TT Blog from SafeTTy Systems™

This page hosts an informal “Technical Blog” from members of the team at SafeTTy Systems Ltd.

The focus is on the development of reliable, real-time embedded systems (in sectors ranging from household goods to satellite systems).

The material presented on this page is based on the use of ‘Time Triggered’ (TT) software architectures.

If you are unfamilar with TT architectures, our Technology provides some background material.



Building a high performance ‘SIL 3’ development platform out of wood …


At SafeTTy, we are working on a range of TT projects based on microcontrollers (MCUs) from NXP®, Infineon® and ST®.

One of our current projects is based on the interesting STM32H745 family.  MCUs in this family incorporate both an Arm® Cortex®-M7 core (operating at up to 480 MHz) and an Arm® Cortex®-M4 core (operating at up to 240 MHz).  By combining two of these MCUs (as we do on our DuplicaTTor boards), we are able to provide a very high performance hardware platform that is capable of meeting ‘SIL 3’ requirements (in compliance with IEC 61508 and related international safety standards). 

A key feature of STM32H745 designs is that we are able to use the M4 core to provide a ‘TT Wrapper’ and combine this with high-performance processing on the M7 core.  Please note that the software on the M7 core need not always have a TT architecture (for example, some of our customers plan to run Linux® on this core).

The first prototype of this platform is shown below.   This is what our Australian colleagues call a ‘brown board’: it is based on two off-the-shelf ST evaluation units.  The final design will – of course – be based on a single PCB (and won’t involve any wood or shelf brackets …).



Please contact us if you are interested in exploring a high-performance platform for your next safety-related embedded system.

[24 September 2020]empty_space


The rise of the ‘TT Wrapper’ – An interview with Dr Michael J. Pont

Dr Michael J. Pont, Founder and CEO at SafeTTy Systems, was interviewed recently by AutoSens.

During this interview, Michael discussed the changes in demand for TT systems; complexity management challenges, and insights from his book “The Engineering of Reliable Embedded Systems”.

You have been working in the field of time-triggered (TT) embedded systems for more than 25 years. How much of this work has been directly related to automotive?

I’ve supported the development of safety-related embedded systems in a range of sectors over the years, including industrial control, civilian aircraft, space and medical. I began my first major TT project in the automotive sector around 15 years ago. Since this time, I have seen two step-changes in demand for TT systems in this sector.

The first step-change came in the lead up to the publication of the first edition of the international standard ISO 26262 in 2011. At this time, many organisations realised that they needed to be able to provide evidence that the vehicles or automotive components that they were producing had been ‘designed for safety’. TT architectures provide a highly-effective way of achieving this.

The second step-change came in the last few years as people became interested in ADAS / AV designs. At this point, the complexity of automotive designs increased very significantly, and I saw further demand for cost-effective TT designs as a means of improving confidence in the safety of such systems.

The end result is that – at the present time – around 60% of my work is in the automotive sector.

What have you learnt in working in other areas of Embedded Systems that can be applied to automotive?

My main goal is to help organisations to produce systems where we can be confident about safety. The key thing that I have learned from different sectors – particularly the aerospace sector – is the importance of having what is sometimes called a ‘safety culture’ in any organisation that wishes to achieve this goal. For me, a safety culture relies on having good people throughout an organisation who are not afraid to question design decisions that – in their view – may have a negative impact on safety.

I think it’s important to add that this is no longer simply a question about the lessons that automotive organisations can learn from other sectors. The ADAS / AV designs that automotive organisations are currently involved with present safety challenges that are – in my view – greater than those faced in many aerospace designs. Over the next few years, I would expect to see experienced automotive designers providing advice in many other sectors.

You can read the full interview on the AutoSens website.


[9 July 2018]



New case study that illustrates ‘ASIL decomposition’ on a single MCU

The decomposition of ‘functional safety requirements’ (FSRs) is a process that is often employed in designs that are developed in compliance with international safety standards such as ISO 26262 and IEC 61508. The process is usually referred to as ‘ASIL decomposition’ or ‘SIL decomposition’.

As an illustration, the figure below summarises the end result of decomposing a single FSR (at ‘ASIL B’) into two equivalent FSRs (in this case at ‘ASIL A’).



Decomposition of FSRs is often employed because:

  • confidence in the safety of the associated system can be increased through this process;
  • the cost of meeting the two decomposed FSRs can be significantly lower than that of meeting the original FSR (‘ASIL B’).

To justify such a decomposition, the two new FSRs must be implemented independently (and it must be possible for the development team to demonstrate – clearly and unambiguously – that the two implementations are independent).

One way of providing an independent implementation is to begin by allocating the two new FSRs to different MCUs: our DuplicaTTor Evaluation Board is often used to prototype such designs.

An alternative way of supporting the decomposition of FSRs is to provide a means of executing two sets of tasks by means of the same MCU: we can do this if we are able to ensure that the two implementations will be able to operate independently: what we call a ‘DecomposiTTor Platform’ provides an effective means of achieving this goal.



A DecomposiTTor Platform is made up of a single MCU, a single scheduler and two independent sets of tasks. To create the two independent task sets, we – can – for example – design one set of tasks, and then – using the techniques discussed in Chapter 6 of ‘ERES2‘ – create a set of matching ‘Diverse Tasks’.

To be clear. We do not attempt to claim that we can prevent all interference between tasks that execute on the same MCU. However, by building on the techniques presented in Part Two and Part Four of the ‘ERES2‘ book, we can be confident that we will be able to detect any interference between such tasks: this is what we require to support successful ASIL / SIL decomposition on single MCU.

We have released a simple case study to illustrate the decomposition of FSRs on a single MCU.

We have also released an associated reference design (TTRD2-25a).

[11 June 2018]



New public TTRDs to illustrate Mode-change mechanisms

In the TT designs that we work on, we use the term ‘Mode’ to refer to: “a software configuration on a Processor that involves the release by means of a Scheduler of a named set of periodic Tasks in accordance with a pre-determined Task schedule”.

In practice, during normal operation, a TT design will usually support a number of different Modes.

Changing Mode will usually involve changing the Task set that we are running. For example, the figure below is intended to illustrate (schematically) that we are likely to require different Modes – and different Task sets – when controlling a passenger car on a motorway, during city driving and when parking.



Changing Modes in any system must be carried out with care, and the mechanisms used to achieve this in a TT design are always the subject of discussion in our TTb training courses.

We can change the Mode by means of a Processor reset. This will typically involve the following steps:

  • we store the required new Mode;
  • we perform a software reset;
  • after the reset, we retrieve the required Mode, add the required Tasks to the schedule and restart the scheduler.

Alternatively, we can perform a ‘manual’ Mode change as follows:

  • we stop the scheduler;
  • we call ‘deinit’ functions for each of the current Tasks (to ‘tidy up’);
  • we clear the schedule;
  • we add the required new Tasks to the schedule and restart the scheduler;
  • we keep refreshing the watchdog timer during the above transition.

Where possible, we prefer to perform reset-based Mode changes. There are two main reasons for this: [i] we find that the code is easier to understand and maintain, even by less-experienced teams; [ii] we know exactly what the hardware and software state will be when we enter a new Mode. The main cost is that it will typically take 1-3 ms to perform the Mode change: we may not always be able to spare this time.

We have released two new TTRDs – TTRD2-08a and TTRD2-08d – that are designed to illustrate the differences between reset-based and manual Mode changes.

You can download the TTRDs here.

[7 June 2018]



Developing safety-related embedded systems that incorporate neural networks

Over the last few months, we’ve had several conversations with customers about the development of safety-related systems that incorporate neural networks. It probably won’t come as a surprise if we say that the majority of these discussions have been centred on the development of control systems for autonomous vehicles.

In summary, this is the approach that we have been exploring:

  • use an appropriate learning algorithm;
  • use an appropriate implementation of the above algorithm (in compliance with ISO 26262-6 or another relevant standard);
  • use appropriate training data with the above implementation;
  • ‘lock down’ the neural network after the training process is complete;
  • perform a comprehensive test & verification process on the ‘locked down’ neural network (ldnn);
  • view the ldnn (after the T&V activities) much like a new human driver who has just passed their driving test;
  • incorporate additional – non-neural – monitoring in the system to improve confidence in the safe operation of the system.

We appreciate that this is all a little vague. We are currently working on a demonstration system that will illustrate what we have in mind.

We hope to be able to say more about this demo in late June 2018.



[18 May 2018]



Snail brains and autonomous vehicles

We were interested to read the results from a recent study about the transfer of memories between one marine snail and another.

Understanding how snail brains operate may seem of limited relevance to developers of safety-related automotive systems, but there is (in our view) a potential link.

To summarise, this study suggests that memories (at least in this marine snail) are not stored by means of changes in the connections – synapses – between cells, as has often been assumed.

This may matter, because – as far as we aware – the ‘neural networks’ that are being used to control various autonomous vehicles are based on a ‘brain model’ that assumes synapse-based learning.

If we want to claim that our neural networks are mimicking the way that humans learn, we may have to revise our neural models …

[16 May 2018]



Attempting to model the links between reliability, security and safety in real-time embedded systems

George E.P. Box is often quoted as saying: “All models are wrong but some are useful”.

We’ve been having a discussion with a customer about the links between reliability, safety and security in embedded systems (in the automotive sector) and the results of this discussion are the ‘model’ below.



What this model tries to represent is our view that we cannot generate safe embedded systems unless we have addressed both reliability and security concerns.

In particular, the key purpose of the model was to capture the links between safety and security (and address a view that is sometimes heard that security is – somehow – independent of safety).

We accept that this model is imperfect, but we hope that it may provide food for thought / further discussion.

[2 May 2018]



Performing temporal and logical program sequence monitoring

Eric Morecambe was an English comedian. About his piano playing, he once said, “I’m playing all the right notes, but not necessarily in the right order”.

Temporal and logical program sequence monitoring (PSM) means checking that our system is playing all the right notes, in the right order – and at the right time. PSM is often employed as a run-time monitoring mechanism in designs that are developed in compliance with standards such as IEC 61508 and ISO 26262.

It is probably fair to say that the specification of the form of PSM that is required to achieve compliance with the requirements of these standards leaves some scope for individual interpretation.

In our designs, we check that the tasks are released in the expected sequence. We do this before each task is released (rather than after the event).

We also monitor the timing of task releases (and task completions). The resolution of the timing measurements required depends on the application (in some cases this may be at the microsecond level).

These tests are straightforward in our TT designs, and are based on the use of ‘MoniTTor’ and ‘PredicTTor’ mechanisms.

You’ll find further information about these mechanisms in the ‘ERES2’ book.

[16 March 2018]



What should you do if your system fails a POST?


‘Power On Self Tests’ are an important consideration in many safety-related embedded systems.

This – below – is a ‘checklist’ that we have found useful when developing (and reviewing) TT software for safety-related embedded systems in compliance with standards such as ISO 26262 and IEC 61508.

  • use an independent clock source to check the oscillator frequency;
  • use a suitable library to test RAM memory and CPU registers;
  • use a ‘Golden Signature’ to check the executable code;
  • test the memory areas used to store Register Variables;
  • test the internal WDT (iWDT);
  • use the (tested) iWDT to check the Scheduler operation;
  • perform startup checks on all peripherals used in the application;
  • perform checks on the eWDC, if present;
  • perform checks on the MoniTTor;
  • perform checks on the PredicTTor;
  • perform key environmental checks, such as measuring the ambient temperature and the Processor operating voltages;
  • use a WarranTTor unit to check for security infringements, if this is deemed to be necessary;
  • use a second Processor (in multi-Processor designs) to perform additional checks on the Platform, including the Schedulers / interrupt system and operating frequencies.

The key question that we are trying to answer through the use of low-level POSTs is as follows: ‘Do we have an operational Processor to work with?’

If the answer is ‘no’, then all we can do is try to enter a Fail-Safe Processor State.

If our Platform is based on a single Processor and this fails a POST, it is difficult to be confident that this PROCESSOR will be able to perform the operations that are needed in order ‘fail safely’.

If we have only one Processor in the Platform and failure of the POSTs could result in a dangerous situation, then we will almost certainly need to consider adding an external Watchdog Controller (eWDC) unit or an additional Processor to the Platform.

This issue must be considered carefully at an early stage in the development process.



[2 March 2018]



Can you use the library code provided by your MCU manufacturer in a safety-related systems?

We have been asked several times in recent months about the use of third-party library code in safety-related systems (for example, designs developed in compliance with IEC 61508 or ISO 26262).

For designs programmed in ‘C’, we typically need to consider (at least) the following third-party code:

  • MCU startup code
  • C-language Standard Libraries
  • MCU peripheral libraries (for example, GPIO interface libraries, ADC libraries, …).

Example startup code will often be provided by the MCU manufacturer. This may be written in Assembly language.

C-language standard libraries will often be provided by the compiler manufacturer.

The MCU peripheral libraries are – in most designs – the largest suite of third-party code that is used in a project. This code will generally be provided by the MCU manufacturer.

Let’s assume that we are developing a design in compliance with IEC 61508 (SIL 3). We’ll further Suppose that we have used all of the above code (as provided by our compiler or MCU manufacturer) in a prototype design. Can we use this code in our production system?

In terms of IEC 61508, the library code identified above can be viewed as a “pre-existing software element”.

IEC 61508-3 [2010] has clear requirements for the situation in which such pre-existing software elements are used to implement all or part of a safety function (see IEC 61508-3 Clause

The options are described as three “routes to compliance” for a software element.

These routes are paraphrased below:

Route 1S: compliant development. The element was developed in compliance with IEC 61508.
Route 2S: proven in use. There is evidence available that the element is proven in use (see Clause 7.4.10 in IEC 61508-2).
Route 3S: assessment of non-compliant development. The element has been shown to be compliant with Clause in IEC 61508-3.

It is possible to find ‘qualified’ code for the ‘C’ standard libraries (that is code that ‘Route 1S’ code).

For startup and MCU peripheral libraries, finding qualified code is likely to be much more challenging.

If you cannot achieve ‘Route 1S’ compliance for your startup and peripheral code, you may be tempted to look at ‘Route 2S’ (proven in use).

In our experience, a ‘proven in use’ requirement is often very difficult to justify. (Did you use exactly the same library code in the previous product? Did you use it in exactly the same way?)

In most cases, organisations are (in our experience) left with ‘Route 3S’ as the only viable solution. This involves (in effect) applying a review and ‘code hardening’ process to that subset of the startup and MCU library code that is used in a particular project, with the aim of bringing the code up to the required safety standard.

This is not a trivial process (and time needs to be allowed for it in the project timeline).

[14 February 2018]



TT design examples

Our customers typically develop automotive, industrial, medical, rail, aircraft or space systems, in compliance with ISO 26262, IEC 61508 and related international safety standards and guidelines.

We help them to meet these requirements through the use of state-of-the-art “Time-Triggered” (TT) software architectures.

To illustrate the application of TT architectures, various design examples are available.

Introductory material

  • You’ll find an introductory guide for people who want to learn how to program reliable, real-time embedded systems here.
  • You can view 7 hours of introductory ‘TT’ training videos here.
  • You can downloaded the complete ‘PTTES’ book (and related code examples) here
  • You’ll find some introductory material from the ‘ERES2’ book here (PDF file).
  • You’ll find various code examples from the ‘ERES2’ book here.

Automotive (ISO 26262)

  • You’ll find an automotive ECU design example (ISO 26262, ASIL D) here.
  • You’ll find an article that summarises some of the ways in which a ‘TT wrapper’ can be used to improve confidence in the safety of Level 3 / Level 4 / Level 5 road vehicles (developed in compliance with ISO 26262) here.

Industrial control / generic safety (IEC 61508)

  • You’ll find an industrial control example (IEC 61508, SIL 2) here.
  • You’ll find a TT framework that can meet IEC 61508 ‘SIL 3’ requirements using two low-cost MCUs here.

Machinery (ISO 13849)

  • You’ll find a machinery-control example (ISO 13849, ‘PLe’) here.

Medical (IEC 62304 and related standards)

  • You’ll find an article that summarises some of the ways in which a ‘TT wrapper’ can be used to improve confidence in the safety of systems that contain ‘SOUP’ (developed in compliance with IEC 62304 or other standards) here.
  • You’ll find an example that illustrates the development of a controller for a hospital radiotherapy machine (in compliance with IEC 60601-2-1 and IEC 62304) in the ‘ERES2‘ book.

Aerospace (DO-178C)

  • You’ll find an example that illustrates (in outline) the development of a controller for an aircraft jet engine (in compliance with DO-178C) in the ‘ERES2‘ book.

Household goods (IEC 60730, IEC 60335)

  • You’ll find an example that illustrates the development of a controller for a domestic washing machine (in compliance with IEC 60730 and IEC 60335) in the ‘ERES2‘ book.


[12 January 2018]



Expanded User Guide released for DuplicaTTor® Evaluation Board (DEB-0405)

The recently-released DuplicaTTor® Evaluation Board (DEB-0405) is primarily intended to support developers that need to meet the requirements of IEC 61508 (up to SIL 3), ISO 13849 (up to Pl e Cat 4) and related international safety standards using ‘Time Triggered‘ (TT) software architectures.

The DEB-0405 incorporates two independent hardware channels, each based on an STM32F405VG microcontroller.

In response to a number of requests, we have released an expanded version of the User Guide for the DEB-0405. In addition to a general ‘tidying up’, this document now includes further information about the fault-injection capabilities of this platform.

You can download the new User Guide from the DEB-0405 page.

[21 October 2017]



Examples of current ISO 26262 projects

At present, we are helping many of our customers to develop automotive systems in compliance with ISO 26262.

For example, we can assist in the development of ‘Safety Elements out of Context‘ (SEooCs). These are ‘components’ (such as a sensor or a software library) that will ultimately be used as part of a larger vehicle system.

In such projects, our role might include: [i] performing an ISO 26262 ‘gap analysis’; [ii] providing design advice and / or training; [iii] assisting with the process of obtaining an ‘ISO 26262 SEooC certificate’ from a third-party organisation (such as TÜV); [iv] assisting with the creation of the Safety Manual.

A particular focus of our current work is on SEooCs for use in semi-autonomous / autonomous vehicles (up to SAE Level 4 / Level 5).

Need help with your ISO 26262 project? Learn more on our Consultancy page – or contact us to discuss your requirements.

[1 September 2017]



Developing software for embedded systems that are safe, reliable and secure

Various design guides, technical reports and related material are available on this website.

  • You’ll find a guide to ‘Time Triggered’ embedded systems here.
  • You’ll find an automotive design example (ISO 26262, ASIL D) here.
  • You’ll find an article that summarises some of the ways in which a ‘TT wrapper’ can be used to improve confidence in the safety of Level 3 / Level 4 / Level 5 road vehicles (developed in compliance with ISO 26262) here.
  • You’ll find an industrial control example (IEC 61508, SIL 2) here.
  • You’ll find a TT framework that can meet IEC 61508 ‘SIL 3’ requirements using two low-cost MCUs here.
  • You’ll find an article that summarises some of the ways in which a ‘TT wrapper’ can be used to improve confidence in the safety of systems that contain ‘SOUP’ (developed in compliance with IEC 62304 or other standards) here.

You’ll find a full list of available design guides and related material here.

[29 August 2017]



An expanded version of ‘TTRD2-19a’ in now available

We’ve been releasing public TTRDs (for various processor targets) since 2014. These releases have always resulted in a number of technical questions from our customers. However, the level of interest in ‘TTRD2-19a’ (available for download from 4 January 2017) has exceeded anything that we’ve seen previously.

TTRD2-19a is a complete example of a ‘CorrelaTTor-A’ design. This means that it incorporates a TT scheduler, and two key monitoring components (MoniTTor and PredicTTor). Appropriate startup tests are also included in this demo system. This design is documented in ‘ERES2‘ (Chapter 19).

TTRD2-19a demonstrates a highly-effective TT software platform. Using an appropriate MCU and with the addition of a small external ‘watchdog’ device (eWDC), this platform can – for example – form the basis of an ‘ASIL D’ design (in compliance with ISO 26262).

Most of the questions that we have received in recent weeks relate to the modelling of TTRD2-19a, and the configuration of the run-time monitoring components. More specifically, we’ve been asked about the generation of the ‘tick list’, and about techniques for determining the task execution timings (‘WCET’ and ‘BCET’).

We decided that the simplest way to address these questions was to release an expanded TTRD2-19a example: this code is now available on our public TTRD page.

[17 August 2017]



More about architectures for fully-autonomous vehicles (SAE Level 5)

As we noted on 5 February 2017, we’ve recently received several requests for advice about the selection of software and hardware architectures for use in autonomous vehicles. In these discussions, we’ve been exploring several ‘Design Sketches’.

Two of these Design Sketches have generated particular interest: these are ‘DS4’ and ‘DS6’ (summarised below).



We’ll say a little more about these designs in a moment. First, let’s be clear about our assumptions.

This is our assumed operating scenario:

  • our ‘Autonomous Road Vehicle’ (ARV) is to ‘drive itself’ on a motorway network (only);
  • there will be other vehicles on this motorway (other ARVs and other ‘normal’ cars);
  • at a service area, the driver becomes a passenger: he / she moves into the back seat of the ARV, inserts the vehicle ‘key’, enters the required destination and presses the green ‘go’ button;
  • the ARV will then drive on the motorway to the service area nearest to the passenger’s required destination and will stop; the passenger then becomes the driver again;
  • the speed of the vehicle is limited to 50 mph in ‘autonomous’ mode;
  • while in autonomous mode, there is no (human) driver; therefore, there is no ‘emergency stop’ button anywhere in the vehicle.

These our key design assumptions:

  • our starting point is a ‘COTS’ ARV Controller (ARVC) that has been developed by a third party;
  • the ARVC has some ‘input sensors’ and it generates two outputs: [i] the required vehicle speed (0-50 mph); and [ii] the required vehicle direction (‘steering wheel angle’);
  • the ARVC may incorporate a neural network or other adaptive software;
  • the adaptive nature of the ARVC design presents significant challenges for a traditional certification process, and we are looking for ways of increasing confidence that the system incorporating the ARVC will operate safely while the vehicle is in use.

Our key design goal is to monitor the operation of the ARVC and intervene if we detect that something is wrong.

The key design challenge is that – apart from very basic sanity checks (e.g. the ARVC is requesting a vehicle speed of 60 mph when our maximum allowed speed is 50 mph) – it is difficult to know whether the the ARVC is operating correctly while the vehicle is moving.

Design Sketch 4 (DS4) – shown above – illustrates one way in which we may be able to improve confidence in designs that are based on a single ARVC. In this case, we use a TT Monitoring System (TT-MS) has the TT-MS the ability to inject faults / tests (e.g. test images) into the ARVC at run time, in order to check that the behaviour is as expected.

DS6 illustrates another option. In this case there are three independent ARVCs used to create the system, and we take action only if 2 units agree. Initially, DS6 would be a more expensive option than DS4. However, it is likely that we would have greater confidence in the decisions made by this system (particularly if some of the injection techniques from DS4 were also employed). In addition – once the DS6 implementation had been fully evaluated – it should be possible to reduce unit costs significantly by integrating the three ARVCs (and TT monitoring) into a single compact device.

At present, we are prototyping some of these designs (on a lab testbench) using a DuplicaTTor board to implement the TT-MS. We hope to have the opportunity to explore this design in more detail on a road vehicle shortly.

If you are interested in working with us on this interesting project, please contact us.

[17 February 2017]



Architectures for fully-autonomous vehicles

In the last few weeks we’ve had several requests for advice about the selection of software and hardware architectures for use in autonomous vehicles.

The questions have related to what are defined as ‘Level 5’ designs by SAE.

We don’t think there are easy answers to these questions (but you’ll not be surprised to learn that we believe that TT monitoring systems could provide a very useful ‘safety net’ for many such systems).

In a recent interview in IEEE Spectrum magazine, Dr Gill Pratt (Head of the Toyota Research Institute) makes a number of wise observations about some of the emerging challenges in this area.

You’ll find the article here.

[We are grateful to David Mentré and Paul Bennett for drawing this article to our attention.]

[5 February 2017]



ISO 26262 vs. IEC 61508

The latest NMI ISO 26262 Workshop took place at HORIBA MIRA (Nuneaton, UK) on 26 January 2017.

At this event, Dr Michael J. Pont (CEO, SafeTTy Systems Ltd) gave a presentation entitled: “Are there lessons that ISO 26262 developers can (and should) learn from IEC 61508?

The presentation abstract reproduced below:

This presentation will be concerned with the development of software for real-time automotive systems that need to be both safe and reliable.

The goal of the presentation is to explore one of the central differences between ISO 26262 and IEC 61508, and to consider whether there are lessons that can (and perhaps should) be learned from the earlier (generic / industrial) safety standard by developers of automotive systems.

During the talk it will be suggested that one key difference between IEC 61508 and ISO 26262 is that the latter standard places less (explicit) reliance on the idea of fault tolerance. In particular, the phrase ‘Hardware Fault Tolerance’ (which is referred to throughout IEC 61508) does not appear in ISO 26262. One important consequence of this difference is that, while IEC 61508 can be seen to favour use of multi-processor architectures, there is much less emphasis on such a solution in ISO 26262.

Does this mean that ISO 26262 designs are likely to be ‘less safe’ than equivalent IEC 61508 designs?

It is hoped that this presentation will encourage a debate at the workshop.

You can now download a copy of the presentation slides for this talk (PDF file).

You’ll find further information about this event on the NMI website.

[5 February 2017]



New DuplicaTTor® Evaluation Boards


To support organisations that want to explore the use of modern TT designs we have introduced our first DuplicaTTor® Evaluation Board (DEB).

Using a DEB, organisations can evaluate design options up to ‘SIL 3’ / ‘ASIL D’ level (and equivalent).

Learn more on our DuplicaTTor page.

[17 January 2017]



‘TTRD2-19a’ released (a full ‘CorrelaTTor-A’ design)

As we noted on 20 December, we are currently preparing a set of ‘Time Triggered Reference Designs’ (TTRDs) to go with the new ‘ERES2‘ book.

We have now released TTRD2-19a for an STM32F401 target (Nucleo board).

This is a ‘CorrelaTTor-A’ design (with iWDT, MoniTTor and PredicTTor support).

It also includes a full suite of ‘Power-On Self Tests’ (POSTs).

To illustrate an extreme (but thorough) form of ‘Built-In Self Test’ (BIST) behaviour, TTRD2-19a executes out the following checks every few seconds: [i] it stores its state; [ii] it performs a system reset and a full suite of startup tests; [iii] it continues its operation from where it left off.

This type of behaviour is common in satellite designs, and can be useful in many other designs too (usually with less frequent testing).

Further information about this TTRD can be found in ERES2 (Chapter 19).

You can download all of the current TTRDs on the ERES2 page.

[4 January 2017]



Latest TTRDs released

As we noted on 15 December, we are currently preparing a set of ‘Time Triggered Reference Designs’ (TTRDs) to go with the new ‘ERES2‘ book.

So far, the suite consists of two simple examples: TTRD2-02a (STM32F0 target) and TTRD2-03a (XMC4500 target).

You can download the TTRDs on the ERES2 page.

Our aim is to complete TTRD2-09a (TMS570 target) and TTRD2-18a (STM32F4 target) before our Christmas break, then fill in some of the gaps in January.

[20 December 2016]



First TTRD released for ‘ERES2’


We’re currently working on the ‘Time-Triggered Reference Design’ (TTRDs) to support the new ‘ERES2‘ book.

So far we’ve released the first TTRD (TTRD2-02a): this example implements the basic scheduler code from Chapter 2.

(Chapter 2 is included in the sample material for this book and can also be downloaded from the ERES2 page.)

We hope to complete work on these TTRDs by the end of January.

[15 December 2016]



Meeting ‘Category 3’ / ‘Category 4’ requirements (ISO 13849-1: 2015)

ISO 13849-1: 2015 is concerned with the development of the safety-related parts of control systems that are used in machinery.

A very wide range of products is covered by this standard, ranging from factory machines to mobile agricultural equipment. Within the EU, this standard is associated with the Machinery Directive.

One way in which ISO 13849-1 differs from many of the other standards that we work with is that it includes five ‘designated architectures’ (DAs): Category B, Category 1, … Category 4. Use of one of these DAs in a system design can reduce the effort required to achieve compliance with the standard.

We’ve been asked recently about the links between the DAs presented in ISO 13849-1 and our recommended TT platforms.

We’ve summarise some of the potential links in the table below:


As an example, the figure below illustrates how it may be possible to implement a Category 3 DA using a DecomposiTTor platform.


As an example, the figure below illustrates how it may be possible to implement a Category 4 DA using a DuplicaTTor-He platform.

[23 June 2016]



New MISRA guidelines on security

Application of the ‘MISRA C’ guidelines is widely seen as an effective way of improving the safety of embedded systems that are implemented using the popular ‘C’ programming language.

Since the latest version of MISRA C was published (MISRA C: 2012), ISO/IEC JTC1/SC22/WG14 (the committee responsible for maintaining the C standard) has published a set of guidelines that are intended to improve the security of systems that are implemented using C: these guidelines have the snappy title “ISO/IEC TS 17961:2013”.

ISO/IEC TS 17961:2013 specifies rules for secure coding in the C programming language, and provides examples of both ‘compliant’ and ‘non-compliant’ code.

MISRA has now published two documents related to ISO/IEC TS 17961:2013:

  • MISRA C:2012 (Addendum 1) illustrates links between existing MISRA rules and the “C Secure” requirements found in ISO/IEC TS 17961:2013;
  • MISRA C:2012 (Amendment 1) provides some additional guidelines that are intended to improve the security of systems implemented using C.


Both of these documents can be dowloaded (free of charge) from the MISRA Bulletin Board.

It is (of course) unlikely to be the case that coding guidelines alone are unlikely to provide the level of security (or safety) that is required in many modern, highly interconnected, embedded systems. However, we take the view that – by combining such coding guidelines with an appropriate TT architecture – designers have the potential to create a foundation for systems that are both safe and secure.

[25 May 2016]



IEC 62304:2006 – Amendment 1:2015


IEC 62304 is concerned with the development of software for use in medical devices.

The standard IEC 62304 dates back to 2006 and – as part of the progress towards a new edition – ‘Amendment 1’ was published in 2015.

This article summarises some of the key changes that will result from the introduction of Amendment 1.


Brief overview of the standard


The IEC 62304 standard notes that software is often an integral part of medical-device technology. It further notes that the effectiveness of a medical device that contains software requires: [i] knowledge of what the software is intended to do, and [ii] demonstration that the software will fulfill such intentions without causing unacceptable risks.

IEC 62304 requires that the manufacturer of the device assigns a safety class (Class A, Class B or Class C) to each software system.

The classes are assigned based on the impact that (failure of) the system may have:

  • Class A: No injury or damage to health is possible
  • Class B: Non-serious injury is possible
  • Class C: Death or serious injury is possible


Scope of the amended standard

As noted above, IEC 62304 is concerned with the development of software for use in medical devices.

The original standard was felt to be rather ambiguous about the meaning of ‘software’ in this context.

The amended standard tries to make it clearer that the standard applies to any medical device that executes software.


Software safety classification

One of the areas of the original standard that gave rise to numerous discussions was the assumption that if a failure of the software could give rise to a hazard, then it must be assumed that the probability of such a failure was 100%. The intention of such a statement was to focus attention on the need for appropriate (external) risk-control measures that could be used – for example – to move the device from Class C to Class B.

The amended standard attempts to clarifies the allocation into classes (including the ‘default’ of Class C).

The assumption about 100% probability of failure remains, but there is (in our view) greater clarity in the amended standard about the application of risk-control measures.


Dealing with legacy software

The amended standard includes a new clause (Clause 4.4) that describes the process of dealing with legacy software.

This might apply (for example) when updating a product that is already being used by the medical community.


Identification and avoidance of common software defects

The amended standard includes (in Clause 5.1.12 – another new clause) the requirement that [i] typical programming ‘bugs’ should be identified, and [ii] evidence should be provided that such bugs cannot give rise to unacceptable risk.

This is an interesting requirement. We suspect that developers working in ‘C’ will use adherence to the ‘MISRA C’ guidelines as a means of (partially) addressing such matters.

Beyond this, this requirement can perhaps be seen as another view of the ‘100% probability of software failure’ assumption. Fully addressing such requirements is likely (again) to require appropriate – external – risk-control measures.


Detailed design

There are new requirements (Clause 5.4.2, Clause 5.4.3) for detailed design documentation.

These are – in our view – sensible changes (previous requirements for design documentation were very limited).


Impact on ‘Class A’ designs

In the amended standard, various requirements that applied only to Class B and Class C designs now also apply to Class A designs.


TT architectures and IEC 62304

You’ll find an example of the use of TT architectures in an IEC 62304 (Class C) design here.

[9 February 2016]



Towards the second edition of ISO 26262


International standard ISO 26262 is the adaptation of IEC 61508 to comply with needs specific to the application sector of electrical and/or electronic (E/E) systems within road vehicles. This adaptation applies to all activities during the safety lifecycle of safety-related systems comprised of electrical, electronic and software components.

First introduced in 2011, ISO 26262 has had a major impact on the development of automotive systems. However – given the rapid rate of change in this sector – it is perhaps not surprising that this new standard already feels rather out of date (for example, it has little to say about the development of autonomous vehicles).

Perhaps an even more significant concern about ISO 26262: 2011-2012 is that it applies only to the development of passenger cars: trucks, buses and motorcycles (for example) are not considered.

It is expected that the second edition of ISO 26262 – which is due for publication in 2018 – may begin to address some of the shortcomings of the current edition of the standard.

As an indication of where the standard is heading, the first Publicly Available Specification (PAS) related to ISO 26262 has now been published. This is ISO/PAS 19695:2015, and it applies to motorcycles.

From the abstract:

ISO/PAS 19695:2015 is intended to be applied to safety-related systems that include one or more electrical and/or electronic (E/E) systems and that are installed in series production two-wheeled or three-wheeled motorcycles.

ISO/PAS 19695:2015 Standard does not address unique E/E systems in special purpose vehicles, such as vehicles designed for competition.

ISO/PAS 19695:2015 Standard addresses possible hazards caused by malfunctioning behaviour of E/E safety-related systems, including interaction of these systems. It does not address hazards related to electric shock, fire, smoke, heat, radiation, toxicity, flammability, reactivity, corrosion, release of energy, and similar hazards, unless directly caused by malfunctioning behaviour of E/E safety-related systems.

ISO/PAS 19695:2015 Standard does not address the nominal performance of E/E systems, even if dedicated functional performance standards exist for these systems.

ISO 26262:2011-2012 has 10 parts.

It is expected that ISO/PAS 19695:2015 will form the basis of a new Part 12 in the next edition of ISO 26262.

Further information about ISO/PAS 19695:2015 is available here.

You may be wondering about Part 11?

The new Part 11 in the next edition of ISO 26262 is expected to focus on semiconductor requirements. The related PAS document (ISO/PAS 19451) should be published later this year (rumour has it that the current draft of this document is around 160 pages long …).

[22 January 2016]



The three laws of safe embedded systems


This short article is part of an ongoing series in which I aim to explore some techniques that may be useful for developers and organisations that are beginning their first safety-related embedded project.

In the present article, I want to take a slightly different perspective on Stage 3 from my previous post:

Read more on Michael J. Pont’s EmbeddedRelated Blog.

[First published 12 November 2015. Updated 29 November 2015]



Developing software for a safety-related embedded system for the first time


I spend most of my working life with organisations that develop software for high-reliability, real-time embedded systems. Some of these systems are created in compliance with IEC 61508, ISO 26262, DO-178C or similar international standards.

When working with organisations that are developing software for their first safety-related design, I’m often asked to identify the key issues that distinguish this process from the techniques used to develop “ordinary” embedded software.

This is never an easy question to answer, not least because every organisation faces different challenges. However, in this article I’ve pulled together a list of steps that may provide some “food for thought” for organisations that are considering the development of their first safety-related embedded system.

Read more on Michael J. Pont’s EmbeddedRelated Blog.

[31 October 2015]



How to test a Tesla?


You can read this article on Michael J. Pont’s EmbeddedRelated Blog.

[23 October 2015]



“Smarter” cars, unintended acceleration – and unintended consequences


In this article, I consider some recent press reports relating to embedded software in the automotive sector.

In The Times newspaper (London, 2015-10-16) the imminent arrival of Tesla cars that “use autopilot technology to park themselves and change lane without intervention from the driver” was noted.

By most definitions, the Tesla design incorporates what is sometimes called “Artificial Intelligence” (AI). Others might label it a “Smart” (or at least “Smarter”) Vehicle.

Read more on Michael J. Pont’s EmbeddedRelated Blog.

[20 October 2015]



Safety, reliability and security in embedded systems

In the last few weeks, there have been a number of discussions about the vulnerability of vehicles to hacking (including a well-documented case with a Jeep). There has also been a case (reported in The Times newspaper, UK, on 11 August 2015) in which an electric skateboard was hacked.

In light of these reports, we thought it might be an appropiate time to mull over some of the competing design constraints involved in creating embedded systems that are safe, reliable and secure.


The competition between safety and reliability is easily explained (and generally well understood). Put simply, we could (for example) make our electric skateboard 100% safe by making it 100% unreliable: that is, by ensuring that it could never move. This would – clearly – not be a great design solution …

The starting point for many successful designs is therefore: [i] we consider what we need to do in order to meet the safety requirements; then [ii] we consider what we need to do to meet the reliability requirements – without making the system any less safe.

This is – of course – all very well in theory. In practice, a key part of the process for ensuring safety and reliability will involve discussions with as many “stakeholders” as possible and – in particular – talking about the design with people who understand the environment in which the product or system will be used. In the case of our skateboard (for example) it is probably going to be beneficial to talk to people who use such boards to commute to work (for example).

This all sounds fine, until your CEO appoints your company’s first Head of Secure Design (HoSD).

When you tell the new HoSD that you are about to talk to as many people as possible about the design for the new electric skateboard (so that you can make the product safe and reliable), you may find that attempts are made to have you sacked, and escorted promptly from the building.

The problem is – of course – that secrecy is often seen as a key requirement in a secure design: unless care is taken, this constraint may be at odds with the need to ensure system system and reliability.

What impact will increased demands for security have over the next few years? One impact is that influential safety standards (e.g. IEC 61508, ISO 26262) will need to be revised in order to to address security concerns more fully.

Even after key standards have been updated, we think that there is a risk that attempting to “bolt on” security checks to existing designs may lead to a reduction in the safety and reliability of many future embedded systems.

[Originally published 14 August 2015; updated 15 August 2015]



What is “Functional Safety”?


In our experience, the phase “Functional Safety” (FS) often causes confusion.

When talking about FS, we are concerned with active (rather than passive) safety mechanisms.

In a passenger car (for example) a simple seatbelt serves as a passive safety mechanism, while an airbag or Collision Avoidance System (CAS) serves as an active safety mechanism.

When implementing active safety mechanisms (with the aim of achieving functional safety), the majority of current designs will involve a microcontroller and appropriate software: such designs will usually be implemented in compliance with international safety standards and guidelines, such as IEC 61508, ISO 26262 and DO-178.

[22 June 2015]



Common-Cause Failure vs. Common-Mode Failure

Our recent posts about redundant systems have provoked some discussion about “Common Cause Failures” and “Common Mode Failures”.

In our experience, definitions (or lack of them) can be a significant cause of confusion in this area.

First, it needs to be clear that we are typically talking about two units (components, systems, etc), arranged in some form of “1 out of 2” (1oo2) arrangement: our goal is usually to ensure that if at least one unit operates correctly, the related system will be able to operate safely.

[Depending – again – on our definitions, the system may not simply “operate safely” if at least one unit out of two is operating correctly, it may remain “fully operational” or may “operate normally” in these circumstances. However, this is not always what is meant. Clear definitions are required for a given system, including definitions of “operate safely”, etc.]

Given the above, most people seem to agree on what is meant by a “Common Cause Failure” (CCF). We find that it is sometimes helpful to think of this in terms of “system inputs”. Failure of a power supply (input) that is linked to both of our units is a typical example of a CCF. Water ingress, EMI, radiation, etc, may all also be common root causes of failure in our two units.

In our experience, there is usually less agreement on what is meant by a “Common Mode Failure” (CMF). We find that it is sometimes helpful to think of this in terms of “system outputs”. More specifically, a CMF is concerned with the way in which our units fail (usually in situations where they both fail). For example, if – in response to a power supply “glitch” – both units shut down (“fail silently”), then this can be viewed as an example of a Common-Mode Failure. If – in the event of the same power glitch – one unit shuts down but the other unit “hangs” in a potentially dangerous state, then this is not what we see as a CMF situation (because the two units are demonstrating a different failure mode).

Note that it is usually assumed that a CCF will precede any CMF, but – again – this depends on your definitions.

Probably because a CCF may precede a CMF, people sometimes assume that the two phrases are synonymous: this is not generally helpful, in our view.

[Originally published 16 April 2015; updated 11 June 2015]



A TT design from the 17th Century?


We are always on the lookout for effective applications of TT architectures.

Here’s a very early example: an “automated carillon”. Dating from around the 17th Century, these devices were designed to ring bells in a pre-determined sequence. By changing the arrangement of “pins” in a drum, the melody (program) could be changed.

The photograph below (taken by our CEO in April 2015) shows the automated carillon mechanism from the belfy in Ghent (Belgium).


[If you’ve landed on this page because you want to purchase an automated carillon, our local bell foundry may be able to help.]

[1 May 2015]



Working with “Time-Triggered Hybrid” (TTH) schedulers

We’ve received some questions about the configuration of “TTH” schedulers, and about TTRD11b in particular from the “ERES (1769)” book.

The main questions (paraphrased slightly) are as follows:

On page 269 [ERES(1769), first printing], you mentioned in the Table 12 that task A, B, C, D are jitter sensitive, so to reduce the jitter levels, they are put into the timer ISR as TTH tasks. But I am not sure if I understood it correctly. Now there are four tasks with the same periods (10 seconds), and the GCD of them should be 1 second, so the timer ISR will run every 1 second. If we make them synchronized, i.e. every task will be released at the same tick interval every time, then again we will face the jitter problem, the worst case will happen to task D, since it is the last to be executed. So, we need to add offset with them, so that at each tick interval there is only one task running. Am I right on this?

Another issue is, now the tick interval is 1 second, but the WCET is 2 seconds, which means one task might not be finished before next tick interval. So do we need to increase the tick interval to take this into account?

In TTRD11b, the tick interval used is 10 seconds (which corresponds to the GCD for this task set, as discussed in Chapter 4 of the book).

In this design, Task A, Task B, Task C and Task D are all released (in sequence) from the scheduler ISR. The total execution time of this task sequence will be 8 seconds, in a period of 10 seconds: these tasks (therefore) consume around 80% of the available CPU capacity (ignoring scheduler overheads).

We assume that all of these tasks will have a “balanced” execution time (as discussed in Chapter 6), in order to help meet the jitter requirements.

Task E is not jitter sensitive. It released as a low-priority task, by the scheduler (TTC) “dispatcher”, every 20 seconds.

The WCET of Task E is 3 seconds and there is only a “slot” of approximately 2 seconds available: Task E will therefore be pre-empted by the high-priority tasks (Task A – Task E) every time it executes.

The end result is that we can meet the requirements for all tasks in the set using this simple architecture.

[20 January 2015]



Use of “idle” mode in TT systems

We’ve received some questions about the use of idle mode in the schedulers that are presented in the ERES (1769) book. The questions concern the links between idle mode and jitter in the task release times. We’ve included some notes about this important topic in this post.

In most cases, the schedulers in the ERES (LPC1769) book enter “idle” mode at the end of the Dispatcher: this is usually achieved by means of the SCH_Go_To_Sleep() function. The system will then remain “asleep” until the next timer Tick is generated.

Clearly, the use of idle mode can help to reduce power consumption. However, in most cases a more important reason for putting the processor “to sleep” is to control the level of “jitter” in the Tick timing.

Control of jitter is an important consideration in the majority of control systems and data-acquisition / sensing systems. As we detail in Chapter 4 of ERES, jitter levels in the region of 1 microsecond can be significant even in designs with what appear to have rather rudimentary signal-processing requirements: the constraints on high-performance systems are often far more severe.

Use of idle mode in a TT scheduler allows us to control jitter in the task (release) timing because we can ensure that both the hardware and the software are in the same (known) state every time an interrupt takes place.

No matter what you do, it is simply not possible to achieve the same level of temporal determinism in a conventional (event-triggered) design.

In a typical TTC design, the Dispatcher is called (as the only function) in an endless loop in main(). If the idle mode is not used, then the system will keep calling the Dispatcher once it has finished releasing the tasks that are scheduled to run in a given Tick. When the next Tick occurs, we cannot be sure where we will be in the Dispatcher code, and the time taken to respond will (therefore) inevitable vary.

If we put the processor into an idle mode at the end of the Dispatcher, we are placing both the software and hardware into a known state. If we are always in the same state when the Ticks occur, the time taken to respond will be essentially identical (or as close to identical as it is possible to get on a given hardware platform). The result is a very low level of jitter.

For example, as we demonstrate in Chapter 5 of ERES, an LPC1769 microcontroller running at 100 MHz and supporting a TTC scheduler already has a low level of Tick jitter (around 0.2 microseconds) even without use of idle mode in the microcontroller: if we use the idle mode, the level of Tick jitter becomes too low to measure.

The discussions above concern TTC schedulers, but control of Tick jitter in this way also applies in TT designs that support task pre-emption. For example, in a simple “Time_triggered Hybrid” (TTH) scheduler running on the same LPC1769 platform (100 MHz), we obtain jitter levels of around 0.4 microseconds: if we incorporate idle mode in this design (as discussed in ERES, Chapter 12) we can – again – bring the jitter levels down to levels that are too low to measure.

Overall, appropriate use of idle modes is a very simple way of ensuring that we can – if and when required – have very precise control of the timing behaviour in a TT design. This is, in turn, an important reason for the popularity of this architecture in what are sometimes call “hard” real-time systems.

[16 January 2015]



Use of watchdog timers in TT systems

watchdog_dog_pttes_201We’ve received some questions about the use of watchdog timers in the ERES (1769) book. The questions relate specifically to TTRD2a and TTRD3a.

The questions are as follows (paraphrased slightly):

I have one question regarding the watchdog timer in Chapter 2 and 3. I only see in the system initialization phase System_Init(), that it checks if the system is reset by the watchdog timer or not, if so the system goes into a fail silent mode. What about later? How does the system go into the fail silent mode during the process, due to some task overrun for example.

As far as I know, the watchdog feed function is also a task in the scheduler (the first one), and will be executed at the start of each scheduler tick. So, for any reason (task overrun), the watchdog feed task cannot be run in time, it will timeout, then what? The system is supposed to go into the fail silent mode, but how? I didn’t find any place other than System_Init() that deals with the fail silent mode.

It may be that we need to say more about the watchdog timer in a future edition of this book. For now, we’ll provide some further information here. You’ll also find some background information in the book “Patterns for Time-Triggered Embedded System” (in Chapter 12): you can download this book here. Some of the early parts of the answer below are adapted from material in “PTTES”.

Let’s start at the beginning …

What is a watchdog timer?
Suppose there is a hungry dog guarding a house, and someone wishes to break in. If the burglar’s accomplice repeatedly throws the guard dog small pieces of meat at (say) 2-minute intervals, then the dog will be so busy concentrating on the food that he will ignore his guard duties and will not bark. However, if the accomplice run out of meat or forgets to feed the dog for some other reason, the animal will start barking, thereby alerting the neighbours, property occupants or police.

A similar approach is followed in computerised ‘watchdog timers’. Very simply, these are timers which, if not refreshed at regular intervals, will overflow. In most cases, overflow of the timer will reset the system. Such watchdogs are intended to deal with the fact that, even with meticulous planning and careful design, embedded systems can ‘hang’ in the field due to faults, such as the impact of programming faults or electromagnetic interference (EMI). The use of a watchdog can be used to recover from this situation, in certain circumstances.

General use of WDTs
In practice, most modern microcontrollers (including the LPC1769 that is used in the examples we are considering here) incorporate a watchdog timer (WDT) unit. This WDT unit can be initialised with an “overflow” period (e.g. 1.1 ms). As long as we “feed” or “refresh” the WDT at intervals less than the overflow period, then the WDT will do nothing. However, if we don’t refresh the WDT within the overflow period, then the processor will be reset.

Note that “feeding” the WDT is a very straightforward processor: please refer to the WATCHDOG_Update() function in Listing 11 for code details for the LPC1769 microcontroller.

The key thing to appreciate is that – when the microcontroller starts up – we can distinguish between a “normal” reset and a “watchdog” reset. If the microcontroller has performed a normal reset (typically caused by the by the user switching on the device), then we would – of course – expect the system to operate normally. If, however, the reset / startup was caused by a watchdog overflow, then we would expect to enter a different mode: this will typically be some form of “limp home” or “ fail silent” mode: we give various examples of systems with multiple operating modes later in the ERES (1769) book, starting in Chapter 7.

Use of WDTs to detect task overruns and system overloads in TT designs
The above notes describe the general use of WDTs in an embedded system. We’ll now consider the use of such timers in a time-triggered (TT) design. For simplicity, we will assume that a tick interval of 1ms is being used in the system. We’ll also assume that a “TTC” scheduler is being employed (as is the case in the examples in Chapter 2 and Chapter 3 of the ERES book). Finally, in this example, we’ll assume that we are currently in “normal” mode.

First, we set up our WDT, with an overflow / timeout value slightly greater than the tick interval: the tick interval is assumed to be 1ms, and we’ll use 1.1 ms as the overflow value. The code for achieving this on the LPC1769 is shown in Listing 11 (starting on page 57 of the ERES book).

Once we’ve set up the WDT, we need to set up the scheduler (as discussed in Chapter 2). We then need to add the WDT “update” task to the schedule: this is the task that feeds the timer, and it must be called once per millisecond, at the start of each scheduler tick, as shown below:

Figure 18 from "The Engineering of Reliable Embedded Systems: LPC1769 edition" (2014)

Note that the WDT update task is also given in Listing 11 (details above).

Other tasks are then added to the schedule, as required: see, for example, Chapter 3 (TTRD3a), where a number of additional tasks are added in order to complete the controller for the washing machine.

Used in this way, the WDT allows us to deal with general faults (such as those related to EMI), as outlined above. In addition, use of a WDT in this way provide us with a simple way of detecting “task overruns”: that is tasks that take longer to complete than their expected “worst-case execution time” (WCET).

We can detect task overruns is because (at least in the majority of TTC designs) we know that all tasks should complete in the tick interval in which they are released, as illustrated below:

Figure 21 from "The Engineering of Reliable Embedded Systems: LPC1769 edition" (2014)

As an example, suppose that we have only one task (“Task B”) running in a tick interval of 1 ms. Suppose that this task has an assumed WCET of 0.9 ms. Further suppose that – as a consequence of a coding error – Task B gets caught in an endless “while(1)” loop. In these circumstances, Task B will keep running after the next tick interval. Because it does this, it will prevent the watchdog update task from running at the start of the next tick, as scheduled: because the watchdog is not “fed” on time, the WDT will trigger a processor reset (as illustrated below):


Following the WDT-induced reset, the microcontroller will start up again. When it does restart, it will determine that the WDT caused the reset. The system will therefore (in TTRD2a) enter a “Fail Silent” mode: this is documented in Listing 7 of the ERES book.

Better ways of detecting task overruns (and underruns) in TT designs
It should be emphasised that the above notes summarise the use of WDTs as a simple means of detecting task overruns in TT designs.

More generally (and more flexibly) we would aim to use a MoniTTor unit to check that each task operates within the “best case execution time” (BCET) and “worst case execution time” (WCET) limits that were determined when the system was developed. Design and use of MoniTTor units is described in detail in Chapter 9 of the ERES book.

[12 January 2015]



Some notes on N-version programming and “backup tasks”


In these notes, we explore some of the links between N-Version Programming (NVP) and the creation of backup tasks for use in a safety-related system.


The problem


We’ll assume that “Task X” in a particular system has failed at run time (that is, while the system is operating).

We’ll also assume – for the purposes of discussion – that we’ve detected the problem with Task X because it has overrun: that is, its execution time on a particular run has exceeded its predicted “Worst Case Execution Time” (WCET).

Task overruns can occur for a number of reasons. For example, if a task is designed to read from an analogue-to-digital converter (ADC) and the ADC is damaged, then the task may “hang” while waiting for a conversion to complete. This is – of course – the kind of behaviour that can be handled locally (for example, the task should check that the ADC has been configured correctly before using it, and should incorporate a suitable timeout mechanism whenever it takes a reading). We’ll assume that Task X has been designed to deal with such problems and has been carefully reviewed: our task overrun problem therefore indicates that a significant (unknown) error has occurred.

In some designs, the overrun of Task X (in these circumstances) might be reason enough to shut down the system as cleanly as possible. However, not all systems can be shut down immediately in this way: for example, if our task is part of a safety-related automotive system and the vehicle is currently travelling at speed down a motorway, then simply “shutting down” is an option that we would generally wish to avoid.

For similar reasons, we will assume that it is not possible to simply terminate Task X (and have the system carry on without this task).

In these circumstances, we may wish to “switch in” a backup task (“Backup Task X”). This backup task will – we assume – replace Task X in the schedule: indeed, our aim is that – the next time that Task X was due to run, Backup Task X will run in its place. Depending on the system requirements, we may then wish to continue running with Backup Task X indefinitely, or we may wish to signal an error (and – for example – encourage the driver to stop the vehicle as soon as it is safe to do so).


Applying an NVP approach


We now turn our attention to the design and implementation of Backup Task X.

Clearly, we’d like to try and ensure that – whatever the underlying faults that gave rise to the failure of Task X – Backup Task X will not suffer from the same problems. The challenge is that we presumably created Task X to be reliable, using appropriate design and implementation approaches. In particular, we will – presumably – have designed Task X in such a way that we didn’t expect it to suffer from any faults. So, how can we try to ensure that Backup Task X does not suffer from the same problems as Task X when we don’t know what the problem is that we are trying to avoid?

One way in which we might try to achieve this is through “N-Version programming. This will involve creating a single specification for the system task, and passing this to two independent teams (for example, we might have one team in Europe and another team in Singapore). Each team is expected to produce a version of the task to the best of their ability.

The underlying assumption is not that either task will be perfect. Instead, what we are trying to do is find a way of creating two tasks that have different faults (that is, statistically uncorrelated faults).

In the mid 1980s, John Knight and Nancy Leveson set out to explore this issue in a much-cited study (Knight and Leveson, 1985; Knight and Leveson, 1986):

“In order to test for statistical independence, we designed and executed an experiment in which 27 versions of a program were prepared independently from the same requirements specification by graduate and senior undergraduate students at two universities. The students tested the programs themselves, but each program was subjected to an acceptance procedure for the experiment consisting of 200 typical inputs. Operational usage of the programs was simulated by executing them on one million inputs that were generated according to a realistic operational profile for the application. Using a statistical hypothesis test, we concluded that the assumption of independence of failures did not hold for our programs and, therefore, that reliability improvement predictions using models based on this assumption may be unrealistically optimistic.”
[Knight and Leveson, 1991]

The Knight and Leveson study has proved influential and it now appears to be generally accepted that N-Version programming is not an effective solution to the problem of creating two “equivalent” tasks that can be expected to fail in different ways.


Alternative solutions (1)


So – if we return to our original problem – how should we produce our backup task?

The answer in the general case is that – as with all of the other tasks in the system – we should start with an appropriate specification and produce a new task to the best of our ability. However, rather than do so in a “Black Box” way – allowing a separate team to create this task – we should work in a “White Box” way, and focus on trying to minimise the opportunities for common-cause failures. This may mean – for example – trying to ensure that we employ different hardware resources and different algorithms in the two tasks. We also know that coding errors are much less likely in short pieces of code and we may – therefore – aim to keep the backup task very simple (in the expectation that this will be likely to improve reliability even if it results in reduced quality or performance). This is sometimes known as a “limp home” task.

Working in this general way may be our only option, if we don’t have any knowledge about the faults that have caused our original task to fail. However, if we can obtain information from the original task about the problem that caused it to fail, this may allow us to execute a more appropriate backup option. For example if – as suggested above – our task fails because of an ADC problem (and we are aware of this), we may be able to run a replacement task that uses a different ADC.


Further information


You’ll find further information about some of the issues raised in this note in “The Engineering of Reliable Embedded Systems” (2014) by Michael J. Pont.


Related references


Knight, J.C. and Leveson, N.G. (1985) “A Large Scale Experiment In N-Version Programming”, Digest of Papers FTCS-15: Fifteenth International Symposium on Fault-Tolerant Computing, June 1985, Ann Arbor,MI. pp. 135-139.

Knight, J.C. and Leveson, N.G. (1986) “An Experimental Evaluation of the Assumption of Independence in Multi-version Programming”, IEEE Transactions on SoftwareEngineering,Vol. SE-12, No. 1 (January 1986), pp. 96-109.

Knight, J.C. and Leveson, N.G. (1990) “A reply to the criticisms of the Knight & Leveson experiment”, ACM SIGSOFT Software Engineering Notes 15 (1), pp. 24-35.

[15 August 2014]