Developers’ Blog from SafeTTy Systems Ltd

empty_space
This page hosts an informal “Technical Blog” from members of the team at SafeTTy Systems Ltd.

The focus is on the development of reliable, real-time embedded systems (in sectors ranging from household goods to satellite systems).

The material presented on this page is based on the use of ‘Time Triggered’ (TT) software architectures.

If you are unfamilar with TT architectures, our Technology provides some background material.

You may also like to take a look at our News Blog.

empty_space


empty_space

S32K TTRDs?

empty_space
The NXP S32K family of AEC-Q100 qualified 32-bit ARM Cortex-M4F and Cortex-M0+ based MCUs has now been released.

Individually, these MCUs are aimed at automotive (up to ‘ASIL B’) and industrial applications (up to ‘SIL 2’). In a DuplicaTTor formation, we would expect to be able to use these devices in a range of high-integrity designs (up to ‘ASIL D’ / ‘SIL 3’).

We are currently awaiting our first set of evaluation boards (Cortex-M4F).

If our initial tests run to plan (and we anticipate sufficient demand) we would expect to release a DuplicaTTor Evaluation Board based on this platform before the end of the year.

Please contact us for further information.

[16 March 2017]

empty_space


empty_space

Expanded version of ‘TTRD2-19a’ now available

empty_space
As promised (see Blog Post 25 February 2017), we have now released an expanded version of our popular ‘TTRD2-19a’ example. You can download this example on our new TTRD page.

TTRD2-19a is a complete example of a ‘CorrelaTTor-A’ design. This means that it incorporates a TT scheduler, and two key monitoring components (MoniTTor and PredicTTor). Appropriate startup tests are also included in this demo system. This design is documented in ‘ERES2‘ (Chapter 19).

TTRD2-19a demonstrates a very popular TT software platform. Using an appropriate MCU and with the addition of a small external ‘watchdog’ device (eWDC), this platform can – for example – form the basis of an ‘ASIL D’ design (in compliance with ISO 26262).

To configure the monitoring systems that are used in TTRD2-19a, we need:

  • information about the task execution times, both ‘worst case’ (WCET) and ‘best case’ (BCET);
  • information about the system ‘tick list’ (in Normal mode).

We have released two related TTRDs that can be used to provide the required data:

  • TTRD2-19a uses task timing data that were created using TTRD2-a07a;
  • TTRD2-19a uses task-sequence data that were created using TTRD2-a08a.

TTRD2-19a is documented in the ERES2 book.

We also include a discussion about this TTRD on our Taster Days and during our TTb training course.

[6 March 2017]

empty_space


empty_space

An expanded version of ‘TTRD2-19a’ will be available shortly

empty_space
We’ve been releasing public TTRDs (for various processor targets) since 2014. These releases have always resulted in a number of technical questions from our customers. However, the level of interest in ‘TTRD2-19a’ (available for download from 4 January 2017) has exceeded anything that we’ve seen previously.

TTRD2-19a is a complete example of a ‘CorrelaTTor-A’ design. This means that it incorporates a TT scheduler, and two key monitoring components (MoniTTor and PredicTTor). Appropriate startup tests are also included in this demo system. This design is documented in ‘ERES2‘ (Chapter 19).

TTRD2-19a demonstrates a highly-effective TT software platform. Using an appropriate MCU and with the addition of a small external ‘watchdog’ device (eWDC), this platform can – for example – form the basis of an ‘ASIL D’ design (in compliance with ISO 26262).

Most of the questions that we have received in recent weeks relate to the modelling of TTRD2-19a, and the configuration of the run-time monitoring components. More specifically, we’ve been asked about the generation of the ‘tick list’, and about techniques for determining the task execution timings (‘WCET’ and ‘BCET’).

We’ve decided that the simplest way to address these questions is to release an expanded TTRD2-19a example along with linked TTRD2-a07a and TRD2-a08a examples: we aim to do this by the middle of March 2017.

Further information will be available about these releases shortly.

[25 February 2017]

empty_space


empty_space

TMS570 TTRDs?

empty_space
We are pleased to confirm that the public TTRDs released in March 2017 will include examples for the TI TMS570 (judging by the comments in our inbox, this is a popular family of MCUs).

[24 February 2017]

empty_space


empty_space

The next suite of public TTRDs will be released in March 2017

empty_space
We are curremtly working on the next suite of public TTRDs (linked to the ERES2 book).

The aim is to ‘fill in the gaps’ in the current set of examples by the end of March 2017 (we realise that there are still quite a few gaps – please accept our apologies for the delay with this release).

Further reports will follow in this blog in the next few weeks.

[23 February 2017]

empty_space


empty_space

More about architectures for fully-autonomous vehicles (SAE Level 5)

empty_space
As we noted on 5 February 2017, we’ve recently received several requests for advice about the selection of software and hardware architectures for use in autonomous vehicles. In these discussions, we’ve been exploring several ‘Design Sketches’.

Two of these Design Sketches have generated particular interest: these are ‘DS4’ and ‘DS6’ (summarised below).

empty_space

empty_space

We’ll say a little more about these designs in a moment. First, let’s be clear about our assumptions.

This is our assumed operating scenario:

  • our ‘Autonomous Road Vehicle’ (ARV) is to ‘drive itself’ on a motorway network (only);
  • there will be other vehicles on this motorway (other ARVs and other ‘normal’ cars);
  • at a service area, the driver becomes a passenger: he / she moves into the back seat of the ARV, inserts the vehicle ‘key’, enters the required destination and presses the green ‘go’ button;
  • the ARV will then drive on the motorway to the service area nearest to the passenger’s required destination and will stop; the passenger then becomes the driver again;
  • the speed of the vehicle is limited to 50 mph in ‘autonomous’ mode;
  • while in autonomous mode, there is no (human) driver; therefore, there is no ‘emergency stop’ button anywhere in the vehicle.

These our key design assumptions:

  • our starting point is a ‘COTS’ ARV Controller (ARVC) that has been developed by a third party;
  • the ARVC has some ‘input sensors’ and it generates two outputs: [i] the required vehicle speed (0-50 mph); and [ii] the required vehicle direction (‘steering wheel angle’);
  • the ARVC may incorporate a neural network or other adaptive software;
  • the adaptive nature of the ARVC design presents significant challenges for a traditional certification process, and we are looking for ways of increasing confidence that the system incorporating the ARVC will operate safely while the vehicle is in use.

Our key design goal is to monitor the operation of the ARVC and intervene if we detect that something is wrong.

The key design challenge is that – apart from very basic sanity checks (e.g. the ARVC is requesting a vehicle speed of 60 mph when our maximum allowed speed is 50 mph) – it is difficult to know whether the the ARVC is operating correctly while the vehicle is moving.

Design Sketch 4 (DS4) – shown above – illustrates one way in which we may be able to improve confidence in designs that are based on a single ARVC. In this case, we use a TT Monitoring System (TT-MS) has the TT-MS the ability to inject faults / tests (e.g. test images) into the ARVC at run time, in order to check that the behaviour is as expected.

DS6 illustrates another option. In this case there are three independent ARVCs used to create the system, and we take action only if 2 units agree. Initially, DS6 would be a more expensive option than DS4. However, it is likely that we would have greater confidence in the decisions made by this system (particularly if some of the injection techniques from DS4 were also employed). In addition – once the DS6 implementation had been fully evaluated – it should be possible to reduce unit costs significantly by integrating the three ARVCs (and TT monitoring) into a single compact device.

At present, we are prototyping some of these designs (on a lab testbench) using a DuplicaTTor board to implement the TT-MS. We hope to have the opportunity to explore this design in more detail on a road vehicle shortly.

If you are interested in working with us on this interesting project, please contact us.

[17 February 2017]

empty_space


empty_space

Architectures for fully-autonomous vehicles

empty_space
In the last few weeks we’ve had several requests for advice about the selection of software and hardware architectures for use in autonomous vehicles.

The questions have related to what are defined as ‘Level 5’ designs by SAE.

We don’t think there are easy answers to these questions (but you’ll not be surprised to learn that we believe that TT monitoring systems could provide a very useful ‘safety net’ for many such systems).

In a recent interview in IEEE Spectrum magazine, Dr Gill Pratt (Head of the Toyota Research Institute) makes a number of wise observations about some of the emerging challenges in this area.

You’ll find the article here.

[We are grateful to David Mentré and Paul Bennett for drawing this article to our attention.]

[5 February 2017]

empty_space


empty_space

ISO 26262 vs. IEC 61508

empty_space
The latest NMI ISO 26262 Workshop took place at HORIBA MIRA (Nuneaton, UK) on 26 January 2017.

At this event, Dr Michael J. Pont (Executive Director, SafeTTy Systems Ltd) gave a presentation entitled: “Are there lessons that ISO 26262 developers can (and should) learn from IEC 61508?

The presentation abstract reproduced below:

This presentation will be concerned with the development of software for real-time automotive systems that need to be both safe and reliable.

The goal of the presentation is to explore one of the central differences between ISO 26262 and IEC 61508, and to consider whether there are lessons that can (and perhaps should) be learned from the earlier (generic / industrial) safety standard by developers of automotive systems.

During the talk it will be suggested that one key difference between IEC 61508 and ISO 26262 is that the latter standard places less (explicit) reliance on the idea of fault tolerance. In particular, the phrase ‘Hardware Fault Tolerance’ (which is referred to throughout IEC 61508) does not appear in ISO 26262. One important consequence of this difference is that, while IEC 61508 can be seen to favour use of multi-processor architectures, there is much less emphasis on such a solution in ISO 26262.

Does this mean that ISO 26262 designs are likely to be ‘less safe’ than equivalent IEC 61508 designs?

It is hoped that this presentation will encourage a debate at the workshop.

You can now download a copy of the presentation slides for this talk (PDF file).

You’ll find further information about this event on the NMI website.

[5 February 2017]

empty_space


empty_space

New DuplicaTTor® Evaluation Boards

empty_space

To support organisations that want to explore the use of modern TT designs we have introduced our first DuplicaTTor® Evaluation Board (DEB).

Using a DEB, organisations can evaluate design options up to ‘SIL 3’ / ‘ASIL D’ level (and equivalent).

Learn more on our DuplicaTTor page.

[17 January 2017]

empty_space


empty_space

‘TTRD2-19a’ released (a full ‘CorrelaTTor-A’ design)

empty_space
As we noted on 20 December, we are currently preparing a set of ‘Time Triggered Reference Designs’ (TTRDs) to go with the new ‘ERES2‘ book.

We have now released TTRD2-19a for an STM32F401 target (Nucleo board).

This is a ‘CorrelaTTor-A’ design (with iWDT, MoniTTor and PredicTTor support).

It also includes a full suite of ‘Power-On Self Tests’ (POSTs).

To illustrate an extreme (but thorough) form of ‘Built-In Self Test’ (BIST) behaviour, TTRD2-19a executes out the following checks every few seconds: [i] it stores its state; [ii] it performs a system reset and a full suite of startup tests; [iii] it continues its operation from where it left off.

This type of behaviour is common in satellite designs, and can be useful in many other designs too (usually with less frequent testing).

Further information about this TTRD can be found in ERES2 (Chapter 19).

You can download all of the current TTRDs on the ERES2 page.

[4 January 2017]

empty_space


empty_space

Latest TTRDs released

empty_space
As we noted on 15 December, we are currently preparing a set of ‘Time Triggered Reference Designs’ (TTRDs) to go with the new ‘ERES2‘ book.

So far, the suite consists of two simple examples: TTRD2-02a (STM32F0 target) and TTRD2-03a (XMC4500 target).

You can download the TTRDs on the ERES2 page.

Our aim is to complete TTRD2-09a (TMS570 target) and TTRD2-18a (STM32F4 target) before our Christmas break, then fill in some of the gaps in January.

[20 December 2016]

empty_space


empty_space

‘ERES2’ is now back in stock

empty_space
We are pleased to confirm that that ‘ERES2‘ is now back in stock:

    Pont, M.J. (2016) “The Engineering of Reliable Embedded Systems: Developing software for ‘SIL 0’ to ‘SIL 3’ designs using Time-Triggered architectures”, (Second Edition) SafeTTy Systems. ISBN: 978-0-9930355-3-1.

To place your order, please visit the ERES2 page.

[16 December 2016]

empty_space


empty_space

First TTRD released for ‘ERES2’

empty_space

We’re currently working on the ‘Time-Triggered Reference Design’ (TTRDs) to support the new ‘ERES2‘ book.

So far we’ve released the first TTRD (TTRD2-02a): this example implements the basic scheduler code from Chapter 2.

(Chapter 2 is included in the sample material for this book and can also be downloaded from the ERES2 page.)

We hope to complete work on these TTRDs by the end of January.

[15 December 2016]

empty_space


empty_space

‘ERES1’ is now available for download (free of charge)


Following publication of ‘ERES2‘, we have released a full PDF copy of ‘ERES1’.

You can download the complete book and the latest code examples on the ERES1 page.

Release of this book and related code examples is primarily intended to support requests from universities and colleges for access to this material. However, the book and code examples are freely available to anyone, subject to the restrictions listed on Page xxv in the book.

[11 December 2016]

empty_space


empty_space

‘ERES2’ has finally been published … and has now sold out!

ERES2 front cover
We are pleased to announce that ‘ERES2‘ has now been published:

    Pont, M.J. (2016) “The Engineering of Reliable Embedded Systems: Developing software for ‘SIL 0’ to ‘SIL 3’ designs using Time-Triggered architectures”, (Second Edition) SafeTTy Systems. ISBN: 978-0-9930355-3-1.

empty_space

All pre-orders for this book have now been dispatched (the final shipments left our office on 30 November).

Due to a late surge in demand for pre-orders, we are temporarily out of stock of ERES2 …

… however a second print run is in progress and we will have new stock available on 16 December.

Further information can be found on the ERES2 page.

[2 December 2016]

empty_space


empty_space

Meeting ‘Category 3’ / ‘Category 4’ requirements (ISO 13849-1: 2015)

empty_space
ISO 13849-1: 2015 is concerned with the development of the safety-related parts of control systems that are used in machinery.

A very wide range of products is covered by this standard, ranging from factory machines to mobile agricultural equipment. Within the EU, this standard is associated with the Machinery Directive.

One way in which ISO 13849-1 differs from many of the other standards that we work with is that it includes five ‘designated architectures’ (DAs): Category B, Category 1, … Category 4. Use of one of these DAs in a system design can reduce the effort required to achieve compliance with the standard.

We’ve been asked recently about the links between the DAs presented in ISO 13849-1 and our recommended TT platforms.

We’ve summarise some of the potential links in the table below:

empty_space
iso_12849_tt_platforms_table_500
empty_space

As an example, the figure below illustrates how it may be possible to implement a Category 3 DA using a DecomposiTTor platform.

empty_space
iso_13849_1_category_3_using_decomposittor_600
empty_space

As an example, the figure below illustrates how it may be possible to implement a Category 4 DA using a DuplicaTTor-He platform.
empty_space
iso_13849_1_category_4_using_duplicattor_600
empty_space

[23 June 2016]

empty_space


empty_space

New MISRA guidelines on security

empty_space
Application of the ‘MISRA C’ guidelines is widely seen as an effective way of improving the safety of embedded systems that are implemented using the popular ‘C’ programming language.

Since the latest version of MISRA C was published (MISRA C: 2012), ISO/IEC JTC1/SC22/WG14 (the committee responsible for maintaining the C standard) has published a set of guidelines that are intended to improve the security of systems that are implemented using C: these guidelines have the snappy title “ISO/IEC TS 17961:2013”.

ISO/IEC TS 17961:2013 specifies rules for secure coding in the C programming language, and provides examples of both ‘compliant’ and ‘non-compliant’ code.

MISRA has now published two documents related to ISO/IEC TS 17961:2013:

  • MISRA C:2012 (Addendum 1) illustrates links between existing MISRA rules and the “C Secure” requirements found in ISO/IEC TS 17961:2013;
  • MISRA C:2012 (Amendment 1) provides some additional guidelines that are intended to improve the security of systems implemented using C.

empty_space

Both of these documents can be dowloaded (free of charge) from the MISRA Bulletin Board.

It is (of course) unlikely to be the case that coding guidelines alone are unlikely to provide the level of security (or safety) that is required in many modern, highly interconnected, embedded systems. However, we take the view that – by combining such coding guidelines with an appropriate TT architecture – designers have the potential to create a foundation for systems that are both safe and secure.

[25 May 2016]

empty_space


empty_space

Use of library code in safety-related systems

empty_space
We have been asked several times in recent months about the use of third-party library code in safety-related systems (for example, designs developed in compliance with IEC 61508).

For designs programmed in ‘C’, we typically need to consider (at least) the following third-party code:

  • MCU startup code
  • C-language Standard Libraries
  • MCU peripheral libraries (for example, GPIO interface libraries, ADC libraries, …).

Example startup code will often be provided by the MCU manufacturer. This may be written in Assembly language.

C-language standard libraries will often be provided by the compiler manufacturer.

The MCU peripheral libraries are – in most designs – the largest suite of third-party code that is used in a project. This code will generally be provided by the MCU manufacturer.

Let’s assume that we are developing a design in compliance with IEC 61508 (SIL 3). We’ll further Suppose that we have used all of the above code (as provided by our compiler or MCU manufacturer) in a prototype design. Can we use this code in our production system?

In terms of IEC 61508, the library code identified above can be viewed as a “pre-existing software element”.

IEC 61508-3 [2010] has clear requirements for the situation in which such pre-existing software elements are used to implement all or part of a safety function (see IEC 61508-3 Clause 7.4.2.12).

The options are described as three “routes to compliance” for a software element.

These routes are paraphrased below:

Route 1S: compliant development. The element was developed in compliance with IEC 61508.
Route 2S: proven in use. There is evidence available that the element is proven in use (see Clause 7.4.10 in IEC 61508-2).
Route 3S: assessment of non-compliant development. The element has been shown to be compliant with Clause 7.4.2.13 in IEC 61508-3.

It is possible to find ‘qualified’ code for the ‘C’ standard libraries (that is code that ‘Route 1S’ code).

For startup and MCU peripheral libraries, finding qualified code is likely to be much more challenging.

If you cannot achieve ‘Route 1S’ compliance for your startup and peripheral code, you may be tempted to look at ‘Route 2S’ (proven in use).

In our experience, a ‘proven in use’ requirement is often very difficult to justify. (Did you use exactly the same library code in the previous product? Did you use it in exactly the same way?)

In most cases, organisations are (in our experience) left with ‘Route 3S’ as the only viable solution. This involves (in effect) applying a review and ‘code hardening’ process to that subset of the startup and MCU library code that is used in a particular project, with the aim of bringing the code up to the required safety standard.

This is not a trivial process (and time needs to be allowed for it in the project timeline).

[31 March 2016]

empty_space


empty_space

IEC 62304:2006 – Amendment 1:2015

empty_space

IEC 62304 is concerned with the development of software for use in medical devices.

The standard IEC 62304 dates back to 2006 and – as part of the progress towards a new edition – ‘Amendment 1’ was published in 2015.

This article summarises some of the key changes that will result from the introduction of Amendment 1.

empty_space

Brief overview of the standard

empty_space

The IEC 62304 standard notes that software is often an integral part of medical-device technology. It further notes that the effectiveness of a medical device that contains software requires: [i] knowledge of what the software is intended to do, and [ii] demonstration that the software will fulfill such intentions without causing unacceptable risks.

IEC 62304 requires that the manufacturer of the device assigns a safety class (Class A, Class B or Class C) to each software system.

The classes are assigned based on the impact that (failure of) the system may have:

  • Class A: No injury or damage to health is possible
  • Class B: Non-serious injury is possible
  • Class C: Death or serious injury is possible

empty_space

Scope of the amended standard

empty_space
As noted above, IEC 62304 is concerned with the development of software for use in medical devices.

The original standard was felt to be rather ambiguous about the meaning of ‘software’ in this context.

The amended standard tries to make it clearer that the standard applies to any medical device that executes software.

empty_space

Software safety classification

empty_space
One of the areas of the original standard that gave rise to numerous discussions was the assumption that if a failure of the software could give rise to a hazard, then it must be assumed that the probability of such a failure was 100%. The intention of such a statement was to focus attention on the need for appropriate (external) risk-control measures that could be used – for example – to move the device from Class C to Class B.

The amended standard attempts to clarifies the allocation into classes (including the ‘default’ of Class C).

The assumption about 100% probability of failure remains, but there is (in our view) greater clarity in the amended standard about the application of risk-control measures.

empty_space

Dealing with legacy software

empty_space
The amended standard includes a new clause (Clause 4.4) that describes the process of dealing with legacy software.

This might apply (for example) when updating a product that is already being used by the medical community.

empty_space

Identification and avoidance of common software defects

empty_space
The amended standard includes (in Clause 5.1.12 – another new clause) the requirement that [i] typical programming ‘bugs’ should be identified, and [ii] evidence should be provided that such bugs cannot give rise to unacceptable risk.

This is an interesting requirement. We suspect that developers working in ‘C’ will use adherence to the ‘MISRA C’ guidelines as a means of (partially) addressing such matters.

Beyond this, this requirement can perhaps be seen as another view of the ‘100% probability of software failure’ assumption. Fully addressing such requirements is likely (again) to require appropriate – external – risk-control measures.

empty_space

Detailed design

empty_space
There are new requirements (Clause 5.4.2, Clause 5.4.3) for detailed design documentation.

These are – in our view – sensible changes (previous requirements for design documentation were very limited).

empty_space

Impact on ‘Class A’ designs

empty_space
In the amended standard, various requirements that applied only to Class B and Class C designs now also apply to Class A designs.

empty_space

TT architectures and IEC 62304

empty_space
You’ll find an example of the use of TT architectures in an IEC 62304 (Class C) design here.

[9 February 2016]

empty_space


empty_space

Towards the second edition of ISO 26262

empty_space

International standard ISO 26262 is the adaptation of IEC 61508 to comply with needs specific to the application sector of electrical and/or electronic (E/E) systems within road vehicles. This adaptation applies to all activities during the safety lifecycle of safety-related systems comprised of electrical, electronic and software components.

First introduced in 2011, ISO 26262 has had a major impact on the development of automotive systems. However – given the rapid rate of change in this sector – it is perhaps not surprising that this new standard already feels rather out of date (for example, it has little to say about the development of autonomous vehicles).

Perhaps an even more significant concern about ISO 26262: 2011-2012 is that it applies only to the development of passenger cars: trucks, buses and motorcycles (for example) are not considered.

It is expected that the second edition of ISO 26262 – which is due for publication in 2018 – may begin to address some of the shortcomings of the current edition of the standard.

As an indication of where the standard is heading, the first Publicly Available Specification (PAS) related to ISO 26262 has now been published. This is ISO/PAS 19695:2015, and it applies to motorcycles.

From the abstract:

ISO/PAS 19695:2015 is intended to be applied to safety-related systems that include one or more electrical and/or electronic (E/E) systems and that are installed in series production two-wheeled or three-wheeled motorcycles.

ISO/PAS 19695:2015 Standard does not address unique E/E systems in special purpose vehicles, such as vehicles designed for competition.

ISO/PAS 19695:2015 Standard addresses possible hazards caused by malfunctioning behaviour of E/E safety-related systems, including interaction of these systems. It does not address hazards related to electric shock, fire, smoke, heat, radiation, toxicity, flammability, reactivity, corrosion, release of energy, and similar hazards, unless directly caused by malfunctioning behaviour of E/E safety-related systems.

ISO/PAS 19695:2015 Standard does not address the nominal performance of E/E systems, even if dedicated functional performance standards exist for these systems.

ISO 26262:2011-2012 has 10 parts.

It is expected that ISO/PAS 19695:2015 will form the basis of a new Part 12 in the next edition of ISO 26262.

Further information about ISO/PAS 19695:2015 is available here.

You may be wondering about Part 11?

The new Part 11 in the next edition of ISO 26262 is expected to focus on semiconductor requirements. The related PAS document (ISO/PAS 19451) should be published later this year (rumour has it that the current draft of this document is around 160 pages long …).

[22 January 2016]

empty_space


empty_space

The three laws of safe embedded systems

empty_space

This short article is part of an ongoing series in which I aim to explore some techniques that may be useful for developers and organisations that are beginning their first safety-related embedded project.

In the present article, I want to take a slightly different perspective on Stage 3 from my previous post:

Read more on Michael J. Pont’s EmbeddedRelated Blog.

[First published 12 November 2015. Updated 29 November 2015]

empty_space


empty_space

Developing software for a safety-related embedded system for the first time

empty_space

I spend most of my working life with organisations that develop software for high-reliability, real-time embedded systems. Some of these systems are created in compliance with IEC 61508, ISO 26262, DO-178C or similar international standards.

When working with organisations that are developing software for their first safety-related design, I’m often asked to identify the key issues that distinguish this process from the techniques used to develop “ordinary” embedded software.

This is never an easy question to answer, not least because every organisation faces different challenges. However, in this article I’ve pulled together a list of steps that may provide some “food for thought” for organisations that are considering the development of their first safety-related embedded system.

Read more on Michael J. Pont’s EmbeddedRelated Blog.

[31 October 2015]

empty_space


empty_space

How to test a Tesla?

empty_space

You can read this article on Michael J. Pont’s EmbeddedRelated Blog.

[23 October 2015]

empty_space


empty_space

“Smarter” cars, unintended acceleration – and unintended consequences

empty_space

In this article, I consider some recent press reports relating to embedded software in the automotive sector.

In The Times newspaper (London, 2015-10-16) the imminent arrival of Tesla cars that “use autopilot technology to park themselves and change lane without intervention from the driver” was noted.

By most definitions, the Tesla design incorporates what is sometimes called “Artificial Intelligence” (AI). Others might label it a “Smart” (or at least “Smarter”) Vehicle.

Read more on Michael J. Pont’s EmbeddedRelated Blog.

[20 October 2015]

empty_space


empty_space

Safety, reliability and security in embedded systems

empty_space
In the last few weeks, there have been a number of discussions about the vulnerability of vehicles to hacking (including a well-documented case with a Jeep). There has also been a case (reported in The Times newspaper, UK, on 11 August 2015) in which an electric skateboard was hacked.

In light of these reports, we thought it might be an appropiate time to mull over some of the competing design constraints involved in creating embedded systems that are safe, reliable and secure.

safety_reliability_security_constraints_488

The competition between safety and reliability is easily explained (and generally well understood). Put simply, we could (for example) make our electric skateboard 100% safe by making it 100% unreliable: that is, by ensuring that it could never move. This would – clearly – not be a great design solution …

The starting point for many successful designs is therefore: [i] we consider what we need to do in order to meet the safety requirements; then [ii] we consider what we need to do to meet the reliability requirements – without making the system any less safe.

This is – of course – all very well in theory. In practice, a key part of the process for ensuring safety and reliability will involve discussions with as many “stakeholders” as possible and – in particular – talking about the design with people who understand the environment in which the product or system will be used. In the case of our skateboard (for example) it is probably going to be beneficial to talk to people who use such boards to commute to work (for example).

This all sounds fine, until your CEO appoints your company’s first Head of Secure Design (HoSD).

When you tell the new HoSD that you are about to talk to as many people as possible about the design for the new electric skateboard (so that you can make the product safe and reliable), you may find that attempts are made to have you sacked, and escorted promptly from the building.

The problem is – of course – that secrecy is often seen as a key requirement in a secure design: unless care is taken, this constraint may be at odds with the need to ensure system system and reliability.

What impact will increased demands for security have over the next few years? One impact is that influential safety standards (e.g. IEC 61508, ISO 26262) will need to be revised in order to to address security concerns more fully.

Even after key standards have been updated, we think that there is a risk that attempting to “bolt on” security checks to existing designs may lead to a reduction in the safety and reliability of many future embedded systems.

[Originally published 14 August 2015; updated 15 August 2015]

empty_space


empty_space

What is “Functional Safety”?

empty_space

In our experience, the phase “Functional Safety” (FS) often causes confusion.

When talking about FS, we are concerned with active (rather than passive) safety mechanisms.

In a passenger car (for example) a simple seatbelt serves as a passive safety mechanism, while an airbag or Collision Avoidance System (CAS) serves as an active safety mechanism.

When implementing active safety mechanisms (with the aim of achieving functional safety), the majority of current designs will involve a microcontroller and appropriate software: such designs will usually be implemented in compliance with international safety standards and guidelines, such as IEC 61508, ISO 26262 and DO-178.

[22 June 2015]

empty_space


empty_space

Common-Cause Failure vs. Common-Mode Failure

empty_space
Our recent posts about redundant systems have provoked some discussion about “Common Cause Failures” and “Common Mode Failures”.

In our experience, definitions (or lack of them) can be a significant cause of confusion in this area.

First, it needs to be clear that we are typically talking about two units (components, systems, etc), arranged in some form of “1 out of 2” (1oo2) arrangement: our goal is usually to ensure that if at least one unit operates correctly, the related system will be able to operate safely.

[Depending – again – on our definitions, the system may not simply “operate safely” if at least one unit out of two is operating correctly, it may remain “fully operational” or may “operate normally” in these circumstances. However, this is not always what is meant. Clear definitions are required for a given system, including definitions of “operate safely”, etc.]

Given the above, most people seem to agree on what is meant by a “Common Cause Failure” (CCF). We find that it is sometimes helpful to think of this in terms of “system inputs”. Failure of a power supply (input) that is linked to both of our units is a typical example of a CCF. Water ingress, EMI, radiation, etc, may all also be common root causes of failure in our two units.

In our experience, there is usually less agreement on what is meant by a “Common Mode Failure” (CMF). We find that it is sometimes helpful to think of this in terms of “system outputs”. More specifically, a CMF is concerned with the way in which our units fail (usually in situations where they both fail). For example, if – in response to a power supply “glitch” – both units shut down (“fail silently”), then this can be viewed as an example of a Common-Mode Failure. If – in the event of the same power glitch – one unit shuts down but the other unit “hangs” in a potentially dangerous state, then this is not what we see as a CMF situation (because the two units are demonstrating a different failure mode).

Note that it is usually assumed that a CCF will precede any CMF, but – again – this depends on your definitions.

Probably because a CCF may precede a CMF, people sometimes assume that the two phrases are synonymous: this is not generally helpful, in our view.

[Originally published 16 April 2015; updated 11 June 2015]

empty_space


empty_space

A TT design from the 17th Century?

empty_space

We are always on the lookout for effective applications of TT architectures.

Here’s a very early example: an “automated carillon”. Dating from around the 17th Century, these devices were designed to ring bells in a pre-determined sequence. By changing the arrangement of “pins” in a drum, the melody (program) could be changed.

The photograph below (taken by our Executive Director in April 2015) shows the automated carillon mechanism from the belfy in Ghent (Belgium).

empty_space

[If you’ve landed on this page because you want to purchase an automated carillon, our local bell foundry may be able to help.]

[1 May 2015]

empty_space


empty_space

Working with “Time-Triggered Hybrid” (TTH) schedulers

empty_space
We’ve received some questions about the configuration of “TTH” schedulers, and about TTRD11b in particular from the “ERES (1769)” book.

The main questions (paraphrased slightly) are as follows:

On page 269 [ERES(1769), first printing], you mentioned in the Table 12 that task A, B, C, D are jitter sensitive, so to reduce the jitter levels, they are put into the timer ISR as TTH tasks. But I am not sure if I understood it correctly. Now there are four tasks with the same periods (10 seconds), and the GCD of them should be 1 second, so the timer ISR will run every 1 second. If we make them synchronized, i.e. every task will be released at the same tick interval every time, then again we will face the jitter problem, the worst case will happen to task D, since it is the last to be executed. So, we need to add offset with them, so that at each tick interval there is only one task running. Am I right on this?

Another issue is, now the tick interval is 1 second, but the WCET is 2 seconds, which means one task might not be finished before next tick interval. So do we need to increase the tick interval to take this into account?

In TTRD11b, the tick interval used is 10 seconds (which corresponds to the GCD for this task set, as discussed in Chapter 4 of the book).

In this design, Task A, Task B, Task C and Task D are all released (in sequence) from the scheduler ISR. The total execution time of this task sequence will be 8 seconds, in a period of 10 seconds: these tasks (therefore) consume around 80% of the available CPU capacity (ignoring scheduler overheads).

We assume that all of these tasks will have a “balanced” execution time (as discussed in Chapter 6), in order to help meet the jitter requirements.

Task E is not jitter sensitive. It released as a low-priority task, by the scheduler (TTC) “dispatcher”, every 20 seconds.

The WCET of Task E is 3 seconds and there is only a “slot” of approximately 2 seconds available: Task E will therefore be pre-empted by the high-priority tasks (Task A – Task E) every time it executes.

The end result is that we can meet the requirements for all tasks in the set using this simple architecture.

[20 January 2015]

empty_space


empty_space

Use of “idle” mode in TT systems

empty_space
We’ve received some questions about the use of idle mode in the schedulers that are presented in the ERES (1769) book. The questions concern the links between idle mode and jitter in the task release times. We’ve included some notes about this important topic in this post.

In most cases, the schedulers in the ERES (LPC1769) book enter “idle” mode at the end of the Dispatcher: this is usually achieved by means of the SCH_Go_To_Sleep() function. The system will then remain “asleep” until the next timer Tick is generated.

Clearly, the use of idle mode can help to reduce power consumption. However, in most cases a more important reason for putting the processor “to sleep” is to control the level of “jitter” in the Tick timing.

Control of jitter is an important consideration in the majority of control systems and data-acquisition / sensing systems. As we detail in Chapter 4 of ERES, jitter levels in the region of 1 microsecond can be significant even in designs with what appear to have rather rudimentary signal-processing requirements: the constraints on high-performance systems are often far more severe.

Use of idle mode in a TT scheduler allows us to control jitter in the task (release) timing because we can ensure that both the hardware and the software are in the same (known) state every time an interrupt takes place.

No matter what you do, it is simply not possible to achieve the same level of temporal determinism in a conventional (event-triggered) design.

In a typical TTC design, the Dispatcher is called (as the only function) in an endless loop in main(). If the idle mode is not used, then the system will keep calling the Dispatcher once it has finished releasing the tasks that are scheduled to run in a given Tick. When the next Tick occurs, we cannot be sure where we will be in the Dispatcher code, and the time taken to respond will (therefore) inevitable vary.

If we put the processor into an idle mode at the end of the Dispatcher, we are placing both the software and hardware into a known state. If we are always in the same state when the Ticks occur, the time taken to respond will be essentially identical (or as close to identical as it is possible to get on a given hardware platform). The result is a very low level of jitter.

For example, as we demonstrate in Chapter 5 of ERES, an LPC1769 microcontroller running at 100 MHz and supporting a TTC scheduler already has a low level of Tick jitter (around 0.2 microseconds) even without use of idle mode in the microcontroller: if we use the idle mode, the level of Tick jitter becomes too low to measure.

The discussions above concern TTC schedulers, but control of Tick jitter in this way also applies in TT designs that support task pre-emption. For example, in a simple “Time_triggered Hybrid” (TTH) scheduler running on the same LPC1769 platform (100 MHz), we obtain jitter levels of around 0.4 microseconds: if we incorporate idle mode in this design (as discussed in ERES, Chapter 12) we can – again – bring the jitter levels down to levels that are too low to measure.

Overall, appropriate use of idle modes is a very simple way of ensuring that we can – if and when required – have very precise control of the timing behaviour in a TT design. This is, in turn, an important reason for the popularity of this architecture in what are sometimes call “hard” real-time systems.

[16 January 2015]

empty_space


empty_space

Use of watchdog timers in TT systems

watchdog_dog_pttes_201We’ve received some questions about the use of watchdog timers in the ERES (1769) book. The questions relate specifically to TTRD2a and TTRD3a.

The questions are as follows (paraphrased slightly):

I have one question regarding the watchdog timer in Chapter 2 and 3. I only see in the system initialization phase System_Init(), that it checks if the system is reset by the watchdog timer or not, if so the system goes into a fail silent mode. What about later? How does the system go into the fail silent mode during the process, due to some task overrun for example.

As far as I know, the watchdog feed function is also a task in the scheduler (the first one), and will be executed at the start of each scheduler tick. So, for any reason (task overrun), the watchdog feed task cannot be run in time, it will timeout, then what? The system is supposed to go into the fail silent mode, but how? I didn’t find any place other than System_Init() that deals with the fail silent mode.

It may be that we need to say more about the watchdog timer in a future edition of this book. For now, we’ll provide some further information here. You’ll also find some background information in the book “Patterns for Time-Triggered Embedded System” (in Chapter 12): you can download this book here. Some of the early parts of the answer below are adapted from material in “PTTES”.

[Please note that the code listing discussed in this post can be downloaded on the ‘ERES‘ page.]

Let’s start at the beginning …

What is a watchdog timer?
Suppose there is a hungry dog guarding a house, and someone wishes to break in. If the burglar’s accomplice repeatedly throws the guard dog small pieces of meat at (say) 2-minute intervals, then the dog will be so busy concentrating on the food that he will ignore his guard duties and will not bark. However, if the accomplice run out of meat or forgets to feed the dog for some other reason, the animal will start barking, thereby alerting the neighbours, property occupants or police.

A similar approach is followed in computerised ‘watchdog timers’. Very simply, these are timers which, if not refreshed at regular intervals, will overflow. In most cases, overflow of the timer will reset the system. Such watchdogs are intended to deal with the fact that, even with meticulous planning and careful design, embedded systems can ‘hang’ in the field due to faults, such as the impact of programming faults or electromagnetic interference (EMI). The use of a watchdog can be used to recover from this situation, in certain circumstances.

General use of WDTs
In practice, most modern microcontrollers (including the LPC1769 that is used in the examples we are considering here) incorporate a watchdog timer (WDT) unit. This WDT unit can be initialised with an “overflow” period (e.g. 1.1 ms). As long as we “feed” or “refresh” the WDT at intervals less than the overflow period, then the WDT will do nothing. However, if we don’t refresh the WDT within the overflow period, then the processor will be reset.

Note that “feeding” the WDT is a very straightforward processor: please refer to the WATCHDOG_Update() function in Listing 11 for code details for the LPC1769 microcontroller.

The key thing to appreciate is that – when the microcontroller starts up – we can distinguish between a “normal” reset and a “watchdog” reset. If the microcontroller has performed a normal reset (typically caused by the by the user switching on the device), then we would – of course – expect the system to operate normally. If, however, the reset / startup was caused by a watchdog overflow, then we would expect to enter a different mode: this will typically be some form of “limp home” or “ fail silent” mode: we give various examples of systems with multiple operating modes later in the ERES (1769) book, starting in Chapter 7.

Use of WDTs to detect task overruns and system overloads in TT designs
The above notes describe the general use of WDTs in an embedded system. We’ll now consider the use of such timers in a time-triggered (TT) design. For simplicity, we will assume that a tick interval of 1ms is being used in the system. We’ll also assume that a “TTC” scheduler is being employed (as is the case in the examples in Chapter 2 and Chapter 3 of the ERES book). Finally, in this example, we’ll assume that we are currently in “normal” mode.

First, we set up our WDT, with an overflow / timeout value slightly greater than the tick interval: the tick interval is assumed to be 1ms, and we’ll use 1.1 ms as the overflow value. The code for achieving this on the LPC1769 is shown in Listing 11 (starting on page 57 of the ERES book).

Once we’ve set up the WDT, we need to set up the scheduler (as discussed in Chapter 2). We then need to add the WDT “update” task to the schedule: this is the task that feeds the timer, and it must be called once per millisecond, at the start of each scheduler tick, as shown below:

Figure 18 from "The Engineering of Reliable Embedded Systems: LPC1769 edition" (2014)

Note that the WDT update task is also given in Listing 11 (details above).

Other tasks are then added to the schedule, as required: see, for example, Chapter 3 (TTRD3a), where a number of additional tasks are added in order to complete the controller for the washing machine.

Used in this way, the WDT allows us to deal with general faults (such as those related to EMI), as outlined above. In addition, use of a WDT in this way provide us with a simple way of detecting “task overruns”: that is tasks that take longer to complete than their expected “worst-case execution time” (WCET).

We can detect task overruns is because (at least in the majority of TTC designs) we know that all tasks should complete in the tick interval in which they are released, as illustrated below:

Figure 21 from "The Engineering of Reliable Embedded Systems: LPC1769 edition" (2014)

As an example, suppose that we have only one task (“Task B”) running in a tick interval of 1 ms. Suppose that this task has an assumed WCET of 0.9 ms. Further suppose that – as a consequence of a coding error – Task B gets caught in an endless “while(1)” loop. In these circumstances, Task B will keep running after the next tick interval. Because it does this, it will prevent the watchdog update task from running at the start of the next tick, as scheduled: because the watchdog is not “fed” on time, the WDT will trigger a processor reset (as illustrated below):

eres1769_ed1_fig19_699

Following the WDT-induced reset, the microcontroller will start up again. When it does restart, it will determine that the WDT caused the reset. The system will therefore (in TTRD2a) enter a “Fail Silent” mode: this is documented in Listing 7 of the ERES book.

Better ways of detecting task overruns (and underruns) in TT designs
It should be emphasised that the above notes summarise the use of WDTs as a simple means of detecting task overruns in TT designs.

More generally (and more flexibly) we would aim to use a MoniTTor unit to check that each task operates within the “best case execution time” (BCET) and “worst case execution time” (WCET) limits that were determined when the system was developed. Design and use of MoniTTor units is described in detail in Chapter 9 of the ERES book.

[12 January 2015]

empty_space


empty_space

Some notes on N-version programming and “backup tasks”

empty_space

In these notes, we explore some of the links between N-Version Programming (NVP) and the creation of backup tasks for use in a safety-related system.

empty_space

The problem

empty_space

We’ll assume that “Task X” in a particular system has failed at run time (that is, while the system is operating).

We’ll also assume – for the purposes of discussion – that we’ve detected the problem with Task X because it has overrun: that is, its execution time on a particular run has exceeded its predicted “Worst Case Execution Time” (WCET).

Task overruns can occur for a number of reasons. For example, if a task is designed to read from an analogue-to-digital converter (ADC) and the ADC is damaged, then the task may “hang” while waiting for a conversion to complete. This is – of course – the kind of behaviour that can be handled locally (for example, the task should check that the ADC has been configured correctly before using it, and should incorporate a suitable timeout mechanism whenever it takes a reading). We’ll assume that Task X has been designed to deal with such problems and has been carefully reviewed: our task overrun problem therefore indicates that a significant (unknown) error has occurred.

In some designs, the overrun of Task X (in these circumstances) might be reason enough to shut down the system as cleanly as possible. However, not all systems can be shut down immediately in this way: for example, if our task is part of a safety-related automotive system and the vehicle is currently travelling at speed down a motorway, then simply “shutting down” is an option that we would generally wish to avoid.

For similar reasons, we will assume that it is not possible to simply terminate Task X (and have the system carry on without this task).

In these circumstances, we may wish to “switch in” a backup task (“Backup Task X”). This backup task will – we assume – replace Task X in the schedule: indeed, our aim is that – the next time that Task X was due to run, Backup Task X will run in its place. Depending on the system requirements, we may then wish to continue running with Backup Task X indefinitely, or we may wish to signal an error (and – for example – encourage the driver to stop the vehicle as soon as it is safe to do so).

empty_space

Applying an NVP approach

empty_space

We now turn our attention to the design and implementation of Backup Task X.

Clearly, we’d like to try and ensure that – whatever the underlying faults that gave rise to the failure of Task X – Backup Task X will not suffer from the same problems. The challenge is that we presumably created Task X to be reliable, using appropriate design and implementation approaches. In particular, we will – presumably – have designed Task X in such a way that we didn’t expect it to suffer from any faults. So, how can we try to ensure that Backup Task X does not suffer from the same problems as Task X when we don’t know what the problem is that we are trying to avoid?

One way in which we might try to achieve this is through “N-Version programming. This will involve creating a single specification for the system task, and passing this to two independent teams (for example, we might have one team in Europe and another team in Singapore). Each team is expected to produce a version of the task to the best of their ability.

The underlying assumption is not that either task will be perfect. Instead, what we are trying to do is find a way of creating two tasks that have different faults (that is, statistically uncorrelated faults).

In the mid 1980s, John Knight and Nancy Leveson set out to explore this issue in a much-cited study (Knight and Leveson, 1985; Knight and Leveson, 1986):

“In order to test for statistical independence, we designed and executed an experiment in which 27 versions of a program were prepared independently from the same requirements specification by graduate and senior undergraduate students at two universities. The students tested the programs themselves, but each program was subjected to an acceptance procedure for the experiment consisting of 200 typical inputs. Operational usage of the programs was simulated by executing them on one million inputs that were generated according to a realistic operational profile for the application. Using a statistical hypothesis test, we concluded that the assumption of independence of failures did not hold for our programs and, therefore, that reliability improvement predictions using models based on this assumption may be unrealistically optimistic.”[Knight and Leveson, 1991]

The Knight and Leveson study has proved influential and it now appears to be generally accepted that N-Version programming is not an effective solution to the problem of creating two “equivalent” tasks that can be expected to fail in different ways.

empty_space

Alternative solutions (1)

empty_space

So – if we return to our original problem – how should we produce our backup task?

The answer in the general case is that – as with all of the other tasks in the system – we should start with an appropriate specification and produce a new task to the best of our ability. However, rather than do so in a “Black Box” way – allowing a separate team to create this task – we should work in a “White Box” way, and focus on trying to minimise the opportunities for common-cause failures. This may mean – for example – trying to ensure that we employ different hardware resources and different algorithms in the two tasks. We also know that coding errors are much less likely in short pieces of code and we may – therefore – aim to keep the backup task very simple (in the expectation that this will be likely to improve reliability even if it results in reduced quality or performance). This is sometimes known as a “limp home” task.

Working in this general way may be our only option, if we don’t have any knowledge about the faults that have caused our original task to fail. However, if we can obtain information from the original task about the problem that caused it to fail, this may allow us to execute a more appropriate backup option. For example if – as suggested above – our task fails because of an ADC problem (and we are aware of this), we may be able to run a replacement task that uses a different ADC.

empty_space

Further information

empty_space

You’ll find further information about some of the issues raised in this note in “The Engineering of Reliable Embedded Systems” (2014) by Michael J. Pont.

empty_space

Related references

empty_space

Knight, J.C. and Leveson, N.G. (1985) “A Large Scale Experiment In N-Version Programming”, Digest of Papers FTCS-15: Fifteenth International Symposium on Fault-Tolerant Computing, June 1985, Ann Arbor,MI. pp. 135-139.

Knight, J.C. and Leveson, N.G. (1986) “An Experimental Evaluation of the Assumption of Independence in Multi-version Programming”, IEEE Transactions on SoftwareEngineering,Vol. SE-12, No. 1 (January 1986), pp. 96-109.

Knight, J.C. and Leveson, N.G. (1990) “A reply to the criticisms of the Knight & Leveson experiment”, ACM SIGSOFT Software Engineering Notes 15 (1), pp. 24-35.

[15 August 2014]

empty_space