Design guide for developers of TT embedded systems
At SafeTTy Systems Ltd we help our customers to meet their system and safety requirements by means of what is sometimes called a ‘semi-formal’ engineering process that combines ‘Time-Triggered‘ (TT) software architectures with state-of-the-art run-time monitoring.
On this page, we present an overview of the process that we use in the form of a ‘design guide’.
The material introduced here is presented in more detail in the ‘ERES2‘ book.[This page was last updated: 2016-12-11]
Who should read this guide?
This guide is primarily intended for developers and project managers who are planning the development of an embedded system that: [i] needs to be reliable; [ii] may need to be secure; [iii] may be safety related; and [iv] will employ a time-triggered (TT) system architecture.
We assume that people reading this guide are already familiar with the development of ‘general purpose’ embedded systems. We provide some introductory training materials that may help to fill in some gaps, if necessary.
We also assume that readers of this guide have some familiarity with international standards that are relevant to their business. Further information about such standards can be found on our Technology page.
Finally, we assume that readers of this guide understand what is meant by a TT embedded system. If required, background information about TT designs can also be found on our Technology page.
1. Define key terms
Before you begin to think about the system requirements, you need to make sure that everyone involved in the project is talking the same language. By this we mean that key terms need to be defined.
As a starting point, these are some of the standard definitions that we use:
- An Uncontrolled System Failure means that the system has not detected a System Fault correctly or – having detected such a fault – has not executed a Controlled System Failure correctly, with the consequence that significant System Damage may be caused. The system may be in any mode other than a Fail-Silent Mode when an Uncontrolled System Failure occurs.
- A Controlled System Failure means that – having correctly detected a System Fault – a reset is performed, after which the system enters a Normal Mode, or a Limp-Home Mode, or a Fail-Silent Mode. A Controlled System Failure may proceed in stages. For example, after a System Fault is detected in a Normal Mode, the system may (after a system reset) re-enter the same Normal Mode; if another System Fault is detected within a pre-determined interval (e.g. 1 hour), the system may then enter a Limp-Home Mode. Depending on the nature of the fault, the sequence may vary: for example, the system may move immediately from a Normal Mode to a Fail-Silent Mode if a significant fault is detected. The system may be in any mode other than a Fail-Silent Mode when a Controlled System Failure occurs.
- A Normal Mode means a pre-determined dynamic mode in which the system is fully operational and is meeting all of the expected system requirements, without causing System Damage. The system may support multiple Normal Modes.
- A Limp-Home Mode means a pre-determined dynamic mode in which – while the system is not meeting all of the expected system requirements – a core subset of the system requirements is being met, and little or no System Damage is being caused. The system may support multiple Limp-Home Modes. In many cases, the system will enter a Limp-Home Mode on a temporary basis (for example, while attempts are made to bring a damaged road vehicle to rest in a location at the side of a motorway), before it enters a Fail-Silent Mode.
- A Fail-Silent Mode means a pre-determined static mode in which the system has been shut down in such a way that it will cause little or no System Damage. The system will usually support only a single Fail-Silent Mode. In many cases, it is expected that intervention by a qualified individual (e.g. a Service Technician) may be required to re-start the system once it has entered a Fail-Silent Mode.
- System Damage results from action by the system that is not in accordance with the system requirements. System Damage may involve loss of life or injury to users of the system, or to people in the vicinity of the system, or loss of life or injury to other animals. System Damage may involve direct or indirect financial losses. System Damage may involve a wider environmental impact (such as an oil spill). System Damage may involve more general damage (for example, through incorrect activation of a building sprinkler system).
- A System Fault means a Hardware Fault and / or a Software Fault.
- A Software Fault means a manifestation of a Software Error or a Deliberate Software Change.
- A Hardware Fault means a manifestation of a Hardware Error, or a Deliberate Hardware Change, or the result of physical damage. Physical damage may arise – for example – from a broken connection, or from the impact of electromagnetic interference (EMI), radiation, vibration or humidity.
- A Deliberate Software Change means an intentional change to the implementation of any part of the System Software that occurs as a result of a “computer virus” or any other form of malicious interference.
- A Software Error means a mistake in the requirements, design, or implementation (that is, programming) of any part of the System Software.
- A Deliberate Hardware Change means an intentional change to the implementation of any part of the System Hardware that occurs as a result of any form of malicious interference.
- A Hardware Error means a mistake in the requirements, design, or implementation of any part of the System Hardware.
- System Software means all of the software in the system, including tasks, scheduler, any support libraries and “startup” code.
- System Hardware means all of the computing and related hardware in the system, including any processing devices (such as microcontrollers, microprocessors, FPGAs, DSPs and similar items), plus associated peripherals (e.g. memory components) and any devices under control of the computing devices (e.g. actuators), or providing information used by these devices (e.g. sensors, communication links).
2. Ensure that the system requirements have been fully documented
In this guide, we are concerned with the creation of reliable, real-time embedded systems that must be: [i] fully tested and verified during development; and [ii] monitored for faults when in use.
It must be emphasised that is it impossible to conduct a test and verification (T&V) process for any system unless we have a complete requirements specification to work from (since the requirements specification is the only “benchmark” against which the “correctness” – or otherwise – of the system may be assessed).
At a minimum:
- The system requires a numbered list of requirements (start with a table in a Word document if you have no other support available).
- Software components (and / or hardware components) then need to be designed and implemented to meet these requirements: the link between Requirement X and the related software / hardware module Y needs to be clearly documented.
- The T&V process must provide a “paper trail”, explaining how it was confirmed that the completed system does – indeed – meet all of the requirements.
3. Ensure that your team understands the operating environment
The key to developing safe systems is to consider [i] what might possibly go wrong; and [ii] how your system will react – safely – in these circumstances.
During this process, you may need to consider hardware failures, software bugs, operator errors, the impact of vibration, radiation, moisture, and so on.
In a non-trivial system, this process can be very challenging.
In “certified” systems, particular documentation may be required in order to achieve compliance with specific standards and guidelines (but the underlying need to consider the impact of potential faults and hazards is universal).
4. Consider international safety standards
Once your team has begun to understand the operating environment, it is important to consider whether any international safety standards may apply to your project.
Some examples of relevant international standards are here (but there are many others that you may need to consider).
Note that it is important to consider such standards early in the project (because – for example – compliance with a given standards may require use of particular system architecture – late changes are always expensive).
5. Select an appropriate hardware platform
It is assumed that your design will be based on a suitable microcontroller or microprocessor.
In this context, a “suitable” processor is one that has been designed to operate in the environment in which your system will be deployed. At a minimum, this means that the processor must have been designed to operate at the required temperature range.
In many safety-related embedded systems, use of a “lockstep” processor will now often be assumed, as such a design may offer ways of ameliorating the impact of EMI (for example).
In addition – as also discussed in more more detail below – determination of “worst-case execution time” (WCET), and – often – “best-case execution time” (BCET) for the various software tasks in your system is a key activity in most development projects. If the hardware design allows static timing analysis to be carried out, this can save a great deal of time, and avoid the risk of timing errors.
Note that – whatever type of processor is employed – some form of independent monitoring system is usually also required if the system is safety-critical: we say more about this below.
6. Use appropriate software tools
Suppose that your team creates high-quality source code for your system (in ‘C’ or Ada, for example). You perform detailed code inspections, walkthoughs, etc. You are happy that the work has been carried out to a high standard and that the chances of errors are very low.
Suppose that you then use a faulty compiler to generate the executable code for your system. This may – unless care is taken – undo much of the good work that your team has carried out when creating the source code.
Does this mean that you need to use “certified” tools when creating your system? Does this mean – for example – that you cannot use “open source” compilers?
The answer to both questions may be “no”, provided that – as part of your development process – you carry out appropriate tests on the executable code, as well as on the source code.
7. Use an appropriately-qualified team
Development of high-integrity embedded systems involves use of appropriate development processes, software tools and hardware platforms — but it also requires that the people involved have relevant experience and qualifications.
In many ways, this may seem like common sense, but modern standards now make this requirement explicit, requiring that organisations can provide evidence of compliance.
For example, organisations developing household appliances (such as washing machines) in compliance with IEC 60335 are told (in the first line of the introduction):
“It has been assumed in the drafting of this International Standard that the execution of its provisions is entrusted to appropriately qualified and experienced persons.” [IEC 60335-1: 2010.]
Clearly, even what might be seen as apparently “simple” household appliances have safety implications. For example, manufacturers need to ensure that the door of a washing machine cannot be opened by a child during a “spin” cycle, and must do all they can to avoid the risk of fires in “always on” applications, such as fridges and freezers.
Demonstrating that your team has the qualifications and experience required to develop modern safety-related embedded systems is an important consideration (in any sector).
Where required, our SafeTTy Certified™ programme can assist with this process.
8. Employ a TT software architecture
As noted at the start of this guide, it is assumed that you will base your system on a time-triggered (TT) architecture.
In our experience, use of a TT architecture in your system (rather than an equivalent event-triggered approach) may deliver some or all of the following benefits for your organisation:
- Improved product reliability
- Ease of certification
- Reduced testing time
- Reduced maintenance / warranty / product recall costs
- Reduced unit costs
- Simpler and more deterministic run-time monitoring
We consider each of these points in turn on our Technology page: we won’t repeat the arguments here.
9. Make an initial choice of TT platform
Even when you have decided on use of a TT architecture, there are a number design options available.
We have documented several of the options that we find most useful in the form of “TT platforms”.
You’ll find information about these various design platforms that we employ in the ‘ERES2‘ book.
10. Consider how the system will be shut down
During the design process, we (clearly) need to consider the processes of powering the system up and shutting the system down. It may seem a little illogical, but we usually need to think first about the ways in which we expect the system to shut down (because this is likely to have an impact on the state of the system software and hardware when it starts up).
In the first instance, we need to think about a “normal” shut down, such as when the user presses the “Off” switch on a machine or removes an ignition key from a vehicle.
We also need to consider reasonably foreseeable misuse. For example, we need to consider what the implications might be for a piece of industrial equipment if the user avoids a lengthy shut-down procedure at the end of a shift by simply switching off the power.
In addition, the system is likely to have at least one “fault” mode (we say more about the modes shortly). This fault mode needs to be handled appropriately and – probably – reported to the user.
In the presence of some faults, it may be felt to be necessary to move the system into Limp-Home Mode, and maintain it there until the device is switched off. At this point, it may be felt to be appropriate to prevent the system from being started / activated again, until it has been reviewed by a suitably-qualified individual (either in person, or – in some circumstances – via a remote diagnostics system).
In some designs, the above scenario can be challenging. For example, leaving a vehicle stranded on a “level crossing” or in the fast lane on a motorway are examples of scenarios that need to be addressed.
One effective solution to such problems is to provide a Limp-Home Mode in which it is still possible to move the vehicle, but only up to a speed of (say) 10 miles per hour (~15 kilometres per hour). This will allow the user to move the vehicle to a safe location, but will not allow “normal” operation.
11. Consider how the system will “power up”
The acronym “POST” is often used to describe a suite of low-level tests that need to be performed before the system starts to operate. The acronym is usually expanded as “Power-On Self Tests”, or – sometimes – as “Pre Operation Self Tests”.
The key question that we are trying to answer through the use of such tests is point is whether we have an operational computer system to work with. If the answer is “no”, then all we can do is try to “fail as safely as possible”.
In most cases (as both of the acronym expansions make clear) we generally need to use the computer system itself to try and answer this question.
In order to determine whether the computer is operating correctly, we will typically aim to investigate some or all of the following elements during POST tests:
- The CPU registers;
- The Program Counter (PC);
- The RAM;
- The ROM / Flash memory;
- The clock frequency;
- The interrupt operation;
Note that we also often need to check the system configuration when we power up. For example, we need to check that the system is running the correct software (and that the software has not been changed, accidentally or maliciously).
To achieve this, a “signature” is usually calculated from the (Flash) memory contents, and stored at the time of manufacture or after an authorised system update. Before (and possibly during) the system begins operation, we can compare the correct (sometimes called “golden”) signature with the results of a calculation made on the current memory contents. Any differences indicate that something is amiss.
Note that some processors provide hardware support to speed up such configuration checks.
12. Consider what periodic (self) tests will be required
Assuming that we have passed the “POSTs”, we can begin the system operation. However, in many designs we need to perform further checks on the system integrity during the system operation. These tests are sometimes called “Built-In Self Tests” (BISTs).
Note that some tests can be lengthy (e.g. some memory tests), and it is particularly important that time is allowed for such tests in the task schedule (again, late changes are usually expensive).
13. Identify the required system operating modes
We need to identify the required system operating modes.
As always, clear definitions are essential. We consider that there is a change in the system mode if the task set is changed. Changing the task set means that we either: [i] change the tasks in the set (for example, replace Task A with Task B, or add Task C); or [ii] keep the same tasks, but change the parameters of one or more of the tasks (for example, we change the period of Task A from 10ms to 12ms).
In this context (and using the definitions provided earlier on this page), most designs will have at least one Normal Mode, and one Fail-Silent Mode. Many designs will have several Normal Modes. Most designs will also require a Limp-Home Mode.
14. Identify the required system states in each mode
In many designs, the system will operate in multiple states in each mode.
As defined in the section above, the transitions between states (in a given mode) will not involve changes to the task set.
As an example, some of the states in a design for a controller for a domestic washing machine might be as shown in the table below.
15. Identify the required task sets
We need to identify the required task sets in each mode.
One way in which we can begin this process is by considering a “Context Diagram” for this system: this shows the components and / or other systems with which the system under development must interact. In an initial design, each of these interactions will require a task.
As an example, consider the washing-machine controller mentioned earlier. A possible Context Diagram for this system is shown below.
The core tasks in this design are as follows: Detergent_Hatch_Update(); Door_Lock_Update(); Door_Sensor_Update(); Drum_Motor_Update(); Drum_Sensor_Update(); Selector_Dial_Update(); Start_Switch_Update(); Water_Heater_Update(); Water_Level_Update(); Water_Pump_Update(); Water_Temperature_Update(); Water_Valve_Update().
This matches the diagram, with the exception of an interface to the LED indicators (this functionality was incorporated in the “System_State_Update()” task in the finished design.
16. Model the task set
This guide is concerned with the development of real-time embedded systems.
In real-time systems, it is not enough to ensure that the processing is ‘as fast as we can make it’: the key requirement is for deterministic processing. What this means is that we need to be able to guarantee that a particular activity will always be completed within (say) 2 ms (+/- 5 µs), or at 6 ms intervals (+/- 1 µs): if the processing does not match this specification, then the system is not slower than we would like, it is simply not fit for purpose.
The key to deterministic processing is being able to model the system behaviour. In TT systems, we generate such models using “Tick Lists”. This a list of all of the system ticks in the hyperperiod, with the details of the tasks that execute in each tick.
Using the Tick List, we can answer some key questions at any early stage in the project, such as:
- Will the selected microcontroller (MCU) be powerful enough to run the task set?
- Assuming that the MCU can handle the tasks, what will the maximum CPU loading be? 50%? 70% 90%?
- Will the system meet all response-time requirements? For example, precisely how long will it take for the system to shutdown after the emergency stop button is pressed?
- How much jitter will there be in the execution times of key tasks? Will the timing characteristics of data-acquisition or control tasks be met? Will the system be stable under all circumstances?
Being able to answer such questions early in the system lifecycle is invaluable, and means that – for example – decisions about hardware platforms can be reviewed at an early stage, should this be necessary.
Further information about the development and use of Tick Lists is included in “The Engineering of Reliable Embedded Systems“.
17. Be prepared to invest effort in “WCET” (and “BCET”) determination
To complete the Tick List, we need to know the worst-case execution time (WCET) of all of the tasks in the system. Because of the central importance of WCET, projects usually begin by predicting timing values for all tasks. The development team then needs to be prepared to obtain WCET information for all tasks frequently during the system development, in order to ensure that these requirements are met (late changes can be very expensive, and can reduce system reliability).
In most cases, we also need “Best-Case Execution Time” (BCET) information too.
It is – therefore – essential that the development team can obtain timing data very easily throughout the project lifecycle.
One way in which this can be achieved is by ‘instrumenting’ a version of the scheduler that is used in your system: this process is illustrated in the ‘ERES2‘ book.
18. Implement appropriate (TT) tasks
As noted above, a TT system design is usually based on a set of periodic tasks.
Each task must – of course – be designed and implemented with care, in line with good practice. For example, all incoming data and outgoing data should be checked before use. Basic checks include confirmation that the data are in range and that the rate-of-change is plausible. More comprehensive “sanity” checks may also be appropriate.
In some cases, such basic checks will not be enough. For example, suppose you are controlling a gas-turbine engine as part of the primary flight control system in a passenger aircraft. Further suppose that the engine will begin to break down if the temperature of a key component exceeds 1,650°C (the melting point of titanium). In this case, you would probably expect to employ at least three temperature sensors, in order to be sure that you have measured the temperature correctly.
Basic techniques for designing tasks used in TT systems also need to be considered: our set of “patterns” for time-triggered designs may be of use in these circumstances.
Measurements of both “worst case” (WCET) and “best case” (BCET) task execution time will usually continue throughout the task design and implementation process. For example, tasks in TT systems are often designed to be “balanced”: that is, the BCET matches the WCET. This can – for example – reduce jitter levels in all system tasks, and make it easier to detect faults (and security breaches) at run time.
Further information about task design for TT embedded systems is provided in ‘ERES2‘.
19. Test (and verify) the operation of your system
When developing a TT design, we usually begin by modelling the system with one or more Tick Lists.
Having completed tests on the individual tasks, final system integration tests will involve confirming that the completed system matches the behaviour recorded in the Tick Lists.
Again, this process is described in ‘ERES2‘.
20. Incorporate an appropriate run-time monitoring system
In a perfect world, embedded systems would not require a run-time monitoring system. Instead, by the time the system was released into the field, the design, test and verification processes employed during development would guarantee that the system would always operate correctly.
Unfortunately, we don’t inhabit such a world:
- It is generally accepted that – as designs become ever larger – there will be some residual software errors in most products that have been released into the field: it is clearly important that the system should not operate dangerously in the event that such residual errors are present.
- We usually also need to ensure that the computer system functions correctly in the event that hardware faults occur (as a result, for example, of electromagnetic interference, or physical damage).
- We may also need to be concerned about the possibility that attempts could be made to introduce deliberate software errors into the system, by means of “computer viruses” and similar technology.
To address such concerns, we usually employ a MoniTTor® and / or PredicTTor® unit to support effective run-time monitoring.
When incorporating such monitoring systems in our design, we need to choose an appropriate implementation.
For example, suppose that we have a created a TT software design to run on a microcontroller: we’ll refer to this combination of hardware and software as the “Main Processor”. If we wish to add a run-time monitoring system to this design, we have two broad options: [i] we can incorporate some or all of this monitoring on the Main Processor; and / or [ii] we can carry out some or all of this monitoring activity on a separate (independent) processor.
Where an independent processor (incorporating MoniTTor and / or PredicTTor technology) is employed
with a TT design, we call this a WarranTTor® unit.
- Where the WarranTTor unit detects a problem, it is able to move the system to a safe state (even if the main processor unit is behaving outside the specification, or has ceased to function altogether): the timing of the transition to a safe state can be determined precisely at design time.
- The WarranTTor unit need not be located immediately adjacent to the unit that is being monitored: for example in some designs the WarranTTor unit may be located several metres from the main processor.
Both “internal” or “external” monitoring can be effective in TT designs, but there is usually a higher risk of common-cause failures in situations where internal monitoring is employed (that is, where there is no WarranTTor unit). It is therefore not surprising that – in an ISO 26262, ASIL D design (for example) – use of an external monitoring facility is “Highly Recommended” (ISO 26262-6 , Table 4) as a means of error detection at the software architecture level.
Similar mechanisms to meet the requirements in other sectors. For example, IEC 61508 requires what is refeered to as a “diverse monitor” in order to: “protect against residual specification and implementation faults in software which adversely affect safety” [61508-7: 2010, Section C.3.4].
You’ll find examples of the use of “internal” and “external” monitoring in ‘ERES2‘.
21. From processor to distributed system?
Of course, many systems incorporate more than one processor. In such circumstances, we need to consider both the scheduling of tasks on the individual processors, and the scheduling of messages on the links between the processors.
We may also need to be sure (using “bus guardians”, for example) that failure of one processor in the system cannot have an impact on the other processors. We may require a duplicated (or triplicated) bus arrangement, or a different network topology.
You’ll find some detailed guidance on the development of reliable, distributed TT systems using “shared-clock” schedulers in ‘ERES2‘.
Further information and support
The design methodology presented in outline on this page is explored in detail in ‘ERES2‘.
This methodology is also discussed in our TTb training courses.
Many of the techniques introduced on this page are supported by our ReliabiliTTy Technology Licences.