In the last few months I have been investigating real-time systems and the conflict between two major approaches that are used. These confrontations have caused me to think more deeply about the concept of time-triggered systems and I am beginning to view this as a significant and valuable paradigm shift when compared with my earlier understanding.
In order to set the stage properly, we need to start with an understanding of the concept of “real-time”. In this article I will be using “real-time” as a synonym for “deterministic”. This will require that every action that is performed in a real-time system must be done at the required moment (within predetermined margins of error and with a predetermined maximum level of uncertainty, aka jitter).
The classical means of dealing with real-time requirements is to start with the observation that the world around us is asynchronous. We are constantly bombarded with inputs to which we must react. Real-time systems are therefore built to be predictably reactive. External events will result in interrupts which can trigger the necessary actions to respond to the events. Synchronization mechanisms (e.g. semaphores, spinlocks, etc) are used to allow software to wait, poised, ready to react “instantly” when there is something to do. When conflicts arise because of the arrival of multiple closely spaced (or simultaneous) events, the conflicts are resolved first by folding in the concept of priorities and then by using clever and complex scheduling algorithms. In this world-view, polling is abjured as a waste of processing power and an unreliable mechanism for achieving the desired goals of performance and reactivity.
Such systems are both widely used and extremely powerful. Unfortunately, they are very difficult to certify when functional safety is an issue. The problem typically results from the need to prove that fault tolerance times are satisfied. When a safety-relevant problem is detected the system must respond within a specified maximum time during which the fault can be tolerated. In a normal POSIX-oriented RTOS, the actual response time will be impacted by the internal state of the system and by asynchronous events from unrelated sources. For example, the number of active threads can affect the time required for scheduling. Since the performance of the system when responding to a problem cannot be precisely predicted, certification is challenging. One is forced to use statistical arguments, which require a large enough sample space to be meaningful. This in turn results in extensive and time-consuming testing.
For many years this has been my world-view as well. I find it fascinating now to watch this world view being turned inside out by my recent in-depth introduction to the concept of time-triggered systems.
Let us start by revisiting the idea of determinism. Every action in a deterministic system will be performed at a precise moment in time. If this were to be recorded, the result would be a simple list of times and actions. What happens if we turn this around? Let’s eliminate all other interrupts in the system and use only the timer interrupt. The timer can then be used to trigger each action in our list so that it will be done at precisely the correct moment in time. The result is the creation of a strictly deterministic system in place of a “reactive” system.
What are the stumbling blocks that must be considered?
The construction of the timing table is a complex task. It requires a detailed understanding of the dynamic behavior of the system that is rarely available. It is rare to find a software architecture analysis that goes into sufficient depth to fully describe timing interactions. Furthermore, the integration of all required activities into a single overall scheme can easily consume vast numbers of engineering hours when done by hand. The aerospace industry reportedly has experience with spending months in this fashion. This is an area which requires good tool support to automate the search for an optimal solution (or to detect when no solution exists).
How about external events that are truly asynchronous? These must be handled by polling at sufficiently frequent intervals that critical events will always be seen. However, this is not the classical “active wait loop” that represents an unbounded waste of processing capacity. Instead it is a repeated time trigger at controlled intervals to check for the occurrence of an expected event. If input buffer overflow, resulting in loss of data is would be a problem, the system will need to poll often enough to ensure that this cannot happen. Similarly, if output buffer underflow is a problem, these buffers will also need to be periodically refilled as long as there is data available to send.
My knee-jerk response to the concept of polling is rejection. The most pertinent argument is that it is an unproductive use of processing time. However, the time lost is counted by adding up how often the system polls and finds nothing to do. The advantage with time-triggered systems is that an absolute guarantee exists for the frequency and timing of all processing steps involved. The polling frequencies must be matched to the processing times at each step along the data path. The amount of “unproductive” polling ends up being very small.
When functional safety is added to the mixture, the timing behavior of the system retains the aspect of being fully deterministic. The strictly reliable behavior of the system makes it easily possible to verify that problems are handled within the specified time after detection.
Another argument against the time triggered concept is that a significant amount of processing capacity is unused. The time interval reserved for an action must be long enough to allow for the worst case execution time. Usually actions will complete earlier, leaving the system with nothing to do until the next time trigger is received. I have recently visited a company, Krono-Safe, that has a solution for this problem. Their Asterios RTOS, which is designed and optimized to support time-triggered concepts, allows co-habitation with a second OS. This second OS (e.g. Linux) is then scheduled as the “idle task” of the time-triggered RTOS. A variant of this concept, which was implemented for the French military, provides even greater flexibility by allowing cohabitation with a classical hypervisor. This cohabitation concept allows extremely strong real-time guarantees in the time-triggered environment while allowing less time-critical activities to run in a more traditional environment.
I am beginning to understand that the classical POSIX-oriented approach to real-time can only provide relatively weak guarantees for deterministic behavior. The time-triggered approach offers much stronger guarantees but requires a significant amount of infrastructure (specialized OS and tooling) to be able to fulfill its promise. In tomorrow’s world, with the development of autonomous driving and the resulting need for “fail-operational” behavior, the time-triggered philosophy is likely to become a very valuable tool.
Further Reading: “Computing Needs Time”, Technical Report by E.A. Lee, EECS dept. UC Berkeley, Feb 18, 2009