Software tests its own integrity
October 18, 2018
Obviously, well designed software is likely to have fewer bugs and the application of modern embedded software development tools can keep them to a minimum.
In a previous posting, I talked about self-testing possibilities for embedded software, where the goal was to detect and mitigate the effects of hardware failure. I also briefly discussed software failure modes and I thought that this was a subject worth a more detailed look.
All non-trivial software has errors. I prefer to not call them “bugs,” as this disassociates them from the developer, who needs to own their mistakes. Obviously, well designed software is likely to have fewer and the application of modern embedded software development tools can keep them to a minimum. Of course, specific errors cannot be predicted (otherwise they could be eradicated), but certain types of software problems can be identified, and it may be possible to spot a problem before it becomes a disaster.
I would divide such software problems into two broad categories:
- data corruption
- code looping
As a significant amount of embedded code is written in C, that means that developers are likely to be making use of pointers. Used carefully, pointers are a powerful feature of the language, but they are also one of the most common sources of programmer error. Problems with pointer usage are hard to identify statically and the bugs introduced might manifest themselves in subtle ways when the code is executed. Some things, like dereferencing a null pointer are easily detected, as they normally cause a trap. It is just necessary to implement a trap handler. Others are harder, as a pointer could end up pointing just about anywhere - more often than not it will be to a valid address, but, unfortunately, it may not be the correct one. There is little that self-testing code can do about this. There are, however, two special, but very common, cases of pointer usage where there is a chance: stack overflow and array bound violations.
Stack overflow should not occur, as the stack allocation should be carefully determined, and its usage verified during the debug phase. However, it is quite possible to overlook an unusual situation or make use of a less testable construct (like a recursive function). A simple solution is to include an extra word at either end of the stack space - "guard words." These are pre-loaded with a specific value, which is monitored by a self-test task (which may run in the background). If the value changes, the stack limits have been violated. The value should be chosen carefully. An odd number is best, as that would not represent a valid address for most processors. Perhaps 0x55555555. So long as the value is "unlikely" - so not 0x00000001 or 0xffffffff for example - there is a 4 billion to 1 chance of a false alarm.
In some languages, there is built-in detection for addressing outside the bounds of an array, but this introduces a runtime overhead, which may be unwelcome. So, this is not implemented in C. Also, it is possible to access array elements using pointers, instead of the operator, so any checking might be circumvented. The best approach is to just check for buffer overrun type of errors by locating a guard word at the end of an array and monitoring in the same way as the stack overflow check.
In both these cases, when the guard word is corrupted, this is an indication of an impending failure. It may be that the stack or array has over- or under-flowed by just one word, so no real damage has yet been done. Locating the cause of the bad access is much easier than debugging the random crash that might otherwise occur.
Code should never get stuck in an infinite loop, but a logic error or the non-occurrence of an expected external event might result in code hanging. When code is waiting for something to happen, there should ideally be a timeout mechanism, so that the code does not get hung indefinitely.
In any kind of multi-threaded environment - either an RTOS or mainline code with ISRs - it is possible to implement a "watchdog" mechanism. Each task that runs continuously (which might be just the mainline code) needs to "check in" with the watchdog task (which may be a timer ISR) every so often. If a timeout occurs, action needs to be taken.
So, what is to be done when a stack overflow, array bound violation or hanging task is detected? This depends on the application. It may be necessary to just stop and restart a single task, but more drastic action may be required: stop the system, sound an alarm of some kind, or simply reset the system. The choice depends on many factors, but broadly the goal is for something better than a crashed system.