When human health and safety depends on software, correct system operation is the most important concern. Unfortunately, a lack of proper testing and verification in such systems allows program defects to remain undetected and later turn into hazardous failures. Here are a few examples from a long history of failing programs that have cost lives:
- Therac-25 (the successor of Therac-6 and Therac-20) was a radiation therapy machine manufactured by Atomic Energy of Canada Limited (AECL). The machine had two modes of radiation therapy: high power and low power. In Therac-25’s predecessors, switching between modes was controlled by a hardware lock. The hardware lock was replaced by a software lock in Therac-25 and failed because of an undetected race condition. This fatal bug caused the death of 3 patients between 1985 and 1987. Several lawsuits were filed as a result of these accidents in the late 80’s. Nancy Leveson, a professor at the University of Washington, and Clark Turner, a graduate student at the University of California at Irvine, conducted a long investigation into the software defect and published a report in IEEE Computer in 1993 [1].
- Toyota was forced to recall more than 10 million vehicles between 2009 and 2011 due to an unintended acceleration problem. Reports from National Highway Traffic Safety Administration documented 6,200 complaints involving unintended acceleration of Toyota vehicles. Further investigation revealed 89 deaths and 57 injuries were potentially linked to major Toyota recalls. Even NASA investigated the issue and published a public report with their findings. Several problems were detected in Toyota’s Electronic Throttle Control System (ETCS) and chipsets. NASA’s investigation traced these defects to important software development mistakes. For example, NASA confirmed that no timing analysis had been performed. In particular, no worst-case execution timing (WCET) analysis was conducted because of the complex nature of the embedded software. As with Therac-25, lawsuits against Toyota have been filed, and in March 2014, Toyota was fined 1.2 billion dollars for “concealing safety defects.” Other suits are still pending. Professor Philip Koopman from Carnegie Mellon University who served as an expert witness in this case, gave a talk about his findings which you could watch it from here [2].
Dr. Mahdi Eslamimehr, an expert witness in testing and analysis, and his group at UCLA and SAP Labs have been studying critical fault detection in embedded software systems for many years. In a study published in 2013, they showed how testing can accurately predict stack memory overflow and could have prevented horrible incidents like the closure of a German railway station in 1995 [3]. In another recent publication, they addressed a challenging problem: how accurately the WCET of a program can be determined [4]. This is not a new problem, and researchers have tried to solve it using a variety of static and dynamic analysis techniques. Static analysis tends to overestimate execution time and produce an upper bound for WCET, while dynamic analysis underestimates execution time and provides a lower bound for WCET. However, Dr. Eslamimehr and his team proposed a new technique called event-based directed testing for testing real-time embedded software. They implemented their technique in an existing tool called VICE, augmenting it to automatically test event-driven software without a human in the loop. Experiments show that compared to previous techniques, VICE achieves significantly more accurate WCET. Compared to random testing, genetic algorithms, and traditional directed testing, VICE improves estimates of WCET by 203%, 176%, and 97%, respectively.