Safety is a Herculean Task!

C.P. Ravikumar, Texas Instruments

Better safe than sorry,” so goes the old saying.  I had a chance to listen to Karl Greb, a veteran in the subject of functional safety at Texas Instruments on this specialized topic, when he visited Bangalore.

You may have read in the newspapers about incidents where lives are lost, people are injured, or property is damaged due to the malfunctioning of a system. A lithium-ion battery caught fire due to overheating during a recharge. Remember that Lithium-ion batteries are used in a number of electronic gadgets, including mobile phones and laptops. A car accelerates or decelerates by itself, without the driver’s intention to do so. An elevator crashes.  An automatic door closes, unable to sense a hand, crushing it in the process. A joy ride in a park turns into a nightmare. A user of an electrical equipment receives a shock.

 In each of these examples, considerable analysis will be required to point a finger at the precise reason for the malfunctioning of the complete system.  A system comprises of mechanical, electrical, and electronic components. Other than manufacturing defects, components also fail due to mechanical stress, thermal stress, electrical stress, aging, environmental influences, etc. With integration of more functionality in the same chip, one can expect a higher failure rate if no changes are made to the design.

Tolerating Faults

 One of my teachers had a permanent disability in his right hand due to an accident; he began to use his left hand to write on the black board. Fault-tolerance is a design principle that makes the device more robust and tolerant to faults.  Fault-tolerant design is important in safety-critical systems.  Given that more and more electronic content is entering safety-critical systems such as automotives, medical systems, and industrial automation, system designers seek building blocks that are fault-tolerant.  In a technique called spatial redundancy, designers duplicate hardware blocks.  In a technique called code redundancy, designers use error correction codes (ECC) to detect faults in storage or communication of data.


Fault Tolerance!

Art by Ananya Ravikumar

 The Hercules processor from Texas Instruments is intended for safety-critical applications and has been designed with safety features built in.  Use of two CPUs in this MCU is an example of spatial redundancy. The use of ECC memory is an example of code redundancy. Many other design features are included in these MCUs to enable customers implement safety standards for the end equipment.  Finally, the software developer can also make use of temporal redundancy, where the computation is repeated more than once to verify the calculation; remember how you verify your answers in an examination by repeating your calculation! Go through the online documentation on Hercules to know more about the other safety features included in Hercules; the quiz below may help you in the process.


Testing and Quality Assurance

 Testing of chips is a way to ensure that chips that have manufacturing defects are not sold and we can assure quality to the customer.  The number of parts that turned out to be defective (at the customer’s site) for every million chips purchased from the vendor is a measure of quality (DPPM – Defective Parts Per Million).  Good manufacturing and testing will improve this figure. To enable testing of the device for the purposes of maintenance, designers implement the “self-testing” feature in both logic and memory blocks.

 Semiconductor vendors specify the expected life-time for their devices, since a chip may fail due to electrical and thermal stresses as well as environmental influences such as humidity.  Sometimes, faults are caused by electrostatic discharges, electromagnetic interference, and radiation. Stress testing is performed by semiconductor manufacturers to reduce the probability of a chip failing before its life-time. It is expected that the life-time of the system will be extended by regular maintenance and replacement of old parts.

 There is still a chance that a device may fail within its lifetime and result in malfunctioning of the overall system. This is where fault-tolerance becomes important. 

Teaching Safety concepts

When listening to Karl Greb's presentation, the thought that was going through my mind was, how can these important concepts be taught to engineering students? That was my motivation for writing this blog entry.  I am not sure if concepts such as testing, quality, safety and fault-tolerance are emphasized in the coursework. Perhaps it will be good to include at least one lecture on this topic in a course on microcontrollers or embedded system design. I would think that a full course on the topic is needed in postgraduate curriculum specializing in electronics or allied areas.  Will be glad to hear your opinions on the subject! Before I sign off, here is an invitation to take part in an adventure! Don't worry, it is perfectly safe!


Quiz - The 12 Labors of Hercules!

Hercules is the Roman name for the divine hero Heracles from Greek mythology.  He was the son of Zeus and is known for his physical strength, using which he performed twelve great feats, also known as the “Twelve Labors of Hercules.” Here is your chance to perform a Herculean task. Fortunately, you will have the power of the Search Engines to locate answers for these quiz questions on the Internet!

1. Many processors are used inside an automotive. These processors make use of a network communication protocol to exchange information. Which of these network communication protocols, which is supported by TMS570 Hercules processor, was designed for fault-tolerant operation?

    1. CAN
    2. LIN
    3. FlexRay
    4. SafeTI

2. Match the following!

End Equipment

Safety Standard

Car

IEC 61508

Washing Machine

IEC 60730

Ventilator used in an ICU

ISO 26262

 3. Two processor cores are used in Hercules MCU. Which ones?
 

    1. Two MSP430 processors
    2. One ARM Cortex-M4 and an MSP430
    3. Two ARM Cortex-R4 cores
    4. One ARM9 core and one ARM Cortex-R4 core

 4. Which of the following design precautions reduce common mode failures in Hercules?

    1. Two separate clock trees are used to provide clock signals to the processors
    2. Use of ECC Flash Memory and RAM
    3. Use of Built-in Self-Test for memory and logic
    4. All of the above

 5. TI provides SafeTI design packages to help customers achieve safety certification for the end products that are used in safety-critical applications. SafeTI design packages are available for which safety standards?

    1. IEC 61508
    2. ISP 26262
    3. IEC 60730
    4. All of the above

 6. In an MCU a watchdog timer is

    1. A timer intended for computing elapsed time between two events
    2. A safety device to prevent theft of CPU cycles by a virus
    3. A safety device to prevent system lockup
    4. A timer intended to turn on the burglar alarm

 7. In Hercules MCU, the watchdog timer is made more robust by

    1. Doubling the number of bits in the timer
    2. Doubling the clock speed of the timer
    3. Flagging a fault if the watchdog timer is reset outside a time window
    4. Ensuring that the watchdog timer cannot be reset

 

8. Hercules uses two CPU cores – let us call them A and B. Which of these statements is correct?

    1. A and B use different instruction sets to compute the same function, thereby catching an error if the outputs do not match
    2. B executes the same instruction as A and checks its output matches that of A
    3. Both A and B execute the same instruction at the same time and a checker is used to compare the results from A and B
    4. B executes the same instruction as A, but with a small delay, and a checker is used to compare the results from A and B

 

9. The ECC memory in Hercules is capable of

    1. Single Error Detection and Correction
    2. Double Error Detection and Single Error Correction
    3. Double Error Correction and Single Error Detection
    4. Double Error Detection and Double Error Detection

 

10. Hercules MCU is used in a motor control application. The feedback signal from the motor is a critical signal and must be monitored in a fail-safe way. How does Hercules support this?

    1. By providing a special hardware accelerator for monitoring critical signals
    2. By allowing more than on on-chip ADC to receiving the same signal for monitoring
    3. By providing parity check on critical signals
    4. By providing a special instruction in the CPU for monitoring critical signals

 

11. Use of a safety MCU from the Hercules family will

    1. Help improve the MTBF metric
    2. Help improve the fault coverage metric
    3. Help improve the safety factor of the mechanical load connected to motors
    4. All of the above

 

12. FIT is defined to be 1 failure in 1000,000,000 hours.  If I have an equipment that has a rating of 50 FITS, then

    1. It may fail a maximum of 50 times in a year
    2. It may fail a maximum of 0.0000000005 times in a year
    3. It may fail a maximum of 0.0000000005 times in an hour
    4. None of the above

 

 


Anonymous
  • 1. FlexRay

    2.  Car: ISO 26262,

      Washing Machine: IEC 60730,

      Ventilator- IEC 61508

    3. Two ARM Cortex-R4 cores

    4. All of the above

    5. All of the above

    6. A safety device to prevent system lock up.

    7. Flagging a fault if the watchdog timer is reset outside a time window.

    8. B executes the same instruction as A, but with a small delay, and a checker is used to compare the results from A and B.

    9. Double Error Detection and Single Error Correction

    10. By allowing more than one on-chip ADC to receiving the same signal for monitoring

    11. All the above

    12. None of the above

  • On behalf of Rubin Kothari Answers received on Feb 22,2013

    1) CAN

    2) car- ISO26262

    WASHING MACHINE-IEC60730

    Ventilator- IEC61508

    3)TWO ARM Cortex-R4 cores

    4)ALL OF ABOVE

    5) ALL OF ABOVE

    6) A safety device to prevent system lockup

    7) Flagging a fault if the watchdog timer is reset outside a time window

    8) B executes the same instruction as A, but with a small delay, and a checker is used to compare the results from A and B

    9) Single Error Detection and Correction

    10) By allowing more than on on-chip ADC to receiving the same signal for monitoring

    11)all of the above

    12)none of above

  • On Behalf of Vardhan Roy received on March 15,2013 10.38 PM

    1) Flex Ray

    2).  Car : ISO 26262,

         Washing Machine : IEC60730,

         Ventilator - IEC61508 .

    3) C. Two ARM Cortex-R4 Cores

    4) D) All of The above

    5) A) IEC 61508

    6) C) A safety device to prevent system lockup

    7) A

    8) D) B executes the same instruction as A, but with a small delay, and a checker is used to compare the results from A and B.

    9) D) Double Error Detection and Double Error Correction

    10) B) By allowing more than one on-chip ADC to receiving the same signal for monitoring

    11) D) All the above

    12) D) None of the above

  • Thanks to everyone who took part in the quiz! We will contact the winner by 6.00pm, March 15, 2013.

  • On Behalf of Seenu Malepati answers received on

    1) c.     FlexRay

    2)

    Car ISO 26262

    Washing Machine IEC 60730

    Ventilator used in an ICU IEC 61508

    3) c.     Two ARM Cortex-R4 cores

    4)  d.    All of the above

    5)  d.    All of the above

    6)  c.    A safety device to prevent system lockup

    7)  c.    Flagging a fault if the watchdog timer is reset outside a time window

    8)  d.    B executes the same instruction as A, but with a small delay, and a checker is used to compare the results from A and B

    9)  b.     Double Error Detection and Single Error Correction

    10) b.   By allowing more than on on-chip ADC to receiving the same signal for monitoring

    11) d.   All of the above

    12)  d.  None of the above