Faults, errors and failures induced by internal (process variability, aging, signal integrity) and external (radiations, single events, dynamic operating environments) causes, represents an increasing source of problems concerning the reliability of new technological processes. However, implicit expectations state that the new technology wonders must perform safely and reliably. Thus, the success of future systems depends on how well these systems are able to overcome errors and function in the presence of perturbations. Established solutions from high reliability (space and military) applications are available, however at a high cost, which may drastically limit their applicability to commercial applications.
Multiple approaches are possible: process improvements , hardened cells, circuit and system error mitigation techniques, etc. Most of these solutions come at some added costs, thus some compromise must be found between error handling capability and cost overhead.
Memory devices are almost always the first circuits to be implemented in a new process. Their highly regular structure makes them perfect candidates and a highly effective benchmark for any technological evolution. In addition, established error mitigation techniques such as Error Detecting and Correcting (EDAC) codes are well-known techniques to reduce the effects of errors. Thus, protecting memory blocks are an effective and affordable way to improve the reliability of the circuit. However, newer processes or implementation schemes may be sensitive to multi-cell and multi-bit upsets, reducing the efficiency of SECDED (Single Error Correcting, Double Error Detecting) codes.
Logic networks have a much more complex internal structure that allows the faults to manifest in very diverse ways with varying levels of criticality. Protecting the standard logic circuitry is a difficult task indeed and may introduce important drawbacks in terms of area, speed and power overheads. To determine the best compromise between protection level and costs, the first task that must be accomplished consists in estimating the sensitivity of the circuit to the various types of faults and errors. As soon as Error Rate (ER) figures are associated to the various blocks of the circuit, informed decisions can be made on whether to protect them or not, or what mitigation techniques are most adapted to the specifics of the circuit.
Moreover, many faults and errors are physical phenomena, strongly dependent on technological process and design implementation. On the other end of the scale, any faults have a potential for causing system-wide consequences. Thus, any fault analysis approach will require a multitude of competencies. The reliability engineer will have to interact with all the actors from the design flow from the technology/library provider to the system architect, while taking in account the reliability targets that are required by the final application. They will have to work with the design files available at various design stages and use EDA tools, both standard but also reliability-dedicated ones. Any suggestions dealing with the improving of the circuit performance should be provided to the design team at the appropriate/earliest stages.
IROC has developed an overall reliability analysis and management flow that closely follows and integrates with the standard design flow. The adopted methodology aims at providing relevant and usable functional safety figures for the various features of the circuit (cells, instances, blocks, etc) while minimizing the effort and time spent during the analysis. The results of the analysis flow should be used to direct implementation, design and functional choices and allow the reliability engineer to provide a valuable service to all the actors from the design flow.
In high-reliability and safety-critical applications, RT-level fault-injection simulations are often performed in order to ensure a certain level of fault detection coverage which is necessary to ensure compliance with standards such as ISO-26262.
There are many techniques available for accelerating the simulations including emulation platforms, however, in most cases, classifying the failing scenarios remains a manual task and is often the limiting factor in the number of fault injections that can be performed.
IROC has developed an approach based on the extension of the components and features of a UVM functional verification environment in order to record additional information about the types of errors that have occurred. This additional information can be used to classify failing tests based on their system level impact (e.g. Silent Data Corruption, Detected Uncorrected Error, etc.).
The outcomes of the analysis help the reliability engineer to choose the optimal error handling methodology in order to meet harsh reliability constraints, to ensure adequate data protection, to respect pre-defined system up-time constraints and to provide support and maintenance to the final user during the lifetime of the product. Moreover, the methodology provides early results that can be used in improving the circuit resilience through architectural and design choices with the firm goal of improving customer experience when using high availability and dependability products.