Your Functional Reliability Partner – IROC Technologies
+33 438 120 763 (FR)
+1 408 982 9993 (US)
info@iroctech.com
Contact Us
Support Site

IPs for Fault-Free, High-Performance Operation

Double Sampling Approaches for High Performance Designs

Aggressive technology scaling has dramatic impact on: (1) process, voltage, and temperature (PVT) variations; (2) circuit aging and wearout induced by failure mechanisms such as NBTI, HCI; (3) sensitivity to EMI (e.g. cross-talk and ground bounce. The resulting high defect levels, heterogeneous behavior of identical circuit nodes, circuit degradation over time, and integrated circuits complexity, affect adversely fabrication yield and reliability.

Sensors monitoring the electrical characteristics of transistors (like ION sensors) as well as replica based canary circuits mimicking critical-path delays of the functional circuit, can be used to: detect circuit degradation induced by aging or timing degradation induced by PVT variations; and activate the regulation of circuit operating parameters (like clock frequency, voltage, or body-bias) in response to this detection. However, these approaches cannot address certain failure mechanisms like single event effects and EMI. Furthermore, these approaches monitor dedicated test structures distributed over the die, which are not part of the operating circuit. Thus, as performance degradation induced by aging is a function of various design and operation parameters, the test structures may age differently from the transistors of the operating circuit. Also, the random sources of process variations may affect differently test structures and operating circuits, and this is also true for voltage and temperature variations. Thus, monitoring the electrical parameters of dedicated test structures may result in false positives (i.e. the monitors indicate circuit degradation while this may not be the case for the operating circuit), or false negatives (i.e. circuit degradation has reached the threshold of failure but this is not detected by the monitors). Thus, monitoring the operating circuit itself appropriate.

Monitoring the electrical characteristics of each individual transistor of a circuit will result on very large area and power penalties. The alternative solution is to check the impact of the failure mechanisms on the operation of the functional circuit concurrently with the execution of the application (concurrent error detection). Traditionally this is done by the so-called DMR (double modular redundancy) scheme, which duplicates the operating circuit and compares the outputs of the two copies. However, area and power penalties exceed 100% and are inacceptable for a large majority of applications. Furthermore, DMR relies on the assumption that only one circuit copy is faulty, which may not be valid for failures induced by variability and aging mechanisms. In particular, aging will have similar impact on two identical circuits performing identical operations.

An important remark is that (1) process, voltage, and temperature (PVT) variations; (2) circuit aging and wearout induced by failure mechanisms such as NBTI, HCI; (3) EMI (e.g. cross-talk and ground bounce); induce timing faults.

Aggressive technology scaling has dramatic impact on: (1) process, voltage, and temperature (PVT) variations; (2) circuit aging and wearout induced by failure mechanisms such as NBTI, HCI; (3) sensitivity to EMI (e.g. cross-talk and ground bounce. The resulting high defect levels, heterogeneous behavior of identical circuit nodes, circuit degradation over time, and integrated circuits complexity, affect adversely fabrication yield and reliability.

Sensors monitoring the electrical characteristics of transistors (like ION sensors) as well as replica based canary circuits mimicking critical-path delays of the functional circuit [1-4], can be used to: detect circuit degradation induced by aging or timing degradation induced by PVT variations; and activate the regulation of circuit operating parameters (like clock frequency, voltage, or body-bias) in response to this detection. However, these approaches cannot address certain failure mechanisms like single event effects and EMI. Furthermore, these approaches monitor dedicated test structures distributed over the die, which are not part of the operating circuit. Thus, as performance degradation induced by aging is a function of various design and operation parameters, the test structures may age differently from the transistors of the operating circuit. Also, the random sources of process variations may affect differently test structures and operating circuits, and this is also true for voltage and temperature variations. Thus, monitoring the electrical parameters of dedicated test structures may result in false positives (i.e. the monitors indicate circuit degradation while this may not be the case for the operating circuit), or false negatives (i.e. circuit degradation has reached the threshold of failure but this is not detected by the monitors). Thus, monitoring the operating circuit itself appropriate.

Monitoring the electrical characteristics of each individual transistor of a circuit will result on very large area and power penalties. The alternative solution is to check the impact of the failure mechanisms on the operation of the functional circuit concurrently with the execution of the application (concurrent error detection). Traditionally this is done by the so-called DMR (double modular redundancy) scheme, which duplicates the operating circuit and compares the outputs of the two copies. However, area and power penalties exceed 100% and are inacceptable for a large majority of applications. Furthermore, DMR relies on the assumption that only one circuit copy is faulty, which may not be valid for failures induced by variability and aging mechanisms. In particular, aging will have similar impact on two identical circuits performing identical operations.

An important remark is that (1) process, voltage, and temperature (PVT) variations; (2) circuit aging and wearout induced by failure mechanisms such as NBTI, HCI; (3) EMI (e.g. cross-talk and ground bounce); induce timing faults.

IROC has developed a scheme able to cope with timing faults, based on double sampling architectures able to detect the errors induced by the timing faults and to activate the error recovery process (re-execution of the latest operation at half clock frequency) to correct the error. Thus, the system can be operated under error prone conditions, to satisfy various application constraints:

  • Operating in hostile environments.
  • Operating at aggressively higher speeds than those allowed by the actual circuit delays.
  • Operating at aggressively low voltage level to achieve aggressive power reduction.

Furthermore, the activation frequency of the error detection signal can be used to monitor circuit degradations induced by the process variations and aging and adapt its operating parameters (clock frequency and voltage).

Advanced Error Correction for Memories

Error Detection and Correction (EDC) for electronic memory devices is a proven industry standard technology. In particular, the Single Error Correcting, Double Error Detecting (SECDED) variant based on standard Hamming codes is widely used and has a considerable positive impact on memory reliability. An ECC-protected memory is able to detect and correct single bit errors. This particularity is very interesting for correcting data errors induced by ionizing particles (Soft Errors) since the majority of data errors are Single Bit Upsets (SBUs). Multiple Cell Upsets (MCUs) are also possible and expected to increase with the technological processes advances. However, address and data coding techniques (such as scrambling) can alleviate the problem by placing same-word bits to physically distant cells. This way, a MCU will manifest as single erroneous bits in several words stored at different address, situation easily handled by standard SECDED codes.

Traditional standard ECC requires additional time to compute the additional redundant codes during write operations and to verify the output data during reads. In addition it has a low cost for memories with wide data buses, but presents a large area overhead for small word sizes (75% for 4-bits memory, 50% 8-bits, 31,25% 16-bits) and memory with mask-able operations (200%). This is particularly penalizing in the case of large ASIC/SoCs that may contains hundreds of memory instances with widely different address and data parameters. It is very challenging to protect these memories in a cost efficient manner. These circuits also requires fast memory blocks, since the embedded memory instances are very close to fast logic blocks. Any additional delay due to speed penalties added by the error mitigation circuits will negatively reflect on the global performance of the circuit.

Ensuring minimal overheads for memory blocks used with 8-bit or 16-bit CPUs (either discrete circuits or embedded blocks) is also challenging. Even more difficult is to optimally protect the memory in the case of maskable memories where the CPU can individually access single bits, nibbles or bytes from a large 32-bits or 64-bits.

To help solving these issues, IROC developed a variety of solutions allowing for:

  • at-speed write operations and improved read timing
  • at-speed read operations with a decontamination procedure for on-demand error correction
  • lower-cost solutions adapted to memory with small word sizes and mask features
  • transparent, no power-cycle, latch-up management with error correction following the event