Chipkill

Chipkill is IBM's trademark for a form of advanced error checking and correcting (ECC) computer memory technology that protects memory systems from single memory chip failures and multi-bit errors from any portion of a single memory chip.^[1]^[2]

One simple scheme to perform this function scatters the bits of an ECC word across multiple memory chips, such that the failure of any single memory (SDRAM) chip will affect only one ECC bit per word. If using the typical 72-bit SECDED (single-error correct, double-error detect) Hamming code to approach the problem, the goal would be to scatter each bit onto its own memory chip. This is easily achievable with four ranks of standard x4 ECC DIMM, as each rank has 18 chips. With wider chips or fewer ranks, longer or shorter words will need to be used, either to increase the amount of correctable bits-per-word or to maintain the 1-bit-per-chip scatter.^[3]

Chipkill is frequently combined with dynamic bit-steering, so that if a chip fails (or has exceeded a threshold of bit errors), another, spare, memory chip is used to replace the failed chip. The concept is similar to that of RAID, which protects against disk failure, except that now the concept is applied to individual memory chips. The technology was developed by the IBM Corporation in the early and mid-1990s.

Although "Chipkill" remains a valid trademark, "chipkill correct" has become the standard term for equivalent scattering schemes. An important RAS feature, chipkill correct is deployed primarily on SSDs, mainframes, and midrange servers.

Equivalent, derived, and similar systems

An equivalent system from Sun Microsystems is called Extended ECC, while equivalent systems from HP are called Advanced ECC and Chipspare.^[4]

Intel has two similar systems:

Single-device data correction (SxEC-DxED, where x is 4 or 8, the width of a single DRAM chip). In S4EC-D4ED, 36-bit SECDED words are used, achieving one-bit-per-chip on a single DRAM with 36 memory chips.^[5]
Lockstep memory provides double-device data correction (DDDC) functionality, where the chips across two memory modules (sticks) are pooled together to scatter the bits. The downside is that the channels now work in lockstep, causing higher latency.^[6]

Similar systems from Micron, called redundant array of independent NAND (RAIN), and from SandForce, called RAISE level 2, protect data stored on SSDs from any single NAND flash chip failure.^[7]^[8]

Evaluation

A 2009 paper using data from Google's data centers^[9] provided evidence demonstrating that in observed Google systems, DRAM errors were recurrent at the same location, and that 8% of DIMMs were affected each year. Specifically, "In more than 85% of the cases a correctable error is followed by at least one more correctable error in the same month." DIMMs with Chipkill error correction showed a lower fraction of DIMMs reporting uncorrectable errors compared to DIMMs with error-correcting codes that can only correct single-bit errors. A 2010 paper from the University of Rochester also showed that Chipkill memory resulted in substantially fewer memory errors, using both real-world memory traces and simulations.^[10]

References

^ Timothy J. Dell (1997-11-19). "A White Paper on the Benefits of Chipkill-Correct ECC for PC Server Main Memory" (PDF). IBM. Archived from the original (PDF) on 2015-09-23. Retrieved 2015-02-02.
^ "Enhancing IBM Netfinity Server Reliability: IBM Chipkill Memory" (PDF). IBM. 2000. Archived from the original (PDF) on 2015-09-23. Retrieved 2015-02-02.
^ Locklear, David (2000). "CHIPKILL CORRECT MEMORY ARCHITECTURE" (PDF). www.ece.umd.edu.
^ "Best Practice Guidelines for ProLiant Servers with the Intel Xeon 5500 processor series Engineering Whitepaper, 1st Edition" (PDF). HP. May 2009. p. 8. Retrieved 2014-09-09.
^ "Intel ® E7500 Chipset MCH Intel® x4 Single Device Data Correction (x4 SDDC) Implementation and Validation" (PDF). 2002.
^ Thomas Willhalm (2014-07-11). "Independent Channel vs. Lockstep Mode – Drive your Memory Faster or Safer". Intel. Retrieved 2015-02-02.
^ Lee Hutchinson. "Solid-state revolution: in-depth on how SSDs really work". 2012.
^ Eric Slack. "How to Make Reliable SSDs - Reliable NAND Flash".
^ Schroeder, Bianca; Pinheiro, Eduardo; Weber, Wolf-Dietrich (2009). "DRAM errors in the wild: A large-scale field study" (PDF). Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems. SIGMETRICS '09. ACM. pp. 193–204. doi:10.1145/1555349.1555372. ISBN 9781605585116. S2CID 6115552. Retrieved 7 September 2011.
^ Li, Xin; Huang, Michael; Shen, Kai; Lingkun, Chu (2010). "A Realistic Evaluation of Memory Hardware Errors and Software System Susceptibility" (PDF). Usenix Annual Tech Conference 2010.

External links

Intel E7500 Chipset MCH Intelx4 Single Device Data Correction (x4 SDDC) Implementation and Validation, Intel Application note AP-726, August 2002.
DRAM study turns assumptions about errors upside down, Ars Technica, October 7, 2009
Enabling Memory Reliability, Availability, and Serviceability Features on Dell PowerEdge Servers, 2005
Chipkill correct memory architecture, August 2000, by David Locklear
The Mathematics of Chipkill ECC, October 2015, by Bob Day

[1] Timothy J. Dell (1997-11-19). "A White Paper on the Benefits of Chipkill-Correct ECC for PC Server Main Memory" (PDF). IBM. Archived from the original (PDF) on 2015-09-23. Retrieved 2015-02-02.

[2] "Enhancing IBM Netfinity Server Reliability: IBM Chipkill Memory" (PDF). IBM. 2000. Archived from the original (PDF) on 2015-09-23. Retrieved 2015-02-02.

[Locklear-3] Locklear, David (2000). "CHIPKILL CORRECT MEMORY ARCHITECTURE" (PDF). www.ece.umd.edu.

[4] "Best Practice Guidelines for ProLiant Servers with the Intel Xeon 5500 processor series Engineering Whitepaper, 1st Edition" (PDF). HP. May 2009. p. 8. Retrieved 2014-09-09.

[5] "Intel ® E7500 Chipset MCH Intel® x4 Single Device Data Correction (x4 SDDC) Implementation and Validation" (PDF). 2002.

[6] Thomas Willhalm (2014-07-11). "Independent Channel vs. Lockstep Mode – Drive your Memory Faster or Safer". Intel. Retrieved 2015-02-02.

[7] Lee Hutchinson. "Solid-state revolution: in-depth on how SSDs really work". 2012.

[8] Eric Slack. "How to Make Reliable SSDs - Reliable NAND Flash".

[9] Schroeder, Bianca; Pinheiro, Eduardo; Weber, Wolf-Dietrich (2009). "DRAM errors in the wild: A large-scale field study" (PDF). Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems. SIGMETRICS '09. ACM. pp. 193–204. doi:10.1145/1555349.1555372. ISBN 9781605585116. S2CID 6115552. Retrieved 7 September 2011.

[10] Li, Xin; Huang, Michael; Shen, Kai; Lingkun, Chu (2010). "A Realistic Evaluation of Memory Hardware Errors and Software System Susceptibility" (PDF). Usenix Annual Tech Conference 2010.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

Equivalent, derived, and similar systems

Evaluation

See also

References

External links