Tech enthusiasts have come across the concept of ECC (Error Correction Code) from time to time. To be clear from the beginning, this technology is generally used in servers and workstations, namely in the corporate field. Error correcting codes are developed to automatically detect and correct errors that may occur in RAM chips.
Electronic/magnetic interference or cosmic rays can corrupt data in memory. The purpose of ECC is to correct the corrupted data and to report it to the system if it cannot be corrected. On-die ECC (ODECC) technology, which came to the fore with DDR5 technology, caused many discussions and confusion among consumers. First of all, it should be noted that this technology is very different from the standard ECC technology. We will now briefly touch on ECC, then we will talk about the differences of ODECC (ECC on chip die).
What is ECC RAM?
Error correction code is a mathematical operation that ensures that the data stored in memory is correct. ECC also allows the system to regenerate correct data in real time in the event of an error.
ECC uses a more advanced parity format, which is a method of using a single bit (parity bit) to detect errors in large groups of data, such as eight bits in RAM. Unfortunately, while a parity bit allows the system to detect an error, it does not provide enough information to correct the data error.
Most systems move data in larger chunks of 64 bits. Instead of generating one extra parity bit for every eight bits of data, ECC produces seven extra bits per 64 bits of data. The system applies a complex mathematical algorithm on the extra seven bits of data to make sure the other 64 bits are correct. If a single bit is wrong (a single-bit error), the ECC algorithm can reconstruct the data, but only report it to the system when there are larger errors (two or more bits).
Brief History of ECC
Many years ago, Intel preferred to use ECC only in Xeon processors, considering that it was exclusive to the professional segment. AMD changed that and started adding ECC support to their Ryzen processors. Thus, the costs of ECC technology have increased and finding suitable ECC support RAM has become a separate problem. But everything changes with the DDR5 standard. ECC has now become a regular part of DDR5.
Next-generation processors use ECC (or some other type) internally to check cache and other components for data consistency. However, without ECC supported RAM, it is not possible for the operating system to control internal data between CPU and RAM or in RAM.
The Importance of ECC
The operating system controls memory consistency to some extent. This process is slow and not entirely reliable. As a result, the operating system cannot detect all problems with data stored in RAM. In other words, controls such as whether the transactions are carried out on the correct data, whether the data is stored in the correct file or not, cannot be controlled 100%.
In daily use, this is not so important. For example, having an invalid character in a Word document does not cause major problems. However, every step in banking transactions is very critical.
Windows usually shows a blue screen error when it detects data inconsistency. As we said, the operating system’s controls are not exactly reliable.
What’s the Difference Between ECC and On-Die ECC (ODECC)?
Different from standard ECC, ODECC primarily aims to increase efficiency in advanced manufacturing technologies so that cheaper DRAM chips can be produced. On-die ECC only detects errors that occur in a cell or row during refreshes. When data is moved from the cell to the cache or CPU, if there is a bit shift or data corruption this is not corrected by the on-die ECC. Standard ECC, on the other hand, is capable of correcting data corruption inside the cell and when moving to another device.
DDR5 splits the memory module into two independent 32-bit addressable subchannels to increase efficiency and reduce data access delays for the memory controller. The data width of the DDR5 module is 64 bits, so it’s the same. However, when this bus is split into two 32-bit addressable channels, overall performance improves. Server grade memories (RDIMMs) add 8 bits per subchannel for ECC support, providing a bus of 40 bits per subchannel or 80 bits per bank. Dual-row modules have four 32-bit subchannels.
On-die ECC is a new feature designed to correct bit errors in the DRAM chip. As with CPUs and GPUs, the manufacturing technologies used in the production of RAM are also evolving. As the density of DRAM chips increases with new lithography techniques, so does the potential for data leaks. Integrated into DDR5 chips, ECC corrects on-chip errors, improves reliability, and reduces risk while minimizing defect rates.
This technology is not capable of correcting off-chip errors or errors in the bus between the module and the memory controller inside the CPU. ECC enabled processors used in servers and workstations have coding feature that can instantly correct single or multi-bit errors.
Continuing, DDR5’s on-die ECC feature does not correct DDR channel errors. In other words, businesses will continue to use the standardized sideband ECC technology as well as DDR5 ODECC support. Long story short, the scope of on-die ECC (on-die ECC) technology is much narrower.