<img src = "https://cdn.arstechnica.net/wp-content/uploads/2021/01/linus-eff-you-ram-800×454.png" alt = "We have enjoyed a friendlier Linus Torvalds for a couple of years Years … but that doesn't mean he stopped Opinions."/> Enlarge /. We've enjoyed a friendlier Linus Torvalds over the past few years … but that doesn't mean he stopped having opinions.
This Monday, Linux kernel developer Linus Torvalds was frustrated by the lack of ECC RAM (Error Correcting Checksum) in consumer PCs and laptops.
… the misguided and backward policy "Consumers don't need ECC", (made) the market for ECC memories are disappearing.
The arguments against ECC have always been complete and utter rubbish. Now even memory manufacturers are starting to do ECC in-house because they have finally come to terms with the fact that they absolutely have to.
If you are unfamiliar with ECC RAM, it is likely because you are not creating or specifying dedicated servers with server CPUs and motherboards. Unfortunately, this is the only place you can actually find ECC. In short, the ECC RAM contains a tiny amount of extra memory that is used to detect and correct errors.
Memory failure and probability
In most modern implementations, this means that there are eight check bits for every 64-bit word stored in RAM. A single bit error – a 0 to 1 or a 1 to 0 – can be automatically recognized and corrected. Two bits mirrored in the same word can be recognized, but not corrected. Three or more bits mirrored in the same word are likely to be recognized, but recognition is not guaranteed.
Bit flips can happen for many reasons, starting with an impact on cosmic rays or a simple hardware failure. A large-scale study of Google servers found that approximately 32 percent of all servers (and 8 percent of all DIMMs) in the Google fleet have at least one memory error per year. However, the vast majority are single bit errors. Since Google uses server CPUs and ECC RAM, the machines in question are still in truck traffic.
Even these single-bit errors, which according to Google data occur more than 40 times more frequently than multi-bit errors, remain undetected on consumer computers and can lead to instability in systems and corruption in data.
Bit flips aren't always random
Not every RAM failure is the result of a hardware failure or an unintended EMF problem. In recent years, researchers have increasingly developed practical, physics-based side-channel attacks that use controlled, fast bit flips in areas of RAM that an application can access to infer or change the data values in adjacent RAM areas that it does not should be able to.
Although ECC-RAM cannot mitigate RAMBleed-style attacks that derive the values of neighboring memory, it can generally stop Rowhammer attacks, in which quickly flipped bits in one area of RAM cause bits to shift into an adjacent one Change area.
Even if ECC cannot actively prevent a Rowhammer attack from affecting the system – for example, if several bits in a word are flipped – it can at least alert the system to the problem and in most cases prevent the Rowhammer attack from occurring something other than downtime. (Most ECC systems are configured to stop the entire machine when an uncorrectable error is detected.)
Torvalds accuses Intel
And memory manufacturers claim that this is due to economy and lower performance. And they lie bastards – let me point out again how these problems have been around for generations, but those fuckers happily sold broken hardware to consumers claiming it was an "attack" when it always was "We're cutting corners." . "
How many times has a hammer-like bit flip happened just by sheer bad luck with real non-attack loads? We will never know Because Intel gave shit to consumers.
Torvalds boldly takes the position that the lack of ECC RAM in consumer technology is Intel's fault due to the company's policy of artificial market segmentation. Intel has a vested interest in pushing companies with deeper pockets toward its more expensive, more profitable server CPUs, rather than letting those companies leverage the necessarily lower-margin consumer parts.
Removing support for ECC RAM from CPUs that aren't directly targeted to the server world is one of the ways Intel has heavily segmented these markets. Torvalds' argument here is that Intel's refusal to support ECC RAM in its consumer-facing parts – along with its de facto monopoly in this area – is the real reason that ECC is almost unavailable outside of the server area.
The usual argument why ECC isn't present in consumer technology revolves around cost, but we suspect Torvalds has the right to do so here. Although ECC RAM is essentially a hard-to-find specialty part, it typically only costs about 20 percent more per DIMM than retail non-ECC. The real problem is that without motherboards and CPUs to support it, there is no use.