Make 2022 a more stable year, use ECC Memory

Are you annoyed at those random game crashes, Blue Screens and weird behavior from your applications? Do you create documents or media you don’t want to become corrupted? Then your next upgrade should include ECC Memory.

What is ECC memory?

Error correction code (ECC) memory detects and corrects bit flips (data corruption) in the data stored or transported as part of the computer’s memory. It stores a small amount of data about a larger section of data, and can recover the bad piece if only one is missing.

Best analogy I can think of is a simple Lego structure. If you have eight different sized Lego blocks on a set and know the weight of the entire set, if you weigh it again and the weight is less by a single block’s weight, you know what is missing. Whereas traditional memory wouldn’t ever weigh it and never know if something was wrong with said Lego structure.

There are actually several different types of ECC. DDR5 will come with “on-chip” ECC, but when talking causally or buying memory with ECC support, we mean “side-band” ECC. And both “side-band” and “on-chip” ECC can work together to make an even more robust system.

Does it really matter?

Yes. Your bits are being flipped and you don’t even realize it. If you are using 16GB of memory, you’ll be experiencing around three bit flips per hour. Keep in mind that means full on hammering that ram, not just it sitting idle. I did some testing myself on a system by running stress-ng on 160GB of ECC ram for an hour and the memory corrected 24 bit flips in that time. AKA 1 flip in every 6.7GB of ram per hour.

Now those numbers seem really scary, but let’s take it down to a more normal use level. Checking on my NAS server using 15GB of it’s ram overall, it had on average one reported ECC bit flip fix every 41.4 hours this past year. Though that may be highly underestimated because of background ECC correction (memory scrubbing) not being reported to the OS.

Obviously the bit flip doesn’t make a huge difference most of the time because everyone’s systems aren’t constantly crashing. However, I guarantee someone reading this has had a crash due to an unsuspected bit flip, and never truly knew the real culprit.

What systems support it?

Sadly, not all of them. Intel has gone the route of not including it in consumer chips at all recently, and AMD lets motherboard manufactures decide if they support it.

Chip SeriesAudiencePlatformSupport
AMD ZenConsumerAM4Varies by motherboard
(most AsRock support ECC)
AMD ThreadripperProsumerTR4/sTRX4Varies by motherboard
Intel CoreConsumerLGA 1700No
Intel XeonServerFCBGA1787Yes

If you’re building anything for a NAS or home server you need to be extra careful to select NAS boxes that support ECC ram or already come with it. As the TrueNAS community guide (PDF) states about ECC: “If you’re going to do it, do it right.”

6 thoughts on “Make 2022 a more stable year, use ECC Memory

  1. “on-chip” ECC in DDR-5 shouldn’t count as ECC, because this is done to improve manufacturer yield, like 4kN sectors on HDDs.
    Large memory capacities are more error-prone. So w/o on-chip ECC, more memory modules would come as defective ex-factory. It cannot replace the “classic” ECC.

  2. Don’t forget that Threadripper and Zen both require unbuffered ECC (afaik), which is only made by Nemix (afaik) and last I looked was only available as large-capacity 4 and 8 stick kits for threadripper systems. For Zen, a minimum of 128GB doesn’t make sense for most people (especially when it costs $800+), assuming they don’t have one of the smaller form factor boards with only 2 memory slots which makes it worse.

    1. I own both v-Color and Nemix 2x32gb ECC for my own computers so I know for a fact they sell smaller kits.

      1. Yeah I just checked again and it looks like it’s finally available readily as dual and single sticks… what I found cost double what UDIMM non-ECC or LRDIMM / RDIMM ECC is selling for though, which is unfortunate but kinda makes sense given the tolerances need to be tighter for unregistered / unbuffered memory.

        Threadripper Pro supports RDIMM and 256GB / 512GB 8 channel kits can be had for around the same prices as equivalent non-ECC UDIMMs so it looks like that (or a 2S Epyc with only one processor installed initially) will be the direction I’ll be taking for my next workstation unless something better appears or prices change again.

        Warning: Wall of text follows.

        Xeons and older i7s / newer XE i9s can be put into a mode that mirrors the 2 memory banks, which halves the available ram but allows failure of a stick without degrading performance or shutting down the system immediately. The spec docs for my processor also seemed to indicate some kind of bank striping that would theoretically allow 2x the bandwidth of quad/hex channel but increased latency correspondingly. These need to be turned on at boot by UEFI which would require some hacking on every consumer BIOS I’ve heard off. I think mirroring is only useful with ECC on since failures can’t be corrected otherwise except on write by the CPU, and striping would lower the effectiveness of ECC but not eliminate it. There were several other oddball memory modes that can be enabled but I’d have to look them up again and they’re not very relevant if they can’t be turned on easily. I’m comfortable disassembling and changing UEFI / BIOS modules but those init ones are going to be pretty specific to one board most of the time and most people can’t or won’t do this.

        Another possibly good mention here is that given the insane scalper prices on gaming-oriented video cards and the increasing amount of compute being done on GPU, I’ve been looking at AMD’s W6800 cards which are a slightly lower clocked 6800 RTX, but with 32GB of ECC vram instead of 16GB. This future-proofs them quite a bit, is great for things like GPU-based rendering, and they’re always in stock and regularly go on sale for below MSRP. I’ve seen them for $2100. If you’re stuck upgrading and can’t find a 6800 or 6900 for MSRP (less likely now that manufacturers have taken advantage and are all putting out customized versions with insane MSRPs anyway) the extra $6-700 premium over scalper prices for a huge vram increase is well worth it IMO. Even over MSRP it’s a nice way to extend the lifespan of the card and great for video editing. AMD workstation cards allow hot-switching between the pro and gaming drivers if that matters to you but the lower clock speeds won’t appeal to most gamers.

        I threw caution to the wind on my current system because ECC RDIMM was still 2x the price of regular UDIMM a couple of summers ago and am running 160GB of ram on an i7-6950x. This goes against my personal system design philosophy completely and I don’t recommend it, but I can highly recommend massively overshooting suspected RAM needs. With Win10 / superfetch, commonly used applications effectively run off of ramdisk, and enormous video files exist in ram until something else needs it and simpler operations like HEVC extract, dovi_tool profile conversions, or dovi_tool generated RPU injection (from madmeasurehdr, which is an amazing feature from quietvoid) can run at extremely high speeds.

        Even if you’re not working with 90GB files, having basically every piece of software you commonly use running off of virtual ramdisk makes everything instant. I’m living dangerously with lack of ECC and that large amount of ram right now but I’ll be correcting that soon, and honestly haven’t hit major issues. I suspect with 160GB to deal with the ~1 bit in every 41 hours flipping ends up being in memory that’s on standby and overwritten or unused the majority of the time. Or in something like an executable that’s signed and has a CRC associated so it can be reloaded from disk. Most file types have some kind of error checking built-in and Windows will generally reload them from disk if there’s some kind of issue. I have a system crash about once every 4 months, and they’re all related to terrible Asus / asmedia drivers for the USB ports. I haven’t had any data corruption as of yet. The most dangerous part of the large amount of RAM is that Win10 allocates a correspondingly huge amount as disk write cache, so I’ll often see copies to HDDs claiming to run at the speed of the source NVMe for about 15s before dropping to their real speed. This is highly dangerous but I’ve yet to find a way to limit it without disabling entirely which is less than desirable.

        Fingers crossed it holds out until I can put together the next workstation.

    2. I do agree we are in no way spoiled for choice with ECC. Was really hopeful either Intel or AMD pushed for side-band ECC to be a requirement with DDR5

Thoughts, issues, comments? Leave a reply...

%d bloggers like this: