I tested four NVMe SSDs from four vendors – half lose FLUSH’d data on power loss

I always liked the embedded system model where you get flash hardware that has two operations -- erase block and write block. GC, RAID, error correction, etc. are then handled at the application level. It was never clear to me that the current tradeoff with consumer-grade SSDs was right. On the one hand, things like the error correction, redundancy, and garbage collection don't require the attention from CPU (and more importantly, doesn't tie up any bus). On the other hand, the user has no control over what the software on the SSD's chip does. Clearly vendors and users are at odds with each other here; vendors want the best benchmarks (so you can sort by speed descending and pick the first one), but users want their files to exist after their power goes out.

It would be nice if we could just buy dumb flash and let the application do whatever it wants (I guess that application would be your filesystem; but it could also be direct access for specialized use cases like databases). If you want maximum speed, adjust your settings for that. If you want maximum write durability, adjust your settings for that. People are always looking for that one size fits all use case, but it's hard here. Some people may be running cloud providers and already have software to store that block on 3 different continents. Some people may be an embedded system with a fixed disk image that changes once a year, with some temporary storage for logs. There probably isn't a single setting that gets optimal use out of the flash memory for both use cases. The cloud provider doesn't care if a block, flash chip, drive, server rack, availability zone, or continent goes away. The embedded system may be happy to lose logs in exchange for having enough writes left to install the next security update.

It's all a mess, but the constraints have changed since we made the mess. You used to be happy to get 1/6th of a PCI Express lane for all your storage. Now processors directly expose 128 PCIe lanes and have a multitude of underused efficiency cores waiting to be used. Maybe we could do all the "smart" stuff in the OS and application code, and just attach commodity dumb flash chips to our computer.

ATsch · 4 years ago

There's really two problems here:

1. Contemporary mainstram OSes have not risen to the challenge of dealing appropriately with the multi-CPU, multi-address space nature of modern computers. The proportion of the computer that the "OS" runs on has been shrinking for a long time and there have only been a few efforts to try to fix that (e.g. HarmonyOS, nrk, RTKit)

2. Hardware vendors, faced with proprietary or non-malleable OSes and incentives to keep as much magic in the firmware as possible, have moved forward by essentially sandboxing the user OS behind a compatibility shim. And because it works well enough, OS developers do not feel the need to adjust to the hardware, continuing the cycle.

There is one notable recent exception in adjusting filesystems to SMR/Zoned devices. However this is only on Linux, so desktop PC component vendors do not care. (Quite the opposite: they disable the feature on desktop hardware for market segmentation)

btown · 4 years ago

Are there solutions to this in the high-performance computing space, where random access to massive datasets is frequent enough that the “sandboxing” overhead adds up?

the_duke · 4 years ago

I can recommend the related talk "It's Time for Operating Systems to Rediscover Hardware". [1]

It explores how modern systems are a set of cooperating devices (each with their own OS) while our main operating systems still pretend to be fully in charge.

[1] https://www.youtube.com/watch?v=36myc8wQhLo

StillBored · 4 years ago

Fundamentally the job of the OS is resource sharing and scheduling. All the low level device management is just a side show.

Hence why SSD's use a block layer (or in the case of NVMe key/value, hello 1964/CKD) abstraction above whatever pile of physical flash, caches, non-volatile caps/batts, etc. That abstraction holds from the lowliest SD card, to huge NVMe-OF/FC/etc smart arrays which are thin provisioning, deduplicating, replicating, snapshoting, etc. You wouldn't want this running on the main cores for performance and power efficiency reasons. Modern m.2/SATA SSD's have a handful of CPUs managing all this complexity, along with background scrubbing, error correction, etc so you would be talking fully heterogeneous OS kernels knowledgeable about multiple architectures, etc.

Basically it would be insanity.

SSDs took a couple orders of magnitude off the time values of spinning rust/arrays, but many of the optimization points of spinning rust still apply. Pack your buffers and submit large contiguous read/write accesses, queue a lot of commands in parallel, etc.

So, the fundamental abstraction still holds true.

And this is true for most other parts of the computer as well. Just talking to a keyboard involves multiple microcontrollers, scheduling the USB bus, packet switching, and serializing/deserializing the USB packets, etc. This is also why every modern CPU has a mgmt CPU that bootstraps and manages it power/voltage/clock/thermals.

So, hardware abstractions are just as useful as software abstractions like huge process address spaces, file IO, etc.

ksec · 4 years ago

Our modern system has sort of achieved what microkernel was set out to do. Our Storage and Network each has their own OS.

rhizome · 4 years ago

And if the entire purpose of computer programming is to control and/or reduce complexity, I should think the discipline would be embarrassed with the direction in which the industries have been going the past several years. AWS alone should serve as an example.

rbanffy · 4 years ago

An interesting approach would be to standardize a way to program the controllers in flash disks, maybe something similar to OpenFirmware. Mainframes farm out all sort of IO to secondary processors and it was relatively common to overwrite the firmware in Commodore 1541 drives, replacing the basic IO routines with faster ones (or with copy protection shenanigans). I'm not sure anyone ever did that, but it should be possible to process data stored in files without tying up the host computer.

wtallis · 4 years ago

Consumer SSDs don't have much room to offer a different abstraction from emulating the semantics of hard drives and older technology. But in the enterprise SSD space, there's a lot of experimentation with exactly this kind of thing. One of the most popular right now is zoned namespaces, which separates write and erase operations but otherwise still abstracts away most of the esoteric details that will vary between products and chip generations. That makes it a usable model for both flash and SMR hard drives. It doesn't completely preclude dishonest caching, but removes some of the incentive for it.

sitkack · 4 years ago

Check out https://www.snia.org/ if you want to track development in this area.

namibj · 4 years ago

There is no strong reason why a consumer SSD can't allow reformatting to a smaller normal namespace and a separate zoned namespace. Zone-aware CoW file systems allow efficiently combining FS-level compaction/space-reclamation with NAND-level rewrites/write-leveling.

I'd probably pay for "unlocking" ZNS on my Samsung 980 Pro, if just to reduce the write amplification.

Deleted Comment

zrm · 4 years ago

> Consumer SSDs don't have much room to offer a different abstraction from emulating the semantics of hard drives and older technology.

From what I understand the abstraction works a lot like virtual memory. The drive shows up as a virtual address space pretending to be a a disk drive and then the drive's firmware maps virtual addresses to physical ones.

That doesn't seem at all incompatible with exposing the mappings to the OS through newer APIs so the OS can inspect or change the mappings instead of having the firmware do it.

_0w8t · 4 years ago

If anything consumer-level SSDs move to the opposite direction. On Samsung 980 Pro it is not even possible to change the sector size from 512 bytes to 4K.

throwawaylinux · 4 years ago

It's called the program-erase model. Some flash devices do expose raw flash, although it's then usually used by a filesystem (I don't know if any apps use it natively).

There's a _lot_ of problems doing high performance NAND yourself. You honestly don't want to do that in your app. If vendors would provide full specs and characterization of NAND and create software-suitable interfaces for the device then maybe it would be feasible to do in a library or kernel driver, but even then it's pretty thankless work.

You almost certainly want to just buy a reliable device.

Endurance management is very complicated. It's not just a matter of PE cycles for any given block will meet UBER spec at data retention limits with the given ECC scheme. Well, it could be in a naive scheme but then your costs go up.

Even something as simple as error correction is not. Error correction is too slow to do on the host for most IOs, so you need hardware ECC engines on the controller. But those become very large if you have a huge amount of correction capability in them so if errors exceed their capability you might go to firmware. Either way, the error rate is still important to know the health of the data, so you would need error rate data to be sent side-band with the data by the controller somehow. If you get a high error rate, does that mean the block is bad or does it mean you chose the wrong Vt to issue the read with, retention limit was approached, the page had read disturb events, dwell time was suboptimal, operating temperature was too low? All these things might factor in to your garbage collection and endurance management strategy.

Oh and all these things depend on every NAND design/process from each NAND manufacturer.

And then there's higher level redundancy than just per-cell (e.g., word line, chip, block, etc). Which all depend on the exact geometry of the NAND and how the controller wires them up.

I think better would be a higher level logical program/free model that sits above the low level UBER guarantees. GC would have to heed direction coming back from the device about what blocks must be freed, and what the next blocks to be allocated must be.

CGamesPlay · 4 years ago

> Clearly vendors and users are at odds with each other here; vendors want the best benchmarks (so you can sort by speed descending and pick the first one), but users want their files to exist after their power goes out.

I don't know, maybe if there was a "my files exist after the power goes out" column on the website, then I'd sort by that, too?

javajosh · 4 years ago

Ultimately the problem is on the review side. Probably because there's no money in it. There just aren't enough channels to sell that kind of content into, and it all seems relatively celebrity driven. That said, I bet there's room for a YouTube personality to produce weekly 10 minute videos where they torture hard-drives old and new - and torture them properly, with real scientific/journalistic integrity. So, basically you need to be an idealistic outspoken nerd and a little money to burn on HDDs and audio/video setup. Such a person would definitely have such a "column" included in their reviews!

(And they could review memory, too, and do backgrounder videos about standards and commonly available parts.)

gruez · 4 years ago

>I don't know, maybe if there was a "my files exist after the power goes out" column on the website

more like, "don't lose the last 5 seconds of writes if the power goes out". If ordering is preserved you should keep your filesystem, just lose more writes than you expected.

upofadown · 4 years ago

The flip side of the tyranny of the hardware flash controller is that the user can't reliably lose data even if they want to. Your super secure end to end messaging system that automatically erases older messages is probably leaving a whole bunch of copies of those "erased" messages laying around on the raw flash on the other side of the hardware flash controller. This can create a weird situation where it is literally impossible to reliably delete anything on certain platforms.

There is sometimes a whole device erase function provided, but it turns out that a significant portion of tested devices don't actually manage to do that.

infogulch · 4 years ago

"Securely erased" has transformed into 1. encrypting all erasable data with a key and 2. "erasing" becomes throwing away the key.

matheusmoreira · 4 years ago

> Maybe we could do all the "smart" stuff in the OS and application code, and just attach commodity dumb flash chips to our computer.

Yeah, this is how things are supposed to be done and the fact it's not happening is a huge problem. Hardware makers isolate our operating systems in the exact same way operating systems isolate our processes. The operating system is not really in charge of anything, the hardware just gives it an illusory sanboxed machine to play around in, a machine that doesn't even reflect what hardware truly looks like. The real computers are all hidden and programmed by proprietery firmware.

https://youtu.be/36myc8wQhLo

aseipp · 4 years ago

Flash storage is incredibly complex in the extreme at the low level. The very fact we're talking about microcontroller flash as if it's even the same ballpark as NVMe SSDs in terms of complexity or storage management says a lot on its own about how much people here understand the subject (including me.)

I haven't done research on flash design in almost a decade back when I worked on backup software, and my conclusions back then were basically that: you're just better off buying a reliable drive that can meet your your own reliability/performance characteristics, and making tweaks to your application to match the underlying drive operational behavior (coalesce writes, append as much as you can, take care with multithreading vs HDDs/SSDs, et cetera), and testing the hell out of that with a blessed software stack. So we also did extensive tests on what host filesystems, kernel versions, etc seemed "valid" or "good". It wasn't easy.

The amount of complexity to manage error correction and wear leveling on these devices alone, including auxiliary constraints, probably rivals the entire Linux I/O stack. And it's all incredibly vendor specific in the extreme. An auxiliary case e.g. the case of the OP, of handling power loss and flushing correctly, is vastly easier when you only consider some controller firmware and some capacitors on the drive, versus the whole OS being guaranteed to handle any given state the drive might be in, with adequate backup power, at time of failure, for any vendor and any device class. You'll inevitably conclude the drive is the better place to do this job precisely because it eliminates a massive amount of variables like this.

"Oh, but what about error correction and all that? Wouldn't that be better handled by the OS?" I don't know. What do you think "error correction" means for a flash drive? Every PHY on your computer for almost every moderately high-speed interface has a built in error correction layer to account for introduced channel noise, in theory no different than "error correction" on SSDs in the large, but nobody here is like, "damn, I need every number on the USB PHY controller on my mobo so that I can handle the error correction myself in the host software", because that would be insane for most of the same reasons and nearly impossible to handle for every class of device. Many "Errors" are transients that are expected in normal operation, actually, aside from the extra fact you couldn't do ECC on the host CPU for most high speed interfaces. Good luck doing ECC across 8x NVMe drives when that has to go over the bus to the CPU to get anything done...

You think you want this job. You do not want this job. And we all believe we could handle this job because all the complexity is hidden well enough and oiled by enough blood, sweat, and tears, to meet most reasonable use cases.

colejohnson66 · 4 years ago

Apple’s SSDs are like that in some systems, and they’ve gotten flack for it.

bob1029 · 4 years ago

I would absolutely love to have access to "dumb" flash from my application logic. I've got append only systems where I could be writing to disk many times faster if the controller weren't trying to be clever in anticipation of block updates.

bradfa · 4 years ago

The ECC and anything to do with multi or triple level cell flashes is quite non-trivial. You don’t want to have to think about these things if you don’t have to. But yes, better control over the flash controllers would be nice. There are alternative modes for NVMe like those specifically for key-value stores: https://nvmexpress.org/developers/nvme-command-set-specifica...

StillBored · 4 years ago

This is like the statement that if I optimize memcpy() for the number of controllers, levels of cache, and latency to each controller/cache, its possible to make it faster than both the CPU microcoded version (rep stosq/etc) and the software versions provided by the compiler/glibc/kernel/etc. Particularly if I know what the workload looks like.

And it breaks down the instant you change the hardware, even in the slightest ways. Frequently the optimizations then made turn around and reduce the speed below naive methods. Modern flash+controllers are massively more complex than the old NOR flash of two decades ago. Which is why they get multiple CPUs managing them.

snek_case · 4 years ago

IMO the problem here is that even if your flash drive presents a "dumb flash" API to the OS, there can still be caching and other magic that happens underneath. You could still be in a situation where you write a block to the drive, but the drive only writes that to local RAM cache so that it can give you very fast burst write speeds. Then, if you try to read the same block, it could read that block from its local cache. The OS would assume that the block has been successfully written, but if the power goes out, you're still out of luck.

ATsch · 4 years ago

Have you had a look at ZoneFS? It exposes pretty much exactly that model to userspace: https://www.kernel.org/doc/html/latest/filesystems/zonefs.ht...

It does need support from the storage device though.

ClumsyPilot · 4 years ago

"Clearly vendors and users are at odds with each other here; vendors want the best benchmarks (so you can sort by speed descending and pick the first one), but users want their files to exist after their power goes out."

Clearly the vendors are at odds with the law, selling a storage device that doesn't store.

I think they are selling snake-oil, otherwise known as commiting fraud. Maybe they made a mistake in design, and at the very least they should be forced to recall faulty products. If they know about the problem and this behaviour continues ait is basically a fraud.

We allow this to continue, and the manufacturers that actually do fulfill their obligations to the customer suffer financially, while unscurpulous ones laugh all the way to the bank.

myself248 · 4 years ago

I agree, all the way up to entire generations of SDRAM being unable to store data at their advertised speeds and refresh timings. (Rowhammer.) This is nothing short of fraud; they backed the refresh off WAY below what's necessary to correctly store and retrieve data accurately regardless of adjacent row access patterns. Because refreshing more often would hurt performance, and they all want to advertise high performance.

And as a result, we have an entire generation of machines that cannot ever be trusted. And an awful lot of people seem fine with that, or just haven't fully considered what it implies.

mikepurvis · 4 years ago

I don't know if a legal angle is the most helpful, but we probably need a Kyle Kingsbury type to step into this space and shame vendors who make inaccurate claims.

Which is currently all of them, but that was also the case in the distributed systems space when he first started working on Jepsen.

erosenbe0 · 4 years ago

This isn't fraud.

The tester is running the device out of spec.

The manufacturers warrant these devices to behave on a motherboard with proper power hold up times, not in whatever enclosures.

If the enclosure vendor suggests that behavior on cable pull will fully mimick motherboard atx power loss then that is fraud. But they probably have fine print about that, I'd hope.

ghshephard · 4 years ago

Nothing says that you can't both offload everything to hardware, and have the application level configure it. Just need to expose the API for things like FLUSH behavior and such...

jrockway · 4 years ago

Yeah, you're absolutely right. I'd prefer that the world dramatically change overnight, but if that's not going to happen, some little DIP switch on the drive that says "don't acknowledge writes that aren't durable yet" would be fine ;)

nayuki · 4 years ago

> the embedded system model where you get flash hardware that has two operations -- erase block and write block

> just attach commodity dumb flash chips to our computer

I kind of agree with your stance; it would be nice for kernel- or user-level software to get low-level access to hardware devices to manage them as they see fit, for the reasons you stated.

Sadly, the trend has been going toward smart devices for a very long time now. In the very old days, stuff like floppy disk seeks and sector management were done by the CPU, and "low-level formatting" actually meant something. Decades ago, IDE HDDs became common, LBA addressing became the norm, and the main CPU cannot know about disk geometry anymore.

mmf · 4 years ago

I think the main reason they did not expose lower level semantics is that the wanted a drop in replacement for hdds. The second is liability: unfettered access ti arbitrary location erases (and writes) can let you kill (wear out) a flash device in a really short time.

MobiusHorizons · 4 years ago

I could see that argument for sata SSDs, but the subject of the thread is NVME drives.

I think this is something LTT could handle with their new test lab. They already said they want to set new standards when it comes to hardware testing, so if they can hold up to what they promised and hire enough experts this should be a trivial thing to add to a test Parcours for disk drives.

balls187 · 4 years ago

LTT's commentary makes it difficult to trust they are objective (even if they are).

I loved seeing how giddy Linus got while testing Valve's Steamdeck, but when it comes to actual benchmarks and testing, I would appreciate if they dropped the entertainment aspect.

KennyBlanken · 4 years ago

GamersNexus seems to really be trying to work on improving and expanding their testing methodology as much as possible.

I feel like they've successfully developed enough clout/trust that they have escaped the hell of having to pull punches in order to assure they get review samples.

They eviscerated AMD for the 6500xt. They called out NZXT repeatedly for a case that was a fire hazard (!). Most recently they've been kicking Newegg in the teeth for trying to scam them over a damaged CPU. They've called out some really overpriced CPU coolers that underperform compared to $25-30 coolers. Etc.

I bet they'd go for testing this sort of thing, if they haven't already started working on it already. They'd test it and then describe for what use cases it would be unlikely to be a problem vs what cases would be fine. For example, a game-file-only drive where if there's an error you can just verify the game files via the store application. Or a laptop that's not overclocked and only is used by someone to surf the web and maybe check their email.

judge2020 · 4 years ago

From the most recent WAN show at 2:22:52[0]:

> for starters i think the lab is going to focus on written for its own content and then supporting our other content [mainly their unboxing videos]... or we will create a lab channel that we just don't even worry about any kind of upload frequency optimization and we just have way more basic, less opinionated videos, that are just 'here is everything you need to know about it' in video form if, for whatever reason, you prefer to watch a video compared to reading an article

0: https://youtu.be/rXHSbIS2lLs?t=8572

cheschire · 4 years ago

The forums contain more of the actual stuff. Not necessarily for the steamdeck, but you'll tend to find less entertainment type stuff.

https://linustechtips.com/topic/1410081-valve-left-me-unsupe...

yokoprime · 4 years ago

AFAIK they are under embargo still as far as software and performance, remember the steam deck doesnt launch until the 25th

ryan29 · 4 years ago

I'd really like to see one of the popular influencers disrupt the review industry by coming up with a way to bring back high quality technical analysis. I'd love to see what the cost of revenue looks like in the review industry. I'm guessing in-depth technical analysis does really bad in the cost of revenue department vs shallow articles with a bunch of ads and affiliate links.

I think the current industry players have tunnel vision and are too focused on their balance sheets. Things like reputation, trust, and goodwill are crucial to their businesses, but no one is getting a bonus for something that doesn't directly translate into revenue, so those things get ignored. That kind of short sighted thinking has left the whole industry vulnerable to up and coming influencers who have more incentive to care about things like reputation and brand loyalty.

I've been watching LTT with a fair bit of interest to see if they can come up with a winning formula. The biggest problem is that in-depth technical analysis isn't exciting. I remember reading something many years ago, maybe from JonnyGuru, where the person was explaining how most visitors read the intro and conclusion of an article and barely anyone reads the actual review.

Basically you need someone with a long term vision who understands the value you get from in-depth technical analysis and doesn't care if the cost of it looks bad on the balance sheet. Just consider it the cost of revenue for creating content and selling merchandise.

The most interesting thing with LTT is that I think they've got the pieces to make it work. They could put the most relevant / interesting parts of a review on YouTube and skew it towards the entertainment side of things. Videos with in-depth technical analysis could be very formulaic to increase predictability and reduce production costs and could be monetized directly on FloatPlane.

That way they build their own credibility for their shallow / entertaining videos without boring the core audience, but they can get some cost recovery and monetization from people that are willing to pay for in-depth technical analysis.

I also think it could make sense as bait to get bought out. If they start cutting into the traditional review industry someone might come along and offer to buy them as a defensive move. I wonder if Linus could resist the temptation of a large buyout offer. I think that would instantly torpedo their brand, but you never know.

______-_-______ · 4 years ago

I use rtings.com every time I buy a monitor.

https://www.rtings.com/monitor/tools/table

They rigorously test their hardware and you can filter/sort by literally hundreds of stats.

I just built a PC and I would have killed for a site that had apples-to-apples benchmarks for SSDs/RAM/etc. Motherboard reviews especially are a huge joke. We're badly missing a site like that for PC components.

KennyBlanken · 4 years ago

I mentioned this in another comment, but I think GamersNexus is doing exactly what you want.

Regarding influencers: they're being leveraged by companies precisely because they are about "the experience", not actual subjective analysis and testing. 99% of the "influencers" or "digital content creators" don't even pretend to try to do analysis or testing, and those that do generally zero in on one specific, usually irrelevant, thing to test.

throwaway2037 · 4 years ago

You wrote: <<bring back high quality technical analysis>>

How about Tom's Hardware and AnandTech? If they don't count, who does? Years ago, I used to read CDRLabs for optical media drives. Their reviews were very scientific and consistent. (Of course, optical media is all but dead now.)

goodpoint · 4 years ago

LTT is more focused on entertaining the audience than providing thorough, professional testing.

Operyl · 4 years ago

He’s recently pivoted a ton of his business to proper lab testing, and is hiring for it. It’ll be interesting to see, I think he might strike a better balance for those types of videos (I too am a bit tired of the clickbait nature these days).

kevincox · 4 years ago

But audience is also important. If it is only super-technical sources that are reporting faulty drives then the manufactures won't care much. However if you get a very popular source that has a lot of audience, especially in the high-margin "gamer" vertical then all of a sudden the manufactures will care a lot.

So if LTT does start providing more objective benchmarks and reviews it could be a powerful market force.

bstar77 · 4 years ago

I would personally leave this kind of testing to the pros, like Phoronix, Gamers Nexus, etc. LTT is a facade for useful performance testing and understanding of hardware issues.

OJFord · 4 years ago

That's like Ellen DeGeneres declaring a desire to set new standards for film critique.

bradenb · 4 years ago

How so? What's the problem with LTT? Are you just bothered that they're more than a purely informational source?

wmf · 4 years ago

Gamers Nexus?

nebula8804 · 4 years ago

Man thats a bit harsh.