I always liked the embedded system model where you get flash hardware that has two operations -- erase block and write block. GC, RAID, error correction, etc. are then handled at the application level. It was never clear to me that the current tradeoff with consumer-grade SSDs was right. On the one hand, things like the error correction, redundancy, and garbage collection don't require the attention from CPU (and more importantly, doesn't tie up any bus). On the other hand, the user has no control over what the software on the SSD's chip does. Clearly vendors and users are at odds with each other here; vendors want the best benchmarks (so you can sort by speed descending and pick the first one), but users want their files to exist after their power goes out.
It would be nice if we could just buy dumb flash and let the application do whatever it wants (I guess that application would be your filesystem; but it could also be direct access for specialized use cases like databases). If you want maximum speed, adjust your settings for that. If you want maximum write durability, adjust your settings for that. People are always looking for that one size fits all use case, but it's hard here. Some people may be running cloud providers and already have software to store that block on 3 different continents. Some people may be an embedded system with a fixed disk image that changes once a year, with some temporary storage for logs. There probably isn't a single setting that gets optimal use out of the flash memory for both use cases. The cloud provider doesn't care if a block, flash chip, drive, server rack, availability zone, or continent goes away. The embedded system may be happy to lose logs in exchange for having enough writes left to install the next security update.
It's all a mess, but the constraints have changed since we made the mess. You used to be happy to get 1/6th of a PCI Express lane for all your storage. Now processors directly expose 128 PCIe lanes and have a multitude of underused efficiency cores waiting to be used. Maybe we could do all the "smart" stuff in the OS and application code, and just attach commodity dumb flash chips to our computer.
1. Contemporary mainstram OSes have not risen to the challenge of dealing appropriately with the multi-CPU, multi-address space nature of modern computers. The proportion of the computer that the "OS" runs on has been shrinking for a long time and there have only been a few efforts to try to fix that (e.g. HarmonyOS, nrk, RTKit)
2. Hardware vendors, faced with proprietary or non-malleable OSes and incentives to keep as much magic in the firmware as possible, have moved forward by essentially sandboxing the user OS behind a compatibility shim. And because it works well enough, OS developers do not feel the need to adjust to the hardware, continuing the cycle.
There is one notable recent exception in adjusting filesystems to SMR/Zoned devices. However this is only on Linux, so desktop PC component vendors do not care. (Quite the opposite: they disable the feature on desktop hardware for market segmentation)
Are there solutions to this in the high-performance computing space, where random access to massive datasets is frequent enough that the “sandboxing” overhead adds up?
I can recommend the related talk "It's Time for Operating Systems to Rediscover Hardware". [1]
It explores how modern systems are a set of cooperating devices (each with their own OS) while our main operating systems still pretend to be fully in charge.
Fundamentally the job of the OS is resource sharing and scheduling. All the low level device management is just a side show.
Hence why SSD's use a block layer (or in the case of NVMe key/value, hello 1964/CKD) abstraction above whatever pile of physical flash, caches, non-volatile caps/batts, etc. That abstraction holds from the lowliest SD card, to huge NVMe-OF/FC/etc smart arrays which are thin provisioning, deduplicating, replicating, snapshoting, etc. You wouldn't want this running on the main cores for performance and power efficiency reasons. Modern m.2/SATA SSD's have a handful of CPUs managing all this complexity, along with background scrubbing, error correction, etc so you would be talking fully heterogeneous OS kernels knowledgeable about multiple architectures, etc.
Basically it would be insanity.
SSDs took a couple orders of magnitude off the time values of spinning rust/arrays, but many of the optimization points of spinning rust still apply. Pack your buffers and submit large contiguous read/write accesses, queue a lot of commands in parallel, etc.
So, the fundamental abstraction still holds true.
And this is true for most other parts of the computer as well. Just talking to a keyboard involves multiple microcontrollers, scheduling the USB bus, packet switching, and serializing/deserializing the USB packets, etc. This is also why every modern CPU has a mgmt CPU that bootstraps and manages it power/voltage/clock/thermals.
So, hardware abstractions are just as useful as software abstractions like huge process address spaces, file IO, etc.
And if the entire purpose of computer programming is to control and/or reduce complexity, I should think the discipline would be embarrassed with the direction in which the industries have been going the past several years. AWS alone should serve as an example.
An interesting approach would be to standardize a way to program the controllers in flash disks, maybe something similar to OpenFirmware. Mainframes farm out all sort of IO to secondary processors and it was relatively common to overwrite the firmware in Commodore 1541 drives, replacing the basic IO routines with faster ones (or with copy protection shenanigans). I'm not sure anyone ever did that, but it should be possible to process data stored in files without tying up the host computer.
Consumer SSDs don't have much room to offer a different abstraction from emulating the semantics of hard drives and older technology. But in the enterprise SSD space, there's a lot of experimentation with exactly this kind of thing. One of the most popular right now is zoned namespaces, which separates write and erase operations but otherwise still abstracts away most of the esoteric details that will vary between products and chip generations. That makes it a usable model for both flash and SMR hard drives. It doesn't completely preclude dishonest caching, but removes some of the incentive for it.
There is no strong reason why a consumer SSD can't allow reformatting to a smaller normal namespace and a separate zoned namespace.
Zone-aware CoW file systems allow efficiently combining FS-level compaction/space-reclamation with NAND-level rewrites/write-leveling.
I'd probably pay for "unlocking" ZNS on my Samsung 980 Pro, if just to reduce the write amplification.
> Consumer SSDs don't have much room to offer a different abstraction from emulating the semantics of hard drives and older technology.
From what I understand the abstraction works a lot like virtual memory. The drive shows up as a virtual address space pretending to be a a disk drive and then the drive's firmware maps virtual addresses to physical ones.
That doesn't seem at all incompatible with exposing the mappings to the OS through newer APIs so the OS can inspect or change the mappings instead of having the firmware do it.
If anything consumer-level SSDs move to the opposite direction. On Samsung 980 Pro it is not even possible to change the sector size from 512 bytes to 4K.
It's called the program-erase model. Some flash devices do expose raw flash, although it's then usually used by a filesystem (I don't know if any apps use it natively).
There's a _lot_ of problems doing high performance NAND yourself. You honestly don't want to do that in your app. If vendors would provide full specs and characterization of NAND and create software-suitable interfaces for the device then maybe it would be feasible to do in a library or kernel driver, but even then it's pretty thankless work.
You almost certainly want to just buy a reliable device.
Endurance management is very complicated. It's not just a matter of PE cycles for any given block will meet UBER spec at data retention limits with the given ECC scheme. Well, it could be in a naive scheme but then your costs go up.
Even something as simple as error correction is not. Error correction is too slow to do on the host for most IOs, so you need hardware ECC engines on the controller. But those become very large if you have a huge amount of correction capability in them so if errors exceed their capability you might go to firmware. Either way, the error rate is still important to know the health of the data, so you would need error rate data to be sent side-band with the data by the controller somehow. If you get a high error rate, does that mean the block is bad or does it mean you chose the wrong Vt to issue the read with, retention limit was approached, the page had read disturb events, dwell time was suboptimal, operating temperature was too low? All these things might factor in to your garbage collection and endurance management strategy.
Oh and all these things depend on every NAND design/process from each NAND manufacturer.
And then there's higher level redundancy than just per-cell (e.g., word line, chip, block, etc). Which all depend on the exact geometry of the NAND and how the controller wires them up.
I think better would be a higher level logical program/free model that sits above the low level UBER guarantees. GC would have to heed direction coming back from the device about what blocks must be freed, and what the next blocks to be allocated must be.
> Clearly vendors and users are at odds with each other here; vendors want the best benchmarks (so you can sort by speed descending and pick the first one), but users want their files to exist after their power goes out.
I don't know, maybe if there was a "my files exist after the power goes out" column on the website, then I'd sort by that, too?
Ultimately the problem is on the review side. Probably because there's no money in it. There just aren't enough channels to sell that kind of content into, and it all seems relatively celebrity driven. That said, I bet there's room for a YouTube personality to produce weekly 10 minute videos where they torture hard-drives old and new - and torture them properly, with real scientific/journalistic integrity. So, basically you need to be an idealistic outspoken nerd and a little money to burn on HDDs and audio/video setup. Such a person would definitely have such a "column" included in their reviews!
(And they could review memory, too, and do backgrounder videos about standards and commonly available parts.)
>I don't know, maybe if there was a "my files exist after the power goes out" column on the website
more like, "don't lose the last 5 seconds of writes if the power goes out". If ordering is preserved you should keep your filesystem, just lose more writes than you expected.
The flip side of the tyranny of the hardware flash controller is that the user can't reliably lose data even if they want to. Your super secure end to end messaging system that automatically erases older messages is probably leaving a whole bunch of copies of those "erased" messages laying around on the raw flash on the other side of the hardware flash controller. This can create a weird situation where it is literally impossible to reliably delete anything on certain platforms.
There is sometimes a whole device erase function provided, but it turns out that a significant portion of tested devices don't actually manage to do that.
> Maybe we could do all the "smart" stuff in the OS and application code, and just attach commodity dumb flash chips to our computer.
Yeah, this is how things are supposed to be done and the fact it's not happening is a huge problem. Hardware makers isolate our operating systems in the exact same way operating systems isolate our processes. The operating system is not really in charge of anything, the hardware just gives it an illusory sanboxed machine to play around in, a machine that doesn't even reflect what hardware truly looks like. The real computers are all hidden and programmed by proprietery firmware.
Flash storage is incredibly complex in the extreme at the low level. The very fact we're talking about microcontroller flash as if it's even the same ballpark as NVMe SSDs in terms of complexity or storage management says a lot on its own about how much people here understand the subject (including me.)
I haven't done research on flash design in almost a decade back when I worked on backup software, and my conclusions back then were basically that: you're just better off buying a reliable drive that can meet your your own reliability/performance characteristics, and making tweaks to your application to match the underlying drive operational behavior (coalesce writes, append as much as you can, take care with multithreading vs HDDs/SSDs, et cetera), and testing the hell out of that with a blessed software stack. So we also did extensive tests on what host filesystems, kernel versions, etc seemed "valid" or "good". It wasn't easy.
The amount of complexity to manage error correction and wear leveling on these devices alone, including auxiliary constraints, probably rivals the entire Linux I/O stack. And it's all incredibly vendor specific in the extreme. An auxiliary case e.g. the case of the OP, of handling power loss and flushing correctly, is vastly easier when you only consider some controller firmware and some capacitors on the drive, versus the whole OS being guaranteed to handle any given state the drive might be in, with adequate backup power, at time of failure, for any vendor and any device class. You'll inevitably conclude the drive is the better place to do this job precisely because it eliminates a massive amount of variables like this.
"Oh, but what about error correction and all that? Wouldn't that be better handled by the OS?" I don't know. What do you think "error correction" means for a flash drive? Every PHY on your computer for almost every moderately high-speed interface has a built in error correction layer to account for introduced channel noise, in theory no different than "error correction" on SSDs in the large, but nobody here is like, "damn, I need every number on the USB PHY controller on my mobo so that I can handle the error correction myself in the host software", because that would be insane for most of the same reasons and nearly impossible to handle for every class of device. Many "Errors" are transients that are expected in normal operation, actually, aside from the extra fact you couldn't do ECC on the host CPU for most high speed interfaces. Good luck doing ECC across 8x NVMe drives when that has to go over the bus to the CPU to get anything done...
You think you want this job. You do not want this job. And we all believe we could handle this job because all the complexity is hidden well enough and oiled by enough blood, sweat, and tears, to meet most reasonable use cases.
I would absolutely love to have access to "dumb" flash from my application logic. I've got append only systems where I could be writing to disk many times faster if the controller weren't trying to be clever in anticipation of block updates.
The ECC and anything to do with multi or triple level cell flashes is quite non-trivial. You don’t want to have to think about these things if you don’t have to. But yes, better control over the flash controllers would be nice. There are alternative modes for NVMe like those specifically for key-value stores: https://nvmexpress.org/developers/nvme-command-set-specifica...
This is like the statement that if I optimize memcpy() for the number of controllers, levels of cache, and latency to each controller/cache, its possible to make it faster than both the CPU microcoded version (rep stosq/etc) and the software versions provided by the compiler/glibc/kernel/etc. Particularly if I know what the workload looks like.
And it breaks down the instant you change the hardware, even in the slightest ways. Frequently the optimizations then made turn around and reduce the speed below naive methods. Modern flash+controllers are massively more complex than the old NOR flash of two decades ago. Which is why they get multiple CPUs managing them.
IMO the problem here is that even if your flash drive presents a "dumb flash" API to the OS, there can still be caching and other magic that happens underneath. You could still be in a situation where you write a block to the drive, but the drive only writes that to local RAM cache so that it can give you very fast burst write speeds. Then, if you try to read the same block, it could read that block from its local cache. The OS would assume that the block has been successfully written, but if the power goes out, you're still out of luck.
"Clearly vendors and users are at odds with each other here; vendors want the best benchmarks (so you can sort by speed descending and pick the first one), but users want their files to exist after their power goes out."
Clearly the vendors are at odds with the law, selling a storage device that doesn't store.
I think they are selling snake-oil, otherwise known as commiting fraud. Maybe they made a mistake in design, and at the very least they should be forced to recall faulty products. If they know about the problem and this behaviour continues ait is basically a fraud.
We allow this to continue, and the manufacturers that actually do fulfill their obligations to the customer suffer financially, while unscurpulous ones laugh all the way to the bank.
I agree, all the way up to entire generations of SDRAM being unable to store data at their advertised speeds and refresh timings. (Rowhammer.) This is nothing short of fraud; they backed the refresh off WAY below what's necessary to correctly store and retrieve data accurately regardless of adjacent row access patterns. Because refreshing more often would hurt performance, and they all want to advertise high performance.
And as a result, we have an entire generation of machines that cannot ever be trusted. And an awful lot of people seem fine with that, or just haven't fully considered what it implies.
I don't know if a legal angle is the most helpful, but we probably need a Kyle Kingsbury type to step into this space and shame vendors who make inaccurate claims.
Which is currently all of them, but that was also the case in the distributed systems space when he first started working on Jepsen.
The manufacturers warrant these devices to behave on a motherboard with proper power hold up times, not in whatever enclosures.
If the enclosure vendor suggests that behavior on cable pull will fully mimick motherboard atx power loss then that is fraud. But they probably have fine print about that, I'd hope.
Nothing says that you can't both offload everything to hardware, and have the application level configure it. Just need to expose the API for things like FLUSH behavior and such...
Yeah, you're absolutely right. I'd prefer that the world dramatically change overnight, but if that's not going to happen, some little DIP switch on the drive that says "don't acknowledge writes that aren't durable yet" would be fine ;)
> the embedded system model where you get flash hardware that has two operations -- erase block and write block
> just attach commodity dumb flash chips to our computer
I kind of agree with your stance; it would be nice for kernel- or user-level software to get low-level access to hardware devices to manage them as they see fit, for the reasons you stated.
Sadly, the trend has been going toward smart devices for a very long time now. In the very old days, stuff like floppy disk seeks and sector management were done by the CPU, and "low-level formatting" actually meant something. Decades ago, IDE HDDs became common, LBA addressing became the norm, and the main CPU cannot know about disk geometry anymore.
I think the main reason they did not expose lower level semantics is that the wanted a drop in replacement for hdds. The second is liability: unfettered access ti arbitrary location erases (and writes) can let you kill (wear out) a flash device in a really short time.
I've actually run into some data loss running simple stuff like pgbench on Hetzner due to this -- I ended up just turning off write-back caching at the device level for all the machines in my cluster:
Granted I was doing something highly questionable (running postgres with fsync off on ZFS) It was very painful to get to the actual issue, but I'm glad I found out.
I've always wondered if it was worth pursuing to start a simple data product with tests like these on various cloud providers to know where these corners are and what you're really getting for the money (or lack thereof).
[EDIT] To save people some time (that post is long), the command to set the feature is the following:
nvme set-feature -f 6 -v 0 /dev/nvme0n1
The docs for `nvme` (nvme-cli package, if you're Ubuntu based) can be pieced together across some man pages:
From reading your vadosware.io notes, I'm intrigued that replacing fdatasync with fsync is supposed to make a difference to durability at the device level. Both functions are supposed to issue a FLUSH to the underlying device, after writing enough metadata that the file contents can be read back later.
If fsync works and fdatasync does not, that strongly suggests a kernel or filesystem bug in the implementation of fdatasync that should be fixed.
That said, I looked at the logs you showed, and those "Bad Address" errors are the EFAULT error, which only occurs in buggy software, or some issue with memory-mapping. I don't think you can conclude that NVMe writes are going missing when the pg software is having EFAULTs, even if turning off the NVMe write cache makes those errors go away. It seems likely that that's just changing the timing of whatever is triggering the EFAULTs in pgbench.
> From reading your vadosware.io notes, I'm intrigued that replacing fdatasync with fsync is supposed to make a difference to durability at the device level. Both functions are supposed to issue a FLUSH to the underlying device, after writing enough metadata that the file contents can be read back later.
Yeah I thought the same initially which is why I was super confused --
> If fsync works and fdatasync does not, that strongly suggests a kernel or filesystem bug in the implementation of fdatasync that should be fixed.
Gulp.
> That said, I looked at the logs you showed, and those "Bad Address" errors are the EFAULT error, which only occurs in buggy software, or some issue with memory-mapping. I don't think you can conclude that NVMe writes are going missing when the pg software is having EFAULTs, even if turning off the NVMe write cache makes those errors go away. It seems likely that that's just changing the timing of whatever is triggering the EFAULTs in pgbench.
It looks like I'm going to have to do some more experimentation on this -- maybe I'll get a fresh machine and try to reproduce this issue again.
What led me to NVMe as dropping write was the complete lack of errors on the pg and OS side (dmesg, etc).
I think this is something LTT could handle with their new test lab. They already said they want to set new standards when it comes to hardware testing, so if they can hold up to what they promised and hire enough experts this should be a trivial thing to add to a test Parcours for disk drives.
LTT's commentary makes it difficult to trust they are objective (even if they are).
I loved seeing how giddy Linus got while testing Valve's Steamdeck, but when it comes to actual benchmarks and testing, I would appreciate if they dropped the entertainment aspect.
GamersNexus seems to really be trying to work on improving and expanding their testing methodology as much as possible.
I feel like they've successfully developed enough clout/trust that they have escaped the hell of having to pull punches in order to assure they get review samples.
They eviscerated AMD for the 6500xt. They called out NZXT repeatedly for a case that was a fire hazard (!). Most recently they've been kicking Newegg in the teeth for trying to scam them over a damaged CPU. They've called out some really overpriced CPU coolers that underperform compared to $25-30 coolers. Etc.
I bet they'd go for testing this sort of thing, if they haven't already started working on it already. They'd test it and then describe for what use cases it would be unlikely to be a problem vs what cases would be fine. For example, a game-file-only drive where if there's an error you can just verify the game files via the store application. Or a laptop that's not overclocked and only is used by someone to surf the web and maybe check their email.
> for starters i think the lab is going to focus on written for its own content and then supporting our other content [mainly their unboxing videos]... or we will create a lab channel that we just don't even worry about any kind of
upload frequency optimization and we just have way more basic, less opinionated videos, that are just 'here is everything you need to know about it' in video form if, for whatever reason, you prefer to watch a video compared to reading an article
I'd really like to see one of the popular influencers disrupt the review industry by coming up with a way to bring back high quality technical analysis. I'd love to see what the cost of revenue looks like in the review industry. I'm guessing in-depth technical analysis does really bad in the cost of revenue department vs shallow articles with a bunch of ads and affiliate links.
I think the current industry players have tunnel vision and are too focused on their balance sheets. Things like reputation, trust, and goodwill are crucial to their businesses, but no one is getting a bonus for something that doesn't directly translate into revenue, so those things get ignored. That kind of short sighted thinking has left the whole industry vulnerable to up and coming influencers who have more incentive to care about things like reputation and brand loyalty.
I've been watching LTT with a fair bit of interest to see if they can come up with a winning formula. The biggest problem is that in-depth technical analysis isn't exciting. I remember reading something many years ago, maybe from JonnyGuru, where the person was explaining how most visitors read the intro and conclusion of an article and barely anyone reads the actual review.
Basically you need someone with a long term vision who understands the value you get from in-depth technical analysis and doesn't care if the cost of it looks bad on the balance sheet. Just consider it the cost of revenue for creating content and selling merchandise.
The most interesting thing with LTT is that I think they've got the pieces to make it work. They could put the most relevant / interesting parts of a review on YouTube and skew it towards the entertainment side of things. Videos with in-depth technical analysis could be very formulaic to increase predictability and reduce production costs and could be monetized directly on FloatPlane.
That way they build their own credibility for their shallow / entertaining videos without boring the core audience, but they can get some cost recovery and monetization from people that are willing to pay for in-depth technical analysis.
I also think it could make sense as bait to get bought out. If they start cutting into the traditional review industry someone might come along and offer to buy them as a defensive move. I wonder if Linus could resist the temptation of a large buyout offer. I think that would instantly torpedo their brand, but you never know.
They rigorously test their hardware and you can filter/sort by literally hundreds of stats.
I just built a PC and I would have killed for a site that had apples-to-apples benchmarks for SSDs/RAM/etc. Motherboard reviews especially are a huge joke. We're badly missing a site like that for PC components.
I mentioned this in another comment, but I think GamersNexus is doing exactly what you want.
Regarding influencers: they're being leveraged by companies precisely because they are about "the experience", not actual subjective analysis and testing. 99% of the "influencers" or "digital content creators" don't even pretend to try to do analysis or testing, and those that do generally zero in on one specific, usually irrelevant, thing to test.
You wrote: <<bring back high quality technical analysis>>
How about Tom's Hardware and AnandTech? If they don't count, who does? Years ago, I used to read CDRLabs for optical media drives. Their reviews were very scientific and consistent. (Of course, optical media is all but dead now.)
He’s recently pivoted a ton of his business to proper lab testing, and is hiring for it. It’ll be interesting to see, I think he might strike a better balance for those types of videos (I too am a bit tired of the clickbait nature these days).
But audience is also important. If it is only super-technical sources that are reporting faulty drives then the manufactures won't care much. However if you get a very popular source that has a lot of audience, especially in the high-margin "gamer" vertical then all of a sudden the manufactures will care a lot.
So if LTT does start providing more objective benchmarks and reviews it could be a powerful market force.
I would personally leave this kind of testing to the pros, like Phoronix, Gamers Nexus, etc. LTT is a facade for useful performance testing and understanding of hardware issues.
I used to developed SSD firmware in the past and our team always used to make sure it would write the data and check the write status. We also used to used to analyze competitor products using bus analyzers and could determine some wouldn't do that. Also in the past many OS filesystems would ignore many errors we returned anyway.
Edit: Here is an old paper on the subject of OS filesystem error handling.
There is a 970 Evo, a 970 Pro and a 970 Evo Plus, but no 970 Evo Pro as far as I am aware. Would be interesting what model OP is actually talking about and if it is the same for other Samsung NMVe SSDs. I also prefer Samsung SSDs because they are reliably and they usually don't change parts to lower spec ones while keeping the same model number like some other vendors.
And watch out with the 980 Pro, Samsung has just changed the components.
Samsung have removed the Elpis controller from the 980 PRO and replaced it with an unknown one, and also removed any speed reference from the spec sheet.
It's OK for them to do this, but then they should give the new product a new name, not re-use the old name so that buying it becomes a "silicon lottery" as far as performance goes.
I mostly buy Samsung Pro. Today I put an Evo in a box which I'm sending back for RMA because of damaged LBAs. I guess I'm stopping my tests on getting anything else but the Pros.
But IIRC Samsung was also called out for switching controllers last year.
"Yes, Samsung Is Swapping SSD Parts Too | Tom's Hardware"
I'm curious whether the drives are at least maintaining write-after-ack ordering of FLUSHed writes in spite of a power failure. (I.e., whether the contents of the drives after power loss are nonetheless crash consistent.) That still isn't great, as it messes with consistency between systems, but at least a system solely dependent on that drive would not suffer loss of integrity.
Enterprise drives with PLP (power loss protection) are surprisingly affordable. I would absolutely choose them for workstation / home use.
The new Micron 7400 Pro M.2 960GB is $200, for example.
Sure, the published IOPS figures are nothing to write home about, but drives like these 1) hit their numbers every time, in every condition, and 2) can just skip flushes altogether, making them much faster in uses where data integrity is important (and flushes would otherwise be issued).
b...but the M in MLC stands for multi... as in multiple... right?
checks
Oh... uh; Apparently the obvious catch-all term MLC actually only refers to dual layer cells, but they didn't call it DLC, and now there's no catch-all term for > SLC. TIL.
It would be nice if we could just buy dumb flash and let the application do whatever it wants (I guess that application would be your filesystem; but it could also be direct access for specialized use cases like databases). If you want maximum speed, adjust your settings for that. If you want maximum write durability, adjust your settings for that. People are always looking for that one size fits all use case, but it's hard here. Some people may be running cloud providers and already have software to store that block on 3 different continents. Some people may be an embedded system with a fixed disk image that changes once a year, with some temporary storage for logs. There probably isn't a single setting that gets optimal use out of the flash memory for both use cases. The cloud provider doesn't care if a block, flash chip, drive, server rack, availability zone, or continent goes away. The embedded system may be happy to lose logs in exchange for having enough writes left to install the next security update.
It's all a mess, but the constraints have changed since we made the mess. You used to be happy to get 1/6th of a PCI Express lane for all your storage. Now processors directly expose 128 PCIe lanes and have a multitude of underused efficiency cores waiting to be used. Maybe we could do all the "smart" stuff in the OS and application code, and just attach commodity dumb flash chips to our computer.
1. Contemporary mainstram OSes have not risen to the challenge of dealing appropriately with the multi-CPU, multi-address space nature of modern computers. The proportion of the computer that the "OS" runs on has been shrinking for a long time and there have only been a few efforts to try to fix that (e.g. HarmonyOS, nrk, RTKit)
2. Hardware vendors, faced with proprietary or non-malleable OSes and incentives to keep as much magic in the firmware as possible, have moved forward by essentially sandboxing the user OS behind a compatibility shim. And because it works well enough, OS developers do not feel the need to adjust to the hardware, continuing the cycle.
There is one notable recent exception in adjusting filesystems to SMR/Zoned devices. However this is only on Linux, so desktop PC component vendors do not care. (Quite the opposite: they disable the feature on desktop hardware for market segmentation)
It explores how modern systems are a set of cooperating devices (each with their own OS) while our main operating systems still pretend to be fully in charge.
[1] https://www.youtube.com/watch?v=36myc8wQhLo
Hence why SSD's use a block layer (or in the case of NVMe key/value, hello 1964/CKD) abstraction above whatever pile of physical flash, caches, non-volatile caps/batts, etc. That abstraction holds from the lowliest SD card, to huge NVMe-OF/FC/etc smart arrays which are thin provisioning, deduplicating, replicating, snapshoting, etc. You wouldn't want this running on the main cores for performance and power efficiency reasons. Modern m.2/SATA SSD's have a handful of CPUs managing all this complexity, along with background scrubbing, error correction, etc so you would be talking fully heterogeneous OS kernels knowledgeable about multiple architectures, etc.
Basically it would be insanity.
SSDs took a couple orders of magnitude off the time values of spinning rust/arrays, but many of the optimization points of spinning rust still apply. Pack your buffers and submit large contiguous read/write accesses, queue a lot of commands in parallel, etc.
So, the fundamental abstraction still holds true.
And this is true for most other parts of the computer as well. Just talking to a keyboard involves multiple microcontrollers, scheduling the USB bus, packet switching, and serializing/deserializing the USB packets, etc. This is also why every modern CPU has a mgmt CPU that bootstraps and manages it power/voltage/clock/thermals.
So, hardware abstractions are just as useful as software abstractions like huge process address spaces, file IO, etc.
I'd probably pay for "unlocking" ZNS on my Samsung 980 Pro, if just to reduce the write amplification.
Deleted Comment
From what I understand the abstraction works a lot like virtual memory. The drive shows up as a virtual address space pretending to be a a disk drive and then the drive's firmware maps virtual addresses to physical ones.
That doesn't seem at all incompatible with exposing the mappings to the OS through newer APIs so the OS can inspect or change the mappings instead of having the firmware do it.
There's a _lot_ of problems doing high performance NAND yourself. You honestly don't want to do that in your app. If vendors would provide full specs and characterization of NAND and create software-suitable interfaces for the device then maybe it would be feasible to do in a library or kernel driver, but even then it's pretty thankless work.
You almost certainly want to just buy a reliable device.
Endurance management is very complicated. It's not just a matter of PE cycles for any given block will meet UBER spec at data retention limits with the given ECC scheme. Well, it could be in a naive scheme but then your costs go up.
Even something as simple as error correction is not. Error correction is too slow to do on the host for most IOs, so you need hardware ECC engines on the controller. But those become very large if you have a huge amount of correction capability in them so if errors exceed their capability you might go to firmware. Either way, the error rate is still important to know the health of the data, so you would need error rate data to be sent side-band with the data by the controller somehow. If you get a high error rate, does that mean the block is bad or does it mean you chose the wrong Vt to issue the read with, retention limit was approached, the page had read disturb events, dwell time was suboptimal, operating temperature was too low? All these things might factor in to your garbage collection and endurance management strategy.
Oh and all these things depend on every NAND design/process from each NAND manufacturer.
And then there's higher level redundancy than just per-cell (e.g., word line, chip, block, etc). Which all depend on the exact geometry of the NAND and how the controller wires them up.
I think better would be a higher level logical program/free model that sits above the low level UBER guarantees. GC would have to heed direction coming back from the device about what blocks must be freed, and what the next blocks to be allocated must be.
I don't know, maybe if there was a "my files exist after the power goes out" column on the website, then I'd sort by that, too?
(And they could review memory, too, and do backgrounder videos about standards and commonly available parts.)
more like, "don't lose the last 5 seconds of writes if the power goes out". If ordering is preserved you should keep your filesystem, just lose more writes than you expected.
There is sometimes a whole device erase function provided, but it turns out that a significant portion of tested devices don't actually manage to do that.
Yeah, this is how things are supposed to be done and the fact it's not happening is a huge problem. Hardware makers isolate our operating systems in the exact same way operating systems isolate our processes. The operating system is not really in charge of anything, the hardware just gives it an illusory sanboxed machine to play around in, a machine that doesn't even reflect what hardware truly looks like. The real computers are all hidden and programmed by proprietery firmware.
https://youtu.be/36myc8wQhLo
I haven't done research on flash design in almost a decade back when I worked on backup software, and my conclusions back then were basically that: you're just better off buying a reliable drive that can meet your your own reliability/performance characteristics, and making tweaks to your application to match the underlying drive operational behavior (coalesce writes, append as much as you can, take care with multithreading vs HDDs/SSDs, et cetera), and testing the hell out of that with a blessed software stack. So we also did extensive tests on what host filesystems, kernel versions, etc seemed "valid" or "good". It wasn't easy.
The amount of complexity to manage error correction and wear leveling on these devices alone, including auxiliary constraints, probably rivals the entire Linux I/O stack. And it's all incredibly vendor specific in the extreme. An auxiliary case e.g. the case of the OP, of handling power loss and flushing correctly, is vastly easier when you only consider some controller firmware and some capacitors on the drive, versus the whole OS being guaranteed to handle any given state the drive might be in, with adequate backup power, at time of failure, for any vendor and any device class. You'll inevitably conclude the drive is the better place to do this job precisely because it eliminates a massive amount of variables like this.
"Oh, but what about error correction and all that? Wouldn't that be better handled by the OS?" I don't know. What do you think "error correction" means for a flash drive? Every PHY on your computer for almost every moderately high-speed interface has a built in error correction layer to account for introduced channel noise, in theory no different than "error correction" on SSDs in the large, but nobody here is like, "damn, I need every number on the USB PHY controller on my mobo so that I can handle the error correction myself in the host software", because that would be insane for most of the same reasons and nearly impossible to handle for every class of device. Many "Errors" are transients that are expected in normal operation, actually, aside from the extra fact you couldn't do ECC on the host CPU for most high speed interfaces. Good luck doing ECC across 8x NVMe drives when that has to go over the bus to the CPU to get anything done...
You think you want this job. You do not want this job. And we all believe we could handle this job because all the complexity is hidden well enough and oiled by enough blood, sweat, and tears, to meet most reasonable use cases.
And it breaks down the instant you change the hardware, even in the slightest ways. Frequently the optimizations then made turn around and reduce the speed below naive methods. Modern flash+controllers are massively more complex than the old NOR flash of two decades ago. Which is why they get multiple CPUs managing them.
It does need support from the storage device though.
Clearly the vendors are at odds with the law, selling a storage device that doesn't store.
I think they are selling snake-oil, otherwise known as commiting fraud. Maybe they made a mistake in design, and at the very least they should be forced to recall faulty products. If they know about the problem and this behaviour continues ait is basically a fraud.
We allow this to continue, and the manufacturers that actually do fulfill their obligations to the customer suffer financially, while unscurpulous ones laugh all the way to the bank.
And as a result, we have an entire generation of machines that cannot ever be trusted. And an awful lot of people seem fine with that, or just haven't fully considered what it implies.
Which is currently all of them, but that was also the case in the distributed systems space when he first started working on Jepsen.
The tester is running the device out of spec.
The manufacturers warrant these devices to behave on a motherboard with proper power hold up times, not in whatever enclosures.
If the enclosure vendor suggests that behavior on cable pull will fully mimick motherboard atx power loss then that is fraud. But they probably have fine print about that, I'd hope.
> just attach commodity dumb flash chips to our computer
I kind of agree with your stance; it would be nice for kernel- or user-level software to get low-level access to hardware devices to manage them as they see fit, for the reasons you stated.
Sadly, the trend has been going toward smart devices for a very long time now. In the very old days, stuff like floppy disk seeks and sector management were done by the CPU, and "low-level formatting" actually meant something. Decades ago, IDE HDDs became common, LBA addressing became the norm, and the main CPU cannot know about disk geometry anymore.
https://vadosware.io/post/everything-ive-seen-on-optimizing-...
Granted I was doing something highly questionable (running postgres with fsync off on ZFS) It was very painful to get to the actual issue, but I'm glad I found out.
I've always wondered if it was worth pursuing to start a simple data product with tests like these on various cloud providers to know where these corners are and what you're really getting for the money (or lack thereof).
[EDIT] To save people some time (that post is long), the command to set the feature is the following:
The docs for `nvme` (nvme-cli package, if you're Ubuntu based) can be pieced together across some man pages:https://man.archlinux.org/man/nvme.1
https://man.archlinux.org/man/nvme-set-feature.1.en
It's a bit hard to find all the NVMe features but 6 is the one for controlling write-back caching.
https://unix.stackexchange.com/questions/472211/list-feature...
[1]https://github.com/linux-nvme/nvme-cli/blob/master/nvme-prin...
[2]https://github.com/linux-nvme/libnvme/blob/master/src/nvme/t...
If fsync works and fdatasync does not, that strongly suggests a kernel or filesystem bug in the implementation of fdatasync that should be fixed.
That said, I looked at the logs you showed, and those "Bad Address" errors are the EFAULT error, which only occurs in buggy software, or some issue with memory-mapping. I don't think you can conclude that NVMe writes are going missing when the pg software is having EFAULTs, even if turning off the NVMe write cache makes those errors go away. It seems likely that that's just changing the timing of whatever is triggering the EFAULTs in pgbench.
Yeah I thought the same initially which is why I was super confused --
> If fsync works and fdatasync does not, that strongly suggests a kernel or filesystem bug in the implementation of fdatasync that should be fixed.
Gulp.
> That said, I looked at the logs you showed, and those "Bad Address" errors are the EFAULT error, which only occurs in buggy software, or some issue with memory-mapping. I don't think you can conclude that NVMe writes are going missing when the pg software is having EFAULTs, even if turning off the NVMe write cache makes those errors go away. It seems likely that that's just changing the timing of whatever is triggering the EFAULTs in pgbench.
It looks like I'm going to have to do some more experimentation on this -- maybe I'll get a fresh machine and try to reproduce this issue again.
What led me to NVMe as dropping write was the complete lack of errors on the pg and OS side (dmesg, etc).
I loved seeing how giddy Linus got while testing Valve's Steamdeck, but when it comes to actual benchmarks and testing, I would appreciate if they dropped the entertainment aspect.
I feel like they've successfully developed enough clout/trust that they have escaped the hell of having to pull punches in order to assure they get review samples.
They eviscerated AMD for the 6500xt. They called out NZXT repeatedly for a case that was a fire hazard (!). Most recently they've been kicking Newegg in the teeth for trying to scam them over a damaged CPU. They've called out some really overpriced CPU coolers that underperform compared to $25-30 coolers. Etc.
I bet they'd go for testing this sort of thing, if they haven't already started working on it already. They'd test it and then describe for what use cases it would be unlikely to be a problem vs what cases would be fine. For example, a game-file-only drive where if there's an error you can just verify the game files via the store application. Or a laptop that's not overclocked and only is used by someone to surf the web and maybe check their email.
> for starters i think the lab is going to focus on written for its own content and then supporting our other content [mainly their unboxing videos]... or we will create a lab channel that we just don't even worry about any kind of upload frequency optimization and we just have way more basic, less opinionated videos, that are just 'here is everything you need to know about it' in video form if, for whatever reason, you prefer to watch a video compared to reading an article
0: https://youtu.be/rXHSbIS2lLs?t=8572
https://linustechtips.com/topic/1410081-valve-left-me-unsupe...
I think the current industry players have tunnel vision and are too focused on their balance sheets. Things like reputation, trust, and goodwill are crucial to their businesses, but no one is getting a bonus for something that doesn't directly translate into revenue, so those things get ignored. That kind of short sighted thinking has left the whole industry vulnerable to up and coming influencers who have more incentive to care about things like reputation and brand loyalty.
I've been watching LTT with a fair bit of interest to see if they can come up with a winning formula. The biggest problem is that in-depth technical analysis isn't exciting. I remember reading something many years ago, maybe from JonnyGuru, where the person was explaining how most visitors read the intro and conclusion of an article and barely anyone reads the actual review.
Basically you need someone with a long term vision who understands the value you get from in-depth technical analysis and doesn't care if the cost of it looks bad on the balance sheet. Just consider it the cost of revenue for creating content and selling merchandise.
The most interesting thing with LTT is that I think they've got the pieces to make it work. They could put the most relevant / interesting parts of a review on YouTube and skew it towards the entertainment side of things. Videos with in-depth technical analysis could be very formulaic to increase predictability and reduce production costs and could be monetized directly on FloatPlane.
That way they build their own credibility for their shallow / entertaining videos without boring the core audience, but they can get some cost recovery and monetization from people that are willing to pay for in-depth technical analysis.
I also think it could make sense as bait to get bought out. If they start cutting into the traditional review industry someone might come along and offer to buy them as a defensive move. I wonder if Linus could resist the temptation of a large buyout offer. I think that would instantly torpedo their brand, but you never know.
https://www.rtings.com/monitor/tools/table
They rigorously test their hardware and you can filter/sort by literally hundreds of stats.
I just built a PC and I would have killed for a site that had apples-to-apples benchmarks for SSDs/RAM/etc. Motherboard reviews especially are a huge joke. We're badly missing a site like that for PC components.
Regarding influencers: they're being leveraged by companies precisely because they are about "the experience", not actual subjective analysis and testing. 99% of the "influencers" or "digital content creators" don't even pretend to try to do analysis or testing, and those that do generally zero in on one specific, usually irrelevant, thing to test.
How about Tom's Hardware and AnandTech? If they don't count, who does? Years ago, I used to read CDRLabs for optical media drives. Their reviews were very scientific and consistent. (Of course, optical media is all but dead now.)
So if LTT does start providing more objective benchmarks and reviews it could be a powerful market force.
Edit: Here is an old paper on the subject of OS filesystem error handling.
https://research.cs.wisc.edu/wind/Publications/iron-sosp05.p...
> The models that never lost data: Samsung 970 EVO Pro 2TB and WD Red SN700 1TB.
I always buy the EVO Pro’s for external drives and use TB to NVMe bridges and they are pretty good.
Samsung have removed the Elpis controller from the 980 PRO and replaced it with an unknown one, and also removed any speed reference from the spec sheet.
Take a look here for what's changed on the 980 PRO: https://www.guru3d.com/index.php?ct=news&action=file&id=4489...
It's OK for them to do this, but then they should give the new product a new name, not re-use the old name so that buying it becomes a "silicon lottery" as far as performance goes.
But IIRC Samsung was also called out for switching controllers last year.
"Yes, Samsung Is Swapping SSD Parts Too | Tom's Hardware"
Samsung 970 Evo Plus: MZ-V7S2T0, 2021.10
WD Red: WDS100T1R0C-68BDK0, 04Sept2021
The new Micron 7400 Pro M.2 960GB is $200, for example.
Sure, the published IOPS figures are nothing to write home about, but drives like these 1) hit their numbers every time, in every condition, and 2) can just skip flushes altogether, making them much faster in uses where data integrity is important (and flushes would otherwise be issued).
https://news.ycombinator.com/item?id=30371857
The Samsung EVO drives are interesting because they have a few GB of SLC that they use as a secondary buffer before they reflush to the MLC.
I'm nitpicking, but an EVO has TLC. Also an SLC write cache is the norm for any high performance consumer ssd, it's not just Samsung.
b...but the M in MLC stands for multi... as in multiple... right?
checks
Oh... uh; Apparently the obvious catch-all term MLC actually only refers to dual layer cells, but they didn't call it DLC, and now there's no catch-all term for > SLC. TIL.
Deleted Comment