Readit News logoReadit News
deepsquirrelnet · 2 years ago
I used to work in manufacturing test for an SSD supplier. This would normally be covered by an “ongoing reliability test” in quality. But I also witnessed that quality can be a highly politicized arm of manufacturing companies, and finding issues with products is not always well received, while approving products is always well received.

In many consumer products, tests like that are often not implemented or curtailed compared to OEM products. When you buy from a company like Dell or Apple, you get the benefit of having a large organization providing accountability. In other words, when a company like Dell represents their interests in receiving quality products to uphold their reputation, they also have a shared interest with the end consumer — but carry a lot more weight since they represent large contracts with the supplier. Suppliers tend to put more effort into testing their OEM products so as not to damage their business relationships.

Anyway, this kind of thing happens all the time in consumer storage. Likely nobody was doing reliability testing on these drives in the first place since that costs money and can only expose problems they didn’t really want to know about.

KennyBlanken · 2 years ago
In a perfect world this would be true, especially at the large business level where the integrator will get their ass sued by the customer or at least be forced to make good on the situation.

In the retail and small/medium business market the reality is that Dell, HP, and the like are under so much pressure to cut margins that they'll go with whoever is cheapest, and customers almost never escalate things to tort.

Dell PC power supplies are made for them by someone else, proprietary in size and connector, and gosh, wouldn't you know it - they have a pretty high failure rate. They last just long enough to make it out of the warranty period, and then they make for a really nice revenue stream for Dell via replacement PSUs or pushing the customer to buy a new system entirely.

Even failure within the warranty period is acceptable in the consumer market because integrators have it down to a science exhausting people on the customer support side. Long phone queue times, incompetent support agents who have to transfer you to different agents and likely drop the call entirely, silly policies like requiring a reformat/OS reinstall for everything, and so on.

aurareturn · 2 years ago
>Even failure within the warranty period is acceptable in the consumer market because integrators have it down to a science exhausting people on the customer support side. Long phone queue times, incompetent support agents who have to transfer you to different agents and likely drop the call entirely, silly policies like requiring a reformat/OS reinstall for everything, and so on.

This is one reason why I believe Apple computers last much longer than Windows computers. With Apple, they only sell a few models in high volume. So if there's an issue, everyone will know about it and Apple will often have to do a mass recall or provide free repairs. And since Apple prices are higher, you'd assume that they use better-grade parts on average.

deepsquirrelnet · 2 years ago
Buying from an OEM certainly doesn’t come with any guarantees. It’s a price/quality contract in almost all cases though. The OEM defines an acceptable defectivity rate in their contract (even if allowed DPM if high). This effectively establishes a requirement at the supplier to ensure they will meet it.

For consumer products, you can assume that this added requirement doesn’t exist.

Edit: as another example, it’s well known among hardware suppliers that being a supplier to Apple can be a double edged sword for this reason. They have very high quality expectations and they squeeze extremely hard on price. But for that, they bring high volumes. If your company doesn’t have their stuff together, they can easily get raked over the coals in Apple contracts.

bayindirh · 2 years ago
> ...Dell, HP, and the like are under so much pressure to cut margins that they'll go with whoever is cheapest, and customers almost never escalate things to tort.

Can confirm. Have an office supplied HP business desktop. One day noticed that my system is slower than normal. After 5 minutes with smartctl, I found out that the SSD was constantly throttling down SATA link (SATA downshift), was not reading or writing more than ~250 MBps, and had some wonky latency issues.

Got a new SSD, moved the drive with dd, and all my problems are solved. Previous drive was by Samsung, but it was a "value" drive which even Google knew nothing about. It was probably built with bottom of the barrel parts, and something went bad earlier than expected.

JohnFen · 2 years ago
> Even failure within the warranty period is acceptable in the consumer market because integrators have it down to a science exhausting people on the customer support side.

This has been true for at least decades. It's why I completely ignore all warranties when I'm deciding what to purchase -- they tend to be essentially worthless, once you factor in the cost of trying to make a warranty claim.

Scoundreller · 2 years ago
It's funny because retail-boxed Intel CPUs used to overclock better, at least in the Celeron 300A days.
jonny_eh · 2 years ago
Except that a non-overclockable CPU isn't a lower quality one. In fact, they may be sold cheaper to the OEM because they are less likely to be overclockable.
bogantech · 2 years ago
> On the one hand, the resistors used in these SSDs are too big for the circuit board, causing weak connections

I am an electronics / PCB hobbyist and I can't for the life of me figure out how they came to such a weird conclusion. What does this even mean?

Larger components will have more surface area at the joint and should be stronger than a smaller component

> On the other hand, the soldering material used to attach these resistors is prone to forming bubbles and breaking easily, according to Häfele.

Never heard of solder doing this - it seems more likely to me that the solder wasn't reflowed properly in manufacturing.

What's more is that the component pictured is a capacitor.

The only conclusion I can draw here is that the guy has no clue what he's talking about

bunnie · 2 years ago
Hard to tell from appearance only but my initial impression is that's an inductor, not a capacitor. The circuit looks like a switching power regulator. The capacitors would be beige with silver ends, this one looks like an over molded inductor, similar to [1], and is used as the main power inductor in a buck regulator.

If this is an inductor, my gut reaction is it has an insufficient current rating for the application and it is overheating. Inductors have a bunch of loss mechanisms that contribute to heating. Depending on the type of metal used to build the core, it can 'hard saturate' and effectively walk itself off a cliff once the current draw gets too high. At some point, it gets hot enough to desolder itself from the circuit board. It's possible they did not see this in validation because the power draw of SSDs depend heavily on the work load and process variations in the chips; erase current can have a fairly wide variation.

fwiw, voiding of solder joints is a problem. The solder is applied as a paste - fine particles of metal solder suspended in solder flux. During reflow the flux evaporates and leaves the metal behind, but if the process isn't tuned right bubbles of gas can be trapped in the joint. This can lead to reliability problems. It can also increase the effective thermal resistance to the circuit board, which for tiny components like this can often be the primary path for heat removal during normal operation.

[1] https://www.digikey.com/en/products/detail/pulse-electronics...

Scoundreller · 2 years ago
The article says:

> the problem lies in hardware, not firmware, which could explain the lack of corrective firmware updates for those models and SanDisk's continued silence about the source of the issues.

But I'd guess a firmware update that slowed down the erase process could let it cool down. But the performance hit.

Are they not using charge pumps and these are some of the first SSDs upgraded to on-board inductored boost convertors?

These messes could be solved if system power supplies had a 20V rail instead of requiring tiny devices to make it. Maybe an integrated manufacturer (hi apple) will spec out proprietary SSDs like this one day.

Charge pumps are cheap and small, but not as efficient (ie: HEAT!):

> By using the boost converter with the optimized inductor, the energy during write-operation of the proposed 1.8-V 3D-SSD is decreased by 68% compared with the conventional 3.3-V 3D-SSD with the charge pump.

https://dl.acm.org/doi/10.1145/1594233.1594253

2023 paper:

> One of the main causes is the on-die charge pump circuit, which has a low conversion efficiency and induces high heat generation.

> Using the in package boost converter, we show that the power consumption can be reduced by up to 39% while the temperature rise can be reduced by 50%.

https://ieeexplore.ieee.org/document/10145971

3seashells · 2 years ago
No vacumoven?
onetimeuse92304 · 2 years ago
I am electronics / PCB hobbyist and I can definitely see how their explanation can be true. I can't say it is, but I can see how it could be.

If you design a PCB for a given size of the resistor but then decide to use larger resistors without redesigning the pads, you may have reflow problems and weak joints. This is simply due to the fact, that the components are positioned due to surface tension during reflow process (they are pulled into place as the solder melts). If the pads are for smaller components, there will be too little solder for larger surface and weight of the component and working at a wrong angle to pull it into place causing potentially higher rate of failure.

> What's more is that the component pictured is a capacitor.

And that means what? From the picture I can tell that there is very little solder between component and the pad. Potentially too little to hold the component well in place.

> The only conclusion I can draw here is that the guy has no clue what he's talking about

Maybe he does, maybe he doesn't. Have you considered a possibility you are not an expert either?

eaasen · 2 years ago
As someone who designs circuit boards professionally, the explanation is clearly lacking. There might be a thermal issue or there might not be. There is nothing conclusive in the pictures either way. What I do see is the following:

1. Underfill (the brownish-tan smooth material surrounding the components towards the bottom of the picture) around the IC, which is typically done to make parts more mechanically robust.

2. No evidence of overheating on any of the thermal interface material that is left stuck to most of the components and no evidence of overheating on the PCB or the components themselves.

3. Completely insufficient evidence to declare a soldering issue. The way to prove this one way or another is x-ray inspection to look for voids in the solder or a mechanical cross-section of the suspect solder joints.

While this certainly could actually be the problem, I see insufficient evidence to conclude one way or another. Manufacturers don’t put underfill under a part unless it’s required through testing or experience with similar package types in prior designs since it adds cost, additional process steps and makes it a PITA or impossible to rework any bad components in the area.

As to the pad size/shape, there are three general classes of design defined by the IPC (standards body that deals with PCBs and PCB assemblies). Depending on how space constrained your design is, there are different recommended pad designs for passive components like these. They might be using one of the tighter spacing guidelines, but if their process is well controlled, it can be perfectly fine for the design life of the product.

If you want to see small pad layouts done well, look at an iPhone logic board.

If you want to know more about pad design for SMT parts, search for IPC-7352

Deleted Comment

jchw · 2 years ago
Does seem a bit strange, but the original article[1] in German, translated using Google Translate, reads as follows:

> “It's definitely a hardware problem. It is a design and construction weakness . The entire soldering process of the SSD is a problem,” says Häfele. A hard drive has components that need to be soldered to the circuit board. “The soldering material used, i.e. the solder, creates bubbles and therefore breaks more easily.”

> “In addition, the components used are far too large for the layout intended on the board,” says Häfele, explaining the technical problems: “As a result, the components are a little higher than the board and the contact with the intended pads is weaker. All it takes is a little something for solder joints to suddenly break.”

It sounds like what they're saying is that the solder pads are too small for some of the components. Not sure about what they're saying about the solder though.

[1]: https://futurezone.at/produkte/sandisk-ssd-ausfaelle-western...

exmadscientist · 2 years ago
> Not sure about what they're saying about the solder though.

There's more than one solder alloy in use. There's more than one class of solder alloy in use. Some are easier to use, some are harder to use. Some are high-performance, low-tolerance, some are low-performance, high-tolerance. Some are expensive, some are cheap.

The most troublesome family is SnBi. These are relatively new. They have a big "greenwashing" problem in that they solder at lower temperatures, which is "environmentally friendly" (and cheaper to run). Also the base metal is dirt cheap. (Wonder why manufacturers are interested?) It's also very, very brittle. It also happens to be a low-temperature alloy... so it's much easier to get hot enough to desolder during operation. Lots of trouble all around and in general a very high field failure rate. Not recommended... oh wait but it's cheap and greenwashable. Sigh.

jeffbee · 2 years ago
> It sounds like what they're saying is that the solder pads are too small for some of the components

The converse is also possible. Instead of being a design flaw with the pads too small for the component, it could be that a larger component was substituted during manufacturing. Even terrible freeware EDA packages have design rules that will flag improper solder pad layouts, so it seems like what might have happened is the physical part does not resemble its model.

nurple · 2 years ago
If the correct amount of pad is not exposed at the edge of the part, the solder will have nowhere to form a fillet which is critical to its physical attachment. Solder is not glue, and even with more pad contact beneath this is a physically weaker connection which often results in tombstones like pictured in TFA.

If you read the integration documents for these packages, you'll see that they distinctly specify the requirements for these margins. Probably the length is the more important axis and may be what he was referring to when saying "large". I've seen this be a problem particularly during the "chip shortage" where jellybean parts like these capacitors have the weakest specs in a design, meaning unilateral substitutions can happen at many points in the design/mfg pipeline.

Indeed brittle solder is a real phenomenon which is often easily visible in hand soldered joints that we call "cold" joints. Formation of bubbles can happen for a number of reasons, but IME it's the result of low quality solder or flux/cleaning. The organic compounds gasify in the heat and form an internal structure similar to bread.

ETA: an interesting paper exploring the cause and minimization of voiding in the reflow process. Particularly, the decrease in thermal conductivity in voided solder can critically contribute to its failure in high-heat operational environments.

https://www.circuitinsight.com/pdf/controlling_voiding_mecha...

exmadscientist · 2 years ago
> Larger components will have more surface area at the joint and should be stronger than a smaller component

Larger components are also, well, larger, and have much bigger forces on them. For ceramic capacitors you need to avoid shearing and torquing as the body of the capacitor is very brittle and a small crack means a dead part, possibly dead short. Big ceramics are dangerous to use as they have a high failure rate. I personally won't use anything larger than a 1210. Some of my colleagues think I'm nuts and should stop at 0805, but I think the flexible terminations available these days make 1210 viable. At least in medium volumes, I don't ship SSDs!

> I can't for the life of me figure out how they came to such a weird conclusion

What I see when I look at this is they have a part with a 5-sided termination (typical MLCC capacitor with metallized cap) but they have a footprint that only gets fillets on 1 of those 5 sides (typical would be 3). This is common for resistors... but resistors (a) have only 3-sided terminations anyway and (b) are made of robust alumina bodies, not fragile ceramics. So someone either got dumb with the footprint library or more likely overly aggressive to pack things in, not appreciating what MLCCs really need to be happy. I don't think it's part size changes, because the fillets along the length dimension that are visible look about right in size.

mips_r4300i · 2 years ago
My gut feel was also cracked MLCC ceramics from thermal expansion or shock.

I've seen some 1206s shear right off a pcb from merely mechanical shock to the PCB, not the cap directly.

When I use them I try to orient them parallel with any PCB bending forces, but they are still fragile.

negative_zero · 2 years ago
This is something that is in my area of expertise, and your suspicions are correct.

Solder can "bubble" but this is a line process issue that is easily picked up even in old AOI systems (automatic optical inspections) from 10-15 years ago.

To be frank, this article to me, reads like piece put together by somebody who has no idea what they're on about to generate publicity for their company. Nothing to see here.

bravo22 · 2 years ago
The most charitable way I can read their statement is that the resistors are too large for the pad, and along with poor solder material it forms a weak joint which breaks over time.

I have a hard time accepting that because there is not a lot of heat on that line nor is there a lot of physical stress, like constant vibration on SSDs.

nrp · 2 years ago
These SSDs are tiny. The controllers can easily get up to 80C during sustained writes, so there could be mechanical stress from thermal cycling. (Source: we also make small USB-interfaced high-speed storage devices and do a range of reliability testing for stuff like this)
londons_explore · 2 years ago
It reads to me more like the journalist writing the article summarized a technical report badly.
Taniwha · 2 years ago
It looks to me like some glued on covering has been removed here, which in turn could have pulled the components off (could still be weak solder joints) rather than it being a manufacturing problem - the components don't look too big for the pads to me

Most modern manufacturing lines have manual and automatic (vision system) inspections that would detect badly soldered or toombstoned components like the ones shown here.

sheepshear · 2 years ago
> What does this even mean?

It means you should click through to look at the pictures in the original article.

RantyDave · 2 years ago
But there was something in the article about epoxy - so potentially the components are glued down with a conductive epoxy instead of being actually soldered. Why you do this? Don't know. But it would explain why the solder is losing the plot.
tyingq · 2 years ago
"Too big" could mean the pads on the circuit board were made for a smaller component, and now with the larger one, there's less overlap and direct contact from the pads on the board and the contacts on the component.
bastard_op · 2 years ago
I stopped buying WD anything early 2010's, but then they acquired everyone else like Seagate, meaning even decent Hitachi disks would be now tainted to become typical WD garbage. I still won't buy anything WD, but alternatives are hardly attractive with the market limited to like 3-4 players.

Good old monopolies in effect, your options are bad or worse.

bayindirh · 2 years ago
If Backblaze yearly disk stats and my personal experience in our datacenter is anything of importance, WD is generally the more reliable disk brand for the last decade or so.

I remember an era where Seagate Constellation (enterprise disks) were so bad, I was replacing them a dozen per week.

Also, from my experience SanDisk didn't get tainted by WD acquisition. Their Extreme Pro SDs still as reliable as before, and their portable SSDs hit the speeds and reliability they advertise.

Every manufacturer makes a design error almost once a decade. Seagate did it, Maxtor did it, WD did it before (their drives were very finicky), however all big producers are in good shape now, from my experience. I can equally trust a Seagate IronWolf Pro or its WD equivalent, or a Samsung SSD and its SanDisk equivalent.

Problems happen, PCBs got revised, things got recalled. Everything is new, but nothing has changed.

justinclift · 2 years ago
> Their Extreme Pro SDs still as reliable as before

Try this: https://news.ycombinator.com/item?id=38244389

AussieWog93 · 2 years ago
It's funny you say that. I always thought WD were the more reliable brand, and Seagate were trash.

I wonder if it's just a case of each of us having one HDD of a particular brand fail on us violently, and then finding others who were in the same boat.

tharkun__ · 2 years ago
Pronounce this in German: "Sea gate oder sea gate nicht" ("Sie geht oder Sie geht nicht"). Meaning "she works or she does not work" is a German word play on early failure rates for Seagate drives.

Coined when there was a time where if you didn't have Seagate drives in a RAID you were more likely to loose your data than not ;)

And yeah I started buying WD at that point. Backblaze stats weren't a thing back then tho.

themagician · 2 years ago
> I wonder if it's just a case of each of us having one HDD of a particular brand fail on us violently, and then finding others who were in the same boat.

That is absolutely the case and anyone with enough experience could confirm it. Both WD and Seagate have made some real trash drives, and both made at least one or two models that were trash at scale. If you timed it just right you could jump from one to another and experience massive failures with both! You also probably have a drive from each that's been running for 20 years somehow.

icehawk · 2 years ago
I take it you mean "like Seagate [acquired everyone else]" because Seagate, Western Digital, and Micron are all competitors.

Deleted Comment

asmor · 2 years ago
And don't forget Hynix. They somewhat recently got into the B2C business, and while they command a premium, the SSDs both OEM and Retail I use from them have been very solid.

There's also Samsung.

Dead Comment

vanderZwan · 2 years ago
I hadn't heard about the Seagate acquisition, that sucks. So what are my options now if I want a reliable external hard drive for example?
justinclift · 2 years ago
Just to be clear, WD has not acquired Seagate. They're still two different, competing, companies.

The above post probably typo-d "Seagate" while meaning "SanDisk".

rft · 2 years ago
For external drives, I would seriously consider using SSDs. Unless you use them exclusively as cold backups and handle them carefully and seldom, I would be far too worried about accidental drops. I have killed some external HDDs this way, never killed an SSD, even though I am far rougher with them. For extra reliability, buy two disks from different manufacturers (e.g. Sandisk/WD and Samsung) at different times and mirror the contents. Less chance of both disks going bad at the same time.

Talking about 3.5" HDDs, sourced from external drives: WD is still ok in my book. Both the Backblaze report [1] (newest, quarterly version, check the drive hours, WDC has less than HGST so far) and my own experience show they are ok. I used to buy HGST based on Backblaze's reports, but now I am using WD external drives in my NAS. My oldest and most used disk (one of the parity drives) has more than 3 years power on hours with nearly 900 start/stop cycles. It shows no signs of failure so far.

I get these HDDs from external drives (called "shucking"), 10TB WD My Book or WD Elements Desktop. It is a bit random what you get, but between 7 HDDs (+1 currently in testing) over about 3 years, I only had one non-Helium drive that runs hotter than the other all Helium drives. No failures yet, no bit errors as well, performance is at least good enough for media storage, currently reading at about 180MB/s sequentially.

I saw one problem: USB errors with WD's USB-SATA bridge and I even had to remove the newest disk to run the test, it would drop from the bus via USB. Might be because it is a refurbished disk or something fishy with the USB 3.0 ports on my server, so I won't blame WD for it.

[1] https://www.backblaze.com/blog/backblaze-drive-stats-for-q2-...

asddubs · 2 years ago
What's wrong with the WD ones? I have a bunch of them and never had any problems
bastard_op · 2 years ago
The funny thing is since these have been getting news even months ago, there was almost immediate fire sales on all the main deal sites to sell them off. Everyone that bought them now have a waiting time bomb of a disk to use. Thanks Western Digital for your contribution to society.
hobobaggins · 2 years ago
Costco was selling them (still!): https://www.costco.com/CatalogSearch?dept=All&keyword=ssd

Is Costco completely unaware of these massive issues?

bastard_op · 2 years ago
Costco is actually a decent org, and if anyone knew they were selling this time-bomb garbage, they would stop it, as they will warranty stuff for YEARS, just to be a somewhat decent company in a time of pirates.
dryheat3 · 2 years ago
Not the same series. "Extreme Go" is not the same product as "Extreme Pro". I have two of these from Costco and they have worked fine for several years.
HankB99 · 2 years ago
Maybe Costco caught up with this. I can't find it on their web site (at least in the US.)

All I see is the "Extreme Go" which I presume is a different product.

bastard_op · 2 years ago
Blissful ignorance imho.
RachelF · 2 years ago
They did warn you - they put the words "Extreme Pro" in the name.

I guess the "Extreme Pro" solder reflowing skillz are required ;-)

frankjr · 2 years ago
Sounds like Western Digital's strategy is to play dead and wait for it to blow over. And it will most likely work.
baz00 · 2 years ago
They saw Apple get away with it and tried to do the same.
bboygravity · 2 years ago
I've had a Fujitsu (if I remember correctly) drive many many years ago that had a hardware bug that would cause an IC on it to spontaneously flash fire and die.

It was a known flaw. They got away with it too.

RCitronsBroker · 2 years ago
no matter how bad the idea, there’s always someone waiting to turn Apple’s bad idea into a poorly implemented, even worse idea
ipqk · 2 years ago
There will probably be a class action lawsuit where everyone that bought one gets a $20 coupon towards a new WD product, and the lawyers make millions.
dboreham · 2 years ago
"resistors too big" ... <accompanied by picture of a capacitor>
layer8 · 2 years ago
Tom’s Hardware’s fault. The original source only says “components”.
newaccount74 · 2 years ago
I told myself I'd never again buy a WD drive when I realised the WD Red NAS drives I bought were completely unsuitable for NAS because they secretely replaced the product line with SMR drives.

And now you are telling me that the Sandisk SSD I bought as a replacement also has a fatal design flaw? And apparently Sandisk is a WD subsidiary?

I'm feeling slightly less bad about spending a fortune on getting a bigger built-in SSD in my Macbook. Please don't tell me they are flawed as well.

layer8 · 2 years ago
TFA is only about external drives.
newaccount74 · 2 years ago
Yeah, I know, I replaced my NAS with external SSDs.
Phostera · 2 years ago
Well they do have the kill your MacBook when they fail problem. ref: rossman on YouTube.
cvccvroomvroom · 2 years ago
I'm unmoved and unsurprised. Retail parts are unreliable, cheap crap by the nature of the market created to perpetuate the fantasies of something for nothing.

Coincidentally, I recently selected Max Endurance with a 15 year warranty for a noncritical application and a non-retail channel Industrial XI for something else.

I'm also unsurprised there are no SLC or traditional EEPROM SD cards advertising these facts because of the race-to-the-bottom commodification of garbage by the price point obsession of users who don't know any better. In an ideal world™, all network and computing devices would use ECC memory but no we can't have nice things and would rather have silent corruption and bitsquatting to save a few cents.

PS: C. 2001, I intentionally tried to induce errors for failure analysis purposes of industrial Maxim flash EEPROM ICs rated for 10k cell writes by using an environmental cycling chamber with heat, cold, and humidity. The damn parts wouldn't fail beyond 2.5 orders of magnitude beyond that, and I started to question that writes weren't happening. If I had more time, I would've burned it down to the ground until there were many errors to characterize it. At the end of the day, it had to be left at using turbo codes to ensure redundancy of data by cell and across chips.

mips_r4300i · 2 years ago
Maxim parts were and remain bulletproof, with prices to match .

I think eeprom longevity is intentionally understated due to practicalities of testing and possibly wide variations in lifetime beyond the spec.

And then you have Chinese domestic SPI NOR flash that kills itself after 3-4 erase cycles...