Intel Processor Instability Causing Oodle Decompression Failures

franzb · 2 years ago

Reminds me of this saga I went through as an early adopter of AMD Threadripper 3970X:

https://forum.level1techs.com/t/amd-threadripper-3970x-under...

HN discussion: https://news.ycombinator.com/item?id=22382946

Ended up investigating the issue with AMD for several months, was generously compensated by AMD for all the troubles (sending motherboards and CPUs back and forth, a real PITA), but the outcome is that I've been running since then with a custom BIOS image provided by AMD. I think at the end the fault was on Gigabyte's side.

rwmj · 2 years ago

Reminded me of the Intel Skylake bug found by the OCaml compiler developers: https://tech.ahrefs.com/skylake-bug-a-detective-story-ab1ad2...

rkagerer · 2 years ago

Holy cow I had no idea CPU vendors would do this for you.

zare_st · 2 years ago

Supermicro gave us same type of assistance. Then new feature of bifurcation did not work correctly. Without it, enterprise telecommunications peripheral that costs 10x more than 4 socket Xeon motherboard can't run at nominal speed, and it was ran on real lines, not test data.

They sent us custom BIOSes until it got stabilized and said they'll put the patch in the following BIOS releases.

The thing is neither Intel nor AMD nor Supermicro can test edge cases at max usage in niche environments without paying money, but they would really love to claim with backup they can be integrated for such solutions. If Intel wants to test stuff in space for free they have to cooperate with NASA; the alternative is in-house launch.

devmor · 2 years ago

When you’re not only helping them debug their own hardware but are also spending money on their ridiculously overpriced HEDT platform, it probably makes them want to keep you happy.

enraf · 2 years ago

I got one of the faulty 13900k, at least in my case I can confirm that the fault appeared using the default settings for pl1/pl2.

I was doing reinforcement learning on that system and it was always crashing, I spent quite a bit of time trying to find the problem, swapped the CPU for a 13700kf I was using in another PC, the problem was solved.

So I contact Intel to start the RMA process, Intel said that the MSI motherboard I was using doesn't support Linux, I emailed them the official Intel GitHub repo with the microcode that enables the support, they switched agents at that point but I was clear to me at that moment that Intel was trying their best to avoid the RMA, luckily I live in Europe, so I contacted my local consumer protection agency and did the RMA through them, in the meanwhile I saw a good offer for a 7950x + motherboard in an online retailer, bought it and sold in the second market my old motherboard and the RMA 13900k when I got it.

Not buying Intel ever again, I was using Intel because they sponsor some projects in DS but damn.

hopfenspergerj · 2 years ago

I’ve had instability with my 7700k since I bought it, and 16 months of bios updates haven’t helped. Maybe this latest generation of processors just has more trouble than older, simpler designs.

acdha · 2 years ago

Intel has been struggling with CPU performance for a decade, and has been trying to regain their position in absolute performance and performance/{price,watt} comparisons. I think that means they’re being less conservative than they used to be on the hardware margins and also that their teams are likely demoralized, too.

smolder · 2 years ago

Possibly. I would start swapping parts around at that point. Different memory, different CPU, or different motherboard. Just 1 more anecdote, but my r7-7700x has been a dream (won the silicon lottery). It runs at the maximum undervolt & RAM at 6000 with no stability problems.

Ochi · 2 years ago

So ideally, we should disable hyper threading to mitigate security issues and now also disable turbo mode to mitigate memory corruption issues. Maybe we should also disable C states to avoid side-channel attacks and disable efficiency cores to avoid scheduler issues... and at some point we are back to a feature set from 20+ years ago. :P

vondur · 2 years ago

It seems like problems occur from different firmware from the various motherboard manufacturers. I have a motherboard with a Ryzen 7950x and it would randomly not boot. I'd have to remove the battery from the system, let it fully reset, and then it would work again. Finally an update to the firmware fixed that bug.

Deleted Comment

secondcoming · 2 years ago

Or just disable overclocking.

zenonu · 2 years ago

Intel is already running their CPUs at the red line. We're seeing the margin breaking down as Intel tries to remain competitive. The latest 14900KS can even pull > 400W. It's utter insanity.

bee_rider · 2 years ago

At least the built in “multicore enhancement” type overclocks that are popular nowadays with motherboard manufacturers.

I wonder if the old style “bump it up and memtest” type overclocking would catch this. Actually, what is the good testing tool nowadays? Does memtest check AVX frequencies?

Kon-Peki · 2 years ago

But isn't overclocking the entire point of buying the K version of these chips?

ajross · 2 years ago

Why is this downvoted? That's exactly what's happening here. The affected devices are being overclocked, and the instructions at the end of the linked support document detail how to find the correct limits for your CPU and set them in your BIOS.

rkagerer · 2 years ago

Right, when they still knew how to make reliable hardware instead of cramming in features that aren't fully thought out and come with questionable tradeoffs to hit the bleeding edge.

NikkiA · 2 years ago

TBH if we'd stopped at coppermine or tualatin and focused entirely on making the software better, it probably would be a better world.

whoisthemachine · 2 years ago

Good, fast, cheap. Choose two.

paulmd · 2 years ago

haha, knew it wouldn't take long for the AMD fanboys to get winding up on how awful this is gonna be.

https://news.ycombinator.com/item?id=39479081

Somehow people think that it's a strawman, but people like parent comment actually think and post like this lol

bee_rider · 2 years ago

IMO it is worth noting that the “turbo mode,” as you call it, seems to be an overlock that some motherboards do by default. Not the stock boost frequencies.

The hyperthread and c-state stuff, eh, if you want to run code that might be a virus you will have to limit your system. I dunno. It would be a shame if we lost the ability to ignore that advice. Most desktops are single-user after all.

jrockway · 2 years ago

Remember that you run a lot of untrusted code on your single-user desktop through Javascript on websites. Javascript can do all those side channel attacks like Spectre and Meltdown.

blibble · 2 years ago

turbo boost is an advertised feature of the chip

these chips that have been specially binned because they are supposedly stable at those frequencies (within an envelope set by intel)

if intel can't get it to work they shouldn't be selling these chips at all

alwayslikethis · 2 years ago

Provided enough cooling, a chip that can boost to its turbo frequency for a few seconds should also run stably at that frequency indefinitely. Nowadays these boost clocks are so high that there is often not much gained by pushing any further.

jnxx · 2 years ago

> The hyperthread and c-state stuff, eh, if you want to run code that might be a virus you will have to limit your system.

So, you are trusting all web pages you view? Because these are unknown code running on your box which probably has some beefy private data.

dist-epoch · 2 years ago

Intel should police their own ecosystem.

mbrumlow · 2 years ago

I recently built a new system with a i9 149kf and a Ausus Formula motherboard. For a VFIO system so I could run windows and play some games.

It was a nightmare to get running stable. None is the default settings the motherboard used worked. Games crashed, kernel and emacs compiles failed.

End result I had to cap turbo to 5.4ghz on a 6ghz chip, and enable settings that capped max watts and temperature for throttling to 90c.

System seems stable now. Can get sustained 5.4ghz without throttling and enjoying games at 120fps with 4k resolution.

Even though it is working I do feel a way about not being able to run the system at any of the advertised numbers I paid for.

JohnBooty · 2 years ago

    enable settings that capped max watts and temperature for throttling to 90c.

You were going above 90C before???

My first thought is that seems insane, but is apparently normal for that chip, according to Intel: "your processor supports up to 100°C and any temperatures below it are normal and expected"

https://community.intel.com/t5/Processors/i9-14900K-temperat...

That is just wild though. On one hand you should obviously get the performance that was advertised and that you paid for. On the other hand IMO operating a CPU at 90-100C is just insane. It really feels like utter desperation on Intel's part.

I would be curious what kind of cooling setup you have.

crote · 2 years ago

Temperatures like that have been fairly normal for a few generations now - both for Intel and AMD. It might look insane compared to what you were used to seeing a decade ago, but it's actually not that crazy.

First, the temperature sensors got a lot better. Previously you only had one sensor per core/cpu, and it was placed wherever there was space - nowadays it'll have dozens of sensors placed in the most likely hotspots. A decade ago a 70C temp meant that some parts of the CPU were closer to 90C, whereas nowadays a 90C temp means the hottest part is actually 90C.

Second, the better sensors allow more accurate tuning. While 100C might be totally fine, 120C is probably already going to cause serious damage. The problem here is that you can't just rely on a somewhat-distant sensor to always be a constant 20C below the peak value: it's also going to be lagging a bit. It might take a tenth of a second for that temp spike in the hotspot to reach the sensor, but in the time between the spike starting and the temp at the sensor raising enough to trigger a downthrottle you could've already caused serious damage. A decade ago that meant leaving some margin for safety, but these days they can just keep going right up to the limit.

It's also why overclocking simply isn't really a "thing" anymore. Previous CPUs had plenty of safety margin left for risk-takers to exploit, modern CPUs use up all that margin by automatically overclocking until it hits either a temperature limit or a power draw limit.

jcalvinowens · 2 years ago

> On the other hand IMO operating a CPU at 90-100C is just insane.

No it isn't, the manufacturer literally says it's normal! I think people who spend as much money on cooling setups as the chip are the insane ones.

My favorite story: I once put Linux on a big machine that had been running windows, and discovered dmesg was full of thermal throttling alerts. Turns out, the heatsink was not in contact with the CPU die because it had a nub that needed to occupy the same space as a little capacitor.

I'd been using that machine to play X-plane for over two years, and I never noticed. It was not meaningfully slower: the throttling events would only happen every ten or so seconds. I'm still using it today, although with the heatsink fixed :)

I have a garage machine with a ca. 2014 Haswell that's been running full tilt at 90C+ for a good bit of its life. It just won't die.

legosexmagic · 2 years ago

the amount of cooling you get is proportional to the difference of component temperature to ambient temperature. thats why modern chips are engineered to run much hotter.

dist-epoch · 2 years ago

For both Intel/AMD 100C is now a target, not a limit.

bee_rider · 2 years ago

Hey, is there a cooling solution that sprays water on some sort of heat spreader and lets it evaporate? Kidding. Kinda. But actually, is that possible?

doubled112 · 2 years ago

What I'm not happy about is the marketing around turbo boost.

You know how ISPs used to sell "up to X Mbps"? Same idea. Your chip will turbo boost "up to 6.00 GHz".

It's basically automated overclocking, and as you learned, sometimes it can't even do it in a stable fashion. Some of those chips will never clock "up to 6.00 GHz" but they didn't lie. "up to"

wtallis · 2 years ago

It's particularly bad when they stop telling you what clock speeds are achievable with more than one core active. At best these days you get a "base clock" spec that's very slow that doesn't correspond to any operating mode that occurs in real life. You used to get a table of x GHz for y active cores, but then the core counts got too large and the limits got fuzzier.

And laptops have another layer of bullshit, because the theoretical boost clocks the chip is capable of will in practice be limited by the power delivery and cooling provided by that specific machine, and the OEMs never tell you what those limits are. So they'll happily take an extra $200 for another 100MHz that you'll never see for more than a few milliseconds while a different model with a slower-on-paper CPU with better cooling can easily be more than 20% faster.

Deleted Comment

phantomwhiskers · 2 years ago

I also recently built a system with the 14900KF on an ASUS TUF motherboard and NZXT Kraken 360 cooler, and so far I haven't experienced any issues running everything at default BIOS settings (defaulted to 5.7GHz). I haven't seen temps above 70C yet, although granted I also haven't seen CPU load go above 40%, and haven't tried running any benchmarking software.

I'm curious about what you are using for cooling, as 90C at 5.4ghz seems way off compared to what I am seeing on my processor, but it could just be that I'm not pushing my processor quite as hard even with the higher clock rate.

callalex · 2 years ago

Did you try cleaning everything and re-mounting the cooler with new paste? I’ve seen similar behavior when people mess up and get bubbles in their paste. Do you see wildly different temperature readouts for different cores?

BobbyTables2 · 2 years ago

Seems scary that 10% difference in clock frequency is makes/breaks stability.

How much margin is really there?

rygorous · 2 years ago

Dynamic switching power (i.e. the fraction of the chip's power consumption from actually switching transistors, as opposed to just "being on") scales with V^2 * f, where V=voltage and f=frequency, and V in turn depends on f, where higher frequencies need higher voltage. Not really linearly (It's Complicated(tm)), but it's not a terrible first-order approximation, which makes the dynamic switching power have a roughly cubic dependency on frequency.

Therefore, 1.1x the frequency at the high end (where switching power dominates) is 1.33x the power draw.

Those final few hundred MHz really hurt. Conversely, that's also why you see "Eco" power profiles with a major reduction in power draw that cost you maybe 5-10% of your peak performance.

CooCooCaCha · 2 years ago

Do you think that happened because you had insufficient cooling?

Its hard to cool these new chips. AMD included.

kijin · 2 years ago

Even if GP's cooling setup was less than ideal, the chip should have throttled itself to a stable frequency instead of crashing left and right.

Hikikomori · 2 years ago

For different reasons though. AMD's chiplets produce heat in a small area which makes it hard to transfer heat quickly. Intel just use a shitload more power and thus more heat.

PawBer · 2 years ago

Reminds me of this Raymond Chen classic: https://devblogs.microsoft.com/oldnewthing/20050412-47/?p=35...

lostmsu · 2 years ago

I wonder why didn't they add a system crash analyzer component that would tell user their CPU is misbehaving (xor eax, eax) to save themselves some hard to debug support volume.

mhio · 2 years ago

This sounds familiar... ye olde pentium III 1.13 GHz

https://www.tomshardware.com/reviews/intel-admits-problems-p...

ManuelKiessling · 2 years ago

Unrelated to the actual topic, but kudos to the Tom’s Hardware site that they serve a 24 years old web posting flawlessly.

terrelln · 2 years ago

We also regularly run into hardware issues with Zstd. Often the decompressor is the first thing to interact with data coming over the network. Or like in this case the decompressor is generally very sensitive to bit-flips, with or without checksumming enabled, so notices other hardware problems more than other processes running on the same host.

One decision that Zstd made was to include only a checksum of the original data. This is sufficient to ensure data integrity. But, it makes it harder to rule out the decompressor as the source of the corruption, because you can't determine if the compressed data is corrupt.

mjevans · 2 years ago

Compressed data is like a backup. It's not valid until it's tested.

lifthrasiir · 2 years ago

This page doesn't seem to be linked from any other public page, so I think it was a response to unwanted complaints from users who tried to track the "oodle" thing in the error log---like SQLite back in 2006 [1].

[1] https://news.ycombinator.com/item?id=36302805

eqvinox · 2 years ago

It's linked from https://www.radgametools.com/tech.htm (click "support" at the top, look next to "Oodle" logo → "Note: If you are having trouble with an Intel 13900K or 14900K CPU, please [[read this page]].")

lifthrasiir · 2 years ago

Ooh, thank you! I looked so long at the Oodle section and skimmed other sections as well (even searched for the `oodleintel.htm` link in their source codes), but somehow missed that...

atesti · 2 years ago

There are some pages that are not linked, wondered what happened to these products

https://www.radgametools.com/granny.html https://www.radgametools.com/iggy.htm https://www.radgametools.com/milesperf.htm

rygorous · 2 years ago

Granny, Iggy and Miles are all discontinued as stand-alone products. We're still providing support to existing customers but not selling any new licenses.