Intel's woes with Core i9 CPUs crashing look worse than we thought

jclay · a year ago

I was seeing constant instability when running basically any large C++ build that was saturating all of the cores. I was getting odd clang segfaults indicating an AST was invalid, that would succeed on a re-run.

This was getting very frustrating, at various points I tried every other option online (including restoring bios to Intel Baseline settings), etc.

I came across Keean's investigations into the matter on the Intel forums:

> I think there is an easy solution for Intel, and that is to limit p-cores with both hyper-threads busy to 5.8GHz and allow cores with only one hyper-thread active to boost up to 5.9/6.2 they would then have a chip that matched advertised multi-core and single-thread performance, and would be stable without any specific power limits.

> I still think the real reason for this problem is that hyper-threading creates a hot-spot somewhere in the address arithmetic part of the core, and this was missed in the design of the chip. Had a thermal sensor been placed there the chip could throttle back the core ratio to remain stable automatically, or perhaps the transistors needed to be bigger for higher current - not sure that would solve the heat problem. Ultimately an extra pipeline stage might be needed, and this would be a problem, because it would slow down when only one hyper-thread is in use too. I wonder if this has something to do with why intel are getting rid hyper-threading in 15th gen?

From: https://community.intel.com/t5/Processors/14900ks-unstable/m...

Based on this, I set a P-Core limit to 5.8 in my bios and after several months of daily-use building Chromium I can say this machine is now completely stable.

If you're seeing instability on an i9 14900k or 13900k see the above forum post for more details, and try setting the all-core limit. I've now seen this fix instability in 3+ build machines we use so far.

jclay · a year ago

I'll also add that I was never able to get the instability to show up when running the classic stress testing tools: MemBench, Prime95, and Intel's own stability tests could all run for hours and pass.

There's something unique about the workload of ninja launching a bunch of clang processes that draws this out.

On my machine, a clean build of the llvm-project would consistently fail to complete, so that may be a reasonable workload to A/B test with if you're looking into this.

The user quoted above was running gentoo builds on specific p-cores to test various solutions, ultimately finding that the p-core limit was the only fix that yielded stability.

xuejie · a year ago

Just provide a not-related-at-all but IMHO still interesting case: I used to have a Kioxia CM6 U2 SSD drive, it would pass all sorts of benchmarks the reseller is willing to run, but whenever I tried to clean-compile Rust on it, the drive would fail almost certainly somewhere in the build process. While there are configurations you can compile Rust using pre-built LLVM, in my tests I'm compiling LLVM along the way. So I can agree with the comment here, there might be some unique property when doing multi-core compilations, though my tests show a potentially faulty drive, while the above comment here is about Intel CPU.

instagib · a year ago

I set a limit for ninja/cmake to only run 4 or so cores when I was getting hang ups when doing a large compile.

Rename ninja to oninja, make an executable shell script in the ninja directory ninja.

#!/bin/sh oninja $@ -j4

more_corn · a year ago

This is going to be bad. I’ve done enough hardware diagnostics to have an intuition about this sort of thing. It’s going to be massive.

zamadatix · a year ago

With the 13900k being nearly 2 years old at this point I think the biggest part of whether it turns into something massive will be whether the issue stops with the Core Ultra 200 CPUs launching in the next several months or not.

Even in the best case resolution this is still going to be yet another bad entry in Intel's reputation.

ralferoo · a year ago

I was reading some discussion of this recently, and it seems that the common factor also seems to be overclocking and the fact that many motherboards are set up to default to overclock instead of running at the approved clock speed.

And wherever it was I read that (sorry, would link if I remembered), the people that commented that were basically shouted down by people saying that people buy these chips because they believe they're more suited to overclocking...

Personally, I've not overclocked anything since the Athlon days with the pencil mark to gain 10% trick, but I guess a lot of the hobbyists nowadays are more interested in getting the biggest number than having the most stable system, so overclocking is now the norm. The only way this can really be resolved is some official clarification as to whether overclocking is manufacturer supported or not.

Also, from TFA, if the problems persist after swapping processor, that kind of suggests a motherboard / power / settings issue to me.

treprinum · a year ago

They mention that 50% of 13900k and 14900k in workstation W680 motherboards had the same issue so it's unlikely it's because of overclocking/overvolting as those boards are made for stability.

quercusa · a year ago

Doesn't the 'K' in the CPU name signify it's been tested/binned to support overclocking?

D13Fd · a year ago

I thought it just meant the multiplier was unlocked.

zamadatix · a year ago

Just calling it "overclocking" is probably not the right way to describe it (all modern CPUs do sorta a form of what might be referred to as overclocking from the factory and it was about the parameters feeding the voltage behaviors driving that) and it's not really related to unsupported clock speeds, just holding said clock speeds in the wrong ways. Intel made an effort to avoid calling it just overclocking in the release on the matter as that typically means directly setting the clock higher not all of the other things too. For one more clarification, overclocking is never something that's supported by any of the CPU vendors, it's just something that the CPU has the capability to do instead of being locked. Some of these things (such as disabling c-states) should definitely be something Intel supports though as that's disabling a powersave feature for inactivity not changing how far the CPU gets pushed above nominal limits during turbo.

While it is something that definitely cleans up a lot of issues Intel also noted they don't consider it to be the root cause of the overall stability issue as a whole, just something problematic they've noticed was causing a lot of issues while trying to dig into it.

Original statement by Intel on this topic back in April:

"Intel® has observed that this issue may be related to out of specification operating conditions resulting in sustained high voltage and frequency during periods of elevated heat.

Analysis of affected processors shows some parts experience shifts in minimum operating voltages which may be related to operation outside of Intel® specified operating conditions.

While the root cause has not yet been identified, Intel® has observed the majority of reports of this issue are from users with unlocked/overclock capable motherboards.

Intel® has observed 600/700 Series chipset boards often set BIOS defaults to disable thermal and power delivery safeguards designed to limit processor exposure to sustained periods of high voltage and frequency, for example:

– Disabling Current Excursion Protection (CEP) – Enabling the IccMax Unlimited bit – Disabling Thermal Velocity Boost (TVB) and/or Enhanced Thermal Velocity Boost (eTVB) – Additional settings which may increase the risk of system instability: – Disabling C-states – Using Windows Ultimate Performance mode – Increasing PL1 and PL2 beyond Intel® recommended limits

Intel® requests system and motherboard manufacturers to provide end users with a default BIOS profile that matches Intel® recommended settings.

Intel® strongly recommends customer's default BIOS settings should ensure operation within Intel's recommended settings.

In addition, Intel® strongly recommends motherboard manufacturers to implement warnings for end users alerting them to any unlocked or overclocking feature usage.

Intel® is continuing to actively investigate this issue to determine the root cause and will provide additional updates as relevant information becomes available.

Intel® will be publishing a public statement regarding issue status and Intel® recommended BIOS setting recommendations targeted for May 2024."

elric · a year ago

I was hoping that intel would get back on track with Pat Gelsinger at the helm, but so far things have turned out to be pretty lacklustre. i9 stability issues are the last thing it needs right now.

karmakaze · a year ago

Ever since the poor initial rollout of the i9 I've never even considered getting one. The performance/price always seemed to be better with i7. Even the i7 is a second choice for me if an AMD cpu isn't offered.

chrisandchris · a year ago

I was close to buying an i7 or i9 for a build server. Then stumbled over those reports and now I'm considering moving to AMD, like a Ryzen 79xx.

Apreche · a year ago

Exactly. I find it hilarious the problem is with only i9. The i9s cost hundreds more than the i7s and don’t seem to be significantly better for any real world usage.

There are just those people out there who have an entire hobby of getting the biggest benchmark number possible. Most of them don’t even use their computers to actually do anything. Intel makes the i9 only to extract more cash from them.

IMO both Intel and the people foolish enough to buy an i9 deserve this problem, and each other.

jauntywundrkind · a year ago

Warframe saying 80% of crash reports come from affected CPUs is horrible. https://www.tomshardware.com/pc-components/cpus/warframe-dev...

icf80 · a year ago

This looks like general memory corruption generated by the CPU.