Zenbleed - Readit News

This is super cool. This exploit will be one of the canonical examples that just running something in a VM does not mean it's safe. We've always known about VM breakout, but this is a no-breakout massive exploit that is simple to execute and gives big payoffs.

Remember: just because this one bug gets fixed in microcode doesn't mean there's not another one of these waiting to be discovered. Many (most?) 0-days are known about by black-hats-for-hire well before they're made public.

CPU vulnerabilities found in the past few years:

  https://en.wikipedia.org/wiki/Meltdown_(security_vulnerability)
  https://en.wikipedia.org/wiki/Spectre_(security_vulnerability)
  https://aepicleak.com/
  https://en.wikipedia.org/wiki/Software_Guard_Extensions#SGAxe
  https://en.wikipedia.org/wiki/Software_Guard_Extensions#LVI
  https://en.wikipedia.org/wiki/Software_Guard_Extensions#Plundervolt
  https://en.wikipedia.org/wiki/Software_Guard_Extensions#MicroScope_replay_attack
  https://en.wikipedia.org/wiki/Software_Guard_Extensions#Enclave_attack
  https://en.wikipedia.org/wiki/Software_Guard_Extensions#Prime+Probe_attack
  https://www.vusec.net/projects/crosstalk/
  https://en.wikipedia.org/wiki/Hertzbleed
  https://www.securityweek.com/amd-processors-expose-sensitive-data-new-squip-attack/

phendrenad2 · 3 years ago

The problem is, VMs aren't really "Virtual Machines" anymore. You're not parsing opcodes in a big switch statement, you're running instructions on the actual CPU, with a few hardware flags that the CPU says will guarantee no data or instruction overlap. It promises! But that's a hard promise to make in reality.

msla · 3 years ago

This is because VM means two different things and has for a long time:

IBM's VM was and is a hypervisor. It dates to the mid 1960s, in the form of CP-40, and it didn't run opcodes in software, but in hardware.

https://en.wikipedia.org/wiki/IBM_CP-40

p-code machines, which interpret bytecode, date back almost as far, such as the O-code machine for BCPL.

https://en.wikipedia.org/wiki/BCPL

Getting people to distinguish between these concepts is probably a lost cause.

MuffinFlavored · 3 years ago

> you're running instructions on the actual CPU

Just how many times is the average operating system workload (with or without a virtual machine also running a second average operating system workload) context switching a second?

Like... unless I'm wrong... the kernel is the main process, and then it slices up processes/threads, and each time those run, they have their own EAX/EBX/ECX/ESP/EBP/EIP/etc. (I know it's RAX, etc. for 64-bit now)

How many cycles is a thread/process given before it context switches to the next one? How is it managing all of the pushfd/popfd, etc. between them? Is this not how modern operating systems work, am I misunderstanding?

tenebrisalietum · 3 years ago

What you're describing (switch statement) is emulation, not virtualization.

eru · 3 years ago

The big switch statement wouldn't necessarily protect you either.

stingraycharles · 3 years ago

Isn’t the typical solution here to pin each VM to certain CPUs / cores?

trebligdivad · 3 years ago

The comparison to Meltdown/Spectre are a bit misleading though - they were a whole new form of attack based on timing where the CPU did exactly what it should have done; This zenbleed case is a good old fashioned bug though - data in a register that shouldn't be.

c7DJTLrn · 3 years ago

Running untrusted code whether in a sandbox, container, or VM, has not been safe since at least Rowhammer, maybe before. I believe a lot of these exploits are down to software and hardware people not talking. Software people make assumptions about the isolation guarantees, hardware people don't speak up when said assumptions are made.

saagarjha · 3 years ago

That is not true in this case. It's just a CPU bug; not even a side channel.

insanitybit · 3 years ago

Hardware people are the ones making those promises, so I don't think that's right at all. And Rowhammer is a way overstated vulnerability - there are all sorts of practical issues with it, especially if you're on modern, patched hardware.

cmrdporcupine · 3 years ago

In the end, I'm thinking most of these are related to branch prediction?

It strikes me that it's either that branch prediction is so inherently complex enough it's always going to be vulnerable to this and/or it just so defies the way most of us intuitively think about code paths / instruction execution that it's hard to conceive of the edge cases until too late?

At what point does the complexity of CPU architectures become so difficult to reason about that we just accept the performance penalty of keeping it simpler?

paulmd · 3 years ago

More generally, most of them are related to speculative execution, where branch mis-prediction is a common gadget to induce speculative mis-execution.

Speculation is hard, it's sort of akin to the idea of introducing multithreading into a program, you are explicitly choosing to tilt at the windmill of pure technical correctness because in a highly concurrent application every error will occur fairly routinely. Speculation is great too, in combination with out-of-order execution it's a multithreading-like boon to overall performance, because now you can resolve several chunks of code in parallel instead of one at a time. It's just also a minefield of correctness issues, but the alternative would be losing something like the equivalent of 10 years of performance gains (going back to like ARM A53 performance).

The recent thing is that "observably correct" needs to include timings. If you can just guess at what the data might be, and the program runs faster if you're correct, that's basically the same thing as reading the data by another means. It's a timing oracle attack.

(in this case AMD just fucked up though, there's no timing attack, this is just implemented wrong and this instruction can speculate against changes that haven't propagated to other parts of the pipeline yet)

The cache is the other problem, modern processors are built with every tenant sharing this single big L3 cache and it turns out that it also needs to be proof against timing attacks for data present in the cache too.

Tuna-Fish · 3 years ago

> At what point does the complexity of CPU architectures become so difficult to reason about that we just accept the performance penalty of keeping it simpler?

Never for branch prediction. It just gets you too much performance. If it becomes too much of a problem, the solution is greater isolation of workloads.

rcxdude · 3 years ago

>At what point does the complexity of CPU architectures become so difficult to reason about that we just accept the performance penalty of keeping it simpler?

Basically never for anything that's at all CPU-bound, that growth in complexity is really the only thing that's been powering single-threaded CPU performance improvements since Dennard scaling stopped in about 2006 (and by that time they were already plenty complex: by the late 90s and early 2000's x86 CPUs were firmly superscalar, out-of-order, branch-predicting and speculative executing devices). If your workload can be made fast without needing that stuff (i.e. no branches and easily parallelised), you're probably using a GPU instead nowadays.

kiririn · 3 years ago

You can rent one of the Atom Kimsufi boxes (N2800) to experience first hand a cpu with no speculative execution. The performance is dire, but at least it hasn’t gotten worse over the years - they are immune to just about everything

c7DJTLrn · 3 years ago

We demanded more performance and we got what we demanded. I doubt manufacturers are going to walk back on branch prediction no matter how flawed it is. They'll add some more mitigations and features which will be broken-on-arrival.

0cf8612b2e1e · 3 years ago

If you pin the VM to a different core/CPU, would that do anything to mitigate? Or are the OS affinity guarantees not that strong?

loeg · 3 years ago

Speculative execution, not branch prediction.

vetrom · 3 years ago

Theres VLIW/'preprediction'/some other technical name I forget for infrastructures which instead ask you to explicitly schedule instruction/data/branch prediction. If I remember, the two biggest examples I can think of were IA64 and Alpha. I wanna think HP-PA did the same but I'm not clear on that one.

For various reasons, all these infras eventually lost out in the market due to market pressure (and cost/watt/IPC, I guess).

Bluecobra · 3 years ago

Yup! I worked at a few companies that would co-mingle Internet facing/DMZ VMs with internal VMs. When pointing this out and recommending we should airgap these VMs to it's own dedicated hypervisor it always fell on deaf ears. Jokes on them I guess.

Kwpolska · 3 years ago

I'm pretty sure AWS/Azure/GCP don’t assign separate boxes to every customer, and somehow they’re fine.

alecco · 3 years ago

Couldn't VMs zero all registers when switching? It shouldn't be much more latency than a typical context switch. Also purge CPU cache to be safe.

sgerenser · 3 years ago

It does zero registers when context switching, but only the "logical" registers, not the physical ones, as described in this comment: https://news.ycombinator.com/item?id=36855266

api · 3 years ago

I’m quite surprised there hasn’t been a cloud apocalypse yet where something just runs rampant through AWS or something.

jacquesm · 3 years ago

It's still early days for the cloud. I'm pretty sure such a thing will happen sooner or later.

zamadatix · 3 years ago

In the case of the VM won't registers be wiped when entering/exiting the VM?

crote · 3 years ago

The problem is that the logical registers don't have a 1:1 relation to the physical registers.

For example, let's imagine a toy architecture with two registers: r0 and r1. We can create a little assembly snippet using them: "r0 = load(addr1); r1 = load(addr2); r0 = r0 + r1; store(addr3, r0)". Pretty simple.

Now, what happens if we want to do that twice? Well, we get something like "r0 = load(addr1); r1 = load(addr2); r0 = r0 + r1; store(addr3, r0); r0 = load(addr4); r1 = load(addr5); r0 = r0 + r1; store(addr6, r0)". Because there is no overlap between the accessed memory sections, they are completely independent. In theory they could even execute at the same time - but that is impossible because they use the same registers.

This can be solved by adding more physical registers to the CPU, let's call them R0-R6. During execution the CPU can now analyze and rewrite the original assembly into "R1 = load(addr1); R4 = load(addr4); R2 = load(addr2); R5 = load(addr5); R3 = R1 + R2; R6 = R4 + R5; store(addr3, R3); store(addr6, R6)". This means we can now start the loads for the second addition before the first addition is done, which means we have to wait less time for the data to arrive when we finally want to actually do the second addition. To the user nothing has changed and the results are identical!

The issue here is that when entering/exiting a VM you can definitely clear the logical registers r0&r1, but there is no guarantee that you are actually clearing the physical registers. On a hardware level, "clearing a register" now means "mark logical register as empty". The CPU makes sure that any future use of that logical register results in it behaving as if it has been clear, but there is no need to touch the content of the physical register. It just gets marked as "free for use". The only way that physical register becomes available again is after a write, after all, and that write would by definition overwrite the stale content - so clearing it would be pointless. Unless your CPU misbehaves and you run into this new bug, of course.

loeg · 3 years ago

The problem is the freed entries in the register file. A VM can, at least, use this bug to read registers from a non-VM thread running on the adjacent SMT/HT of a single physical core. I suspect a VM could also read registers from other processes scheduled on the same SMT/HT.

stcredzero · 3 years ago

this is a no-breakout massive exploit that is simple to execute and gives big payoffs

Wouldn't we be able to avoid the "big payoffs" of no-breakout exploits if we had specialized hardware handle the secrets?

Deleted Comment

The README in the tar file with the exploit (linked at "If you want to test the exploit, the code is available here") contains some more details, including a timeline:

- `2023-05-09` A component of our CPU validation pipeline generates an anomalous result.

- `2023-05-12` We successfully isolate and reproduce the issue. Investigation continues.

- `2023-05-14` We are now aware of the scope and severity of the issue.

- `2023-05-15` We draft a brief status report and share our findings with AMD PSIRT.

- `2023-05-17` AMD acknowledge our report and confirm they can reproduce the issue.

- `2023-05-17` We complete development of a reliable PoC and share it with AMD.

- `2023-05-19` We begin to notify major kernel and hypervisor vendors.

- `2023-05-23` We receive a beta microcode update for Rome from AMD.

- `2023-05-24` We confirm the update fixes the issue and notify AMD.

- `2023-05-30` AMD inform us they have sent a SN (security notice) to partners.

- `2023-06-12` Meeting with AMD to discuss status and details.

- `2023-07-20` AMD unexpectedly publish patches, earlier than an agreed embargo date.

- `2023-07-21` As the fix is now public, we propose privately notifying major distributions that they should begin preparing updated firmware packages.

- `2023-07-24` Public disclosure.

sedatk · 3 years ago

> AMD unexpectedly publish patches, earlier than an agreed embargo date.

> As the fix is now public, we propose privately notifying major distributions that they should begin preparing updated firmware packages.

AMD had to drop the ball somewhere didn't it.

klyrs · 3 years ago

It's good that they published patches early, isn't it?

nextaccountic · 3 years ago

Something is a little unclear to me. Does https://archlinux.org/packages/core/any/amd-ucode/

amd-ucode 20230625.ee91452d-5

last updated 2023-07-25 11:48 UTC

Contains the microcode update that addresses this?

https://git.kernel.org/pub/scm/linux/kernel/git/firmware/lin... says that the fixed version is 2023-07-18, but the amd-ucode version in Arch is 20230625.. but it was last updated in 2023-07-25..

My guess is that this is still getting the 20230625 firmware, per the PKGBUILD at https://gitlab.archlinux.org/archlinux/packaging/packages/li...

Which contains those lines

_tag=20230625

source=("git+https://git.kernel.org/pub/scm/linux/kernel/git/firmware/lin...")

I suppose that it isn't up to date and thus Arch Linux is still vulnerable, right?

edit:

but actually there's two commits in the _backports array (which contains cherry-picked commits) that was last edited 20 hours ago

https://gitlab.archlinux.org/archlinux/packaging/packages/li...

Which is 0bc3126c9cfa0b8c761483215c25382f831a7c6f and b250b32ab1d044953af2dc5e790819a7703b7ee6

And b250b32ab1d044953af2dc5e790819a7703b7ee6 appears to be the commit I linked ealier at git.kernel.org so hopefully up-to-date Arch is not vulnerable to zenbleed

nemetroid · 3 years ago

From what I can tell, 20230625 is the latest tagged release of of the linux-firmware repo: https://git.kernel.org/pub/scm/linux/kernel/git/firmware/lin...

Either way, as noted elsewhere in the comments, only the Rome CPU series has received updated microcode with fixes. All other Zen 2 users need the fix that was released as part of Linux 6.4.6: https://lwn.net/Articles/939102/

(which has been built and packaged for Arch)

AMD Ryzen 3000 Series Processors AMD Ryzen PRO 3000 Series Processors AMD Ryzen Threadripper 3000 Series Processors AMD Ryzen 4000 Series Processors with Radeon Graphics AMD Ryzen PRO 4000 Series Processors AMD Ryzen 5000 Series Processors with Radeon Graphics AMD Ryzen 7020 Series Processors with Radeon Graphics AMD EPYC “Rome” Processors

0xbadcafebee · 3 years ago

eric__cartman · 3 years ago

This is incredibly scary. On my Zen 2 box (Ryzen 3600) logging the output of the exploit running as an unprivileged user while copying and pasting a string into a text editor in the background (I used Kate), resulted in pieces of the string being logged into the output of zenbleed. And this is after a few seconds of runtime mind you, not even a full minute.

Thankfully the exploit is highly dependent on a specific asm routine so exploiting it from JS or WASM in a browser should be extremely difficult. Otherwise a nefarious tab left open for hours in the background could exfiltrate without an issue.

I'm eagerly waiting for Fedora maintainers to push the new microcode so the kernel can update it during the boot process.

> Thankfully the exploit is highly dependent on a specific asm routine so exploiting it from JS or WASM in a browser should be extremely difficult. Otherwise a nefarious tab left open for hours in the background could exfiltrate without an issue.

At least one commentor here claims to be able to reproduce this with javascript: https://news.ycombinator.com/item?id=36849767 .

IshKebab · 3 years ago

A very bold claim with zero evidence.

zekica · 3 years ago

I tried on my zen 2 box, and the same things works even when the exploit is run in a KVM.

slappy7 · 3 years ago

> Thankfully the exploit is highly dependent on a specific asm routine so exploiting it from JS or WASM in a browser should be extremely difficult.

I assume that once/if a method is found it will be applicable broadly though. At the same time, hopefully software patches in V8 and SpiderMonkey will be able to mitigate this further and sooner.

But a JS exploit would require some way to exfiltrate data and presumably doing that would be quite difficult to hide entirely.

kludge41 · 3 years ago

How do you build the POC? I get "No such file or directory" and error 127 on Ubuntu.

I had to run make on the uncompressed folder. Perhaps the build-essential package doesn't come with NASM in Ubuntu? I'll need a bit more info on the error if you want me to try and help you :)

mrpippy · 3 years ago

It feels like not-a-coincidence that OpenBSD added AMD microcode loading in the last 3 days.

https://news.ycombinator.com/item?id=36838511

dralley · 3 years ago

This may or may not also be relevant (I actually have no idea): https://www.phoronix.com/news/Fedora-Server-Alert-FW-Updates

hammock · 3 years ago

Explain that like I’m 5?

laverya · 3 years ago

The patch for this exploit is to load AMD's updated microcode.

dTP90pN · 3 years ago

> AMD have released an microcode update for affected processors.

I don't think that is correct. AMD has released a microcode update[0] for family 17h models 0x31 and 0xa0, which corresponds to Rome, Castle Peak and Mendocino as per WikiChip [1].

So far, there seems to be no microcode update for Renoir, Grey Hawk, Lucienne, Matisse and Van Gogh. Fortunately, the newly released kernels can and do simply set the chicken bit for those. [2]

[0] https://git.kernel.org/pub/scm/linux/kernel/git/firmware/lin...

[1] https://en.wikichip.org/wiki/amd/cpuid#Family_23_.2817h.29

[2] https://github.com/torvalds/linux/commit/522b1d69219d8f08317...

More details:

`good_revs` as per the kernel: https://github.com/torvalds/linux/commit/522b1d69219d8f08317...

Currently published revs ("Patch") (git HEAD):

https://git.kernel.org/pub/scm/linux/kernel/git/firmware/lin...

As of this writing, only two of the five `good_rev`s have been published.

anarazel · 3 years ago

What does that chicken bit do?

userbinator · 3 years ago

and Mendocino

That's the same codename Intel used for Celerons 24 years ago, the ones famous for 50% overclocks:

https://ark.intel.com/content/www/us/en/ark/products/codenam...

lopkeny12ko · 3 years ago

Relevant snippet:

This technique is CVE-2023-20593 and it works on all Zen 2 class processors, which includes at least the following products:

tremon · 3 years ago

Do they mean "only confirmed on Zen2", or is the problem definitely confined to only this architecture?

Is it likely that this same technique (or similar) also works on earlier (Zen/Zen+) or later (Zen3) cores, but they just haven't been able to demonstrate it yet?

It's Tavis Ormandy, and he reported it to AMD, so one would assume they tried it on related hardware and it's not working.

zacmps · 3 years ago

I tested on a Zen 3 Epyc and wasn't able to get the POC to work, so I think it probably is just Zen 2.

rincebrain · 3 years ago

At least the stock exploit code he provided said "nope I can't get shit to leak" on my 5900X.

Arnavion · 3 years ago

Doesn't repro on 2920x (Zen+).

winrid · 3 years ago

Looks like my 2700x narrowly misses this one, assuming 7020 series is affected and not 7000 series.

Yeah -- Ryzen 2700x is Zen+, not Zen 2. Current understanding is that Zen+ is not affected.

_flux · 3 years ago

The wording "at least" suggests the list might not be exhaustive.

eugene3306 · 3 years ago

and how about playstation 5 ?

and also xbox and that thing from valve?

javajosh · 3 years ago

I mean, the PS5 is running a Zen 2 processor [0] so I would assume it's vulnerable. In general I would assume that AAA games are safe. Websites and smaller games made by malefactors will be the issue. (Note that AAA game makers have little interest in antagonizing the audience, OTOH they also will push limits to install anti-cheat mechanisms. On balance I'd trust them.)

0 - https://blog.playstation.com/2020/03/18/unveiling-new-detail...

ye-olde-sysrq · 3 years ago

So are Ryzen 5000's without Radeon not vulnerable? I guess said processors are zen 3?

I have an "AMD Ryzen 9 5950x Desktop Processor" which appears to be Zen 3. I think I'm good?

(Not that I'm running untrusted workloads, but yknow, fortune favors the prepared)

You are likely frequently running untrusted workloads. As javascript in a browser. I don't know about this one, but at least meltdown was fully exploitable from js.

But yes, you are fine, 5950x is Zen3.

kevin_thibedeau · 3 years ago

FYI, Ryzen 3000 APUs aren't Zen 2.

neogodless · 3 years ago

> AMD Ryzen 3000 Series Processors

The above are desktop. If they meant APUs, it would list "Ryzen 3000 Series Processors with Radeon Graphics."

timw4mail · 3 years ago

They are Zen+, aren't they?

justinclift · 3 years ago

Whew, my 5600X looks like it avoided this one too. :)

sounds · 3 years ago

The site is getting hugged to death. https://web.archive.org/web/20230724143835/https://lock.cmpx...

ksec · 3 years ago

It is a simple static HTML page, how is it possible in 2023 a static site could be hugged to death. In most cases HN traffic barely hits 100 page view per second.

taviso · 3 years ago

welp, that's unfortunate indeed.

It's a single-core 128 MB VPS, which seemed fine for my boring static html articles. I guess I underestimated the interest.

jedberg · 3 years ago

It's a security writeup so it's probably run by a security expert who is not an expert at running high traffic websites. Most likely there is something on the page that causes a database hit. Possibly the page content itself.

marcus0x62 · 3 years ago

I imagine they are also getting traffic from sources other than HN.

100rps for most articles. I bet this is at least double that, and he's using apache which by default I think is still thread per connection.

ComputerGuru · 3 years ago

Faster link: https://archive.is/QAwvQ

AdmiralAsshat · 3 years ago

And now we've hugged the archive to death. Nice job!

The original still loads (eventually) for me. YMMV.

nevi-me · 3 years ago

XMMV or ZMMV could also apply

akyuu · 3 years ago

https://www.amd.com/en/resources/product-security/bulletin/a...

According to AMD's security bulletin, firmware updates for non-EPYC CPUs won't be released until the end of the year. What should users do until then, disable the chicken bit and take the performance hit?

stefan_ · 3 years ago

Are they out of their mind? This is not a "medium".

qhwudbebd · 3 years ago

Presumably classified as severity 'medium' in an attempt to look marginally less negligent when announcing that they can't be bothered to issue microcode updates for most CPU models until Nov or Dec.

themoonisachees · 3 years ago

Under what circumstances is this not a medium? The only case this applies is if you have public runners running completely untrusted code, and if you're doing that I hope you're doing it on EPYC, which is fixed. And if you're doing that, you're probably mining crypto for randoms.