Readit News logoReadit News
BeeOnRope · 3 years ago
What I find odd is that after the initial Spectre attacks, there have been a long string of these attacks discovered by outside researchers and then patched by the chipmakers.

In principle it seems like the chipmakers should hold all the cards when it comes to discovery: they are experts in speculative execution, know exactly how their chips work and have massive existing validation suites, simulators and internal machine-readable specifications for the low-level operations of these chips.

Outside researches need to reverse all this by probing a black box (plus a few much-worse-than-insider sources like patents).

Yet years after the initial disclosures it's still random individuals or groups who are discovering these? Perhaps pre-Spectre this attack vector wasn't even considered, but once the general mechanism was obvious did the chip-makers not simply set their biggest brains down and say "go through this with a fine-toothed comb looking for other Spectre attacks"?

Maybe they did and are well aware of all these attacks but to save face and performance hits they simply hold on to them hoping nobody makes them public?

infinityio · 3 years ago
This could be a case of survivorship bias - we don't know how many spectre-like bugs did get patched, because they never made it to the public
BeeOnRope · 3 years ago
I considered this, but we have pretty good evidence that the chipmakers have not been busily secretly patching Spectre attacks:

1) Microcode updates are visible and Spectre fixes are hard to hide: most have performance impacts and most require coordination from the kernel to enable or make effective (which are visible for Linux). There have been a large number of microcode changes tied to published attacks and corresponding fixes, but no corresponding "mystery" updates and hidden kernel fixes to my knowledge which have a similar shape to Spectre fixes.

It's possible they could wait to try to bundle these fixes into a microcode update that arrives for another reason, but the performance impacts and kernel-side changes are harder to hide.

2) If this were the case, we'd expect independent researches to be at least in part re-discovering these attacks, rather than finding completely new ones. This would lead to a case where an attack was already resolved in a release microcode version. To my knowledge this hasn't really happened.

hinkley · 3 years ago
Or the problem could be with methodology, and the wrong people are in charge of the right people left, and so the mindset for testing is just wrong.

Also you’re dealing with a company that has been running to stand still for a long time. There’s been a lot of pressure to meet numbers that they simply cannot keep up with. At some point people cheat, unconsciously or consciously, to achieve the impossible.

api · 3 years ago
We also don't know how many are still out there unreported and part of the secret zero-day caches of various intelligence agencies.
zamadatix · 3 years ago
The logic in this comment rubs me the wrong way. You could use the same train of thought to postulate programmers that have made 2 memory safety errors are nefarious instead of simply human.

When billions use something I expect them to find more problems, flaws, and exploits in it than the creator/manufacturer did. The presence of this does nothing to indicate (or refute) any further conclusion about why.

BeeOnRope · 3 years ago
I think the comparison between CPU and software exploits holds at a very high level, but in the case of software the gap between internal and external researches seems lower. Much software is open-source, in which case the play field is almost level and even closed source software is available in assembly which exposes the entire attack surface in a reasonably consumable form.

Software reverse-engineering is a hugely popular, fairly accessible field with good tools. Hardware not so much.

> When billions use something I expect them to find more problems, flaws, and exploits in it than the creator/manufacturer did. The presence of this does nothing to indicate (or refute) any further conclusion about why.

To be very clear none of these errors have been found by billions of random users but by a few interested third parties: many of them working as students with microscopic funding levels and no apparent inside information.

I'm not actually suggesting that the nefarious explanation holds: I'm genuinely curious.

mike_hearn · 3 years ago
One possibility nobody mentioned yet: the chip vendors don't invest a ton of time looking for them because they don't actually matter that much.

Bear in mind, security researchers are incentivized to find things to build their reputation. It's very often the case that they claim something is a world-shaking security vulnerability when in reality it doesn't matter much for real world attackers. Has anyone ever found a speculation attack in the wild? I think the answer might be no. In which case, why would chip vendors invest tons of money into this? Real customers aren't being hurt by it except in the sense that when an external researcher forces action, they're made to release new microcode that slows things down. Note how all their mitigations for these attacks always have off switches: not something you usually see in security fixes. It's because in many, many cases, these attacks just don't matter. All software running on the same physical core or even the same physical CPU is either running at the same trust level, or sandboxed so heavily it can't mount the attack.

dijit · 3 years ago
> Has anyone ever found a speculation attack in the wild? I think the answer might be no.

this is known as the Y2K paradox.

The Y2K bug had the potential to be very dangerous, but due to a wide-reaching campaign and tonnes of investment in prevention, when the millennium came, it did so with very few issues (though there were still some); leading many to speculate that the issues were overblown

yencabulator · 3 years ago
> Note how all their mitigations for these attacks always have off switches: not something you usually see in security fixes. It's because in many, many cases, these attacks just don't matter.

They have off switches because they can have severe performance costs, which most security fixes don't have.

> All software running on the same physical core or even the same physical CPU is either running at the same trust level, or sandboxed so heavily it can't mount the attack.

Former is simply not true. For example, EC2 avoids mixing tenants on the same core, but even the AWS serverless stuff doesn't do that, for cost reasons. For most end user computers, this is blatantly not true.

Operating system level sandboxing is largely irrelevant if the attacker can simply read secrets from other processes running on the same core at the CPU level. Most processes will have a way to smuggle the stolen goods out.

l33t7332273 · 3 years ago
Software running on the same CPU as the kernel is not necessarily at the same trust level.
toyg · 3 years ago
Maybe they're simply victims of Kernighan's Law of Debugging: "Everyone knows that debugging is twice as hard as writing a program in the first place. So if you're as clever as you can be when you write it, how will you ever debug it?"

There is no doubt that Intel make chips "as clever as they can". Hence, by definition, they can't fully debug them.

djbusby · 3 years ago
If debugging is what we call it when fixing things, does that mean we're "bugging" when we make it?
weebull · 3 years ago
As a CPU engineer, I can say that spectre highlighted channels of information escape that weren't previously considered as vulnerable. That's why it kicked off a new batch of exploits. There was a new idea at the heart of it and others built on that idea.

It's also important to say that these are not bugs. The design is behaving as intentioned. That the performance differs based off the previous code that the CPU has executed was understood and deameed acceptable because the cost of the alternative was considered too high (either in power, performance or area) than the alternative. That's what it means to be an engineer. You weigh up alternatives and make a choice.

In this case, CPUs became fast enough that the fractional part of a bit per iteration became high enough bandwidth to be exploitable, but it needed someone to demonstrate it for it to be understood in the industry. That changed the engineering decision.

bsder · 3 years ago
The problem is that nobody wants to admit that the old, stodgy mainframe guys were right 30 years ago and that sharing anything always results in an exfiltration surface.

Nobody wants to be the first to take proper steps because they have to either:

1) Partition hardware properly so that users are genuinely isolated. This costs silicon area that nobody wants to pay for.

2) Stop all the speculative bullshit. This throws performance into rewind and will put chip performance back a decade or two that nobody wants to pay for.

Until people are willing to value security enough to put real money behind it, this will continue ad nauseam.

tracker1 · 3 years ago
I'd add that the status quo has done pretty well, and many of these exploits are fixed. It's also worth noting that a lot of the exploits in question may be known, but the people working on them couldn't theorize a practical exploit. How many web browser sandbox breaches have there been over the years? Far less than the CPU exploits in the past several years. The latter can have a much bigger impact though.

The biggest risk target seems to be shared servers, and you often don't know who you're sharing with, so is it worth trying? It seems to be usually, no... in a specific target, maybe.

pnut · 3 years ago
Bad news, 1993 was 30 years ago. Maybe 50+ years ago at this point.
hackermatic · 3 years ago
A fundamental problem is that the attack surface is so, so huge. Even if their security researchers are doing blue-sky research on both very small and very broad areas of processor functionality, they're going to miss a lot.

And in line with that and

>Maybe they _did_ and are well aware of all these attacks but to save face and performance hits they simply hold on to them hoping nobody makes them public?

... maybe they have patched a number of issues and just never announced them.

BeeOnRope · 3 years ago
> A fundamental problem is that the attack surface is so, so huge. Even if their security researchers are doing blue-sky research on both very small and very broad areas of processor functionality, they're going to miss a lot.

Sure. If they had patched a bunch of Spectre vulnerabilities and independent researchers had discovered a few more that would be one thing, but as far as I can tell they have patched _zero_ while independent researches have found many and it has been years since the initial attack. Many of these follow very similar patterns and "in what cases is protected data exposed via speculative execution" is something that an architect or engineer could definitely assess.

yencabulator · 3 years ago
Generally all these workarounds have a measurable slowdown associated with them. This mitigation apparently has an up to 50% cost. It's unlikely many of them have been silently fixed without people noticing.
whoisthemachine · 3 years ago
It's the same as any product, the product team wants a faster, cheaper product, yesterday. Security and trust is secondary, because if you're lucky enough, that will fall on the next product team.

Beyond that, processors contain billions (or trillions?) of possible outcomes from a set of inputs. Testing for all of these just to verify reliability and stamp out logic bugs is really hard due to the combinatorial explosion. Putting security testing on top just complicates matter further. The best they can probably do is map out potential ways in which their general purpose processors could be used for specific nefarious uses.

dougall · 3 years ago
Yeah... I don't know if you saw Rodrigo Branco's damning "The Microarchitectures That I Saw And The Ones That I Hope To One Day See":

https://www.youtube.com/watch?v=WlcQrx7VK00https://hardwear.io/usa-2023/presentation/the-microarchitect...

But it definitely seems to be a culture/disclosure problem.

(Also, hi - hope things are going well! We miss you on Mastodon)

yyyk · 3 years ago
The chipmakers don't have an incentive to look too hard for speculation security issues beyond a bit of PR. If they succeed, they lose money and marketshare, while their 'insecure' opponents gain and at most patches later. And in fairness, a lot of these bugs are rather theoretical. Until buyers take these bugs much more seriously, this isn't going to change.
peddling-brink · 3 years ago
Clouds take these vulns seriously, and have a lot to lose, and have deep wallets. I'd be surprised if this topic didn't come up when large purchases are discussed.

Not that there are many alternatives..

justinator · 3 years ago
>In principle it seems like the chipmakers should hold all the cards when it comes to discovery: they are experts in speculative execution, know exactly how their chips work and have massive existing validation suites, simulators and internal machine-readable specifications for the low-level operations of these chips.

I hope you're not in charge of hiring QA. Bugs are often found by people who AREN'T thinking like the developers whichstart wearing blinders on how they're stuff should work and stop trying stupid things.

BeeOnRope · 3 years ago
I'm not directly in charge of hiring QA, no!

I think this sort of excuses the initial blindness to Spectre style attacks in the first place, but once the basic pattern was clear it doesn't excuse the subsequent lack of discovering any of the subsequent issues.

It is as if someone found a bug by examining the assembly language of your process which was caused by unsafe inlined `strcpy` calls (though they could not see the source, so they had no idea strcpy was the problem), and then over a the subsequent 6 years other people slowly found more strcpy issues in your application using brute-force black-box engineering and meanwhile you never just grepped your (closed) source for strcpy uses and audited them or used many of the compiler or runtime mitigations against this issue.

margalabargala · 3 years ago
> I hope you're not in charge of hiring QA.

This seems unnecessarily harsh. The whole post would be improved by removing that sentence IMO.

mekoka · 3 years ago
> I hope you're not in charge of hiring QA.

Why the personal attack? The logic is sound. Isn't QA typically part of the same organization?

specialist · 3 years ago
Spot on.

Generalizing: QA/Test and dev people just think differently.

I served as QA manager for a while. I was fortunate to have worked with some really, really good QA/Test people. They're more rare than good devs.

Some individuals can do well in both worlds.

I've had great devs who were pretty good at test. It seems to me like these devs came from outside of software and CS. Like from aerospace or ballet or history.

I've mentored QA/Test people so they could better automate and manage their work. Then they could pick up tasks like CI/CD, testing scripts, scrub data, etc.

But, in my experience, devs are bad at testing, and just terrible at QA.

Getting enough QA/Test support on a team in the 90s was a tough sell, even though everyone gave lip service to quality.

It's been a long time since I've worked with an actual Test person, dedicated to that role. And that was just 1 person vs 8 devs. So ridiculous.

These days, any kind of "test" is done by "business analysts", whatever that means. And I can't recall the last QA person I've worked with (since my stint as QA Manager in the '90s).

FWIW, I wholly agree with Dan Luu's observations about today's QA/Test standard practice in software vs hardware.

netheril96 · 3 years ago
No QA is able to find out a Spectre vulnerability. QA is irrelevant in this conversation.
CTDOCodebases · 3 years ago
And maybe just maybe when the Snowden revelations started to come out some people woke up and realised that the companies who design the processors used in the vast majority of computers are from the US.

https://www.theverge.com/2013/12/20/5231006/nsa-paid-10-mill...

CanaryLayout · 3 years ago
Since Applied Cryptograhy and everyday since publication it's been well-known and well-understood that NSA's never-ending efforts to weaken systems to make it easier for them to do their work wrecks havoc and costs billions to everyone else. Luckily their attempt to get everyone to adopt a PRNG that was broken on-purpose was thwarted.

But who's to say that any chipmaker gets bribed to backdoor their design to allow the reading of any page of RAM from any protection level doesn't happen? It would make their job super easy.

raggi · 3 years ago
The intersection of probabilistic optimization and timing based side channels is a gift that will never ever fully go away.

Everything _really fast_ which is approximately any very mature systems gear has probabilistic optimization in it now, and that's where a great deal of modern performance comes from.

Even thermals and power draws produce side channels. Eradicating every side channel is untenable, research is required to understand what side channels are tenable, then we have to patch them up as best we can.

I'm waiting for the one where we find that binning is involved in an integrated way, and that some arbitrary sub-population of popular chips turn out to much more exploitable than others due to particular paths being disabled and leaking extra timing info. That'll be a really fun and awful day.

z3t4 · 3 years ago
You are describing a creator bias, there are probably a better name, where you think the ones who created something knows it best. For example, you could create a programming language, a game, or anything, and you think that you know it better then someone who use it for several hours every day. The larger the user base the less likely you are better or know it better then all users.
weebull · 3 years ago
The fans know Star Wars far better than George Lucas.
adgjlsfhk1 · 3 years ago
I think you're last statement is half right. They probably don't bother looking for them that hard because if they look for them they might find them and then have to make their CPUs slower before launch.
cjbprime · 3 years ago
I think there are simply very few humans performing competent vulnerability research (proactive discovery, not reactive patching) in public.
thayne · 3 years ago
> Maybe they did and are well aware of all these attacks but to save face and performance hits they simply hold on to them hoping nobody makes them public?

Or they chose not to look for these types of bugs in the first place, for those reasons.

chrsw · 3 years ago
There's more money to be made in performance and efficiency than security. If chipmakers could design and build the perfect processor, they would. But like anything else complex, there are compromises everywhere.
BeeOnRope · 3 years ago
Granted but the vendors accept these are serious problems given that they are immediately patched and the mitigation all enabled by default even at significant performance cost (most chip generations are down double digit perf % based versus "zero mitigations" at this int).

So they don't need to build the "perfect processor" but why aren't they discovering any of these issues themselves?

TZubiri · 3 years ago
It's probably that they are aware of several vulnerabilities, such is life. But they are unable to prioritize them and assess which to solve first, they need external auditors using black box techniques to help them identify which are exploitable by external attackers, so they can fix that one, and not the other non issues
DropInIn · 3 years ago
Tinfoil hat time:

They know at release about many of these bugs but between incentivizing upgrades and selling the resultant bugs as backdoor to 'three letter agencies' there's simply too much money to be made by not disclosing/patching the problems before third parties release discovery of the problems.

mad · 3 years ago
Company politics? According to this tweet, parts of Intel did know about this attack: https://twitter.com/bsdaemon/status/1688978152201015301
quickthrower2 · 3 years ago
My guess: There are a lot more researchers on the outside than inside. And also incentives - you would need a red team at Intel who tries to attack their own chips, and those people would be even fewer. And the best people may want to stay independent.
Method-X · 3 years ago
Like how car companies will do a cost/benefit analysis to see if a recall is worth it.
david-gpu · 3 years ago
Having worked in the industry, my gut feeling is that chipmakers don't invest all that much in looking for and preventing these sorts of attacks.

When working on a new feature, you are desperately trying to deliver on time something that adds value in the sorts of scenarios that it was designed for. And that is already hard enough.

rvba · 3 years ago
Maybe Intel cut corners wherever possible - so their processors are few percent faster due to each bug, but much less safe.

But hey they were faster than AMD!

CanaryLayout · 3 years ago
shoulda bought the threadripper the intern recommended than buying that vulnshit your ISV recommended to you
IshKebab · 3 years ago
As I understand it this isn't really a Spectre style bug, it's just a straight up bug. It's pretty much the hardware equivalent of use-after-free.

Simple to fix with a microcode update. No serious performance implications.

efficax · 3 years ago
an easier explanation is that modern chipsets are incredibly complex, side channel attacks like these are hard to reason about and traditionally, processors themselves have not been the target of these kinds of attacks, so i don’t think the engineers working on them are accustomed to thinking about them as attack surfaces in this way.
CanaryLayout · 3 years ago
You didn't have to think about this when you ran your code on your own chips; so it's your fault for backdooring the front end into the datacenter.

But now we have the 21st-century mainframe we like to call the cloud, where everything is shared. So I upload a container image with the vampire vuln intending to read all the activity on the host. Other customers jobs, the OS itself, even steal internal keys used at Amazon.

The motivation to do this kind of attack now is incalculable.

jgalt212 · 3 years ago
> Outside researches need to reverse all this by probing a black box

The kids are calling this activity "prompt engineering".

radium3d · 3 years ago
Do "they know exactly how their chips work"?

Deleted Comment

throwawaylinux · 3 years ago
There are security teams and/or methodologies inside all major CPU designers today that look at speculation and other side channels. Although these might still not be up to quite the kind of rigor that you see in traditional verification. That is to say, their unit tests and fuzzers and formal analysis and proving methodologies are all well set up to verify that architectural results are correct, my guess is that they don't all verify intermediate results or side effects can't be observed by different privilege contexts.

In many ways it is a much more difficult problem too. Going back to first principles, what execution in one context could have an affect on a subsequent context? The answer is just about everything. If you really wanted to be sure you couldn't leak information between contexts, you could only ever run one privilege level on the machine at any time. Not just core, entire machine. When switching contexts, you would have to stop everything, stop all memory and IO traffic, flush all caches and queues, and idle all silicon until all temperatures had reached ambient, voltages, fans, frequency scaling had settled. Then you could start your next program. Even then there's probably things you've forgotten -- programs can access DRAM in certain ways to cause bitflips in adjacent cells for example, so you've probably got a bunch of persistent and unfixable problems there to if you're paranoid. That's not even starting on any of the OS, hypervisor or firmware state that the attacking program might have influenced. So the real answer is that you simply can't share any silicon, software, or wires whatsoever between different trust domains if you are totally paranoid.

All of these things are well known about, but at some point you make your best estimation of whether something could realistically be exploited and that's very hard to actually prove one way or another. Multiply by all possible channels and techniques.

That's probably why you see a side channel vulnerability discovered every month by outsiders, but very few architectural defects (Pentium FDIV type bugs).

That said, this issue looks like a clear miss by their security process. Supplying data to an untrusted context, even if it can only be used speculatively, is clearly outside what is acceptable, and it is one of the things that could be discovered by an analysis of the pipeline.

Contrast with the recent AMD branch prediction vulnerability, which could plausibly fall under teh category of it being a known risk but was not thought to be realistically exploitable.

As others have said though, everyone makes mistakes, every CPU and program and every engineering project has bugs and mistakes. I don't know if you can deduce much about a CPU design company's internal process from looking at things like this.

acheong08 · 3 years ago
NSA Backdoor \s
mike_hearn · 3 years ago
The Intel paper link is dead, this seems to be the right one:

https://www.intel.com/content/www/us/en/developer/articles/t...

General caveats: are there many clouds that still run workloads from different users on the same physical core? I thought most had changed their schedulers years ago so you can't get cross-domain leaks between hyperthreads anymore. Claiming that it affects all users on the internet seems like a massive over-exaggeration, as he hasn't demonstrated any kind of browser based exploit and even if such a thing did exist, it'd affect only a tiny minority of targeted users, as AFAIK many years after the introduction of Spectre nobody has ever found a specex attack in the wild (or have they?)

I think the more interesting thing here is that it continues the long run of speculation bugs that always seem to be patchable in microcode. When this stuff first appeared there was the looming fear that we'd have to be regularly junking and replacing the physical chips en masse, but has that ever been necessary? AFAIK all of the bugs could be addressed via a mix of software and microcode changes, sometimes at the cost of some performance in some cases. But there's never been a bug that needed new physical silicon (except for the early versions of AMD SEV, which were genuinely jailbroken in unpatchable ways).

kiririn · 3 years ago
>are there many clouds that still run workloads from different users on the same physical core?

There are a vast number of VPS providers out there that aren’t AWS/GCP/Azure/etc where the answer is yes. Even the ones that sell ‘dedicated’ cores, which really just means unmetered cpu

H8crilA · 3 years ago
What about burstable instances on AWS, and whatever is the equivalent in other clouds? Hard to imagine those having a dedicated core, would probably defeat the purpose.
ajross · 3 years ago
Per the paper, this looks like an attack against speculated instructions that modify the store forward buffer. The details aren't super clear, but that seems extremely unlikely to survive a context switch. In practice this is probably only an attack against hyperthread code running simultaneously on the same CPU, which I'd expect cloud hosts to have eliminated long ago.
winternewt · 3 years ago
The Spectre attack had to be patched in the kernel in a way that significantly slowed down execution on Intel CPU:s: https://www.notebookcheck.net/Spectre-v2-mitigation-wreaks-h...
dmatech · 3 years ago
What's interesting is that the FDIV bug from 1994 could also be worked around, but Intel recalled and wrote off those processors[1]. For their latest several problems, their response was more of a "sucks to be you". While they provided microcode updates and worked with OS vendors, there were performance impacts that materially affected the value of the chips.

1. https://www.intel.com/content/www/us/en/history/history-1994...

mike_hearn · 3 years ago
Yes, I think that's what I said? Every attack no matter how deep it seemed to be has been patchable in microcode, sometimes at a cost in performance. But so far nobody had to toss the physical silicon, at least not with Intel. The malleability of these chips is quite fascinating.
vladvasiliu · 3 years ago
> General caveats: are there many clouds that still run workloads from different users on the same physical core? I thought most had changed their schedulers years ago so you can't get cross-domain leaks between hyperthreads anymore.

Isn't this the whole point of AWS' t instances? It's my understanding that they are "shared" at the core level, or else there wouldn't be a reason for the CPU credit balance thing.

BeeOnRope · 3 years ago
They are definitely time-sliced among tenants and very possibly two tenants may run at the same time on two hardware threads on the same core: but you could have a viable burstable instance with time-slicing alone.
pwarner · 3 years ago
I think most if not all cloud VMs dedicate a core to you. Well, there are some that share like the T series on AWS and I think other clouds have similar, but my bet is they can put in an extra "flush" between users to prevent cross tenant leakage.

Of course cross process leakage for a single tenant is an issue, in cloud or on prem, and folks will have to decide how much they trust the processes on their machine to not become evil...

CanaryLayout · 3 years ago
This concern I also share and it's probably worth converting into layman's terms so that all computer users understand what it is. Basically the job scheduler Behavior in the OS needs to surface to the user with understandable language they can read so they can make the trade-off decision.
calibas · 3 years ago
> Claiming that it affects all users on the internet seems like a massive over-exaggeration, as he hasn't demonstrated any kind of browser based exploit and even if such a thing did exist

He's saying it likely affects "everyone on the Internet" because most servers are vulnerable.

Dylan16807 · 3 years ago
Most servers being vulnerable to a local attack is generally pretty boring news.
CanaryLayout · 3 years ago
If you throw in the same vulnerability that AMD has with the list from Intel I think it pretty much covers every server available for rent at Quadra.
amarshall · 3 years ago
> same physical core…between hyperthreads

These are not the same thing. Afaik, most “vCPU” are hyperthreads, not physical cores.

> I thought most had changed their schedulers years ago so you can't get cross-domain leaks between hyperthreads anymore

It would be great to have a source on this.

BeeOnRope · 3 years ago
> These are not the same thing. Afaik, most “vCPU” are hyperthreads, not physical cores.

OP didn't say otherwise. They are saying that public clouds do not let work from different tenants run on the same physical core (on different hyperthreads) at the same time.

This doesn't prevent you from selling 1 hyperthread as 1 vCPU, it just means there are some scheduling restrictions and your smallest instance type will probably have 2 vCPUs if you have SMT-2 hardware (and that's exactly what you see on AWS outside of the burstable instance types).

tedunangst · 3 years ago
Does Digital Ocean count as a major cloud player?
CanaryLayout · 3 years ago
Yes. And Linode. And Quadra. And OVH.

A lot of people on YC are enterprisey-brained and only think there are 3 possible clouds, and then there is the rest of the planet who can't afford to park their cash at AWS and set it on fire.

Negitivefrags · 3 years ago
Once again it seems clear that running code from two security domains on the same physical processor cores is just not possible to get right, and we should probably just stop doing it.

There are really only two common cases for this anyway. VMs and JavaScript.

For VMs we just need to give up on it. Dedicate specific cores to specific VMs or at least customers.

For JavaScript it’s a bit harder.

Either way, we need to not be giving up performance for the the normal case.

marssaxman · 3 years ago
> For JavaScript it’s a bit harder.

"We should probably just stop doing it" works for me.

throwaway892238 · 3 years ago
Agreed. Browsers are now nothing but an application platform of APIs (https://developer.mozilla.org/en-US/docs/Web/API). For some reason they still retain the vestigial HTML, CSS and JS, but really all you need is bytecode that calls an ABI, and a widget toolkit that talks to a rendering API. Then we can finally ship apps to users without the shackles of how a browser wants to interpret and render some markup.

The idea of security "sandboxes" is quaint, but have been defeated pretty much since their inception. And the only reason we have "frontend developers" rather than just "developers" is HTML/CSS/JS/DOM/etc is a byzantine relic we refuse to let go of. Just let us deliver regular-old apps to users and control how they're displayed in regular programming languages. Let users can find any app in any online marketplace based on open standards.

jl6 · 3 years ago
There’s a marketing opportunity here to put multi-core back in the spotlight. Most workloads have reached the point of diminishing returns for adding more cores to a CPU, but if it turns out we need more cores just so we can run more concurrent processes (or browser tabs) securely, then here come the 128-core laptop chips…
dgb23 · 3 years ago
From a user’s perspective I often think that applications which run multiple processes, demand multiple threads and large chunks of memory are too entitled.

I know it’s a (not even) half baked thought. But there’s something to that. We never really think of “how many resources is this application allowed to demand?”

Software would be orders of magnitudes faster if there was some standard, sensible way of giving applications fixed resource buckets.

brundolf · 3 years ago
Intel playing 4D chess to sell more new hardware
demindiro · 3 years ago
I've wondered for a while whether it would make sense to split the CPU into a "IOPU" and a "SPU"

- The IOPU would be responsible for directing other hardware on the system. It doesn't need to be very performant.

- The SPU would be optimized for scalar and branch-heavy code that needs to run fast.

The SPU could have minimal security, just enough so it can't read arbitrary memory when fetching from RAM. It would only run one program at a time, so speculation shouldn't be an issue.

At least on my system few programs need a lot of processing power (and even then only intermittently), so little task switching should occur on an SPU.

kazinator · 3 years ago
> Once again it seems clear that running code from two security domains on the same physical processor cores is just not possible to get right, and we should probably just stop doing it.

Yes. This has had its heyday: the era of the time-shared systems, from the 1960's, right into the 1990's (Unix systems with multiple users having "shell accounts", at universities and ISP's and such). These CPU attacks show us that secure time-shared systems where users run arbitrary machine code is no longer feasible.

There will still be time sharing where users trust each other, like workers on the same projects accessing a build machine and whatnot.

bee_rider · 3 years ago
It is sort of like the second law of thermodynamics (before statistical mechanics came around and cleared things up): sure, maybe not well founded in some analytical or philosophical sense, but experimentally bulletproof to the point where anyone who tries to sell you otherwise would be regarded very suspiciously. The idea that any two programs running on a computer can be prevented from snooping on each other.
hinkley · 3 years ago
This seems like a job for Arm or RISC-V or ???

Sacrifice IPC/core to reduce gates per core, stuff more cores into a square centimeter, pin processes to cores or groups of cores, keep thread preemption cheap, but let thread migration take as long as it takes.

Arm already has asymmetric multiprocessing, which I feel like is halfway there. Or maybe a third. Fifteen years ago asynchrony primitives weren’t what they are today. I think there’s more flexibility to design a core around current or emerging primitives instead of the old ways. And then there are kernel primitives like io_uring meant to reduce system call overhead, amortizing over multiple calls. If you split the difference, you could afford to allow individual calls to get several times more expensive, while juggling five or ten at once for the cost of two.

hyperman1 · 3 years ago
I've wondered if we can't give a dedicated core to the browser. Of course, then web pages can steal from other web pages. Maybe task switching needs to erect much higher barriers between security contexts, a complete flush or so?
bee_rider · 3 years ago
I wish chips would come with a core devoted to running this kind of untrusted code; maybe they could take something like an old bonnell atom core, strip out the hyper threading, and run JavaScript on that.

If a script can’t run happily on a core like that, it should really be a program anyway, and I don’t want to run it.

Negitivefrags · 3 years ago
It probably would be possible to add a new instruction that causes the processor to flush all state in exchange for sacrificing task switching speed. Of course it might still have bugs, but you could imagine that it would be easier to get right.

Of course, it’s not doing much for the billions of devices that exist.

I would hope that we could find a software solution that web browsers can implement so that devices can be patched.

Either way, I would want such a solution to not compromise performance in the case where code is running in the same security context.

This is what I don’t like about existing mitigations. It’s making computers slower in many contexts where they don’t need to be.

moonchild · 3 years ago
You would need a dedicated core per tab.

Partitioning tabs according to trust would be ~fine, but laborious and error-prone.

kristjank · 3 years ago
Do you really think that giving up on getting things done right is the way to progress computing? While AMD has it's own spectrum of problems and not-quite-there security features, most of their vulnyerabilities have been fixed in microcode shortly after disclosure.

We as an industry should stop excusing chipmakers from doing their jobs and reject broken products. It's brand loyalty all over again, like when Apple does something retarded like losing the headphone jack and the whole industry follows, breaking years of interoperability.

When the products/services we buy break, we should demand better, not lower our expectations.

paxys · 3 years ago
So one should never install software from more than one company on a computer?
CanaryLayout · 3 years ago
You install App X from Vendor Y on to vSystem Z.

Vector is found to get untrusted code C to run in the user area on Z via exploit in X that Y has not acknowledged, so researchers publish a CVE with an example.

C starts trying to read memory from threads shared on same vCPU, revealing db connection string used by X, the nonce and salt for hashing.

Attacker now has the keys to the entire kingdom.

accrual · 3 years ago
> Once again it seems clear that running code from two security domains on the same physical processor cores is just not possible to get right, and we should probably just stop doing it.

I believe this is why OpenBSD disabled SMT by default in June of 2018. [0]

It can still be enabled with a simple 'sysctl', though.

[0] https://www.mail-archive.com/source-changes@openbsd.org/msg9...

bhk · 3 years ago
Once you "dedicate cores to specific VMs" you will find that chip designers can also screw that up, just like they can screw up protection within a core. So you might as well proclaim that "impossible to get right" preemptively.
quickthrower2 · 3 years ago
They said physical processor. I took this to mean chip. Of course you can't trust chips on the same mother board (there could be a bug where they can read the same memory), so you need a network boundary. So different rack on the server, for each [boundary, however you define that...]
malfist · 3 years ago
This is an unreasonable position. Vulnerabilities can be fixed
jjav · 3 years ago
> This is an unreasonable position. Vulnerabilities can be fixed

That's a highly optimistic position, to the point of being almost wishful thinking.

The vulnerability being talked about today has been around since 2014 according to the report. Possibly being exploited for unknown number of years since. Sure, maybe we can workaround this one, now.

Other similar ones to be published years into the future are also there today being likely exploited as we speak.

Running untrusted code on the same silicon as sensitive code (data), is unlikely to ever actually be truly safe.

hinkley · 3 years ago
Vulnerabilities are like unreliable cars. It doesn't matter so much if they can be fixed, it's the very, very high opportunity cost of needing to be fixed, when you were busy doing something else of high value.

Responsible people tend to pick the small consequence now over the unknowable distribution of consequences later.

pseudalopex · 3 years ago
> Vulnerabilities can be fixed

Not always the damage.

2OEH8eoCRo0 · 3 years ago
The mitigation here can incur a whopping 50% performance penalty. At what point can customers return these CPUs for either being defective or sue for false advertising? If they can't safely meet the target performance they shouldn't be doing these tricks at all.
layer8 · 3 years ago
You are assuming that white hats discover them early enough before black hats.

Deleted Comment

zerocrates · 3 years ago
Only up to 11th gen... it didn't seem like this could have been disclosed to Intel soon enough for them to have fixed it for 12th gen, so had they just happened to fix it while fixing something else, or what?

Decided to look in the paper and "Intel states that newer CPUs such as Alder Lake, Raptor Lake, and Sapphire Rapids are unaffected, although not a security consideration and seems just a side effect of a significantly modified architecture." So basically they just randomly fixed it, or at least made this particular exploit nonworkable.

broodbucket · 3 years ago
Microarchitectural behaviour changes from generation to generation, and thus so do side effects. Fixing things by accident (and also introducing new problems by accident) are relatively frequent occurrences
stevefan1999 · 3 years ago
Fix one big bug, get two small bugs...
dncornholio · 3 years ago
Yes
v8xi · 3 years ago
From FAQ: [Q] How long have users been exposed to this vulnerability? [A] At least nine years. The affected processors have been around since 2014.

Amazing how these vulnerabilities sit around unnoticed for years and then it takes two weeks for someone to code up an exploit.

robotnikman · 3 years ago
I have a feeling the time spent searching for the vulnerability in the first place was more than 2 weeks though.
H8crilA · 3 years ago
Those things come in waves. Once the first large CPU vulnerability was found then more followed soon. I think it's obvious why this is so.
Liquix · 3 years ago
All a publication indicates is that a white/grey hat researcher has discovered the vulnerability. There is no way to know if or how many times the same flaw has been exploited by less scrupulous parties in the interim.
mrob · 3 years ago
And information leak exploits are less likely to be detected than arbitrary code execution. If somebody is exploiting a buffer overflow, they need to get it exactly right, or they'll probably crash the process, which can be logged and noticed. The only sign of somebody attempting Downfall or similar attacks is increased CPU use, which has many benign causes.
ibejoeb · 3 years ago
Since it is in a class of other well known vulnerabilities, I'm going to assume that there has been quite a bit of active research by state-operated and state-sponsored labs. I think it's more likely than not that this has been exploited.
xyst · 3 years ago
Likely work based off of previous exploits.
kzrdude · 3 years ago
See this LWN story: https://lwn.net/Articles/940783/

on Linux, any cpus that don't have updated microcode will have AVX completely disabled as a mitigation for this issue. That's rather harsh if you ask me and would be very noticeable. Now I'm interested in finding out if I can get updated microcode..

hansendc · 3 years ago
The AVX disable is only when you use "gather_data_sampling=force". The default is to leave AVX alone and proclaim the system to be vulnerable.

From https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin... :

> Specifying "gather_data_sampling=force" will use the microcode mitigation when > available or disable AVX on affected systems where the microcode hasn't been > updated to include the mitigation.

Disclaimer: I work on Linux at Intel. I probably wrote or tweaked the documentation and changelogs that are confusing folks.

kzrdude · 3 years ago
Great, thanks for the clarification
Bu9818 · 3 years ago
`[ 0.000000] microcode: updated early: 0x27 -> 0x28, date = 2019-11-12`

I'm on haswell. Is there a list of what CPUs get updated microcode? Sad.

starlevel003 · 3 years ago
haswell isn't vulnerable, it only affects skylake or later
kzrdude · 3 years ago
Maybe that's not the default? Not entirely clear from the text.
bironran · 3 years ago
Worth to note that GCP has this patched (https://cloud.google.com/support/bulletins#gcp-2023-024)
palcu · 3 years ago
My adjacent teams in London who work in SRE on Google Cloud (GCE) got some well deserved doughnuts today for rolling out the patches on time.
zgluck · 3 years ago
Corresponding AWS notice:

https://aws.amazon.com/security/security-bulletins/AWS-2023-...

AWS customers’ data and instances are not affected by this issue, and no customer action is required. AWS has designed and implemented its infrastructure with protections against this class of issues. Amazon EC2 instances, including Lambda, Fargate, and other AWS-managed compute and container services protect customer data against GDS through microcode and software based mitigations.