I remember Apple had a totally different but equally clever solution back in the days of the 68K-to-PowerPC migration. The 68K had 16-bit instruction words, usually with some 16-bit arguments. The emulator’s core loop would read the next instruction and branch directly into a big block of 64K x 8 bytes of PPC code. So each 68K instruction got 2 dedicated PPC instructions, typically one to set up a register and one to branch to common code.
What that solution and Rosetta 2 have in common is that they’re super pragmatic - fast to start up, with fairly regular and predictable performance across most workloads, even if the theoretical peak speed is much lower than a cutting-edge JIT.
Anyone know how they implemented PPC-to-x86 translation?
From what I understand; they purchased a piece of software that already existed to translate PPC to x86 in some form or another and iterated on it. I believe the software may have already even been called ‘Rosetta’.
My memory is very hazy; though. While I experienced this transition firsthand and was an early Intel adopter, that’s about all I can remember about Rosetta or where it came from.
I remember before Adobe had released the Universal Binary CS3 that running Photoshop on my Intel Mac was a total nightmare. :( I learned to not be an early adopter from that whole debacle.
Assuming you're talking about PPC-to-x86, it was certainly usable, though noticeably slower. Heck, I used to play Tron 2.0 that way, the frame rate suffered but it was still quite playable.
Interactive 68K programs were usually fast. The 68K programs would still call native PPC QuickDraw code. It was processor intensive code that was slow. Especially with the first generation 68K emulator.
I remember years ago when Java adjacent research was all the rage, HP had a problem that was “Rosetta lite” if you will. They had a need to run old binaries on new hardware that wasn’t exactly backward compatible. They made a transpiler that worked on binaries. It might have even been a JIT but that part of the memory is fuzzy.
What made it interesting here was that as a sanity check they made an A->A mode where they took in one architecture and spit out machine code for the same architecture. The output was faster than the input. Meaning that even native code has some room for improvement with JIT technology.
I have been wishing for years that we were in a better place with regard to compilers and NP complete problems where the compilers had a fast mode for code-build-test cycles and a very slow incremental mode for official builds. I recall someone telling me the only thing they liked about the Rational IDE (C and C++?) was that it cached precompiled headers, one of the Amdahl’s Law areas for compilers. If you changed a header, you paid the recompilation cost and everyone else got a copy. I love whenever the person that cares about something gets to pay the consequence instead of externalizing it on others.
And having some CI machines or CPUs that just sit around chewing on Hard Problems all day for that last 10% seems to be to be a really good use case in a world that’s seeing 16 core consumer hardware. Also caching hints from previous runs is a good thing.
Could it be simply because many binaries were produced by much older, outdated optimizers. Or optimized for size.
Also, optimizers usually target “most common denominator” so native binaries rarely use full power of current instruction set.
Jumping from that peculiar finding to praising runtime JIT feels like a longshot. To me it’s more of an argument towards distributing software in intermediate form (like Apple Bitcode) and compiling on install, tailoring for the current processor.
> To me it’s more of an argument towards distributing software in intermediate form (like Apple Bitcode) and compiling on install, tailoring for the current processor.
This turns out to be quite difficult, especially if you're using bitcode as a compiler IL. You have to know what the right "intermediate" level is; if assumptions change too much under you then it's still too specific. And it means you can't use things like inline assembly.
That's why bitcode is dead now.
By the way, I don't know why this thread is about how JITs can optimize programs when this article is about how Rosetta is not a JIT and intentionally chose a design that can't optimize programs.
Note that on gcc (I think) and clang (I'm sure), -Oz is a strict superset of -O2 (the "fast+safe" optimizations, compared to -O3 that can be a bit too aggressive, given C's minefield of Undefined Behavior that compilers can exploit).
I'd guess that, with cache fit considerations, -Oz can even be faster than -O2.
All reasonable points, but examples where JIT has an advantage are well supported in research literature. The typical workload that shows this is something with a very large space of conditionals, but where at runtime there's a lot of locality, eg matching and classification engines.
> To me it’s more of an argument towards distributing software in intermediate form (like Apple Bitcode) and compiling on install, tailoring for the current processor.
Or distribute it in source form and make compilation part of the install process. Aka, the Gentoo model.
It was particularly poignant at the time because JITed languages were looked down on by the “static compilation makes us faster” crowd. So it was a sort of “wait a minute Watson!” moment in that particular tech debate.
No one cares as much now days, we’ve moved our overrated opinion battlegrounds to other portions of what we do.
People have mentioned the Dynamo project from HP. But I think you're actually thinking of the Aries project (I worked in a directly adjacent project) that allowed you to run PA-RISC binaries on IA-64.
Something that fascinates me about this kind of A -> A translation (which I associate with the original HP Dynamo project on HPPA CPUs) is that it was able to effectively yield the performance effect of one or two increased levels of -O optimization flag.
Right now it's fairly common in software development to have a debug build and a release build with potentially different optimisation levels. So that's two builds to manage - if we could build with lower optimisation and still effectively run at higher levels then that's a whole load of build/test simplification.
Moreover, debugging optimised binaries is fiddly due to information that's discarded. Having the original, unoptimised, version available at all times would give back the fidelity when required (e.g. debugging problems in the field).
Java effectively lives in this world already as it can use high optimisation and then fall back to interpreted mode when debugging is needed. I wish we could have this for C/C++ and other native languages.
It depends greatly on which optimization levels you’re going through. —O0 to -O1 can easily be a 2-3x performance improvement, which is going to be hard to get otherwise. -O2 to -O3 might be 15% if you’re lucky, in which case -O+LTO+PGO can absolutely get you wins that beat that.
One of the engineers I was working with on a project was from Transitive (the company that made QuickTransit which became Rosetta) found that their JIT based translator could not deliver significant performance increases for A->A outside of pathological cases, and it was very mature technology at the time.
I think it's a hypothetical. The Mill Computing lectures talk about a variant of this, which is sort of equivalent to an install-time specializer for intermediate code which might work, but that has many problems (for one thing, it breaks upgrades and is very, very problematic for VMs being run on different underlying hosts).
If JIT-ing a statically compiled input makes it faster, does that mean that JIT-ing itself is superior or does it mean that the static compiler isn't outputting optimal code? (real question. asked another way, does JIT have optimizations it can make that a static compiler can't?)
It's more the case that the ahead-of-time compilation is suboptimal.
Modern compilers have a thing called PGO (Profile Guided Optimization) that lets you take a compiled application, run it and generate an execution profile for it, and then compile the application again using information from the profiling step. The reason why this works is that lots of optimization involves time-space tradeoffs that only make sense to do if the code is frequently called. JIT only runs on frequently-called code, so it has the advantage of runtime profiling information, while ahead-of-time (AOT) compilers have to make educated guesses about what loops are the most hot. PGO closes that gap.
Theoretically, a JIT could produce binary code hyper-tailored to a particular user's habits and their computer's specific hardware. However, I'm not sure if that has that much of a benefit versus PGO AOT.
In addition to the sibling comments, one simple opportunity available to a JIT and not AOT is 100% confidence about the target hardware and its capabilities.
For example AOT compilation often has to account for the possibility that the target machine might not have certain instructions - like SSE/AVX vector ops, and emit both SSE and non-SSE versions of a codepath with, say, a branch to pick the appropriate one dynamically.
Whereas a JIT knows what hardware it's running on - it doesn't have to worry about any other CPUs.
It means that in this case, the static compiler emitted code that could be further optimised, that's all. It doesn't mean that that's always the case, or that static compilers can't produce optimal code, or that either technique is "better" than the other.
An easy example is code compiled for 386 running on a 586. The A->A compiler can use CPU features that weren't available to the 386. As with PGO you have branch prediction information that's not available to the static compiler. You can statically compile the dynamically linked dependencies, allowing inlining that wasn't previously available.
On the other hand you have to do all of that. That takes warmup time just like a JIT.
I think the road to enlightenment is letting go of phrasing like "is superior". There are lots of upsides and downsides to pretty much every technique.
It depends on what the JIT does exactly, but in general yes a JIT may be able to make optimisations that a static compiler won't be aware of because a JIT can optimise for the specific data being processed.
That said, a sufficiently advanced CPU could also make those optimisations on "static" code. That was one of the things Transmeta had been aiming towards, I think.
> where the compilers had a fast mode for code-build-test cycles and a very slow incremental mode for official builds.
That already exists with c/c++...unity builds aka single translation unit builds. Compiling and linking a ton of object files takes an inordinate amount of time, often the majority of the build time
Outside of gaming, or hyper-CPU-critical workflows like video editing, I'm not really sure if people actually even care about that last 10% of performance.
I know most of the time I get frustrated by everyday software, its doing something unnecessary in a long loop, and possibly forgetting to check for Windows messages too.
Should you feel inspired to share your learnings, insights, or future ideas about the computing spaces you know, me and I'm sure many other people would be interested to listen!
My preferred way to learn about a new (to me) area of tech is to hear the insights of the people who have provably advanced that field. There's a lot of noise to signal in tech blogs.
May I ask – would it be possible to implement support for 32-bit VST and AU plugins?
This would be a major bonus, because it could e.g. enable producers like me to open up our music projects from earlier times, and still have the old plugins work.
Are you able to speak at all to the known performance struggles with x87 translation? Curious to know if we're likely to see any updates or improvements there into the future.
Huh, this is timely. Incredibly random but: do you know if there was anything that changed as of Ventura to where trying to mmap below the 2/4GB boundary would no longer work in Rosetta 2? I've an app where it's worked right up to Monterey yet inexplicably just bombs in Ventura.
I know a few of their devs went to ARM, some to Apple & a few to IBM (who bought Transitive). I do know a few of their ex staff (and their twitter handles), but I don’t feel comfortable linking them here.
> To see ahead-of-time translated Rosetta code, I believe I had to disable SIP, compile a new x86 binary, give it a unique name, run it, and then run otool -tv /var/db/oah///unique-name.aot (or use your tool of choice – it’s just a Mach-O binary). This was done on old version of macOS, so things may have changed and improved since then.
> Rosetta 2 translates the entire text segment of the binary from x86 to ARM up-front.
Do I understand correctly that the Rosetta is basically a transpiler from x86-64 machine code to ARM machine code which is run prior to the binary execution? If so, does it affect the application startup times?
> If so, does it affect the application startup times?
It does, but only the very first time you run the application. The result of the transpilation is cached so it doesn't have to be computed again until the app is updated.
And deleting the cache is undocumented (it is not in the file system) so if you run Mac machines as CI runners they will trash and brick themselves running out of disk space over time.
The first load is fairly slow, but once it's done it every load after that is pretty much identical to what it'd be running on an x86 mac due to the caching it does.
For me my M1 was fast enough that the first load didn't seem that different - and more importantly subsequent loads were lighting fast! It's astonishing how good Rosetta 2 is - utterly transparent and faster than my Intel Mac thanks to the M1.
"I believe there’s significant room for performance improvement in Rosetta 2... However, this would come at the cost of significantly increased complexity...
Engineering is about making the right tradeoffs, and I’d say Rosetta 2 has done exactly that."
They could've amazed a few people a bit more by emulating x86 apps even faster (but M1+Rosetta can already run some stuff faster than an Intel Mac), but then the benefit of releasing native apps would be much decreased ("why bother, it's good enough ...").
It's a delicate political game that they, yet again, seem to be playing pretty well.
One thing that’s interesting to note is that the amount of effort expended here is not actually all that large. Yes, there are smart people working on this, but the performance of Rosetta 2 for the most part is probably the work of a handful of clever people. I wouldn’t be surprised if some of them have an interest in compilers but the actual implementation is fairly straightforward and there isn’t much of the stuff you’d typically see in an optimizing JIT: no complicated type theory or analysis passes. Aside from a handful of hardware bits and some convenient (perhaps intentionally selected) choices in where to make tradeoffs there’s nothing really specifically amazing here. What really makes it special is that anyone (well, any company with a bit of resources) could’ve done it but nobody really did. (But, again, Apple owning the stack and having past experience probably did help them get over the hurdle of actually putting effort into this.)
Yeah, agreed. I get the impression it's a small team.
But there is a long-tail of weird x86 features that are implemented, that give them amazing compatibility, that I regret not mentioning:
* 32-bit support for Wine
* full x87 emulation
* full SSE2 support (generally converting to efficient NEON equivalents) for performance on SIMD code
I consider all of these "compatibility", but that last one in particular should have been in the post, since that's very important to the performance of optimised SIMD routines (plenty of emulators also do SIMD->SIMD, but others just translate SIMD->scalar or SIMD->helper-runtime-call).
I think it's about the incentive and not about other companies not doing it. Apple decided to move to ARM and the reason is probably in their strong connection to the ARM ecosystem which basically means that they have an edge with their vertical-integration approach when compared to the other competitors. Apple is one of the three _founding_ companies of ARM. Other two were VLSI Technology and Acorn.
Vertical integration. My understanding was it's because the Apple silicon ARM has special support to make it fast. Apple has had enough experience to know that some hardware support can go a long way to making the binary emulation situation better.
That is correct, the article goes into details why. See the "Apple's Secret Extension" section as well as the "Total Store Ordering" section.
The "Apple's Secret Extension" section talks about how the M1 has 4 flag bits and the x86 has 6 flag bits, and how emulating those 2 extra flags would make every add/sub/cmp instruction significantly slower. Apple has an undocumented extension that adds 2 more flag bits to make the M1's flag bits behave the same as x86.
The "Total Store Ordering" section talks about how Apple has added a non-standard store ordering to the M1 than makes the M1 order its stores in the same way x86 guarantees instead of the way ARM guarantees. Without this, there's no good way to translate instructions in code in and around an x86 memory fence; if you see a memory fence in x86 code it's safe to assume that it depends on x86 memory store semantics and if you don't have that you'll need to emulate it with many mostly unnecessary memory fences, which will be devastating for performance.
I remember Apple had a totally different but equally clever solution back in the days of the 68K-to-PowerPC migration. The 68K had 16-bit instruction words, usually with some 16-bit arguments. The emulator’s core loop would read the next instruction and branch directly into a big block of 64K x 8 bytes of PPC code. So each 68K instruction got 2 dedicated PPC instructions, typically one to set up a register and one to branch to common code.
What that solution and Rosetta 2 have in common is that they’re super pragmatic - fast to start up, with fairly regular and predictable performance across most workloads, even if the theoretical peak speed is much lower than a cutting-edge JIT.
Anyone know how they implemented PPC-to-x86 translation?
They licensed Transitive's retargettable binary translator, and renamed it Rosetta; very Apple.
It was originally a startup, but had been bought by IBM by the time Apple was interested.
Rosetta shipped in 2005.
IBM bought Transitive in 2008.
The last version of OS X that supported Rosetta shipped in 2009.
I always wondered if the issue was that IBM tried to alter the terms of deal too much for Steve's taste.
They squeezed a virtual machine with 88 instructions into less than 1k of memory!
[1] https://thechipletter.substack.com/p/bytecode-and-the-busico...
https://en.wikipedia.org/wiki/SWEET16
My memory is very hazy; though. While I experienced this transition firsthand and was an early Intel adopter, that’s about all I can remember about Rosetta or where it came from.
I remember before Adobe had released the Universal Binary CS3 that running Photoshop on my Intel Mac was a total nightmare. :( I learned to not be an early adopter from that whole debacle.
Connectix SpeedDoubler was definitely faster.
What made it interesting here was that as a sanity check they made an A->A mode where they took in one architecture and spit out machine code for the same architecture. The output was faster than the input. Meaning that even native code has some room for improvement with JIT technology.
I have been wishing for years that we were in a better place with regard to compilers and NP complete problems where the compilers had a fast mode for code-build-test cycles and a very slow incremental mode for official builds. I recall someone telling me the only thing they liked about the Rational IDE (C and C++?) was that it cached precompiled headers, one of the Amdahl’s Law areas for compilers. If you changed a header, you paid the recompilation cost and everyone else got a copy. I love whenever the person that cares about something gets to pay the consequence instead of externalizing it on others.
And having some CI machines or CPUs that just sit around chewing on Hard Problems all day for that last 10% seems to be to be a really good use case in a world that’s seeing 16 core consumer hardware. Also caching hints from previous runs is a good thing.
Also, optimizers usually target “most common denominator” so native binaries rarely use full power of current instruction set.
Jumping from that peculiar finding to praising runtime JIT feels like a longshot. To me it’s more of an argument towards distributing software in intermediate form (like Apple Bitcode) and compiling on install, tailoring for the current processor.
This turns out to be quite difficult, especially if you're using bitcode as a compiler IL. You have to know what the right "intermediate" level is; if assumptions change too much under you then it's still too specific. And it means you can't use things like inline assembly.
That's why bitcode is dead now.
By the way, I don't know why this thread is about how JITs can optimize programs when this article is about how Rosetta is not a JIT and intentionally chose a design that can't optimize programs.
Note that on gcc (I think) and clang (I'm sure), -Oz is a strict superset of -O2 (the "fast+safe" optimizations, compared to -O3 that can be a bit too aggressive, given C's minefield of Undefined Behavior that compilers can exploit).
I'd guess that, with cache fit considerations, -Oz can even be faster than -O2.
Or distribute it in source form and make compilation part of the install process. Aka, the Gentoo model.
No one cares as much now days, we’ve moved our overrated opinion battlegrounds to other portions of what we do.
https://nixdoc.net/man-pages/HP-UX/man5/Aries.5.html
Right now it's fairly common in software development to have a debug build and a release build with potentially different optimisation levels. So that's two builds to manage - if we could build with lower optimisation and still effectively run at higher levels then that's a whole load of build/test simplification.
Moreover, debugging optimised binaries is fiddly due to information that's discarded. Having the original, unoptimised, version available at all times would give back the fidelity when required (e.g. debugging problems in the field).
Java effectively lives in this world already as it can use high optimisation and then fall back to interpreted mode when debugging is needed. I wish we could have this for C/C++ and other native languages.
I think it's a hypothetical. The Mill Computing lectures talk about a variant of this, which is sort of equivalent to an install-time specializer for intermediate code which might work, but that has many problems (for one thing, it breaks upgrades and is very, very problematic for VMs being run on different underlying hosts).
Modern compilers have a thing called PGO (Profile Guided Optimization) that lets you take a compiled application, run it and generate an execution profile for it, and then compile the application again using information from the profiling step. The reason why this works is that lots of optimization involves time-space tradeoffs that only make sense to do if the code is frequently called. JIT only runs on frequently-called code, so it has the advantage of runtime profiling information, while ahead-of-time (AOT) compilers have to make educated guesses about what loops are the most hot. PGO closes that gap.
Theoretically, a JIT could produce binary code hyper-tailored to a particular user's habits and their computer's specific hardware. However, I'm not sure if that has that much of a benefit versus PGO AOT.
For example AOT compilation often has to account for the possibility that the target machine might not have certain instructions - like SSE/AVX vector ops, and emit both SSE and non-SSE versions of a codepath with, say, a branch to pick the appropriate one dynamically.
Whereas a JIT knows what hardware it's running on - it doesn't have to worry about any other CPUs.
An easy example is code compiled for 386 running on a 586. The A->A compiler can use CPU features that weren't available to the 386. As with PGO you have branch prediction information that's not available to the static compiler. You can statically compile the dynamically linked dependencies, allowing inlining that wasn't previously available.
On the other hand you have to do all of that. That takes warmup time just like a JIT.
I think the road to enlightenment is letting go of phrasing like "is superior". There are lots of upsides and downsides to pretty much every technique.
That said, a sufficiently advanced CPU could also make those optimisations on "static" code. That was one of the things Transmeta had been aiming towards, I think.
What is meant by "Java adjacent research"? I'm not familiar with what that was.
That already exists with c/c++...unity builds aka single translation unit builds. Compiling and linking a ton of object files takes an inordinate amount of time, often the majority of the build time
I know most of the time I get frustrated by everyday software, its doing something unnecessary in a long loop, and possibly forgetting to check for Windows messages too.
So if you ran the input back through the output multiple times then that means you could eventually get the runtime down to 0.
Deleted Comment
Dead Comment
In my experience, exceptionally well executed tech like this tends to have 1-2 very talented people leading. I'd like to follow their blog or Twitter.
What was the most surprising thing you learned while working on Rosetta 2?
Is there anything (that you can share) that you would do differently?
Can your recommend any great starting places for someone interested in instruction translation?
Looking forward, did your work on Rosetta give you ideas for unfilled needs in the virtualization/emulation/translation space?
What's the biggest inefficiency you see today in the tech stacks you interact most with?
A lot of hard decisions must have been made while building Rosetta 2; can you shed light on some of those and how you navigated them?
My preferred way to learn about a new (to me) area of tech is to hear the insights of the people who have provably advanced that field. There's a lot of noise to signal in tech blogs.
May I ask – would it be possible to implement support for 32-bit VST and AU plugins?
This would be a major bonus, because it could e.g. enable producers like me to open up our music projects from earlier times, and still have the old plugins work.
I know a few of their devs went to ARM, some to Apple & a few to IBM (who bought Transitive). I do know a few of their ex staff (and their twitter handles), but I don’t feel comfortable linking them here.
My aotool project uses a trick to extract the AOT binary without root or disabling SIP: https://github.com/lunixbochs/meta/tree/master/utils/aotool
Do I understand correctly that the Rosetta is basically a transpiler from x86-64 machine code to ARM machine code which is run prior to the binary execution? If so, does it affect the application startup times?
It does, but only the very first time you run the application. The result of the transpilation is cached so it doesn't have to be computed again until the app is updated.
It's a delicate political game that they, yet again, seem to be playing pretty well.
But there is a long-tail of weird x86 features that are implemented, that give them amazing compatibility, that I regret not mentioning:
* 32-bit support for Wine
* full x87 emulation
* full SSE2 support (generally converting to efficient NEON equivalents) for performance on SIMD code
I consider all of these "compatibility", but that last one in particular should have been in the post, since that's very important to the performance of optimised SIMD routines (plenty of emulators also do SIMD->SIMD, but others just translate SIMD->scalar or SIMD->helper-runtime-call).
The "Apple's Secret Extension" section talks about how the M1 has 4 flag bits and the x86 has 6 flag bits, and how emulating those 2 extra flags would make every add/sub/cmp instruction significantly slower. Apple has an undocumented extension that adds 2 more flag bits to make the M1's flag bits behave the same as x86.
The "Total Store Ordering" section talks about how Apple has added a non-standard store ordering to the M1 than makes the M1 order its stores in the same way x86 guarantees instead of the way ARM guarantees. Without this, there's no good way to translate instructions in code in and around an x86 memory fence; if you see a memory fence in x86 code it's safe to assume that it depends on x86 memory store semantics and if you don't have that you'll need to emulate it with many mostly unnecessary memory fences, which will be devastating for performance.