Why is Rosetta 2 fast?

This is a great writeup. What a clever design!

I remember Apple had a totally different but equally clever solution back in the days of the 68K-to-PowerPC migration. The 68K had 16-bit instruction words, usually with some 16-bit arguments. The emulator’s core loop would read the next instruction and branch directly into a big block of 64K x 8 bytes of PPC code. So each 68K instruction got 2 dedicated PPC instructions, typically one to set up a register and one to branch to common code.

What that solution and Rosetta 2 have in common is that they’re super pragmatic - fast to start up, with fairly regular and predictable performance across most workloads, even if the theoretical peak speed is much lower than a cutting-edge JIT.

Anyone know how they implemented PPC-to-x86 translation?

kijiki · 3 years ago

> Anyone know how they implemented PPC-to-x86 translation?

They licensed Transitive's retargettable binary translator, and renamed it Rosetta; very Apple.

It was originally a startup, but had been bought by IBM by the time Apple was interested.

GeekyBear · 3 years ago

> It was originally a startup, but had been bought by IBM by the time Apple was interested.

Rosetta shipped in 2005.

IBM bought Transitive in 2008.

The last version of OS X that supported Rosetta shipped in 2009.

I always wondered if the issue was that IBM tried to alter the terms of deal too much for Steve's taste.

runjake · 3 years ago

Link: https://en.wikipedia.org/wiki/QuickTransit

klelatti · 3 years ago

That’s really interesting. You might enjoy reading about the VM embedded into the Busicom calculator that used the Intel 4004 [1]

They squeezed a virtual machine with 88 instructions into less than 1k of memory!

[1] https://thechipletter.substack.com/p/bytecode-and-the-busico...

wang_li · 3 years ago

In the mists of history S. Wozniak wrote the SWEET-16 interpreter for the 6502. A VM with 29 instructions implemented in 300 bytes.

https://en.wikipedia.org/wiki/SWEET16

iainmerrick · 3 years ago

That is nifty! Sounds very similar to a Forth interpreter.

hoosieree · 3 years ago

And here I was feeling impressed with myself for implementing the Nand2Tetris VM translator in ~2k of python... wow. Respect for the elders!

lostgame · 3 years ago

From what I understand; they purchased a piece of software that already existed to translate PPC to x86 in some form or another and iterated on it. I believe the software may have already even been called ‘Rosetta’.

My memory is very hazy; though. While I experienced this transition firsthand and was an early Intel adopter, that’s about all I can remember about Rosetta or where it came from.

I remember before Adobe had released the Universal Binary CS3 that running Photoshop on my Intel Mac was a total nightmare. :( I learned to not be an early adopter from that whole debacle.

runjake · 3 years ago

Link: https://en.wikipedia.org/wiki/QuickTransit

saagarjha · 3 years ago

Transitive.

Asmod4n · 3 years ago

I don't know how they did it, but they did it very very slowly. Anything "interactive" was unuseable.

lilyball · 3 years ago

Assuming you're talking about PPC-to-x86, it was certainly usable, though noticeably slower. Heck, I used to play Tron 2.0 that way, the frame rate suffered but it was still quite playable.

scarface74 · 3 years ago

Interactive 68K programs were usually fast. The 68K programs would still call native PPC QuickDraw code. It was processor intensive code that was slow. Especially with the first generation 68K emulator.

Connectix SpeedDoubler was definitely faster.

I remember years ago when Java adjacent research was all the rage, HP had a problem that was “Rosetta lite” if you will. They had a need to run old binaries on new hardware that wasn’t exactly backward compatible. They made a transpiler that worked on binaries. It might have even been a JIT but that part of the memory is fuzzy.

What made it interesting here was that as a sanity check they made an A->A mode where they took in one architecture and spit out machine code for the same architecture. The output was faster than the input. Meaning that even native code has some room for improvement with JIT technology.

I have been wishing for years that we were in a better place with regard to compilers and NP complete problems where the compilers had a fast mode for code-build-test cycles and a very slow incremental mode for official builds. I recall someone telling me the only thing they liked about the Rational IDE (C and C++?) was that it cached precompiled headers, one of the Amdahl’s Law areas for compilers. If you changed a header, you paid the recompilation cost and everyone else got a copy. I love whenever the person that cares about something gets to pay the consequence instead of externalizing it on others.

And having some CI machines or CPUs that just sit around chewing on Hard Problems all day for that last 10% seems to be to be a really good use case in a world that’s seeing 16 core consumer hardware. Also caching hints from previous runs is a good thing.

hamstergene · 3 years ago

Could it be simply because many binaries were produced by much older, outdated optimizers. Or optimized for size.

Also, optimizers usually target “most common denominator” so native binaries rarely use full power of current instruction set.

Jumping from that peculiar finding to praising runtime JIT feels like a longshot. To me it’s more of an argument towards distributing software in intermediate form (like Apple Bitcode) and compiling on install, tailoring for the current processor.

astrange · 3 years ago

> To me it’s more of an argument towards distributing software in intermediate form (like Apple Bitcode) and compiling on install, tailoring for the current processor.

This turns out to be quite difficult, especially if you're using bitcode as a compiler IL. You have to know what the right "intermediate" level is; if assumptions change too much under you then it's still too specific. And it means you can't use things like inline assembly.

That's why bitcode is dead now.

By the way, I don't know why this thread is about how JITs can optimize programs when this article is about how Rosetta is not a JIT and intentionally chose a design that can't optimize programs.

AceJohnny2 · 3 years ago

> Or optimized for size.

Note that on gcc (I think) and clang (I'm sure), -Oz is a strict superset of -O2 (the "fast+safe" optimizations, compared to -O3 that can be a bit too aggressive, given C's minefield of Undefined Behavior that compilers can exploit).

I'd guess that, with cache fit considerations, -Oz can even be faster than -O2.

jasonwatkinspdx · 3 years ago

All reasonable points, but examples where JIT has an advantage are well supported in research literature. The typical workload that shows this is something with a very large space of conditionals, but where at runtime there's a lot of locality, eg matching and classification engines.

kllrnohj · 3 years ago

> To me it’s more of an argument towards distributing software in intermediate form (like Apple Bitcode) and compiling on install, tailoring for the current processor.

Or distribute it in source form and make compilation part of the install process. Aka, the Gentoo model.

wmf · 3 years ago

https://www.hpl.hp.com/techreports/1999/HPL-1999-78.html

travisgriggs · 3 years ago

It was particularly poignant at the time because JITed languages were looked down on by the “static compilation makes us faster” crowd. So it was a sort of “wait a minute Watson!” moment in that particular tech debate.

No one cares as much now days, we’ve moved our overrated opinion battlegrounds to other portions of what we do.

hawflakes · 3 years ago

People have mentioned the Dynamo project from HP. But I think you're actually thinking of the Aries project (I worked in a directly adjacent project) that allowed you to run PA-RISC binaries on IA-64.

https://nixdoc.net/man-pages/HP-UX/man5/Aries.5.html

mark_undoio · 3 years ago

Something that fascinates me about this kind of A -> A translation (which I associate with the original HP Dynamo project on HPPA CPUs) is that it was able to effectively yield the performance effect of one or two increased levels of -O optimization flag.

Right now it's fairly common in software development to have a debug build and a release build with potentially different optimisation levels. So that's two builds to manage - if we could build with lower optimisation and still effectively run at higher levels then that's a whole load of build/test simplification.

Moreover, debugging optimised binaries is fiddly due to information that's discarded. Having the original, unoptimised, version available at all times would give back the fidelity when required (e.g. debugging problems in the field).

Java effectively lives in this world already as it can use high optimisation and then fall back to interpreted mode when debugging is needed. I wish we could have this for C/C++ and other native languages.

saagarjha · 3 years ago

It depends greatly on which optimization levels you’re going through. —O0 to -O1 can easily be a 2-3x performance improvement, which is going to be hard to get otherwise. -O2 to -O3 might be 15% if you’re lucky, in which case -O+LTO+PGO can absolutely get you wins that beat that.

foobiekr · 3 years ago

One of the engineers I was working with on a project was from Transitive (the company that made QuickTransit which became Rosetta) found that their JIT based translator could not deliver significant performance increases for A->A outside of pathological cases, and it was very mature technology at the time.

I think it's a hypothetical. The Mill Computing lectures talk about a variant of this, which is sort of equivalent to an install-time specializer for intermediate code which might work, but that has many problems (for one thing, it breaks upgrades and is very, very problematic for VMs being run on different underlying hosts).

freedomben · 3 years ago

If JIT-ing a statically compiled input makes it faster, does that mean that JIT-ing itself is superior or does it mean that the static compiler isn't outputting optimal code? (real question. asked another way, does JIT have optimizations it can make that a static compiler can't?)

kmeisthax · 3 years ago

It's more the case that the ahead-of-time compilation is suboptimal.

Modern compilers have a thing called PGO (Profile Guided Optimization) that lets you take a compiled application, run it and generate an execution profile for it, and then compile the application again using information from the profiling step. The reason why this works is that lots of optimization involves time-space tradeoffs that only make sense to do if the code is frequently called. JIT only runs on frequently-called code, so it has the advantage of runtime profiling information, while ahead-of-time (AOT) compilers have to make educated guesses about what loops are the most hot. PGO closes that gap.

Theoretically, a JIT could produce binary code hyper-tailored to a particular user's habits and their computer's specific hardware. However, I'm not sure if that has that much of a benefit versus PGO AOT.

mockery · 3 years ago

In addition to the sibling comments, one simple opportunity available to a JIT and not AOT is 100% confidence about the target hardware and its capabilities.

For example AOT compilation often has to account for the possibility that the target machine might not have certain instructions - like SSE/AVX vector ops, and emit both SSE and non-SSE versions of a codepath with, say, a branch to pick the appropriate one dynamically.

Whereas a JIT knows what hardware it's running on - it doesn't have to worry about any other CPUs.

ketralnis · 3 years ago

It means that in this case, the static compiler emitted code that could be further optimised, that's all. It doesn't mean that that's always the case, or that static compilers can't produce optimal code, or that either technique is "better" than the other.

An easy example is code compiled for 386 running on a 586. The A->A compiler can use CPU features that weren't available to the 386. As with PGO you have branch prediction information that's not available to the static compiler. You can statically compile the dynamically linked dependencies, allowing inlining that wasn't previously available.

On the other hand you have to do all of that. That takes warmup time just like a JIT.

I think the road to enlightenment is letting go of phrasing like "is superior". There are lots of upsides and downsides to pretty much every technique.

andrewaylett · 3 years ago

It depends on what the JIT does exactly, but in general yes a JIT may be able to make optimisations that a static compiler won't be aware of because a JIT can optimise for the specific data being processed.

That said, a sufficiently advanced CPU could also make those optimisations on "static" code. That was one of the things Transmeta had been aiming towards, I think.

rowanG077 · 3 years ago

A JIT can definitely make optimizations that a static compiler can't. Simply by virtue of it having concrete dynamic real-time information.

vips7L · 3 years ago

Yes, the JIT has more profile guided data as to what your program actually does at runtime, therefore it can optimize better.

jeffbee · 3 years ago

Post-build optimization of binaries without changing the target CPU is common. See BOLT https://github.com/facebookincubator/BOLT

chrisseaton · 3 years ago

I've run Ruby C extensions on a JIT faster than on native, due to things like inlining and profiling working more effectively at runtime.

bogomipz · 3 years ago

>"I remember years ago when Java adjacent research was all the rage, ..."

What is meant by "Java adjacent research"? I'm not familiar with what that was.

vinyl7 · 3 years ago

> where the compilers had a fast mode for code-build-test cycles and a very slow incremental mode for official builds.

That already exists with c/c++...unity builds aka single translation unit builds. Compiling and linking a ton of object files takes an inordinate amount of time, often the majority of the build time

AussieWog93 · 3 years ago

Outside of gaming, or hyper-CPU-critical workflows like video editing, I'm not really sure if people actually even care about that last 10% of performance.

I know most of the time I get frustrated by everyday software, its doing something unnecessary in a long loop, and possibly forgetting to check for Windows messages too.

koala_man · 3 years ago

Performance also translates into better battery life and cheaper datacenters.

tomcam · 3 years ago

I’m likely misunderstanding what you said, but I thought pre-compiled headers were pretty much standard these days.

tomcam · 3 years ago

What on earth did I say to merit the downvotes?

zaphirplane · 3 years ago

Is this for itanium

fuckstick · 3 years ago

> The output was faster than the input.

So if you ran the input back through the output multiple times then that means you could eventually get the runtime down to 0.

twic · 3 years ago

But unfortunately, the memory use goes to infinity.

avidiax · 3 years ago

Probably the output of the decade-old compiler that produced the original binary had no optimizations.

Deleted Comment

Dead Comment