The Itanic Saga: The History of VLIW and Itanium

cpr · 2 years ago

Was at Multiflow (Yale spinoff with Josh Fisher and John O'Donnell) '85-90 and saw the VLIW problem up close (was in the OS group, eventually running it).

The main problem was compiler complexity -- the hoped-for "junk parallelism" gains really never panned out (maybe 2-3X?), so the compiler was best when it could discover, or be fed, vector operations.

But Convex (main competitor at the time) already had the "minisupercomputer vector" market locked up.

So Multiflow folded in early '90 (I had already bailed, seeing the handwriting mural) after burning through $60M in VC, which was a record at the time, I believe.

gregw2 · 2 years ago

Multiflow folded in early '90 but then in 1994 HP and Intel bet their futures on this? Convex (using HP PA-RISC no less, defeated Multiflow but HP still thought they could make it work? What was the nature of HP's misjudgement? They must have had a theory why Multiflow failed but they would succeed, no?

I recall there was new research coming out of University of Illinois at Urbana/Champaign that breathed new hope into this but would be curious why HP+Intel pursued it if Multiflow failed. I do recall HP brought a compiler team to the table and per wikipedia they had a research team working on VLIW since 1989. HP has an internal bakeoff of RISC vs VLIW but did it not capture the compiler challenges or workloads properly?

Strategy wise, for a vendor (HP) looking to phase out their RISC investments while maintaining some degree of backwards compatibility and becoming first among equals of their RISC peers partnering with the market volume leader Intel it made great sense... but that only works if the technology pans out. Did HP management get seduced by a competitive strategic play and make a poor engineering/technical call? Or did the internal research team oversell itself even knowing about Multiflow? Or was it something else? Hindsight is 20/20 but what key assumption(s) went wrong that we can all learn from?

Taniwha · 2 years ago

I worked at Chromatic, we built a series of 2-wide VLIWs, writing a compiler (actually just the assembler) that could extract that parallelism was pretty easy, just some low level register flow analysis, I can imagine getting something like 6 way would be a lot harder though

pavlov · 2 years ago

Funny how today burning through that amount is entirely ordinary and expected for most startups working on much more trivial problems.

Just the other day it was reported that Brex, an expense management SaaS, has a $17M / month burn rate. That’s almost $60M in one quarter.

mananaysiempre · 2 years ago

Not that it undermines your point much, but $60M (1990 resp. 1985) = $140M resp. $170M today. (I’m not good enough at this to correct for the interest rate differences as well.)

raverbashing · 2 years ago

"Expected"

I think people are just going happy-go-click-click in AWS panel and not auditing it correctly

(And if AWS would release aws left-pad I'm sure some people would pay for it)

OhMeadhbh · 2 years ago

And hilariously*, Convex was eventually eaten by HP. Though the PowerPC Altivec/Velocity engine always looked a lot like Parsec to me. The past lives on in weird places.

[*] And by "hilariously," I mean "painfully."

mkhnews · 2 years ago

Was at Convex and then HP (and then Convey) and worked quite hard porting/optimizing numerical/scientific apps for the I2. Eventually, I think performance for some apps was ok, I mean considering a 900 Mhz clock and all.

ghaff · 2 years ago

I actually have a short book on the Itanic/Itanium done and planned to have it released as a free download by now. But various schedule-related stuff happened and it just hasn't happened yet.

I was a mostly hardware-focused industry analyst during Itanium's heyday so I find the topic really interesting. From a technical perspective, compilers (and dependency on them) certainly played a role but there were a bunch of other lessons too around market timing, partner strategies, fighting the last war, etc.

demondemidi · 2 years ago

I worked on Merced post-silicon, and McKinley presilicon. I wasn't an architect on the project, I just worked on keeping the power grid alive and thermals under control. It reminded me of working on the 486: the team was small and engaged, even though HP was problematic for parts of it. Pentium Pro was sucking up all the marketing air, so we were kind of left alone to do our own thing since the part wasn't making money yet. This was also during the corporate wide transition to Linux, removing AIX/SunOS/HPUX. I had a Merced in my office but sadly it was running linux in 32-bit compatibility mode, which is where we spent a lot of time fixing bugs because we knew lots of people weren't going to port to IA64 right away, and that ate up a ton of debug resources. The world was still migrating to Windows NT 3.5 and Windows 95, so migrating to 64 bit was way too soon. I don't remember when the linux kernel finally ported to IA64, but it seemed odd to have a platform without an OS (or an OS running in 32-bit mode). We had plenty of emulators, there's no reason why pre-silicon kernel development couldn't have happened faster (which was what HP was supposed to be doing). Kind of a bummer but it was a fun time, before the race to 1 GHz became the next $$$ sink / pissing contest.

hawflakes · 2 years ago

I was at HP pre-Merced tape-out and HP did have a number of simulators available. I worked on a compiler-related team so we were downstream.

As for running linux in 32-bit compatibility mode, wasn't that the worst of all worlds on Merced? When I was there which was pre-Merced tape-out, a tiny bit of the chip was devoted to the IVE (Intel Value Engine) which the docs stated were supposed to be just good enough to book the firmware and then jump into IA64 mode. I figured at the time that this was the goal — boot in 32-bit x86 and then jump to 64-bit mode.

BirAdam · 2 years ago

Do it, do it, do it!

ghaff · 2 years ago

I will but I want to use it as part of a website relaunch and, for various reasons, the appropriate timing of that relaunch slipped out.

ethbr1 · 2 years ago

Curious question on the period.

Assuming Itanium released as actually happened... (timeline, performance, compiler support, etc)

What else would have had to change for it to get market adoption and come out on top? (competitors, x86 clock rate running into ceiling sooner, etc)

captaincrowbar · 2 years ago

Well, what actually killed it historically was AMD64. AMD64 could easily not have happened, AMD has a very inconsistent track record; other contemporary CPUs like Alpha were never serious competitors for mainstream computing, and ARM was nowhere near being a contender yet. In that scenario, obviously mainstream PC users would have stuck with x86-32 for much longer than they actually did, but I think in the end they wouldn't have had any real choice but to be dragged kicking and screaming to Itanium.

hinoki · 2 years ago

My uninformed opinion: lots of speculative execution is good for single core performance, but terrible for power efficiency.

Have data centres always been limited by power/cooling costs, or did that only become a major consideration during the move to more commodity hardware?

tibbydudeza · 2 years ago

Seeing the direction Intel is going with heterogenous compute (P vs E cores) and their patent to replace hyperthreading with the concept of "rentable" units it seems now that exposing the innards of the CPU (thread director) and make it more flexible to OS control that can use better algorithms to decide where/when/how long.

Deleted Comment

quic_bcain · 2 years ago

A modern history of VLIW should also include mention the Hexagon FSP architecture used by Qualcomm in its SoCs.

With a smaller target market it's probably more sustainable than Itanium was.

Disclaimer: Qualcomm employee working on hexagon toolchain.

p_l · 2 years ago

Also, GPU VLIW architectures (including GCN and it's successors CDNA and RDNA) and yes, various coprocessors.

Once heard comparison that Itanium was pretty good for a fast DSP, but too expensive XD

MindSpunk · 2 years ago

GCN is not VLIW (follows that neither is RDNA and derivatives). You're thinking of TeraScale, the generation before GCN, which was VLIW.

chasil · 2 years ago

Sophie Wilson also mentions Firepath in several of her YouTube lectures.

mjevans · 2 years ago

VLIW reminded me of Transmeta, but unfortunately...

"For Sun, however, their VLIW project was abandoned. David Ditzel left Sun and founded Transmeta along with Bob Cmelik, Colin Hunter, Ed Kelly, Doug Laird, Malcolm Wing and Greg Zyner in 1995. Their new company was focused on VLIW chips, but that company is a story for another day."

chx · 2 years ago

> These delays didn’t stop the hypetrain.

This is an understatement. From an older article "How the Itanium killed the Computer Industry" https://www.pcmag.com/archive/how-the-itanium-killed-the-com...

> In 1997 Intel was the king of the hill; in that year it first announced the Itanium or IA-64 processor. That same year, research company IDC predicted that the Itanium would take over the world, racking up $38 billion in sales in 2001.

> What we heard was that HP, IBM, Dell, and even Sun Microsystems would use these chips and discontinue anything else they were developing. This included Sun making noise about dropping the SPARC chip for this thing—sight unseen. I say "sight unseen" because it would be years before the chip was even prototyped. The entire industry just took Intel at its word that Itanium would work as advertised in a PowerPoint presentation.

And then the original article has an Intel leader saying "Everything was new. When you do that, you're going to stumble". Yeah, much as Intel stumbled with the Pentium IV and basically everything since Skylake in 2015 (which was late). Let's emphasize this: for near ten years now, Intel can't deliver on time and on target. Just last year, Sapphire Rapids after being late by two years shipped in 2023 March and needed to pause in June because of a bug. Meteor Lake was also two years late. In 2020 https://www.zdnet.com/article/intels-7nm-product-transition-...

> Intel's first 7nm product, a client CPU, is now expected to start shipping in late 2022 or early 2023, CEO Bob Swan said on a conference call Thursday.

> The yield of Intel's 7nm process is now trending approximately 12 months behind the company's internal target.

Well then the internal target must've been late 2021 and it came out late 2023.

gregw2 · 2 years ago

I’d be interested in understanding why the compilers never panned out but have never seen a good writeup on that. Or why people thought the compilers would be able to succeed in the first place at the mission.

_chris_ · 2 years ago

> I’d be interested in understanding why the compilers never panned out but have never seen a good writeup on that. Or why people thought the compilers would be able to succeed in the first place at the mission.

It's a fundamentally impossible ask.

Compilers are being asked to look at a program (perhaps watch it run a sample set) and guess the bias of each branch to construct a most-likely 'trace' path through the program, and then generate STATIC code for that path.

But programs (and their branches) are not statically biased! So it simply doesn't work out for general-purpose codes.

However, programs are fairly predictable, which means a branch predictor can dynamically learn the program path and regurgitate it on command. And if the program changes phases, the branch predictor can re-learn the new program path very quickly.

Now if you wanted to couple a VLIW design with a dynamically re-executing compiler (dynamic binary translation), then sure, that can be made to work.

yvdriess · 2 years ago

> Now if you wanted to couple a VLIW design with a dynamically re-executing compiler (dynamic binary translation), then sure, that can be made to work.

RIP Transmeta

gregw2 · 2 years ago

This makes a lot of sense to me, thanks for boiling it down. Compilers can predict the code instructions coming up decently, but not really the data coming up, so VLIW doesn't work that well compared to branch prediction and speculative and out of order execution complexities which VLIW tried to simplify away on branching-heavy commercial/database server workloads. Does that sound right?

actionfromafar · 2 years ago

I think it could have worked if the IDE had performance instrumentation (some kind of tracing) which would have been fed in to the next build. (And perhaps several iterations of this.)

Another way to leverage the Itanium power would have been to make a Java Virtual Machine go really fast, with dynamic binary translation. This way you'd sidestep all the C UB optimization caveats.

dfox · 2 years ago

One big reason is that it was 20 years ago. At that time, gcc only did rudimentary data flow analysis and full SSA dataflow was at best an experimental feature. Also, the market would not really accept a C compiler that does the kind of agressive UB exploitation needed to extract the paralelism from C code (and instead people mostly tended to pass -Wno-strict-aliasing and friends in order to reduce "warning noise").

This issue is somewhat C specific and Fortran compilers produced decidedly better IA-64 code than C compilers. Which is what together with respectable FP performance of Itanium made it somewhat popular for HPC.

clausecker · 2 years ago

There are a number of reasons for the Itanium's poor performance, and it's the combination of these various factors that did it in. I wasn't present back in the Itanium's heyday, but this is what I gathered.

As a quick recap, superscalar processors have multiple execution units, each of which can execute one instruction each cycle. So if you have three execution units, your CPU can execute up to three instructions every cycle. The conventional way to make use of the power of more than one execution unit is to have an out-of-order design, where a complicated mechanism (Tomasulo algorithm) decodes multiple instructions in parallel, tracks their dependencies and dispatches them onto execution units as they can be executed. Dependencies are resolved by having a large physical register file, which is dynamically mapped onto the programmer-visible logical register file (register renaming). This works well, but is notoriously complex to implement and requires a couple of extra pipeline stages before decode and execution, increasing the latency of mispredicted branches.

The idea of VLIW architectures was to improve on this idea by moving the decision which instruction to execute on which port to the compiler. The compiler, having prescient knowledge about what your code is going to do next, can compute the optimal assignment of instructions to execution units. Each instruction word is a pack of multiple instructions, one for each port, that are executed simultaneously (these words become very wide, hence VLIW for Very Long Instruction Word). In essence, all the bits of the out-of-order mechanism between decoding and execution ports can be done away with and the decoder is much simpler, too.

However, things fail in practice:

* the whole idea hinges on the compiler being able to figure out the correct instruction schedule ahead of time. While feasible for Intel's/HP's in house compiler team, the authors of other toolchains largely did not bother, instead opting for more conventional code generation that did not performed all too well.

* This issue was exacerbated by the Itanium's dreadful model for fast memory loads. You see, loads can take a long time to finish, especially if cache misses or page faults occur. To fix that, the Itanium has the option to do a speculative load, which may or may not succeed at a later point. So you can do a load from a dubious pointer, then check if the pointer is fine (e.g. is it in bounds? Is it a null pointer?), and only once it has been validated you make use of the result. This allows you to hide the latency of the load, significantly speeding up typical business logic. However, the load can still fail (e.g. due to to pagefault), in which case your code has to roll back to where the load should be performed and then do a conventional load as a back-up. Understandably, few, if any compilers ever made use of this feature and load latency was dealt with rather poorly.

* Relatedly, the latency of some instructions like loads and division is variable and cannot easily be predicted. So there usually isn't even the one perfect schedule the compiler could find. Turns out the schedule is much better when you leave it to the Tomasulo mechanism, which has accurate knowledge of the latency of already executing long-latency instructions.

* By design, VLIW instruction sets encode a lot about how the execution units work in the instruction format. For example, Itanium is designed for a machine with three execution units and each instruction pack has up to three instructions, one for each of them. But what if you want to put more execution units into the CPU in a future iteration of the design? Well, it's not straightforward. One approach is to ship executables in a bytecode, which is only scheduled and encoded on the machine it is installed on, allowed the instruction encoding and thus number of ports to vary. Intel had chosen a different approach and instead implemented later Itanium CPUs as out-of-order designs, combining the worst of both worlds.

* Due to not having register renaming, VLIW architectures conventionally have a large register file (128 registers in the case of the Itanium). This slows down context switches, further reducing performance. Out-of-order CPUs can cheat by having a comparably small programmer-visible state, with most of the state hidden in the bowels of the processor and consequently not in need of saving or restoring.

* Branch prediction rapidly grew more and more accurate shortly after the Itanium's release, reducing the importance of fast recovery from mispredictions. These days, branch prediction is up to 99% accurate and out-of-order CPUs can evaluate multiple branches per cycle using speculative execution. A feature, that is not possible with a straightforward VLIW design due to the lack of register renaming. So Intel locked itself out of one of the most crucial strategies for better performance with this approach.

* Another enginering issue was that x86 simulation on the Itanium performed quite poorly, giving existing customers no incentive to switch. And those that did decide to switch found that if they invest into porting their software, they might as well make it fully portable and be independent of the architecture. This is the same problem that led to the death of DEC: by forcing their customers to rewrite all the VAX software for the Alpha, the created a bunch of customers that were no longer locked into their ecosystem and could now buy whatever UNIX box was cheapest on the free market.

Kon-Peki · 2 years ago

> To fix that, the Itanium has the option to do a speculative load, which may or may not succeed at a later point. So you can do a load from a dubious pointer, then check if the pointer is fine (e.g. is it in bounds? Is it a null pointer?), and only once it has been validated you make use of the result.

Way back in the day, as a fairly young engineer, I was assigned to a project to get a bunch of legacy code migrated from Alpha to Itanium. The assignment was to "make it compile, run, and pass the tests. Do nothing else. At all."

We were using the Intel C compiler on OpenVMS and every once in a while would encounter a crash in a block of code that looked something like this:

   if(ptr != NULL && ptr->val > 0) {
     //do something
   } else {
     //init the ptr
   }

It was evaluating both parts of the if statement simultaneously and crashing on the second. Not being allowed to spend too much time debugging or investigating the compiler options, we did the following:

   if(ptr != NULL) {
     if(ptr->val > 0) {
       //do something
     }
   } else {
     //init the ptr
   }

Which resolved the problem!

EDIT - I recognize that the above change introduces a potential bug in the program ;) Obviously I wasn't copying code verbatim - it was 10-15 years ago! But you get the picture - the compiler was wonky, even the one you paid money for.

acdha · 2 years ago

> the whole idea hinges on the compiler being able to figure out the correct instruction schedule ahead of time. While feasible for Intel's/HP's in house compiler team, the authors of other toolchains largely did not bother, instead opting for more conventional code generation that did not performed all too well.

I definitely think that keeping their compilers as an expensive license was a somewhat legendary bit of self-sabotage but I’m not sure it would’ve helped even if they’d given them away or merged everything into GCC. I worked for a commercial software vendor at the time before moving into web development, and it seemed like they basically over-focused on HPC benchmarks and a handful of other things like encryption. All of the business code we tried was usually slower even before you considered price, and nobody wanted to spend time hand-coding it hoping to make it less uneven. I do sometimes wonder if Intel’s compiler team would have been able to make it more competitive now with LLVM, WASM, etc. making the general problem of optimizing everything more realistic but I think the areas where the concept works best are increasingly sewn up by GPUs.

Your comment with DEC was spot-on. A lot of people I met had memories of the microcomputer era and were not keen on locking themselves in. The company I worked for had a pretty large support matrix because we had customers running most of the “open systems” platforms to ensure they could switch easily if one vendor got greedy.

sillywalk · 2 years ago

>By design, VLIW instruction sets encode a lot about how the execution units work in the instruction format. For example, Itanium is designed for a machine with three execution units and each instruction pack has up to three instructions, one for each of them. But what if you want to put more execution units into the CPU in a future iteration of the design? Well, it's not straightforward. One approach is to ship executables in a bytecode, which is only scheduled and encoded on the machine it is installed on, allowed the instruction encoding and thus number of ports to vary.

This was how Sun's MAJC[0] worked -

" For instance, if a particular implementation took three cycles to complete a floating-point multiplication, MAJC compilers would attempt to schedule in other instructions that took three cycles to complete and were not currently stalled. A change in the actual implementation might reduce this delay to only two instructions, however, and the compiler would need to be aware of this change.

This means that the compiler was not tied to MAJC as a whole, but a particular implementation of MAJC, each individual CPU based on the MAJC design.

...

The developer ships only a single bytecode version of their program, and the user's machine compiles that to the underlying platform. "[0]

[0] https://en.wikipedia.org/wiki/MAJC

hawflakes · 2 years ago

> Itanium is designed for a machine with three execution units and each instruction pack has up to three instructions, one for each of them. The design was that each bundle had some extra bits including a stop which was a sort of barrier to execution. The idea was that you could have a series of bundles with no stop bit and the last one would set it. That meant the whole series could be safely scheduled on a future wide IA64 machine. Of course that meant the compiler had to be explicit about that parallelism (hence EPIC) but future machines would be able to schedule on the extra execution units. This also addressed the problem where VLIW traditionally would require re-compilation to run/run more efficiently on newer hardware.

> Due to not having register renaming, VLIW architectures conventionally have a large register file (128 registers in the case of the Itanium). This slows down context switches, further reducing performance. Out-of-order CPUs can cheat by having a comparably small programmer-visible state, with most of the state hidden in the bowels of the processor and consequently not in need of saving or restoring.

Itanium borrowed the register windows from SPARC. It was effectively a hardware stack that had a minimum of 128 physical registers but were referenced in instructions by 6 bits — e.g. 64 virtual registers, iirc. So you could make a function call and the stack would push. And a return would pop. Just like SPARC execept they weren't fixed-sized windows.

That said, the penalty for spilling the RSE (They called this part the Register Stack Engine) for say, an OS context switch was quite heavy since you'd have to write the whoe RSE state to memory.

It was pretty cool reading about this stuff as a new grad.

> Another enginering issue was that x86 simulation on the Itanium performed quite poorly, giving existing customers no incentive to switch.

As I mentioned in my previous comment Merced had a tiny corner of the chip devoted to the IVE, Intel Value Engine which was meant to be the very simple 32-bit x86 chip meant mainly for booting the system. The intent was (and the docs also had sample code) to boot, do some set up of system state, and then jump into IA64 mode where you would actually get a fast system.

I think they did devote more silicon to x86 support but I had already served my very short time at HP and Merced still took 2+ years to tape out.

xoranth · 2 years ago

> * the whole idea hinges on the compiler being able to figure out the correct instruction schedule ahead of time. While feasible for Intel's/HP's in house compiler team, the authors of other toolchains largely did not bother, instead opting for more conventional code generation that did not performed all too well.

Was Intel's compiler actually able to get good performance on Itanium? How much less screwed would Itanium have been if other toolchains matched the performance on Intel's compiler?

Also, I vaguely remember reading that Itanium also had a different page table structure (like a hash table?). Did that cause problems too?

actionfromafar · 2 years ago

Wow, another great explanation!

I didn't know these things, I don't think they are part of the meme-lore about Itanium:

- The problems with the fast load misses and compiler support

- I didn't understand the implications of a completely visible register file

- The trouble with "hard coding" three execution units. Very bad if you can't recompile your code and/or bytecde to a new binary when you get a new CPU.

Your last point about coding your way out of the ecosystem, I wonder if that might have been a reason for why Intel didn't go all-in to make Itanium the Java machine...

Findecanor · 2 years ago

"Something of a tragedy: the Itanium was Bob Rau's design, and he died before he had a chance to do it right. His original efforts wound up being taken over for commercial reasons and changed into a machine that was rather different than what he had originally intended and the result was the Itanium. While it was his machine in many ways, it did not reflect his vision."

Quote from Ivan Goddard of Mill Computing: https://www.youtube.com/watch?v=JS5hCjueqQ0&t=4054s

Bob Rau: https://en.wikipedia.org/wiki/Bob_Rau

flakiness · 2 years ago

VLIW is everywhere in client side ML accelerator space for some reason.

Another comment mentioned Snapdragon's Hexagon, which they try to rebrand as NPU with some Mat-mul circuits.

Intel Core's NPU, which is based on Movidius VPU, also has a VLIW based core in it. It is called SHAVE.

And AMD's XDNA NPU, which is based on Xilinx Alveo, also has a VLIW based core they call AI-Engine.