Yeah this isn't quite what happened. Firstly, Intel didn't start Itanium, HP did, as a successor to their HP Precision line. I forget how they got together, but it was a collaboration between Intel and HP, but HP started it and had largely the architecture defined before Intel got involved.
Secondly, it's true that AMD hammered the nails in the coffin, but AMD wouldn't have mattered if Itanic had been faster, cheap, and on time. Itanic was a disaster partly because of overly complicated design by committee and partly because of the fundamentally flawed assumption (that you don't need dynamic scheduling, AKA OoO processing).
I have an Itanium in the garage, a monument to hubris.
UPDATE: I forgot to mention that from the outside it might seem that Intel had a singular vision, but the reality is that there were massive political battles internally and the company was largely split into IA-64 and x86 camps.
UPDATE2: Itanium was massively successful in one thing: it killed off Alpha and a few other cpus, just based on what Intel claimed.
I've always thought that killing off Alpha in favour of pushing Itanium was one of the worst things Intel/HP could have done. Not only was Alpha more advanced architecturally, it was actively implemented and mature. With active development by HP, it could have easily snowballed into the standard cloud hardware platform.
Alpha's fate, like the other proprietary RISC architectures that focused on the lucrative but in the end small workstation market, was sealed. With exponentially increasing R&D and manufacturing costs, massive industry consolidation was inevitable. That it was Itanium that delivered the coupe de grace to Alpha was but the final insult, but it would have happened anyway without Itanium.
And it wasn't like the Alpha was some embodiment of perfection either. E.g. that mindbogglingly crazy memory consistency model.
Itanium was a second Intel VLIW design. The first one was the i860, which was a mixed bag of either being eye popping fast, if instruction bundles were handcrafted by a human, or being as slow as a dog if it was a compiler that emitted the code.
Perhaps, there was a belief back then that compilers could be easily optimised or uplifted to generate fast and efficient code, and that did not turn out to be the case. Project management and mismanagement certainly did not help either.
I wonder how a VLIW architecture would pan out today given advances in compilers in last three decades, and whether a ML assisted VLIW backend could deliver on the old dream.
GPUs today have a more VLIW-like architecture than CPUs and almost every neural network accelerator is a VLIW chip of some kind. It's worked out really well.
The big problem is SMT, since it's hard to share a VLIW core between processes, while a superscalar core shares really well.
> fundamentally flawed assumption (that you don't need dynamic scheduling, AKA OoO processing)
I'm still unconvinced this is fundamental. It certainly was flawed back then, but compiler theory has improved a LOT since then, we have polyhedral optimization, e.g. that we didn't have access to... You could probably optimize delay line technology that way.
If you know how long data will take to go to/from memory then you can schedule pretty well.
If you don't know whether some value will hit in the L1 or the L3 cache there's wild variance on how long it'll take so you have to do something else in the meantime. On x64, that's the pipeline and speculation. On a GPU, you swap to another fibre/warp until the memory op finished.
Fundamentally the hardware knows how long the memory access took, the software can only guess how long it will take. That kills effective ahead of time scheduling on most architectures.
Applications tend to exhibit a lot of dynamic behaviour. Consider any graph munger. I do expect there is an interesting subset of applications, particularly those that have fairly narrow scope and that have been highly optimised, which could be effectively statically scheduled. But for general-purpose computation, I don't buy it.
Perhaps as part of the trend towards increasingly heterogeneous architectures, we'll see big VLIW coprocessors for power efficiency in certain serial workloads (GPUs are massively parallel but little VLIW coprocessors; however, they do have dynamic scheduling).
> It certainly was flawed back then, but compiler theory has improved a LOT since then
Which would be fine if there was a time machine available to send back what we know now to the past. But at the time having sucky compilers for what you want to (try to) accomplish was a bad idea.
The same decision now may be good, but then it was a mistake. It'd be like trying to have a Moon landing project in the 1920s: we got there eventually, but certain things are only possible when the time 'is right'.
This is the same argument that was being made in the 90s, ironically, that compilers were now good enough that it would work.
The thing is, on regular, array-based codes, it's great. And it was then, too -- the polyhedral approach was all being developed at exactly that time, but maybe it's not clear in hindsight because the terminology hadn't settled yet. Ancourt and Irigoin's "Scanning Polyhedra With Do Loops" was published in 1991. Lots of the unification of loop optimization was labeled affine optimization or described as based on Presburger arithmetic. But that is the technology that they were depending on to make it work.
But most code we run is not Fortran in another guise. The dependences aren't known at compile time. That issue hasn't changed much.
The one change now is that workloads that were called "scientific computing" then are now being run for machine learning. But now it doesn't make sense to run regular, FP-intensive codes on a CPU at all, because of GPUs and ML accelerators. So what's left for CPUs that excel on that workload? I'm not sure there is a niche there.
The halting problem means that for typical programs, you can't prove control flow. Proving optimization means not only proving control flow, but then knowing which branches get taken most often so you can optimize. There may be a small subset of programs for which this is true, but the rest leave the compiler completely blinded.
OoO execution does an end run around this by examining the program as it runs and adapting to the current reality rather than a simulated reality. This is the same reason a JIT can do optimizations that a compiler cannot do. The ability to look 500-700 instruction into the future to bundle them together into a kind of VLIW dynamically is a very powerful feature.
As to compiler theory, it really isn't that advanced. Our "cutting edge" compilers are doing glorified find-and-replace for their final optimizations (peephole optimization).
Look at SIMD and auto-vectorization. There are so many potential questions the compiler can't prove the answer to that even trivial vectorization that any programmer would identify can't be used by the compiler to the point where the entire area of research has resulted in basically zero improvements in real-world code.
Because even if you get compilers to optimize for your current VLIW, that makes it harder to make improvements down the line.
Current way, while arguably pretty wasteful on all the micro-optimizing CPU does on incoming bytecode, allows designers to nearly freely expand hardware to meet the needs without having compilers to produce different code.
I've heard a rumour that supposedly one of the lead designers of the IA-64 architecture had died prematurely mid-project. And that would have left the project without the man with the vision. Hence, design by committee.
I thought the biggest problem with Itanium was the fact that it was optimising the wrong thing; it maximised single thread performance by going all in on speculative execution, but it turns out that optimising joules per instruction is much more important.
Certainly one of the issues with Itanium was it was fighting the last war.
It was trying to optimize for instruction-level parallelism when power efficiency and thread level parallelism were coming into vogue. Arguably companies like Sun overoptimized for the latter too soon but it was the direction things were going.
A senior exec at Intel told me at the time the focus on frequency in the case of Netburst was driven by Microsoft being uncomfortable with highly multi-core designs--and I have no reason to doubt that was one of the drivers. There was a lot of discussion around the challenges of parallelism, especially on the desktop, at the time. It generally wasn't the problem the hand wringing suggested it would be.
> the fundamentally flawed assumption (that you don't need dynamic scheduling, AKA OoO processing)
Could this have worked better with JIT-compiled applications, e.g. Java given a sufficiently clever JVM, where assumptions can be dynamically adjusted at runtime?
This was the grand hope but it never panned out. It’s possible now that a sufficiently brilliant compiler could make a difference since there was nothing like LLVM at the time and GCC was far less sophisticated. One of the many acts of self-sabotage Intel committed was insisting on their hefty license fees for icc, which meant that almost all real-world comparisons were made using code compiled using GCC or MSVC, which were not as effective optimizing for Itanium. There’s no way they made enough in revenue to balance out all of those lost sales.
The other point in favor of this approach now is that far more code is using high-level libraries. Back then there was still the assumption that distributing packages was hard and open source was distrusted in many organizations so you had many codebases with hand rolled or obsolete snapshots of things we’d get from a package manager now. It’s hard to imagine that wouldn’t make a difference now if Intel was smart enough to contribute optimizations upstream.
Wasn’t that always the issue with Itanium - that it could have been fast with a sufficiently clever compiler? The problem seemed to be no one was clever enough to write that compiler.
PA-RISC seemed fairly neat from the somewhat limited information you can find online (there’s a 1.1 and 2.0 ISA manual on kernel.org). Where there major issues with the ISA? Killing your entire product line and starting over of course rarely worked.
The PA-RISC 1.1 ISA encoded specific implementation details that didn't age well, like the branch delay slot and instruction address queues. And required in-order memory accesses, because there wasn't support for cache coherent IO.
HP-UX 11 was one of the UNIXes I worked on (1999 - 2002, 2005), and the only issue I had, at least during the first employer was their ongoing transition to 64 bits, and the C compiler we had available (aC) was a mix of K&R C with some ISO/ANSI C compliance.
I don't recall any issues with the ISA, and we really liked using its early container capabilities (HP Vault).
but AMD wouldn't have mattered if Itanic had been faster, cheap, and on time
I dunno about that.
My own personal opinion is that Intel has never been able to re-architect their way out of the fact that the cornerstone of their success is that they are selling x86, and their customers mostly don't care about the theoretical advantages of the bright shiny new thing. They just want to run their software just like they always have. There's a reason why IA-64 has joined iAPX432, i860, StrongARM and i960 as Intel footnotes (outside of the embedded market).
When they were philosophizing about what Itanic should look like, the only thing that x86 obviously needed from a market perspective was a bigger address space. And AMD was smart enough to deliver on that, and here we are.
Itanium didn't kill off Alpha. Intel x86 pricing did. But the most important unheeded lesson in those days was software compatibilty. We went from the days of each computer having its own word length, instruction set, heck, even data format (remember the endian wars?) to source compatibility to binary compatibilty. We learned that for most useage, software stability that allowed taking advantage of Moores law, was seriously more valuable in most cases than gaining a bit more performance or price/performance by changing architectures.
Intel kept the X86 price at a point where no bean counter would favor investing in new architectures. Fortunately AMD broke the headlock on x86.
Well Intel's original plan was to keep x86 32 bit, forcing anyone that needed more into IA64. Fortunately AMD came out with x86-64, and when it was clear that IA64 wasn't going to be competitive, Intel brought x86-64 to their chips.
If you look at SPEC CPU benchmark, Itanium was not bad at all in terms of instructions-per-cycle. IIRC in fp performance it could beat Netburst Pentium4 running twice its clock speed, and even compared favorably to Core. I.e. if Intel produced Itaniums which ran at the same clock speed as Core CPUs, it would be world's fastest single-threaded number cruncher.
So I don't really buy the "Itanium is bad architecture" story.
It's fate was probably decided around 1999-2000. At that point Itanium was still pretty good against Pentium 3 and Pentium 4. And name "IA-64" indicates Intel didn't plan to make 64-bit Pentiums. So eventually Pentiums would fill low-end segment while the rest would be occupied by 64-bit Itaniums.
AMD killed that plan by releasing AMD64 architecture. It was an obvious upgrade to x86, so it would clearly do better in the market than IA-64. So Intel decided to go for x86-64 too, and Itanium was doomed at that point. They didn't even bother making Itaniums with same clock speed as Xeons.
So it's definitely possible that if AMD decided to stick to 32 bits at that time, Intel would have pushed optimized IA-64. Also AMD64 could be worse than it is. E.g. if they decided to increase only register size but keep the number of registers the same, IA-64 could still come on top.
> AMD killed that plan by releasing AMD64 architecture. It was an obvious upgrade to x86, so it would clearly do better in the market than IA-64.
One of the best aspects of the Opteron was it also happened to be a fantastic 32-bit CPU in addition to AMD64. This was a period where a lot of software, even FOSS wasn't 64-bit clean. There was a lot of pointer arithmetic hiding deep in libraries that were assuming pointers would always be 32-bits.
The Opteron running a 32-bit OS at least as well as a 32-bit Athlon was a huge point in its favor. So your existing system running on new Opteron hardware ran fine and you could mix and match Xeon and Opterons in a fleet. Then switch over to 64-bit on the Opterons for (hopefully) better performance.
One thing to remember is that FP performance with well-scheduled code was by far its strongest performing area, and Intel put a lot of work into tuning their compiler for those specific tests. The problem was that it fell off heavily the less your code is like that, especially for the branchy code most business apps depend on.
The other big problem was that the x86 compatibility story was worse than the earlier hopes. That meant that it not only wasn’t competitive with the current generation competition but often even the previous or worse - note losing to the original Pentium or even a 486 here:
Now, they could have improved that but statistically nobody was going to pay considerably more for lower performance in the hopes that a future update would improve matters.
The Athlon and Opteron weren’t just fast, they also had flawless 32-bit support so even if your 64-bit software update never happened you could justify the purchase based on their price/performance.
Itanium can be regarded as a huge success. Maybe some don't remember, but the minicomputers and unices were where all the money was. Sure, Intel had the process edge, and had completely cornered the microcomputer market. But PCs had slim margins and didn't really matter in the bigger scope important data processing.
Many of the computer nerds watched in awe as vendor after vendor dropped their hugely expensive and engineering heavy custom CPU architectures and lined up behind Intel. IBM was the only big player who didn't swallow the bait. "Even if they fail, that's a huge success" was a common observation at the time.
And sure enough. I don't think they failed on purpose, but business wise it was a win-win situation. The x86 architecture would have won anyway, because of the sheer scale, but the Itanium wreckage hastened it. Everyone needed to move, so why not move to x86/Linux directly?
> In some ways Itanium was the most successful bluff every played in the tech industry. In much the same way that Reagan's Star Wars bankrupted the Soviet Union got almost every single competitor to fold. Back at the beginning of the project, Intel was nowhere in high-end & 64-bit computing. There was HP (PA- RISC), Sun (Sparc), Dec (Alpha), IBM (Power), MIPS (SGI). Intel wisely picked the partner with the stupidest management (Carly) to give up their competitive edge and announce to analysts that Intel's vision/roadmap is so awesome that RISC is dead and that they're going to follow the bidding of their master Intel for their 64-bit plan. Wall Street bought in to the story so much that almost everyone else with competitive chips folded their strong hands to Itanium's bluff - SGI spun off MIPS and MIPS decided to leave the high-end space. Compaq undervalued Alpha and let it die. Sun tried to become a software company and if it weren't for Fujitsu making modern sparcs, sparc would be dead.
> Basically, with nothing but PR and Carly's stupidity, Intel wiped out over half of the high-end computing processor market.
> Thankfully AMD had the vision to see through the bluff, and saw the opportunity for 64-bit computing that worked; and thankfully IBM didn't have someone like Carly around so they saw the value in retaining competitive advantages; or the computing world would be pretty bleak place right now..
For anyone who doesn't know but is curious, "Carly" refers to Hewlett Packard (HP) CEO Carly Fiorina. Fiorina oversaw HP's acquisition of Compaq, which had previously acquired DEC, IIRC. Reportedly, the HP-Compaq acquisition was opposed by many, including board members and family. (This was before my time, but I read a lot of trade rags as a kid, and then later occasionally heard insider stories.)
HP was legendary for culture, like "management by walking around, and talking to the people on the ground", which was different from Fiorina's style.
Compaq was the most noteworthy IBM-compatible PC company, before Dell's dorm room dirt-cheap generic PC clones business skyrocketed into an empire.
DEC was the maker of the PDPs and VAX-based minicomputers on which much of the field of Computer Science was arguably developed, and later MIPS- and then Alpha-based workstations and servers, while also still developing VAXen (the plural form of the word).
All those proprietary CPU ISAs listed (PA-RISC, SPARC, Alpha, POWER), when they were introduced on engineering/graphics workstation computers, were especially exciting, because -- separate from the technical architecture itself -- they would briefly probably be the fastest workstation in your shop. All of these made MS-DOS/Windows PCs and Macs look like toys by comparison (though, eventually, Windows NT 3.51 started to be semi-credible if you just needed to run a single big-ticket application program). And you didn't know what exciting new development would be next.
Maybe it was like if, today, several makers of top-end gaming GPUs resulted in a leapfrogging on a cadence of every few/several months. And if they had different strengths, and, incidentally, curious exclusive game software to explore. Or like the very recent succession of Stable Diffusion, ChatGPT, etc., and wondering what the next big wow will be, what they've done with it, and what you can do with it.
When I knew some Linux developers working on Itanium, some were already calling it "Itanic". (I didn't read much into the name at the time, because were a lot of joke derogatory names for brands and technologies.) Later, I thought "Itanic" was because it was a huge expensive thing that was doomed to sink. The theory in the TFA sounds like most competing ship companies gave up on their own engine designs when they heard how great the Titanic would be.
The HP-Intel joint effort to develop what became the Itanium was announced 5 years Fiorina became the CEO of HP. During that time she was working at AT&T/Lucent and had zero input in HP's strategy.
The comparison to star wars is certainly apt. It doesn't need to work, in the engineering sense, to be useful. Sometimes economics can trump engineering.
An important part of the PR machinery was that by picking up a hot topic from academia, they really got absolutely everyone to talk about VLIW as the next genreration RISC. And everyone already knew that RISC was superior and x86 was a toy, but which also was mostly true at time.
In the end, what won was huge caches and huge OoO pipelines. Linus Torvalds had some strong opinions and well known opinions on this, which turned out to be mostly right.
It was a humiliating failure, not a success. The trend against those other architectures was already clear: as processor complexity went up, the costs of building them skyrocketed and most of those companies had no plan for the kind of volume you’d need to support them. Don’t forget that applied on both the hardware and software sides: a competitive compiler and optimized libraries were important.
This is why Itanium got traction: everyone knew that you needed volume to stay in the game. IBM had a strategy to get that with Apple & Motorola (PowerPC started in 1991), but HP did not have anything like that for PA-RISC. DEC might have gotten there if they’d had a more aggressive partner for the lower-end Alpha strategy but the merger killed any chance of that.
Since x86 was rising so fast, it might not be clear why Intel got involved. That goes back to the licensing rights: they couldn’t prevent companies like AMD from competing directly with them. Itanium was the attempt to close off that line of competition legally and they were willing to attack their own product margins to do it.
Doesn't quite add up. SGI folded their CPUs (thanks Mr Belluzzo) before Itanium was even released, Sun offered their x86 server about the same time (and kept their SPARCs).
HP was the only casualty to Itanium, but that was self-inflicted.
Yes, but the CPU pipeline is years long. So years before Alpha, Sparc, MIPS, PA-Risc and related CPUs had to decide if the R&D for a next generation CPU made sense in the face of the announced Itanium which most believes was going to dominate the server industry.
When the Itanium shipped years late and slower than expected it was too late for any of the competition (except Power) to recover. Granted the x86-64s were ramping up and they would have all had tough competition, even without Itanium.
To be completely honest, MIPS CPU's had never been known for their speed – not until the release of the 64-bit MIPS architecture anyway when they finally became competitive with other RISC CPU's, but it was a complete ISA redesign.
Sun was in a somewhat similar boat with the SPARC v8 architecture, and they were rather late with UltraSPARC (SPARC v9 ISA). Yet, they managed to hold out longer due to having a switched memory controller and a very wide memory bus, which allowed them to become the best hardware appliance to run the Oracle database (despite being less performant), and divert the cash flow into the UltraSPARC development. UltraSPARC I was underwhelming, and with UltraSPARC II they finally caught up with other RISC vendors and gradually started outperforming some (e.g. MIPS) in some areas.
Amusingly, the 512-bit wide memory bus has made a comeback in Apple M1 Max laptops (laptops!), and M1 Ultra has a 1024-bit wide memory bus.
Why HP went all in with ditching their own perfectly fine PA-RISC 2.0 architecture is an enigma to me tho.
So long as one puts big, fat "giga-money-losing" and "humiliating" disclaimers on "success", then yes.
Vs. - what if, instead of Itanium, Intel had more-quietly designed and delivered good, high-performance x86-64 CPU's? I'm thinking that, by bottom-line metrics, would have been a vastly more successful business strategy.
Of course the whole story is way more complex, but I have been saying for years that the Itanium might have succeeded, if AMD had not extended x86 to 64 bit. The 64 bit extensions did not only fix some of the x86 problems (it increased the register count, pushed 64 bit double floats) but it made x86 a choice for the more serious compute platforms and servers.
Back then, the whole professional world had switched to 64 bit, both from a performance and memory size perspective. That is why the dotcom time basically was based on Sparc Suns. The Itanium was way late, it still was Intels only offering in the domain. Until x86-64 came and very quickly entered the professional compute centers. The performance race in the consumer space then sealed the deal by providing faster CPUs than the classic RISC processors of the time, including Itanium.
It is a bit sad to see it go, I wonder how well the architecture would have performed in modern processes. After all, an iPhone has a much larger transistor count than those "large" and "hot" Itaniums.
The industry was already moving away from the big 64 bit SMP machines made Sun, SGI & IBM. In many cases a cluster of 32bit x86 machines made more sense than one expensive big machine with high priced support contracts and parts. 32 bit x86 machines already supported more than 4GB total memory with PAE, it was just that one process couldn’t use more than 4GB. Other 64bit chips were already well established (SPARC, POWER, MIPS), probably for most of the users they couldn’t easily move to a new CPU architecture. For other users by the time they needed the bigger machines x86 64bit was already available, including from Intel themselves. AMD was limited 8 sockets from what I remember, so their was still a small market for big Itanium systems (like SGI’s Altix).
At the time there were computers containing Alpha chips with quite a PC-ish design. I nearly bought one, so they were semi-affordable. They ran Linux well. So it seems a bit more likely that these might have succeeded if AMD hadn't extended x86.
What you also have to remember is that Itanic was a very weird architecture. It's hard to write compilers for it, and it made the cardinal error of baking microarchitectural decisions into the ISA.
Yeah, by mid-2000s most RISC workstations (Alpha, Sun, SGI, PowerPC tho not IBM POWER and certainly not the top of the range datacentre class servers) had converged on the mainstream PC architecture, e.g. a PCI bus, EIDE disk controllers, standard PC memory (168-pin DIMM's and such). I had a Sun Ultra-10 as the sole «PC» at home for some years running Solaris and later Linux. After quickly getting fed up with the standard Sun keyboard, I bought a generic, no-name USB PCI card, plugged it into my Ultra-10 and connected a Microsoft Natural keyboard. It just worked, with Solaris not even requiring extra drivers or kernel modules by virtue of being a USB keyboard.
In that case, another more proven RISC architecture like Alpha would have replaced x86. At that time, the only reason why x86 was still "competitive" was the enormous amount of x86 software (specifically Windows software). If Microsoft would have to switch to another ISA anyway, there wasn't really a reason to bet on something as risky as Itanium.
That came after AMD64. It was Intel trying to prevent brand damage by saying their AMD compatible cpus were not knock-offs of a competitor that had plagued them with knock-offs.
Actually, it was originally called x86-64. The AMD64 branding came around about the time the chips were actually released.
Back around 2002 or 2003, I sent away to AMD for a set of reference manuals, got back a nice five-volume set which said "x86-64" everywhere and a little note in the box saying "where it says x86-64, read AMD64"
Does anyone remember when first Intel processors with AMD64 support came out Intel called it "IA32e" to downplay the importance.
To differentiate it from the
IA32=x86 (32bit)
IA64=Itanium
AMD was more like the last straw that broke the camel's back.
Itanium was already in trouble. It was hot (really hot) and underperforming. It wasn't selling really well, since it was too expensive. Of the UNIX vendors, only HP was left standing behind Itanium. IBM had already long pulled out of Itanium and also out of Monterey, the UNIX that would unify all unixes.
AMD64 was the light that suddenly came shining and everybody knew that was where everybody was going.
Itanium is something that sounds good in theory but didn't quite deliver in practice. It is my impression that this is how things usually turn out when you do top-down design of a new product. Maybe my impression is wrong and it has nothing to do with top-down design and it's just that the majority of products fail. Maybe 99 out of 100 fail and you have to launch 100 to have a statistical chance to get 1 winner. Intel didn't launch 100 different processor designs. They surely have a dozen designs that got killed early but then they focused most of their energy on one design and that one failed. Not enough good luck probably.
I'm sure you're right in general - but I would say that a counterexample, or the 1 in 100 top-down design of a new product that did succeed, is the iPhone.
If I remember well what I learned in college, the organization of x86 processors is superescalar, which means that the processor uses an internal statistics mechanism for predicting the subsequent code that's going to be executed. Itanium on the other side used the VLIW architecture so that the instructions are optimized for a long vector of execution of instructions already in compilation time.
I always found the idea behind the VLIW processor architecture to be a quite good one to be honest, but I read many engineers in many places saying that it's a bad one and it was doomed from the beginning.
The article says that the death of Itanium is mostly due to the disinvestment in the IA-64 caused by the threat of AMD overtaking the x86 market, even though the competition for the x86 market probably benefited all of us, I still find it a bit sad that there was this loss of architectural diversity and sometimes I wonder how well the Itanium would perform today if it weren't killed.
Superscalar just means the processor can execute more than one instruction at a time (in parallel, not just pipelined). The main distinction between VLIW and mainstream x86 is in how the instruction execution is scheduled: x86 cores track dependencies between instructions, and most support reordering instructions to deal with stalls from things like cache misses or the next instruction being of the wrong type for any of the free execution ports. VLIW relies on the compiler to schedule instructions, so that the hardware does not need to do dependency tracking.
And the problem is that such a thing doesn't really work on mainstream CPUs, because memory access instructions take a highly variable amount of cycles depending on which cache level the data is in, which the compiler cannot know, and generating code for all possibilities leads to exponential blowup of the source code size.
It's not clear how Intel thought it could possibly work.
> ...I read many engineers in many places saying that it's a bad [idea]...
I read an article several years ago from an engineer at...ooh, I think it was DEC or IBM. He said that during development of the Itanium, the Intel guys had talked to them and they advised most strongly that Intel drop the project, because they had been down that road and thought it was a dead end.
VLIW is considered: Multiple Instruction Multiple Data, in each line of assembly you can send out something like 4 (or 8) instruction each with a different target, and it will work as long as there aren't dependency issues.
GPUs are still Single Instruction Multiple Data (SIMD), for every vector operation you are doing operation: adding vectors, taking a dot production you are only executing a single op at a time.
SIMDs are really close to the RISC/CISC paradigm, and there's various extensions for other types of SIMD processing in different ISAs used today. VLIW is a much different set of assumptions, requiring the compiler to program in the same instruction level parallelism that a superscalar chip will parallelize via it's architectural features (pipelines/branch prediction/et cetera).
Can GPUs really be considered VLIW just because they might share an instruction pointer across a couple of 'micro cores'? AFAIK they went even back from SIMD to scalar a long time ago.
.. and then Nvidia brought their GPUs to the clusters and ate both Intel’s and AMDs lunch with a better version of VLIW. The CUDA programming model and hardware IMO is quite successful at abstracting vectors the right way[TM], separating the problem of the grid setup from that of the kernel code (which you can still largely program in a scalar way if you’re not after the last ~30% of performance). IMO a shame that OpenCL never got good.
I have itanium on a shelf somewhere-- while I was using it I got to do some assembly level debugging to track down a report a GCC bug I hit. Well worth in entertainment the $50 or whatever I paid for it.
Nice story, but it doesn't fully add up:
"When Intel started Itanium development work in the mid 1990s"
That's more than a decade after AMD releasing their first x86 compatible CPU. Intel was very aware of the threat before they started the work on Itanium. They even tried to hold back AMD with a lawsuit which they lost and ultimately allowed AMD to release their 80386 compatible CPU.
I find it more likely that it failed for similar reasons why iAPX 432 and i860 failed. There just wasn't a market.
Why would AMD do that today though? They are in a privileged position to have an x86 license and zillions of customers asking for an x86. AMD once made some ARM chips and realised they didn’t need to.
Agreed. The only reason I can think of for AMD to set up an alternative instruction set would be a sudden rise in competition from other players. If big advances are made in the RISC-V space that make the architecture a cost effective alternative to amd64 (including the cost of porting software to RISC-V) then I can see them setting themselves up to allow running software on both platforms at native speeds.
I don't think AMD is currently limited by their instruction set. Even if they are, there may be an argument to move to ARM instead of RISC-V to take advantage of the software already ported because of Apple's transition and the Graviton chips. Windows already runs on ARM but hasn't been announced to run on RISC-V, after all.
Does the core area math check out on that? Is the x86 instruction decoding area a small enough part of today's cores that having a chip that includes both a RISC-V decoder and a x86 one not a big penalty?
It's also not clear that would be a gain for AMD. x86 has a lot of lock-in and AMD is one of two viable suppliers of it. Helping along a RISC-V transition would create opportunities for attackers that don't need to license x86. AMD doesn't seem to have a reason for that right now. But maybe that's an Innovator's Dilemma kind of situation and they should be cannibalizing the present to setup their future.
Secondly, it's true that AMD hammered the nails in the coffin, but AMD wouldn't have mattered if Itanic had been faster, cheap, and on time. Itanic was a disaster partly because of overly complicated design by committee and partly because of the fundamentally flawed assumption (that you don't need dynamic scheduling, AKA OoO processing).
I have an Itanium in the garage, a monument to hubris.
UPDATE: I forgot to mention that from the outside it might seem that Intel had a singular vision, but the reality is that there were massive political battles internally and the company was largely split into IA-64 and x86 camps.
UPDATE2: Itanium was massively successful in one thing: it killed off Alpha and a few other cpus, just based on what Intel claimed.
And it wasn't like the Alpha was some embodiment of perfection either. E.g. that mindbogglingly crazy memory consistency model.
Perhaps, there was a belief back then that compilers could be easily optimised or uplifted to generate fast and efficient code, and that did not turn out to be the case. Project management and mismanagement certainly did not help either.
I wonder how a VLIW architecture would pan out today given advances in compilers in last three decades, and whether a ML assisted VLIW backend could deliver on the old dream.
The big problem is SMT, since it's hard to share a VLIW core between processes, while a superscalar core shares really well.
I'm still unconvinced this is fundamental. It certainly was flawed back then, but compiler theory has improved a LOT since then, we have polyhedral optimization, e.g. that we didn't have access to... You could probably optimize delay line technology that way.
If you don't know whether some value will hit in the L1 or the L3 cache there's wild variance on how long it'll take so you have to do something else in the meantime. On x64, that's the pipeline and speculation. On a GPU, you swap to another fibre/warp until the memory op finished.
Fundamentally the hardware knows how long the memory access took, the software can only guess how long it will take. That kills effective ahead of time scheduling on most architectures.
Perhaps as part of the trend towards increasingly heterogeneous architectures, we'll see big VLIW coprocessors for power efficiency in certain serial workloads (GPUs are massively parallel but little VLIW coprocessors; however, they do have dynamic scheduling).
Which would be fine if there was a time machine available to send back what we know now to the past. But at the time having sucky compilers for what you want to (try to) accomplish was a bad idea.
The same decision now may be good, but then it was a mistake. It'd be like trying to have a Moon landing project in the 1920s: we got there eventually, but certain things are only possible when the time 'is right'.
The thing is, on regular, array-based codes, it's great. And it was then, too -- the polyhedral approach was all being developed at exactly that time, but maybe it's not clear in hindsight because the terminology hadn't settled yet. Ancourt and Irigoin's "Scanning Polyhedra With Do Loops" was published in 1991. Lots of the unification of loop optimization was labeled affine optimization or described as based on Presburger arithmetic. But that is the technology that they were depending on to make it work.
But most code we run is not Fortran in another guise. The dependences aren't known at compile time. That issue hasn't changed much.
The one change now is that workloads that were called "scientific computing" then are now being run for machine learning. But now it doesn't make sense to run regular, FP-intensive codes on a CPU at all, because of GPUs and ML accelerators. So what's left for CPUs that excel on that workload? I'm not sure there is a niche there.
OoO execution does an end run around this by examining the program as it runs and adapting to the current reality rather than a simulated reality. This is the same reason a JIT can do optimizations that a compiler cannot do. The ability to look 500-700 instruction into the future to bundle them together into a kind of VLIW dynamically is a very powerful feature.
As to compiler theory, it really isn't that advanced. Our "cutting edge" compilers are doing glorified find-and-replace for their final optimizations (peephole optimization).
Look at SIMD and auto-vectorization. There are so many potential questions the compiler can't prove the answer to that even trivial vectorization that any programmer would identify can't be used by the compiler to the point where the entire area of research has resulted in basically zero improvements in real-world code.
Current way, while arguably pretty wasteful on all the micro-optimizing CPU does on incoming bytecode, allows designers to nearly freely expand hardware to meet the needs without having compilers to produce different code.
Also the Alpha was the utter opposite of design by committee. A quick skim of the alpha ISA would show you that.
It was trying to optimize for instruction-level parallelism when power efficiency and thread level parallelism were coming into vogue. Arguably companies like Sun overoptimized for the latter too soon but it was the direction things were going.
A senior exec at Intel told me at the time the focus on frequency in the case of Netburst was driven by Microsoft being uncomfortable with highly multi-core designs--and I have no reason to doubt that was one of the drivers. There was a lot of discussion around the challenges of parallelism, especially on the desktop, at the time. It generally wasn't the problem the hand wringing suggested it would be.
Could this have worked better with JIT-compiled applications, e.g. Java given a sufficiently clever JVM, where assumptions can be dynamically adjusted at runtime?
(Edit: As opposed to an AOT compiler.)
The other point in favor of this approach now is that far more code is using high-level libraries. Back then there was still the assumption that distributing packages was hard and open source was distrusted in many organizations so you had many codebases with hand rolled or obsolete snapshots of things we’d get from a package manager now. It’s hard to imagine that wouldn’t make a difference now if Intel was smart enough to contribute optimizations upstream.
I don't recall any issues with the ISA, and we really liked using its early container capabilities (HP Vault).
I dunno about that.
My own personal opinion is that Intel has never been able to re-architect their way out of the fact that the cornerstone of their success is that they are selling x86, and their customers mostly don't care about the theoretical advantages of the bright shiny new thing. They just want to run their software just like they always have. There's a reason why IA-64 has joined iAPX432, i860, StrongARM and i960 as Intel footnotes (outside of the embedded market).
When they were philosophizing about what Itanic should look like, the only thing that x86 obviously needed from a market perspective was a bigger address space. And AMD was smart enough to deliver on that, and here we are.
Intel kept the X86 price at a point where no bean counter would favor investing in new architectures. Fortunately AMD broke the headlock on x86.
So I don't really buy the "Itanium is bad architecture" story.
It's fate was probably decided around 1999-2000. At that point Itanium was still pretty good against Pentium 3 and Pentium 4. And name "IA-64" indicates Intel didn't plan to make 64-bit Pentiums. So eventually Pentiums would fill low-end segment while the rest would be occupied by 64-bit Itaniums.
AMD killed that plan by releasing AMD64 architecture. It was an obvious upgrade to x86, so it would clearly do better in the market than IA-64. So Intel decided to go for x86-64 too, and Itanium was doomed at that point. They didn't even bother making Itaniums with same clock speed as Xeons.
So it's definitely possible that if AMD decided to stick to 32 bits at that time, Intel would have pushed optimized IA-64. Also AMD64 could be worse than it is. E.g. if they decided to increase only register size but keep the number of registers the same, IA-64 could still come on top.
One of the best aspects of the Opteron was it also happened to be a fantastic 32-bit CPU in addition to AMD64. This was a period where a lot of software, even FOSS wasn't 64-bit clean. There was a lot of pointer arithmetic hiding deep in libraries that were assuming pointers would always be 32-bits.
The Opteron running a 32-bit OS at least as well as a 32-bit Athlon was a huge point in its favor. So your existing system running on new Opteron hardware ran fine and you could mix and match Xeon and Opterons in a fleet. Then switch over to 64-bit on the Opterons for (hopefully) better performance.
The other big problem was that the x86 compatibility story was worse than the earlier hopes. That meant that it not only wasn’t competitive with the current generation competition but often even the previous or worse - note losing to the original Pentium or even a 486 here:
https://tweakers.net/reviews/204/8/intel-itanium-sneak-previ...
Now, they could have improved that but statistically nobody was going to pay considerably more for lower performance in the hopes that a future update would improve matters.
The Athlon and Opteron weren’t just fast, they also had flawless 32-bit support so even if your 64-bit software update never happened you could justify the purchase based on their price/performance.
I love monuments to hubris so I too have one in my garage-- a four-node SGI Altix 350.
Deleted Comment
Many of the computer nerds watched in awe as vendor after vendor dropped their hugely expensive and engineering heavy custom CPU architectures and lined up behind Intel. IBM was the only big player who didn't swallow the bait. "Even if they fail, that's a huge success" was a common observation at the time.
And sure enough. I don't think they failed on purpose, but business wise it was a win-win situation. The x86 architecture would have won anyway, because of the sheer scale, but the Itanium wreckage hastened it. Everyone needed to move, so why not move to x86/Linux directly?
> In some ways Itanium was the most successful bluff every played in the tech industry. In much the same way that Reagan's Star Wars bankrupted the Soviet Union got almost every single competitor to fold. Back at the beginning of the project, Intel was nowhere in high-end & 64-bit computing. There was HP (PA- RISC), Sun (Sparc), Dec (Alpha), IBM (Power), MIPS (SGI). Intel wisely picked the partner with the stupidest management (Carly) to give up their competitive edge and announce to analysts that Intel's vision/roadmap is so awesome that RISC is dead and that they're going to follow the bidding of their master Intel for their 64-bit plan. Wall Street bought in to the story so much that almost everyone else with competitive chips folded their strong hands to Itanium's bluff - SGI spun off MIPS and MIPS decided to leave the high-end space. Compaq undervalued Alpha and let it die. Sun tried to become a software company and if it weren't for Fujitsu making modern sparcs, sparc would be dead.
> Basically, with nothing but PR and Carly's stupidity, Intel wiped out over half of the high-end computing processor market.
> Thankfully AMD had the vision to see through the bluff, and saw the opportunity for 64-bit computing that worked; and thankfully IBM didn't have someone like Carly around so they saw the value in retaining competitive advantages; or the computing world would be pretty bleak place right now..
HP was legendary for culture, like "management by walking around, and talking to the people on the ground", which was different from Fiorina's style.
Compaq was the most noteworthy IBM-compatible PC company, before Dell's dorm room dirt-cheap generic PC clones business skyrocketed into an empire.
DEC was the maker of the PDPs and VAX-based minicomputers on which much of the field of Computer Science was arguably developed, and later MIPS- and then Alpha-based workstations and servers, while also still developing VAXen (the plural form of the word).
All those proprietary CPU ISAs listed (PA-RISC, SPARC, Alpha, POWER), when they were introduced on engineering/graphics workstation computers, were especially exciting, because -- separate from the technical architecture itself -- they would briefly probably be the fastest workstation in your shop. All of these made MS-DOS/Windows PCs and Macs look like toys by comparison (though, eventually, Windows NT 3.51 started to be semi-credible if you just needed to run a single big-ticket application program). And you didn't know what exciting new development would be next.
Maybe it was like if, today, several makers of top-end gaming GPUs resulted in a leapfrogging on a cadence of every few/several months. And if they had different strengths, and, incidentally, curious exclusive game software to explore. Or like the very recent succession of Stable Diffusion, ChatGPT, etc., and wondering what the next big wow will be, what they've done with it, and what you can do with it.
When I knew some Linux developers working on Itanium, some were already calling it "Itanic". (I didn't read much into the name at the time, because were a lot of joke derogatory names for brands and technologies.) Later, I thought "Itanic" was because it was a huge expensive thing that was doomed to sink. The theory in the TFA sounds like most competing ship companies gave up on their own engine designs when they heard how great the Titanic would be.
An important part of the PR machinery was that by picking up a hot topic from academia, they really got absolutely everyone to talk about VLIW as the next genreration RISC. And everyone already knew that RISC was superior and x86 was a toy, but which also was mostly true at time.
In the end, what won was huge caches and huge OoO pipelines. Linus Torvalds had some strong opinions and well known opinions on this, which turned out to be mostly right.
This is why Itanium got traction: everyone knew that you needed volume to stay in the game. IBM had a strategy to get that with Apple & Motorola (PowerPC started in 1991), but HP did not have anything like that for PA-RISC. DEC might have gotten there if they’d had a more aggressive partner for the lower-end Alpha strategy but the merger killed any chance of that.
Since x86 was rising so fast, it might not be clear why Intel got involved. That goes back to the licensing rights: they couldn’t prevent companies like AMD from competing directly with them. Itanium was the attempt to close off that line of competition legally and they were willing to attack their own product margins to do it.
HP was the only casualty to Itanium, but that was self-inflicted.
When the Itanium shipped years late and slower than expected it was too late for any of the competition (except Power) to recover. Granted the x86-64s were ramping up and they would have all had tough competition, even without Itanium.
Sun was in a somewhat similar boat with the SPARC v8 architecture, and they were rather late with UltraSPARC (SPARC v9 ISA). Yet, they managed to hold out longer due to having a switched memory controller and a very wide memory bus, which allowed them to become the best hardware appliance to run the Oracle database (despite being less performant), and divert the cash flow into the UltraSPARC development. UltraSPARC I was underwhelming, and with UltraSPARC II they finally caught up with other RISC vendors and gradually started outperforming some (e.g. MIPS) in some areas.
Amusingly, the 512-bit wide memory bus has made a comeback in Apple M1 Max laptops (laptops!), and M1 Ultra has a 1024-bit wide memory bus.
Why HP went all in with ditching their own perfectly fine PA-RISC 2.0 architecture is an enigma to me tho.
So long as one puts big, fat "giga-money-losing" and "humiliating" disclaimers on "success", then yes.
Vs. - what if, instead of Itanium, Intel had more-quietly designed and delivered good, high-performance x86-64 CPU's? I'm thinking that, by bottom-line metrics, would have been a vastly more successful business strategy.
Meanwhile, ARM was designing little low-power RISC toys that were obviously no danger to Intel at all.
Back then, the whole professional world had switched to 64 bit, both from a performance and memory size perspective. That is why the dotcom time basically was based on Sparc Suns. The Itanium was way late, it still was Intels only offering in the domain. Until x86-64 came and very quickly entered the professional compute centers. The performance race in the consumer space then sealed the deal by providing faster CPUs than the classic RISC processors of the time, including Itanium.
It is a bit sad to see it go, I wonder how well the architecture would have performed in modern processes. After all, an iPhone has a much larger transistor count than those "large" and "hot" Itaniums.
What you also have to remember is that Itanic was a very weird architecture. It's hard to write compilers for it, and it made the cardinal error of baking microarchitectural decisions into the ISA.
Had it not been the case and Intel alongside its OS partners would have managed to push it no matter what.
That came after AMD64. It was Intel trying to prevent brand damage by saying their AMD compatible cpus were not knock-offs of a competitor that had plagued them with knock-offs.
Back around 2002 or 2003, I sent away to AMD for a set of reference manuals, got back a nice five-volume set which said "x86-64" everywhere and a little note in the box saying "where it says x86-64, read AMD64"
Itanium was already in trouble. It was hot (really hot) and underperforming. It wasn't selling really well, since it was too expensive. Of the UNIX vendors, only HP was left standing behind Itanium. IBM had already long pulled out of Itanium and also out of Monterey, the UNIX that would unify all unixes.
AMD64 was the light that suddenly came shining and everybody knew that was where everybody was going.
I always found the idea behind the VLIW processor architecture to be a quite good one to be honest, but I read many engineers in many places saying that it's a bad one and it was doomed from the beginning.
The article says that the death of Itanium is mostly due to the disinvestment in the IA-64 caused by the threat of AMD overtaking the x86 market, even though the competition for the x86 market probably benefited all of us, I still find it a bit sad that there was this loss of architectural diversity and sometimes I wonder how well the Itanium would perform today if it weren't killed.
It's not clear how Intel thought it could possibly work.
Deleted Comment
I read an article several years ago from an engineer at...ooh, I think it was DEC or IBM. He said that during development of the Itanium, the Intel guys had talked to them and they advised most strongly that Intel drop the project, because they had been down that road and thought it was a dead end.
Even the M1 could be argued to be close, it’s a very wide machine.
VLIW is considered: Multiple Instruction Multiple Data, in each line of assembly you can send out something like 4 (or 8) instruction each with a different target, and it will work as long as there aren't dependency issues.
GPUs are still Single Instruction Multiple Data (SIMD), for every vector operation you are doing operation: adding vectors, taking a dot production you are only executing a single op at a time.
SIMDs are really close to the RISC/CISC paradigm, and there's various extensions for other types of SIMD processing in different ISAs used today. VLIW is a much different set of assumptions, requiring the compiler to program in the same instruction level parallelism that a superscalar chip will parallelize via it's architectural features (pipelines/branch prediction/et cetera).
https://en.wikipedia.org/wiki/Itanium#/media/File:Itanium_Sa...
I have itanium on a shelf somewhere-- while I was using it I got to do some assembly level debugging to track down a report a GCC bug I hit. Well worth in entertainment the $50 or whatever I paid for it.
That's more than a decade after AMD releasing their first x86 compatible CPU. Intel was very aware of the threat before they started the work on Itanium. They even tried to hold back AMD with a lawsuit which they lost and ultimately allowed AMD to release their 80386 compatible CPU.
I find it more likely that it failed for similar reasons why iAPX 432 and i860 failed. There just wasn't a market.
In the modern landscape, it'd be closer to a v86 mode, where the hypervisor is RISC-V, but user x86 applications can run full speed.
All the legacy PC platform crap can be done away, replaced by RISC-V standard OS-A profile.
I don't think AMD is currently limited by their instruction set. Even if they are, there may be an argument to move to ARM instead of RISC-V to take advantage of the software already ported because of Apple's transition and the Graviton chips. Windows already runs on ARM but hasn't been announced to run on RISC-V, after all.
It's also not clear that would be a gain for AMD. x86 has a lot of lock-in and AMD is one of two viable suppliers of it. Helping along a RISC-V transition would create opportunities for attackers that don't need to license x86. AMD doesn't seem to have a reason for that right now. But maybe that's an Innovator's Dilemma kind of situation and they should be cannibalizing the present to setup their future.
Dead Comment