With all the discussion about what the “big trick” is that makes the M1 seem to be such a breakthrough, I can’t help but wonder, if the M1 is more like the iPhone: The sum of a large number of small engineering improvements, coupled with a lot of component integration detail work, topped off by some very shrewd supply chain arrangements.
Analogous to the iPhone being foreshadowed by the iPod without most experts believing Apple could make a mobile phone from that, the M1 was foreshadowed by the A1 for mobile devices with many(most?) experts not forecasting how much it could be the base for laptops and desktops.
It seems, the M1 includes numerous small engineering advances and the near term lockup of the top of the line fab in the supply chain also reminds me of how Apple had secured exclusivity for some key leading edge iPhone parts (was it the screens?).
So the M1 strikes me as the result of something that Apple has the ability to pull off from time to time.
And that is rather hard to pull off financially, organizationally and culturally. And it more than makes up for some pretty spectacular tactical mis-steps (I’m looking at you, puck mouse, cube mac, butterfly keyboard).
> The sum of a large number of small engineering improvements, coupled with a lot of component integration detail work, topped off by some very shrewd supply chain arrangements.
I think the vertical integration they have is a major advantage too.
I used to work at arm on CPUs. One thing I worked on was memory prefetching which is critical to performance. When designing a prefetcher you can do a better job if you have some understanding or guarantees as to the behaviour of the wider memory system (better yet if you can add prefetching specific functionality to it). The issue I faced is the partners (Samsung, Qualcomm etc) are the ones implementing the SoC and hence controlling the wider memory system. They don't give you detailed specs of how that works, nor is there an method where you can discuss with them appropriate ways to build things to enable better prefetching performance. You end up building something that's hopefully adaptable for multiple scenarios and no one ever gets a chance to do some decent end to end performance tuning. I'm either working with a model of what the memory system might be and Qualcomm/Samsung etc engineers are working with the CPU as a black box trying to tune their side of things to work better. Were we all under one roof I suspect we could easily have got more out of it.
You also get requirements based upon targets to hit for some specific IP, rather than requirements around the final product, e.g. silicon area. Generally arm will be keen to keep area increase low or improve performance / area ratio without any huge shocks on overall area. If you're apple you just care about the final end user experience and the potential profit margin. You can run the numbers and realise you can go big on silicon area and get where you want to be. With a multi-company/vendor chain each link is trying to optimise for some number they control, even if that overal has a negative impact on the final product.
Very interesting comment. I mean you see some of the same things with companies like Tesla also pushing vertical integration.
A lot of the examples you see are similar to what you talk about. You can cut down on the friction between different parts.
I remember an example of software controlling a blinking icon on the dashboard, where this was a 10 minute code change for Tesla but a 2-3 month update cycle for a traditional automaker due to the dashboard hardware coming from a supplier.
If we're comparing the M1 to x86, though, then all the prefetching and other memory shenanigans are on the CPU die. The A1 had an advantage over the SoCs used in Android phones here, but the M1 doesn't have an advantage over Intel and AMD CPUs.
> the partners (Samsung, Qualcomm etc) are the ones implementing the SoC and hence controlling the wider memory system.
And I assume the partners also do some things differently, for at least somewhat legitimate reasons, and no one ARM design can be optimal for everyone.
You use the word partner with the proper noun Qualcomm but there are no quotes. Qualcomm's only focus is to make money while delivering the worst experience in every direction. They are often stuck in local maximums and they are too big to just flow around.
Apple has used exclusive access to advanced hardware as a differentiator several times. With screens it was Retina. They funded the development and actually owned the manufacturing equipment and leased it to the manufacturing subcontractors.
Also in 2008 they secured exclusive access to a then new laser cutting technology that they used to etch detail cuts in the unibody enclosures of their MacBooks, and then iPads. This enables them to mill and then precision cut the device bodies out of single blocks of Aluminium.
They’ve also frequently bought small companies to secure exclusive use of their specialist tech, like Anobit for their flash memory controllers, Primesense for the face mapping tech in FaceID, and there are many more. For Apple simply having the best isn’t enough, they want to be the only people with the best.
Retina is a very interesting example for how Apple works. They have identified the necessary resolution (200+ ppi) for this technology and worked towards across their whole product range. The technology isn't exclusive to Apple, but they are the only company which pushes it, even if it sometimes means quite odd display resolutions.
Other manufacturers seem to be completely oblivious to it. They still equip their laptops either with full hd or 4k screens. The resulting ppi are all over the place. Sometimes way to low (bad quality) or way to high (4k in an 13" laptop, halves the runtime). Same with standalone screens, there is a good selection around 100ppi, but for "high res" the manufacturers just offer 4k in whatever size, so once again the ppi are all over the place again.
I believe this is the only consumer 5nm chip currently available as well. Ryzen gen 3 is still on 7nm. I'd be interested to see how well general purpose compute on the m1 vs ryzen gen 3 mobile will be.
The sum of a large number of small engineering improvements, coupled with a lot of component integration detail work, topped off by some very shrewd supply chain arrangements.
I think you precisely have it. There is no single magic reason the M1 is so good, just a lot of things coming together. They start with a better instruction set than x86, have of course the best process available, and perhaps the largest part, they have built up an increadible team over a decade. And they are extremely focussed in what they target. If anything, that is Apples "magic". They are not making a chip which is built in an abstract manner to be sold to random customers. They have exactly their needs in mind and execute towards those.
In a sense AMD did that with the chips for the Playstation/XBox. Like the M1 it is basically a SOC. There optimized for great graphics performance. Unfortunately, those chips are not sold separately for building your own PC.
So the M1 strikes me as the result of something that Apple has the ability to pull off from time to time.
Perhaps you haven't been paying attention?
Apple shipped 64-bit ARM processors for the iPhone at least a year before Qualcomm could do it for Android devices. The reaction to the A7 was similar to what we're seeing now with the M1—not possible, there's some trickery going on, etc.
Apple is pretty good at this processor transition thing, going from 68k to PowerPC to Intel to ARM.
And it more than makes up for some pretty spectacular tactical mis-steps (I’m looking at you, puck mouse, cube mac, butterfly keyboard).
Except for the recent keyboard issues, you're literally talking about another era. I wouldn't put the shape of the mouse for the 1998 iMac in the same category as transitioning a $9 billion revenue product line to a radically different processor architecture.
> Apple shipped 64-bit ARM processors for the iPhone at least a year before Qualcomm could do it for Android devices. The reaction to the A7 was similar to what we're seeing now with the M1—not possible, there's some trickery going on, etc.
That is because it is not possible to ship a top-end design for a new ISA in that amount of time. The more reasonable answer is they had been working on a new core design for some years before. AMD has hinted that their Zen design makes it relatively easy to swap the x86 frontend for an ARM frontend.
Apple was considering buying MIPS around that time. I suspect they strong-armed ARM into accepting their ARMv8 proposal because it was good and because Apple buying MIPS would be disastrous for ARM's share price. At that point, it wasn't faster than possible, it was just designing the last part of the chip (or if both frontends were being worked on in tandem, cancelling one of them and focusing everyone on the other).
This explains why ARM announced v8 and then took the full 4 years to ship their first low-power core (A53) and even longer to ship their bad first try at a high-performance core (A57 -- with the more baked A72 being superior in almost every way).
> the near term lockup of the top of the line fab in the supply chain also reminds me of how Apple had secured exclusivity for some key leading edge iPhone parts (was it the screens?).
Yes, Apple managed to lock up most of the global supply of capacitive touchscreens for about a year after the iPhone came out. The iPhone wasn't the first phone to use a capacitive touchscreen, but for a while, it seemed like it was because nobody else could produce devices with them in large volumes.
People used to dismiss Tim Cook as "just" a supply chain guy. But I think it's become clear that supply chain management is at least as important to Apple's success as anything on the design or marketing side.
In some ways it is the environment the M1 was born from that helped. mobile space CPU's focus upon low power usage and that has seen many core software tasks get dedicated instructions and why you end up with the M1 in some tests utterly trouncing competition as it has dedicated hardware catering for the common niche things that software ends up doing - the hardware video encoding being a small area, but deep down, more than that. This along with advances in software/hardware integration and being able to synergies that at a level nobody else can. The way to think of it is - if Intel did an operating system from scratch, it would tap the CPU extremely well compared to others due to them knowing the internals better and fully. Then add the ability for them to see that adding some dedicated hardware to replace some software instruction combinations and you start to see a tightly integrated team of CPU and Operating system/software.
One area that I've always wished CPU's would take would be a dedicated core or two for the OS that is completely isolated from the other cores, which would be for the software/applications you run. Now if those ran about two different architectures - darn that would be the inner geek in me appeased.
What would your goal be? I think locking a modern, general-purpose OS to a small number of cores would artificially constrain performance, assuming a reasonable scheduler.
> The sum of a large number of small engineering improvements, coupled with a lot of component integration detail work,
Exactly. ARM has been progressing faster than Intel. For the past 8 years or so, Apple has had the fastest ARM CPU out there on the iPhone/ iPad. Apple has sucked up TSMC's 5nm production. They've integrated a pile of relevant coprocessors into the CPU and put fast RAM on the package. The SSD is lightning fast and SSD encryption is done via a dedicated coprocessor.
It's not one magic trick, it is countless bits of engineering, manufacturing, and purchase choices.
> by the iPod without most experts believing Apple could make a mobile phone
Except for all the people practically begging Apple to make a phone for years, except all the analysts who wrote essays on how computer companies could make successful phones, except for all the fanboys making fan-art of phones with that big circular wheel.
I don't buy it. I think there is in fact one "trick," which is shedding the X86 decode bottleneck.
People always make the point that the X86 decoder is only ~5% of the die. Sure, that's true, but keep two things in mind:
(1) While it's only 5% of the die, it runs constantly at full utilization. The ALU is also only a small percentage of the die (5-10%). How hot does your CPU get when you're running the ALU full blast? Now consider that there is a roughly ALU-sized piece always running full blast no matter what the CPU is doing because X86 instructions are so complex to decode. Not only does this give X86 a higher power use "floor," but it means there's always more heat being dissipated. This extra heat limits thermal throttling and thus sustained clock speed unless you have really good cooling, which is why the super high performance X86 chips need beefy heatsinks or water cooling.
(2) It apparently takes exponentially more silicon to decode X86 instructions with parallelism beyond 4 instructions at once. This limits instruction level parallelism unless you're willing to add heat dissipation and power, which is a show stopper for phones and laptops and undesirable even for servers and desktops.
People make the point that ARM64 (and even RISC-V) are not really "RISC" in the classic "reduced" sense as they have a lot of instructions, but that's not really relevant. The complexity in X86 decoding does not come from the number of instructions or even the number of legacy modes and instructions but from the variable length of these instructions and the complexity of determining that length during pipelined decode.
M1 leverages the ARM64 instruction set's relative decode simplicity to do 8X parallel decode and keep a really deep reorder buffer full, permitting a lot of reordering and instruction level parallelism for a very low cost in power and complexity. That's a huge win. Moreover there is nothing stopping them from going to 12X, 16X, 24X, and so on if it's profitable to do so.
The second big win is probably weaker memory ordering requirements in multiprocessor ARM, which allows more reordering.
There are other wins in M1 like shared memory between CPU, GPU, and I/O, but those are smaller wins compared to the big decoder win.
So yes this does foreshadow the rise of RISC-V as RISC-V also has a simple-to-decode instruction set. It would be much easier to "pull an M1" with RISC-V than with X86. Apple could have gone RISC-V, but they already had a huge investment in ARM64 due to the iPhone and iPad.
X86 isn't quite on its death bed, but it's been delivered a fatal prognosis. It'll be around for a long long time due to legacy demand but it won't be where the action is.
>This extra heat limits thermal throttling and thus sustained clock speed unless you have really good cooling, which is why the super high performance X86 chips need beefy heatsinks or water cooling.
The 16 core Ryzen has the same TDP as the 8 core Ryzen. Increasing the clock
speed for slightly more single core performance is an intentional design decision, not an engineering flaw. Clock up those Apple chips and they are going to guzzle more power than AMD's chips. https://images.anandtech.com/doci/14892/a12-fvcurve_575px.pn...
Apple's preference for manufacturing processes that optimize for mobile low ower consumption below the 4Ghz range mean scaling up is harder than just slapping a higher TDP on the chips. Remember the TDP of the whole package already exceeds the TDP of the most power hungry Ryzen core running at 4.8Ghz. Apple has enough headroom to boost to the same frequencies but they don't, because of the manufacturing process they have chosen which loses all of its efficiency beyond 4Ghz.
I haven't studied it carefully, but it sure looks like 90% of the performance improvement is using a big cache, which is a totally obvious thing to do. Also the big x86 guys have more or less been asleep at the wheel for almost a decade.
My go to example: my 2011 x220 sandybridge stinkpad is faster than my 2017 kaby lake mbp. 2005 machines (I dunno, Lakeport?) aren't even in the same ballpark as modern machines. Had that pace continued up to current year, the M1 chip would be a stinker. As it is, AMD is close and could smoke M1 in next generation 5nm chips, restoring order to the universe.
> I haven't studied it carefully, but it sure looks like 90% of the performance improvement is using a big cache, which is a totally obvious thing to do. Also the big x86 guys have more or less been asleep at the wheel for almost a decade.
Dude, has Intel called you yet? You've got some serious CTO chops.
I think the narrative around instruction set is a bit overblown. I was a chip architect for the shader core at a major GPU company. I worked on simulators and modeling performance for next generation chips where we changed the ISA for each family of chips. The big reason why Apple Silicon is so damn fast is because they were able to shape the design at modeling time exclusively around Mac system level workloads. At best, Intel would have some subset of traces come from Apple for important traces to optimize for. Combine getting to narrow traces down exclusively for one platform, and heterogeneous design space (cpu + coprocessors) and you can really tune a monster.
> The big reason why Apple Silicon is so damn fast is because they were able to shape the design at modeling time exclusively around Mac system level workloads
Is that really the case? My understanding was that M1 is fast because it's able to keep the chip saturated with instructions due to a large L1 cache and wide instruction decoders. Is anything about that specific to mac workloads?
> Is anything about that specific to mac workloads?
The memory and instruction architecture may be more 'generic' but it and the neural engine, storage and media controllers, image processor etc will have been shaped and fine tuned by the requirements of the mac.
It is probably the marginal gains of each subsystem being 5-10% better for purpose that gives it the edge.
Does optimizing for a single system really improve performance significantly in general purpose computing benchmarks like SPEC? IIRC, the M1 also does fairly well with virtualized Linux.
It does. The point is how Apple could use so much silicon space for the CPU in the SoC. Most AP and CPUs are designed to be general purpose as possible, so there are some spaces used for unnecessary interfaces. But Apple could use those spaces for CPU. Also Apple could increase the die size without thinking about profit in chip production because they earn money from their devices, so by selling SoCs like others.
No other companies can do silicon business like Apple.
being "a chip architect for the shader core at a major GPU company" sounds like a dream job for me.
Do you have any interesting tips or books to read for a fellow hardware design engineer? :)
I don't see how an ISA doesn't matter. While not a chip architect like you, I do work as a developer and I know that the interface you make to something affects what kind of performance you can build in the backend.
In principle whether you are using Python or C++ doesn't matter. It is just an interface. The compiler or interpreter in the back decides the actual performance. Yet it is pretty obvious that the specifications of C++ syntax makes it much easier to create a high performance compiler than the specification for Python.
I have been quite involved with Julia. It is a dynamic language like Python, but specific language syntax choices has made it possible to create a JIT that rivals Fortran and C in performance.
Likewise we have seen from Nvidia slides when the went with e.g. RISC-V over ARM, that the simple and smaller instruction-set of RISC-V allowed them to make much cores consuming much smaller silicon, better fitting their silicon budget.
When you worked as a chip architect didn't the ISA affect in any way how hard or easy it would be for you to make an optimization/improvement in silicon?
I mean if one ISA requires memory writes to happen in order, or have variable length, or left too little space for encoding register targets etc. All that kind of stuff is going to make your job as a chip architect harder isn't it?
Also I don't quite get your argument about modeling the M1 around Mac workloads. We know the M1 is having great performance on Geekbench and other benchmarks which have not been specifically designed for Mac workloads.
Only things I can see with M1 which is specific to Mac is:
1) They do the code needed for automatic reference counting faster. Big deal on Mac since more software is Objective-C or Swift which uses automatic reference counting.
2) They prioritized faster single cores over multiple cores. Hence optimizing more for a desktop type workload than a server workload.
3) A number of coprocessors/accelerators for tasks popular on Macs such as image processing and video encoding. But that is orthogonal to the design of the Firestorm cores.
I don't claim to know this remotely as well as you. I am just trying to reason based on what you said and what I know. Would be interested in hearing your thoughts/response. Thanks.
Does this potentially mean that as the OS evolves, the chip will likely become less efficient, as it becomes "out of tune"? Apple could mitigate this obviously.
Not clear what RISC-V has to do with the Apple M1.
Also not clear what benefit RISC-V would have for "coprocessors". GPUs and various machine learning speedup devices are massively parallel devices, intended to run small, specialized programs in parallel on multiple specialized execution units.
Also note that the real win of the Apple M1 is lower power consumption. In terms of basic compute speed, there are Intel products that are roughly comparable. But they use more much more power. This is more about battery life than compute power. (Also heat. Apple laptops have had a long-standing problem with running too hot, from having too much electronics in a thin fanless enclosure. The M1 gets them past that.)
The hardware video decoder is probably to make it possible to play movies with most of the rest of the machine in a sleep mode. The CPU is probably fast enough to do the decode in software, but would use more power than the video decoder.
> Also not clear what benefit RISC-V would have for "coprocessors".
As the article states, the architects of RISC-V recognized that co-processors that assist the CPU to do more and more specialised repetitive tasks will be the norm. Thus, RISC-V was designed in a way to accommodate such co-processors, with limited instruction sets that make its CPU design simpler.
The Apple ARM processor is also similar - the ARM system-on-chip they have designed is highly customised with many co-processors all optimised for the macos / ios platform.
Apple's SoC contains a GPU, an image processing unit, a digital signal processing unit, Neural processing unit, video encoder / decoder, a "secure enclave" for encryption / decryption and unified memory (RAM integrated) etc. (Note that this is not a unique innovation - many ARM SoCs like these already exist in different variations. In fact, it's what made ARM popular.) Obviously, when a system software or application uses these specific units of an SoC to process specific data, they may be faster than a processor that doesn't have these units. And Intel and AMD processors currently don't have these specific units integrated with their CPU.
Anyway, the point the article is making that the RISC-V architects recognized that such co-processors will be the norm in the future, and thus the author is predicting that RISC-V will become more popular, now that the M1 acts a showpiece for the architectural idea the RISC-V wants to popularise.
And this is just idle speculation and I (and I guess Animats) don't necessarily buy. I think RISC-V is doing fine and will grow regardless.
Where more Arm mainstream success will have a slipstream effect on RISC-V is in app porting. There are significant differences between x86 and Arm, notably memory model (AS does support TSO with a flag, but native apps use the weak mode). Porting from x86 to Arm can be non-trivial, whereas porting from Arm to RISC-V is far easier.
RISC-V is just an ISA. How exactly can it popularize the already extremely popular idea of shoving a bunch of peripherals onto a SoC?
> RISC-V was designed in a way to accommodate such co-processors, with limited instruction sets that make its CPU design simpler
This only affects the extremely tiny embedded space, only under the most extreme constraints you have the "simpler ISA → simpler CPU core design → more space on the silicon for coprocessors" thing.
For a general purpose high performance SoC, you don't want a simple CPU design, you want a fast CPU design, and you have space for all the coprocessors you want anyway.
Other than "being simple", there's nothing an ISA has to do with coprocessors. There's nothing ISA-specific about having memory-mapped peripherals.
Adding custom instructions directly to the CPU ISA instead? That's not exactly coprocessors, that's more like modifying the main processor, it's annoying (fragmentation) and Apple for some reason was allowed to do it with Arm anyway >_<
> Intel and AMD processors currently don't have these specific units integrated with their CPU
Of course they have GPU, video encode/decode, "secure enclave" (fTPM).
Hardware accelerator modules also often need their own mini-cpu built in to them as a controller, apart from the main CPU cores. RISC-V was specifically designed for this use case to have a very light weight core ISA, with an extensions mechanism that make it easily customisable for specific accelerator design. Even the lightest weight ARM cores are monsters in comparison.
Thus we are likely to see a lot of new ARM chips containing a few RISC-V cores tucked away inside the design. In fact NVIDIA already does this on some graphics cards and it’s not impossible M1 does as well.
They are both RISC instruction sets used for SOC style chips. So what makes the M1 successful should also work for RISC-V. The key non-technical difference is that you won't have to license it from NVidia which might make it attractive to some companies that don't want to pay as many license fees to these companies.
Licensing and patents have historically been in the hands of only a few companies; which limits other companies doing custom designs. With RISC-V, that could change. Of course that's only the instruction set and you'd likely still need to license lots of patents to get anything shipping. But it fits the pattern of OSS software driving a lot of innovation and hardware design becoming more like software design.
Theoretically, if Intel wanted to make a comeback, RISC-V might actually be interesting for them. Right now they would have to compete with Apple, Nvidia/ARM and the likes of Qualcomm for non X-86 based CPUs. Those three are basically using ARM based designs and you need to license patents and designs to do anything there. Intel having to license chip designs + patents from their competitors is likely not compatible with their ambitions of wanting to dominate that market (like they dominated X86 for nearly half a century). They are clearly having issues keeping X86 relevant. So, RISC-V might provide them an alternative. The question is if they have enough will left to think laterally like that or whether they are just doomed to slowly become less relevant as they milk the X86 architecture.
> Also note that the real win of the Apple M1 is lower power consumption.
Doesn’t this automatically translate into higher performance — by adding more cores or increasing clock rate — since TDP is the limiting factor for CPU speed?
I mean, if someone created a 1W CPU that performed as well as a 100W CPU, would you say “lovely, a lower power CPU” or “overclock/add cores until it reaches 100W and give me that”?
Why do people ascribe broader ARM implications to the M1? Apple uses the ARM instruction set to make an amazing CPU. It could probably make one with the x86 set too. It doesn’t mean everyone else making ARM processors will suddenly get much better. Not to mention that Apple’s very similar A series has already been around for years.
There was an article posted not long ago that suggested the variable length instruction set in x86 chips prevented some of M1's most important design innovations being replicated by Intel and AMD.
It's true that ARM64 has a load-store architecture and fixed-length instructions (the latter depending on the former for encoding space efficiency). Other than that, the instruction set design is very far from minimalist textbook-style RISC ISAs like RISC-V. It has both flag-based branches and fused compare-to-zero-and-branch instructions. It has very complex immediate encodings. It has instructions for loading/storing register pairs. It has pre-increment/post-increment addressing modes of the kind that were hallmarks of CISCs like M68K and VAX.
It seems unwise to draw far-reaching conclusions about RISC-V or even ARM64's intrinsic merits versus Apple's CPU designers when there are so many variables. The frontend decoder hasn't been a frequent bottleneck in Intel cores for a long time and they could scale it up more aggressively if they wanted.
Apple's engineers did a great job. That seems to be the conclusion we can draw based on currently available evidence.
I doubt that --- modern x86 (everything since the original Pentium) breaks instructions into uops anyway and caches those, so if anything I'd say the M1 is impressive despite having relatively large fixed-length instructions.
There's some more discussion in here about the source of the M1's performance, and it largely seems to come down to the smaller process size that enabled Apple to scale up a lot of the structures in the uarch:
But perhaps Intel/AMD can surprise us with a dynamic allocator that runs in the reorder buffer. Or perhaps they can still push the limit one more time with more transistors. Another option would be to implement a fast-path for small instructions, so in effect they would be moving from CISC to RISC but only for parts of the code that need the extra performance.
Well the perception was that AMD and Intel had a unassailable lead. That even with a power and clock speed disadvantage that the M1 can be quite competitive with several other serious mobile chips, like the Intel I9.
Now apple has proved that a cool running chip that sips power can run a wide variety of intensive applications well.
People were quite dubious of apple's chances on a competitive desktop chip and have just received a wake up call with a relatively conservative M1 chip (3.2 GHz and 4 fast cores).
It wasn't generally thought that amd/intel were unassailable engineering wise, just that sw compatibility, x86 patents and volumes were important enough that it was economically hard to go against them. But other chips (eg IBM) regularly challenged them on speed despite relatively tiny volumes and budgets. And of course years earlier the Itanium debacle + exponentially increasing fab costs (favouring volumes) killed off most of the RISC competition.
Trivia: Simultaneously to the previous Mac ISA transition, Apple acquired PA Semi who had a power efficient and fast PPC chip. Then, Apple decided to go to Intel anyway instead of betting on their new in-house chip. Discarding their newly acquired highly acclaimed chip design, they put the newly acquired semi team to work on the A series of chips instead.
But they had no reason to be skeptical, given the A series. To only take a processor seriously once it’s housed in a case with a keyboard attached is ridiculous.
It’s slightly tangential, but I’m managing an engineering team of 28-30 and we’re currently considering a wholesale change to ARM CPUs across the board.
MacBooks are our de facto development laptop and all our services use skaffold for local development, Docker basically. If we consider the perhaps likely outcome that MacBooks will one day be ARM-only, that Docker will not offer cross-arch emulation, and that our development environment will be ARM only, it then becomes likely that we’ll migrate our UAT and PROD to ARM based instances.
If we go that route it’ll mean more money to the AWS Graviton programme and likely further development of ARM chips. I can’t see this affecting RISC-V but the M1 switch could very well benefit the wider ARM ecosystem.
You’re basically locking yourself to a single development eco system, and a highly limited deployment eco system.
It’s not clear what the benefits of either are either. I get that the MacBook gets great performance for battery life but the majority of work is gonna be done in desktop settings, so simply using more/equally powerful x86 chips is only gonna cost you a few dollars a developer per year in electricity costs.
And all that despite the fact that your development is on Docker which doesn’t even have a working solution for the workflow you’re considering at the moment.
x64 virtual machines, Docker, etc have to be supported on Apple's M chips for a long time to come. There's zero risk of this changing soon unless Apple wants to scuttle the non-iOS/non-Mac developer market for Mac.
M1 is a cool chip, but there's no reason for an average development company to rush into it unless targeting M1 MacOS specifically. Maybe the server world swings to ARM, but that will take decades to sort out, if it actually happens at all.
"Macs are now crazy-fast but they're still Macs so few people will switch".
Anecdotally there have been a bunch of posts on HN since the M1 Macs shipped by people who've either stopped using Macs years ago or who've never bought a Mac previously who are happy M1 Mac owners.
The M1 Mac mini retails at $699, but I've already seen it as low as $625. There's certainly nothing in that price range that's better.
And even before the M1 Macs shipped in November, Mac revenue hit an all-time high of $9 billion in the quarter that ended September 26, 2020 [1]. Apple often highlights that about 50% of Mac customers are new to the Mac, a trend that's likely to accelerate.
The second narrative doesn’t explain AWS throwing its weight behind ARM.
Not to say that ARM is killing x64, it’s definitely not, but ARM is clearly being invested in and rolled out at a massive scale by 2 of the biggest tech companies in the world in both consumer devices and server side. To me that’s quite something.
Apple is playing the margin game not the volume game. Just like Apple takes something like 98% of the profit in the global phone manufacturing business, I wouldn’t be surprised if they’re doing the same thing in the developer compute market.
Worth noting that the ISA is more than a set of instructions it’s also a semantics for those instructions. For example the concurrent semantics of ARM processors permits a much larger array of optimizations on the per thread level which is good for performance.
2) More registers. ARM64 has 32 general purpose registers and 32 registers for SIMD stuff. x86 has fewer registers which are also wasted on all sorts of legacy junk.
3) More lax restrictions on memory-write back. It is easier to optimize the Out-of-Order execution on ARM, as you don't need to write back everything in order to memory.
As for everybody else. ARM designs from ARM Ltd. is showing rapid performance increases and gradually closing the gap to x86. It really is inevitable as there is NOTHING special about the x86 ISA that gives it higher performance. Nothing prevents other ARM makers from catching up: https://medium.com/swlh/is-it-game-over-for-the-x86-isa-and-...
It's not even clear that the M1's big leap is due to ARM vs x86 rather than say 5nm vs 7nm (amd) or 14nm (Intel), or design ideas such as big/little cores and more specialized accelerators (which is ironically against the risc idea which people are claiming as the reason why arm vs x86 so the reason m1 does well)
Specialized accelerators doesn't explain it, because we're measuring a lot of general purpose CPU tasks for the most part.
Big/little is good for power consumption, not so much for performance which is still good.
There's a lot of microarchitectural goodness here beyond ARM, though. Apple's got lots of little details right, and fat connection to memory helps, too. It doesn't hurt to be on leading fab, too.
I'm out of my league here but I've seen references to 8 bit cores that can run at a couple of giga instructions a second. It's hard to understate the performance vs power cores like that are cable of. Also sub nanosecond interrupt latency.
Think a small coprocessor with local memory that's pulling commands out of a queue and managing an io controller. Couple of wins, lower power consumption, fewer context switches, and cache pressure.
Originally it was because the US government requires a second source for any components and so Intel had to license it to somebody to supply the US government.
Then later AMD's 64 bit instructions became the standard, so Intel needed the license for the 64 bit extensions and AMD needed the x86 base and so they just decided to cross-license and call it good.
There's actually a 3rd x86 license, that has changed hands quite a few times (Cyrix -> National Semiconductor -> Centaur(IDT) -> VIA -> Zhaoxin, I think, unless I missed a few transitions?)
ARM has been eating away at Intel for a while now, the same way Intel ate away at the mainframe and minicomputer market in the 1980s and the MIPS/Sparc/HPPA/Alpha workstation market in the 1990s. While the mainframes and minis ate the low-cost PCs for lunch in the 1980s, and the $20,000 workstations of the 1990s had far better performance than a 386 (or even 486 did), PCs were cheaper and more widely available.
It was the economies of scale and the standardization on x86_64 that made the PC the performance king in the first 2000s decade. Intel (and, of course, AMD) x86 did not have the best ISA but they, because of economies of scale, had the best fabs which let them outperform anything else.
While Intel was dominating with raw performance in the first 2000s decade, embedded chipsets slowly coalesced around the ARM ISA, a process which was accelerated by Apple choosing ARM for the iPhone (Nokia also used ARM in a lot of their phones).
Moore’s Law finally stopped working for Intel and they stopped being able to outfab everyone else in the mid-to-late 2010s; a 2012 x86 chip has about the same performance as, say, a 2017 x86 chip.
Intel saw the writing on the wall with people using non-Intel ISAs for phones, and tried to make an Atom chip which would work in a phone; it was a flop. Nobody wanted the x86 ISA unless they needed it in systems which ran legacy applications.
With the Raspberry Pi moving up from only suitable for specialized embedded applications to having near-desktop level performance, and with Apple finally making an ARM chip which is competitive (and in some cases superior) to Intel’s desktop chips, and with legacy x86 Windows applications being in many cases replaced with webpage and smart phone applications, it looks the industry as a whole is finally moving past x86 and its bloated instruction set.
This is a much needed breath of fresh air for the computer industry. I like the M1 because I like that we now have mainstream non-x86 desktop/laptop computers again.
I think RISC-V has a lot of potential, and I am interested in what comes of it in the 2020s, whether it blooms like the ARM did, or if it goes the way of the HPPA, Alpha, or Sparc.
> Nobody wanted the x86 ISA unless they needed it in systems which ran legacy applications.
It wasn't fast/efficient enough. If it had been faster or with better power consumption it would've been fine. There was a massive push to get Android studio to automatically compile X86 binaries for you etc.
But why would you put an Atom in your phone if it means it's slower and worse battery life? That's the reason it flopped. ISA change is a hindrance for switching, but it can be overcome. Even if Intel had sold Atom chips with ARM ISA and identical performance to the X86 variant it would have still flopped due to the poor performance and efficiency.
>legacy x86 Windows applications being in many cases replaced with webpage and smart phone applications, it looks the industry as a whole is finally moving past x86 and its bloated instruction set
Similar predictions about lack of importance of legacy support made over the past 40 years have not borne out. Performant x86 emulation is an absolute must for a replacement ISA.
The broader point is that RISC-V provides the freedom and a practical ecosystem in which to innovate. Custom instructions may only be a small part of that.
The PC platform has plenty of freedom for innovation, it's actually quite open, you can create what ever peripheral, addon, whatever you want on top of it.
The problem is that if you want mass adoption of your fancy new bespoke offer it's quite a bit tricker no matter how good it is at doing it's thing, and that problem does not got away with different ISAs, probably harder to be honest.
You cannot innovate in how things are integrated however. You got to stick to the industry standards. Nor can you say innovate by switching the CPU architecture in your computer.
A vertical integrated system like a Mac allows for much more innovation.
In fact this is true for any vertical integrated system. If you look at Amiga, NeXT, SGI, SPARC and many others, they where always far ahead of the PC in terms of technology.
I'm missing the reason why RISC-V in particular is (claimed to be) so much better suited for building specialized co-processors. Are they talking about ISA extensions? Or maybe that the royalty free model makes it cheaper?
Maybe you skipped the section "What is the benefit of sticking with Risc-V"?
> But for a coprocessor you don’t want or need this large instruction-set. You want an eco-system of tools that have been built around the idea of a minimal fixed base instruction-set with extensions.
Essentially: the modular nature of Risc-V and tooling/ecosystem built around it, with first class extension support.
ARM is closed, too complex and not friendly to extensions, while custom ISAs mandate a huge amount of extra work.
I only dabble in this field, but I see the ecosystem rapidly maturing. The open nature also leads to a general propensity to open source designs and tooling, lowering the barrier to entry and reducing cost.
I think people think it's going to be a vibrant open ecosystem in collaboration with industry and academia with a lot of development and fresh ideas leading to some significant simplification and so opportunity for performance and functionality breakthroughs. I don't know if that's true or not.
Same, I love the idea of RISC-V but I don’t really get how it’s better either from a technical standpoint. Would love to hear more about the main advantage of RISC-V against others.
I implement RISC-V for a living and for fun. Arm64 is a very good modern ISA. RISC-V is good too. Comparison with RISC-V must account for the much broader purpose of RISC-V, but just to focus on a few points (I'm not an Arm64 expert though):
* RISC-V (RV64GC) have simpler instructions than Arm64. It's possible it would have a slight frequency advantage given the same implementation resources, but Arm64 might need slight fewer instructions. Notably, Arm64 have more addressing modes. Fusion and cracking makes this mostly a wash, but implementing a RV64 cores is a lot easier than an Arm64 (I speculate).
* Arm64 has load pairs and store pairs; this is a significant advantage.
* RISC-V has no flags and conditional branches directly compares operands. This look like a significant advantage in the code I have looked at and is easier to implement (no flag renaming etc.)
RISC-V has a tiny basic instruction-set consisting of very simple instructions. That makes it possible to implement a RISC-V core on a very small silicon die. Smaller simpler cores make it easier to increase clock frequency as well as reduce watt usage. That gives RISC-V an advantage over the competition such as ARM, where any implementation will require a much larger and complex core to handle all the ARM instructions you must implement to be a valid ARM CPU.
It won’t happen. People (aka businesses) want something that just works for the person doing data entry or scheduling calendars and expects the code rguy (even if a top 3 in the world by value by valuation) to suck it up to bare bottom IT policy.
My opinions are my own. But the things one sees coming from them, they make things up to justify their continued existence.
Although I think the premise of the article is wrong, it is nicely written.
A major nitpick: unified memory is being massively over hyped. There is a reason GPUs have their own memory bus -- contention. CPU/GPUs fighting over access to memory causes massive disruption to very parallel computation. Even if Intel/nVidia resolved their fight over inter-CPU connectivity or we're talking POWER and nVidia using NV-LINK, you still need extra memory ports to keep things fed. The more cores and the faster the GPU the more memory bandwidth required.
I expect to see future Mac Pro M1 series machines with multiple CPU sockets -- at which point memory isn't unified any more, and all the regular CC-NUMA tricks will be used. But it won't be a big deal.
> unified memory is being massively over hyped. There is a reason GPUs have their own memory bus -- contention. CPU/GPUs fighting over access to memory causes massive disruption to very parallel computation.
Not sure I totally agree with this.
Game console have unified memory architecture and it’s a beloved feature. It greatly simplifies things and allows the CPU to far more easily use results computed by the GPU with complicated sync commands or frame delays.
Maybe unified architecture is less valuable for non-interactive programs. I’m not sure. This is a fair bit outside of my wheelhouse.
Memory access is definitely one of the biggest bottlenecks. So I fully agree with the general concern. And you may even be right that the unified architecture isn’t that interesting. But I’m not so sure it’s the problem you think it is.
It does indeed make it easier to code. But you may have noticed that Xbox X has moved to having different speed memories with different memory controllers, even if logically contiguous and uniformly accessible. The CPU pool has a narrower bus and can't be used for graphics objects -- you have to queue for copy to the graphics pool. The driver then schedules a deconflicted copy.
The PS5 architecture also has different speeds -- but the slow access is to SSD! It has manual management of streaming resources from SSD to RAM, but also allows direct SSD reads -- but that is a trick, because the SSD controller has a huge RAM cache too.
So I'd say we're actually moving away from UMA in general. I think that memory aware scheduling is going to be the next win -- online learning to understand memory access patterns and scheduling compute and cache fill. Fancy cache algorithms used to take too much logic (and slow cache fill logic down), but for SSD->RAM you can do lots of prediction based on program state.
Apple is super fucking rich, I wouldn't be surprised if they are ready to pay TSMC enormous money to make monster monolithic dies, and Mac Pro could be single-socket only, packing >128 cores into one socket.
You need lots of memory too. Maybe not an issue for the desktop. If they doubled the SOC for 16 cores (8 hot/8 cool) and 32GB of RAM that would be a decent Mac Pro.
But plenty of people want >32GB of memory, I routinely use machines with 256GB. No way you can get enough RAM into the SOC. Large core counts are even harder, because speed of light means you need a new on socket switched architecture, memory on the other side is slow to access etc. TANSTAAFL.
I did indeed read that, but it has nothing real to say about it. Instead it says this:
> Apple uses memory which serves both large chunks of data and serves it fast. In computer speak that is called low latency and high throughput. Thus the need to be connected to separate types of memory is removed.
That is just hand waving. It is possible to produce such memory, but it involves ultra wide busses, far wider than optimal for filling CPU caches, and preferably directly connected to the GPU rather than a multi-master bus or switch.
There is the possibility that Apple has built a very fancy memory interposer that leverages the short distances in the SOC to present the memory both wide (to GPU) and narrow (for filling a queue of L2 misses), so that cache fills pause while GPUs read/write. That would be a highly interesting piece of logic. But of course it can't scale outside of the SOC.
Analogous to the iPhone being foreshadowed by the iPod without most experts believing Apple could make a mobile phone from that, the M1 was foreshadowed by the A1 for mobile devices with many(most?) experts not forecasting how much it could be the base for laptops and desktops.
It seems, the M1 includes numerous small engineering advances and the near term lockup of the top of the line fab in the supply chain also reminds me of how Apple had secured exclusivity for some key leading edge iPhone parts (was it the screens?).
So the M1 strikes me as the result of something that Apple has the ability to pull off from time to time.
And that is rather hard to pull off financially, organizationally and culturally. And it more than makes up for some pretty spectacular tactical mis-steps (I’m looking at you, puck mouse, cube mac, butterfly keyboard).
EDIT for typo
I think the vertical integration they have is a major advantage too.
I used to work at arm on CPUs. One thing I worked on was memory prefetching which is critical to performance. When designing a prefetcher you can do a better job if you have some understanding or guarantees as to the behaviour of the wider memory system (better yet if you can add prefetching specific functionality to it). The issue I faced is the partners (Samsung, Qualcomm etc) are the ones implementing the SoC and hence controlling the wider memory system. They don't give you detailed specs of how that works, nor is there an method where you can discuss with them appropriate ways to build things to enable better prefetching performance. You end up building something that's hopefully adaptable for multiple scenarios and no one ever gets a chance to do some decent end to end performance tuning. I'm either working with a model of what the memory system might be and Qualcomm/Samsung etc engineers are working with the CPU as a black box trying to tune their side of things to work better. Were we all under one roof I suspect we could easily have got more out of it.
You also get requirements based upon targets to hit for some specific IP, rather than requirements around the final product, e.g. silicon area. Generally arm will be keen to keep area increase low or improve performance / area ratio without any huge shocks on overall area. If you're apple you just care about the final end user experience and the potential profit margin. You can run the numbers and realise you can go big on silicon area and get where you want to be. With a multi-company/vendor chain each link is trying to optimise for some number they control, even if that overal has a negative impact on the final product.
A lot of the examples you see are similar to what you talk about. You can cut down on the friction between different parts.
I remember an example of software controlling a blinking icon on the dashboard, where this was a 10 minute code change for Tesla but a 2-3 month update cycle for a traditional automaker due to the dashboard hardware coming from a supplier.
And I assume the partners also do some things differently, for at least somewhat legitimate reasons, and no one ARM design can be optimal for everyone.
? shared prefetch queue ?
Also in 2008 they secured exclusive access to a then new laser cutting technology that they used to etch detail cuts in the unibody enclosures of their MacBooks, and then iPads. This enables them to mill and then precision cut the device bodies out of single blocks of Aluminium.
They’ve also frequently bought small companies to secure exclusive use of their specialist tech, like Anobit for their flash memory controllers, Primesense for the face mapping tech in FaceID, and there are many more. For Apple simply having the best isn’t enough, they want to be the only people with the best.
Other manufacturers seem to be completely oblivious to it. They still equip their laptops either with full hd or 4k screens. The resulting ppi are all over the place. Sometimes way to low (bad quality) or way to high (4k in an 13" laptop, halves the runtime). Same with standalone screens, there is a good selection around 100ppi, but for "high res" the manufacturers just offer 4k in whatever size, so once again the ppi are all over the place again.
By the time I was ready to purchase one of their keyboards to put in my iBook G3 Snow, they had shut down. Little did I know...
https://technical.ly/philly/2013/01/09/jeff-white-fingerwork...
I think you precisely have it. There is no single magic reason the M1 is so good, just a lot of things coming together. They start with a better instruction set than x86, have of course the best process available, and perhaps the largest part, they have built up an increadible team over a decade. And they are extremely focussed in what they target. If anything, that is Apples "magic". They are not making a chip which is built in an abstract manner to be sold to random customers. They have exactly their needs in mind and execute towards those. In a sense AMD did that with the chips for the Playstation/XBox. Like the M1 it is basically a SOC. There optimized for great graphics performance. Unfortunately, those chips are not sold separately for building your own PC.
Perhaps you haven't been paying attention?
Apple shipped 64-bit ARM processors for the iPhone at least a year before Qualcomm could do it for Android devices. The reaction to the A7 was similar to what we're seeing now with the M1—not possible, there's some trickery going on, etc.
Apple is pretty good at this processor transition thing, going from 68k to PowerPC to Intel to ARM.
And it more than makes up for some pretty spectacular tactical mis-steps (I’m looking at you, puck mouse, cube mac, butterfly keyboard).
Except for the recent keyboard issues, you're literally talking about another era. I wouldn't put the shape of the mouse for the 1998 iMac in the same category as transitioning a $9 billion revenue product line to a radically different processor architecture.
“...and even if they did it, it doesn’t matter” (though I guess not many are saying that about M1)
That is because it is not possible to ship a top-end design for a new ISA in that amount of time. The more reasonable answer is they had been working on a new core design for some years before. AMD has hinted that their Zen design makes it relatively easy to swap the x86 frontend for an ARM frontend.
Apple was considering buying MIPS around that time. I suspect they strong-armed ARM into accepting their ARMv8 proposal because it was good and because Apple buying MIPS would be disastrous for ARM's share price. At that point, it wasn't faster than possible, it was just designing the last part of the chip (or if both frontends were being worked on in tandem, cancelling one of them and focusing everyone on the other).
This explains why ARM announced v8 and then took the full 4 years to ship their first low-power core (A53) and even longer to ship their bad first try at a high-performance core (A57 -- with the more baked A72 being superior in almost every way).
Yes, Apple managed to lock up most of the global supply of capacitive touchscreens for about a year after the iPhone came out. The iPhone wasn't the first phone to use a capacitive touchscreen, but for a while, it seemed like it was because nobody else could produce devices with them in large volumes.
People used to dismiss Tim Cook as "just" a supply chain guy. But I think it's become clear that supply chain management is at least as important to Apple's success as anything on the design or marketing side.
One area that I've always wished CPU's would take would be a dedicated core or two for the OS that is completely isolated from the other cores, which would be for the software/applications you run. Now if those ran about two different architectures - darn that would be the inner geek in me appeased.
Unfortunately it looks the project is now abandoned
Exactly. ARM has been progressing faster than Intel. For the past 8 years or so, Apple has had the fastest ARM CPU out there on the iPhone/ iPad. Apple has sucked up TSMC's 5nm production. They've integrated a pile of relevant coprocessors into the CPU and put fast RAM on the package. The SSD is lightning fast and SSD encryption is done via a dedicated coprocessor.
It's not one magic trick, it is countless bits of engineering, manufacturing, and purchase choices.
Except for all the people practically begging Apple to make a phone for years, except all the analysts who wrote essays on how computer companies could make successful phones, except for all the fanboys making fan-art of phones with that big circular wheel.
People always make the point that the X86 decoder is only ~5% of the die. Sure, that's true, but keep two things in mind:
(1) While it's only 5% of the die, it runs constantly at full utilization. The ALU is also only a small percentage of the die (5-10%). How hot does your CPU get when you're running the ALU full blast? Now consider that there is a roughly ALU-sized piece always running full blast no matter what the CPU is doing because X86 instructions are so complex to decode. Not only does this give X86 a higher power use "floor," but it means there's always more heat being dissipated. This extra heat limits thermal throttling and thus sustained clock speed unless you have really good cooling, which is why the super high performance X86 chips need beefy heatsinks or water cooling.
(2) It apparently takes exponentially more silicon to decode X86 instructions with parallelism beyond 4 instructions at once. This limits instruction level parallelism unless you're willing to add heat dissipation and power, which is a show stopper for phones and laptops and undesirable even for servers and desktops.
People make the point that ARM64 (and even RISC-V) are not really "RISC" in the classic "reduced" sense as they have a lot of instructions, but that's not really relevant. The complexity in X86 decoding does not come from the number of instructions or even the number of legacy modes and instructions but from the variable length of these instructions and the complexity of determining that length during pipelined decode.
M1 leverages the ARM64 instruction set's relative decode simplicity to do 8X parallel decode and keep a really deep reorder buffer full, permitting a lot of reordering and instruction level parallelism for a very low cost in power and complexity. That's a huge win. Moreover there is nothing stopping them from going to 12X, 16X, 24X, and so on if it's profitable to do so.
The second big win is probably weaker memory ordering requirements in multiprocessor ARM, which allows more reordering.
There are other wins in M1 like shared memory between CPU, GPU, and I/O, but those are smaller wins compared to the big decoder win.
So yes this does foreshadow the rise of RISC-V as RISC-V also has a simple-to-decode instruction set. It would be much easier to "pull an M1" with RISC-V than with X86. Apple could have gone RISC-V, but they already had a huge investment in ARM64 due to the iPhone and iPad.
X86 isn't quite on its death bed, but it's been delivered a fatal prognosis. It'll be around for a long long time due to legacy demand but it won't be where the action is.
The 16 core Ryzen has the same TDP as the 8 core Ryzen. Increasing the clock speed for slightly more single core performance is an intentional design decision, not an engineering flaw. Clock up those Apple chips and they are going to guzzle more power than AMD's chips. https://images.anandtech.com/doci/14892/a12-fvcurve_575px.pn...
Apple's preference for manufacturing processes that optimize for mobile low ower consumption below the 4Ghz range mean scaling up is harder than just slapping a higher TDP on the chips. Remember the TDP of the whole package already exceeds the TDP of the most power hungry Ryzen core running at 4.8Ghz. Apple has enough headroom to boost to the same frequencies but they don't, because of the manufacturing process they have chosen which loses all of its efficiency beyond 4Ghz.
Deleted Comment
My go to example: my 2011 x220 sandybridge stinkpad is faster than my 2017 kaby lake mbp. 2005 machines (I dunno, Lakeport?) aren't even in the same ballpark as modern machines. Had that pace continued up to current year, the M1 chip would be a stinker. As it is, AMD is close and could smoke M1 in next generation 5nm chips, restoring order to the universe.
Dude, has Intel called you yet? You've got some serious CTO chops.
Comparing next gen to current gen is a strange way to do things. Apple will also have a next gen M chip.
Is that really the case? My understanding was that M1 is fast because it's able to keep the chip saturated with instructions due to a large L1 cache and wide instruction decoders. Is anything about that specific to mac workloads?
The memory and instruction architecture may be more 'generic' but it and the neural engine, storage and media controllers, image processor etc will have been shaped and fine tuned by the requirements of the mac.
It is probably the marginal gains of each subsystem being 5-10% better for purpose that gives it the edge.
People need to stop saying stupid stuff like "It's only fast because it's so intensely focused on macOS."
In principle whether you are using Python or C++ doesn't matter. It is just an interface. The compiler or interpreter in the back decides the actual performance. Yet it is pretty obvious that the specifications of C++ syntax makes it much easier to create a high performance compiler than the specification for Python.
I have been quite involved with Julia. It is a dynamic language like Python, but specific language syntax choices has made it possible to create a JIT that rivals Fortran and C in performance.
Likewise we have seen from Nvidia slides when the went with e.g. RISC-V over ARM, that the simple and smaller instruction-set of RISC-V allowed them to make much cores consuming much smaller silicon, better fitting their silicon budget.
When you worked as a chip architect didn't the ISA affect in any way how hard or easy it would be for you to make an optimization/improvement in silicon?
I mean if one ISA requires memory writes to happen in order, or have variable length, or left too little space for encoding register targets etc. All that kind of stuff is going to make your job as a chip architect harder isn't it?
Also I don't quite get your argument about modeling the M1 around Mac workloads. We know the M1 is having great performance on Geekbench and other benchmarks which have not been specifically designed for Mac workloads.
Only things I can see with M1 which is specific to Mac is:
1) They do the code needed for automatic reference counting faster. Big deal on Mac since more software is Objective-C or Swift which uses automatic reference counting.
2) They prioritized faster single cores over multiple cores. Hence optimizing more for a desktop type workload than a server workload.
3) A number of coprocessors/accelerators for tasks popular on Macs such as image processing and video encoding. But that is orthogonal to the design of the Firestorm cores.
I don't claim to know this remotely as well as you. I am just trying to reason based on what you said and what I know. Would be interested in hearing your thoughts/response. Thanks.
Also not clear what benefit RISC-V would have for "coprocessors". GPUs and various machine learning speedup devices are massively parallel devices, intended to run small, specialized programs in parallel on multiple specialized execution units.
Also note that the real win of the Apple M1 is lower power consumption. In terms of basic compute speed, there are Intel products that are roughly comparable. But they use more much more power. This is more about battery life than compute power. (Also heat. Apple laptops have had a long-standing problem with running too hot, from having too much electronics in a thin fanless enclosure. The M1 gets them past that.)
The hardware video decoder is probably to make it possible to play movies with most of the rest of the machine in a sleep mode. The CPU is probably fast enough to do the decode in software, but would use more power than the video decoder.
As the article states, the architects of RISC-V recognized that co-processors that assist the CPU to do more and more specialised repetitive tasks will be the norm. Thus, RISC-V was designed in a way to accommodate such co-processors, with limited instruction sets that make its CPU design simpler.
The Apple ARM processor is also similar - the ARM system-on-chip they have designed is highly customised with many co-processors all optimised for the macos / ios platform.
Apple's SoC contains a GPU, an image processing unit, a digital signal processing unit, Neural processing unit, video encoder / decoder, a "secure enclave" for encryption / decryption and unified memory (RAM integrated) etc. (Note that this is not a unique innovation - many ARM SoCs like these already exist in different variations. In fact, it's what made ARM popular.) Obviously, when a system software or application uses these specific units of an SoC to process specific data, they may be faster than a processor that doesn't have these units. And Intel and AMD processors currently don't have these specific units integrated with their CPU.
Anyway, the point the article is making that the RISC-V architects recognized that such co-processors will be the norm in the future, and thus the author is predicting that RISC-V will become more popular, now that the M1 acts a showpiece for the architectural idea the RISC-V wants to popularise.
Where more Arm mainstream success will have a slipstream effect on RISC-V is in app porting. There are significant differences between x86 and Arm, notably memory model (AS does support TSO with a flag, but native apps use the weak mode). Porting from x86 to Arm can be non-trivial, whereas porting from Arm to RISC-V is far easier.
> RISC-V was designed in a way to accommodate such co-processors, with limited instruction sets that make its CPU design simpler
This only affects the extremely tiny embedded space, only under the most extreme constraints you have the "simpler ISA → simpler CPU core design → more space on the silicon for coprocessors" thing.
For a general purpose high performance SoC, you don't want a simple CPU design, you want a fast CPU design, and you have space for all the coprocessors you want anyway.
Other than "being simple", there's nothing an ISA has to do with coprocessors. There's nothing ISA-specific about having memory-mapped peripherals.
Adding custom instructions directly to the CPU ISA instead? That's not exactly coprocessors, that's more like modifying the main processor, it's annoying (fragmentation) and Apple for some reason was allowed to do it with Arm anyway >_<
> Intel and AMD processors currently don't have these specific units integrated with their CPU
Of course they have GPU, video encode/decode, "secure enclave" (fTPM).
There's even an ISP on some Intel laptop chips: https://www.kernel.org/doc/html/v5.4/media/v4l-drivers/ipu3....
Neural thingy.. I'm happy not to pay for one :P
Thus we are likely to see a lot of new ARM chips containing a few RISC-V cores tucked away inside the design. In fact NVIDIA already does this on some graphics cards and it’s not impossible M1 does as well.
Licensing and patents have historically been in the hands of only a few companies; which limits other companies doing custom designs. With RISC-V, that could change. Of course that's only the instruction set and you'd likely still need to license lots of patents to get anything shipping. But it fits the pattern of OSS software driving a lot of innovation and hardware design becoming more like software design.
Theoretically, if Intel wanted to make a comeback, RISC-V might actually be interesting for them. Right now they would have to compete with Apple, Nvidia/ARM and the likes of Qualcomm for non X-86 based CPUs. Those three are basically using ARM based designs and you need to license patents and designs to do anything there. Intel having to license chip designs + patents from their competitors is likely not compatible with their ambitions of wanting to dominate that market (like they dominated X86 for nearly half a century). They are clearly having issues keeping X86 relevant. So, RISC-V might provide them an alternative. The question is if they have enough will left to think laterally like that or whether they are just doomed to slowly become less relevant as they milk the X86 architecture.
Doesn’t this automatically translate into higher performance — by adding more cores or increasing clock rate — since TDP is the limiting factor for CPU speed?
I mean, if someone created a 1W CPU that performed as well as a 100W CPU, would you say “lovely, a lower power CPU” or “overclock/add cores until it reaches 100W and give me that”?
https://debugger.medium.com/why-is-apples-m1-chip-so-fast-32...
It seems unwise to draw far-reaching conclusions about RISC-V or even ARM64's intrinsic merits versus Apple's CPU designers when there are so many variables. The frontend decoder hasn't been a frequent bottleneck in Intel cores for a long time and they could scale it up more aggressively if they wanted.
Apple's engineers did a great job. That seems to be the conclusion we can draw based on currently available evidence.
There's some more discussion in here about the source of the M1's performance, and it largely seems to come down to the smaller process size that enabled Apple to scale up a lot of the structures in the uarch:
https://news.ycombinator.com/item?id=25394301
But perhaps Intel/AMD can surprise us with a dynamic allocator that runs in the reorder buffer. Or perhaps they can still push the limit one more time with more transistors. Another option would be to implement a fast-path for small instructions, so in effect they would be moving from CISC to RISC but only for parts of the code that need the extra performance.
Now apple has proved that a cool running chip that sips power can run a wide variety of intensive applications well.
People were quite dubious of apple's chances on a competitive desktop chip and have just received a wake up call with a relatively conservative M1 chip (3.2 GHz and 4 fast cores).
Trivia: Simultaneously to the previous Mac ISA transition, Apple acquired PA Semi who had a power efficient and fast PPC chip. Then, Apple decided to go to Intel anyway instead of betting on their new in-house chip. Discarding their newly acquired highly acclaimed chip design, they put the newly acquired semi team to work on the A series of chips instead.
MacBooks are our de facto development laptop and all our services use skaffold for local development, Docker basically. If we consider the perhaps likely outcome that MacBooks will one day be ARM-only, that Docker will not offer cross-arch emulation, and that our development environment will be ARM only, it then becomes likely that we’ll migrate our UAT and PROD to ARM based instances.
If we go that route it’ll mean more money to the AWS Graviton programme and likely further development of ARM chips. I can’t see this affecting RISC-V but the M1 switch could very well benefit the wider ARM ecosystem.
You’re basically locking yourself to a single development eco system, and a highly limited deployment eco system.
It’s not clear what the benefits of either are either. I get that the MacBook gets great performance for battery life but the majority of work is gonna be done in desktop settings, so simply using more/equally powerful x86 chips is only gonna cost you a few dollars a developer per year in electricity costs.
And all that despite the fact that your development is on Docker which doesn’t even have a working solution for the workflow you’re considering at the moment.
x64 virtual machines, Docker, etc have to be supported on Apple's M chips for a long time to come. There's zero risk of this changing soon unless Apple wants to scuttle the non-iOS/non-Mac developer market for Mac.
M1 is a cool chip, but there's no reason for an average development company to rush into it unless targeting M1 MacOS specifically. Maybe the server world swings to ARM, but that will take decades to sort out, if it actually happens at all.
Anecdotally there have been a bunch of posts on HN since the M1 Macs shipped by people who've either stopped using Macs years ago or who've never bought a Mac previously who are happy M1 Mac owners.
The M1 Mac mini retails at $699, but I've already seen it as low as $625. There's certainly nothing in that price range that's better.
And even before the M1 Macs shipped in November, Mac revenue hit an all-time high of $9 billion in the quarter that ended September 26, 2020 [1]. Apple often highlights that about 50% of Mac customers are new to the Mac, a trend that's likely to accelerate.
[1]: https://www.apple.com/newsroom/2020/10/apple-reports-fourth-...
Not to say that ARM is killing x64, it’s definitely not, but ARM is clearly being invested in and rolled out at a massive scale by 2 of the biggest tech companies in the world in both consumer devices and server side. To me that’s quite something.
You interface whether a programming language, library API or and ISA has strong implications for what optimization and implementer than do.
The ARM ISA has many advantages over x86:
1) Fixed sized instructions, which make it easier to add more instruction decoders. Discussed here: https://debugger.medium.com/why-is-apples-m1-chip-so-fast-32...
2) More registers. ARM64 has 32 general purpose registers and 32 registers for SIMD stuff. x86 has fewer registers which are also wasted on all sorts of legacy junk.
3) More lax restrictions on memory-write back. It is easier to optimize the Out-of-Order execution on ARM, as you don't need to write back everything in order to memory.
As for everybody else. ARM designs from ARM Ltd. is showing rapid performance increases and gradually closing the gap to x86. It really is inevitable as there is NOTHING special about the x86 ISA that gives it higher performance. Nothing prevents other ARM makers from catching up: https://medium.com/swlh/is-it-game-over-for-the-x86-isa-and-...
Big/little is good for power consumption, not so much for performance which is still good.
There's a lot of microarchitectural goodness here beyond ARM, though. Apple's got lots of little details right, and fat connection to memory helps, too. It doesn't hurt to be on leading fab, too.
Think a small coprocessor with local memory that's pulling commands out of a queue and managing an io controller. Couple of wins, lower power consumption, fewer context switches, and cache pressure.
Also I wouldn't be surprised if one actually could not build anything like M1 (at that power usage) w/ x86.... Intel certainly hasn't been able to.
Originally it was because the US government requires a second source for any components and so Intel had to license it to somebody to supply the US government.
Then later AMD's 64 bit instructions became the standard, so Intel needed the license for the 64 bit extensions and AMD needed the x86 base and so they just decided to cross-license and call it good.
There's also the https://opencores.org/projects/ao486 - the relevant patents on a 486-era design would have expired
It was the economies of scale and the standardization on x86_64 that made the PC the performance king in the first 2000s decade. Intel (and, of course, AMD) x86 did not have the best ISA but they, because of economies of scale, had the best fabs which let them outperform anything else.
While Intel was dominating with raw performance in the first 2000s decade, embedded chipsets slowly coalesced around the ARM ISA, a process which was accelerated by Apple choosing ARM for the iPhone (Nokia also used ARM in a lot of their phones).
Moore’s Law finally stopped working for Intel and they stopped being able to outfab everyone else in the mid-to-late 2010s; a 2012 x86 chip has about the same performance as, say, a 2017 x86 chip.
Intel saw the writing on the wall with people using non-Intel ISAs for phones, and tried to make an Atom chip which would work in a phone; it was a flop. Nobody wanted the x86 ISA unless they needed it in systems which ran legacy applications.
With the Raspberry Pi moving up from only suitable for specialized embedded applications to having near-desktop level performance, and with Apple finally making an ARM chip which is competitive (and in some cases superior) to Intel’s desktop chips, and with legacy x86 Windows applications being in many cases replaced with webpage and smart phone applications, it looks the industry as a whole is finally moving past x86 and its bloated instruction set.
This is a much needed breath of fresh air for the computer industry. I like the M1 because I like that we now have mainstream non-x86 desktop/laptop computers again.
I think RISC-V has a lot of potential, and I am interested in what comes of it in the 2020s, whether it blooms like the ARM did, or if it goes the way of the HPPA, Alpha, or Sparc.
It wasn't fast/efficient enough. If it had been faster or with better power consumption it would've been fine. There was a massive push to get Android studio to automatically compile X86 binaries for you etc.
But why would you put an Atom in your phone if it means it's slower and worse battery life? That's the reason it flopped. ISA change is a hindrance for switching, but it can be overcome. Even if Intel had sold Atom chips with ARM ISA and identical performance to the X86 variant it would have still flopped due to the poor performance and efficiency.
Similar predictions about lack of importance of legacy support made over the past 40 years have not borne out. Performant x86 emulation is an absolute must for a replacement ISA.
• Smartphones
• iPads and Android tablets
• Chromebooks
• Game consoles
• The Phoenix-like return of the Mac (About half of the shops I have seen in the last decade were mixed or Mac shops)
Point being, it’s a different world than it was 20 years ago, when one needed to run on Windows to be a viable software product.
Anyway, the M1 has excellent x86 emulation; it’s 50% as fast (if not better) as native ARM code.
And, yes, I felt Windows for ARM was not viable just 10 years ago: https://www.samiam.org/blog/20101224.html But a lot has changed since then.
The problem is that if you want mass adoption of your fancy new bespoke offer it's quite a bit tricker no matter how good it is at doing it's thing, and that problem does not got away with different ISAs, probably harder to be honest.
A vertical integrated system like a Mac allows for much more innovation.
In fact this is true for any vertical integrated system. If you look at Amiga, NeXT, SGI, SPARC and many others, they where always far ahead of the PC in terms of technology.
> But for a coprocessor you don’t want or need this large instruction-set. You want an eco-system of tools that have been built around the idea of a minimal fixed base instruction-set with extensions.
Essentially: the modular nature of Risc-V and tooling/ecosystem built around it, with first class extension support.
ARM is closed, too complex and not friendly to extensions, while custom ISAs mandate a huge amount of extra work.
I only dabble in this field, but I see the ecosystem rapidly maturing. The open nature also leads to a general propensity to open source designs and tooling, lowering the barrier to entry and reducing cost.
* RISC-V (RV64GC) have simpler instructions than Arm64. It's possible it would have a slight frequency advantage given the same implementation resources, but Arm64 might need slight fewer instructions. Notably, Arm64 have more addressing modes. Fusion and cracking makes this mostly a wash, but implementing a RV64 cores is a lot easier than an Arm64 (I speculate).
* Arm64 has load pairs and store pairs; this is a significant advantage.
* RISC-V has no flags and conditional branches directly compares operands. This look like a significant advantage in the code I have looked at and is easier to implement (no flag renaming etc.)
My opinions are my own. But the things one sees coming from them, they make things up to justify their continued existence.
A major nitpick: unified memory is being massively over hyped. There is a reason GPUs have their own memory bus -- contention. CPU/GPUs fighting over access to memory causes massive disruption to very parallel computation. Even if Intel/nVidia resolved their fight over inter-CPU connectivity or we're talking POWER and nVidia using NV-LINK, you still need extra memory ports to keep things fed. The more cores and the faster the GPU the more memory bandwidth required.
I expect to see future Mac Pro M1 series machines with multiple CPU sockets -- at which point memory isn't unified any more, and all the regular CC-NUMA tricks will be used. But it won't be a big deal.
Not sure I totally agree with this.
Game console have unified memory architecture and it’s a beloved feature. It greatly simplifies things and allows the CPU to far more easily use results computed by the GPU with complicated sync commands or frame delays.
Maybe unified architecture is less valuable for non-interactive programs. I’m not sure. This is a fair bit outside of my wheelhouse.
Memory access is definitely one of the biggest bottlenecks. So I fully agree with the general concern. And you may even be right that the unified architecture isn’t that interesting. But I’m not so sure it’s the problem you think it is.
The PS5 architecture also has different speeds -- but the slow access is to SSD! It has manual management of streaming resources from SSD to RAM, but also allows direct SSD reads -- but that is a trick, because the SSD controller has a huge RAM cache too.
So I'd say we're actually moving away from UMA in general. I think that memory aware scheduling is going to be the next win -- online learning to understand memory access patterns and scheduling compute and cache fill. Fancy cache algorithms used to take too much logic (and slow cache fill logic down), but for SSD->RAM you can do lots of prediction based on program state.
But plenty of people want >32GB of memory, I routinely use machines with 256GB. No way you can get enough RAM into the SOC. Large core counts are even harder, because speed of light means you need a new on socket switched architecture, memory on the other side is slow to access etc. TANSTAAFL.
> Apple uses memory which serves both large chunks of data and serves it fast. In computer speak that is called low latency and high throughput. Thus the need to be connected to separate types of memory is removed.
That is just hand waving. It is possible to produce such memory, but it involves ultra wide busses, far wider than optimal for filling CPU caches, and preferably directly connected to the GPU rather than a multi-master bus or switch.
There is the possibility that Apple has built a very fancy memory interposer that leverages the short distances in the SOC to present the memory both wide (to GPU) and narrow (for filling a queue of L2 misses), so that cache fills pause while GPUs read/write. That would be a highly interesting piece of logic. But of course it can't scale outside of the SOC.