Everyone here seems to be making this solely about the processor, which it is partially, but the other story is that Rosetta 2 is looking really good. Obviously there’s plenty more to test beyond a GeekBench benchmark, but hitting nearly 80% of native performance shows both the benefits of ahead-of-time transpiling and the sophistication of Apple's implementation.
I’ll still be waiting to see what latency-sensitive performance is like (specifically audio plugins) but this is halfway to addressing my biggest concern about buying an M-series Mac while the software market is still finding its footing.
As some other have mentioned, Rosetta uses AOT translation that contrasts with Microsoft's approach of emulation.
I think the contrast in approach here is motive, where I get the impression that Microsoft wanted developers to write ARM apps and publish them to its Store by making emulated programs less attractive by the virtue of poorer performance.
Apple on the other hand is keen to to get rid of Intel as quickly as they can, therefore they had to make the transition as seamless as possible.
Microsoft calls their approach emulation, but if you read the details you see they are also translating instructions and caching the result (AOT translating).
Their current implementation only supports 32-bit code, x64 translation is still underway. It is not known how well x64 translated code will perform relative to native code.
I am similarly curious about Rosetta2, but there seems to be very little marketing let alone technical information being made available. All I can figure is that it performs user mode emulation, similar to what QEmu can do, and does not cover some of the newer instructions.
Off topic, my job has been in virtualisation for the last 12 years, thus I am very familiar with the publicly available body of research on this topic. Ahead of time binary translation has been a niche area at best.
My understanding from WWDC is that it does AOT compilation at install time when possible. If an app marks a page executable at runtime (such as a JIT) it will interpret that.
Newer instructions are still encumbered by patents.
Why would they market it? They will market the result--performance. Rosetta is just for geeks to appreciate, laypersons don't care about that stuff, all they want to know is whether the end result is faster or slower, as that's what affects them.
That doesn't match with what we know about Rosetta 2. Rosetta 2 can't run processes with mixed architectures, so ARM hosts can only run in-process ARM plug-ins, and x86 hosts can only run in-process x86 plug-ins. Apple's AUv3 plug-in architecture is out-of-process, so you can mix those, but there is no way you can mix ARM and x86 VSTs, for example, without specific work by hosts to provide an out-of-process shim translation layer.
Either he's talking about AUv3 specifically, or the hosts he tested already are doing out-of-process wrapping, or Rosetta 2 is actually magic (AFAICT this isn't a generally solvable problem at that layer), or he's confused.
Why aren't programs distributed by the Mac App Store pre-translated once and the translation downloaded to the Mac?
They'd still have to be run under rosetta2 (because programs can write code and branch to it) but a lot of computation could be done once rather than every time.
Not sure, but, my educated guess is that it will be more efficient as they improve Rosetta 2 to do this on demand instead of pre-translating every app on the store every time Rosetta 2 is updated.
Even if that means the users have to retranslate, that's still essentially "free" (to Apple) distributed compute.
translation of most apps doesn't take a significant amount of time, and have it pretranslated would mean shipping code that was not signed by the app developer.
The fact that it's marginally faster is almost irrelevant.
What really matters is that it's using about 1/3 the power of a comparable Intel chip while doing it, and that Intel routinely has a 60%+ quarterly profit margin (probably much higher on PC chips) that Apple now gets to keep.
It is lot harder technically and politically for apple to replace TSMC part of the value chain. Apple will loose a lot of brand value if it directly employees cheap Asian labour.
They could in theory move the production to U.S. however without the ecosystem of talent, suppliers and partners that Taiwan or china has it will be hard .
They already spent 10+ years improving their iPad chips, switching made sense especially given intel lack of reasonable roadmap, the effort to integrate TSMC part of the chain does not have the same value given the risks today.
Your point is theoretically valid but considering M1 is essentially a rebranded A14Z chip they would have have had to make for the iPad anyway, the marginal R&D spent on M1 should be epsilon.
Intel can distribute its overhead costs over a lot more CPUs than Apple, and Apple pays TSMC for making the chips, so, presumably, a large fraction of the marginal profits go to TSMC (probably not all of it, as I think Apple paid for building production capacity up front)
I wonder how this is possible? I imagine its partly due to manufacturing improvements (14nm vs 5nm) I mean how can Intel fall so far behind in only a couple years? I know they were one of the earliest investors in ASML’s EUV. Why couldn’t they push for smaller nodes? Is it because they were milking their current nodes too far? I saw them marketing freaking TEC’s to cool down their 500W Cpu gaming rigs on LTT and derbauer recently.
What made TSMC so successful? Is it primarily thanks to their business strategy? Or did Intel do something so wrong they tumbled down this far?
I know that Intel’s 10nm is closer to 7nm TSMC but still their competition is coming up with interesting and relevant technologies while Intel is like a junkyard of half baked ideas. 5g modems? Arduino competitor? Vaporware GPU’s since Larrabee? Claims of dominance in NN accelerators with nothing solid? Nirvana? Optane 600 series garbage SSDs? Stupid desktop computing form factor ideas? I can go on...
I don’t hate Intel or root for any other company. I’m just trying to understand how incompetence like this happens in companies
"What made TSMC so successful? Is it primarily thanks to their business strategy?"
Basically they are riding the new wave of cheap devices that outnumber the x86 devices by 10x 20x.
Basically everything uses an ARM CPU these days, not just tablets and phones, but microwaves, TVs, projectors, refrigerators, ovens, 3D printers...
That makes those devices extremely cheap on volume and make innovations to happen faster than o a single company like Intel, that was not interested on those low margin products.
Intel is far from incompetent, they just decided to get advantage of their monopoly position to reap as big profits and margins as they could get for the longest possible time, instead of cannibalizing themselves with lower margins.
And it was great for them. Their executives have done great. They have just ruled the semiconductor industry and wanted to enjoy it.
I suspect what Intel missed was that firms would be selling phones in very large volume for c$1000. That price includes enough margin to spend quite a lot on the SoC and associated research.
When the iPhone launched Steve Ballmer laughed [1] at the price and pushed a $99 competitor with MS software. The phone market was very, very different before the iPhone got real traction.
> Basically everything uses an ARM CPU these days, not just tablets and phones, but microwaves, TVs, projectors, refrigerators, ovens, 3D printers...
Most of these things are not on bleeding edge 5nm or 7nm process, though. Most microcontrollers are more like 90nm (e.g. STM32 up to F7 is 90nm; STM32H7 is 40nm... many smaller micros of the M0 variety are even 180nm...)
Basically, if you're not video processing, or a real computing device, or something power sensitive, 90nm is still a pretty sweet place to be-- ~$300k for a mask set, easy to have 5V tolerant I/O if that's something you need, high likelihood of common I/O and core voltage, &c.
It's funny because this seems like a textbook case of the innovator's dilemma (from Clayton Christensen) in a nutshell - what worked for Intel was just working so well, that cannibalizing it with something new didn't make sense - until it was too late.
> Intel is far from incompetent, they just decided to get advantage of their monopoly position to reap as big profits and margins as they could get for the longest possible time
That sounds exactly like an incompetent strategy by being lazy ignoring possible competitors.
Intel's current fab troubles are simply inexcusable. There are some factors that can account for part of the problem, but at this point Intel is 5+ years late on delivering a usable, profitable successor to their 14nm process. And 14nm got off to a slow, rocky start too. Intel's fab business has been horribly mismanaged, and the CPU design business has been forced to believe fab roadmaps that don't have any credibility.
Perhaps it is because all the latest Intel fabs are in the U.S. in Hillsboro, Oregon. Other foundries like TSMC benefit from the ecosystem and cheaper cost in East Asia ? i.e. they can afford to make more mistakes than Intel can if it is cheaper to do so.
Intel (or AMD for that matter who are using TSMC) isn't falling behind, it's just that geekbench is completely unrepresentative of real world performance across architectures, as Linus points out here[0] due to the test including hardware accelerated tasks that benefit specifically these modern arm chips.
>There’s been a lot of criticism about more common benchmark suites such as GeekBench, but frankly I've found these concerns or arguments to be quite unfounded. The only factual differences between workloads in SPEC and workloads in GB5 is that the latter has less outlier tests which are memory-heavy, meaning it’s more of a CPU benchmark whereas SPEC has more tendency towards CPU+DRAM.
No idea but I suspect the "unified" on chip memory is very very quick.
Some friends and I were BSing about the "pro" level parts, it you can graft 2 or 4 M1s together, use off chip RAM and then treat that onboard 16GB like cache? We're talking about some game changing stuff.
The Xeon Phi had up to 16GB of fast memory in the same package as the main die. IIRC, it could be used as memory or as cache for external memory (which was much slower).
If Apple integrates two more memory chips, it'll be able to power a pretty solid desktop or laptop.
On the performance, Rosetta is most likely doing JIT so that most of the time it's running native ARM code. It did this with PPC binaries and DEC had it for Alpha.
It is not on-chip memory, the dies are separate, they're just in the same package. They seem to use standard LPDDR4 connectivity, so I don't think its actually faster. The "unified" bit seems to matter more: having a single address space for both CPU & GPU, but this is pure speculation. I don't know if AMD or Intel APUs do this too.
Didn’t Intel have a similar idea with Skylake? Those had very fast albeit smaller eDRAM die glued to the processor. It was dropped on subsequent generations.
It might make sense to use very fast SSD as the main memory and on-chip RAM as cache. Huge amounts of RAM make only sense if your disks are slow or your workload actually needs the whole RAM which is rare.
>> I know they were one of the earliest investors in ASML’s EUV.
They may have been one of the earliest investors in EUV (along with TSMC, by the way), but in terms of adoption and roadmap they have been way behind both TSMC and Samsung. I don't know the exact numbers of machines but my educated guess is that TSMC and Samsung together probably have close to 10x the EUV wafer capacity compared to Intel. And have had it for much longer as well.
The problem Intel created for itself is that they have always had a very stubborn over-confidence in their own knowledge of process technology, and have driven tool manufacturers like ASML to work within Intels constraints, instead of working together to alleviate them. Their hubris has bitten them now that EUV has become economically viable compared to Intels process technology that relies heavily on triple and quadruple patterning, and very little of Intels 'old' process technology knowledge carries over to EUV.
TSMC has also had a lot of teething pains with EUV but they have been very determined to make it work, and that's paying off now.
> I don't know the exact numbers of machines but my educated guess is that TSMC and Samsung together probably have close to 10x the EUV wafer capacity compared to Intel.
There are currently zero EUV Wafer from Intel. Which means the answer to your question with would be close to infinite.
> I mean how can Intel fall so far behind in only a couple years?
Arguably Intel has been falling behind since the delays in replacing Haswell (so, last six years or so). It just hasn’t been particularly visible, as the ARM vendors simply don’t compete in the same spaces, until now.
Though, in what might be an early sign in retrospect, x86 phone chips, after a lacklustre launch, vanished without a trace some years back.
There were some ex-Intel people commenting on a previous thread and they told about a lot if internal politics/fighting between inside groups. It might not be the main reason but part of it.
Geekbench benchmarks provide evidence of an upper bound, which is useful for determining what is technically possible. Real-world benchmarks instead show what can be realistically observed. The upper bound benchmarks are still useful, just in a different way.
For example: these benchmarks show that under excessive load (as benchmarks do), the M1 is capable of out-performing Intel chips at something like 1/3 the energy usage. No matter how you slice it, that is an impressive achievement and shows what M1 could potentially do for the most perfectly optimized software.
Realistically, of course, software will not attain this. But having the upper bound lets software writers know what to aim for and when it's not worth pushing for extra performance.
Would be a lot worse if performance was 1/10th. That’s what this benchmark shows. Essentially that it’s a lot better than people feared. The question for users like myself is, will things like IntelliJ still be decently fast? Still don’t know what kind of performance we will get from the JVM, but this shows some decent hope!
* Gumbo Parser of an HTML file then execute some stuff with duktape -- note: this parses HTML to a simple DOM -- nowhere close to actual rendering
* Text rendering of 1,700 words into a 12MP image
* Horizon detection of 9MP image (that's around 2MB)
* Image repaint of 1MP image (that's around 200KB)
* HDR image (4MP -- around 800KB) from 4 normal images
* Neural Net of tiny 224x224 images
* Navigate using a graph with 200k nodes and 450k edges.
* SQLite is between 0.5MB and 1.1MB depending on the compiler options. Maybe the dataset pushes it out of cache, but I wouldn't bet on them creating many millions of rows.
Not sure, but may not fit in cache
* Google's PDFium render. Not sure about the library itself, but a 200dpi map doesn't sound like anything worth mentioning
* Camera test gives very little in specifics, but with several steps and a handful of libraries, this probably overflows cache a little.
* Ray Tracing 3.6K triangles and 768x768 output. I'd put it elsewhere, but they could be using a huge number of rays (though I seriously doubt it)
Undoubtedly doesn't fit in cache
* Clang rendering 730 LOC (seriously?) Clang is pretty big and most likely needs to cache.
Zero actual details
* Speech Recognition
* N-Body Simulation
* Rigid Body Simulation
* Face Detection
* Structure from Motion
All in all, them saying these things are "real world" would be a huge overstatement at best. I don't see anything here to contradict Linus' assessment.
ARM already has instructions for improving performance of Javascript [1]. What if Apple added custom ISA extensions to their chips to support efficient x86 emulation? Current evidence seems to suggest that much of the translation is happening statically; a few additional instructions might greatly increase x86 emulation performance if most of the code can be translated instruction-by-instruction.
Also curious to see how Rosetta will work with x86 code whose instruction alignments can't be determined statically.
I think this is very unlikely as 1. Apple is still almost required to follow the ARM ISA as a result of its licensing agreement 2. They will regard x86 emulation as a short term legacy issue (developers will recompile their Apps for ARM soon) and so why design and build in extra hardware you won't need long term and 3. Rosetta 1 worked fine without this sort of help.
I don't agree that this is so farfetched. As noted in a sibling, Apple has released ISA extensions before. If the architecture is decoding to uops anyway the addition of x86 helper instructions may not affect the architecture much at all (or may even be implemented in microcode). Further, the comparison between Rosetta 1 and 2 may not necessarily apply, as the switch to intel was (arguably) a more substantial perf increase over POWER.
People need to stop acting like that instruction is some amazing magical "make JS fast". Literally the only thing it does is to change the rounding mode and sentinel for NaNs.
One thing I’m unclear on still is the windows VM situation.
Is it just that “right now” we can’t run a windows VM but with some work by Apple/Microsoft/Parallels it can happen or is there some fundamental blocker here?
I thought there was an ARM version of Windows already?
Parallels is planning an ARM release[1]. This won’t run Intel arch Windows, but rather they hint that it will Windows for ARM64. They also mention Microsoft’s recent announcement that Windows for ARM will soon have the ability to run x64 (Intel arch) applications.
If you read that link, there is no hint at anything to do with Windows on there at all. It's quite a short press release, it makes no mention of what might be run inside parallels. I would expect linux to run, but Windows would be complete speculation.
Those are weasel words. They are not saying parallels will run windows ARM. Read carefully and see what they're actually saying. There is not even a hint that parallels will run windows ARM on M1 macs, but the sequence of sentences creates the illusion of one.
Parallels, like vmware, is depending on microsoft here. If microsoft doesn't play ball there will be no windows on apple silicon macs. Currently there can be no officially supported way of running windows ARM on M1 macs, as the windows license does not allow it.
Not "officially", but you can get hold of it as an individual and install it on e.g. a raspberry pi relatively easily already. Running it in a VM might be doable.
Actual Windows 10 ARM too, not the restrictive IoT version that's officially supported on the pi.
I had the impression that, virtualization was not possible on current M1. As I sadly discovered that docker didn’t and won’t run on current silicon. Is it still possible for parallels to move forward without virtualization ?
I'm an old dog trying to learn a new trick. Old dog says Nvidia GPU with 8+GB of RAM and thousands of cores is powerful. New trick says M1 with max 8 cores and ??? amount of RAM is more powerful??? My head hurts trying to come up with how. Where is the magic happening that makes 6K+ video run in real time in Resolve, when an Intel CPU with multiple GPUs needs a good wind at its back going downhill. Of the many hand wavy demo details left out, what kind of video are they using? MP4 video, RAW videos, ProRes, etc? Is Resolve running in realtime just playing back the video but immediately chokes when you apply a single node with light grade applied? The M1 release videos and too PR speak for me
• NVIDIA is on a dumpster fire cheap “Samsung 8nm” process that is quite possibly the worst <=14nm process especially when it comes to power and heat.
• Apple has the benefits of complete vertical integration, both on the hardware and software side.
• Neural engine is essentially Tensor Cores in NVIDIA’s GPU but occupies at least 4x the equivalent die area as tensor cores (no public details on performance
yet).
• NVIDIA doesn’t want to make their consumer GPUs too powerful on tensor operations in order to not cannibalise their 1000% markup ML cards.
It’s almost a classic Intel: financial greed and financial engineering, combined with complacency from being long for so long.
Heck, AMD’s new top end card is tied with the 3090 - but $500 cheaper.
NVIDIA is reportedly scrambling to try and get back into TSMC who is going to make an example out of them.
Indeed this is the number for single-core performance. A previous benchmark[1] showed a MacBook Air with M1 getting 1687 on the single-core test and 7433 on multi-core. Compared to a 4-month-old kitted out iMac (1251 & 9014)[2] this is 35% faster on single-core and 17% slower on multi-core (8 cores on M1, 20 cores on the i9).
Getting a better result via Rosetta 2 on the single-threaded benchmark is very impressive. I assumed that their benchmark included some vector instructions (which as far as I understand Rosetta 2 does not emulate) so this would mean that these higher numbers are for general-purpose instructions vs vector. That said, I can't find any reference to AVX or SSE in the "Geekbench 5 CPU Workloads" document, although the one for Geekbench 4 does mention them. It would be interesting to see numbers for Geekbench 4 on M1, if it runs via Rosetta at all. I would imagine that the tool is able to detect what is supported to run optimized code for each CPU being evaluated.
If the M1's emulated x86 performance is this good, why didn't Apple showcase this sort of comparison at their launch announcement? This seems like something that would have assuaged concerns about the apparent tradeoff of switching silicon. That is, it seems there is very little downside, especially if the M1 is more power/heat efficient.
A comparison like this would have demonstrated this point more definitively than the many graphs Apple showed that were conspicuously lacking axis labels.
As another commented, they would prefer native apps. If emulation were so good it may cause may developers to just keep putting development of the native apps off.
however the cynical side is, Apple is having difficulty getting developers on board. Apple is notorious for not encouraging game developers to be on OS X to the point the common refrain in the Mac community has been to "buy a different system to game on". Many people don't have that luxury or like me don't want two systems on my desk when one should be sufficient.
That M1 debut was notable for one reason too many overlooked, they had very few developers showing off their wares and most of those they did are not well known.
I’ll still be waiting to see what latency-sensitive performance is like (specifically audio plugins) but this is halfway to addressing my biggest concern about buying an M-series Mac while the software market is still finding its footing.
I think the contrast in approach here is motive, where I get the impression that Microsoft wanted developers to write ARM apps and publish them to its Store by making emulated programs less attractive by the virtue of poorer performance.
Apple on the other hand is keen to to get rid of Intel as quickly as they can, therefore they had to make the transition as seamless as possible.
https://docs.microsoft.com/en-us/windows/uwp/porting/apps-on...
Their current implementation only supports 32-bit code, x64 translation is still underway. It is not known how well x64 translated code will perform relative to native code.
Off topic, my job has been in virtualisation for the last 12 years, thus I am very familiar with the publicly available body of research on this topic. Ahead of time binary translation has been a niche area at best.
Newer instructions are still encumbered by patents.
https://developer.apple.com/documentation/apple_silicon/abou...
FWIW Chris Randall from Audio Damage posted a quick note saying performance of plugins under Rosetta was basically comparible with Intel: https://www.audiodamage.com/blogs/news/a-quick-note-about-ap...
I've read tweets from other plugin developers saying similar things, so preliminary feedback seems quite positive!
Either he's talking about AUv3 specifically, or the hosts he tested already are doing out-of-process wrapping, or Rosetta 2 is actually magic (AFAICT this isn't a generally solvable problem at that layer), or he's confused.
They'd still have to be run under rosetta2 (because programs can write code and branch to it) but a lot of computation could be done once rather than every time.
Even if that means the users have to retranslate, that's still essentially "free" (to Apple) distributed compute.
What really matters is that it's using about 1/3 the power of a comparable Intel chip while doing it, and that Intel routinely has a 60%+ quarterly profit margin (probably much higher on PC chips) that Apple now gets to keep.
https://www.tsmc.com/uploadfile/ir/quarterly/2020/3PTwU/E/3Q...
They could in theory move the production to U.S. however without the ecosystem of talent, suppliers and partners that Taiwan or china has it will be hard .
They already spent 10+ years improving their iPad chips, switching made sense especially given intel lack of reasonable roadmap, the effort to integrate TSMC part of the chain does not have the same value given the risks today.
Deleted Comment
It could be dramatically more expensive to make an M1 chip and Apple is making the trade off for performance/walled garden security.
Unlikely, but possible
[1] https://www.anandtech.com/show/16214/amd-zen-3-ryzen-deep-di...
Deleted Comment
Purely from a CPU / SoC "Unit" shipped perspective. Apple currently shipped more CPU unit on both leading edge and total volume than Intel.
What made TSMC so successful? Is it primarily thanks to their business strategy? Or did Intel do something so wrong they tumbled down this far?
I know that Intel’s 10nm is closer to 7nm TSMC but still their competition is coming up with interesting and relevant technologies while Intel is like a junkyard of half baked ideas. 5g modems? Arduino competitor? Vaporware GPU’s since Larrabee? Claims of dominance in NN accelerators with nothing solid? Nirvana? Optane 600 series garbage SSDs? Stupid desktop computing form factor ideas? I can go on...
I don’t hate Intel or root for any other company. I’m just trying to understand how incompetence like this happens in companies
Basically they are riding the new wave of cheap devices that outnumber the x86 devices by 10x 20x.
Basically everything uses an ARM CPU these days, not just tablets and phones, but microwaves, TVs, projectors, refrigerators, ovens, 3D printers...
That makes those devices extremely cheap on volume and make innovations to happen faster than o a single company like Intel, that was not interested on those low margin products.
Intel is far from incompetent, they just decided to get advantage of their monopoly position to reap as big profits and margins as they could get for the longest possible time, instead of cannibalizing themselves with lower margins.
And it was great for them. Their executives have done great. They have just ruled the semiconductor industry and wanted to enjoy it.
When the iPhone launched Steve Ballmer laughed [1] at the price and pushed a $99 competitor with MS software. The phone market was very, very different before the iPhone got real traction.
[1] https://www.youtube.com/watch?v=eywi0h_Y5_U
Most of these things are not on bleeding edge 5nm or 7nm process, though. Most microcontrollers are more like 90nm (e.g. STM32 up to F7 is 90nm; STM32H7 is 40nm... many smaller micros of the M0 variety are even 180nm...)
Basically, if you're not video processing, or a real computing device, or something power sensitive, 90nm is still a pretty sweet place to be-- ~$300k for a mask set, easy to have 5V tolerant I/O if that's something you need, high likelihood of common I/O and core voltage, &c.
That sounds exactly like an incompetent strategy by being lazy ignoring possible competitors.
[0] https://www.realworldtech.com/forum/?threadid=136526&curpost...
https://www.anandtech.com/show/16226/apple-silicon-m1-a14-de...
Some friends and I were BSing about the "pro" level parts, it you can graft 2 or 4 M1s together, use off chip RAM and then treat that onboard 16GB like cache? We're talking about some game changing stuff.
If Apple integrates two more memory chips, it'll be able to power a pretty solid desktop or laptop.
On the performance, Rosetta is most likely doing JIT so that most of the time it's running native ARM code. It did this with PPC binaries and DEC had it for Alpha.
It is not on-chip memory, the dies are separate, they're just in the same package. They seem to use standard LPDDR4 connectivity, so I don't think its actually faster. The "unified" bit seems to matter more: having a single address space for both CPU & GPU, but this is pure speculation. I don't know if AMD or Intel APUs do this too.
They may have been one of the earliest investors in EUV (along with TSMC, by the way), but in terms of adoption and roadmap they have been way behind both TSMC and Samsung. I don't know the exact numbers of machines but my educated guess is that TSMC and Samsung together probably have close to 10x the EUV wafer capacity compared to Intel. And have had it for much longer as well.
The problem Intel created for itself is that they have always had a very stubborn over-confidence in their own knowledge of process technology, and have driven tool manufacturers like ASML to work within Intels constraints, instead of working together to alleviate them. Their hubris has bitten them now that EUV has become economically viable compared to Intels process technology that relies heavily on triple and quadruple patterning, and very little of Intels 'old' process technology knowledge carries over to EUV.
TSMC has also had a lot of teething pains with EUV but they have been very determined to make it work, and that's paying off now.
There are currently zero EUV Wafer from Intel. Which means the answer to your question with would be close to infinite.
Binary translation can work pretty well for user code, especially synthetic benchmarks.
Arguably Intel has been falling behind since the delays in replacing Haswell (so, last six years or so). It just hasn’t been particularly visible, as the ARM vendors simply don’t compete in the same spaces, until now.
Though, in what might be an early sign in retrospect, x86 phone chips, after a lacklustre launch, vanished without a trace some years back.
For example: these benchmarks show that under excessive load (as benchmarks do), the M1 is capable of out-performing Intel chips at something like 1/3 the energy usage. No matter how you slice it, that is an impressive achievement and shows what M1 could potentially do for the most perfectly optimized software.
Realistically, of course, software will not attain this. But having the upper bound lets software writers know what to aim for and when it's not worth pushing for extra performance.
https://www.geekbench.com/doc/geekbench5-cpu-workloads.pdf
* Compress a 2.3mb file that fits in cache.
* Alter a 24MP JPEG (that's around 5MB filesize).
* Gaussian blur of 24MP JPEG
* Gumbo Parser of an HTML file then execute some stuff with duktape -- note: this parses HTML to a simple DOM -- nowhere close to actual rendering
* Text rendering of 1,700 words into a 12MP image
* Horizon detection of 9MP image (that's around 2MB)
* Image repaint of 1MP image (that's around 200KB)
* HDR image (4MP -- around 800KB) from 4 normal images
* Neural Net of tiny 224x224 images
* Navigate using a graph with 200k nodes and 450k edges.
* SQLite is between 0.5MB and 1.1MB depending on the compiler options. Maybe the dataset pushes it out of cache, but I wouldn't bet on them creating many millions of rows.
Not sure, but may not fit in cache
* Google's PDFium render. Not sure about the library itself, but a 200dpi map doesn't sound like anything worth mentioning
* Camera test gives very little in specifics, but with several steps and a handful of libraries, this probably overflows cache a little.
* Ray Tracing 3.6K triangles and 768x768 output. I'd put it elsewhere, but they could be using a huge number of rays (though I seriously doubt it)
Undoubtedly doesn't fit in cache
* Clang rendering 730 LOC (seriously?) Clang is pretty big and most likely needs to cache.
Zero actual details
* Speech Recognition
* N-Body Simulation
* Rigid Body Simulation
* Face Detection
* Structure from Motion
All in all, them saying these things are "real world" would be a huge overstatement at best. I don't see anything here to contradict Linus' assessment.
Also curious to see how Rosetta will work with x86 code whose instruction alignments can't be determined statically.
[1] https://stackoverflow.com/questions/50966676/why-do-arm-chip...
(I didn't know what you were talking about at first, had to work it out.)
That's it.
Is it just that “right now” we can’t run a windows VM but with some work by Apple/Microsoft/Parallels it can happen or is there some fundamental blocker here?
I thought there was an ARM version of Windows already?
[1] https://www.parallels.com/blogs/parallels-desktop-apple-sili...
The link does mention an early access program.
I remember this was promised a long time ago. I'm surprised it's not the case.
Parallels, like vmware, is depending on microsoft here. If microsoft doesn't play ball there will be no windows on apple silicon macs. Currently there can be no officially supported way of running windows ARM on M1 macs, as the windows license does not allow it.
Parallels is trying emulation on M1 but who knows how well that will work. https://www.parallels.com/blogs/parallels-desktop-apple-sili...
Not "officially", but you can get hold of it as an individual and install it on e.g. a raspberry pi relatively easily already. Running it in a VM might be doable.
Actual Windows 10 ARM too, not the restrictive IoT version that's officially supported on the pi.
https://www.windowslatest.com/2020/01/27/windows-10-arm-on-r...
In any case, I bet the M3 will be able to run rings around the current MacPro.
• Apple has the benefits of complete vertical integration, both on the hardware and software side.
• Neural engine is essentially Tensor Cores in NVIDIA’s GPU but occupies at least 4x the equivalent die area as tensor cores (no public details on performance yet).
• NVIDIA doesn’t want to make their consumer GPUs too powerful on tensor operations in order to not cannibalise their 1000% markup ML cards.
It’s almost a classic Intel: financial greed and financial engineering, combined with complacency from being long for so long.
Heck, AMD’s new top end card is tied with the 3090 - but $500 cheaper.
NVIDIA is reportedly scrambling to try and get back into TSMC who is going to make an example out of them.
Getting a better result via Rosetta 2 on the single-threaded benchmark is very impressive. I assumed that their benchmark included some vector instructions (which as far as I understand Rosetta 2 does not emulate) so this would mean that these higher numbers are for general-purpose instructions vs vector. That said, I can't find any reference to AVX or SSE in the "Geekbench 5 CPU Workloads" document, although the one for Geekbench 4 does mention them. It would be interesting to see numbers for Geekbench 4 on M1, if it runs via Rosetta at all. I would imagine that the tool is able to detect what is supported to run optimized code for each CPU being evaluated.
[1] https://www.macrumors.com/2020/11/11/m1-macbook-air-first-be...
[2] https://browser.geekbench.com/macs/imac-27-inch-retina-mid-2...
A comparison like this would have demonstrated this point more definitively than the many graphs Apple showed that were conspicuously lacking axis labels.
however the cynical side is, Apple is having difficulty getting developers on board. Apple is notorious for not encouraging game developers to be on OS X to the point the common refrain in the Mac community has been to "buy a different system to game on". Many people don't have that luxury or like me don't want two systems on my desk when one should be sufficient.
That M1 debut was notable for one reason too many overlooked, they had very few developers showing off their wares and most of those they did are not well known.