macOS 26.2 enables fast AI clusters with RDMA over Thunderbolt

I follow the MLX team on Twitter and they sometimes post about using MLX on two or more joined together Macs to run models that need more than 512GB of RAM.

A couple of examples:

Kimi K2 Thinking (1 trillion parameters): https://x.com/awnihannun/status/1986601104130646266

DeepSeek R1 (671B): https://x.com/awnihannun/status/1881915166922863045 - that one came with setup instructions in a Gist: https://gist.github.com/awni/ec071fd27940698edd14a4191855bba...

awnihannun · 2 days ago

For a bit more context, those posts are using pipeline parallelism. For N machines put the first L/N layers on machine 1, next L/N layers on machine 2, etc. With pipeline parallelism you don't get a speedup over one machine - it just buys you the ability to use larger models than you can fit on a single machine.

The release in Tahoe 26.2 will enable us to do fast tensor parallelism in MLX. Each layer of the model is sharded across all machines. With this type of parallelism you can get close to N-times faster for N machines. The main challenge is latency since you have to do much more frequent communication.

dpe82 · a day ago

> The main challenge is latency since you have to do much more frequent communication.

Earlier this year I experimented with building a cluster to do tensor parallelism across large cache CPUs (AMD EPYC 7773X have 768mb of L3). My thought was to keep an entire model in SRAM and take advantage of the crazy memory bandwidth between CPU cores and their cache, and use Infiniband between nodes for the scatter/gather operations.

Turns out the sum of intra-core latency and PCIe latency absolutely dominate. The Infiniband fabric is damn fast once you get data to it, but getting it there quickly is a struggle. CXL would help but I didn't have the budget for newer hardware. Perhaps modern Apple hardware is better for this than x86 stuff.

aimanbenbaha · a day ago

Exo-Labs is an open source project that allows this too, pipeline parallelism I mean not the latter, and it's device agnostic meaning you can daisy-chain anything you have that has memory and the implementation will intelligently shard model layers across them, though its slow but scales linearly with concurrent requests.

Exo-Labs: https://github.com/exo-explore/exo

liuliu · 2 days ago

But that's only for prefilling right? Or is it beneficial for decoding too (I guess you can do KV lookup on shards, not sure how much speed-up that will be though).

Deleted Comment

anemll · a day ago

Tensor Parallel test with RDMA last week https://x.com/anemll/status/1996349871260107102

Note fast sync workaround

andy99 · 2 days ago

I’m hoping this isn’t as attractive as it sounds for non-hobbyists because the performance won’t scale well to parallel workloads or even context processing, where parallelism can be better used.

Hopefully this makes it really nice for people that want the experiment with LLMs and have a local model but means well funded companies won’t have any reason to grab them all vs GPUs.

codazoda · 2 days ago

I haven’t looked yet but I might be a candidate for something like this, maybe. I’m RAM constrained and, to a lesser extent, CPU constrained. It would be nice to offload some of that. That said, I don’t think I would buy a cluster of Macs for that. I’d probably buy a machine that can take a GPU.

willtemperley · a day ago

I think it’s going to be great for smaller shops that want on premise private cloud. I’m hoping this will be a win for in-memory analytics on macOS.

api · 2 days ago

No way buying a bunch of minis could be as efficient as much denser GPU racks. You have to consider all the logistics and power draw, and high end nVidia stuff and probably even AMD stuff is faster than M series GPUs.

What this does offer is a good alternative to GPUs for smaller scale use and research. At small scale it’s probably competitive.

Apple wants to dominate the pro and serious amateur niches. Feels like they’re realizing that local LLMs and AI research is part of that, is the kind of thing end users would want big machines to do.

bigyabai · 2 days ago

The lack of official Linux/BSD support is enough to make it DOA for any serious large-scale deployment. Until Apple figures out what they're doing on that front, you've got nothing to worry about.

CamperBob2 · a day ago

Almost the most impressive thing about that is the power consumption. ~50 watts for both of them? Am I reading it wrong?

wmf · a day ago

Yeah, two Mac Studios is going to be ~400 W.

This implies you'd run more than one Mac Studio in a cluster, and I have a few concerns regarding Mac clustering (as someone who's managed a number of tiny clusters, with various hardware):

1. The power button is in an awkward location, meaning rackmounting them (either 10" or 19" rack) is a bit cumbersome (at best)

2. Thunderbolt is great for peripherals, but as a semi-permanent interconnect, I have worries over the port's physical stability... wish they made a Mac with QSFP :)

3. Cabling will be important, as I've had tons of issues with TB4 and TB5 devices with anything but the most expensive Cable Matters and Apple cables I've tested (and even then...)

4. macOS remote management is not nearly as efficient as Linux, at least if you're using open source / built-in tooling

To that last point, I've been trying to figure out a way to, for example, upgrade to macOS 26.2 from 26.1 remotely, without a GUI, but it looks like you _have_ to use something like Screen Sharing or an IP KVM to log into the UI, to click the right buttons to initiate the upgrade.

Trying "sudo softwareupdate -i -a" will install minor updates, but not full OS upgrades, at least AFAICT.

wlesieutre · 2 days ago

For #2, OWC puts a screw hole above their dock's thunderbolt ports so that you can attach a stabilizer around the cord

https://www.owc.com/solutions/thunderbolt-dock

It's a poor imitation of old ports that had screws on the cables, but should help reduce inadvertent port stress.

The screw only works with limited devices (ie not the Mac Studio end of the cord) but it can also be adhesive mounted.

https://eshop.macsales.com/item/OWC/CLINGON1PK/

crote · 2 days ago

That screw hole is just the regular locking USB-C variant, is it not?

See for example:

https://www.startech.com/en-jp/cables/usb31cctlkv50cm

eurleif · 2 days ago

I have no experience with this, but for what it's worth, looks like there's a rack mounting enclosure available which mechanically extends the power switch: https://www.sonnetstore.com/products/rackmac-studio

geerlingguy · a day ago

I have something similar from MyElectronics, and it works, but it's a bit expensive, and still imprecise. At least the power button isn't in the back corner underneath!

rsync · a day ago

"... Thunderbolt is great for peripherals, but as a semi-permanent interconnect, I have worries over the port's physical stability ..."

Thunderbolt as a server interconnect displeases me aesthetically but my conclusion is the opposite of yours:

If the systems are locked into place as servers in a rack the movements and stresses on the cable are much lower than when it is used as a peripheral interconnect for a desktop or laptop, yes ?

827a · a day ago

This is a semi-solved problem e.g. https://www.sonnetstore.com/products/thunderlok-a

Apple’s chassis do not support it. But conceptually that’s not a Thunderbolt problem, it’s an Apple problem. You could probably drill into the Mac Studio chassis to create mount points.

cromniomancer · a day ago

VNC over SSH tunneling always worked well for me before I had Apple Remote Desktop available, though I don't recall if I ever initiated a connection attempt from anything other than macOS...

erase-install can be run non-interactively when the correct arguments are used. I've only ever used it with an MDM in play so YMMV:

https://github.com/grahampugh/erase-install

ThomasBb · a day ago

With MDM solutions you can not only get software update management, but even full LOM for models that support this. There are free and open source MDM out there.

827a · a day ago

They do still sell the Mac Pro in a rack mount configuration. But, it was never updated for M3 Ultra, and feels not long for this world.

badc0ffee · a day ago

> To that last point, I've been trying to figure out a way to, for example, upgrade to macOS 26.2 from 26.1 remotely,

I think you can do this if you install a MDM profile on the Macs and use some kind of management software like Jamf.

timc3 · 2 days ago

It’s been terrible for years/forever. Even Xserves didn’t really meet the needs of a professional data centre. And it’s got worse as a server OS because it’s not a core focus. Don’t understand why anyone tries to bother - apart from this MLX use case or as a ProRes render farm.

crote · 2 days ago

iOS build runner. Good luck developing cross-platform apps without a Mac!

colechristensen · 2 days ago

There are open source MDM projects, I'm not familiar but https://github.com/micromdm/nanohub might do the job for OS upgrades.

simonw · 2 days ago

btown · 2 days ago

It would be incredibly ironic if, with Apple's relatively stable supply chain relative to the chaos of the RAM market these days (projected to last for years), Apple compute became known as a cost-effective way to build medium-sized clusters for inference.

It’s gonna suck if all the good Macs get gobbled up by commercial users.

icedchai · 2 days ago

Outside of YouTube influencers, I doubt many home users are buying a 512G RAM Mac Studio.

mschuster91 · 2 days ago

it's not like regular people can afford this kind of Apple machine anyway.

teaearlgraycold · 2 days ago

It already is depending on your needs.

reilly3000 · 2 days ago

dang I wish I could share md tables.

Here’s a text edition: For $50k the inference hardware market forces a trade-off between capacity and throughput:

* Apple M3 Ultra Cluster ($50k): Maximizes capacity (3TB). It is the only option in this price class capable of running 3T+ parameter models (e.g., Kimi k2), albeit at low speeds (~15 t/s).

* NVIDIA RTX 6000 Workstation ($50k): Maximizes throughput (>80 t/s). It is superior for training and inference but is hard-capped at 384GB VRAM, restricting model size to <400B parameters.

To achieve both high capacity (3TB) and high throughput (>100 t/s) requires a ~$270,000 NVIDIA GH200 cluster and data center infrastructure. The Apple cluster provides 87% of that capacity for 18% of the cost.

mechagodzilla · 2 days ago

You can keep scaling down! I spent $2k on an old dual-socket xeon workstation with 768GB of RAM - I can run Deepseek-R1 at ~1-2 tokens/sec.

Weryj · a day ago

Just keep going! 2TB of swap disk for 0.0000001 t/sec

jacquesm · a day ago

I did the same, then put in 14 3090's. It's a little bit power hungry but fairly impressive performance wise. The hardest parts are power distribution and riser cards but I found good solutions for both.

ternus · a day ago

And if you get bored of that, you can flip the RAM for more than you spent on the whole system!

a012 · a day ago

And heat the whole house in parallel

rpastuszak · a day ago

Nice! What do you use it for?

Dead Comment

For $50K, you could buy 25 Framework desktop motherboards (128G VRAM each w/Strix Halo, so over 3TB total) Not sure how you'll cluster all of them but it might be fun to try. ;)

sspiff · a day ago

There is no way to achieve a high throughput low latency connection between 25 Strix Halo systems. After accounting for storage and network, there are barely any PCIe lanes left to link two of them together.

You might be able to use USB4 but unsure how the latency is for that.

3abiton · a day ago

You could use llama.cpp rpc mode over "network" via usb4/thunderbolt connection

What's the math on the $50k nvidia cluster? My understanding these things cost ~$8k and you can at least get 5 for $40k, that's around half a tb.

That being said, for inference mac still remain the best, and the M5 Ultra will even be a better value with its better PP.

reilly3000 · a day ago

GPUs: 4x NVIDIA RTX 6000 Blackwell (96GB VRAM each) • Cost: 4 × $9,000 = $36,000

• CPU: AMD Ryzen Threadripper PRO 7995WX (96-Core) • Cost: $10,000

• Motherboard: WRX90 Chipset (supports 7x PCIe Gen5 slots) • Cost: $1,200

• RAM: 512GB DDR5 ECC Registered • Cost: $2,000

• Chassis & Power: Supermicro or specialized Workstation case + 2x 1600W PSUs. • Cost: $1,500

• Total Cost: ~$50,700

It’s a bit maximalist, but if you had to spend $50k it’s going to be about as fast as you can make it.

FuckButtons · 2 days ago

Are you factoring in the above comment about as yet un-implemented parallel speed up in there? For on prem inference without any kind of asic this seems quite a bargain relatively speaking.

conradev · a day ago

Apple deploys LPDDR5X for the energy efficiency and cost (lower is better), whereas NVIDIA will always prefer GDDR and HBM for performance and cost (higher is better).

_zoltan_ · a day ago

the GH/GB compute has LPDDR5X - a single or dual GPU shares 480GB, depending if it's GH or GB, in addition to the HBM memory, with NVLink C2C - it's not bad!

dsrtslnd23 · a day ago

what about a GB300 workstation with 784GB unified mem?

rbanffy · 17 hours ago

That thing will be extremely expensive I guess. And neither CPU nor GPU have that much memory. It's also not a great workstation either - macOS is a lot more comfortable to use.

wmf · 18 hours ago

$95K

yieldcrv · a day ago

15 t/s way too slow for anything but chatting, call and response, and you don't need a 3T parameter model for that

Wake me up when the situation improves

Just wait for the M5-Ultra with a terabyte of RAM.

geerlingguy · 2 days ago

int32_64 · a day ago

Apple should setup their own giant cloud of M chips with tons of vram, make Metal as good as possible for AI purposes, then market the cloud as allowing self-hosted models for companies and individuals that care about privacy. They would clean up in all kinds of sectors whose data can't touch the big LLM companies.

That exists but it's only for iUsers running Apple models. https://security.apple.com/blog/private-cloud-compute/

make3 · a day ago

The advantages of having a single big memory per gpu are not as big in a data center where you can just shard things between machines and use the very fast interconnect, saturating the much faster compute cores of a non Apple GPU from Nvidia or AMD

timsneath · 2 days ago

Also see https://www.engadget.com/ai/you-can-turn-a-cluster-of-macs-i...

irusensei · a day ago

I am waiting for M5 studio but due to current price of hardware I'm not sure it will be at a level that I would call affordable. Currently I'm watching for news and if there is any announcement prices will go up I'll probably settle for an M4 Max.