Readit News logoReadit News
zackangelo commented on macOS 26.2 enables fast AI clusters with RDMA over Thunderbolt   developer.apple.com/docum... · Posted by u/guiand
wtallis · 12 days ago
That doesn't answer the question, which was how to get a high-speed interconnect between a Mac and a DGX Spark. The most likely solution would be a Thunderbolt PCIe enclosure and a 100Gb+ NIC, and passive DAC cables. The tricky part would be macOS drivers for said NIC.
zackangelo · 12 days ago
You’re right I misunderstood.

I’m not sure if it would be of much utility because this would presumably be for tensor parallel workloads. In that case you want the ranks in your cluster to be uniform or else everything will be forced to run at the speed of the slowest rank.

You could run pipeline parallel but not sure it’d be that much better than what we already have.

zackangelo commented on macOS 26.2 enables fast AI clusters with RDMA over Thunderbolt   developer.apple.com/docum... · Posted by u/guiand
storus · 13 days ago
Is there any way to connect DGX Sparks to this via USB4? Right now only 10GbE can be used despite both Spark and MacStudio having vastly faster options.
zackangelo · 12 days ago
Sparks are built for this and actually have Connect-X 7 NICs built in! You just need to get the SFPs for them. This means you can natively cluster them at 200Gbps.
zackangelo commented on macOS 26.2 enables fast AI clusters with RDMA over Thunderbolt   developer.apple.com/docum... · Posted by u/guiand
liuliu · 13 days ago
But that's only for prefilling right? Or is it beneficial for decoding too (I guess you can do KV lookup on shards, not sure how much speed-up that will be though).
zackangelo · 12 days ago
No you use tensor parallelism in both cases.

The way it typically works in an attention block is: smaller portions of the Q, K and V linear layers are assigned to each node and are processed independently. Attention, rope norm etc is run on the node-specific output of that. Then, when the output linear layer is applied an "all reduce" is computed which combines the output of all the nodes.

EDIT: just realized it wasn't clear -- this means that each node ends up holding a portion of the KV cache specific to its KV tensor shards. This can change based on the specific style of attention (e.g., in GQA where there are fewer KV heads than ranks you end up having to do some replication etc)

zackangelo commented on Kimi K2 Thinking, a SOTA open-source trillion-parameter reasoning model   moonshotai.github.io/Kimi... · Posted by u/nekofneko
riku_iki · 2 months ago
> How do the Chinese train these models if they don't have access to the GPUs to train them?

they may be taking some western models: llama, chatgpt-oss, gemma, mistral, etc, and do postraining, which required way less resources.

zackangelo · 2 months ago
What 1T parameter base model have you seen from any of those labs?

Deleted Comment

zackangelo commented on NVIDIA DGX Spark In-Depth Review: A New Standard for Local AI Inference   lmsys.org/blog/2025-10-13... · Posted by u/yvbbrjdr
moondev · 2 months ago
Just don't try to run a NCCL
zackangelo · 2 months ago
Wouldn't you be able to test nccl if you had 2 of these?
zackangelo commented on Launch HN: LlamaFarm (YC W22) – Open-source framework for distributed AI   github.com/llama-farm/lla... · Posted by u/mhamann
zackangelo · 3 months ago
Just a bit of feedback:

> Instead of one brittle giant, we orchestrate a Mixture of Experts…

“mixture of experts” is a specific term of art that describes an architectural detail of a type of transformer model. It’s definitely not using smaller specialized models for individual tasks. Experts in an MoE model are actually routed to on a per token basis, not on a per task or per generation basis.

I know it’s tempting to co-opt this term because it would fit nicely for what you’re trying to do but it just adds confusion.

zackangelo commented on Apps SDK   developers.openai.com/app... · Posted by u/alvis
typpilol · 3 months ago
How's having the best model not a most?
zackangelo · 3 months ago
Because it depends on how much better “best” is. If it’s only incrementally better than open source models that have other advantages, why would you bother?

OpenAI’s moat will only come from the products they built on top. Theoretically their products will be better because they’ll be more vertically integrated with the underlying models. It’s not unlike Apple’s playbook with regard to hardwares and software integration.

zackangelo commented on From multi-head to latent attention: The evolution of attention mechanisms   vinithavn.medium.com/from... · Posted by u/mgninad
mrtesthah · 4 months ago
Do we know if any of these techniques are actually used in the so-called "frontier" models?
zackangelo · 4 months ago
Not quite a frontier model but definitely built by a frontier lab: Grok 2 was recently open sourced and I believe it uses a fairly standard MHA architecture with MoE.
zackangelo commented on Mosh Mobile Shell   mosh.org... · Posted by u/rbinv
kenrose · 4 months ago
When mosh came out back in 2013, it solved a pretty real problem of ssh crapping out when you changed networks (like moving from in-office to home). It solves it at the app layer and uses UDP and is designed to work in high loss / latency environments. Very cool.

At the same time, in recent years, I've found that ssh running on top of Wireguard / Tailscale is way more usable than 2013 days. Those latter tools address the roaming IP issues directly at the network layer.

So while there are still issues with ssh / TCP if you're on a really crappy network (heavy packet loss, satellite link, etc), those have been less common in my experience compared to IP changes.

The “killer use case” for Mosh feels a lot less killer now.

zackangelo · 4 months ago
I feel a bit silly for not noticing this before. Over the last year or so I've often wondered when ssh added protocol-level support for session resume. I'd open my laptop on a new network and everything would be ready to go. But of course, it's nothing to do with ssh, it's just that I started using tailscale.

u/zackangelo

KarmaCake day333July 23, 2012
About
building mixlayer, zack at mixlayer.com
View Original