zackangelo (u/zackangelo)

zackangelo commented on macOS 26.2 enables fast AI clusters with RDMA over Thunderbolt developer.apple.com/docum... · Posted by u/guiand

wtallis · 12 days ago

That doesn't answer the question, which was how to get a high-speed interconnect between a Mac and a DGX Spark. The most likely solution would be a Thunderbolt PCIe enclosure and a 100Gb+ NIC, and passive DAC cables. The tricky part would be macOS drivers for said NIC.

zackangelo · 12 days ago

You’re right I misunderstood.

I’m not sure if it would be of much utility because this would presumably be for tensor parallel workloads. In that case you want the ranks in your cluster to be uniform or else everything will be forced to run at the speed of the slowest rank.

You could run pipeline parallel but not sure it’d be that much better than what we already have.

zackangelo commented on macOS 26.2 enables fast AI clusters with RDMA over Thunderbolt developer.apple.com/docum... · Posted by u/guiand

storus · 13 days ago

Is there any way to connect DGX Sparks to this via USB4? Right now only 10GbE can be used despite both Spark and MacStudio having vastly faster options.

zackangelo · 12 days ago

Sparks are built for this and actually have Connect-X 7 NICs built in! You just need to get the SFPs for them. This means you can natively cluster them at 200Gbps.

zackangelo commented on macOS 26.2 enables fast AI clusters with RDMA over Thunderbolt developer.apple.com/docum... · Posted by u/guiand

liuliu · 13 days ago

But that's only for prefilling right? Or is it beneficial for decoding too (I guess you can do KV lookup on shards, not sure how much speed-up that will be though).

zackangelo · 12 days ago

No you use tensor parallelism in both cases.

The way it typically works in an attention block is: smaller portions of the Q, K and V linear layers are assigned to each node and are processed independently. Attention, rope norm etc is run on the node-specific output of that. Then, when the output linear layer is applied an "all reduce" is computed which combines the output of all the nodes.

EDIT: just realized it wasn't clear -- this means that each node ends up holding a portion of the KV cache specific to its KV tensor shards. This can change based on the specific style of attention (e.g., in GQA where there are fewer KV heads than ranks you end up having to do some replication etc)

zackangelo commented on Kimi K2 Thinking, a SOTA open-source trillion-parameter reasoning model moonshotai.github.io/Kimi... · Posted by u/nekofneko

riku_iki · 2 months ago

> How do the Chinese train these models if they don't have access to the GPUs to train them?

they may be taking some western models: llama, chatgpt-oss, gemma, mistral, etc, and do postraining, which required way less resources.

zackangelo · 2 months ago

What 1T parameter base model have you seen from any of those labs?

zackangelo commented on NVIDIA DGX Spark In-Depth Review: A New Standard for Local AI Inference lmsys.org/blog/2025-10-13... · Posted by u/yvbbrjdr

moondev · 2 months ago

Just don't try to run a NCCL

zackangelo · 2 months ago

Wouldn't you be able to test nccl if you had 2 of these?

zackangelo commented on Launch HN: LlamaFarm (YC W22) – Open-source framework for distributed AI github.com/llama-farm/lla... · Posted by u/mhamann

zackangelo · 3 months ago

Just a bit of feedback:

> Instead of one brittle giant, we orchestrate a Mixture of Experts…

“mixture of experts” is a specific term of art that describes an architectural detail of a type of transformer model. It’s definitely not using smaller specialized models for individual tasks. Experts in an MoE model are actually routed to on a per token basis, not on a per task or per generation basis.

I know it’s tempting to co-opt this term because it would fit nicely for what you’re trying to do but it just adds confusion.

zackangelo commented on Apps SDK developers.openai.com/app... · Posted by u/alvis

typpilol · 3 months ago

How's having the best model not a most?

zackangelo · 3 months ago

Because it depends on how much better “best” is. If it’s only incrementally better than open source models that have other advantages, why would you bother?

OpenAI’s moat will only come from the products they built on top. Theoretically their products will be better because they’ll be more vertically integrated with the underlying models. It’s not unlike Apple’s playbook with regard to hardwares and software integration.

zackangelo commented on From multi-head to latent attention: The evolution of attention mechanisms vinithavn.medium.com/from... · Posted by u/mgninad

mrtesthah · 4 months ago

Do we know if any of these techniques are actually used in the so-called "frontier" models?

zackangelo · 4 months ago

Not quite a frontier model but definitely built by a frontier lab: Grok 2 was recently open sourced and I believe it uses a fairly standard MHA architecture with MoE.

zackangelo commented on Mosh Mobile Shell mosh.org... · Posted by u/rbinv

kenrose · 4 months ago

When mosh came out back in 2013, it solved a pretty real problem of ssh crapping out when you changed networks (like moving from in-office to home). It solves it at the app layer and uses UDP and is designed to work in high loss / latency environments. Very cool.

At the same time, in recent years, I've found that ssh running on top of Wireguard / Tailscale is way more usable than 2013 days. Those latter tools address the roaming IP issues directly at the network layer.

So while there are still issues with ssh / TCP if you're on a really crappy network (heavy packet loss, satellite link, etc), those have been less common in my experience compared to IP changes.

The “killer use case” for Mosh feels a lot less killer now.

zackangelo · 4 months ago

I feel a bit silly for not noticing this before. Over the last year or so I've often wondered when ssh added protocol-level support for session resume. I'd open my laptop on a new network and everything would be ready to go. But of course, it's nothing to do with ssh, it's just that I started using tailscale.