The way it typically works in an attention block is: smaller portions of the Q, K and V linear layers are assigned to each node and are processed independently. Attention, rope norm etc is run on the node-specific output of that. Then, when the output linear layer is applied an "all reduce" is computed which combines the output of all the nodes.
EDIT: just realized it wasn't clear -- this means that each node ends up holding a portion of the KV cache specific to its KV tensor shards. This can change based on the specific style of attention (e.g., in GQA where there are fewer KV heads than ranks you end up having to do some replication etc)
they may be taking some western models: llama, chatgpt-oss, gemma, mistral, etc, and do postraining, which required way less resources.
Deleted Comment
> Instead of one brittle giant, we orchestrate a Mixture of Experts…
“mixture of experts” is a specific term of art that describes an architectural detail of a type of transformer model. It’s definitely not using smaller specialized models for individual tasks. Experts in an MoE model are actually routed to on a per token basis, not on a per task or per generation basis.
I know it’s tempting to co-opt this term because it would fit nicely for what you’re trying to do but it just adds confusion.
OpenAI’s moat will only come from the products they built on top. Theoretically their products will be better because they’ll be more vertically integrated with the underlying models. It’s not unlike Apple’s playbook with regard to hardwares and software integration.
At the same time, in recent years, I've found that ssh running on top of Wireguard / Tailscale is way more usable than 2013 days. Those latter tools address the roaming IP issues directly at the network layer.
So while there are still issues with ssh / TCP if you're on a really crappy network (heavy packet loss, satellite link, etc), those have been less common in my experience compared to IP changes.
The “killer use case” for Mosh feels a lot less killer now.
I’m not sure if it would be of much utility because this would presumably be for tensor parallel workloads. In that case you want the ranks in your cluster to be uniform or else everything will be forced to run at the speed of the slowest rank.
You could run pipeline parallel but not sure it’d be that much better than what we already have.