> Since we did not have time to change the cooling infrastructure, we had to remain in an air-cooled environment. The mechanical and thermal designs had to change to accommodate this, and that triggered a validation cycle to support a large-scale deployment.
> All of these hardware-related changes were challenging because we had to find a solution that fit within the existing resource constraints, with a very small degree of freedom to change and meet a tight schedule.
Seems like the time constraints put into the team impacted the overall quality of the model.
> So we decided to build both: two 24k clusters, one with RoCE and another with InfiniBand. Our intent was to build and learn from the operational experience.
I love how they built two completely insane clusters just to learn. That's badass.
It's not just to learn; an RoCE ethernet cluster with Aristas is way cheaper to build and maintain than a fancy InfiniBand cluster with Mellanox/NVidia networking, so proving that the former is good enough at scale will eventually save Meta a huge amount of money. InfiniBand cards are much more expensive than ethernet because there's few vendors, that have a quasi-monopoloy, and because overall far fewer of them are produced so there's less economy of scale.
I wish that instead of just training another stupid LLM, Meta would use it to improve their search and help me find the content I'm actually interested in.
Their revenue depends on it being hard (but not impossible) for you to find the content you're actually interested in. Would be nice if it didn't, but in this reality, money on the Internet is made by wasting users' lives. That is what attention economy is about.
It's actually a mix. They need to disappoint the user for the right amount of time, to then please it at the right moment and dose. This maximizes the dopamine release and increases addictiveness.
When you find good content depends on when the algo judges you're already primed to a colorful dopamine intake.
this is like asking disney to reduce wait times for their rides. in other words, it is against content aggregation platforms' interest to let you get what you want directly.
I can't comment on how things like faces get used, but in my experience, PII at Meta is inaccessible by default.
Unless you're impersonating a user on the platform (to access what PII they can see), you have to request special access for logs or database columns that contain so much as user IDs, otherwise the data simply won't show up when you query for it. This is baked into the infrastructure layer, so I doubt the GenAI teams are using something else.
In the paper covering the original Llama they explicitly list their data sources in table 1 - including saying that they pretrained on the somewhat controversial books3 dataset.
The paper for Llama 2 also explicitly says they don't take data from Meta's products and services; and that they filter out data from sites known to contain a lot of PII. Although it is more coy about precisely what data sources they used, like many such papers are.
Yikes, the little Infiniband+A100 cluster I installed for my previous company seemed useful at the time (12 GPUs) and that was at a cost of around $300k. With LLMs it feels like game over for non-cloud applications if you are not a mega-corp.
Well, yes, but Not all models need to be "super large." Smaller models, specialized in specific tasks, working together - and then reporting to a slightly larger model is the way to go.
Think of everything being connected to a "Home Computer" in those "Future House of 2020" videos that were out there in 70s or what not.
Another example (very rough) would be something like "Weather data gets to a small model via an API, model looks at it, updates the home dashboard, also sees if there's any alerts, if so, adds x or y to home dashboard appropriately as to what it thinks best."
We can probably achieve the latter example today. (without any significant 'coding' on anyone's part except the API owner)
> Well, yes, but Not all models need to be "super large." Smaller models, specialized in specific tasks, working together - and then reporting to a slightly larger model is the way to go.
I want to believe, but I'm still yet to see this kind of set up being anywhere near GPT-4 level.
The weather example seems quite contrived. Why not just display the alerts for your area? Why is a complex system of smaller models reporting up to a slightly larger model necessary?
Posts like this underscore why the smart money is betting on Google as the long term AI winner. Meta, Microsoft, OpenAI, etc. are trying to address problems with consumer video cards and spending billions to try and out bid each other to win Nvidia's favor - while Google is on their 6th generation of custom silicon.
Literally the only thing that can stop Google now is the fact they keep bringing Microsoft and Oracle flunkies into leadership positions.
I think it's likely Nvidia's GPU's, many of which are $50,000+ for a single unit, far surpass Google's custom silicon otherwise why wouldn't Google be selling shovels like Nvidia?
If Google had a better chip, or even a chip that was close, they would sell it to anyone and everyone.
From a quick search I can see Google's custom chips are 15x to 30x slower to train AI compared to Nvidia's current latest gen AI specific GPU's.
Nvidia has decades of experience selling hardware to people with all the pains that entails, support, sales channels, customer acquisition, software, it's something you don't just do overnight, and it does cost money. Google's TPUs get some of their cost efficiency from not supporting COTS use cases and the overhead of selling to people, and the total wall clock time has to also include the total operational costs, which dominate at their size (e.g. if it's 30x slower but 1/50th the TCO then it's a win. I don't know how TPUv5 stacks up against the B200). It's not as simple as "just put it on a shelf and sell it and make a gajillion dollars like nvidia"
TPU v5p is ~2 times slower than H100 at larg(ish)-scale training (order of 10k chips) [1]. And they already have v6 [2]. I think it's safe to say that they are fairly close to Nvidia in terms of performance.
We have almost 400 H100's sitting idle. I wonder how many other companies are buying millions of dollars worth of these chips with the hopes of them being used, but aren't being utilized?
> If Google had a better chip, or even a chip that was close, they would sell it to anyone and everyone.
While I do not actually think Google's chips are better or close to being better, I don't think this actually holds?
If the upside of <better chip> is effectively unbounded, it would outweigh the short term benefit of selling them to others, I would think. At least for a company like Google.
They do sell them - but through their struggling cloud business. Either way, Nvidia's margin is google's opportunity to lower costs.
> I can see Google's custom chips are 15x to 30x slower to train AI
TPUs are designed for inference not training - they're betting that they can serve models to the world at a lower cost structure than their competition. The compute required for inference to serve their billions of customers is far greater than training costs for models - even LLMs. They've been running model inference as a part of production traffic for years.
That’s not necessarily true. Many companies make chips they won’t sell to support lucrative, proprietary offerings. Mainframe processors are the classic example.
In AI, Google (TPU) and Intel (Gaudi) each have chips they push in cloud offerings. The cloud offerings have cross selling opportunities. That by itself would be a reason to keep it internal at their scale. It might also be easier to support one, or a small set, of deployments that are internal vs the variety that external customers would use.
Intel, AMD and others also have chips for training that perform close to or sometimes better than Nvidia's. These are already in the market. Two problems: the CUDA moat, and, "noone gets fired for buying green".
Can't agree. This is like saying $popularApp will fail because they buy expensive hosting at AWS.
Rubbish they will fail because the product didn't fit the market, if they're successful they'll have money to buy servers and colo then drive down cost. If they succeed it will be in large part due to the fact they spent thier capital and more importantly time on code/engineers rather than servers.
Right now companies are searching for a use of AI that will add hundreds of billions to thier market cap. Once they find that they can make TPUs, right now only one thing matters; getting there first.
> This is like saying $popularApp will fail because they buy expensive hosting at AWS.
For any given mobile app startup, AWS is effectively infinite. The more money you throw at it the more doodads you get back. Nvidia's supply chain is not infinite and is the bottle neck for all the non-Google players to fight over.
Custom silicon is fantastic when things have stabilised and you know exactly want you want. But things are still evolving fast in real time and in that environment, whoever can move fastest to be ultra flexible and deploy the latest architecture as soon as possible is the winner. I think in a nutshell, that is the story of Nvidia's success here : they created a GPGPU platform with just the right level of abstraction to capture the market for AI research.
^ How to compact as many mistakes as possible in one single comment.
1. Google's stock didn't siginificantly outperformed Meta, Microsoft, etc, in thet past two years.
2. Meta and Microsoft are trying to make their own chips as well.
3. They're not using "consumer video cards" to train AI. I don't even know if you can call these beasts video cards any more. H100 doesn't have HDMI port.
That’s how all huge tech companies become dinosaurs though. Upper management that is already stupidly wealthy (and therefore unmotivated) have the funding and patience to hire geniuses to build incredible machines and then constantly tie their shoelaces together while asking them to sprint. Examples include Microsoft and Oracle as you said, and before them IBM, AT$T, TIBCO, Marvell, Motorola, I could go on for a while…
GPU vs TPU, and good software managing large clusters of them across all sorts of failure.
the funny bit from the above article is the incident when someone forgot about a training job at google, and month later had the model fully trained without an alert of any kind. "outrageously good infra"
Yeah, ops comment makes it seem like they are building racks of RTX 4090s, when this isn’t remotely true. Tensor Core performance is far different on the data center class devices vs consumer ones.
I would call it "stupid" money. This isn't a commodity business. Value of the final product is orthogonal to amount invested in compute. If Google is 10% slower or its product is 10% worse, it can lose all the value. This is like valuing a software company higher because its devs are using cheap PC desktops instead of Mac.
Much the same way you can have all the best gear and still fail - Google’s primary strength seems to be the DeepMind group. I’m not affiliated with Google, but IMHO the reason they will slowly die is because their engineering culture has taken a backseat due to their broken hiring practices.
Bad hiring practices aren’t exclusive to them, but from all accounts it seems like their internal focus is on optimizing ad revenue over everything else. I could be wrong or misinformed, but it seems to me like they are playing the finite game in the AI space (DeepMind group aside) while FAIR are playing the infinite game.
*meanwhile MSFT are simply trying to buy their way to relevance (e.g. OpenAI investments, etc) and carve out future revenues (Recall) and Jobs-less Apple is building their trademark walled-garden (AppleIntelligence?). Although the use of unified memory in Apple silicon poses some interesting possibilities for enabling the use of sizable models on consumer hardware.
Overall it seems like “big-tech” is by-and-large uninspired and asleep at the wheel save specific teams like those led by Lecun, Hassabis, etc. not sure where that leaves OpenAI now that Karpathy is gone.
>ecause their engineering culture has taken a backseat due to their broken hiring practices.
What company do you think has better hiring practices, and subsequently a higher talent pool? Meta is pretty similar to Google's (though with an emphasis on speed over creativity). Microsoft is certainly worse at hiring than the two aforementioned...
The only thing that can stop Google is Google. Somehow every bet that isn't Search doesn't pan out. And inexplicably, they're working hard to kill Search now. As a shareholder, I hope they succeed. But I am more pessimistic about it than you.
What can stop Google is building the wrong thing, or being so scared to launch anything that they smother their own fledgling products before they are born or before they can mature. Their product and finance and accounting teams should be tossed.
Google is old like MetLife, relative to each's respective industry. Both are carrying too much baggage and are top heavy. As a result, I personally don't think Google will be able to keep pace with OpenAI in the long run.
this was the same argument that was presented a decade ago on why Google was supposed to win the cloud because their internal infra was miles ahead of Amazon and Microsoft.
Yet here we are. Will the consumer video cards get cheaper and better faster or will Google's directors' infighting stop first?
> Meta, Microsoft, OpenAI, etc. are trying to address problems with consumer video cards
Yes, but you are buying access to tested, supported units that are proven to work, don't require custom software, and are almost plug an go. When its time to upgrade, its not that costly.
Designing, fabricating and deploying your own silicon is Expensive, creating software support for it, also more expense. THen there is the opportunity cost of having to optimise the software stack your self.
You're exchanging a large capex, for a similar sized capex plus a fuckton of opex as well.
after everything I've seen and the litigation coming out of europe, I really can't see AI lasting long after they're obligated to prove rights for the data they're training on.
they can't get away with having scraped people's owned work forever. You can't steal things from workers and then undercut them by selling that hard work for pennies, and not expect everything to collapse. I mean, I know that the folks in charge of this aren't really known for their foresight, especially when stock numbers and venture capital are the entire point, but... surely I hope people can recognize that this can't go on unimpeded.
Eventually they're going to put vision LLMs in robotic bodies and they'll be able to learn just by listening and watching, just like humans, at which stage the idea that they're "stealing" just by viewing content will be seen as absurd.
being first/early doesn't always mean long-term dominance, see tensorflow.
at consumer level, npu are becoming a useful accelerator, and here google can choose to become the platform of choice.
but nobody comes close for training workflow but nvidia as it stands. imho it is currently possible thanks to community efforts afforded by cuda being the only realistic option.
politics and leadership aside, it would be nice to sustian a market for highly efficient matrix multipliers. also a s/w ecosystem that finally makes multiprocessing workloads easy to integrate for dummies like me.
Ecosystem support for GPU is very strong and so smaller teams might find TPUs to have a steeper learning curve. Sophisticated users can definitely get wins today on TPU if they can get capacity.
And it’s not as if nvidia is standing still, so how the strengths of each change over future generations isn’t set either. Ie TPU are “simple” matrix multipliers and also optimized for operation at scale, GPU are more general purpose and have strong ecosystem power.
Disclaimer - work on GKE on enabling AI workloads.
The winner is just nvidia, I see this like the battle of gas vs electric cars. Nvidia is basically making the wheels. whichever company wins, you'd still need wheels.
The problem isn’t just developing your own processor. Nvidia has a huge stack of pretty cutting edge technology including a huge stack from mellanox, an enormous OSS tool chain around CUDA, etc, that people seeking to make comparable products have to overcome.
Frustratingly little information. For example, I'm exceedingly curious how they deal with scheduling jobs on such a huge array of machines. The article:
> Efficient scheduling helps ensure that our resources are used optimally. This involves sophisticated algorithms that can allocate resources based on the needs of different jobs and dynamic scheduling to adapt to changing workloads.
Wow thanks for that, captain obvious. So how do you do it?
I usually assume these companies are using some of the popular schedulers (e.g., Slurm, MOAB, SGE) that have existed in the HPC community for many years.
I have anecdotally also heard that some are using k8s, but I've not seen that myself. Slurm [1] is basically built for this stuff; that's definitely what I would use!
Slurm is definitely still dominant, but OpenAI has been using k8s for training for many years now¹, and there are various ways to run slurm on top of Kubernetes, including the recent SUNK from coreweave²
at my company we use slurm "directly" for static compute we rent or own (i.e. not in a public cloud), but are considering using Kubernetes because that's how we run the rest of the company, and we'd rather invest more effort into being better at k8s than becoming good slurm admins.
> All of these hardware-related changes were challenging because we had to find a solution that fit within the existing resource constraints, with a very small degree of freedom to change and meet a tight schedule.
Seems like the time constraints put into the team impacted the overall quality of the model.
The last tech team to have no budget and time constraints to pursue their vision? I don’t know, the Xanadu team? Romero’s original Daikatana team?
Deleted Comment
I love how they built two completely insane clusters just to learn. That's badass.
When you find good content depends on when the algo judges you're already primed to a colorful dopamine intake.
Which data sources? How much of Meta users data (fb, instagram… etc). How do they sanitize PII?
I can't comment on how things like faces get used, but in my experience, PII at Meta is inaccessible by default.
Unless you're impersonating a user on the platform (to access what PII they can see), you have to request special access for logs or database columns that contain so much as user IDs, otherwise the data simply won't show up when you query for it. This is baked into the infrastructure layer, so I doubt the GenAI teams are using something else.
In the paper covering the original Llama they explicitly list their data sources in table 1 - including saying that they pretrained on the somewhat controversial books3 dataset.
The paper for Llama 2 also explicitly says they don't take data from Meta's products and services; and that they filter out data from sites known to contain a lot of PII. Although it is more coy about precisely what data sources they used, like many such papers are.
By literally opting everyone into their training data set and making it very cumbersome to opt out: https://threadreaderapp.com/thread/1794863603964891567.html
Think of everything being connected to a "Home Computer" in those "Future House of 2020" videos that were out there in 70s or what not.
Another example (very rough) would be something like "Weather data gets to a small model via an API, model looks at it, updates the home dashboard, also sees if there's any alerts, if so, adds x or y to home dashboard appropriately as to what it thinks best."
We can probably achieve the latter example today. (without any significant 'coding' on anyone's part except the API owner)
I want to believe, but I'm still yet to see this kind of set up being anywhere near GPT-4 level.
The weather example seems quite contrived. Why not just display the alerts for your area? Why is a complex system of smaller models reporting up to a slightly larger model necessary?
Literally the only thing that can stop Google now is the fact they keep bringing Microsoft and Oracle flunkies into leadership positions.
If Google had a better chip, or even a chip that was close, they would sell it to anyone and everyone.
From a quick search I can see Google's custom chips are 15x to 30x slower to train AI compared to Nvidia's current latest gen AI specific GPU's.
[1] https://mlcommons.org/benchmarks/training/
[2] https://cloud.google.com/blog/products/compute/introducing-t...
While I do not actually think Google's chips are better or close to being better, I don't think this actually holds?
If the upside of <better chip> is effectively unbounded, it would outweigh the short term benefit of selling them to others, I would think. At least for a company like Google.
They do sell them - but through their struggling cloud business. Either way, Nvidia's margin is google's opportunity to lower costs.
> I can see Google's custom chips are 15x to 30x slower to train AI
TPUs are designed for inference not training - they're betting that they can serve models to the world at a lower cost structure than their competition. The compute required for inference to serve their billions of customers is far greater than training costs for models - even LLMs. They've been running model inference as a part of production traffic for years.
In AI, Google (TPU) and Intel (Gaudi) each have chips they push in cloud offerings. The cloud offerings have cross selling opportunities. That by itself would be a reason to keep it internal at their scale. It might also be easier to support one, or a small set, of deployments that are internal vs the variety that external customers would use.
https://www.theverge.com/2023/11/15/23960345/microsoft-cpu-g...
Rubbish they will fail because the product didn't fit the market, if they're successful they'll have money to buy servers and colo then drive down cost. If they succeed it will be in large part due to the fact they spent thier capital and more importantly time on code/engineers rather than servers.
Right now companies are searching for a use of AI that will add hundreds of billions to thier market cap. Once they find that they can make TPUs, right now only one thing matters; getting there first.
For any given mobile app startup, AWS is effectively infinite. The more money you throw at it the more doodads you get back. Nvidia's supply chain is not infinite and is the bottle neck for all the non-Google players to fight over.
Deleted Comment
1. Google's stock didn't siginificantly outperformed Meta, Microsoft, etc, in thet past two years.
2. Meta and Microsoft are trying to make their own chips as well.
3. They're not using "consumer video cards" to train AI. I don't even know if you can call these beasts video cards any more. H100 doesn't have HDMI port.
https://www.yitay.net/blog/training-great-llms-entirely-from...
GPU vs TPU, and good software managing large clusters of them across all sorts of failure.
the funny bit from the above article is the incident when someone forgot about a training job at google, and month later had the model fully trained without an alert of any kind. "outrageously good infra"
Bad hiring practices aren’t exclusive to them, but from all accounts it seems like their internal focus is on optimizing ad revenue over everything else. I could be wrong or misinformed, but it seems to me like they are playing the finite game in the AI space (DeepMind group aside) while FAIR are playing the infinite game.
*meanwhile MSFT are simply trying to buy their way to relevance (e.g. OpenAI investments, etc) and carve out future revenues (Recall) and Jobs-less Apple is building their trademark walled-garden (AppleIntelligence?). Although the use of unified memory in Apple silicon poses some interesting possibilities for enabling the use of sizable models on consumer hardware.
Overall it seems like “big-tech” is by-and-large uninspired and asleep at the wheel save specific teams like those led by Lecun, Hassabis, etc. not sure where that leaves OpenAI now that Karpathy is gone.
What company do you think has better hiring practices, and subsequently a higher talent pool? Meta is pretty similar to Google's (though with an emphasis on speed over creativity). Microsoft is certainly worse at hiring than the two aforementioned...
Wheres others are playing the llm race to the bottom.
And we'll be writing case studies of how they squandered billions of R&D to help found other companies (kinda like Xerox Parc)
Almost every interesting paper after transformers has had it's authors leave to commercialize their own companies.
Deleted Comment
Deleted Comment
Yet here we are. Will the consumer video cards get cheaper and better faster or will Google's directors' infighting stop first?
Deleted Comment
You mean the thing that's already stopped them? If they had seriously invested into the TPU ecosystem in 2015, they would already have "won" AI.
Yes, but you are buying access to tested, supported units that are proven to work, don't require custom software, and are almost plug an go. When its time to upgrade, its not that costly.
Designing, fabricating and deploying your own silicon is Expensive, creating software support for it, also more expense. THen there is the opportunity cost of having to optimise the software stack your self.
You're exchanging a large capex, for a similar sized capex plus a fuckton of opex as well.
they can't get away with having scraped people's owned work forever. You can't steal things from workers and then undercut them by selling that hard work for pennies, and not expect everything to collapse. I mean, I know that the folks in charge of this aren't really known for their foresight, especially when stock numbers and venture capital are the entire point, but... surely I hope people can recognize that this can't go on unimpeded.
They're using advanced cards meant for data centers and machine learning -- almost effectively "custom silicon"
at consumer level, npu are becoming a useful accelerator, and here google can choose to become the platform of choice.
but nobody comes close for training workflow but nvidia as it stands. imho it is currently possible thanks to community efforts afforded by cuda being the only realistic option.
politics and leadership aside, it would be nice to sustian a market for highly efficient matrix multipliers. also a s/w ecosystem that finally makes multiprocessing workloads easy to integrate for dummies like me.
https://youtu.be/b87I1plPeMg?si=T4XSFUzXG8BwpphR
Ecosystem support for GPU is very strong and so smaller teams might find TPUs to have a steeper learning curve. Sophisticated users can definitely get wins today on TPU if they can get capacity.
And it’s not as if nvidia is standing still, so how the strengths of each change over future generations isn’t set either. Ie TPU are “simple” matrix multipliers and also optimized for operation at scale, GPU are more general purpose and have strong ecosystem power.
Disclaimer - work on GKE on enabling AI workloads.
[0] https://www.reuters.com/technology/artificial-intelligence/h...
Apple, AWS, Google, Meta, Microsoft all have custom AI-centric silicon.
If needed these other companies have the $$$ to buy the best chips money can buy from Nvidia. Better chips than Google could ever produce.
If anything, this is why IMO Google will fail.
google is the biggest loser in all of this.
Dead Comment
Deleted Comment
> Efficient scheduling helps ensure that our resources are used optimally. This involves sophisticated algorithms that can allocate resources based on the needs of different jobs and dynamic scheduling to adapt to changing workloads.
Wow thanks for that, captain obvious. So how do you do it?
I have anecdotally also heard that some are using k8s, but I've not seen that myself. Slurm [1] is basically built for this stuff; that's definitely what I would use!
[1] https://slurm.schedmd.com/documentation.html
at my company we use slurm "directly" for static compute we rent or own (i.e. not in a public cloud), but are considering using Kubernetes because that's how we run the rest of the company, and we'd rather invest more effort into being better at k8s than becoming good slurm admins.
¹: https://openai.com/index/scaling-kubernetes-to-2500-nodes/
²: https://www.coreweave.com/blog/sunk-slurm-on-kubernetes-impl...
RDMA and GPUDirect capable. Coordinates over MPI or (hi)redia.
Edit: I had a brain freeze or something... gloo is not CPU only but for whatever reason I don't see it outside of CPU-comms