Readit News logoReadit News
dudus · a year ago
> Since we did not have time to change the cooling infrastructure, we had to remain in an air-cooled environment. The mechanical and thermal designs had to change to accommodate this, and that triggered a validation cycle to support a large-scale deployment.

> All of these hardware-related changes were challenging because we had to find a solution that fit within the existing resource constraints, with a very small degree of freedom to change and meet a tight schedule.

Seems like the time constraints put into the team impacted the overall quality of the model.

vessenes · a year ago
This sounds to me like standard CYA / perhaps good-natured complaining from the tech team.

The last tech team to have no budget and time constraints to pursue their vision? I don’t know, the Xanadu team? Romero’s original Daikatana team?

Deleted Comment

nomel · a year ago
What's the cost of letting that hardware sit idle?
Muhtasham · a year ago
it’s serious crime
xvector · a year ago
> So we decided to build both: two 24k clusters, one with RoCE and another with InfiniBand. Our intent was to build and learn from the operational experience.

I love how they built two completely insane clusters just to learn. That's badass.

logicchains · a year ago
It's not just to learn; an RoCE ethernet cluster with Aristas is way cheaper to build and maintain than a fancy InfiniBand cluster with Mellanox/NVidia networking, so proving that the former is good enough at scale will eventually save Meta a huge amount of money. InfiniBand cards are much more expensive than ethernet because there's few vendors, that have a quasi-monopoloy, and because overall far fewer of them are produced so there's less economy of scale.
riku_iki · a year ago
More like Mark gave them 100k GPUs, and they are not sure what exactly to do with them..
yosito · a year ago
I wish that instead of just training another stupid LLM, Meta would use it to improve their search and help me find the content I'm actually interested in.
TeMPOraL · a year ago
Their revenue depends on it being hard (but not impossible) for you to find the content you're actually interested in. Would be nice if it didn't, but in this reality, money on the Internet is made by wasting users' lives. That is what attention economy is about.
rmbyrro · a year ago
It's actually a mix. They need to disappoint the user for the right amount of time, to then please it at the right moment and dose. This maximizes the dopamine release and increases addictiveness.

When you find good content depends on when the algo judges you're already primed to a colorful dopamine intake.

rldjbpin · a year ago
this is like asking disney to reduce wait times for their rides. in other words, it is against content aggregation platforms' interest to let you get what you want directly.
Oras · a year ago
Would be nice to read how do they collect/prepare data for training.

Which data sources? How much of Meta users data (fb, instagram… etc). How do they sanitize PII?

OsrsNeedsf2P · a year ago
> How do they sanitize PII?

I can't comment on how things like faces get used, but in my experience, PII at Meta is inaccessible by default.

Unless you're impersonating a user on the platform (to access what PII they can see), you have to request special access for logs or database columns that contain so much as user IDs, otherwise the data simply won't show up when you query for it. This is baked into the infrastructure layer, so I doubt the GenAI teams are using something else.

actionfromafar · a year ago
For a convenient definition of PII. Isn’t everything a user does in aggregate PII?
sonofaragorn · a year ago
What about a post or comment that includes proper names?
michaelt · a year ago
Then, you should check out papers like https://arxiv.org/abs/2302.13971 and https://arxiv.org/abs/2307.09288

In the paper covering the original Llama they explicitly list their data sources in table 1 - including saying that they pretrained on the somewhat controversial books3 dataset.

The paper for Llama 2 also explicitly says they don't take data from Meta's products and services; and that they filter out data from sites known to contain a lot of PII. Although it is more coy about precisely what data sources they used, like many such papers are.

troupo · a year ago
> Would be nice to read how do they collect/prepare data for training.

By literally opting everyone into their training data set and making it very cumbersome to opt out: https://threadreaderapp.com/thread/1794863603964891567.html

discobot · a year ago
They explicitly train models only on public datasets
lokimedes · a year ago
Yikes, the little Infiniband+A100 cluster I installed for my previous company seemed useful at the time (12 GPUs) and that was at a cost of around $300k. With LLMs it feels like game over for non-cloud applications if you are not a mega-corp.
lannisterstark · a year ago
Well, yes, but Not all models need to be "super large." Smaller models, specialized in specific tasks, working together - and then reporting to a slightly larger model is the way to go.

Think of everything being connected to a "Home Computer" in those "Future House of 2020" videos that were out there in 70s or what not.

Another example (very rough) would be something like "Weather data gets to a small model via an API, model looks at it, updates the home dashboard, also sees if there's any alerts, if so, adds x or y to home dashboard appropriately as to what it thinks best."

We can probably achieve the latter example today. (without any significant 'coding' on anyone's part except the API owner)

afro88 · a year ago
> Well, yes, but Not all models need to be "super large." Smaller models, specialized in specific tasks, working together - and then reporting to a slightly larger model is the way to go.

I want to believe, but I'm still yet to see this kind of set up being anywhere near GPT-4 level.

The weather example seems quite contrived. Why not just display the alerts for your area? Why is a complex system of smaller models reporting up to a slightly larger model necessary?

mike_d · a year ago
Posts like this underscore why the smart money is betting on Google as the long term AI winner. Meta, Microsoft, OpenAI, etc. are trying to address problems with consumer video cards and spending billions to try and out bid each other to win Nvidia's favor - while Google is on their 6th generation of custom silicon.

Literally the only thing that can stop Google now is the fact they keep bringing Microsoft and Oracle flunkies into leadership positions.

throwaway_ab · a year ago
I think it's likely Nvidia's GPU's, many of which are $50,000+ for a single unit, far surpass Google's custom silicon otherwise why wouldn't Google be selling shovels like Nvidia?

If Google had a better chip, or even a chip that was close, they would sell it to anyone and everyone.

From a quick search I can see Google's custom chips are 15x to 30x slower to train AI compared to Nvidia's current latest gen AI specific GPU's.

aseipp · a year ago
Nvidia has decades of experience selling hardware to people with all the pains that entails, support, sales channels, customer acquisition, software, it's something you don't just do overnight, and it does cost money. Google's TPUs get some of their cost efficiency from not supporting COTS use cases and the overhead of selling to people, and the total wall clock time has to also include the total operational costs, which dominate at their size (e.g. if it's 30x slower but 1/50th the TCO then it's a win. I don't know how TPUv5 stacks up against the B200). It's not as simple as "just put it on a shelf and sell it and make a gajillion dollars like nvidia"
EvgeniyZh · a year ago
TPU v5p is ~2 times slower than H100 at larg(ish)-scale training (order of 10k chips) [1]. And they already have v6 [2]. I think it's safe to say that they are fairly close to Nvidia in terms of performance.

[1] https://mlcommons.org/benchmarks/training/

[2] https://cloud.google.com/blog/products/compute/introducing-t...

bluedino · a year ago
We have almost 400 H100's sitting idle. I wonder how many other companies are buying millions of dollars worth of these chips with the hopes of them being used, but aren't being utilized?
candiddevmike · a year ago
They do sell shovels, you can get Google TPUs on Google Cloud.
girvo · a year ago
> If Google had a better chip, or even a chip that was close, they would sell it to anyone and everyone.

While I do not actually think Google's chips are better or close to being better, I don't think this actually holds?

If the upside of <better chip> is effectively unbounded, it would outweigh the short term benefit of selling them to others, I would think. At least for a company like Google.

vineyardmike · a year ago
> why wouldn't Google be selling shovels

They do sell them - but through their struggling cloud business. Either way, Nvidia's margin is google's opportunity to lower costs.

> I can see Google's custom chips are 15x to 30x slower to train AI

TPUs are designed for inference not training - they're betting that they can serve models to the world at a lower cost structure than their competition. The compute required for inference to serve their billions of customers is far greater than training costs for models - even LLMs. They've been running model inference as a part of production traffic for years.

nickpsecurity · a year ago
That’s not necessarily true. Many companies make chips they won’t sell to support lucrative, proprietary offerings. Mainframe processors are the classic example.

In AI, Google (TPU) and Intel (Gaudi) each have chips they push in cloud offerings. The cloud offerings have cross selling opportunities. That by itself would be a reason to keep it internal at their scale. It might also be easier to support one, or a small set, of deployments that are internal vs the variety that external customers would use.

xipix · a year ago
Intel, AMD and others also have chips for training that perform close to or sometimes better than Nvidia's. These are already in the market. Two problems: the CUDA moat, and, "noone gets fired for buying green".
megablast · a year ago
Is that why apple sell there chips to everyone??
tw04 · a year ago
Except Microsoft is making their own chips as well?

https://www.theverge.com/2023/11/15/23960345/microsoft-cpu-g...

jauer · a year ago
matt-p · a year ago
Can't agree. This is like saying $popularApp will fail because they buy expensive hosting at AWS.

Rubbish they will fail because the product didn't fit the market, if they're successful they'll have money to buy servers and colo then drive down cost. If they succeed it will be in large part due to the fact they spent thier capital and more importantly time on code/engineers rather than servers.

Right now companies are searching for a use of AI that will add hundreds of billions to thier market cap. Once they find that they can make TPUs, right now only one thing matters; getting there first.

mike_d · a year ago
> This is like saying $popularApp will fail because they buy expensive hosting at AWS.

For any given mobile app startup, AWS is effectively infinite. The more money you throw at it the more doodads you get back. Nvidia's supply chain is not infinite and is the bottle neck for all the non-Google players to fight over.

Deleted Comment

hooloovoo_zoo · a year ago
Google has been working on TPUs and Tensorflow for a decade with pretty mixed success; I don't think it's clear that they're going to win.
zmmmmm · a year ago
Custom silicon is fantastic when things have stabilised and you know exactly want you want. But things are still evolving fast in real time and in that environment, whoever can move fastest to be ultra flexible and deploy the latest architecture as soon as possible is the winner. I think in a nutshell, that is the story of Nvidia's success here : they created a GPGPU platform with just the right level of abstraction to capture the market for AI research.
raincole · a year ago
^ How to compact as many mistakes as possible in one single comment.

1. Google's stock didn't siginificantly outperformed Meta, Microsoft, etc, in thet past two years.

2. Meta and Microsoft are trying to make their own chips as well.

3. They're not using "consumer video cards" to train AI. I don't even know if you can call these beasts video cards any more. H100 doesn't have HDMI port.

callalex · a year ago
That’s how all huge tech companies become dinosaurs though. Upper management that is already stupidly wealthy (and therefore unmotivated) have the funding and patience to hire geniuses to build incredible machines and then constantly tie their shoelaces together while asking them to sprint. Examples include Microsoft and Oracle as you said, and before them IBM, AT$T, TIBCO, Marvell, Motorola, I could go on for a while…
ai4ever · a year ago
here is an older take on this same topic..

https://www.yitay.net/blog/training-great-llms-entirely-from...

GPU vs TPU, and good software managing large clusters of them across all sorts of failure.

the funny bit from the above article is the incident when someone forgot about a training job at google, and month later had the model fully trained without an alert of any kind. "outrageously good infra"

zdyn5 · a year ago
H100s are far from consumer video cards
stygiansonic · a year ago
Yeah, ops comment makes it seem like they are building racks of RTX 4090s, when this isn’t remotely true. Tensor Core performance is far different on the data center class devices vs consumer ones.
blackoil · a year ago
I would call it "stupid" money. This isn't a commodity business. Value of the final product is orthogonal to amount invested in compute. If Google is 10% slower or its product is 10% worse, it can lose all the value. This is like valuing a software company higher because its devs are using cheap PC desktops instead of Mac.
r_hanz · a year ago
Much the same way you can have all the best gear and still fail - Google’s primary strength seems to be the DeepMind group. I’m not affiliated with Google, but IMHO the reason they will slowly die is because their engineering culture has taken a backseat due to their broken hiring practices.

Bad hiring practices aren’t exclusive to them, but from all accounts it seems like their internal focus is on optimizing ad revenue over everything else. I could be wrong or misinformed, but it seems to me like they are playing the finite game in the AI space (DeepMind group aside) while FAIR are playing the infinite game.

*meanwhile MSFT are simply trying to buy their way to relevance (e.g. OpenAI investments, etc) and carve out future revenues (Recall) and Jobs-less Apple is building their trademark walled-garden (AppleIntelligence?). Although the use of unified memory in Apple silicon poses some interesting possibilities for enabling the use of sizable models on consumer hardware.

Overall it seems like “big-tech” is by-and-large uninspired and asleep at the wheel save specific teams like those led by Lecun, Hassabis, etc. not sure where that leaves OpenAI now that Karpathy is gone.

VirusNewbie · a year ago
>ecause their engineering culture has taken a backseat due to their broken hiring practices.

What company do you think has better hiring practices, and subsequently a higher talent pool? Meta is pretty similar to Google's (though with an emphasis on speed over creativity). Microsoft is certainly worse at hiring than the two aforementioned...

rapsey · a year ago
Their work on folding and ai could very well be a business worth in the hundreds of billions and they know it as well as many other bets.

Wheres others are playing the llm race to the bottom.

htrp · a year ago
> Their work on folding and ai could very well be a business worth in the hundreds of billions and they know it as well as many other bets.

And we'll be writing case studies of how they squandered billions of R&D to help found other companies (kinda like Xerox Parc)

Almost every interesting paper after transformers has had it's authors leave to commercialize their own companies.

loeg · a year ago
The only thing that can stop Google is Google. Somehow every bet that isn't Search doesn't pan out. And inexplicably, they're working hard to kill Search now. As a shareholder, I hope they succeed. But I am more pessimistic about it than you.
yellow_postit · a year ago
And they missed multiple waves of effectively building on their own in house research.

Deleted Comment

Deleted Comment

throwaway920102 · a year ago
What can stop Google is building the wrong thing, or being so scared to launch anything that they smother their own fledgling products before they are born or before they can mature. Their product and finance and accounting teams should be tossed.
xcv123 · a year ago
None of these companies are using consumer video cards. https://www.nvidia.com/en-us/data-center/h200/
guardiang · a year ago
Google is old like MetLife, relative to each's respective industry. Both are carrying too much baggage and are top heavy. As a result, I personally don't think Google will be able to keep pace with OpenAI in the long run.
jatins · a year ago
this was the same argument that was presented a decade ago on why Google was supposed to win the cloud because their internal infra was miles ahead of Amazon and Microsoft.

Yet here we are. Will the consumer video cards get cheaper and better faster or will Google's directors' infighting stop first?

Deleted Comment

htrp · a year ago
> Literally the only thing that can stop Google now is the fact they keep bringing Microsoft and Oracle flunkies into leadership positions.

You mean the thing that's already stopped them? If they had seriously invested into the TPU ecosystem in 2015, they would already have "won" AI.

KaiserPro · a year ago
> Meta, Microsoft, OpenAI, etc. are trying to address problems with consumer video cards

Yes, but you are buying access to tested, supported units that are proven to work, don't require custom software, and are almost plug an go. When its time to upgrade, its not that costly.

Designing, fabricating and deploying your own silicon is Expensive, creating software support for it, also more expense. THen there is the opportunity cost of having to optimise the software stack your self.

You're exchanging a large capex, for a similar sized capex plus a fuckton of opex as well.

HarHarVeryFunny · a year ago
The chips are somewhat irrelevant. It's the overall system architecture, management, and fault recovery that matters.
lxgr · a year ago
I wish I had your faith in Google’s ability to refrain from kneecapping their own perfectly fine product.
houseplant · a year ago
after everything I've seen and the litigation coming out of europe, I really can't see AI lasting long after they're obligated to prove rights for the data they're training on.

they can't get away with having scraped people's owned work forever. You can't steal things from workers and then undercut them by selling that hard work for pennies, and not expect everything to collapse. I mean, I know that the folks in charge of this aren't really known for their foresight, especially when stock numbers and venture capital are the entire point, but... surely I hope people can recognize that this can't go on unimpeded.

logicchains · a year ago
Eventually they're going to put vision LLMs in robotic bodies and they'll be able to learn just by listening and watching, just like humans, at which stage the idea that they're "stealing" just by viewing content will be seen as absurd.
Jabrov · a year ago
"Consumer video cards"? Meta's not building their clusters out of 3090s.

They're using advanced cards meant for data centers and machine learning -- almost effectively "custom silicon"

rldjbpin · a year ago
being first/early doesn't always mean long-term dominance, see tensorflow.

at consumer level, npu are becoming a useful accelerator, and here google can choose to become the platform of choice.

but nobody comes close for training workflow but nvidia as it stands. imho it is currently possible thanks to community efforts afforded by cuda being the only realistic option.

politics and leadership aside, it would be nice to sustian a market for highly efficient matrix multipliers. also a s/w ecosystem that finally makes multiprocessing workloads easy to integrate for dummies like me.

hipadev23 · a year ago
Google’s on their 6th generation and still can’t find anyone to use it. Hmm.
coralreef · a year ago
How do TPUs perform compared to GPUs on LLMs and image generation?
smarterclayton · a year ago
Pretty well. Anthropic runs some of Claude inference on GKE and TPU v5e - this talk at the last GCP conference has some details:

https://youtu.be/b87I1plPeMg?si=T4XSFUzXG8BwpphR

Ecosystem support for GPU is very strong and so smaller teams might find TPUs to have a steeper learning curve. Sophisticated users can definitely get wins today on TPU if they can get capacity.

And it’s not as if nvidia is standing still, so how the strengths of each change over future generations isn’t set either. Ie TPU are “simple” matrix multipliers and also optimized for operation at scale, GPU are more general purpose and have strong ecosystem power.

Disclaimer - work on GKE on enabling AI workloads.

fragmede · a year ago
smart money has a diversified portfolio and isn't betting on any one winner and has invested in all of them, and then some.
checkyoursudo · a year ago
What would any company as "the long term AI winner" look like? What would it mean to be the winner in this context?
ketchupdebugger · a year ago
The winner is just nvidia, I see this like the battle of gas vs electric cars. Nvidia is basically making the wheels. whichever company wins, you'd still need wheels.
Der_Einzige · a year ago
Until the lion share of AI projects support that custom silicon, I will continue to bet on anyone buying Nvidia GPUs.
iamflimflam1 · a year ago
Don’t forget Apple’s Private Compute Cloud - built on top of Apple Silicon.
towawy · a year ago
Models trained on Google TPUs according to Reuters [0]. Does anyone know the "technical document" the news article references?

[0] https://www.reuters.com/technology/artificial-intelligence/h...

moneywoes · a year ago
Is no one else working on custom silicon?
fnordpiglet · a year ago
The problem isn’t just developing your own processor. Nvidia has a huge stack of pretty cutting edge technology including a huge stack from mellanox, an enormous OSS tool chain around CUDA, etc, that people seeking to make comparable products have to overcome.
threeseed · a year ago
Everyone is.

Apple, AWS, Google, Meta, Microsoft all have custom AI-centric silicon.

dinobones · a year ago
Do you really think Google’s hardware expertise is better than Nvidia’s?

If needed these other companies have the $$$ to buy the best chips money can buy from Nvidia. Better chips than Google could ever produce.

If anything, this is why IMO Google will fail.

candiddevmike · a year ago
I thought NVIDIA's moat was mostly software/CUDA?
mike_d · a year ago
Yes. Google was building custom HPC hardware 5-8 years before Nvidia decided to expand outside the consumer and "workstation" markets.
asynchronous · a year ago
Laughable response when you actually look at the quality of the algorithms being produced by Google. They’re so behind it’s embarrassing.
jejeyyy77 · a year ago
lolwat.

google is the biggest loser in all of this.

Dead Comment

Deleted Comment

radarsat1 · a year ago
Frustratingly little information. For example, I'm exceedingly curious how they deal with scheduling jobs on such a huge array of machines. The article:

> Efficient scheduling helps ensure that our resources are used optimally. This involves sophisticated algorithms that can allocate resources based on the needs of different jobs and dynamic scheduling to adapt to changing workloads.

Wow thanks for that, captain obvious. So how do you do it?

p4ul · a year ago
I usually assume these companies are using some of the popular schedulers (e.g., Slurm, MOAB, SGE) that have existed in the HPC community for many years.

I have anecdotally also heard that some are using k8s, but I've not seen that myself. Slurm [1] is basically built for this stuff; that's definitely what I would use!

[1] https://slurm.schedmd.com/documentation.html

claytonjy · a year ago
Slurm is definitely still dominant, but OpenAI has been using k8s for training for many years now¹, and there are various ways to run slurm on top of Kubernetes, including the recent SUNK from coreweave²

at my company we use slurm "directly" for static compute we rent or own (i.e. not in a public cloud), but are considering using Kubernetes because that's how we run the rest of the company, and we'd rather invest more effort into being better at k8s than becoming good slurm admins.

¹: https://openai.com/index/scaling-kubernetes-to-2500-nodes/

²: https://www.coreweave.com/blog/sunk-slurm-on-kubernetes-impl...

jauntywundrkind · a year ago
Random q, I wonder if gloo is used in these systems? https://github.com/facebookincubator/gloo

RDMA and GPUDirect capable. Coordinates over MPI or (hi)redia.

runeblaze · a year ago
̶I̶I̶R̶C̶ ̶g̶l̶o̶o̶ ̶i̶s̶ ̶C̶P̶U̶ ̶t̶e̶n̶s̶o̶r̶s̶ ̶o̶n̶l̶y̶ ̶s̶o̶ ̶l̶i̶k̶e̶l̶y̶ ̶n̶o̶t̶

Edit: I had a brain freeze or something... gloo is not CPU only but for whatever reason I don't see it outside of CPU-comms