How Meta trains large language models at scale

Would be nice to read how do they collect/prepare data for training.

Which data sources? How much of Meta users data (fb, instagram… etc). How do they sanitize PII?

> How do they sanitize PII?

I can't comment on how things like faces get used, but in my experience, PII at Meta is inaccessible by default.

Unless you're impersonating a user on the platform (to access what PII they can see), you have to request special access for logs or database columns that contain so much as user IDs, otherwise the data simply won't show up when you query for it. This is baked into the infrastructure layer, so I doubt the GenAI teams are using something else.

actionfromafar · a year ago

For a convenient definition of PII. Isn’t everything a user does in aggregate PII?

sonofaragorn · a year ago

What about a post or comment that includes proper names?

michaelt · a year ago

Then, you should check out papers like https://arxiv.org/abs/2302.13971 and https://arxiv.org/abs/2307.09288

In the paper covering the original Llama they explicitly list their data sources in table 1 - including saying that they pretrained on the somewhat controversial books3 dataset.

The paper for Llama 2 also explicitly says they don't take data from Meta's products and services; and that they filter out data from sites known to contain a lot of PII. Although it is more coy about precisely what data sources they used, like many such papers are.

troupo · a year ago

> Would be nice to read how do they collect/prepare data for training.

By literally opting everyone into their training data set and making it very cumbersome to opt out: https://threadreaderapp.com/thread/1794863603964891567.html

discobot · a year ago

They explicitly train models only on public datasets

Posts like this underscore why the smart money is betting on Google as the long term AI winner. Meta, Microsoft, OpenAI, etc. are trying to address problems with consumer video cards and spending billions to try and out bid each other to win Nvidia's favor - while Google is on their 6th generation of custom silicon.

Literally the only thing that can stop Google now is the fact they keep bringing Microsoft and Oracle flunkies into leadership positions.

throwaway_ab · a year ago

I think it's likely Nvidia's GPU's, many of which are $50,000+ for a single unit, far surpass Google's custom silicon otherwise why wouldn't Google be selling shovels like Nvidia?

If Google had a better chip, or even a chip that was close, they would sell it to anyone and everyone.

From a quick search I can see Google's custom chips are 15x to 30x slower to train AI compared to Nvidia's current latest gen AI specific GPU's.

aseipp · a year ago

Nvidia has decades of experience selling hardware to people with all the pains that entails, support, sales channels, customer acquisition, software, it's something you don't just do overnight, and it does cost money. Google's TPUs get some of their cost efficiency from not supporting COTS use cases and the overhead of selling to people, and the total wall clock time has to also include the total operational costs, which dominate at their size (e.g. if it's 30x slower but 1/50th the TCO then it's a win. I don't know how TPUv5 stacks up against the B200). It's not as simple as "just put it on a shelf and sell it and make a gajillion dollars like nvidia"

EvgeniyZh · a year ago

TPU v5p is ~2 times slower than H100 at larg(ish)-scale training (order of 10k chips) [1]. And they already have v6 [2]. I think it's safe to say that they are fairly close to Nvidia in terms of performance.

[1] https://mlcommons.org/benchmarks/training/

[2] https://cloud.google.com/blog/products/compute/introducing-t...

bluedino · a year ago

We have almost 400 H100's sitting idle. I wonder how many other companies are buying millions of dollars worth of these chips with the hopes of them being used, but aren't being utilized?

candiddevmike · a year ago

They do sell shovels, you can get Google TPUs on Google Cloud.

girvo · a year ago

> If Google had a better chip, or even a chip that was close, they would sell it to anyone and everyone.

While I do not actually think Google's chips are better or close to being better, I don't think this actually holds?

If the upside of <better chip> is effectively unbounded, it would outweigh the short term benefit of selling them to others, I would think. At least for a company like Google.

vineyardmike · a year ago

> why wouldn't Google be selling shovels

They do sell them - but through their struggling cloud business. Either way, Nvidia's margin is google's opportunity to lower costs.

> I can see Google's custom chips are 15x to 30x slower to train AI

TPUs are designed for inference not training - they're betting that they can serve models to the world at a lower cost structure than their competition. The compute required for inference to serve their billions of customers is far greater than training costs for models - even LLMs. They've been running model inference as a part of production traffic for years.

nickpsecurity · a year ago

That’s not necessarily true. Many companies make chips they won’t sell to support lucrative, proprietary offerings. Mainframe processors are the classic example.

In AI, Google (TPU) and Intel (Gaudi) each have chips they push in cloud offerings. The cloud offerings have cross selling opportunities. That by itself would be a reason to keep it internal at their scale. It might also be easier to support one, or a small set, of deployments that are internal vs the variety that external customers would use.

xipix · a year ago

Intel, AMD and others also have chips for training that perform close to or sometimes better than Nvidia's. These are already in the market. Two problems: the CUDA moat, and, "noone gets fired for buying green".

megablast · a year ago

Is that why apple sell there chips to everyone??

tw04 · a year ago

Except Microsoft is making their own chips as well?

https://www.theverge.com/2023/11/15/23960345/microsoft-cpu-g...

jauer · a year ago

and so is Meta: https://ai.meta.com/blog/next-generation-meta-training-infer...

matt-p · a year ago

Can't agree. This is like saying $popularApp will fail because they buy expensive hosting at AWS.

Rubbish they will fail because the product didn't fit the market, if they're successful they'll have money to buy servers and colo then drive down cost. If they succeed it will be in large part due to the fact they spent thier capital and more importantly time on code/engineers rather than servers.

Right now companies are searching for a use of AI that will add hundreds of billions to thier market cap. Once they find that they can make TPUs, right now only one thing matters; getting there first.

mike_d · a year ago

> This is like saying $popularApp will fail because they buy expensive hosting at AWS.

For any given mobile app startup, AWS is effectively infinite. The more money you throw at it the more doodads you get back. Nvidia's supply chain is not infinite and is the bottle neck for all the non-Google players to fight over.

Deleted Comment

hooloovoo_zoo · a year ago

Google has been working on TPUs and Tensorflow for a decade with pretty mixed success; I don't think it's clear that they're going to win.

zmmmmm · a year ago

Custom silicon is fantastic when things have stabilised and you know exactly want you want. But things are still evolving fast in real time and in that environment, whoever can move fastest to be ultra flexible and deploy the latest architecture as soon as possible is the winner. I think in a nutshell, that is the story of Nvidia's success here : they created a GPGPU platform with just the right level of abstraction to capture the market for AI research.

raincole · a year ago

^ How to compact as many mistakes as possible in one single comment.

1. Google's stock didn't siginificantly outperformed Meta, Microsoft, etc, in thet past two years.

2. Meta and Microsoft are trying to make their own chips as well.

3. They're not using "consumer video cards" to train AI. I don't even know if you can call these beasts video cards any more. H100 doesn't have HDMI port.

callalex · a year ago

That’s how all huge tech companies become dinosaurs though. Upper management that is already stupidly wealthy (and therefore unmotivated) have the funding and patience to hire geniuses to build incredible machines and then constantly tie their shoelaces together while asking them to sprint. Examples include Microsoft and Oracle as you said, and before them IBM, AT$T, TIBCO, Marvell, Motorola, I could go on for a while…

ai4ever · a year ago

here is an older take on this same topic..

https://www.yitay.net/blog/training-great-llms-entirely-from...

GPU vs TPU, and good software managing large clusters of them across all sorts of failure.

the funny bit from the above article is the incident when someone forgot about a training job at google, and month later had the model fully trained without an alert of any kind. "outrageously good infra"

zdyn5 · a year ago

H100s are far from consumer video cards

stygiansonic · a year ago

Yeah, ops comment makes it seem like they are building racks of RTX 4090s, when this isn’t remotely true. Tensor Core performance is far different on the data center class devices vs consumer ones.

blackoil · a year ago

I would call it "stupid" money. This isn't a commodity business. Value of the final product is orthogonal to amount invested in compute. If Google is 10% slower or its product is 10% worse, it can lose all the value. This is like valuing a software company higher because its devs are using cheap PC desktops instead of Mac.

r_hanz · a year ago

Much the same way you can have all the best gear and still fail - Google’s primary strength seems to be the DeepMind group. I’m not affiliated with Google, but IMHO the reason they will slowly die is because their engineering culture has taken a backseat due to their broken hiring practices.

Bad hiring practices aren’t exclusive to them, but from all accounts it seems like their internal focus is on optimizing ad revenue over everything else. I could be wrong or misinformed, but it seems to me like they are playing the finite game in the AI space (DeepMind group aside) while FAIR are playing the infinite game.

*meanwhile MSFT are simply trying to buy their way to relevance (e.g. OpenAI investments, etc) and carve out future revenues (Recall) and Jobs-less Apple is building their trademark walled-garden (AppleIntelligence?). Although the use of unified memory in Apple silicon poses some interesting possibilities for enabling the use of sizable models on consumer hardware.

Overall it seems like “big-tech” is by-and-large uninspired and asleep at the wheel save specific teams like those led by Lecun, Hassabis, etc. not sure where that leaves OpenAI now that Karpathy is gone.

VirusNewbie · a year ago

>ecause their engineering culture has taken a backseat due to their broken hiring practices.

What company do you think has better hiring practices, and subsequently a higher talent pool? Meta is pretty similar to Google's (though with an emphasis on speed over creativity). Microsoft is certainly worse at hiring than the two aforementioned...

rapsey · a year ago

Their work on folding and ai could very well be a business worth in the hundreds of billions and they know it as well as many other bets.

Wheres others are playing the llm race to the bottom.

htrp · a year ago

> Their work on folding and ai could very well be a business worth in the hundreds of billions and they know it as well as many other bets.

And we'll be writing case studies of how they squandered billions of R&D to help found other companies (kinda like Xerox Parc)

Almost every interesting paper after transformers has had it's authors leave to commercialize their own companies.

loeg · a year ago

The only thing that can stop Google is Google. Somehow every bet that isn't Search doesn't pan out. And inexplicably, they're working hard to kill Search now. As a shareholder, I hope they succeed. But I am more pessimistic about it than you.

yellow_postit · a year ago

And they missed multiple waves of effectively building on their own in house research.

Deleted Comment

throwaway920102 · a year ago

What can stop Google is building the wrong thing, or being so scared to launch anything that they smother their own fledgling products before they are born or before they can mature. Their product and finance and accounting teams should be tossed.

xcv123 · a year ago

None of these companies are using consumer video cards. https://www.nvidia.com/en-us/data-center/h200/

guardiang · a year ago

Google is old like MetLife, relative to each's respective industry. Both are carrying too much baggage and are top heavy. As a result, I personally don't think Google will be able to keep pace with OpenAI in the long run.

jatins · a year ago

this was the same argument that was presented a decade ago on why Google was supposed to win the cloud because their internal infra was miles ahead of Amazon and Microsoft.

Yet here we are. Will the consumer video cards get cheaper and better faster or will Google's directors' infighting stop first?

Deleted Comment

htrp · a year ago

> Literally the only thing that can stop Google now is the fact they keep bringing Microsoft and Oracle flunkies into leadership positions.

You mean the thing that's already stopped them? If they had seriously invested into the TPU ecosystem in 2015, they would already have "won" AI.

KaiserPro · a year ago

> Meta, Microsoft, OpenAI, etc. are trying to address problems with consumer video cards

Yes, but you are buying access to tested, supported units that are proven to work, don't require custom software, and are almost plug an go. When its time to upgrade, its not that costly.

Designing, fabricating and deploying your own silicon is Expensive, creating software support for it, also more expense. THen there is the opportunity cost of having to optimise the software stack your self.

You're exchanging a large capex, for a similar sized capex plus a fuckton of opex as well.

HarHarVeryFunny · a year ago

The chips are somewhat irrelevant. It's the overall system architecture, management, and fault recovery that matters.

lxgr · a year ago

I wish I had your faith in Google’s ability to refrain from kneecapping their own perfectly fine product.

houseplant · a year ago

after everything I've seen and the litigation coming out of europe, I really can't see AI lasting long after they're obligated to prove rights for the data they're training on.

they can't get away with having scraped people's owned work forever. You can't steal things from workers and then undercut them by selling that hard work for pennies, and not expect everything to collapse. I mean, I know that the folks in charge of this aren't really known for their foresight, especially when stock numbers and venture capital are the entire point, but... surely I hope people can recognize that this can't go on unimpeded.

logicchains · a year ago

Eventually they're going to put vision LLMs in robotic bodies and they'll be able to learn just by listening and watching, just like humans, at which stage the idea that they're "stealing" just by viewing content will be seen as absurd.

Jabrov · a year ago

"Consumer video cards"? Meta's not building their clusters out of 3090s.

They're using advanced cards meant for data centers and machine learning -- almost effectively "custom silicon"

rldjbpin · a year ago

being first/early doesn't always mean long-term dominance, see tensorflow.

at consumer level, npu are becoming a useful accelerator, and here google can choose to become the platform of choice.

but nobody comes close for training workflow but nvidia as it stands. imho it is currently possible thanks to community efforts afforded by cuda being the only realistic option.

politics and leadership aside, it would be nice to sustian a market for highly efficient matrix multipliers. also a s/w ecosystem that finally makes multiprocessing workloads easy to integrate for dummies like me.

hipadev23 · a year ago

Google’s on their 6th generation and still can’t find anyone to use it. Hmm.

coralreef · a year ago

How do TPUs perform compared to GPUs on LLMs and image generation?

smarterclayton · a year ago

Pretty well. Anthropic runs some of Claude inference on GKE and TPU v5e - this talk at the last GCP conference has some details:

https://youtu.be/b87I1plPeMg?si=T4XSFUzXG8BwpphR

Ecosystem support for GPU is very strong and so smaller teams might find TPUs to have a steeper learning curve. Sophisticated users can definitely get wins today on TPU if they can get capacity.

And it’s not as if nvidia is standing still, so how the strengths of each change over future generations isn’t set either. Ie TPU are “simple” matrix multipliers and also optimized for operation at scale, GPU are more general purpose and have strong ecosystem power.

Disclaimer - work on GKE on enabling AI workloads.

fragmede · a year ago

smart money has a diversified portfolio and isn't betting on any one winner and has invested in all of them, and then some.

checkyoursudo · a year ago

What would any company as "the long term AI winner" look like? What would it mean to be the winner in this context?

ketchupdebugger · a year ago

The winner is just nvidia, I see this like the battle of gas vs electric cars. Nvidia is basically making the wheels. whichever company wins, you'd still need wheels.

Der_Einzige · a year ago

Until the lion share of AI projects support that custom silicon, I will continue to bet on anyone buying Nvidia GPUs.

iamflimflam1 · a year ago

Don’t forget Apple’s Private Compute Cloud - built on top of Apple Silicon.

towawy · a year ago

Models trained on Google TPUs according to Reuters [0]. Does anyone know the "technical document" the news article references?

[0] https://www.reuters.com/technology/artificial-intelligence/h...

moneywoes · a year ago

Is no one else working on custom silicon?

fnordpiglet · a year ago

The problem isn’t just developing your own processor. Nvidia has a huge stack of pretty cutting edge technology including a huge stack from mellanox, an enormous OSS tool chain around CUDA, etc, that people seeking to make comparable products have to overcome.

threeseed · a year ago

Everyone is.

Apple, AWS, Google, Meta, Microsoft all have custom AI-centric silicon.

dinobones · a year ago

Do you really think Google’s hardware expertise is better than Nvidia’s?

If needed these other companies have the $$$ to buy the best chips money can buy from Nvidia. Better chips than Google could ever produce.

If anything, this is why IMO Google will fail.

candiddevmike · a year ago

I thought NVIDIA's moat was mostly software/CUDA?

mike_d · a year ago

Yes. Google was building custom HPC hardware 5-8 years before Nvidia decided to expand outside the consumer and "workstation" markets.

asynchronous · a year ago

Laughable response when you actually look at the quality of the algorithms being produced by Google. They’re so behind it’s embarrassing.