Readit News logoReadit News
latchkey · a year ago
I am building a bare metal mi300x service provider business.

Anyone offering $2 GPUs is either losing money on DC space/power, or their service is so sketchy under the covers, which they do their best to hide. It is one thing to play around with $2 gpus and another to run a business. If you're trying to do the latter, you're not considering how you are risking your business on unreliable compute.

AWS really twerked people's perception of what it takes to run high end enterprise GPU infrastructure like this. People got used to the reliability hyperscaler offers. They don't consider what 999999% uptime + 45kW+ rack infrastructure truly costs.

There is absolutely no way anyone is going to be making any money offering $2 H100s unless they stole them and they get free space/power...

dijit · a year ago
> 999999% uptime

Assuming you mean 99.9999%; your hyperscaler isn't giving you that. MTBF is comparable.

It's hardware at the end of the day, the VM hypervisor isn't giving you anything on GPU instances because those GPU instances aren't possible to live-migrate. (even normal VMs are really tricky).

In a country with a decent power grid and a UPS (or if you use a colo-provider) you're going to get the same availability guarantee of a machine, maybe even slightly higher because less moving parts.

I think this "cloud is god" mentality betrays the fact that server hardware is actually hugely reliable once it's working; and the cloud model literally depends on this fact. The reliability of cloud is simply the reliability of hardware; they only provided an abstraction on management not on reliability.

llm_trw · a year ago
I think people just don't realize how big computers have gotten since 2006. A t2.micro was an ok desktop computer back then. Today you can have something 1000 times as big for a few tens of thousands. You can easily run a company that serves the whole of the US out of a closet.
zaptrem · a year ago
As someone who has done a bunch of large scale ML on hyperscaler hardware I will say the uptime is nowhere near 99.9999%. Given a cluster of only a few hundred GPUs one or multiple failures is a near certainty to the point where we spend a bunch of time on recovery time optimization.
everforward · a year ago
> The reliability of cloud is simply the reliability of hardware; they only provided an abstraction on management not on reliability.

This isn't really true. I mean it's true in the sense that you could get the same reliability on-premise given a couple decades of engineer hours, but the vast majority of on-premise deployments I have seen have significantly lower reliability than clouds and have few plans to build out those capabilities.

E.g. if I exclude public cloud operator employers, I've never worked for a company that could mimick an AZ failover on-prem and I've worked for a couple F500s. As far as I can recall, none of them have even segmented their network beyond the management plane having its own hardware. The rest of the DC network was centralized; I recall one of them in specific because an STP loop screwed up half of it at one point.

Part of paying for the cloud is centralizing the costs of thinking up and implementing platform-level reliability features. Some of those things are enormously expensive and not really practical for smaller economies of scale.

Just one random example is tracking hardware-level points of failure and exposing that to the scheduler. E.g. if a particular datacenter has 4 supplies from mains and each rack is only connected to a single one of those supplies, when I schedule 4 jobs to run there it will try to put each job in a rack with a separate power supply to minimize the impact of losing a mains. Ditto with network, storage, fire suppression, generators, etc, etc, etc.

That kind of thing makes 0 economic sense for an individual company to implement, but it starts to make a lot of sense for a company who does basically nothing other than manage hardware failures.

traceroute66 · a year ago
> instances aren't possible to live-migrate

Some of the cloud providers don't even do live-migration. They adhere to the cloud mantra of "oh well, its up to the customer to spin up and carry on elsewhere".

I have it on good authority that some of them don't even take A+B feeds to their DC suites - and then have the chutzpah to shout at the DC provider when their only feed goes down, but that's another story... :)

yencabulator · 10 months ago
> (even normal VMs are really tricky)

For what it's worth, GCP routinely live-migrates customer VMs to schedule hardware for maintenance/decommissioning when hardware sensors start indicating trouble. It's standard everyday basic functionality by now, but only for the vendors who built the feature in from the beginning.

wkat4242 · a year ago
> Assuming you mean 99.9999%; your hyperscaler isn't giving you that. MTBF is comparable.

Yeah we've already had about a day's worth of downtime this year on office 365 and Microsoft is definitely a hyperscaler. So that's 99.3% at best.

dijit · a year ago
meta: I'm always interested how the votes go on comments like this. I've been watching periodically and it seems like I get "-2" at random intervals.

This is not the first time that "low yield" karma comments have sporadic changes to their votes.

It seems unlikely at the rate of change (roughly 3-5 point changes per hour) that two people would simultaneously (within a minute) have the same desire to flag a comment, so I can only speculate that:

A) Some people's flag is worth -2

B) Some people, passionate about this topic, have multiple accounts

C) There's bots that try to remain undetected by making only small adjustments to the conversation periodically.

I'm aware that some peoples job very strongly depends on the cloud, but nothing I said could be considered off topic or controversial: Cloud for GPU compute relies on hardware reliability just like everything else does. This is fact. Regardless of this, the voting behaviour on my comments such as this are extremely suspicious.

michaelt · a year ago
> There is absolutely no way anyone is going to be making any money offering $2 H100s unless they stole them and they get free space/power...

At the highest power settings, H100s consume 400 W. Add another 200 W for CPU/RAM. Assume you have an incredibly inefficient cooling system, so you also need 600 W of cooling.

Google tells me US energy prices average around 17 cents/kWh - even if you don't locate your data centre somewhere with cheap electricity.

17 cents/kWh * 1200 watts * 1 hour is only 20.4 cents/hour.

ckastner · a year ago
That's just the power. If one expects a H100 to run for three years at full load, 24 x 365 x 3 = 26280. Assuming a price of $25K per H100, that means about $1/h to amortize costs. Hence the unless they stole them, I guess.

Factor in space, networking, cooling, security, etc., and $2 really do seem undoable.

latchkey · a year ago
You are not looking at the full economics of the situation.

There are very few data centers left that can do 45kW+ rack density, which translates to 32 H100/MI300x GPUs in a rack.

Most datacenters, you're looking at 1 or 2 boxes of 8 GPU, a rack. As a result, it isn't just the price of power, it is whatever the data center wants to charge you.

Then you factor in cooling on top of that...

sandworm101 · a year ago
For the fuller math one has to include the cost of infrastructure financing, which is tied to interest rates. Given how young most of these H100 shops are, I assume that they pay more to service their debts than for power.
neom · a year ago
This reads exactly like what people said about DigitalOcean when we launched it.
count · a year ago
To be fair, DO was muuuch sketchier in the past (eg https://news.ycombinator.com/item?id=6983097).

Launching any multitenant system is HARD. Many of them are held together with bubble gum and good intentions….

imglorp · a year ago
How was DO able to provide what AWS didn't want to? Was it purely margins?

Deleted Comment

bjornsing · a year ago
> There is absolutely no way anyone is going to be making any money offering $2 H100s unless they stole them and they get free space/power...

That’s essentially what the OP says. But once you’ve already invested in the H100s you’re still better off renting them out for $2 per hour rather than having them idle at $0 per hour.

Wytwwww · a year ago
Then how come you can still get several last gen EPYC or Xeon systems that would use the same amount of power for under $1 per hour?

For datacentre GPUs the energy, infrastructure and other variable costs seem to be relatively insignificant to fixed capital costs. Nvidia's GPUs are just extremely expensive relative to how much power they use (compared to CPUs).

> H100s you’re still better off renting them out for $2 per hour rather than having them idle at $0 per hour.

If you're barely breaking even at $2 then immediately selling them would seem like the only sensible option (depreciation alone is significantly higher than the cost power of running a H100 24x365 at 100% utilization).

traceroute66 · a year ago
> 999999% uptime

I've said it before and I've said it again....

Read the cloud provider small-print before you go around boasting about how great their SLAs are.

Most of the time they are not worth the paper they are written on.

kjs3 · a year ago
This is beyond true. Read and understand what your cloud SLAs are, not what you think they are or what you think they should be. There was significant consternation generated when I pointed out that the SLA for availability for an Azure storage blob was only 4 nines with zone redundancy.

https://azure.microsoft.com/files/Features/Reliability/Azure...

latchkey · a year ago
Not just the fine print, but also look at how they present themselves. A provider with pictures of equipment and detailed specifications is always going to be more interesting than a provider with just a marketing website and a "contact us" page.
marcyb5st · a year ago
But it is about minimizing losses, not making profits.

If you read the article, such prices happen because a lot of companies bought hardware reservations for the next few years. Instead of keeping the hardware idle (since they pay for it anyway), they rent it out on the cheap to recoup something.

rajnathani · a year ago
From your bio, your company is Hot Aisle.

This company TensorWave covered by TechCrunch [0] this week sounds very similar, I almost thought it was the same! Anyway, best of luck, we need more AMD GPU compute.

[0] https://techcrunch.com/2024/10/08/tensorwave-claims-its-amd-...

latchkey · a year ago
Thanks! Definitely not the same at all.
tasuki · a year ago
> If you're trying to do the latter, you're not considering how you are risking your business on unreliable compute.

What do you mean by "risking your business on unreliable compute"? Is there a reason not to use one of these to train whatever neural nets one's business needs?

oefrha · a year ago
Well, someone who’s building a GPU renting service right now obviously wants to scare you into using expensive and “reliable” services; the market crashing is disastrous for them. The reality is high price is hardly an indicator of reliability, and the article very clearly explains why H100 hours are being sold at $2 or less, and it’s not because of certain providers lacking reliability.
lazide · a year ago
If it crashes half way through, you don’t get a useful model, and you’re still on the hook for the rental costs to get there maybe?
dx034 · a year ago
Since most applications aren't latency sensitive, space and power can be nearly free by setting up the data center in a place where it's cold, there's nearly free electricity and few people live. Leaves you with cost for infrastructure and connectivity, but I guess electricity prices shouldn't be the issue?
tonetegeatinst · a year ago
I'd think cost of internet would be the big issue even if can afford the AI hardware.

In rural areas or even with low population it takes forever to get fiber to roll out and if your selling access to your hardware infrastructure then you really want to get a direct connection to the nearest IX so you can offer customers the best speed for accessing data and the IX would probably be one of the few places you might be able to get 400G or higher direct fiber. But if your hooking up to a IX chances are your not an end user but a autonomous system and already are shoving moving and signing NDA's to be a peer with other Autonomous Systems in the exchange and be able to bgp announce.

(Source - my old highschool networking class where I got sick of my shitty internet and looked into how I could get fiber from an exchange. I'm probably mistaken on stuff here as it was years ago and its either wrong or outdated from all those years ago.)

serjester · a year ago
Ambient cooling can only go so far. At the end of the day if you have a rack of GPU’s using 6000 watts per node, you’re going to need some very serious active cooling regardless of your location. You’ll save a little but it’s a small percentage of your overall costs.
foobiekr · a year ago
You should consider the possibility that one outcome is that no one is going to make money offering H100s.
fhars · a year ago
I think this is what they are insinuating with the "Hot the Bubble Burst" in the headline. You are not expected to make money if you have invested in a bursting bubble.
wolfgangK · a year ago
For training, doesn't checkpoint saving make high reliability a moot point ? Why pay for 99.99999? uptime when you can restart your training from last/best model ?
scotty79 · a year ago
> There is absolutely no way anyone is going to be making any money offering $2 H100s unless they stole them and they get free space/power...

I think that's the point. Trying to buy and run H100s now either for yourself or for someone else to rent it is a terrible investment because of oversupply.

And prices you can get for compute are not enough to cover the costs.

acd10j · a year ago
May be their business model is running compute at loss and stealing ip/code from people using platform?
hnaccount_rng · a year ago
Can you elaborate on the cost basis? With how little could a very lean operation still make money?

I know that's basically impossible to answer generically, especially given that the recurring cost is likely already zero, given that the GPUs are already paid...

pico_creator · a year ago
Someone is losing the money. It’s elaborated in the article how and why this happens

TLDR, VC money, is being burnt/lost

shermantanktop · a year ago
Tons of VC money burned in pursuit of low-probability success. It’s no wonder that some people find it easier to scam VCs than it is to build a real business.
TechDebtDevin · a year ago
I've been saying this would happen for months. There (was) a giant arbitrage for data centers that already have the infra.

If you could get a hold H100s and had an operational data center you essentially had the keys to an infinate money printer on anything above $3.50/hr.

Of course, because we live in a world of effecient markets that was never going to last forever. But they are still profitible at $2.00 assuming they have cheap electricity/infra/labor.

pico_creator · a year ago
Problem is - u can find some at $1
startupsfail · a year ago
The screenshot there is 1xH100 PCIE, for $1.604. Which is likely promotional pricing to get customers onboarded.

With promotional pricing it can be $0 for qualified customers.

Note also, how the author shows screenshots for invites for private alpha access. It can be mutually beneficial for the data center to provide discounted alpha testing access. The developer gets discounted access, the data center gets free/realistic alpha testing workflows.

swyx · a year ago
original title i wrote for this piece was "$1 H100s" but i deleted because even i thought it was so ridiculously low lol

but yes sfcompute home page is now quoting $0.95/hr average. wild.

electronbeam · a year ago
The real money is in renting infiniband clusters, not individual gpus/machines

If you look at lambda one click clusters they state $4.49/H100/hr

latchkey · a year ago
I'm in the business of mi300x. This comment nails it.

In general, the $2 GPUs are either PE venture losing money, long contracts, huge quantities, pcie, slow (<400G) networking, or some other limitation, like unreliable uptime on some bitcoin miner that decided to pivot into the GPU space and has zero experience on how to run these more complicated systems.

Basically, all the things that if you decide to build and risk your business on these sorts of providers, you "get what you pay for".

jsheard · a year ago
> slow (<400G) networking

We're not getting Folding@Home style distributed training any time soon, are we.

marcyb5st · a year ago
I agree with you, but as the article mentioned, if you need to finetune a small/medium model you really don't need clusters. Getting a whole server with 8/16x H100s is more than enough. And I also believe with the article when it states that most companies are finetuning some version of llama/open-weights models today.
pico_creator · a year ago
Exactly, it covered in the article that there is a segmentation happening via GPU cluster size.

Is it big enough for foundation model training from scratch = ~$3+ Otherwise it drops hard

Problem is "big enough" is a moving goal post now, what was big, becomes small

swyx · a year ago
so why not buy up all the little h100s and enough together for a cluster? seems like a decent rollup strategy?

ofcourse it woudl still cost a lot to do... but if the difference is $2/hr vs $4.49/hr then there's some size where it makes sense

ranger_danger · a year ago
Last year we reached out to a major GPU vendor for a need to get access to a seven figure dollar amount worth of compute time.

They contacted (and we spoke with) several of the largest partners they had, including education/research institutions and some private firms, and could not find ANYONE that could accommodate our needs.

AWS also did not have the capacity, at least for spot instances since that was the only way we could have afforded it.

We ended up rolling our own solution with (more but lower-end) GPUs we sourced ourselves that actually came out cheaper than renting a dozen "big iron" boxes for six months.

It sounds like currently that capacity might actually be available now, but at the time we could not afford to wait another year to start the job.

chronogram · a year ago
If you were able to make do with cheaper GPUs, then you didn't need FP64 so you didn't need H100s in the first place right? Then you made the right choice in buying a drill for your screw work instead of renting a jackhammer even if the jackhammer would've seemed cooler to you at the time.
KeplerBoy · a year ago
Does anyone doing AI need FP64, and yet they sell well.
ranger_danger · a year ago
> didn't need H100s

I think we're splitting hairs here, it was more about choosing a good combination of least effort, time and money involved. When you're spending that amount of money, things are not so black and white... rented H100s get the job done faster and easier than whatever we can piece together ourselves. L40 (cheaper but no FP64) was also brand new at the time. Also our code was custom OpenCL and could have taken advantage of FP64 to go faster if we had the devices for it.

Dead Comment

wg0 · a year ago
> Collectively there are less than <50 teams worldwide who would be in the market for 16 nodes of H100s (or much more), at any point in time, to do foundation model training

At best 100 and this number will go down as many would fail to make money. Even traditional 100 software development companies would have a very low success rate and here we're talking about products that themselves work probabilistically all the way down.

pico_creator · a year ago
Im quite sure there is more than a 100 clusters even. Though that would be harder to prove.

So yea, it would be rough

Der_Einzige · a year ago
I just want to observe that there are a lot of people paying huge amounts of money for consulting about this exact topic and that this article is jam packed with more recent and relevant information than almost any of these consultants have.
pico_creator · a year ago
Feel free to forward to the clients of "paid consultant". Also how do i collect my cut.
swyx · a year ago
author @pico_creator is in here actively replying in case u have any followups.. i just did the editing

Deleted Comment

pico_creator · a year ago
Also: how many of those consultants, have actually rented GPU's - used them for inference - or used them to finetune / train
aurareturn · a year ago
I’m guessing most of them are advising Wallstreet on AI demand.
grues-dinner · a year ago
> For all the desperate founders rushing to train their models to convince their investors for their next $100 million round.

Has anyone actually trained a model actually worth all this money? Even OpenAI is s struggling to staunch the outflow of cash. Even if you can get a profitable model (for what?) how many billion dollar models does the world support? And everyone is throwing money into the pit and just hoping that there's no technical advance that obsoletes everything from under them, or commiditisation leading to a "good enough" competitor that does it cheaper.

I mean, I get that everyone and/or they investors has got the FOMO for not being the guys holding the AGI demigod at the end of the day. But from a distance it mostly looks like a huge speculative cash bonfire.

justahuman74 · a year ago
> For all the desperate founders rushing to train their models to convince their investors for their next $100 million round.

I would say Meta has (though not a startup) justified the expenditure.

By freely releasing llama they undercut every a huge swath of competition who can get funded during the hype. Then when the hype dies they can pick up what the real size of the market is, with much better margins than if there were a competitive market. Watch as one day they stop releasing free versions and start rent seeking on N+1

grues-dinner · a year ago
Right, but that is all predicated that, when they get to the end, having spent tons of nuclear fuel, container shiploads of GPUs and whole national GDPs on the project, there will be some juice worth all that squeeze.

And even if AI as we know it today is still relevant and useful in that future, and the marginal value per training-dollar stays (becomes?) positive, will they be able to defend that position against lesser, cheaper, but more agile AIs? What will the position even be that Llama2030 or whatever will be worth that much?

Like, I know that The Market says the expected payoff is there, but what is it?

pico_creator · a year ago
Given their rising stock price trend, due to their moves in AI. Definitely worth it for them
mlinhares · a year ago
Given meta hasn’t been able to properly monetize WhatsApp I seriously doubt they can monetize this.
jordwest · a year ago
> I get that everyone and/or they investors has got the FOMO for not being the guys holding the AGI demigod at the end of the day

Don't underestimate the power of the ego...

Look at their bonfire, we need one like that but bigger and hotter

bugbuddy · a year ago
I spit out my tea when I read your last sentence. You should consider standup comedy.
Aeolun · a year ago
Isn’t OpenAI profitable if they stop training right at this moment? Just because they’re immediately reinvesting all that cash doesn’t mean they’re not profitable.
Attach6156 · a year ago
And if they stop training right now their "moat" (which I think is only o1 as of today) would last a good 3 to 6 months lol, and then to the Wendy's it is.
0xDEAFBEAD · a year ago
This guy claims they are losing billions of dollars on free ChatGPT users:

https://nitter.poast.org/edzitron/status/1841529117533208936

elcomet · a year ago
Not everyone is doing LLM training. I know plenty of startups selling AI products for various image tasks (agriculture, satellite, medical...)
mark_l_watson · a year ago
Yes, a lot of the money to be made is in the middleware and application sides of development. I find even small models like Llama 3.2 2B to be extremely useful and fine tuning and integration with existing businesses can have a large potential payoff for smaller investments.
hackernewds · a year ago
Lots of companies have. Most recently Character AI trained an internal model and did raise $100M early last year. They didn't release any benchmarks since the founding team and Noam taken to Google
tonetegeatinst · a year ago
Pretty sure anthropic has
anshulbhide · a year ago
This reminds me of the boom and bust oil cycle as outlined in The Prize: The Epic Quest for Oil, Money & Power by Daniel Yergin.
swyx · a year ago
care to summarize key points for the class?
dplgk · a year ago
It seems appropriate, in this thread, to have ChatGPT provide the summary:

In The Prize: The Epic Quest for Oil, Money & Power, Daniel Yergin explains the boom-and-bust cycle in the oil industry as a recurring pattern driven by shifts in supply and demand. Key elements include:

1. Boom Phase: High oil prices and increased demand encourage significant investment in exploration and production. This leads to a surge in oil output, as companies seek to capitalize on the favorable market.

2. Oversupply: As more oil floods the market, supply eventually exceeds demand, causing prices to fall. This oversupply is exacerbated by the long lead times required for oil development, meaning that new oil from earlier investments continues to come online even as demand weakens.

3. Bust Phase: Falling prices result in lower revenues for oil producers, leading to cuts in exploration, production, and jobs. Smaller or higher-cost producers may go bankrupt, and oil-dependent economies suffer from reduced income. Investment in new production declines during this phase.

4. Correction and Recovery: Eventually, the cutbacks in production lead to reduced supply, which helps stabilize or raise prices as demand catches up. This sets the stage for a new boom phase, and the cycle repeats.

Yergin highlights how this cycle has shaped the global oil industry over time, driven by technological advances, geopolitical events, and market forces, while creating periods of both rapid growth and sharp decline.