OpenAI Status: Multiple engines are down

foruhar · 3 years ago

Apparently ChatGPT went from zero to a million users in five days[1], at a cost of "single-digits cents per chat."

For perspective the likes of Instagram and Twitter have take from a few months to a few dozen months to get to a million users all while doing less work to service a request for a "page."

Hats of to OpenAI for not falling over more often under a hug of death that doesn't seem to be letting up.

[1] https://twitter.com/sama/status/1599668808285028353 [2] https://www.afr.com/technology/chatgpt-takes-the-internet-by...

peyton · 3 years ago

Come on. OpenAI has been around far longer than ChatGPT. I bet Instagram Stories reached a million users in a matter of minutes.

kblev · 3 years ago

Also it's fairly common to not be able to use ChatGPT because they're at capacity

herval · 3 years ago

FB had quite the infrastructure when Instagram Stories was released... not sure that comparison makes a ton of sense?

weird-eye-issue · 3 years ago

I don't think that many people where actually using OpenAI's products directly. Unlike Instagram.

danielscrubs · 3 years ago

To put this in perspective: the outage was 52 min. The total number of employees OpenAI have are 375. The launch date was just some months ago.

At Google, the Cloud SQL dashboard was unavailable for around 12 hours a couple of weeks ago if I read this correctly: https://status.cloud.google.com/incidents/xg2qrL1UuSJiPDZALJ... The total number of employees of Google are 156 500. Google Cloud was launched 2008.

So when people say scaling is hard... it's not a solved problem and you shouldn't be surprised when these things happen.

A solid theoretical foundation and more testing is better of course.

naet · 3 years ago

Not sure what "perspective" you are trying to put this in by offering these comparison points between OpenAI and Google. Are you trying to say that OpenAI is more reliable than Google?

For additional perspective, OpenAI has a very small number of highly similar offerings (neural net with API access), while Google has a huge host of very different offerings, from web indexing and search, to email, to file hosting, to video streaming, to cloud compute... etc. Google also has a vastly larger user pool by at least a couple orders of magnitude. Google's core services are extremely solid even at ridiculous scale and have few outages, any of which would be considered major news.

Google also operates all this at a profit, while OpenAI works at a deficit. Google has had to scale larger than nearly any other service while maintaining profitability while OpenAI is more or less free to throw more compute power to solve problems, at any cost, to build valuation. Google has written entire programming languages to help them keep up at an unprecedented scale.

Comparing a single minor Google offering going down to an OpenAI outage isn't a fair comparison to either company. Yes, Google has, by your numbers, about 400 times the number of OpenAI employees. I'd be willing to bet that a single large Google service like YouTube handles more than 400 times the amount of compute, data, and traffic that OpenAI does. I wouldn't draw a comparison between the size and efficacy of the employees, but again, Google operates at a very different scale, and has operated at that scale very well.

Also anecdotally, half the time I've tried to use ChatGPT it's "at capacity" or throws an internal error, and I've also seen Dall-E unavailable even though I've barely tried to use it, so I wouldn't say that OpenAI service has been ironclad this whole time.

verdverm · 3 years ago

How Google, and most likely all businesses, should think about outages is more complex than being up or down. Is it just part of the product, like Slack threads yesterday? Is it 1/2 or 1/1M requests or operations?

The Google SRE book is an excellent work: https://sre.google/sre-book/introduction/

Also some perspective, if you look at the OpenAPI outages for Jan 2023, the up time is worse than Google Cloud SQL.

https://status.openai.com/uptime

danielscrubs · 3 years ago

Agreed, Google's public incident reports could be improved with those numbers.

Some chapters in the book include "Being On-Call", "Effective Troubleshooting" & "Postmortem Culture: Learning from Failure".

Which signals to me, a normal work culture and time will improve stability in most companies.

jeffbee · 3 years ago

That's a completely bizarre comparison. No number of Android developers will lead to higher availability of Cloud SQL observability. The gross number of employees at an org is meaningless.

xdavidliu · 3 years ago

presumably the data scientists at OpenAI would also not be able to directly lead to higher availability of ChatGPT; it's just an order of magnitude comparison.

danielscrubs · 3 years ago

Sure. But many Android developers are likely to use Cloud SQL and give quick feedback. How big the benefits are is let as an exercise to the reader.

IMTDb · 3 years ago

To put in perspective, my hello world service has less than 1 employee working on it, has been launched 10 years ago and has roughly 5 minutes of downtime per year when I restart the server.

I am available for hire to teach a thing or two to Google (and OpenAI).

/s

riku_iki · 3 years ago

> The total number of employees OpenAI have are 375

they outsource infra to MS?

lossolo · 3 years ago

We don't know anything about these outages and their nature. Maybe OpenAI outage was caused by a typo somewhere in code and Google had rolled out system wide monitoring + SQL change that had cascading issues and required many not obvious steps and processing hundrerds of TBs of data.

I mean I had outages that were 60 seconds long with magnitude less people than OpenAI and serving more req/s than they do. What does it tell you? Nothing at all.

brookst · 3 years ago

Internal people should care about why, but customers/users really don’t. We expect openai / Google to take steps to balance likelihood of outage with impact of outage, and plan accordingly.

So what this tells us, based on very limited data, is that this level out outage happens to even the biggest of companies.

ironick09 · 3 years ago

The is the worst comparison in the history of comparisons.

nindalf · 3 years ago

> The total number of employees of Google are 156 500

This is a non-sequitur. This statement has nothing to do with the outage. It’s just thrown in there to pad the number of words in the comment.

Dead Comment

garrettjoecox · 3 years ago

Slightly funny but also slightly concerning, they were also having service issues a few days ago that resulted in ChatGPT giving me answers to prompts I wasn’t submitting. I reproduced this 30 or so times just to see the different results and it was interesting, everything from answering questions about mental health to marketing tips, to a request to write a thesis over social media and it’s negative effects on our society.

Of course, upon trying the next day, all was fine again and I was no longer able to reproduce.

langitbiru · 3 years ago

This is why Amazon begs employees not to leak corporate secrets to ChatGPT:

https://futurism.com/the-byte/amazon-begs-employees-chatgpt

andrewxdiamond · 3 years ago

I mean there’s a lot of reasons to not leak secrets to a third party, buggy AI or otherwise

Alifatisk · 3 years ago

You should NEVER paste sensitive details to a third party service unless authorized

Gigachad · 3 years ago

I had the exact same issue, answers completely unrelated to my prompt and different every time I hit regenerate.

seydor · 3 years ago

Thats what i got an hour ago. Answers to different queries that i never made.

TOMDM · 3 years ago

LLM's are incredibly expensive to run.

I imagine the huge demand that ChatGPT is seeing would make any cloud vendor sweat if you were to suddenly lump it on top of the usual demand.

To me it's entirely unsurprising that OpenAI would have trouble keeping up. Good luck to them.

la64710 · 3 years ago

This is precisely why they should really open source their model so that anyone can download and run it on their own infrastructure. Just like google or others have done and one is free to run it on their own laptop (some even without a GPU) , on premise or on any cloud provider infrastructure. They can continue to provide a hosted service for their model but they should allow it to be downloaded just like BERT.

TOMDM · 3 years ago

Don't get me wrong, I'd love to be able to run GPT3 and the subsequent finetuned versions thereof myself, but OpenAI has essentially no financial incentive to do so.

A few years ago, we could maybe lean on their open aspirations to get that done, but with the "limited profitibility model" they've since instead adopted, I think that dream is mostly gone.

At least we still get the occasional treat like Whisper out of them.

t3rmi · 3 years ago

Open AI's CEO Sam Altman's take is that they will only ever allow API access to their models to avoid misuse.

I don't get HN's take with wanting everything open sourced. Some things are expensive to create and dangerous in the wrong hands. Not everything can and should be open sourced.

scratchyone · 3 years ago

Why did you comment this 3 separate times?

motoboi · 3 years ago

Something like 1,5Tb memory to run this model in inference mode.

whazor · 3 years ago

I see OpenAI as a 'company' that (eventually) wants to earn money with their models. The 'open' part is trying to commercialize part it in such a way that everyone can use it. Also be open with the research and risks. But not open as in open-source.

alfor · 3 years ago

Don’t worry, we will see plenty of other GPT very soon. The genie is out of the bottle.

kube-system · 3 years ago

Their end goal is to be stinkin’ rich, and it looks like that might happen. They’re not going to let the fear of an hour of downtime undermine that.

earthboundkid · 3 years ago

I think from Apple’s POV, this is great news. Moore’s Law has been dead for years, and there has been no good reason to upgrade your devices until now. AI means it’s 1990 again, and you need to buy a new device every 18 months because the performance leap is so meaningful to the UX.

TOMDM · 3 years ago

I'd agree if Apple were in the business of making datacentre infrastructure.

Nobody is running these large scale models on their personal devices.

Sure, some of the image generation tech is seeing personal use, so you'd have a point there, but these immense language models are something else entirely.

spaceman_2020 · 3 years ago

Is this something that can come down with better tech (both hardware and software) or is the high cost just baked into LLMs?

osigurdson · 3 years ago

This was my assumption but it would be great to see it quantified. Does anyone know the average cost per query?

dwaltrip · 3 years ago

I saw someone claim it’s about $0.01 per query.

seydor · 3 years ago

They should sell chatGPT hardware boxes

Deleted Comment

jsemrau · 3 years ago

That's the motivation for Microsoft in a nutshell.

152334H · 3 years ago

With all of these outages, you have to wonder if it's a lack of skilled engineers on their part, or if they simply don't have enough GPUs to keep the lights on all the time.

cloudking · 3 years ago

It's a reflection of demand, I imagine ChatGPT is still growing with the media attention + word of mouth. I've referred at least 30 friends to try it out, and many are still using it.

wyre · 3 years ago

ChatGPT is currently the 4th most read page on English Wikipedia right now, earlier I noticed it was 2nd. This is the first time I’ve noticed it on their ranking list.

tstrimple · 3 years ago

In every company I've worked at that does ML work, the data scientists building the models don't have a strong software delivery background. They aren't building a production system with fail safes and redundancy. Their focus is the model itself. Consequently, when they stand up services in front of these models they are not even close to "production grade". I'd hope that a company focused on ML (versus an enterprise just dipping their toes into ML) would have solved these problems, but I wonder if it's related.

thundergolfer · 3 years ago

The have first-class platform and systems engineers. Do you think one of the most desireable and well funded software shops in the world wouldn’t be able to get skilled engineers?

lmm · 3 years ago

If they're that well funded, skilled engineers are more likely to be the bottleneck than compute availability.

pm90 · 3 years ago

A friend that interviewed with them shared that they were a bit too smug about “running at google scale with much leaner but highly skilled staff” (paraphrasing here). Perhaps it starts becoming clear why you need that kind of staffing…

ygouzerh · 3 years ago

Knowing all the increase in term of load that they are having these two last months, 99.53 % of uptime on three months is actually quite impressive.

d2049 · 3 years ago

Many companies have skilled engineers and enough hardware, and still have outages. It's an expected part of running any software service. Outages can be minimized, but the idea that they can be eliminated with enough people and computing resources is not how things work, from my experience.

KyeRussell · 3 years ago

Computers are hard.

Dead Comment

ndneighbor · 3 years ago

Sending major hugs to their team, always hard to keep the lights on when I am sure that they are breaking scale records possibly every day.

la64710 · 3 years ago

This is precisely why they should really open source their model so that anyone can download and run it on their own infrastructure. Just like google or others have done and one is free to run it on their own laptop (some even without a GPU) , on premise or on any cloud provider infrastructure. They can continue to provide a hosted service for their model but they should allow it to be downloaded just like BERT.

Gigachad · 3 years ago

The more realistic scenario is they charge money to use it. A few cents per query or so and you'll cut out almost all the traffic while still keeping it available to anyone making good use of it.

DonHopkins · 3 years ago

Why does using the term "open source" as a verb always sound like fingernails scratching on a chalkboard to me?

ksubedi · 3 years ago

I just saw that Azure OpenAI service has a SLA and OpenAI does not. I thought they would have separated the infrastructure for free ChatGPT users and paying API customers.

sva_ · 3 years ago

I think their models all run on Azure infrastructure?!

ksubedi · 3 years ago

They do, but Azure also provides a different OpenAI service with different terms and SLA: https://azure.microsoft.com/en-us/products/cognitive-service...

honkler · 3 years ago

AI was about to be sentient. A brave engineer pulled out the power cord, just in time. Disaster averted!