Apparently ChatGPT went from zero to a million users in five days[1], at a cost of "single-digits cents per chat."
For perspective the likes of Instagram and Twitter have take from a few months to a few dozen months to get to a million users all while doing less work to service a request for a "page."
Hats of to OpenAI for not falling over more often under a hug of death that doesn't seem to be letting up.
To put this in perspective: the outage was 52 min. The total number of employees OpenAI have are 375. The launch date was just some months ago.
At Google, the Cloud SQL dashboard was unavailable for around 12 hours a couple of weeks ago if I read this correctly:
https://status.cloud.google.com/incidents/xg2qrL1UuSJiPDZALJ...
The total number of employees of Google are 156 500. Google Cloud was launched 2008.
So when people say scaling is hard... it's not a solved problem and you shouldn't
be surprised when these things happen.
A solid theoretical foundation and more testing is better of course.
Not sure what "perspective" you are trying to put this in by offering these comparison points between OpenAI and Google. Are you trying to say that OpenAI is more reliable than Google?
For additional perspective, OpenAI has a very small number of highly similar offerings (neural net with API access), while Google has a huge host of very different offerings, from web indexing and search, to email, to file hosting, to video streaming, to cloud compute... etc. Google also has a vastly larger user pool by at least a couple orders of magnitude. Google's core services are extremely solid even at ridiculous scale and have few outages, any of which would be considered major news.
Google also operates all this at a profit, while OpenAI works at a deficit. Google has had to scale larger than nearly any other service while maintaining profitability while OpenAI is more or less free to throw more compute power to solve problems, at any cost, to build valuation. Google has written entire programming languages to help them keep up at an unprecedented scale.
Comparing a single minor Google offering going down to an OpenAI outage isn't a fair comparison to either company. Yes, Google has, by your numbers, about 400 times the number of OpenAI employees. I'd be willing to bet that a single large Google service like YouTube handles more than 400 times the amount of compute, data, and traffic that OpenAI does. I wouldn't draw a comparison between the size and efficacy of the employees, but again, Google operates at a very different scale, and has operated at that scale very well.
Also anecdotally, half the time I've tried to use ChatGPT it's "at capacity" or throws an internal error, and I've also seen Dall-E unavailable even though I've barely tried to use it, so I wouldn't say that OpenAI service has been ironclad this whole time.
How Google, and most likely all businesses, should think about outages is more complex than being up or down. Is it just part of the product, like Slack threads yesterday? Is it 1/2 or 1/1M requests or operations?
That's a completely bizarre comparison. No number of Android developers will lead to higher availability of Cloud SQL observability. The gross number of employees at an org is meaningless.
presumably the data scientists at OpenAI would also not be able to directly lead to higher availability of ChatGPT; it's just an order of magnitude comparison.
To put in perspective, my hello world service has less than 1 employee working on it, has been launched 10 years ago and has roughly 5 minutes of downtime per year when I restart the server.
I am available for hire to teach a thing or two to Google (and OpenAI).
We don't know anything about these outages and their nature. Maybe OpenAI outage was caused by a typo somewhere in code and Google had rolled out system wide monitoring + SQL change that had cascading issues and required many not obvious steps and processing hundrerds of TBs of data.
I mean I had outages that were 60 seconds long with magnitude less people than OpenAI and serving more req/s than they do. What does it tell you? Nothing at all.
Internal people should care about why, but customers/users really don’t. We expect openai / Google to take steps to balance likelihood of outage with impact of outage, and plan accordingly.
So what this tells us, based on very limited data, is that this level out outage happens to even the biggest of companies.
Slightly funny but also slightly concerning, they were also having service issues a few days ago that resulted in ChatGPT giving me answers to prompts I wasn’t submitting. I reproduced this 30 or so times just to see the different results and it was interesting, everything from answering questions about mental health to marketing tips, to a request to write a thesis over social media and it’s negative effects on our society.
Of course, upon trying the next day, all was fine again and I was no longer able to reproduce.
This is precisely why they should really open source their model so that anyone can download and run it on their own infrastructure. Just like google or others have done and one is free to run it on their own laptop (some even without a GPU) , on premise or on any cloud provider infrastructure. They can continue to provide a hosted service for their model but they should allow it to be downloaded just like BERT.
Don't get me wrong, I'd love to be able to run GPT3 and the subsequent finetuned versions thereof myself, but OpenAI has essentially no financial incentive to do so.
A few years ago, we could maybe lean on their open aspirations to get that done, but with the "limited profitibility model" they've since instead adopted, I think that dream is mostly gone.
At least we still get the occasional treat like Whisper out of them.
Open AI's CEO Sam Altman's take is that they will only ever allow API access to their models to avoid misuse.
I don't get HN's take with wanting everything open sourced. Some things are expensive to create and dangerous in the wrong hands. Not everything can and should be open sourced.
I see OpenAI as a 'company' that (eventually) wants to earn money with their models. The 'open' part is trying to commercialize part it in such a way that everyone can use it. Also be open with the research and risks. But not open as in open-source.
I think from Apple’s POV, this is great news. Moore’s Law has been dead for years, and there has been no good reason to upgrade your devices until now. AI means it’s 1990 again, and you need to buy a new device every 18 months because the performance leap is so meaningful to the UX.
I'd agree if Apple were in the business of making datacentre infrastructure.
Nobody is running these large scale models on their personal devices.
Sure, some of the image generation tech is seeing personal use, so you'd have a point there, but these immense language models are something else entirely.
With all of these outages, you have to wonder if it's a lack of skilled engineers on their part, or if they simply don't have enough GPUs to keep the lights on all the time.
It's a reflection of demand, I imagine ChatGPT is still growing with the media attention + word of mouth. I've referred at least 30 friends to try it out, and many are still using it.
ChatGPT is currently the 4th most read page on English Wikipedia right now, earlier I noticed it was 2nd. This is the first time I’ve noticed it on their ranking list.
In every company I've worked at that does ML work, the data scientists building the models don't have a strong software delivery background. They aren't building a production system with fail safes and redundancy. Their focus is the model itself. Consequently, when they stand up services in front of these models they are not even close to "production grade". I'd hope that a company focused on ML (versus an enterprise just dipping their toes into ML) would have solved these problems, but I wonder if it's related.
The have first-class platform and systems engineers. Do you think one of the most desireable and well funded software shops in the world wouldn’t be able to get skilled engineers?
A friend that interviewed with them shared that they were a bit too smug about “running at google scale with much leaner but highly skilled staff” (paraphrasing here). Perhaps it starts becoming clear why you need that kind of staffing…
Many companies have skilled engineers and enough hardware, and still have outages. It's an expected part of running any software service. Outages can be minimized, but the idea that they can be eliminated with enough people and computing resources is not how things work, from my experience.
This is precisely why they should really open source their model so that anyone can download and run it on their own infrastructure. Just like google or others have done and one is free to run it on their own laptop (some even without a GPU) , on premise or on any cloud provider infrastructure. They can continue to provide a hosted service for their model but they should allow it to be downloaded just like BERT.
The more realistic scenario is they charge money to use it. A few cents per query or so and you'll cut out almost all the traffic while still keeping it available to anyone making good use of it.
I just saw that Azure OpenAI service has a SLA and OpenAI does not. I thought they would have separated the infrastructure for free ChatGPT users and paying API customers.
For perspective the likes of Instagram and Twitter have take from a few months to a few dozen months to get to a million users all while doing less work to service a request for a "page."
Hats of to OpenAI for not falling over more often under a hug of death that doesn't seem to be letting up.
[1] https://twitter.com/sama/status/1599668808285028353 [2] https://www.afr.com/technology/chatgpt-takes-the-internet-by...
At Google, the Cloud SQL dashboard was unavailable for around 12 hours a couple of weeks ago if I read this correctly: https://status.cloud.google.com/incidents/xg2qrL1UuSJiPDZALJ... The total number of employees of Google are 156 500. Google Cloud was launched 2008.
So when people say scaling is hard... it's not a solved problem and you shouldn't be surprised when these things happen.
A solid theoretical foundation and more testing is better of course.
For additional perspective, OpenAI has a very small number of highly similar offerings (neural net with API access), while Google has a huge host of very different offerings, from web indexing and search, to email, to file hosting, to video streaming, to cloud compute... etc. Google also has a vastly larger user pool by at least a couple orders of magnitude. Google's core services are extremely solid even at ridiculous scale and have few outages, any of which would be considered major news.
Google also operates all this at a profit, while OpenAI works at a deficit. Google has had to scale larger than nearly any other service while maintaining profitability while OpenAI is more or less free to throw more compute power to solve problems, at any cost, to build valuation. Google has written entire programming languages to help them keep up at an unprecedented scale.
Comparing a single minor Google offering going down to an OpenAI outage isn't a fair comparison to either company. Yes, Google has, by your numbers, about 400 times the number of OpenAI employees. I'd be willing to bet that a single large Google service like YouTube handles more than 400 times the amount of compute, data, and traffic that OpenAI does. I wouldn't draw a comparison between the size and efficacy of the employees, but again, Google operates at a very different scale, and has operated at that scale very well.
Also anecdotally, half the time I've tried to use ChatGPT it's "at capacity" or throws an internal error, and I've also seen Dall-E unavailable even though I've barely tried to use it, so I wouldn't say that OpenAI service has been ironclad this whole time.
The Google SRE book is an excellent work: https://sre.google/sre-book/introduction/
Also some perspective, if you look at the OpenAPI outages for Jan 2023, the up time is worse than Google Cloud SQL.
https://status.openai.com/uptime
Some chapters in the book include "Being On-Call", "Effective Troubleshooting" & "Postmortem Culture: Learning from Failure".
Which signals to me, a normal work culture and time will improve stability in most companies.
I am available for hire to teach a thing or two to Google (and OpenAI).
/s
they outsource infra to MS?
I mean I had outages that were 60 seconds long with magnitude less people than OpenAI and serving more req/s than they do. What does it tell you? Nothing at all.
So what this tells us, based on very limited data, is that this level out outage happens to even the biggest of companies.
This is a non-sequitur. This statement has nothing to do with the outage. It’s just thrown in there to pad the number of words in the comment.
Dead Comment
Of course, upon trying the next day, all was fine again and I was no longer able to reproduce.
https://futurism.com/the-byte/amazon-begs-employees-chatgpt
I imagine the huge demand that ChatGPT is seeing would make any cloud vendor sweat if you were to suddenly lump it on top of the usual demand.
To me it's entirely unsurprising that OpenAI would have trouble keeping up. Good luck to them.
A few years ago, we could maybe lean on their open aspirations to get that done, but with the "limited profitibility model" they've since instead adopted, I think that dream is mostly gone.
At least we still get the occasional treat like Whisper out of them.
I don't get HN's take with wanting everything open sourced. Some things are expensive to create and dangerous in the wrong hands. Not everything can and should be open sourced.
Nobody is running these large scale models on their personal devices.
Sure, some of the image generation tech is seeing personal use, so you'd have a point there, but these immense language models are something else entirely.
Deleted Comment
Dead Comment