Google Cloud Is Down - Readit News

Disclosure: I work on Google Cloud (but disclaimer, I'm on vacation and so not much use to you!).

We're having what appears to be a serious networking outage. It's disrupting everything, including unfortunately the tooling we usually use to communicate across the company about outages.

There are backup plans, of course, but I wanted to at least come here to say: you're not crazy, nothing is lost (to those concerns downthread), but there is serious packet loss at the least. You'll have to wait for someone actually involved in the incident to say more.

boulos · 6 years ago

To clarify something: this outage doesn’t appear to be global, but it is hitting us particularly hard in parts of the US. So for the folks with working VMs in Mumbai, you’re not crazy. But for everyone with sadness in us-central1, the team is on it.

digaozao · 6 years ago

It seems global to me. This is really strange compared to AWS. I don't remember an outage there (other than s3) impacting instances or networking globally.

captn3m0 · 6 years ago

Some services are still impacted globally. Gmail over IMAP is unreachable for me. (Edit: gmail web is fine)

ls612 · 6 years ago

I’m from the US and in Australia right now. Both me and my friends in the US are experiencing outages across google properties and Snapchat, so it’s pretty global.

the-rc · 6 years ago

Fiber cut? SDN bug that causes traffic to be misdirected? One or more core routers swallowing or corrupting packets?

falcon2_0 · 6 years ago

It seemed to be congestion in the North East US.

odiroot · 6 years ago

> including unfortunately the tooling we usually use to communicate across the company about outages.

There's some irony in that.

boulos · 6 years ago

Edit: and I agree!

I’m not in SRE so I don’t bother with all the backup modes (direct IRC channel, phone lines, “pagers” with backup numbers). I don’t think the networking SRE folks are as impacted in their direct communication, but they are (obviously) not able to get the word out as easily.

Still, it seems reasonable to me to use tooling for most outages that relies on “the network is fine overall”, to optimize for the common case.

Note: the status dashboard now correctly highlights (Edit: with a banner at the top) that multiple things are impacted because Networking. The Networking outage is the root cause.

Twirrim · 6 years ago

AWS experienced a major outage a few years ago that couldn't be communicated to customers because it took out all the components central to update the status board. One of those obvious-in-hindsight situations.

Not long after that incident, they migrated it to something that couldn't be affected by any outage. I imagine Google will probably do the same thing after this :)

k_bx · 6 years ago

Even more irony: Google+ shown as working fine: https://i.imgur.com/52ACuiY.png

ohazi · 6 years ago

> including unfortunately the tooling we usually use to communicate across the company about outages.

So memegen is down?

ChuckMcM · 6 years ago

I'm guessing this will be part of the next DiRT exercise :-) (DiRT being the disaster recovery exercises that Google runs internally to prepare for this sort of thing)

bufferoverflow · 6 years ago

Well, lots of revenue is lost, that's for sure.

SmokeGS · 6 years ago

>nothing is lost

except time

stanfordkid · 6 years ago

Can't use my Nest lock to let guests into my house. I'm pretty sure their infrastructure is hosted in Google Cloud. So yeah... definitely some stuff lost.

sdan · 6 years ago

and a nice Sunday afternoon

countbackula · 6 years ago

And reputation. With this outage the global media socket is going to be in gCloud nine.

jussij · 6 years ago

and reputation.

Deleted Comment

foobarbazetc · 6 years ago

Seems to be the private network. The public network looks fine to us from all over the world?

nodesocket · 6 years ago

Not on my end. Public access in us-west2 (Los Angeles) is down for me.

Double_a_92 · 6 years ago

> but there is serious packet loss at the least.

Can confirm with Gmail in Europe. Everything works but it's sluggish (i.e. no immediate reaction on button clicks).

gingabriska · 6 years ago

We are also hosted on GCP bit nothing is down for us. We are using 3 regions in US and 2 in EU.

breadandcrumbel · 6 years ago

What can be the reason for the outage? Can it be a cyber attack to your servers?

ikiris · 6 years ago

go/stopleaks :)

foota · 6 years ago

Hm, isn't releasing go links publicly also verboten? :)

123jay7 · 6 years ago

This happened to Amazon S3 as well once. The "X" image they use to indicate a service outage was served by... yup, S3, which was down obviously.

hinkley · 6 years ago

One of the projects I worked on was using data URIs for critical images, and I wouldn’t trust that particular team to babysit my goldfish.

Sounds like Google and Amazon are hiring way too many optimists. I kinda blame the war on QA for part of this, but damn that’s some Pollyanna bullshit.

dosy · 6 years ago

You're brave to jump on here when on holiday!

Shouldn't that outage system be aware when service heartbeats stop?

Could this be a solar flare?

You know this reminds me of a bad taste that Google Sales team left when I asked for some of my billing that I was unaware of running after following a quickstart guide.

AWS refunded me in the first reply on the same day!

GCP sales rep just copy pasted a link to a self support survey that essentially told me, after a series of YES or NO questions that they can't refund me.

So why not just tell your customers like it is? Google Cloud is super strict when it comes to billing. I have called my bank to do a chargeback and put a hold on all future billing with GCP.

I'm now back to AWS and still on a Free Tier. Apparently the $300 Trial with Google Cloud did not include some critical products, AWS Free tier makes it super clear and even still I sometimes leave something running on and discover it in my invoice....

I've yet to receive a reply from Google and its been a week now.

I do appreciate other products such as Firebase but honestly for infrastructure and for future integration with enterprise customers I feel AWS is more appropriate and mature.

mcintyre1994 · 6 years ago

The thing that worries me most about Google Cloud and these billing stories is that I’m assuming if you chargeback or block them at your bank then they’ll ban all Google accounts of yours - and they’re obviously going to be able to make the link between an account made just for Google Cloud and my real account.

ganeshkrishnan · 6 years ago

They WILL absolutely block and suspend all accounts indefinitely. They have terminated accounts for credit card failing transactions.

I really wanted to try out their new autoML but I was paranoid of entering my credit card and getting banned from Google

lucb1e · 6 years ago

Are you seriously complaining about having to pay for using their resources? I understand that you're surprised some things aren't covered in the free trial or free credit or whatever, but getting $300 free already sounded a little too good to be true (I heard about it from a friend and was dubious; at least in Europe, consumers are told not to enter deals that are too good to be true), you could at least have checked what you're actually getting.

I think it's weird to say you get credit in dollars and then not be able to spend it on everything. That's not how money works. But that's the way hosting providers work and afaik it's quite well known. Especially with a large sum of "free money", even if it's not well known, it was on you to check any small print.

mcherm · 6 years ago

> Are you seriously complaining about having to pay for using their resources?

I didn't read it that way. I thought they were complaining about poor customer service that made it difficult to understand the bill or respond to it appropriately.

kerng · 6 years ago

Google is well known for not caring about small shops, only if you are a multi million dollar customer with dedicated account manager you can expect reasonable support. That's been the case forever with them.

fredthomsen · 6 years ago

Does Amazon treat smaller customers any better? I am genuinely asking, as I have no clue.

bscphil · 6 years ago

>I asked for some of my billing that I was unaware of running

>I have called my bank to do a chargeback

You're issuing a chargeback because you made a mistake and spent someone else's resources? And you're admitting to this on HN? I'm not a lawyer, but that sounds like fraud and / or theft to me.

prepend · 6 years ago

It’s not, read the terms of your credit card. It’s basically “I didn’t intend to buy this. I tried in good faith to contact the merchant for return and support. I was ignored. I’m contacting you.”

It’s pretty convenient for companies like Comcast and Google that have poor customer service.

Can_Not · 6 years ago

GCE charged me for "Chinese egress" but doesn't provide me a way to block China via firewall or other methods. They have the ability to check and bill me for it but if I want to use the same logic for a firewall rule I'm on my own. That sounds like theft and or fraud to me.

OP sounds like they're just defending their selves from ambiguous draconian billing robots.

espeed · 6 years ago

What was the quickstart guide?

Deleted Comment

WC3w6pXxgGd · 6 years ago

Anything created in-house at Google (GCP) is typically created by technically-proficient devs, those devs then leave the project to start something new and maintenance is left to interns and new hires. Google customer service basically doesn't care and also has no tools at their disposal to fix any issues anyway.

The infinite money spout that is Google Ads has created a situation in which devs are at Google just to have fun - there really is no incentive to maintain anything because the money will flow regardless of quality.

Source: I interned at Google.

marcinzm · 6 years ago

Isn't it also that promotions at Google are based on creating new products/projects rather than maintaining existing ones? So engineers have a negative incentive to maintain things since it costs them promotions.

panopticon · 6 years ago

From what I’ve been told, the issue is that the people with political capital (managers, PMs, etc) are quick to move after successful launches and milestones. No matter how many competent engineers hang around, the product/team becomes resource and attention starved.

kerng · 6 years ago

I'm not sure why you are downvoted - seems like a reasonable insight and explanation for the drop in quality and weird decisions Google is making recently.

Deleted Comment

Yrlec · 6 years ago

Now is a good time to point out that the SLA of Google Cloud Storage only covers HTTP 500 errors: https://cloud.google.com/storage/sla. So if the servers are not responding at all then it's not covered by the SLA. I've brought this to their attention and they basically responded that their network is never down.

crazygringo · 6 years ago

Ironically I can't read that page because, since it's Google-hosted, I'm getting an HTTP 500 error... but which means at least that service is SLA-covered...

Cloud services live and die by their reputation, so I'd be shocked if Google ever tried to get out of following an SLA contract based on a technicality like that. It would be business suicide, so it doesn't seem like something to be too worried about?

based2 · 6 years ago

https://www.reddit.com/r/sysadmin/comments/bw1gye/most_googl...

https://www.zdnet.com/article/some-internet-outages-predicte... 768k Day

_Marak_ · 6 years ago

This should be voted higher up.

According to https://twitter.com/bgp4_table, we have just exceeded 768k Border Gateway Protocol routing entries, which may be causing some routers to malfunction.

dreamer_soul · 6 years ago

Isn't it weird that it's happening now even though that number was surpassed nearly a month and half ago?

juanuys · 6 years ago

Will this affect more than just Google? I haven't seen any outages from other cloud providers.

namibj · 6 years ago

packet.net was hit. Specifically, also their San Jose DC. Internet only. It took less than an hour to recover. More than 20 minutes. I didn't ping it continuously, and I can say that the traceroute got stuck in Frankfuhrt (Where my ISP and their ISP first (as seen from me) meet).

I was actually surprised, as they tend to have excellent networking. Now I'm not nearly as distrusting as I was initially, knowing it was likely their ISP getting screwed by routing table overflow.

tntn · 6 years ago

There goes 3 nines for June and for Q2. I guess everyone gets a 10% discount for the month? https://cloud.google.com/compute/sla

OkGoDoIt · 6 years ago

Remember to request the credit!

From that linked page:

"Customer Must Request Financial Credit

In order to receive any of the Financial Credits described above, Customer must notify Google technical support within thirty days from the time Customer becomes eligible to receive a Financial Credit. Customer must also provide Google with server log files showing loss of external connectivity errors and the date and time those errors occurred. If Customer does not comply with these requirements, Customer will forfeit its right to receive a Financial Credit. If a dispute arises with respect to this SLA, Google will make a determination in good faith based on its system logs, monitoring reports, configuration records, and other available information, which Google will make available for auditing by Customer at Customer’s request."

gundmc · 6 years ago

A couple more hours and everyone will get 25% off for June.

twistedpair · 6 years ago

Does that apply to the rest of June?

Might be a good month to rebuild all your models ;)

CamelCaseName · 6 years ago

Ironically, the SLA page returns a 502 error.

londons_explore · 6 years ago

The discount seems way too small.

I would pay a premium for a cloud provider happy to give 100 percent discount for the month for 10 minutes downtime, and 100 percent discount for the year for an hour's downtime.

Any cloud provider offering those terms would go out of business VERY quickly. Outages happen, all providers are incentivized to minimize the frequency and severity of disruptions - not just from the financial hit of breaching SLA (which for something like this will be significant), but for the reputational damage which can be even more impactful.

Johnny555 · 6 years ago

Just take the premium that you'd be willing to pay and put it in the bank -- the premium would be priced such that the expected payout of the premium would be less than or equal to what you'd be paying.

Besides, a provider credit is the least of most company's concerns after an extended outage, it's a small fraction of their remediation costs and loss of customer goodwill.

ddalex · 6 years ago

> The discount seems way too small.

> I would pay a premium for a cloud provider happy to give 100 percent discount for the month for 10 minutes downtime, and 100 percent discount for the year for an hour's downtime.

It takes a lot of effort (exponential) to reliably (I. E. Designed to fail-working) build something that is guaranteed to have this level of uptime at these penalties.

So I'm sure that I can build something that works like this, but would you pay me $100 per GB of storage per month? $100 per wall-time hour of CPU usage? $100 per GB of Ram used per hour? Because these are the premium prices for your specs.

w_s_l · 6 years ago

ksajadi · 6 years ago

GCP status page is worthless as it's always happy and green when production systems are down and then they might acknowledge something an hour later

JimboOmega · 6 years ago

Just like AWS, then. "Some users are experiencing increased error rates" = "Everything has been down for hours"

bsimpson · 6 years ago

"Everything is fine, unless you're Carl. There's a massage outage, but only at Carl's house. Sorry, Carl."

djsumdog · 6 years ago

I remember when S3 was down and the status was green because the updates for the status page with pushed via S3.

Analemma_ · 6 years ago

Azure too. During the most recent outage a couple weeks ago their Twitter account acknowledged the incident an hour before the status page did.

So no matter where you go for your cloud services, you're guaranteed a useless status page. Yippee.

actuator · 6 years ago

AWS is no better. Something from 2015 I remember: https://twitter.com/SIGKILL/status/630684777813684224?s=19

duxup · 6 years ago

I swear most status pages are run by folks who aren't "there".

It’s an easy problem to fix as basically services should emit performance data, openly, and the status page should just summarize that. So if a service doesn’t report out, it’s assumed down or erroring out.

Having an excel file where people enter statuses is not very useful to me as a customer. That’s more like a blog.

I haven’t written a status page in a while, but the rest of my infrastructure starts freaking out if it hasn’t heard from a service in a while. Why doesn’t their status page have at least a warning about things not looking good?

emilfihlman · 6 years ago

Digitalocean has the same issue: status pages are actually manually updated and no live data is fed into them.

intellix · 6 years ago

Was noticing massive issues earlier and thought that maybe my account was blocked due to breaching from TOS as I was heavily playing with Cloud Run. Then I noticed gitlab was also acting up but my Chinese internet was still surprisingly responsive. Tried the status page which said everything was fine and searched Twitter for "google cloud" and also found nobody talking about it. Typically Twitter is the single source of truth for service outages as people start talking about it

lionradio · 6 years ago

I think this might be a static page they are hosting on Akamai?

Godel_unicode · 6 years ago

Doesn't look like it:

https://www.whoishostingthis.com/#search=status.cloud.google...

domenici2000 · 6 years ago

They update the page manually.

colinbartlett · 6 years ago

Google Cloud is the number 4 most monitored status page on StatusGator and Google Apps is number 12. In addition, at least 20 other services we monitor seemingly depend on Google Cloud because they all posted issues as soon as Google went down.

It's always interesting to see these outages at large cloud providers spider out across the rest of the internet, a lot of the world depends on Google to stay up.

nabla9 · 6 years ago

This feels like 80's.

When the mainframe is down terminals are useless.

pmlnr · 6 years ago

Yep. The cloud is just a lot of cheap hardware acting together as a shitty mainframe.

hhs · 6 years ago

"a lot of the world depends on Google to stay up."

Yup, I'm trying to check the Associated Press News right now and it's having trouble connecting to "storage.googleapis.com".

FPGAhacker · 6 years ago

I guess we know what steam uses (the store at least).

I don't know about Steam, but I know Apple must use Google Cloud: https://www.apple.com/support/systemstatus/

xNevo · 6 years ago

No issues for me. Maybe they have a failover mechanism?

hazeii · 6 years ago

...and only the paranoid survive?

MrMorden · 6 years ago

Just because you're paranoid, it doesn't mean they're not out to get you.

macintux · 6 years ago

And thus was ruined hundreds or thousands of pleasant Sunday afternoons.

I don’t miss being on pager duty one bit. I see it looming in my headlights, sadly.

xerxes901 · 6 years ago

Spare a thought for the pleasant Australian early Monday mornings too! Always a rude awakening...

newsbinator · 6 years ago

It's the Queen's birthday, a Monday off here in New Zealand...

... but not for everybody now.

Scoundreller · 6 years ago

So what happens when the crown changes? They change the holiday? Immediately? For the next year? Sounds like a bit of a nightmare.

jacques_chester · 6 years ago

The only response is to wait for Google to fix it.

Nothing you or I or the pager can do will speed that up.

I am aware some bosses won't believe that and I am not trying to make light of it. But there really isn't much else to do except wait.

warp_factor · 6 years ago

Or you wait for Google or you are frantically trying to move everything you got to AWS.

jagtesh · 6 years ago

Multi-cloud for those times when you really need that level of availability and can afford it.

jjeaff · 6 years ago

It's not even about being able to afford it. Some things just don't lend themselves to hot failover. If your data throughput is high, it may not be feasible or possible to stream a redundant copy to a data center outside the network.

bowmessage · 6 years ago

Do you work at G?

Nope. I was more thinking of everyone else.