Readit News logoReadit News
palcu commented on Ask HN: GCP Outage?    · Posted by u/grilledchickenw
palcu · 2 months ago
palcu commented on Google Cloud Incident Report – 2025-06-13   status.cloud.google.com/i... · Posted by u/denysvitali
PNewling · 3 months ago
Not sure if you'll get an answer (I'd be interesting in a response as well), but from the blog in their profile it looks like they moved to be a 'member of technical staff working in the AI Reliability Engineering (AIRE) team at Anthropic'. So it might have just been an upward move to something different/more-exciting.
palcu · 3 months ago
I confirm. I’m still doing reliability, but in a fun and exciting way for Claude and Anthropic.
palcu commented on Google Cloud Incident Report – 2025-06-13   status.cloud.google.com/i... · Posted by u/denysvitali
btown · 3 months ago
From the OP:

> a policy change was inserted into the regional Spanner tables that Service Control uses for policies. Given the global nature of quota management, this metadata was replicated globally within seconds

If there’s a root cause here, it’s that “given the global nature of quota management” wasn’t seen as a red flag that “quota policy changes must use the standard gradual rollout tooling.”

The baseline can’t be “the trend isn’t worsening;” the baseline should be that if global config rollouts are commonly the cause of problems, there should be increasingly elevated standards for when config systems can bypass best practices. Clearly that didn’t happen here.

palcu · 3 months ago
One of the problems has been that most users have requested that quotas get updated as fast as possible and that they should be consistent across regions, even for global quotas. As such people have been prioritising user experience rather than availability.

I hope the pendulum swings the other way around now in the discussion.

[disclaimer that I worked as a GCP SRE for a long time, but not left recently]

palcu commented on Coreweave S-1   sec.gov/Archives/edgar/da... · Posted by u/calvinfo
TheAlchemist · 6 months ago
"For the year ended December 31, 2022, our largest customer accounted for 16% of our revenue. For the years ended December 31, 2023 and 2024, our largest customer was Microsoft, which accounted for 35% and 62% of our revenue, respectively."

Wow, I didn't know they where basically a MSFT subsidiary. Who are their other customers ?

palcu · 6 months ago
palcu commented on Tell HN: Merry Christmas    · Posted by u/LorenDB
linsomniac · 2 years ago
May your pagers be silent.
palcu · 2 years ago
And the queries flow through.
palcu commented on Downfall Attacks   downfall.page/... · Posted by u/WalterSobchak
bironran · 2 years ago
Worth to note that GCP has this patched (https://cloud.google.com/support/bulletins#gcp-2023-024)
palcu · 2 years ago
My adjacent teams in London who work in SRE on Google Cloud (GCE) got some well deserved doughnuts today for rolling out the patches on time.
palcu commented on TweetDeck is falling apart after Twitter’s rate-limiting fiasco   theverge.com/2023/7/3/237... · Posted by u/mfiguiere
palcu · 2 years ago
Yeah yeah, they've broken the old TweetDeck. You need to wait for the pop-up to ask you to transition to the new TweetDeck. Or, search on the internet for the Javascript variable you have to change in your console.

The more important question is that they've removed the Activity feed, where you could see likes from other people. Which was like a realtime feed to what your friends were doing on the website. The website is way more boring now.

palcu commented on Google Cloud region currently down due to water intrusion   status.cloud.google.com/i... · Posted by u/kalabilla
throwbigdata · 2 years ago
Who trains the trainers?
palcu · 2 years ago
Life and experience, if you're looking for a short answer. For example, last year we had an outage in London[0] and the folks who worked on it learnt a lot. Now, they applied the learnings in this incident.

[0]: https://news.ycombinator.com/item?id=32161755

palcu commented on Google Cloud region currently down due to water intrusion   status.cloud.google.com/i... · Posted by u/kalabilla
Waterluvian · 2 years ago
Would you be able to comment a bit on the emotional (perhaps there’s a better word) aspect of the response?

Was there a lot of anxiety? Panic? Or was it just a “woof that sucks. Time to follow a checklist and then do a bunch of paper work” ?

What I’m curious about is what it feels like on a team at a company like Google when there is a major system failure.

palcu · 2 years ago
There's not much emotion as the core team working on the huge outages is more like an "SRE for SRE". They are all people who've been with the company for a long time and they've been in the secondary seat for at least one previous big rodeo. Not to mention that we're all running a checklist that has been exercised multiple times and there's always somebody on the call who could help if a step fails.

Personally, I wasn't part this time for the actual mitigation of the overall Paris DC recovery, as I was busy with an unfortunate[0] side effect of the outage. These generate more anxiety, as being woken up at 6am and being told that nobody understands exactly why the system is acting this way is not great. But then again, we're trained for this situation and there are always at least several ways of fixing the issue.

Finally, it's worth repeating that incident management is just a part of the SRE job and after several years I've understood that it is not the most important one. The best SREs I know are not great when it comes to a huge incident. But, they're work has avoided the other 99 outages that could have appeared on the front page of Hacker News.

[0]: https://news.ycombinator.com/item?id=35734224

palcu commented on Google Cloud region currently down due to water intrusion   status.cloud.google.com/i... · Posted by u/kalabilla
palcu · 2 years ago
[disclaimer: SRE @ Google, I was involved with the incident, obvious conflicts of interest]

Hey Dang, thanks for cleaning up the thread. One thing to note is that the title is not correct. The entire region is not currently down, as the regional impact was mitigated as of 06:39 PDT, per the support dashboard (though I think it was earlier). The impact is currently zonal (europe-west9-a), so having zone in the title as opposed to region would reflect reality closer.

Finally, there's lots of good feedback on this thread and on the previous one (https://news.ycombinator.com/item?id=35711349), so we obviously have a lot of lessons to learn.

u/palcu

KarmaCake day277November 13, 2011
About
Member of Technical Staff, AI Reliability Engineering (AIRE) at Anthropic

Previously: SRE at Google Cloud (taking care of GCE and GCP) and Forward Deployed Engineer at Palantir.

https://www.palcu.net/

[ my public key: https://keybase.io/palcu; my proof: https://keybase.io/palcu/sigs/425bZ1Ip-RwE52Uws06rLF54qq21RU2pQQyi8cXICv8 ]

View Original