ndneighbor (u/ndneighbor)

ndneighbor commented on Railway (PaaS) global outage status.railway.com... · Posted by u/TealMyEal

vintagedave · a month ago

> We rolled out a change to update our fraud model, and that uses workload fingerprinting

> Since, in all likelyhood, your projects are similarly structured...

Thanks for the info. For what it's worth and to inform your retrospective, this included:

* A Wordpress frontend, with just a few posts, minimal traffic -- but one that had been posted to LinkedIn yesterday

* A Docusaurus-generated static site. Completely static.

* A Python server where workload would show OpenAI API usage, with consistent behavioural patterns for at least two months (and, I am strongly skeptical would have different patterns to any hosted service that calls OpenAI.)

These all seem pretty different to me. Some that _are_ similarly structured (eg a second Python OpenAI-using server) were not killed.

Some things come to mind for your post-mortem:

* If 3% of your services were affected, does that match your expected fraud rate? That is an awful lot of customers to take down in one go, and you'd want to be very accurate in your modeling. I can't see how you'd plan to kill that many without false positives and negative media.

* I'm speaking only for myself but I cannot understand what these three services have in common, nor how at least 2/3 of them (Wordpress, static HTML) could seem anything other than completely normal.

* How or why were customers not notified? I have used services before where if something seemed dodgy they would proactively reach out and say 'tell us if it's legit or in 24 hours it will be shut down' or for something truly bad, eg massive CPU usage affecting other services, they'd kill it right away but would _tell you_. Invisible SIGTERMS to random containers we find out about the hard way seems the exact opposite of sensible handling of supposedly questionable clients.

ndneighbor · a month ago

We have more info coming soon but I think the best way to frame this is actually working backwards and then explain how it impacted yours and other services.

So Railway (and other cloud providers) deal with fraud near constantly. The internet is a bad and scary place and we spend maybe a third to half of our total engineering cycles just on fraud/up-time related work. I don't wanna give any credit to script kiddies to the hostile nation states but we (and others) are under near and constant bombardment from crap workloads in the form of traffic, or not great CPU cycles, or sometimes more benignly, movie pirating.

Most cloud providers understandably don't like talking about it because ironically, the more they talk about it- the bad actors do indeed get a kick from seeing the chaos that they cause work. Begin the vicious cycle...

This hopefully answers:

> If 3% of your services were affected, does that match your expected fraud rate? That is an awful lot of customers to take down in one go, and you'd want to be very accurate in your modeling. I can't see how you'd plan to kill that many without false positives and negative media.

In our 5 year history, this is the third abuse related major outage. One being a Nation State DDoS, one being coordinated denial. This is the first one where it was a false positive taking down services automatically. We tune it constantly so its not really an issue except when it is.

So- with that background, we tune our boxes of lets say "performance" rules constantly. When we see bad workloads, or bad traffic, we have automated systems that "discourage" that use entirely.

When we updated those rules because we detected a new pattern, and then rolling it out, that's when we nailed the legit users, since this used the abuse pattern, it didn't show on your dash, hence the immediate gaslighting.

Which leads to the other question:

> How or why were customers not notified? I have used services before where if something seemed dodgy they would proactively reach out and say 'tell us if it's legit or in 24 hours it will be shut down' or for something truly bad, eg massive CPU usage affecting other services, they'd kill it right away but would _tell you_.

We don't want to tell fraudulent customers if they are effective or not. For this instance, it was a straight up logic bug on the heuristics match. But we have done this for our existence like black holing illegitimate traffic for example, then ban. We did this because some coordinated actors will deploy, get banned with: "reason" and then they would have backup accounts after they found that whatever they were doing was working. If you knew where to look, sometimes they will brag on their IRCs/Discords.

Candidly, we don't want to be transparent about this, but any user impact like this is the least we can do. Zooming out, macro wise, this is why Discord and other services are leaning towards ID verification. ...and it's hard for people on the non service provider side to appreciate the level of garbage out there in the internet. That said, that is an excuse- and we shovel that so that you can do your job and if we stop you, then thats on us which we own and hopefully do better about.

That said, you and others are understandably miffed (understatement) all we can do is work through our actions to rebuild trust.

ndneighbor commented on Railway (PaaS) global outage status.railway.com... · Posted by u/TealMyEal

vintagedave · a month ago

Multiple services are receiving SIGTERM or shutdown signals. See dozens of support messages here: https://station.railway.com/questions/services-down-799f7bc1

Here's a sample log entry:

> 2026-02-11T14:35:11.916787622Z [err] 2026/02/11 14:35:03 [notice] 1#1: signal 15 (SIGTERM) received, exiting

I've had about one third of my Railway services affected. I had no notification from Railway, and logging in showed each affected service as 'Online', even though it had been shut down.

I'm pretty annoyed. I am hosting some key sites on Railway. This is not their first outage recently, and one time a couple of months ago was just as I was about to give our company owner a demo of the live product.

ndneighbor · a month ago

Hey there Dave, Angelo from Railway here-

First off, super duper sorry. It's sometimes a good/bad thing if I can remember someones handle. ...and I specifically remember the support thread where we did have an outage before your demo :| - the number one goal for us is to deliver a great product. Number two is that we should never embarrass a user, outages do exactly that.

We just wrapped up the post mortem and that'll be published soon where it explains why the dashboard was reporting the state of the application incorrectly and would be more than happy to credit you for the impact to keep your business. That said, totally understand if two is way too much impact for your services.

ndneighbor commented on Railway (PaaS) global outage status.railway.com... · Posted by u/TealMyEal

iJohnDoe · a month ago

Affected by the outage since about 6:15 AM PT this morning. We're still down as of 9:00 AM PT.

Our existing containers were in a failure state and are now are in a partial failure state. Containers are running, but underlying storage/database is offline.

Many questions on their forum are similar to our situation. People wondering if they should restart their containers to get things working again. Worried about if they should do anything, risk losing data if they do anything, or just give everything more time.

I'm glad Railway updated their status page, but more details need to be posted so everyone knows what to do now.

Everyone has outages, it's the way of life and technology. Communication with your customers always makes it less painful and people remember good communication and not the outage. Railway, let's start hearing more communication. Forum is having problems as well. Thanks.

ndneighbor · a month ago

(Angelo from Railway here)

Heard. Being transparent, usually the delay on ack is us trying to determine and correlate the issue. We have a post mortem going out but we note that first report was in our system 10 minutes before it was acked, to which the platform team was trying to see which layer the impact was at.

That said, this is maybe concern #1 of the support team. Where we want the delta between report and customer outage detected to be as small as possible. The way it usually works is that we have the platform alarms and pages go first, and then the platform engineer usually will page a support eng. to run communications.

Usually the priority is to have the platform engineer focus on triaging the issue and then offload the workload to our support team so that we can accurately state what is going on. We have a new comms clustering system that rolling out so that if we get 5 reports with the similar content, it pages up to the support team as well. (We will roll this out after we communicated with affected customers first.)

ndneighbor commented on Vercel's CEO offers to cover expenses of 'Jmail' threads.com/@qa_test_hq/p... · Posted by u/vinnyglennon

ramoz · a month ago

Every single mentioned service is either an AWS or GCP abstraction.

ndneighbor · a month ago

Angelo from Railway here, Railway runs our own metal for the sheer reason to preserve margins so we can run for perpetuity.

We're nuts for studying failure at the company and Heroku's margins was one of the things we considered to be one of the many nails in that coffin. (RIP)

(my rant here: https://blog.railway.com/p/heroku-walked-railway-run)

ndneighbor commented on OpenAI acquires Sky.app openai.com/index/openai-a... · Posted by u/meetpateltech

ndneighbor · 5 months ago

I think this acquisition makes a lot of sense and it's good business. Finding good MacOS developers who know the system level APIs more so than the docs is a tough go. It would make a lot of sense that OpenAI would just go ahead and hire out this expertise as they try to get their Mac app and their iOS app to get closer and closer to the system.

ndneighbor commented on UA 1093 windbornesystems.com/blog... · Posted by u/c420

ndneighbor · 5 months ago

The unfortunate irony is not lost on me that Windbourne's H1 is "record breaking Weather Balloons".

I don't think any company would want this record. I am very glad the pilot and the souls on board are safe.

ndneighbor commented on Why most product planning is bad and what to do about it blog.railway.com/p/produc... · Posted by u/ndneighbor

stavros · 5 months ago

Or create.

ndneighbor · 5 months ago

A man can dream ;-;

ndneighbor commented on Why most product planning is bad and what to do about it blog.railway.com/p/produc... · Posted by u/ndneighbor

apsurd · 5 months ago

You got downvoted but then I read some and you're not wrong.

I think we should call out bad writing assuming English is their first language. Bad, lazy writing, doesn't respect the audience.

> Instead of crowd sourcing the OKRs from the company and bubbling them up per function.

First sentence under the heading "Good Ole Projects". This is not a sentence.

edit: The charitable pov is that writing is very hard work and writing and publishing anything is a net good. I wish for more people to respect how hard writing is and also to take the time to write well! So that's why that sentence bothered me.

ndneighbor · 5 months ago

Author here, not my intent! My deepest apologies. English is my first language but people do joke that they say I write English like I learned it as a second language.

I have fixed the sentence fragment and connected the two thoughts together. Thank you for keeping me honest.

ndneighbor commented on Why most product planning is bad and what to do about it blog.railway.com/p/produc... · Posted by u/ndneighbor

wrs · 5 months ago

I must have missed something, because this seems to say you do capacity and headcount planning and publicly commit to the work before you know how to solve the problem? This seems like the “draw the rest of the owl” part of this process…

ndneighbor · 5 months ago

I am more than happy to add color here, I am sorry, I try my best to write everything but my editor cuts as much as I add. We also tend to hire really autonomous engineers who tend to like just going off on their own to try to solve the issue.

There have been a few times where we would commit to the problem, assign a DRI, and then find out midway that... no we have to hire/consult our way out of the issue. I think that's okay, we then look back at the retro to see what we missed.

If interested, I think we can blog about what happens when a problem gets converted to an RFC and then we have more engineering discussions with the stakeholders but the piece was pushing a 10 min read time as it was...

ndneighbor commented on Why most product planning is bad and what to do about it blog.railway.com/p/produc... · Posted by u/ndneighbor

procaryote · 5 months ago

Am I getting it wrong? It sounds like they're still doing quarterly planning just with a different ritual?

I had hoped they'd realise quarterly planning is a bad premise and asked themselves why they do it.

If you have a mature product where you add incremental features, you don't need that plan because it's just an arbitrary block of pretty fungible work.

If you're still looking for product market fit, that three month plan wont last a week before becoming obsolete.

If you need to build a bigger thing that is only valuable once it's all done, you A: need a project and B: probably don't because it will fail.

ndneighbor · 5 months ago

Hello there! Author there, and surprised/delighted with the response. I don't think we had the issue with the cadence, the quarter is arbitrary, but we think it gives us the ability to just go heads down to focus.

With that said, one thing we did and I don't why we did it was that we would "re-justify" why we would want to work on something every three months which isn't great. There is a world where if we had more eng. resources we could have more people than problems and we could take stuff on board as it arrives, but for us deciding on what to work on is a hard decision.

I also agree that market fit is a key factor. I think Railway was lucky that we didn't have to pivot the product 3 to 5 times to get some latch.

What would be the post-quarterty planning process that you would like to see?