AWS Lambda Silent Crash – A Platform Failure, Not an Application Bug [pdf]

Oh wow, a 23-page write up about how the author misunderstood AWS Lambda's execution model [1].

> It emits an event, then immediately returns a response — meaning it always reports success (201), regardless of whether the downstream email handler succeeds or fails.

It should be understood that after Lambda returns a response the MicroVM is suspending, interrupting your background HTTP request. There is zero guarantee that the request would succeed.

1: https://docs.aws.amazon.com/lambda/latest/dg/lambda-runtime-...

frenchtoast8 · a month ago

It took me a couple reads of the PDF but I think you're right. The author creates an HTTP request Promise, and then immediately returns a response thereby shutting down the Lambda. They have logging which shows the background HTTP request was in the early stages of being sent to the server but the server never receives anything. They also have an error handler that is supposed to catch errors during the HTTP request but it isn't executed either. The reason for both seems quite obvious: it's completely expected that a Lambda being shutdown wouldn't finish making the request and it certainly wouldn't stick around to execute error handling code after the request was cancelled.

As an aside I find it strange that the author spent all this time writing this document but would not provide the actual code that demonstrates the issue. They say they wrote "minimal plain NodeJS functions" to reproduce it. What would be the reason to not show proof of concept code? Instead they only show code written by an AWS engineer, with multiple caveats that their code is different in subtle ways.

The author intends for this to be some big exposé of AWS Support dropping the ball but I think it's the opposite. They entertained him through many phone calls and many emails, and after all that work they still offered him a $4000 account credit. For comparison, it's implied that the Lambda usage they were billed for is less than $700 as that figure also includes their monthly AWS Support cost. In other words they offered him a credit for over 5x the financial cost to him for a misunderstanding that was his fault. On the other hand, he sounds like a nightmare customer. He used AWS's offer of a credit as an admission of fault ("If the platform functioned correctly, then why offer credits?") then got angry when AWS reasonably took back the offer.

kevin_nisbet · a month ago

I don’t know about node but a fun abuse of this is background tasks can still sometimes run on a busy lambda as the same process will unsuspend and resuspend the same process. So you can abuse this sometimes for non essential background tasks and to keep things like caches in process. You just cant rely on this since the runtime instead might just cycle out the suspended lambda.

johnduhart · a month ago

Absolutely, I do this at $dayjob to update feature flags and refresh config. Your code just needs to understand that such execution is not guaranteed to happened, and in-flight requests may get interrupted and should be retried.

__turbobrew__ · a month ago

TiTiler does exactly that. Geospatial rasters are stored in S3, and the lambda retains a cache in memory of loaded data from S3. So if the same lambda execution is used it can return cached data without hitting S3.

lbreakjai · a month ago

I wouldn't really call this an abuse. I remember their documentation mentioning it.

dsmurrell · a month ago

"This document reflects the outcome of a seven-week independent technical investigation" - :)

Hikikomori · a month ago

Guy seems to be a consultant, so good money invoiced on this "problem".

belter · a month ago

Uhhmm… it seems both you and @frenchtoast8 are misrepresenting the code flow. The author doesn’t return early, the handler is async and explicitly awaits an https.request() Promise. The return only happens after the request resolves or fails. Lambda is terminating mid-await, not post-return.

That’s the whole point and it’s not the behavior your comments describe.

Scaevolus · a month ago

Page 5: "It emits an event, then immediately returns a response — meaning it always reports success (201), regardless of whether the downstream email handler succeeds or fails."

This is not the right way to accomplish this. If you want a Lambda function to trigger background processing, you invoke another lambda function before returning. Using Node.js events for background processing doesn't work, since the Lambda runtime shuts down the event loop as quickly as possible.

souldeux · a month ago

Respectfully, the documentation makes it clear that the author is incorrect.

I saw this come up on /r/aws a few days back.

This response seemed illuminating:

https://www.reddit.com/r/aws/comments/1lxfblk/comment/n2qww9...

> Looking at (this section)[https://will-o.co/gf4St6hrhY.png], it seems like you're trying to queue up an asyncronous task and then return a response. But when a Lambda handler returns a response, that's the end of execution. You can't return an HTTP response and then do more work after that in the same execution; it's just not a capability of the platform. This is documented behavior: "Your function runs until the handler returns a response, exits, or times out". After you return the object with the message, execution will immediately stop even if other tasks had been queued up.

mcflubbins · a month ago

If this is the case (it might very well be) I do at least find it odd that no one on AWS' side was able to explain this to them.

eddythompson80 · a month ago

They did (in a typical non-fault admitting way). They didn't escalate to Lambda engineering team, then said that this is a code issue, and that they should move to EC2 or Fargate, which is the polite way of saying "you can't do that on lambda. its your issue. no refunds. try fargate."

OP seems to be fixated on wanting MicroVM logs from AWS to help them correlate their "crash", but there likely no logs support can share with them. The microVM is suspended in a way you can't really replicate or test locally. "Just don't assume you can do background processing". Also to be clear, AWS used to allow the microVM to run for a bit after the response is completed to make sure anything small like that has been done.

It's an nondeterministic part of the platform. You usually don't run into it until some library start misbehaving for one reason or another. To be clear, it does break applications and it's a common support topic. The main support response is to "move to EC2 or fargate" if it doesn't work for you. Trying to debug and diagnose the lambda code with the customer is out of support scope.

somethingAlex · a month ago

It may just be such a basic tenant of the platform that no one thought to. You stop getting billed after lambda returns a response so why would you expect computation to continue? This guy expected free lunch.

semiquaver · a month ago

Based on the way the document is written, it seems very likely that several people did realize exactly what the misunderstanding was and try to explain it to them.

arpinum · a month ago

They did, he has writings elsewhere where he says AWS told him it was an application code problem. The sub-title of this screed also says it.

This person should not be trusted to accurately recount a story, I would be sceptical of any claim in the doc.

ceejayoz · a month ago

Some people have a lot of certainty coupled with a lack of accuracy.

m3sta · a month ago

AWS has a lot of employees. Many of them don't know AWS very well.

xer0x · a month ago

There is a "post-invocation phase" for tidying up. It's not long, and it's enough for things like Datadog's metrics plugin to send off some numbers. Yet it will fail, if you have too many metrics. It's fuzzy.

Scaevolus · a month ago

You only get that when you register extensions: https://docs.aws.amazon.com/lambda/latest/dg/lambda-runtime-...

nemothekid · a month ago

If I'm reading this correctly, then AWS Support dropped the ball here but this isn't a bug in lambda. This is the documented behavior of the lambda runtime.

The document is long, and the examples seem contrived, so anyone is free to correct me but as I understand it the lambda didn't crash, after you returned 201, your lambda instance was put to sleep. You aren't guaranteed that any code will remain running after your lambda "ends". I am not sure why AWS Support was unable to communicate this OP.

If you are using Lambda with a function URL, you aren't guaranteed that anything after you return your http response remains running. I believe Lambda has some callbacks/signals you can listen to, to ensure your function properly cleans up before the Lambda is frozen, but if you want the lambda to return as fast as possible it seems you are better off having your service publish to an SQS queue instead.

This document is bizarre. The author is so confidently verbose about something they are clearly misunderstanding, and have been told as much dozens of times. It’s humbling, in a way, to think of times I’ve felt this strongly about something and to consider the possibility I could have been this wrong.

johnfn · a month ago

It's pretty clearly written by GPT.

tough · a month ago

stubbornness is a very human thing

encomiast · a month ago

> I am not sure why AWS Support was unable to communicate this OP.

Maybe we should be asking why the OP was not able to hear what AWS was telling them. I think there is a fairly troublesome cognitive bias that gets flipped on when people are telling you that you're wrong — especially when all your personal branding and identity seems to be about you being right.

Everdred2dx · a month ago

Yep. I've ran into this using Bugsnag for reporting on unhandled exceptions in Python-based lambda functions. The exception handler would get called, but because the library is async by default the HTTP request wouldn't make it out before the runtime was torn down.

I sympathize with OP because debugging this was painful, but I'm sorry to say this is sort of just a "you're holding it wrong" situation.

cldcntrl · a month ago

> If I'm reading this correctly, then AWS Support dropped the ball here but this isn't a bug in lambda. This is the documented behavior of the lambda runtime.

AWS Support is generally ineffective unless you're stuck on something very simple at a higher level of the platform (e.g. misunderstanding an SDK API).

Even with their higher tier support - where you can summon a subject matter expert via Chime almost instantly - they're often clueless, and will confidently pass you misleading or incorrect information just to get you off the line. I've successfully used them as a very expensive rubber ducky, but that's about it.

happytoexplain · a month ago

Consider that this conversation is even being had - that you are even in the situation to be writing what you are writing.

"Failure" != "Bug"

Analemma_ · a month ago

This post is a good lesson on humility, and why emotionally-heated blog posts might not be a great idea, no matter how satisfying, and you should possibly stick to dry technicalities instead. If OP had just written up their experience without all the sneering and potshots at Amazon support, people would've just gently corrected them. But as it is they look like a tremendous asshole, and certainly not fit for a CTO role, either temperamentally or skill-wise.

Quarrel · a month ago

The poor guy has obviously wound himself in knots over this for several months. He went through all this, obviously really bothered by the impersonal Corporation, "solved" his problem by moving to Azure, but then wrote up this massive rant.

Sounds like he needs a holiday at the least.

The mere idea that reddit has a special Amazon protection team that is the cause of his account suspension is ludicrous.

Hopefully this exposure will help him out.

lazystar · a month ago

> Hopefully this exposure will help him out.

looks lile he took down his site after some other commentators in this discussion noticed some discrepencies in his resume. specifically his claims of having a PhD and a JD. Con men never learn, they just play the next fool.

sneilan1 · a month ago

`This wasn’t a request for clarification. It was a reset — a request for artefacts that have already been delivered, multiple times, across prior case updates, call notes, and debug summaries. The very same package AWS previously reviewed. The same logs they had already quoted. The same failure they had already confirmed.`

While the author is probably correct in this being a platform level bug (I have not ran the code to confirm myself though), they should stop being so angry at a corporation doing what corporations do which is run a big bureaucracy and require extraordinary amounts of extra communication. One of job requirements of a principle engineer, often making hundreds of thousands if not millions of dollars a year, is to communicate across levels because it is so hard to do this effectively without being angry or losing your cool.

`As a solo engineer, I outpaced AWS’s own diagnostics, rebuilt our deployment stack from scratch, eliminated every confounding factor, and forced AWS to inadvertently reproduce a failure they still refused to acknowledge.`,

I do not think that a solo engineer would have the economic clout to get AWS to pay attention to the bug you found. Finding a bug in a system does not imply any ability to fix a bug in a bureaucracy.

Plus, given the tone of the article, I have a funny feeling that the author may have rubbed people in AWS the wrong way, preventing more progress from being made on their bug.

But, a wise fellow once said, "all progress in this world depends upon the unreasonable man".

It's not a platform level bug. It's as-designed, as-documented behavior. Return a response and Lambda suspends.

They seem to want a fast response with background processing of a potentially slow task. The answer on Lambda (and honestly, elsewhere) is having this function send the payload to a queue, and processing the queue with another function.

Yes you’re right. I wrote the parent comment and then someone else also mentioned the lambda is returning before background processing. I missed that one!

Deleted Comment

reilly3000 · a month ago

What a tremendous waste of resources. As a startup engineer and leader your job is to build value for customers and investors as quickly and effectively as possible, knowing the privilege of their participation with you is fleeting. The time spent in those meetings or writing this vitriolic pdf would have gone a long way towards shipping features, even if you had to survive with a suboptimal approach for a while. Time isn’t cheap and this cost a fortune.

PartiallyTyped · a month ago

I work at lambda, and my team would be the one to engage with OP, it most likely would have been me actually to investigate this.

The fact that support didn’t engage us seems odd as we have gotten engaged for far more silly stuff.

However, in this case you should have awaited for the response from event emitter.

The lambda execution model clearly does not allow you to run things past the response as execution is frozen by the sandbox / virtualization software. From there, you’ve most likely caused a timeout when issuing the request.

WatchDog · a month ago

This is a simple misunderstanding of the Lambda execution lifecycle, but it's maybe not a great look for AWS support, that no one there was able to explain the problem to this guy.

I guess it's possible someone did explain it to him, and David hasn't mentioned that in his diatribe.

jdeastwood · a month ago

I actually witnessed almost exactly the same thing happen in my org recently. AWS support was far _too_ nice and actually did escalate to the service team (or claimed to).

There was a few days of back and forth which should have been resolved by saying "please read the how lambda works" section of the documentation.

the_plus_one · a month ago

It's been several years at this point, but I used to be a cloud support engineer for AWS (i.e. the first line of defense against support tickets). We were trained to always be kind to the customer, even if we're frustrated with them, so a response like "please read how the lambda works" probably would've gotten me reprimanded. Customers can rate your response from one to five stars, and if I remember correctly, anything less than four required a written explanation to your manager as to what went wrong.

That job was a glorified call center; I didn't last very long.

> I guess it's possible someone did explain it to him, and David hasn't mentioned that in his diatribe.

Not just possible. Quite likely. It seems very clear the answer was "yes, that's how it works on Lambda, and it is not going to change" and this person said "but I want it to work the other way on Lambda".