I think the send/recv with a timeout example is very interesting, because in a language where futures start running immediately without being polled, I think the situation is likely to be the opposite way around. send with a timeout is probably safe (you may still send if the timeout happened, which you might be sad about, but the message isn't lost), while recv with a timeout is probably unsafe, because you might read the message out of the channel but then discard it because you selected the timeout completion instead. And the fix is similar, you want to select either the timeout or 'something is available' from the channel, and if you select the latter you can peek to get the available data.
Some other material that has been written by me on that topic:
- Proposal from 2020 about async functions which are forced to run to completion (and thereby would use graceful cancellation if necessary). Quite old, but I still feel that no better idea has come up so far. https://github.com/Matthias247/rfcs/pull/1
I'm not understanding what the supposed problem with these futures getting cancelled is. Since futures are not tasks, as the post itself acknowledges, does it not logically follow that one should not expect futures to complete if the future is not driven to completion, for one reason or another? What else could even be expected to happen?
The examples presented for "cancel unsafe" futures seem to me like the root of the problem is some sort of misalignment of expectations to the reality:
Example 1: one future cancelled on error in the other
let res = tokio::try_join!(
do_stuff_async(),
more_async_work(),
);
Example 2: data not written out on cancellation
let buffer: &[u8] = /* ... */;
writer.write_all(buffer)?;
Both of these cases are claimed to not be cancel-safe, because the work gets interrupted and so not driven to completion. But again, what else is supposed to happen? If you want the work to finish regardless of the async context being cancelled, then don't put it in the same async context but spawn a task instead.
I feel like I must be missing something obvious that keeps me from understanding the author's issue here. I thought work getting dropped on cancellation is exactly how futures are supposed to work. What's the nuance that I'm missing?
You're absolutely right! The problem is that this has introduced many bugs in our experience at Oxide. If you've already fully internalized the idea that futures are passive and can be cancelled at any await point, the talk is just a bunch of details.
I see. Do you suppose that the origin of these bugs is more about the difficulty of reasoning about the execution of deep async stacks, or does it come down to the developers holding an incorrect mental model of the Rust futures in their minds?
I am asking because I've noticed that many developers with previous experience from "task-based" languages (specifically the JS/TS world) tend to grasp the basics of Rust async quickly enough, but then run into expectation-misalignment related problems similar to the examples that you used in your post. That in turn has made want to understand whether it is the Rust futures that are themselves difficult or strange, or whether it's a case of the Rust futures appearing simple and familiar, even though they are completely different in very subtle ways. I suppose that it's a combination of both.
I don't write Rust often, and I think part of the issue is that the async/await style is encouraging you to write code that looks a lot like synchronous code, and so it makes it really easy to forget that your code is cancelable at any await point.
I'm sure experienced async Rust programmers always have this things in mind, but Rust is also about preventing these kinds of missable behaviour, be it via the type system or otherwise.
"Cancel correctness" makes a lot of sense, because it puts the cancellation in some context.
I don't like the "cancel safety" term. Not only it's unrelated to the Rust's concept of safety, it's also unnecessarily judgemental.
Safe/unsafe implies there's a better or worse behavior, but what is desirable for cancellation to do is highly context-dependent.
Futures awaiting spawned tasks are called "cancellation safe", because they won't stop the task when dropped. But that's not an inherently safe behavior – leaving tasks running after their spawner has been cancelled could be a bug: piling up work that won't be used, and even interfering with the rest of the program by keeping locks locked or ports used. OTOH a spawn handle that stops the task when dropped would be called "cancellation unsafe", despite being a very useful construct specifically for propagating cleanup to dependent tasks.
I'm a Go developer and this was still useful for me! Obviously Rust devs are more accustomed to more assistance from their tools than Go devs, but just about every gotcha listed is something that can happen in Go with goroutines, channels, select, and other shared concurrency primitives.
In the initial example, it's not clear what the desired behavior is. If the queue is full, the basic options are drop something, block and wait, or panic. Timing out on a block is usually deadlock detection. He writes "It turns out that this code is often incorrect, because not all messages make their way to the channel." Well, yes.
You're out of resources. Now what?
What's he trying to do? Get a clean program shutdown? That's moderately difficult in threaded programs, and async has problems, too. The use case here is unclear.
The real use cases involve when you're sending messages back and forth to a remote site, and the remote site goes away. Now you need to dispose of the state on your end.
Ideally, what you would like to do is buffer up the message until there's space in the channel. I cover this later in the talk under "What can be done".
Is that ideal, though? I mean, the channel is the buffer. If you need more buffer, it should have been bigger to start with. Generally this reflects a resource exhaustion failure, which you don't handle by adding code. Fix the resource allocation.
It's in the example isn't it? The example is logging "No space for 5 seconds". It's just a helpful diagnostic that subtly turned into data loss.
Maybe it's a bit contrived, but it's also the kind of code you'd sprinkle through your system in response to "nothing seems to be happening and I don't know why".
It's definitely a bit contrived, but to me it's also emblematic of the issues with async Rust.
The note on mpsc::Sender::send losing the message on drop [1] was actually added by me [2], after I wrote the Oxide RFD on cancellations [3] that this talk is a distilled form of. So even the great folks on the Tokio project hadn't documented this particular landmine.
One should always keep in mind that await is always a potential return point. So, using await between two actions which always should be performed together should be avoided.
That… seems bad? Like I guess it is what it is and you just have to deal with it but what if your "critical section" has two await calls? The code can be paused between them but it's such that it must eventually resume. Say making a change in the database and emitting an audit edit for that change. Is your only option to either not do that or put a big do not cancel sign on the function docs?
Even if you guaranteed the calling code would always logically continue running the function till completion, you wouldn’t have the guarantee the code would actually resume — eg the program crashes between the two calls, network dies, etc.
If you want to tie multiple actions together as an atomic unit, you need the other side to have some concept of transactions; — and you need to utilize it.
A DB action and audit emission have to run transactionally anyway.
So on cancellation, the transaction times out and nothing is written. Bad but safe.
The problem is the same on other platforms. For example, what if writing to the DB throws an exception if you’re on Python? Your app just dies, the transaction times out. Unfortunate but safe.
If it does not run transactionally you have a problem in any execution scenario.
Rust's Future is somehow like move semantics in C++, where you may leave a Future in an invalid state after it finishes. Besides, Rust adopts a stackless coroutine design, so you need to maintain the state in your struct if you would like to implement a poll-based async structure manually.
These are all common traps. And now cancellations in async Rust are a new complement to state management in async Rust (Futures).
When I'm developing the mea (Make Easy Async) [1] library, I document the cancel safety attribute when it's non-trivial.
Additionally, I recall [2] an instance where a thoughtless async cancellation can disrupt the IO stack.
- Proposal from 2020 about async functions which are forced to run to completion (and thereby would use graceful cancellation if necessary). Quite old, but I still feel that no better idea has come up so far. https://github.com/Matthias247/rfcs/pull/1
- Proposal for unified cancellation between sync and async Rust ("A case for CancellationTokens" - https://gist.github.com/Matthias247/354941ebcc4d2270d07ff0c6...)
- Exploration of an implementation of the above: https://github.com/Matthias247/min_cancel_token
The examples presented for "cancel unsafe" futures seem to me like the root of the problem is some sort of misalignment of expectations to the reality:
Example 1: one future cancelled on error in the other
let res = tokio::try_join!( do_stuff_async(), more_async_work(), );
Example 2: data not written out on cancellation
let buffer: &[u8] = /* ... */; writer.write_all(buffer)?;
Both of these cases are claimed to not be cancel-safe, because the work gets interrupted and so not driven to completion. But again, what else is supposed to happen? If you want the work to finish regardless of the async context being cancelled, then don't put it in the same async context but spawn a task instead.
I feel like I must be missing something obvious that keeps me from understanding the author's issue here. I thought work getting dropped on cancellation is exactly how futures are supposed to work. What's the nuance that I'm missing?
I am asking because I've noticed that many developers with previous experience from "task-based" languages (specifically the JS/TS world) tend to grasp the basics of Rust async quickly enough, but then run into expectation-misalignment related problems similar to the examples that you used in your post. That in turn has made want to understand whether it is the Rust futures that are themselves difficult or strange, or whether it's a case of the Rust futures appearing simple and familiar, even though they are completely different in very subtle ways. I suppose that it's a combination of both.
I'm sure experienced async Rust programmers always have this things in mind, but Rust is also about preventing these kinds of missable behaviour, be it via the type system or otherwise.
Glad to see it converted to a blog post. Talks are great, but blogs are much easier to share and reference.
I don't like the "cancel safety" term. Not only it's unrelated to the Rust's concept of safety, it's also unnecessarily judgemental.
Safe/unsafe implies there's a better or worse behavior, but what is desirable for cancellation to do is highly context-dependent.
Futures awaiting spawned tasks are called "cancellation safe", because they won't stop the task when dropped. But that's not an inherently safe behavior – leaving tasks running after their spawner has been cancelled could be a bug: piling up work that won't be used, and even interfering with the rest of the program by keeping locks locked or ports used. OTOH a spawn handle that stops the task when dropped would be called "cancellation unsafe", despite being a very useful construct specifically for propagating cleanup to dependent tasks.
What's he trying to do? Get a clean program shutdown? That's moderately difficult in threaded programs, and async has problems, too. The use case here is unclear.
The real use cases involve when you're sending messages back and forth to a remote site, and the remote site goes away. Now you need to dispose of the state on your end.
Maybe it's a bit contrived, but it's also the kind of code you'd sprinkle through your system in response to "nothing seems to be happening and I don't know why".
The note on mpsc::Sender::send losing the message on drop [1] was actually added by me [2], after I wrote the Oxide RFD on cancellations [3] that this talk is a distilled form of. So even the great folks on the Tokio project hadn't documented this particular landmine.
[1] https://docs.rs/tokio/latest/tokio/sync/mpsc/struct.Sender.h...
[2] https://github.com/tokio-rs/tokio/pull/5947
[3] https://rfd.shared.oxide.computer/rfd/0400
They go by they/she
https://sunshowers.io/about/
Dead Comment
If you want to tie multiple actions together as an atomic unit, you need the other side to have some concept of transactions; — and you need to utilize it.
So on cancellation, the transaction times out and nothing is written. Bad but safe.
The problem is the same on other platforms. For example, what if writing to the DB throws an exception if you’re on Python? Your app just dies, the transaction times out. Unfortunate but safe.
If it does not run transactionally you have a problem in any execution scenario.
Let's say my code looks like this
Where does an issue occur which causes `d` to not to be called? Is it some sort of cancellation in c? Or some upstream action in a?`d` not being called would happen because of actions in `a`.
If `a` were rewritten as
Then if `c` ends up failing in the try_join then process on `b` will be halted and thus the `d` in `b` won't be executed.These are all common traps. And now cancellations in async Rust are a new complement to state management in async Rust (Futures).
When I'm developing the mea (Make Easy Async) [1] library, I document the cancel safety attribute when it's non-trivial.
Additionally, I recall [2] an instance where a thoughtless async cancellation can disrupt the IO stack.
[1] https://github.com/fast/mea
[2] https://www.reddit.com/r/rust/comments/1gfi5r1/comment/luido...