Indeed, one thing I’ve always wondered is if you can submit a read request for a page aligned buffer and have the kernel arrange for data to be written directly into that without any additional copies. That’s probably not possible since there’s routing happening in the kernel and it accumulates everything into sk_buffs.
But maybe it could arrange for the framing part of the packet and the data to be decoupled so that it can just give you a mapping into the data region (maybe instead of you even providing a buffer, it gives you back an address mapped into your space). Not sure if that TLB update might be more expensive than a single copy.
Parity with synchronous programming is an explicit goal of Rust async declared many times (e.g. see here https://github.com/rust-lang/rust-project-goals/issues/105). I agree with your rant about the illusion of synchronicity, but it does not matter. The synchronous abstraction is immensely useful in practice and less leaky it is, the better.
Is this really better than what we have now? I don't think async is perfect, but I can see what tradeoffs they are currently making and how they plan to address most if not all of them. "General" unsoundness seems like a rather large downside.
> In future I plan to create a custom "green-thread" fork of `std` to ease limitations a bit
Can you go more in-depth into these limitations and which would be alleviated by having first class support for your approach in the compiler/std?
Depends on the metric you use. Memory-wise it's a bit less efficient (our tasks usually are quite big, so relative overhead is small in our case), runtime-wise it should be on par or slightly ahead. From the source code perspective, in my opinion, it's much better. We don't have the async/await noise everywhere and after development of the `std` fork we will get async in most dependencies as well for "free" (we still would need to inspect the code to see that they do not use blocking `libc` calls for example). I always found it amusing that people use "sync" `log`-based logging in their async projects, we will not have this problem. The approach also allows migration of tasks across cores even if you keep `Rc` across yield points. And of course we do not need to duplicate traits with their async counterparts and Drop implementations with async operations work properly out of the box.
>Can you go more in-depth into these limitations and which would be alleviated by having first class support for your approach in the compiler/std?
The most obvious example is thread locals. Right now we have to ensure that code does not wait on completion while having a thread local reference (we allow migration of tasks across workers/cores by default). We ban use of thread locals in our code and assume that dependencies are unable to yield into our executor. With forked `std` we can replace the `thread_local!` macro with a task-local implementation which would resolve this issue.
Another source of potential unsoundness is reuse of parent task stack for sub-task stacks in our implementation of `select!`/`join!` (we have separate variants which allocate full stacks for sub-tasks which are used for "fat" sub-tasks). Right now we have to provide stack size for sub-tasks manually and check that the value is correct using external tools (we use raw syscalls for interacting with io-uring and forbid external shared library calls inside sub-tasks). This could be resolved with the aforementioned special async ABI and tracking of maximum stack usage bound.
Finally, our implementation may not work out-of-box on Windows (I read that it has protections against messing with stack pointer on which we rely), but it's not a problem for us since we target only modern Linux.