Readit News logoReadit News
infogulch · 4 years ago
The dense fog lifts, tree branches part, a ray of light beams down on a pedestal revealing the hidden intentions of the ancients. A plaque states "The operational semantics of the most basic primitives of your operating system are designed to simplify the implementation of shells." You hesitantly lift your eyes to the item presented upon the pedestal, take a pause in respect, then turn away slumped and disappointed but not entirely surprised. As you walk you shake your head trying to evict the after image of a beam of light illuminating a turd.
chubot · 4 years ago
It seems like this 2019 paper covers this point, and the content in the gist? I was expecting to see a reference to it

A fork() in the road

https://dl.acm.org/doi/abs/10.1145/3317550.3321435

Discussed at the time: https://news.ycombinator.com/item?id=19621799

Although it does say that vfork() is difficult to use safely, while the gist recommends it? I think there is still some clarity needed around the use cases.

Fork today is a convenient API for a single-threaded process with a small memory footprint and simple memory layout that requires fine-grained control over the execution environment of its children but does not need to be strongly isolated from them. In other words, a shell. It’s no surprise that the Unix shell was the first program to fork [69], nor that defenders of fork point to shells as the prime example of its elegance [4, 7]. However, most modern programs are not shells. Is it still a good idea to optimise the OS API for the shell’s convenience?

cryptonector · 4 years ago
As u/amaranth pointed out, my gist predates the MSFT paper, which mostly explains why I didn't reference. Though, to be fair, I saw that paper posted here back in 2019, and I commented on it plenty (13 comments) then. I could have edited my gist to reference it, and, really, probably should have. Sometime this week I will add a reference to it, as well as this and that HN post, since they are clearly germane and useful threads.

I vehemently disagree with those who say that vfork() is much more difficult to use correctly than fork(). Neither is particularly easy to use though. Both have issues to do with, e.g., signals. posix_spawn() is not exactly trivial to use, but it is easier to use it correctly than fork() or vfork(). And posix_spawn() is extensible -- it is not a dead end.

My main points are that vfork() has been unjustly vilified, fork() is really not good, vfork() is better than fork(), and we can do better than vfork(). That said, posix_spawn() is the better answer whenever it's applicable.

Note that the MSFT paper uncritically accepts the idea that vfork() is dangerous. I suspect that is because their focus was on the fork-is-terrible side of things. Their preference seems to be for spawn-type APIs, which is reasonable enough, so why bother with vfork() anyways, right? But here's the thing: Windows WSL can probably get a vfork() added easily enough, and replacing fork() with vfork() will generally be a much simpler change than replacing fork() with posix_spawn(), so I think there is value in vfork() for Microsoft.

Use cases for vfork() or afork()? Wherever you're using fork() today to then exec, vfork() will make that code more performant and it generally won't take too much effort to replace the call to fork() with vfork(). afork() is for apps that need to spawn lots of processes quickly -- these are rare apps, but uses for them do arise from time to time. But also, afork() should be easier to use safely than vfork(). And, again, for Microsoft there is value in vfork() as a smaller change to Linux apps so they can run well in WSL.

BTW, see @famzah's popen-noshell issue #11 [0] for a high-perf spawn use case. I linked it from my gist, and, in fact, the discussion there led directly to my writing that gist.

  [0] https://github.com/famzah/popen-noshell/issues/11

amaranth · 4 years ago
The gist seems to be from 2017 so it wouldn't have been able to reference that paper.
cryptonector · 4 years ago
I've updated the gist to include that, this, and many other links.
rezonant · 4 years ago
I too could use some more clarity around the use cases
ckastner · 4 years ago
> "The operational semantics of the most basic primitives of your operating system are designed to simplify the implementation of shells."

Yes, but why is this characterized as something negative?

Isn't that the entire point? Operating systems are there to serve user requests, and shells are an interface between user and OS.

Shells simply developed features that users required of them.

eru · 4 years ago
> Isn't that the entire point?

The exokernel people would disagree.

You see, an operating system as commonly conceived has at least two major jobs:

- abstract away underlying hardware

- safely multiplex resources

And do the above with as little overhead as possible.

Now the thing is: whenever you have multiple goals, you need to make trade-offs, and you aren't as good at any one goal as you could be.

So the exokernel folks made a suggestion in the 90s: let the OS concentrate on safely multiplexing resources, and do all the abstracting in user level libraries.

See eg https://www.classes.cs.uchicago.edu/archive/2019/winter/3310... or https://people.eecs.berkeley.edu/~kubitron/cs262/handouts/pa...

Normal application programming would mostly look the same as before, your libraries just do more of the heavy lifting. But it's much easier to swap out different libraries than it is to swap out kernel-level functionality.

That vision never caught on with mainstream OSes. But: widespread virtualisation made it possible. You can see hypervisors like Xen as exokernel OSes that do the bare minimum required to safely multiplex, but don't provide (many) abstractions.

rtpg · 4 years ago
Shells have relatively simple operational models, so _any_ API would probably be workable for shells.

Meanwhile, programs with more complex requirements have to work around these APIs. And many programs call other programs, or otherwise have to do tricky process lifecycle management.

The lowest-level APIs should, in theory, cater to the most complex cases, not to the simplest ones. This doesn't prevent a simpler API from existing, but catering to a simple use case in the primitives does hinder more complex needs.

(I think the more nuanced point is that the OS itself might not have a much better design available in any case. Unixes have a lot of neat stuff, but it's a lot of "design by user feature request", and "standardize 4 slightly different ways of doing things", so there is a lot of weirdness and it's hard to have The Perfect API in that case)

matu3ba · 4 years ago
> Yes, but why is this characterized as something negative?

Unfortunately, the text does not provide sufficient context. Shell are not properly supported in any OS (probably except plan9), since 1. the OS provides no enforcement or convention of CLI API interface (there is no enforced encoding standard or checkable stuff), 2. the OS provides no rules for file names to be shell-friendly and 3. there are no dedicated communication channels towards shells or in between programs and shells.

So all in all, shells remain a hack around the system that is "simple to implement the initials" and is annoying to use and write at many corner cases.

> Shells simply developed features that users required of them.

Cross out "simply" and call it convenience+arbitrary complex scripting glue for 4 main goals: 1. piping 2. basic text processing 3. basic job control 4. path hackery

int_19h · 4 years ago
Shells haven't been the primary interface between the user and the OS for decades.
peterburkimsher · 4 years ago
That is the most glorious ** that i've read all day.

Larry Wall, creator of Perl, famously wrote that "It is easier to port a shell than a shell script."

https://en.wikipedia.org/wiki/Shell_script

So we can write operating systems easily if it's just an infinite superloop?

Dead Comment

Deleted Comment

evmar · 4 years ago
In Ninja, which needs to spawn a lot of subprocesses but it otherwise not especially large in memory and which doesn't use threads, we moved from fork to posix_spawn (which is the "I want fork+exec immediately, please do the smartest thing you can" wrapper) because it performed better on OS X and Solaris:

https://github.com/ninja-build/ninja/commit/89587196705f54af...

ridiculous_fish · 4 years ago
posix_spawn also outperforms fork on Linux under more recent glibc and musl, which can use vfork under the hood. https://twitter.com/ridiculous_fish/status/12328893907639336...
xroche · 4 years ago
The issue with posix_spawn is that you can't close all descriptors before exec. This is especially an issue as most libraries are still unaware they need to open every single handle with the close-on-exec flag set.
kazinator · 4 years ago
Closing all descriptors is next to useless; you usually need to inherit at least standard in/out/error.

What you need is an operation like "close all descriptors >= N", as posix_spawn opcode.

cryptonector · 4 years ago
Solaris/Illumos has an extension[0] for that.

  [0] http://src.illumos.org/source/search?project=illumos-gate&full=posix_spawn_file_actions_addclosefrom_np&defs=&refs=&path=&hist=&type=&xrd=&nn=1
  [1] https://docs.oracle.com/cd/E36784_01/html/E36874/posix-spawn-file-actions-addclosefrom-np-3c.html

mattgreenrocks · 4 years ago
> Long ago, I, like many Unix fans, thought that fork(2) and the fork-exec process spawning model were the greatest thing, and the Windows sucked for only having exec() and _spawn(), the last being a Windows-ism.

I appreciate this quite a bit. Vocal Unix proponents tend to believe that anything Unix does is automatically better than Windows, sometimes without even knowing what the Windows analogue is. Programming in both is necessary to have an informed opinion on this subject.

The one thing I miss most on Unix: the unified model of HANDLEs that enables you to WaitOnMultipleObjects() with almost any system primitive you could want, such as an event with a socket (blocking I/O + a shutdown notification) in one call. On Unix, a flavor of select() tends to be the base primitive for waiting on things to happen, which means you end up writing adapter code for file descriptors to other resources, or need something like eventfd.

Things I don't miss from Windows at all: wchar_t everywhere. :)

cryptonector · 4 years ago
WIN32 got a few things very right:

  - SIDs
  - access tokens
    (like struct cred / cred_t in Unix kernels,
     but exposed as a first-class type to user-land)
  - security descriptors
    (like owner + group mode_t + ACL in Unix land,
     but as a first-class type)
  - HANDLEs, as you say
  - HANDLEs for processes
Many other things, Windows got wrong. But the above are far superior to what Unix has to offer.

mananaysiempre · 4 years ago
How are SIDs the right thing?

Superficial silliness like allocating 48 bits to encode integers in [0,18] aside, what problem do structured SIDs actually solve? I’ve been trying to figure that out for the last couple of days and I still don’t get it, possibly because the Windows documentation doesn’t seem to actually say it anywhere.

I completely agree with having UUIDs or something in that vein for user and group IDs and will not dismiss IDs for sessions and such in the same namespace (although haven’t actually seen a use case for those), but structured variable-length SIDs as NT defines them just don’t make sense to me.

twoodfin · 4 years ago
I’d add an I/O interface to the kernel that was built to be asynchronous from Day 0.
al2o3cr · 4 years ago
I'd be curious how many of those derive from NT's VMS roots - for instance:

http://lxmi.mi.infn.it/~calcolo/OpenVMS/ssb71/6346/6346p004....

monocasa · 4 years ago
These decisions here are all older than Windows and weren't in reaction to them. It's in reaction to the awful mainframe ways to spawn processes like using JCL.

We've sort of come back to that with kubernetes yaml files to describe how to launch an executable in a specific env and all of the resources it needs. Like it can be traced explicitly, the Borg paper references mainframes and knowingly calls the language that would be replaced by kubernetes's yaml files 'BCL' instead of z/OS's JCL.

zozbot234 · 4 years ago
Plan9 is a lot older than Kubernetes and has the same namespacing of all processes. So it's not impossible to have a "*nix like" OS that still has mainframe-like separation of concerns to ease deployment.
ogazitt · 4 years ago
Having written server software that had to work in both places, I always loved the simplicity of fork(2) / vfork(2) relative to Windows CreateProcess. Threading models in Win32 were always a pain. Which only got worse with COM (remember apartment threading? rental threading? ugh)

Back in the 90's, processes had smaller memory footprint, and every UNIX my software supported had COW optimizations. So the difference between fork(2) and vfork(2) were not very large in practice. Often, the TCP handshake behind the accept(2) call was of more concern than how long it would take fork(2) to complete. Of course, bandwidth has increased by a factor of 1000 since then, so considerations have changed.

t43562 · 4 years ago
It's how CreatProcess handles commandline argument that infuriates me - not as an argv array but a big string. It's so difficult to work around quoting.
pcwalton · 4 years ago
The problem with WaitForMultipleObjects (WFMO) is that it's limited to 64 handles, which basically makes it useless for anything where the number of handles is dynamic as opposed to static. There are ways to get around this limitation by grouping handles into trees, but it's tremendously clunky.
AnIdiotOnTheNet · 4 years ago
UCS-2 seemed like a good(ish) idea at the time when Unicode's scope didn't include every possible human concept represented in icon form and UTF-8 hadn't yet been spec'd on a napkin by the first adults to bother thinking about the problem.
xiaq · 4 years ago
Even in 1989, it should have been clear that 16 bits were not enough to encode all of the Chinese characters, let alone encoding all the human scripts. Unicode today encodes 92,865 Chinese characters (https://en.wikipedia.org/wiki/CJK_Unified_Ideographs).

The only reason anybody would think of UCS-2 was a good idea was that they did not consult a single Chinese or Japanese scholar on Chinese characters.

cryptonector · 4 years ago
Quite true. One of the things Windows got very wrong was UCS-2 and, later, UTF-16. So did JavaScript.
marwis · 4 years ago
Is there any difference between Windows HANDLE and Linux file descriptor? Aren't they both just indexes into a table of objects managed by the kernel?
cryptonector · 4 years ago
HANDLE values are opaque, and generally not reused. Imagine an implementation like this:

  typedef struct HANDLE_s {
    uintptr_t ptr;
    uintptr_t verifier;
  } HANDLE;
where `ptr` might be an index into a table (much like a file descriptor) or maybe a pointer in kernel-land (dangerous sounding!) and `verifier` is some sort of value that can be used by the kernel to validate the `ptr` before "dereferencing" it.

On Unix the semantics of file descriptors are dangerous. EBADF can be a symptom of a very dangerous bug where some thread closed a still-in-use FD then a open gets the same FD and now maybe you get file corruption. This particular type of bug doesn't happen with HANDLEs.

Cloudef · 4 years ago
Isn't HANDLE basically fd?
notriddle · 4 years ago
FD has been gradually turned into HANDLE.
cryptonector · 4 years ago
Well, I'm surprised to see this on the front page, let alone as #1. Ask me anything.

EDIT: Also, don't miss @NobodyXu's comment on my gist, and don't miss @NobodyXu's aspawn[1].

  [0] https://gist.github.com/nicowilliams/a8a07b0fc75df05f684c23c18d7db234?permalink_comment_id=3467980#gistcomment-3467980 
  [1] https://github.com/NobodyXu/aspawn/

Lerc · 4 years ago
Since you said anything... This is not strictly related to the article but your expertise seems to be in the right area.

I have a process that executes actions for users, at the moment that process runs as root until it receives a token indicating an accepted user, then it fork()s and the fork changes to the UID of the user before executing the action.

Is there a better way? I hadn't actually heard of vfork() before reading this article. I'm guessing maybe you could do a threaded server model where each thread vfork()s. I'm not really aware what happens when threads and forks combine. Does the v/fork() branch get trimmed down to just that one thread? If so what happens to the other thread stacks? It feels like a can of worms.

cryptonector · 4 years ago
If the parent is threaded, then yes, vfork() will be better. You could also use posix_spawn().

As to "becoming a user", that's a tough one. There are no standard tools for this on Unix. The most correct way to do it would be to use PAM in the child. See su(1) and sudo(1), and how they do it.

> I'm not really aware what happens when threads and forks combine. Does the v/fork() branch get trimmed down to just that one thread? If so what happens to the other thread stacks? It feels like a can of worms.

Yes, fork() only copies the calling thread. The other threads' stacks also get copied (because, well, you might have pointers into them, who knows), but there will only be one thread in the child process.

vfork() also creates only one thread in the child.

There used to be a forkall() on Solaris that created a child with copies of all the threads in the parent. That system call was a spectacularly bad idea that existed only to help daemonize: the parent would do everything to start the service, then it would forkall(), and on the parent side it would exit() (or maybe _exit()). That is, the idea is that the parent would not finish daemonizing (i.e., exit) until the child (or grandchild) was truly ready. However, there's no way to make forkall() remotely safe, and there's a much better way to achieve the same effect of not completing daemonization until the child (or grandchild) is fully ready.

In fact, the daemonization pattern of not exiting the parent until the child (or grandchild) is ready is very important, especially in the SMF / systemd world. I've implemented the correct pattern many times now, starting in 2005 when project Greenline (SMF) delivered into OS/Net. It's this: instead of calling daemon(), you need a function that calls pipe(), then fork() or vfork(), and if fork(), and on the parent side then calls read() on the read end of the pipe, while on the child side it returns immediately so the child can do the rest of the setup work, then finally it should write one byte into the write side of the pipe to tell the parent it's ready so the parent can exit.

aidenn0 · 4 years ago
What about fork(2) for network servers? I've written parallel network servers two ways; open the socket to listen on and call fork() N times for the desired level of parallelism, and just create N processes and use SO_REUSEPORT. I prefer the former. I suppose there is hidden option C of "have a simple process that opens the listening port and then vfork/execs each worker" I find that to be a bit strange because the code will be split into "things that happen before listening on the port" (which includes, e.g. reading configuration files) and "things that happen after listening on the port" (which includes, e.g. reading configuration files)
ahmedalsudani · 4 years ago
No questions yet as I am yet to read ... but I can already comment and say grade A title.
cryptonector · 4 years ago
It's a bit opinionated. It's meant to get a reaction, but also to have meaningful and thought-provoking content, and I think it's correct in the main too. Anyways, hope you and others enjoy it.
monocasa · 4 years ago
Hard disagree to most of this.

fork(2) makes a lot more sense when you realize its heritage. It came from a land before Unix supported full MMUs. In this model, to still have per process address spaces and preemptive multitasking on what was essentially a PC-DOS level of hardware, the kernel would checkpoint the memory for a process, slurp it all out to dectape or some such, and load in the memory for whatever the scheduler wanted to run next. It's simplicity of being process checkpoint based wasn't a reaction to windows style calls (which wouldn't exist for almost a couple decades), but instead mainframe process spawning abominations like JCL. The idea "you probably want most of what you have so force a checkpoint, copy the checkpoint into a new slot, and continue separately from both checkpoints" was soooo much better than JCL and it's tomes of incantations to do just about anything.

vfork(2) is an abomination. Even when the child returns, the parent now has a heavily modified stack if the child didn't immediately exec(). All of those bugs that causes are super fun to chase, lemme tell you. AFAIC, about the only valid use for vfork now is nommu systems where fork() incredibly expensive compared to what is generally expected.

clone(2) is great. Start from a checkpoint like fork, but instead of semantically copying everything, optionally share or not based on a bitmask. Share a tgid, virtual address space, and FD table? You just made a thread. Share nothing? You just made a process. It's the most 'mechanism, not policy' way I've seen to do context creation outside of maybe the l4 variants and the exokernels. This isn't an old holdover, this is how threads work today, processes spawned that happen to share resources. Modern archs on linux don't even have a fork(2) syscall; it all happens through clone(2). Even vfork is clone set to share virtual address space and nothing else that fork wouldn't share. Namespaces are a way to opt into not sharing resources that normally fork would share.

And I don't see what afork gets you that clone doesn't, except afork isn't as general.

Quekid5 · 4 years ago
(This is a bit of a tangent, apologies.)

> fork(2) makes a lot more sense when you realize its heritage.

I think it only makes sense when you consider its heritage. It has ALL the wrong defaults for what it's almost always used for these days: running a subprocess.

It copies "random" kernel data structures like open FDs, etc. and you have to be very careful about closing the ones you don't want to be inherited, etc. etc. It may copy things that weren't even a relevant concept when you wrote your program.

The correct thing to do is to very explicit about what you want to pass onto the subprocess and to choose safe defaults for programs compiled against the old API when you extend it. (Off the top of my head, the only thing I'd want to be automatically inherited by default would be the environment and CWD.)

It's 100% the wrong API for spawning processes.

Now, I don't think afork() solves any of these problems, AFAICT. But my personal perspective is that fork() and its derivatives are the wrong starting point in the first place for what they are used for in 99% of all cases.

twic · 4 years ago
The behaviour of subprocesses inheriting resources like file descriptors is absolutely bizarre. Why on earth would you want this to be the default?! But we're so used to it, we think it's normal.
monocasa · 4 years ago
Practically, this is the struct you have to fill in if you don't use clone or fork.

https://github.com/torvalds/linux/blob/719fce7539cd3e186598e...

IMO clone looks a lot better than screwing with that giant struct and all of the kernel bugs that would exist from validating every goofy way those options could be setup wrong by user space.

cryptonector · 4 years ago
afork() could do some things differently. The point of afork() is to be able to spawn child processes (that will exec-or-_exit) faster.
kragen · 4 years ago
The PDP-11 had segment base registers and memory protection, so it wasn't necessary to swap out one process to run another one at the same (virtual) address. It didn't have paging, so it couldn't swap out part of a segment. I think it's true that PDP-11 fork() would stop the process to make a copy of the writable segments, but it didn't have to "checkpoint" the process to a disk or tape. Are you talking about the PDP-7? I don't know anything about the PDP-7.

I agree about vfork(), since I haven't seen a system with segment base registers and no paging in a long time, and about clone(). Unfortunately it's true that clone() (which came from Plan9) has made POSIX threads difficult to support.

What's the L4 approach? Construct the state of the process you want to run in some memory and then use a launch-new-thread system call, then possibly relinquish access to that memory?

monocasa · 4 years ago
> Are you talking about the PDP-7?

Yes

> Unfortunately it's true that clone() (which came from Plan9) has made POSIX threads difficult to support.

clone was literally designed to support posix threads.

> What's the L4 approach?

Capabilities over all of the kernel objects so user space can do safe brain surgery on them. Since everything is capability based including the cap tables you end up duping a cap table, allocating a non running thread, setting registers, and attaching duped cap table. Four syscalls in the minimal case, but it's l4 so they're fairly cheap. Total disclosure, one of my side projects is a kernel with caps and a first class VM to do that in one syscall amortized.

cryptonector · 4 years ago
> vfork(2) is an abomination. Even when the child returns, the parent now has a heavily modified stack if the child didn't immediately exec().

What stack modifications? Sure, the child can scribble over the stack frame, or worse, the child could do things like return -- but you are the author of the code calling vfork() and you know not to do that, so why would that happen?

A: It just wouldn't happen.

And as to exec() failing, this is why exec calls must be followed with calls to either exec() or _exit(), and this is true even if you use fork() instead of vfork(). I.e.:

    /* do a bunch of pre-vfork() setup */
    ...
    
    pid_t pid = vfork();
    
    if (pid == -1) err(1, "Couldn't vfork()");
    
    if (pid == 0) {
      /* do a bunch of child-side setup */
      execve(...);
      /* oops, ENOENT or something */
      _exit(1);
    }
    
    /* the child either exec'ed or exited */
    if (waitpid(pid, &status, 0) != pid) err(1, "...");
    
    ...
How do you detect if the child exec'ed or exited? Well, you make a pipe before you vfork(), you set its ends to be O_CLOEXEC, then on the child side of vfork() you write one byte into it if the exec call fails. On the parent side you read from the pipe before you reap the child, and if you get EOF then you know the child exec'ed, and if you get one byte then you know the child exited. The one byte could be an errno value.

No, really, what you say about vfork() is lore, and very very wrong.

That said, vfork() blocks a thread in the parent. The point of my gist was to explain why fork() sucks, why vfork() is much better, and what would be better still.

> And I don't see what afork gets you that clone doesn't, except afork isn't as general.

afork()/avfork() is not meant to be as general as clone() but to be more performant than vfork() by not blocking a thread on the parent side.

clone() needs some improvements. It should be possible to create a container additively. See elsewhere in the comments on this post.

monocasa · 4 years ago
> What stack modifications? Sure, the child can scribble over the stack frame, or worse, the child could do things like return -- but you're the author of the code calling vfork() and you know not to do that

Within a sentence you described the stack modification. 'It's not a footgun, just don't make mistakes' doesn't hold a lot of water with me.

> No, really, what you say about vfork() is lore, and very very wrong.

Like I've said elsewhere in the comments, I've literally had to fix awful bugs, some security related, from how much vfork() is a preloaded foot gun with the safety off. Not everyone who has a bad impression of it is just following the "lore".

> afork()/avfork() is not meant to be as general as clone() but to be more performant than vfork() by not blocking a thread on the parent side.

Ok, but I'm not going to hold it against clone for being a more general solution.

> clone() needs some improvements. It should be possible to create a container additively. See elsewhere in the comments on this post.

I agree with this, but there's practical reasons why this isn't the case, mainly around how asking user space for every little thing is expensive, and large sparse structs to copy into kernel space covering basically everything in struct task sounds like a special kind of security hell I would not want to be a part of.

A flag to clone to create an empty process and something like a bunch of io_uring calls or a box program to hydrate the new task state would be really neat, and has been kicked around a bunch. There's just a ton corner cases that haven't been ironed out.

quietbritishjim · 4 years ago
Your code snippet assumes that your C compiler is just a high-level assembler. But it's not - it executes against a theoretical C virtual machine that doesn't know about about forking. It's allowed to generate some non-obvious code so long as it acts "as if" it has the same behaviour - but only from the point of view of that theoretic C VM.

For example, in theory _exit(1) could be implemented as longjmp(...) up to a point in some compiler-created top-level function that wraps up main(). Then that wrapper function could perform some steps to communicate the return code to the OS that trashes the stack before actually exiting. After all, if the process is about to exit anyway, what difference does it make if a bunch of memory is fiddled with? We know the answer to this but, from the point of view of the C virtual machine, it's irrelevant.

That particular scenario is unlikely but the point is that compiler implementations and optimisations are allowed to do very non-obvious things. You're only safe if you stick the rules of the C standard, which this 100% does not.

__s · 4 years ago
Stack manipulations are a real problem. Say if some parameter to exec after vfork uses stack slots created by compiler for temporary variables. & sure you compute those before the call to vfork, but then compiler applies code motion..
jandrese · 4 years ago
I'm still struggling to understand the point of vfork(). The whole point of fork is to offload work to a different part of your program so the original part can continue to do work. The entire idea fails if it halts the original program for the duration of the child's life. How is this better than just doing a regular function call?
ddulaney · 4 years ago
vfork halts the parent until the child exits or calls exec, getting its own address space. In the normal case, you vfork and immediately exec, and the parent continues on with what it was doing. The time between vfork and exec is “special” in that the child is temporarily running in the parent’s address space, then it uses exec to separate and do its own thing.
monocasa · 4 years ago
I've seen an argument for immediately execing and not marking the whole mutable process VA space as 'trap on write', including the thread stack that you're about immediately write to if you're going to throw that work away and exec(). There's also 'I want support cheap forks on a nommu system and vforking is easier to retrofit in'.
cryptonector · 4 years ago
If you really think vfork() is hard to use because of the stack sharing, the avfork() should be good for you!
mark_undoio · 4 years ago
The code I currently work on actually has a use of `clone` with the `CLONE_VM` flag to create something that isn't a thread. Since `CLONE_VM` will share the entire address space with the child (you know, like a thread does!) a very reasonable response would be "WAT?!"

What led us here was a need to create an additional thread within an existing process's address space but in a way that was non-disruptive - to the rest of the process it shouldn't really appear to exist.

We achieved this by using `CLONE_VM` (and a handful of other flags) to give the new "thread-like" entity access to the whole address space. But, we omitted `CLONE_THREAD`, as if we were making a new process. The new "thread-like" entity would not technically be part of the same thread group but would live in the same address space.

We also used two chained `clone()` calls (with the intermediate exiting, like when you daemonise) so that the new "thread-like" wouldn't be a child of the original process.

All this existed before I joined, it's just really cool that it works. I've never encountered a such a non-standard use of clone before but it was the right tool for this particular job!

scottlamb · 4 years ago
> What led us here was a need to create an additional thread within an existing process's address space but in a way that was non-disruptive - to the rest of the process it shouldn't really appear to exist.

I'm curious to hear more. What's its purpose?

mark_undoio · 4 years ago
> I'm curious to hear more. What's its purpose?

Sure! I'll try to illustrate the general idea, though I'm taking liberties with a few of the details to keep things simple(r).

Our software (see https://undo.io) does record and replay (including the full set of Time Travel Debug stuff - executing backwards, etc) of Linux processes. Conceptually that's similar to `rr` (see https://rr-project.org/) - the differences probably aren't relevant here.

We're using `ptrace` as part of monitoring process behaviour (we also have in-process instrumentation). This reflects our origins in building a debugger - but it's also because `ptrace` is just very powerful for monitoring a process / thread. It is a very challenging API to work with, though.

One feature / quirk of `ptrace` is that you can't really do anything useful with a traced thread that's currently running - including peeking its memory. So if a program we're recording is just getting along with its day we can't just examine it whenever we want.

First choice is just to avoid messing with the process but sometimes we really do need to interact with it. We could just interrupt a thread, use `ptrace` to examine it, then start it up again. But there's a problem - in the corners of Linux kernel behaviour there's a risk that this will have a program-visible side effect. Specifically, you might cause a syscall restart not to happen.

So when we're recording a real process we need something that:

* acts like a thread in the process - so we can peek / poke its memory, etc via ptrace * is always in a known, quiescent state - so that we can use ptrace on it whenever we want * doesn't impact the behaviour of the process it's "in" - so we don't affect the process we're trying to record * doesn't cause SIGCHLD to be sent to the process we're recording when it does stuff - so we don't affect the process we're trying to record

Our solution is double clone + magic flags. There are other points in the solution space (manage without, handle the syscall restarting problem, ...) but this seems to be a pretty good tradeoff.

[edit: fixed a typo]

kccqzy · 4 years ago
Maybe some kind of snapshotting for an in-memory database?
Ericson2314 · 4 years ago
This stuff is still all confused

Read http://catern.com/rsys21.pdf

What you want is:

1. create "embryonic" unscheduled process

2. Set it up from the parent process, it just lies on the operating table passively.

3. Submit it to the scheduler.

This is just....obviously correct. Totally flexible. Totally efficient. Hell, if you really want to fork anything, fork those embryonic process which have no active threads! Much safer and easier to understand!

I did not write the paper above, but I did write

https://lore.kernel.org/lkml/f8457e20-c3cc-6e56-96a4-3090d7d...

https://lists.freebsd.org/archives/freebsd-arch/2022-January...

I hope I or someone else will have time to make it happen!

IgorPartola · 4 years ago
When I was first learning about UNIX and similar OSes I just assumed that this is how things worked because this is the obvious way of doing it. Why would you fork a process, then try to determine which of the two processes you are, then fix whatever the parent process messed up in your global state, and only then execute what you actually wanted to do? That seems insane (I guess until you realize that the main use case is creating /bin/sh).
Ericson2314 · 4 years ago
Me too!

But even when writing /bin/sh, I don't see why this would get in the way? I was once told earlier Unix didn't even have fork, but something more purpose-made for shells instead.

zokier · 4 years ago
Sounds a bit like fuchsias launchpad library where you create launchpad object, do all the setup, and then call launchpad_go to actually start the process. Launchpad doesn't allow arbitrary syscalls in the setup, so in that sense it is maybe closer to "spawn" interface but with better ergonomics

https://cs.opensource.google/fuchsia/fuchsia/+/main:zircon/s...

Ericson2314 · 4 years ago
Yes, it is basically the same thing. Fuschia has the capbilities mindset that would lead one here.
cryptonector · 4 years ago
Yes, I like the larval process idea. No doubt it's good.
londons_explore · 4 years ago
I was always disappointed by the performance of fork()/clone().

CompSci class told me it was a very cheap operation, because all the actual memory is copy-on-write, so its a great way to do all kinds of things.

But the reality is that duplicating huge page tables, and hundreds of file handles is very slow. Like 10's of milliseconds slow for a big process.

And then the process runs slowly for a long time after that because every memory access ends up causing lots of faults and page copying.

I think my CompSci class lied to me... it might seem cheap and a neat thing to do, but the reality is there are very few usecases where it makes sense.

vgel · 4 years ago
CS classes (and, far too often, professional programmers) talk about computers like they're just faster PDP-11s with fundamentally the same performance characteristics.
mark_undoio · 4 years ago
Agreed that these costs can be larger than is perhaps implied in compsci classes (though it's possible that they've changed their message since I took them!)

I suppose it is still essentially free for some common uses - e.g. if a shell uses `fork()` rather than one of the alternatives it's unlikely to have a very big address space, so it'll still be fast.

My experience has been that big processes - 100+GB - which are now pretty reasonable in size really do show some human-perceptible latency for forking. At least tens of milliseconds matches my experience (I wouldn't be surprised to see higher). This is really jarring when you're used to thinking of it as cost-free.

The slowdown afterwards, resulting from copy-on-write, is especially noticeable if (for instance) your process has a high memory dirtying rate. Simulators that rapidly write to a large array in memory are a good example here.

When you really need `fork()` semantics this could all still be acceptable - but I think some projects do ban the use of `fork()` within a program to avoid unexpected costs. If you really have a big process that needs to start workers I guess it might be worth having a small daemon specifically for doing that.

cryptonector · 4 years ago
Right, shells are no threaded and they tend to have small resident set sizes. Even in shells though, there's no reason not to use vfork(), and if you have a tight loop over starting a bunch of child processes, you might as well use it. Though, in a shell, you do need fork() in order to trivially implement sub-shells.

fork() is most problematic for things like Java.

smasher164 · 4 years ago
Also, mandating copy-on-write as an implementation strategy is a huge burden to place on the host. Now you’ve made the amount of memory a process is is using unquantifiable.
vgel · 4 years ago
It's not necessarily unquantifiable -- the kernel can count the not-yet-copied pages pessimistically as allocated memory, triggering OOM allocation failures if the amount of potential memory usage is greater than RAM. IIUC, this is how Linux vm.overcommit_memory[1] mode 2 works, if overcommit_ratio = 100.

However, if an application is written to assume that it can fork a ton and rely on COW to not trigger OOM, it obviously won't work under mode 2.

[1] https://www.kernel.org/doc/Documentation/vm/overcommit-accou...

> 2 - Don't overcommit. The total address space commit for the system is not permitted to exceed swap + a configurable amount (default is 50%) of physical RAM.

> Depending on the amount you use, in most situations this means a process will not be killed while accessing pages but will receive errors on memory allocation as appropriate.

> Useful for applications that want to guarantee their memory allocations will be available in the future without having to initialize every page.

cryptonector · 4 years ago
POSIX doesn't require that fork() be implemented using copy-on-write techniques. An implementation is free to copy all of the parent's writable address space.
immibis · 4 years ago
You also mandate a system complex enough to have an MMU.
cryptonector · 4 years ago
Copy-on-write is supposed to be cheap, but in fact it's not. MMU/TLB manipulations are very slow. Page faults are slow. So the common thing now is to just copy the entire resident set size (well, the writable pages in it), and if that is large, that too is slow.