I like how they won’t say the windows API’s name. WaitForMultipleObjects is quite nice - an epoll that works with all sorts of things not just fds. This is kind of a half assed implementation - sometimes I wish Linux would admit Windows has a good idea once in a while.
fds are equivalent to Windows HANDLEs, and WaitForMultipleObjects() is equivalent to poll().
But, Linux also has epoll() which scales better for non-trivial numbers of things, and Windows has IOCP. So WaitForMultipleObjects isn't particularly special.
Both Windows and Linux have things that don't work with these interfaces. Nonetheless, Linux has been trending towards "waiting on all sorts of things" by makng more and more things into fds that can be waited on with pol;/epoll. Examples: timerfd, signalfd, eventfd. It's quite a unixy approach.
In fact, Wine already uses eventfd to implement WaitForMultipleObjects. This kernel change is just an optimisation, to speed up Wine, and a workaround for some distros setting Wine's max fd limit too low for Windows apps.
Futexes used to support waiting on multiple futexes, using FUTEX_FD. That was arguably better than the new patch FUTEX_WAIT_MULTIPLE, because in old Linux you could wait for futexes and other fds at the same time - it did work with "all sorts of things".
But FUTEX_FD was removed after searches online found no code using it, and kernel devs didn't like keeping it. (To my mind, this was a suprising, unusual breakage of system call binary compatibility) The new FUTEX_WAIT_MULTIPLE allows programs like Wine to wait for multiple futexes faster than before, but it's more limited than the old FUTEX_FD because you can't mix them with other things.
WaitForMultipleObjects is problematic for a few reasons:
* limited to 64 handles.
* passes the entire handle buffer to the kernel on every call. That is also a key reason poll is worse than epoll/kqueue. A better way is to let the kernel retain the list of handles across calls.
* when two handles are signalled, the one earlier in the array is always the one returned. So handle with a lower array index being frequently signalled can starve out your opportunity to process the later ones.
* since everything happens by returning an index, when multiple handles are signalled, you can only process one at a time, needing a new syscall for each.
Thats my point. They should just bite the bullet and implement WaitForMultipleObjects instead of having all these disjoint APIs. Plus sometimes you want to wait for both and this is very difficult with Linux.
Its also telling Linux has gone through the whole select/poll/epoll madness while WaitForMultipleObjects has worked well in windows NT since the 90s. Its a proven design.
Personally for me - the problematic part of gaming on Linux has been input(i.e mouse) latency and acceleration profile.
I am not sure if this is just my experience but when using libinput on Fedora for example - the cursor movement is not exactly precise. This is not obvious when working but while gaming this is a deal breaker.
Games in Windows can get raw mouse input just by writing the code for it, and Linux games usually don't because that would require root permissions, to add the user to the input group, etc.
It is a security issue and an X-Window design issue.
Oblivious question: is there any notion of granulated permissions, ie a ‘mouse’ user group that the game could add itself to? Seems like something like this shouldn’t be a deal-breaker.
There is nothing in its design that prevents X windows to be made to send an event like WM_INPUT that games can handle in a similar way to how they handle WM_INPUT on Windows today.
Seems like in an ideal world, games would select that profile programmatically. Assuming, of course, that there's no reason to prefer mouse acceleration and that we're talking about using the mouse as a linear controller for stuff like aiming or moving.
Linux Gaming with Steam is actually quite nice these days. I spent about 3 years using an Ubuntu desktop for all my gaming at home. Most of the games I played installed via steam and worked great on Linux.
The only reason I switched back to a Windows Desktop was that there were just one or two games I specifically wanted to try, but couldn't install to Linux. And once I had switched back (and paid the price for Windows) there were no games that needed Linux, so no motivation to go back.
The article is discussing optimizing Proton, a version of Wine embedded in Steam, meaning you can (try to) play Windows-only games on Linux with minimal effort. It works quite well, though graphically intensive games are more likely to experience glitches or slowdown from the translation.
I don't have the nerves to bear a native Windows install in my environment, so I turned their LTSC into a "Windows Gaming Container" with VFIO.
Alright, it's a headless VM, but it's pared down with 'unfuck' and other telemetry- and uselessness- neutering projects into something safer. Lookingglass peers into one of my GPU's framebuffers, so it lives inside a window on my Linux host.
It's also been at least a year since I've needed it, though, since Steam and wine cover everything else I'd want to play or run, so it might be time to cut it loose for good.
This looks to me like the main change is to make it easier to create mutexes with timeouts.
Isn’t a mutex timing out an indication that:
a) a lock wasn’t needed in the first place or
b) the program is incorrect?
It feels more like they just want the api to match win32 better but most of the multithreaded programming I’ve done lately has just used go’s channels so I totally could be missing something.
Correctness isn't always the right thing to do. Games, in particular, are full of code that approximates the right thing and falls back to less and less correct solutions. It's more important to be fast than right in a lot of cases. Dropping frames can have a significant negative experience for players. Dropping an AI pathing algorithm, particle physics computation, or other background task can often be fine or even unnoticed.
Approximations, including temporal ones, are also science.
Ever seen a texture pop in instead of a stutter? If a lock would be taken on that load without timeout (very short) it'd either not load on time or load when it's no longer needed.
You can't acquire locks in any order, that's a recipe for deadlock when one process has acquired half the locks and another has the different half, resulting in them waiting on each other.
Use of WaitForMultipleObjects is more usually for completion of tasks, in the way that win32 "overlapped" works.
One valid scenario for mutexes timing out is stealing:
1. Each CPU first attempts to take an allocated object from its own pool, using a 1ms timeout to acquire the mutex on the pool.
2. If that fails, the CPU attempts to steal from the other CPU pools, using a zero timeout (immediately return if the mutex can't be taken).
3. If that fails, acquire the mutex on its own pool with an infinite timeout.
This is definitely correct and is much faster than any of the lock-free approaches I could come up with. I guess the general class of problems where mutex timeouts help is when you have "many valid options, some of which are being used by others."
It could be (b) -- it's incorrect because there's a deadlock due to incorrect mutex usage here.
It could also be (c) -- it's correct but there's contention or: the mutex is too coarse / there's "too much" work being protected by the mutex.
But I think your point about matching windows is likely the case (this is how wine implements WaitForMultipleObjects maybe?). The fd exhaustion with FUTEX_FD means they need another way.
Yes, it looks exactly like WaitForMultipleObjects to me. A very reasonable plan. The mutexs aren't being used to exclude critical sections but as completion indicators for various tasks.
Mutex timeouts show up in some implementations as leases.
Essentially you're putting a lower bound on Availability in favor of Consistency. Lease expiry happens when the lessor isn't around to retire the lease in an orderly fashion. It's detecting a Partition. In theory a network partition, but we all know how CPU boundedness creeps into the system as the feature set or the data set grows... and that can show up even on a solitary machine.
Valve's post suggests that they believe this to be the case, although they didn't explain the specific details.
> We think that if this feature (or an equivalent) was adopted upstream, we would achieve efficiency gains by adopting it in native massively-threaded applications such as Steam and the Source 2 engine.
Wine is a native program, so yes. But it is a niche facility that Wine benefits from, and other programs that handle a lot of locks/queues also do, but may not directly benefit Linux native game performance.
I know the patch mentions interactive multimedia applications (games) in particular, but an actual mechanism to implement WaitForMultipleObjects on Linux would be very welcome for many high-performance multi-threaded applications.
Say you have one worker thread per CPU core. On Windows, each thread would get an Event object and you would WaitOnMultiple to be able to act on the first unit of work that was complete. On Linux you would have to roll your own solution using lower-level primitives and it will not be correct. Being able to wait on multiple events on Linux will be awesome.
Linux already has what you’re talking about with eventfd and epoll.
In Linux each thread can get an eventfd and you can POLLIN all of them.
In fact I would argue that using futexes is the “roll your own solution” using lower level primitives (and easier to fuckup) much more so than eventfd and epoll.
As mentioned somewhat poorly in the post, using futexes gives a performance boost which is not surprising since they are fast user mutexes. FWIW I didnt think windows events had a fast user space path but I may be mistaken.
For most worker pool scenarios you’re describing, the overhead of eventfd is probably in the noise.
You’re talking about interfaces for waiting on multiple kernel resources but the new futex interface enables you to wait for multiple user resources.
Though it can emulate a win32 api for waiting on multiple “objects”, it’s strictly more powerful than WaitForMultiple if you are dealing with user objects since futexes impose very few constraints on how your user synchronization object is shaped and how it works.
So, the new interface is totally different from things like epoll. In one case the kernel is helping you wait for multiple user objects and in the other case it’s helping you wait for multiple kernel objects. The distinction is intentional because the whole point is that the user object that has the futex can be shaped however user likes, and can implement whatever synchro protocol the use likes.
Finally, it’s worth remembering that futex interfaces are all about letting you avoid going into kernel unless there is actually something to wait for. The best part of the api is that it helps you to avoid calling it. So for typical operations, if the resource being waited on can have its wait state represented as a 32-bit int in user memory, the futex based apis will be a lot faster.
They point out that they already have an implementation that does just this .... and it fails on some programs due to running out of file descriptors (they have one program that needs ~1 million of them ...)
Yes, and you have to cobble together an event implementation out of eventfd and epoll. There are two problems (specifically talking about multi-platform software)
1. You'll likely get it wrong and have subtle bugs.
2. This is significantly different than the Windows model where you wait on events. Now you have two classes of events - regular ones, and ones that can be waited on in multiple. The second class also comes with its own event manager class that needs to manage the eventfd for this group of events.
You end up with a specialised class of event that needs to be used whenever you need to wait in several of them at once. Then you realise you used a normal POSIX event somewhere else and now you want to wait on that as well, so you have to rewrite parts of your program to use your special multi-waitable event.
It's mostly trivial to write a event wrapper on top of POSIX events that behaves the same as Windows Events, except for the part where you might want to wait on multiple of them. I would expect that once this kernel interface is implemented we'll get GNU extensions in glibc allowing for waiting on multiple POSIX events. I absolutely do not want to roll my own thread synchronisation primitives except for very thin wrappers over the platform-specific primitives. Rolling your own synchronisation primitives is about as fraught with peril as rolling your own crypto.
To be honest, WaitForMultipleObjects will probably become not very useful in the near future. We're getting 32-core workstation CPUs today, it's quite likely there will be CPUs with more than 64 cores in near future workstations making it impossible to use this classic Windows primitive, but I suspect Microsoft will provide WaitForMultipleObjectsEx2.
On Linux your workers would push the work onto a single output queue or could signal and condition variable pointed to by the work. I've never really felt the need for WaitForMultiple.
This is awesome. If I understand this correctly, this would allow you to write a library that allows you to wait on multiple, fully independent queues from a single consumer thread. At the moment, those queues would have to share a mutex.
I'd like a way to wait on multiple condition variables. Windows has this... Or to treat condition variables as file descriptors so that select/poll/epoll can be used to wait on them (you'd get a notification of a signal, though it might be spurious, so you still have to take the lock and check the condition before you consider the CV signaled).
fds are equivalent to Windows HANDLEs, and WaitForMultipleObjects() is equivalent to poll().
But, Linux also has epoll() which scales better for non-trivial numbers of things, and Windows has IOCP. So WaitForMultipleObjects isn't particularly special.
Both Windows and Linux have things that don't work with these interfaces. Nonetheless, Linux has been trending towards "waiting on all sorts of things" by makng more and more things into fds that can be waited on with pol;/epoll. Examples: timerfd, signalfd, eventfd. It's quite a unixy approach.
In fact, Wine already uses eventfd to implement WaitForMultipleObjects. This kernel change is just an optimisation, to speed up Wine, and a workaround for some distros setting Wine's max fd limit too low for Windows apps.
Futexes used to support waiting on multiple futexes, using FUTEX_FD. That was arguably better than the new patch FUTEX_WAIT_MULTIPLE, because in old Linux you could wait for futexes and other fds at the same time - it did work with "all sorts of things".
But FUTEX_FD was removed after searches online found no code using it, and kernel devs didn't like keeping it. (To my mind, this was a suprising, unusual breakage of system call binary compatibility) The new FUTEX_WAIT_MULTIPLE allows programs like Wine to wait for multiple futexes faster than before, but it's more limited than the old FUTEX_FD because you can't mix them with other things.
* limited to 64 handles.
* passes the entire handle buffer to the kernel on every call. That is also a key reason poll is worse than epoll/kqueue. A better way is to let the kernel retain the list of handles across calls.
* when two handles are signalled, the one earlier in the array is always the one returned. So handle with a lower array index being frequently signalled can starve out your opportunity to process the later ones.
* since everything happens by returning an index, when multiple handles are signalled, you can only process one at a time, needing a new syscall for each.
This legit isnt really like anything in Windows, the closest thing I can think of is you'd be able to do something like this with XOK wake predicates.
Its also telling Linux has gone through the whole select/poll/epoll madness while WaitForMultipleObjects has worked well in windows NT since the 90s. Its a proven design.
I am not sure if this is just my experience but when using libinput on Fedora for example - the cursor movement is not exactly precise. This is not obvious when working but while gaming this is a deal breaker.
It is a security issue and an X-Window design issue.
Touchpad flat profile was flat up to a certain velocity, but when you exceeded that it accelerated a lot.
Flat profile on a physical mouse worked correctly. Even if I yanked the device, it wouldn't accelerate.
This is one of the reasons why I'm sticking to Ubuntu 16.04: "xset m 1 0" disables all acceleration on xinput.
The only reason I switched back to a Windows Desktop was that there were just one or two games I specifically wanted to try, but couldn't install to Linux. And once I had switched back (and paid the price for Windows) there were no games that needed Linux, so no motivation to go back.
Alright, it's a headless VM, but it's pared down with 'unfuck' and other telemetry- and uselessness- neutering projects into something safer. Lookingglass peers into one of my GPU's framebuffers, so it lives inside a window on my Linux host.
It's also been at least a year since I've needed it, though, since Steam and wine cover everything else I'd want to play or run, so it might be time to cut it loose for good.
That will never happen. You game in Linux because you need Linux first and games second.
Isn’t a mutex timing out an indication that:
a) a lock wasn’t needed in the first place or
b) the program is incorrect?
It feels more like they just want the api to match win32 better but most of the multithreaded programming I’ve done lately has just used go’s channels so I totally could be missing something.
Ever seen a texture pop in instead of a stutter? If a lock would be taken on that load without timeout (very short) it'd either not load on time or load when it's no longer needed.
- program with fine grained locks
- you have an algorithm that needs to acquire N of those locks, one at a time, and can do it in any order
In that case, you really want to:
- first attempt fast path locking on all of them in any order
- if that fails tell the kernel about all of the locks you are waiting on at once.
This means that you will become unblocked as soon as any lock becomes available.
That will make your program run faster. Therefore, it’s a good improvement for futexes in my opinion.
Use of WaitForMultipleObjects is more usually for completion of tasks, in the way that win32 "overlapped" works.
1. Each CPU first attempts to take an allocated object from its own pool, using a 1ms timeout to acquire the mutex on the pool.
2. If that fails, the CPU attempts to steal from the other CPU pools, using a zero timeout (immediately return if the mutex can't be taken).
3. If that fails, acquire the mutex on its own pool with an infinite timeout.
This is definitely correct and is much faster than any of the lock-free approaches I could come up with. I guess the general class of problems where mutex timeouts help is when you have "many valid options, some of which are being used by others."
It could also be (c) -- it's correct but there's contention or: the mutex is too coarse / there's "too much" work being protected by the mutex.
But I think your point about matching windows is likely the case (this is how wine implements WaitForMultipleObjects maybe?). The fd exhaustion with FUTEX_FD means they need another way.
Essentially you're putting a lower bound on Availability in favor of Consistency. Lease expiry happens when the lessor isn't around to retire the lease in an orderly fashion. It's detecting a Partition. In theory a network partition, but we all know how CPU boundedness creeps into the system as the feature set or the data set grows... and that can show up even on a solitary machine.
Wine stands for "Wine Is Not an Emulator".
> We think that if this feature (or an equivalent) was adopted upstream, we would achieve efficiency gains by adopting it in native massively-threaded applications such as Steam and the Source 2 engine.
Say you have one worker thread per CPU core. On Windows, each thread would get an Event object and you would WaitOnMultiple to be able to act on the first unit of work that was complete. On Linux you would have to roll your own solution using lower-level primitives and it will not be correct. Being able to wait on multiple events on Linux will be awesome.
In Linux each thread can get an eventfd and you can POLLIN all of them.
In fact I would argue that using futexes is the “roll your own solution” using lower level primitives (and easier to fuckup) much more so than eventfd and epoll.
As mentioned somewhat poorly in the post, using futexes gives a performance boost which is not surprising since they are fast user mutexes. FWIW I didnt think windows events had a fast user space path but I may be mistaken.
For most worker pool scenarios you’re describing, the overhead of eventfd is probably in the noise.
Though it can emulate a win32 api for waiting on multiple “objects”, it’s strictly more powerful than WaitForMultiple if you are dealing with user objects since futexes impose very few constraints on how your user synchronization object is shaped and how it works.
So, the new interface is totally different from things like epoll. In one case the kernel is helping you wait for multiple user objects and in the other case it’s helping you wait for multiple kernel objects. The distinction is intentional because the whole point is that the user object that has the futex can be shaped however user likes, and can implement whatever synchro protocol the use likes.
Finally, it’s worth remembering that futex interfaces are all about letting you avoid going into kernel unless there is actually something to wait for. The best part of the api is that it helps you to avoid calling it. So for typical operations, if the resource being waited on can have its wait state represented as a 32-bit int in user memory, the futex based apis will be a lot faster.
1. You'll likely get it wrong and have subtle bugs.
2. This is significantly different than the Windows model where you wait on events. Now you have two classes of events - regular ones, and ones that can be waited on in multiple. The second class also comes with its own event manager class that needs to manage the eventfd for this group of events.
You end up with a specialised class of event that needs to be used whenever you need to wait in several of them at once. Then you realise you used a normal POSIX event somewhere else and now you want to wait on that as well, so you have to rewrite parts of your program to use your special multi-waitable event.
It's mostly trivial to write a event wrapper on top of POSIX events that behaves the same as Windows Events, except for the part where you might want to wait on multiple of them. I would expect that once this kernel interface is implemented we'll get GNU extensions in glibc allowing for waiting on multiple POSIX events. I absolutely do not want to roll my own thread synchronisation primitives except for very thin wrappers over the platform-specific primitives. Rolling your own synchronisation primitives is about as fraught with peril as rolling your own crypto.
To be honest, WaitForMultipleObjects will probably become not very useful in the near future. We're getting 32-core workstation CPUs today, it's quite likely there will be CPUs with more than 64 cores in near future workstations making it impossible to use this classic Windows primitive, but I suspect Microsoft will provide WaitForMultipleObjectsEx2.