I'm amazed that polyinstantiation of directories via pam_namespace.so[1] is so unheard of. Setting this up fixes almost all of the qualms mentioned in the article by giving each user their own mount namespace with an isolated /tmp directory (and others if configured). Still though, this wouldn't prevent poorly written applications using /tmp from clashing with others that are running under the same user.
It's relatively easy to set up[2] and provides a pretty huge defense mitigation against abusing /tmp.
Is there an easy way to duplicate a specific process' namespace? My biggest issue with all these new features is that privatize state is how much harder it is to reproduce a state.
Back when it was just environment variables, I could pipe /proc/PID/environ to xargs and get basically the same state. Given that things like unix domain sockets may end up in $TMPDIR, I can be left unable to do certain things.
I don't think there is, or should be, a way to do that. Granular copying of per-process resource state seems like a need that would be better served at either a closer-to-the-program layer (i.e. debug hooks in code you control that provide information on how to reconstruct its state) or much further away (e.g. via CRIU/whole-machine snapshots or scary tricks like SIGSTP or ptrace-injecting calls to fork(2)).
> I can be left unable to do certain things
Most of what I can imagine of "certain things" falls into two categories: debugging (for which much better tools exist), or concerns that would be better served by a program providing an API of some kind rather than "go muck with state in $TMPDIR".
For example, Kubernetes doesn't use PAM in the pods it creates to run your containers.
You might think "who cares", but I've written code that is agnostic as to whether it's running in a logged-in user's session or something else. https://news.ycombinator.com/item?id=41916623
> There should be per-user temporary directories. In fact, on modern systems there are per-user temporary directories!
On Linux+systemd, I think this is referring to /run/user/$UID. $XDG_RUNTIME_DIR is set to this path in a session by default. There's a spec for that environment variable at <https://specifications.freedesktop.org/basedir-spec/latest/>. I assume there's also some systemd doc talking about this.
On macOS, I see that $TMPDIR points to a path like /var/folders/jd/d94zfh8d1p3bv_q56wmlxn6w0000gq/T/ that appears to be per-user also.
Unfortunately /run/user/$UID/ is NOT universally available.
On Linux it's typically created by a PAM, so if you're not using PAM then it doesn't exist. This means that on Kubernetes pods/containers... it doesn't exist!
Yes, /tmp/ is a security nightmare on multi-user systems, but those are a rarity nowadays.
Lots of things want to write things into /tmp, like Kerberos, but not only. I recently implemented a token file-based cache for JWT that... is a lot like a Kerberos ticket cache. I needed it because the tokens all have specific aud (audience) values. Now where to keep that cache?? The only reasonable place turned out to be /tmp/ precisely because /run/user/$UID/ is not universally available, not even on Linux.
> Yes, /tmp/ is a security nightmare on multi-user systems, but those are a rarity nowadays.
What's not a rarity though is apps (or code in general) that you don't fully trust, and that you don't want to give a chance to exfiltrate all your data for example.
Sadly, the POSIX permission model is entirely ill-suited for that, precisely because it tries to solve the multi-user problem, wherein all code belonging to a single user is effectively treated omnipotent within that user's domain (i.e. the files the user owns). That's why iOS and macOS (the non-POSIX parts) has a container model with strong sandboxing, entitlements, etc.
FreeBSD doesn't create user tmp when setting up a new user using the automatic tool. I don't recall if that's an option or not. There is, of course, a /tmp
iOS and macOS go further and separate their (native) apps almost entirely, including temporary files. That way, if you download "Super Free VPN Pro!!", it at least doesn't get access to, say, photos, temporary data or not.
Windows seemed like a mess to me but it’s actually not too bad. Local/roaming per-app data can be separated and the per-user temp would always then be in the local part.
I think the idea of a shared filesystem in general is bad. It’s not the 80s anymore where we’re logging on to a shared mainframe. Applications should be completely sandboxed from each other by default and only allowed to see what they need to see. Real sandboxing by default (not like systemd’s opt in sandboxing, which is an absolute mess) would eliminate entire classes of vulnerabilities.
This has been iOS's (and macOS's, for native apps) model for a long time, yeah. "Multiuser" computers are not common anymore, but computers with a bunch of apps/code that you put different levels of trust in (especially on a level on what you want a given app to have access to), now are.
macOS sandboxing for native apps has a long way to go to catch up with iOS. As one example, any application can read from your clipboard at any time. Web apps are, I believe, much better sandboxed.
> It’s not the 80s anymore where we’re logging on to a shared mainframe.
Modern high performance clusters still follow that logic, and are found in almost all large universities or companies doing research on heavy computational topics (artificial intelligence, comp. chemistry, comp. biology, comp. engineering, and so on)
Oh man, this sort of thing is part of what I love love love about systemd. It bakes in so many great isolation/sandboxing/privacy measures for units! From the article:
> There should be per-user temporary directories. In fact, on modern systems there are per-user temporary directories! But this solution came several decades too late.
> If you have per-user $TMPDIR then temporary filenames can safely be created using the simple mechanisms described in the mktemp(1) rationale or used by the old deprecated C functions. There’s no need to defend against an attacker who doesn’t have sufficient access to mount an attack! There’s no need for sticky directories because there aren’t any world-writable directories.
May I introduce you to PrivateTMP= ?
> PrivateTmp=¶
> Takes a boolean argument. If true, sets up a new file system namespace for the executed processes and mounts private /tmp/ and /var/tmp/ directories inside it that are not shared by processes outside of the namespace
Notably you don't even need to change how programs work (no $TMPDIR necessary)! It creates a filesystem namespace for your process, such-that you see the normal fs, but with your own /tmp ! That way your program behaves regularly/as convention goes everywhere else, and existing programs you run can also benefit without re-writing!
I cannot emphasize enough how many excellent well integrated kick ass security features systemd gives you totally for free. DynamicUser= turns on PrivateTmp= by default and is an easy way to insure isolation, to prevent needing to hand-code & safely manage uid/gids yourself; I'd start there if you can.
There's so so so many great isolation features in this man page.
A more canonical means would be to use the runtime directory. Explicitly setting a Runtime directory= for each would be appropriate.
I get your point. Yeah as a newbie flipping on random options listed under "sandbox" may be bad for you. But this hardly seems like a good dig against a well integrated unit process that has lots on tap to do the job very very well, in a succint manner.
It's also the default on most distros these days since it means /tmp is always wiped on every reboot. (For a programming side, this also means that if you write a file to /tmp, it'll probably have the fastest read/write speed you can find on the OS, which can be desirable.)
You'd need to pin pages in physical memory to guarantee it stays in physical memory. What happens if an 'attacker' (or accidental user) exceeds available physical memory? OOM Kill other applications? Just don't accept temp data, leading to failures in operations requested by the user or system?
Pages in physical memory are not typically zero'ed out upon disuse. Yes, they're temporary... but only guaranteed temporary if you turn the system off and the DRAM cells bleed out their voltage.
By default a tmpfs has a really low RAM priority so the OS will try to move it in swapspace if memory gets low. tmpfs size is specified on creation of the tmpfs (and cant be larger than the total memory available, which is swap + RAM) but it's only "occupied" when files begin to fill the tmpfs.
If it gets too full for regular OS operations, you get the fun of the OOM Killer shutting down services (tmpfs is never targeted by the OOM Killer) until the entire OS just deadlocks if you somehow manage to fill the tmpfs up entirely.
Well I guess you could tell Linux to not use some memory addresses using the BadRAM feature, then setup an `mtd` device to those memory addresses and create a RAM-based block device, then use `cryptsetup` to encrypt it. If your Linux box is headless and you have a GPU with RAM there mostly sitting unused then you could use the VRAM.
This is fine until something uses more memory than is available, such as sox insisting on routing a huge audio file though a too-small /tmp, or the MATLAB installer likewise only using /tmp. With sox you could in theory recompile to get it to use some other path (iirc none of the TMP or TMPDIR environment variables did anything), but I instead gave up on the /tmp in memory (it complicated the OpenBSD desktop setup). I forget exactly how I worked around the MATLAB installer issue, probably something horrible involving LD_PRELOAD or time wasted reconfiguring and rebooting and reconfiguring and rebooting one node to do the install on, plus more time wasted running the massive and bloated installer process(es) under strace to see exactly what file paths were in play.
So, not really a fan of /tmp in memory. (And I don't much run massive and bloated browsers that may murder your SSD lifetime with excessive file writes better diverted to an in-memory /tmp.)
Ideally you have swap, so that stuff that's not actually in use ends up in swap instead of occupying physical memory.
That's why you still usually see machines with unreasonable amounts of GB of RAM having swap partitions: Instead of having data that's rarely, if ever, used occupy precious DRAM, it's much better to have that data in swap so that the DRAM can contain, say, filesystem caches.
I guess the general gist is shared spaces between users causes security issues.
I recall using 'shared hosting' where instead of using your default IP address for fetching anything from the network, you could do some funky stuff in the shared environment to discover many more IPs that could be used. Useful for scraping and such. Generally any shared hosting that used cpanel would expose all their network interfaces, often a /24 or two.
Any shared resource seems to give rise to security issues. Extracting data through side channels in the hardware's architecture is what woke me up to this.
That's true of physical reality itself. Everything that happens constantly leaks information to the surrounding, spreading outward at the speed of light.
I remember digging into this 10-15 years ago. 'shared hosting' per provider had some arbitrary resource restrictions, but you could still find out via a cron job or some such. Like `cat`ting /etc/network stuff. Basically a sieve.
I agree, but I think that shared mutable global state is a bad default. I think it'd be better to be opt-in (eg, you get a `/tmp/${USER}` and your user can `chmod o+rw` during setup if it needs to be globally mutable.
There are very few always in such matters, but I view this one as an 'except for rare circumstances'. Even when true, it should be modeled as "contained state where the container includes everyone".
The problem is that Unices use access control, rather than capabilities, so ensuring state is shared only by those who need it is quite a bit more difficult than just punting, and declaring that 'those who need it' is 'everyone'.
Nor has the design problem of a user-friendly capabilities architecture truly been solved, IHMO. Nonetheless, we shouldn't confuse convenience with correctness.
How much is 1 GB*0 bytes? Better serve 1GB of /dev/zero instead! (Or even better, /dev/urandom, because zeroes compress very well and are easy to spot.)
> The fix, way back when, should have been for login(8) to create a per-user temporary directory in a sensible place before it drops privilege, and set $TMPDIR so the user’s shell and child processes can find it.
Something like
tmpdir := "/tmp/${USERNAME}"
loop:
rmdir(tmpdir, recurse=true)
while not mkdir(tmpdir, 0o700, must-create=true)
chown(tmpdir, user=$USERNAME, group=$USERGROUP)
export("TMPDIR", tmpdir)
with /tmp having root:root owner with 0o775 permissions on it? Yeah, would've been nice.
MacOS does something like this. Not by username, but through /private, which is a private mount, and then /tmp is linked to /private/tmp, as are /var and /etc.
You're right that macOS has per-user temp (and cache) dirs under /private/var/folders/ (since 10.5), but it still has traditional shared /tmp (via the /private/tmp symlink) since not everything respects the per-user temp dir.
That's not the reason for /private though. Rather, /private is a holdover from NeXTSTEP days which could mount the OS via NFS (NetBoot), and where /private was local to the machine:
"Each NetBoot client will share the server's root file system, but there are several administrative files (such as the NetInfo database, log files, and the swapfile) that must be unique to each client. The server must have a separate directory tree for each client, which the client mounts on its own /private directory during startup. This lets a client keep its own files separate from those of other clients."
IMO even a home-level, per-user tmp directory isn't ideal (though it is better). In a single-user environment, where malware is the biggest concern in current times, what difference does it make if it's a process running under a different user or one that is running under your current user that is attacking you?
In other words, for many systems, a home-level temp directory is virtually the same as /tmp anyway since other than system daemons, all applications are being started as a single user anyway.
And that might be a security regression. For servers you're spinning up most services at bootup and those should either be running fully sandboxed from each other (containerization) or at least as separate system users.
But malware doesn't necessarily need root, or a daemon process user id to inflict harm if it's running as the human user's id and all temp files are in $HOME/.tmp.
What you really want is transient application-specific disk storage that is isolated to the running process and protected, so that any malware that tries to attack another running application's temp files can't since they don't have permission even when both processes are running under the same user id.
At that point malware requires privilege escalation to root first to be able to attack temp files. And again, if we're talking about a server, you're better off running your services in sandboxes when you can because then even root privilege escalation limits the blast radius.
I'm guessing, but I would think that the idea is to have all the junk in one place so that it can be safely cleared at startup and excluded from backups.
If the user tmp files were placed in /tmp/${USER}/ then that would achieve the same goal.
It's relatively easy to set up[2] and provides a pretty huge defense mitigation against abusing /tmp.
[1] https://www.man7.org/linux/man-pages/man8/pam_namespace.8.ht...
[2] https://docs.redhat.com/en/documentation/red_hat_enterprise_...
Back when it was just environment variables, I could pipe /proc/PID/environ to xargs and get basically the same state. Given that things like unix domain sockets may end up in $TMPDIR, I can be left unable to do certain things.
https://man7.org/linux/man-pages/man1/nsenter.1.html
Also you can use nsenter(8) to run a command (or even a shell) under another process’s mount, pid, network, etc namespace.
> I can be left unable to do certain things
Most of what I can imagine of "certain things" falls into two categories: debugging (for which much better tools exist), or concerns that would be better served by a program providing an API of some kind rather than "go muck with state in $TMPDIR".
For example, Kubernetes doesn't use PAM in the pods it creates to run your containers.
You might think "who cares", but I've written code that is agnostic as to whether it's running in a logged-in user's session or something else. https://news.ycombinator.com/item?id=41916623
On Linux+systemd, I think this is referring to /run/user/$UID. $XDG_RUNTIME_DIR is set to this path in a session by default. There's a spec for that environment variable at <https://specifications.freedesktop.org/basedir-spec/latest/>. I assume there's also some systemd doc talking about this.
On macOS, I see that $TMPDIR points to a path like /var/folders/jd/d94zfh8d1p3bv_q56wmlxn6w0000gq/T/ that appears to be per-user also.
What do FreeBSD/OpenBSD/NetBSD do?
On Linux it's typically created by a PAM, so if you're not using PAM then it doesn't exist. This means that on Kubernetes pods/containers... it doesn't exist!
Yes, /tmp/ is a security nightmare on multi-user systems, but those are a rarity nowadays.
Lots of things want to write things into /tmp, like Kerberos, but not only. I recently implemented a token file-based cache for JWT that... is a lot like a Kerberos ticket cache. I needed it because the tokens all have specific aud (audience) values. Now where to keep that cache?? The only reasonable place turned out to be /tmp/ precisely because /run/user/$UID/ is not universally available, not even on Linux.
What's not a rarity though is apps (or code in general) that you don't fully trust, and that you don't want to give a chance to exfiltrate all your data for example.
Sadly, the POSIX permission model is entirely ill-suited for that, precisely because it tries to solve the multi-user problem, wherein all code belonging to a single user is effectively treated omnipotent within that user's domain (i.e. the files the user owns). That's why iOS and macOS (the non-POSIX parts) has a container model with strong sandboxing, entitlements, etc.
It's very different from 20-30 years ago.
It's even worse than that: We're all using the same shared applications on some cloud.
Modern high performance clusters still follow that logic, and are found in almost all large universities or companies doing research on heavy computational topics (artificial intelligence, comp. chemistry, comp. biology, comp. engineering, and so on)
> There should be per-user temporary directories. In fact, on modern systems there are per-user temporary directories! But this solution came several decades too late.
> If you have per-user $TMPDIR then temporary filenames can safely be created using the simple mechanisms described in the mktemp(1) rationale or used by the old deprecated C functions. There’s no need to defend against an attacker who doesn’t have sufficient access to mount an attack! There’s no need for sticky directories because there aren’t any world-writable directories.
May I introduce you to PrivateTMP= ?
> PrivateTmp=¶
> Takes a boolean argument. If true, sets up a new file system namespace for the executed processes and mounts private /tmp/ and /var/tmp/ directories inside it that are not shared by processes outside of the namespace
https://www.freedesktop.org/software/systemd/man/latest/syst...
Notably you don't even need to change how programs work (no $TMPDIR necessary)! It creates a filesystem namespace for your process, such-that you see the normal fs, but with your own /tmp ! That way your program behaves regularly/as convention goes everywhere else, and existing programs you run can also benefit without re-writing!
I cannot emphasize enough how many excellent well integrated kick ass security features systemd gives you totally for free. DynamicUser= turns on PrivateTmp= by default and is an easy way to insure isolation, to prevent needing to hand-code & safely manage uid/gids yourself; I'd start there if you can.
There's so so so many great isolation features in this man page.
I wonder if Fedora does this by default?
I get your point. Yeah as a newbie flipping on random options listed under "sandbox" may be bad for you. But this hardly seems like a good dig against a well integrated unit process that has lots on tap to do the job very very well, in a succint manner.
EDIT: I do this more for avoiding certain disk reads/writes than security actually
Pages in physical memory are not typically zero'ed out upon disuse. Yes, they're temporary... but only guaranteed temporary if you turn the system off and the DRAM cells bleed out their voltage.
If it gets too full for regular OS operations, you get the fun of the OOM Killer shutting down services (tmpfs is never targeted by the OOM Killer) until the entire OS just deadlocks if you somehow manage to fill the tmpfs up entirely.
shm and memory mounts use half the available system memory by default. so this is not typically possible.
> are not typically zero'ed out upon disuse
They're zeroed when they're reallocated.
> and the DRAM cells bleed out their voltage.
This occurs in less than a second in almost every room temperature environment.
So, not really a fan of /tmp in memory. (And I don't much run massive and bloated browsers that may murder your SSD lifetime with excessive file writes better diverted to an in-memory /tmp.)
That's why you still usually see machines with unreasonable amounts of GB of RAM having swap partitions: Instead of having data that's rarely, if ever, used occupy precious DRAM, it's much better to have that data in swap so that the DRAM can contain, say, filesystem caches.
I recall using 'shared hosting' where instead of using your default IP address for fetching anything from the network, you could do some funky stuff in the shared environment to discover many more IPs that could be used. Useful for scraping and such. Generally any shared hosting that used cpanel would expose all their network interfaces, often a /24 or two.
Point being, there always are side channels.
So "echo $API_TOKEN" failed, but getting the output of the complete environment was as easy as "env | base64".
I agree, but I think that shared mutable global state is a bad default. I think it'd be better to be opt-in (eg, you get a `/tmp/${USER}` and your user can `chmod o+rw` during setup if it needs to be globally mutable.
The problem is that Unices use access control, rather than capabilities, so ensuring state is shared only by those who need it is quite a bit more difficult than just punting, and declaring that 'those who need it' is 'everyone'.
Nor has the design problem of a user-friendly capabilities architecture truly been solved, IHMO. Nonetheless, we shouldn't confuse convenience with correctness.
Something like
with /tmp having root:root owner with 0o775 permissions on it? Yeah, would've been nice.https://magnusviri.com/what-is-var-folders.html
That's not the reason for /private though. Rather, /private is a holdover from NeXTSTEP days which could mount the OS via NFS (NetBoot), and where /private was local to the machine:
"Each NetBoot client will share the server's root file system, but there are several administrative files (such as the NetInfo database, log files, and the swapfile) that must be unique to each client. The server must have a separate directory tree for each client, which the client mounts on its own /private directory during startup. This lets a client keep its own files separate from those of other clients."
https://www.nextcomputers.org/files/manuals/nsa/13_NetBoot.h...
$HOME/.tmp for user operations and /tmp for system operations?
EDIT: I see from other posters it can be done. Why the heck isn't this the default?!
In other words, for many systems, a home-level temp directory is virtually the same as /tmp anyway since other than system daemons, all applications are being started as a single user anyway.
And that might be a security regression. For servers you're spinning up most services at bootup and those should either be running fully sandboxed from each other (containerization) or at least as separate system users.
But malware doesn't necessarily need root, or a daemon process user id to inflict harm if it's running as the human user's id and all temp files are in $HOME/.tmp.
What you really want is transient application-specific disk storage that is isolated to the running process and protected, so that any malware that tries to attack another running application's temp files can't since they don't have permission even when both processes are running under the same user id.
At that point malware requires privilege escalation to root first to be able to attack temp files. And again, if we're talking about a server, you're better off running your services in sandboxes when you can because then even root privilege escalation limits the blast radius.
If the user tmp files were placed in /tmp/${USER}/ then that would achieve the same goal.
Anything that requires login(8) or PAM to make it happen is insufficient. This has to happen in environments like Kubernetes too.