Fedora 38 LLVM vs. Team Fortress 2

Unfortunately this is exactly the type of stuff that makes supporting commercial apps on linux a nightmare. Weird crashes due to weird linking of system libraries.

Common distros are very adamant about dynamic linking everything in order to support the use case of "core library has vulnerability, upgrade it in place without rebuilding consuming apps." Along with a desire to avoid "dll hell" and force a single canonical version of every library systemwide. This leads to these sorts of issues.

Windows gets around it by letting applications put the DLLs they care about beside the executable, and having it check there first by default.

mjg59 · 3 years ago

This is largely solved with approaches such as Flatpak or Snap, but graphics drivers are still an issue - they're expected to be supplied by the distribution, and components of them end up in-process in the application even if the rest of the application's runtime is shipped with the application. If there's an incompatibility between the application runtime and assumptions made by that driver code (as there appears to be in this case - TF2 ships its own malloc() implementation, but the graphics driver code ends up using it inconsistently and so blows up) then you're going to have problems.

I don't think there's anything about Windows that would fundamentally change things here. Windows apps aren't shipping their own graphics drivers, even if they're bundling everything else.

cesarb · 3 years ago

> TF2 ships its own malloc() implementation, but the graphics driver code ends up using it inconsistently and so blows up) then you're going to have problems. I don't think there's anything about Windows that would fundamentally change things here.

Yes, there is: on Windows, due to the way DLL linking works there, the graphics driver wouldn't use the malloc() implementation from TF2. The flat linking namespace in which you can globally replace the memory allocator for every dynamic library does not exist on Windows; if the graphics driver is linked to the memory allocator from the C library, it will get the memory allocator from that C library, not from some other DLL in the same process.

That's not to say Windows is free of dynamic linking problems. While on Linux it's mostly only NSS and the graphics driver (and only when explicitly requested), on Windows it's common for unrelated third party software to inject DLLs and threads all over every process on the system. And it's not uncommon for these injected DLLs to do things like hooking into system DLLs (by overwriting the entry point of exported functions, or even internal functions), leading to hard-to-diagnose crashes when things are not like they expected.

Dalewyn · 3 years ago

>I don't think there's anything about Windows that would fundamentally change things here. Windows apps aren't shipping their own graphics drivers, even if they're bundling everything else.

Over on Windows, GPU drivers are provided and distributed by the manufacturer, Microsoft themselves might also distribute them through Windows Update.

GPU manufacturers also work together with Microsoft and bigger game dev studios (read: studios with sufficient cash/influence) to make sure everything works well together. The drivers are also signed off by Microsoft, both figuratively and literally.

Linux has none of this. Drivers are provided primarily by volunteers (most of whom couldn't care less about proprietary code), packaged and distributed by each distro, and most game devs couldn't care less about issues concerning less than one percent of their customers.

gabcoh · 3 years ago

Can Linux not trivially do the same thing as windows with LD_PRELOAD? If so why is this more of an issue on Linux than Windows? Is it really less a technical challenge and more just a matter of Linux getting less support from upstream developers?

stabbles · 3 years ago

LD_PRELOAD is too global to be useful, it's hard to scope it to one process (and not child processes). macOS is better in the sense that it clears DYLD_* variables when the dynamic linker has done its work and the process starts. (Although that can also be painful when you want to run a shell script and set DYLD_* outside)

bravetraveler · 3 years ago

I was thinking/wondering this myself. Not to reinvent the wheel - more toss an idea around, but a 'venv for LD_PRELOAD' sounds like it'd deal with this pretty handily

Not... in a way I'd use as a distribution/release maintainer. Probably as an administrator [of my LAN]

LispSporks22 · 3 years ago

It can be done by setting rpath to origin, even post compilation using the patchelf tool. Works great with C shared libraries. Perhaps ABI issues with C++ shared libs introduces other problems.

admax88qqq · 3 years ago

Yes linux _can_, the machinery is there, but culturally the common distros do not. And the defaults do not. On windows I can literally drop a DLL next to an executable and it will pick it up. On linux I have to do a wrapper script to set LD_PRELOAD, or mutate the binary's rpath to get it to load.

It's not really a question of capability, but a question of culture and defaults that makes linux hard to support.

Debian for example goes through great pains (or used to at least) to unbundle shared libraries such as openssl from projects like chromium.

aidenn0 · 3 years ago

This sounds like it's an interaction with the GPU driver though, which could also happen on windows...

ho_schi · 3 years ago

Aehm. That is what a lot of closed-source applications do on Linux. And Valve does that, too.

The open-source ones are maintained in the packing system and kept lean.

doublepg23 · 3 years ago

The funny thing is in on Fedora in 2023 I don't feel like I'm missing out on most software.

nsajko · 3 years ago

Completely off base. If you want to distribute your application to users yourself (instead of letting the distro take care of that), then distribute all dependencies together with it.

marcthe12 · 3 years ago

There are a few dependencies that can not be easily vendored (At least not recommended). Mesa is probably the biggest example (An this case was caused by a mesa dependency). You can vendor them technically or even static link them but then you might end up with limited hardware support. The only alternative is to setup a mini opengl distro.

q3k · 3 years ago

Which libGL.so should I be distributing alongside my application?

sebazzz · 3 years ago

> Unfortunately this is exactly the type of stuff that makes supporting commercial apps on linux a nightmare. Weird crashes due to weird linking of system libraries.

That is the true reason containers were born, isn't it? The kernel is perfect, the public interface of the kernel is perfect. Userspace is a mess. Let's fix it by adding a layer between and have a userspace per application.

eikenberry · 3 years ago

Isn't this exactly the use case for which flatpaks are designed? Isn't Redhat/Fedora in the process of adopting them as the primary way to support third party/proprietary graphical apps like Steam? Doesn't the current Steam flatpak avoid this issue?

TLDR; isn't this already addressed?

MichaelZuo · 3 years ago

It would appear so, I don't understand why this blog post is so popular.

==27467== Invalid write of size 2 ==27467== at 0x406526A: GetSteamContentPath() (pathmatch.cpp:523) ==27467== by 0x4065927: pathmatch(char const*, char\*, bool, char*, unsigned int) [clone .part.1] (pathmatch.cpp:594) ==27467== by 0x4066849: pathmatch (pathmatch.cpp:541) ==27467== by 0x4066849: CWrap (pathmatch.cpp:685) ==27467== by 0x4066849: __wrap___xstat (pathmatch.cpp:907) ==27467== by 0x406294A: stat (stat.h:455) ==27467== by 0x406294A: CFileSystem_Stdio::FS_stat(char const*, stat*) (FileSystem_Stdio.cpp:225) ==27467== by 0x4060819: CBaseFileSystem::AddPackFiles(char const*) (BaseFileSystem.cpp:1325) ==27467== by 0x4060AA4: CBaseFileSystem::AddSearchPathInternal(char const*, char const*, bool) (BaseFileSystem.cpp:254) ==27467== by 0x4060B37: CBaseFileSystem::AddSearchPath(char const*, char const*) (BaseFileSystem.cpp:186) ==27467== by 0x8049003: main (launcher.cpp:413) ==27467== Address 0x45e5f4e is 30 bytes inside a block of size 31 alloc'd ==27467== at 0x4041714: malloc (vg_replace_malloc.c:393) ==27467== by 0x4357C4A: strdup (strdup.c:42) ==27467== by 0x42F1A76: realpath_stk (canonicalize.c:410) ==27467== by 0x42F1A76: realpath@@GLIBC_2.3 (canonicalize.c:432) ==27467== by 0x406525B: GetSteamContentPath() (pathmatch.cpp:520) ==27467== by 0x4065927: pathmatch(char const*, char\*, bool, char*, unsigned int) [clone .part.1] (pathmatch.cpp:594) ==27467== by 0x4066849: pathmatch (pathmatch.cpp:541) ==27467== by 0x4066849: CWrap (pathmatch.cpp:685) ==27467== by 0x4066849: __wrap___xstat (pathmatch.cpp:907) ==27467== by 0x406294A: stat (stat.h:455) ==27467== by 0x406294A: CFileSystem_Stdio::FS_stat(char const*, stat*) (FileSystem_Stdio.cpp:225) ==27467== by 0x4060819: CBaseFileSystem::AddPackFiles(char const*) (BaseFileSystem.cpp:1325) ==27467== by 0x4060AA4: CBaseFileSystem::AddSearchPathInternal(char const*, char const*, bool) (BaseFileSystem.cpp:254) ==27467== by 0x4060B37: CBaseFileSystem::AddSearchPath(char const*, char const*) (BaseFileSystem.cpp:186) ==27467== by 0x8049003: main (launcher.cpp:413)

It should be straightforward to make a little LD_PRELOAD shim to implement the new operator new on top of old overloads and thus restore proper functioning.

It would be a gross kludge, though.

olliej · 3 years ago

I'm not sure that's sound. You can't just redirect an aligned new to the unaligned operator new as you may get unaligned result. It _sounds_ like what is happening is

    a = ::operator new(some size, some alignment)
    ...
    ::operator delete(a);

where delete is dropping the align_val_t parameter that would guarantee it hits the same allocator family. There are a variety of ways this can happen, and let's just take it as given that it is.

The problem is that if operator new(size_t, align_val_t) is called then the struct has an alignment annotation. That can lead to codegen that reasonably assumes alignment, even without any source level decisions that depend on alignment. The result of having some equivalent of (either at runtime or link time)

    void * operator new(size_t sz, align_val_t a) {
      if (operator new(size_t) has been overridden) return ::operator new(sz); 
      ...
    }

could be an "aligned" allocation returning an unaligned value, causing crashes later on.

jenadine · 3 years ago

That's not sound in general, but it is "probably" going to work for this specific case because the previous build was build with allocator that did not support this alignment, meaning that they did not need extra alignment. This is pretty rare actually. And you had anyway to use a custom allocator already with previous C++ versions to make it work.

viraptor · 3 years ago

If you don't mind wasting a bit of time, you could forward size+alignment to the allocator, return the aligned version and keep a record of aligned-to-allocation mapping. (For freeing later)

But as the other comment mentioned - it should be a problem for tf2 in the first place since that's not the behaviour they're after.

Asooka · 3 years ago

The C interface for aligned memory allocation is aligned_alloc(). The returned pointers are always freed with free(). So what is probably happening is that aligned new calls aligned_alloc(), and then aligned delete simply calls the regular delete, expecting to end up in free(), which by design should work with both kinds of pointers.

I think the problem here is partly with the implementation of aligned new/delete. Since one is free to override only the old versions, the ones supplied by the standard library should make sure not to fall back to functions that may be partially overriden.

amluto · 3 years ago

As pure speculation, one could forward to aligned_alloc and still free with ::delete. I haven’t tested this, nor have I looked at the code.

zokier · 3 years ago

LD_PRELOAD would probably run afoul with VAC though?

Polycryptus · 3 years ago

Steam on Linux already uses LD_PRELOAD under-the-hood to load the overlay. Valve signs the overlay SO files, so they could be making an exception for Valve-signed-preloads in VAC, but it's also possible that VAC does something else to check for suspicious libraries loaded in.

This is a predictable outcome of overriding the global operator new. It remains annoying that this was ever allowed, and is a constant source of pain for c++ standard library implementations.

DannyBee · 3 years ago

It actually should still work, since fedora38 includes the llvm15 versioned libs.

The only way to make this break is if something is loading random unversioned solibs or whatever the latest one it can find is, and expecting this to work forever.

If it actually used a versioned solib, it would get llvm 15 just like it did before.

This is the whole point of versioned solibs.

Versioning does not solve the problem.

The aligned allocation operators have existed since llvm 8.x.

The problem is not that the aligned allocation APIs are new. The problem is that TCMalloc is only partially replacing the global allocation APIs, it's just taken until this year for that bug to be exposed.

What has happened is presumably some part of the OS has updated its target C++ version so is now using the aligned allocators, which exposes the gap in TCMalloc.

I'm not sure if the spec explicitly allows an aligned allocation to be fed into an unaligned operator delete, but it seems like implementations do, so that's probably why adopting aligned operator new wouldn't be seen as an ABI break.

phkahler · 3 years ago

It seems more like the app and driver are mixing their new/delete pairs. That seems like a bug to me. Maybe even an API design issue if it's supposed to happen.

mmh0000 · 3 years ago

I loved the premise of the article, though I really wish the author had gone into detail about how he discovered the root cause.

boutique · 3 years ago

Funnily enough, on Half-Life 1 engine-based games (i.e. the engine that came before HL2 - on which Team Fortress 2 runs; such as Counter-Strike 1.6), a different allocator problem exists -- glibc's malloc() just decides to fail miserably[0] on some setups.

[0] https://github.com/ValveSoftware/halflife/issues/3158

kimixa · 3 years ago

that's exactly the sort of error you get if something has written just out of bounds on a malloc'd chunk - it clobbers the allocator's internal state, which appears to be what that assert() is checking.

It's probably an allocation before the failing one that is being misued - so the backtrace pointing to openal doesn't necessarily mean it's openal's fault.

Running with valgrind or another heap memory checking tool will probably be helpful to track down that particular linked bug.

EDIT:

It looks like that there's at least one out-of-bounds write when starting up half life (On arch linux, so maybe slightly different library versions and not loaded the counterstrike mod).

It looks like a valve bug - writing 2 bytes at index [30] of a malloc'd size of 31 goes one byte over, and it looks from the backtrace it's all valve's code and not deep in some library that might have been loaded in. Writing 2 bytes to a string is a bit odd, perhaps it's trying to null-terminate but somehow uses a wstring null? Or some attempt at SIMD that isn't correctly bound?

It doesn't seem to crash for me though, it might just be luck that nothing important is put 1 byte over, and it feels a bit unlikely something would be due to allocation and type alignment requirements, but it's perfectly valid for the malloc implementation to keep something important in that byte.

Or perhaps there's some other dynamics that change this - it looks like it's doing stuff with paths, so may change size (of the allocation or even the amount written) based on where the steam app is installed - stuff like your user name length changing that may be the difference between a crash. Or even another issue somewhere else I didn't see, or valgrind didn't catch.

Just goes to show how many games ship for years with "big" bugs :P

For reference:

Thanks a lot for the guidance/tip, I've learned something new. And you're absolutely right about the cause of the mentioned crash -- I've updated the Github issue with a bit of new info I've gathered.

Regarding the function, here it is: https://github.com/dreamstalker/rehlds/blob/master/rehlds/fi...

Interestingly, strdup gets compiled into:

  89 04 24           mov   [esp+101Ch+name], pszContentPath ; s
  E8 82 DC 00 00     call  strlen
  66 C7 04 03 2F 00  mov   word ptr [pszContentPath+eax], 2Fh ; '/'

Which is basically:

  *(_WORD *)&pPath[strlen(pPath)] = '/';`

and would explain why Valgrind says it goes one byte over.

Karliss · 3 years ago

Whole graphics drivers using LLVM in the backend has caused countless issues. The way I look at it one of the main problems is that graphic API libraries shouldn't leak symbols from implementation details like them using LLVM. They should expose only the graphics API and nothing more.

vchuravy · 3 years ago

Don't ask me about GNU_UNIQUE...

Due to some wonderful C++ features the dynamic linker is forced to unify symbols across shared libraries, even if those symbols have different versions.

This utterly breaks loading multiple libLLVM's except if you build the copy you care about with -no-gnu-unique (or whatever the flag was called)

I have seen wonderful things like the initializers of an already loaded libLLVM being rerun when a new one is loaded.

planede · 3 years ago

The presume wonderful C++ feature is spelled __attribute__((weak)) in GNU C.

IceWreck · 3 years ago

Isn't this the reason why people recommend using the flatpak version of Steam ?

PlutoIsAPlanet · 3 years ago

Yes, especially on Fedora.

This isn't something Fedora is doing wrong, unfortunately some games build against older libraries or are built against Debian/Ubuntu and the Flatpak runtimes generally have better compatibility.

Fedora 38 includes the LLVM15 libs to maintain backwards compatibility.

Why is this automatically using a new, incompatible solib, instead of a versioned solib?

AnssiH · 3 years ago

The LLVM dependency is in the HW-specific driver solib which is loaded by the OpenGL library, which has the same soname as before.

Okay, then why does it fail when it should then still use llvm15?

The author states this is an llvm16 issue, but unless the driver was built against llvm16, it should still be loading llvm15.

If it was built against llvm16 (or loads llvm16), and doesn't work, that's not a failure of anything other than QA testing.