Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Solve the libGL ABI problem #31189

Open
dezgeg opened this issue Nov 3, 2017 · 26 comments · May be fixed by #337995
Open

Solve the libGL ABI problem #31189

dezgeg opened this issue Nov 3, 2017 · 26 comments · May be fixed by #337995

Comments

@dezgeg
Copy link
Contributor

dezgeg commented Nov 3, 2017

The problem

The design of libGL drivers is such that the userspace part of the driver consists of a libGL.so that gets loaded in each process using OpenGL. That is, each driver vendor (Mesa, NVIDIA, AMD) ships their own libGL.so that we select dynamically (and impurely) by having NixOS set:

LD_LIBRARY_PATH=/run/opengl-driver/lib:/run/opengl-driver-32/lib

and having the NixOS module set the symlinks pointing to the proper packages depending on the system configuration. Now while the OpenGL ABI itself is stable, a major pain point for us that the impurity causes are conflicting library versions between any libraries that the driver itself and the application depends on.

Issue #16779 shows a manifestation of this problem: applications built on NixOS 16.03 would stop working on NixOS 16.09, because of a version conflict between libwayland.so used both by the application and Mesa: the application itself causes version X of libwayland.so being loaded to the process, but Mesa requires version Y of libwayland.so being loaded, thus the application cannot start up and fails with:

Note that this problem is not inherently specific to NixOS -- the same problem is known to happen on other distros as well when the libstdc++ version provided by the Steam runtime conflicts with the libstdc++ that Mesa requires.

A (potential) solution

An attempt of solving this has been done in the libcapsule project (https://git.collabora.com/cgit/user/vivek/libcapsule.git/tree/README) by a Collabora employee. The approach taken there is to build a stub libGL.so that uses the little-known dlmopen() function to create a completely new symbol namespace for dynamic linking, and load the real libGL.so of the graphics driver there, and then redirect all exported symbols from the stub libGL.so to the entry points in the real libGL.so living in the segregated dynamic linker namespace. This is implemented via a clever hack of patching the PLT table of the stub libGL.so to point to the real libGL.so's entry points, so there is zero overhead for function calls to libGL!

The problems in practice

I attempted to package and use libcapsule during NixCon 2017, with not-so-great success (https://github.com/dezgeg/libcapsule, https://github.com/dezgeg/nixpkgs/tree/libcapsule). While approach taken by libcapsule seems theoretically sound, one problem seems to be that the proxied libGL driver needs to also provide exports for libX11.so among some other xcb libraries. I'm not totally sure why, but I'm guessing the X11 client driver keeps some per-process state on which GLX client-side library is associated with which X screen, so having two different libX11.so's in the main symbol namespace and inside the capsuled symbol namespace would break things.

Now, that causes a problem because libraries like libXi (probably accidentally) allocate memory with malloc() from outside the capsule but free it with XFree(), which crashes because XFree() calls the free() inside the capsule, and those two glibc of course have their independent heaps. AFAICT, there's currently no way to have certain libraries loaded only once and shared by both the main dlmopen() namespace and the in-capsule dlmopen() namespace.

A potential way to avoid that problem might be to try would be to use libcapsule between libglvnd and and the driver, which shouldn't require the hack of exporting symbols from libX11. Though what worries me a bit is whether having multiple glibcs loaded in will work either, given that there are some per-process and per-thread kernel APIs where the two glibcs might step on each others' feet. (set_robust_list() and sbrk() come to mind). But presumably the glibc people have given at least some thought to that, or the entire dlmopen() would become pretty much useless...

cc: @vcunat, @abbradar

@vcunat
Copy link
Member

vcunat commented Nov 3, 2017

AFAIK glibc is very good at keeping ABI-compatible, so we would better make it shared, IMO, as the risk of e.g. using malloc in one and free in the other one is larger. (I haven't looked into how difficult that might be.)

@dezgeg
Copy link
Contributor Author

dezgeg commented Nov 3, 2017

I agree. I guess in principle the same PLT patching approach would work. Except for exported global variables, which libcapsule currently doesn't know how to redirect.

@TerkiKerel
Copy link

Is there any progress about this issue? :(

@dezgeg
Copy link
Contributor Author

dezgeg commented Feb 14, 2018

No new progress. This one:

A potential way to avoid that problem might be to try would be to use libcapsule between libglvnd and and the driver, which shouldn't require the hack of exporting symbols from libX11.

is still the next plan (at least for me).

@fledermaus
Copy link

The shared libc approach is in progress, patches are currently awaiting review on libc-alpha.

@dezgeg
Copy link
Contributor Author

dezgeg commented Aug 29, 2018

Neat! I hope to have time to take a look at this again sometime this year. Especially since (IIRC) in my fork I added some potentially useful things like DT_RUNPATH support.

@nixos-discourse
Copy link

This issue has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/xorgserver-upgrade-and-startx/6834/10

@stale

This comment has been minimized.

@stale stale bot added the 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md label Oct 20, 2020
@xaverdh
Copy link
Contributor

xaverdh commented Nov 20, 2020

still an issue

@stale
Copy link

stale bot commented Jun 4, 2021

I marked this as stale due to inactivity. → More info

@stale stale bot added the 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md label Jun 4, 2021
@xaverdh
Copy link
Contributor

xaverdh commented Jun 4, 2021

Well still an issue unfortunately..

@stale stale bot removed the 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md label Jun 4, 2021
@Atemu
Copy link
Member

Atemu commented Jul 20, 2021

I have no expertise in this but could our graphics drivers just be statically linked perhaps?

(Note: pkgsStatic.mesa is broken currently)

@vcunat
Copy link
Member

vcunat commented Jul 20, 2021

The malloc problem from OP would still apply. (especially when you hint at combining musl and glibc)

EDIT: maybe, if the drivers were built against quite an old glibc and linked everything except glibc statically.

@Apteryks
Copy link

IIUC, this problem is the cost to pay to support closed source drivers (AMD, NVIDIA), right? Otherwise, mesa covers all the free software drivers available at least on GNU/Linux and thus should be safe to rely on for libGL.so. Is my understanding correct?

@vcunat
Copy link
Member

vcunat commented Nov 14, 2021

Yes, I think so. The other part (I know) would be to have mesa drivers in way more closures, though that's a price we'd be willing to pay, I expect.

@ppaalanen
Copy link

I would not expect the libcapsule approach work with libwayland-client. While libwayland-client is developed in a backward compatible way to not break the ABI, I do not think it ever considered having multiple different versions of itself to be interoperable.

What I mean is, an application calls libwayland-client to create a wl_display and passes that through EGL API to the EGL implementation. The EGL implementation is better use the exact same version of libwayland-client so that the internal details of the opaque wl_display object and others are the same.

Now, EGL implementation is different from the GL implementations and drivers but close enough that I thought of mentioning it.

With X11 you have the workaround that the EGL or whatever implementation could just open more X11 connections itself and just pass the window id around, but that won't work with Wayland where you must use the same connection where all the protocol objects live, and your windows simply do not exist on additional connections.

@nixos-discourse
Copy link

This issue has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/could-we-robustly-protect-against-errors-version-glibc-2-33/18343/4

@nixos-discourse
Copy link

This issue has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/problems-with-using-packages-from-unstable/18999/10

@tobiasBora
Copy link
Contributor

tobiasBora commented May 11, 2022

I don't know much of the internals of glibc, so I appologize if my comment makes no sense… But I can imagine a generic solution that combines the best of both world (runnable on newer systems, and purity when running on older systems): if we wrap all packages, we could check in the wrapper the GLIBC version used by the drivers of the system. If the version is newer than the GLIBC used by the current system, then we export LD_PRELOAD the glibc used by the system, and if it is older or equal, then we do nothing and use the glibc hardcoded in the executable. Note that if other drivers need to always use the exact same version of a library between the system and the software (maybe libwayland-client?), we could certainly also configure this driver here.

Note that this comes at the price of wrapping all executables or the loaders (wrapping loaders would not work for statistically linked binaries, see also #150841), but I don't see any other solution and I'm not skilled enough to understand and compare this approach to the solution provided above. My hope is that it may work with libwayland-client, but I know nothing about libwayland-client, so I'm curious to know what you think @ppaalanen

@ppaalanen
Copy link

I have no idea about libc.

All I was saying is that you do not want to link two different versions of libwayland-client into the same process.

@nixos-discourse
Copy link

This issue has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/do-gui-applications-not-work-on-non-nixos-using-nixpkgs-only/19070/13

@xaverdh
Copy link
Contributor

xaverdh commented May 18, 2022

IIUC, this problem is the cost to pay to support closed source drivers (AMD, NVIDIA), right? Otherwise, mesa covers all the free software drivers available at least on GNU/Linux and thus should be safe to rely on for libGL.so. Is my understanding correct?

Since NVIDIA appears to be opening up their driver (not quite fully yet, just the kernel module atm), this may actually become viable I guess.

@vcunat
Copy link
Member

vcunat commented May 29, 2022

The issue here is user-space. The things I've heard so far haven't raised my hopes wrt. this ticket.

EDIT: well OK, perhaps in the sense that it might help improving the nouveau drivers over the following years.

@nixos-discourse
Copy link

This issue has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/help-understanding-the-libgl-abi-problem-and-possible-solutions/42022/1

@Pandapip1
Copy link
Contributor

Pandapip1 commented Nov 14, 2024

Duplicate of #9415? There's a significant amount of hisory in both issues, but I feel like it should really be consolidated into one.

@Atemu
Copy link
Member

Atemu commented Nov 14, 2024

They're quite similar and have the same root cause but this one is more general as this is an issue you can run into on NixOS aswell.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet