0x06 Single GPU Passthrough Ubuntu 24.04
After spending 10+ years running the same Windows installation (that’s been upgraded from HDD to SSD and been through 3 separate computer builds), it finally gave up the ghost when I wanted to jump in the new Factorio DLC for a “quick 20min session, in-and-out adventure”.
I click “Play” in Steam and immediately blue screen, the first I’ve had in recent memory. “No problem”, I say to myself. “It’ll recover just fine”, I foolishly follow up with.
Dear readers, it was not fine. The Windows Audio Server had proceeded to completely and irrepairably die. Adding permissions for the nt service
accounts, fixing the registry, modify the permissions on the binary - nothing. Zip. Nada. Sound output from my Windows machine was now just a memory. The furthest I had gotten was bringing back the output devices to list in the troubleshooter but every time I tried to move the output volume from 0 to 100, the slider would do its best yo-yo impression and snap straight back to 0.
“That’s it, I’m doing Linux on the desktop. It’s time - enough pissfarting around. I’m ready”. Of course, I had “dabbled” in having Linux on the desktop before. The machine was currently dual-booted with Manjaro, for goodness sake! But the most use it had seen recently (in the last 4 years) was letting my friend ssh
in to run some power calculations for grid energy in SEA. Weird, but glad the GPU+CPU got a workout.
Long Story Short
yada yada Ubuntu 24.04 is fine. I’m using gnome
which is fine (though I do prefer KDE) - I’m looking for something that “just works (TM)” as I’ve been spoiled the last few years with MacBooks.
But that’s not enough - years ago I had tried doing GPU passthrough on a laptop. Not the best idea. But I still held on hope that one day, I would be able to boot into a Windows VM with full GPU and then exit back into my Linux install. Running a Ryzen 3900x and a single Nvidia RTX3080 LHR, I knew that single GPU passthrough would be difficult and there would be some concessions I’d need to make. That’s fine - if it worked, wonderful. If it didn’t, I’d let bygones be bygones and move on with my life. Spoiler: I did not move on with my life.
“Why not dual-boot? You’ve done it before!” - I hear you cry. This is part of my effort of removing myself from surveillance tech (Windows) and forcing myself to use Linux, so I can make (very small) contributions again. I feel bad enough I’ve let my on-prem Bitwarden server collapse and use their cloud offering now. My home lab is a mess. I have a new NUC to setup that’s been sitting in the box for 2 months. My Plex server doesn’t have enough RAM to index files properly so just keels over. I must get back into Linux.
Wow, That’s Still Long and You Haven’t Told Us the Problem Yet
Fun fact: Gnome defuaults to X11 instead of Wayland if you use Nvidia graphics cards.
I follow this guide here: https://gitlab.com/risingprismtv/single-gpu-passthrough
Went great. I’m using a raw SSD as the disk drive, the GPU released from the host correctly, games run in Windows great. Wonderful.
So what’s the problem?
When the VM shuts down and releases the GPU back to the host, my entire host would crash. No bueno. Nothing. Not even if I’m SSH’d in from somewhere else - all dead.
Some system details:
OS: Ubuntu 24.04.1 LTS x86_64
Host: MS-7C35 2.0
Kernel: 6.8.0-48-generic
Shell: zsh 5.9
Resolution: 3440x1440, 1920x1080
DE: GNOME 46.0
CPU: AMD Ryzen 9 3900X (24) @ 3.800GHz
GPU: NVIDIA GeForce RTX 3080 Lite Hash Rate
Memory: 3658MiB / 15901MiB
Cue spending days tracking down the problem. A not exhaustive list of the things I tried:
- reload the systemctl daemon
- setting the target for
gdm
back tographical.install
- add delays between each kernel module being reloaded
- forcing reset of the GPU
- modifying GRUB with
nomodeset, video=efifb:off,vesafb:off
- kernel module blacklisting various framebuffers (hint hint)
- so much more
But it would never work. Checking journalctl -b -1
there would be BUG: kernel NULL pointer dereference, address: 00000000000003e0
right at the end. The incriminating lines were slightly further up however:
Nov 15 11:40:52 hostname kernel: Console: switching to mono frame buffer device 80x25
Nov 15 11:40:52 hostname kernel: ------------[ cut here ]------------
Nov 15 11:40:52 hostname: UBSAN: array-index-out-of-bounds in /build/linux-21sZ5Q/linux-6.8.0/drivers/video/fbdev/core/fbcon.c:120:28
Nov 15 11:40:52 hostname kernel: index -1 is out of range for type 'fb_info *[32]'
Looked like some issue with the framebuffer when reinitializing. A long time trying to disable it later, I could never get it fully disabled.
FFS What is the fix?
In hooks/vfio-teardown.sh, don’t rebind the consoles:
echo 1 > /sys/class/vtconsole/vtcon"${consoleNumber}"/bind
Just remove that line / adjust the shell scripts to your liking and done. Now I can boot back into my Linux install after shutting down Windows :)
UPDATE: Looks like this has already been reported as a bug here: https://bugzilla.kernel.org/show_bug.cgi?id=216475
UPDATE Again: Turns out this has been an issue since the 5.19 kernel: https://old.reddit.com/r/VFIO/comments/wp85ve/linux_519_kernel_single_gpu_passthough_black/
The current consensus is to reboot the guest, smash ESC on Tianocore, and then type reset -s
in the EFI shell :/
Final Update: I just don’t unbind any vtconsoles or try to rebind - works fine. Seems like this has been a long-running issue and a point of contention. virsh
dettaching / attaching the PCI GPU doesn’t seem to help either. It is what it is.