Gpu passthrough failed for lxc/lxd virtual machine

zachary · January 11, 2022, 5:31pm

I enabled iommu, sr-iov and svm in bios then also added GRUB_CMDLINE_LINUX_DEFAULT=“amd_iommu=on iommu=pt” to /etc/default/grub.

Next, I launched lxc launch images:ubuntu/20.04/cloud v1 --vm successfully.

Finally, I executed lxc config device add v1 gpu0 gpu gputype=physical pci=0000:xx:00.0.

When I tried to perform lxc start v1, it just got stuck there with no error message and I have to reset the server.

Can anyone advice where I did wrong? Is there something I need to setup for VFIO? Thank you.

Specs:
AMD Milan Processor
Asrock Rack ROMED8-2T Motherboard
Nvidia RTX A4000 GPU

stgraber · January 11, 2022, 5:34pm

Anything in dmesg? Also, just making sure that this GPU is dedicated to this and it’s not currently tied to host NVIDIA GPU drivers or the like?

zachary · January 16, 2022, 12:20pm

I saw https://github.com/bryansteiner/gpu-passthrough-tutorial tutorial that can dynamically unbind the nvidia/amd drivers and bind the vfio drivers right before the VM starts and subsequently reversing these actions when the VM stops. That way, whenever the VM isn’t in use, the GPU is available to the host machine to do work on its native drivers. How can this be achieved for lxc/lxd?

hypeit · October 4, 2022, 2:18pm

Hi Stefan,

Right now I face exectly the same situation as @zachary, could you please elaborate on how to make sure that GPU is dedicated to this and it’s not currently tied to host NVIDIA GPU drivers?
Maybe you @zachary figured out how to do that?

After enabling iommu svm and sr-iov I checked dmesg

dmesg | grep -i gpu
[    6.548783] [drm] [nvidia-drm] [GPU ID 0x00001800] Loading driver
[    6.548880] [drm] [nvidia-drm] [GPU ID 0x0000af00] Loading driver
[   51.884863] audit: type=1400 audit(1664891780.437:47): apparmor="STATUS" operation="profile_load" profile="unconfined" name="lxd-k3s-1gpu-node_</var/snap/lxd/common/lxd>" pid=5073 comm="apparmor_parser"
[   51.953564] audit: type=1400 audit(1664891780.504:48): apparmor="DENIED" operation="open" profile="lxd-k3s-1gpu-node_</var/snap/lxd/common/lxd>" name="/proc/5075/cpuset" pid=5075 comm="lxd" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
[  518.796965] audit: type=1400 audit(1664892247.206:53): apparmor="STATUS" operation="profile_remove" profile="unconfined" name="lxd-k3s-1gpu-node_</var/snap/lxd/common/lxd>" pid=83460 comm="apparmor_parser"
[ 1306.293261] audit: type=1400 audit(1664893034.700:54): apparmor="STATUS" operation="profile_load" profile="unconfined" name="lxd-k3s-1gpu-node_</var/snap/lxd/common/lxd>" pid=123261 comm="apparmor_parser"
[ 1306.353177] audit: type=1400 audit(1664893034.760:55): apparmor="DENIED" operation="open" profile="lxd-k3s-1gpu-node_</var/snap/lxd/common/lxd>" name="/proc/123263/cpuset" pid=123263 comm="lxd" requested_mask="r" denied_mask="r" fsuid=0 ouid=0

Moreover

for d in /sys/kernel/iommu_groups/*/devices/*; do n=${d#*/iommu_groups/*}; n=${n%%/*}; printf 'IOMMU Group %s ' "$n"; lspci -nns "${d##*/}"; done | grep VGA;
IOMMU Group 26 03:00.0 VGA compatible controller [0300]: Matrox Electronics Systems Ltd. MGA G200e [Pilot] ServerEngines (SEP1) [102b:0522] (rev 42)

This is not my card

Looks like it’s still tied to nvidia drivers, how can I change that?

Thank you

stgraber · October 4, 2022, 3:24pm

Easiest is usually to put a blacklist nvidia entry in the modprobe.d config in /etc/modprobe.d, this will also need an initrd update though which on Debian/Ubuntu is done with update-initramfs -u.

hypeit · October 4, 2022, 4:08pm

It did not help

cat /etc/modprobe.d/blacklist-nvidia.conf
blacklist nvidia

but I will try advices from here, maybe something helps
https://linuxconfig.org/how-to-disable-blacklist-nouveau-nvidia-driver-on-ubuntu-20-04-focal-fossa-linux

hypeit · October 6, 2022, 8:59am

The error was extremely stupid on my side, it turned out that blacklisting nvidia drivers was not needed at all
The problem was a typo in vfio-pci.ids in my GRUB default cfg.

If anyone has the same problem I found a great article describing 4 ways to do that

hypeit · October 15, 2022, 4:24am

UPDATE:
It worked but partially, as I said, the GPU passthrough works(the card is visible as PCI device in a vm) but it cannot be detected by nvidia smi either on windows or on linux vm.
After having installed drivers from official nvidia website it states that there are no nvidia GPU devices.

@stgraber I am wondering about one thing. There is one fact I forgot to mention:
the host os is also a vm(created by hypervisor of type 1, vmware if I recall correctly) so the way we try to pass it is technically a vm insisde anther vm. Since I am not this proficeint in terms of virtualization I thoutht that it may hava an impact, am I right?

stgraber · October 16, 2022, 1:51am

Ah yeah, passing a GPU to a nested VM may be tricky.
The main reason is that the IOMMU groups provided in your VM may not allow for that to happen. I know there’s ongoing work in qemu and Linux kernel to make this work with kvm, but it’s not there yet.

In your case, if the first level of virtualization is vmware, then things may be different though.
You’d want to check the IOMMU groups with lspci to see if the GPU is indeed in a dedicated group.

hypeit · October 16, 2022, 4:17pm

Thank you for your answer, I checked it at the very beginning, each of the 2 cards has their own group. So I guess the only option is to try to passthrough the gpu to the initial host vm(created in type 1 hypervisor - VMware)?

hypeit · February 4, 2023, 1:06pm

Hi all,

Small update:

We reorganised infrastructure and installed ubuntu 20.04 on our host machine and right now try to passthrough one of our 2 gpus to lxd vm.
The machine is visible via lspci -nn

root@k3s-1gpu-node:~# lspci -nn | grep -i nvidia
06:00.0 3D controller [0302]: NVIDIA Corporation GV100GL [Tesla V100 PCIe 32GB] [10de:1db6] (rev a1)

One thing I noted is that the pci id is different than on the host

mother@node53:~$ sudo lspci -nn | grep -i nvidia
18:00.0 3D controller [0302]: NVIDIA Corporation GV100GL [Tesla V100 PCIe 32GB] [10de:1db6] (rev a1)
af:00.0 3D controller [0302]: NVIDIA Corporation GV100GL [Tesla V100 PCIe 32GB] [10de:1db6] (rev a1)
mother@node53:~$ lxc config show k3s-1gpu-node | grep gpu -A 5
  volatile.gpu1.last_state.pci.driver: vfio-pci
  volatile.gpu1.last_state.pci.slot.name: 0000:af:00.0
  volatile.last_state.power: RUNNING
  volatile.last_state.ready: "false"
  volatile.uuid: 61b7e20b-4b93-450d-b8ec-7e6a05ecffab
  volatile.vsock_id: "18"
devices:
  gpu1:
    gputype: physical
    pci: 0000:af:00.0
    type: gpu
ephemeral: false
profiles:
- vm-linux-btrfs
stateful: false
description: ""
mother@node53:~$

IOMMU seems to be working just fine and the card seems to have the right driver:

oot@node53:~# dmesg | grep -E "DMAR|IOMMU"
[    0.021318] ACPI: DMAR 0x000000005F5CC9D8 000292 (v01 FUJ    D3384-B1 00000001 INTL 20091013)
[    0.021383] ACPI: Reserving DMAR table memory at [mem 0x5f5cc9d8-0x5f5ccc69]
[    0.448729] DMAR: IOMMU enabled
[    0.728487] DMAR: Host address width 46
[    0.728490] DMAR: DRHD base: 0x000000d37fc000 flags: 0x0
[    0.728497] DMAR: dmar0: reg_base_addr d37fc000 ver 1:0 cap 8d2078c106f0466 ecap f020df
[    0.728501] DMAR: DRHD base: 0x000000e0ffc000 flags: 0x0
[    0.728507] DMAR: dmar1: reg_base_addr e0ffc000 ver 1:0 cap 8d2078c106f0466 ecap f020df
[    0.728510] DMAR: DRHD base: 0x000000ee7fc000 flags: 0x0
[    0.728515] DMAR: dmar2: reg_base_addr ee7fc000 ver 1:0 cap 8d2078c106f0466 ecap f020df
[    0.728518] DMAR: DRHD base: 0x000000fbffc000 flags: 0x0
[    0.728524] DMAR: dmar3: reg_base_addr fbffc000 ver 1:0 cap 8d2078c106f0466 ecap f020df
[    0.728527] DMAR: DRHD base: 0x000000aaffc000 flags: 0x0
[    0.728532] DMAR: dmar4: reg_base_addr aaffc000 ver 1:0 cap 8d2078c106f0466 ecap f020df
[    0.728535] DMAR: DRHD base: 0x000000b87fc000 flags: 0x0
[    0.728540] DMAR: dmar5: reg_base_addr b87fc000 ver 1:0 cap 8d2078c106f0466 ecap f020df
[    0.728543] DMAR: DRHD base: 0x000000c5ffc000 flags: 0x0
[    0.728552] DMAR: dmar6: reg_base_addr c5ffc000 ver 1:0 cap 8d2078c106f0466 ecap f020df
[    0.728555] DMAR: DRHD base: 0x0000009d7fc000 flags: 0x1
[    0.728560] DMAR: dmar7: reg_base_addr 9d7fc000 ver 1:0 cap 8d2078c106f0466 ecap f020df
[    0.728563] DMAR: RMRR base: 0x0000006f5d4000 end: 0x0000006f5e5fff
[    0.728566] DMAR: RMRR base: 0x000000672bd000 end: 0x0000006f2c4fff
[    0.728568] DMAR: ATSR flags: 0x0
[    0.728570] DMAR: ATSR flags: 0x0
[    0.728572] DMAR: RHSA base: 0x0000009d7fc000 proximity domain: 0x0
[    0.728575] DMAR: RHSA base: 0x000000aaffc000 proximity domain: 0x0
[    0.728577] DMAR: RHSA base: 0x000000b87fc000 proximity domain: 0x0
[    0.728579] DMAR: RHSA base: 0x000000c5ffc000 proximity domain: 0x0
[    0.728581] DMAR: RHSA base: 0x000000d37fc000 proximity domain: 0x1
[    0.728583] DMAR: RHSA base: 0x000000e0ffc000 proximity domain: 0x1
[    0.728585] DMAR: RHSA base: 0x000000ee7fc000 proximity domain: 0x1
[    0.728588] DMAR: RHSA base: 0x000000fbffc000 proximity domain: 0x1
[    0.728591] DMAR-IR: IOAPIC id 12 under DRHD base  0xc5ffc000 IOMMU 6
[    0.728594] DMAR-IR: IOAPIC id 11 under DRHD base  0xb87fc000 IOMMU 5
[    0.728596] DMAR-IR: IOAPIC id 10 under DRHD base  0xaaffc000 IOMMU 4
[    0.728598] DMAR-IR: IOAPIC id 18 under DRHD base  0xfbffc000 IOMMU 3
[    0.728601] DMAR-IR: IOAPIC id 17 under DRHD base  0xee7fc000 IOMMU 2
[    0.728603] DMAR-IR: IOAPIC id 16 under DRHD base  0xe0ffc000 IOMMU 1
[    0.728606] DMAR-IR: IOAPIC id 15 under DRHD base  0xd37fc000 IOMMU 0
[    0.728609] DMAR-IR: IOAPIC id 8 under DRHD base  0x9d7fc000 IOMMU 7
[    0.728612] DMAR-IR: IOAPIC id 9 under DRHD base  0x9d7fc000 IOMMU 7
[    0.728614] DMAR-IR: HPET id 0 under DRHD base 0x9d7fc000
[    0.728617] DMAR-IR: x2apic is disabled because BIOS sets x2apic opt out bit.
[    0.728618] DMAR-IR: Use 'intremap=no_x2apic_optout' to override the BIOS setting.
[    0.730404] DMAR-IR: Enabled IRQ remapping in xapic mode
[    2.098098] DMAR: dmar6: Using Queued invalidation
[    2.098105] DMAR: dmar5: Using Queued invalidation
[    2.098109] DMAR: dmar4: Using Queued invalidation
[    2.098113] DMAR: dmar3: Using Queued invalidation
[    2.098118] DMAR: dmar2: Using Queued invalidation
[    2.098132] DMAR: dmar1: Using Queued invalidation
[    2.098136] DMAR: dmar0: Using Queued invalidation
[    2.098140] DMAR: dmar7: Using Queued invalidation
[    2.121043] DMAR: Intel(R) Virtualization Technology for Directed I/O
[    4.194077] megaraid_sas 0000:5e:00.0: DMAR: 32bit DMA uses non-identity mapping
[    5.116904] qla2xxx 0000:86:00.0: DMAR: 32bit DMA uses non-identity mapping
[    8.033480] qla2xxx 0000:86:00.1: DMAR: 32bit DMA uses non-identity mapping
[   10.672322] qla2xxx 0000:d8:00.0: DMAR: 32bit DMA uses non-identity mapping
[   13.565232] qla2xxx 0000:d8:00.1: DMAR: 32bit DMA uses non-identity mapping


af:00.0 3D controller [0302]: NVIDIA Corporation GV100GL [Tesla V100 PCIe 32GB] [10de:1db6] (rev a1)
Subsystem: NVIDIA Corporation GV100GL [Tesla V100 PCIe 32GB] [10de:124a]
Flags: fast devsel, IRQ 233, NUMA node 1
Memory at ed000000 (32-bit, non-prefetchable) [disabled] [size=16M]
Memory at 7e000000000 (64-bit, prefetchable) [disabled] [size=32G]
Memory at 7e800000000 (64-bit, prefetchable) [disabled] [size=32M]
Capabilities: [60] Power Management version 3
Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
Capabilities: [78] Express Endpoint, MSI 00
Capabilities: [100] Virtual Channel
Capabilities: [250] Latency Tolerance Reporting
Capabilities: [258] L1 PM Substates
Capabilities: [128] Power Budgeting <?>
Capabilities: [420] Advanced Error Reporting
Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
Capabilities: [900] Secondary PCI Express
Capabilities: [ac0] Designated Vendor-Specific <?>
Kernel driver in use: vfio-pci
Kernel modules: nvidiafb, nouveau

When I try to install the nvidia driver on the vm I get(from log):

-> Kernel messages:
[    2.900871] [drm] pci: virtio-vga detected at 0000:04:00.0
[    2.911600] virtio-pci 0000:04:00.0: vgaarb: deactivate vga console
[    2.911782] [drm] features: -virgl +edid -resource_blob -host_visible
[    2.912406] [drm] number of scanouts: 1
[    2.912411] [drm] number of cap sets: 0
[    2.912940] [drm] Initialized virtio_gpu 0.1.0 0 for virtio11 on minor 0
[    2.915344] fbcon: Deferring console take-over
[    2.915347] virtio_gpu virtio11: [drm] fb0: virtio_gpudrmfb frame buffer device
[    2.949411] cryptd: max_cpu_qlen set to 1000
[    2.996650] AVX2 version of gcm_enc/dec engaged.
[    2.996690] AES CTR mode by8 optimization enabled
[    6.220654] fbcon: Taking over console
[    6.221065] virtio_gpu virtio11: [drm] drm_plane_enable_fb_damage_clips() not called
[    6.221134] Console: switching to colour frame buffer device 160x50
[  267.451831] nvidia: loading out-of-tree module taints kernel.
[  267.451845] nvidia: module license 'NVIDIA' taints kernel.
[  267.451851] Disabling lock debugging due to kernel taint
[  267.502765] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[  267.514734] nvidia-nvlink: Nvlink Core is being initialized, major device number 235
[  267.514740] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
NVRM: BAR0 is 0M @ 0x0 (PCI:0000:06:00.0)
[  267.516654] nvidia: probe of 0000:06:00.0 failed with error -1
[  267.516678] NVRM: The NVIDIA probe routine failed for 1 device(s).
[  267.516679] NVRM: None of the NVIDIA devices were initialized.
[  267.517839] nvidia-nvlink: Unregistered Nvlink Core, major device number 235
ERROR: Installation has failed.  Please see the file '/var/log/nvidia-installer.log' for details.  You may find suggestions on fixing installation problems in the README available on the Linux driver download page at www.nvidia.com.

The error is clear
This PCI I/O region assigned to your NVIDIA device is invalid:
NVRM: BAR0 is 0M @ 0x0 (PCI:0000:06:00.0)

Right now the only way aou I found on the internet is to add realloc option to grub options
GRUB_CMDLINE_LINUX_DEFAULT="iommu=pt intel_iommu=on rd.driver.blacklist=nouveau nouveau.modeset=0 vfio-pci.ids=10de:1db6 kvm.ignore_msrs=1 tsx=on tsx_async_abort=off pcie_acs_override=downstream,multifunction nofb nomodeset pci=realloc"

@stgraber maybe you came across a similar problem?

Thanks

Mateusz