More gpus then assigned

Hi,

I have an issue with the container and its assigned devices.

I have more available GPUs on the server and I want to assign just one of them to the container. I have more of them and it works just fine, but I’m getting issues with this one.

Drivers are OK for the Nvidia.

How I do it:
lxc config device add <container> gpu2 gpu id=2

lxc config show
architecture: x86_64
config:
image.architecture: amd64
image.description: ubuntu 20.04 LTS amd64 (release) (20211021)
image.label: release
image.os: ubuntu
image.release: focal
image.serial: “20211021”
image.type: squashfs
image.version: “20.04”
volatile.base_image: 5fc94479f588171282beb094da96bb83eb51420d6cf13b223c737d1fda9169cd
volatile.eth0.host_name: veth1d232190
volatile.eth0.hwaddr: 00:16:3e:0a:17:fb
volatile.idmap.base: “0”
volatile.idmap.current: ‘[{“Isuid”:true,“Isgid”:false,“Hostid”:1000000,“Nsid”:0,“Maprange”:1000000000},{“Isuid”:false,“Isgid”:true,“Hostid”:1000000,“Nsid”:0,“Maprange”:1000000000}]’
volatile.idmap.next: ‘[{“Isuid”:true,“Isgid”:false,“Hostid”:1000000,“Nsid”:0,“Maprange”:1000000000},{“Isuid”:false,“Isgid”:true,“Hostid”:1000000,“Nsid”:0,“Maprange”:1000000000}]’
volatile.last_state.idmap: ‘[{“Isuid”:true,“Isgid”:false,“Hostid”:1000000,“Nsid”:0,“Maprange”:1000000000},{“Isuid”:false,“Isgid”:true,“Hostid”:1000000,“Nsid”:0,“Maprange”:1000000000}]’
volatile.last_state.power: RUNNING
volatile.uuid: 1a283105-f789-4c6c-bd68-3f3514495238
devices:
gpu2:
id: “2”
type: gpu
ephemeral: false
profiles:
-
stateful: false
description: “”

When I run nvidia-smi on the server (this output is the same for the container where should be just one GPU):

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.142.00   Driver Version: 450.142.00   CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla M10           Off  | 00000000:88:00.0 Off |                  N/A |
| N/A   36C    P0    16W /  53W |      0MiB /  8129MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla M10           Off  | 00000000:89:00.0 Off |                  N/A |
| N/A   36C    P0    16W /  53W |      0MiB /  8129MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla M10           Off  | 00000000:8A:00.0 Off |                  N/A |
| N/A   25C    P0    16W /  53W |      0MiB /  8129MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla M10           Off  | 00000000:8B:00.0 Off |                  N/A |
| N/A   28C    P0    16W /  53W |      0MiB /  8129MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   4  Tesla M10           Off  | 00000000:B1:00.0 Off |                  N/A |
| N/A   36C    P0    16W /  53W |      0MiB /  8129MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   5  Tesla M10           Off  | 00000000:B2:00.0 Off |                  N/A |
| N/A   36C    P0    16W /  53W |      0MiB /  8129MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   6  Tesla M10           Off  | 00000000:B3:00.0 Off |                  N/A |
| N/A   24C    P0    16W /  53W |      0MiB /  8129MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   7  Tesla M10           Off  | 00000000:B4:00.0 Off |                  N/A |
| N/A   30C    P0    16W /  53W |      0MiB /  8129MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

When I list the /dev I get more GPUs, has someone encountered this before, or where am I doing mistakes?

Container restarted, the only thing left to be restarted is the server.

Can you show:

  • lxc config show --expanded NAME
  • lxc info --resources
  • lxc exec NAME -- ls -lh /dev

Hi, thank you for your quick response.

I’ve tried all of that and there you go:

architecture: x86_64
config:
image.architecture: amd64
image.description: ubuntu 20.04 LTS amd64 (release) (20211021)
image.label: release
image.os: ubuntu
image.release: focal
image.serial: “20211021”
image.type: squashfs
image.version: “20.04”
linux.kernel_modules: overlay, nf_nat
security.nesting: “true”
volatile.base_image: 5fc94479f588171282beb094da96bb83eb51420d6cf13b223c737d1fda9169cd
volatile.eth0.host_name: veth1d232190
volatile.eth0.hwaddr: 00:16:3e:0a:17:fb
volatile.idmap.base: “0”
volatile.idmap.current: ‘[{“Isuid”:true,“Isgid”:false,“Hostid”:1000000,“Nsid”:0,“Maprange”:1000000000},{“Isuid”:false,“Isgid”:true,“Hostid”:1000000,“Nsid”:0,“Maprange”:1000000000}]’
volatile.idmap.next: ‘[{“Isuid”:true,“Isgid”:false,“Hostid”:1000000,“Nsid”:0,“Maprange”:1000000000},{“Isuid”:false,“Isgid”:true,“Hostid”:1000000,“Nsid”:0,“Maprange”:1000000000}]’
volatile.last_state.idmap: ‘[{“Isuid”:true,“Isgid”:false,“Hostid”:1000000,“Nsid”:0,“Maprange”:1000000000},{“Isuid”:false,“Isgid”:true,“Hostid”:1000000,“Nsid”:0,“Maprange”:1000000000}]’
volatile.last_state.power: RUNNING
volatile.uuid: 1a283105-f789-4c6c-bd68-3f3514495238
devices:
aadisable:
path: /sys/module/apparmor/parameters/enabled
source: /dev/null
type: disk
eth0:
name: eth0
nictype: bridged
parent: br0
type: nic
fuse:
path: /dev/fuse
type: unix-char
gpu2:
id: “2”
type: gpu
root:
path: /
pool: default
type: disk
student-gpu-storage:
path: /mnt/coldnfsshare/lely/containers/student-gpu
source: /mnt/coldnfsshare/lely/containers/student-gpu
type: disk
ephemeral: false
profiles:

stateful: false
description: “”

The resources are rather long. :frowning:

CPUs (x86_64):
Socket 0:
Vendor: GenuineIntel
Name: Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz
Caches:
- Level 1 (type: Data): 33kB
- Level 1 (type: Instruction): 33kB
- Level 2 (type: Unified): 1MB
- Level 3 (type: Unified): 12MB
Cores:
- Core 0
Frequency: 1189Mhz
Threads:
- 0 (id: 0, online: true, NUMA node: 0)
- 1 (id: 16, online: true, NUMA node: 0)
- Core 1
Frequency: 2092Mhz
Threads:
- 0 (id: 1, online: true, NUMA node: 0)
- 1 (id: 17, online: true, NUMA node: 0)
- Core 2
Frequency: 800Mhz
Threads:
- 0 (id: 18, online: true, NUMA node: 0)
- 1 (id: 2, online: true, NUMA node: 0)
- Core 3
Frequency: 799Mhz
Threads:
- 0 (id: 19, online: true, NUMA node: 0)
- 1 (id: 3, online: true, NUMA node: 0)
- Core 4
Frequency: 800Mhz
Threads:
- 0 (id: 20, online: true, NUMA node: 0)
- 1 (id: 4, online: true, NUMA node: 0)
- Core 5
Frequency: 800Mhz
Threads:
- 0 (id: 21, online: true, NUMA node: 0)
- 1 (id: 5, online: true, NUMA node: 0)
- Core 6
Frequency: 803Mhz
Threads:
- 0 (id: 22, online: true, NUMA node: 0)
- 1 (id: 6, online: true, NUMA node: 0)
- Core 7
Frequency: 802Mhz
Threads:
- 0 (id: 23, online: true, NUMA node: 0)
- 1 (id: 7, online: true, NUMA node: 0)
Frequency: 1010Mhz (min: 800Mhz, max: 3000Mhz)
Socket 1:
Vendor: GenuineIntel
Name: Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz
Caches:
- Level 1 (type: Data): 33kB
- Level 1 (type: Instruction): 33kB
- Level 2 (type: Unified): 1MB
- Level 3 (type: Unified): 12MB
Cores:
- Core 0
Frequency: 822Mhz
Threads:
- 0 (id: 24, online: true, NUMA node: 1)
- 1 (id: 8, online: true, NUMA node: 1)
- Core 1
Frequency: 800Mhz
Threads:
- 0 (id: 25, online: true, NUMA node: 1)
- 1 (id: 9, online: true, NUMA node: 1)
- Core 2
Frequency: 803Mhz
Threads:
- 0 (id: 10, online: true, NUMA node: 1)
- 1 (id: 26, online: true, NUMA node: 1)
- Core 3
Frequency: 800Mhz
Threads:
- 0 (id: 11, online: true, NUMA node: 1)
- 1 (id: 27, online: true, NUMA node: 1)
- Core 4
Frequency: 800Mhz
Threads:
- 0 (id: 12, online: true, NUMA node: 1)
- 1 (id: 28, online: true, NUMA node: 1)
- Core 5
Frequency: 867Mhz
Threads:
- 0 (id: 13, online: true, NUMA node: 1)
- 1 (id: 29, online: true, NUMA node: 1)
- Core 6
Frequency: 800Mhz
Threads:
- 0 (id: 14, online: true, NUMA node: 1)
- 1 (id: 30, online: true, NUMA node: 1)
- Core 7
Frequency: 800Mhz
Threads:
- 0 (id: 15, online: true, NUMA node: 1)
- 1 (id: 31, online: true, NUMA node: 1)
Frequency: 811Mhz (min: 800Mhz, max: 3000Mhz)

Memory:
NUMA nodes:
Node 0:
Free: 1.97GB
Used: 66.75GB
Total: 68.72GB
Node 1:
Free: 4.12GB
Used: 64.59GB
Total: 68.72GB
Free: 118.95GB
Used: 18.49GB
Total: 137.44GB

GPUs:
Card 0:
NUMA node: 0
Vendor: ASPEED Technology, Inc. (1a03)
Product: ASPEED Graphics Family (2000)
PCI address: 0000:03:00.0
Driver: ast (5.4.0-65-generic)
DRM:
ID: 0
Card: card0 (226:0)
Control: controlD64 (226:0)
Card 1:
NUMA node: 1
Vendor: NVIDIA Corporation (10de)
Product: GM107GL [Tesla M10] (13bd)
PCI address: 0000:88:00.0
Driver: nvidia (450.142.00)
DRM:
ID: 1
Card: card1 (226:1)
Render: renderD128 (226:128)
NVIDIA information:
Architecture: 5.0
Brand: Tesla
Model: Tesla M10
CUDA Version: 11.0
NVRM Version: 450.142.00
UUID: GPU-667ffb18-a1c2-d7ef-4ba6-fa02165826a6
Card 2:
NUMA node: 1
Vendor: NVIDIA Corporation (10de)
Product: GM107GL [Tesla M10] (13bd)
PCI address: 0000:89:00.0
Driver: nvidia (450.142.00)
DRM:
ID: 2
Card: card2 (226:2)
Render: renderD129 (226:129)
NVIDIA information:
Architecture: 5.0
Brand: Tesla
Model: Tesla M10
CUDA Version: 11.0
NVRM Version: 450.142.00
UUID: GPU-eb1bdbde-4c1f-8233-603c-5e61e2e9edc3
Card 3:
NUMA node: 1
Vendor: NVIDIA Corporation (10de)
Product: GM107GL [Tesla M10] (13bd)
PCI address: 0000:8a:00.0
Driver: nvidia (450.142.00)
DRM:
ID: 3
Card: card3 (226:3)
Render: renderD130 (226:130)
NVIDIA information:
Architecture: 5.0
Brand: Tesla
Model: Tesla M10
CUDA Version: 11.0
NVRM Version: 450.142.00
UUID: GPU-90c282ad-b4b3-616e-d838-90cf109b6bbd
Card 4:
NUMA node: 1
Vendor: NVIDIA Corporation (10de)
Product: GM107GL [Tesla M10] (13bd)
PCI address: 0000:8b:00.0
Driver: nvidia (450.142.00)
DRM:
ID: 4
Card: card4 (226:4)
Render: renderD131 (226:131)
NVIDIA information:
Architecture: 5.0
Brand: Tesla
Model: Tesla M10
CUDA Version: 11.0
NVRM Version: 450.142.00
UUID: GPU-1e26493c-2347-8ec2-cb00-95090a24656a
Card 5:
NUMA node: 1
Vendor: NVIDIA Corporation (10de)
Product: GM107GL [Tesla M10] (13bd)
PCI address: 0000:b1:00.0
Driver: nvidia (450.142.00)
DRM:
ID: 5
Card: card5 (226:5)
Render: renderD132 (226:132)
NVIDIA information:
Architecture: 5.0
Brand: Tesla
Model: Tesla M10
CUDA Version: 11.0
NVRM Version: 450.142.00
UUID: GPU-a6deae74-6cc1-1126-d9c5-6087a9458dff
Card 6:
NUMA node: 1
Vendor: NVIDIA Corporation (10de)
Product: GM107GL [Tesla M10] (13bd)
PCI address: 0000:b2:00.0
Driver: nvidia (450.142.00)
DRM:
ID: 6
Card: card6 (226:6)
Render: renderD133 (226:133)
NVIDIA information:
Architecture: 5.0
Brand: Tesla
Model: Tesla M10
CUDA Version: 11.0
NVRM Version: 450.142.00
UUID: GPU-3030c09c-acc7-1d09-09d1-34a0153276d0
Card 7:
NUMA node: 1
Vendor: NVIDIA Corporation (10de)
Product: GM107GL [Tesla M10] (13bd)
PCI address: 0000:b3:00.0
Driver: nvidia (450.142.00)
DRM:
ID: 7
Card: card7 (226:7)
Render: renderD134 (226:134)
NVIDIA information:
Architecture: 5.0
Brand: Tesla
Model: Tesla M10
CUDA Version: 11.0
NVRM Version: 450.142.00
UUID: GPU-a864d026-f895-a741-f6a1-fd5214f95cae
Card 8:
NUMA node: 1
Vendor: NVIDIA Corporation (10de)
Product: GM107GL [Tesla M10] (13bd)
PCI address: 0000:b4:00.0
Driver: nvidia (450.142.00)
DRM:
ID: 8
Card: card8 (226:8)
Render: renderD135 (226:135)
NVIDIA information:
Architecture: 5.0
Brand: Tesla
Model: Tesla M10
CUDA Version: 11.0
NVRM Version: 450.142.00
UUID: GPU-3e05ec37-c280-bc9a-5e63-baff54e1f12f

NICs:
Card 0:
NUMA node: 0
Vendor: Intel Corporation (8086)
Product: Ethernet Controller 10-Gigabit X540-AT2 (1528)
PCI address: 0000:01:00.0
Driver: ixgbe (5.1.0-k)
Ports:
- Port 0 (ethernet)
ID: enp1s0f0
Address: ac:1f:6b:ba:ec:c6
Supported modes: 100baseT/Full, 1000baseT/Full, 10000baseT/Full
Supported ports: twisted pair
Port type: twisted pair
Transceiver type: internal
Auto negotiation: true
Link detected: true
Link speed: 10000Mbit/s (full duplex)
SR-IOV information:
Current number of VFs: 0
Maximum number of VFs: 63
Card 1:
NUMA node: 0
Vendor: Intel Corporation (8086)
Product: Ethernet Controller 10-Gigabit X540-AT2 (1528)
PCI address: 0000:01:00.1
Driver: ixgbe (5.1.0-k)
Ports:
- Port 0 (ethernet)
ID: enp1s0f1
Address: ac:1f:6b:ba:ec:c7
Supported modes: 100baseT/Full, 1000baseT/Full, 10000baseT/Full
Supported ports: twisted pair
Port type: twisted pair
Transceiver type: internal
Auto negotiation: true
Link detected: false
SR-IOV information:
Current number of VFs: 0
Maximum number of VFs: 63

Disks:
Disk 0:
NUMA node: 0
ID: sda
Device: 8:0
Model: INTEL SSDSC2KB48
Type: scsi
Size: 480.10GB
Read-Only: false
Removable: false
Partitions:
- Partition 1
ID: sda1
Device: 8:1
Read-Only: false
Size: 1.05MB
- Partition 2
ID: sda2
Device: 8:2
Read-Only: false
Size: 480.10GB
Disk 1:
NUMA node: 0
ID: sdb
Device: 8:16
Model: INTEL SSDSC2KB48
Type: scsi
Size: 480.10GB
Read-Only: false
Removable: false
Disk 2:
NUMA node: 0
ID: sr0
Device: 11:0
Model: Virtual CDROM
Type: cdrom
Size: 958.40MB
Read-Only: false
Removable: true

crw–w---- 1 root tty 136, 0 Oct 25 11:35 console
lrwxrwxrwx 1 root root 11 Oct 25 11:34 core → /proc/kcore
drwxr-xr-x 2 root root 80 Oct 25 11:34 dri
lrwxrwxrwx 1 root root 13 Oct 25 11:34 fd → /proc/self/fd
crw-rw-rw- 1 nobody nogroup 1, 7 Feb 5 2021 full
crw-rw-rw- 1 root root 10, 229 Oct 25 11:34 fuse
lrwxrwxrwx 1 root root 12 Oct 25 11:34 initctl → /run/initctl
lrwxrwxrwx 1 root root 28 Oct 25 11:34 log → /run/systemd/journal/dev-log
drwxr-xr-x 2 nobody nogroup 60 Sep 15 02:11 lxd
drwxrwxrwt 2 nobody nogroup 40 Feb 5 2021 mqueue
drwxr-xr-x 2 root root 60 Oct 25 11:34 net
crw-rw-rw- 1 nobody nogroup 1, 3 Feb 5 2021 null
drwxr-xr-x 2 root root 40 Oct 25 11:35 nvidia-caps
crw-rw-rw- 1 root root 195, 254 Oct 25 11:34 nvidia-modeset
crw-rw-rw- 1 root root 236, 0 Oct 25 11:34 nvidia-uvm
crw-rw-rw- 1 root root 236, 1 Oct 25 11:34 nvidia-uvm-tools
crw-rw-rw- 1 root root 195, 0 Oct 25 11:34 nvidia0
crw-rw-rw- 1 root root 195, 1 Oct 25 11:34 nvidia1
crw-rw-rw- 1 root root 195, 2 Oct 25 11:34 nvidia2
crw-rw-rw- 1 root root 195, 3 Oct 25 11:34 nvidia3
crw-rw-rw- 1 root root 195, 4 Oct 25 11:34 nvidia4
crw-rw-rw- 1 root root 195, 5 Oct 25 11:34 nvidia5
crw-rw-rw- 1 root root 195, 6 Oct 25 11:34 nvidia6
crw-rw-rw- 1 root root 195, 7 Oct 25 11:34 nvidia7
crw-rw-rw- 1 root root 195, 255 Oct 25 11:34 nvidiactl
crw-rw-rw- 1 root root 5, 2 Oct 25 12:06 ptmx
drwxr-xr-x 2 root root 0 Oct 25 11:34 pts
crw-rw-rw- 1 nobody nogroup 1, 8 Feb 5 2021 random
drwxrwxrwt 2 root root 40 Oct 25 11:34 shm
lrwxrwxrwx 1 root root 15 Oct 25 11:34 stderr → /proc/self/fd/2
lrwxrwxrwx 1 root root 15 Oct 25 11:34 stdin → /proc/self/fd/0
lrwxrwxrwx 1 root root 15 Oct 25 11:34 stdout → /proc/self/fd/1
crw-rw-rw- 1 nobody nogroup 5, 0 Oct 8 12:37 tty
crw-rw-rw- 1 nobody nogroup 1, 9 Feb 5 2021 urandom
crw-rw-rw- 1 nobody nogroup 1, 5 Feb 5 2021 zero

As I’ve said before, I tried those and didn’t find anything useful, hope you will have better luck.

Thanks in advance for your response. If you won’t find anything I think that maybe rebooting the node will somehow help…

What’s the LXD version in use there?

The version is: 4.0.7.

That’s odd. Did you try using address= with the PCI address instead of id=, see if that somehow works better?

Great idea it worked! Thanks a lot, I removed the GPU and added it with the PCI address.

Thanks a lot once more. O:)