Physical nic attached to a container randomly losing connection

Hi! I’m running Incus on an Ubuntu 24.04 server, with kernel 6.8.0, and I have configured a physical nic on one of my containers using the following commands:

$ incus launch images:ubuntu/24.04 caddy-container
$ incus config device add caddy-container lan nic \\
nictype=physical parent=enp4s0 name=enp4s0

And then DCHPv4, with netplan, on the container:

$ incus exec caddy-container -- su --login ubuntu
$ sudo cat /etc/netplan/10-lxc.yaml
network:
  version: 2
  ethernets:
    eth0:
      dhcp4: true
      dhcp-identifier: mac
    enp4s0:
      dhcp4: true

And after a restart of the container, I can see that it gets the IP:

$ incus ls
+------------------------------------------+---------+----------------------------+------------------------------------------------+-----------------+-----------+
|                   NAME                   |  STATE  |            IPV4            |                      IPV6                      |      TYPE       | SNAPSHOTS |
+------------------------------------------+---------+----------------------------+------------------------------------------------+-----------------+-----------+
| caddy-container                          | RUNNING | 10.177.116.10 (eth0)       | fd42:771d:9b36:84ca:216:3eff:feb1:e6c2 (eth0)  | CONTAINER       | 0         |
|                                          |         | 10.1.1.237 (enp4s0)        | fd06:3f72:6d6a::4ec (enp4s0)                   |                 |           |
|                                          |         |                            | fd06:3f72:6d6a:0:1ac0:4dff:febb:5e6b (enp4s0)  |                 |           |
...

And that the host does not see the interface:

$ ip a s
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host noprefixroute
       valid_lft forever preferred_lft forever
2: enp3s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 18:c0:4d:bb:5e:6a brd ff:ff:ff:ff:ff:ff
    inet 10.1.1.235/24 metric 100 brd 10.1.1.255 scope global dynamic enp3s0
       valid_lft 31205sec preferred_lft 31205sec
    inet6 fd06:3f72:6d6a::849/128 scope global dynamic noprefixroute
       valid_lft 27982sec preferred_lft 27982sec
    inet6 fd06:3f72:6d6a:0:1ac0:4dff:febb:5e6a/64 scope global mngtmpaddr noprefixroute
       valid_lft forever preferred_lft 604542sec
    inet6 fe80::1ac0:4dff:febb:5e6a/64 scope link
       valid_lft forever preferred_lft forever
4: incusbr0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 10:66:6a:54:71:10 brd ff:ff:ff:ff:ff:ff
    inet 10.177.116.1/24 brd 10.177.116.255 scope global incusbr0
       valid_lft forever preferred_lft forever
    inet6 fd42:771d:9b36:84ca::1/64 scope global
       valid_lft forever preferred_lft forever
    inet6 fe80::1266:6aff:fe54:7110/64 scope link
       valid_lft forever preferred_lft forever
6: veth309cf0fc@if5: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master incusbr0 state UP group default qlen 1000
    link/ether 22:38:7b:79:3d:8d brd ff:ff:ff:ff:ff:ff link-netnsid 0
8: veth0ff22abf@if7: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master incusbr0 state UP group default qlen 1000
    link/ether ba:c3:ed:27:8d:fd brd ff:ff:ff:ff:ff:ff link-netnsid 1
10: vethdbe78812@if9: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master incusbr0 state UP group default qlen 1000
    link/ether 9a:b3:85:0a:41:ce brd ff:ff:ff:ff:ff:ff link-netnsid 2
12: veth85f7f7ad@if11: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master incusbr0 state UP group default qlen 1000
    link/ether be:34:b5:b9:c5:6c brd ff:ff:ff:ff:ff:ff link-netnsid 3
14: vethd9e274d0@if13: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master incusbr0 state UP group default qlen 1000
    link/ether ea:e8:dc:1f:e2:77 brd ff:ff:ff:ff:ff:ff link-netnsid 4
16: veth7b237818@if15: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master incusbr0 state UP group default qlen 1000
    link/ether 16:d6:95:24:d2:c5 brd ff:ff:ff:ff:ff:ff link-netnsid 5
18: vethd621de92@if17: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master incusbr0 state UP group default qlen 1000
    link/ether c6:cf:91:92:b2:96 brd ff:ff:ff:ff:ff:ff link-netnsid 6

Everything seems to work fine, but, from time to time, I lose connectivity with the container using the dedicated NIC.

I’ve tried restarting the container itself, and the incus service but then, the interface does not come up:

$ incus ls
+------------------------------------------+---------+----------------------------+------------------------------------------------+-----------------+-----------+
|                   NAME                   |  STATE  |            IPV4            |                      IPV6                      |      TYPE       | SNAPSHOTS |
+------------------------------------------+---------+----------------------------+------------------------------------------------+-----------------+-----------+
| caddy-container                          | RUNNING | 10.177.116.10 (eth0)       | fd42:771d:9b36:84ca:216:3eff:feb1:e6c2 (eth0)  | CONTAINER       | 0         |
|                                          |         |                            |                                                |                 |           |
|                                          |         |                            |                                                |                 |           |
...

The only thing that works is restarting the host; then everything works again.

So, how can I debug what’s going wrong?

Thanks!

Anything in the host’s log around the time this happens?

OK, this just happened again, and seems to be some kind of problem with the NIC itself :cry:

Jan 18 12:36:36 fractal kernel: igb 0000:04:00.0 enp4s0: PCIe link lost
Jan 18 12:36:36 fractal kernel: ------------[ cut here ]------------
Jan 18 12:36:36 fractal kernel: igb: Failed to read reg 0xc030!
Jan 18 12:36:36 fractal kernel: WARNING: CPU: 2 PID: 264001 at drivers/net/ethernet/intel/igb/igb_main.c:750 igb_rd32+0x93/0xb0 [igb]
Jan 18 12:36:36 fractal kernel: Modules linked in: tls xt_MASQUERADE xt_tcpudp xt_mark nft_compat nft_reject_bridge nf_reject_ipv4 nf_reject_ipv6 nft_reject nf_conntrack_bridge nft_ct nft_meta_bridge veth nft_masq nft_chain_nat nf_nat nf_conntrack n>
Jan 18 12:36:36 fractal kernel:  raid1 raid0 uas usb_storage crct10dif_pclmul crc32_pclmul polyval_clmulni polyval_generic igb ghash_clmulni_intel ahci sha256_ssse3 xhci_pci i2c_algo_bit sha1_ssse3 libahci xhci_pci_renesas dca 8250_dw aesni_intel cr>
Jan 18 12:36:36 fractal kernel: CPU: 2 PID: 264001 Comm: kworker/2:1 Tainted: P           O       6.8.0-90-generic #91-Ubuntu
Jan 18 12:36:36 fractal kernel: Hardware name: GIGABYTE G431-MM0-OT/MJ11-EC1-OT, BIOS F09 09/14/2021
Jan 18 12:36:36 fractal kernel: Workqueue: events igb_watchdog_task [igb]
Jan 18 12:36:36 fractal kernel: RIP: 0010:igb_rd32+0x93/0xb0 [igb]
Jan 18 12:36:36 fractal kernel: Code: c7 c6 bb 55 52 c0 e8 ec 09 a3 c2 48 8b bb 28 ff ff ff e8 60 e0 49 c2 84 c0 74 c1 44 89 e6 48 c7 c7 88 62 52 c0 e8 9d 47 c2 c1 <0f> 0b eb ae b8 ff ff ff ff 31 d2 31 f6 31 ff e9 99 ac d7 c2 66 0f
Jan 18 12:36:36 fractal kernel: RSP: 0018:ffffcf00920dfd88 EFLAGS: 00010246
Jan 18 12:36:36 fractal kernel: RAX: 0000000000000000 RBX: ffff89970f0d0f38 RCX: 0000000000000000
Jan 18 12:36:36 fractal kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
Jan 18 12:36:36 fractal kernel: RBP: ffffcf00920dfd98 R08: 0000000000000000 R09: 0000000000000000
Jan 18 12:36:36 fractal kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 000000000000c030
Jan 18 12:36:36 fractal kernel: R13: 0000000000000000 R14: 0000000000000000 R15: ffff89970f7d2b40
Jan 18 12:36:36 fractal kernel: FS:  0000000000000000(0000) GS:ffff899dfb300000(0000) knlGS:0000000000000000
Jan 18 12:36:36 fractal kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jan 18 12:36:36 fractal kernel: CR2: 0000555d61c1fd80 CR3: 00000001364ce000 CR4: 00000000003506f0
Jan 18 12:36:36 fractal kernel: Call Trace:
Jan 18 12:36:36 fractal kernel:  <TASK>
Jan 18 12:36:36 fractal kernel:  igb_update_stats+0x93/0x880 [igb]
Jan 18 12:36:36 fractal kernel:  igb_watchdog_task+0x134/0x8e0 [igb]
Jan 18 12:36:36 fractal kernel:  ? srso_return_thunk+0x5/0x5f
Jan 18 12:36:36 fractal kernel:  process_one_work+0x184/0x3a0
Jan 18 12:36:36 fractal kernel:  worker_thread+0x306/0x440
Jan 18 12:36:36 fractal kernel:  ? srso_return_thunk+0x5/0x5f
Jan 18 12:36:36 fractal kernel:  ? _raw_spin_lock_irqsave+0xe/0x20
Jan 18 12:36:36 fractal kernel:  ? __pfx_worker_thread+0x10/0x10
Jan 18 12:36:36 fractal kernel:  kthread+0xf2/0x120
Jan 18 12:36:36 fractal kernel:  ? __pfx_kthread+0x10/0x10
Jan 18 12:36:36 fractal kernel:  ret_from_fork+0x47/0x70
Jan 18 12:36:36 fractal kernel:  ? __pfx_kthread+0x10/0x10
Jan 18 12:36:36 fractal kernel:  ret_from_fork_asm+0x1b/0x30
Jan 18 12:36:36 fractal kernel:  </TASK>
Jan 18 12:36:36 fractal kernel: ---[ end trace 0000000000000000 ]---

Yup, it shows that it dropped off the PCIe bus, so definitely looks like a hardware issue :slight_smile: