Percentage of containers won't boot anymore

I’m unsure what triggered this but 10% of our containers won’t start anymore.

There are no hard errors in the logs just a line like start - start.c:signal_handler:466 - Container init process 34820 exited

When I mount the container to check out the filesystem itself there are no logs updated inside the container implicating that the container does not actually boot.

I am unsure how to diagnose this further, is there anyway to get more log output?

lxc info --show-log output can be found here.

  driver: lxc | qemu
  driver_version: 4.0.10 | 5.2.0
  firewall: nftables
  kernel: Linux
  kernel_architecture: x86_64
  kernel_features:
    netnsid_getifaddrs: "true"
    seccomp_listener: "true"
    seccomp_listener_continue: "true"
    shiftfs: "false"
    uevent_injection: "true"
    unpriv_fscaps: "true"
  kernel_version: 5.4.0-81-generic
  lxc_features:
    cgroup2: "true"
    devpts_fd: "true"
    idmapped_mounts_v2: "true"
    mount_injection_file: "true"
    network_gateway_device_route: "true"
    network_ipvlan: "true"
    network_l2proxy: "true"
    network_phys_macvlan_mtu: "true"
    network_veth_router: "true"
    pidfd: "true"
    seccomp_allow_deny_syntax: "true"
    seccomp_notify: "true"
    seccomp_proxy_send_notify_fd: "true"
  os_name: Ubuntu
  os_version: "18.04"
  project: default
  server: lxd
  server_clustered: true
  server_name: procyon
  server_pid: 74283
  server_version: "4.17"
  storage: ceph
  storage_version: 16.2.4
  storage_supported_drivers:
  - name: cephfs
    version: 16.2.4
    remote: true
  - name: dir
    version: "1"
    remote: false
  - name: lvm
    version: 2.03.07(2) (2019-11-30) / 1.02.167 (2019-11-30) / 4.41.0
    remote: false
  - name: zfs
    version: 0.8.3-1ubuntu12.10
    remote: false
  - name: ceph
    version: 16.2.4
    remote: true
  - name: btrfs
    version: 5.4.1
    remote: false

Anybody any idea to get more log output?

I might be wrong but in the log you have the line;

lxc redacted 20210825101948.865 ERROR network - network.c:lxc_ovs_delete_port:2757 - Failed to delete "veth96de0aa8" from openvswitch bridge "enp216s0": ovs-vsctl: unix:/var/run/openvswitch/db.sock: database connection failed (No such file or directory)

Could it be something wrong here? It seems to be the only error between going from “RUNNING” to “STOPPING”

Hi Maran,
Probably that’s not related but, maybe journalctl -p err can give you a clue about the host layer.
Regards.

That was my initial idea, however this seems to come in after the child pid for systemd alread exited so I figured this was a red herring.

I checked that and just the openswitch thing comes up in there.

ovs|00002|db_ctl_base|ERR|unix:/var/run/openvswitch/db.sock: database connection failed (No such file or directory)

I did an apt install openvswitch-switch and the error changed.

lxc box 20210827113236.104 ERROR    network - network.c:lxc_ovs_delete_port:2757 - Failed to delete "veth03aeb117" from openvswitch bridge "enp216s0": ovs-vsctl: no port named veth03aeb117

It seems odd to have a openvswitch error if your not using it? Any config values in the container / profile perhaps incorrect?

Can you post the dpkg -l | grep -i openvswitch and ovs-vsctl show?

Sure thing:

root@procyon:~# dpkg -l | grep -i openvswitch
ii  openvswitch-common                     2.9.8-0ubuntu0.18.04.2                          amd64        Open vSwitch common components
ii  openvswitch-switch                     2.9.8-0ubuntu0.18.04.2                          amd64        Open vSwitch switch implementations
root@procyon:~# ovs-vsctl show
a6100b39-dbe2-4917-9e3c-016357bf22f7
    ovs_version: "2.9.8"

Just as a side-note, I never heard about openvswitch until today so I never did something actively with it on purpose. Only reason I isntalled it was to see if the error would change.

That’s the first thing I tried to figure out. But I couldn’t find anything that was exclusively unique to the containers that won’t boot anymore.

Strange, :thinking: . I suppose that you created another bridge and append to the openvswitch but the output you posted does not contain such information.
Just a guess, if you havent heard and used openvswitch, can you purge like that?
sudo apt purge openvswitch-switch openvswitch-common

I purged it and we are back to the original error now.

lxc box 20210827125755.398 TRACE    network - network.c:lxc_delete_network_priv:3637 - Restored interface "veth3a91878e" to its initial mtu "1450"
lxc box 20210827125755.407 ERROR    network - network.c:lxc_ovs_delete_port:2757 - Failed to delete "vethd423a83d" from openvswitch bridge "enp216s0": ovs-vsctl: unix:/var/run/openvswitch/db.sock: database connection failed (No such file or directory)
lxc box 20210827125755.407 WARN     network - network.c:lxc_delete_network_priv:3675 - Failed to remove port "vethd423a83d" from openvswitch bridge "enp216s0"
lxc box 20210827125755.407 INFO     network - network.c:lxc_delete_network_priv:3677 - Removed port "vethd423a83d" from openvswitch bridge "enp216s0"

Can you install openvswitch-switch openvswitch-common again and make sure, systemctl status openvswitch-switch.service is running.

Please can you show the output of lxc config show <instance> --expanded for a couple of the affected instances?

Sure thing:

config:
  image.architecture: amd64
  image.description: ubuntu 18.04 LTS amd64 (release) (20201211.1)
  image.label: release
  image.os: ubuntu
  image.release: bionic
  image.serial: "20201211.1"
  image.type: squashfs
  image.version: "18.04"
  limits.cpu: "2"
  limits.memory: 5000MB
  limits.memory.enforce: hard
  limits.memory.swap: "false"
  security.nesting: "true"
  user.user-data: |
    #cloud-config
    phone_home:
      url: https://redacted/api/v1/capsules/$INSTANCE_ID/finish_deployment
      tries: 15
      post:
        - hostname
        - instance_id
    #package_upgrade: true
    packages:
      - apache2
      - unzip
      - ruby2.5
      - build-essential
      - libsqlite3-dev
      - sqlite3
      - ruby-dev
    timezone: Europe/Amsterdam
  volatile.base_image: 95c0e536d361eb5ac953ad343e0342c2f615e4aea714ca8a64126a228b809cae
  volatile.eth0.hwaddr: 00:16:3e:ee:b5:f0
  volatile.eth1.hwaddr: 00:16:3e:62:c6:56
  volatile.eth1.name: eth1
  volatile.idmap.base: "0"
  volatile.idmap.current: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":1000000000},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]'
  volatile.idmap.next: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":1000000000},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]'
  volatile.last_state.idmap: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":1000000000},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]'
  volatile.last_state.power: STOPPED
  volatile.uuid: 7d6de2b2-2873-4fe3-8189-cdfb4f401e75
devices:
  aa-web-32930:
    connect: tcp:0.0.0.0:4567
    listen: tcp:redacted:32930
    nat: "true"
    type: proxy
  deluge-45470:
    connect: tcp:0.0.0.0:9418
    listen: tcp:redacted:45470
    nat: "true"
    type: proxy
  deluge-listen_port-25053:
    connect: tcp:0.0.0.0:25053
    listen: tcp:redacted:25053
    nat: "true"
    type: proxy
  eth0:
    ipv4.address: 240.8.0.246
    name: eth0
    network: lxdfan0
    type: nic
  eth1:
    ipv6.address: 2a0a:7000:1337:1c:9032:c230:e79f:3ff9
    nictype: routed
    parent: enp216s0
    type: nic
  plex-37839:
    connect: tcp:0.0.0.0:32400
    listen: tcp:redacted:37839
    nat: "true"
    type: proxy
  portainer-27698:
    connect: tcp:0.0.0.0:56772
    listen: tcp:redacted:27698
    nat: "true"
    type: proxy
  qbtest-26233:
    connect: tcp:0.0.0.0:56443
    listen: tcp:redacted:26233
    nat: "true"
    type: proxy
  root:
    limits.max: 150iops
    path: /
    pool: ceph-erasure
    size: 750GB
    type: disk
  rutorrent-connection_port-26591:
    connect: tcp:0.0.0.0:26591
    listen: tcp:redacted:26591
    nat: "true"
    type: proxy
  shadowsock-23814:
    connect: tcp:0.0.0.0:8388
    listen: tcp:redacted:23814
    nat: "true"
    type: proxy
  ssh-40338:
    connect: tcp:0.0.0.0:22
    listen: tcp:redacted:40338
    nat: "true"
    type: proxy
  temp-46360:
    connect: tcp:0.0.0.0:8080
    listen: tcp:redacted:46360
    nat: "true"
    type: proxy
ephemeral: false
profiles:
- bysh
stateful: false
description: ""
config:
  image.architecture: amd64
  image.description: ubuntu 20.04 LTS amd64 (release) (20210812)
  image.label: release
  image.os: ubuntu
  image.release: focal
  image.serial: "20210812"
  image.type: squashfs
  image.version: "20.04"
  limits.cpu: "8"
  limits.memory: 20000MB
  limits.memory.enforce: hard
  limits.memory.swap: "false"
  security.nesting: "true"
  user.user-data: |
    #cloud-config
    phone_home:
      url: https://redacted/api/v1/capsules/$INSTANCE_ID/finish_deployment
      tries: 15
      post:
        - hostname
        - instance_id
    #package_upgrade: true
    packages:
      - apache2
      - unzip
      - ruby
      - build-essential
      - libsqlite3-dev
      - sqlite3
      - ruby-dev
    timezone: Europe/Amsterdam
  volatile.base_image: fab57376cf04b817d43804d079321241ce98d3b5c2296f1a41541de6c100ab09
  volatile.eth0.hwaddr: 00:16:3e:aa:7d:e1
  volatile.eth1.hwaddr: 00:16:3e:a3:4e:1c
  volatile.eth1.name: eth1
  volatile.idmap.base: "0"
  volatile.idmap.current: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":1000000000},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]'
  volatile.idmap.next: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":1000000000},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]'
  volatile.last_state.idmap: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":1000000000},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]'
  volatile.last_state.power: STOPPED
  volatile.uuid: b79bd25a-c07e-49c5-b85d-b2b2a563d2fa
devices:
  aa-web-44002:
    connect: tcp:0.0.0.0:4567
    listen: tcp:redacted:44002
    nat: "true"
    type: proxy
  eth0:
    ipv4.address: 240.8.0.106
    name: eth0
    network: lxdfan0
    type: nic
  eth1:
    ipv6.address: 2a0a:7000:1337:fd:c4a7:387:c522:b95c
    nictype: routed
    parent: enp216s0
    type: nic
  filebrowser-37705:
    connect: tcp:0.0.0.0:6990
    listen: tcp:redacted:37705
    nat: "true"
    type: proxy
  jackett-34290:
    connect: tcp:0.0.0.0:6631
    listen: tcp:redacted:34290
    nat: "true"
    type: proxy
  ombi-20481:
    connect: tcp:0.0.0.0:7968
    listen: tcp:redacted:20481
    nat: "true"
    type: proxy
  plex-port-32244:
    connect: tcp:0.0.0.0:32400
    listen: tcp:redacted:32244
    nat: "true"
    type: proxy
  radarr-20905:
    connect: tcp:0.0.0.0:6917
    listen: tcp:redacted:20905
    nat: "true"
    type: proxy
  root:
    limits.max: 300iops
    path: /
    pool: ceph-erasure
    size: 4500GB
    type: disk
  rutorrent-44733:
    connect: tcp:0.0.0.0:2530
    listen: tcp:redacted:44733
    nat: "true"
    type: proxy
  rutorrent-connection_port-36284:
    connect: tcp:0.0.0.0:36284
    listen: tcp:redacted:36284
    nat: "true"
    type: proxy
  sonarr-40859:
    connect: tcp:0.0.0.0:6930
    listen: tcp:redacted:40859
    nat: "true"
    type: proxy
  ssh-21730:
    connect: tcp:0.0.0.0:22
    listen: tcp:redacted:21730
    nat: "true"
    type: proxy
  syncthing-46874:
    connect: tcp:0.0.0.0:5631
    listen: tcp:redacted:46874
    nat: "true"
    type: proxy
  tautulli-38982:
    connect: tcp:0.0.0.0:5479
    listen: tcp:redacted:38982
    nat: "true"
    type: proxy
  vnc-25943:
    connect: tcp:0.0.0.0:5946
    listen: tcp:redacted:25943
    nat: "true"
    type: proxy
ephemeral: false
profiles:
- bysh_2004
stateful: false
description: ""

Thanks, and lxc network show lxdfan0 please?

As well as ip a and ip r on the host.

This sounds quite similar to

As it involved routed NICs as well, but on that issue I never heard back from the OP after we identified that there did indeed appear to be an OVS switch in use from the logs.

Here it is:

config:
  bridge.mode: fan
  fan.underlay_subnet: x.x.x.0/24
  ipv4.nat: "true"
description: ""
name: lxdfan0
type: bridge
used_by:
- /1.0/redacted
...
managed: true
status: Created
locations:
- procyon

ip a: Was a bit too long so put it here.

ip r
default via x.x.x.1 dev enp216s0 proto static 
x.x.x.0/22 dev enp216s0 proto kernel scope link src x.x.x.8 
240.0.0.0/8 dev lxdfan0 proto kernel scope link src 240.8.0.1 

Thanks and also sudo bridge link show and sudo ovs-vsctl show