LXC commands are hung


(Jon Clayton) #1

I tried to shutdown a nested container which was taking for ever, but it seems to have brought down the LXC daemon service.

Can’t run any commands,

systemctl stop snap.lxd.daemon.service has stuck in a deactivating state

snap.lxd.daemon.service - Service for snap application lxd.daemon
   Loaded: loaded (/etc/systemd/system/snap.lxd.daemon.service; static; vendor preset: enabled)
   Active: deactivating (stop) since Mon 2018-11-26 17:09:11 GMT; 2min 19s ago
  Process: 24392 ExecStart=/usr/bin/snap run lxd.daemon (code=killed, signal=TERM)
 Main PID: 24392 (code=killed, signal=TERM);         : 27012 (daemon.stop)
    Tasks: 8
   Memory: 3.6M
      CPU: 195ms
   CGroup: /system.slice/snap.lxd.daemon.service
           └─control
             ├─27012 /bin/sh /snap/lxd/9600/commands/daemon.stop
             └─27038 lxd shutdown

Nov 26 17:10:13 gns3vm lxd.daemon[7426]:   2: fd:  10: blkio
Nov 26 17:10:13 gns3vm lxd.daemon[7426]:   3: fd:  11: pids
Nov 26 17:10:13 gns3vm lxd.daemon[7426]:   4: fd:  12: cpu,cpuacct
Nov 26 17:10:13 gns3vm lxd.daemon[7426]:   5: fd:  13: memory
Nov 26 17:10:13 gns3vm lxd.daemon[7426]:   6: fd:  14: cpuset
Nov 26 17:10:13 gns3vm lxd.daemon[7426]:   7: fd:  15: freezer
Nov 26 17:10:13 gns3vm lxd.daemon[7426]:   8: fd:  16: hugetlb
Nov 26 17:10:13 gns3vm lxd.daemon[7426]:   9: fd:  17: devices
Nov 26 17:10:13 gns3vm lxd.daemon[7426]:  10: fd:  18: name=systemd
Nov 26 17:10:13 gns3vm lxd.daemon[7426]: lxcfs.c: 105: do_reload: lxcfs: reloaded

I think normally rebooting would solve this but I really would like to get it back up and running without having to reboot.

All containers seem to still be responding fine, just can’t run any commands just hangs :frowning:

Any ideas on what to try? possibly manually killing some processes?

image

root@gns3vm:/home/jclayton# systemctl status | grep lxd
       ├─24500 lxd --logfile /var/snap/lxd/common/lxd/logs/lxd.log --group lxd
       ├─24501 lxd waitready
       ├─24502 /bin/sh /snap/lxd/9600/commands/daemon.start
       ├─29124 /bin/sh /snap/lxd/9600/commands/daemon.start
       ├─29235 lxd --logfile /var/snap/lxd/common/lxd/logs/lxd.log --group lxd
       ├─29236 lxd waitready
       ├─29237 /bin/sh /snap/lxd/9600/commands/daemon.start
       ├─62176 lxd --logfile /var/snap/lxd/common/lxd/logs/lxd.log --group lxd
       ├─62178 lxd waitready
       ├─62179 /bin/sh /snap/lxd/9600/commands/daemon.start
       │ │   ├─lxd.service
       │ │   │ └─5533 /usr/bin/lxd --group lxd --logfile=/var/log/lxd/lxd.log
       │ │   ├─lxd-bridge.service
       │ │   │ └─5516 lxd-bridge-proxy --addr=[fe80::1%lxdbr0]:13128
       │ │ │ │   ├─lxd.service
       │ │ │ │   │ └─41802 /usr/lib/lxd/lxd --group lxd --logfile=/var/log/lxd/lxd.log
       │ │ │ ├─ 6060 [lxc monitor] /var/snap/lxd/common/lxd/containers kafka-connect3
       │ │ │ ├─37810 [lxc monitor] /var/snap/lxd/common/lxd/containers kafka-connect2
       │ │ │ ├─38278 [lxc monitor] /var/snap/lxd/common/lxd/containers kafka-connect5
       │ │ │ ├─44934 lxcfs /var/snap/lxd/common/var/lib/lxcfs -p /var/snap/lxd/common/lxcfs.pid
       │ │ │ ├─48819 lxcfs /var/snap/lxd/common/var/lib/lxcfs -p /var/snap/lxd/common/lxcfs.pid
       │ │ │ ├─53998 /bin/sh /snap/lxd/9600/commands/daemon.start
       │ │ │ ├─54121 lxd --logfile /var/snap/lxd/common/lxd/logs/lxd.log --group lxd
       │ │ │ ├─54251 dnsmasq --strict-order --bind-interfaces --pid-file=/var/snap/lxd/common/lxd/networks/lxdbr0/dnsmasq.pid --except-interface=lo --interface=lxdbr0 --quiet-dhcp --quiet-dhcp6 --quiet-ra --listen-address=10.182.241.1 --dhcp-no-override --dhcp-authoritative --dhcp-leasefile=/var/snap/lxd/common/lxd/networks/lxdbr0/dnsmasq.leases --dhcp-hostsfile=/var/snap/lxd/common/lxd/networks/lxdbr0/dnsmasq.hosts --dhcp-range 10.182.241.2,10.182.241.254,1h -s lxd -S /lxd/ --conf-file=/var/snap/lxd/common/lxd/networks/lxdbr0/dnsmasq.raw -u lxd
       │ │ │ ├─55728 [lxc monitor] /var/snap/lxd/common/lxd/containers haproxy
       │ │ │ ├─64200 lxcfs /var/snap/lxd/common/var/lib/lxcfs -p /var/snap/lxd/common/lxcfs.pid
       │ │ │ ├─65033 [lxc monitor] /var/snap/lxd/common/lxd/containers kafka-connect4
       │ │ │ ├─65058 [lxc monitor] /var/snap/lxd/common/lxd/containers grafana
       │ │ │ ├─66506 [lxc monitor] /var/snap/lxd/common/lxd/containers influxdb
       │ │ │ └─71373 [lxc monitor] /var/snap/lxd/common/lxd/containers kafka-connect1
       │ │ │ ├─snap-lxd-9550.mount
       │ │ │ │ └─3254 snapfuse /var/lib/snapd/snaps/lxd_9550.snap /snap/lxd/9550 -o ro,nodev,allow_other,suid
       │ │ │ ├─snap-lxd-9564.mount
       │ │ │ │ └─46412 snapfuse /var/lib/snapd/snaps/lxd_9564.snap /snap/lxd/9564 -o ro,nodev,allow_other,suid
       │ │ │ ├─snap-lxd-9437.mount
       │ │ │ │ └─44142 snapfuse /var/lib/snapd/snaps/lxd_9437.snap /snap/lxd/9437 -o ro,nodev,allow_other,suid
       │ │ │ ├─snap-lxd-9263.mount
       │ │ │ │ └─48154 snapfuse /var/lib/snapd/snaps/lxd_9263.snap /snap/lxd/9263 -o ro,nodev,allow_other,suid
       │ │ │ ├─snap-lxd-9412.mount
       │ │ │ │ └─44405 snapfuse /var/lib/snapd/snaps/lxd_9412.snap /snap/lxd/9412 -o ro,nodev,allow_other,suid
       │ │ │ ├─snap-lxd-9239.mount
       │ │ │ │ └─64142 snapfuse /var/lib/snapd/snaps/lxd_9239.snap /snap/lxd/9239 -o ro,nodev,allow_other,suid
       │ │ │ ├─snap-lxd-9354.mount
       │ │ │ │ └─16180 snapfuse /var/lib/snapd/snaps/lxd_9354.snap /snap/lxd/9354 -o ro,nodev,allow_other,suid
       │ │ │ ├─snap-lxd-9277.mount
       │ │ │ │ └─27589 snapfuse /var/lib/snapd/snaps/lxd_9277.snap /snap/lxd/9277 -o ro,nodev,allow_other,suid
       │ │ │ ├─snap-lxd-9600.mount
       │ │ │ │ └─53674 snapfuse /var/lib/snapd/snaps/lxd_9600.snap /snap/lxd/9600 -o ro,nodev,allow_other,suid
       │ │ │ ├─snap-lxd-9510.mount
       │ │ │ │ └─51907 snapfuse /var/lib/snapd/snaps/lxd_9510.snap /snap/lxd/9510 -o ro,nodev,allow_other,suid
       │ │ │ ├─snap-lxd-9298.mount
       │ │ │ │ └─61805 snapfuse /var/lib/snapd/snaps/lxd_9298.snap /snap/lxd/9298 -o ro,nodev,allow_other,suid
       │ │ │ └─snap-lxd-9210.mount
       │ │ │   └─63017 snapfuse /var/lib/snapd/snaps/lxd_9210.snap /snap/lxd/9210 -o ro,nodev,allow_other,suid
       │ │   ├─lxd.service
       │ │   │ └─16981 /usr/lib/lxd/lxd --group lxd --logfile=/var/log/lxd/lxd.log
       │ ├─ 8552 lxcfs /var/snap/lxd/common/var/lib/lxcfs -p /var/snap/lxd/common/lxcfs.pid
       │ ├─10197 [lxc monitor] /var/snap/lxd/common/lxd/containers gitlab
       │ ├─12751 [lxc monitor] /var/snap/lxd/common/lxd/containers eve-ng
       │ ├─18287 [lxc monitor] /var/snap/lxd/common/lxd/containers librenms
       │ ├─18957 [lxc monitor] /var/snap/lxd/common/lxd/containers openvpn
       │ ├─33410 [lxc monitor] /var/snap/lxd/common/lxd/containers opennti
       │ ├─41042 [lxc monitor] /var/snap/lxd/common/lxd/containers splunk
       │ ├─47307 [lxc monitor] /var/snap/lxd/common/lxd/containers radius
       │ ├─47742 /snap/lxd/current/bin/lxd forkstart telegraf1 /var/snap/lxd/common/lxd/containers /var/snap/lxd/common/lxd/logs/telegraf1/lxc.conf
       │ ├─47759 [lxc monitor] /var/snap/lxd/common/lxd/containers telegraf1
       │ ├─49797 [lxc monitor] /var/snap/lxd/common/lxd/containers fluentd
       │ ├─61993 [lxc monitor] /var/snap/lxd/common/lxd/containers telemetry
       │ ├─63941 [lxc monitor] /var/snap/lxd/common/lxd/containers elastiflow
       │ └─71525 [lxc monitor] /var/snap/lxd/common/lxd/containers zabbix
       │ ├─snap.lxd.daemon.service
       │ │   ├─27012 /bin/sh /snap/lxd/9600/commands/daemon.stop
       │ │   └─27038 lxd shutdown
           │ ├─10993 sudo lxd --debug --group lxd
           │ ├─10998 lxd --debug --group lxd
             └─30193 grep --color=auto lxd

(Stéphane Graber) #2

Can you show ps fauxww?


(Jon Clayton) #3

Sure, i think its some stuck process when restarting the new nested container (telegraf1):

Here is ps fauxww: https://paste.ee/p/raSYJ

Cheers,
Jon


(Jon Clayton) #4

Also services seem to be running okay now but lxc still hangs running any command:


(Stéphane Graber) #5

You have a lot of lxd related processes stuck in D state (I/O wait).
There’s a good chance that your kernel is having a bad day.

Can you show dmesg?

Those processes are likely to be impossible to kill so there’d be no way to recover from this other than rebooting the system, but maybe dmesg will tell us more about what’s going on.


(Jon Clayton) #6

using file backed zfs at the moment (.img file) which isn’t ideal, could that be something to do with it

dmesg:

https://raw.githubusercontent.com/bodleytunes/paste-stuff/master/dmesg


(Stéphane Graber) #7

No, looks like a network namespace kernel bug.
I’m afraid the only way you can recover from this is by rebooting and I’d advice doing it sooner rather than later as your kernel is clearly misbehaving.


(Jon Clayton) #8

Yeah will reboot.

Would updating the kernel be of any use? I have locked the kernel currently from upgrades as I patched it to allow linux bridge to forward link local frames.

Cheers!
Jon.


(Stéphane Graber) #9

It certainly wouldn’t hurt, kernel updates do tend to include a large number of bugfixes, this may be something that got fixed.


(Jon Clayton) #10

OK cheers. :slight_smile:


(Jon Clayton) #11

Not sure if directly related but I seem to still have an issue, this time with hung monitor on that same nested container, its exactly the same output as on this this thread https://github.com/lxc/lxd/issues/4468

However I’m running 4.15 kernel which is a very late kernel :(:thinking: