Container sometimes turns into STOPPED state after a container reboot

C0rn3j · April 19, 2023, 2:15pm

Rarely, rebooting a container by executing a reboot from inside of it will simply power it off, and we need to start it up again with lxc start.

Subsequent reboots work fine, so this is very hard to reproduce.

For example I just now rebooted a 22.04 Ubuntu container(8 weeks uptime) on a 20.04 host with LXD 5.13 on btrfs, and it went straight to STOPPED.

This has been happening for years I believe all the way back to LXD 3.x, it’s just so impossible to reproduce for me that I haven’t gotten around to reporting it until now.

How could I possibly diagnose it or get any meaningful information about this issue?

stgraber · April 24, 2023, 10:41pm

That can happen if there’s a storage or device stop issue while the container shuts down, that should normally result in an error in lxd.log though.

If it’s somewhat reproducible with a particular container, you could run lxc monitor --pretty in a separate terminal while running reboot to catch any useful debug output.

C0rn3j · May 26, 2023, 1:09pm

Just had it happen again, nothing is in the lxd.log about it however, last error/warn logs are from 3 days ago, checked both snap.lxd.daemon.service and /var/snap/lxd/common/lxd/logs/lxd.log.

Container results in the unwanted STOPPED state after issuing a reboot on the container itself.

Subsequent reboots are normal as usual.

This is 22.04 host and client on 5.15.0-58-generic with LXD 5.13 using btrfs.

Thankfully I have been able to reproduce this with another container on the same host with lxc monitor attached, but not sure if it has anything helpful in it.

All container names but the container in question is censored. https://haste.rys.pw/idihafadij.sql

Mogaba · June 1, 2023, 6:31am

I’ve been having the same issue with LXC 4.0.2 (no LXD) in Alt Linux (kernel 5.10.176). It happened from time to time in the past but after I changed limits as described in Linux Containers - LXD - Has been moved to Canonical - it happens almost every time now.

lxc-monitor shows nothing useful:

‘ts-print’ changed state to [STARTING]
‘ts-print’ changed state to [RUNNING]
‘ts-print’ exited with status [0]
‘ts-print’ changed state to [STOPPING]
‘ts-print’ changed state to [STOPPED]
‘ts-print’ changed state to [STARTING]
‘ts-print’ changed state to [RUNNING]
‘ts-print’ changed state to [ABORTING]
‘ts-print’ changed state to [STOPPING]
‘ts-print’ changed state to [STOPPED]

tomp · June 1, 2023, 9:41am

Can you enable debug in LXD:

sudo snap set lxd daemon.debug=true; sudo systemctl reload snap.lxd.daemon

And then the next time it happens get the contents of:

/var/snap/lxd/common/lxd/logs/lxd.log

C0rn3j · June 1, 2023, 2:08pm

lxc.conf of problematic container - https://haste.rys.pw/xehejokoro.ini
lxc.log of problematic container - https://cloud.rys.pw/s/D4r6aezgeCSA2yT

Excerpt with the issue:

lxc 20230601135455.311 TRACE    commands - ../src/src/lxc/commands.c:lxc_cmd:521 - Opened new command socket connection fd 34 for command "get_devpts_fd"
lxc ansible 20230601135455.319 WARN     conf - ../src/src/lxc/conf.c:lxc_map_ids:3621 - newuidmap binary is missing
lxc ansible 20230601135455.319 WARN     conf - ../src/src/lxc/conf.c:lxc_map_ids:3627 - newgidmap binary is missing
lxc ansible 20230601135919.273 ERROR    conf - ../src/src/lxc/conf.c:run_buffer:322 - Script exited with status 127
lxc ansible 20230601135919.277 ERROR    start - ../src/src/lxc/start.c:lxc_end:944 - Failed to run "lxc.hook.stop" hook
lxc ansible 20230601135919.942 ERROR    conf - ../src/src/lxc/conf.c:run_buffer:322 - Script exited with status 127
lxc ansible 20230601135919.942 ERROR    start - ../src/src/lxc/start.c:lxc_end:985 - Failed to run lxc.hook.post-stop for container "ansible"
lxc ansible 20230601135919.942 WARN     start - ../src/src/lxc/start.c:lxc_end:987 - Container will be stopped instead of rebooted

I also captured the requested lxd.log but I think I might have to seriously go through it to comb through unwanted information, let me know if that is necessary, or I can send it privately.

tomp · June 1, 2023, 2:10pm

Please show lxc config show <instance> --expanded

It will be useful to see the debug log output for the time when the instance is stopping.

C0rn3j · June 1, 2023, 2:12pm

# lxc config show ansible --expanded                                                               
architecture: x86_64
config:
  limits.cpu: "6"
  limits.memory: 12GB
  linux.kernel_modules: overlay, nf_nat
  security.nesting: "true"
  volatile.base_image: b9cba741ef621f14e6e8c306c7f87637a59d7d09c1a3ec8ef717ab0d6e43430a
  volatile.cloud-init.instance-id: a6d9e1b8-ffd8-4407-b172-f61bd570fb8b
  volatile.eth0.host_name: veth1f06ea24
  volatile.eth0.hwaddr: de:ad:be:ef:fe:eb
  volatile.idmap.base: "0"
  volatile.idmap.current: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":1000000000},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]'
  volatile.idmap.next: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":1000000000},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]'
  volatile.last_state.idmap: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":1000000000},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]'
  volatile.last_state.power: RUNNING
  volatile.uuid: d547a824-cee2-4087-a6e3-4741c8e78341
  volatile.uuid.generation: d547a824-cee2-4087-a6e3-4741c8e78341
devices:
  aadisable:
    path: /sys/module/apparmor/parameters/enabled
    source: /dev/null
    type: disk
  eth0:
    name: eth0
    nictype: bridged
    parent: br0
    type: nic
  fuse:
    path: /dev/fuse
    type: unix-char
  root:
    path: /
    pool: default
    type: disk
ephemeral: false
profiles:
- default
- docker
stateful: false
description: ""

C0rn3j · June 15, 2023, 1:41pm

A bit late, but here’s the censored debug log, does not seem to have much for 15:59:19 which is when the stop error occured.

time="2023-06-01T15:57:46+02:00" <OK Status log for a different container>
time="2023-06-01T15:59:19+02:00" level=debug msg="Instance process stopped" PID=2935457 err="<nil>" exitStatus=129 instance=ansible interactive=true project=default
time="2023-06-01T15:59:19+02:00" level=debug msg="Websocket: Sending barrier message" address=@
time="2023-06-01T15:59:19+02:00" level=debug msg="Exec control handler finished" PID=2935457 instance=ansible interactive=true project=default
time="2023-06-01T15:59:19+02:00" level=debug msg="Websocket: Got normal closure" address=@
time="2023-06-01T15:59:19+02:00" level=debug msg="Exec mirror websocket finished" PID=2935457 instance=ansible interactive=true number=0 project=default
time="2023-06-01T15:59:19+02:00" level=debug msg="Websocket: Stopped mirror" address=@
time="2023-06-01T15:59:19+02:00" level=debug msg="Updated metadata for operation" class=websocket description="Executing command" operation=220bdf08-9fb4-493b-9056-357d5667a644 project=default
time="2023-06-01T15:59:19+02:00" level=debug msg="Success for operation" class=websocket description="Executing command" operation=220bdf08-9fb4-493b-9056-357d5667a644 project=default
time="2023-06-01T15:59:19+02:00" level=debug msg="Event listener server handler stopped" listener=b3429230-4ba6-4695-914d-69a5c209d3fe local=/var/snap/lxd/common/lxd/unix.socket remote=@
time="2023-06-01T15:59:22+02:00" level=debug msg="Handling API request" ip=@ method=GET protocol=unix url=/1.0 username=root
time="2023-06-01T15:59:22+02:00" level=debug msg="WriteJSON\n\t{\n\t\t\"type\": \"sync\",\n\t\t\"status\": \"Success\",\n\t\t\"status_code\": 200,\n\t\t\"operation\": \"\",\n\t\t\"err<CENSORED>
time="2023-06-01T15:59:22+02:00" level=debug msg="Handling API request" ip=@ method=GET protocol=unix url="/1.0/instances?filter=&recursion=2" username=root
<instance usage logs for different containers>
time="2023-06-01T15:59:22+02:00" level=debug msg="GetInstanceUsage started" driver=btrfs instance=ansible pool=default project=default
time="2023-06-01T15:59:22+02:00" level=debug msg="GetInstanceUsage finished" driver=btrfs instance=ansible pool=default project=default
<instance usage logs for different containers>
time="2023-06-01T15:59:22+02:00" level=debug msg="WriteJSON\n\t{\n\t\t\"type\": \"sync\",\n\t\t\"status\": \"Success\",\n\t\t\"status_code\": 200,\n\t\t\"op
time="2023-06-01T15:59:26+02:00" level=debug msg="Matched trusted cert" fingerprint=f5a50134f60a78d6d55e6e694d209922f15ef1531194b799aaf5fb4ad54b3946 subject="CN=127.0.0.1,OU=IT,O=Company,L=Prague,ST=Czech Republic,C=CZ,1.2.840.113549.1.9.1=#0c0b6974406c6f67696f2e637a"
...
time="2023-06-01T15:59:26+02:00" level=debug msg="Matched trusted cert" fingerprint=f5a50134f60a78d6d55e6e694d209922f15ef1531194b799aaf5fb4ad54b3946 subject="CN=127.0.0.1,OU=IT,O=Company,L=Prague,ST=Czech Republic,C=CZ,1.2.840.113549.1.9.1=#0c0b6974406c6f67696f2e637a"
...
time="2023-06-01T15:59:30+02:00" level=debug msg="Handling API request" ip=@ method=GET protocol=unix url=/1.0 username=root
time="2023-06-01T15:59:30+02:00" level=debug msg="WriteJSON\n\t{\n\t\t\"type\": \"sync\",\n\t\t\"status\": \"Success\",\n\t\t\"status_code\": 200,\n\t\t\"operation\": \"\",\n\t\t\"erro<CENSORED>
time="2023-06-01T15:59:30+02:00" level=debug msg="Handling API request" ip=@ method=GET protocol=unix url="/1.0/instances?filter=&recursion=2" username=root
<instance usage logs for different containers>
time="2023-06-01T15:59:30+02:00" level=debug msg="GetInstanceUsage started" driver=btrfs instance=ansible pool=default project=default
time="2023-06-01T15:59:30+02:00" level=debug msg="GetInstanceUsage finished" driver=btrfs instance=ansible pool=default project=default
<instance usage logs for different containers>
time="2023-06-01T15:59:30+02:00" level=debug msg="WriteJSON\n\t{\n\t\t\"type\": \"sync\",\n\t\t\"status\": \"Success\",\n\t\t\"status_code\": 200,\n\t\t\"operation\": \"\",\n\t\t\"err<CENSORED>
time="2023-06-01T15:59:31+02:00" level=debug msg="Handling API request" ip=@ method=GET protocol=unix url=/1.0 username=root
time="2023-06-01T15:59:31+02:00" level=debug msg="WriteJSON\n\t{\n\t\t\"type\": \"sync\",\n\t\t\"status\": \"Success\",\n\t\t\"status_code\": 200,\n\t\t\"operation\": \"\",\n\t\t\"err<CENSORED>
time="2023-06-01T15:59:31+02:00" level=debug msg="Handling API request" ip=@ method=GET protocol=unix url="/1.0/instances?filter=&recursion=2" username=root
time="2023-06-01T15:59:31+02:00" level=debug msg="GetInstanceUsage started" driver=btrfs instance=ansible pool=default project=default
time="2023-06-01T15:59:31+02:00" level=debug msg="GetInstanceUsage finished" driver=btrfs instance=ansible pool=default project=default
<instance usage logs for different containers>
time="2023-06-01T15:59:31+02:00" level=debug msg="WriteJSON\n\t{\n\t\t\"type\": \"sync\",\n\t\t\"status\": \"Success\",\n\t\t\"status_code\": 200,\n\t\t\"operation\": \"\",\n\t\t\"err<CENSORED>

tomp · June 20, 2023, 10:11am

that error just shows that an lxc exec session ended, but this isn’t the same as the container itself stopping. Although the two could be related.

C0rn3j · June 29, 2023, 9:05am

Anything else I can provide?

Mogaba · July 8, 2023, 11:46am

I found something interesting: if I use “lxc-start -F” to start a container in the foreground mode, reboot works fine every time. I don’t know what this means but maybe it will help.