Lost a container, (failed to start) exit status 1

derek423 · February 6, 2023, 3:46pm

Had a strange issue with a container that has been running Zimbra email for the past seven months with no issues. While I was cleaning up some log files in /var/log inside the container, I started getting an error “cannot open shared object file: No such file or directory”, and eventually no commands would execute without getting this error. So I exited back to the host, and everything appeared to be running fine, services were still running, clients were still connecting fine, but for some reason I could no longer run a shell into the container with the usual "lxc shell ", so I attempted to restart the container, as it had been running about 93 days or so. I failed to restart, it just hung. I tried this a few times.

So I went into panic mode, I was also not able to revert back to the latest snapshot. Got the same error: “Error: Failed to run: /snap/lxd/current/bin/lxd forkstart zcs-old /var/snap/lxd/common/lxd/containers /var/snap/lxd/common/lxd/logs/zcs-old/lxc.conf: exit status 1
Try lxc info --show-log zcs-old for more info”

Here is what the log showed:
Name: zcs-old
Status: STOPPED
Type: container
Architecture: x86_64
Created: 2022/12/03 21:44 EST
Last Used: 2023/02/06 10:27 EST

Log:

lxc zcs-old 20230206152746.471 WARN conf - …/src/src/lxc/conf.c:lxc_map_ids:3621 - newuidmap binary is missing
lxc zcs-old 20230206152746.472 WARN conf - …/src/src/lxc/conf.c:lxc_map_ids:3627 - newgidmap binary is missing
lxc zcs-old 20230206152746.472 WARN conf - …/src/src/lxc/conf.c:lxc_map_ids:3621 - newuidmap binary is missing
lxc zcs-old 20230206152746.472 WARN conf - …/src/src/lxc/conf.c:lxc_map_ids:3627 - newgidmap binary is missing
lxc zcs-old 20230206152746.472 WARN cgfsng - …/src/src/lxc/cgroups/cgfsng.c:fchowmodat:1619 - No such file or directory - Failed to fchownat(40, memory.oom.group, 1000000000, 0, AT_EMPTY_PATH | AT_SYMLINK_NOFOLLOW )
lxc zcs-old 20230206152746.593 ERROR start - …/src/src/lxc/start.c:start:2197 - No such file or directory - Failed to exec “/sbin/init”
lxc zcs-old 20230206152746.593 ERROR sync - …/src/src/lxc/sync.c:sync_wait:34 - An error occurred in another process (expected sequence number 7)
lxc zcs-old 20230206152746.598 WARN network - …/src/src/lxc/network.c:lxc_delete_network_priv:3631 - Failed to rename interface with index 0 from “eth0” to its initial name “veth47682ebe”
lxc zcs-old 20230206152746.598 ERROR lxccontainer - …/src/src/lxc/lxccontainer.c:wait_on_daemonized_start:878 - Received container state “ABORTING” instead of “RUNNING”
lxc zcs-old 20230206152746.598 ERROR start - …/src/src/lxc/start.c:__lxc_start:2107 - Failed to spawn container “zcs-old”
lxc zcs-old 20230206152746.598 WARN start - …/src/src/lxc/start.c:lxc_abort:1036 - No such process - Failed to send SIGKILL via pidfd 41 for process 3157696
lxc 20230206152751.736 ERROR af_unix - …/src/src/lxc/af_unix.c:lxc_abstract_unix_recv_fds_iov:218 - Connection reset by peer - Failed to receive response
lxc 20230206152751.736 ERROR commands - …/src/src/lxc/commands.c:lxc_cmd_rsp_recv_fds:128 - Failed to receive file descriptors for command “get_state”

By this time I had already ran a restore from the latest backup file. So I got the instance back to running from the backup, so everything is rolling along fine now. I currently have this trouble container renamed as “zcs-old” and assigned it a different profile for troubleshooting purposes.

I don’t know if I fat fingered a command while I was cleaning up old log files or what, - I have never encountered this before, and I do cleanup operations all the time and have never had a problem. Scratching my head on this one. Thought I would share the error info here looking for clues.

Oh and I am running LXD 5.0.2 stable, on Ubuntu 20.04 LTS, on a tiny server with 16GB of RAM, along with about 61 other containers that all run web API/apps and various services. It has been in production since March of 2019.
Thanks!

tomp · February 9, 2023, 8:56am

What storage pool type are you using? Sounds like your storage system is failing.

DJ423 · February 9, 2023, 12:20pm

zfs, on /dev/sda an ssd that’s about three years old. FWIW, it’s been running fine since restore.

zpool list
NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
pool1 464G 35.5G 429G - - 17% 7% 1.00x ONLINE -

Thanks

DJ423 · February 11, 2023, 1:05pm

Update:

I think I somehow damaged the filesystem in the container. When I tried to look inside the export tar file, it doesn’t even extract, it fails at a large database file in the image. Something tells me I deleted some files causing the container to crash, and not be able to boot. The restored container has been running fine since, so I am chalking this one to user error.