Lost a container, (failed to start) exit status 1

Had a strange issue with a container that has been running Zimbra email for the past seven months with no issues. While I was cleaning up some log files in /var/log inside the container, I started getting an error “cannot open shared object file: No such file or directory”, and eventually no commands would execute without getting this error. So I exited back to the host, and everything appeared to be running fine, services were still running, clients were still connecting fine, but for some reason I could no longer run a shell into the container with the usual "lxc shell ", so I attempted to restart the container, as it had been running about 93 days or so. I failed to restart, it just hung. I tried this a few times.

So I went into panic mode, I was also not able to revert back to the latest snapshot. Got the same error: “Error: Failed to run: /snap/lxd/current/bin/lxd forkstart zcs-old /var/snap/lxd/common/lxd/containers /var/snap/lxd/common/lxd/logs/zcs-old/lxc.conf: exit status 1
Try lxc info --show-log zcs-old for more info”

Here is what the log showed:
Name: zcs-old
Type: container
Architecture: x86_64
Created: 2022/12/03 21:44 EST
Last Used: 2023/02/06 10:27 EST


lxc zcs-old 20230206152746.471 WARN conf - …/src/src/lxc/conf.c:lxc_map_ids:3621 - newuidmap binary is missing
lxc zcs-old 20230206152746.472 WARN conf - …/src/src/lxc/conf.c:lxc_map_ids:3627 - newgidmap binary is missing
lxc zcs-old 20230206152746.472 WARN conf - …/src/src/lxc/conf.c:lxc_map_ids:3621 - newuidmap binary is missing
lxc zcs-old 20230206152746.472 WARN conf - …/src/src/lxc/conf.c:lxc_map_ids:3627 - newgidmap binary is missing
lxc zcs-old 20230206152746.472 WARN cgfsng - …/src/src/lxc/cgroups/cgfsng.c:fchowmodat:1619 - No such file or directory - Failed to fchownat(40, memory.oom.group, 1000000000, 0, AT_EMPTY_PATH | AT_SYMLINK_NOFOLLOW )
lxc zcs-old 20230206152746.593 ERROR start - …/src/src/lxc/start.c:start:2197 - No such file or directory - Failed to exec “/sbin/init”
lxc zcs-old 20230206152746.593 ERROR sync - …/src/src/lxc/sync.c:sync_wait:34 - An error occurred in another process (expected sequence number 7)
lxc zcs-old 20230206152746.598 WARN network - …/src/src/lxc/network.c:lxc_delete_network_priv:3631 - Failed to rename interface with index 0 from “eth0” to its initial name “veth47682ebe”
lxc zcs-old 20230206152746.598 ERROR lxccontainer - …/src/src/lxc/lxccontainer.c:wait_on_daemonized_start:878 - Received container state “ABORTING” instead of “RUNNING”
lxc zcs-old 20230206152746.598 ERROR start - …/src/src/lxc/start.c:__lxc_start:2107 - Failed to spawn container “zcs-old”
lxc zcs-old 20230206152746.598 WARN start - …/src/src/lxc/start.c:lxc_abort:1036 - No such process - Failed to send SIGKILL via pidfd 41 for process 3157696
lxc 20230206152751.736 ERROR af_unix - …/src/src/lxc/af_unix.c:lxc_abstract_unix_recv_fds_iov:218 - Connection reset by peer - Failed to receive response
lxc 20230206152751.736 ERROR commands - …/src/src/lxc/commands.c:lxc_cmd_rsp_recv_fds:128 - Failed to receive file descriptors for command “get_state”

By this time I had already ran a restore from the latest backup file. So I got the instance back to running from the backup, so everything is rolling along fine now. I currently have this trouble container renamed as “zcs-old” and assigned it a different profile for troubleshooting purposes.

I don’t know if I fat fingered a command while I was cleaning up old log files or what, - I have never encountered this before, and I do cleanup operations all the time and have never had a problem. Scratching my head on this one. Thought I would share the error info here looking for clues.

Oh and I am running LXD 5.0.2 stable, on Ubuntu 20.04 LTS, on a tiny server with 16GB of RAM, along with about 61 other containers that all run web API/apps and various services. It has been in production since March of 2019.

What storage pool type are you using? Sounds like your storage system is failing.

zfs, on /dev/sda an ssd that’s about three years old. FWIW, it’s been running fine since restore.

zpool list
pool1 464G 35.5G 429G - - 17% 7% 1.00x ONLINE -



I think I somehow damaged the filesystem in the container. When I tried to look inside the export tar file, it doesn’t even extract, it fails at a large database file in the image. Something tells me I deleted some files causing the container to crash, and not be able to boot. The restored container has been running fine since, so I am chalking this one to user error.