Cannot start lxc copied from snapshot, CPU usage raised to 100%

Hi all,

I did the following

lxc snapshot AAA clean-snapshot
lxc copy AAA/clean-snapshot BBB
lxc start BBB

And it is hanging there, with CPU utilization raised to 100 % for kworker process. The container BBB is in ERROR state.

It happens quite often.

I can start the container after a host reboot, which is OK for now as I’m just learning LXD. However this is not desirable if I need to put it in production environment.

I am using Ubuntu bionic and latest version of LXD with all the patches applied.

What information I should gather to solve the problem?
Would anyone help me by giving me some insight to solve it?
I searched the net and I got lost…

Whenever you get a container in an error state, you have a valuable case to debug so that it get possible to figure out what went wrong and report it to get it fixed.

If you can figure out the steps to bring a container in an error state is even more valuable. It is equally fine, for example, if you can figure out that if you do those X steps, you get the error state in 1 out of 20 repetitions.

There are some standard commands to retrieve info from the problematic containers. I am on mobile now and cannot locate them. Perhaps someone else can help.

Is the LXD host a VPS or bare metal? If the former it might be due to high levels of overcommitment (especially memory overcommitment) or other virtualization overheads, depending on number of CPU (cores/specs) mostly and/or the hypervisor (type/settings) underlying the LXD host.

The LXD host is a bare metal. It is a HP PC 6200 Pro MT PC with 8G ram and Intel® Core™ i7-2600 CPU @ 3.40GHz.

Thanks Simon

I created a bridge interface on host and customized a profile for containers so that they can access the same network. In my previous post, container AAA was indeed stopped before I took the snapshot.

Recently I found that whenever I changed an IP address in the netplan of a container. It is OK for me to stop it via lxc stop. However, when I start the container again, it hangs. While I attempt to reboot the LXD host, I have to wait for 10 minutes for a LXD process to unmount the problematic container. I can start it later on after a the reboot.

It seems to me that when ever there is a change in IP address of any container, the problem arise, which doesn’t make senses to me. I am searching for commands to get more information from the problematic situation.

The symptoms described could either match a kernel bug in which case the output of dmesg and ps fauxww may be able to confirm it.

It could also be that the newly created container has a uid/gid map which differs from the one in the snapshot, in which case LXD would need to rewrite the uid/gid of every file in the container on startup. Depending on the number of files in that container and the type of storage in use, this can take a while.

@Robonoob Did you find which was your issue and a fix?
@simos In my case it is repeatable. What data would be helpful and how do I obtain it?
@stgraber I find myself in a similar situation, I’m using LXD 3.11 on top of Ubuntu 18.04 with a Digital Ocean Droplet.

  • It could also be that the newly created container has a uid/gid which differs from the one in the snapshot…
    How can I check this and avoid it if its the case?

  • The documentation states: “the directory backend is to be considered as a last resort option.
    It does support all main LXD features, but is terribly slow and inefficient as it can’t perform
    instant copies or snapshots and so needs to copy the entirety of the container’s filesystem every time.”

Could that be causing a 2-3 min. or more block where the server becomes almost unresponsive if several containers where created over a short period of time?

I am creating new containers based on an existing, stopped, container, is that a good approach, would using a snapshot as the origin help or would maybe converting it to an image help?

In our case, the automatic upgrade to LXD 3.12 seems to have fixed this…