Can't start a container after running lxc copy --stateless --refresh

bruce78 · March 29, 2019, 11:06am

I’m copying a container from a remote to a local with this command

lxc copy --stateless --refresh remote:c1 c1

The command runs fine but then the container won’t start.

If I delete the container on the local host and just run

lxc copy --stateless remote:c1 c1

Then everything works fine but I’d like to include --refresh to get some incremental copies going…

lxc info --show-log c1 shows the following, the key line seems to be

lxc c1 20190329110109.406 ERROR    dir - storage/dir.c:dir_mount:198 - No such file or directory - Failed to mount "/var/snap/lxd/common/lxd/containers/c1/rootfs" on "/var/snap/lxd/common/lxc/"

here’s the whole output

lxc c1 20190329110109.388 WARN     conf - conf.c:lxc_map_ids:2970 - newuidmap binary is missing
lxc c1 20190329110109.388 WARN     conf - conf.c:lxc_map_ids:2976 - newgidmap binary is missing
lxc c1 20190329110109.394 WARN     conf - conf.c:lxc_map_ids:2970 - newuidmap binary is missing
lxc c1 20190329110109.394 WARN     conf - conf.c:lxc_map_ids:2976 - newgidmap binary is missing
lxc c1 20190329110109.406 ERROR    dir - storage/dir.c:dir_mount:198 - No such file or directory - Failed to mount "/var/snap/lxd/common/lxd/containers/c1/rootfs" on "/var/snap/lxd/common/lxc/"
lxc c1 20190329110109.406 ERROR    conf - conf.c:lxc_mount_rootfs:1351 - Failed to mount rootfs "/var/snap/lxd/common/lxd/containers/c1/rootfs" onto "/var/snap/lxd/common/lxc/" with options "(null)"
lxc c1 20190329110109.406 ERROR    conf - conf.c:lxc_setup_rootfs_prepare_root:3498 - Failed to setup rootfs for
lxc c1 20190329110109.406 ERROR    conf - conf.c:lxc_setup:3551 - Failed to setup rootfs
lxc c1 20190329110109.406 ERROR    start - start.c:do_start:1282 - Failed to setup container "c1"
lxc c1 20190329110109.406 ERROR    sync - sync.c:__sync_wait:62 - An error occurred in another process (expected sequence number 5)
lxc c1 20190329110109.406 WARN     network - network.c:lxc_delete_network_priv:2589 - Operation not permitted - Failed to remove interface "eth0" with index 30
lxc c1 20190329110109.406 ERROR    lxccontainer - lxccontainer.c:wait_on_daemonized_start:864 - Received container state "ABORTING" instead of "RUNNING"
lxc c1 20190329110109.407 ERROR    start - start.c:__lxc_start:1975 - Failed to spawn container "c1"
lxc c1 20190329110109.408 WARN     conf - conf.c:lxc_map_ids:2970 - newuidmap binary is missing
lxc c1 20190329110109.408 WARN     conf - conf.c:lxc_map_ids:2976 - newgidmap binary is missing
lxc 20190329110109.414 WARN     commands - commands.c:lxc_cmd_rsp_recv:132 - Connection reset by peer - Failed to receive response for command "get_state"

Does anyone know what the problem might be?

stgraber · March 29, 2019, 2:14pm

Can you show:

ls -lh /var/snap/lxd/common/mntns/var/snap/lxd/common/lxd/containers
ls -lh /var/snap/lxd/common/mntns/var/snap/lxd/common/lxd/storage-pools/default/containers/

bruce78 · April 1, 2019, 6:07am

On the remote

This command

ls -lh /var/snap/lxd/common/mntns/var/snap/lxd/common/lxd/containers

Gives a bunch of output for each container e.g.

lrwxrwxrwx 1 root root 59 Dec 10 10:07 haproxy -> /var/snap/lxd/common/lxd/storage-pools/2/containers/haproxy

While this command

ls -lh /var/snap/lxd/common/mntns/var/snap/lxd/common/lxd/storage-pools/default/containers/

Returns

ls: cannot access '/var/snap/lxd/common/mntns/var/snap/lxd/common/lxd/storage-pools/default/containers/': No such file or directory

On the host

ls -lh /var/snap/lxd/common/mntns/var/snap/lxd/common/lxd/containers

Returns

total 0

And

ls -lh /var/snap/lxd/common/mntns/var/snap/lxd/common/lxd/storage-pools/default/containers/

Also returns

total 0

stgraber · April 1, 2019, 11:07pm

Ah, so I wonder if the difficultly here is about the storage pool being called differently on source and target, in a normal one time migration, this gets re-shuffled a bit for you, but when running refresh, the config is re-synced and something may be going wrong there.

Any chance you can try in a setup where the storage pools have the same names on both end? That’d confirm the hypothesis and would make the issue easier to reproduce.

stgraber · April 2, 2019, 3:45am

Not having any luck reproducing this issue with similarly differently named storage pools, things work just fine here.

What storage backends are in use on source and destination?

bruce78 · April 2, 2019, 7:16am

Not having any luck reproducing this issue with similarly differently named storage pools, things work just fine here.

Hmm… the host pool is three-way-mirror and the remote is one-way. I guess you’re saying that having different names shouldn’t be a problem?

Do you want me to try when both storage pools have the same name?

What storage backends are in use on source and destination?

They’re both zfs but zfs was setup before lxd was initialised. Both zfs configurations should be the same, aside from the fact that three-way-mirror has 3 disks in it and one-way has just one…

bruce78 · April 5, 2019, 5:50am

Do you think this issue is related to the fact that both lxd instances were connected to existing zfs pools, rather than letting lxd setup zfs itself?

When initialising lxd, I answered these 2 questions as below…

Create a new ZFS pool? (yes/no) [default=yes]: no
Name of the existing ZFS pool or dataset: three-way-mirror

Do you think this might play a role in the problem?

stgraber · April 5, 2019, 2:05pm

That part shouldn’t matter, no but I don’t think I tried with zfs on both ends, so I should try that.

Testing with the same name on both end would be useful, yes.

bruce78 · April 5, 2019, 2:28pm

ok, cool, I’ll try with both pools having the same name…

bruce78 · April 12, 2019, 8:11am

Ok, I tried with both pools having the same name and everything seems to have worked… I wonder what the problem might have been?

Andrew_Wilson · July 4, 2020, 9:43pm

FYI - I was able to test this. When I used zpools of the same name (source, tarrget), the container copied successfully with the --refresh option, and it was seperately refreshed (updated) successfully (with minor changes I made deliberately). I.e. it works as advertised.

HOWEVER: when I tried on the same setup using a different zpool, it failed to even copy at all using --refresh with the following error:

Error: Invalid devices: Device validation failed “root”: The “default” storage pool doesn’t exist

So I have a new rule for me based on this lesson: use the same name when creating zpools on servers that you want to use for lxc container copies. I have never made that a habit before now, but I will going forward.

THANK YOU. (And thank you for the --refresh option!!!)

stgraber · July 4, 2020, 10:36pm

You don’t have to use the same pool name, but if you don’t, you need to use a profile on each side which contains the root device rather than the container directly having it.

That way the exact container config of the source will work on the target.

Alternatively, passing --storage NAME as an argument to lxc copy may also work.

Andrew_Wilson · July 5, 2020, 11:57am

So as it was easy for me to do so, I tried the following two commands:

lxc copy c1 remote:c1-backup --refresh
lxc copy c1 remote:c1-backup2 --refresh --storage pool2

The first copy worked (the zpool names were a match between target and destination). The second copy I switched to a second pool I have available on the target, and it failed with error:

Error: Failed instance creation:

https://10.30.70.1:48443: Error transferring instance data: Unable to connect to: 10.30.70.1:48443
https://192.168.1.15:48443: Error transferring instance data: Create instance: Find parent volume: No such object

So that seems to answer that. I will try the different profile when I get more time and I will post an update here. Personally I am ok keeping pool names the same going forward to keep this simple (and more importantly functioning), which is probably a good stanardization idea anyway. FYI I am running lxd 4.2.

Thanks.

stgraber · July 5, 2020, 3:27pm

This may be an issue because of snapshots, could you try with --instance-only to validate this hypothesis?

johanehnberg · August 20, 2020, 4:27pm

I am seeing this as well. Snapshots seem to be the issue as --instance-only works.

johanehnberg · August 20, 2020, 4:33pm

Let me know if you want me to open an issue with exact ways reproduce. It was not among the open issues at least.

stgraber · August 20, 2020, 6:37pm

Can you check if it still happens on 4.4? I remember fixing a related issue so maybe we got lucky and fixed that too

stgraber · August 20, 2020, 6:37pm

And if you hit it on 4.4, then what storage backends are used on source and destination.

johanehnberg · August 21, 2020, 5:41am

Yes, still happens on 4.4.

The case is identical to the reports above: I am running ZFS across all hosts. All hosts have a pool called ‘default’. Copying between those works. One host also has a pool called ‘backuplxd’, copying with snapshots to that fails with Error transferring instance data: Create instance: Find parent volume: No such object despite --storage backuplxd when not using --instance-only.

stgraber · August 25, 2020, 11:25pm

I “think” we’ve got all of those under control now with the last branch I sent