Lxc copy --refresh error

cyroreal · October 18, 2019, 2:47am

Hi all,

I am trying to use --refresh to backup my containers, but when I issue:

lxc copy container1 remote:container1-bkp --refresh

I get the following error:

Error: Failed instance creation: Post https://myservername.com:8443/1.0/instances: EOF

I am able to take a snapshot of container1 and then copy the snapshot for the remote server fine.

I am also able to use lxc list remote: to list the images on the remote server fine.

Lxd snap 3.18 on both sides
ZFS storage on both sides

As a final question, is there a documentation showing how to use --refresh properly? Should I copy a snapshot first and then refresh, or should I use --refresh on a running container since the first copy?

Thanks for any help.

pvincent · October 18, 2019, 11:44am

I’m interested as well knowing more about this --refresh copy option.

According to me, this feature seems very promising for speeding up synchronization between containers, specially with remote ones.

I guess you’ve probably noticed that you cannot copy towards a running container (without CRIU). Usually, the error message is very helpful and it’s not the message you posted. So that’s not the good answer for you.

As far as I understand, I use this feature for backup purpose. I’m wondering whether it’s a good idea or not. Every day, a cron-daily script copy --refresh all of my production containers (locally) towards stopped containers (remotely). And it does work so far. The first time, I had to snapshot once, then copy this snapshot to the remote LXD machine. But then, after this first initialization, the copy --refresh seems to synchronize properly live containers and it runs fast (kind of rsync, maybe ?).

cyroreal · October 29, 2019, 8:33am

Hi all,

Actually I just found out that the refresh command is updating the remote container. However I am still receiving the following error message after the copy process:

root@cyrofilho:~# lxc copy ns1 quantatech:ns1 --refresh
Error: Failed instance creation: Error transferring container data: exit status 23
root@cyrofilho:~#

The copy process runs very slow, between 10 and 500KB/s, while a copy of a snapshot to the same destination server runs around 34MB/s. Is this normal?

Should I trust that the remote image is being correctly updated, even with the above error message? I did start a copy of the remote container to test, and it did run fine, with the latest information.

Please help me to figure this out. I really need to remotely backup my containers using the refresh option to save bandwidth and time. Is it ready for production, or should I use zfs send/receive for now?

Thanks for any help.

stgraber · October 29, 2019, 2:27pm

23 is Partial transfer due to error according to the rsync manual, so something didn’t transfer too well.
You’ll want to look at lxd.log on the source and target server, one of the two should have a more detailed error with the rsync output to let you know what file things blew up on.

The speed difference is possibly due to only transferring the difference, so spending more time going through files, comparing their file info and hash before transferring just a tiny bit of data.

There is also the difference that your initial copy was likely done using zfs send/receive but refresh updates can only be done through rsync, so the difference in protocol could also explain things working quite differently.

bodleytunes · October 29, 2019, 3:03pm

Why can’t refresh use ZFS send? When I use syncoid (zfs send wrapper) to replicate containers that have already been sent with lxc copy it finishes in an instant as ZFS is much faster at sending the diff than rsync is (or seems to be).

Just a bit confused as to why it reverts to using rsync over ZFS?

stgraber · October 30, 2019, 1:32am

It’s not impossible to do but it’s just not done at this point.

Our migration protocol is somewhat simple and doesn’t allow much back and forth between source and target, more back and forth would be needed to first determine exactly what snapshots need to be transferred (source sends full list, target filters list based on what it has, send list back, then source would have to figure out nearest snapshot for each and send those), then a new temporary snapshot would need to be made on the source, sent, restored on target and deleted on both sides.

We would also need to add some fs details in the migration protocol so that part of the negotiation would be ensuring that both sides are actually the same base dataset as otherwise send/receive just can’t work.

Today, it’d actually be perfectly fine to do:

Copy a container from a remote server with zfs on both sides (uses send/receive)
Move the target container to a btrfs pool (converts everything to subvolumes)
Move back to the zfs pool (converts everything back to datasets and snapshots)
Do a refresh from the source

In this case, even though it’s still the same container and the same snapshots, the dataset itself isn’t the same on source and target, so send/receive cannot work at all. Since we use rsync, that’s fine, but if we were to support zfs, we’d need the extra data in the migration protocol so that we can detect it and switch to rsync for such cases.

cyroreal · November 4, 2019, 7:28am

This is what I found in lxd.log:

t=2019-11-04T01:38:41-0500 lvl=info msg=“Freezing container” created=2019-04-23T20:54:01-0400 ephemeral=false name=ns1 project=default used=2019-10-17T21:23:47-0400
t=2019-11-04T01:38:41-0500 lvl=info msg=“Froze container” created=2019-04-23T20:54:01-0400 ephemeral=false name=ns1 project=default used=2019-10-17T21:23:47-0400
t=2019-11-04T01:41:01-0500 lvl=eror msg=“Rsync send failed: /var/snap/lxd/common/lxd/containers/ns1/: exit status 23: rsync: delete_file: rmdir(.zfs/snapshot) failed: Operation not permitted (1)\nrsync: delete_file: rmdir(.zfs/shares) failed: Operation not permitted (1)\nrsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1183) [sender=3.1.1]\n”
t=2019-11-04T01:41:01-0500 lvl=info msg=“Unfreezing container” created=2019-04-23T20:54:01-0400 ephemeral=false name=ns1 project=default used=2019-10-17T21:23:47-0400
t=2019-11-04T01:41:01-0500 lvl=info msg=“Unfroze container” created=2019-04-23T20:54:01-0400 ephemeral=false name=ns1 project=default used=2019-10-17T21:23:47-0400

Can you please help? Would like to know if my backups are working. From the error looks like it cannot delete the snapshot after the copy?

Thank you.

stgraber · November 5, 2019, 3:49am

Sounds like you have the zfs snapdir enabled on your volumes, this isn’t something that LXD would ever do itself and it may be causing the behavior you’re seeing.

Can you show zfs get snapdir?

cyroreal · November 5, 2019, 7:49am

Thanks for the help. I use snapdir so I can access my snapshot files using the directory .zfs and then copy any file I need from a particular snapshot. Do I have to disable it for the lxd dataset?

NAME PROPERTY VALUE SOURCE
dados snapdir visible local
dados/lxd snapdir visible inherited from dados
dados/lxd/containers snapdir visible inherited from dados
dados/lxd/containers/lucsim-zimbra snapdir visible inherited from dados
dados/lxd/containers/ns1 snapdir visible inherited from dados
dados/lxd/containers/ns2 snapdir visible inherited from dados
dados/lxd/custom snapdir visible inherited from dados
dados/lxd/custom-snapshots snapdir visible inherited from dados
dados/lxd/deleted snapdir visible inherited from dados
dados/lxd/images snapdir visible inherited from dados
dados/lxd/snapshots snapdir visible inherited from dados

Thanks.

cyroreal · November 5, 2019, 8:51am

Just executed:

zfs set snapdir=hidden dados/lxd

And the problem is fixed!!!

The only problem I have right now is the very low transfer speed when using --refresh, comparing when copying a snapshot, the difference is unbelievable. Around 30 to 100 KB/s for the refresh and around 30 MB/s for the snapshot copy. Is there something wrong here?

Thank you very much for your help.

cyroreal · November 5, 2019, 10:51am

Well actually I have a Zimbra container. This container is still giving me errors when using --refresh:

t=2019-11-05T05:01:30-0500 lvl=info msg=“Freezing container” created=2019-10-13T13:47:28-0400 ephemeral=false name=lucsim-zimbra project=default used=2019-10-17T21:27:36-0400
t=2019-11-05T05:01:30-0500 lvl=info msg=“Froze container” created=2019-10-13T13:47:28-0400 ephemeral=false name=lucsim-zimbra project=default used=2019-10-17T21:27:36-0400
t=2019-11-05T05:32:47-0500 lvl=info msg=“Unfreezing container” created=2019-10-13T13:47:28-0400 ephemeral=false name=lucsim-zimbra project=default used=2019-10-17T21:27:36-0400
t=2019-11-05T05:32:47-0500 lvl=info msg=“Unfroze container” created=2019-10-13T13:47:28-0400 ephemeral=false name=lucsim-zimbra project=default used=2019-10-17T21:27:36-0400
t=2019-11-05T05:34:37-0500 lvl=eror msg=“Rsync send failed: /var/snap/lxd/common/lxd/containers/lucsim-zimbra/: exit status 24: file has vanished: “/var/snap/lxd/common/lxd/containers/lucsim-zimbra/rootfs/opt/zimbra/data/tmp/zmcontrol.error.Fn3r_”\nfile has vanished: “/var/snap/lxd/common/lxd/containers/lucsim-zimbra/rootfs/opt/zimbra/data/tmp/zmcontrol.status.CFGO4”\nrsync warning: some files vanished before they could be transferred (code 24) at main.c(1183) [sender=3.1.1]\n”

Does --refresh snapshot the container before sending the files over?

Thank you.

usrflo · May 7, 2021, 7:16am

In Lxc copy --refresh workaround: efficient incremental ZFS snapshot sync with send/receive I posted a script for the regular and efficient sync of containers to hot/cold standby clones based on ZFS storage volumes. This is working very well, I will use this in production.