Strange behaviour on lxc copy

Hi,

I’ve set up a simple backup script to backup up containers between my two lxd hosts. The pseudo code is roughly:
get list of remotes
get list of running containers on each remote
copy each remote container to local

This all appears to work insomuch as the commands complete without error and the containers are copied. However on one of the hosts the copy results in processes being left behind, eg:
zfs send -c -L -i srv/lxd/containers/<container>@migration-8e5ab99c-65f3-4523-9a19-0794acd4a7f5 srv/lxd/containers/<container>@migration-42a7d169-0e2d-47f7-88ca-4c2bf195ade6

The command that is used is:
lxc copy <source> <target> --stateless --instance-only --config=boot.autostart=false --storage <pool>

The processes are left behind on the remote host, one for each container.
Both hosts are running ubuntu 20.04.1 and lxd 4.7.
On the host which works as expected there is a single storage pool (on rpool) whilst on the the other (where the processes are left behind) there are two pools.

I have a log produced by used --debug but this doesn’t appear to show any errors - I can provide this if it would be helpful.

I’d be grateful for any help in understanding why this is happening and what I can do to fit it!

many thanks

Can you enable debug logging mode and then tail the logs on both sides while you perform a copy?

So on both systems do:

sudo snap set lxd daemon.debug=true
sudo systemctl reload snap.lxd.daemon
sudo tail -f /var/snap/lxd/common/lxd/logs/lxd.log

Then perform the copy and attach the logs to this thread.

Thanks
Tom

Thanks for the quick response and sorry about the delay in providing the information the you suggested. This was because gathering the information was not as straightforward as I imagined as changing the settings and reloading lxd caused the behaviour to change.

Reloading daemon causing the lingeringzfs send processes to go away
sudo systemctl **reload** snap.lxd.daemon

The first time the containers are copied after the reload the zfs send command looks like:
zfs send -c -L srv/lxd/containers/alpha@migration-06d8d03f-b6a1-40ac-8899-2a2c661dd263

This first copy works without a lingering 'zfs send` process.

The copy of the 2nd container started with a similar looking process (ie with one snapshot)
zfs send -c -L srv/lxd/containers/alpha@migration-d3694b80-d76b-4d90-9744-9ab2089784d2

but then changed at the end to have two snapshots, It was this process that stayed round:
zfs send -c -L -i srv/lxd/containers/alpha@migration-d3694b80-d76b-4d90-9744-9ab2089784d2 srv/lxd/containers/alpha@migration-9ff24d4a-28f5-4e6e-bc58-9c206e4dc0ce

This is the log for the local host in the case where the zfs send process remains:

t=2020-11-17T14:31:33+0000 lvl=dbug msg="New task Operation: c4c08115-e459-4340-b844-00f62f601d5b"
t=2020-11-17T14:31:33+0000 lvl=dbug msg="Started task operation: c4c08115-e459-4340-b844-00f62f601d5b"
t=2020-11-17T14:31:33+0000 lvl=info msg="Deleting container" created=2020-11-17T13:57:53+0000 ephemeral=false name=alpha project=default used=1970-01-01T01:00:00+0100
t=2020-11-17T14:31:33+0000 lvl=dbug msg="\n\t{\n\t\t\"type\": \"async\",\n\t\t\"status\": \"Operation created\",\n\t\t\"status_code\": 100,\n\t\t\"operation\": \"/1.0/operations/c4c08115-e459-4340-b844-00f62f601d5b\",\n\t\t\"error_code\": 0,\n\t\t\"error\": \"\",\n\t\t\"metadata\": {\n\t\t\t\"id\": \"c4c08115-e459-4340-b844-00f62f601d5b\",\n\t\t\t\"class\": \"task\",\n\t\t\t\"description\": \"Deleting container\",\n\t\t\t\"created_at\": \"2020-11-17T14:31:33.879318624Z\",\n\t\t\t\"updated_at\": \"2020-11-17T14:31:33.879318624Z\",\n\t\t\t\"status\": \"Running\",\n\t\t\t\"status_code\": 103,\n\t\t\t\"resources\": {\n\t\t\t\t\"containers\": [\n\t\t\t\t\t\"/1.0/containers/alpha\"\n\t\t\t\t]\n\t\t\t},\n\t\t\t\"metadata\": null,\n\t\t\t\"may_cancel\": false,\n\t\t\t\"err\": \"\",\n\t\t\t\"location\": \"none\"\n\t\t}\n\t}"
t=2020-11-17T14:31:33+0000 lvl=dbug msg=Handling ip=@ method=GET protocol=unix url=/1.0/operations/c4c08115-e459-4340-b844-00f62f601d5b username="uid=1004"
t=2020-11-17T14:31:33+0000 lvl=dbug msg="DeleteInstance started" driver=zfs instance=alpha pool=default project=default
t=2020-11-17T14:31:33+0000 lvl=dbug msg="Deleting instance volume" driver=zfs instance=alpha pool=default project=default volName=alpha
t=2020-11-17T14:31:34+0000 lvl=dbug msg="DeleteInstance finished" driver=zfs instance=alpha pool=default project=default
t=2020-11-17T14:31:34+0000 lvl=info msg="Deleted container" created=2020-11-17T13:57:53+0000 ephemeral=false name=alpha project=default used=1970-01-01T01:00:00+0100
t=2020-11-17T14:31:34+0000 lvl=dbug msg="Success for task operation: c4c08115-e459-4340-b844-00f62f601d5b"
t=2020-11-17T14:31:34+0000 lvl=dbug msg="Event listener finished: 32219f6a-e242-4556-b58d-17a25ec54b61"
t=2020-11-17T14:31:34+0000 lvl=dbug msg="Disconnected event listener: 32219f6a-e242-4556-b58d-17a25ec54b61"
t=2020-11-17T14:31:34+0000 lvl=dbug msg=Handling ip=@ method=GET protocol=unix url=/1.0 username="uid=1004"
t=2020-11-17T14:31:35+0000 lvl=dbug msg=Handling ip=@ method=GET protocol=unix url=/1.0/events username="uid=1004"
t=2020-11-17T14:31:35+0000 lvl=dbug msg="New event listener: 261ff797-4eac-4a58-a6ce-f0296f61672d"
t=2020-11-17T14:31:35+0000 lvl=dbug msg=Handling ip=@ method=POST protocol=unix url=/1.0/instances username="uid=1004"
t=2020-11-17T14:31:35+0000 lvl=dbug msg="\n\t{\n\t\t\"architecture\": \"x86_64\",\n\t\t\"config\": {\n\t\t\t\"boot.autostart\": \"false\",\n\t\t\t\"boot.autostart.priority\": \"100\",\n\t\t\t\"security.privileged\": \"true\",\n\t\t\t\"volatile.base_image\": \"6385fd2c715116b231e3e93b9e3d6cbc8c46aac5f1de7a8560a80ebd6bcaa394\"\n\t\t},\n\t\t\"devices\": {\n\t\t\t\"eth0\": {\n\t\t\t\t\"name\": \"eth0\",\n\t\t\t\t\"nictype\": \"bridged\",\n\t\t\t\t\"parent\": \"br0\",\n\t\t\t\t\"type\": \"nic\"\n\t\t\t},\n\t\t\t\"root\": {\n\t\t\t\t\"path\": \"/\",\n\t\t\t\t\"pool\": \"default\",\n\t\t\t\t\"size\": \"10GB\",\n\t\t\t\t\"type\": \"disk\"\n\t\t\t}\n\t\t},\n\t\t\"ephemeral\": false,\n\t\t\"profiles\": [\n\t\t\t\"default\"\n\t\t],\n\t\t\"stateful\": false,\n\t\t\"description\": \"\",\n\t\t\"name\": \"alpha\",\n\t\t\"source\": {\n\t\t\t\"type\": \"migration\",\n\t\t\t\"certificate\": \"-----BEGIN CERTIFICATE----
-----END CERTIFICATE-----\\n\",\n\t\t\t\"base-image\": \"6385fd2c715116b231e3e93b9e3d6cbc8c46aac5f1de7a8560a80ebd6bcaa394\",\n\t\t\t\"mode\": \"pull\",\n\t\t\t\"operation\": \"https://foxtrot:8443/1.0/operations/1d5105fe-f854-49a3-8f9e-b55e68ab7452\",\n\t\t\t\"secrets\": {\n\t\t\t\t\"control\": \"4f464b97d8726918ed815bf47d0abcf2f9214531886cdfae8eecc2a2db5dead8\",\n\t\t\t\t\"fs\": \"faa6c61d1a831226ffe351c04c51252441e7b7d5ca3d2d717257146fae8fd974\"\n\t\t\t},\n\t\t\t\"instance_only\": true,\n\t\t\t\"container_only\": true\n\t\t},\n\t\t\"instance_type\": \"\",\n\t\t\"type\": \"container\"\n\t}"
t=2020-11-17T14:31:35+0000 lvl=dbug msg="Responding to instance create"
t=2020-11-17T14:31:35+0000 lvl=info msg="Creating container" ephemeral=false name=alpha project=default
t=2020-11-17T14:31:35+0000 lvl=info msg="Created container" ephemeral=false name=alpha project=default
t=2020-11-17T14:31:35+0000 lvl=dbug msg="New task Operation: 6f0396ab-6abf-4dd5-89d4-aed398ba5616"
t=2020-11-17T14:31:35+0000 lvl=dbug msg="Started task operation: 6f0396ab-6abf-4dd5-89d4-aed398ba5616"
t=2020-11-17T14:31:35+0000 lvl=dbug msg="\n\t{\n\t\t\"type\": \"async\",\n\t\t\"status\": \"Operation created\",\n\t\t\"status_code\": 100,\n\t\t\"operation\": \"/1.0/operations/6f0396ab-6abf-4dd5-89d4-aed398ba5616\",\n\t\t\"error_code\": 0,\n\t\t\"error\": \"\",\n\t\t\"metadata\": {\n\t\t\t\"id\": \"6f0396ab-6abf-4dd5-89d4-aed398ba5616\",\n\t\t\t\"class\": \"task\",\n\t\t\t\"description\": \"Creating container\",\n\t\t\t\"created_at\": \"2020-11-17T14:31:35.214948919Z\",\n\t\t\t\"updated_at\": \"2020-11-17T14:31:35.214948919Z\",\n\t\t\t\"status\": \"Running\",\n\t\t\t\"status_code\": 103,\n\t\t\t\"resources\": {\n\t\t\t\t\"containers\": [\n\t\t\t\t\t\"/1.0/containers/alpha\"\n\t\t\t\t],\n\t\t\t\t\"instances\": [\n\t\t\t\t\t\"/1.0/instances/alpha\"\n\t\t\t\t]\n\t\t\t},\n\t\t\t\"metadata\": null,\n\t\t\t\"may_cancel\": false,\n\t\t\t\"err\": \"\",\n\t\t\t\"location\": \"none\"\n\t\t}\n\t}"
t=2020-11-17T14:31:35+0000 lvl=dbug msg=Handling ip=@ method=GET protocol=unix url=/1.0/operations/6f0396ab-6abf-4dd5-89d4-aed398ba5616 username="uid=1004"
t=2020-11-17T14:31:35+0000 lvl=dbug msg="CreateInstanceFromMigration started" args="{Name:alpha Description: Config:map[] Snapshots:[] MigrationType:{FSType:ZFS Features:[compress]} TrackProgress:false Refresh:false Live:true VolumeSize:0 ContentType:}" driver=zfs instance=alpha pool=default project=default
t=2020-11-17T14:31:35+0000 lvl=dbug msg="Using volume size from root disk config" driver=zfs pool=default size=10GB
t=2020-11-17T14:35:49+0000 lvl=dbug msg="CreateInstanceFromMigration finished" args="{Name:alpha Description: Config:map[] Snapshots:[] MigrationType:{FSType:ZFS Features:[compress]} TrackProgress:false Refresh:false Live:true VolumeSize:0 ContentType:}" driver=zfs instance=alpha pool=default project=default
t=2020-11-17T14:35:49+0000 lvl=dbug msg="Got error reading migration control socket read tcp 192.168.11.11:42140->192.168.11.12:8443: use of closed network connection"
t=2020-11-17T14:35:49+0000 lvl=dbug msg="Success for task operation: 6f0396ab-6abf-4dd5-89d4-aed398ba5616"
t=2020-11-17T14:35:49+0000 lvl=dbug msg="Event listener finished: 261ff797-4eac-4a58-a6ce-f0296f61672d"
t=2020-11-17T14:35:49+0000 lvl=dbug msg="Disconnected event listener: 261ff797-4eac-4a58-a6ce-f0296f61672d"

And this is the matching log from the remote host:

t=2020-11-17T14:31:33+0000 lvl=dbug msg="Found cert" name=25330df5081b13a960b704c10e922ea0e47dc95947ddcd322aaecb68944c6b55
t=2020-11-17T14:31:33+0000 lvl=dbug msg=Handling ip=192.168.11.11:42120 method=GET protocol=tls url=/1.0 username=25330df5081b13a960b704c10e922ea0e47dc95947ddcd322aaecb68944c6b55
t=2020-11-17T14:31:33+0000 lvl=dbug msg="Found cert" name=25330df5081b13a960b704c10e922ea0e47dc95947ddcd322aaecb68944c6b55
t=2020-11-17T14:31:33+0000 lvl=dbug msg="Found cert" name=25330df5081b13a960b704c10e922ea0e47dc95947ddcd322aaecb68944c6b55
t=2020-11-17T14:31:33+0000 lvl=dbug msg=Handling ip=192.168.11.11:42122 method=GET protocol=tls url="/1.0/instances?recursion=1" username=25330df5081b13a960b704c10e922ea0e47dc95947ddcd322aaecb68944c6b55
t=2020-11-17T14:31:34+0000 lvl=dbug msg="Found cert" name=25330df5081b13a960b704c10e922ea0e47dc95947ddcd322aaecb68944c6b55
t=2020-11-17T14:31:34+0000 lvl=dbug msg=Handling ip=192.168.11.11:42132 method=GET protocol=tls url=/1.0 username=25330df5081b13a960b704c10e922ea0e47dc95947ddcd322aaecb68944c6b55
t=2020-11-17T14:31:34+0000 lvl=dbug msg="Found cert" name=25330df5081b13a960b704c10e922ea0e47dc95947ddcd322aaecb68944c6b55
t=2020-11-17T14:31:35+0000 lvl=dbug msg="Found cert" name=25330df5081b13a960b704c10e922ea0e47dc95947ddcd322aaecb68944c6b55
t=2020-11-17T14:31:35+0000 lvl=dbug msg=Handling ip=192.168.11.11:42134 method=GET protocol=tls url=/1.0/instances/alpha username=25330df5081b13a960b704c10e922ea0e47dc95947ddcd322aaecb68944c6b55
t=2020-11-17T14:31:35+0000 lvl=dbug msg="Found cert" name=25330df5081b13a960b704c10e922ea0e47dc95947ddcd322aaecb68944c6b55
t=2020-11-17T14:31:35+0000 lvl=dbug msg=Handling ip=192.168.11.11:42136 method=GET protocol=tls url=/1.0/events username=25330df5081b13a960b704c10e922ea0e47dc95947ddcd322aaecb68944c6b55
t=2020-11-17T14:31:35+0000 lvl=dbug msg="New event listener: e79ad27a-7aef-4631-a7cc-7c4d6c78f6e6"
t=2020-11-17T14:31:35+0000 lvl=dbug msg="Found cert" name=25330df5081b13a960b704c10e922ea0e47dc95947ddcd322aaecb68944c6b55
t=2020-11-17T14:31:35+0000 lvl=dbug msg=Handling ip=192.168.11.11:42138 method=POST protocol=tls url=/1.0/instances/alpha username=25330df5081b13a960b704c10e922ea0e47dc95947ddcd322aaecb68944c6b55
t=2020-11-17T14:31:35+0000 lvl=dbug msg="\n\t{\n\t\t\"name\": \"\",\n\t\t\"migration\": true,\n\t\t\"live\": false,\n\t\t\"instance_only\": true,\n\t\t\"container_only\": true,\n\t\t\"target\": null\n\t}"
t=2020-11-17T14:31:35+0000 lvl=dbug msg="New websocket Operation: 1d5105fe-f854-49a3-8f9e-b55e68ab7452"
t=2020-11-17T14:31:35+0000 lvl=dbug msg="Started websocket operation: 1d5105fe-f854-49a3-8f9e-b55e68ab7452"
t=2020-11-17T14:31:35+0000 lvl=dbug msg="\n\t{\n\t\t\"type\": \"async\",\n\t\t\"status\": \"Operation created\",\n\t\t\"status_code\": 100,\n\t\t\"operation\": \"/1.0/operations/1d5105fe-f854-49a3-8f9e-b55e68ab7452\",\n\t\t\"error_code\": 0,\n\t\t\"error\": \"\",\n\t\t\"metadata\": {\n\t\t\t\"id\": \"1d5105fe-f854-49a3-8f9e-b55e68ab7452\",\n\t\t\t\"class\": \"websocket\",\n\t\t\t\"description\": \"Migrating container\",\n\t\t\t\"created_at\": \"2020-11-17T14:31:35.14657045Z\",\n\t\t\t\"updated_at\": \"2020-11-17T14:31:35.14657045Z\",\n\t\t\t\"status\": \"Running\",\n\t\t\t\"status_code\": 103,\n\t\t\t\"resources\": {\n\t\t\t\t\"containers\": [\n\t\t\t\t\t\"/1.0/containers/alpha\"\n\t\t\t\t],\n\t\t\t\t\"instances\": [\n\t\t\t\t\t\"/1.0/instances/alpha\"\n\t\t\t\t]\n\t\t\t},\n\t\t\t\"metadata\": {\n\t\t\t\t\"control\": \"4f464b97d8726918ed815bf47d0abcf2f9214531886cdfae8eecc2a2db5dead8\",\n\t\t\t\t\"fs\": \"faa6c61d1a831226ffe351c04c51252441e7b7d5ca3d2d717257146fae8fd974\"\n\t\t\t},\n\t\t\t\"may_cancel\": true,\n\t\t\t\"err\": \"\",\n\t\t\t\"location\": \"none\"\n\t\t}\n\t}"
t=2020-11-17T14:31:35+0000 lvl=dbug msg="Allowing untrusted GET" ip=192.168.11.11:42140 url="/1.0/operations/1d5105fe-f854-49a3-8f9e-b55e68ab7452/websocket?secret=4f464b97d8726918ed815bf47d0abcf2f9214531886cdfae8eecc2a2db5dead8"
t=2020-11-17T14:31:35+0000 lvl=dbug msg="Connected websocket Operation: 1d5105fe-f854-49a3-8f9e-b55e68ab7452"
t=2020-11-17T14:31:35+0000 lvl=dbug msg="Handled websocket Operation: 1d5105fe-f854-49a3-8f9e-b55e68ab7452"
t=2020-11-17T14:31:35+0000 lvl=dbug msg="Allowing untrusted GET" ip=192.168.11.11:42142 url="/1.0/operations/1d5105fe-f854-49a3-8f9e-b55e68ab7452/websocket?secret=faa6c61d1a831226ffe351c04c51252441e7b7d5ca3d2d717257146fae8fd974"
t=2020-11-17T14:31:35+0000 lvl=dbug msg="Connected websocket Operation: 1d5105fe-f854-49a3-8f9e-b55e68ab7452"
t=2020-11-17T14:31:35+0000 lvl=dbug msg="Handled websocket Operation: 1d5105fe-f854-49a3-8f9e-b55e68ab7452"
t=2020-11-17T14:31:35+0000 lvl=dbug msg="MigrateInstance started" args="&{Name:alpha Snapshots:[] MigrationType:{FSType:ZFS Features:[compress]} TrackProgress:true MultiSync:true FinalSync:false Data:<nil> ContentType:}" driver=zfs instance=alpha pool=array project=default
t=2020-11-17T14:31:36+0000 lvl=dbug msg="Updated metadata for websocket Operation: 1d5105fe-f854-49a3-8f9e-b55e68ab7452"

<<<Lines repeated every second deleted>>>

t=2020-11-17T14:35:49+0000 lvl=dbug msg="Updated metadata for websocket Operation: 1d5105fe-f854-49a3-8f9e-b55e68ab7452"
t=2020-11-17T14:35:49+0000 lvl=dbug msg="MigrateInstance finished" args="&{Name:alpha Snapshots:[] MigrationType:{FSType:ZFS Features:[compress]} TrackProgress:true MultiSync:true FinalSync:false Data:map[filesystem:srv/lxd/containers/alpha@migration-d3694b80-d76b-4d90-9744-9ab2089784d2] ContentType:}" driver=zfs instance=alpha pool=array project=default
t=2020-11-17T14:35:49+0000 lvl=dbug msg="MigrateInstance started" args="&{Name:alpha Snapshots:[] MigrationType:{FSType:ZFS Features:[compress]} TrackProgress:true MultiSync:true FinalSync:true Data:map[filesystem:srv/lxd/containers/alpha@migration-d3694b80-d76b-4d90-9744-9ab2089784d2] ContentType:}" driver=zfs instance=alpha pool=array project=default
t=2020-11-17T14:35:49+0000 lvl=dbug msg="Event listener finished: e79ad27a-7aef-4631-a7cc-7c4d6c78f6e6"
t=2020-11-17T14:35:49+0000 lvl=dbug msg="Disconnected event listener: e79ad27a-7aef-4631-a7cc-7c4d6c78f6e6"

When do you say 2nd copy, do you mean you copy over the first copy or do you copy it again to a different instance? Or do you use refresh mode?

Please could you show the command you run for each example please.

I originally used refresh but stopped as the size of the copied container was signicantly larger that the original and I ran out of diskspace.

for each container in the backup script I run the following commands:

lxc delete  <container>
lxc copy <remote>:<container> <container> --stateless --instance-only --config=boot.autostart=false --storage default

It appears that the first time after a reload these commands don’t result in a lingering zfs send but subsequently they do. However, I am not convinced this first copy working is happening consistently and may be a red herring.

The two zfs sends per container with 1 and then 2 snapshots is consistent.

I just turned on daemon debugging with the tail still running and saw these messages:

t=2020-11-17T16:14:49+0000 lvl=info msg="Received 'terminated signal', waiting for all operations to finish"
t=2020-11-17T16:14:50+0000 lvl=dbug msg="Cancelling websocket Operation: f40fbd3d-97ba-4efa-b2c2-e22dfd4d65e0"
t=2020-11-17T16:14:50+0000 lvl=dbug msg="Cancelled websocket Operation: f40fbd3d-97ba-4efa-b2c2-e22dfd4d65e0"
t=2020-11-17T16:14:50+0000 lvl=dbug msg="Cancelled websocket Operation: f40fbd3d-97ba-4efa-b2c2-e22dfd4d65e0"
t=2020-11-17T16:14:50+0000 lvl=dbug msg="Cancelling websocket Operation: ac1c8166-db92-4aa1-b0b6-b6b42d6c02cf"
t=2020-11-17T16:14:50+0000 lvl=dbug msg="Cancelled websocket Operation: ac1c8166-db92-4aa1-b0b6-b6b42d6c02cf"
t=2020-11-17T16:14:50+0000 lvl=dbug msg="Cancelled websocket Operation: ac1c8166-db92-4aa1-b0b6-b6b42d6c02cf"
t=2020-11-17T16:14:50+0000 lvl=dbug msg="Cancelling websocket Operation: 9becb5b7-b535-4466-9d4e-1d29f9b76175"
t=2020-11-17T16:14:50+0000 lvl=dbug msg="Cancelled websocket Operation: 9becb5b7-b535-4466-9d4e-1d29f9b76175"
t=2020-11-17T16:14:50+0000 lvl=dbug msg="Cancelled websocket Operation: 9becb5b7-b535-4466-9d4e-1d29f9b76175"
t=2020-11-17T16:14:50+0000 lvl=dbug msg="Cancelling websocket Operation: 1d5105fe-f854-49a3-8f9e-b55e68ab7452"
t=2020-11-17T16:14:50+0000 lvl=dbug msg="Cancelled websocket Operation: 1d5105fe-f854-49a3-8f9e-b55e68ab7452"
t=2020-11-17T16:14:50+0000 lvl=dbug msg="Cancelled websocket Operation: 1d5105fe-f854-49a3-8f9e-b55e68ab7452"
t=2020-11-17T16:14:50+0000 lvl=dbug msg="Cancelling websocket Operation: f52e16a3-e48d-4e82-b5f4-46a038619c46"
t=2020-11-17T16:14:50+0000 lvl=dbug msg="Cancelled websocket Operation: f52e16a3-e48d-4e82-b5f4-46a038619c46"
t=2020-11-17T16:14:50+0000 lvl=dbug msg="Cancelled websocket Operation: f52e16a3-e48d-4e82-b5f4-46a038619c46"
t=2020-11-17T16:14:51+0000 lvl=info msg="All running operations finished, shutting down"
t=2020-11-17T16:14:51+0000 lvl=dbug msg="Cancel ongoing or future gRPC connection attempts"
t=2020-11-17T16:14:51+0000 lvl=info msg="Starting shutdown sequence"
t=2020-11-17T16:14:51+0000 lvl=info msg="Closing the database"
t=2020-11-17T16:14:51+0000 lvl=info msg="Stop database gateway"
t=2020-11-17T16:14:51+0000 lvl=info msg="Stopping REST API handler:"
t=2020-11-17T16:14:51+0000 lvl=info msg=" - closing socket" socket=[::]:8443
t=2020-11-17T16:14:51+0000 lvl=info msg=" - closing socket" socket=/var/snap/lxd/common/lxd/unix.socket
t=2020-11-17T16:14:51+0000 lvl=info msg="Stopping /dev/lxd handler:"
t=2020-11-17T16:14:51+0000 lvl=info msg=" - closing socket" socket=/var/snap/lxd/common/lxd/devlxd/sock
t=2020-11-17T16:14:51+0000 lvl=dbug msg="Not unmounting temporary filesystems (containers are still running)"

Sorry that final log was me turning off debugging.

You mention snapshots, do the containers your sending have snapshots? How many do they have?

I was referring to the migration zfs snapshots taken by lxd. The containers doe also have 26 lxd snapshots using the ‘autosnap’ feature:

config:
  snapshots.expiry: 2d
  snapshots.schedule: 15 8-20 * * *

OK thanks.

Any anything problematic appearing in dmesg (to rule out any issues with the lower level I/O)?

Nothing that I can see. Pretty much only apparmor="ALLOWED" messages and a few kauditd_printk_skb: callbacks suppressed

Can you update to LXD 4.8 which is in stable channel now (if not already) and retest, as there were quite a few storage related changes in that one and would be a good starting point to see if it has been resolved.

Hi, it looks like the snap was automatically updated to 4.8 last night, so the runs that I posted about today were using 4.8.

# snap info lxd
name:      lxd
summary:   System container manager and API
publisher: Canonical✓
store-url: https://snapcraft.io/lxd
contact:   https://github.com/lxc/lxd/issues
license:   unset
description: |
  **LXD is a system container manager**
  
  With LXD you can run hundreds of containers of a variety of Linux
  distributions, apply resource limits, pass in directories, USB devices
  or GPUs and setup any network and storage you want.
  
  LXD containers are lightweight, secure by default and a great
  alternative to running Linux virtual machines.
  
  
  **Run any Linux distribution you want**
  
  Pre-made images are available for Ubuntu, Alpine Linux, ArchLinux,
  CentOS, Debian, Fedora, Gentoo, OpenSUSE and more.
  
  A full list of available images can be found here: https://images.linuxcontainers.org
  
  Can't find the distribution you want? It's easy to make your own images too, either using our
  `distrobuilder` tool or by assembling your own image tarball by hand.
  
  
  **Containers at scale**
  
  LXD is network aware and all interactions go through a simple REST API,
  making it possible to remotely interact with containers on remote
  systems, copying and moving them as you wish.
  
  Want to go big? LXD also has built-in clustering support,
  letting you turn dozens of servers into one big LXD server.
  
  
  **Configuration options**
  
  Supported options for the LXD snap (`snap set lxd KEY=VALUE`):
     - ceph.builtin: Use snap-specific ceph configuration [default=false]
     - criu.enable: Enable experimental live-migration support [default=false]
     - daemon.debug: Increases logging to debug level [default=false]
     - daemon.group: Group of users that can interact with LXD [default=lxd]
     - daemon.preseed: A YAML configuration to feed `lxd init` on initial start
     - lxcfs.pidfd: Start per-container process tracking [default=false]
     - lxcfs.loadavg: Start tracking per-container load average [default=false]
     - lxcfs.cfs: Consider CPU shares for CPU usage [default=false]
     - openvswitch.builtin: Run a snap-specific OVS daemon [default=false]
     - shiftfs.enable: Enable shiftfs support [default=auto]
  
  Documentation: https://lxd.readthedocs.io
commands:
  - lxd.benchmark
  - lxd.buginfo
  - lxd.check-kernel
  - lxd.lxc
  - lxd.lxc-to-lxd
  - lxd
  - lxd.migrate
services:
  lxd.activate: oneshot, enabled, inactive
  lxd.daemon:   simple, enabled, active
snap-id:      J60k4JY0HppjwOjW8dZdYc8obXKxujRu
tracking:     latest/stable
refresh-date: yesterday at 22:20 GMT
channels:
  latest/stable:    4.8         2020-11-16 (18324) 72MB -
  latest/candidate: 4.8         2020-11-16 (18324) 72MB -
  latest/beta:      ↑                                   
  latest/edge:      git-4d6203a 2020-11-15 (18320) 72MB -
  4.8/stable:       4.8         2020-11-16 (18324) 72MB -
  4.8/candidate:    4.8         2020-11-16 (18324) 72MB -
  4.8/beta:         ↑                                   
  4.8/edge:         ↑                                   
  4.7/stable:       4.7         2020-11-12 (18251) 72MB -
  4.7/candidate:    4.7         2020-11-12 (18251) 72MB -
  4.7/beta:         ↑                                   
  4.7/edge:         ↑                                   
  4.6/stable:       4.6         2020-10-13 (17738) 72MB -
  4.6/candidate:    4.6         2020-10-14 (17777) 72MB -
  4.6/beta:         ↑                                   
  4.6/edge:         ↑                                   
  4.0/stable:       4.0.4       2020-11-03 (18150) 71MB -
  4.0/candidate:    4.0.4       2020-11-04 (18165) 71MB -
  4.0/beta:         ↑                                   
  4.0/edge:         git-21b37b8 2020-11-14 (18299) 71MB -
  3.0/stable:       3.0.4       2019-10-10 (11348) 55MB -
  3.0/candidate:    3.0.4       2019-10-10 (11348) 55MB -
  3.0/beta:         ↑                                   
  3.0/edge:         git-81b81b9 2019-10-10 (11362) 55MB -
  2.0/stable:       2.0.12      2020-08-18 (16879) 38MB -
  2.0/candidate:    2.0.12      2020-08-17 (16879) 38MB -
  2.0/beta:         ↑                                   
  2.0/edge:         git-82c7d62 2020-08-17 (16867) 39MB -
installed:          4.8                    (18324) 72MB -

OK thanks.

Can you advise when the processes are left behind, are there any sockets left behind active as well.

Could you provide output of ss -tnp where the socket mentions lxd or zfs?

There was nothing left open tagged with lxd or zfs, nor any any sockets open at all between the two hosts.

This is on the remote host (ip = …12). No connections to the other host (ip= …11)

State          Recv-Q      Send-Q           Local Address:Port              Peer Address:Port      Process                                                          
ESTAB          0           0                192.168.11.12:59648            192.168.11.22:389        users:(("sssd_be",pid=5550,fd=22))                              
ESTAB          0           0                192.168.11.12:22              192.168.11.161:61807      users:(("sshd",pid=2910714,fd=4),("sshd",pid=2910467,fd=4))     
ESTAB          0           0                192.168.11.12:445             192.168.11.161:58677      users:(("smbd",pid=291070,fd=40))                               
ESTAB          0           0                192.168.11.12:37428            192.168.11.21:49153      users:(("winbindd",pid=15419,fd=27))                            
CLOSE-WAIT     1           0                192.168.11.12:55498            192.168.11.21:389        users:(("winbindd",pid=15511,fd=27))                            
ESTAB          0           0                192.168.11.12:50042            192.168.11.21:49152      users:(("winbindd",pid=15419,fd=26))                            
CLOSE-WAIT     1           0                192.168.11.12:53382            192.168.11.21:445        users:(("winbindd",pid=15419,fd=24))  

Sorry I meant run with sudo or as root in case you run as non root user.

Sorry, did you mean the ss command or lxc copy? The ss command was run with sudo