Failed creating instance on target: Problem with zfs receive error

DJ423 · March 9, 2023, 2:15am

Not sure if this is part of a larger problem yet, still doing some monitoring. But I have been getting failures when migrating containers over to another LXD host. Both of the VM hosts are on the same VLAN (vlan99) and the LXD hosts and containers are running on the same VLAN (vlan50).

I am running:
LXD 5.0.2 on both source and target, with small ZFS pools (64GB)
Running on Ubuntu 22.04 LTS on VM (16G RAM, 4cores)

VMs running on XCP-ng 8.2 VM hosts

Connected to storage backend: TrueNAS scale 22.12 (ZFS)

Here are some of the errors from target and source:

Error on target (LXD1)
Error: Failed instance creation: Error transferring instance data: Failed creating instance on target: Problem with zfs receive: ([exit status 1 read tcp 192.168.50.10:36360->192.168.50.11:8443: read: connection reset by peer]) cannot receive new filesystem stream: incomplete stream

lxc monitor on source (LXD2)
location: none
metadata:
context:
args: ‘&{IndexHeaderVersion:1 Name:zcs Snapshots:[] MigrationType:{FSType:ZFS
Features:[migration_header compress]} TrackProgress:true MultiSync:false FinalSync:false
Data: ContentType: AllowInconsistent:false Refresh:false Info:0xc000406398
VolumeOnly:false}’
instance: zcs
project: default
response: ‘{StatusCode:200 Error: Refresh:0xc002215663}’
version: “1”
level: info
message: Received migration index header response
timestamp: “2023-03-08T19:27:04.748580901-05:00”
type: logging

Note: This container is around 5Gb.

dmesg on target, this seems to crop up during the migration tasks:

[ 8822.769928] INFO: task receive_writer:14300 blocked for more than 120 seconds.
[ 8822.777322] Tainted: P O 5.15.0-67-generic #74-Ubuntu
[ 8822.785443] “echo 0 > /proc/sys/kernel/hung_task_timeout_secs” disables this message.
[ 8822.794893] task:receive_writer state:D stack: 0 pid:14300 ppid: 2 flags:0x00004000

error on target when deleting a damaged container:
Error: Error deleting storage volume: Failed to run: zfs destroy -r default/containers/zcs: exit status 1 (cannot destroy ‘default/containers/zcs’: dataset is busy)

Now when I run lxc export on the target and imported the tar files, everything went fine. I saw this suggested by Tom in another post, so this went fine. No errors at all. I am not logging any disk errors so far, and everything runs fine otherwise. It’s only when I migrate a few containers over 4GB I get the error. Smaller containers migrate fine.

If anyone has seen this, or if this is a clue of a larger problem, I am just trying to rule out any hardware issues. I normally only move them around when doing maintenance on one of the VM hosts they all run on. All containers run fine 24/7 and I just started seeing this, so thought I would share.

Thank you

derek423 · March 9, 2023, 5:52pm

Update: Looks like importing is not working now.

When I import a containers over 5GB, it hangs at 100%

ansible@LXD1:~$ lxc import zcs-back.tar.gz
Importing instance: 100% (6.41MB/s)

I have created a new zfs pool on a separate virtual disk, thinking the ext4 file system was causing issue.

Here is the output of zfs status:

 pool: default
 state: ONLINE
config:

        NAME                                          STATE     READ WRITE CKSUM
        default                                       ONLINE       0     0     0
          /var/snap/lxd/common/lxd/disks/default.img  ONLINE       0     0     0

errors: No known data errors

  pool: pool1
 state: ONLINE
config:

        NAME        STATE     READ WRITE CKSUM
        pool1       ONLINE       0     0     0
          xvdb      ONLINE       0     0     0

errors: No known data errors

Output of dmesg:

[11117.424498] INFO: task vdev_autotrim:638 blocked for more than 120 seconds.
[11117.431743]       Tainted: P           O      5.15.0-67-generic #74-Ubuntu
[11117.441361] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[11117.450284] task:vdev_autotrim   state:D stack:    0 pid:  638 ppid:     2 flags:0x00004000