Lxc copy: Error transferring instance data: exit status 22

amcduffee · May 5, 2020, 7:44pm

I am trying to copy containers from an old 3.0.2 deb based install to a newer clustered 4.0.1 snap based install. I keep getting the following error:

lxc copy c1 ceph:c1
Error: Failed container creation: Error transferring instance data: exit status 22

The lxd.log shows the following:

lvl=eror msg=“Rsync send failed: /var/lib/lxd/containers/c1/: exit status 22: ERROR: buffer overflow in recv_rules [Receiver]\nrsync error: error allocating core memory buffers (code 22) at util2.c(112) [Receiver=3.1.2]\n” t=2020-05-05T12:22:19-0700

Any idea what is happening?

amcduffee · May 5, 2020, 7:53pm

I just upgraded to 3.0.3 on the source server since it was available through apt. Now I get the following errors:

Error: Failed container creation: Error transferring instance data: exit status 2

t=2020-05-05T12:48:46-0700 lvl=eror msg=“Rsync send failed: /var/lib/lxd/containers/c1/: exit status 2: [Receiver] Invalid dir index: -1 (-101 - -101)\nrsync error: protocol incompatibility (code 2) at flist.c(2634) [Receiver=3.1.2]\n”

The source server is using a ZFS storage pool and the target server is using a Ceph RBD storage pool.

stgraber · May 6, 2020, 2:23am

Hmm, looks like something wrong going on with the migration negotiation, that or quite a different rsync version.

Depending on whether your source server can take the downtime, upgrading it to the 4.0 snap would likely work around this issue.

snap install lxd
lxd.migrate

Should take care of moving the data and cleaning things up, then you’ll be dealing with the same version on source and target.

amcduffee · May 6, 2020, 5:20pm

Source server:

$ lsb_release -d
Description: Ubuntu 18.04.1 LTS
$ rsync --version | head -n1
rsync version 3.1.2 protocol version 31
$ lxc version
Client version: 3.0.3
Server version: 3.0.3

Destination servers (3 in an LXD cluster):

$ lsb_release -d
Description: Ubuntu 18.04.4 LTS
$ rsync --version | head -n1
rsync version 3.1.2 protocol version 31
$ lxc version
Client version: 4.0.1
Server version: 4.0.1

Although, in the case of the destination servers that might not be the correct way to get the rsync version because isn’t the rsync version based on the Snap core dependency of the LXD snap?

I was able to move the container using a publish command on the source. In this case the ‘ceph’ remote is one of the 3 servers in the destination cluster:

lxc publish c1 ceph: --alias c1

Followed by a launch command on the destination:

lxc launch c1 c1
lxc image delete c1

However, this has a few ugly side effects:

The ‘volatile.base_image’ and ‘image.*’ configuration values are all lost due to the published image being given its own fingerprint. I don’t want to lose this information because it is quite useful to have for reference. I was able to manually restore these values to the destination container, but I am not sure it is safe to do so?
Using publish to move the container and then deleting the image after launching the container at the destination leads to a zombie image in the RBD pool. I know this is correct behavior based on how LXD clone copies the base image, but it isn’t ideal:

$ sudo rbd ls lxd/
container_c1
…
zombie_image_aeeacfe6b70321d45e1f0f7560cc8e513e75b8dec539840c6a7a5ef5e4e953d4_ext4
The publish approach also causes the container to lose its efficient snapshot copy of its true base image ‘9879a79ac2b208c05af769089f0a6c3cbea8529571e056c82e96f1468cd1f610’ as published on Ubuntu Cloud Images.

I have 40-50 containers that I need to get migrated and having the above issues repeated that number of times isn’t good.

The source is a single server storing the containers in a ZFS pool. The destination is a 3 node cluster storing the containers in a CEPH RBD pool.

Is there any way to move the containers without losing their efficient snapshot copies? Any suggestions on how to best do this migration?

amcduffee · May 8, 2020, 7:01pm

@stgraber

I decided to try copying a container using the push mode:

$ lxc copy test-copy ceph: --mode push --verbose
Transferring container: test-copy: 142B (19B/s)

It just hangs at the above state indefinitely despite the following error showing up in ‘lxc monitor’ output on the target cluster:

metadata:
  class: websocket
  created_at: "2020-05-08T11:18:54.187185969-07:00"
  description: Creating container
  err: 'Error transferring instance data: exit status 2'
  id: 11809c5b-cae4-40a6-9c1e-95adfd5eb2ee
  location: node03
  may_cancel: false
  metadata:
    create_instance_from_image_unpack_progress: 'Unpack: 100% (3.90GB/s)'
    progress:
      percent: "100"
      speed: "3900497512"
      stage: create_instance_from_image_unpack
  resources:
    containers:
    - /1.0/containers/test-copy
    instances:
    - /1.0/instances/test-copy
  status: Failure
  status_code: 400
  updated_at: "2020-05-08T11:18:57.590873186-07:00"

Both push and pull modes encounter the same ‘exit status 2’ error. The above is at least a bug because the failure should have propagated out and terminated the hung command. However, I consider it a bug because rsyncing from LXD 3.0.3 on Ubuntu 18.04 LTS to LXD snap 4.0.1 on Ubuntu 18.04 LTS should have close enough rsync versions for this to work, right?

The source LXD daemon reports the following in its log:

lvl=eror msg=“Rsync send failed: /var/lib/lxd/containers/test-copy/: exit status 2: [Receiver] Invalid dir index: -1 (-101 - -101)\nrsync error: protocol incompatibility (code 2) at flist.c(2634) [Receiver=3.1.2]\n”

I tried searching for “[Receiver] Invalid dir index: -1 (-101 - -101)”. Maybe it is related to this?:

candlerb · September 14, 2022, 3:27pm

Sorry to resurrect an old thread, but I’m getting exactly the same error as in the subject line.

What I’m trying to do is to move some containers from an old system (IP 10.0.0.53):

Ubuntu 16.04.7 LTS
lxd 2.0.11-0ubuntu1~16.04.4
btrfs

# rsync --version
rsync  version 3.1.1  protocol version 31

To a new system (IP 10.0.0.153):

Ubuntu 22.04.1 LTS
lxd 5.0.1-9dcf35b
ext4

# /snap/lxd/current/bin/rsync --version
rsync  version 3.1.3  protocol version 31

Using the LXD_INSECURE_TLS=true trick (in both shell and server environment), I’ve been able on the new machine to add the old machine as a remote. But when I try to move a container, running the following on the new server:

# lxc move 10.0.0.53:ix-dc1 ix-dc1 --verbose
Error: Failed instance creation: Error transferring instance data: exit status 22

On the old server, /var/log/lxd/lxd.log adds this line:

lvl=eror msg="Rsync send failed: /var/lib/lxd/containers/ix-dc1/: exit status 22: ERROR: buffer overflow in recv_rules [Receiver]\nrsync error: error allocating core memory buffers (code 22) at util2.c(111) [Receiver=3.1.3]\n" t=2022-09-14T16:06:03+0100

On the new server, nothing is added to /var/snap/lxd/common/lxd/logs/lxd.log

There is a suggestion here that updating rsync to 3.1.3 might help - but there is no such package in xenial-backports and I’m having trouble locating a third-party one (there is 3.2.3, but not sure if that is protocol-compatible)

I guess I’ll poke around some more.

I thought I could fall back to lxc export/import, but the old server doesn’t support that:

# lxc export 10.0.0.53:ix-dc1 blah.tgz
Error: Create instance backup: The server is missing the required "container_backup" API extension

Any other suggestions for workarounds?

candlerb · September 14, 2022, 3:54pm

The rsync-3.1.3 package from focal installs OK (with a bit of --ignore-depends=rsync); but unfortunately gives the same error.

Using strace to watch the lxd process and its descendants on the old (sending) system, I can see that it is exec’ing the rsync binary, but also lxc netcat:

[pid  1831] stat("/usr/local/sbin/rsync",  <unfinished ...>
[pid  1831] stat("/usr/local/bin/rsync", 0xc820358378) = -1 ENOENT (No such file or directory)
[pid  1831] stat("/usr/sbin/rsync",  <unfinished ...>
[pid  1831] stat("/usr/bin/rsync",  <unfinished ...>
[pid 24799] execve("/usr/bin/rsync", ["rsync", "-arvP", "--devices", "--numeric-ids", "--partial", "--sparse", "/var/lib/lxd/containers/ix-dc1/", "localhost:/tmp/foo", "-e", "sh -c \"/usr/bin/lxd netcat @lxd/2e66ee13-aeb6-4bd1-87f0-6b017a910730 ix-dc1\""], [/* 5 vars */] <unfinished ...>
[pid 24800] execve("/usr/local/sbin/sh", ["sh", "-c", "/usr/bin/lxd netcat @lxd/2e66ee13-aeb6-4bd1-87f0-6b017a910730 ix-dc1", "localhost", "rsync", "--server", "-vlogDtprSe.iLsfxC", "--partial", "--numeric-ids", ".", "/tmp/foo"], [/* 5 vars */] <unfinished ...>
[pid 24800] execve("/usr/local/bin/sh", ["sh", "-c", "/usr/bin/lxd netcat @lxd/2e66ee13-aeb6-4bd1-87f0-6b017a910730 ix-dc1", "localhost", "rsync", "--server", "-vlogDtprSe.iLsfxC", "--partial", "--numeric-ids", ".", "/tmp/foo"], [/* 5 vars */] <unfinished ...>
[pid 24800] execve("/usr/sbin/sh", ["sh", "-c", "/usr/bin/lxd netcat @lxd/2e66ee13-aeb6-4bd1-87f0-6b017a910730 ix-dc1", "localhost", "rsync", "--server", "-vlogDtprSe.iLsfxC", "--partial", "--numeric-ids", ".", "/tmp/foo"], [/* 5 vars */]) = -1 ENOENT (No such file or directory)
[pid 24800] execve("/usr/bin/sh", ["sh", "-c", "/usr/bin/lxd netcat @lxd/2e66ee13-aeb6-4bd1-87f0-6b017a910730 ix-dc1", "localhost", "rsync", "--server", "-vlogDtprSe.iLsfxC", "--partial", "--numeric-ids", ".", "/tmp/foo"], [/* 5 vars */]) = -1 ENOENT (No such file or directory)
[pid 24800] execve("/sbin/sh", ["sh", "-c", "/usr/bin/lxd netcat @lxd/2e66ee13-aeb6-4bd1-87f0-6b017a910730 ix-dc1", "localhost", "rsync", "--server", "-vlogDtprSe.iLsfxC", "--partial", "--numeric-ids", ".", "/tmp/foo"], [/* 5 vars */]) = -1 ENOENT (No such file or directory)
[pid 24800] execve("/bin/sh", ["sh", "-c", "/usr/bin/lxd netcat @lxd/2e66ee13-aeb6-4bd1-87f0-6b017a910730 ix-dc1", "localhost", "rsync", "--server", "-vlogDtprSe.iLsfxC", "--partial", "--numeric-ids", ".", "/tmp/foo"], [/* 5 vars */]) = 0
[pid  5610] write(18, "0\0\0\nERROR: buffer overflow in recv_rules [Receiver]\n]\0\0\nrsync error: error allocating core memory buffers (code 22) at util2.c(111) [Receiver=3.1.3]\n\4\0\0]\26\0\0\0", 157 <unfinished ...>
[pid 24804] <... read resumed> "0\0\0\nERROR: buffer overflow in recv_rules [Receiver]\n]\0\0\nrsync error: error allocating core memory buffers (code 22) at util2.c(111) [Receiver=3.1.3]\n\4\0\0]\26\0\0\0", 32768) = 157
[pid 24804] write(1, "0\0\0\nERROR: buffer overflow in recv_rules [Receiver]\n]\0\0\nrsync error: error allocating core memory buffers (code 22) at util2.c(111) [Receiver=3.1.3]\n\4\0\0]\26\0\0\0", 157 <unfinished ...>
[pid 24799] lstat("rootfs/etc/default/rsync", {st_mode=S_IFREG|0644, st_size=1768, ...}) = 0
[pid 24799] lstat("rootfs/etc/init.d/rsync", {st_mode=S_IFREG|0755, st_size=4355, ...}) = 0
[pid 24799] read(5, "0\0\0\nERROR: buffer overflow in recv_rules [Receiver]\n]\0\0\nrsync error: error allocating core memory buffers (code 22) at util2.c(111) [Receiver=3.1.3]\n\4\0\0]\26\0\0\0", 32768) = 157
[pid 24799] write(2, "rsync error: error allocating core memory buffers (code 22) at util2.c(111) [Receiver=3.1.3]", 92 <unfinished ...>
[pid  1831] <... read resumed> "\nrsync error: error allocating core memory buffers (code 22) at util2.c(111) [Receiver=3.1.3]\n", 1489) = 94
[pid  1831] write(5, "lvl=eror msg=\"Rsync send failed: /var/lib/lxd/containers/ix-dc1/: exit status 22: ERROR: buffer overflow in recv_rules [Receiver]\\nrsync error: error allocating core memory buffers (code 22) at util2.c(111) [Receiver=3.1.3]\\n\" t=2022-09-14T16:36:09+0100\n", 254 <unfinished ...>
[pid  1831] write(2, "lvl=eror msg=\"Rsync send failed: /var/lib/lxd/containers/ix-dc1/: exit status 22: ERROR: buffer overflow in recv_rules [Receiver]\\nrsync error: error allocating core memory buffers (code 22) at util2.c(111) [Receiver=3.1.3]\\n\" t=2022-09-14T16:36:09+0100\n", 254 <unfinished ...>

I’m not sure what’s going on with /tmp/foo, I guess that’s just a dummy placeholder?

The above was just grepped to ‘rsync’. Looking at the full strace up to the point where the error occurs, it seems to be walking the container until it recurses to rootfs/bin:

[pid 24799] lstat("rootfs/lib64",  <unfinished ...>
[pid  5610] epoll_wait(8,  <unfinished ...>
[pid 24799] <... lstat resumed> {st_mode=S_IFDIR|0755, st_size=40, ...}) = 0
[pid  5610] <... epoll_wait resumed> [], 128, 0) = 0
[pid 24799] lstat("rootfs/media",  <unfinished ...>
[pid  5610] epoll_wait(8,  <unfinished ...>
[pid 24799] <... lstat resumed> {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
[pid  1810] <... futex resumed> )       = 1
[pid 24799] lstat("rootfs/mnt",  <unfinished ...>
[pid  1809] read(18,  <unfinished ...>
[pid 24799] <... lstat resumed> {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
[pid  1810] write(14, "\27\3\3\0u\0\0\0\0\0\0\0\3]\377\274\337\357\330\311\36n\207%\206\245Q\353\317\\\301v~\251\245\216N\n\4\0068X\245J\231\222y6%\v\342\251\375{r6\6\346\21\"J\207\207\362\20673\10\1Oz$\
206r\264\373\10\356%\277\341\262\317\311BP\n(\264\241i8d.\234\32\f\311<\256\316\221\204\222\240!\344w\"\264\335)\324\321\220\267{|&\236\223\242", 122 <unfinished ...>
[pid 24799] lstat("rootfs/opt",  <unfinished ...>
[pid  1809] <... read resumed> 0xc820630000, 131072) = -1 EAGAIN (Resource temporarily unavailable)
[pid 24799] <... lstat resumed> {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
[pid  1810] <... write resumed> )       = 122
[pid 24799] lstat("rootfs/proc",  <unfinished ...>
[pid  1810] futex(0xc8202d2d08, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 24799] <... lstat resumed> {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
[pid  1809] futex(0xc8202e2508, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 24799] lstat("rootfs/root",  <unfinished ...>
[pid  1704] <... select resumed> )      = 0 (Timeout)
[pid 24799] <... lstat resumed> {st_mode=S_IFDIR|0700, st_size=468, ...}) = 0
[pid  1704] futex(0x115f630, FUTEX_WAIT, 0, {60, 0} <unfinished ...>
[pid 24799] lstat("rootfs/run", {st_mode=S_IFDIR|0755, st_size=14, ...}) = 0
[pid 24799] lstat("rootfs/sbin", {st_mode=S_IFDIR|0755, st_size=3626, ...}) = 0
[pid 24799] lstat("rootfs/srv", {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
[pid 24799] lstat("rootfs/sys", {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
[pid 24799] lstat("rootfs/tmp", {st_mode=S_IFDIR|S_ISVTX|0777, st_size=126, ...}) = 0
[pid 24799] lstat("rootfs/usr", {st_mode=S_IFDIR|0755, st_size=70, ...}) = 0
[pid 24799] lstat("rootfs/var",  <unfinished ...>
[pid  5610] <... epoll_wait resumed> [{EPOLLIN|EPOLLOUT, {u32=973834744, u64=140450699381240}}], 128, -1) = 1
[pid 24799] <... lstat resumed> {st_mode=S_IFDIR|0755, st_size=100, ...}) = 0
[pid  5610] futex(0x115f630, FUTEX_WAKE, 1 <unfinished ...>
[pid 24799] getdents(3,  <unfinished ...>
[pid  5610] <... futex resumed> )       = 1
[pid 24799] <... getdents resumed> /* 0 entries */, 32768) = 0
[pid  5610] read(14,  <unfinished ...>
[pid 24799] close(3 <unfinished ...>
[pid  5610] <... read resumed> "\27\3\3\0\275\0\0\0\0\0\0\0\5\263\343\35`\242\325\313\206D\23\340\312\300);T\4\247I\37\343\275\247S\247~\\\376\331\215\16$\300/\227\16\374<3\274\234\vJp\0\345\274\10\316\2
47\273Jt9U\r\221xK}\356\212a>\357\217}\223\356\227P\335P4\205t%\314\3\275\31\342\205\233\275\26\211X\"V\305\360\226\360\372\224@\252\356_\331\"\243[\315\201\272@\213T\311\376\326\177\267\np`\353\374\321\
264\317\350T\\r\212?\361?\227UR\263\360N\236}\363O\305\250\233KR\205\256\211\vk\253\316\231e\340|-o\16\271\316\372*!\213\373Z\205\205\251\323\\q~7\275*\226\304\333", 1024) = 194
[pid 24799] <... close resumed> )       = 0
[pid  1704] <... futex resumed> )       = 0
[pid 24799] open("rootfs/bin", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC <unfinished ...>
[pid  5610] write(18, "0\0\0\nERROR: buffer overflow in recv_rules [Receiver]\n]\0\0\nrsync error: error allocating core memory buffers (code 22) at util2.c(111) [Receiver=3.1.3]\n\4\0\0]\26\0\0\0", 157
<unfinished ...>

I’m a bit stumped now.

Just to confirm the obvious, I think there should be plenty of RAM available for this operation: there is 4GB real RAM plus 4GB swap.

root@ix-dns1:~# free
              total        used        free      shared  buff/cache   available
Mem:        4028756      452608     1126744       26432     2449404     3193620
Swap:       3906556      120548     3786008

candlerb · September 14, 2022, 4:03pm

Aha: after more searching I found the solution in another thread:

Solution: create /usr/local/bin/rsync containing

#!/bin/sh
exec /usr/bin/rsync --xattrs --delete --compress --compress-level=2 "$@"

(and this even works with the stock rsync-3.1.1 xenial package)

Phew! Glad to be getting rid of this legacy server

Edit: the filesystem copied successfully, but the container’s init process (symlink to /lib/systemd/systemd) doesn’t launch any processes. At least the filesystem copied OK, so it’s a different problem to investigate…

root@ix-dns1-new:~# lxc exec ix-dc1 bash
root@ix-dc1:~# ps auxwww
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root           1  0.0  0.0  36612  2020 ?        Ss   16:20   0:00 /sbin/init
root          22  0.0  0.0  21804  3748 pts/1    Ss   16:21   0:00 bash
root          34  0.0  0.0  37764  3060 pts/1    R+   16:21   0:00 ps auxwww
root@ix-dc1:~#

… and it seems to be this problem (Ubuntu 16.04 won’t work in a cgroup2 environment). Fortunately, this is a container I can retire (I tested it first because it was the least important), and the important ones are 20.04.