Lxd rsync live migration error

We’re trying to do live migration with very simple newly created containers, but the move never completes.

Looking at /var/snap/lxd/common/lxd/logs/lxd.log we get rsync errors on the sending host

ephemeral=false lvl=info msg=“Creating container” name=rjtTestServer t=2018-09-19T14:45:10+0100
ephemeral=false lvl=info msg=“Created container” name=rjtTestServer t=2018-09-19T14:45:10+0100
lvl=warn msg=“Unable to update backup.yaml at this time” name=rjtTestServer t=2018-09-19T14:45:10+0100
lvl=warn msg=“Unable to update backup.yaml at this time” name=rjtTestServer t=2018-09-19T14:45:10+0100
ephemeral=false lvl=info msg=“Creating container” name=rjtTestServer/deleteMe t=2018-09-19T14:45:10+0100
ephemeral=false lvl=info msg=“Created container” name=rjtTestServer/deleteMe t=2018-09-19T14:45:10+0100
action=start created=2018-09-19T14:45:10+0100 ephemeral=false lvl=info msg=“Starting container” name=rjtTestServer stateful=false t=2018-09-19T14:45:34+0100 used=1970-01-01T01:00:00+0100
action=start created=2018-09-19T14:45:10+0100 ephemeral=false lvl=info msg=“Started container” name=rjtTestServer stateful=false t=2018-09-19T14:45:34+0100 used=1970-01-01T01:00:00+0100
actionscript=false created=2018-09-19T14:45:10+0100 ephemeral=false features=1 lvl=info msg=“Migrating container” name=rjtTestServer predumpdir= statedir= stop=false t=2018-09-19T14:45:53+0100 used=2018-09-19T14:45:34+0100
actionscript=false created=2018-09-19T14:45:10+0100 ephemeral=false features=0 lvl=info msg=“Migrating container” name=rjtTestServer predumpdir= statedir=/tmp/lxd_checkpoint_685803994 stop=false t=2018-09-19T14:46:04+0100 used=2018-09-19T14:45:34+0100
actionscript=false created=2018-09-19T14:45:10+0100 ephemeral=false features=0 lvl=info msg=“Migrated container” name=rjtTestServer predumpdir= statedir=/tmp/lxd_checkpoint_685803994 stop=false t=2018-09-19T14:46:04+0100 used=2018-09-19T14:45:34+0100
lvl=eror msg=“Rsync send failed: /tmp/lxd_checkpoint_685803994/: exit status 2: [Receiver] Invalid dir index: -1 (-101 - -101)\nrsync error: protocol incompatibility (code 2) at flist.c(2630) [Receiver=3.1.1]\n” t=2018-09-19T14:46:04+0100

We don’t get errors on the receiving host.

Elsewhere in the logs it does repeatedly complain about

lvl=warn msg=“Unable to update backup.yaml at this time” name=testchimg t=2018-09-19T14:38:01+0100

but that may be unrelated.

Both hosts are 18.04.1 new builds, as is the container.

First time posting here, apologies for any breaches of protocol.

Richard

Same LXD version on source and destination?

Both are snap installed 3.0.2.

Hmm, odd, we’ll need to try to reproduce that.
What storage backend is that, does it consistently happen and did you try with a very simple container image like Alpine?

The error matches what I’d expect if one server was 3.0.1 and the other 3.0.2 or maybe one 3.0.2 with our cherry-picks and the other without as it’d match what happened prior to us adding logic to detect rsync feature mismatches.

Alpine gives roughly the same error.

The storage backend is zfs. We’ve yet to manage a successful live migration. We’re using the stable snap branch 3.0

actionscript=false created=2018-09-20T13:56:02+0100 ephemeral=false features=1 lvl=info msg="Migrating container" name=testAlpine predumpdir= statedir= stop=false t=2018-09-20T13:59:51+0100 used=2018-09-20T13:56:03+0100
actionscript=false created=2018-09-20T13:56:02+0100 ephemeral=false features=0 lvl=info msg="Migrating container" name=testAlpine predumpdir= statedir=/tmp/lxd_checkpoint_696747950 stop=false t=2018-09-20T13:59:52+0100 used=2018-09-20T13:56:03+0100
actionscript=false created=2018-09-20T13:56:02+0100 ephemeral=false features=0 lvl=info msg="Migrated container" name=testAlpine predumpdir= statedir=/tmp/lxd_checkpoint_696747950 stop=false t=2018-09-20T13:59:52+0100 used=2018-09-20T13:56:03+0100
lvl=eror msg="Rsync send failed: /tmp/lxd_checkpoint_696747950/: exit status 2: [Receiver] Invalid dir index: -1 (-101 - -101)\nrsync error: protocol incompatibility (code 2) at flist.c(2630) [Receiver=3.1.1]\n" t=2018-09-20T13:59:52+0100

Can you show lxc info for the source and destination server?

Looks like I can’t attach, so I’ve gone with long post, sorry:

Host 3:

config:
  core.https_address: '[::]:8443'
  core.trust_password: true
api_extensions:
- storage_zfs_remove_snapshots
- container_host_shutdown_timeout
- container_stop_priority
- container_syscall_filtering
- auth_pki
- container_last_used_at
- etag
- patch
- usb_devices
- https_allowed_credentials
- image_compression_algorithm
- directory_manipulation
- container_cpu_time
- storage_zfs_use_refquota
- storage_lvm_mount_options
- network
- profile_usedby
- container_push
- container_exec_recording
- certificate_update
- container_exec_signal_handling
- gpu_devices
- container_image_properties
- migration_progress
- id_map
- network_firewall_filtering
- network_routes
- storage
- file_delete
- file_append
- network_dhcp_expiry
- storage_lvm_vg_rename
- storage_lvm_thinpool_rename
- network_vlan
- image_create_aliases
- container_stateless_copy
- container_only_migration
- storage_zfs_clone_copy
- unix_device_rename
- storage_lvm_use_thinpool
- storage_rsync_bwlimit
- network_vxlan_interface
- storage_btrfs_mount_options
- entity_description
- image_force_refresh
- storage_lvm_lv_resizing
- id_map_base
- file_symlinks
- container_push_target
- network_vlan_physical
- storage_images_delete
- container_edit_metadata
- container_snapshot_stateful_migration
- storage_driver_ceph
- storage_ceph_user_name
- resource_limits
- storage_volatile_initial_source
- storage_ceph_force_osd_reuse
- storage_block_filesystem_btrfs
- resources
- kernel_limits
- storage_api_volume_rename
- macaroon_authentication
- network_sriov
- console
- restrict_devlxd
- migration_pre_copy
- infiniband
- maas_network
- devlxd_events
- proxy
- network_dhcp_gateway
- file_get_symlink
- network_leases
- unix_device_hotplug
- storage_api_local_volume_handling
- operation_description
- clustering
- event_lifecycle
- storage_api_remote_volume_handling
- nvidia_runtime
- candid_authentication
- candid_config
api_status: stable
api_version: "1.0"
auth: trusted
public: false
auth_methods:
- tls
environment:
  addresses:
  - 10.163.254.33:8443
  - 192.168.248.110:8443
  - 192.168.122.1:8443
  architectures:
  - x86_64
  - i686
  certificate: |
    -----BEGIN CERTIFICATE-----
    MIIFVTCCAz2gAwIBAgIQExnufo8Jn65F11gQP/JPBTANBgkqhkiG9w0BAQsFADA5
    <snip>
    3BezUmyqzVdMJnMSri8UovKkTTM/4kriHZw9SNAS77tedIveyY7C1dZzuNLrgTnT
    UIPWOtJiEOcdpEYLPDHDDVjJmwhzXWPfIA==
    -----END CERTIFICATE-----
  certificate_fingerprint: <snip>
  driver: lxc
  driver_version: 3.0.2
  kernel: Linux
  kernel_architecture: x86_64
  kernel_version: 4.15.0-34-generic
  server: lxd
  server_pid: 27824
  server_version: 3.0.2
  storage: zfs
  storage_version: 0.7.5-1ubuntu16.3
  server_clustered: false
  server_name: frp-vmhost3

And host 4:

config:
  core.https_address: '[::]:8443'
  core.trust_password: true
api_extensions:
- storage_zfs_remove_snapshots
- container_host_shutdown_timeout
- container_stop_priority
- container_syscall_filtering
- auth_pki
- container_last_used_at
- etag
- patch
- usb_devices
- https_allowed_credentials
- image_compression_algorithm
- directory_manipulation
- container_cpu_time
- storage_zfs_use_refquota
- storage_lvm_mount_options
- network
- profile_usedby
- container_push
- container_exec_recording
- certificate_update
- container_exec_signal_handling
- gpu_devices
- container_image_properties
- migration_progress
- id_map
- network_firewall_filtering
- network_routes
- storage
- file_delete
- file_append
- network_dhcp_expiry
- storage_lvm_vg_rename
- storage_lvm_thinpool_rename
- network_vlan
- image_create_aliases
- container_stateless_copy
- container_only_migration
- storage_zfs_clone_copy
- unix_device_rename
- storage_lvm_use_thinpool
- storage_rsync_bwlimit
- network_vxlan_interface
- storage_btrfs_mount_options
- entity_description
- image_force_refresh
- storage_lvm_lv_resizing
- id_map_base
- file_symlinks
- container_push_target
- network_vlan_physical
- storage_images_delete
- container_edit_metadata
- container_snapshot_stateful_migration
- storage_driver_ceph
- storage_ceph_user_name
- resource_limits
- storage_volatile_initial_source
- storage_ceph_force_osd_reuse
- storage_block_filesystem_btrfs
- resources
- kernel_limits
- storage_api_volume_rename
- macaroon_authentication
- network_sriov
- console
- restrict_devlxd
- migration_pre_copy
- infiniband
- maas_network
- devlxd_events
- proxy
- network_dhcp_gateway
- file_get_symlink
- network_leases
- unix_device_hotplug
- storage_api_local_volume_handling
- operation_description
- clustering
- event_lifecycle
- storage_api_remote_volume_handling
- nvidia_runtime
- candid_authentication
- candid_config
api_status: stable
api_version: "1.0"
auth: trusted
public: false
auth_methods:
- tls
environment:
  addresses:
  - 10.163.254.34:8443
  - 192.168.248.149:8443
  architectures:
  - x86_64
  - i686
  certificate: |
    -----BEGIN CERTIFICATE-----
    MIIFUDCCAzigAwIBAgIRAPc9dUjEEp/Ld7S7/1XIgfwwDQYJKoZIhvcNAQELBQAw
    <snip>
    8KorvAlvSmsxS4sqf6YjSNTur3jZMuSyjGivbMwJjxKjyHS9SbKABYZolV4wQzZo
    X1ehaKAo5cbbthGtY1Pbrp7CrFU=
    -----END CERTIFICATE-----
  certificate_fingerprint: <snip>
  driver: lxc
  driver_version: 3.0.2
  kernel: Linux
  kernel_architecture: x86_64
  kernel_version: 4.15.0-34-generic
  server: lxd
  server_pid: 2518
  server_version: 3.0.2
  storage: zfs
  storage_version: 0.7.5-1ubuntu16.3
  server_clustered: false
  server_name: frp-vmhost4

Does anyone have any ideas?

Just got the same error with latest lxc, lxd from git on rhel7 with rsync 3.1.2. Will look into what is necessary to fix this.

@stgraber If I revert the following commit it works again:

commit 7dfc2939ac278d8436ff4b892599c795c5482007
Author: Stéphane Graber stgraber@ubuntu.com
Date: Thu Aug 23 19:27:31 2018 -0400

global: Advertise rsync features

Closes #4962

Somehow the pre-dump loop seems to have problems with the additional information on the channel.

@stgraber - does release 3.03 change this at all?

The CRIU migration issue was resolved in LXD 3.7 and the same fix is present in 3.0.3.

Thanks - we’ll check it out.