Lxd rsync live migration error


(Richard Trouncer) #1

We’re trying to do live migration with very simple newly created containers, but the move never completes.

Looking at /var/snap/lxd/common/lxd/logs/lxd.log we get rsync errors on the sending host

ephemeral=false lvl=info msg=“Creating container” name=rjtTestServer t=2018-09-19T14:45:10+0100
ephemeral=false lvl=info msg=“Created container” name=rjtTestServer t=2018-09-19T14:45:10+0100
lvl=warn msg=“Unable to update backup.yaml at this time” name=rjtTestServer t=2018-09-19T14:45:10+0100
lvl=warn msg=“Unable to update backup.yaml at this time” name=rjtTestServer t=2018-09-19T14:45:10+0100
ephemeral=false lvl=info msg=“Creating container” name=rjtTestServer/deleteMe t=2018-09-19T14:45:10+0100
ephemeral=false lvl=info msg=“Created container” name=rjtTestServer/deleteMe t=2018-09-19T14:45:10+0100
action=start created=2018-09-19T14:45:10+0100 ephemeral=false lvl=info msg=“Starting container” name=rjtTestServer stateful=false t=2018-09-19T14:45:34+0100 used=1970-01-01T01:00:00+0100
action=start created=2018-09-19T14:45:10+0100 ephemeral=false lvl=info msg=“Started container” name=rjtTestServer stateful=false t=2018-09-19T14:45:34+0100 used=1970-01-01T01:00:00+0100
actionscript=false created=2018-09-19T14:45:10+0100 ephemeral=false features=1 lvl=info msg=“Migrating container” name=rjtTestServer predumpdir= statedir= stop=false t=2018-09-19T14:45:53+0100 used=2018-09-19T14:45:34+0100
actionscript=false created=2018-09-19T14:45:10+0100 ephemeral=false features=0 lvl=info msg=“Migrating container” name=rjtTestServer predumpdir= statedir=/tmp/lxd_checkpoint_685803994 stop=false t=2018-09-19T14:46:04+0100 used=2018-09-19T14:45:34+0100
actionscript=false created=2018-09-19T14:45:10+0100 ephemeral=false features=0 lvl=info msg=“Migrated container” name=rjtTestServer predumpdir= statedir=/tmp/lxd_checkpoint_685803994 stop=false t=2018-09-19T14:46:04+0100 used=2018-09-19T14:45:34+0100
lvl=eror msg=“Rsync send failed: /tmp/lxd_checkpoint_685803994/: exit status 2: [Receiver] Invalid dir index: -1 (-101 - -101)\nrsync error: protocol incompatibility (code 2) at flist.c(2630) [Receiver=3.1.1]\n” t=2018-09-19T14:46:04+0100

We don’t get errors on the receiving host.

Elsewhere in the logs it does repeatedly complain about

lvl=warn msg=“Unable to update backup.yaml at this time” name=testchimg t=2018-09-19T14:38:01+0100

but that may be unrelated.

Both hosts are 18.04.1 new builds, as is the container.

First time posting here, apologies for any breaches of protocol.

Richard


(Stéphane Graber) #2

Same LXD version on source and destination?


(Ed McDonagh) #3

Both are snap installed 3.0.2.


(Stéphane Graber) #4

Hmm, odd, we’ll need to try to reproduce that.
What storage backend is that, does it consistently happen and did you try with a very simple container image like Alpine?

The error matches what I’d expect if one server was 3.0.1 and the other 3.0.2 or maybe one 3.0.2 with our cherry-picks and the other without as it’d match what happened prior to us adding logic to detect rsync feature mismatches.


(Richard Trouncer) #5

Alpine gives roughly the same error.

The storage backend is zfs. We’ve yet to manage a successful live migration. We’re using the stable snap branch 3.0

actionscript=false created=2018-09-20T13:56:02+0100 ephemeral=false features=1 lvl=info msg="Migrating container" name=testAlpine predumpdir= statedir= stop=false t=2018-09-20T13:59:51+0100 used=2018-09-20T13:56:03+0100
actionscript=false created=2018-09-20T13:56:02+0100 ephemeral=false features=0 lvl=info msg="Migrating container" name=testAlpine predumpdir= statedir=/tmp/lxd_checkpoint_696747950 stop=false t=2018-09-20T13:59:52+0100 used=2018-09-20T13:56:03+0100
actionscript=false created=2018-09-20T13:56:02+0100 ephemeral=false features=0 lvl=info msg="Migrated container" name=testAlpine predumpdir= statedir=/tmp/lxd_checkpoint_696747950 stop=false t=2018-09-20T13:59:52+0100 used=2018-09-20T13:56:03+0100
lvl=eror msg="Rsync send failed: /tmp/lxd_checkpoint_696747950/: exit status 2: [Receiver] Invalid dir index: -1 (-101 - -101)\nrsync error: protocol incompatibility (code 2) at flist.c(2630) [Receiver=3.1.1]\n" t=2018-09-20T13:59:52+0100

(Stéphane Graber) #6

Can you show lxc info for the source and destination server?


(Ed McDonagh) #7

Looks like I can’t attach, so I’ve gone with long post, sorry:

Host 3:

config:
  core.https_address: '[::]:8443'
  core.trust_password: true
api_extensions:
- storage_zfs_remove_snapshots
- container_host_shutdown_timeout
- container_stop_priority
- container_syscall_filtering
- auth_pki
- container_last_used_at
- etag
- patch
- usb_devices
- https_allowed_credentials
- image_compression_algorithm
- directory_manipulation
- container_cpu_time
- storage_zfs_use_refquota
- storage_lvm_mount_options
- network
- profile_usedby
- container_push
- container_exec_recording
- certificate_update
- container_exec_signal_handling
- gpu_devices
- container_image_properties
- migration_progress
- id_map
- network_firewall_filtering
- network_routes
- storage
- file_delete
- file_append
- network_dhcp_expiry
- storage_lvm_vg_rename
- storage_lvm_thinpool_rename
- network_vlan
- image_create_aliases
- container_stateless_copy
- container_only_migration
- storage_zfs_clone_copy
- unix_device_rename
- storage_lvm_use_thinpool
- storage_rsync_bwlimit
- network_vxlan_interface
- storage_btrfs_mount_options
- entity_description
- image_force_refresh
- storage_lvm_lv_resizing
- id_map_base
- file_symlinks
- container_push_target
- network_vlan_physical
- storage_images_delete
- container_edit_metadata
- container_snapshot_stateful_migration
- storage_driver_ceph
- storage_ceph_user_name
- resource_limits
- storage_volatile_initial_source
- storage_ceph_force_osd_reuse
- storage_block_filesystem_btrfs
- resources
- kernel_limits
- storage_api_volume_rename
- macaroon_authentication
- network_sriov
- console
- restrict_devlxd
- migration_pre_copy
- infiniband
- maas_network
- devlxd_events
- proxy
- network_dhcp_gateway
- file_get_symlink
- network_leases
- unix_device_hotplug
- storage_api_local_volume_handling
- operation_description
- clustering
- event_lifecycle
- storage_api_remote_volume_handling
- nvidia_runtime
- candid_authentication
- candid_config
api_status: stable
api_version: "1.0"
auth: trusted
public: false
auth_methods:
- tls
environment:
  addresses:
  - 10.163.254.33:8443
  - 192.168.248.110:8443
  - 192.168.122.1:8443
  architectures:
  - x86_64
  - i686
  certificate: |
    -----BEGIN CERTIFICATE-----
    MIIFVTCCAz2gAwIBAgIQExnufo8Jn65F11gQP/JPBTANBgkqhkiG9w0BAQsFADA5
    <snip>
    3BezUmyqzVdMJnMSri8UovKkTTM/4kriHZw9SNAS77tedIveyY7C1dZzuNLrgTnT
    UIPWOtJiEOcdpEYLPDHDDVjJmwhzXWPfIA==
    -----END CERTIFICATE-----
  certificate_fingerprint: <snip>
  driver: lxc
  driver_version: 3.0.2
  kernel: Linux
  kernel_architecture: x86_64
  kernel_version: 4.15.0-34-generic
  server: lxd
  server_pid: 27824
  server_version: 3.0.2
  storage: zfs
  storage_version: 0.7.5-1ubuntu16.3
  server_clustered: false
  server_name: frp-vmhost3

And host 4:

config:
  core.https_address: '[::]:8443'
  core.trust_password: true
api_extensions:
- storage_zfs_remove_snapshots
- container_host_shutdown_timeout
- container_stop_priority
- container_syscall_filtering
- auth_pki
- container_last_used_at
- etag
- patch
- usb_devices
- https_allowed_credentials
- image_compression_algorithm
- directory_manipulation
- container_cpu_time
- storage_zfs_use_refquota
- storage_lvm_mount_options
- network
- profile_usedby
- container_push
- container_exec_recording
- certificate_update
- container_exec_signal_handling
- gpu_devices
- container_image_properties
- migration_progress
- id_map
- network_firewall_filtering
- network_routes
- storage
- file_delete
- file_append
- network_dhcp_expiry
- storage_lvm_vg_rename
- storage_lvm_thinpool_rename
- network_vlan
- image_create_aliases
- container_stateless_copy
- container_only_migration
- storage_zfs_clone_copy
- unix_device_rename
- storage_lvm_use_thinpool
- storage_rsync_bwlimit
- network_vxlan_interface
- storage_btrfs_mount_options
- entity_description
- image_force_refresh
- storage_lvm_lv_resizing
- id_map_base
- file_symlinks
- container_push_target
- network_vlan_physical
- storage_images_delete
- container_edit_metadata
- container_snapshot_stateful_migration
- storage_driver_ceph
- storage_ceph_user_name
- resource_limits
- storage_volatile_initial_source
- storage_ceph_force_osd_reuse
- storage_block_filesystem_btrfs
- resources
- kernel_limits
- storage_api_volume_rename
- macaroon_authentication
- network_sriov
- console
- restrict_devlxd
- migration_pre_copy
- infiniband
- maas_network
- devlxd_events
- proxy
- network_dhcp_gateway
- file_get_symlink
- network_leases
- unix_device_hotplug
- storage_api_local_volume_handling
- operation_description
- clustering
- event_lifecycle
- storage_api_remote_volume_handling
- nvidia_runtime
- candid_authentication
- candid_config
api_status: stable
api_version: "1.0"
auth: trusted
public: false
auth_methods:
- tls
environment:
  addresses:
  - 10.163.254.34:8443
  - 192.168.248.149:8443
  architectures:
  - x86_64
  - i686
  certificate: |
    -----BEGIN CERTIFICATE-----
    MIIFUDCCAzigAwIBAgIRAPc9dUjEEp/Ld7S7/1XIgfwwDQYJKoZIhvcNAQELBQAw
    <snip>
    8KorvAlvSmsxS4sqf6YjSNTur3jZMuSyjGivbMwJjxKjyHS9SbKABYZolV4wQzZo
    X1ehaKAo5cbbthGtY1Pbrp7CrFU=
    -----END CERTIFICATE-----
  certificate_fingerprint: <snip>
  driver: lxc
  driver_version: 3.0.2
  kernel: Linux
  kernel_architecture: x86_64
  kernel_version: 4.15.0-34-generic
  server: lxd
  server_pid: 2518
  server_version: 3.0.2
  storage: zfs
  storage_version: 0.7.5-1ubuntu16.3
  server_clustered: false
  server_name: frp-vmhost4

(Richard Trouncer) #8

Does anyone have any ideas?


#9

Just got the same error with latest lxc, lxd from git on rhel7 with rsync 3.1.2. Will look into what is necessary to fix this.


#10

@stgraber If I revert the following commit it works again:

commit 7dfc2939ac278d8436ff4b892599c795c5482007
Author: Stéphane Graber stgraber@ubuntu.com
Date: Thu Aug 23 19:27:31 2018 -0400

global: Advertise rsync features

Closes #4962

Somehow the pre-dump loop seems to have problems with the additional information on the channel.


(Ed McDonagh) #11

@stgraber - does release 3.03 change this at all?


(Stéphane Graber) #12

The CRIU migration issue was resolved in LXD 3.7 and the same fix is present in 3.0.3.


(Ed McDonagh) #13

Thanks - we’ll check it out.