Results of Testing on LXD 3.21 - much better- but something still not right

Tony_Anytime · February 21, 2020, 11:14pm

Version 3.21 seems to recover better from a power/server outage whether intentional or on purpose. But it still has issues if you only have 1 units up. Hope my testing shows up the issues.

On a test cluster of Four units; q1, q2, q3, q4

So I first turn on Cluster and upgrade to all the latest Snap lxd
Brought down and up cluster member q4, no problem
Brought down and up cluster member q3 and q4, no problem
Brought down and up cluster member q2, q3 and q4, problem, q1 lxd goes stuck,
–To bring back q2 had to do systemctl restart snap.lxd.daemon on q1 and q2 several times – but got it working
– To get q3 back into cluster had to do systemctl restart snap.lxd.daemon, everything but it did not reconnect after waiting 10 minutes
–After q3 was connected , q4 connected to cluster by itself

After this I brought the whole cluster down by doing shutdown at each individual terminal.
I then brought each unit back up with a 5 minute delay between them, none of the 4 units could get LXD working by itself. Some had Error: “cannot fetch node config from database: driver: bad connection” others had “Job for snap.lxd.daemon.unix.socket failed”.
I tried systemctl restart snap.lxd.daemon like before to no avail.
Then after doing my reset script (see below) about 2 times, it started working fine.

While 3.20 is much better at recovering from a lost cluster, or from a shutdown, If it goes under 2 nodes, cluster becomes a confused, and you need to restart lxd on all the nodes. If you start from 0 or 1 node being online it looks like something happens with unix.socket and it need to be restarted. And Lxd if restarted it will kick in.

Unfortunately my understanding of the startup process of Lxd is limited. Otherwise I would now where to look further.

I can continue to do further testing if someone gives better idea of where to look in the startup process, To me it seems like it could be timing issues on initial boot,
It could also be in the cluster voting process, it seems to hang. It doesn’t timeout and try again.

Let me know what else I can do to help solve this problem.

tony

For reference Reset script:

echo ‘Reseting Networking’
systemctl restart systemd-networkd
sleep 10
echo ‘Stopping Socket’
systemctl stop snap.lxd.daemon.unix.socket
sleep 10
echo ‘Starting Unix Socket’
systemctl start snap.lxd.daemon.unix.socket
sleep 10
echo ‘Stopping lxd’
systemctl stop snap.lxd.daemon
echo ‘Starting lxd’
systemctl restart snap.lxd.daemon
sleep 10
lxc cluster list

freeekanayaka · February 22, 2020, 10:00am

This is expected. If you shutdown 2 out of the 3 database nodes, the cluster API becomes unavailable. It’s the price to pay in order to be sure your LXD configuration data survives the loss of any one node.

Next week I’ll try to reproduce this. If you have the relevant lxd.log files to share, that might be helpful too.

Tony_Anytime · February 22, 2020, 4:48pm

Why? Can’t it go into a Read only mode and give a message … better than just stuck. In read only mode it would be nice if you can start and stop individual containers, and access them, you just can’t add or delete them. It should also be ready for a second node to come online.

I will send you log files. Anyone in particular your interested in?

BTW, You can test this with just two machine. The problem seems to be with cluster members 1 and 2

Here is some additional info.
When you bring back the second cluster member and try to do lxc list or anything else
you get nothing, but in background

If I do the following on both machines - it gets going
root@Q0:/home/ic2000# pkill -9 -f “lxd --logfile”
root@Q0:/home/ic2000# lxd --debug --group lxd
DBUG[02-22|11:40:52] Connecting to a local LXD over a Unix socket
DBUG[02-22|11:40:52] Sending request to LXD method=GET url=http://unix.socket/1.0 etag=
DBUG[02-22|11:40:52] Got response struct from LXD
DBUG[02-22|11:40:52]
{
“config”: {
“cluster.https_address”: “10.0.0.100:8443”,
“core.https_address”: “10.0.0.100:8443”,
“core.trust_password”: true
},
“api_extensions”: [
“storage_zfs_remove_snapshots”,
“container_host_shutdown_timeout”,
“container_stop_priority”,
“container_syscall_filtering”,
“auth_pki”,
“container_last_used_at”,
“etag”,
“patch”,
“usb_devices”,
“https_allowed_credentials”,
“image_compression_algorithm”,
“directory_manipulation”,
“container_cpu_time”,
“storage_zfs_use_refquota”,
“storage_lvm_mount_options”,
“network”,
“profile_usedby”,
“container_push”,
“container_exec_recording”,
“certificate_update”,
“container_exec_signal_handling”,
“gpu_devices”,
“container_image_properties”,
“migration_progress”,
“id_map”,
“network_firewall_filtering”,
“network_routes”,
“storage”,
“file_delete”,
“file_append”,
“network_dhcp_expiry”,
“storage_lvm_vg_rename”,
“storage_lvm_thinpool_rename”,
“network_vlan”,
“image_create_aliases”,
“container_stateless_copy”,
“container_only_migration”,
“storage_zfs_clone_copy”,
“unix_device_rename”,
“storage_lvm_use_thinpool”,
“storage_rsync_bwlimit”,
“network_vxlan_interface”,
“storage_btrfs_mount_options”,
“entity_description”,
“image_force_refresh”,
“storage_lvm_lv_resizing”,
“id_map_base”,
“file_symlinks”,
“container_push_target”,
“network_vlan_physical”,
“storage_images_delete”,
“container_edit_metadata”,
“container_snapshot_stateful_migration”,
“storage_driver_ceph”,
“storage_ceph_user_name”,
“resource_limits”,
“storage_volatile_initial_source”,
“storage_ceph_force_osd_reuse”,
“storage_block_filesystem_btrfs”,
“resources”,
“kernel_limits”,
“storage_api_volume_rename”,
“macaroon_authentication”,
“network_sriov”,
“console”,
“restrict_devlxd”,
“migration_pre_copy”,
“infiniband”,
“maas_network”,
“devlxd_events”,
“proxy”,
“network_dhcp_gateway”,
“file_get_symlink”,
“network_leases”,
“unix_device_hotplug”,
“storage_api_local_volume_handling”,
“operation_description”,
“clustering”,
“event_lifecycle”,
“storage_api_remote_volume_handling”,
“nvidia_runtime”,
“container_mount_propagation”,
“container_backup”,
“devlxd_images”,
“container_local_cross_pool_handling”,
“proxy_unix”,
“proxy_udp”,
“clustering_join”,
“proxy_tcp_udp_multi_port_handling”,
“network_state”,
“proxy_unix_dac_properties”,
“container_protection_delete”,
“unix_priv_drop”,
“pprof_http”,
“proxy_haproxy_protocol”,
“network_hwaddr”,
“proxy_nat”,
“network_nat_order”,
“container_full”,
“candid_authentication”,
“backup_compression”,
“candid_config”,
“nvidia_runtime_config”,
“storage_api_volume_snapshots”,
“storage_unmapped”,
“projects”,
“candid_config_key”,
“network_vxlan_ttl”,
“container_incremental_copy”,
“usb_optional_vendorid”,
“snapshot_scheduling”,
“container_copy_project”,
“clustering_server_address”,
“clustering_image_replication”,
“container_protection_shift”,
“snapshot_expiry”,
“container_backup_override_pool”,
“snapshot_expiry_creation”,
“network_leases_location”,
“resources_cpu_socket”,
“resources_gpu”,
“resources_numa”,
“kernel_features”,
“id_map_current”,
“event_location”,
“storage_api_remote_volume_snapshots”,
“network_nat_address”,
“container_nic_routes”,
“rbac”,
“cluster_internal_copy”,
“seccomp_notify”,
“lxc_features”,
“container_nic_ipvlan”,
“network_vlan_sriov”,
“storage_cephfs”,
“container_nic_ipfilter”,
“resources_v2”,
“container_exec_user_group_cwd”,
“container_syscall_intercept”,
“container_disk_shift”,
“storage_shifted”,
“resources_infiniband”,
“daemon_storage”,
“instances”,
“image_types”,
“resources_disk_sata”,
“clustering_roles”,
“images_expiry”,
“resources_network_firmware”,
“backup_compression_algorithm”,
“ceph_data_pool_name”,
“container_syscall_intercept_mount”,
“compression_squashfs”,
“container_raw_mount”,
“container_nic_routed”,
“container_syscall_intercept_mount_fuse”,
“container_disk_ceph”,
“virtual-machines”,
“image_profiles”,
“clustering_architecture”,
“resources_disk_id”,
“storage_lvm_stripes”,
“vm_boot_priority”,
“unix_hotplug_devices”,
“api_filtering”,
“instance_nic_network”,
“clustering_sizing”
],
“api_status”: “stable”,
“api_version”: “1.0”,
“auth”: “trusted”,
“public”: false,
“auth_methods”: [
“tls”
],
“environment”: {
“addresses”: [
“10.0.0.100:8443”
],
“architectures”: [
“x86_64”,
“i686”
],
“certificate”: "-----BEGIN CERTIFICATE-----\nMIIB+D TOOK IT OUT
“driver”: “lxc”,
“driver_version”: “3.2.1”,
“kernel”: “Linux”,
“kernel_architecture”: “x86_64”,
“kernel_features”: {
“netnsid_getifaddrs”: “false”,
“seccomp_listener”: “false”,
“seccomp_listener_continue”: “false”,
“shiftfs”: “false”,
“uevent_injection”: “false”,
“unpriv_fscaps”: “true”
},
“kernel_version”: “4.15.0-88-generic”,
“lxc_features”: {
“cgroup2”: “false”,
“mount_injection_file”: “true”,
“network_gateway_device_route”: “true”,
“network_ipvlan”: “true”,
“network_l2proxy”: “true”,
“network_phys_macvlan_mtu”: “true”,
“network_veth_router”: “true”,
“seccomp_notify”: “true”
},
“project”: “default”,
“server”: “lxd”,
“server_clustered”: true,
“server_name”: “Q0”,
“server_pid”: 8691,
“server_version”: “3.21”,
“storage”: “zfs”,
“storage_version”: “0.7.5-1ubuntu16.7”
}
}
EROR[02-22|11:40:52] Failed to start the daemon: LXD is already running
INFO[02-22|11:40:52] Starting shutdown sequence
DBUG[02-22|11:40:52] Not unmounting temporary filesystems (containers are still running)
Error: LXD is already running

root@Q9:/home/ic2000# lxc cluster list
±-----±------------------------±---------±-------±------------------±-------------+
| NAME | URL | DATABASE | STATE | MESSAGE | ARCHITECTURE |
±-----±------------------------±---------±-------±------------------±-------------+
| Q0 | https://10.0.0.100:8443 | YES | ONLINE | fully operational | x86_64 |
±-----±------------------------±---------±-------±------------------±-------------+
| Q9 | https://10.0.0.109:8443 | YES | ONLINE | fully operational | x86_64 |
±-----±------------------------±---------±-------±------------------±-------------+

IT WORKS AFTER A KICK

freeekanayaka · February 27, 2020, 11:45am

I could not reproduce this. What I did:

Create a cluster of 4 nodes
Stop all nodes (using lxd shutdown)
Start node 1
Wait 5 minutes
Start node 2
Run lxc cluster list

The lxc cluster list command worked right away after node 2 was restarted and a quorum of 2 nodes was hence restored.

If you think you can reproduce this issue, please do the following:

Turn on debug mode with snap set lxd daemon.debug=true
Turn off all your 4 nodes. Double check that LXD is not running on any node (e.g. ps aux | grep lxd returns nothing)
rm the /var/snap/lxd/common/lxd/logs/lxd.log file on all nodes. Or just mv it if you wish to keep a copy.
Start node 1
Wait 5 minutes
Start node 2
Run lxc cluster list

If the lxc cluster list command does not work, please send me the logs of those two files, or paste them here.

Tony_Anytime · February 27, 2020, 2:13pm

They all do the same… 12 retrys

t=2020-02-27T11:21:45-0500 lvl=info msg=“LXD 3.21 is starting in normal mode” path=/var/snap/lxd/common/lxd
t=2020-02-27T11:21:45-0500 lvl=info msg=“Kernel uid/gid map:”
t=2020-02-27T11:21:45-0500 lvl=info msg=" - u 0 0 4294967295"
t=2020-02-27T11:21:45-0500 lvl=info msg=" - g 0 0 4294967295"
t=2020-02-27T11:21:45-0500 lvl=info msg=“Configured LXD uid/gid map:”
t=2020-02-27T11:21:45-0500 lvl=info msg=" - u 0 1000000 1000000000"
t=2020-02-27T11:21:45-0500 lvl=info msg=" - g 0 1000000 1000000000"
t=2020-02-27T11:21:45-0500 lvl=info msg=“Kernel features:”
t=2020-02-27T11:21:45-0500 lvl=info msg=" - netnsid-based network retrieval: no"
t=2020-02-27T11:21:45-0500 lvl=info msg=" - uevent injection: no"
t=2020-02-27T11:21:45-0500 lvl=info msg=" - seccomp listener: no"
t=2020-02-27T11:21:45-0500 lvl=info msg=" - seccomp listener continue syscalls: no"
t=2020-02-27T11:21:45-0500 lvl=info msg=" - unprivileged file capabilities: yes"
t=2020-02-27T11:21:45-0500 lvl=info msg=" - cgroup layout: hybrid"
t=2020-02-27T11:21:45-0500 lvl=warn msg=" - Couldn’t find the CGroup memory swap accounting, swap limits will be ignored"
t=2020-02-27T11:21:45-0500 lvl=info msg=" - shiftfs support: disabled"
t=2020-02-27T11:21:45-0500 lvl=info msg=“Initializing local database”
t=2020-02-27T11:21:45-0500 lvl=dbug msg=“Initializing database gateway”
t=2020-02-27T11:21:45-0500 lvl=dbug msg=“Start database node” address=10.0.0.101:8443 id=1 role=voter
t=2020-02-27T11:21:45-0500 lvl=eror msg=“Cannot listen on https socket, skipping…” err=“listen tcp 10.0.0.101:8443: bind: cannot assign requested address”
t=2020-02-27T11:21:45-0500 lvl=info msg=“Starting /dev/lxd handler:”
t=2020-02-27T11:21:45-0500 lvl=info msg=" - binding devlxd socket" socket=/var/snap/lxd/common/lxd/devlxd/sock
t=2020-02-27T11:21:45-0500 lvl=info msg=“REST API daemon:”
t=2020-02-27T11:21:45-0500 lvl=info msg=" - binding Unix socket" inherited=true socket=/var/snap/lxd/common/lxd/unix.socket
t=2020-02-27T11:21:45-0500 lvl=info msg=“Initializing global database”
t=2020-02-27T11:21:45-0500 lvl=dbug msg="Dqlite: server connection failed err=failed to establish network connection: Head https://10.0.0.101:8443/internal/database: dial tcp 10.0.0.1$
t=2020-02-27T11:21:45-0500 lvl=dbug msg="Dqlite: server connection failed err=failed to establish network connection: Head https://10.0.0.103:8443/internal/database: dial tcp 10.0.0.1$
t=2020-02-27T11:21:45-0500 lvl=dbug msg=“Dqlite: connection failed err=no available dqlite leader server found attempt=0”
t=2020-02-27T11:21:45-0500 lvl=dbug msg="Dqlite: server connection failed err=failed to establish network connection: Head https://10.0.0.101:8443/internal/database: dial tcp 10.0.0.1$
t=2020-02-27T11:21:45-0500 lvl=dbug msg="Dqlite: server connection failed err=failed to establish network connection: Head https://10.0.0.103:8443/internal/database: dial tcp 10.0.0.1$
t=2020-02-27T11:21:45-0500 lvl=dbug msg=“Dqlite: connection failed err=no available dqlite leader server found attempt=1”
t=2020-02-27T11:21:46-0500 lvl=dbug msg="Dqlite: server connection failed err=failed to establish network connection: Head https://10.0.0.101:8443/internal/database: dial tcp 10.0.0.1$
t=2020-02-27T11:21:46-0500 lvl=dbug msg="Dqlite: server connection failed err=failed to establish network connection: Head https://10.0.0.103:8443/internal/database: dial tcp 10.0.0.1$
t=2020-02-27T11:21:46-0500 lvl=dbug msg=“Dqlite: connection failed err=no available dqlite leader server found attempt=2”
t=2020-02-27T11:21:46-0500 lvl=dbug msg="Dqlite: server connection failed err=failed to establish network connection: Head https://10.0.0.101:8443/internal/database: dial tcp 10.0.0.1$
t=2020-02-27T11:21:46-0500 lvl=dbug msg="Dqlite: server connection failed err=failed to establish network connection: Head https://10.0.0.103:8443/internal/database: dial tcp 10.0.0.1$
t=2020-02-27T11:21:46-0500 lvl=dbug msg=“Dqlite: connection failed err=no available dqlite leader server found attempt=3”
t=2020-02-27T11:21:47-0500 lvl=dbug msg="Dqlite: server connection failed err=failed to establish network connection: Head https://10.0.0.101:8443/internal/database: dial tcp 10.0.0.1$
t=2020-02-27T11:21:47-0500 lvl=dbug msg="Dqlite: server connection failed err=failed to establish network connection: Head https://10.0.0.103:8443/internal/database: dial tcp 10.0.0.1$
t=2020-02-27T11:21:47-0500 lvl=dbug msg=“Dqlite: connection failed err=no available dqlite leader server found attempt=4”

Tony_Anytime · February 27, 2020, 4:37pm

Now it is even stranger. Trying another reboot after snap set lxd daemon.debug=false

Tony_Anytime · February 27, 2020, 4:45pm

At this point all 4 server are toasted, only way I know how to fix is to reinstall the Snap LXD

freeekanayaka · February 27, 2020, 5:05pm

This means that they can’t listen to that port. It’s a critical error. If you see that, nothing is going to work.

You should investigate why that happens.

Tony_Anytime · February 27, 2020, 5:30pm

That is source of problem from day one… That is why I do
systemctl restart systemd-networkd
sleep 10
echo ‘Stopping Socket’
systemctl stop snap.lxd.daemon.unix.socket
sleep 10
echo ‘Starting Unix Socket’
systemctl start snap.lxd.daemon.unix.socket

and restart LXD and it mostly works.

The LXD hangs on startup and takes unix.socket with it. A catch 22
My guess is that LXD is starting faster than networking and then locks up the port before it is ready.
Is it possible to set snap lxd to not start automatically, and then I can start in manually with something like lxd --group lxd

BTW this new problem, because of all the experimenting looks like a damage database, how can I fix this or is it best just to blow away the whole cluster again.

root@Q1:/home/ic2000# lxd --group lxd
WARN[02-27|12:24:30] - Couldn’t find the CGroup memory swap accounting, swap limits will be ignored
EROR[02-27|12:24:30] Failed to start the daemon: Failed to start dqlite server: raft_start(): io: load closed segment 0000000000003054-0000000000003076: entries batch 61 starting at byte 673280: entries count in preamble is zero
Error: Failed to start dqlite server: raft_start(): io: load closed segment 0000000000003054-0000000000003076: entries batch 61 starting at byte 673280: entries count in preamble is zero

stgraber · February 27, 2020, 8:31pm

You can use systemd overrides to re-order the boot order, use systemctl edit snap.lxd.daemon set something like:

[Service]
After=whatever.service

Where whatever.service will be started by the time LXD should be allowed to come online.

freeekanayaka · February 27, 2020, 8:32pm

Yes, an ordering issue where networking is not ready when LXD start is one possibility indeed.

You can probably disable the service with:

systemctl disable snap.lxd.daemon

and the try to start the daemon manually, but @stgraber might know a bit more details then me in this regard.

freeekanayaka · February 28, 2020, 11:05am

You should be able to solve this by:

rm /var/snap/lxd/common/database/global/0000000000003054-0000000000003076