Unusual Issues - Related to 5.1 Release?

For the last 9 hours or so I’ve had some unusual issues when building clusters from scratch. For one thing, about 10 hours ago, the nodes were installing 5.1, but a few hours later they were back to installing 5.0 so I had some quick work required to get clusters to form during this time when it seemed to be a mystery whether 5.0 or 5.1 would be installed. ATM 5.0 is being installed by default.

Then Orabuntu-LXD started getting this error about 9 hours ago (and this is still happening) when trying to create the container. It succeeds eventually but not before several retries:

==============================================
Launch Oracle LXD Seed Container...           
==============================================

==============================================
(takes a minute or two ... patience) ...      
==============================================

Creating oel84d10
Error: Failed instance creation: Post "http://unix.socket/1.0/instances": net/http: timeout awaiting response headers
Launch of LXD container encountered an unexpected issue.  Retrying launch ...

Creating oel84d10
Error: Failed instance creation: Post "http://unix.socket/1.0/instances": net/http: timeout awaiting response headers
Launch of LXD container encountered an unexpected issue.  Retrying launch ...

Creating oel84d10
Error: Failed instance creation: Post "http://unix.socket/1.0/instances": net/http: timeout awaiting response headers
Launch of LXD container encountered an unexpected issue.  Retrying launch ...

Creating oel84d11
Error: Failed instance creation: Post "http://unix.socket/1.0/instances": net/http: timeout awaiting response headers
Launch of LXD container encountered an unexpected issue.  Retrying launch ...

Creating oel84d12

The Orabuntu-LXC software keeps re-trying for awhile to see if it can get a successful result, and it also increments the name of the container as part of that retry. As can see it tries the same container name 3 times, then it starts incrementing to see if that will help. I have not seen this fail-safe code active for quite awhile, so I am wondering if there is something going on with container launch as well. The output is cutoff in the cut and paste, but oel84d12 launch did succeed without any further drama onthe first try.

Thanks

Gil

Please can you post the LXD server logs to see why LXD is restarting?

Thanks @tomp. Sorry, dumb question: where do I find the logs you are wanting to see ?

Are you running snap package?

Yes snap

The logs are in /var/snap/lxd/common/lxd/logs/lxd.log

Thank you. Everything in there that has a “log” extension:

[root@r8sv1 ~]# cd /var/snap/lxd/common/lxd/logs
[root@r8sv1 logs]# ls -lrt
total 8
-rw-------. 1 root root 1268 May  3 08:18 lxd.log.1
-rw-r--r--. 1 root root    0 May  3 08:18 dnsmasq.lxdbr0.log
-rw-------. 1 root root  704 May  3 08:18 lxd.log
drwxr-xr-x. 2 root root   21 May  3 08:22 oel84d10
drwxr-xr-x. 2 root root   21 May  3 08:22 oel84d11
drwxr-xr-x. 2 root root  116 May  3 08:59 maestro
drwxr-xr-x. 2 root root  116 May  3 08:59 violin1
drwxr-xr-x. 2 root root  116 May  3 09:00 violin2
drwxr-xr-x. 2 root root  116 May  3 09:02 ora84d10
drwxr-xr-x. 2 root root  116 May  3 09:03 ora84d11
drwxr-xr-x. 2 root root  116 May  3 09:03 oel84d12
[root@r8sv1 logs]# cat lxd.log
time="2022-05-03T08:18:43-05:00" level=warning msg="AppArmor support has been disabled because of lack of kernel support"
time="2022-05-03T08:18:43-05:00" level=warning msg=" - AppArmor support has been disabled, Disabled because of lack of kernel support"
time="2022-05-03T08:18:43-05:00" level=warning msg=" - Couldn't find the CGroup blkio.weight, disk priority will be ignored"
time="2022-05-03T08:18:43-05:00" level=warning msg="Instance type not operational" driver=qemu err="KVM support is missing (no /dev/kvm)" type=virtual-machine
time="2022-05-03T08:18:44-05:00" level=warning msg="Failed to initialize fanotify, falling back on fsnotify" err="Failed to initialize fanotify: invalid argument"
[root@r8sv1 logs]# date
Tue May  3 09:48:32 CDT 2022
[root@r8sv1 logs]# cat lxd.log.1
time="2022-05-03T08:17:39-05:00" level=warning msg="AppArmor support has been disabled because of lack of kernel support"
time="2022-05-03T08:17:39-05:00" level=warning msg=" - AppArmor support has been disabled, Disabled because of lack of kernel support"
time="2022-05-03T08:17:39-05:00" level=warning msg=" - Couldn't find the CGroup blkio.weight, disk priority will be ignored"
time="2022-05-03T08:17:39-05:00" level=warning msg="Instance type not operational" driver=qemu err="KVM support is missing (no /dev/kvm)" type=virtual-machine
time="2022-05-03T08:17:40-05:00" level=warning msg="Failed to initialize fanotify, falling back on fsnotify" err="Failed to initialize fanotify: invalid argument"
time="2022-05-03T08:17:43-05:00" level=warning msg="Failed adding member event listener client" err="Unable to connect to: 0.0.0.0:443 ([dial tcp 0.0.0.0:443: connect: connection refused])" local="10.209.53.1:8443" remote=0.0.0.0
time="2022-05-03T08:18:40-05:00" level=warning msg="Failed to update instance types: Get \"https://us.lxd.images.canonical.com/meta/instance-types/aws.yaml\": context canceled"
time="2022-05-03T08:18:40-05:00" level=warning msg="Could not handover member's responsibilities" err="Failed to transfer leadership: No online voter found"
[root@r8sv1 logs]# cat dnsmasq.lxdbr0.log 
[root@r8sv1 logs]#

Do you see the lxd process changing (indicating LXD is restarting)?

Not sure. But I guess ? that the lxd.log.1 indicates there was a restart ?

ps aux | grep lxd

As you’re using clusters, you need to make sure all members are running the same snap revision, see snap info lxd.

[root@r8sv1 logs]# ps -aux | grep lxd
root      622014  0.0  0.0 298932  1476 ?        Sl   08:17   0:03 lxcfs /var/snap/lxd/common/var/lib/lxcfs -p /var/snap/lxd/common/lxcfs.pid
root      622928  0.0  0.0   2624    80 ?        Ss   08:18   0:00 /bin/sh /snap/lxd/23001/commands/daemon.start
root      623047  1.1  0.9 2033668 75132 ?       Sl   08:18   1:04 lxd --logfile /var/snap/lxd/common/lxd/logs/lxd.log --group lxd
lxd       623135  0.0  0.0   7428   376 ?        Ss   08:18   0:00 dnsmasq --keep-in-foreground --strict-order --bind-interfaces --except-interface=lo --pid-file= --no-ping --interface=lxdbr0 --dhcp-rapid-commit --quiet-dhcp --quiet-dhcp6 --quiet-ra --listen-address=10.119.162.1 --dhcp-no-override --dhcp-authoritative --dhcp-leasefile=/var/snap/lxd/common/lxd/networks/lxdbr0/dnsmasq.leases --dhcp-hostsfile=/var/snap/lxd/common/lxd/networks/lxdbr0/dnsmasq.hosts --dhcp-range 10.119.162.2,10.119.162.254,1h -s lxd --interface-name _gateway.lxd,lxdbr0 -S /lxd/ --conf-file=/var/snap/lxd/common/lxd/networks/lxdbr0/dnsmasq.raw -u lxd -g lxd
root      681662  0.0  0.2 1273640 18104 ?       Ss   08:59   0:00 [lxc monitor] /var/snap/lxd/common/lxd/containers maestro
root      681882  0.0  0.2 1273640 16060 ?       Ss   08:59   0:00 [lxc monitor] /var/snap/lxd/common/lxd/containers violin1
root      682511  0.0  0.1 1200164 15120 ?       Ss   08:59   0:00 [lxc monitor] /var/snap/lxd/common/lxd/containers violin2
root      690622  0.0  0.1 1273640 13312 ?       Ss   09:02   0:00 [lxc monitor] /var/snap/lxd/common/lxd/containers ora84d10
root      694073  0.0  0.1 1273640 13184 ?       Ss   09:03   0:00 [lxc monitor] /var/snap/lxd/common/lxd/containers ora84d11
root      695055  0.0  0.1 1273640 13096 ?       Ss   09:03   0:00 [lxc monitor] /var/snap/lxd/common/lxd/containers oel84d12
root      733830  0.0  0.0  12136  1136 pts/1    S+   09:55   0:00 grep --color=auto lxd
[root@r8sv1 logs]#

Yes, i did actually check ps -aux :slight_smile

This was the install of the first node in the cluster. As mentioned in the original post :slight_smile: I did run into issues with some of the nodes pulling down LXD 5.0 and some pulling 5.1, and it was odd because the first node pulled 5.1, then the second node pulled 5.0, when intuition would sort of suggest the other way round. Anyway, yes I’ve been watching that carefully since 5.1 came out because that did cause some hiccups in cluster deployments.

You may be affected by phased rollouts. What does snap info lxd on each member show?

Yes I’ve been checking snap info lxd and Orabuntu-LXC runs that command automatically as part of the cluster deployments. And yes it has been a bit unpredictable. All these odd little issues that cropped up last 24 hours or so may just go away by themselves once 5.1 “settles in” so to speak. Right now Orabuntu-LXC is deploying the second node on Redhat 8 so we’ll see how that goes. I’m just doing testing for the upcoming release of Orabuntu-LXC v7.0.0-alpha ELENA and the testing consists of having Orabuntu-LXC auto deploy 5-node LXD clusters.

Orabuntu-LXC retries the cluster formation using a preseed several times, so I can workaround the issue of phased rollouts by just upgrading a wayward node to 5.1 on the fly (preseed retry waits 2 minutes before each retry) and then the preseed of course succeeds.

It was more the failure of the container launch that I found surprising. Other container launches later in the flow of Orabuntu-LXC were working fine. It just seemed to be that first one after the cluster had formed that kept consistently running into issue.

Thanks for your help @tomp! I will post much later today (12 hours) the results of the 5-node build. Bye for now …

1 Like

I’m having the same problem, so what kind of command should I end up with to fix it?