People : hi !
I’ve ran into a deep panic in one of my lxd clusters. To be short : all three machines went out of free disk space in /var and all three instances of lxd went wrong. This is kindof my fault, I don’t come over here to get help on how to monitor my machines and so on, that’s not the purpose of my post.
Yep, despite this, I was abble to copy my containers on a fourth machine and re-install a brand new lxd cluster on these three machines.
This part was not without any troubles for me and I think that there are some things that are not clear in the install process. Let me explain.
On the three machines, at first install try, I had in /etc/environment variables set up like http_proxy, https_proxy, ftp_proxy and no_proxy. And, since these machines can’t access outside without proxy, I also had the first machine set up with lxd’s core config : core.http_proxy, core.https_proxy and core.proxy_ignore_host. This led to troubles, at least with brand new lxd-3.19 installation.
So, I ended up with throwing all this apart, no proxy at all and the three machines went back online and I just had to get my containers back on the cluster.
That said, I alos have two other clusters which were installed maybe more than one year ago.
The fact is that I had entries in my proxy log file complaining that the machines of these two clusters were trying to talk to each other via the proxy. That shouldn’t happen. So, I checked the config variables.
http_proxy, https_proxy (both in shell env), core.http.proxy and core.https_proxy are setup the way they should, and env var no_proxy is set to "localhost, 127.0.0.1, .mydomain.com, 10.0.0.0/24", same to core.proxy_ignore_host’s lxd config key.
Since the 6 machines are on network 10.0.0.0/24, this was expected to work out of the box and machine with ip 10.0.0.1 souhld talk directly to machine with ip 10.0.0.2 without going via the proxy, which it does not according to the logs in my proxy’s machine and the tcpdump I got on each machine. Why is that ?
Since I couldn’t make out what was wrong, I commented out the lines in /etc/environment, unsetted the lxd’s config keys, restarted all the lxd daemons with no success.
The core. don’t exist anymore, as would report “lxc config show” and lxd sql global “select * from config”.
So, there might be something left somewhere, but I couldn’t figure out clearly what it is.
I went to /var/snap/lxd/common/lxd/databases/global and found in there a db.bin file.
With a sqlite3 db.bin select * from config I found out that the core.* config keys I mentioned earlier are still in there. Can I safely remove these entries in this file ? Should I stop lxd before and restart it afterward ?
I don’t want to break my two clusters, so I’d like not to break anything.
Any help/explanations would be much appreciated !
Thanks in advance.