Last Snap Refresh has left my LXD cluster barely functioning again -"unix.socket: connect: connection refused|

Tony_Anytime · November 29, 2020, 6:03am

On all machines …
lxc --version
4.8

My containers are running…
Nothing else works
Afraid of rebooting

lxc cluster list
Error: Get “http://unix.socket/1.0”: dial unix /var/snap/lxd/common/lxd/unix.socket: connect: connection refused

All typical kickstarters don’t seem to work
pkill -9 -f “lxd --logfile”
systemctl restart snap.lxd.daemon

I eventually got all to say Started on snap start lxd.daemon.

But all lxc commands still go to la la land on three machines.
One of the four machines gives Error: Get “http://unix.socket/1.0”: dial unix /var/snap/lxd/common/lxd/unix.socket: connect: no such file or directory

Where do I start to get this running without breaking the running containers.

thanks

tony

stgraber · November 30, 2020, 3:04am

journalctl -u snap.lxd.daemon -n 300
systemctl -a | grep lxd

Tony_Anytime · November 30, 2020, 4:33pm

Hope this helps, looks to me like it didn’t upgrade fully and the database got stuck… But then it could something completely different.

BTW Q1 to Q4, last I remember Q1 was not a database manager. They were all running fine before upgrade and upgrade to 4.7 by themselves just fine. I had updated/reboot servers fine a couple of days before.

I can send you full file of journalctl but let me know where to send it.

Thanks

Tony

stgraber · November 30, 2020, 5:14pm

Sounds like at least one of them didn’t refresh causing some issues.
Can you look at snap list on all of them to see what revision they’re on?

stgraber · November 30, 2020, 5:15pm

Your screenshots don’t actually show the bottom of some of the journal output so it’s hard to tell exactly what state they’re in, so just made my above comment based on the limited information visible.

Tony_Anytime · November 30, 2020, 6:14pm

One of the server does show it is using lxd 4.7 even if lxd/lxc version showed 4.8
I put it to snap refresh and seems to twirling away for many minutes
Sending separate journal file pics

Tony_Anytime · November 30, 2020, 6:18pm

stgraber · November 30, 2020, 6:28pm

Ok, on the one that’s stuck, do ps fauxww | grep lxd.*logfile to find the current LXD process and use kill -9 <PID> to kill it. That should unstick the refresh at which point they’ll all line up on the same version again and should be much happier.

Tony_Anytime · November 30, 2020, 6:32pm

Still shows 4.7

stgraber · November 30, 2020, 6:33pm

what does a snap refresh lxd do now?

stgraber · November 30, 2020, 6:33pm

Also, can you check all 3 other systems with ps aux | grep lxd.*logfile to confirm they have LXD running and just waiting for the update to go through?

Tony_Anytime · November 30, 2020, 6:33pm

this is server

Tony_Anytime · November 30, 2020, 6:35pm

ll

stgraber · November 30, 2020, 6:36pm

Ok, try systemctl start snap.lxd.daemon on the one which isn’t running right now.

stgraber · November 30, 2020, 6:37pm

Nevermind, that’s the broken one…

Tony_Anytime · November 30, 2020, 6:38pm

Seems to have taken
oot@Q2:/home/ic2000# ps aux | grep lxd.*logfile
root 11113 33.0 0.1 2720292 56064 ? Sl 13:37 0:00 lxd --logfile /var/snap/lxd/common/lxd/logs/lxd.log --group lxd --debug

stgraber · November 30, 2020, 6:39pm

Hmm, but that’s probably still the old LXD so it won’t stay up long before exiting due to not being the right version.

Unless the refresh finally unstuck and happened on that box?

Tony_Anytime · November 30, 2020, 6:39pm

But I dont think they are talking to each other

Tony_Anytime · November 30, 2020, 6:40pm

Yep it died

stgraber · November 30, 2020, 6:42pm

Weird, it not wanting to upgrade is definitely the issue, none of the others will want to process any request until Q2 has updated to 4.8.

Maybe try:

systemctl daemon-reload
systemctl stop snap.lxd.daemon.service
systemctl stop snap.lxd.daemon.unix.socket
snap refresh lxd

See if clearing systemd’s mind about the state of things helps?