Here we go again...Upgrade to 3.17 Causing - Error: Get http://unix.socket/1.0: EOF

Tony_Anytime · September 9, 2019, 2:20pm

I started upgrading my servers from 3.16 to 3.17 and I am getting this error in cluster.
So wondering what is the best way to proceed, upgrade all and reboot, or is there something else wrong. I don’t want all my machines to become useless again and start a chain of reinstalling LXD on all my servers again. Presently Larry(LOU) , CHEMP and CURLYJOE on 3.17, have a problem, and JOE, MOE which are on 3.16 seem fine. Thanks in advance for your wisdom.

systemctl stop lxd.service lxd.socket
Failed to stop lxd.service: Unit lxd.service not loaded.
Failed to stop lxd.socket: Unit lxd.socket not loaded.
root@LOU:/home/ic2000# lxd --debug --group lxd
DBUG[09-09|09:56:42] Connecting to a local LXD over a Unix socket
DBUG[09-09|09:56:42] Sending request to LXD method=GET url=http://unix.socket/1.0 etag=
INFO[09-09|09:56:48] LXD 3.17 is starting in normal mode path=/var/snap/lxd/common/lxd
INFO[09-09|09:56:48] Kernel uid/gid map:
INFO[09-09|09:56:48] - u 0 0 4294967295
INFO[09-09|09:56:48] - g 0 0 4294967295
INFO[09-09|09:56:48] Configured LXD uid/gid map:
INFO[09-09|09:56:48] - u 0 1000000 1000000000
INFO[09-09|09:56:48] - g 0 1000000 1000000000
WARN[09-09|09:56:48] CGroup memory swap accounting is disabled, swap limits will be ignored.
INFO[09-09|09:56:48] Kernel features:
INFO[09-09|09:56:48] - netnsid-based network retrieval: no
INFO[09-09|09:56:48] - uevent injection: no
INFO[09-09|09:56:48] - seccomp listener: no
INFO[09-09|09:56:48] - unprivileged file capabilities: yes
INFO[09-09|09:56:48] - shiftfs support: no
INFO[09-09|09:56:48] Initializing local database
DBUG[09-09|09:56:48] Initializing database gateway
DBUG[09-09|09:56:48] Start database node id=2 address=64.71.77.80:8443
DBUG[09-09|09:56:48] Connecting to a local LXD over a Unix socket
DBUG[09-09|09:56:48] Sending request to LXD method=GET url=http://unix.socket/1.0 etag=
DBUG[09-09|09:57:00] Detected stale unix socket, deleting
DBUG[09-09|09:57:00] Detected stale unix socket, deleting
INFO[09-09|09:57:00] Starting /dev/lxd handler:
INFO[09-09|09:57:00] - binding devlxd socket socket=/var/snap/lxd/common/lxd/devlxd/sock
INFO[09-09|09:57:00] REST API daemon:
INFO[09-09|09:57:00] - binding Unix socket socket=/var/snap/lxd/common/lxd/unix.socket
INFO[09-09|09:57:00] - binding TCP socket socket=64.7xx.80:8443
INFO[09-09|09:57:00] Initializing global database
DBUG[09-09|09:57:01] Found cert name=0
WARN[09-09|09:57:05] Dqlite client proxy TLS -> Unix: read tcp 64.xx.80:51906->64.xx.32:8443: use of closed network connection
DBUG[09-09|09:57:05] Dqlite: server connection failed err=failed to establish network connection: Head https://64.xx.32:8443/internal/database: context deadline exceeded address=64.xxx.32:8443 attempt=0
DBUG[09-09|09:57:05] Found cert name=0
DBUG[09-09|09:57:05] Dqlite: server connection failed err=failed to establish network connection: 503 Service Unavailable address=64.xxx.80:8443 attempt=0
DBUG[09-09|09:57:05] Dqlite: server connection failed err=failed to establish network connection: 503 Service Unavailable address=64.xx.13:8443 attempt=0
DBUG[09-09|09:57:05] Dqlite: connection failed err=no available dqlite leader server found attempt=0
DBUG[09-09|09:57:06] Found cert name=0
DBUG[09-09|09:57:06] Found cert name=0
DBUG[09-09|09:57:07] Found cert name=0
DBUG[09-09|09:57:07] Found cert name=0
DBUG[09-09|09:57:07] Found cert name=0
DBUG[09-09|09:57:07] Found cert name=0
DBUG[09-09|09:57:08] Found cert name=0
DBUG[09-09|09:57:09] Found cert name=0
DBUG[09-09|09:57:10] Found cert name=0
WARN[09-09|09:57:10] Dqlite client proxy TLS -> Unix: read tcp 64.xx.80:51920->64.xxx.32:8443: use of closed network connection
DBUG[09-09|09:57:10] Dqlite: server connection failed err=failed to establish network connection: Head https://64.xx.32:8443/internal/database: context deadline exceeded address=64.xxx.32:8443 attempt=1
DBUG[09-09|09:57:10] Dqlite: server connection failed err=failed to establish network connection: Head https://64.71.77.80:8443/internal/database: context deadline exceeded address=64.xx.80:8443 attempt=1
DBUG[09-09|09:57:10] Dqlite: server connection failed err=failed to establish network connection: Head https://64.xxx.13:8443/internal/database: context deadline exceeded address=64.xxxx.13:8443 attempt=1
DBUG[09-09|09:57:10] Dqlite: connection failed err=no available dqlite leader server found attempt=1
DBUG[09-09|09:57:10] Failed connecting to global database (attempt 0): failed to create dqlite connection: no available dqlite leader server found
DBUG[09-09|09:57:11] Found cert name=0
DBUG[09-09|09:57:11] Replace current raft nodes with [{ID:1 Address:64.xx.32:8443} {ID:2 Address:64.xx.80:8443} {ID:4 Address:64.xxx.13:8443}]
DBUG[09-09|09:57:11] Partial node list heartbeat received, skipping full update
DBUG[09-09|09:57:11] Found cert name=0
DBUG[09-09|09:57:11] Replace current raft nodes with [{ID:1 Address:64.xxx.32:8443} {ID:2 Address:64.xx.80:8443} {ID:4 Address:64.xxx.13:8443}]
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0x926824]

goroutine 243 [running]:
sync.(*RWMutex).RLock(…)
/snap/go/4301/src/sync/rwmutex.go:48
github.com/lxc/lxd/lxd/db.(*Cluster).Transaction(0x0, 0xc000719dc0, 0x0, 0x0)
/build/lxd/parts/lxd/go/src/github.com/lxc/lxd/lxd/db/db.go:330 +0x34
github.com/lxc/lxd/lxd/cluster.MaybeUpdate(0xc000719e48, 0x0, 0x0)
/build/lxd/parts/lxd/go/src/github.com/lxc/lxd/lxd/cluster/upgrade.go:68 +0x95
main.(*Daemon).NodeRefreshTask(0xc0001eed80, 0xc00022ec80)
/build/lxd/parts/lxd/go/src/github.com/lxc/lxd/lxd/daemon.go:1339 +0x4d0
created by github.com/lxc/lxd/lxd/cluster.(*Gateway).HandlerFuncs.func1
/build/lxd/parts/lxd/go/src/github.com/lxc/lxd/lxd/cluster/gateway.go:218 +0xe49

TomvB · September 9, 2019, 2:25pm

Same here. I can’t use my servers right now… High prio issue.
3 nodes.

stgraber · September 9, 2019, 2:47pm

Can you show on an affected node:

rm /var/snap/lxd/common/lxd/unix.socket
lxd --debug --group lxd

So far the only issue we’ve seen is people with bad database causing the schema migration to fail and needing a small patch to clean things up.

stgraber · September 9, 2019, 4:06pm

From your logs, can you figure out if you were also getting a segfault/panic like the original reporter or something else?

stgraber · September 9, 2019, 4:07pm

For the original reporter. LXD 3.16 -> 3.17 includes a database schema change, for it to actually kick off, you need all the members to refresh to 3.17. This normally gets triggered automatically for you though that actually is right in the code that paniced so it’s likely the issue.

My recommendation is for you to manually run snap refresh lxd on all your systems, that will bring them all up to 3.17 at which point they can startup and process the upgrade.

stgraber · September 9, 2019, 4:10pm

@tomp will take a look at that upgrade code to make sure there’s nothing we missed.

For testing, you can easily deploy a small cluster of 3 nodes using snap install lxd --channel=3.16/stable and switch them to 3.17 with snap refresh lxd --channel=stable. If the issue happens on all 3.16 to 3.17 triggers, that should do the trick.

fwaggle · September 9, 2019, 4:11pm

@stgraber

We get a repeating core-dump on lxd, we’re not using clustering:

INFO[09-09|16:08:47] LXD 3.17 is starting in normal mode      path=/var/snap/lxd/common/lxd
INFO[09-09|16:08:47] Kernel uid/gid map: 
INFO[09-09|16:08:47]  - u 0 0 4294967295 
INFO[09-09|16:08:47]  - g 0 0 4294967295 
INFO[09-09|16:08:47] Configured LXD uid/gid map: 
INFO[09-09|16:08:47]  - u 0 1000000 1000000000 
INFO[09-09|16:08:47]  - g 0 1000000 1000000000 
INFO[09-09|16:08:47] Kernel features: 
INFO[09-09|16:08:47]  - netnsid-based network retrieval: no 
INFO[09-09|16:08:47]  - uevent injection: no 
INFO[09-09|16:08:47]  - seccomp listener: no 
INFO[09-09|16:08:47]  - unprivileged file capabilities: yes 
[New LWP 25889]
INFO[09-09|16:08:47]  - shiftfs support: no 
INFO[09-09|16:08:47] Initializing local database 
DBUG[09-09|16:08:47] Initializing database gateway 
DBUG[09-09|16:08:47] Start database node                      id=1 address=
[New LWP 25919]

Thread 23 "lxd" received signal SIGSEGV, Segmentation fault.
[Switching to LWP 25919]
0x00007f901f8d3156 in ?? () from target:/lib/x86_64-linux-gnu/libc.so.6
(gdb) bt
#0  0x00007f901f8d3156 in ?? () from target:/lib/x86_64-linux-gnu/libc.so.6
#1  0x00007f901fb63346 in ?? () from target:/snap/lxd/11824/lib/libdqlite.so.0
#2  0x00007f901fb6437c in vfsFileWrite () from target:/snap/lxd/11824/lib/libdqlite.so.0
#3  0x00007f901fb5a953 in ?? () from target:/snap/lxd/11824/lib/libdqlite.so.0
#4  0x00007f901e6bad06 in snapshotRestore (r=r@entry=0x34bcbf8, snapshot=0x7f9008004570) at src/snapshot.c:31
#5  0x00007f901e6bb029 in raft_start (r=0x34bcbf8) at src/start.c:133
#6  0x00007f901fb5f42d in ?? () from target:/snap/lxd/11824/lib/libdqlite.so.0
#7  0x00007f902023b6ba in start_thread () from target:/lib/x86_64-linux-gnu/libpthread.so.0
#8  0x00007f901f88c41d in clone () from target:/lib/x86_64-linux-gnu/libc.so.6

Any advice on that? Snap is inoperable because of lxd failing to start.

stgraber · September 9, 2019, 4:14pm

That looks like a proper dqlite bug, @freeekanayaka will need to investigate it.
Can you send me a tarball of your /var/snap/lxd/common/lxd/database to stgraber at ubuntu dot com?

Once you’ve made the backup tarball so we can reproduce/debug this further. You should rollback to 3.16 with snap refresh lxd --channel=3.16/stable which should get you back online until we figure it out.

stgraber · September 9, 2019, 4:14pm

Oh and in your case since it’s crashing during a filesystem write, can you check that your filesystem isn’t full or out of inodes?

fwaggle · September 9, 2019, 4:21pm

@stgraber disk isn’t full (37% capacity) or out of inodes (at 8% capacity). Reverting to 3.16 has a similar issue. The database is fairly huge but I’m compressing it now.

fwaggle · September 9, 2019, 4:26pm

snap refresh lxd --channel=3.16/stable suffers a similar issue, the snap command is hung, lxd process continually restarting.

Actually thinking about it, I tried to be clever and run lxd in gdb, so the above trace might be a red herring. Here’s what it’s currently dumping to syslog:

Sep  9 16:24:34 lxd-host-303 systemd[1]: snap.lxd.daemon.service: Main process exited, code=exited, status=1/FAILURE
Sep  9 16:24:34 lxd-host-303 systemd[1]: snap.lxd.daemon.service: Failed with result 'exit-code'.
Sep  9 16:24:34 lxd-host-303 systemd[1]: snap.lxd.daemon.service: Service hold-off time over, scheduling restart.
Sep  9 16:24:34 lxd-host-303 systemd[1]: snap.lxd.daemon.service: Scheduled restart job, restart counter is at 127.
Sep  9 16:24:34 lxd-host-303 systemd[1]: Stopped Service for snap application lxd.daemon.
Sep  9 16:24:34 lxd-host-303 systemd[1]: Started Service for snap application lxd.daemon.
Sep  9 16:24:34 lxd-host-303 lxd.daemon[33104]: => Preparing the system
Sep  9 16:24:34 lxd-host-303 lxd.daemon[33104]: ==> Loading snap configuration
Sep  9 16:24:34 lxd-host-303 lxd.daemon[33104]: ==> Setting up mntns symlink (mnt:[4026532656])
Sep  9 16:24:34 lxd-host-303 lxd.daemon[33104]: ==> Setting up kmod wrapper
Sep  9 16:24:34 lxd-host-303 lxd.daemon[33104]: ==> Preparing /boot
Sep  9 16:24:34 lxd-host-303 lxd.daemon[33104]: ==> Preparing a clean copy of /run
Sep  9 16:24:34 lxd-host-303 lxd.daemon[33104]: ==> Preparing a clean copy of /etc
Sep  9 16:24:34 lxd-host-303 lxd.daemon[33104]: ==> Setting up ceph configuration
Sep  9 16:24:34 lxd-host-303 lxd.daemon[33104]: ==> Setting up LVM configuration
Sep  9 16:24:34 lxd-host-303 lxd.daemon[33104]: ==> Rotating logs
Sep  9 16:24:34 lxd-host-303 lxd.daemon[33104]: ==> Setting up ZFS (0.7)
Sep  9 16:24:34 lxd-host-303 lxd.daemon[33104]: ==> Escaping the systemd cgroups
Sep  9 16:24:34 lxd-host-303 lxd.daemon[33104]: ====> Detected cgroup V1
Sep  9 16:24:34 lxd-host-303 lxd.daemon[33104]: ==> Escaping the systemd process resource limits
Sep  9 16:24:34 lxd-host-303 lxd.daemon[33104]: ==> Disabling shiftfs on this kernel (auto)
Sep  9 16:24:34 lxd-host-303 kernel: [ 3768.093578] new mount options do not match the existing superblock, will be ignored
Sep  9 16:24:34 lxd-host-303 lxd.daemon[18560]: mount namespace: 7
Sep  9 16:24:34 lxd-host-303 lxd.daemon[18560]: hierarchies:
Sep  9 16:24:34 lxd-host-303 lxd.daemon[18560]:   0: fd:   8: rdma
Sep  9 16:24:34 lxd-host-303 lxd.daemon[18560]:   1: fd:   9: perf_event
Sep  9 16:24:34 lxd-host-303 lxd.daemon[18560]:   2: fd:  10: memory
Sep  9 16:24:34 lxd-host-303 lxd.daemon[18560]:   3: fd:  11: cpuset
Sep  9 16:24:34 lxd-host-303 lxd.daemon[18560]:   4: fd:  12: freezer
Sep  9 16:24:34 lxd-host-303 lxd.daemon[18560]:   5: fd:  13: cpu,cpuacct
Sep  9 16:24:34 lxd-host-303 lxd.daemon[18560]:   6: fd:  14: devices
Sep  9 16:24:34 lxd-host-303 lxd.daemon[18560]:   7: fd:  15: pids
Sep  9 16:24:34 lxd-host-303 lxd.daemon[18560]:   8: fd:  16: blkio
Sep  9 16:24:34 lxd-host-303 lxd.daemon[18560]:   9: fd:  17: hugetlb
Sep  9 16:24:34 lxd-host-303 lxd.daemon[18560]:  10: fd:  18: net_cls,net_prio
Sep  9 16:24:34 lxd-host-303 lxd.daemon[18560]:  11: fd:  19: name=systemd
Sep  9 16:24:34 lxd-host-303 lxd.daemon[18560]:  12: fd:  20: unified
Sep  9 16:24:34 lxd-host-303 lxd.daemon[18560]: lxcfs.c: 152: do_reload: lxcfs: reloaded
Sep  9 16:24:34 lxd-host-303 lxd.daemon[33104]: => Re-using existing LXCFS
Sep  9 16:24:34 lxd-host-303 lxd.daemon[33104]: => Starting LXD
Sep  9 16:24:38 lxd-host-303 lxd.daemon[33104]: fatal error: unexpected signal during runtime execution
Sep  9 16:24:38 lxd-host-303 lxd.daemon[33104]: [signal SIGSEGV: segmentation violation code=0x2 addr=0x7f8d7932d000 pc=0x7f8d80420156]
Sep  9 16:24:38 lxd-host-303 lxd.daemon[33104]: runtime stack:
Sep  9 16:24:38 lxd-host-303 lxd.daemon[33104]: runtime.throw(0x12c5914, 0x2a)
Sep  9 16:24:38 lxd-host-303 lxd.daemon[33104]: #011/snap/go/4301/src/runtime/panic.go:617 +0x72
Sep  9 16:24:38 lxd-host-303 lxd.daemon[33104]: runtime.sigpanic()
Sep  9 16:24:38 lxd-host-303 lxd.daemon[33104]: #011/snap/go/4301/src/runtime/signal_unix.go:374 +0x4a9
Sep  9 16:24:38 lxd-host-303 lxd.daemon[33104]: goroutine 39 [syscall, locked to thread]:
Sep  9 16:24:38 lxd-host-303 lxd.daemon[33104]: runtime.cgocall(0x101be00, 0xc000310f00, 0x0)
Sep  9 16:24:38 lxd-host-303 lxd.daemon[33104]: #011/snap/go/4301/src/runtime/cgocall.go:128 +0x5b fp=0xc000310ed0 sp=0xc000310e98 pc=0x40eaeb
Sep  9 16:24:38 lxd-host-303 lxd.daemon[33104]: github.com/canonical/go-dqlite/internal/bindings._Cfunc_dqlite_run(0x7f8d6c000940, 0x7f8d00000000)
Sep  9 16:24:38 lxd-host-303 lxd.daemon[33104]: #011_cgo_gotypes.go:354 +0x49 fp=0xc000310f00 sp=0xc000310ed0 pc=0x898bd9
Sep  9 16:24:38 lxd-host-303 lxd.daemon[33104]: github.com/canonical/go-dqlite/internal/bindings.(*Server).Run.func1(0x7f8d6c000940, 0x0)
Sep  9 16:24:38 lxd-host-303 lxd.daemon[33104]: #011/build/lxd/parts/lxd/go/src/github.com/canonical/go-dqlite/internal/bindings/server.go:265 +0x56 fp=0xc000310f38 sp=0xc000310f00 pc=0x89b906
Sep  9 16:24:38 lxd-host-303 lxd.daemon[33104]: github.com/canonical/go-dqlite/internal/bindings.(*Server).Run(0x7f8d6c000940, 0x1319ba8, 0x0)
Sep  9 16:24:38 lxd-host-303 lxd.daemon[33104]: #011/build/lxd/parts/lxd/go/src/github.com/canonical/go-dqlite/internal/bindings/server.go:265 +0x2f fp=0xc000310fa0 sp=0xc000310f38 pc=0x899f5f
Sep  9 16:24:38 lxd-host-303 lxd.daemon[33104]: github.com/canonical/go-dqlite.(*Server).run(0xc000312690)
Sep  9 16:24:38 lxd-host-303 lxd.daemon[33104]: #011/build/lxd/parts/lxd/go/src/github.com/canonical/go-dqlite/server.go:234 +0x54 fp=0xc000310fd8 sp=0xc000310fa0 pc=0x8ae104
Sep  9 16:24:38 lxd-host-303 lxd.daemon[33104]: runtime.goexit()
Sep  9 16:24:38 lxd-host-303 lxd.daemon[33104]: #011/snap/go/4301/src/runtime/asm_amd64.s:1337 +0x1 fp=0xc000310fe0 sp=0xc000310fd8 pc=0x468921
Sep  9 16:24:38 lxd-host-303 lxd.daemon[33104]: created by github.com/canonical/go-dqlite.(*Server).Start
Sep  9 16:24:38 lxd-host-303 lxd.daemon[33104]: #011/build/lxd/parts/lxd/go/src/github.com/canonical/go-dqlite/server.go:126 +0x54
Sep  9 16:24:38 lxd-host-303 lxd.daemon[33104]: goroutine 1 [select]:
Sep  9 16:24:38 lxd-host-303 lxd.daemon[33104]: github.com/canonical/go-dqlite.(*Server).Start(0xc000312690, 0x14dbc40, 0xc000318660, 0x1, 0x14bb9c0)
Sep  9 16:24:38 lxd-host-303 lxd.daemon[33104]: #011/build/lxd/parts/lxd/go/src/github.com/canonical/go-dqlite/server.go:136 +0x160
Sep  9 16:24:38 lxd-host-303 lxd.daemon[33104]: github.com/lxc/lxd/lxd/cluster.(*Gateway).init(0xc000350000, 0xc0003169a0, 0xc0003001e0)
Sep  9 16:24:38 lxd-host-303 lxd.daemon[33104]: #011/build/lxd/parts/lxd/go/src/github.com/lxc/lxd/lxd/cluster/gateway.go:588 +0x4a4
Sep  9 16:24:38 lxd-host-303 lxd.daemon[33104]: github.com/lxc/lxd/lxd/cluster.NewGateway(0xc000316020, 0xc000340070, 0xc00028b820, 0x2, 0x2, 0x0, 0x23, 0xc000241180)
Sep  9 16:24:38 lxd-host-303 lxd.daemon[33104]: #011/build/lxd/parts/lxd/go/src/github.com/lxc/lxd/lxd/cluster/gateway.go:59 +0x1d0
Sep  9 16:24:38 lxd-host-303 lxd.daemon[33104]: main.(*Daemon).init(0xc0001d6b40, 0x4108c0, 0x60)
Sep  9 16:24:38 lxd-host-303 lxd.daemon[33104]: #011/build/lxd/parts/lxd/go/src/github.com/lxc/lxd/lxd/daemon.go:615 +0x7ac
Sep  9 16:24:38 lxd-host-303 lxd.daemon[33104]: main.(*Daemon).Init(0xc0001d6b40, 0xc0001d6b40, 0xc0002a2240)
Sep  9 16:24:38 lxd-host-303 lxd.daemon[33104]: #011/build/lxd/parts/lxd/go/src/github.com/lxc/lxd/lxd/daemon.go:496 +0x2f
Sep  9 16:24:38 lxd-host-303 lxd.daemon[33104]: main.(*cmdDaemon).Run(0xc00019cf80, 0xc000241b80, 0xc000182840, 0x0, 0x4, 0x0, 0x0)

stgraber · September 9, 2019, 4:40pm

Ah so it probably didn’t get to refresh due to it being stuck in that restart loop.

You can try:

systemctl stop snap.lxd.daemon snap.lxd.daemon.unix.socket

This may hang, if it does, in a separate terminal, locate and kill any lxd process you find.
Once that returns, lxd should be properly stopped, then try the snap refresh lxd --channel=3.16/stable, that should then let you replace the snap and have it start up as 3.16.

fwaggle · September 9, 2019, 4:41pm

The above syslog output was what 3.16 was doing, it was bootlooping (spamming that same message repeatedly). I brought the instance up by fishing out a copy of the dqlite db from before it upgraded, and swapped the entire database dir out for the broken one (I’ll have to see if any containers are missing) between startup attempts.

Did you still want a copy of the db?

Tony_Anytime · September 9, 2019, 5:31pm

I basically have two types of servers here…they are all on 3.17
Basically it lloks like they can’t talk to dbserver
Or is it possible to go back to 3.16

Tony_Anytime · September 9, 2019, 5:48pm

There seems to be more than one LXD trying to run here.

Tony_Anytime · September 9, 2019, 5:51pm

You really should have started your own topic, I believe yours is unrelated to mine and having both together causes confusion that does not help.

Tony_Anytime · September 9, 2019, 5:51pm

Please check out my two screen shots below.

stgraber · September 9, 2019, 5:57pm

Can you show ps fauxww?

Tony_Anytime · September 9, 2019, 6:02pm

this is most of it

Tony_Anytime · September 9, 2019, 6:14pm

Anyway to read database in each node to see which one is in better condition?. I am afraid when one goes bad, it spreads it to all.