Cannot restart LXD service after crash due to full disk

Javi_Perez · November 12, 2018, 3:57pm

Hi,

I set up the zfs pool size bigger than it should be and it filled up the disk (don’t know why, as the system only had like 5 containers doing barely anything). Maybe because the images had the auto_update flag?

So after everything crashing (nothing works if the disk is full, CPU was at 400 load), I rebooted the system in rescue-mode, added a second hard disk and then did as indicated here:

So that way I could bring the zfs image size to a “normal” size and the server could breath again.
I rebooted the server but now the lxd service doesn’t start.
The output of journalctl -u lxd is:

Nov 12 10:35:47 vps210744 systemd[1]: Starting LXD - main daemon...                                                                     
Nov 12 10:35:48 vps210744 lxd[1735]: lvl=warn msg="CGroup memory swap accounting is disabled, swap limits will be ignored." t=2018-11-12T10:35:48-0500
Nov 12 10:35:48 vps210744 lxd[1735]: err="listen tcp 10.10.0.1:8443: bind: cannot assign requested address" lvl=eror msg="cannot listen on https socket, skipping..." t=2018-11-12T10:35:48-0500
Nov 12 10:35:48 vps210744 lxd[1735]: panic: txn not found                                                                               
Nov 12 10:35:48 vps210744 lxd[1735]: trace:                                                                                   
Nov 12 10:35:48 vps210744 lxd[1735]: 2018-11-12 10:35:48.05968: fsm: restore=565 start                                         
Nov 12 10:35:48 vps210744 lxd[1735]: 2018-11-12 10:35:48.05969: fsm: restore=565 database size: 4096                                    
Nov 12 10:35:48 vps210744 lxd[1735]: 2018-11-12 10:35:48.05969: fsm: restore=565 wal size: 267832                             
Nov 12 10:35:48 vps210744 lxd[1735]: 2018-11-12 10:35:48.05983: fsm: restore=565 filename: db.bin                              
Nov 12 10:35:48 vps210744 lxd[1735]: 2018-11-12 10:35:48.05983: fsm: restore=565 transaction ID:                                        
Nov 12 10:35:48 vps210744 lxd[1735]: 2018-11-12 10:35:48.06138: fsm: restore=565 open follower: db.bin                        
Nov 12 10:35:48 vps210744 lxd[1735]: 2018-11-12 10:35:48.06219: fsm: restore=565 done                                           
Nov 12 10:35:48 vps210744 lxd[1735]: 2018-11-12 10:35:48.08189: fsm: term=1 index=566 cmd=frames txn=565 pages=3 commit=1 start          
Nov 12 10:35:48 vps210744 lxd[1735]: 2018-11-12 10:35:48.08824: fsm: term=1 index=566 cmd=frames txn=565 pages=3 commit=1 unregister txn
... 
Nov 12 10:45:48 vps210744 lxd[2550]: 2018-11-12 10:45:48.43646: fsm: term=1 index=593 cmd=frames txn=592 pages=4 commit=1 done
Nov 12 10:45:48 vps210744 lxd[2550]: 2018-11-12 10:45:48.43647: fsm: term=1 index=594 cmd=undo txn=592 start
Nov 12 10:45:48 vps210744 lxd[2550]: goroutine 29 [running]:
Nov 12 10:45:48 vps210744 lxd[2550]: github.com/CanonicalLtd/dqlite/internal/trace.(*Tracer).Panic(0xc42024c3c0, 0x10162ea, 0xd, 0x0, 0x0, 0x0)
Nov 12 10:45:48 vps210744 lxd[2550]:         /build/lxd-0FDBXp/lxd-3.0.1/obj-x86_64-linux-gnu/src/github.com/CanonicalLtd/dqlite/internal/trace/tracer.go:59 +0x12c
Nov 12 10:45:48 vps210744 lxd[2550]: github.com/CanonicalLtd/dqlite/internal/replication.(*FSM).applyUndo(0xc4202e4000, 0xc42024c3c0, 0xc4203f9138, 0x1, 0xc42024c280)
Nov 12 10:45:48 vps210744 lxd[2550]:         /build/lxd-0FDBXp/lxd-3.0.1/obj-x86_64-linux-gnu/src/github.com/CanonicalLtd/dqlite/internal/replication/fsm.go:335 +0x200
Nov 12 10:45:48 vps210744 lxd[2550]: github.com/CanonicalLtd/dqlite/internal/replication.(*FSM).apply(0xc4202e4000, 0xc42024c280, 0xc4204a8810, 0xc41a03f426, 0x1a03f42600000003)
Nov 12 10:45:48 vps210744 lxd[2550]:         /build/lxd-0FDBXp/lxd-3.0.1/obj-x86_64-linux-gnu/src/github.com/CanonicalLtd/dqlite/internal/replication/fsm.go:116 +0x4b6
Nov 12 10:45:48 vps210744 lxd[2550]: github.com/CanonicalLtd/dqlite/internal/replication.(*FSM).Apply(0xc4202e4000, 0xc4204a8810, 0x0, 0x0)
Nov 12 10:45:48 vps210744 lxd[2550]:         /build/lxd-0FDBXp/lxd-3.0.1/obj-x86_64-linux-gnu/src/github.com/CanonicalLtd/dqlite/internal/replication/fsm.go:81 +0xaf
Nov 12 10:45:48 vps210744 lxd[2550]: github.com/hashicorp/raft.(*Raft).runFSM.func1(0xc4204386c0)
Nov 12 10:45:48 vps210744 lxd[2550]:         /build/lxd-0FDBXp/lxd-3.0.1/obj-x86_64-linux-gnu/src/github.com/hashicorp/raft/fsm.go:57 +0x15a
Nov 12 10:45:48 vps210744 lxd[2550]: github.com/hashicorp/raft.(*Raft).runFSM(0xc4201fcdc0)
Nov 12 10:45:48 vps210744 lxd[2550]:         /build/lxd-0FDBXp/lxd-3.0.1/obj-x86_64-linux-gnu/src/github.com/hashicorp/raft/fsm.go:120 +0x2fa
Nov 12 10:45:48 vps210744 lxd[2550]: github.com/hashicorp/raft.(*Raft).(github.com/hashicorp/raft.runFSM)-fm()
Nov 12 10:45:48 vps210744 lxd[2550]:         /build/lxd-0FDBXp/lxd-3.0.1/obj-x86_64-linux-gnu/src/github.com/hashicorp/raft/api.go:506 +0x2a
Nov 12 10:45:48 vps210744 lxd[2550]: github.com/hashicorp/raft.(*raftState).goFunc.func1(0xc4201fcdc0, 0xc420291b20)
Nov 12 10:45:48 vps210744 lxd[2550]:         /build/lxd-0FDBXp/lxd-3.0.1/obj-x86_64-linux-gnu/src/github.com/hashicorp/raft/state.go:146 +0x53
Nov 12 10:45:48 vps210744 lxd[2550]: created by github.com/hashicorp/raft.(*raftState).goFunc
Nov 12 10:45:48 vps210744 lxd[2550]:         /build/lxd-0FDBXp/lxd-3.0.1/obj-x86_64-linux-gnu/src/github.com/hashicorp/raft/state.go:144 +0x66
Nov 12 10:45:48 vps210744 systemd[1]: lxd.service: Main process exited, code=exited, status=2/INVALIDARGUMENT

This address (10.10.0.1) is the default/internal of the bridge lxdbr0, but as the lxd service doesn’t start, I assume the address is not available.

How can I bring this into life again? Any ideas? Thank you very much

rkelleyrtp · November 12, 2018, 4:37pm

I don’t have the answer, but the error: “panic: txn not found”

Is reference here:

stgraber · November 12, 2018, 4:51pm

What LXD version is that? Things have been made a bit more reliable with dqlite lately which may or may not prevent such issues when running out of disk space.

@freeekanayaka may also be able to fix your database or provide steps to get it back online if you can send us a tarball of /var/lib/lxd/database

Javi_Perez · November 12, 2018, 4:54pm

Hi,

Thanks both @rkelleyrtp and @stgraber for replying.

@stgraber:
it is 3.0.1 (default apt version).

I can send the tarball, sure.
Where to?
Thanks

stgraber · November 12, 2018, 4:55pm

You can send to stgraber@ubuntu.com and I’ll forward it to @freeekanayaka.

Javi_Perez · November 12, 2018, 4:59pm

Sent it.

Many many thanks.

stgraber · November 12, 2018, 5:00pm

And forwarded to Free.

Javi_Perez · November 12, 2018, 6:18pm

Hi there,

It worked again just by deleting (renaming) the logs.db file.

No need for fixing the database, I assume? Everything works fine now.

Many thanks anyway, @stgraber, @freeekanayaka

stgraber · November 13, 2018, 3:35am

Oh, okay, deleting that file would have it restored from the last snapshot, so you may have lost some data in the process.