No Solution Yet-Problem after upgrade of Ubuntu 18.04 (emergency)

Hmm, no, and a failure to backup should either have been fatal or at least result in a log entry, so that’s pretty confusing.

Pretty sure, yes, this isn’t a snapped installation, so we’re talking 3.0.x here and all logs above show a startup failure caused by storage config problem.

Hmm, if this was 3.0.0 it may predate the backup feature maybe? I can’t remember when we added unconditional .bak of the clustered db.

Odd timing I suppose.

I upgraded all servers, but then only rebooted one, joe. And thiis might have caused problem.
Remember ubuntu 18.04, just did a major upgrade and perhaps not everything took
lxc version
Client version: 3.0.3
Server version: unreachable
Not a snap installation, I believe

Once stgraber gets this sorted for ya, I would setup a snap cluster to migrate to. Pretty happy so far, ran into some issues with my ceph storage pool on the 18.04 apt version when I was getting that setup.

I tried the snap in earlier version and lost my zpool twice. I wish there was a way to uncluster this cluster and then recluster it again into a new cluster. I could do that with my 4 server. But I think right now, it is all or nothing deal.

I would avoid ZFS like the plague, way too easy to lose data IMO especially if you are using loops.

Yeah this is my fear, if I lose the lxd, I can lose my data. What would you recommend better.

Anything really, I setup a ceph pool it is working well with ~90 containers hitting it.

My current setup is 4 lxd servers and 4 ceph servers with SSD storage.

Dumping the database seems to indicate that the source config is actually there:

sqlite> select * from storage_pools;
1|local|zfs||1
sqlite> select * from storage_pools_config;
2|1|1|size|100GB
3|1|1|source|/var/lib/lxd/disks/local.img
4|1|1|zfs.pool_name|local
5|1|2|size|100GB
6|1|2|source|/var/lib/lxd/disks/local.img
7|1|2|zfs.pool_name|local
8|1|3|size|100GB
9|1|3|source|/var/lib/lxd/disks/local.img
10|1|3|zfs.pool_name|local
11|1|4|size|100GB
12|1|4|source|/var/lib/lxd/disks/local.img
13|1|4|zfs.pool_name|local

so I’m not totally sure what’s going on.

/var/lib/lxd/disks/local.img

Does that exist? Previous ZFS adventures lead me to believe it does not.

If it does you might be able to remount it manually.

Yes, they exist on all server. But containers not live on Joe becuase LXD not running

I would make a backup of that file everywhere before doing anything else, if you have the disk space.

I noticed you had a WAN IP setup as the cluster IP. Can you telnet to that, and get a response?

Hopefully it is a static IP.

Doing it again, just in case.
Here is the thing, I don’t care about this cluster member as much, except all the others are not working either. Simply turning this one off does nothing. Is it holding other members down?

Tony the problem is not one node, is all nodes.

So it is a corrupt database, or servers stuck in between upgrades. zpool?
What do you think so far?

The database does not seem corrupted, at first sight, that’s why I’m scratching my head.

Data probably OK as the file still exists, I would say version mismatch probably. My nodes didn’t come back until they were all upgraded.