No Solution Yet-Problem after upgrade of Ubuntu 18.04 (emergency)

stgraber · February 8, 2019, 4:09pm

Hmm, no, and a failure to backup should either have been fatal or at least result in a log entry, so that’s pretty confusing.

stgraber · February 8, 2019, 4:11pm

Pretty sure, yes, this isn’t a snapped installation, so we’re talking 3.0.x here and all logs above show a startup failure caused by storage config problem.

stgraber · February 8, 2019, 4:12pm

Hmm, if this was 3.0.0 it may predate the backup feature maybe? I can’t remember when we added unconditional .bak of the clustered db.

CyrusTheVirusG · February 8, 2019, 4:12pm

Odd timing I suppose.

Tony_Anytime · February 8, 2019, 4:18pm

I upgraded all servers, but then only rebooted one, joe. And thiis might have caused problem.
Remember ubuntu 18.04, just did a major upgrade and perhaps not everything took
lxc version
Client version: 3.0.3
Server version: unreachable
Not a snap installation, I believe

CyrusTheVirusG · February 8, 2019, 4:22pm

Once stgraber gets this sorted for ya, I would setup a snap cluster to migrate to. Pretty happy so far, ran into some issues with my ceph storage pool on the 18.04 apt version when I was getting that setup.

Tony_Anytime · February 8, 2019, 4:28pm

I tried the snap in earlier version and lost my zpool twice. I wish there was a way to uncluster this cluster and then recluster it again into a new cluster. I could do that with my 4 server. But I think right now, it is all or nothing deal.

CyrusTheVirusG · February 8, 2019, 4:29pm

I would avoid ZFS like the plague, way too easy to lose data IMO especially if you are using loops.

Tony_Anytime · February 8, 2019, 4:30pm

Yeah this is my fear, if I lose the lxd, I can lose my data. What would you recommend better.

CyrusTheVirusG · February 8, 2019, 4:32pm

Anything really, I setup a ceph pool it is working well with ~90 containers hitting it.

CyrusTheVirusG · February 8, 2019, 4:36pm

My current setup is 4 lxd servers and 4 ceph servers with SSD storage.

freeekanayaka · February 8, 2019, 4:39pm

Dumping the database seems to indicate that the source config is actually there:

sqlite> select * from storage_pools;
1|local|zfs||1
sqlite> select * from storage_pools_config;
2|1|1|size|100GB
3|1|1|source|/var/lib/lxd/disks/local.img
4|1|1|zfs.pool_name|local
5|1|2|size|100GB
6|1|2|source|/var/lib/lxd/disks/local.img
7|1|2|zfs.pool_name|local
8|1|3|size|100GB
9|1|3|source|/var/lib/lxd/disks/local.img
10|1|3|zfs.pool_name|local
11|1|4|size|100GB
12|1|4|source|/var/lib/lxd/disks/local.img
13|1|4|zfs.pool_name|local

so I’m not totally sure what’s going on.

CyrusTheVirusG · February 8, 2019, 4:40pm

/var/lib/lxd/disks/local.img

Does that exist? Previous ZFS adventures lead me to believe it does not.

If it does you might be able to remount it manually.

Tony_Anytime · February 8, 2019, 4:41pm

Yes, they exist on all server. But containers not live on Joe becuase LXD not running

CyrusTheVirusG · February 8, 2019, 4:42pm

I would make a backup of that file everywhere before doing anything else, if you have the disk space.

I noticed you had a WAN IP setup as the cluster IP. Can you telnet to that, and get a response?

Hopefully it is a static IP.

Tony_Anytime · February 8, 2019, 4:44pm

Doing it again, just in case.
Here is the thing, I don’t care about this cluster member as much, except all the others are not working either. Simply turning this one off does nothing. Is it holding other members down?

freeekanayaka · February 8, 2019, 4:45pm

Tony the problem is not one node, is all nodes.

Tony_Anytime · February 8, 2019, 4:47pm

So it is a corrupt database, or servers stuck in between upgrades. zpool?
What do you think so far?

freeekanayaka · February 8, 2019, 4:47pm

The database does not seem corrupted, at first sight, that’s why I’m scratching my head.

CyrusTheVirusG · February 8, 2019, 4:48pm

Data probably OK as the file still exists, I would say version mismatch probably. My nodes didn’t come back until they were all upgraded.