[LXD] New disaster recovery tool

tomp · June 15, 2021, 9:02am

PR to prevent underscores in new project names:

github.com/lxc/lxd

Projects: Prevent the use of underscores in new project names

lxc:master ← tomponline:tp-project-name-restrictions

opened 08:41AM - 15 Jun 21 UTC

tomponline

+25 -14

We use underscore as the delimiter between project name and storage volume name …and we need to be able to recover the two parts from the underlying storage volume name for the new `lxd recover` tool (see https://discuss.linuxcontainers.org/t/lxd-recover-tool/11296). Also fixes bug where when renaming a project, only the current name was validated not the new name, effectively allowing the new and existing validation rules to be bypassed for project names.

tomp · June 15, 2021, 9:46am

@stgraber the existing lxd import tool has a client-side restriction to only allow running it as the root user. Should this also be maintained for the new lxd recover tool?

stgraber · June 15, 2021, 1:07pm

No, this can be dropped. I’m not even completely sure why we had it for lxd import in the first place (unless it was actually performing direct disk access at some point?).

tomp · June 15, 2021, 1:09pm

@stgraber cool thanks. I think IIRC that internalImport() was writing an “importing” file to the instance’s root volume to prevent deletion of it on failure. That might be the reason.

Only other thing I can think of (that may still be relevant), is the probing of supported storage pool drivers using storageDrivers.SupportedDrivers() which I am using to populate the possible storage driver option question, and lxd import used it too. See

stgraber · June 15, 2021, 1:23pm

Ah yeah, for the supported drivers thing, I wonder if we shouldn’t just expose that as a comma separated list in /1.0 so both lxd init and lxd recover can just fetch it over the API and avoid direct probing. What do you think?

tomp · June 15, 2021, 1:29pm

I’m happy to expose it over an API endpoint, the function itself comes with a warning “This can take a long time if a driver is not supported.” so perhaps /1.0 isn’t the most appropriate place for it? Although the slow parts should only occur on first call of that function (as the storage driver should then cache the features/version supported).

stgraber · June 15, 2021, 1:31pm

Yeah, I definitely don’t want to hit a slow path with each /1.0 call, but having it done on daemon startup with the result kept in memory would be fine and in line with other things we expose through /1.0.

tomp · June 15, 2021, 1:34pm

I think we have a var that contains that already that we can just expose indeed.

tomp · June 15, 2021, 1:51pm

That would be good too as it would remove the (for me, annoying) pause during lxd init as it probes for available storage engines. This way it will have been done already when LXD starts up.

stgraber · June 15, 2021, 2:21pm

Yeah, that pause is a bit annoying

tomp · July 7, 2021, 12:50pm

The PR that implements returning available storage driver info in the API is here:

turtle0x1 · July 7, 2021, 1:17pm

Well now I’m “salty” Feature Request: API Returning Supported Storage Drivers · Issue #5955 · lxc/lxd · GitHub

Handy addition though!

stgraber · July 8, 2021, 12:54am

Ah yeah, it’s a bit different than your original ask as it’s not listing you what the API support so much as it’s listing you what the system supports, but indeed for you that’s probably equivalent in this case

For us, it will allow lxd recover and lxd init to build up a list of local and remote storage drivers without having to use the internal logic they currently had.

tomp · July 20, 2021, 4:50pm

Something that occurred to me today. There can be a scenario where storage pool (such as LVM) can have volumes on it that use a different filesystem than the default filesystem (from the pool or the LXD default if the pool DB record is missing as well).

In these cases we are in a chicken/egg situation, as we cannot mount the instance volume without knowing what filesystem it uses, in order to read the backup.yaml file (which would contain the volume config containing the filesystem to use).

And for custom volumes its worse because there is no config file at all.

In cases like this, I can see the recovery process being blocked for all volumes because the mount process for the first such volume will fail the entire recovery process.

Perhaps we should skip over volumes that cannot be mounted and present a list of volumes that are considered unknown but cannot be recovered, and then allow the ones that can be recovered to proceed?

stgraber · July 20, 2021, 6:15pm

Or we could have a function which guesses the filesystem
We only really support 3 filesystems (ext4, xfs or btrfs), so even if we had to iterate until one of them mounts properly, it wouldn’t be too bad. But we should also be able to just use auto and let the kernel superblock parser figure it out then just record whatever ended up being used.

tomp · July 20, 2021, 6:21pm

Fair enough, I’ll look at integrating some logic into the relevant storage drivers mount logic.

tomp · July 21, 2021, 3:50pm

This is my approach to provide the option to probe filesystem type when mounting a volume without the original DB records available:

tomp · July 28, 2021, 9:15am

PRs for this feature is here: