[LXD] Adding support for instance rebuild

gabrielmougard · May 16, 2023, 8:41pm


Project	LXD
Status	Implemented
Author(s)	`@gabrielmougard`
Approver(s)	`@stgraber @tomp`
Release	5.15
Internal ID	LX046

Abstract

This adds support for instance rebuild.

Rationale

There are multiple reasons that can justify to rebuild an instance.

In rebuilding a LXD instance (keeping the same instance configurations), you can be sure to have the same exact instance as the one you initially created. It could be useful to start with a fresh instance filesystem in case you want to reset it.
Rebuilding the instance also gives you the opportunity to update its base image while keeping the instance configuration. It can be useful in case of operating system updates.
Lastly, if you want to update the applications running in the instance or you want to fix a bug after it has been deployed, rebuilding the LXD instance can be an effective way to fix the issue. You can make the necessary changes to the code or configuration files, rebuild the instance, and then deploy the updated version.

Specification

Design

The instance could be rebuilt in three different ways:

We could rebuild it as empty
We could rebuild it with the original instance’s image.
Or, we could rebuild it with a different base image.

In order to do that, we would need to replace its root disk with a fresh copy of the same or alternate image.

A simple approach would be to just erase the content of the underlying storage volume of the container and unpack the new rootfs from the same or alternate image. For that we could we could delete and re-create the underlying instance volume using the appropriate image unpacker. Also, we chose to prevent a rebuild if an instance has snapshots.

In the case where we would choose a different image from the orignal one, we would also want to check that the existing volume can hold the new rootfs. If this is not the case, we might need to grow the volume before overwriting it.

API changes

A new field in InstancesPost will be added:

type InstancesPost struct {
  ...
  Rebuild bool `json:"rebuild" yaml:"rebuild"`
}

CLI changes

The rebuild will be added and will be as follow:
lxc rebuild [[<remote>:]<image>...] [<remote>:]<instance> [--empty]

Here are some use cases supposing we have an existing instance called c1 which is in a stopped state:

lxc rebuild c1
- Rebuild c1 using its original image
lxc rebuild c1 --empty
- Rebuild c1 as empty (in the same fashion as lxc init c1 --empty)
lxc rebuild images:ubuntu/jammy c1
- Rebuild c1 using a different image

Database changes

No datababse changes.

tomp · May 17, 2023, 7:18am

We probably only need NewSource and not InstancePost in this new InstanceRebuildPost struct, because my understanding is, that we are not allowing the instance config to be changed as part of this request.

Please can you also indicate what the API endpoint/URL will be for this new handler you envisage?

tomp · May 17, 2023, 7:18am

I expect all storage drivers will need to be catered for. Certainly you can develop and test them one at a time though. But the approach will need to cater for all of them so best get them all done before merging any of them to ensure you don’t have to backtrack later.

tomp · May 17, 2023, 7:32am

I think the principle here is sound, as it will allow reverting on failure and not losing the original.
It also allows a clean new volume to be created, rather than writing over the old one.
One downside to that is that it will require more space than writing over the old one, but I think the benefits out weigh the the drawbacks, as long as we document the design choices and why.

What are your thoughts around snapshots on the instance being rebuilt?
Will those get destroyed so the instance is effectively brand new?
Or will they get kept? If so, this may present some challenges for the new volume approach, as we would need to explore what each storage driver allows us to do with regards to snapshots.

One potential issue on that topic I can see is ZFS. As ZFS doesn’t allow us to delete datasets when there are snapshots associated to it, e.g.

lxc init images:ubuntu/jammy c1 -s zfs
lxc snapshot c1
lxc snapshot c1

sudo zfs list -t all | grep /c1
zfs/containers/c1                                                                                   104K  19.8G      236M  legacy
zfs/containers/c1@snapshot-snap0                                                                   13.5K      -      236M  -
zfs/containers/c1@snapshot-snap1                                                                   13.5K      -      236M  -

sudo zfs destroy zfs/containers/c1
cannot destroy 'zfs/containers/c1': filesystem has children
use '-r' to destroy the following datasets:
zfs/containers/c1@snapshot-snap1
zfs/containers/c1@snapshot-snap0

So the plan of creating a new volume will not work for ZFS if the instance has snapshots and we are expecting to retain them.

gabrielmougard · May 17, 2023, 8:20am

Regarding the snapshots, I don’t have a strong opinion. I think we can keep them if the image used during the rebuild is the same as the original one. If the image is different, maybe we can erase them all as it might cause issue during the restore (not quite sure about that but it looks dangerous)

gabrielmougard · May 17, 2023, 8:24am

Regarding ZFS, can’t we move the snapshots in an other temporary dataset and reattach them after (just like the initial tmp approach for the instance) to allow zfs destroy to take place ?

tomp · May 17, 2023, 8:25am

Those will come through the URL though, which is why I asked.

tomp · May 17, 2023, 8:26am

I don’t think so because they are inextricably linked with the original dataset (because of the CoW nature of these snapshots).

gabrielmougard · May 17, 2023, 8:28am

Regarding the API endpoint/URL, something like this one:

POST /1.0/instances/rebuild?project=<PROJECT>&name=<INSTANCE_NAME>

tomp · May 17, 2023, 8:36am

That’s not particularly in keeping with the existing URL that uses POST /1.0/instances/<instance_name>?project=<project_name> for modifying an existing instance.

I suspect we’ll end up adding a Rebuild bool field to InstancePost type on that existing endpoint rather than introduce a new endpoint. A bit like we do with the Migration bool field in that type.

gabrielmougard · May 17, 2023, 8:36am

Ok. Then, I’m tempted to say that the rebuild also deletes the associated snapshots (I don’t see a particular use case for now, plus the fact that zfs does not allow us to do that probably means that it is a dangerous behavior to restore an instance with a potentially different image from its snapshot)

Else, there is also the choice to say that in case of an instance relying on a ZFS storage, the snaps are deleted as part of this ZFS limitation. Else, for other storage backends, we could keep the snaps. But it seems inconsistent (not the same end result for different storage backends) so I’d prefer the first choice.

gabrielmougard · May 17, 2023, 8:40am

Ok. Else, I also had this idea: PUT /1.0/instances/{name}/rebuild?project=<PROJECT>
What do you think of it ?

tomp · May 17, 2023, 10:27am

Lets go with snapshots being deleted for now and see if @stgraber concurs.

tomp · May 17, 2023, 10:34am

I think that adding POST /1.0/instances/{instance_name}/rebuild?project=<project_name> would be most consistent with existing API endpoints, the nearest one I could find being the endpoint to trigger a refresh of an image:

https://linuxcontainers.org/lxd/docs/master/api/#/images/images_refresh_post

POST /1.0/images/{fingerprint}/refresh

ruskofd · May 17, 2023, 1:24pm

This looks really really interesting, this could helps a lot for OS updates by doing them in a more “immutable” way. This also open doors for other interesting things like refreshing a group of instances in a rolling upgrade manner more easily.

gabrielmougard · May 23, 2023, 5:21pm

@tomp Instead of using the tmp folder approach, can’t we create a local temporary copy of the volume and its snapshot in a new ZFS dataset (with a name like <original_dataset>-save for example) ? That way we can use the send+receive logic that is quite useful to move things around. As for the other storage drivers, we could have the same approach with their own logic.

tomp · May 25, 2023, 10:34am

Thinking about this, given that the rebuild process is necessarily destructive, I don’t think we need to make a temporary copy of the instance to allow a revert. As this will require potentially a lot of extra disk space.

gabrielmougard · May 25, 2023, 10:59am

Ok it makes sense

tomp · May 30, 2023, 8:58am

What URL will the new endpoint be at?

gabrielmougard · May 31, 2023, 7:15pm

Here is the related PR : https://github.com/lxc/lxd/pull/11687