Repeatable LXD installations

ghomem · August 18, 2021, 2:30pm

Thank you for all the answers. All things considered I find it a bit disappointing that a super powerful and sophisticated solution such as LXD is being delivered in a way that is rather OPS and IaC unfriendly. The recommended way lacks repeatability and fine grained control of security update cycles.

We will probably move along the following lines:

repeatability: we are forced to accept that installations are not repeatable which therefore will force a full test every time we deploy a new cluster or standalone server; the corresponding effort will be reflected in couple of ways over time, rather then in planned feature update cycles
control of updates: we will implement a workaround that prevents automatic updates, probably based on the suggestions written here:

and share the results with the community.

Problem 1) would be easily fixed if older channels were kept for a reasonable amount of time. On the replies written here, there seems to be an implicit assumption that OPS teams will not do “the right thing” if they are not forced to, “the right thing” being updating their system once updates are available. Because “the right thing” depends on risk/benefit analysis which is particular to each project where lxd is used, I find this "one fits all " assumption a bit too radical. It kind of says that the security risk from not updating immediately is always higher than the risk of unpredictable behaviour on a production system, arising from slight differences in the software. My experience with risk analysis tells me that it really depends on the exposure and impact, among other things.

Problem 2) is a serious design problem with snap not related to lxd. It needs to be fixed, unfortunately by means of workaround, for any critical infrastructure component that is delivered this way.

We will share our experience.

Thanks everyone.

tomp · August 18, 2021, 2:35pm

On that point, snapd does support the ability to defer updates for up to 60 days and also specify the times when these updates occur, so it doesn’t have to be immediately updated as the software is released.

https://snapcraft.io/blog/how-to-make-snaps-and-configuration-management-tools-work-together

That doesn’t help with your repeatability of freshly installed systems goal though. Only snap proxy really helps with that.

https://docs.ubuntu.com/snap-store-proxy/en/overrides

Also, note, that LXD clusters all have to run the same LXD version, so they should be updated together close after one and other (irrespective of whether using snap or not).

ghomem · August 18, 2021, 3:08pm

The point of immediately here was about being forced to adopt newer versions on fresh installs almost after they are released (only an older one is kept). This assumes that OPS teams will always be wrong in their management, that they must be forced to use the latest or they will never update.

In the context of snap refreshes, 60 days being reasonable is also an assumption. Whether 60 days is too much or too little depends on the context. In the blog article I linked above a former Snap Advocate of Canonical, includes a cron job that postpones the refresh everyday in order to not have a deadline. I guess that it says a lot.

I am aware of Snap Proxy - it just seems to me contra natura to have to do that. It also introduces other potential problems.

Also, note, that LXD clusters all have to run the same LXD version, so they should be updated
together close after one and other (irrespective of whether using snap or not).

Right. We will take this into account. Thanks.

ghomem · September 6, 2021, 8:21pm

So this is what we decided to do:

retest latest 4.0/stable and re-validate (4.0.7 seems to work for all our scenarios)
install via SNAP in a way that updates are never applied automatically
develop a script that updates the snap in the moment we want to

Step 2 is a declarative way of doing this:

snap download lxd --channel=4.0 --basename=lxd --target-directory=/root/snap
snap install /root/snap/lxd.snap --dangerous

Thank you everyone for the multiple comments on this matter.

ghomem · February 23, 2023, 1:16am

I would like to do a gentle follow up in the matter of reproducibility and immutability of the installations. The workaround above seems to not work with the latest LXD snap and the need for reproducible and immutability of the installations remains. Can you advise?

Thanks in advance.

tomp · February 23, 2023, 2:07am

Can you expand on what isn’t working anymore? This may need to be referred to the snap team if its a snap specific problem.

ghomem · February 23, 2023, 2:12am

We’d like to install an LTS version of LXD and make sure it never gets updated automatically (we will manage the updates in the right moments). Previously an unnelegant workaround was necessary (see above).

tomp · February 23, 2023, 2:19am

Yes, and which part has stopped working? Is there an error?

"The workaround above seems to not work with the latest LXD snap "

ghomem · February 23, 2023, 2:21am

Yes, I can no longer install the local snap file. The error is this

This worked fine on previous servers, which have been running without problems for a long time.

tomp · February 23, 2023, 2:23am

Can you look in dmesg output for DENIED entries to do with apparmor?

tomp · February 23, 2023, 2:26am

As an aside, the official way to do this with snap is to use the snap store proxy.

See the proxy section in Managing the LXD snap

It allows you to pin a certain snap revision so all of your machines get same version.

ghomem · February 23, 2023, 2:55am

Meanwhile I found something recent:

https://merlijn.sebrechts.be/blog/2022-11-10-turn-off-snap-updates/

I am repeating the installation and will provide feedback ASAP.

ghomem · February 23, 2023, 3:22am

Messages from apparmor.

The problem seems to go away if we remove the --dangerous option, which is the past was necessary. Things changing behaviour inside an LTS version

ghomem · April 3, 2023, 3:09pm

Seems that without the --dangerous option (which worked before and now does not…) we also need to run snap ack. However it seems that by running this we are enabling automatic refreshes to be performed which defeats the purpose of this exercise.

As a matter of fact a Snap refresh just killed one of our existing test LXD servers that had been installed with snap ack by updating from 4.x stable to 5.x (VM terminal access stopped working)

This is breaking business continuity for serious users of LXD - we simply can’t have production servers updated in an uncontrolled manner. I am aware of the SNAP proxy but that adds unnecessary complexity - the point is running a reliable virtualization server, not a server that workarounds SNAP problems.

Would you be able to suggest a simple solution to not have our LXD servers updated except when we want to?

Is updating to snap 2.58 the simplest option?

tomp · April 3, 2023, 4:58pm

If you are using the 5.0/stable LTS channel then it won’t auto upgrade between major versions.

tomp · April 3, 2023, 4:58pm

You could use snap hold too and then manage when you get security updates manually.

ghomem · April 3, 2023, 4:59pm

We are using 4.0/stable since we don’t have everything certified for 5.0 yet.

tomp · April 3, 2023, 4:59pm

So 4.0/stable won’t switch to 5.x automatically.

ghomem · April 3, 2023, 5:00pm

Well but it somehow did!

We are going to try the latest snapd so that --hold can be used. I will report the result. Thank you.

ghomem · April 12, 2023, 6:20pm

The aftermath of this:

the new --hold flag works
due to the confusion introduced by the change of snap behaviour (inside an LTS version!) we had two machines that took unintended LXD upgrades (from 4.0.X to 5.12) and plus interruption of running production systems

After the unintended upgrades to 5.12 we can’t downgrade even to 5.0.X because the lxd recover command does not recognize the machines. On version 5.12:

sudo lxc exec my-vm bash

fails with

Error: Failed to connect to lxd-agent

This worked 100% before the upgrade. I tried to upgrade lxd inside the VM but that did not work either. Any ideas on how to solve?