Introducing `incus-deploy`

stgraber · April 16, 2024, 10:12pm

Hello,

For those following my live streams, you’ll already know about my recent work on an Ansible & Terraform based deployment tool aimed at making it very easy to deploy and maintain Incus clusters.

While it’s still the early days of this effort, it’s now able to deploy a full Incus cluster including Ceph storage and OVN networking, effectively replicating the environment one may have built using MicroCloud over on the LXD side.

This is obviously quite a different take on it as this uses stock packages for everything, no snaps and no specific distro requirements (though it’s only been tested on Ubuntu 22.04 and adding others will need some work). It’s also a lot more customizable and can be integrated in someone’s existing deployment tooling.

The repository for this has now been moved over to the LXC organization.

And here is a full demo of it building a 5 machines cluster:

What the screencast above shows is basically:

Using Terraform on an Incus system to create a new project (dev-incus-deploy), then create an extra network (br-ovn-test) to be used for OVN ingress and then 5 virtual machines acting as our clustered servers, each of those gets 5 disks attached for used by a mix of local and Ceph storage.
Quickly looking at the Ansible configuration to show what networks and storage pools will be created on the resulting cluster.
Running Ansible to deploy everything on the systems.
Then entering one of the servers and throw a few instances at the cluster to make sure everything is behaving.

For actual deployments, the Terraform step would be replaced by either slightly different Terraform against a bare metal provisioning tool or just be done by hand but it makes it easy to experiment with.

Next steps:

Add support for Ubuntu 20.04 LTS
Add support for clustered LVM as an alternative to Ceph
Add the ability to deploy Grafana, Prometheus and Loki (monitoring stack)
Add the ability to deploy OpenFGA and Keycloak (authorization stack)
Improve the Terraform/Ansible integration so we can do a full test deployment without having to manually tweak the Ansible inventory

I’d also like to repeat that I’m no expert in Ansible or Terraform, pretty far from that, so anyone who’s interested in contributing to this effort is most welcome to do so!

The goal is really to keep this both very easy to get started with while offering sufficient options that production users don’t need to re-invent the wheel and are motivated to contribute back anything that will be useful to the wider Incus community.

jarrodu · April 17, 2024, 9:36am

I got everything up and running on a Raspberry Pi 5 running Bookworm with an NVMe drive.

Pinging across instances running on different cluster members worked as expected. However, I am not sure about ceph.

I tried to do a live migration with a running container. I have an instance, d1, running on server02. It is using the default profile.

config: {}
description: Default Incus profile
devices:
  eth0:
    name: eth0
    network: default
    type: nic
  root:
    path: /
    pool: remote
    type: disk
name: default
used_by:
- /1.0/instances/d1
- /1.0/instances/d2

Here is the default network config.

config:
  bridge.mtu: "1442"
  ipv4.address: 10.217.93.1/24
  ipv4.nat: "true"
  ipv6.address: fd42:8da5:84f4:8605::1/64
  ipv6.nat: "true"
  network: UPLINK
  volatile.network.ipv4.address: 172.31.254.10
  volatile.network.ipv6.address: fd00:1e4d:637d:1234:216:3eff:fe9e:dd69
description: ""
name: default
type: ovn
used_by:
- /1.0/instances/d1
- /1.0/instances/d2
- /1.0/profiles/default
managed: true
status: Created
locations:
- server01
- server02
- server03

And the remote storage pool config.

config:
  ceph.cluster_name: ceph
  ceph.osd.pg_num: "32"
  ceph.osd.pool_name: incus_baremetal
  ceph.user.name: admin
  volatile.pool.pristine: "true"
description: ""
name: remote
driver: ceph
used_by:
- /1.0/images/3d4c0917478e358f1743679766d2961facc729483a754d49e0d7720de5475116
- /1.0/instances/d1
- /1.0/instances/d2
- /1.0/profiles/default
status: Created
locations:
- server03
- server01
- server02

I tried moving the instance to server03.

root@server01:~# incus move d1 --target server03
Error: Migration operation failure: Instance move to destination failed: Error transferring instance data: Failed migration on target: Error from migration control source: Failed migration on source: migration dump failed
(00.014727) Error (criu/namespaces.c:460): Can't dump nested uts namespace for 6575
(00.014733) Error (criu/namespaces.c:721): Can't make utsns id
Error (criu/util.c:627): execvp("iptables-restore", ...) failed: No such file or directory
(00.018275) Error (criu/util.c:642): exited, status=1
Error (criu/util.c:627): execvp("ip6tables-restore", ...) failed: No such file or directory
(00.020918) Error (criu/util.c:642): exited, status=1
(00.021633) Error (criu/cr-dump.c:2098): Dumping FAILED.

I expected the move to work and the instance uptime to be correct. I might be doing it wrong, but the Tofu and Ansible applied cleanly.

If I turn off the instance, then the move works.

victoitor · April 17, 2024, 1:08pm

Try live migrating VMs to test Ceph.

If it works, the issue is probably not Ceph, but it might just be live migrating containers which I know is tricky.

stgraber · April 17, 2024, 1:52pm

Yeah, the error you’re showing above is the expected error for live-migration of containers at this stage, this is just CRIU being unhappy as usual

chenchen · April 19, 2024, 7:38am

I also have a very similar ansible project to deploy “incus & ovn & ceph”. The main difference between that two projects is that my project supports deploying “incus, ceph, and ovn” on one server,
and then it supports adding two servers at a time, the final number of servers in the entire cluster like 1 server, 3 servers, 5 servers, 7 servers …

When I saw “incus-deploy”, I was considering whether to abandon my ansible project. So I would like to ask if “incus-deploy” will support to scale out cluster in future ? For example, from 5 servers to 7 servers. After all, I believe that scale out is a complex operation. For example, there is a requirement for the number of OSDs for a new server, to ensure high availability of the CEPH cluster.

stgraber · April 19, 2024, 4:50pm

Growing clusters should already work fine but I plan on doing more tests on it.

The special case of starting with a single server would likely need some extra work though as Ceph will not provide you with working pools unless you have a minimum of 3 disks on 3 different servers. Crush maps can be changed to allow single-server operations but that’s more custom work needed to handle that out of the box.

But going from say 3 machines to 10, that I’d expect to work fine and if not functional now, to be pretty trivial to sort out. The way this should work is by simply adding the extra machines and roles to the Ansible inventory and running deploy.yaml again. deploy.yaml is designed to be safe to re-run so it should then only deploy the additional systems and reconfigure the existing ones if the new systems have roles within ceph or ovn.

stgraber · April 26, 2024, 11:31am

xarufagem · May 3, 2024, 9:38am

Hello, and thanks for all the work and love put in these scripts, it’s kind to share

I had started a ‘minimal’ ansible playbook, designed to install zabbly sources for Incus/Kernel/ZFS, and installing them on fresh installs, repurposing hardware !

Your playbooks are way smarter than mine, and easyer to adapt thanks to vars added through the ansible logic, i’m even newer to Ansible

Tho, trying to run “out of the box” on Ubuntu 22.04, failed.
I had to use Ceph repository, to get reef installed, else, an error happened .

L314-317 : handlers:
- name: Enable msgr2
shell:
cmd: ceph mon enable-msgr2

Looks like this command is not available if ceph is installed from package, or maybe, is it because i installed reccomends ? Running last reef release sorted this out

I can see some logic in the ceph.yaml regarding Ceph version to deploy, in that case, do ceph version on ‘control node’ must be the same as the target version aimed to deploy using ansible ?

As a fair amount of commands run locally on ‘control node’, and then, configure the instances using ‘locally generated’ files

Thanks again for all of your work, amazing !
/joen

stgraber · May 3, 2024, 1:07pm

The version on the management node shouldn’t matter too much as we just need two very basic Ceph tools.

I’ll have to do a test with distro packaging, I’m surprised that the msgr2 stuff isn’t there yet but that should be easy to add a condition to handle that.

stgraber · May 3, 2024, 1:37pm

I just did a full deploy with ceph_release: distro on Ubuntu 22.04 LTS and it’s succeeded here (Ceph 17.2.7 from Ubuntu packages), no issue with enable-msgr2.