[Solved] Cluster - Shared ceph pool or init incomplete?


#1

[Short version]

I installed lxd from apt on the first host, had problems copying from an old lxd host to the new lxd host.

So I installed the snap version, problems copying containers were resolved.

Once my base OS upgrades were completed on the other machines I installed lxd only from snap.

Snap doesn’t install requirements or perform some configuration that apt does.

Solution was to remove, install apt versions, purge them, reinstall snap version.

Everything is working as expected, the question of is it shared?

Yes, it is shared. Thanks for your work on this project everyone!

[/short version]

Is the ceph pool shared now?

If not during lxd init for additional lxd members to an existing cluster you are not prompted to create a new ceph pool on each host.

Is that on purpose or do I need to manually create the storage pool now?

How do you specify which pool is for which cluster host or does it just know automatically?

If everything should be working then I tried to move a container using lxc move container --target new_host which appeared to work but the container fails to startup:

lxc container 20181217063106.757 WARN conf - conf.c:lxc_map_ids:2917 - newuidmap binary is missing
lxc container 20181217063106.758 WARN conf - conf.c:lxc_map_ids:2923 - newgidmap binary is missing
lxc container 20181217063106.852 WARN conf - conf.c:lxc_map_ids:2917 - newuidmap binary is missing
lxc container 20181217063106.852 WARN conf - conf.c:lxc_map_ids:2923 - newgidmap binary is missing
lxc container 20181217063106.876 ERROR dir - storage/dir.c:dir_mount:198 - No such file or directory - Failed to mount “/var/snap/lxd/common/lxd/containers/container/rootfs” on “/var/snap/lxd/common/lxc/”
lxc container 20181217063106.876 ERROR conf - conf.c:lxc_mount_rootfs:1326 - Failed to mount rootfs “/var/snap/lxd/common/lxd/containers/container/rootfs” onto “/var/snap/lxd/common/lxc/” with options “(null)”
lxc container 20181217063106.876 ERROR conf - conf.c:lxc_setup_rootfs_prepare_root:3445 - Failed to setup rootfs for
lxc container 20181217063106.876 ERROR conf - conf.c:lxc_setup:3498 - Failed to setup rootfs
lxc container 20181217063106.876 ERROR start - start.c:do_start:1263 - Failed to setup container “container”
lxc container 20181217063106.876 ERROR sync - sync.c:__sync_wait:62 - An error occurred in another process (expected sequence number 5)
lxc container 20181217063106.876 WARN network - network.c:lxc_delete_network_priv:2589 - Operation not permitted - Failed to remove interface “eth0” with index 10
lxc container 20181217063106.876 ERROR lxccontainer - lxccontainer.c:wait_on_daemonized_start:842 - Received container state “ABORTING” instead of “RUNNING”
lxc container 20181217063106.877 ERROR start - start.c:__lxc_start:1939 - Failed to spawn container “container”
lxc container 20181217063106.877 WARN conf - conf.c:lxc_map_ids:2917 - newuidmap binary is missing
lxc container 20181217063106.877 WARN conf - conf.c:lxc_map_ids:2923 - newgidmap binary is missing
lxc 20181217063106.880 WARN commands - commands.c:lxc_cmd_rsp_recv:132 - Connection reset by peer - Failed to receive response for command “get_state”

The directory it’s referencing for the rootfs is empty, no links or anything.

This is with the candidate snap branch.

driver: lxc
driver_version: 3.0.3
kernel: Linux
kernel_architecture: x86_64
kernel_version: 4.15.0-42-generic
server: lxd
server_pid: 1593
server_version: “3.7”
storage: “”
storage_version: “”
server_clustered: true
server_name: lxdhost
project: default


(Stéphane Graber) #2

When using CEPH with a LXD cluster, all nodes must have access to the same CEPH cluster and they’ll all be using the same CEPH pool from that cluster.

You can add more than one CEPH pool to your LXD cluster though and then choose on a per-container basis what pool they’ll use, but LXD cluster always assumes all hosts are identical when it comes to storage and network setup.


#3

Ok, so it should be working then.

I notice that force reuse is not configured by default, what it sounds like you’re saying is that ceph is a shared pool and it should be configured?

Also, last night I was able to issue storage commands, this morning 'lxc storage ’ hangs. Nothing shows up in the logs at all.

All hosts have access to the same ceph cluster.


(Stéphane Graber) #4

When using LXD clustering, you don’t need to do anything on a per-node basis when it comes to using a CEPH cluster.

So long as all nodes can talk to the same CEPH cluster, you can do:

lxc storage create my-pool ceph source=NAME-OF-OSD-POOL

And that should be it. The force reuse flag isn’t needed unless you’re dealing with a dirty osd-pool (pre-existing and not empty) which wouldn’t be recommended. The force reuse flag is there pretty much only for data recovery after a disaster as it lets you re-attach a pool to a new empty LXD and get the containers back into the database with lxd import.


#5

I get an error on the new node:

lxc storage create remote ceph source=lxd-3.8
Error: Config key ‘source’ is node-specific

I tried using the same value as the other host and a different value, I get the same error message.

The commands “hanging” was due to lxd being out of date? I switched back to the candidate branch and it started working again. However before that I did a snap refresh and everything was up to date. It seems like it might have reverted to stable on it’s own?

I tried to move a container again and start it:

lxd.daemon[1134]: t=2018-12-17T11:45:59-0600 lvl=eror msg=“Failed to mount RBD storage volume for container “container”: %!s()”
lxd.daemon[1134]: t=2018-12-17T11:45:59-0600 lvl=warn msg=“Unable to update backup.yaml at this time” name=container rootfs=/var/snap/lxd/common/lxd/containers/container/rootfs


#6

So some background on this, I created my ceph storage cluster and configured a single lxd host. I moved all the containers from the old lxd hosts to this new one.

I’ve now reinstalled ubuntu with the 18.04 on the remaining nodes in the cluster and attempting to add them to the single host I setup earlier.

Are you saying that adding lxd instances to an existing cluster with containers is not recommended? The pool was not used by anything previously.

Pool has ~70 containers, all running from that single host. Working surprisingly well.


#7

I think this might have something to do with the user that is used to interact with ceph.

From my normal user account I can’t check ceph status, I have to sudo. In the docs it says the default user is admin which doesn’t exist.

Some more info:

Original
/var/snap/lxd/common/lxd/containers/container$ sudo ls -lah
total 8.0K
drwx–x--x 2 root root 4.0K Dec 16 22:53 .
drwx–x--x 66 root root 4.0K Dec 16 22:53 …

ls of …
lrwxrwxrwx 1 root root 62 Dec 16 22:53 container -> /var/snap/lxd/common/lxd/storage-pools/remote/containers/container

Moved To
/var/snap/lxd/common/lxd/containers/container$ sudo ls -lah
total 8.0K
drwx–x--x 2 1000000 1000000 4.0K Dec 17 00:30 .
drwx–x--x 3 root root 4.0K Dec 17 00:30 …

ls of …
lrwxrwxrwx 1 root root 62 Dec 17 00:30 container -> /var/snap/lxd/common/lxd/storage-pools/remote/containers/container

Ok, thinking through this. I moved the container before I tried to issue the command to create the storage pool on the new host. Is it possible that this broke the pool on the new host?

I also notice that in the logs with trying to start the container, the mount says:
20181217193156.485 ERROR dir - storage/dir.c:dir_mount:198 - No such file or directory - Failed to mount

Shouldn’t that read storage/ceph.c or is that normal?

Update:

It appears that on the first host, the containers are set to uid/gid 0. On the new host the uid/gid are set to 1000000.

I added subuid/gid for root manually on the new host for 1000000 but it didn’t change anything. subuid/gid on the first host is set as 165536:65536 though.

I chown’d the container on the new host to 0:0 and no effect.

I can do that or I need to do that?

I tried creating the pool on a host before doing a lxd init:

sudo lxc storage create remote ceph source=lxd-3.8
If this is your first time running LXD on this machine, you should also run: lxd init
To start your first container, try: lxc launch ubuntu:18.04

Error: Failed to run: ceph --name client.admin --cluster ceph osd pool create lxd-3.8 32: Error initializing cluster client: Error(‘error calling conf_read_file: error code 22’,)

I see that it’s trying to use that client.admin username. Does it matter that the account doesn’t exist on the local system or is that just for ceph internal auth?

Update, nevermind I see that client.admin corresponds to the keyring created by default by ceph-deploy. I’m pretty certain what’s happening is that it can’t read the keyring file for some reason. permissions possibly?

Update 2, using sudo I can run the same command that it says fails with success… I assume this means that when running sudo lxd init, at some point it switches the user to lxd and lxd doesn’t have permission to read the keyring.

There is a small difference between the machines, on the first one I installed lxd from apt then purged and installed from snapd

The base is a network boot with only ssh selected during package selection.

I’m going to assume also that apt did something that snap does not do and investigate down that path.

Another update, I’ve given up on determining exactly what’s different. I installed lxd lxd-tools and lxcfs from apt, purged, installed from snap and then did an init and it worked.

Final update, issue is resolved. The mistake was not installing from apt first, for whatever reason the snap doesn’t install or configure everything required for this to work. Probably a bug?

I would also recommend changing the last question in the lxd init process to “What is the ceph pool name for the ‘remote’ storage pool?”