Not sure why im having so much trouble but i’ve purged and rebooted and still i cant get my cluster to work.
root@debian-dell:~# microcloud init
Please choose the address MicroCloud will be listening on [default=10.30.2.2]:
Scanning for eligible servers...
Press enter to end scanning for servers
Found "debian-dell2" at "10.30.2.3"
Found "debian-nuc" at "10.30.2.4"
Ending scan
Initializing a new cluster
Local MicroCloud is ready
Local MicroCeph is ready
Local LXD is ready
Awaiting cluster formation...
<< just gets stuck here and then timesout…
2023-02-17T10:15:56-08:00 microcloud.daemon[2956]: time="2023-02-17T10:15:56-08:00" level=warning msg="microcluster database is uninitialized"
2023-02-17T10:16:15-08:00 microcloud.daemon[2956]: time="2023-02-17T10:16:15-08:00" level=error msg="Failed to parse join token" error="Failed to parse token map: invalid character 'i' looking for beginning of value" name=debian-dell2
<< This is from another node on my network (same subnet)
Timed out waiting for a response from all cluster members
Cluster initialization is complete
Would you like to add additional local disks to MicroCeph? (yes/no) [default=yes]: Select from the available unpartitioned disks:
Space to select; Enter to confirm; Esc to exit; Type to filter results.
Up/Down to move; Right to select all; Left to select none.
+-------------+----------------+-----------+------+----------------------------------------+
| LOCATION | MODEL | CAPACITY | TYPE | PATH |
+-------------+----------------+-----------+------+----------------------------------------+
> [ ] | debian-dell | EDGE SE847 SSD | 465.76GiB | sata | /dev/disk/by-id/wwn-0x588891410006496d |
[ ] | debian-dell | EDGE SE847 SSD | 465.76GiB | sata | /dev/disk/by-id/wwn-0x5888914100071325 |
+-------------+----------------+-----------+------+----------------------------------------+
Error: Failed to confirm disk selection: Failed to confirm selection: interrupt
<< and this is from where i run microcloud init… its almost as if it cant reach out to the other nodes.?
UPDATE:
for some reason after on debian it is necessary to do a snap restart microcloud before init … if you keep seeing an error in snap log microcloud from parsing tokens then just keep doing snap restart microcloud until the only debug error is that database has not been initialized… it works after …
I started over multiple times, and the same failure was appearing on the other nodes randomly. Errors I got are related to the join tokens.
Feb 19 00:18:40 microcloud2 microcloud.daemon[1483]: time=“2023-02-19T00:18:40Z” level=error msg=“Failed to handle join token” error=“Failed to join "MicroCloud" cluster: Failed to join cluster with the given join token” name=microcloud2
Feb 19 00:18:31 microcloud3 microcloud.daemon[2628]: time=“2023-02-19T00:18:31Z” level=error msg=“Failed to parse join token” error=“Failed to parse token map: invalid character ‘r’ looking for beginning of value” name=microcloud3
Finally, the initialization completed successfully and now I have it up and running.
ahmad@microcloud1:~$ sudo -i
root@microcloud1:~# microcloud init
Please choose the address MicroCloud will be listening on [default=192.168.0.35]:
Scanning for eligible servers...
Press enter to end scanning for servers
Found "microcloud2" at "192.168.0.36"
Found "microcloud3" at "192.168.0.37"
Ending scan
Initializing a new cluster
Local MicroCloud is ready
Local MicroCeph is ready
Local LXD is ready
Awaiting cluster formation...
Peer "microcloud2" has joined the cluster
Peer "microcloud3" has joined the cluster
Cluster initialization is complete
Would you like to add additional local disks to MicroCeph? (yes/no) [default=yes]:
Select from the available unpartitioned disks:
Select which disks to wipe:
Adding 3 disks to MicroCeph
MicroCloud is ready
root@microcloud1:~#
Summary of issues and solutions:
If while executing “microclound init” it failed with Timed out waiting for a response from all cluster members and one or more of the nodes failed to handle and/or parse the join token (see the errors I got in the previous comment), cancel the current execution, wipe out all snaps on all nodes, and run again, eventually it will succeed.
To wipe out, I executed what nkrapf suggested:
If you are using qemu-kvm VMs, and the additional disks attached to the VMs has a disk bus of virtio, then during adding the disks to ceph, there will be an issue that the path to the disks will be unknown or there is some issue there, so there is no path to the disks appearing
/dev/disk/by-id/
to overcome this issue, wipe out all snaps on all nodes as mentioned in point 1, re-add the disks to the VMs but choose SCSI as the bus type and run microclound again, eventually it should succeed.
If you are experiencing timeouts waiting for the cluster to form, the key is to go to each node in the cluster and keep issuing “snap restart microcloud” until you no longer see the error in the logs about parsing tokens. Once all the nodes no longer present that error, initiate the “microcloud init”.
Please choose the address MicroCloud will be listening on [default=10.20.0.11]:
Scanning for eligible servers...
Press enter to end scanning for servers
Found "infra2" at "10.20.0.12"
Ending scan
Initializing a new cluster
Error: Failed to bootstrap local MicroCloud: Post "http://control.socket/cluster/control": dial unix /var/snap/microcloud/common/state/control.socket: connect: connection refused
mother@infra1:~$ sudo systemctl start snap.microcloud.daemon.service
mother@infra1:~$ systemctl status snap.microcloud.daemon.service
× snap.microcloud.daemon.service - Service for snap application microcloud.daemon
Loaded: loaded (/etc/systemd/system/snap.microcloud.daemon.service; enabled; vendor preset: enabled)
Active: failed (Result: exit-code) since Sat 2023-02-25 12:50:59 UTC; 1s ago
Process: 84222 ExecStart=/usr/bin/snap run microcloud.daemon (code=exited, status=1/FAILURE)
Main PID: 84222 (code=exited, status=1/FAILURE)
CPU: 113ms
Feb 25 12:50:59 infra1 systemd[1]: snap.microcloud.daemon.service: Scheduled restart job, restart counter is at 5.
Feb 25 12:50:59 infra1 systemd[1]: Stopped Service for snap application microcloud.daemon.
Feb 25 12:50:59 infra1 systemd[1]: snap.microcloud.daemon.service: Start request repeated too quickly.
Feb 25 12:50:59 infra1 systemd[1]: snap.microcloud.daemon.service: Failed with result 'exit-code'.
Feb 25 12:50:59 infra1 systemd[1]: Failed to start Service for snap application microcloud.daemon.
Any options here? Do I need to tweak it from systemd service?
Thank you, that is what I did yesterday and in the end I ended up just reinstalling microcloud since it had the previous address LXD was listening on set up.
I’m experiencing an issue where I am unable to start VMs that I’ve copied from an existing LXD host to my Microcloud cluster. I get ‘Error: Failed setting up disk device “root”: Couldn’t find a keyring entry’ when I try and start them.
josh@lxd00:~$ lxc copy lxdtest1:homeassistant homeassistant -s remote
josh@lxd00:~$ lxc config device remove homeassistant eth0
Device eth0 removed from homeassistant
josh@lxd00:~$ lxc profile apply homeassistant lan2
Profiles lan2 applied to homeassistant
josh@lxd00:~$ lxc start homeassistant
Error: Failed setting up disk device "root": Couldn't find a keyring entry
Try `lxc info --show-log homeassistant` for more info
josh@lxd00:~$ lxc info --show-log homeassistant
Name: homeassistant
Status: STOPPED
Type: virtual-machine
Architecture: x86_64
Location: lxd02
Created: 2023/02/26 11:46 CST
Error: open /var/snap/lxd/common/lxd/logs/homeassistant/qemu.log: no such file or directory
josh@lxd00:~$
Basically microceph uses ceph.keyring whereas LXD expected ceph.client.admin.keyring. Both are valid paths, so we’re now expanding the LXD lookup logic to match that of Ceph itself.
Can I fool it by having partitions instead of disks? I have three Lenovo Thinkcentre machines, each with a 1TB NVMe, no space to add more disks to them.
Not currently, MicroCloud as it stands today is looking for full disks and actively skips any partitioned disks.
But @masnax is working on quite a few improvements in that area and one thing we’re looking at doing is let you add additional entries to what’s auto-detected which could then be used to force it to use partitions.
I’m experiencing an issue when I try and move a container instance from my local ZFS pool to the remote Ceph pool.
josh@lxd01:~$ lxc move motioneye -s remote
Error: Migration operation failure: Create instance from copy: Create instance volume from copy failed: [Rsync send failed: motioneye, /var/snap/lxd/common/lxd/storage-pools/loc
al/backup.1302756649/: [exit status 11 read unix @lxd/cfa31ccf-c74e-4c78-8569-6f0df2743511->@: use of closed network connection] (rsync: write failed on "/var/snap/lxd/common/lx
d/storage-pools/remote/containers/lxd-move-of-116d4b8f-15f0-4cac-b117-5985164bfae9/rootfs/var/log/journal/82c548939e714e58afcf80b967b33ddd/system@e258b1e601694d89a989f76224c056b
1-0000000000d06b47-0005f3f71cf0ba7e.journal": No space left on device (28)
rsync error: error in file IO (code 11) at receiver.c(374) [receiver=3.1.3]
) Rsync receive failed: /var/snap/lxd/common/lxd/storage-pools/remote/containers/lxd-move-of-116d4b8f-15f0-4cac-b117-5985164bfae9/: [exit status 11] ()]
josh@lxd01:~$
All my nodes have storage.images_volume and backups_volume set to a volume on the Ceph storage. The Ceph cluster has ~100TB available. This container’s root disk is <5GiB in size. The root disk of this node has ~40GiB available.
I’m not sure why this keeps failing with “No space left”. I even tried setting storage.images_volume and backups_volume to a volume on the local storage but got the same error.