Incus host on google compute engine boots to emergency mode

DanB · October 29, 2024, 12:44pm

I have incus host running on a debian 12 instance on gce, which was working fine but now gets stuck in emergency mode after a reboot.

The serial port output


[[0;32m  OK  [0m] Reached target [0;1;39mnetwork-online.target[0m - Network is Online.


[[0;32m  OK  [0m] Reached target [0;1;39mnss-lookup.…m - Host and Network Name Lookups.


You are in emergency mode. After logging in, type "journalctl -xb" to view

system logs, "systemctl reboot" to reboot, "systemctl default" or "exit"

to boot into default mode.



Cannot open access to console, the root account is locked.

See sulogin(8) man page for more details.

Looking through the serial port logs, there are no “error” logs, but a fs related job times out. I’m guessing this is the attached zfs disk (non root).

[K[   [0;31m*[0;1;31m*[0m[0;31m*[0m] Job dev-disk-by\x2duuid-6b3277db\x2…tart running (1min 28s / 1min 30s)
M
[K[    [0;31m*[0;1;31m*[0m] Job dev-disk-by\x2duuid-6b3277db\x2…tart running (1min 29s / 1min 30s)
M
[K[     [0;31m*[0m] Job dev-disk-by\x2duuid-6b3277db\x2…tart running (1min 29s / 1min 30s)
M
[K[    [0;31m*[0;1;31m*[0m] Job dev-disk-by\x2duuid-6b3277db\x2…tart running (1min 30s / 1min 30s)
M
[K[[0;1;31m TIME [0m] Timed out waiting for device [0;1;…277db-5548-43c1-8dd5-179c5c2e4dae.
[K[[0;1;38;5;185mDEPEND[0m] Dependency failed for [0;1;39msyst…277db-5548-43c1-8dd5-179c5c2e4dae.
[[0;1;38;5;185mDEPEND[0m] Dependency failed for [0;1;39mmedi…ta\x2d1.mount[0m - /media/data-1.
[[0;1;38;5;185mDEPEND[0m] Dependency failed for [0;1;39mloca…s.target[0m - Local File Systems.

The first time this happened it was due to a “shielded_vm_integrity” event;

lateBootReportEvent: {
   actualMeasurements: [11]
   policyEvaluationPassed: false
   policyMeasurements: [3]
}

I have now disabled vTPM, Integrity Monitoring & Secure Boot options. I no longer get the integrity event, however the machine still goes to emergency mode.

I would like to recover this instance if possible. Any ideas on how to address? Even connect to the console would be a step in the right direction.

simos · October 29, 2024, 1:22pm

Welcome!

There is a mention about a mount point not being available, on /media/data-1. Is that the storage pool for Incus on there? Can you check the state of that mount point from the GCE environment?

DanB · October 29, 2024, 2:09pm

Thank you @simos !

Hmm, that’s likely the storage pool for incus. When I configured the pool I believe I used:
incus storage create pd-standard zfs source=/dev/sdb

I’ve since learned that one shouldn’t use /dev/sdX on GCE as the device mapping may change. I can boot a recovery VM instance and mount the root disk, is there a way I can “fix” the incus storage pool? Or even just remove it, I hadn’t begun using it (effectively) yet.

Thanks!

jarrodu · October 29, 2024, 2:11pm

Feel free to delete it and create a new one.

DanB · October 29, 2024, 2:15pm

Hi @jarrodu , beginner question; how can I delete the pool from an incus installation on a mounted disk? Would I install incus on the recovery VM and then somehow map incus to the configuration files on the mount?

My thinking is;

boot a recovery VM
mount the “original” root disk
install incus with parameters pointing to the mounted config?

Or is there a better way?

jarrodu · October 29, 2024, 2:35pm

I don’t think I understood your situation until I reread your post.

You are working in the cloud, so why not just destroy the whole thing and start over? I suspect anything else is going to get complicated.

In the cloud, instances are cheap and everything can be automated. It is not uncommon to redeploy instances on every change.

baleygr · October 29, 2024, 3:01pm

If you want to try and rescue the server, could you share your fstab and/or the relevant .mount files for the failing disks?

The error seems pretty straight forward: a disk that previously existed isn’t there anymore, so the mount job times out. If you can find out what the UUID changed to, if it’s still there, you might be able to just change it.

DanB · October 31, 2024, 11:56am

Thanks all, I deleted and started over. GCE provides a stable symlink /dev/disk/by-id/ that I will use as the block device for creation of the pool.