Fail to start VM / saving config failed (disk full)

bugeaud · November 28, 2022, 9:58am

Here it is

admin@host:~$ lxc config show gooo --expanded
architecture: x86_64
config:
  image.architecture: amd64
  image.description: ubuntu 18.04 LTS amd64 (release) (20180808)
  image.label: release
  image.os: ubuntu
  image.release: bionic
  image.serial: "20180808"
  image.version: "18.04"
  security.nesting: "true"
  volatile.base_image: 7e8633da9dfc800230c7330cf04e9f284e82e26ddbc1757448c29c25db80f1e4
  volatile.eth0.hwaddr: 08:00:27:bc:91:63
  volatile.idmap.base: "0"
  volatile.idmap.current: '[{"Isuid":true,"Isgid":false,"Hostid":100000,"Nsid":0,"Maprange":65536},{"Isuid":false,"Isgid":true,"Hostid":100000,"Nsid":0,"Maprange":65536}]'
  volatile.idmap.next: '[{"Isuid":true,"Isgid":false,"Hostid":100000,"Nsid":0,"Maprange":65536},{"Isuid":false,"Isgid":true,"Hostid":100000,"Nsid":0,"Maprange":65536}]'
  volatile.last_state.idmap: '[{"Isuid":true,"Isgid":false,"Hostid":100000,"Nsid":0,"Maprange":65536},{"Isuid":false,"Isgid":true,"Hostid":100000,"Nsid":0,"Maprange":65536}]'
  volatile.last_state.power: STOPPED
  volatile.uuid: 8162435d-aa67-4630-bb7f-63528e4b697c
devices:
  eth0:
    name: eth0
    nictype: bridged
    parent: br0
    type: nic
  root:
    path: /
    pool: Medggg
    type: disk
ephemeral: false
profiles:
- default
- mse-lan-profile
stateful: false
description: ""

I did not tried to reboot as it is a live server. I will schedule this ASAP.

tomp · November 28, 2022, 10:01am

Thanks, let me know.

Also can you show lxc info <instance> too.

bugeaud · November 28, 2022, 10:06am

I was able to get a reboot slot from the users. Still the same issues.

admin**@host:~$ lxc info gooo**
**Name: gooo**
**Location: none**
**Remote: unix://**
**Architecture: x86_64**
**Created: 2018/08/13 19:59 CEST**
**Status: Stopped**
**Type: container**
**Profiles: default, mse-lan-profile**
**Snapshots:**
**  snap-2021-02-25 (taken at 2021/02/25 00:29 CET) (stateless)**
**  snap-2022-11-18 (taken at 2022/11/18 23:41 CET) (stateless)**
**  snap-2022-11-25 (taken at 2022/11/25 23:05 CET) (stateless)**
**  snap-2022-11-26 (taken at 2022/11/26 01:44 CET) (stateless)**

Still wondering why I don’t see any message/error other than this saving config problem.

tomp · November 28, 2022, 10:09am

Have you looked in /var/snap/lxd/common/lxd/logs/lxd.log

tomp · November 28, 2022, 10:09am

Also anything in sudo dmesg?

tomp · November 28, 2022, 10:10am

Lets take a look at lxc storage info <pool> too to get an idea of sizings.

bugeaud · November 28, 2022, 10:14am

Nothing special, after reboot

t=2022-11-28T11:01:03+0100 lvl=info msg="LXD is starting" mode=normal path=/var/snap/lxd/common/lxd version=4.0.9
t=2022-11-28T11:01:03+0100 lvl=info msg="Kernel uid/gid map:"
t=2022-11-28T11:01:03+0100 lvl=info msg=" - u 0 0 4294967295"
t=2022-11-28T11:01:03+0100 lvl=info msg=" - g 0 0 4294967295"
t=2022-11-28T11:01:03+0100 lvl=info msg="Configured LXD uid/gid map:"
t=2022-11-28T11:01:03+0100 lvl=info msg=" - u 0 1000000 1000000000"
t=2022-11-28T11:01:03+0100 lvl=info msg=" - g 0 1000000 1000000000"
t=2022-11-28T11:01:03+0100 lvl=info msg="Kernel features:"
t=2022-11-28T11:01:03+0100 lvl=info msg=" - closing multiple file descriptors efficiently: yes"
t=2022-11-28T11:01:03+0100 lvl=info msg=" - netnsid-based network retrieval: yes"
t=2022-11-28T11:01:03+0100 lvl=info msg=" - pidfds: yes"
t=2022-11-28T11:01:03+0100 lvl=info msg=" - core scheduling: yes"
t=2022-11-28T11:01:03+0100 lvl=info msg=" - uevent injection: yes"
t=2022-11-28T11:01:03+0100 lvl=info msg=" - seccomp listener: yes"
t=2022-11-28T11:01:03+0100 lvl=info msg=" - seccomp listener continue syscalls: yes"
t=2022-11-28T11:01:03+0100 lvl=info msg=" - seccomp listener add file descriptors: yes"
t=2022-11-28T11:01:03+0100 lvl=info msg=" - attach to namespaces via pidfds: yes"
t=2022-11-28T11:01:03+0100 lvl=info msg=" - safe native terminal allocation : yes"
t=2022-11-28T11:01:03+0100 lvl=info msg=" - unprivileged file capabilities: yes"
t=2022-11-28T11:01:03+0100 lvl=info msg=" - cgroup layout: cgroup2"
t=2022-11-28T11:01:03+0100 lvl=warn msg=" - Couldn't find the CGroup network priority controller, network priority will be ignored"
t=2022-11-28T11:01:03+0100 lvl=info msg=" - shiftfs support: disabled"
t=2022-11-28T11:01:04+0100 lvl=info msg="Initializing local database"
t=2022-11-28T11:01:04+0100 lvl=info msg="Set client certificate to server certificate" fingerprint=6b08b9b7b909c92e92bbf1c8b5efcc649d7ddade33c74dfb06e87542ceb94e2e
t=2022-11-28T11:01:04+0100 lvl=info msg="Starting database node" id=1 local=1 role=voter
t=2022-11-28T11:01:04+0100 lvl=info msg="Starting /dev/lxd handler:"
t=2022-11-28T11:01:04+0100 lvl=info msg=" - binding devlxd socket" socket=/var/snap/lxd/common/lxd/devlxd/sock
t=2022-11-28T11:01:04+0100 lvl=info msg="REST API daemon:"
t=2022-11-28T11:01:04+0100 lvl=info msg=" - binding Unix socket" inherited=true socket=/var/snap/lxd/common/lxd/unix.socket
t=2022-11-28T11:01:04+0100 lvl=info msg=" - binding TCP socket" socket=[::]:8443
t=2022-11-28T11:01:04+0100 lvl=info msg="Initializing global database"
t=2022-11-28T11:01:04+0100 lvl=info msg="Connecting to global database"
t=2022-11-28T11:01:04+0100 lvl=info msg="Connected to global database"
t=2022-11-28T11:01:04+0100 lvl=info msg="Initialized global database"
t=2022-11-28T11:01:05+0100 lvl=info msg="Firewall loaded driver" driver=nftables
t=2022-11-28T11:01:05+0100 lvl=info msg="Initializing storage pools"
t=2022-11-28T11:01:05+0100 lvl=info msg="Initializing daemon storage mounts"
t=2022-11-28T11:01:05+0100 lvl=info msg="Loading daemon configuration"
t=2022-11-28T11:01:05+0100 lvl=info msg="Initializing networks"
t=2022-11-28T11:01:06+0100 lvl=info msg="Pruning leftover image files"
t=2022-11-28T11:01:06+0100 lvl=info msg="Done pruning leftover image files"
t=2022-11-28T11:01:06+0100 lvl=info msg="Starting device monitor"
t=2022-11-28T11:01:06+0100 lvl=info msg="Started seccomp handler" path=/var/snap/lxd/common/lxd/seccomp.socket
t=2022-11-28T11:01:06+0100 lvl=info msg="Pruning expired images"
t=2022-11-28T11:01:06+0100 lvl=info msg="Done pruning expired images"
t=2022-11-28T11:01:06+0100 lvl=info msg="Pruning expired instance backups"
t=2022-11-28T11:01:06+0100 lvl=info msg="Done pruning expired instance backups"
t=2022-11-28T11:01:06+0100 lvl=info msg="Updating images"
t=2022-11-28T11:01:06+0100 lvl=info msg="Expiring log files"
t=2022-11-28T11:01:06+0100 lvl=info msg="Done updating images"
t=2022-11-28T11:01:06+0100 lvl=info msg="Done expiring log files"
t=2022-11-28T11:01:06+0100 lvl=info msg="Updating instance types"
t=2022-11-28T11:01:06+0100 lvl=info msg="Done updating instance types"
t=2022-11-28T11:01:06+0100 lvl=info msg="Daemon started"

And before reboot is was not much more …

And dmesg has not much too. Nothing but a simple

[ 41.708215] audit: type=1400 audit(1669629665.585:116): apparmor="STATUS" operation="profile_load" profile="unconfined" name="lxd_dnsmasq-lxdbr0_</var/snap/lxd/common/lxd>" pid=2839 comm="apparmor_parser"

Odd…

tomp · November 28, 2022, 10:20am

Is that including trying to start the container?

tomp · November 28, 2022, 10:21am

Does lxc info --show-log gooo show anything useful after a start failure?

tomp · November 28, 2022, 10:24am

Looking at Containers cannot start: saving config file for the container failed · Issue #6118 · lxc/lxd · GitHub and Error: Common start logic: saving config file for the container failed · Issue #7406 · lxc/lxd · GitHub it suggests you may be out of pool storage space.

bugeaud · November 28, 2022, 10:24am

Nope, nothing in the show-log, nothing in the log, just the error at the start without any other indication.

admin@host:~$ lxc start gooo
Error: saving config file for the container failed
Try `lxc info --show-log gooo` for more info
admin@host:~$

and

admin@host:~$ lxc info --show-log gooo
Name: gooo
Location: none
Remote: unix://
Architecture: x86_64
Created: 2018/08/13 19:59 CEST
Status: Stopped
Type: container
Profiles: default, mse-lan-profile
Snapshots:
  snap-2021-02-25 (taken at 2021/02/25 00:29 CET) (stateless)
  snap-2022-11-18 (taken at 2022/11/18 23:41 CET) (stateless)
  snap-2022-11-25 (taken at 2022/11/25 23:05 CET) (stateless)
  snap-2022-11-26 (taken at 2022/11/26 01:44 CET) (stateless)

Log:


admin@host:~$

bugeaud · November 28, 2022, 10:30am

I had a space issue after the last snapshot and upgrade but I increased the LVM volume quota and perform the required resize2fs on the LVM volume. I fear that the storage group might not have seen the size upgrade. But I have not found so far how to fix this as well.

admin@host:~$ lxc storage info Meddd
info:
  description: ""
  driver: btrfs
  name: Meddd
  space used: 39.74GiB
  total space: 41.00GiB
used by:
  instances:
  - gooo
  - gooo
  profiles:
  - default

I added 50GB to the LVM Volume so I was expeting the total space to be close to 100GB.

tomp · November 28, 2022, 10:41am

I suspect you’ll need to resize the btrfs ontop of the LVM, I don’t think resize2fs will do that.

bugeaud · November 28, 2022, 12:43pm

Problem is that it is a btrfs file img. There seem to be no link between the BTRFS file sizes on the LVM extfs partition (real or theorical) and the btrfs device size from that same file once mounted. I have tried a fallocate for the img file but it does not sync with the btrfs device size. So quite logically any attempt such as “btrfs filesystem resize max” fail.

Surprisingly most of documentation on btrfs resizing only applies to a regular FS not a file img FS.

Still digging …

tomp · November 28, 2022, 12:54pm

Specifically the last command:

sudo btrfs filesystem resize max <LXD_lib_dir>/storage-pools/<pool_name>/

Once you’ve grown the LVM and the image file.

bugeaud · November 28, 2022, 2:27pm

LVM grown and image file grown thru but resize not working

admin@host:~# sudo btrfs filesystem resize max /var/snap/lxd/common/lxd/storage-pools/Medd/
ERROR: not a btrfs filesystem: /var/snap/lxd/common/lxd/storage-pools/Medd/

Which is quite logical as this is not any mounted point.

I tried also adding the
sudo losetup -c <loop_device>
with the same result.

qupfer · November 28, 2022, 3:14pm

Maybe (not tested and never done) you could create a new/second img file and use btrfs device add to enlarge your btrfs volume

bugeaud · November 28, 2022, 4:24pm

This would mean a split storage from my understanding … I definitivelly simply want to expand my storage capacity and found out to be stuck on such a simple and basic thing.

The documentation How to manage storage pools - LXD documentation does not seem to work as well.

Having a LUKS/LVM/EXTFS/IMG-file/Btrfs FS stack does not seem to help.

bugeaud · November 28, 2022, 10:48pm

Ok, some long hours of deep diving, I got lxc back online …

First I add to resize the LVM and ensure the EXTFS was in sync with the change.

But at that point you will have to go for the following :

Allocate the new maximum file size (here 60GB):
fallocate -l 60G /var/snap/lxd/common/lxd/disks/Medd.img

Fetch the target loop using :
losetup -j /var/snap/lxd/common/lxd/disks/Medd.img

From that loop given (say here loop26) ensure its target is up and resizes :
sudo losetup -c /dev/loop26

Update the device loop target to the max size
sudo btrfs filesystem resize max /var/snap/lxd/common/mntns/var/snap/lxd/common/lxd/storage-pools/Medd/

Force the update at the lxc level :
lxc storage set Medd size=60GB

Here are a couple of things that would be nice to improve in lxc

Log any file capacity outage in the lxc log
Give more details (verbose mode) on the reason it is failing to start a VM
Confirm that there is a capacity outage if start is failing
Provide a way to directly resize the underlying structure all-in-one. LXC knows where the .img is and its configuration. If such a resize is not possible it should advice the way forward
Upgrade the page How to manage storage pools - LXD documentation with this procedure on actual pool resizing in LVM/*fs/img file/loop/btrfs

Thanks @tomp & others for helping on this one.

tomp · November 29, 2022, 8:19am

Glad you got it working.

We have an item on our roadmap for this cycle (I think) to support resizing image backed storage pools via the pool’s size setting.

As for improving error messages, I suspect that has already been done, as you’re running a pretty old version of LXD now (the 4.0 LTS series is only supported for security fixes now), and we’ve been moving to using error wrapping a lot more over the last few years to include more information in error messages.

I would strongly recommend you upgrade to the 5.0 LTS series (5.0/stable snap channel).

See Managing the LXD snap