VM vanishes from lxc list after starting

lxd 5.11-ad0b61e 24483 latest/stable canonical✓ -

images:ubuntu/jammy initiated through Api, without wait option.
Image is pulled,

lxc ls
| jam3    | STOPPED |      |      | VIRTUAL-MACHINE | 0
zfs list
zp3/pl2/virtual-machines/101_jam3                                                              6.97M  93.0M     6.97M  legacy
zp3/pl2/virtual-machines/101_jam3.block                                                         10M  6.04T      10M  -

After starting, it disappears from lxc list.
lxd log:

time=“2023-03-20T16:43:56+01:00” level=warning msg=“Error getting disk usage” err=“Failed to run: zfs get -H -p -o value used zp3/pl2/virtual-machines/101_jam3.block: exit status 1 (cannot open ‘zp3/pl2/virtual-machines/101_jam3.block’: dataset does not exist)” instance=jam3 instanceType=virtual-machine project=101

time=“2023-03-20T16:48:55+01:00” level=error msg=“Failed to advertise vsock address” err=“Failed sending VM sock address to lxd-agent: Failed to fetch https://custom.socket/1.0: 401 Unauthorized” instance=jam3 instanceType=virtual-machine project=101

time=“2023-03-20T16:53:46+01:00” level=warning msg=“Failed getting instance metrics” err=“dial unix /var/snap/lxd/common/lxd/logs/101_jam3/qemu.monitor: connect: no such file or directory” instance=jam3 project=101

It seems, it starts too early running zfs get value, as after a while doing same zfs query manually gives the right final size of block:

zfs get -H -p -o value used zp3/pl2/virtual-machines/101_jam3.block
427563008

But at this moment instance has already vanished from lxc list.

BTW those zfs sets persist as ghost and cant be removed:

sudo zfs destroy zp3/pl2/virtual-machines/101_jam3.block
cannot destroy ‘zp3/pl2/virtual-machines/101_jam3.block’: dataset is busy

sudo zfs destroy zp3/pl2/virtual-machines/101_jam2
cannot destroy ‘zp3/pl2/virtual-machines/101_jam2’: dataset is busy

Got a lot of those ghost zfs sets, waiting for Server to be powered off and tossed away.

Sounds like could be related to LXC ls not showing proper names - #24 by tomp

Can you try on latest/edge channel and see if you can show reproducer steps?

It has worked partially. Instance hasn’t disappear from list, even after a fast start.
lxd git-fb5257c 24636 latest/edge canonical✓ -

Steps:

  1. deleted jammy image
  2. Initiated images:ubuntu/jammy --vm (Api, non wait)
lxc list
| jam4    | STOPPED |      |      | VIRTUAL-MACHINE
lxc image list
|       | 8eb253c8ed68 | no     | Ubuntu jammy amd64 (20230320_07:43) | x86_64       | VIRTUAL-MACHINE | 264.26MB | Mar 20, 2023 at 8:57pm (UTC)

Delayed start, after 20s, only one set of errors(below), then resumed normal.

time=“2023-03-20T21:53:53+01:00” level=warning msg=“Error getting disk usage” err=“Failed to run: zfs get -H -p -o value used zp3/pl2/virtual-machines/101_jam4.block: exit status 1 (cannot open ‘zp3/pl2/virtual-machines/101_jam4.block’: dataset does not exist)” instance=jam4 instanceType=virtual-machine project=101

time=“2023-03-20T21:54:43+01:00” level=error msg=“Failed to advertise vsock address” err=“Failed sending VM sock address to lxd-agent: Failed to fetch https://custom.socket/1.0: 401 Unauthorized” instance=jam4 instanceType=virtual-machine project=101

Fast start, 5s after init,

level=error msg=“Failed to advertise vsock address” err=“Failed sending VM sock address to lxd-agent: Failed to fetch https://custom.socket/1.0: 401 Unauthorized” instance=jam4 instanceType=virtual-machine project=101

level=warning msg=“Error getting disk usage” err=“Failed to run: zfs get -H -p -o value used zp3/pl2/virtual-machines/101_jam4.block: exit status 1 (cannot open ‘zp3/pl2/virtual-machines/101_jam4.block’: dataset does not exist)” instance=jam4 instanceType=virtual-machine project=101

The Api change state to “running” timed out, above error kept logging in loop.
After manually starting the instance again, it went back to normal.

Also much appreciated if you could hint to a solution for that. Any idea how to determine the process which blocks the dataset?
Right now only a server reboot helps, but an abrupt server reboot would harm some other running instances.

LXD 5.12 includes a very recent packaging change that should hopefully resolve the long existing issue with ZFS losing references to its mount namespace that may help here.

Please can you explain further what you are doing, I am really now following I’m afraid.
If you are making direct API calls can you show the precise API calls you are doing?
What is a “fast start”?

You are right. Was sort of not very clear.

POST /1.0/instances?project=101

{
“name”: “jam4”,“type”:“virtual-machine”,“profiles”:[“vm2”],“source”:{“type”:“image”,“protocol”:“simplestreams”,“alias”:“ubuntu/jammy/default”,“server”:“https://images.linuxcontainers.org”}
}
and get

[reason:protected] => OK [statusCode:protected] => 200
{“type”:“virtual-machine”,“profiles”:[“vm2”],“source”:{“type”:“image”,“protocol”:“simplestreams”,“alias”:“ubuntu/jammy/default”,“server”:“https://images.linuxcontainers.org”}},“create”:{“id”:“e90590f6-cf45-45ae-bf29-b35df4948452”,“class”:“task”,“description”:“Creating instance”,“created_at”:“2023-03-21T18:50:08.146026554+01:00”,“updated_at”:“2023-03-21T18:50:08.146026554+01:00”,“status”:“Running”,“status_code”:103,“resources”:{“instances”:[“/1.0/instances/jam4”]},“metadata”:null,“may_cancel”:false,“err”:“”,“location”:“none”}}

lxc list:

jam4 | STOPPED | | | VIRTUAL-MACHINE | 0 | vm2 | 101

Than start it:

PUT /1.0/instances/jam4}/state?project=101

{
  "action": "start",
  "force": false,
  "stateful": false,
  "timeout": 60
}

If the start goes immediately after creation, it hangs and produce error logs.
If wait a while (>1m), then start, it works.