After lxd upgrade to 3.x: "Error: Failed container creation: No root device could be found."

candlerb · July 6, 2018, 1:34pm

A few days ago I updated some Ubuntu 16.04 (with ZFS) boxes to lxd 3.0.1 from backports. The upgrade seemed to go OK: I saw it made a bunch of changes and got rid of my old /etc/default/lxd-bridge. Existing containers are still running. But now I find that I can’t create new containers:

root@nuc1:~# lxc launch ubuntu:16.04 snf-image
Creating snf-image
Error: Failed container creation: No root device could be found.

I’ve tried looking around, and don’t see anything obviously wrong:

root@nuc1:~# lxc image list
+-------+--------------+--------+---------------------------------------------+--------+----------+-----------------------------+
| ALIAS | FINGERPRINT  | PUBLIC |                 DESCRIPTION                 |  ARCH  |   SIZE   |         UPLOAD DATE         |
+-------+--------------+--------+---------------------------------------------+--------+----------+-----------------------------+
|       | f2228450779f | no     | ubuntu 16.04 LTS amd64 (release) (20180703) | x86_64 | 157.56MB | Jul 6, 2018 at 1:07pm (UTC) |
+-------+--------------+--------+---------------------------------------------+--------+----------+-----------------------------+

root@nuc1:~# lxc storage list
+---------+-------------+--------+---------+---------+
|  NAME   | DESCRIPTION | DRIVER | SOURCE  | USED BY |
+---------+-------------+--------+---------+---------+
| default |             | zfs    | zfs/lxd | 9       |
+---------+-------------+--------+---------+---------+

root@nuc1:~# lxc storage show default
config:
  source: zfs/lxd
  zfs.pool_name: zfs/lxd
description: ""
name: default
driver: zfs
used_by:
- /1.0/containers/xxxx   << snipped >>
...
status: Created
locations:
- none

root@nuc1:~# zfs list zfs/liblxd
NAME         USED  AVAIL  REFER  MOUNTPOINT
zfs/liblxd   159M  39.2G   159M  /var/lib/lxd
root@nuc1:~# zfs list -r zfs/lxd
NAME                                                                                      USED  AVAIL  REFER  MOUNTPOINT
zfs/lxd                                                                                  16.5G  39.2G    96K  none
zfs/lxd/containers                                                                       15.8G  39.2G    96K  none
zfs/lxd/containers/builder                                                               15.3M
... goes on to list other containers, then images

I also saw that newer lxd/lxd-client packages have been released since then, so I’ve just upgraded to those too, but with the same result. “systemctl restart lxd” didn’t make a difference either.

Looking in /var/log/lxd/lxd.log:

lvl=warn msg="Failed to update instance types: Get https://images.linuxcontainers.org/meta/instance-types/.yaml: lookup images.linuxcontainers.org on 10.12.255.1:53: read udp 10.12.255.11:43020->10.12.255.1:53: i/o timeout" t=2018-07-06T14:25:56+0100
ephemeral=false lvl=info msg="Creating container" name=snf-image t=2018-07-06T14:25:58+0100
created=2018-07-06T14:25:58+0100 ephemeral=false lvl=info msg="Deleting container" name=snf-image t=2018-07-06T14:25:58+0100 used=1970-01-01T01:00:00+0100
created=2018-07-06T14:25:58+0100 ephemeral=false lvl=info msg="Deleted container" name=snf-image t=2018-07-06T14:25:58+0100 used=1970-01-01T01:00:00+0100
ephemeral=false lvl=eror msg="Failed creating container" name=snf-image t=2018-07-06T14:25:58+0100

I think the error about failing to resolve against 10.12.255.1 (which happens when I do systemctl restart lxd) is a different problem to investigate separately: I can definitely resolve using dig @10.12.255.1 images.linuxcontainers.org. Furthermore, if I run tcpdump while restarting lxd, I can see the queries being answered.

14:29:27.844268 IP (tos 0x0, ttl 64, id 3133, offset 0, flags [DF], proto UDP (17), length 72)
    10.12.255.11.57492 > 10.12.255.1.53: 39723+ AAAA? images.linuxcontainers.org. (44)
14:29:27.844812 IP (tos 0x0, ttl 64, id 3134, offset 0, flags [DF], proto UDP (17), length 72)
    10.12.255.11.55622 > 10.12.255.1.53: 1940+ A? images.linuxcontainers.org. (44)
14:29:27.846230 IP (tos 0x0, ttl 64, id 24796, offset 0, flags [none], proto UDP (17), length 603)
    10.12.255.1.53 > 10.12.255.11.57492: 39723 3/13/15 images.linuxcontainers.org. CNAME canonical.images.linuxcontainers.org., canonical.images.linuxcontainers.org. AAAA 2001:67c:1562::41, canonical.images.linuxcontainers.org. AAAA 2001:67c:1560:8001::21 (575)
14:29:27.847933 IP (tos 0x0, ttl 64, id 24797, offset 0, flags [none], proto UDP (17), length 547)
    10.12.255.1.53 > 10.12.255.11.55622: 1940 3/13/13 images.linuxcontainers.org. CNAME canonical.images.linuxcontainers.org., canonical.images.linuxcontainers.org. A 91.189.91.21, canonical.images.linuxcontainers.org. A 91.189.88.37 (519)

But the primary problem is not being able to create containers, with “No root device could be found.”

Any suggestions for where else I can look?

Thanks … Brian.

apt history from initial upgrade:

Start-Date: 2018-07-01  22:20:03
Commandline: apt-get install -t xenial-backports lxd lxd-client python-pylxd
Install: xdelta3:amd64 (3.0.8-dfsg-1ubuntu2, automatic), liblxc-common:amd64 (3.0.1-0ubuntu1~16.04.1, automatic)
Upgrade: lxd:amd64 (2.0.11-0ubuntu1~16.04.4, 3.0.1-0ubuntu1~16.04.1), liblxc1:amd64 (2.0.8-0ubuntu1~16.04.2, 3.0.1-0ubuntu1~16.04.1), lxd-client:amd64 (2.0.11-0ubuntu1~16.04.4, 3.0.1-0ubuntu1~16.04.1), lxcfs:amd64 (2.0.8-0ubuntu1~16.04.2, 3.0.1-0ubuntu2~16.04.1)
Remove: lxc-common:amd64 (2.0.8-0ubuntu1~16.04.2)
End-Date: 2018-07-01  22:21:50

And subsequent upgrade:

Start-Date: 2018-07-06  14:15:36
Commandline: apt-get dist-upgrade
Upgrade: ... lxd:amd64 (3.0.1-0ubuntu1~16.04.1, 3.0.1-0ubuntu1~16.04.2), ... lxd-client:amd64 (3.0.1-0ubuntu1~16.04.1, 3.0.1-0ubuntu1~16.04.2), ...
End-Date: 2018-07-06  14:15:54

candlerb · July 6, 2018, 1:46pm

Sorry for quick follow-up, but I think I’ve fixed it.

After finding and reading this post about changes to storage in lxd 2.15, I found that none of my profiles had a ‘root’ section: e.g.

root@nuc1:~# lxc profile show default
config:
  environment.http_proxy: ""
  user.network_mode: ""
description: Default LXD profile
devices:
  eth0:
    name: eth0
    nictype: bridged
    parent: lxdbr0
    type: nic
name: default
used_by:
- <list of containers>

I edited it to add

devices:
  ...
  root:
    path: /
    pool: default
    type: disk

and now it appears to be happy. If I was supposed to read somewhere that I had to do that, unfortunately I missed it.

candlerb · July 6, 2018, 2:39pm

Update: the other problem, of resolving images.linuxcontainers.org, is reproducible:

root@nuc1:~# lxc launch images:debian/jessie/amd64 snf-image-jessie
Creating snf-image-jessie
Error: Failed container creation: Get https://images.linuxcontainers.org/streams/v1/index.json: lookup images.linuxcontainers.org on 10.12.255.1:53: read udp 10.12.255.11:46962->10.12.255.1:53: i/o timeout

However I believe it’s something to do with DNSSEC, because if I change my resolver to 8.8.8.8 instead of the Mikrotik router, it works.

If so, I don’t know where in the stack this validation is being done. The linux built-in resolver library doesn’t care, since it works when I’m using 10.12.255.1 as my cache:

root@nuc1:~# ping images.linuxcontainers.org
PING canonical.images.linuxcontainers.org (91.189.88.37) 56(84) bytes of data.
64 bytes from naiad.canonical.com (91.189.88.37): icmp_seq=1 ttl=54 time=10.3 ms
64 bytes from naiad.canonical.com (91.189.88.37): icmp_seq=2 ttl=54 time=10.3 ms
^C
--- canonical.images.linuxcontainers.org ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1000ms
rtt min/avg/max/mdev = 10.353/10.357/10.361/0.004 ms

So I wonder if lxd itself is doing extra validation, or using a different resolver library (thinks: golang DNS resolver library, bypassing the system one?)