Mysterious fail loop when installing Docker inside LXD

rask · April 16, 2020, 9:47pm

My first time testing out LXD, and I decided to take the “simplest” of tasks and get a Kubernetes node running inside it.

I’m having trouble installing Docker properly. The strangest thing is that I managed to install it well a few days back, but now with a separate setup I get all the way up to apt-get install which runs through as such:

$ lxc exec mylxd bash                                                                                                                                                                                                                  
root@mylxd:~$ apt-get install docker-ce                                                                                                                                                                                                
Reading package lists... Done                                                                                                                                                                                                                 
... snip ...                                                                                                                                                              
The following NEW packages will be installed:
  aufs-tools cgroupfs-mount containerd.io docker-ce docker-ce-cli libltdl7 pigz
0 upgraded, 7 newly installed, 0 to remove and 4 not upgraded.
Need to get 85.8 MB of archives.
After this operation, 385 MB of additional disk space will be used.
Do you want to continue? [Y/n] y
Get:1 https://download.docker.com/linux/ubuntu bionic/stable amd64 containerd.io amd64 1.2.13-1 [20.1 MB]
Get:2 http://archive.ubuntu.com/ubuntu bionic/universe amd64 pigz amd64 2.4-1 [57.4 kB]
Get:3 http://archive.ubuntu.com/ubuntu bionic/universe amd64 aufs-tools amd64 
... snip ...
Setting up docker-ce-cli (5:19.03.8~3-0~ubuntu-bionic) ...
Setting up pigz (2.4-1) ...
Setting up docker-ce (5:19.03.8~3-0~ubuntu-bionic) ...
Created symlink /etc/systemd/system/multi-user.target.wants/docker.service → /lib/systemd/system/docker.service.
Created symlink /etc/systemd/system/sockets.target.wants/docker.socket → /lib/systemd/system/docker.socket.

And that’s it. After that last symlink it just freezes, my CPU keeps jumping up and down, and my hard drive is getting filled with something at 250-500MB/sec, until I run out of space. The data is being written to the container storage pool, but I can’t fathom where there it could be or why. My PC gets quite sluggish when this problem appears, so I can’t do any proper debugging or monitoring otherwise.

The only thing that works after this is lxc stop --force <container>. lxc delete <container> frees up all the disk space the error condition stole from me.

I’m using the 4.* snap version of LXD. Here’s how I launch my container:

$ lxc launch ubuntu:18.04 mylxd -s mypool
$ lxc config set mylxd limits.cpu 4
$ lxc config set mylxd limits.memory 3GB
$ lxc config set mylxd security.privileged true
$ lxc config set mylxd security.nesting true
$ lxc config set mylxd linux.kernel_modules 'xt_conntrack,ip_tables,ip6_tables,netlink_diag,nf_nat,overlay'
$ lxc config set mylxd raw.lxc 'lxc.apparmor.profile=unconfined\nlxc.cap.drop= \nlxc.cgroup.devices.allow=a\nlxc.mount.auto=proc:rw sys:rw'

The mypool is a basic dir storage pool. Once the container is running, I exec into it and attempt to install docker.

Need help with the following:

If someone has a hunch on what is going wrong would be neat to know!
Where can I start looking for information on what is filling my disk when this occurs? Some log getting flooded perhaps?
Where can I find proper logs that might show what is happening, instead of just seeing a frozen terminal? The logs at /var/snap/lxd/common/... are quite empty.
I need pointers on how to help you help me, not sure what I could provide to make this easier to debug and solve?

Some googling showed that sometimes systemd-resolved goes into a loop in certain situations, could that relate to this as Docker presumably does something to the network configs?

rask · April 17, 2020, 8:29am

Okay, I did the sensible thing and went through the steps again manually, so I guess the error happens when I try to use Python’s subprocess tooling to provision the container. The steps I took now and Docker installed just fine:

$ lxc storage create testpool1 dir
$ lxc launch ubuntu:18.04 dockertest -s testpool1
> Creating dockertest
> Starting dockertest
$ lxc config edit dockertest

In config I added:

config:
  ...
  limits.cpu: "4"
  limits.memory: 3GB
  linux.kernel_modules: xt_conntrack,ip_tables,ip6_tables,netlink_diag,nf_nat,overlay
  raw.lxc: "lxc.apparmor.profile=unconfined\nlxc.cap.drop= \nlxc.cgroup.devices.allow=a\nlxc.mount.auto=proc:rw
    sys:rw"
  security.nesting: "true"
  security.privileged: "true"
  ...

Then

$ lxc restart dockertest
$ lxc exec dockertest bash
@ curl -s https://download.docker.com/linux/debian/gpg | apt-key add -
@ apt-get update
@ apt-get install apt-transport-https ca-certificates
@ add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu bionic stable"
@ apt-get update
@ apt-cache policy docker-ce
@ apt-get install docker-ce

So I will now trace what exactly is going on in the Python provisioned container, as the steps are essentially the same, with the addition of doing some Kubernetes setups in between.

EDIT:

Okay, it seems that a full reboot of my PC made something work better, and now my original setup works as well. Strange. Could it be something related to continuously starting, stopping, deleting, and restarting a container with the same name, storage, and bridge even though the container is a fresh one every time?

simos · April 17, 2020, 9:25am

A quick note, have a look to verify whether shiftfs is enabled. It makes launching container much faster, but may break Docker. Trying out `shiftfs`

I did not see any reference to the host OS. Shall I assume it is Ubuntu 18.04 LTS?

To run Docker in a container, you would need security.nesting: true. It should not be necessary to also enable security.privileged or disable AppArmor.

There is an issue with Docker not getting the FS it needs, depending on the storage driver (dir, ZFS, btrfs). I recollect that the best option is btrfs for Docker, as the container can use something better then overlayfs. dir is the slowest of all in the lifecycle of a container, the other two support copy-on-write, which I think it necessary for kubernetes.

rask · April 17, 2020, 10:45am

lxc info says shiftfs: "false" which presumably means it is off. And yes, host is 18.04.

Will try and run as unprivileged, not sure if Kubernetes requires that though.

I’m using dir disks as this is a development environment thingy, where developers can try use a local Kubernetes to run software (a bit like Minikube), and the other filesystems seem to require dedicated partitions which is a little much for someone who just knows how to write and compile code. Please do say if I could use btrfs without major alterations on the host’s disk setups! From all the guides that I’ve seen about running k8s inside LXD people are using dir.

I am currently running into a problem where I’ve mounted a dir root FS (-s flag for lxc launch), but inside the container Docker somehow sees it as a vfs FS and is not supported.

Thanks!

EDIT:

I read up a bit on btrfs and learned that you can in fact use a loopback to fake a partition: https://www.excamera.com/sphinx/article-btrfs.html

My question is whether LXD does this automatically when creating storage, or whether I should create this loopback myself first, and then somehow define it when creating a new btrfs storage pool in LXD? The documentation on this is a bit vague.

EDIT2:

Okay it seems they default to being loop backed: Noob question: How to create additional storage? (loop file based, dir, etc.)

Will try running with a btrfs pool and see what happens.

EDIT3:

Okay btrfs works well for LXD+Docker, but it is not officially supported by Kubernetes, so will have to see how this works.