Container creation on directory storage using BeegFS stuck

Ubuntu 18.04 - Linux procyon 5.4.0-81-generic #91~18.04.1-Ubuntu SMP Fri Jul 23 13:36:29 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

Running latest BeegFS: 7.1.3.

Latest is 7.3.1, but I’m not sure if it can run on Ubuntu 18.04.

I have a newer setup but can’t duplicate your problem because of this so I’ll have to wait to see if either that can be worked around, or maybe others will know why you’re seeing this problem.

Sorry it’s early and I haven’t had my coffee yet; It is 7.1.3. BeegFS itself is working fine and I read and write files to it from the host.

Turns out I had to remove open-vm-tools because that’s seems to have been “locking” a kernel module.

After that I was able to deploy a VM w/o problems - Ubuntu 20.04 with BeeGFS 7.3.0 (I left additional notes about the process in that other post, but there’s nothing special about it, it just worked).

Selection_650

Yeah VM’s work but sadly those don’t support proxy devices which I heavily use. I don’t want to give them all a dedicated IPv4, just IPv6.

A bit off-topic but what kind of settings (stripe/chunksize) are you using for the VMs?

They do in nat=true mode if that helps.

Interesting question - I’m not, to be honest. I’ve been using BeeGFS for other stuff (mostly analytics-like workloads), but LXD was on my “to experiment with” list, so when I saw your post I thought that was an excuse to do it right away…

As you can see from my screenshot (and I’m sure your own environment) this looks (relatively) promising - one big file, and few small, static files. That’s good.

$ sudo dir -lat /mnt/beegfs/lxd-pool/virtual-machines/natural-locust
total 1679212
-rw-r--r-- 1 root root 10737418240 Sep  1 15:39 root.img
-rw------- 1 lxd  root      131072 Sep  1 15:18 qemu.nvram
d--x------ 4 root root          10 Sep  1 15:18 .
-rw-r--r-- 1 root root         692 Sep  1 15:18 agent-client.crt
-rw------- 1 root root         288 Sep  1 15:18 agent-client.key
-rw-r--r-- 1 root root         721 Sep  1 15:18 agent.crt
-rw------- 1 root root         288 Sep  1 15:18 agent.key
-r-------- 1 root root        2105 Sep  1 15:18 backup.yaml
dr-x------ 6 lxd  root          10 Sep  1 15:18 config
drwx--x--x 3 root root           1 Sep  1 15:18 ..
-rw-r--r-- 1 root root         295 Aug 10 07:29 metadata.yaml
drwxr-xr-x 2 root root           1 Aug 10 07:29 templates

But inside of the VM, IO requests are small, so I probably wouldn’t run important OLTP DBs in these VMs (many small, random writes and reads).

I think anything to do with large files & sequential IO should be fine, also VMs that aren’t too busy in terms of small IO requests, and don’t get modified a lot (DHCP, DNS, App servers, Web servers, and such).

So I’d likely use small stripes for VM workloads with small/random IO.

I have a very small environment in my lab so I may not be able to obtain production-worthy insights from it, but I’ll keep a few VMs on try to observe. (I also have Nomad in this environment, and that’s another to-do when I find time - explore if they can be used together, Nomad & LXD).

1 Like

I did not know this, I will experiment with this :slight_smile:

1 Like

Back on topic the container part is still not working so if anybody has any ideas please let me know.

Maybe submit more details?

My NAT config (on for IPv4):

$ lxc network show lxdbr0
config:
  bridge.mtu: "1450"
  ipv4.address: 192.168.1.1/24
  ipv4.dhcp.ranges: 192.168.1.2-192.168.1.5
  ipv4.nat: "true"
  ipv6.address: none

I’ve found an interesting problem - when lxc commands are executed from a BeeGFS path, the command hangs for a long while, and then exits with this message.

$ cd /mnt/beegfs
$ sudo lxc start natural-locust
cannot stat path of the current working directory: Communication error on send

The same command executes as usual from $HOME or other paths, so it’s easy to work around that behavior.

Edit: I don’t now if this will help but I measured time-to-timeout - 9m13s.

I am issuing the commands from my home folder; I waited 30 minutes but a timeout never happens for me. Could you try starting a container instead of a vm and let me know if that works on your stack?

Creating a container hangs in this state for a while (it hasn’t timed out yet, but it probably will after some time):

$ sudo lxc launch ubuntu:22.04 tainer --storage beegfs
Creating tainer
Retrieving image: Unpack: 100% (949.31MB/s) 

I assume in your case creating a container outside of BeeGFs does work, even on the same LXD host?
I’ll try that after the command above times out.

Awesome, this at least confirmed it’s not a issue specifically to my setup. I wonder if it will ever time out.

Seems to be waiting for something that doesn’t exist or is busy

futex(0x11f3778, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x11f3698, FUTEX_WAKE_PRIVATE, 1) = 1
read(3, "\211\tkeepalive", 4096)        = 11
futex(0xc000050948, FUTEX_WAKE_PRIVATE, 1) = 1
write(6, "\0", 1)                       = 1
write(3, "\212\211\345A\300\23\216$\245c\204-\251e\200", 15) = 15
read(3, 0xc000187000, 4096)             = -1 EAGAIN (Resource temporarily unavailable)
futex(0x11f2168, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0x11f2168, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
epoll_pwait(4, [], 128, 0, NULL, 140723108802120) = 0
futex(0x11f2168, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0xc000084548, FUTEX_WAKE_PRIVATE, 1) = 1
sched_yield()                           = 0
--- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=71658, si_uid=0} ---
rt_sigreturn({mask=[]})                 = 0
futex(0x11f3698, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x11f3768, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x11f2168, FUTEX_WAIT_PRIVATE, 0, NULL) = -1 EAGAIN (Resource temporarily unavailable)
getpid()                                = 71658
tgkill(71658, 71736, SIGURG)            = 0
futex(0x11f3698, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x11f2168, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0x11f3790, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x11f3698, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x11f2168, FUTEX_WAIT_PRIVATE, 0, NULL

Tried on local (ext4) pool, container can be created and used.

Tried again on BeeGFS, got this log (and then log stops and nothing happens with the create command until interrupted with (CTRL+C)x3.

DEBUG  [2022-09-02T10:40:22Z] GetInstanceUsage started                      instance=natural-locust project=default
DEBUG  [2022-09-02T10:41:03Z] Handling API request                          ip=@ method=GET protocol=unix url=/1.0 username=root
DEBUG  [2022-09-02T10:41:03Z] Handling API request                          ip=@ method=GET protocol=unix url=/1.0/storage-pools/beegfs username=root
DEBUG  [2022-09-02T10:41:03Z] Handling API request                          ip=@ method=GET protocol=unix url=/1.0/events username=root
DEBUG  [2022-09-02T10:41:03Z] Event listener server handler started         id=5a834c90-cd5e-4047-9a54-5ee651dfb3b8 local=/var/snap/lxd/common/lxd/unix.socket remote=@
DEBUG  [2022-09-02T10:41:03Z] Handling API request                          ip=@ method=POST protocol=unix url=/1.0/instances username=root
DEBUG  [2022-09-02T10:41:03Z] Responding to instance create                
DEBUG  [2022-09-02T10:41:03Z] Started operation                             class=task description="Creating instance" operation=1849e6a9-0842-4456-80e6-d8071ef02b5a project=default
DEBUG  [2022-09-02T10:41:03Z] New operation                                 class=task description="Creating instance" operation=1849e6a9-0842-4456-80e6-d8071ef02b5a project=default
DEBUG  [2022-09-02T10:41:03Z] Connecting to a remote simplestreams server   URL="https://cloud-images.ubuntu.com/releases"
DEBUG  [2022-09-02T10:41:03Z] Handling API request                          ip=@ method=GET protocol=unix url=/1.0/operations/1849e6a9-0842-4456-80e6-d8071ef02b5a username=root
DEBUG  [2022-09-02T10:41:04Z] Acquiring lock for image                      fingerprint=a3ca829e3a556de86db0106615add923a55d1619dd838cf1f127c13fc06c0611
DEBUG  [2022-09-02T10:41:04Z] Lock acquired for image                       fingerprint=a3ca829e3a556de86db0106615add923a55d1619dd838cf1f127c13fc06c0611
DEBUG  [2022-09-02T10:41:04Z] Image already exists in the DB                fingerprint=1c72c70f037b2fd8ce2db8ac00cbbe82e8c6091eed1f2e08831bc1365ae4dcf2
DEBUG  [2022-09-02T10:41:04Z] Acquiring lock for image                      fingerprint=1c72c70f037b2fd8ce2db8ac00cbbe82e8c6091eed1f2e08831bc1365ae4dcf2
DEBUG  [2022-09-02T10:41:04Z] Lock acquired for image                       fingerprint=1c72c70f037b2fd8ce2db8ac00cbbe82e8c6091eed1f2e08831bc1365ae4dcf2
DEBUG  [2022-09-02T10:41:04Z] Instance operation lock created               action=create instance=bee1 project=default reusable=false
INFO   [2022-09-02T10:41:04Z] Creating instance                             ephemeral=false instance=bee1 instanceType=container project=default
DEBUG  [2022-09-02T10:41:04Z] Adding device                                 device=eth0 instance=bee1 instanceType=container project=default type=nic
DEBUG  [2022-09-02T10:41:04Z] Adding device                                 device=root instance=bee1 instanceType=container project=default type=disk
INFO   [2022-09-02T10:41:04Z] Created container                             ephemeral=false instance=bee1 instanceType=container project=default
DEBUG  [2022-09-02T10:41:04Z] CreateInstanceFromImage started               instance=bee1 project=default
INFO   [2022-09-02T10:41:04Z] Image unpack started                          imageFile=/var/snap/lxd/common/lxd/images/1c72c70f037b2fd8ce2db8ac00cbbe82e8c6091eed1f2e08831bc1365ae4dcf2 volName=bee1
DEBUG  [2022-09-02T10:41:04Z] Running filler function                       dev= driver=dir path=/var/snap/lxd/common/lxd/storage-pools/beegfs/containers/bee1 pool=beegfs
DEBUG  [2022-09-02T10:41:04Z] Updated metadata for operation                class=task description="Creating instance" operation=1849e6a9-0842-4456-80e6-d8071ef02b5a project=default
DEBUG  [2022-09-02T10:41:34Z] Instance operation lock finished              action=create err="Instance \"create\" operation timed out after 30s" instance=bee1 project=default reusable=false

This is good stuff; I had too many log entries to see it clearly but this gives me a codepath to start looking at. I will spin up a dev environment and see if I can trace where things go wrong. Appreciate the help.

Ok so a bit of news; after adding tons of debug statements to the code I discovered it’s probably just the fact that my BeegFS setup is too slow.

It got ‘stuck’ on the command unsquashfs -f -d /var/lib/lxd/storage-pools/beegfs-mnt/containers/maran/rootfs -n -da 94 -fr 94 -p 1 /var/lib/lxd/images/95f82d22bea6a0d20dc3c06a59929b768abd0c7243d5298c3973f1808688d117.rootfs however when I run this command using strace I guess see it’s simply working on extracting everything; it’s just taking ages.

Side note that I did this from a VM running on my development machine which is not connected to the BeegFS server directly but via the internet so this is bound to be slow. I will see if I can replicate this on a LXD instance that is directly connected to the BeegFS server so I can rule out bandwidth issues.

Hmm, that sounds strange. I don’t think it’s a slowness issue - my BeeGFS is on SSDs and I just looked 5 min ago, the container launch command I executed days ago was still in that “Unpacking” state.

Secondly, there’s nothing on the filesystem after several days of unpacking that minimal 200 MB image.

Thirdly, for VM, when I create that (which works), there’s just 5-10 files (one big VM file and 5 smaller config-related files). I don’t know if unpacked containers look differently (I don’t use them), but in any case even on a slow BeeGFS it shouldn’t take more than 30 seconds if it takes 5 seconds on ext4/xfs.

I suspect it may be something related to how LXD references BeeGFS, or maybe if symbolic link or other features are required for containers to work.

sysACLsEnabled is by default false (it’s allowed to paying BeeGFS customers only) and I don’t have that enabled. The other is sysCreateHardlinksAsSymlinks, I don’t know if that’s required for LXD so I won’t try enabling it, but by default that one is false as well. The first option seems more relevant here but I’d like to know what the devs think is more likely, or how to troubleshoot.

I think this is complicated enough to be posted to the Github issues :slight_smile:

You were right: FATAL ERROR: create_inode: failed to create hardlink, because Operation not permitted.

I’ve made changes to BeegFS to create symlinks when hardlinks are requested; this got me past this issue. Sadly the next issue was

Error: Failed instance creation: Failed creating instance from image: Unpack failed: Failed to run: unsquashfs -f -d /var/snap/lxd/common/lxd/storage-pools/beegfs/containers/unified-tortoise/rootfs -n /var/snap/lxd/common/lxd/images/95f82d22bea6a0d20dc3c06a59929b768abd0c7243d5298c3973f1808688d117.rootfs: Process exited with non-zero value 2 (write_xattr: failed to write xattr security.capability for file /var/snap/lxd/common/lxd/storage-pools/beegfs/containers/unified-tortoise/rootfs/usr/bin/mtr-packet because extended attributes are not supported by the destination filesystem
Ignoring xattrs in filesystem
To avoid this error message, specify -no-xattrs)

I changed the LXD code to supply the -no-xattrs flag; after which the launch worked, however I can’t exec into the container as lxc exec just sits there doing nothing.

At this point I am afraid I need some help from the devs to know whether or not running lxd without xattrs and hardlinks is even supported.

1 Like

That’s cool.

Similar problem on GlusterFS here (no workaround, but some hints, though).

@stgraber that thread had no conclusion, is there anything we could try here?