Ubuntu 18.04 - Linux procyon 5.4.0-81-generic #91~18.04.1-Ubuntu SMP Fri Jul 23 13:36:29 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
Running latest BeegFS: 7.1.3.
Ubuntu 18.04 - Linux procyon 5.4.0-81-generic #91~18.04.1-Ubuntu SMP Fri Jul 23 13:36:29 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
Running latest BeegFS: 7.1.3.
Latest is 7.3.1, but I’m not sure if it can run on Ubuntu 18.04.
I have a newer setup but can’t duplicate your problem because of this so I’ll have to wait to see if either that can be worked around, or maybe others will know why you’re seeing this problem.
Sorry it’s early and I haven’t had my coffee yet; It is 7.1.3. BeegFS itself is working fine and I read and write files to it from the host.
Turns out I had to remove open-vm-tools because that’s seems to have been “locking” a kernel module.
After that I was able to deploy a VM w/o problems - Ubuntu 20.04 with BeeGFS 7.3.0 (I left additional notes about the process in that other post, but there’s nothing special about it, it just worked).
Yeah VM’s work but sadly those don’t support proxy devices which I heavily use. I don’t want to give them all a dedicated IPv4, just IPv6.
A bit off-topic but what kind of settings (stripe/chunksize) are you using for the VMs?
They do in nat=true
mode if that helps.
Interesting question - I’m not, to be honest. I’ve been using BeeGFS for other stuff (mostly analytics-like workloads), but LXD was on my “to experiment with” list, so when I saw your post I thought that was an excuse to do it right away…
As you can see from my screenshot (and I’m sure your own environment) this looks (relatively) promising - one big file, and few small, static files. That’s good.
$ sudo dir -lat /mnt/beegfs/lxd-pool/virtual-machines/natural-locust
total 1679212
-rw-r--r-- 1 root root 10737418240 Sep 1 15:39 root.img
-rw------- 1 lxd root 131072 Sep 1 15:18 qemu.nvram
d--x------ 4 root root 10 Sep 1 15:18 .
-rw-r--r-- 1 root root 692 Sep 1 15:18 agent-client.crt
-rw------- 1 root root 288 Sep 1 15:18 agent-client.key
-rw-r--r-- 1 root root 721 Sep 1 15:18 agent.crt
-rw------- 1 root root 288 Sep 1 15:18 agent.key
-r-------- 1 root root 2105 Sep 1 15:18 backup.yaml
dr-x------ 6 lxd root 10 Sep 1 15:18 config
drwx--x--x 3 root root 1 Sep 1 15:18 ..
-rw-r--r-- 1 root root 295 Aug 10 07:29 metadata.yaml
drwxr-xr-x 2 root root 1 Aug 10 07:29 templates
But inside of the VM, IO requests are small, so I probably wouldn’t run important OLTP DBs in these VMs (many small, random writes and reads).
I think anything to do with large files & sequential IO should be fine, also VMs that aren’t too busy in terms of small IO requests, and don’t get modified a lot (DHCP, DNS, App servers, Web servers, and such).
So I’d likely use small stripes for VM workloads with small/random IO.
I have a very small environment in my lab so I may not be able to obtain production-worthy insights from it, but I’ll keep a few VMs on try to observe. (I also have Nomad in this environment, and that’s another to-do when I find time - explore if they can be used together, Nomad & LXD).
I did not know this, I will experiment with this
Back on topic the container part is still not working so if anybody has any ideas please let me know.
Maybe submit more details?
My NAT config (on for IPv4):
$ lxc network show lxdbr0
config:
bridge.mtu: "1450"
ipv4.address: 192.168.1.1/24
ipv4.dhcp.ranges: 192.168.1.2-192.168.1.5
ipv4.nat: "true"
ipv6.address: none
I’ve found an interesting problem - when lxc commands are executed from a BeeGFS path, the command hangs for a long while, and then exits with this message.
$ cd /mnt/beegfs
$ sudo lxc start natural-locust
cannot stat path of the current working directory: Communication error on send
The same command executes as usual from $HOME or other paths, so it’s easy to work around that behavior.
Edit: I don’t now if this will help but I measured time-to-timeout - 9m13s.
I am issuing the commands from my home folder; I waited 30 minutes but a timeout never happens for me. Could you try starting a container instead of a vm and let me know if that works on your stack?
Creating a container hangs in this state for a while (it hasn’t timed out yet, but it probably will after some time):
$ sudo lxc launch ubuntu:22.04 tainer --storage beegfs
Creating tainer
Retrieving image: Unpack: 100% (949.31MB/s)
I assume in your case creating a container outside of BeeGFs does work, even on the same LXD host?
I’ll try that after the command above times out.
Awesome, this at least confirmed it’s not a issue specifically to my setup. I wonder if it will ever time out.
Seems to be waiting for something that doesn’t exist or is busy
futex(0x11f3778, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x11f3698, FUTEX_WAKE_PRIVATE, 1) = 1
read(3, "\211\tkeepalive", 4096) = 11
futex(0xc000050948, FUTEX_WAKE_PRIVATE, 1) = 1
write(6, "\0", 1) = 1
write(3, "\212\211\345A\300\23\216$\245c\204-\251e\200", 15) = 15
read(3, 0xc000187000, 4096) = -1 EAGAIN (Resource temporarily unavailable)
futex(0x11f2168, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0x11f2168, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
epoll_pwait(4, [], 128, 0, NULL, 140723108802120) = 0
futex(0x11f2168, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0xc000084548, FUTEX_WAKE_PRIVATE, 1) = 1
sched_yield() = 0
--- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=71658, si_uid=0} ---
rt_sigreturn({mask=[]}) = 0
futex(0x11f3698, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x11f3768, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x11f2168, FUTEX_WAIT_PRIVATE, 0, NULL) = -1 EAGAIN (Resource temporarily unavailable)
getpid() = 71658
tgkill(71658, 71736, SIGURG) = 0
futex(0x11f3698, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x11f2168, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0x11f3790, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x11f3698, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x11f2168, FUTEX_WAIT_PRIVATE, 0, NULL
Tried on local (ext4) pool, container can be created and used.
Tried again on BeeGFS, got this log (and then log stops and nothing happens with the create command until interrupted with (CTRL+C)x3.
DEBUG [2022-09-02T10:40:22Z] GetInstanceUsage started instance=natural-locust project=default
DEBUG [2022-09-02T10:41:03Z] Handling API request ip=@ method=GET protocol=unix url=/1.0 username=root
DEBUG [2022-09-02T10:41:03Z] Handling API request ip=@ method=GET protocol=unix url=/1.0/storage-pools/beegfs username=root
DEBUG [2022-09-02T10:41:03Z] Handling API request ip=@ method=GET protocol=unix url=/1.0/events username=root
DEBUG [2022-09-02T10:41:03Z] Event listener server handler started id=5a834c90-cd5e-4047-9a54-5ee651dfb3b8 local=/var/snap/lxd/common/lxd/unix.socket remote=@
DEBUG [2022-09-02T10:41:03Z] Handling API request ip=@ method=POST protocol=unix url=/1.0/instances username=root
DEBUG [2022-09-02T10:41:03Z] Responding to instance create
DEBUG [2022-09-02T10:41:03Z] Started operation class=task description="Creating instance" operation=1849e6a9-0842-4456-80e6-d8071ef02b5a project=default
DEBUG [2022-09-02T10:41:03Z] New operation class=task description="Creating instance" operation=1849e6a9-0842-4456-80e6-d8071ef02b5a project=default
DEBUG [2022-09-02T10:41:03Z] Connecting to a remote simplestreams server URL="https://cloud-images.ubuntu.com/releases"
DEBUG [2022-09-02T10:41:03Z] Handling API request ip=@ method=GET protocol=unix url=/1.0/operations/1849e6a9-0842-4456-80e6-d8071ef02b5a username=root
DEBUG [2022-09-02T10:41:04Z] Acquiring lock for image fingerprint=a3ca829e3a556de86db0106615add923a55d1619dd838cf1f127c13fc06c0611
DEBUG [2022-09-02T10:41:04Z] Lock acquired for image fingerprint=a3ca829e3a556de86db0106615add923a55d1619dd838cf1f127c13fc06c0611
DEBUG [2022-09-02T10:41:04Z] Image already exists in the DB fingerprint=1c72c70f037b2fd8ce2db8ac00cbbe82e8c6091eed1f2e08831bc1365ae4dcf2
DEBUG [2022-09-02T10:41:04Z] Acquiring lock for image fingerprint=1c72c70f037b2fd8ce2db8ac00cbbe82e8c6091eed1f2e08831bc1365ae4dcf2
DEBUG [2022-09-02T10:41:04Z] Lock acquired for image fingerprint=1c72c70f037b2fd8ce2db8ac00cbbe82e8c6091eed1f2e08831bc1365ae4dcf2
DEBUG [2022-09-02T10:41:04Z] Instance operation lock created action=create instance=bee1 project=default reusable=false
INFO [2022-09-02T10:41:04Z] Creating instance ephemeral=false instance=bee1 instanceType=container project=default
DEBUG [2022-09-02T10:41:04Z] Adding device device=eth0 instance=bee1 instanceType=container project=default type=nic
DEBUG [2022-09-02T10:41:04Z] Adding device device=root instance=bee1 instanceType=container project=default type=disk
INFO [2022-09-02T10:41:04Z] Created container ephemeral=false instance=bee1 instanceType=container project=default
DEBUG [2022-09-02T10:41:04Z] CreateInstanceFromImage started instance=bee1 project=default
INFO [2022-09-02T10:41:04Z] Image unpack started imageFile=/var/snap/lxd/common/lxd/images/1c72c70f037b2fd8ce2db8ac00cbbe82e8c6091eed1f2e08831bc1365ae4dcf2 volName=bee1
DEBUG [2022-09-02T10:41:04Z] Running filler function dev= driver=dir path=/var/snap/lxd/common/lxd/storage-pools/beegfs/containers/bee1 pool=beegfs
DEBUG [2022-09-02T10:41:04Z] Updated metadata for operation class=task description="Creating instance" operation=1849e6a9-0842-4456-80e6-d8071ef02b5a project=default
DEBUG [2022-09-02T10:41:34Z] Instance operation lock finished action=create err="Instance \"create\" operation timed out after 30s" instance=bee1 project=default reusable=false
This is good stuff; I had too many log entries to see it clearly but this gives me a codepath to start looking at. I will spin up a dev environment and see if I can trace where things go wrong. Appreciate the help.
Ok so a bit of news; after adding tons of debug statements to the code I discovered it’s probably just the fact that my BeegFS setup is too slow.
It got ‘stuck’ on the command unsquashfs -f -d /var/lib/lxd/storage-pools/beegfs-mnt/containers/maran/rootfs -n -da 94 -fr 94 -p 1 /var/lib/lxd/images/95f82d22bea6a0d20dc3c06a59929b768abd0c7243d5298c3973f1808688d117.rootfs
however when I run this command using strace I guess see it’s simply working on extracting everything; it’s just taking ages.
Side note that I did this from a VM running on my development machine which is not connected to the BeegFS server directly but via the internet so this is bound to be slow. I will see if I can replicate this on a LXD instance that is directly connected to the BeegFS server so I can rule out bandwidth issues.
Hmm, that sounds strange. I don’t think it’s a slowness issue - my BeeGFS is on SSDs and I just looked 5 min ago, the container launch command I executed days ago was still in that “Unpacking” state.
Secondly, there’s nothing on the filesystem after several days of unpacking that minimal 200 MB image.
Thirdly, for VM, when I create that (which works), there’s just 5-10 files (one big VM file and 5 smaller config-related files). I don’t know if unpacked containers look differently (I don’t use them), but in any case even on a slow BeeGFS it shouldn’t take more than 30 seconds if it takes 5 seconds on ext4/xfs.
I suspect it may be something related to how LXD references BeeGFS, or maybe if symbolic link or other features are required for containers to work.
sysACLsEnabled
is by default false
(it’s allowed to paying BeeGFS customers only) and I don’t have that enabled. The other is sysCreateHardlinksAsSymlinks
, I don’t know if that’s required for LXD so I won’t try enabling it, but by default that one is false
as well. The first option seems more relevant here but I’d like to know what the devs think is more likely, or how to troubleshoot.
I think this is complicated enough to be posted to the Github issues
You were right: FATAL ERROR: create_inode: failed to create hardlink, because Operation not permitted
.
I’ve made changes to BeegFS to create symlinks when hardlinks are requested; this got me past this issue. Sadly the next issue was
Error: Failed instance creation: Failed creating instance from image: Unpack failed: Failed to run: unsquashfs -f -d /var/snap/lxd/common/lxd/storage-pools/beegfs/containers/unified-tortoise/rootfs -n /var/snap/lxd/common/lxd/images/95f82d22bea6a0d20dc3c06a59929b768abd0c7243d5298c3973f1808688d117.rootfs: Process exited with non-zero value 2 (write_xattr: failed to write xattr security.capability for file /var/snap/lxd/common/lxd/storage-pools/beegfs/containers/unified-tortoise/rootfs/usr/bin/mtr-packet because extended attributes are not supported by the destination filesystem
Ignoring xattrs in filesystem
To avoid this error message, specify -no-xattrs)
I changed the LXD code to supply the -no-xattrs flag; after which the launch worked, however I can’t exec into the container as lxc exec just sits there doing nothing.
At this point I am afraid I need some help from the devs to know whether or not running lxd without xattrs and hardlinks is even supported.
That’s cool.
Similar problem on GlusterFS here (no workaround, but some hints, though).
@stgraber that thread had no conclusion, is there anything we could try here?