Fixing forgejo-runner's LXC logging

Hello LXC community! I’m a Forgejo contributor who is working on solving a bug with the Forgejo Actions runner where running jobs in LXC containers causes a loss of random chunks of log data. I’m reaching out here for two possible pieces of help…

For context, the Forgejo Actions runner allows running continuous integration scripts on Forgejo repositories, very similar to the more well-known GitHub Actions. Typically these occur in docker containers, but Forgejo has enhanced this capability with LXC containers as well. We’ve been struggling with an issue where, when a CI action is running and failing, it may output a lot of logs in a short time window. On our LXC implementation we find that sometimes these logs are missing chunks of data.

I believe that the root cause of this issue is that we run lxc-attach with a target command, and when lxc-attach pipes stdout around it allows data from “short writes” to stdout to be discarded (lxc/src/lxc/terminal.c at cba4ce7ef2ec09176de0aaca80f94636bcffabf2 · lxc/lxc · GitHub). As Forgejo’s runner starts subcommands in a pty, lxc-attach has a small buffer to write into and, as fast as we can read that buffer, we see data loss intermittently. The same would happen with the --pty-log option; as our use-case requires us to stream data into another application, we’d need to connect this to a FIFO which would have some limited buffering capability.

I’ve currently minimized the amount of data loss by attempting to read the pty buffer as quickly as we can to keep it as-available as possible for lxc-attach to write to. But minimizing data loss isn’t as great as eliminating it.

My new plan to address this issue has two steps:

  • I’ve introduced a new automated test which attempts to verify as much of the LXC environment as possible – eg. that the correct user, cwd, linux namespaces, cgroup, pty, are all “operating as expected”.
  • Now I’m proposing to replace lxc-attach with nsenter so that we can avoid the short-write behaviour of lxc-attach. With the automated test unchanged, the hope is that there’s a guarantee that all the relevant isolation, security, and other attributes of running commands within the container are maintained.

So, the two pieces of help that I’m looking for are…

  • I’m reasonably confident that replacing lxc-attach with nsenter (Making sure you're not a bot!) is fixing our log problem, but I’m less confident that we’re doing everything we need to do to provide the same execution environment under nsenter that we previously had. If someone with more LXC experience could review this idea and give me some feedback, it’d be greatly appreciated.
  • I’m operating under the assumption that the lxc-attach behaviour with short writes is as-designed and as-intended, which is why I’m taking the step of working around it. I’d love to hear a confirmation of this (and possibly an explanation? perhaps intentionally not providing backpressure to the running command, for some reason?). Either way we’d probably have to continue down the nsenter path in the short-term, but if there was a pathway to fixing this in lxc-enter then there may be different options in the long-term.

Appreciate any time that could be spared to help out with this.

2 Likes

@amikhalitsyn

Hi @mfenniak ,

thanks for doing this research and sharing results with us.

The problem you describe is very similar to lxc-attach drops output with lots of console output · Issue #4546 · lxc/lxc · GitHub, what we want to do is to merge a community member proposed fix (lxc/file_utils: ensure complete data write in lxc_write_nointr by DreamConnected · Pull Request #4597 · lxc/lxc · GitHub) and if you could validate that this helps that would be awesome!

Upd: please, can you try to revert terminal: make a terminal FDs non-blocking · lxc/lxc@715fb4e · GitHub , build LXC from source and test if this helps to solve the issue for you? (More details lxc/file_utils: ensure complete data write in lxc_write_nointr by DreamConnected · Pull Request #4597 · lxc/lxc · GitHub)

P.S. Forgejo is amazing piece of software. Thanks for working on it :wink:

Kind regards,

Alex

I do have some testing results to share, but I’d also like to ask if you could comment on the temporary strategy of replacing lxc-attach with nsenter mentioned in my first post? I’m happy to help verify we could fix this issue, but forgejo-runner will also need a functional workaround that we could release sooner rather than later, and not require a long release cycle to get into the Debian releases we use for LXC execution. :slightly_smiling_face:

Test results are negative, indicating that the problem still exists with the proposed patches:

For clarity, I’ll describe my testing procedure in detail:

forgejo-runner has a test called TestRunnerLXC/Large_Fast_Logs which runs long echo commands in an LXC container in a loop, and verifies that all the lines are captured by forgejo-runner. On my Debian 13 system (lxc-attach –version => 6.0.4) this test fails intermittently, so I trigger its failure by running in a loop 100 times:

$ go test -timeout 30m -count 100 -failfast -run TestRunnerLXC/Lo ./internal/app/run
...
Error: "[...]" should have 2400 item(s), but has 2149
...

In order to test the revert, I’ve pulled down the lxc repo @ cba4ce7ef, and built it. I’ve modified my forgejo-runner by changing the path to lxc-attach here to the repo location + /build/src/lxc/tools/lxc-attach. I’ve then run the same test command as above and verified that the test still fails on HEAD / commit cba4ce7ef – failure in about 30 seconds (each test loop takes about 7 seconds, so, this is pretty frequent).

Then, I reverted 715fb4eff, rebuilt, and re-ran the test loop. I began writing a response saying that everything worked with that revert… but then after about 6 minutes, it failed. The failure was of the same nature, a missing chunk of log data in the middle of the runner’s output.

Then, I pulled down PR 4597, and checked out commit 1e1ed6e4a, built it, and reran the test. Failure occurred in 20 seconds.

P.S. Forgejo is amazing piece of software. Thanks for working on it :wink:

Thanks! :slightly_smiling_face:

2 Likes

oh, HUGE thanks for doing this experimentation. Let’s do the following, I’m going to play around with this stuff and make a reproducer in a form of LXC CI test for this. Then we can debug it and include this test in our normal CI procedures to ensure that this will never regress again.

2 Likes

I’ll highlight one thing that makes the issue worse for forgejo-runner, and therefore might be relevant for writing a LXC CI test: we run lxc-attach and attach a pty to it. I believe that a pty has only a 4kB buffer, which causes lxc-attach to short-write to stdout more easily – without a pty attached we don’t experience the issue.

1 Like

Hi @mfenniak,

I checked PR Making sure you're not a bot! And issue Making sure you're not a bot! The test content of issue 1258 cannot be reproduced. Is it possible that the cloud environment affects I/O latency?

Can I ask if you want to restore the commit 715fb4e when applying PR 4597? Because PR 4597 needs to submit 715fb4e.

This patch cannot be replicated on my Android device, and my current device (popos 24.04 LTS (based on ubuntu 24.04 LTS)) also cannot be replicated. Can you provide more detailed lxc logs? This will help me troubleshoot problems.

I’m sorry I just registered and can’t upload the video.

I’m running these tests on my local workstation, in a Debian 13 KVM. As it’s a VM, the performance won’t be native, but it is a high performance workstation.

The test that I performed on your PR did include 715fb4e. I checked out the commit 1e1ed6e, with no modifications. Today I performed a clean build.

With lxc-attach built from commit 1e1ed6e, reproducing the issue with forgejo-runner’s test, and lxc-attach with –logpriority=WARN

lxc-attach f3a8d27db9d7d4c2 20260120211039.335 WARN     terminal - ../src/lxc/terminal.c:lxc_terminal_ptx_io:357 - Short write on terminal r:1024 != w:831
lxc-attach f3a8d27db9d7d4c2 20260120211039.335 WARN     terminal - ../src/lxc/terminal.c:lxc_terminal_ptx_io:357 - Short write on terminal r:224 != w:-1

I’ve uploaded a trace-level log as well to a GitHub gist since I am also not permitted to attach files. But I cannot put in a link to that either: “Sorry you cannot post a link to that host.”. If these restrictions could be relaxed it would be appreciated. :broken_heart:

The trace-level has many more “Short write” messages, some with w:-1 and others without. I don’t see any relevant messages otherwise.

1 Like

can we please unhide / unflag this :heart:

Fixed. The forum anti-spam is a bit trigger happy when dealing with new users and posts containing a bunch of links :slight_smile:

Though that’s with good reasons as we’re being hammered by bots trying to post scam links…

2 Likes

here’s the gist link

top post is still hidden

Oops, fixed that one now too.

2 Likes

Hi @mfenniak ,

Just now, I tried to simulate your environment for some tests. Is your connection to the VM ssh based?

If so, I use the adb shell to simulate remote connections and limit the CPU to 50% (my arm64 8 cores CPU). Limit the bandwidth to 8Mbps (1Mb/s). It was not until the CPU was fully loaded for 8 minutes that the short write was finally repeated.

Let’s simply reproduce:

  1. First write (1024 bytes):
    -Remaining space of terminal buffer: 1024 bytes
    -Actual write: 831 bytes (193 bytes left in the buffer)

  2. Continue to try to write the remaining 193 bytes:
    -write() may return EAGAIN (buffer is full)
    -Enter lxc_wait_for_io_ready to wait

  3. During the waiting period:
    -Terminal disconnected
    -poll() detected POLLHUP/POLLERR
    -return -EPIPE

  4. All subsequent writes:
    -Connection marked as disconnected
    -write() directly returns -1 (EPIPE)

In logcat, Try again is returned first. 5 seconds later, poll() detects a timeout and fd has disconnected the Broken pipe. Finally, all subsequent write() returns -1 and set errno to I/O error. However, in the syslog, lxc does not return errno for this section. I cannot make specific judgments.

PID: 16420 CMD: lxc-attach ubuntu -- bash -c 'for i in $(seq 1000000); do echo "A...A"; done | nl'
W/lxc     (16420): ../src/lxc/terminal.c:352: lxc_terminal_ptx_io - Try again - Short write on terminal r:209 != w:208
W/lxc     (16420): ../src/lxc/terminal.c:352: lxc_terminal_ptx_io - Try again - Short write on terminal r:1024 != w:0
W/lxc     (16420): ../src/lxc/terminal.c:352: lxc_terminal_ptx_io - Try again - Short write on terminal r:1024 != w:0
W/lxc     (16420): ../src/lxc/terminal.c:352: lxc_terminal_ptx_io - Broken pipe - Short write on terminal r:1024 != w:-1
W/lxc     (16420): ../src/lxc/terminal.c:352: lxc_terminal_ptx_io - I/O error - Short write on terminal r:1024 != w:-1
I/lxc     (16420): ../src/lxc/terminal.c:410: lxc_terminal_peer_io_handler - I/O error - Terminal client on fd 16 has exited
W/lxc     (16420): ../src/lxc/terminal.c:792: lxc_terminal_delete - I/O error - Failed to set old terminal settings
PID: 15503 CMD: lxc-attach ubuntu -- bash ; ./test.sh -timeout 30m -count 100000 -Type base64 -length 1000
W/lxc     (15503): ../src/lxc/terminal.c:352: lxc_terminal_ptx_io - Try again - Short write on terminal r:1024 != w:248
W/lxc     (15503): ../src/lxc/terminal.c:352: lxc_terminal_ptx_io - Try again - Short write on terminal r:1024 != w:0
W/lxc     (15503): ../src/lxc/terminal.c:352: lxc_terminal_ptx_io - Try again - Short write on terminal r:1024 != w:0
W/lxc     (15503): ../src/lxc/terminal.c:352: lxc_terminal_ptx_io - Broken pipe - Short write on terminal r:1024 != w:-1
W/lxc     (15503): ../src/lxc/terminal.c:352: lxc_terminal_ptx_io - I/O error - Short write on terminal r:1024 != w:-1
I/lxc     (15503): ../src/lxc/terminal.c:410: lxc_terminal_peer_io_handler - I/O error - Terminal client on fd 16 has exited
W/lxc     (15503): ../src/lxc/terminal.c:792: lxc_terminal_delete - I/O error - Failed to set old terminal settings

I think my tests of the LXC commits may not be following a correct testing procedure. I want to clarify this:

  • I’m checking out LXC, and running make all initially, and meson compile -C buildfor rebuilds.
  • I’ve modified my reproduction so that rather than running /usr/bin/lxc-attach, I’m running /somedir/lxc/build/src/lxc/tools/lxc-attach, built from make all.
    • w/ added –logpriority=TRACE –logfile=/somedir/lxc-attach.log

In order to answer some of @DreamConnected’s questions, I’ve been modifying PR #4597 and attempting to add more logging. However, I’m finding that some of my logging modifications are not effective. For example, if I change the log message “Short write on terminal” to “Really short write on terminal”, I’m still getting “Short write on terminal” in my from my reproduction approach.

Is it possible that this approach isn’t effective because either the build needs to be installed, or some updated binary/library needs to be present inside the container? If I’m not able to make simple log message changes, it raises questions as to whether I’ve effectively tested either #4597 or reverting the non-blocking change.

There is no SSH involved, but the reproduction case is more complex than that as I am encountering this error with the forgejo-runner CI coordinator.

Specifically, my UI terminal (alacritty) on the VM is executing go test, which is creating an LXC container, and then it is running (with a pty attached) sudo→ lxc-attach. The pty master is flushed into memory of the go process as quickly as the system is able into a 100MiB in-memory buffer. This process ends when it receives an error (or an EOF theoretically, but it’s always an error when logged) which is “read /dev/ptmx: input/output error”. The “terminal” being reported in the lxc-attach log will be the pty that forgejo-runner creates.

The output that the reproducing test misses is random, not consistently the end. In seq 2400, sometimes I get line 2400 but observe missing lines 150-181. Sometimes I’m missing the end.

Yeah, it looks like. You need to enforce lxc-attach to use a proper liblxc.so library. You can modify LD_LIBRARY_PATH environment variable to point to your build directory so it will pick up a freshly-built one instead of a system package provided one.

1 Like

Fix lxc-attach: fix data corruption during heavy IO on PTS by mihalicyn · Pull Request #4633 · lxc/lxc · GitHub was merged

@mfenniak if you can test/verify it that would be awesome

1 Like

@amikhalitsyn I’ve just completed a verification of PR 4633 and I am no longer able to reproduce any lost logs. Looks great. :+1: :tada:

Haven’t exactly figured out why my in-tree build wasn’t usable for testing – LD_LIBRARY_PATH was a good idea but updating it didn’t make any change. I switched to building and installing .deb packages on my test system with the changes, then validating that the changes are effective, which is good enough for me. I’m pretty confident that this approach tested PR 4633 correctly on my workload.

My apologies to @DreamConnected for time wasted on my inaccurate test results.

3 Likes