OS: Linux Mint 18.3 Sylvia
LXD Version: 3.18
LXC Version: 3.18
Which LXD: /snap/bin/lxd
Which LXC: /snap/bin/lxc
Summary
I’m working on optimizing my development environment now. I have a ROS development container configured. When launching my project with roslaunch, the time it takes to spawn each node/process takes more time compared to launching from the host.
I decided to do some investigation by using sysbench to see if I can locate the bottle neck. I went through each basic test, CPU, Disk I/O and Memory. I found that there was a difference with memory execution time between whats ran in the host vs container.
user@host ~ $ sysbench --test=memory run
sysbench 0.4.12: multi-threaded system evaluation benchmark
Running the test with following options:
Number of threads: 1
Doing memory operations speed test
Memory block size: 1K
Memory transfer size: 102400M
Memory operations type: write
Memory scope type: global
Threads started!
Done.
Operations performed: 104857600 (3131320.06 ops/sec)
102400.00 MB transferred (3057.93 MB/sec)
Test execution summary:
total time: 33.4867s
total number of events: 104857600
total time taken by event execution: 27.0143
per-request statistics:
min: 0.00ms
avg: 0.00ms
max: 0.14ms
approx. 95 percentile: 0.00ms
Threads fairness:
events (avg/stddev): 104857600.0000/0.00
execution time (avg/stddev): 27.0143/0.00
user@container ~ $ sysbench --test=memory run
sysbench 0.4.12: multi-threaded system evaluation benchmark
Running the test with following options:
Number of threads: 1
Doing memory operations speed test
Memory block size: 1K
Memory transfer size: 102400M
Memory operations type: write
Memory scope type: global
Threads started!
Done.
Operations performed: 104857600 (2476192.03 ops/sec)
102400.00 MB transferred (2418.16 MB/sec)
Test execution summary:
total time: 42.3463s
total number of events: 104857600
total time taken by event execution: 33.9804
per-request statistics:
min: 0.00ms
avg: 0.00ms
max: 0.55ms
approx. 95 percentile: 0.00ms
Threads fairness:
events (avg/stddev): 104857600.0000/0.00
execution time (avg/stddev): 33.9804/0.00
There is a difference here, however not entirely sure if this is a good lead or not.
What could be causing the container to spawn system processes slower than the host?
So I guess it depends exactly what the process spawning involves.
It could be I/O which are likely setup differently for your container than your host.
Or it could be scheduling related, especially if you have some limits in place on your container.
Right… roslaunch is firing up nearly 100 processes. There is quite a bit going on with the system when the programs are running… logging data is defiantly one of those things.
How does one verify the I/O configuration on the container and the host??
It could be scheduling related too…
However, there are currently no limit’s in place on my container… that I’m aware of.
I’ve been using Docker for my ROS development environment for the past year.
I’ve been wanting to switch to an OS level containerization, hence why I’m migrating to LXC.
The important note pertains to the fact that I see the same issue when launching dozen’s of processes inside the docker container as well. The amount of time it takes to launch these processes is slower inside the container, compared to my host.
I suspect that the issue lies at the kernel level because of this fact…
One interesting thing worth trying may be to use a privileged container (security.privileged=true) to see if maybe it’s some of the spectre/meltdown mitigations that are applied to user namespaces which are causing a syscall slowdown.
Interesting, setting the container to privileged caused an error to occur.
I’m assuming that you can set container security privileges on existing containers…
Below is what I did…
user@host ~ $ lxc restart lvm-testing
Error: Failed to run: /snap/lxd/current/bin/lxd forkstart lvm-testing /var/snap/lxd/common/lxd/containers /var/snap/lxd/common/lxd/logs/lvm-testing/lxc.conf:
Try `lxc info --show-log lvm-testing` for more info
Checking out the log…
user@host ~ $ lxc info --show-log lvm-testing
Name: lvm-testing
Location: none
Remote: unix://
Architecture: x86_64
Created: 2019/12/04 02:34 UTC
Status: Stopped
Type: persistent
Profiles: rtk-dev
Log:
lxc lvm-testing 20191205155613.632 ERROR conf - conf.c:run_buffer:352 - Script exited with status 1
lxc lvm-testing 20191205155613.632 ERROR conf - conf.c:lxc_setup:3653 - Failed to run mount hooks
lxc lvm-testing 20191205155613.632 ERROR start - start.c:do_start:1321 - Failed to setup container "lvm-testing"
lxc lvm-testing 20191205155613.632 ERROR sync - sync.c:__sync_wait:62 - An error occurred in another process (expected sequence number 5)
lxc lvm-testing 20191205155613.632 WARN network - network.c:lxc_delete_network_priv:3377 - Failed to rename interface with index 11 from "eth0" to its initial name "veth247fe952"
lxc lvm-testing 20191205155613.632 ERROR lxccontainer - lxccontainer.c:wait_on_daemonized_start:873 - Received container state "ABORTING" instead of "RUNNING"
lxc lvm-testing 20191205155613.633 ERROR start - start.c:__lxc_start:2039 - Failed to spawn container "lvm-testing"
lxc 20191205155613.796 WARN commands - commands.c:lxc_cmd_rsp_recv:135 - Connection reset by peer - Failed to receive response for command "get_state"
Looks like it’s trying to rename my Network Interface???
Okay, after doing some trouble shooting here, I found that I could create a simple privileged container, without using “my-profile”.
So I know my LXD/LXC config is just fine, there’s something in “my-profile” that’s causing issues here.
So I did some more troubleshooting, and found that the line
nvidia.runtime: "true"
Is the culprit here…
This of course breaks my ability to use cuda libraries and opengl applications… =(
Update, I must have my openGL lib’s working in order for the project to build.
I’m unable to verify if running a privileged container would resolve the speed issue here.
I managed finding a work around with getting GPU pass-through to work inside a privileged container.
I had to manually map the Nvidia driver, and some libraries from the host to container.
Now I have OpenGL working inside my privileged container. I do not have cuda libraries working, since
my project currently doesn’t utilize cuda at the moment…
I’m able to run glxgears inside it just fine, and my project is able to run as well.