Process's in container spawning slower than host

avondollen · December 4, 2019, 6:57pm

System Information:

OS: Linux Mint 18.3 Sylvia
LXD Version: 3.18
LXC Version: 3.18
Which LXD: /snap/bin/lxd
Which LXC: /snap/bin/lxc

Summary

I’m working on optimizing my development environment now. I have a ROS development container configured. When launching my project with roslaunch, the time it takes to spawn each node/process takes more time compared to launching from the host.

I decided to do some investigation by using sysbench to see if I can locate the bottle neck. I went through each basic test, CPU, Disk I/O and Memory. I found that there was a difference with memory execution time between whats ran in the host vs container.

user@host ~ $ sysbench --test=memory run
sysbench 0.4.12:  multi-threaded system evaluation benchmark

Running the test with following options:
Number of threads: 1

Doing memory operations speed test
Memory block size: 1K

Memory transfer size: 102400M

Memory operations type: write
Memory scope type: global
Threads started!
Done.

Operations performed: 104857600 (3131320.06 ops/sec)

102400.00 MB transferred (3057.93 MB/sec)


Test execution summary:
    total time:                          33.4867s
    total number of events:              104857600
    total time taken by event execution: 27.0143
    per-request statistics:
         min:                                  0.00ms
         avg:                                  0.00ms
         max:                                  0.14ms
         approx.  95 percentile:               0.00ms

Threads fairness:
    events (avg/stddev):           104857600.0000/0.00
    execution time (avg/stddev):   27.0143/0.00

user@container ~ $ sysbench --test=memory run
sysbench 0.4.12:  multi-threaded system evaluation benchmark

Running the test with following options:
Number of threads: 1

Doing memory operations speed test
Memory block size: 1K

Memory transfer size: 102400M

Memory operations type: write
Memory scope type: global
Threads started!
Done.

Operations performed: 104857600 (2476192.03 ops/sec)

102400.00 MB transferred (2418.16 MB/sec)


Test execution summary:
    total time:                          42.3463s
    total number of events:              104857600
    total time taken by event execution: 33.9804
    per-request statistics:
         min:                                  0.00ms
         avg:                                  0.00ms
         max:                                  0.55ms
         approx.  95 percentile:               0.00ms

Threads fairness:
    events (avg/stddev):           104857600.0000/0.00
    execution time (avg/stddev):   33.9804/0.00

Summary:

Host Benchmark Time: 33.4867s
Container Benchmark Time: 42.3463s

There is a difference here, however not entirely sure if this is a good lead or not.
What could be causing the container to spawn system processes slower than the host?

stgraber · December 5, 2019, 2:29am

So I guess it depends exactly what the process spawning involves.
It could be I/O which are likely setup differently for your container than your host.
Or it could be scheduling related, especially if you have some limits in place on your container.

avondollen · December 5, 2019, 4:29am

Right… roslaunch is firing up nearly 100 processes. There is quite a bit going on with the system when the programs are running… logging data is defiantly one of those things.

How does one verify the I/O configuration on the container and the host??

It could be scheduling related too…
However, there are currently no limit’s in place on my container… that I’m aware of.

Here is my current configuration.

Profile Used for Container Launching.

user@host ~ $ lxc profile show my-profile
config:
  environment.DISPLAY: :0
  nvidia.driver.capabilities: graphics, compute, display, utility, video
  nvidia.runtime: "true"
  raw.idmap: both 1000 1000
  user.user-data: |
    #cloud-config
    package_upgrade: true
    apt:
      sources:
         stuff to add
         ...
    runcmd:
      - 'apt-get update'
    packages:
      - x11-apps
      - mesa-utils
      - ros-kinetic-ros-base
      - python-catkin-tools
      - python-wstool
    runcmd:
      - 'ln -s /usr/lib/x86_64-linux-gnu/libGLX_nvidia.so.418.87.01 /usr/lib/x86_64-linux-gnu/libGL.so.1'
      - 'ldconfig'
      - 'rm /etc/ld.so.conf.d/x86_64-linux-gnu_GL.conf'
      - 'ldconfig'
      - 'rosdep init'
      - 'echo "source /opt/ros/kinetic/setup.bash" >> /home/ubuntu/.bashrc'
description: Sets up ros kinetic with GPU-Pass through
devices:
  My-GPU:
    pci: "Some-Address-Here"
    type: gpu
  X0:
    bind: container
    connect: unix:@/tmp/.X11-unix/X0
    listen: unix:@/tmp/.X11-unix/X0
    security.gid: "1000"
    security.uid: "1000"
    type: proxy
  eth0:
    name: eth0
    nictype: bridged
    parent: lxdbr0
    type: nic
  root:
    path: /
    pool: lvm-pool
    size: 32GB
    type: disk
name: my-profile
used_by:
- /1.0/containers/lvm-testing

Currently the only limit set is the disk size. I’m using an LVM pool, here is my storage config for that…

user@host ~ $ lxc storage show lvm-pool
config:
  lvm.thinpool_name: lxc-test
  lvm.vg_name: lvm-grp-lxc
  source: lvm-grp-lxc
  volatile.initial_source: lvm-grp-lxc
description: ""
name: lvm-pool
driver: lvm
used_by:
- /1.0/containers/lvm-testing
- /1.0/images/7ed08b435c92cd8a8a884c88e8722f2e7546a51e891982a90ea9c15619d7df9b
- /1.0/profiles/my-profile
status: Created
locations:
- none

Note: I still have the “default” storage as a ZFS pool, but I’m using my-profile to launch a container that uses the lvm pool.

user@host ~ $ lxc launch ubuntu:16.04 lvm-testing -p my-profile

Important Note here:

I’ve been using Docker for my ROS development environment for the past year.
I’ve been wanting to switch to an OS level containerization, hence why I’m migrating to LXC.
The important note pertains to the fact that I see the same issue when launching dozen’s of processes inside the docker container as well. The amount of time it takes to launch these processes is slower inside the container, compared to my host.
I suspect that the issue lies at the kernel level because of this fact…

stgraber · December 5, 2019, 2:10pm

One interesting thing worth trying may be to use a privileged container (security.privileged=true) to see if maybe it’s some of the spectre/meltdown mitigations that are applied to user namespaces which are causing a syscall slowdown.

avondollen · December 5, 2019, 4:07pm

Interesting, setting the container to privileged caused an error to occur.
I’m assuming that you can set container security privileges on existing containers…
Below is what I did…

user@host ~ $ lxc config set lvm-testing security.privileged true
user@host ~ $ lxc config show --expanded lvm-testing | grep security
  security.privileged: "true"
    security.gid: "1000"
    security.uid: "1000"

Cool it’s been set… continuing on…

user@host ~ $ lxc restart lvm-testing 
Error: Failed to run: /snap/lxd/current/bin/lxd forkstart lvm-testing /var/snap/lxd/common/lxd/containers /var/snap/lxd/common/lxd/logs/lvm-testing/lxc.conf: 
Try `lxc info --show-log lvm-testing` for more info

Checking out the log…

user@host ~ $ lxc info --show-log lvm-testing
Name: lvm-testing
Location: none
Remote: unix://
Architecture: x86_64
Created: 2019/12/04 02:34 UTC
Status: Stopped
Type: persistent
Profiles: rtk-dev

Log:

lxc lvm-testing 20191205155613.632 ERROR    conf - conf.c:run_buffer:352 - Script exited with status 1
lxc lvm-testing 20191205155613.632 ERROR    conf - conf.c:lxc_setup:3653 - Failed to run mount hooks
lxc lvm-testing 20191205155613.632 ERROR    start - start.c:do_start:1321 - Failed to setup container "lvm-testing"
lxc lvm-testing 20191205155613.632 ERROR    sync - sync.c:__sync_wait:62 - An error occurred in another process (expected sequence number 5)
lxc lvm-testing 20191205155613.632 WARN     network - network.c:lxc_delete_network_priv:3377 - Failed to rename interface with index 11 from "eth0" to its initial name "veth247fe952"
lxc lvm-testing 20191205155613.632 ERROR    lxccontainer - lxccontainer.c:wait_on_daemonized_start:873 - Received container state "ABORTING" instead of "RUNNING"
lxc lvm-testing 20191205155613.633 ERROR    start - start.c:__lxc_start:2039 - Failed to spawn container "lvm-testing"
lxc 20191205155613.796 WARN     commands - commands.c:lxc_cmd_rsp_recv:135 - Connection reset by peer - Failed to receive response for command "get_state"

Looks like it’s trying to rename my Network Interface???

Okay, after doing some trouble shooting here, I found that I could create a simple privileged container, without using “my-profile”.

So I know my LXD/LXC config is just fine, there’s something in “my-profile” that’s causing issues here.

So I did some more troubleshooting, and found that the line

nvidia.runtime: "true"

Is the culprit here…
This of course breaks my ability to use cuda libraries and opengl applications… =(

Update, I must have my openGL lib’s working in order for the project to build.
I’m unable to verify if running a privileged container would resolve the speed issue here.

avondollen · December 6, 2019, 7:54pm

I managed finding a work around with getting GPU pass-through to work inside a privileged container.
I had to manually map the Nvidia driver, and some libraries from the host to container.

Now I have OpenGL working inside my privileged container. I do not have cuda libraries working, since
my project currently doesn’t utilize cuda at the moment…

I’m able to run glxgears inside it just fine, and my project is able to run as well.

For the readers, I used what I learned from simos in the thread I started a couple weeks ago…
Getting OpenGL to work inside a Ubuntu 16.04 Container

Here is the profile I’m using to get Nvidia pass-through to work for a privileged container.

user@host ~ $ lxc profile show my-profile
config:
  environment.DISPLAY: :0
  nvidia.driver.capabilities: graphics, compute, display, utility, video
  raw.idmap: both 1000 1000
  security.privileged: "true"
  user.user-data: |
    #cloud-config
    package_upgrade: true
    runcmd:
      - 'apt-get update'
    packages:
      - x11-apps
      - mesa-utils
      - ros-kinetic-ros-base
      - python-catkin-tools
      - python-wstool
    runcmd:
      - 'ln -s /usr/lib/nvidia-418/libGLX_nvidia.so.418.87.01 /usr/lib/x86_64-linux-gnu/libGL.so.1'
      - 'ln -s /usr/lib/nvidia-418/tls/libnvidia-tls.so.418.87.01 /usr/lib/x86_64-linux-gnu/libnvidia-tls.so.418.87.01'
      - 'ln -s /usr/lib/nvidia-418/libnvidia-glcore.so.418.87.01 /usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.418.87.01'
      - 'ldconfig'
      - 'rm /etc/ld.so.conf.d/x86_64-linux-gnu_GL.conf'
      - 'ldconfig'
      - 'rosdep init'
      - 'echo "source /opt/ros/kinetic/setup.bash" >> /home/ubuntu/.bashrc'
description: GPU-Pass through in Priveleged Container
devices:
  Nvidia-driver:
    path: /usr/lib/nvidia-418/
    source: /usr/lib/nvidia-418/
    type: disk
  Nvidia-modeset:
    bind: container
    connect: unix:@/dev/nvidia-modeset
    listen: unix:@/dev/nvidia-modeset
    security.gid: "1000"
    security.uid: "1000"
    type: proxy
  Nvidia-uvm:
    bind: container
    connect: unix:@/dev/nvidia-uvm
    listen: unix:@/dev/nvidia-uvm
    security.gid: "1000"
    security.uid: "1000"
    type: proxy
  Nvidia0:
    bind: container
    connect: unix:@/dev/nvidia0
    listen: unix:@/dev/nvidia0
    security.gid: "1000"
    security.uid: "1000"
    type: proxy
  Nvidiactl:
    bind: container
    connect: unix:@/dev/nvidiactl
    listen: unix:@/dev/nvidiactl
    security.gid: "1000"
    security.uid: "1000"
    type: proxy
  OpenCL-Libraries:
    path: /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.418.87.01
    source: /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.418.87.01
    type: disk
  Quadro-M100M:
    pci: "0000:01:00.0"
    type: gpu
  X0:
    bind: container
    connect: unix:@/tmp/.X11-unix/X0
    listen: unix:@/tmp/.X11-unix/X0
    security.gid: "1000"
    security.uid: "1000"
    type: proxy
  eth0:
    name: eth0
    nictype: bridged
    parent: lxdbr0
    type: nic
  root:
    path: /
    pool: lvm-pool
    size: 32GB
    type: disk
name: rtk-dev
used_by:
- /1.0/containers/performance-troubleshooting

I really do appreciate your time, in helping me in resolving this issue.

I found that, with my testing, that running my code inside a privileged container does not speed things up like I hoped it would.

Do you have any additional thoughts or ideas, that I could explore??