Online cpu numbers in /sys/devices/system/cpu/online and sched_getaffinity masks

I just wasted two days debugging a java8 app, where java8 incorrectly gets a number of CPUS, it looks at /sys/devices/system/cpu/online and sched_getaffinity but that shows the host affinity maps, not the container’s, e.g.

[root@int9662 ~]# fgrep -c processor /proc/cpuinfo
8
[root@int9662 ~]# cat /sys/devices/system/cpu/online
0,2,5,7,11,17,19,31

so obviously this is “impossible”, if the system has 8 cores then they are numbered 0-7, not 11,17 or whatnot, so jdk8 simply discards impossible CPUs which makes it run either slow or if it gets it as 0, and then it bombs in all sorts of mysterious ways.

Ok, for java it is fairly easy to fix it by setting -XX:ActiveProcessorCount, but there must be other legacy stuff that is doing similar things, so would it be possible to fake all this? i.e. if the limits.cpu=n for a container then /sys/devices/system/cpu/online contains 0-(n-1) and sched_getaffinity returns the appropriate bitmap?

Any ideas @stgraber ?

Is that the LXD snap?
If not, what version of LXCFS are you using and where did you get it from?

There’s a bug in LXCFS 5.0.0 which could explain that. It’s fixed upstream and cherry-picked in the snap but won’t otherwise be available until 5.0.1 in a few weeks.

yeah its a snap

# snap list lxd
Name  Version        Rev    Tracking    Publisher   Notes
lxd   5.0.0-b0287c1  22923  5.0/stable  canonical✓  in-cohort

so whatever version of lxcfs comes with that

Can you show:

  • cat /sys/devices/system/cpu/online (again as it may change)
  • cat /proc/self/status
  • ls -lh /sys/devices/system/cpu/

yeah, it changes all the time as lxd rebalances the cores, that’s what bit me originally as the machines were not loaded enough and all the containers had at least one cpu in the first 8, but as soon as containers started being moved away from the first 8 cores, the fun started :slight_smile:

# cat /sys/devices/system/cpu/online
2,4,7,11,14,25,27,30


# cat /proc/self/status                                                                                                                                                     [24/42]
Name:   cat
Umask:  0022
State:  R (running)
Tgid:   21702
Ngid:   0
Pid:    21702
PPid:   21672
TracerPid:      0
Uid:    0       0       0       0
Gid:    0       0       0       0
FDSize: 256
Groups:
NStgid: 21702
NSpid:  21702
NSpgid: 21702
NSsid:  21672
VmPeak:   224412 kB
VmSize:   224412 kB
VmLck:         0 kB
VmPin:         0 kB
VmHWM:       708 kB
VmRSS:       708 kB
RssAnon:              68 kB
RssFile:             640 kB
RssShmem:              0 kB
VmData:      312 kB
VmStk:       132 kB
VmExe:        32 kB
VmLib:      1964 kB
VmPTE:        64 kB
VmSwap:        0 kB
HugetlbPages:          0 kB
CoreDumping:    0
THP_enabled:    1
Threads:        1
SigQ:   59/514030
SigPnd: 0000000000000000
ShdPnd: 0000000000000000
SigBlk: 0000000000000000
SigIgn: 0000000000000000
SigCgt: 0000000000000000
CapInh: 0000000000000000
CapPrm: 000001ffffffffff
CapEff: 000001ffffffffff
CapBnd: 000001ffffffffff
CapAmb: 0000000000000000
NoNewPrivs:     0
Seccomp:        2
Speculation_Store_Bypass:       thread force mitigated
Cpus_allowed:   4a004894
Cpus_allowed_list:      2,4,7,11,14,25,27,30
Mems_allowed:   00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000003
Mems_allowed_list:      0-1
voluntary_ctxt_switches:        0
nonvoluntary_ctxt_switches:     0


# ls -lh /sys/devices/system/cpu/
total 0
drwxr-xr-x 2 nobody nobody   0 May  4 11:18 cpu11
drwxr-xr-x 2 nobody nobody   0 May  4 11:18 cpu14
drwxr-xr-x 2 nobody nobody   0 May  4 11:18 cpu2
drwxr-xr-x 2 nobody nobody   0 May  4 11:18 cpu25
drwxr-xr-x 2 nobody nobody   0 May  4 11:18 cpu27
drwxr-xr-x 2 nobody nobody   0 May  4 11:18 cpu30
drwxr-xr-x 2 nobody nobody   0 May  4 11:18 cpu4
drwxr-xr-x 2 nobody nobody   0 May  4 11:18 cpu7
drwxr-xr-x 2 nobody nobody   0 May  4 11:18 cpufreq
drwxr-xr-x 2 nobody nobody   0 May  4 11:18 cpuidle
drwxr-xr-x 2 nobody nobody   0 May  4 11:18 hotplug
drwxr-xr-x 2 nobody nobody   0 May  4 11:18 intel_pstate
-r--r--r-- 1 nobody nobody   1 May  4 11:18 isolated
-r--r--r-- 1 nobody nobody   5 May  4 11:18 kernel_max
drwxr-xr-x 2 nobody nobody   0 May  4 11:18 microcode
-r--r--r-- 1 nobody nobody 714 May  4 11:18 modalias
-r--r--r-- 1 nobody nobody  33 May  4 11:18 nohz_full
-r--r--r-- 1 nobody nobody   1 May  4 11:18 offline
-r--r--r-- 1 nobody nobody   5 May  4 11:18 online
-r--r--r-- 1 nobody nobody   5 May  4 11:18 possible
drwxr-xr-x 2 nobody nobody   0 May  4 11:18 power
-r--r--r-- 1 nobody nobody   5 May  4 11:18 present
drwxr-xr-x 2 nobody nobody   0 May  4 11:18 smt
-rw-r--r-- 1 nobody nobody   0 May  4 11:18 uevent
drwxr-xr-x 2 nobody nobody   0 May  4 11:18 vulnerabilities

Right, so the output above is unfortunately correct :slight_smile:

We correctly report the list of CPUs you’re pinned on through the online file and we filter the CPU list to match that.

It does leave gaps which can be seen as problematic by poorly written software, in practice you can also have gap on physical hardware if you set a specific CPU socket or core to offline.

Because /proc/cpuinfo is just a text file that’s primarily read by users, we re-order things in it through LXCFS but for the rest, we must keep it aligned with the kernel as otherwise calls to prctl() for pinning would fail.

You should be able to do umount /sys/devices/system/cpu as a workaround though.

fair enough, unmounting indeed works. I now have enough workarounds, I’ll stop bugging you :wink:

# java Cpus
Number of processors available to this JVM: 3
# umount /sys/devices/system/cpu
# java Cpus
Number of processors available to this JVM: 8

actually, the problem is not a gap, its the range, physical machine will never have offline CPUs in the range outside of the total number of CPU-s, so this will always hapen in a container of some sort, so the software is, technically speaking, not poorly written, it is just not container aware.

Ideally you would put 0-(n-1) in /sys/devices/system/cpu/online, and then renumber the cpus in /sys/devices/system/cpu/ (and elsewhere to be numbered cpu0-n, if you want to mimic physical hardware as close as possible, of course.

Maybe I’m missing something, but here is a physical system:

root@delmak:~# fgrep -c processor /proc/cpuinfo
44
root@delmak:~# cat /sys/devices/system/cpu/online 
0-10,12-44
root@delmak:~# 

How does that differ from your container?

my container has less cpus, i.e. processor entries in /proc/cpuinfo, also less entries in /sys/devices/system/cpu/ and because I’ve set limits.cpu to less, also all the various syscalls return less

# cat sysconf.c
#include <unistd.h>
#include <stdio.h>
#include <sys/sysinfo.h>

int main(int argc, char *argv[])
{

        printf ("sysconf(_SC_NPROCESSORS_CONF): %d\n", sysconf(_SC_NPROCESSORS_CONF));
        printf ("sysconf(_SC_NPROCESSORS_ONLN): %d\n", sysconf(_SC_NPROCESSORS_ONLN));
        printf ("get_nprocs_conf(): %d\n", get_nprocs_conf());
        printf ("get_nprocs(): %d\n", get_nprocs());
        return 0;
}
# ./sysconf
sysconf(_SC_NPROCESSORS_CONF): 8
sysconf(_SC_NPROCESSORS_ONLN): 8
get_nprocs_conf(): 8
get_nprocs(): 8

so if a legacy software wants to be smart and shuffle the thread/process affinities around itself, it will bomb like jvm8 does, as this is impossible situation on a physical machine or a full VM, to have a cpu core with an ID bigger than the total number of cores in the system.

Did you read my previous message?

root@delmak:~# fgrep -c processor /proc/cpuinfo
44
root@delmak:~# cat /sys/devices/system/cpu/online 
0-10,12-44
root@delmak:~#

This is a physical system, not a container.

It is absolutely possible on a real system to have an id higher than the total.

For example:

root@delmak:~# nproc
45
root@delmak:~# cat /sys/devices/system/cpu/online
0-44
root@delmak:~# for i in $(seq 0 20); do echo 0 > /sys/devices/system/cpu/cpu$i/online; done
root@delmak:~# nproc
24
root@delmak:~# cat /sys/devices/system/cpu/online
21-44
root@delmak:~# 

Hehe, I read it, but got an off-by-one error when parsing it :wink: