candlerb
(Brian Candler)
October 4, 2018, 11:10am
1
This is a really weird one.
I’m inside an lxd container. Process with pid 486 does not exist, but “kill -0” returns as if it does.
root@ix-oxidized:~# ps auxwww | grep 486
root 31649 0.0 0.0 14620 968 ? S+ 11:02 0:00 grep --color=auto 486
root@ix-oxidized:~# kill -0 9999
bash: kill: (9999) - No such process
root@ix-oxidized:~# kill -0 486
root@ix-oxidized:~# echo $?
0
Checking with strace:
root@ix-oxidized:~# strace -f kill -0 486 2>&1 | grep kill
execve("/bin/kill", ["kill", "-0", "486"], [/* 12 vars */]) = 0
kill(486, SIG_0) = 0
root@ix-oxidized:~#
Why does this matter? Well, I’m trying to start an application , and there is a stale pidfile:
# cat /home/oxidized/.config/oxidized/pid
486
Tthe way the application decides whether it’s safe to delete this pidfile is by doing a kill -0
, which is a standard pattern .
On the outer host, everything is fine:
root@ix-mon2:~# kill -0 486
bash: kill: (486) - No such process
… but then again, I have no idea what pid in the outer host would have mapped to pid 486 inside the container’s pid namespace.
The host is Ubuntu 16.04.5 with kernel 4.15.0-34-generic, lxd 3.0.1 and lxcfs 3.0.1.
Any clues for what I can look for? I’ll leave it in this broken state for as long as I can.
Thanks, Brian.
candlerb
(Brian Candler)
October 4, 2018, 11:24am
2
Now it gets really weird. If I do ls /proc
there is no 486. But ls /proc/486
shows it exists!
root@ix-oxidized:~# ls /proc | grep 486
root@ix-oxidized:~# ls /proc/486
attr comm fd map_files net pagemap schedstat stat timerslack_ns
autogroup coredump_filter fdinfo maps ns patch_state sessionid statm uid_map
auxv cpuset gid_map mem numa_maps personality setgroups status wchan
cgroup cwd io mountinfo oom_adj projid_map smaps syscall
clear_refs environ limits mounts oom_score root smaps_rollup task
cmdline exe loginuid mountstats oom_score_adj sched stack timers
root@ix-oxidized:~# ls -l /proc/486/exe
lrwxrwxrwx 1 root root 0 Oct 4 11:15 /proc/486/exe -> /usr/lib/policykit-1/polkitd
root@ix-oxidized:~# cat /proc/486/cmdline
/usr/lib/policykit-1/polkitd--no-debug
So… at least I know what the proc is, maybe. What if I search for it by name?
root@ix-oxidized:~# ps auxwww | grep polkitd
root 480 0.0 0.0 277088 5988 ? Ssl Sep29 0:00 /usr/lib/policykit-1/polkitd --no-debug
root 31774 0.0 0.0 14620 1088 ? S+ 11:20 0:00 grep --color=auto polkitd
root@ix-oxidized:~# ls -l /proc/480/exe
lrwxrwxrwx 1 root root 0 Sep 29 12:13 /proc/480/exe -> /usr/lib/policykit-1/polkitd
root@ix-oxidized:~# ls -l /proc/*/exe | grep polkitd
lrwxrwxrwx 1 root root 0 Sep 29 12:13 /proc/480/exe -> /usr/lib/policykit-1/polkitd
So 486 is still not visible, but it might be a stale child of 480.
Unfortunately, when I look on the host, lots of containers are running polkitd:
root@ix-mon2:~# ps auxwww | grep polkitd
root 2864 0.0 0.0 277088 5628 ? Ssl Sep29 0:00 /usr/lib/policykit-1/polkitd --no-debug
root 10579 0.0 0.0 14228 972 pts/10 S+ 12:21 0:00 grep --color=auto polkitd
100000 10959 0.0 0.0 277088 5948 ? Ssl Sep29 0:00 /usr/lib/policykit-1/polkitd --no-debug
100000 14142 0.0 0.0 277088 5996 ? Ssl Sep29 0:00 /usr/lib/policykit-1/polkitd --no-debug
100000 18295 0.0 0.0 277088 5768 ? Ssl Sep29 0:00 /usr/lib/policykit-1/polkitd --no-debug
100000 20713 0.0 0.0 277088 5640 ? Ssl Sep29 0:00 /usr/lib/policykit-1/polkitd --no-debug
100000 23559 0.0 0.0 277088 5988 ? Ssl Sep29 0:00 /usr/lib/policykit-1/polkitd --no-debug
100000 24746 0.0 0.0 277088 5900 ? Ssl Sep29 0:00 /usr/lib/policykit-1/polkitd --no-debug
100000 25991 0.0 0.0 277088 5900 ? Ssl Sep29 0:00 /usr/lib/policykit-1/polkitd --no-debug
100000 27926 0.0 0.0 277088 5784 ? Ssl Sep29 0:00 /usr/lib/policykit-1/polkitd --no-debug
root 29526 0.0 0.0 277088 5664 ? Ssl Sep29 0:00 /usr/lib/policykit-1/polkitd --no-debug
Presumably one of those is pid 480 inside the affected container, but I don’t know if another one is the missing pid 486. I know it’s not a zombie (Z) anyway.
Your 486 might be a thread id. Use option -L
to the ps
command to list those (column LWP).
candlerb
(Brian Candler)
October 4, 2018, 11:43am
4
Aha, yes that’s it:
root@ix-oxidized:~# ps auxwww -L | grep 486
root 480 486 0.0 3 0.0 277088 5988 ? Ssl Sep29 0:00 /usr/lib/policykit-1/polkitd --no-debug
root 31951 31951 0.0 1 0.0 14620 936 ? S+ 11:40 0:00 grep --color=auto 486
Does that mean that using kill -0
to test for the existence of a process with a given pid is actually incorrect? What would be the right way to do it?
I guess grepping the output of ps
(without -L
) or checking the existence of /proc/PID
should do the trick if you only want to consider “real” processes. I’m afraid I don’t know a less ugly solution.
Incidentally, I wasn’t aware of the kill -0
trick. And oddly enough, while on my system man kill
says that 0 is a “useful signal”, it doesn’t say what it does, and no mention of signal 0 is found in man 7 signal
. So that may be an undocumented feature with implementation-defined behaviour.
candlerb
(Brian Candler)
October 4, 2018, 2:12pm
6
Sadly this doesn’t work unless you list the entire directory /proc looking for pid. If you stat /proc/<PID>
then you find it exists, even though it’s not in the directory listing.
As for documentation, try man 2 kill
. On my system it says:
If sig is 0, then no signal is sent, but error checking is still performed; this can be used to check for the existence of a process ID or process group ID.