Host are in D state

We are running our host with Ubuntu 16.04.5 LTS and the kernel version is 4.4.0-138-generic.

Very often we are seeing that the host’s systemd-logind was in D state and all the services on the containers hosted on the nodes are getting affected. Nothing obvious from the logs but we noticed that this is happening quite often after we integrated the auth services to the AD via IPA and in some cases we also noticed that there is no issue with the hosts but containers (especially centos ) are affected with D state and all are systemd-logind. When the containers are affected with D state we are forced to reboot the host and all containers to bring back to normal state. Rebooting the affected the containers does not help and when we try to reboot the affected container, it is ending up with Error state.

Could you please help on this ?

Anything relevant in dmesg?

Indefinite uninterruptible D state is usually a sign of a kernel issue.

Thanks for the reply, this is what I see from the dmesg.

[    2.791940] megaraid_sas 0000:02:00.0: Init cmd success
[    2.844055] megaraid_sas 0000:02:00.0: INIT adapter done
[   44.165069] kvm [11447]: vcpu0 unhandled rdmsr: 0x34
[   44.165171] kvm [11447]: vcpu0 unhandled rdmsr: 0x606
[   47.926645] kvm [11447]: vcpu0 unhandled rdmsr: 0x611
[   47.926739] kvm [11447]: vcpu0 unhandled rdmsr: 0x639
[   47.926823] kvm [11447]: vcpu0 unhandled rdmsr: 0x641
[   47.926907] kvm [11447]: vcpu0 unhandled rdmsr: 0x619
[   47.971932] kvm [11447]: vcpu0 unhandled rdmsr: 0x611
[   47.972042] kvm [11447]: vcpu0 unhandled rdmsr: 0x639
[   47.972133] kvm [11447]: vcpu0 unhandled rdmsr: 0x641
[   47.972289] kvm [11447]: vcpu0 unhandled rdmsr: 0x619
[  185.736449] CIFS VFS: Malformed UNC in devname.
 [  437.099541] audit: type=1400 audit(1543816816.983:305): apparmor="DENIED" operation="mount" info="failed flags match" error=-13 profile="lxd-jenkins-master00_</var/lib/lxd>" name="/" pid=105151 comm="(resolved)" flags="rw, rslave"
 [  529.245544] perf interrupt took too long (2538 > 2500), lowering kernel.perf_event_max_sample_rate to 50000
 [ 1681.854209] audit: type=1400 audit(1543818061.735:310): apparmor="DENIED" operation="mount" info="failed flags match" error=-13 profile="lxd-jenkins-master00_</var/lib/lxd>" name="/" pid=131328 comm="(resolved)" flags="rw, rslave"
 [ 1921.967811] audit: type=1400 audit(1543818301.870:311): apparmor="DENIED" operation="mount" info="failed flags match" error=-13 profile="lxd-nat01_</var/lib/lxd>" name="/" pid=166114 comm="(nntrackd)" flags="rw, rslave"
 [ 3696.203510] audit: type=1400 audit(1543820076.184:319): apparmor="DENIED" operation="mount" info="failed flags match" error=-13 profile="lxd-jenkins-master00_</var/lib/lxd>" name="/" pid=73367 comm="(resolved)" flags="rw, rslave"
 [ 3785.891632] audit: type=1400 audit(1543820165.876:320): apparmor="DENIED" operation="capable" namespace="root//lxd-icinga2-master00_<var-lib-lxd>" profile="/usr/sbin/ntpd" pid=45466 comm="ntpd" capability=1  capname="dac_override"
 [ 4299.855414] perf interrupt took too long (5139 > 5000), lowering kernel.perf_event_max_sample_rate to 25000`Preformatted text`

Ok, that doesn’t look too bad.

What does cat /proc/PID/stack show, replacing PID with the PID of any of the processes stuck in D state?