Package installation failing to install with cloud-init on CentOS 7

lhprojects · March 19, 2021, 2:53pm

Right now I’m in discovery to determine what might be at fault here with package installation failing during boot time with cloud-init on CentOS7 container. Is it cloud-init, lxc, or CentOS7 itself. The issue is yum is failing to resolve hosts although when I run lxc exec yum install openssh-server mere seconds later, it completes successfully. I’ll conduct a test to see if this is reproducible on CentOS 7 VMs as well.

I’ve just ran a few tests in a LXC container:

LXCN=centos7-dev
10:43:30 root@linux-iqu9 ~ lxc launch local:centos7 centos7-dev --profile default --verbose --console < /tmp/rh_config.yml
10:43:47 root@linux-iqu9 ~ lxc exec $LXCN ip a 
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
14: eth0@if15: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 00:16:3e:d0:ef:57 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 192.168.103.13/24 brd 192.168.103.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::216:3eff:fed0:ef57/64 scope link 
       valid_lft forever preferred_lft forever
10:43:55 root@linux-iqu9 ~ lxc exec $LXCN ip r
default via 192.168.103.1 dev eth0 
169.254.0.0/16 dev eth0 scope link metric 1014 
192.168.103.0/24 dev eth0 proto kernel scope link src 192.168.103.13
10:43:57 root@linux-iqu9 ~ lxc exec $LXCN cat /etc/resolv.conf

192.168.99.1
10:44:01 root@linux-iqu9 ~ lxc exec $LXCN curl -v google.com
curl: (6) Could not resolve host: google.com; Unknown error
10:44:10 root@linux-iqu9 ~ lxc exec $LXCN curl -v google.com                                                                                                                                                                                                                                                                                                          6 ↵
curl: (6) Could not resolve host: google.com; Unknown error
10:44:18 root@linux-iqu9 ~ lxc exec $LXCN curl -v google.com                                                                                                                                                                                                                                                                                                          6 ↵
<HTML><HEAD><meta http-equiv="content-type" content="text/html;charset=utf-8">
<TITLE>301 Moved</TITLE></HEAD><BODY>
<H1>301 Moved</H1>
The document has moved
<A HREF="http://www.google.com/">here</A>.
</BODY></HTML>

Looking at the time-stamps, it seems network connectivity is missing for 20 seconds, although the network configuration is setup correctly? What might be the cause for the delay?

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: enp5s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 70:85:c2:80:ce:a7 brd ff:ff:ff:ff:ff:ff
3: enp5s0.2@enp5s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 70:85:c2:80:ce:a7 brd ff:ff:ff:ff:ff:ff
    inet 192.168.100.201/24 brd 192.168.100.255 scope global dynamic noprefixroute enp5s0.2
       valid_lft 5878sec preferred_lft 5878sec
    inet6 fe80::cee2:f43:148a:bbb2/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever
4: enp5s0.5@enp5s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master virbr1 state UP group default qlen 1000
    link/ether 70:85:c2:80:ce:a7 brd ff:ff:ff:ff:ff:ff
5: vpnbridge: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 52:09:12:69:03:66 brd ff:ff:ff:ff:ff:ff
    inet 172.16.2.4/24 brd 172.16.2.255 scope global noprefixroute vpnbridge
       valid_lft forever preferred_lft forever
    inet6 fe80::e2a9:78:dd23:4542/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever
6: virbr1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether ea:10:c1:c2:60:68 brd ff:ff:ff:ff:ff:ff
    inet 192.168.103.201/24 brd 192.168.103.255 scope global dynamic noprefixroute virbr1
       valid_lft 5908sec preferred_lft 5908sec
    inet6 fe80::ffd8:67c4:13bc:b842/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever
7: enp5s0.1024@enp5s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master vpnbridge state UP group default qlen 1000
    link/ether 70:85:c2:80:ce:a7 brd ff:ff:ff:ff:ff:ff
21: vethb4d7efd0@if20: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master virbr1 state UP group default qlen 1000
    link/ether fe:6c:f4:2c:c9:88 brd ff:ff:ff:ff:ff:ff link-netnsid 0

Linux linux-iqu9 5.12.0-rc3-1-default #1 SMP Thu Mar 18 16:11:17 EDT 2021 x86_64 x86_64 x86_64 GNU/Linux

lhprojects · March 23, 2021, 1:31pm

I will redirect this question LXC github.

lhprojects · March 23, 2021, 1:41pm

It’s like I won’t get an answer, if I don’t, it’s fine. It’s just my tooling that I use.

stgraber · March 23, 2021, 2:18pm

This isn’t really a LXD issue (I closed it) nor is it even a container issue so we may not be the best to answer this.

I’d probably recommend looking at the timestamps in journalctl -b 0 to see what happened around that time?

All that LXD cares about is spawning the init process in the container, once that’s done it considers it started. When you get functional network connectivity isn’t something that LXD gets notified about so there isn’t much we can do on our side.

There are both systemd and cloud-init commands available to block until they’re done starting up, using those should avoid this kind of issue.

lhprojects · March 23, 2021, 4:20pm

Thank you, but if I understand correctly, the 20 seconds delay is not LXC issue but a Kernel issue? Then I can speak with the kernel folks then. ADDS I know you say it, but this isn’t the first I come across this problem. Where with LXC, there is no connective with the network for a period time. I and didn’t bother investiage it, because I set network config.

lhprojects · March 23, 2021, 4:54pm

I open a ticket at https://bugzilla.kernel.org/show_bug.cgi?id=212409 and will open a ticket with OpenSUSE team later.

lhprojects · March 23, 2021, 5:18pm

I spoke with someone who know a bit more than me with the kernel, they said is a stp:

Disable STP
/sys/class/net/virbr1/bridge/forward_delay to 0 (virbr1 being the my bridge)

stgraber · March 23, 2021, 8:17pm

Since you’re bridging out of the system, it can be any number of thing, STP could be an issue but the host bridge mac address changing could be another common one.

Linux bridges take the lowest MAC address of all their members unless they were directly configured with a static MAC. If a lower MAC gets added to the bridge, the bridge will change MAC, which then will trigger STP causing further delays but will also require new ARP queries for the traffic to hit the host properly.