Why does stopping and restarting a container cause ip routing breakage?

I have a new issue. When I stop a container, for export for example, and then start it works find internally but Ip no longer functions. I have to copy it so it gets a new IP and then it works.

Anyone have any ideas?

Thanks

This shouldn’t happen, and indeed doesn’t happen in a local test I just tried.

Can you give some more info:

  • Which LXD version are you running
  • Which container distro/version are you running
  • The exact command sequence you run to produce this issue
  • The output of ip a and ip r inside the container and on the host before and after the issue occurs
  • The expanded config of the container using lxc config show <container> --expanded
  • The network configuration of any parent network used by the container using lxc network show <network>

Thanks

I am using latest one. I believe problem started with version 4.0
I using 18.04 lts, latest, and so are containers.
I will post rest of info tomorrow when I get a chance. Thank you in advance for your time.

Sorry for delay getting this data back to you, but being busy

Inside Container

 ip r
default via 240.18.0.1 dev eth0 proto dhcp src 240.18.0.52 metric 100 
240.0.0.0/8 dev eth0 proto kernel scope link src 240.18.0.52 
240.18.0.1 dev eth0 proto dhcp scope link src 240.18.0.52 metric 100 
ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
99: eth0@if100: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default qlen 1000
    link/ether 00:16:3e:3c:35:da brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 240.18.0.52/8 brd 240.255.255.255 scope global dynamic eth0
       valid_lft 2751sec preferred_lft 2751sec
    inet6 fe80::216:3eff:fe3c:35da/64 scope link 
       valid_lft forever preferred_lft forever
lxc config show PHP-OAI-2020-MAY2 --expanded
architecture: x86_64
config:
  image.architecture: amd64
  image.description: ubuntu 18.04 LTS amd64 (release) (20190424)
  image.label: release
  image.os: ubuntu
  image.release: bionic
  image.serial: "20190424"
  image.version: "18.04"
  volatile.base_image: 5b72cf46f628b3d60f5d99af48633539b2916993c80fc5a2323d7d841f66afbe
  volatile.eth0.host_name: vethe40e1464
  volatile.eth0.hwaddr: 00:16:3e:3c:35:da
  volatile.idmap.base: "0"
  volatile.idmap.current: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":1000000000},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]'
  volatile.idmap.next: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":1000000000},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]'
  volatile.last_state.idmap: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":1000000000},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]'
  volatile.last_state.power: RUNNING
devices:
  eth0:
    name: eth0
    nictype: bridged
    parent: lxdfan0
    type: nic
  root:
    path: /
    pool: local
    type: disk
ephemeral: false
profiles:
- default
stateful: false
description: ""

Outside

ip r
default via 84.17.40.62 dev enp1s0 proto static metric 100 
84.17.40.0/26 dev enp1s0 proto kernel scope link src 84.17.40.18 metric 100 
169.254.0.0/16 dev enp1s0 scope link metric 1000 
240.0.0.0/8 dev lxdfan0 proto kernel scope link src 240.18.0.1 
ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: enp1s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether 00:30:48:cb:f6:be brd ff:ff:ff:ff:ff:ff
    inet 84.17.40.18/26 brd 84.17.40.63 scope global noprefixroute enp1s0
       valid_lft forever preferred_lft forever
    inet6 fe80::18ae:83c8:3f8b:3230/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever
3: enp2s0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc fq_codel state DOWN group default qlen 1000
    link/ether 00:30:48:cb:f6:bf brd ff:ff:ff:ff:ff:ff
4: lxdfan0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default qlen 1000
    link/ether 02:61:99:ed:f2:d8 brd ff:ff:ff:ff:ff:ff
    inet 240.18.0.1/8 scope global lxdfan0
       valid_lft forever preferred_lft forever
    inet6 fe80::f032:1dff:fe37:bce5/64 scope link 
       valid_lft forever preferred_lft forever
5: lxdfan0-mtu: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1450 qdisc noqueue master lxdfan0 state UNKNOWN group default qlen 1000
    link/ether f2:32:1d:37:bc:e5 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::f032:1dff:fe37:bce5/64 scope link 
       valid_lft forever preferred_lft forever
6: lxdfan0-fan: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master lxdfan0 state UNKNOWN group default qlen 1000
    link/ether 82:38:2b:9b:26:c1 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::8038:2bff:fe9b:26c1/64 scope link 
       valid_lft forever preferred_lft forever
8: veth7805085f@if7: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master lxdfan0 state UP group default qlen 1000
    link/ether 7e:65:83:d1:a1:d2 brd ff:ff:ff:ff:ff:ff link-netnsid 0
10: vethc0da3c8f@if9: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master lxdfan0 state UP group default qlen 1000
    link/ether 06:5c:9c:aa:1e:ff brd ff:ff:ff:ff:ff:ff link-netnsid 1
12: vetha872ae08@if11: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master lxdfan0 state UP group default qlen 1000

I goes on for a while but I believe most important above

 lxc cluster show Q1
roles: []
architecture: x86_64
server_name: Q1
url: https://84.17.40.18:8443
database: false
status: Online
message: fully operational

BTW the problem happens on all servers in cluster. Nginx is give 502 error reaching container, but a wget within container works fine.

Now I am shutting container, coping and starting new container and it will be fine.
New Container

root@PHP-OAI-2020-MAY10:~# ip r
default via 240.18.0.1 dev eth0 proto dhcp src 240.18.0.49 metric 100 
240.0.0.0/8 dev eth0 proto kernel scope link src 240.18.0.49 
240.18.0.1 dev eth0 proto dhcp scope link src 240.18.0.49 metric 10
ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
101: eth0@if102: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default qlen 1000
    link/ether 00:16:3e:75:1d:fb brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 240.18.0.49/8 brd 240.255.255.255 scope global dynamic eth0
       valid_lft 2708sec preferred_lft 2708sec
    inet6 fe80::216:3eff:fe75:1dfb/64 scope link 
       valid_lft forever preferred_lft forever
 lxc config show PHP-OAI-2020-MAY10
architecture: x86_64
config:
  image.architecture: amd64
  image.description: ubuntu 18.04 LTS amd64 (release) (20190424)
  image.label: release
  image.os: ubuntu
  image.release: bionic
  image.serial: "20190424"
  image.version: "18.04"
  volatile.base_image: 5b72cf46f628b3d60f5d99af48633539b2916993c80fc5a2323d7d841f66afbe
  volatile.eth0.host_name: veth7509004d
  volatile.eth0.hwaddr: 00:16:3e:75:1d:fb
  volatile.idmap.base: "0"
  volatile.idmap.current: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":1000000000},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]'
  volatile.idmap.next: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":1000000000},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]'
  volatile.last_state.idmap: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":1000000000},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]'
  volatile.last_state.power: RUNNING
devices: {}
ephemeral: false
profiles:
- default
stateful: false
description: ""

I dont see anything here, hope you can. I wonder if it is IP dhcp type problem.
Your thought are welcomed in advance.

I’ve gone through your post and used the three backticks feature of markdown to make your post easier to digest. This is a useful feature for future posts containing blocks of config text or command output.

So the containers still have IPs and routing, suggesting DHCP is working fine.

I’m not clear at this stage what the problem is that you’re describing.

Please can you give the step by step commands you run to recreate the problem, as well as the commands you run to confirm the problem is happening (i.e ping, wget, curl to nginx etc).

Thanks

One thing I would say is that the size of your fan subnet inside the container looks too large.

Looking at @stgraber’s post here:

The NIC interface inside the container should have a subnet size of /16 but yours have a subnet size of /8.

Can you show the output of lxc network show lxdfan0 please.

lxc network show lxdfan0
config:
bridge.mode: fan
fan.underlay_subnet: 84.17.40.0/24
description: “”
name: lxdfan0
type: bridge
used_by:

  • /1.0/instances/AI-GENIE-2020-mar6
  • /1.0/instances/AI-GENIE-2020-mar6-bk
  • /1.0/instances/CHAT
  • /1.0/instances/DIASPORA-2020-mar6
  • /1.0/instances/DIASPORA-2020-mar6-bk
  • /1.0/instances/ELMONSTRUO-PHP-2020-mar6
  • /1.0/instances/EMPODERATE-2020-mar6
  • /1.0/instances/ENLACES24-ADSERVER-REVIVE-2020-APR27
  • /1.0/instances/ENLACES24-ADSERVER-REVIVE-2020-APR30
  • /1.0/instances/ENLACES24-ADSERVER-REVIVE-2020-mar6
  • /1.0/instances/ENLACES24-ADSERVER-REVIVE-2020-mar6-bk
  • /1.0/instances/ENLACES24-ANALY-2020-2020-APR27
    -MANY MORE IN Between, about a third of these RUNNING.
  • /1.0/instances/WP-WARHAPPENS-2019-09-3-NEW-2020-mar6-bk
  • /1.0/instances/WP-WARHAPPENS-2020-APR26
  • /1.0/instances/WP-WARHAPPENS-2020-APR26-bk
  • /1.0/instances/WP-WARHAPPENS-2020-APR29
  • /1.0/instances/WW-CCNESNOTICIAS-2020-mar28
  • /1.0/instances/lxdMosaic2020B2
    managed: true
    status: Created
    locations:
  • Q1
  • Q3
  • Q2
  • Q4

Some clarifications.

All the configurations are default. All four clusters were loaded and installed exactly using defaults in LXD init.
The problem happens on all servers.
I have always had problems when I turn off all for servers, LXD has a hard time starting correctly, there are countless of my threads on this in forum. This problem has gotten better with version 4+ but it is not solved.
Yesterday I ended up having to reboot 2 of 4 machines and running my lxd reset script a couple times on all servers to get it working again. Script below.
Initially it was only one LXD was down, but because it was stuck lxd didn’t upvote any of the other servers to a database server.
perhaps this had to do with so lxd refresh, but the end result it died mysteriously.
Ok, so back to the original problem.
On some servers containers that were running started correctly, and worked fine.
On server Q1, none started, and I had to start them manually, and the IP changed. I had update nginx for IP and it worked find after that. On Q2, no IP was showing up on containers but a reboot fixed that.
On serves Q3, Q4, I started container manually but IP had changed by 1. On these I had to copy container to another on same server, start and put new IP in Nginx and then it worked fine.
Within the broken container, I can still do a wget , apt update, apt upgrade, etc. It reaches outside just fine. I could also do wget 127.0.0.1 to check webserver within container. But a wget to container ip would give and 500 error.
I have tried many within container including upgrades, reboot etc…
The only way to make it work is copy it and it would work find with new IP.

Very strange… this problem only started recently and I feel it must have something to do with version 4.

Is there an easy way to force a container to use a static IP? It is almost like access to container is being blocked. BTW, I also tried all this with firewall disabled.

Thanks in Advance

My LXD jumpstart script
echo ‘Reseting Networking’
systemctl restart systemd-networkd
sleep 10
echo ‘Stopping Socket’
systemctl stop snap.lxd.daemon.unix.socket
sleep 10
echo ‘Starting Unix Socket’
systemctl start snap.lxd.daemon.unix.socket
sleep 10
echo ‘Stopping lxd’
systemctl stop snap.lxd.daemon
echo ‘Starting lxd’
systemctl restart snap.lxd.daemon
sleep 10
lxc cluster list

OK thanks for that. I’m still having a hard time understanding the steps you take to reproduce this issue.

What I’m really looking for is a step by step list of the commands you run to reproduce this issue for a single container, along with the example commands you run before and after that work before the restart and don’t afterwards. This will help me try and reproduce it and help to understand what exactly isn’t working and the test you are running to get to that conclusion.

Just to ensure its nothing in the default config and to give you an idea of what I’m looking for, I spun up a 3 node cluster this morning using 3 VMs, with 1 container on each host using the fan network.

So I have three nodes in the cluster: v1, v2, v3. With a container running on each: c1, c2, c3.

lxc ls
+------+---------+----------------------+------+-----------+-----------+----------+
| NAME |  STATE  |         IPV4         | IPV6 |   TYPE    | SNAPSHOTS | LOCATION |
+------+---------+----------------------+------+-----------+-----------+----------+
| c1   | RUNNING | 240.161.0.107 (eth0) |      | CONTAINER | 0         | v1       |
+------+---------+----------------------+------+-----------+-----------+----------+
| c2   | RUNNING | 240.163.0.126 (eth0) |      | CONTAINER | 0         | v2       |
+------+---------+----------------------+------+-----------+-----------+----------+
| c3   | RUNNING | 240.214.0.209 (eth0) |      | CONTAINER | 0         | v3       |
+------+---------+----------------------+------+-----------+-----------+----------+

They are connected to the fan network like yours:

lxc network show lxdfan0
lxc network show lxdfan0
config:
  bridge.mode: fan
  fan.underlay_subnet: 10.238.31.0/24
description: ""
name: lxdfan0
type: bridge
used_by:
- /1.0/instances/c1
- /1.0/instances/c2
- /1.0/instances/c3
managed: true
status: Created
locations:
- v1
- v2
- v3

Now for the tests:

First, lets check comms are working ok between c1 and the other containers:

lxc exec c1 -- ping -c 3 240.163.0.126
PING 240.163.0.126 (240.163.0.126) 56(84) bytes of data.
64 bytes from 240.163.0.126: icmp_seq=1 ttl=64 time=0.278 ms
64 bytes from 240.163.0.126: icmp_seq=2 ttl=64 time=0.325 ms
64 bytes from 240.163.0.126: icmp_seq=3 ttl=64 time=0.617 ms

lxc exec c1 -- ping -c 3 240.214.0.209
PING 240.214.0.209 (240.214.0.209) 56(84) bytes of data.
64 bytes from 240.214.0.209: icmp_seq=1 ttl=64 time=0.743 ms
64 bytes from 240.214.0.209: icmp_seq=2 ttl=64 time=0.507 ms
64 bytes from 240.214.0.209: icmp_seq=3 ttl=64 time=0.442 ms

OK, so now lets restart the c1 container and check if it still works:

lxc restart c1

Test again:

lxc exec c1 -- ping -c 3 240.161.0.107
PING 240.161.0.107 (240.161.0.107) 56(84) bytes of data.
64 bytes from 240.161.0.107: icmp_seq=1 ttl=64 time=0.015 ms
64 bytes from 240.161.0.107: icmp_seq=2 ttl=64 time=0.039 ms
64 bytes from 240.161.0.107: icmp_seq=3 ttl=64 time=0.050 ms

lxc exec c1 -- ping -c 3 240.214.0.209
PING 240.214.0.209 (240.214.0.209) 56(84) bytes of data.
64 bytes from 240.214.0.209: icmp_seq=1 ttl=64 time=0.503 ms
64 bytes from 240.214.0.209: icmp_seq=2 ttl=64 time=0.460 ms
64 bytes from 240.214.0.209: icmp_seq=3 ttl=64 time=0.587 ms

So restarting the c1 container doesn’t appear to have broken routing between the fan nodes, at least in this basic test.

Proper testing takes time as you know… so I will be ding it tomorrow hopefully.
Meanwhile take a look at this…

Q1  ip r
default via 84.17.40.62 dev enp1s0 proto static metric 100 
84.17.40.0/26 dev enp1s0 proto kernel scope link src 84.17.40.18 metric 100 
169.254.0.0/16 dev enp1s0 scope link metric 1000 
240.0.0.0/8 dev lxdfan0 proto kernel scope link src 240.18.0.1 
Q2 ip r 
default via 84.17.40.62 dev enp1s0 proto static metric 100 
84.17.40.0/26 dev enp1s0 proto kernel scope link src 84.17.40.19 metric 100 
169.254.0.0/16 dev enp1s0 scope link metric 1000 
240.0.0.0/8 dev lxdfan0 proto kernel scope link src 240.19.0.1 

First line is my server and Last line is lxdfan… was is the third line. I am not familar with this IP 169.254.0.0 scope.

169.254.0.0/16 is the IPv4 link-local range, it is normal and is used by IPv4 for auto addressing when no other IPs are available.