No internet access from inside an instance

Alexandr · September 5, 2022, 8:12am

Hi

I have two instances whose configs are below

architecture: x86_64
config:
  image.architecture: amd64
  image.description: ubuntu 20.04 LTS amd64 (release) (20220824)
  image.label: release
  image.os: ubuntu
  image.release: focal
  image.serial: "20220824"
  image.type: squashfs
  image.version: "20.04"
  limits.cpu: "8"
  limits.memory: 24GB
  user.network-config: |
    version: 2
    ethernets:
        eth0:
            addresses:
            - 100.100.100.7/32
            nameservers:
                addresses:
                - 8.8.8.8
                search: []
            routes:
            -   to: 0.0.0.0/0
                via: 169.254.0.1
                on-link: true
  volatile....
  ............
devices:
  eth0:
    ipv4.address: 100.100.100.7
    limits.max: 100Mbit
    nictype: routed
    parent: eno1
    type: nic
  root:
    path: /
    pool: default
    size: 30GB
    type: disk
ephemeral: false
profiles:
- default
- routed_100.100.100.7
stateful: false
description: ""

architecture: x86_64
config:
  image.architecture: amd64
  image.description: ubuntu 20.04 LTS amd64 (release) (20220824)
  image.label: release
  image.os: ubuntu
  image.release: focal
  image.serial: "20220824"
  image.type: squashfs
  image.version: "20.04"
  limits.cpu: "8"
  limits.memory: 24GB
  user.network-config: |
    version: 2
    ethernets:
        eth0:
            addresses:
            - 100.100.100.11/32
            nameservers:
                addresses:
                - 8.8.8.8
                search: []
            routes:
            -   to: 0.0.0.0/0
                via: 169.254.0.1
                on-link: true
  volatile....
  ............
devices:
  eth0:
    ipv4.address: 100.100.100.11
    limits.max: 100Mbit
    nictype: routed
    parent: eno1
    type: nic
  root:
    path: /
    pool: default
    size: 30GB
    type: disk
ephemeral: false
profiles:
- default
- routed_100.100.100.11
stateful: false
description: ""

Both instances were created by the same script, runs on the same host and have public IPs, the only difference is 7 – 11.

The problem is that I do not have internet access from the instance with IP …7 - ping google.com gives Temporary failure in name resolution.
The instance …11 is OK

Instance …11 is newly created.
Instance …7 was earlier created and deleted on another host, but inside the same net (and behind the same router).

I also created …8 and …12, where …12 is newly created and …8 is exactly the same as …7.
And with the same result - …12 is OK, …8 have no access.

Is there some kind of cache behind this story, or something else??

Would be very grateful for any help.

rkelleyrtp · September 5, 2022, 10:41am

Curious - why are your containers using a “/32” netmask? Also, why is your default route through 169.254.0.1? Seems like you may have a typo in your profile. Compare this output with a known working container. Pay attention specifically to the netmask value.

tomp · September 5, 2022, 10:48am

They are using routed NIC type which is normal to have this configuration. See Instance configuration - LXD documentation

tomp · September 5, 2022, 10:49am

Can you ping 169.254.0.1 and 8.8.8.8 from the problem containers?

rkelleyrtp · September 5, 2022, 10:50am

Thanks Tom. I learned something new today!

Alexandr · September 5, 2022, 10:56am

Both pings OK

tomp · September 5, 2022, 3:30pm

So its just a DNS issue by the sounds of it.
What does resolvectl status eth0 show inside the container?

Alexandr · September 5, 2022, 4:04pm

Bad container

root@hvvv7:~# resolvectl status eth0
Link 27 (eth0)
      Current Scopes: none
DefaultRoute setting: no
       LLMNR setting: yes
MulticastDNS setting: no
  DNSOverTLS setting: no
      DNSSEC setting: no
    DNSSEC supported: no

Good container

root@h802459511:~# resolvectl status eth0
Link 31 (eth0)
      Current Scopes: DNS
DefaultRoute setting: yes
       LLMNR setting: yes
MulticastDNS setting: no
  DNSOverTLS setting: no
      DNSSEC setting: no
    DNSSEC supported: no
  Current DNS Server: 8.8.8.8
         DNS Servers: 127.0.0.1
                      8.8.8.8

Ahhhhha
How can it be if i created both containers using the same script?

And how can i check/fix it in script?

tomp · September 5, 2022, 4:18pm

I wonder if cloud-init is failing in the problem container or something else is interfering with it?

Alexandr · September 5, 2022, 4:52pm

Wow!

I see /etc/netplan/50-cloud-init.yaml in a good container

network:
    ethernets:
        eth0:
            addresses: ['100.100.100.11/32']
            nameservers:
                addresses: [127.0.0.1, 8.8.8.8]
            routes:
              - on-link: true
                to: 0.0.0.0/0
                via: 169.254.0.1
    version: 2

And in a bad conainer dir /etc/netplan/ is empty at all !!!

How can it be if i created both containers by the same script?
Kinda something is out of sync?

And how to fix it?

tomp · September 6, 2022, 7:55am

Are they running the same image?

Also I understand that cloud-init only uses config from LXD on first boot, so if the container had booted previously it wouldn’t be applied.

Alexandr · September 6, 2022, 9:28am

All containers are created with the same command

lxc launch ubuntu:20.04 $CONTAINER_NAME --profile default --profile $PROFILE_NAME </dev/null

But I need to delete some containers from time to time, with the ability to re-create them in the future.
And now I see that recreated container always has empty /etc/netplan.
The big problem for me

tomp · September 6, 2022, 10:34am

Does it occur if using images:ubuntu/20.04 out of interest rather than ubuntu:20.04?

Alexandr · September 6, 2022, 10:57am

Now /netplan has something in it, but I think it’s not what I need

lxc launch images:ubuntu/20.04 h7 --profile default --profile routed_7 </dev/null

root@h7:~# dir /etc/netplan
10-lxc.yaml
root@h7:~# cat /etc/netplan/10-lxc.yaml
network:
version: 2
ethernets:
eth0:
dhcp4: true
dhcp-identifier: mac

tomp · September 6, 2022, 11:12am

OK try using the new cloud-init keys, see cloud-init - LXD documentation

As the LXD/cloud-init keys have recently changed and different cloud-init versions may take different config options from LXD.

Alexandr · September 6, 2022, 11:40am

Now it has become quite interesting.

I deleted the instance from the previous letter and recreated it like I did before.

Voilà

root@h7:~# dir /etc/netplan
10-lxc.yaml
root@h7:~# cat /etc/netplan/10-lxc.yaml
network:
  version: 2
  ethernets:
    eth0:
      dhcp4: true
      dhcp-identifier: mac
root@h7:~# exit
logout
root@host:~# lxc stop h7
root@host:~# lxc delete h7
root@host:~# lxc launch ubuntu:20.04 h7 --profile default --profile routed_7 </dev/null
Creating h7
Starting h7
root@host:~# lxc shell h7
root@h7:~# dir /etc/netplan
50-cloud-init.yaml
root@h7:~# cat /etc/netplan/50-cloud-init.yaml
# This file is generated from information provided by the datasource.  Changes
# to it will not persist across an instance reboot.  To disable cloud-init's
# network configuration capabilities, write a file
# /etc/cloud/cloud.cfg.d/99-disable-network-config.cfg with the following:
# network: {config: disabled}
network:
    ethernets:
        eth0:
            addresses:
            - 100.100.100.7/32
            nameservers:
                addresses:
                - 8.8.8.8
                search: []
            routes:
            -   on-link: true
                to: 0.0.0.0/0
                via: 169.254.0.1
    version: 2

and now it works as expected

tomp · September 6, 2022, 1:22pm

Interesting, perhaps its an intermittent issue, perhaps try creating a few fresh instances and see if you can recreate.

Alexandr · September 7, 2022, 1:10pm

Yes, it’s definitely an intermittent issue!

I tried dozens of sequences of create/delete/recreate/… today, in different combinations, on the same host as yesterday, and on 2 other hosts.
Today it works as it should!

Have no idea why it didn’t want to work the last 2 days.
The host wasn’t rebooted nor upgraded for many days, but today there are no errors(?)

So I’ll add some kind of diagnostics for possible situations like these (which I should do anyway), and let’s consider the issue resolved for now(?)

Thomas, many thanks for your time!