Network issues - How to troubleshoot?

tomp · June 1, 2020, 7:08am

What system are you using to configure eth0 on the host? Is it networkd, /etc/network/interfaces or netplan?

Please can you show the configure you are using, and confirm that no other ones are also in use (potentially configuring conflicting settings).

If your ISP is routing your /64 directly to your host’s IP without the need for NDP, then you won’t need proxy NDP enabled at all (and I believe if you’re using netplan then that configures with networkd and will disable the accept_ra and proxy_ndp settings anyway).

Your ISP uses static routes for default gateway, so you should not need accept_ra enabled on the host anyway (only the containers).

tomp · June 1, 2020, 7:09am

You could set these to off as well as you won’t use them (but as your ISP doesn’t send RAs they wont be being used anyway).

net.ipv6.conf.all.use_tempaddr = 0
net.ipv6.conf.default.use_tempaddr = 0

michacassola · June 1, 2020, 12:02pm

I am using netplan as shown in our tutorial. I have not actively set up anything else. In the netplan config networkd is specified as the renderer.

I only have the configuration mentioned here in my netplan. I have no other networking configurations. I am happy to check if you tell me what to look for. But ifupdown is not installed.

So eth0 and lxdbr0 should be set to accept_ra = 0 ?

How do I check if there is a problem with the built in dnsmasq of lxdbr0?

Could net.ipv6.icmp.ratelimit = 1000 be too small?

tomp · June 1, 2020, 1:09pm

So LXD sets accept_ra to 2 on all interfaces (https://github.com/lxc/lxd/blob/182567f046e0debb5cae4f98f3cd50292628b43f/lxd/network/network.go#L591-L607), this is to allow the host to continue to receive route advertisements when LXD switches the node into routed mode (which is does by default as routing is enabled).

Have you got any router advertisements daemons running inside any of your containers by any chance (dnsmasq or radvd come to mind)?

It would be interesting if you shut your containers down, restart the node, and see if the expiring route 2a02:c207:1234:1234::/64 dev lxdbr0 proto kernel metric 256 expires 3344sec pref medium appears if none of the containers have ever started (just LXD to bring up the interface).

michacassola · June 1, 2020, 1:18pm

Okay, thanks for the explanation about accept_ra.

I have dnsmasq in the container yes:

root@s1c5:~# ps aux | grep dnsmasq
root     16066  0.0  0.0  16176  1036 ?        S+   15:13   0:00 grep --color=auto dnsmasq

I since made a restart of the host and the expires option does not appear on the lxdbr0 route. But last time I did the same and it came back, so I guess it is only a matter of time and as you are indicating a matter of something happening in the container?

tomp · June 1, 2020, 2:38pm

It will be interesting to see if after some time it doesn’t appear if the containers aren’t running, and if it goes onto appear once the containers are running.

michacassola · June 1, 2020, 2:48pm

Let me let the containers run for a while. Maybe it happens only under load, as I was mostly working on the WordPress sites running inside the containers. Or when the resource limits get to 100% too much or too many network requests overall?

michacassola · June 1, 2020, 8:17pm

Had another short outage. This time the routes stayed stable, ip -6 r showed no expires on lxdbr0, also the container didn’t lose it’s address or route. So I guess the container does not respond due to overload? How can I check this?

I also noticed that maybe trying to access the container not only on 443 with https but on another port with another https connection could be part of the issue?

Would you like to see more tcpdumps? As far as my little knowledge goes I do not see anything unusual.

Could missing Keep-Alive be the issue? I use Cloudflare and get 522 errors when this kind of outage occurs.

I used https://www.giftofspeed.com/check-keep-alive/ to check keep-alive and it is turned on on https connection to the WordPress instance inside the container and on the https connection to Webmin also running inside the container but on a different port. So that is not it.

tomp · June 2, 2020, 8:12am

Are you seeing the same thing with ping rather than http?

michacassola · June 3, 2020, 8:06pm

Sorry, but I do not understand the question.

I am having expires problems again on lxdbr0.
This time a beefier container went down when I added a wordpress user. Before that I did updates on the OS and WP and created a new WP site.

Maybe this container has problems as its name was changed after creation from c1 to s1c1?
I have really no idea why they keep going down.

Should I switch to DHCP6?
Could it be a cgroup issue that there are too many iops when the database gets going?

tomp · June 3, 2020, 8:39pm

Try setting the networking inside the containers to use a static ipv6 gateway (the address of the lxdbr0 bridge) and a static IPv6 address to avoid router advertisement issues.

Its still strange why your lxdbr0 subnet route on the host has an expiry, something indeed is very odd on your host.

I meant do you only see HTTP timeouts or do you see ping timeouts too?

michacassola · June 3, 2020, 8:43pm

Thanks!
So setting up netplan inside the container or running some lxc commands?

Last time I pinged a container when it was down in https it answered.

lxc seems not aware of any eth0 on a container:

user@s1:~$ sudo lxc config device get s1c1 eth0 ipv6.address
Error: The device doesn't exist

Could this old issue hold an answer?

michacassola · June 3, 2020, 10:49pm

I setup netplan inside the container to have the static ipv6 and the ipv6 of the lxdbr0 as gateway6.
Also the same nameservers of cloudflare inside the container.
And I set accept-ra: no also.

This eliminated the expriring of the ipv6 address inside the container (ip -6 r).
Now I will restart the host and hope for the best.
(ipv4 still has valid_lft 2896sec preferred_lft 2896sec what the ipv6 also had)

tomp · June 4, 2020, 7:40am

@michacassola ip -r doesn’t show addresses, it shows routes, so you must have been getting an expired route rather than address (which may have also removed the address).

I understood that the expiring route was on the LXD host, not in the container, as here Network issues - How to troubleshoot? you posted the output of ip -r with a default gateway of fe80::1 which is what your ISP requires on the host.

The line:

2a02:c207:1234:1234::/64 dev lxdbr0 proto kernel metric 256 expires 3344sec pref medium is not normal, as it suggests, the route has been learned through RA or DHCPv6, but it should be static (as it is the network of your bridge.

If you can get me a login to the host I could take a look and see what is wrong.

michacassola · June 4, 2020, 1:09pm

Then the expiry of the route. The routes did show some time limits inside the container.
And also lxdbr0’s route did always have the expires with it when a container went down.

Everything I wrote before is regarding the routes.

That is why I set accept-ra to no inside the conatiners netplan configs now, hoping that it will keep the routes static.

Thank you very much! I will make a user and key for you and send you a private message.

tomp · June 4, 2020, 1:12pm

You can use https://launchpad.net/~tomparrott/+sshkeys

michacassola · July 3, 2020, 3:41pm

Hey @tomp, hope you are fine these days!

I am sad to say, but still experience short network outages every once in a while.
A friend of mine who also checks out one of the wordpress sites in the containers also notices the outages. It’s a Cloudflare 522 Error, Host not reachable.

Just a short recap of all the things I do to be able to check them off one by one.

My ISP statically routes the ipv6/64 subnet to the host (they told me that)
I use IPv6 only on my WordPress site sin the containers, meaning AAAA records in DNS, on my containers and let Cloudflare handle the IPv4 compatibility
I implement Networking with netplan(networkd) and with the LXD built in lxdbr0.
I use the bridge lxdbr0 to have ingress and egress limits on the containers
- Here all limits from my sudo lxd init --preseed < EOF for my smallest plan
- config:
  limits.cpu: “1”
  limits.cpu.allowance: 50%
  limits.disk.priority: “1”
  limits.memory: 1792MB
  limits.memory.swap: “false”
  description: “”
  devices:
  eth0:
  limits.max: 50Mbit
  nictype: bridged
  parent: lxdbr0
  type: nic
  root:
  path: /
  pool: default
  size: 20GiB
  type: disk
I use ufw on the host and in the containers to block ports besides the http/s ports and some others (used to think the problem is due to UDP being blocked, but still persists with UDP enabled)
I run WordOps (wordops.net) on ports 80 and 443 (although I use my own ufw script after install) but through Cloudflare
I run Webmin (www.webmin.com) on port 8443 also through Cloudflare

What could it be?

Should I not use one of the cgroups limits? @stgraber Maybe the container just drops the traffic becaus of one of those rules? How would I monitor/troubleshoot for that? Maybe when Cloudflare loads stuff into it’s cache it gets dropped as the traffic is limited?
Is it a possible issue with networkd and netplan? How would I check that out?
What could impact lxdbr0 in the way that the route gets the expires header it always gets after some time?
Maybe the routed NIC type would be better? Maybe the bridge has too much to do? Could you guys implement the “limits.max” for that NIC type please? @stgraber
How can I troubleshoot the upper level apps like wordops (its a very fancy lemp stack with dual nginx which is also used as a reverse proxy as well as the webserver) or webmin, how can I make sure it’s LXD related or upper level apps related?

Thanks in advance for all the help and a special thank you for all the help already given!

tomp · July 6, 2020, 8:19am

I think you need to try and narrow down the issue, as at the moment its still rather vague.

Its perfectly possible there is some sort of intermittent routing issue between your ISP and Cloudflare (in my earlier traces I saw some strange intermittent issues when using MTR to your host that I hadn’t seen before).

To narrow it down further I suggest:

Record how often and for how long these issues occur - in order to do this you would need some sort of monitoring system (there are free ones like uptimerobot (although not sure if 5m check resolution will catch it) and thinkbroadband ping monitoring).
It would be useful to know if this is affecting both ping and HTTP or just HTTP. So monitoring both the host’s IP as well as the container’s IP will also help to narrow down a container issues vs a general routing problem.
You could try routed nic type as that will remove the native bridge and limits from the equation (narrowing down the problem further).
I would setup a separate container that is not loaded and also monitor that to see if it is only busy containers that have the issue.

michacassola · July 6, 2020, 7:22pm

Yeah, I need to narrow it down for sure.

Recording the issue is elusive. I have 5 min StatusCake tests and 1 min PingBreak tests on the sites and servers Webmin panel logins. They only give me warnings when the site is not reachable. So no luck recording anything. Could netplan help? I shall try.
Last time it happened I could not access Webmin through Cloudflare on the domain, but it was possible to access it on the ipv6 address directly and the status page showed me that there was barely any load on the container.
I was thinking of doing that, but I do not know how to reproduce the problem exactly so testing proves difficult. But it is on the schedule. (You guys talked about limits on the routed nic type yet?)
See 2. (Can be ruled out in my opinion)

Thanks for your help!

tomp · July 7, 2020, 7:48am

I think @stgraber added routed NIC limits a while back, the docs confirm they are supported now:

https://linuxcontainers.org/lxd/docs/master/instances#nictype-routed