Try setting the networking inside the containers to use a static ipv6 gateway (the address of the lxdbr0 bridge) and a static IPv6 address to avoid router advertisement issues.
Its still strange why your lxdbr0 subnet route on the host has an expiry, something indeed is very odd on your host.
I meant do you only see HTTP timeouts or do you see ping timeouts too?
I setup netplan inside the container to have the static ipv6 and the ipv6 of the lxdbr0 as gateway6.
Also the same nameservers of cloudflare inside the container.
And I set accept-ra: no also.
This eliminated the expriring of the ipv6 address inside the container (ip -6 r).
Now I will restart the host and hope for the best.
(ipv4 still has valid_lft 2896sec preferred_lft 2896sec what the ipv6 also had)
@michacassolaip -r doesn’t show addresses, it shows routes, so you must have been getting an expired route rather than address (which may have also removed the address).
I understood that the expiring route was on the LXD host, not in the container, as here Network issues - How to troubleshoot? you posted the output of ip -r with a default gateway of fe80::1 which is what your ISP requires on the host.
The line:
2a02:c207:1234:1234::/64 dev lxdbr0 proto kernel metric 256 expires 3344sec pref medium is not normal, as it suggests, the route has been learned through RA or DHCPv6, but it should be static (as it is the network of your bridge.
If you can get me a login to the host I could take a look and see what is wrong.
Then the expiry of the route. The routes did show some time limits inside the container.
And also lxdbr0’s route did always have the expires with it when a container went down.
Everything I wrote before is regarding the routes.
That is why I set accept-ra to no inside the conatiners netplan configs now, hoping that it will keep the routes static.
Thank you very much! I will make a user and key for you and send you a private message.
I am sad to say, but still experience short network outages every once in a while.
A friend of mine who also checks out one of the wordpress sites in the containers also notices the outages. It’s a Cloudflare 522 Error, Host not reachable.
Just a short recap of all the things I do to be able to check them off one by one.
My ISP statically routes the ipv6/64 subnet to the host (they told me that)
I use IPv6 only on my WordPress site sin the containers, meaning AAAA records in DNS, on my containers and let Cloudflare handle the IPv4 compatibility
I implement Networking with netplan(networkd) and with the LXD built in lxdbr0.
I use the bridge lxdbr0 to have ingress and egress limits on the containers
- Here all limits from my sudo lxd init --preseed < EOF for my smallest plan
I use ufw on the host and in the containers to block ports besides the http/s ports and some others (used to think the problem is due to UDP being blocked, but still persists with UDP enabled)
I run WordOps (wordops.net) on ports 80 and 443 (although I use my own ufw script after install) but through Cloudflare
I run Webmin (www.webmin.com) on port 8443 also through Cloudflare
What could it be?
Should I not use one of the cgroups limits? @stgraber Maybe the container just drops the traffic becaus of one of those rules? How would I monitor/troubleshoot for that? Maybe when Cloudflare loads stuff into it’s cache it gets dropped as the traffic is limited?
Is it a possible issue with networkd and netplan? How would I check that out?
What could impact lxdbr0 in the way that the route gets the expires header it always gets after some time?
Maybe the routed NIC type would be better? Maybe the bridge has too much to do? Could you guys implement the “limits.max” for that NIC type please? @stgraber
How can I troubleshoot the upper level apps like wordops (its a very fancy lemp stack with dual nginx which is also used as a reverse proxy as well as the webserver) or webmin, how can I make sure it’s LXD related or upper level apps related?
Thanks in advance for all the help and a special thank you for all the help already given!
I think you need to try and narrow down the issue, as at the moment its still rather vague.
Its perfectly possible there is some sort of intermittent routing issue between your ISP and Cloudflare (in my earlier traces I saw some strange intermittent issues when using MTR to your host that I hadn’t seen before).
To narrow it down further I suggest:
Record how often and for how long these issues occur - in order to do this you would need some sort of monitoring system (there are free ones like uptimerobot (although not sure if 5m check resolution will catch it) and thinkbroadband ping monitoring).
It would be useful to know if this is affecting both ping and HTTP or just HTTP. So monitoring both the host’s IP as well as the container’s IP will also help to narrow down a container issues vs a general routing problem.
You could try routed nic type as that will remove the native bridge and limits from the equation (narrowing down the problem further).
I would setup a separate container that is not loaded and also monitor that to see if it is only busy containers that have the issue.
Recording the issue is elusive. I have 5 min StatusCake tests and 1 min PingBreak tests on the sites and servers Webmin panel logins. They only give me warnings when the site is not reachable. So no luck recording anything. Could netplan help? I shall try.
Last time it happened I could not access Webmin through Cloudflare on the domain, but it was possible to access it on the ipv6 address directly and the status page showed me that there was barely any load on the container.
I was thinking of doing that, but I do not know how to reproduce the problem exactly so testing proves difficult. But it is on the schedule. (You guys talked about limits on the routed nic type yet?)
Wow, thanks! You guys are the best. Thanks @stgraber and thanks @tomp.
My ISP moved my VPS to another host system to make sure that it is not host system related. Now I will keep on using and testing. If it happens again I will try out the routed NIC type to see if it might be related to the managed bridge.
When running tcpdump on host’s eth0 I see a lot of packets being sent to the server with me just hitting the save button inside a program once… So I thought that the connection interruption comes from the queue being too small and me sending too much packets for it. Appears logical that my connection then gets dropped, but why does the server need 5 minutes or more to give Cloudflare back the connection to my ip address while others can still connect without problem? What do you think?
To make ip link set eth0 txqueuelen 10000 persistent in Ubuntu 20.04 with netplan and systemd I read you need a udev rule:
Make file: /lib/udev/rules.d/60-persistent-txqueuelen.rules
Put in the file: KERNEL=="eth[0,1]", RUN+="/sbin/ip link set %k txqueuelen 10000"
Run: sudo udevadm trigger
Regarding the “Server Changes” before that, do I need to set them? How do I check which values the kernel is using right now?
Hey @tomp , I read here (might be old info) that netplan still might have bugs with ipv6 so I would like to directly setup systemd-networkd. Anything I should add to this config that might explain the expiring route as it was missing?
Is it normal for the containers to receive router advertisements and have expiring routes?
1234:1234:1234:8614::/64 dev eth0 proto ra metric 100 expires 3439sec pref medium
fe80::/64 dev eth0 proto kernel metric 256 pref medium
default via fe80::216:3eff:fe15:deb9 dev eth0 proto ra metric 100 expires 1639sec mtu 1500 pref medium
I am sorry, I am not sure what was happening on the host.
One container showed a downtime on an uptime checking service for those 5 minutes from 17:56 to 18:02. So again very selectively.
I since noticed that my offices ISP has a DNS that does not support IPv6. I since switched also my office DNS to Cloudflare and Googles DNS. Could this possibly have had anything to do with my selective outages?
When the container does not have an netplan config set up, is it normal for it to have expiring routes as it gets those RAs from lxdbr0?