What is the expected behavior when running two containers with the same 'listen' value

atc0005 · July 29, 2019, 3:14pm

We had someone unintentionally bring online an original container and a copy of the container. Both copy and original have proxy devices configured to listen on the same host IP/Port. This caused some confusion for a bit until we realized what had happened.

I looked through error logs, but didn’t find any entries which indicated that this was caught/flagged. I’m sure there are valid use cases for this sort of setup, but aside from scraping the output of lxc list and then processing each container (e.g., via lxc config show XYZ) to look for conflicts, what options do we have for detecting this situation so that our monitoring setup can alert us?

Thanks in advance for your help.

Setup:

Ubuntu 18.04
LXD 3.0.3
Kernel 4.15.0-55-generic

tomp · July 29, 2019, 4:03pm

As chance would have it, I was re-working the proxy device code today to move it into our new device interface package. https://github.com/lxc/lxd/pull/6011

Looking at the code, whilst there is validation to check that proxy devices only forward connections to IP addresses owned by the container the device is setup on, there doesn’t currently appear to be any checks that the listen IP/port combination is not taken by another proxy device (either on the same or different container).

It would appear that the behaviour in this case is undefined, and most likely it will be a case of the first device ‘winning’ and the others failing to start.

@stgraber do you see any problems with adding some checks for this sort of thing as I port the code?

stgraber · July 29, 2019, 4:21pm

So I’m not sure what’s the best way to do this.

Doing it at configuration time seems wrong as you could perfectly have multiple containers setup to listen on the same port and just make sure not to start two at the same time.

So I guess what we could do is on start, resolve the listen address if host side, then check if there’s something currently binding that and if so fail. That’d also work if there’s anything else (outside of LXD) binding this address/port combination.

tomp · July 29, 2019, 4:49pm

Good point, I’ll have a think, as currently the forkproxy is started after container start, this might be a good reason to move proxy start into startCommon() so it can prevent container start. Which would also need a way for forkproxy to wait until container is started so it can bind inside container.

atc0005 · July 30, 2019, 10:50pm

@stgraber, @tomp

Thanks for pushing forwarding with a permanent fix for this.

Once the fix is in, will this be back-ported to the LTS releases (e.g., Ubuntu 18.04 & the 3.0.x branch)?

For the time being, is there a (relatively) easy way to query the proxy devices associated with each container aside from the output from a lxc list? We’re looking to build Nagios checks for running containers with duplicate proxy devices and any pointers there would be great.

Thanks.

stgraber · July 31, 2019, 3:11am

Yeah, most fixes get backported to our stable-3.0 branch which then is used to release the 3.0.x point releases that are uploaded to Ubuntu 18.04.

We tagged 3.0.4 recently and that’s yet to make it to 18.04, so it will be a little while before we do a 3.0.5 though.

If you’re interested in a per-container basis, your best bet is parsing the output (yaml) of lxc config show --expanded NAME.

I don’t remember if 3.0.x has lxc query, if it does, then you may be able to do some parsing of lxc query /1.0/containers?recursion=1 which will get you the data you need for all containers in a single request. For those, you’ll want to look at the expanded_devices entry.

tomp · July 31, 2019, 10:47am

@stgraber ideally forkproxy wouldn’t fork until it had established the listeners so that we could catch any errors when it started up, that way if there were any issues with forkproxy starting we could detect it. Unfortunately it double forks before attempting to setup listeners. I’ve tried checking that the process is running after its started, but it runs just long enough before exiting to be detected as running OK.

tomp · July 31, 2019, 12:44pm

So looking into this further the issue is that the forkproxy starts straight away, and then can take several seconds to exit when trying to find to a port that is already in use.

So I was thinking if I add a log line of just “Started” to the forkproxy output once it enters the main event loop, then we can poll for that on the LXD side every 1s and once found consider the device started OK. If an “Error” line is found (these are already being logged), or no “Started” line is found within a set period of time (say 10s?) then we can consider this too a failure for forkproxy to start.

stgraber · July 31, 2019, 1:34pm

I suppose we could have some kind of state file that we can track, possibly separately from the log would probably work.

tomp · July 31, 2019, 1:59pm

Yes that would be nicer than parsing log files, I’ll add that.