UID remapping process that can recovery from killing lxd

joelhockey · September 9, 2019, 10:59am

Do you have any suggestions about how we could have a reliable process for doing a container UID remap where lxd could be killed at any point in the process?

I am currently seeing that if lxd is stopped during the uid remap process which happens during StartContainer, then when the container is subsequently started again, the uids/gids can become corrupted.

My current thoughts are to have a process such as:
1/ look for ‘temp’ copy and delete it
2/ copy ‘container’ to ‘temp’
3/ uid remap ‘temp’ (call StartContainer)
4/ delete ‘container’
5/ rename ‘temp’ to ‘container’

If lxd is stopped at any point in steps 1, 2, or 3, then the process will safely recover the next time it starts.

I think this approach would still fail if lxd is stopped during step 3 or 4, however these should be very fast steps.

I have tried the approach:
1/ look for ‘temp’ snapshot and restore ‘container’ from it if it exists
2/ create ‘temp’ snapshot
3/ uid remap ‘container’ (call StartContainer)
4/ delete ‘temp’ snapshot

I found that if it was stopped during step 1, it would not recover.

Maybe I could extend the first suggested process to be:
1/ if ‘container’ does not exist, but ‘container.to.be.deleted’ does, then rename ‘container.to.be.deleted’ to ‘container’
2/ look for ‘temp’ copy and delete it
3/ copy ‘container’ to ‘temp’
4/ uid remap ‘temp’ (call StartContainer)
5/ rename ‘container’ to ‘container.to.be.deleted’
6/ rename ‘temp’ to ‘container’
7/ delete ‘container.to.be.deleted’

That feels a bit convoluted.

stgraber · September 9, 2019, 11:38am

That does seem rather complicated but also seems correct.
It’s unfortunate that there is no way for us to continue a remap, but depending on the maps involved, there is really no way for us to figure out what may have been shifted already…

Ultimately the solution for this is to not deal with remapping at all and use shiftfs to avoid the whole problem, then hopefully soon use a in-VFS equivalent of shiftfs so we don’t need that separate overlay at all.

I’m assuming the problems for you are that 1) you’re using LXD LTS which lacks shiftfs 2) you don’t have shiftfs in your kernel.

joelhockey · September 9, 2019, 8:46pm

Hi Stéphane,

Yes, I’m with the Chome OS crostini team and we would like to use shiftfs but we do have the restrictions you mention for now.

I’ll go with the last 7-step process.

joelhockey · September 10, 2019, 12:13pm

I have implemented some code at https://chromium-review.googlesource.com/c/chromiumos/platform/tremplin/+/1795683. Any feedback is welcome.

I have noticed that when you do a copy of a container, then new container will start with the new hostname. But when you do a rename, the container keeps the old hostname.

Is that expected?

I needed to add some code to update the hostname of the copy back to the original hostname to make things work.