Do you have any suggestions about how we could have a reliable process for doing a container UID remap where lxd could be killed at any point in the process?
I am currently seeing that if lxd is stopped during the uid remap process which happens during StartContainer, then when the container is subsequently started again, the uids/gids can become corrupted.
My current thoughts are to have a process such as:
1/ look for ‘temp’ copy and delete it
2/ copy ‘container’ to ‘temp’
3/ uid remap ‘temp’ (call StartContainer)
4/ delete ‘container’
5/ rename ‘temp’ to ‘container’
If lxd is stopped at any point in steps 1, 2, or 3, then the process will safely recover the next time it starts.
I think this approach would still fail if lxd is stopped during step 3 or 4, however these should be very fast steps.
I have tried the approach:
1/ look for ‘temp’ snapshot and restore ‘container’ from it if it exists
2/ create ‘temp’ snapshot
3/ uid remap ‘container’ (call StartContainer)
4/ delete ‘temp’ snapshot
I found that if it was stopped during step 1, it would not recover.
Maybe I could extend the first suggested process to be:
1/ if ‘container’ does not exist, but ‘container.to.be.deleted’ does, then rename ‘container.to.be.deleted’ to ‘container’
2/ look for ‘temp’ copy and delete it
3/ copy ‘container’ to ‘temp’
4/ uid remap ‘temp’ (call StartContainer)
5/ rename ‘container’ to ‘container.to.be.deleted’
6/ rename ‘temp’ to ‘container’
7/ delete ‘container.to.be.deleted’
That feels a bit convoluted.