When running an isolated container and mounting a filesystem from the host, root in the container appears to bypass filesystem permissions on host, much to my surprise. Can somebody confirm if this is expected behaviour, and if so, why?
I’m not sure of the exact security logic here, but assuming that user:user on the host matches the 1000:1000 mapping, you have now exposed that user/group to the container as part of its map. So while the container is isolated, that particular uid/gid is not, it’s directly mapped through.
Root in the container has privileges over all uids/gids that are available in its namespace, which in this case would also extend to 1000/1000 due to raw.idmap.
Quite honestly, the observed behaviour is really unexpected.
If I now have a nested mount on the host, e.g. /test/subdir, both owned by 1000:1000 on the host, with the idmap above, but only add a disk device for /test, I see this in the container:
drwxr-x--- 3 1000 1000 3 Jun 2 21:08 /test
drwxr-x--- 3 nobody nogroup 3 Jun 2 21:08 /test/subdir
which make sense. And root in the container can’t write to subdir, which again makes perfect sense.
However, in the container, rmdir /test/subdir does work and umounts /test/subdir on the host. Now that, again, is really really surprising.
I guess I’ll have to check the mount propagation settings…