Docker container fails sometimes with "symlink /proc/mounts ... file exists" with fuse-overlayfs

Hello,

I am running LXD with the following details:

# lxc info
config: {}
api_extensions:
- storage_zfs_remove_snapshots
- container_host_shutdown_timeout
- container_stop_priority
- container_syscall_filtering
- auth_pki
- container_last_used_at
- etag
- patch
- usb_devices
- https_allowed_credentials
- image_compression_algorithm
- directory_manipulation
- container_cpu_time
- storage_zfs_use_refquota
- storage_lvm_mount_options
- network
- profile_usedby
- container_push
- container_exec_recording
- certificate_update
- container_exec_signal_handling
- gpu_devices
- container_image_properties
- migration_progress
- id_map
- network_firewall_filtering
- network_routes
- storage
- file_delete
- file_append
- network_dhcp_expiry
- storage_lvm_vg_rename
- storage_lvm_thinpool_rename
- network_vlan
- image_create_aliases
- container_stateless_copy
- container_only_migration
- storage_zfs_clone_copy
- unix_device_rename
- storage_lvm_use_thinpool
- storage_rsync_bwlimit
- network_vxlan_interface
- storage_btrfs_mount_options
- entity_description
- image_force_refresh
- storage_lvm_lv_resizing
- id_map_base
- file_symlinks
- container_push_target
- network_vlan_physical
- storage_images_delete
- container_edit_metadata
- container_snapshot_stateful_migration
- storage_driver_ceph
- storage_ceph_user_name
- resource_limits
- storage_volatile_initial_source
- storage_ceph_force_osd_reuse
- storage_block_filesystem_btrfs
- resources
- kernel_limits
- storage_api_volume_rename
- macaroon_authentication
- network_sriov
- console
- restrict_devlxd
- migration_pre_copy
- infiniband
- maas_network
- devlxd_events
- proxy
- network_dhcp_gateway
- file_get_symlink
- network_leases
- unix_device_hotplug
- storage_api_local_volume_handling
- operation_description
- clustering
- event_lifecycle
- storage_api_remote_volume_handling
- nvidia_runtime
- container_mount_propagation
- container_backup
- devlxd_images
- container_local_cross_pool_handling
- proxy_unix
- proxy_udp
- clustering_join
- proxy_tcp_udp_multi_port_handling
- network_state
- proxy_unix_dac_properties
- container_protection_delete
- unix_priv_drop
- pprof_http
- proxy_haproxy_protocol
- network_hwaddr
- proxy_nat
- network_nat_order
- container_full
- candid_authentication
- backup_compression
- candid_config
- nvidia_runtime_config
- storage_api_volume_snapshots
- storage_unmapped
- projects
- candid_config_key
- network_vxlan_ttl
- container_incremental_copy
- usb_optional_vendorid
- snapshot_scheduling
- snapshot_schedule_aliases
- container_copy_project
- clustering_server_address
- clustering_image_replication
- container_protection_shift
- snapshot_expiry
- container_backup_override_pool
- snapshot_expiry_creation
- network_leases_location
- resources_cpu_socket
- resources_gpu
- resources_numa
- kernel_features
- id_map_current
- event_location
- storage_api_remote_volume_snapshots
- network_nat_address
- container_nic_routes
- rbac
- cluster_internal_copy
- seccomp_notify
- lxc_features
- container_nic_ipvlan
- network_vlan_sriov
- storage_cephfs
- container_nic_ipfilter
- resources_v2
- container_exec_user_group_cwd
- container_syscall_intercept
- container_disk_shift
- storage_shifted
- resources_infiniband
- daemon_storage
- instances
- image_types
- resources_disk_sata
- clustering_roles
- images_expiry
- resources_network_firmware
- backup_compression_algorithm
- ceph_data_pool_name
- container_syscall_intercept_mount
- compression_squashfs
- container_raw_mount
- container_nic_routed
- container_syscall_intercept_mount_fuse
- container_disk_ceph
- virtual-machines
- image_profiles
- clustering_architecture
- resources_disk_id
- storage_lvm_stripes
- vm_boot_priority
- unix_hotplug_devices
- api_filtering
- instance_nic_network
- clustering_sizing
- firewall_driver
- projects_limits
- container_syscall_intercept_hugetlbfs
- limits_hugepages
- container_nic_routed_gateway
- projects_restrictions
- custom_volume_snapshot_expiry
- volume_snapshot_scheduling
- trust_ca_certificates
- snapshot_disk_usage
- clustering_edit_roles
- container_nic_routed_host_address
- container_nic_ipvlan_gateway
- resources_usb_pci
- resources_cpu_threads_numa
- resources_cpu_core_die
- api_os
- container_nic_routed_host_table
- container_nic_ipvlan_host_table
- container_nic_ipvlan_mode
- resources_system
- images_push_relay
- network_dns_search
- container_nic_routed_limits
- instance_nic_bridged_vlan
- network_state_bond_bridge
- usedby_consistency
- custom_block_volumes
- clustering_failure_domains
- resources_gpu_mdev
- console_vga_type
- projects_limits_disk
- network_type_macvlan
- network_type_sriov
- container_syscall_intercept_bpf_devices
- network_type_ovn
- projects_networks
- projects_networks_restricted_uplinks
- custom_volume_backup
- backup_override_name
- storage_rsync_compression
- network_type_physical
- network_ovn_external_subnets
- network_ovn_nat
- network_ovn_external_routes_remove
- tpm_device_type
- storage_zfs_clone_copy_rebase
- gpu_mdev
- resources_pci_iommu
- resources_network_usb
- resources_disk_address
- network_physical_ovn_ingress_mode
- network_ovn_dhcp
- network_physical_routes_anycast
- projects_limits_instances
- network_state_vlan
- instance_nic_bridged_port_isolation
- instance_bulk_state_change
- network_gvrp
- instance_pool_move
- gpu_sriov
- pci_device_type
- storage_volume_state
- network_acl
- migration_stateful
- disk_state_quota
- storage_ceph_features
- projects_compression
- projects_images_remote_cache_expiry
- certificate_project
- network_ovn_acl
- projects_images_auto_update
- projects_restricted_cluster_target
- images_default_architecture
- network_ovn_acl_defaults
- gpu_mig
- project_usage
- network_bridge_acl
- warnings
- projects_restricted_backups_and_snapshots
- clustering_join_token
- clustering_description
- server_trusted_proxy
- clustering_update_cert
- storage_api_project
- server_instance_driver_operational
- server_supported_storage_drivers
- event_lifecycle_requestor_address
- resources_gpu_usb
- clustering_evacuation
- network_ovn_nat_address
- network_bgp
- network_forward
- custom_volume_refresh
- network_counters_errors_dropped
- metrics
- image_source_project
- clustering_config
- network_peer
- linux_sysctl
- network_dns
- ovn_nic_acceleration
- certificate_self_renewal
- instance_project_move
- storage_volume_project_move
- cloud_init
- network_dns_nat
- database_leader
- instance_all_projects
- clustering_groups
- ceph_rbd_du
- instance_get_full
- qemu_metrics
- gpu_mig_uuid
- event_project
- clustering_evacuation_live
- instance_allow_inconsistent_copy
- network_state_ovn
- storage_volume_api_filtering
- image_restrictions
- storage_zfs_export
- network_dns_records
- storage_zfs_reserve_space
- network_acl_log
- storage_zfs_blocksize
- metrics_cpu_seconds
- instance_snapshot_never
- certificate_token
- instance_nic_routed_neighbor_probe
- event_hub
- agent_nic_config
- projects_restricted_intercept
- metrics_authentication
- images_target_project
- cluster_migration_inconsistent_copy
- cluster_ovn_chassis
- container_syscall_intercept_sched_setscheduler
- storage_lvm_thinpool_metadata_size
- storage_volume_state_total
- instance_file_head
- instances_nic_host_name
- image_copy_profile
- container_syscall_intercept_sysinfo
- clustering_evacuation_mode
- resources_pci_vpd
- qemu_raw_conf
- storage_cephfs_fscache
- network_load_balancer
- vsock_api
- instance_ready_state
- network_bgp_holdtime
- storage_volumes_all_projects
- metrics_memory_oom_total
- storage_buckets
- storage_buckets_create_credentials
- metrics_cpu_effective_total
- projects_networks_restricted_access
- storage_buckets_local
- loki
- acme
- internal_metrics
- cluster_join_token_expiry
- remote_token_expiry
- init_preseed
- storage_volumes_created_at
- cpu_hotplug
- projects_networks_zones
- network_txqueuelen
- cluster_member_state
- instances_placement_scriptlet
- storage_pool_source_wipe
- zfs_block_mode
- instance_generation_id
- disk_io_cache
- amd_sev
- storage_pool_loop_resize
- migration_vm_live
- ovn_nic_nesting
- oidc
- network_ovn_l3only
- ovn_nic_acceleration_vdpa
- cluster_healing
- instances_state_total
api_status: stable
api_version: "1.0"
auth: trusted
public: false
auth_methods:
- tls
environment:
  addresses: []
  architectures:
  - x86_64
  - i686
  certificate: |
    -----BEGIN CERTIFICATE-----
...
    -----END CERTIFICATE-----
  certificate_fingerprint: af209b4d143705216083f43aae6cbf9b3f90443b3f473cef481527aa6d071db3
  driver: lxc | qemu
  driver_version: 5.0.2 | 8.0.0
  firewall: nftables
  kernel: Linux
  kernel_architecture: x86_64
  kernel_features:
    idmapped_mounts: "true"
    netnsid_getifaddrs: "true"
    seccomp_listener: "true"
    seccomp_listener_continue: "true"
    shiftfs: "true"
    uevent_injection: "true"
    unpriv_fscaps: "true"
  kernel_version: 5.19.0-45-generic
  lxc_features:
    cgroup2: "true"
    core_scheduling: "true"
    devpts_fd: "true"
    idmapped_mounts_v2: "true"
    mount_injection_file: "true"
    network_gateway_device_route: "true"
    network_ipvlan: "true"
    network_l2proxy: "true"
    network_phys_macvlan_mtu: "true"
    network_veth_router: "true"
    pidfd: "true"
    seccomp_allow_deny_syntax: "true"
    seccomp_notify: "true"
    seccomp_proxy_send_notify_fd: "true"
  os_name: Ubuntu
  os_version: "22.04"
  project: default
  server: lxd
  server_clustered: false
  server_event_mode: full-mesh
  server_name: ...
  server_pid: 2852889
  server_version: "5.14"
  storage: zfs | btrfs
  storage_version: 2.1.5-1ubuntu6 | 5.16.2
  storage_supported_drivers:
  - name: cephfs
    version: 17.2.5
    remote: true
  - name: cephobject
    version: 17.2.5
    remote: true
  - name: dir
    version: "1"
    remote: false
  - name: lvm
    version: 2.03.11(2) (2021-01-08) / 1.02.175 (2021-01-08) / 4.47.0
    remote: false
  - name: zfs
    version: 2.1.5-1ubuntu6
    remote: false
  - name: btrfs
    version: 5.16.2
    remote: false
  - name: ceph
    version: 17.2.5
    remote: true

Inside of a container, I am using docker with fuse-overlayfs as storage driver. This works quite well, but I found issues with some images. For example ubuntu:16.04 always fails:

# docker run ubuntu:16.04
docker: Error response from daemon: symlink /proc/mounts /var/lib/docker/fuse-overlayfs/2653fe53a0905d2bd4cdac608ca3468c6cb91472f28ae0a48b9bf13743f19ea8-init/merged/etc/mtab: file exists.
See 'docker run --help'.

Interesting is the case with the image http://ghcr.io/gvenzl/oracle-xe:11:

# docker run ghcr.io/gvenzl/oracle-xe:11
CONTAINER: starting up...
Oracle Database SYS and SYSTEM passwords have to be specified at first database startup.
Please specify a password either via the $ORACLE_PASSWORD variable, e.g. '-e ORACLE_PASSWORD=<password>'
or set the $ORACLE_RANDOM_PASSWORD environment variable to any value, e.g. '-e ORACLE_RANDOM_PASSWORD=yes'.

root@t:~/docker-test# docker run ghcr.io/gvenzl/oracle-xe:11
docker: Error response from daemon: symlink /proc/mounts /var/lib/docker/fuse-overlayfs/9ec3cd008b47b7dbde255c06916a9d8ee524e2efa283cec38f218a4fe260b692-init/merged/etc/mtab: file exists.
See 'docker run --help'.

root@t:~/docker-test# docker run ghcr.io/gvenzl/oracle-xe:11
docker: Error response from daemon: symlink /proc/mounts /var/lib/docker/fuse-overlayfs/fa8ce84ec1220c0f3c0368a2363e84a7f41110686db9e8b0a7d1faf20a1c2177-init/merged/etc/mtab: file exists.
See 'docker run --help'.

root@t:~/docker-test# docker run ghcr.io/gvenzl/oracle-xe:11
docker: Error response from daemon: symlink /proc/mounts /var/lib/docker/fuse-overlayfs/2f560c6956cfd5025d1f91a55faaa35f758cc57e19f725cdb7ad6d081c4900c2-init/merged/etc/mtab: file exists.
See 'docker run --help'.

root@t:~/docker-test# docker run ghcr.io/gvenzl/oracle-xe:11
docker: Error response from daemon: symlink /proc/mounts /var/lib/docker/fuse-overlayfs/36b87d3e9f286785587d2f9f9681635e9e236b7705b29f0281f6e92a35c7cc51-init/merged/etc/mtab: file exists.
See 'docker run --help'.

root@t:~/docker-test# docker run ghcr.io/gvenzl/oracle-xe:11
CONTAINER: starting up...
Oracle Database SYS and SYSTEM passwords have to be specified at first database startup.
Please specify a password either via the $ORACLE_PASSWORD variable, e.g. '-e ORACLE_PASSWORD=<password>'
or set the $ORACLE_RANDOM_PASSWORD environment variable to any value, e.g. '-e ORACLE_RANDOM_PASSWORD=yes'.

root@t:~/docker-test# docker run ghcr.io/gvenzl/oracle-xe:11
CONTAINER: starting up...
Oracle Database SYS and SYSTEM passwords have to be specified at first database startup.
Please specify a password either via the $ORACLE_PASSWORD variable, e.g. '-e ORACLE_PASSWORD=<password>'
or set the $ORACLE_RANDOM_PASSWORD environment variable to any value, e.g. '-e ORACLE_RANDOM_PASSWORD=yes'.

root@t:~/docker-test# docker run ghcr.io/gvenzl/oracle-xe:11
CONTAINER: starting up...
Oracle Database SYS and SYSTEM passwords have to be specified at first database startup.
Please specify a password either via the $ORACLE_PASSWORD variable, e.g. '-e ORACLE_PASSWORD=<password>'
or set the $ORACLE_RANDOM_PASSWORD environment variable to any value, e.g. '-e ORACLE_RANDOM_PASSWORD=yes'.

root@t:~/docker-test# docker run ghcr.io/gvenzl/oracle-xe:11
CONTAINER: starting up...
Oracle Database SYS and SYSTEM passwords have to be specified at first database startup.
Please specify a password either via the $ORACLE_PASSWORD variable, e.g. '-e ORACLE_PASSWORD=<password>'
or set the $ORACLE_RANDOM_PASSWORD environment variable to any value, e.g. '-e ORACLE_RANDOM_PASSWORD=yes'.

As you can see, the image sometimes starts correctly and sometimes fails with the same error as ubuntu:16.04.

I tried to update fuse-overlayfs to the latest git master, but this does not change anything.
I then tried to reproduce this issue with a native linux installation and fuse-overlayfs (so without LXD). I was unable to reproduce the issue there.

So at this point I am pretty clueless how this could be solved. Any idea?

Container config:

# lxc config show t
architecture: x86_64
config:
  image.architecture: x86_64
  image.description: Ubuntu 22.04 LTS server (20230518)
  image.os: ubuntu
  image.release: jammy
  security.nesting: "true"
  security.syscalls.intercept.mknod: "true"
  security.syscalls.intercept.setxattr: "true"
  volatile.base_image: 549a42b59cab0e73421759479633d5a9356f6dea2ff1f26c9ff52bf6fc59e6ef
  volatile.cloud-init.instance-id: bb487f92-3799-4867-9452-24d2c395f91e
  volatile.eth0.host_name: veth910164a6
  volatile.eth0.hwaddr: 00:16:3e:bb:12:db
  volatile.idmap.base: "0"
  volatile.idmap.current: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":1000000000},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]'
  volatile.idmap.next: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":1000000000},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]'
  volatile.last_state.idmap: '[]'
  volatile.last_state.power: RUNNING
  volatile.last_state.ready: "false"
  volatile.uuid: d1e63f13-770e-4e31-96cf-82939b6fc10e
  volatile.uuid.generation: d1e63f13-770e-4e31-96cf-82939b6fc10e
devices: {}
ephemeral: false
profiles:
- default
stateful: false
description: ""

Have you checked what happens if you do the same on the host without LXD?

As far as I understand you just need to specify environment variables to run this image which is clear from the error message.

Yes, without LXD I am unable to reproduce the issue.

Sorry for not being clear here, yes, to use the image we need to specify the environment variables. But if you look at the output above the image only runs sometimes, showing the error message you’re refereeing to, in the other cases the error message is the same “symlink… file exists” error.

So I did some more testing with this and found out that this only happens when using shiftfs. If shiftfs is disabled, I can’t reproduce the error (but then the startup time for a container is obviously way longer).

I compiled the steps on how to reproduce the issue:

# Make sure shiftfs is enabled to reproduce
snap set lxd shiftfs.enable=true
systemctl reload snap.lxd.daemon

# Create container and allow nesting
lxc launch ubuntu:22.04 docker
lxc config set docker security.nesting=true

# Enter bash from the container
lxc exec docker --  bash

# Install docker and fuse-overlayfs
curl -fsSL https://get.docker.com -o get-docker.sh
sudo sh get-docker.sh
apt update && apt install fuse-overlayfs -y

# Force docker to use fuse-overlayfs
echo '{"storage-driver": "fuse-overlayfs"}' > /etc/docker/daemon.json
systemctl restart docker

# Verify that fuse-overlayfs works
docker info | grep Storage

# Try to run this image (easiest reproducable with this one)
docker run ubuntu:16.04

Then I always get

# docker run ubuntu:16.04
docker: Error response from daemon: symlink /proc/mounts /var/lib/docker/fuse-overlayfs/36d1e7ad5eb15acc7c6adebabd2b8595997b37c552f282df51b46f767c2d6328-init/merged/etc/mtab: file exists.

So I am still not sure where the issue really comes from. Something fuse-overlayfs does not handle correctly? shiftfs? LXD?