Lxc exec results in "Error: Failed to retrieve PID of executing child process"

Yeah, and my theory had been that this is due to the fact of close_range() not being available and lxd falling back to close_inherited() and that there’s a bug in there. But there isn’t it seems.
At least I can reproduce the error on Ubuntu with a 5.3 kernel installed, LXD 4.14 and LXC 4.0.9. I’ve installed OpenSUSE Leap 5.3 now but I can’t seem to find snapd to install. Need to figure that out.

This is what I did:

7  2021-06-07 10:55:20 zypper ar --refresh https://download.opensuse.org/repositories/system:/snappy/openSUSE_Leap_15.3 snappy
8  2021-06-07 10:55:30 zypper --gpg-auto-import-keys ref
9  2021-06-07 10:56:06 zypper dup --from snappy
10  2021-06-07 10:56:18 zypper in snapd
11  2021-06-07 10:56:42 systemctl enable --now snapd
12  2021-06-07 10:56:57 reboot
13  2021-06-07 11:20:39 snap install lxd

FWIW the same message is in the separate openSUSE 15.3 box with lxd installed from the default repos:

surveyor:/var/log/lxd/opensuse # cat forkexec.log
Aborting attach to prevent leaking file descriptors into container

Hm, confused now

leap2:~ # uname -a
Linux leap2 5.3.18-lp152.78-default #1 SMP Tue Jun 1 14:53:21 UTC 2021 (556d823) x86_64 x86_64 x86_64 GNU/Linux
leap2:~ # lxc list
+------+-------+------+------+------+-----------+
| NAME | STATE | IPV4 | IPV6 | TYPE | SNAPSHOTS |
+------+-------+------+------+------+-----------+
leap2:~ # lxc ^C
leap2:~ # lxc launch images:alpine/edge alp1
Creating alp1
Starting alp1
leap2:~ # lxc shell alp1
alp1:~#
  driver: qemu | lxc
  driver_version: 5.2.0 | 4.0.9
  firewall: nftables
  kernel: Linux
  kernel_architecture: x86_64
  kernel_features:
    netnsid_getifaddrs: "true"
    seccomp_listener: "true"
    seccomp_listener_continue: "false"
    shiftfs: "false"
    uevent_injection: "true"
    unpriv_fscaps: "true"
  kernel_version: 5.3.18-lp152.78-default
  lxc_features:
    cgroup2: "true"
    devpts_fd: "true"
    idmapped_mounts_v2: "false"
    mount_injection_file: "true"
    network_gateway_device_route: "true"
    network_ipvlan: "true"
    network_l2proxy: "true"
    network_phys_macvlan_mtu: "true"
    network_veth_router: "true"
    pidfd: "true"
    seccomp_allow_deny_syntax: "true"
    seccomp_notify: "true"
    seccomp_proxy_send_notify_fd: "true"
  os_name: openSUSE Leap
  os_version: "15.3"
  project: default
  server: lxd
  server_clustered: false
  server_name: leap2
  server_pid: 4316
  server_version: "4.15"
  storage: btrfs
  storage_version: 4.15.1

Does an upgrade to 4.15 fix the issue for you?

lxd has been updated to 4.15.

I did notice that your particular kernel is a different build than the default from the current leap download page, however.

hydra:~ # uname -a
Linux hydra 5.3.18-57-default #1 SMP Wed Apr 28 10:54:41 UTC 2021 (ba3c2e9) x86_64 x86_64 x86_64 GNU/Linux
hydra:~ # lxc launch images:alpine/edge alp1
Creating alp1
Starting alp1
hydra:~ # lxc shell alp1
Error: Failed to retrieve PID of executing child process
hydra:~ # lxc info | egrep ‘version|^\s*os’
api_version: “1.0”
driver_version: 4.0.9 | 5.2.0
kernel_version: 5.3.18-57-default
os_name: openSUSE Leap
os_version: “15.3”
server_version: “4.15”
storage_version: 4.15.1

I confirm there is a problem on Leap 15.3

[admin@naunas] ~   
❯ lxc exec atlas -- /bin/bash  
Error: Failed to retrieve PID of executing child process

[admin@naunas] ~  
❯ uname -a
Linux naunas 5.3.18-57-default #1 SMP Wed Apr 28 10:54:41 UTC 2021 (ba3c2e9) x86_64 x86_64 x86_64 GNU/Linux

[admin@naunas] ~  
❯ rpm -qa | grep lxd
lxd-4.15-lp153.88.1.x86_64
lxd-bash-completion-4.15-lp153.88.1.noarch

[admin@naunas] ~  
❯ lxc info --show-log atlas
Name: atlas
Location: none
Remote: unix://
Architecture: x86_64
Created: 2020/12/13 06:58 UTC
Status: Running
Type: container
Profiles: default
Pid: 5056
Ips:
  lo:   inet    127.0.0.1
  lo:   inet6   ::1
Resources:
  Processes: 18
  CPU usage:
    CPU usage (in seconds): 34
  Memory usage:
    Memory (current): 71.42MB
    Memory (peak): 249.84MB
  Network usage:
    eth0:
      Bytes received: 1.26kB
      Bytes sent: 90B
      Packets received: 21
      Packets sent: 1
    lo:
      Bytes received: 152.00kB
      Bytes sent: 152.00kB
      Packets received: 3040
      Packets sent: 3040

Log:

lxc atlas 20210609055540.561 ERROR    utils - utils.c:lxc_can_use_pidfd:1793 - Недопустимый аргумент - Kernel does not support waiting on processes through pidfds
lxc atlas 20210609055540.584 WARN     cgfsng - cgroups/cgfsng.c:fchowmodat:1296 - Нет такого файла или каталога - Failed to fchownat(43, memory.oom.group, 500000001, 0, AT_EMPTY_PATH | AT_SYMLINK_NOFOLLOW )

[admin@naunas] ~  
❯ lxc info
config:
  core.https_address: '[::]:8443'
  core.trust_password: true
api_extensions:
- storage_zfs_remove_snapshots
- container_host_shutdown_timeout
- container_stop_priority
- container_syscall_filtering
- auth_pki
- container_last_used_at
- etag
- patch
- usb_devices
- https_allowed_credentials
- image_compression_algorithm
- directory_manipulation
- container_cpu_time
- storage_zfs_use_refquota
- storage_lvm_mount_options
- network
- profile_usedby
- container_push
- container_exec_recording
- certificate_update
- container_exec_signal_handling
- gpu_devices
- container_image_properties
- migration_progress
- id_map
- network_firewall_filtering
- network_routes
- storage
- file_delete
- file_append
- network_dhcp_expiry
- storage_lvm_vg_rename
- storage_lvm_thinpool_rename
- network_vlan
- image_create_aliases
- container_stateless_copy
- container_only_migration
- storage_zfs_clone_copy
- unix_device_rename
- storage_lvm_use_thinpool
- storage_rsync_bwlimit
- network_vxlan_interface
- storage_btrfs_mount_options
- entity_description
- image_force_refresh
- storage_lvm_lv_resizing
- id_map_base
- file_symlinks
- container_push_target
- network_vlan_physical
- storage_images_delete
- container_edit_metadata
- container_snapshot_stateful_migration
- storage_driver_ceph
- storage_ceph_user_name
- resource_limits
- storage_volatile_initial_source
- storage_ceph_force_osd_reuse
- storage_block_filesystem_btrfs
- resources
- kernel_limits
- storage_api_volume_rename
- macaroon_authentication
- network_sriov
- console
- restrict_devlxd
- migration_pre_copy
- infiniband
- maas_network
- devlxd_events
- proxy
- network_dhcp_gateway
- file_get_symlink
- network_leases
- unix_device_hotplug
- storage_api_local_volume_handling
- operation_description
- clustering
- event_lifecycle
- storage_api_remote_volume_handling
- nvidia_runtime
- container_mount_propagation
- container_backup
- devlxd_images
- container_local_cross_pool_handling
- proxy_unix
- proxy_udp
- clustering_join
- proxy_tcp_udp_multi_port_handling
- network_state
- proxy_unix_dac_properties
- container_protection_delete
- unix_priv_drop
- pprof_http
- proxy_haproxy_protocol
- network_hwaddr
- proxy_nat
- network_nat_order
- container_full
- candid_authentication
- backup_compression
- candid_config
- nvidia_runtime_config
- storage_api_volume_snapshots
- storage_unmapped
- projects
- candid_config_key
- network_vxlan_ttl
- container_incremental_copy
- usb_optional_vendorid
- snapshot_scheduling
- snapshot_schedule_aliases
- container_copy_project
- clustering_server_address
- clustering_image_replication
- container_protection_shift
- snapshot_expiry
- container_backup_override_pool
- snapshot_expiry_creation
- network_leases_location
- resources_cpu_socket
- resources_gpu
- resources_numa
- kernel_features
- id_map_current
- event_location
- storage_api_remote_volume_snapshots
- network_nat_address
- container_nic_routes
- rbac
- cluster_internal_copy
- seccomp_notify
- lxc_features
- container_nic_ipvlan
- network_vlan_sriov
- storage_cephfs
- container_nic_ipfilter
- resources_v2
- container_exec_user_group_cwd
- container_syscall_intercept
- container_disk_shift
- storage_shifted
- resources_infiniband
- daemon_storage
- instances
- image_types
- resources_disk_sata
- clustering_roles
- images_expiry
- resources_network_firmware
- backup_compression_algorithm
- ceph_data_pool_name
- container_syscall_intercept_mount
- compression_squashfs
- container_raw_mount
- container_nic_routed
- container_syscall_intercept_mount_fuse
- container_disk_ceph
- virtual-machines
- image_profiles
- clustering_architecture
- resources_disk_id
- storage_lvm_stripes
- vm_boot_priority
- unix_hotplug_devices
- api_filtering
- instance_nic_network
- clustering_sizing
- firewall_driver
- projects_limits
- container_syscall_intercept_hugetlbfs
- limits_hugepages
- container_nic_routed_gateway
- projects_restrictions
- custom_volume_snapshot_expiry
- volume_snapshot_scheduling
- trust_ca_certificates
- snapshot_disk_usage
- clustering_edit_roles
- container_nic_routed_host_address
- container_nic_ipvlan_gateway
- resources_usb_pci
- resources_cpu_threads_numa
- resources_cpu_core_die
- api_os
- container_nic_routed_host_table
- container_nic_ipvlan_host_table
- container_nic_ipvlan_mode
- resources_system
- images_push_relay
- network_dns_search
- container_nic_routed_limits
- instance_nic_bridged_vlan
- network_state_bond_bridge
- usedby_consistency
- custom_block_volumes
- clustering_failure_domains
- resources_gpu_mdev
- console_vga_type
- projects_limits_disk
- network_type_macvlan
- network_type_sriov
- container_syscall_intercept_bpf_devices
- network_type_ovn
- projects_networks
- projects_networks_restricted_uplinks
- custom_volume_backup
- backup_override_name
- storage_rsync_compression
- network_type_physical
- network_ovn_external_subnets
- network_ovn_nat
- network_ovn_external_routes_remove
- tpm_device_type
- storage_zfs_clone_copy_rebase
- gpu_mdev
- resources_pci_iommu
- resources_network_usb
- resources_disk_address
- network_physical_ovn_ingress_mode
- network_ovn_dhcp
- network_physical_routes_anycast
- projects_limits_instances
- network_state_vlan
- instance_nic_bridged_port_isolation
- instance_bulk_state_change
- network_gvrp
- instance_pool_move
- gpu_sriov
- pci_device_type
- storage_volume_state
- network_acl
- migration_stateful
- disk_state_quota
- storage_ceph_features
- projects_compression
- projects_images_remote_cache_expiry
- certificate_project
- network_ovn_acl
- projects_images_auto_update
- projects_restricted_cluster_target
- images_default_architecture
- network_ovn_acl_defaults
- gpu_mig
- project_usage
- network_bridge_acl
- warnings
- projects_restricted_backups_and_snapshots
- clustering_join_token
- clustering_description
- server_trusted_proxy
api_status: stable
api_version: "1.0"
auth: trusted
public: false
auth_methods:
- tls
environment:
  addresses:
  - XXX.XXX.XXX.XXX:8443
  architectures:
  - x86_64
  - i686
  certificate: | XXX
  certificate_fingerprint: XXX
  driver: qemu | lxc
  driver_version: 5.2.0 | 4.0.9
  firewall: nftables
  kernel: Linux
  kernel_architecture: x86_64
  kernel_features:
    netnsid_getifaddrs: "true"
    seccomp_listener: "true"
    seccomp_listener_continue: "false"
    shiftfs: "false"
    uevent_injection: "true"
    unpriv_fscaps: "true"
  kernel_version: 5.3.18-57-default
  lxc_features:
    cgroup2: "true"
    devpts_fd: "true"
    mount_injection_file: "true"
    network_gateway_device_route: "true"
    network_ipvlan: "true"
    network_l2proxy: "true"
    network_phys_macvlan_mtu: "true"
    network_veth_router: "true"
    pidfd: "true"
    seccomp_allow_deny_syntax: "true"
    seccomp_notify: "true"
    seccomp_proxy_send_notify_fd: "true"
  os_name: openSUSE Leap
  os_version: "15.3"
  project: default
  server: lxd
  server_clustered: false
  server_name: naunas
  server_pid: 1944
  server_version: "4.15"
  storage: dir
  storage_version: "1"

but on thumbleweed lxd works normal

[werwolf@power] ~   
❯ lxc exec atlas -- bash
[root@atlas ~]# exit
exit

[werwolf@power] ~   
❯ uname -a
Linux power 5.12.9-1-default #1 SMP Thu Jun 3 07:44:58 UTC 2021 (f17eb01) x86_64 x86_64 x86_64 GNU/Linux

[werwolf@power] ~   
❯ rpm -qa | grep lxd    
lxd-4.14-2.1.x86_64
lxd-bash-completion-4.14-2.1.noarch

clarification: on tumbleweed, the exec works normally, but for some reason the auto-assignment of ip addresses to the container does not work

Yeah you need to use the package in the distro – it’s in the default repos. I think you could in principle use snapd (there is a package for it IIRC) but I’ve never tried it.

I haven’t yet updated my server to Leap 15.3 so I haven’t run into this particular issue yet, but it works on Leap 15.2 and Tumbleweed (with the same package) so I agree this points towards a kernel version or some other system package version issue.

@Joshua_Newman Sorry for not responding to the BZ you opened – I did take a look at it when you first opened it but I’ve been chasing down other LXD issues on openSUSE that I didn’t get around to looking into this one deeply.

I’m having the same issue on a fresh install of openSUSE Leap 15.3. Looking forward to a fix or workaround!

@brauner

Well ok. This is an unexpected twist… It seems that the OpenSUSE kernel has my close_range() syscall backported but not CLOSE_RANGE_UNSHARE (which was in the first version. So that’s why this is all messed up.

See

please.

3 Likes

lxc exec works with the fix now, thank you!!

Containers I create are not getting DHCP leases, but this might be an error on my part; not sure yet.

1 Like

If you’re running the snap package it could well be related to one of these:

I’m using the packages in the openSUSE official repositories, except for the lxd package which I built myself to include the lxc exec fix, but I’ll read those just in case anything might apply.

Check dnsmasq is running, and if so then check your firewall isn’t preventing DHCP requests.

1 Like

Ahhh it was the firewall! Definitely did not expect that. :smiley:

Thank you very much!

Perfect, that did the trick.
Thanks much for the very quick turnaround @brauner!

I’ve backported the patch to the openSUSE packages, it should appear in Leap in a little while (in the meantime, you can use the devel packages in Virtualization:containers).