wrkilu
(wrkilu)
January 8, 2023, 8:36pm
1
Hi,
Mother server - newest Centos 7 (7.9.2009)
LXD container - also newest Centos 7
lxd package version on mother server - 5.9-9879096 installed by Snap.
And after some time container crashes little bit. I mean I can ping from it any IP on internet so network looks is working. But e.g. df command or others return following errors:
df: ‘/proc/cpuinfo’: Transport endpoint is not connected
df: ‘/proc/diskstats’: Transport endpoint is not connected
df: ‘/proc/loadavg’: Transport endpoint is not connected
df: ‘/proc/meminfo’: Transport endpoint is not connected
df: ‘/proc/slabinfo’: Transport endpoint is not connected
df: ‘/proc/stat’: Transport endpoint is not connected
df: ‘/proc/swaps’: Transport endpoint is not connected
df: ‘/proc/uptime’: Transport endpoint is not connected
df: ‘/sys/devices/system/cpu/online’: Transport endpoint is not connected
df: ‘/sys/fs/cgroup/blkio’: Transport endpoint is not connected
df: ‘/sys/fs/cgroup/cpu’: Transport endpoint is not connected
df: ‘/sys/fs/cgroup/cpuset’: Transport endpoint is not connected
df: ‘/sys/fs/cgroup/devices’: Transport endpoint is not connected
df: ‘/sys/fs/cgroup/freezer’: Transport endpoint is not connected
df: ‘/sys/fs/cgroup/hugetlb’: Transport endpoint is not connected
df: ‘/sys/fs/cgroup/memory’: Transport endpoint is not connected
df: ‘/sys/fs/cgroup/net_cls’: Transport endpoint is not connected
df: ‘/sys/fs/cgroup/perf_event’: Transport endpoint is not connected
df: ‘/sys/fs/cgroup/pids’: Transport endpoint is not connected
df: ‘/sys/fs/cgroup/systemd’: Transport endpoint is not connected
And on mother server there is error in dmesg:
[309274.787698] lxcfs[29285]: segfault at 0 ip 00007f96c28d03ce sp 00007f96c15dcc38 error 4 in libc-2.31.so[7f96c2848000+178000]
Does anybody have idea how to solve that issue ?
Thank you.
stgraber
(Stéphane Graber)
January 8, 2023, 11:22pm
2
Hmm, yeah, that’s a LXCFS crash.
Does that happen repeatedly?
To recover from this you’d need to:
systemctl reload snap.lxd.daemon
lxc restart --all
Which will both reload LXD (and restart LXCFS) and then restart all the containers on the system.
amikhalitsyn
(Aleksandr Mikhalitsyn)
January 9, 2023, 1:22pm
3
Looks related lxcfs crash on lxd 5.9 rev 24164 · Issue #573 · lxc/lxcfs · GitHub
@wrkilu you need to setup your core_pattern the same way as I’ve described here:
opened 01:45PM - 21 Dec 22 UTC
Due to https://discuss.linuxcontainers.org/t/number-of-cpus-reported-by-proc-sta… t-fluctuates-causing-issues/15780 we are running LXD 5.9 revision 24164. After running a few days lxcfs crashed:
```
Dec 21 05:33:41 kernel: show_signal_msg: 14 callbacks suppressed
Dec 21 05:33:41 kernel: lxcfs[3219179]: segfault at 0 ip 00007f8084afdf81 sp 00007f8084a2e780 error 6 in libc-2.31.so[7f8084a94000+178000]
Dec 21 05:33:41 kernel: Code: 00 00 4c 89 ef 4c 89 4c 24 08 e8 3a 68 00 00 48 89 e9 4c 89 e2 48 89 ee 48 8d 05 2a d2 15 00 4c 89 ef 48 89 84 24 e8 00 00 00 <c6> 45 00 00 e8 06 7e 00 00 89 d9 4c 89 fa 4c 89 f6 4c 89 ef e8 c6
```
This is an Ubuntu 22.04.1 running kernel 5.15.0-56-generic on an AMD Epyc 7702P (128 thread) system with 512GB RAM.
```
lxc info
config:
core.https_address: '[::]:8443'
core.trust_password: true
images.auto_update_interval: "0"
api_extensions:
- storage_zfs_remove_snapshots
- container_host_shutdown_timeout
- container_stop_priority
- container_syscall_filtering
- auth_pki
- container_last_used_at
- etag
- patch
- usb_devices
- https_allowed_credentials
- image_compression_algorithm
- directory_manipulation
- container_cpu_time
- storage_zfs_use_refquota
- storage_lvm_mount_options
- network
- profile_usedby
- container_push
- container_exec_recording
- certificate_update
- container_exec_signal_handling
- gpu_devices
- container_image_properties
- migration_progress
- id_map
- network_firewall_filtering
- network_routes
- storage
- file_delete
- file_append
- network_dhcp_expiry
- storage_lvm_vg_rename
- storage_lvm_thinpool_rename
- network_vlan
- image_create_aliases
- container_stateless_copy
- container_only_migration
- storage_zfs_clone_copy
- unix_device_rename
- storage_lvm_use_thinpool
- storage_rsync_bwlimit
- network_vxlan_interface
- storage_btrfs_mount_options
- entity_description
- image_force_refresh
- storage_lvm_lv_resizing
- id_map_base
- file_symlinks
- container_push_target
- network_vlan_physical
- storage_images_delete
- container_edit_metadata
- container_snapshot_stateful_migration
- storage_driver_ceph
- storage_ceph_user_name
- resource_limits
- storage_volatile_initial_source
- storage_ceph_force_osd_reuse
- storage_block_filesystem_btrfs
- resources
- kernel_limits
- storage_api_volume_rename
- macaroon_authentication
- network_sriov
- console
- restrict_devlxd
- migration_pre_copy
- infiniband
- maas_network
- devlxd_events
- proxy
- network_dhcp_gateway
- file_get_symlink
- network_leases
- unix_device_hotplug
- storage_api_local_volume_handling
- operation_description
- clustering
- event_lifecycle
- storage_api_remote_volume_handling
- nvidia_runtime
- container_mount_propagation
- container_backup
- devlxd_images
- container_local_cross_pool_handling
- proxy_unix
- proxy_udp
- clustering_join
- proxy_tcp_udp_multi_port_handling
- network_state
- proxy_unix_dac_properties
- container_protection_delete
- unix_priv_drop
- pprof_http
- proxy_haproxy_protocol
- network_hwaddr
- proxy_nat
- network_nat_order
- container_full
- candid_authentication
- backup_compression
- candid_config
- nvidia_runtime_config
- storage_api_volume_snapshots
- storage_unmapped
- projects
- candid_config_key
- network_vxlan_ttl
- container_incremental_copy
- usb_optional_vendorid
- snapshot_scheduling
- snapshot_schedule_aliases
- container_copy_project
- clustering_server_address
- clustering_image_replication
- container_protection_shift
- snapshot_expiry
- container_backup_override_pool
- snapshot_expiry_creation
- network_leases_location
- resources_cpu_socket
- resources_gpu
- resources_numa
- kernel_features
- id_map_current
- event_location
- storage_api_remote_volume_snapshots
- network_nat_address
- container_nic_routes
- rbac
- cluster_internal_copy
- seccomp_notify
- lxc_features
- container_nic_ipvlan
- network_vlan_sriov
- storage_cephfs
- container_nic_ipfilter
- resources_v2
- container_exec_user_group_cwd
- container_syscall_intercept
- container_disk_shift
- storage_shifted
- resources_infiniband
- daemon_storage
- instances
- image_types
- resources_disk_sata
- clustering_roles
- images_expiry
- resources_network_firmware
- backup_compression_algorithm
- ceph_data_pool_name
- container_syscall_intercept_mount
- compression_squashfs
- container_raw_mount
- container_nic_routed
- container_syscall_intercept_mount_fuse
- container_disk_ceph
- virtual-machines
- image_profiles
- clustering_architecture
- resources_disk_id
- storage_lvm_stripes
- vm_boot_priority
- unix_hotplug_devices
- api_filtering
- instance_nic_network
- clustering_sizing
- firewall_driver
- projects_limits
- container_syscall_intercept_hugetlbfs
- limits_hugepages
- container_nic_routed_gateway
- projects_restrictions
- custom_volume_snapshot_expiry
- volume_snapshot_scheduling
- trust_ca_certificates
- snapshot_disk_usage
- clustering_edit_roles
- container_nic_routed_host_address
- container_nic_ipvlan_gateway
- resources_usb_pci
- resources_cpu_threads_numa
- resources_cpu_core_die
- api_os
- container_nic_routed_host_table
- container_nic_ipvlan_host_table
- container_nic_ipvlan_mode
- resources_system
- images_push_relay
- network_dns_search
- container_nic_routed_limits
- instance_nic_bridged_vlan
- network_state_bond_bridge
- usedby_consistency
- custom_block_volumes
- clustering_failure_domains
- resources_gpu_mdev
- console_vga_type
- projects_limits_disk
- network_type_macvlan
- network_type_sriov
- container_syscall_intercept_bpf_devices
- network_type_ovn
- projects_networks
- projects_networks_restricted_uplinks
- custom_volume_backup
- backup_override_name
- storage_rsync_compression
- network_type_physical
- network_ovn_external_subnets
- network_ovn_nat
- network_ovn_external_routes_remove
- tpm_device_type
- storage_zfs_clone_copy_rebase
- gpu_mdev
- resources_pci_iommu
- resources_network_usb
- resources_disk_address
- network_physical_ovn_ingress_mode
- network_ovn_dhcp
- network_physical_routes_anycast
- projects_limits_instances
- network_state_vlan
- instance_nic_bridged_port_isolation
- instance_bulk_state_change
- network_gvrp
- instance_pool_move
- gpu_sriov
- pci_device_type
- storage_volume_state
- network_acl
- migration_stateful
- disk_state_quota
- storage_ceph_features
- projects_compression
- projects_images_remote_cache_expiry
- certificate_project
- network_ovn_acl
- projects_images_auto_update
- projects_restricted_cluster_target
- images_default_architecture
- network_ovn_acl_defaults
- gpu_mig
- project_usage
- network_bridge_acl
- warnings
- projects_restricted_backups_and_snapshots
- clustering_join_token
- clustering_description
- server_trusted_proxy
- clustering_update_cert
- storage_api_project
- server_instance_driver_operational
- server_supported_storage_drivers
- event_lifecycle_requestor_address
- resources_gpu_usb
- clustering_evacuation
- network_ovn_nat_address
- network_bgp
- network_forward
- custom_volume_refresh
- network_counters_errors_dropped
- metrics
- image_source_project
- clustering_config
- network_peer
- linux_sysctl
- network_dns
- ovn_nic_acceleration
- certificate_self_renewal
- instance_project_move
- storage_volume_project_move
- cloud_init
- network_dns_nat
- database_leader
- instance_all_projects
- clustering_groups
- ceph_rbd_du
- instance_get_full
- qemu_metrics
- gpu_mig_uuid
- event_project
- clustering_evacuation_live
- instance_allow_inconsistent_copy
- network_state_ovn
- storage_volume_api_filtering
- image_restrictions
- storage_zfs_export
- network_dns_records
- storage_zfs_reserve_space
- network_acl_log
- storage_zfs_blocksize
- metrics_cpu_seconds
- instance_snapshot_never
- certificate_token
- instance_nic_routed_neighbor_probe
- event_hub
- agent_nic_config
- projects_restricted_intercept
- metrics_authentication
- images_target_project
- cluster_migration_inconsistent_copy
- cluster_ovn_chassis
- container_syscall_intercept_sched_setscheduler
- storage_lvm_thinpool_metadata_size
- storage_volume_state_total
- instance_file_head
- instances_nic_host_name
- image_copy_profile
- container_syscall_intercept_sysinfo
- clustering_evacuation_mode
- resources_pci_vpd
- qemu_raw_conf
- storage_cephfs_fscache
- network_load_balancer
- vsock_api
- instance_ready_state
- network_bgp_holdtime
- storage_volumes_all_projects
- metrics_memory_oom_total
- storage_buckets
- storage_buckets_create_credentials
- metrics_cpu_effective_total
- projects_networks_restricted_access
- storage_buckets_local
- loki
- acme
- internal_metrics
- cluster_join_token_expiry
- remote_token_expiry
- init_preseed
- storage_volumes_created_at
- cpu_hotplug
- projects_networks_zones
api_status: stable
api_version: "1.0"
auth: trusted
public: false
auth_methods:
- tls
environment:
addresses:
- ...:8443
architectures:
- x86_64
- i686
certificate: ...
certificate_fingerprint: ...
driver: qemu | lxc
driver_version: 7.1.0 | 5.0.1
firewall: nftables
kernel: Linux
kernel_architecture: x86_64
kernel_features:
idmapped_mounts: "true"
netnsid_getifaddrs: "true"
seccomp_listener: "true"
seccomp_listener_continue: "true"
shiftfs: "false"
uevent_injection: "true"
unpriv_fscaps: "true"
kernel_version: 5.15.0-56-generic
lxc_features:
cgroup2: "true"
core_scheduling: "true"
devpts_fd: "true"
idmapped_mounts_v2: "true"
mount_injection_file: "true"
network_gateway_device_route: "true"
network_ipvlan: "true"
network_l2proxy: "true"
network_phys_macvlan_mtu: "true"
network_veth_router: "true"
pidfd: "true"
seccomp_allow_deny_syntax: "true"
seccomp_notify: "true"
seccomp_proxy_send_notify_fd: "true"
os_name: Ubuntu
os_version: "22.04"
project: default
server: lxd
server_clustered: false
server_event_mode: full-mesh
server_name: server.domain.com
server_pid: 1657650
server_version: "5.9"
storage: zfs
storage_version: 2.1.4-0ubuntu0.1
storage_supported_drivers:
- name: zfs
version: 2.1.4-0ubuntu0.1
remote: false
- name: btrfs
version: 5.4.1
remote: false
- name: ceph
version: 15.2.17
remote: true
- name: cephfs
version: 15.2.17
remote: true
- name: cephobject
version: 15.2.17
remote: true
- name: dir
version: "1"
remote: false
- name: lvm
version: 2.03.07(2) (2019-11-30) / 1.02.167 (2019-11-30) / 4.45.0
remote: false
```
As requested, further information:
```
service apport status
● apport.service - LSB: automatic crash report generation
Loaded: loaded (/etc/init.d/apport; generated)
Active: active (exited) since Sat 2022-12-17 08:45:41 CET; 4 days ago
Docs: man:systemd-sysv-generator(8)
CPU: 27ms
Dec 17 08:45:40 server.domain.com systemd[1]: Starting LSB: automatic crash report generation...
Dec 17 08:45:41 server.domain.com apport[3908]: * Starting automatic crash report generation: apport
Dec 17 08:45:41 server.domain.com apport[3908]: ...done.
Dec 17 08:45:41 server.domain.com systemd[1]: Started LSB: automatic crash report generation.
```
```
cat /proc/sys/kernel/core_pattern
|/usr/share/apport/apport -p%p -s%s -c%c -d%d -P%P -u%u -g%g -- %E
```
```
ls -la /var/crash
total 8
drwxrwsrwt 2 root whoopsie 4096 Nov 15 06:25 .
drwxr-xr-x 15 root root 4096 Nov 13 20:56 ..
```
```
ls -la /var/lib/apport/coredump/
total 8
drwxr-xr-x 2 root root 4096 Oct 27 2021 .
drwxr-xr-x 3 root root 4096 Oct 27 2021 ..
```
```
cat /var/log/apport.log
ERROR: apport (pid 3551728) Wed Dec 21 05:33:41 2022: host pid 5646 crashed in a separate mount namespace, ignoring
```
Unfortunately no dumps are available and the lxd log shows nothing of interest during the time of crash:
```
journalctl -u snap.lxd.daemon
Dec 17 08:49:16 server.domain.com lxd.daemon[5419]: => LXD is ready
Dec 17 09:09:32 server.domain.com lxd.daemon[5660]: time="2022-12-17T09:09:32+01:00" level=warning msg="Detected poll(POLLNVAL) event: exiting"
Dec 21 12:22:29 server.domain.com systemd[1]: Stopping Service for snap application lxd.daemon...
Dec 21 12:22:29 server.domain.com lxd.daemon[1626376]: => Stop reason is: host shutdown
```
Please tell me if I should modify any configuration to catch the next possible crash.
To catch core. BTW, you can check your current /proc/sys/kernel/core_pattern
, if we are lucky then possibly you already have coredump collected in /var/crash/...
amikhalitsyn
(Aleksandr Mikhalitsyn)
January 9, 2023, 2:09pm
4
@wrkilu couldn’t you also check your kernel logs for line with:
kernel: Code
It should follow the line with “segfault” info. Please, post it too.
wrkilu
(wrkilu)
January 9, 2023, 2:33pm
5
@stgraber
I’ve done “lxc stop test1 -f”, then “systemctl reload snap.lxd.daemon”
Container has started again and worked about 3h. And then problem got back…
@amikhalitsyn
There aren’t other important lines in dmesg around this Segfault:
[424483.934921] lxdbr0: port 1(veth4aec45e0) entered forwarding state
[431674.370712] lxcfs[1237]: segfault at 0 ip 00007f89b23d83ce sp 00007f89b09a1c38 error 4 in libc-2.31.so[7f89b2350000+178000]
[436810.186476] logflags DROP IN=enp4s0 OUT=enp4s0 MAC=54:04:a6:f1:77:83:30:b6:4f:d8:00:d2:08:00
Also I have to mention that this container (I don’t have others yet), has RAM limit 1GB and after “systemctl reload snap.lxd.daemon” it started with maximum mother RAM size (16GB). Then i rebooted him from inside and he started with 1GB. And then as I wrote after 3h he got these errors with lxcfs.
wrkilu
(wrkilu)
January 9, 2023, 2:38pm
6
On mother:
cat /proc/sys/kernel/core_pattern
core
/var/crash is empty
stgraber
(Stéphane Graber)
January 9, 2023, 2:57pm
7
The memory reporting behavior you described is correct for the crash you’re experiencing. You need to reload LXD to have LXCFS restored at which point restarting a container will have it use the new LXCFS instance and so report the memory consumption correctly. Restarting the container prior to restarting LXCFS will leave it seeing the memory information of the host system.
Now it’d be nice if we could indeed grab a core out of this thing since you seem to have it in a state that’s mostly reproducible…
Any idea what may be happening inside of your container at the time of the LXCFS crash?
If we can figure that out, then we could probably grab both a strace
and gdb
output of the running lxcfs
just as it crashes which should then give us what we need to sort this out.
1 Like
amikhalitsyn
(Aleksandr Mikhalitsyn)
January 9, 2023, 3:13pm
8
please, change it by
echo '|/bin/sh -c $@ -- eval exec cat > /var/crash/core-%e.%p' > /proc/sys/kernel/core_pattern
and try to repeat actions which led to the crash
wrkilu
(wrkilu)
January 9, 2023, 3:19pm
9
@stgraber
It does nothing yet. It is clean OS.
@amikhalitsyn
ok, I’ve typed that command on mother.
Lets check to next crash and hopefully we’ll have crash dump.
1 Like
wrkilu
(wrkilu)
January 10, 2023, 7:31pm
10
There is no crash still. I’ll write when it occurs.
wrkilu
(wrkilu)
January 10, 2023, 11:04pm
11
1 Like
wrkilu
(wrkilu)
January 11, 2023, 12:39am
12
And I have to add that kernel on mother server is 3.10. Isn’t too old ? Maybe this is the reason ?
amikhalitsyn
(Aleksandr Mikhalitsyn)
January 11, 2023, 9:07am
13
no-no, userspace should work without crashes on any supported kernel. 3.10 (rhel7) is not ideal, but okay
amikhalitsyn
(Aleksandr Mikhalitsyn)
January 11, 2023, 10:36am
14
The same issue as reported yesterday Handle NULL in releasedir by deleriux · Pull Request #575 · lxc/lxcfs · GitHub
(gdb) bt
#0 __strcmp_sse2_unaligned ()
at ../sysdeps/x86_64/multiarch/strcmp-sse2-unaligned.S:31
#1 0x00005577569c4508 in lxcfs_releasedir (path=0x0, fi=0x7f9242d6ac80)
at ../src/src/lxcfs.c:774
#2 0x00007f92441122b7 in ?? ()
#3 0x0000000000000007 in ?? ()
#4 0x0000000000000000 in ?? ()
(gdb) p *(struct fuse_file_info*)0x7f9242d6ac80
$1 = {flags = 0, writepage = 0, direct_io = 0, keep_cache = 0, flush = 0,
nonseekable = 0, flock_release = 0, cache_readdir = 0, padding = 0, padding2 = 0,
fh = 140266048610592, lock_owner = 0, poll_events = 0}
(gdb) p/x ((struct fuse_file_info*)0x7f9242d6ac80)->fh
$4 = 0x7f923c006120
(gdb) x/8xg 0x7f923c006120
0x7f923c006120: 0x00007f923c005610 0x00007f923c004460
0x7f923c006130: 0x0000000000000000 0x6770757800000000
0x7f923c006140: 0x0000000000000000 0x00007f9200000000
0x7f923c006150: 0x0000000000000040 0x00000000000000a5
(gdb) x/s 0x00007f923c005610
0x7f923c005610: "systemd"
(gdb) x/s 0x00007f923c004460
0x7f923c004460: "lxc.payload.complexupgrade/system.slice/systemd-sysusers.service"
Thanks, @wrkilu for providing us with core dump! I think it makes sense to continue catching core dumps for the LXCFS process. I’ve a suspicion that we have 2 different bugs, cause here lxcfs crash on lxd 5.9 rev 24164 · Issue #573 · lxc/lxcfs · GitHub
we crashed on write (!), but in your case, we crashed on read.
wrkilu
(wrkilu)
January 11, 2023, 11:52am
15
No problem man. I do thank you for LXD! not you me.
Still I think LXD is awesome and many thanks to all of you maintainers!
Should I attach second crash when it occur ?
1 Like
amikhalitsyn
(Aleksandr Mikhalitsyn)
January 11, 2023, 11:55am
16
Should I attach second crash when it occur ?
Yep, every piece of information may be valuable for debugging. I think we will release a new hotfix version of LXCFS soon, just to address this particular crash that you’ve caught already. I’ll notify you.
wrkilu
(wrkilu)
January 15, 2023, 6:58pm
17
Still there wasn’t next crash on my server.
Other question: when you release hot fix ? Or… is there a way to downgrade LXC in Snap to older good version ?
amikhalitsyn
(Aleksandr Mikhalitsyn)
January 15, 2023, 10:43pm
18
I think fix will be released this week. I can say that it makes no sense to downgrade, because this is not a degradation. It’s interesting question why you’ve started facing this issue (it can be related to our last fix with turning on direct IO mode for lxcfs, but in fact this is the only right behavior).
See also
wrkilu
(wrkilu)
January 16, 2023, 8:19am
19
This is my first container on dedicated server with Centos 7. I need virtualization so I’ve installed lxd and problems occured. I haven’t even installed any service on it - it has only ssh server - nothing else. And randomly has crashes. So I’m waiting simply for good version to I could go with installing services in it…
amikhalitsyn
(Aleksandr Mikhalitsyn)
January 17, 2023, 10:55am
20
@wrkilu you can try sudo snap refresh lxd --channel=latest/candidate