Slow ZFS performance on LXD instances

I have been testing IO performance on instances and have recently tried using a partition created through lxd init instead of a loop device to see if performance would increase. However, to my surprise the performance actually remained relatively slow compared to the native performance.

I ran the following test on host (native ext4):

root@lxd01:~$ dd if=/dev/urandom of=/root/input bs=128k count=75k
76800+0 records in
76800+0 records out
10066329600 bytes (10 GB, 9.4 GiB) copied, 56.7984 s, 177 MB/s
root@lxd01:~# sync; echo 3 > /proc/sys/vm/drop_caches
root@lxd01:~# dd if=/root/input of=/root/test bs=128k count=75k conv=fdatasync
76800+0 records in
76800+0 records out
10066329600 bytes (10 GB, 9.4 GiB) copied, 113.324 s, 88.8 MB/s

And then the same for the container (ZFS partition):

root@lxd01:~# lxc exec container-test -- dd if=/dev/urandom of=/root/input bs=128k count=75k
76800+0 records in
76800+0 records out
10066329600 bytes (10 GB, 9.4 GiB) copied, 262.79 s, 38.3 MB/s
root@lxd01:~# sync; echo 3 > /proc/sys/vm/drop_caches
root@lxd01:~# lxc exec container-test -- dd if=/root/input of=/root/test bs=128k count=75k conv=fdatasync
76800+0 records in
76800+0 records out
10066329600 bytes (10 GB, 9.4 GiB) copied, 355.324 s, 28.3 MB/s

Though I was expecting a performance difference between ext4 and zfs, it seems that the performance on zfs is much worse than I had anticipated. Does anyone else have a similar experience?

Output of lxc config show:

architecture: x86_64
config:
  image.architecture: amd64
  image.description: ubuntu 20.04 LTS amd64 (release) (20201210)
  image.label: release
  image.os: ubuntu
  image.release: focal
  image.serial: "20201210"
  image.type: squashfs
  image.version: "20.04"
  limits.cpu: "4"
  limits.memory: 4GB
  limits.memory.enforce: hard
  volatile.base_image: e0c3495ffd489748aa5151628fa56619e6143958f041223cb4970731ef939cb6
  volatile.eth0.host_name: vethb0f298bb
  volatile.eth0.hwaddr: 00:16:3e:3c:36:37
  volatile.idmap.base: "0"
  volatile.idmap.current: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":1000000000},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]'
  volatile.idmap.next: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":1000000000},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]'
  volatile.last_state.idmap: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":1000000000},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]'
  volatile.last_state.power: RUNNING
  volatile.uuid: 441ff04e-f3e2-4fb9-a3c2-63f627a16a9d
devices:
  root:
    path: /
    pool: partition
    size: 70GB
    type: disk
ephemeral: false
profiles:
- default
stateful: false
description: ""

Output of zpool status:

  pool: partition
 state: ONLINE
  scan: none requested
config:

	NAME        STATE     READ WRITE CKSUM
	partition   ONLINE       0     0     0
	  sda4      ONLINE       0     0     0

errors: No known data errors

Some extra information, the machine I’m running on is setup with RAID 6 with 4 SATA disks. Any input is appreciated.

Hi,

I have the same issue. Did you resolved it?

OP, I have some ideas, but if you wouldn’t mind providing a tad-bit more information; I’d like you to run two separate commands on the host and report their output back. Both have to do with your zfs configuration.

  1. zfs get all <name of lxd zpool>

  2. arc_summary

Thanks

Hi,

Number 1:

root@esx:~# zfs get all lxdpool
NAME PROPERTY VALUE SOURCE
lxdpool type filesystem -
lxdpool creation Fri Dec 30 6:40 2022 -
lxdpool used 1.25T -
lxdpool available 523G -
lxdpool referenced 24K -
lxdpool compressratio 1.00x -
lxdpool mounted no -
lxdpool quota none default
lxdpool reservation none default
lxdpool recordsize 128K default
lxdpool mountpoint legacy local
lxdpool sharenfs off default
lxdpool checksum on default
lxdpool compression off default
lxdpool atime off local
lxdpool devices on local
lxdpool exec on local
lxdpool setuid on local
lxdpool readonly off default
lxdpool zoned off default
lxdpool snapdir hidden default
lxdpool aclmode discard default
lxdpool aclinherit restricted default
lxdpool createtxg 1 -
lxdpool canmount on default
lxdpool xattr sa local
lxdpool copies 1 default
lxdpool version 5 -
lxdpool utf8only off -
lxdpool normalization none -
lxdpool casesensitivity sensitive -
lxdpool vscan off default
lxdpool nbmand off default
lxdpool sharesmb off default
lxdpool refquota none default
lxdpool refreservation none default
lxdpool guid 17894017618164977837 -
lxdpool primarycache all default
lxdpool secondarycache all default
lxdpool usedbysnapshots 0B -
lxdpool usedbydataset 24K -
lxdpool usedbychildren 1.25T -
lxdpool usedbyrefreservation 0B -
lxdpool logbias latency default
lxdpool objsetid 54 -
lxdpool dedup off default
lxdpool mlslabel none default
lxdpool sync standard default
lxdpool dnodesize legacy default
lxdpool refcompressratio 1.00x -
lxdpool written 24K -
lxdpool logicalused 1.24T -
lxdpool logicalreferenced 12K -
lxdpool volmode default default
lxdpool filesystem_limit none default
lxdpool snapshot_limit none default
lxdpool filesystem_count none default
lxdpool snapshot_count none default
lxdpool snapdev hidden default
lxdpool acltype posix local
lxdpool context none default
lxdpool fscontext none default
lxdpool defcontext none default
lxdpool rootcontext none default
lxdpool relatime on local
lxdpool redundant_metadata all default
lxdpool overlay on default
lxdpool encryption off default
lxdpool keylocation none default
lxdpool keyformat none default
lxdpool pbkdf2iters 0 default
lxdpool special_small_blocks 0 default

Number 2:

root@esx:~# arc_summary

ZFS Subsystem Report Mon Jan 02 13:15:39 2023
Linux 5.10.0-20-amd64 2.0.3-9
Machine: esx (x86_64) 2.0.3-9
ARC status: HEALTHY
Memory throttle count: 0
ARC size (current): 27.9 % 17.5 GiB
Target size (adaptive): 28.0 % 17.5 GiB
Min size (hard limit): 6.2 % 3.9 GiB
Max size (high water): 16:1 62.8 GiB
Most Frequently Used (MFU) cache size: 31.5 % 5.3 GiB
Most Recently Used (MRU) cache size: 68.5 % 11.6 GiB
Metadata cache size (hard limit): 75.0 % 47.1 GiB
Metadata cache size (current): 1.8 % 864.7 MiB
Dnode cache size (hard limit): 10.0 % 4.7 GiB
Dnode cache size (current): 2.9 % 140.9 MiB
ARC hash breakdown:
Elements max: 5.7M
Elements current: 11.2 % 641.8k
Collisions: 17.8M
Chain max: 6
Chains: 11.9k
ARC misc:
Deleted: 124.8M
Mutex misses: 2.3k
Eviction skips: 6.5M
ARC total accesses (hits + misses): 814.2M
Cache hit ratio: 93.9 % 764.4M
Cache miss ratio: 6.1 % 49.8M
Actual hit ratio (MFU + MRU hits): 93.6 % 761.9M
Data demand efficiency: 99.0 % 509.9M
Data prefetch efficiency: 15.2 % 51.7M
Cache hits by cache type:
Most frequently used (MFU): 87.9 % 671.8M
Most recently used (MRU): 11.8 % 90.1M
Most frequently used (MFU) ghost: 0.3 % 2.4M
Most recently used (MRU) ghost: 0.3 % 2.4M
Cache hits by data type:
Demand data: 66.0 % 504.8M
Demand prefetch data: 1.0 % 7.9M
Demand metadata: 32.8 % 250.9M
Demand prefetch metadata: 0.1 % 895.0k
Cache misses by data type:
Demand data: 10.3 % 5.1M
Demand prefetch data: 88.0 % 43.8M
Demand metadata: 1.4 % 720.5k
Demand prefetch metadata: 0.3 % 145.1k
DMU prefetch efficiency: 477.2M
Hit ratio: 5.9 % 28.1M
Miss ratio: 94.1 % 449.2M
L2ARC not detected, skipping section
Solaris Porting Layer (SPL):
spl_hostid 0
spl_hostid_path /etc/hostid
spl_kmem_alloc_max 1048576
spl_kmem_alloc_warn 65536
spl_kmem_cache_kmem_threads 4
spl_kmem_cache_magazine_size 0
spl_kmem_cache_max_size 32
spl_kmem_cache_obj_per_slab 8
spl_kmem_cache_reclaim 0
spl_kmem_cache_slab_limit 16384
spl_max_show_tasks 512
spl_panic_halt 0
spl_schedule_hrtimeout_slack_us 0
spl_taskq_kick 0
spl_taskq_thread_bind 0
spl_taskq_thread_dynamic 1
spl_taskq_thread_priority 1
spl_taskq_thread_sequential 4
Tunables:
dbuf_cache_hiwater_pct 10
dbuf_cache_lowater_pct 10
dbuf_cache_max_bytes 18446744073709551615
dbuf_cache_shift 5
dbuf_metadata_cache_max_bytes 18446744073709551615
dbuf_metadata_cache_shift 6
dmu_object_alloc_chunk_shift 7
dmu_prefetch_max 134217728
ignore_hole_birth 1
l2arc_feed_again 1
l2arc_feed_min_ms 200
l2arc_feed_secs 1
l2arc_headroom 2
l2arc_headroom_boost 200
l2arc_meta_percent 33
l2arc_mfuonly 0
l2arc_noprefetch 1
l2arc_norw 0
l2arc_rebuild_blocks_min_l2size 1073741824
l2arc_rebuild_enabled 1
l2arc_trim_ahead 0
l2arc_write_boost 8388608
l2arc_write_max 8388608
metaslab_aliquot 524288
metaslab_bias_enabled 1
metaslab_debug_load 0
metaslab_debug_unload 0
metaslab_df_max_search 16777216
metaslab_df_use_largest_segment 0
metaslab_force_ganging 16777217
metaslab_fragmentation_factor_enabled 1
metaslab_lba_weighting_enabled 1
metaslab_preload_enabled 1
metaslab_unload_delay 32
metaslab_unload_delay_ms 600000
send_holes_without_birth_time 1
spa_asize_inflation 24
spa_config_path /etc/zfs/zpool.cache
spa_load_print_vdev_tree 0
spa_load_verify_data 1
spa_load_verify_metadata 1
spa_load_verify_shift 4
spa_slop_shift 5
vdev_file_logical_ashift 9
vdev_file_physical_ashift 9
vdev_removal_max_span 32768
vdev_validate_skip 0
zap_iterate_prefetch 1
zfetch_array_rd_sz 1048576
zfetch_max_distance 8388608
zfetch_max_idistance 67108864
zfetch_max_streams 8
zfetch_min_sec_reap 2
zfs_abd_scatter_enabled 1
zfs_abd_scatter_max_order 10
zfs_abd_scatter_min_size 1536
zfs_admin_snapshot 0
zfs_allow_redacted_dataset_mount 0
zfs_arc_average_blocksize 8192
zfs_arc_dnode_limit 0
zfs_arc_dnode_limit_percent 10
zfs_arc_dnode_reduce_percent 10
zfs_arc_evict_batch_limit 10
zfs_arc_eviction_pct 200
zfs_arc_grow_retry 0
zfs_arc_lotsfree_percent 10
zfs_arc_max 0
zfs_arc_meta_adjust_restarts 4096
zfs_arc_meta_limit 0
zfs_arc_meta_limit_percent 75
zfs_arc_meta_min 0
zfs_arc_meta_prune 10000
zfs_arc_meta_strategy 1
zfs_arc_min 0
zfs_arc_min_prefetch_ms 0
zfs_arc_min_prescient_prefetch_ms 0
zfs_arc_p_dampener_disable 1
zfs_arc_p_min_shift 0
zfs_arc_pc_percent 0
zfs_arc_shrink_shift 0
zfs_arc_shrinker_limit 10000
zfs_arc_sys_free 0
zfs_async_block_max_blocks 18446744073709551615
zfs_autoimport_disable 1
zfs_checksum_events_per_second 20
zfs_commit_timeout_pct 5
zfs_compressed_arc_enabled 1
zfs_condense_indirect_commit_entry_delay_ms 0
zfs_condense_indirect_vdevs_enable 1
zfs_condense_max_obsolete_bytes 1073741824
zfs_condense_min_mapping_bytes 131072
zfs_dbgmsg_enable 1
zfs_dbgmsg_maxsize 4194304
zfs_dbuf_state_index 0
zfs_ddt_data_is_special 1
zfs_deadman_checktime_ms 60000
zfs_deadman_enabled 1
zfs_deadman_failmode wait
zfs_deadman_synctime_ms 600000
zfs_deadman_ziotime_ms 300000
zfs_dedup_prefetch 0
zfs_delay_min_dirty_percent 60
zfs_delay_scale 500000
zfs_delete_blocks 20480
zfs_dirty_data_max 4294967296
zfs_dirty_data_max_max 4294967296
zfs_dirty_data_max_max_percent 25
zfs_dirty_data_max_percent 10
zfs_dirty_data_sync_percent 20
zfs_disable_ivset_guid_check 0
zfs_dmu_offset_next_sync 0
zfs_expire_snapshot 300
zfs_fallocate_reserve_percent 110
zfs_flags 0
zfs_free_bpobj_enabled 1
zfs_free_leak_on_eio 0
zfs_free_min_time_ms 1000
zfs_history_output_max 1048576
zfs_immediate_write_sz 32768
zfs_initialize_chunk_size 1048576
zfs_initialize_value 16045690984833335022
zfs_keep_log_spacemaps_at_export 0
zfs_key_max_salt_uses 400000000
zfs_livelist_condense_new_alloc 0
zfs_livelist_condense_sync_cancel 0
zfs_livelist_condense_sync_pause 0
zfs_livelist_condense_zthr_cancel 0
zfs_livelist_condense_zthr_pause 0
zfs_livelist_max_entries 500000
zfs_livelist_min_percent_shared 75
zfs_lua_max_instrlimit 100000000
zfs_lua_max_memlimit 104857600
zfs_max_async_dedup_frees 100000
zfs_max_log_walking 5
zfs_max_logsm_summary_length 10
zfs_max_missing_tvds 0
zfs_max_nvlist_src_size 0
zfs_max_recordsize 1048576
zfs_metaslab_fragmentation_threshold 70
zfs_metaslab_max_size_cache_sec 3600
zfs_metaslab_mem_limit 75
zfs_metaslab_segment_weight_enabled 1
zfs_metaslab_switch_threshold 2
zfs_mg_fragmentation_threshold 95
zfs_mg_noalloc_threshold 0
zfs_min_metaslabs_to_flush 1
zfs_multihost_fail_intervals 10
zfs_multihost_history 0
zfs_multihost_import_intervals 20
zfs_multihost_interval 1000
zfs_multilist_num_sublists 0
zfs_no_scrub_io 0
zfs_no_scrub_prefetch 0
zfs_nocacheflush 0
zfs_nopwrite_enabled 1
zfs_object_mutex_size 64
zfs_obsolete_min_time_ms 500
zfs_override_estimate_recordsize 0
zfs_pd_bytes_max 52428800
zfs_per_txg_dirty_frees_percent 5
zfs_prefetch_disable 0
zfs_read_history 0
zfs_read_history_hits 0
zfs_rebuild_max_segment 1048576
zfs_reconstruct_indirect_combinations_max 4096
zfs_recover 0
zfs_recv_queue_ff 20
zfs_recv_queue_length 16777216
zfs_recv_write_batch_size 1048576
zfs_removal_ignore_errors 0
zfs_removal_suspend_progress 0
zfs_remove_max_segment 16777216
zfs_resilver_disable_defer 0
zfs_resilver_min_time_ms 3000
zfs_scan_checkpoint_intval 7200
zfs_scan_fill_weight 3
zfs_scan_ignore_errors 0
zfs_scan_issue_strategy 0
zfs_scan_legacy 0
zfs_scan_max_ext_gap 2097152
zfs_scan_mem_lim_fact 20
zfs_scan_mem_lim_soft_fact 20
zfs_scan_strict_mem_lim 0
zfs_scan_suspend_progress 0
zfs_scan_vdev_limit 4194304
zfs_scrub_min_time_ms 1000
zfs_send_corrupt_data 0
zfs_send_no_prefetch_queue_ff 20
zfs_send_no_prefetch_queue_length 1048576
zfs_send_queue_ff 20
zfs_send_queue_length 16777216
zfs_send_unmodified_spill_blocks 1
zfs_slow_io_events_per_second 20
zfs_spa_discard_memory_limit 16777216
zfs_special_class_metadata_reserve_pct 25
zfs_sync_pass_deferred_free 2
zfs_sync_pass_dont_compress 8
zfs_sync_pass_rewrite 2
zfs_sync_taskq_batch_pct 75
zfs_trim_extent_bytes_max 134217728
zfs_trim_extent_bytes_min 32768
zfs_trim_metaslab_skip 0
zfs_trim_queue_limit 10
zfs_trim_txg_batch 32
zfs_txg_history 100
zfs_txg_timeout 5
zfs_unflushed_log_block_max 262144
zfs_unflushed_log_block_min 1000
zfs_unflushed_log_block_pct 400
zfs_unflushed_max_mem_amt 1073741824
zfs_unflushed_max_mem_ppm 1000
zfs_unlink_suspend_progress 0
zfs_user_indirect_is_special 1
zfs_vdev_aggregate_trim 0
zfs_vdev_aggregation_limit 1048576
zfs_vdev_aggregation_limit_non_rotating 131072
zfs_vdev_async_read_max_active 3
zfs_vdev_async_read_min_active 1
zfs_vdev_async_write_active_max_dirty_percent 60
zfs_vdev_async_write_active_min_dirty_percent 30
zfs_vdev_async_write_max_active 10
zfs_vdev_async_write_min_active 2
zfs_vdev_cache_bshift 16
zfs_vdev_cache_max 16384
zfs_vdev_cache_size 0
zfs_vdev_default_ms_count 200
zfs_vdev_default_ms_shift 29
zfs_vdev_initializing_max_active 1
zfs_vdev_initializing_min_active 1
zfs_vdev_max_active 1000
zfs_vdev_max_auto_ashift 16
zfs_vdev_min_auto_ashift 9
zfs_vdev_min_ms_count 16
zfs_vdev_mirror_non_rotating_inc 0
zfs_vdev_mirror_non_rotating_seek_inc 1
zfs_vdev_mirror_rotating_inc 0
zfs_vdev_mirror_rotating_seek_inc 5
zfs_vdev_mirror_rotating_seek_offset 1048576
zfs_vdev_ms_count_limit 131072
zfs_vdev_nia_credit 5
zfs_vdev_nia_delay 5
zfs_vdev_queue_depth_pct 1000
zfs_vdev_raidz_impl cycle [fastest] original scalar sse2 ssse3 avx2 avx512f avx512bw
zfs_vdev_read_gap_limit 32768
zfs_vdev_rebuild_max_active 3
zfs_vdev_rebuild_min_active 1
zfs_vdev_removal_max_active 2
zfs_vdev_removal_min_active 1
zfs_vdev_scheduler unused
zfs_vdev_scrub_max_active 3
zfs_vdev_scrub_min_active 1
zfs_vdev_sync_read_max_active 10
zfs_vdev_sync_read_min_active 10
zfs_vdev_sync_write_max_active 10
zfs_vdev_sync_write_min_active 10
zfs_vdev_trim_max_active 2
zfs_vdev_trim_min_active 1
zfs_vdev_write_gap_limit 4096
zfs_vnops_read_chunk_size 1048576
zfs_zevent_cols 80
zfs_zevent_console 0
zfs_zevent_len_max 384
zfs_zevent_retain_expire_secs 900
zfs_zevent_retain_max 2000
zfs_zil_clean_taskq_maxalloc 1048576
zfs_zil_clean_taskq_minalloc 1024
zfs_zil_clean_taskq_nthr_pct 100
zil_maxblocksize 131072
zil_nocacheflush 0
zil_replay_disable 0
zil_slog_bulk 786432
zio_deadman_log_all 0
zio_dva_throttle_enabled 1
zio_requeue_io_start_cut_in_line 1
zio_slow_io_ms 30000
zio_taskq_batch_pct 75
zvol_inhibit_dev 0
zvol_major 230
zvol_max_discard_blocks 16384
zvol_prefetch_bytes 131072
zvol_request_sync 0
zvol_threads 32
zvol_volmode 1
VDEV cache disabled, skipping section
ZIL committed transactions: 28.5M
Commit requests: 2.1M
Flushes to stable storage: 2.1M
Transactions to SLOG storage pool: 0 Bytes 0
Transactions to non-SLOG storage pool: 133.2 GiB 2.7M
root@esx:~#

PS I HAVE 4 SSD DRIVES IN A RAID 10(ZFS) WHERE I STORE ALL VM AND CONTAINERS

Hi @ckruijntjens,
If you have enough RAM space, you can increase the zfs_arc_min and zfs_arc_max values in zfs.conf in /etc/modprobe.d, dont forget to execute sudo update-initramfs -u -k all after changing the configuration.
You have L2ARC and SLOG options if you have faster nvme, with faster nvme disk you can multipy those speed values.
My zfs values

options zfs zfs_arc_min=1000000000
options zfs zfs_arc_max=2147483648
options zfs enable-xattr

P.S. Dont disable the sync parameter in zfs, the default value of my pool is as follows.
NAME PROPERTY VALUE SOURCE
zpool sync standard default
Regards.

i run this on debian.

do i need to create the zfs.conf file? there is no zfs.conf in /etc/modprobe.d/

i added the options.

in a vm that is stored on the zfs pool a apt-upgrade goos very slow. any idea?

@ckruijntjens, can you post the lxc config show <problematic_vm> --expanded, and ps fauxww output in the container.
Regards.

After you changed the zfs configuration and update the kernel, you need to reboot the host. Indeed, I posted those zfs_arc_min and max values for 16GB of RAM, you may need to increase those values for your needs.
Regards.

root@esx:/home/downloads# lxc config show warehouse --expanded
architecture: x86_64
config:
boot.autostart: “true”
image.architecture: amd64
image.description: Debian bullseye amd64 (20221214_05:25)
image.os: Debian
image.release: bullseye
image.serial: “20221214_05:25”
image.type: squashfs
image.variant: cloud
security.nesting: “true”
security.privileged: “true”
volatile.base_image: 832e37fe9c74647b560eec03dfa7343d6854dbc53ad574d3be2d8edbfd026217
volatile.cloud-init.instance-id: 3797f4f8-18e7-4eae-abc6-17f0043d4692
volatile.eth0.host_name: veth1afb9215
volatile.eth0.hwaddr: 00:16:3e:56:c3:9a
volatile.idmap.base: “0”
volatile.idmap.current: ‘
volatile.idmap.next: ‘
volatile.last_state.idmap: ‘
volatile.last_state.power: RUNNING
volatile.uuid: 984af78c-85f9-4f36-ba14-5f671a3829a1
devices:
eth0:
name: eth0
nictype: bridged
parent: bridge1
type: nic
root:
path: /
pool: default
type: disk
ephemeral: false
profiles:

  • default
    stateful: false
    description: “”

well for the 2th command i need to install software. 160mb. this is going to take an hour to install on the zfs pool.

When i change the zfs_arc_min and zfs_arc_max no changes in speed

@ckruijntjens, What is your pool name? Please post the lxc storage ls command output?
For the second command which is ps fauxww you dont need to install any packages.
Regards.

root@esx:/home/downloads# lxc storage ls
±--------±-------±--------±------------±--------±--------+
| NAME | DRIVER | SOURCE | DESCRIPTION | USED BY | STATE |
±--------±-------±--------±------------±--------±--------+
| default | zfs | lxdpool | | 18 | CREATED |

a,

here the output of the 2th command.

root@warehouse:~# ps fauxww
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 0.0 0.0 18324 7288 ? Ss 20:26 0:00 /sbin/init
root 75 0.0 0.0 61996 12588 ? Ss 20:26 0:00 /lib/systemd/systemd-journald
avahi 171 0.0 0.0 7344 2812 ? Ss 20:27 0:00 avahi-daemon: running [warehouse.local]
avahi 183 0.0 0.0 7152 348 ? S 20:27 0:00 _ avahi-daemon: chroot helper
root 172 0.0 0.0 6748 1628 ? Ss 20:27 0:00 /usr/sbin/cron -f
message+ 173 0.0 0.0 8268 3080 ? Ss 20:27 0:00 /usr/bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation --syslog-only
root 174 0.0 0.0 233504 4516 ? Ssl 20:27 0:00 /usr/libexec/polkitd --no-debug
root 175 0.0 0.0 14004 5156 ? Ss 20:27 0:00 /lib/systemd/systemd-logind
root 176 0.0 0.0 14620 3680 ? Ss 20:27 0:00 /sbin/wpa_supplicant -u -s -O /run/wpa_supplicant
odoo 178 0.9 0.1 290004 132216 ? Ssl 20:27 0:08 /usr/bin/python3 /usr/bin/odoo --config /etc/odoo/odoo.conf --logfile /var/log/odoo/odoo-server.log
root 182 0.0 0.0 5480 1280 pts/0 Ss+ 20:27 0:00 /sbin/agetty -o -p – \u --noclear --keep-baud console 115200,38400,9600 vt220
root 190 0.0 0.0 13356 4748 ? Ss 20:27 0:00 sshd: /usr/sbin/sshd -D [listener] 0 of 10-100 startups
root 1590 0.0 0.0 14952 6924 ? Ss 20:43 0:00 _ sshd: root@pts/1
root 1612 0.0 0.0 7164 2828 pts/1 Ss 20:43 0:00 _ -bash
root 1617 0.0 0.0 9884 2284 pts/1 R+ 20:43 0:00 _ ps fauxww
root 191 0.0 0.0 314776 7204 ? Ssl 20:27 0:00 /usr/sbin/ModemManager
postgres 200 0.0 0.0 213280 21220 ? Ss 20:27 0:00 /usr/lib/postgresql/13/bin/postgres -D /var/lib/postgresql/13/main -c config_file=/etc/postgresql/13/main/postgresql.conf
postgres 202 0.0 0.0 213380 7552 ? Ss 20:27 0:00 _ postgres: 13/main: checkpointer
postgres 203 0.0 0.0 213280 6748 ? Ss 20:27 0:00 _ postgres: 13/main: background writer
postgres 204 0.0 0.0 213280 8904 ? Ss 20:27 0:00 _ postgres: 13/main: walwriter
postgres 205 0.0 0.0 213816 6088 ? Ss 20:27 0:00 _ postgres: 13/main: autovacuum launcher
postgres 206 0.0 0.0 67948 4496 ? Ss 20:27 0:00 _ postgres: 13/main: stats collector
postgres 207 0.0 0.0 213836 5520 ? Ss 20:27 0:00 _ postgres: 13/main: logical replication launcher
postgres 465 0.0 0.0 219736 25248 ? Ss 20:27 0:00 _ postgres: 13/main: odoo odoo [local] idle
postgres 483 0.0 0.0 214224 10588 ? Ss 20:27 0:00 _ postgres: 13/main: odoo postgres [local] idle
postgres 484 0.0 0.0 214224 9544 ? Ss 20:27 0:00 _ postgres: 13/main: odoo postgres [local] idle
postgres 488 0.0 0.0 217564 24764 ? Ss 20:28 0:00 _ postgres: 13/main: odoo odoo [local] idle
postgres 489 0.0 0.0 217460 24472 ? Ss 20:28 0:00 _ postgres: 13/main: odoo odoo [local] idle
postgres 490 0.0 0.0 217556 24840 ? Ss 20:28 0:00 _ postgres: 13/main: odoo odoo [local] idle
postgres 491 0.0 0.0 217540 23704 ? Ss 20:28 0:00 _ postgres: 13/main: odoo odoo [local] idle
Debian-+ 473 0.0 0.0 18628 3696 ? Ss 20:27 0:00 /usr/sbin/exim4 -bd -q30m
root 1593 0.0 0.0 15188 6888 ? Ss 20:43 0:00 /lib/systemd/systemd --user
root 1594 0.0 0.0 21444 2584 ? S 20:43 0:00 _ (sd-pam)

ckruijntjens,

From what I can tell and I am self taught (so “grain of salt” this info), you appear to have a few things going on in ZFS that unless there is a very specific use case for doing them then you might not want to be doing them. All these instructions pertain to the host.

TL;DR

Remove layered ext4/zfs by backing up/migrating containers, then wipe ZFS pool drives, re-establish ZFS drives through LXD, then in ZFS set L2ARC, compression, scrub frequency, and create a SLOG.

Discussion

Just like we can tune our containers through their config for their use case, we can ‘tune’ our zpool(s) to their use case. For instance, if you had two pools which the first, consisted of NVMEs then you could tune it for running containers/VMs and the other consisted of SSDs then you could tune it for storage. By tweaking the zpool settings you can optimize performance according to what is in the containers. You stated in your “PS” that you have allocated four SSDs to one zpool, which is good because ZFS prefers whole disks versus partitions. IMO, OSs and storage should be on different pools, but I will try to address things generally.

SSDs Tip

Make sure that you are running trim regularly on your SSDs in order to keep them in tip top shape. While SSDs are performing trim functions it will hurt drive (and by consequence ZFS/LXD/etc) performance until complete. Set trim to run when the drives are most likely to not be used.

ZFS Housekeeping Tip

With SDDs and your usage, you’re likely not experiencing anything running slow due to ‘waste’ but it’s good practice to implement scrubbing, like trim above, as soon as possible to keep your drives from getting cluttered. To clean up your zpool you can safely run:

sudo zpool scrub <name of lxd zpool>

NOTE: This is somewhat-akin to an ‘empty trash’ feature so it’s pretty permanent. Also, you can set zpool scrub to run automatically; checkout the zpool-scrub manpage for details.

ZFS Compression

If you run: zfs get all <name of lxd zpool> | grep compress you’ll see (same as below) and it means that you are not running any compression on your zpool which is most likely not good.

lxdpool compressratio 1.00x
lxdpool compression off default
lxdpool refcompressratio 1.00x -

ZFS is optimized through compression, while at first it might seem like backwards thinking because compression causes cycles, the reality is that in nearly every x86_64 use case not using compression will actually slow ZFS down. A use case to not use compression would be if you had a single core atom processor running your network router and you wanted to add some disks which used ZFS for the filesystem, then you may not want use ZFS compression in that kind of instance due to the CPU overhead required compared to the importance of routing traffic.

ZFS uses compression algorithms to move files i/o ARCs and Storage. By default, ZFS typically uses zstd-1, but imho lz4 optimizes the system better. Here is an article discussing this topic better than I can; and the people at OpenZFS seems to support the same conclusion. You’re running ZFS v2.0.3-9 so lz4 compression is an option for you. Assuming that you want to use lz4 compression you can switch it on with:

sudo zfs set compression=lz4 <name of lxd zpool>

The compression will only impact new IO activities on the zpool; which means that you’re older containers will need to be migrated off from the zpool and then back onto it in order to take advantage of lz4 compression. I have more to say on migration in ZFS Storage section below.

ZFS Cache

ARC status: HEALTHY
ARC size (current): 27.9 % 17.5 GiB
Target size (adaptive): 28.0 % 17.5 GiB
Min size (hard limit): 6.2 % 3.9 GiB
Max size (high water): 16:1 62.8 GiB
Cache hit ratio: 93.9 % 764.4M
Cache miss ratio: 6.1 % 49.8M
Actual hit ratio (MFU + MRU hits): 93.6 % 761.9M
Data demand efficiency: 99.0 % 509.9M

ZFS is awesome but it requires a lot of RAM; a general rule regardless of parity settings is that you make at least 1GB of RAM available for every TB of disk storage; so if you had four 10TB drives for ZFS, regardless of your RAID settings; you would have an additional 40GB of RAM just for ZFS to play with. So if your system OS required 4GB of RAM then you have to have at least 44GB of RAM in total. Ideally, you’re using ECC-RAM, which is an added layer of insurance.

ZFS ARC

ZFS cache is called ARC, and it runs in RAM, when you exceed the limitations of ARC then it’s off to ZIL which is stored on disk. You are going to want to compare your max ARC size to your available RAM to the total TB of disk storage on your zpool.

Your ARC ratio is 93.9% which means that you are using ZIL the rest of the time; ideally you want a ratio as close to 100% as possible. If your RAM is maxed out and you’re still running slow, then you may want to setup a SLOG device. It only impacts system performance when you are asynchronously using ZIL.

By default, ZFS sets the ARC Min to 1/32 of system RAM; this means on a 128GB RAM system, about a 4GB minimum has been established. Your system is set to: 3.9GB.

By default, ZFS sets the ARC Max to 1/2 of system RAM; this means on a 128GB or so of RAM, about 64GB has been established. Your system is set to: 62.8GB.

I suspect that you have 128GB of RAM on your system; if that is the case then unless your four SSDs are greater than 15GB each, which I doubt, then that should be plenty. If you have less than 128GB RAM then you are configured incorrectly. ARC settings are in bytes, so some math is involved. You can safely and temporarily reconfigure your ARC setting by using the following (where X is the number of GBs that you want):

ARC Min
sudo echo "$[X * 1024*1024*1024]" >/sys/module/zfs/parameters/zfs_arc_min

ARC Max
sudo echo "$[X * 1024*1024*1024]" >/sys/module/zfs/parameters/zfs_arc_max

Once you establish your optimal ARC Settings. To permanently set them from the temp settings you can run the following:

sudo touch /etc/modprobe.d/zfs.conf
arcmin=$(cat /sys/module/zfs/parameters/zfs_arc_min)
arcmax=$(cat /sys/module/zfs/parameters/zfs_arc_max)
sudo echo "options zfs zfs_arc_min ${arcmin}" >> /etc/modprobe.d/zfs.conf
sudo echo "options zfs zfs_arc_max ${arcmax}" >> /etc/modprobe.d/zfs.conf

If you are running ZFS on your root directory you will have to run: update-initramfs -u -k all and reboot your system.

NOTE: the difference between zfs_arc_max and zfs_arc_min has to be greater than 10% for L2ARC to work correctly.

ZFS L2ARC

You’re not running L2ARC, it is totally recommended from a performance perspective. You can use a more performant SSD or NVME partition for L2ARC:

sudo zpool add <name of lxd pool> cache <l2arc device>

ZFS Storage

ZIL and SLOG are between the ARC/L2ARC and the zpool(s).

ZFS SLOG

ZFS SLOG is cache beyond the default ZIL. Running a SLOG drive increases asynchronous performance across ZIL. If you have a drive to spare then:

sudo zpool add <name of lxd pool> cache <slog device>

NOTE: you can use a partition of a drive or a partition on a parity.

Layering Filesystems

The OP posted that he is layering ZFS on top of Ext4, and you can technically do that, but IMO it is usually unnecessary and it’s going to hurt performance tremendously; because, well it’s two separate file systems for a single file; everything gets all wonky. I cannot think of a use case where running one file system on top of another is beneficial from an optimization perspective then just running Ext4 or ZFS.

If your containers matter to you, your very best bet is to migrate your containers to a whole new temporary device, perhaps the drive that you intend on using for SLOG. Once your confident that you have safely backed up your containers, destroy the ZFS pool, wipe the drives completely. Then remount them and re-implement ZFS storage through LXD. You can do the LXD storage setup via lxd init or you can use this tutorial make sure to click the “ZFS” directions. I only mention that because I was pretty slow on that, lol. Also, I was aided by this info. You can add an existing ZFS storage to LXD but it really is easier if you are starting over to implement through LXD.

Once you’re done rebuilding your zpools through LXD, go back and set your compression to your ideal algorithm, set your arc values, and tell your system to scrub your lxd zpool regularly @weekly (or whatever you want), and set up your SLOG through ZFS. Your zpool ought to be optimized for better performance after these tweaks.

Good Luck!

2 Likes

@ckruijntjens, can you format the output of the command please? it is hard to read generally.
Looks like, you are running postgresql server inside the container, what is the cpu load of the container? Why are you running privileged container, any reason?
Regards.

Hi,

Thank you for the info. I think im going back to btrfs.

zfs is way to complicated i think? I only have 4 ssd drives. so iall wanted them in the pool. but i run machines and storage on the same drives

Is your use case a SOHO setup?

@ckruijntjens, if you are running a simple container with a postgresql database why are you running privileged container? It seems to me that, the problem is not related with filesystem, which is either btrfs nor zfs. You need to debug the problem somewhere else.
Can you post the lsblk command output.
Regards.

Hi,

what do you mean with soho?

Small Office, Home Office