Weekly status for the week of the 18th of October to the 24th of October.
Introduction
This past week has been a busy one for LXD. It has gained two OVN networking features (network to network routing and network source address spoof protection), as well as support for fanotify for filesystem events watchers, and partial support for VM stateful migrations. In addition to that there has been a focus on improving cluster failover reliability.
LXD
@stgraber has added a new video covering deployment of LXD clusters with Juju:
https://www.youtube.com/watch?v=JgNVAvcXR9Q
New features:
- OVN network-to-network routing (peering). This is one of our roadmap items and you can read more about its purpose and design in the design specification document.
- OVN network source address spoofing protection. As part of implementing network to network peering we wanted to ensure that asymmetric routing was not possible and so we have added router security policy rules to ensure that only traffic from known source address of the network were allowed to be routed down the peer connection. However we have extended this validation so it applies to all egress traffic from the network. This means OVN networks now prevent instance NICs on the network from sending traffic outside of the known allowed subnets for that network. Note, this does not prevent instance NICs from assuming an IP of another instance NIC on the same network.
- VM stateful migrations. Initial support for VM stateful migrations has been added when using clustering (with the
--target
flag ) or when moving a VM between local storage pools. This action causes the VM to be stopped statefully (i.e with its memory state saved to a file), migrated and then started back up again with its memory state restored. - Ability to specify sysctls settings for containers. When using LXD containers you can now specify that certain sysctl settings are applied inside the container on start using the
linux.sysctl.*
settings.
Improvements:
- Fanotify support for filesystem event monitoring. LXD now supports (and prefers) using fannotify for watching for filesystem events. These are used to support automatic hot-plugging of devices into instances.
Bug fixes:
- Bridged NICs now prevent the use of the
ipv{n}.address
settings when connected to an unmanaged bridge. This avoids confusion where a static IP is set but cannot take effect due to not being connected to LXD’s DHCP server. - A recent regression in listing active managed bridge DHCP leases has been fixed.
- lxd-p2c gains support for passing an existing certificate.
- No longer auto fill cluster member
scheduler.instance
config when adding new member.
Clustering failover fixes:
There’s been a focus on improving the reliability of clustering fail-over this past week. It was observed that if the LXD dqlite leader became abruptly unreachable by the other cluster members (perhaps if it went offline or there was a network issue), in some cases if there was data still in the TCP send queue of the DB connection that the remaining cluster members would block for up to 15 minutes before failing the ongoing query and recovering. During that time all operations that required DB access on those members would block. This, to most intents and purposes, effectively prevented fail-over from occurring for up to 15 minutes. The reason for this is because the data in the TCP send queue was preventing the normal TCP keep-alive timers from taking effect and the OS’s TCP re-transmission timers were taking precedence. These by default keep trying to re-transmit the data to the unreachable server for 15 minutes.
At the same time we also observed that the LXD event connection from the unreachable server was also hanging around blocked for 15 minutes.
To resolve these issues required several fixes:
- Events web socket API now sends heartbeats to connected clients and expects replies. If the replies do not come in time then the socket connection is closed down.
- DB queries (which will go to the leader server) now have a 10s timeout, which is implemented as a TCP read deadline, meaning that when a remote leader server becomes unreachable, ongoing queries will block for up to 10s before detecting the connection is broken and allowing a re-connection attempt (possibly to a new leader server) to proceed. This effectively shortens the cluster fail-over time from 15minutes to 10s.
- The Dqlite proxy subsystem inside LXD (that handles incoming DB connections and outgoing Raft connections for other cluster members) is now using the
TCP_USER_TIMEOUT
connection setting to set the maximum time that a connection can remain open with unacknowledged sent data. This means that if data is stuck in the TCP send queue for too long, the socket will be closed forcefully closed (preventing connections and go routines hanging around for up to 15 minutes). - Retry cluster transactions once on query timeout so that if the query timed out due to a leader election it will retry automatically once the leader election has finished.
LXC
New features:
- You can now specify how many RX and TX queues are configured with veth NICs using the
veth.n_rxqueues
andveth.n_txqueues
settings respectively. This allows for distributing traffic over multiple CPU cores.
Improvements:
- Detect and prevent rootfs being over-mounted using
lxc.mount.entry
setting, as this causes confusion during container setup. - Handle kernels without or not using SMT.
Bug fixes:
- Support restoring containers with pre-created veth devices (CRIU).
Distrobuilder
Improvements:
- The
rootfs-http
downloaded now supports local files with a prefix offile://
.
Bug fixes:
- Various fixes for the Oracle image.
Dqlite (database)
Bug fixes:
- A statement leak has been fixed that was causing
assert
being hit inleader__close
. - Fix page numbers leak.
Dqlite (Go bindings)
Bug fixes:
- Fixes an issue that was preventing storing and retrieving a sql.NullTime.
Youtube channel
We’ve started a Youtube channel with live streams covering LXD releases and its use in the wider ecosystem.
You may want to give it a watch and/or subscribe for more content in the coming weeks.
https://www.youtube.com/lxd-videos
Contribute to LXD
Ever wanted to contribute to LXD but not sure where to start?
We’ve recently gone through some effort to properly tag issues suitable for new contributors on Github: Easy issues for new contributors
Upcoming events
- Nothing to report this week
Ongoing projects
The list below is feature or refactoring work which will span several weeks/months and can’t be tied directly to a single Github issue or pull request.
- Distrobuilder Windows support
- Virtual networks in LXD
- Various kernel work
- Stable release work for LXC, LXCFS and LXD
Upstream changes
The items listed below are highlights of the work which happened upstream over the past week and which will be included in the next release.
LXD
- Network: OVN network to network routing (peering)
- lxd: Replace inotify with fsnotify/fanotify
- lxd: fixes
- Network: Add OVN router security policy to prevent address spoofing
- lxd/api/cluster: fix comment on clusterGet clusterPut
- Network: Fix bridge leases
- lxd-p2c: Allow passing existing certificate
- doc: Fixes
- lxd/fsmonitor/drivers: Add missing FAN_MARK_FILESYSTEM
- Instance: Add ability to perform stateful instance pool migration
- Instance: Add ability to perform stateful instance cluster member migration
- Instance: Renames IsMigratable to CanMigrate
- Don’t autofill cluster config.
- Add linux.sysctl.* configuration keys
- DB: Adds 10s timeout to Transaction
- Cluster: Add dqlite proxy timeout and event stream heartbeats
- NIC: Prevent use of static IPs on bridged NIC connected to unmanaged bridge
- Events: Moves blocking reader into heartbeat function
- lxd/fsmonitor/drivers: Log warning instead of failing
- Retry cluster transactions once if context deadline exceeded
- Cluster: Replaces dqliteProxy idle timeout with TCP_USER_TIMEOUT
- seccomp: Pass the caller TGID to pidfd_open instead of TID
LXC
- conf: fixes
- Riscv64
- conf: verify that rootfs is stable after setting up mounts
- criu: support restoring containers with pre-created veth devices
- conf: allow users to specify that they want a cgroup2 layout on a hybrid host
- Make number of rx and tx queues configurable for veths
- doc: Update Japanese lxc.container.conf(5) and common options
- conf: handle kernels without or not using SMT
LXCFS
- lxcfs: fixes
- doc: guide for reload share libary file
- sysfs: fix cpumasks
- build: fixes
- meson: Include lxcfs_fuse.h into source files
Distrobuilder
- sources/oracle: Run yum with --skip-broken
- ubuntu.yaml: add releases hirsute, impish, jammy
- sources: Fix Oracle 7 for aarch64
- sources: Fix Oracle install script
- sources/rootfs: Support local image files
Dqlite (RAFT library)
- Nothing to report this week
Dqlite (database)
- gateway: Finalize stmt when query_barrier_cb reports failure
- Coverity fixes
- fsm: Fix page_numbers leak
Dqlite (Go bindings)
LXD Charm
- lxd: convert PosixPath to plain string
- Only override the source key for storage pools
- Put cluster connection info in the app data bag
Distribution work
This section is used to track the work done in downstream Linux distributions to ship the latest LXC, LXD and LXCFS as well as work to get various software to work properly inside containers.
Ubuntu
- Nothing to report this week
Snap
- snapcraft: Simplified cohort handling
- lxd: Cherry-pick upstream bugfixes
- lxcfs: Bump to 4.0.11
- lxc: Bump to 4.0.11
- lxc: Cherry-pick upstream bugfixes