Looks like a bug in the raft implementation that is triggered by the disk I/O errors we are seeing. I’m fixing it.
Hi, by the disk I/O error - I did the next test:
(1) create a test table
lxd sql global "CREATE TABLE test (id int, val varchar, PRIMARY KEY (id));"
for (( ID=1; ID <= 100; ID++ )); do
lxd sql global "INSERT INTO test (id, val) VALUES ($ID, $(date +%s%N));"
done
(2) then did mass records update
for (( TC=1; TC <= 10; TC++ )); do
echo "Test $TC "
for (( ID=1; ID <= 100; ID++ )); do
lxd sql global "UPDATE test SET val=$(date +%s%N) WHERE id=$ID;"
RES=$?
if [[ $RES != 0 ]]; then
echo "Result code is $RES"
continue
fi
done
done
The only error detected - “Error: Failed to exec query: database is locked” (this is expected - cluster is running), but this:
Test 7
Rows affected: 1
...
Rows affected: 0
Rows affected: 1
Rows affected: 1
Rows affected: 1
Rows affected: 1
Rows affected: 0
Rows affected: 1
Rows affected: 1
Rows affected: 1
Rows affected: 0
...
looks strange - “Rows affected: 0”. LXD could not change record? And no error reported…
With local database test worked like expected.
@solo Still working on a fix for the issue I saw, however it might not be a permanent solution, these I/O errors (as observed by our system) should be considered severe and highly exceptional, while in your case they happen very frequently.
In meanwhile can you try and run the other nodes on ext4
instead of BTRFS
, it would be an interesting experiment. (I don’t see the I/O errors on the node running on ext4
)
Well, I moved /var
on separate ext4 partition on nodes. While everything looks workable (I check cyclic launch and destroy a lot of containers).