Cluster blocked due node error (?)

mbordere · November 25, 2021, 3:14pm

Looks like a bug in the raft implementation that is triggered by the disk I/O errors we are seeing. I’m fixing it.

solo · November 26, 2021, 8:18am

Hi, by the disk I/O error - I did the next test:
(1) create a test table

lxd sql global "CREATE TABLE test (id int, val varchar, PRIMARY KEY (id));" 
for (( ID=1; ID <= 100; ID++ )); do 
  lxd sql global "INSERT INTO test (id, val) VALUES ($ID, $(date +%s%N));" 
done

(2) then did mass records update

for (( TC=1; TC <= 10; TC++ )); do
  echo "Test $TC " 
  for (( ID=1; ID <= 100; ID++ )); do 
    lxd sql global "UPDATE test SET val=$(date +%s%N) WHERE id=$ID;" 
    RES=$? 
     if [[ $RES != 0 ]]; then 
       echo "Result code is $RES" 
       continue 
       fi 
  done 
 done

The only error detected - “Error: Failed to exec query: database is locked” (this is expected - cluster is running), but this:

Test 7 
Rows affected: 1
 ...
Rows affected: 0
Rows affected: 1
Rows affected: 1
Rows affected: 1
Rows affected: 1
Rows affected: 0
Rows affected: 1
Rows affected: 1
Rows affected: 1
Rows affected: 0
...

looks strange - “Rows affected: 0”. LXD could not change record? And no error reported…
With local database test worked like expected.

tomp · November 26, 2021, 11:38am

I believe @mbordere has found an issue and is working on a fix.

mbordere · November 26, 2021, 4:11pm

@solo Still working on a fix for the issue I saw, however it might not be a permanent solution, these I/O errors (as observed by our system) should be considered severe and highly exceptional, while in your case they happen very frequently.

In meanwhile can you try and run the other nodes on ext4 instead of BTRFS, it would be an interesting experiment. (I don’t see the I/O errors on the node running on ext4)

solo · November 29, 2021, 8:47am

Well, I moved /var on separate ext4 partition on nodes. While everything looks workable (I check cyclic launch and destroy a lot of containers).