Cluster blocked due node error (?)

Looks like a bug in the raft implementation that is triggered by the disk I/O errors we are seeing. I’m fixing it.

1 Like

Hi, by the disk I/O error - I did the next test:
(1) create a test table

lxd sql global "CREATE TABLE test (id int, val varchar, PRIMARY KEY (id));" 
for (( ID=1; ID <= 100; ID++ )); do 
  lxd sql global "INSERT INTO test (id, val) VALUES ($ID, $(date +%s%N));" 
done

(2) then did mass records update

for (( TC=1; TC <= 10; TC++ )); do
  echo "Test $TC " 
  for (( ID=1; ID <= 100; ID++ )); do 
    lxd sql global "UPDATE test SET val=$(date +%s%N) WHERE id=$ID;" 
    RES=$? 
     if [[ $RES != 0 ]]; then 
       echo "Result code is $RES" 
       continue 
       fi 
  done 
 done

The only error detected - “Error: Failed to exec query: database is locked” (this is expected - cluster is running), but this:

Test 7 
Rows affected: 1
 ...
Rows affected: 0
Rows affected: 1
Rows affected: 1
Rows affected: 1
Rows affected: 1
Rows affected: 0
Rows affected: 1
Rows affected: 1
Rows affected: 1
Rows affected: 0
...

looks strange - “Rows affected: 0”. LXD could not change record? And no error reported…
With local database test worked like expected.

I believe @mbordere has found an issue and is working on a fix.

@solo Still working on a fix for the issue I saw, however it might not be a permanent solution, these I/O errors (as observed by our system) should be considered severe and highly exceptional, while in your case they happen very frequently.

In meanwhile can you try and run the other nodes on ext4 instead of BTRFS, it would be an interesting experiment. (I don’t see the I/O errors on the node running on ext4)

Well, I moved /var on separate ext4 partition on nodes. While everything looks workable (I check cyclic launch and destroy a lot of containers).