Cluster blocked due node error (?)

mbordere · November 24, 2021, 1:50pm

I’m not sure how to do it with the snap, I think @solo mentioned he was building from source?

tomp · November 24, 2021, 1:51pm

Oh yes good spot, in that case @solo are you also using up-to-date dqlite, raft libraries (using make deps from LXD)?

tomp · November 24, 2021, 1:52pm

Also if you’ve built from source, you’re not getting any of the cherry-picks, so might be better to try building from the main lxd branch so you’ve got the latest patches.

solo · November 24, 2021, 1:52pm

Of course I update dqlite and raft libraries.

mbordere · November 24, 2021, 1:53pm

Doesn’t really matter, as long as I can see the output somewhere.

tomp · November 24, 2021, 1:55pm

You can never be sure without asking.

solo · November 24, 2021, 2:57pm

I got a logs with trace settings, where it can be placed?

mbordere · November 24, 2021, 3:05pm

You can upload them somewhere and send me the link via mail mathieu.bordere@canonical.com if you don’t want to post them publicly or compress them and send them as attachment if they’re not too large.

solo · November 24, 2021, 3:20pm

Send link to email…

mbordere · November 24, 2021, 3:30pm

@solo can you upload the logs from the 4th node too?

mbordere · November 24, 2021, 4:51pm

From a first look I see disk I/O errors around the same time on all the nodes (have edited the timestamp to seconds instead of nanoseconds), the status should be 0 when no errors occur.
Are they sharing the same drive? I think something might be up with your disk.

mathieu@linda:public-20211124T152038Z-001 $ grep -nir -A2 uv_append

public/lxd_trace200.log:16188:LIBRAFT   1637765410 src/uv_append.c:197 write: 
public/lxd_trace200.log-16189-LIBRAFT   1637765410 src/replication.c:828 I/O completed on follower: status 18
public/lxd_trace200.log-16190-LIBRAFT   1637765410 src/uv_send.c:218 connection available -> write message
--
public/lxd_trace201.log:16021:LIBRAFT   1637765420 src/uv_append.c:197 write: 
public/lxd_trace201.log-16022-LIBRAFT   1637765420 src/replication.c:828 I/O completed on follower: status 18
public/lxd_trace201.log-16023-LIBRAFT   1637765420 src/uv_send.c:218 connection available -> write message
--
public/lxd_trace201.log:16194:LIBRAFT   1637765422 src/uv_append.c:197 write: 
public/lxd_trace201.log-16195-LIBRAFT   1637765422 src/replication.c:828 I/O completed on follower: status 18
public/lxd_trace201.log-16196-LIBRAFT   1637765422 src/uv_send.c:218 connection available -> write message
--
public/lxd_trace201.log:16733:LIBRAFT   1637765491 src/uv_append.c:197 write: 
public/lxd_trace201.log-16734-LIBRAFT   1637765491 src/replication.c:828 I/O completed on follower: status 18
public/lxd_trace201.log-16735-LIBRAFT   1637765491 src/uv_send.c:218 connection available -> write message
--
public/lxd_trace202.log:3947:LIBRAFT   1637764307 src/uv_append.c:197 write: 
public/lxd_trace202.log-3948-LIBRAFT   1637764307 src/replication.c:828 I/O completed on follower: status 18
public/lxd_trace202.log-3949-LIBRAFT   1637764307 src/uv_send.c:218 connection available -> write message
--
public/lxd_trace202.log:13164:LIBRAFT   1637765410 src/uv_append.c:197 write: 
public/lxd_trace202.log-13165-LIBRAFT   1637765410 src/replication.c:828 I/O completed on follower: status 18
public/lxd_trace202.log-13166-LIBRAFT   1637765410 src/uv_send.c:218 connection available -> write message
--
public/lxd_trace202.log:13350:LIBRAFT   1637765421 src/uv_append.c:197 write: 
public/lxd_trace202.log-13351-LIBRAFT   1637765421 src/replication.c:828 I/O completed on follower: status 18
public/lxd_trace202.log-13352-LIBRAFT   1637765421 src/uv_send.c:218 connection available -> write message
--
public/lxd_trace202.log:13371:LIBRAFT   1637765424 src/uv_append.c:197 write: 
public/lxd_trace202.log-13372-LIBRAFT   1637765424 src/replication.c:828 I/O completed on follower: status 18
public/lxd_trace202.log-13373-LIBRAFT   1637765424 src/uv_send.c:218 connection available -> write message

tomp · November 24, 2021, 4:58pm

What is the error there @mbordere ?

mbordere · November 24, 2021, 5:27pm

It’s a generic RAFT_IOERR set here → https://github.com/canonical/raft/blob/b71e3038944b34bed650c3612bfbe07f01ea6aa7/src/uv_writer.c#L20 the original error code is lost, will add it to the error message in the future, PR here → https://github.com/canonical/raft/pull/251/files

tomp · November 24, 2021, 5:45pm

Yeah adding the status text to the message would be great

Also I/O completed on follower is somewhat misleading if there was an error right?

solo · November 25, 2021, 7:40am

No, all nodes are different physical machines. Strange that an error occurs simultaneously on different disks or it only error message delivered simultaneously?

mbordere · November 25, 2021, 7:45am

Strange indeed, can you also upload the log from the other node please?

solo · November 25, 2021, 7:48am

I trying, but while everything is working properly (

solo · November 25, 2021, 12:23pm

I managed to break the cluster, traces in same place…

tomp · November 25, 2021, 1:50pm

Do you see anything suspicious in sudo dmesg on the servers suggesting disk issues?

solo · November 25, 2021, 2:39pm

No, do you recommend any specific tests?