Cluster blocked due node error (?)

I’m not sure how to do it with the snap, I think @solo mentioned he was building from source?

Oh yes good spot, in that case @solo are you also using up-to-date dqlite, raft libraries (using make deps from LXD)?

Also if you’ve built from source, you’re not getting any of the cherry-picks, so might be better to try building from the main lxd branch so you’ve got the latest patches.

Of course I update dqlite and raft libraries.

Doesn’t really matter, as long as I can see the output somewhere.

You can never be sure without asking.

I got a logs with trace settings, where it can be placed?

You can upload them somewhere and send me the link via mail mathieu.bordere@canonical.com if you don’t want to post them publicly or compress them and send them as attachment if they’re not too large.

Send link to email…

@solo can you upload the logs from the 4th node too?

From a first look I see disk I/O errors around the same time on all the nodes (have edited the timestamp to seconds instead of nanoseconds), the status should be 0 when no errors occur.
Are they sharing the same drive? I think something might be up with your disk.

mathieu@linda:public-20211124T152038Z-001 $ grep -nir -A2 uv_append

public/lxd_trace200.log:16188:LIBRAFT   1637765410 src/uv_append.c:197 write: 
public/lxd_trace200.log-16189-LIBRAFT   1637765410 src/replication.c:828 I/O completed on follower: status 18
public/lxd_trace200.log-16190-LIBRAFT   1637765410 src/uv_send.c:218 connection available -> write message
--
public/lxd_trace201.log:16021:LIBRAFT   1637765420 src/uv_append.c:197 write: 
public/lxd_trace201.log-16022-LIBRAFT   1637765420 src/replication.c:828 I/O completed on follower: status 18
public/lxd_trace201.log-16023-LIBRAFT   1637765420 src/uv_send.c:218 connection available -> write message
--
public/lxd_trace201.log:16194:LIBRAFT   1637765422 src/uv_append.c:197 write: 
public/lxd_trace201.log-16195-LIBRAFT   1637765422 src/replication.c:828 I/O completed on follower: status 18
public/lxd_trace201.log-16196-LIBRAFT   1637765422 src/uv_send.c:218 connection available -> write message
--
public/lxd_trace201.log:16733:LIBRAFT   1637765491 src/uv_append.c:197 write: 
public/lxd_trace201.log-16734-LIBRAFT   1637765491 src/replication.c:828 I/O completed on follower: status 18
public/lxd_trace201.log-16735-LIBRAFT   1637765491 src/uv_send.c:218 connection available -> write message
--
public/lxd_trace202.log:3947:LIBRAFT   1637764307 src/uv_append.c:197 write: 
public/lxd_trace202.log-3948-LIBRAFT   1637764307 src/replication.c:828 I/O completed on follower: status 18
public/lxd_trace202.log-3949-LIBRAFT   1637764307 src/uv_send.c:218 connection available -> write message
--
public/lxd_trace202.log:13164:LIBRAFT   1637765410 src/uv_append.c:197 write: 
public/lxd_trace202.log-13165-LIBRAFT   1637765410 src/replication.c:828 I/O completed on follower: status 18
public/lxd_trace202.log-13166-LIBRAFT   1637765410 src/uv_send.c:218 connection available -> write message
--
public/lxd_trace202.log:13350:LIBRAFT   1637765421 src/uv_append.c:197 write: 
public/lxd_trace202.log-13351-LIBRAFT   1637765421 src/replication.c:828 I/O completed on follower: status 18
public/lxd_trace202.log-13352-LIBRAFT   1637765421 src/uv_send.c:218 connection available -> write message
--
public/lxd_trace202.log:13371:LIBRAFT   1637765424 src/uv_append.c:197 write: 
public/lxd_trace202.log-13372-LIBRAFT   1637765424 src/replication.c:828 I/O completed on follower: status 18
public/lxd_trace202.log-13373-LIBRAFT   1637765424 src/uv_send.c:218 connection available -> write message

What is the error there @mbordere ?

It’s a generic RAFT_IOERR set here → https://github.com/canonical/raft/blob/b71e3038944b34bed650c3612bfbe07f01ea6aa7/src/uv_writer.c#L20 the original error code is lost, will add it to the error message in the future, PR here → https://github.com/canonical/raft/pull/251/files

1 Like

Yeah adding the status text to the message would be great :slight_smile:

Also I/O completed on follower is somewhat misleading if there was an error right?

No, all nodes are different physical machines. Strange that an error occurs simultaneously on different disks or it only error message delivered simultaneously?

Strange indeed, can you also upload the log from the other node please?

I trying, but while everything is working properly (

I managed to break the cluster, traces in same place…

Do you see anything suspicious in sudo dmesg on the servers suggesting disk issues?

No, do you recommend any specific tests?