I’m not sure how to do it with the snap, I think @solo mentioned he was building from source?
Oh yes good spot, in that case @solo are you also using up-to-date dqlite, raft libraries (using make deps
from LXD)?
Also if you’ve built from source, you’re not getting any of the cherry-picks, so might be better to try building from the main lxd branch so you’ve got the latest patches.
Of course I update dqlite and raft libraries.
Doesn’t really matter, as long as I can see the output somewhere.
You can never be sure without asking.
I got a logs with trace settings, where it can be placed?
You can upload them somewhere and send me the link via mail mathieu.bordere@canonical.com if you don’t want to post them publicly or compress them and send them as attachment if they’re not too large.
Send link to email…
@solo can you upload the logs from the 4th node too?
From a first look I see disk I/O errors around the same time on all the nodes (have edited the timestamp to seconds instead of nanoseconds), the status should be 0 when no errors occur.
Are they sharing the same drive? I think something might be up with your disk.
mathieu@linda:public-20211124T152038Z-001 $ grep -nir -A2 uv_append
public/lxd_trace200.log:16188:LIBRAFT 1637765410 src/uv_append.c:197 write:
public/lxd_trace200.log-16189-LIBRAFT 1637765410 src/replication.c:828 I/O completed on follower: status 18
public/lxd_trace200.log-16190-LIBRAFT 1637765410 src/uv_send.c:218 connection available -> write message
--
public/lxd_trace201.log:16021:LIBRAFT 1637765420 src/uv_append.c:197 write:
public/lxd_trace201.log-16022-LIBRAFT 1637765420 src/replication.c:828 I/O completed on follower: status 18
public/lxd_trace201.log-16023-LIBRAFT 1637765420 src/uv_send.c:218 connection available -> write message
--
public/lxd_trace201.log:16194:LIBRAFT 1637765422 src/uv_append.c:197 write:
public/lxd_trace201.log-16195-LIBRAFT 1637765422 src/replication.c:828 I/O completed on follower: status 18
public/lxd_trace201.log-16196-LIBRAFT 1637765422 src/uv_send.c:218 connection available -> write message
--
public/lxd_trace201.log:16733:LIBRAFT 1637765491 src/uv_append.c:197 write:
public/lxd_trace201.log-16734-LIBRAFT 1637765491 src/replication.c:828 I/O completed on follower: status 18
public/lxd_trace201.log-16735-LIBRAFT 1637765491 src/uv_send.c:218 connection available -> write message
--
public/lxd_trace202.log:3947:LIBRAFT 1637764307 src/uv_append.c:197 write:
public/lxd_trace202.log-3948-LIBRAFT 1637764307 src/replication.c:828 I/O completed on follower: status 18
public/lxd_trace202.log-3949-LIBRAFT 1637764307 src/uv_send.c:218 connection available -> write message
--
public/lxd_trace202.log:13164:LIBRAFT 1637765410 src/uv_append.c:197 write:
public/lxd_trace202.log-13165-LIBRAFT 1637765410 src/replication.c:828 I/O completed on follower: status 18
public/lxd_trace202.log-13166-LIBRAFT 1637765410 src/uv_send.c:218 connection available -> write message
--
public/lxd_trace202.log:13350:LIBRAFT 1637765421 src/uv_append.c:197 write:
public/lxd_trace202.log-13351-LIBRAFT 1637765421 src/replication.c:828 I/O completed on follower: status 18
public/lxd_trace202.log-13352-LIBRAFT 1637765421 src/uv_send.c:218 connection available -> write message
--
public/lxd_trace202.log:13371:LIBRAFT 1637765424 src/uv_append.c:197 write:
public/lxd_trace202.log-13372-LIBRAFT 1637765424 src/replication.c:828 I/O completed on follower: status 18
public/lxd_trace202.log-13373-LIBRAFT 1637765424 src/uv_send.c:218 connection available -> write message
What is the error there @mbordere ?
It’s a generic RAFT_IOERR
set here → https://github.com/canonical/raft/blob/b71e3038944b34bed650c3612bfbe07f01ea6aa7/src/uv_writer.c#L20 the original error code is lost, will add it to the error message in the future, PR here → https://github.com/canonical/raft/pull/251/files
Yeah adding the status text to the message would be great
Also I/O completed on follower
is somewhat misleading if there was an error right?
No, all nodes are different physical machines. Strange that an error occurs simultaneously on different disks or it only error message delivered simultaneously?
Strange indeed, can you also upload the log from the other node please?
I trying, but while everything is working properly (
I managed to break the cluster, traces in same place…
Do you see anything suspicious in sudo dmesg
on the servers suggesting disk issues?
No, do you recommend any specific tests?