What causes "Error: disk I/O error"?

I have a LXD cluster composed of 32 nodes and connected with LXD clustering feature

It was working fine I remember

I didn’t touch it for two days and I typed
lxc ls today and I got
Error: disk I/O error.

Then I checked journal log and I got (Really long. x100 of these)

May 23 19:05:12 node00 lxd[12790]: lvl=warn msg="failed to rollback transaction after error (failed to fecth nodes: disk I/O error): cannot rollback - no transaction is active" t=2018-05-23T19:05:12+0000
May 23 19:05:13 node00 lxd[12790]: lvl=warn msg="failed to rollback transaction after error (failed to fecth nodes: disk I/O error): cannot rollback - no transaction is active" t=2018-05-23T19:05:13+0000
May 23 19:05:13 node00 lxd[12790]: lvl=warn msg="failed to rollback transaction after error (failed to fecth nodes: disk I/O error): cannot rollback - no transaction is active" t=2018-05-23T19:05:13+0000
May 23 19:05:14 node00 lxd[12790]: lvl=warn msg="failed to rollback transaction after error (failed to fecth nodes: disk I/O error): cannot rollback - no transaction is active" t=2018-05-23T19:05:14+0000
May 23 19:05:15 node00 lxd[12790]: lvl=warn msg="failed to rollback transaction after error (failed to fecth nodes: disk I/O error): cannot rollback - no transaction is active" t=2018-05-23T19:05:15+0000
May 23 19:05:15 node00 lxd[12790]: lvl=warn msg="failed to rollback transaction after error (failed to fecth nodes: disk I/O error): cannot rollback - no transaction is active" t=2018-05-23T19:05:15+0000
May 23 19:05:16 node00 lxd[12790]: lvl=warn msg="failed to rollback transaction after error (failed to fecth nodes: disk I/O error): cannot rollback - no transaction is active" t=2018-05-23T19:05:16+0000
May 23 19:05:16 node00 lxd[12790]: lvl=warn msg="failed to rollback transaction after error (failed to fecth nodes: disk I/O error): cannot rollback - no transaction is active" t=2018-05-23T19:05:16+0000
May 23 19:05:17 node00 lxd[12790]: lvl=warn msg="failed to rollback transaction after error (failed to fecth nodes: disk I/O error): cannot rollback - no transaction is active" t=2018-05-23T19:05:17+0000
May 23 19:05:18 node00 lxd[12790]: lvl=warn msg="failed to rollback transaction after error (failed to fecth nodes: disk I/O error): cannot rollback - no transaction is active" t=2018-05-23T19:05:18+0000
May 23 19:05:18 node00 lxd[12790]: lvl=warn msg="failed to rollback transaction after error (failed to fecth nodes: disk I/O error): cannot rollback - no transaction is active" t=2018-05-23T19:05:18+0000
May 23 19:05:19 node00 lxd[12790]: lvl=warn msg="failed to rollback transaction after error (failed to fecth nodes: disk I/O error): cannot rollback - no transaction is active" t=2018-05-23T19:05:19+0000
May 23 19:05:20 node00 lxd[12790]: lvl=warn msg="failed to rollback transaction after error (failed to fecth nodes: disk I/O error): cannot rollback - no transaction is active" t=2018-05-23T19:05:20+0000
May 23 19:05:20 node00 lxd[12790]: lvl=warn msg="failed to rollback transaction after error (failed to fecth nodes: disk I/O error): cannot rollback - no transaction is active" t=2018-05-23T19:05:20+0000
May 23 19:05:21 node00 lxd[12790]: lvl=warn msg="failed to rollback transaction after error (failed to fecth nodes: disk I/O error): cannot rollback - no transaction is active" t=2018-05-23T19:05:21+0000
May 23 19:05:22 node00 lxd[12790]: lvl=warn msg="failed to rollback transaction after error (failed to fecth nodes: disk I/O error): cannot rollback - no transaction is active" t=2018-05-23T19:05:22+0000
May 23 19:05:22 node00 lxd[12790]: lvl=warn msg="failed to rollback transaction after error (failed to fecth nodes: disk I/O error): cannot rollback - no transaction is active" t=2018-05-23T19:05:22+0000
May 23 19:05:23 node00 lxd[12790]: lvl=warn msg="Failed to get current cluster nodes: failed to fecth nodes: disk I/O error" t=2018-05-23T19:05:23+0000

What can cause this disk I/O problem?

I’ve done S.M.A.R.T result after disk test and I couldn’t find any errors

Also, sudo zpool status does not show any problems too

This log below is the very start of the problem

May 22 18:25:12 node00 lxd[12790]: lvl=warn msg="Raft: Failed to contact 2 in 1.51905941s" t=2018-05-22T18:25:12+0000
May 23 18:06:26 node00 lxd[12790]: lvl=warn msg="Raft: Failed to contact 2 in 1.500121466s" t=2018-05-23T18:06:26+0000
May 23 18:06:26 node00 lxd[12790]: lvl=warn msg="Raft: Failed to contact 2 in 1.560505846s" t=2018-05-23T18:06:26+0000
May 23 18:06:26 node00 lxd[12790]: lvl=warn msg="Raft: Failed to contact 3 in 1.500327995s" t=2018-05-23T18:06:26+0000
May 23 18:06:26 node00 lxd[12790]: lvl=warn msg="Raft: Failed to contact quorum of nodes, stepping down" t=2018-05-23T18:06:26+0000
May 23 18:06:32 node00 lxd[12790]: lvl=warn msg="failed to rollback transaction after error (failed to fecth nodes: disk I/O error): cannot rollback - no transaction is active" t=2018-05-23T18:06:32+0000
May 23 18:06:32 node00 lxd[12790]: lvl=warn msg="failed to rollback transaction after error (failed to fecth nodes: disk I/O error): cannot rollback - no transaction is active" t=2018-05-23T18:06:32+0000
May 23 18:06:33 node00 lxd[12790]: lvl=warn msg="failed to rollback transaction after error (failed to fecth nodes: disk I/O error): cannot rollback - no transaction is active" t=2018-05-23T18:06:33+0000
May 23 18:06:33 node00 lxd[12790]: lvl=warn msg="failed to rollback transaction after error (failed to fecth nodes: disk I/O error): cannot rollback - no transaction is active" t=2018-05-23T18:06:33+0000
May 23 18:06:34 node00 lxd[12790]: lvl=warn msg="failed to rollback transaction after error (failed to fecth nodes: disk I/O error): cannot rollback - no transaction is active" t=2018-05-23T18:06:34+0000

Help me!

Hello the “disk I/O error” is misleading, it’s actually a raft replication issue, most probably related to the same problem of:

and

https://github.com/lxc/lxd/issues/4548

I’m most probably going to provide some mitigation of this issue today, so it should get committed soonish and released to users later on. Until then, the “lxc list” is going to be subject to this kind of glitches.

Could this be related with HW issues ? Like unstable network due to faulty NIC
Seems like not everybody goes through this problem

It could be, but most probably not. I recommend trying again when the coming fixes get realesed.

I had the same issue after restarting the networking service on my nodes - Ubuntu 16.04 - I had to add some routes to interfaces…after restarting the LXD services I got another error: Error: Get http://unix.socket/1.0: dial unix /var/snap/lxd/common/lxd/unix.socket: connect: no such file or directory. Which is related to : https://github.com/lxc/lxd/issues/4436