STATE=ERROR, but container appears to be running

JLR83 · July 11, 2022, 3:36am

List --fast show containers running:

[22:31:48] john@moneta:~$ lxc list --fast
+---------+---------+--------------+----------------------+----------+-----------+
|  NAME   |  STATE  | ARCHITECTURE |      CREATED AT      | PROFILES |   TYPE    |
+---------+---------+--------------+----------------------+----------+-----------+
| flash   | RUNNING | x86_64       | 2021/02/12 02:32 UTC | base     | CONTAINER |
|         |         |              |                      | data     |           |
+---------+---------+--------------+----------------------+----------+-----------+
| forward | RUNNING | x86_64       | 2021/02/12 02:35 UTC | base     | CONTAINER |
|         |         |              |                      | data     |           |
+---------+---------+--------------+----------------------+----------+-----------+

The list -c nsD takes a long time, but I can lxc shell flash and access the container.

Any clue as to what is messed up?

cemzafer · July 11, 2022, 4:20am

Hi,
You can check information about the log with the following command.
lxc info <container_name> --show-log
You can check the error as well, lxc monitor --type=logging --pretty and restart the container which has error state and observe the log.
Regards.

tomp · July 11, 2022, 7:35am

If you’re running LXD 5.3 you’re likely being affected by this Database error: "sql: transaction has already been committed or rolled back"

There is a fix for the instance list speed regression merged and it should be deployed to the latest/stable channel soon.

JLR83 · July 11, 2022, 3:25pm

Yes – that error is all over the logs. Thank you – I will await the update/fix

JLR83 · July 27, 2022, 5:16pm

I have updated to 5.4 and stopped all the containers and rebooted the server, but the long delay to “lxc list” is still there. What logs / info ought I provide?

tomp · July 27, 2022, 7:06pm

Are you running a cluster?

If so can you identify the leader using “lxc cluster list” and then try running “lxc ls” on that machine and see if still slow.

Can you provide the output of both commands too please.

tomp · July 27, 2022, 7:07pm

Do you still see errors in the logs?

JLR83 · July 28, 2022, 1:05pm

My server is not part of a cluster.
My logs are filled with error:
`WARNING[2022-07-28T08:00:52-05:00] Transaction timed out. Retrying once err=“Failed to begin transaction: context deadline exceeded” member=1’

‘DEBUG [2022-07-28T08:01:12-05:00] Database error err=“Failed to fetch from “instance_snapshot_config” table: sql: Rows are closed”’

'DEBUG [2022-07-28T08:01:12-05:00] Database error err=“sql: transaction has already been committed or rolled back”`

tomp · July 28, 2022, 1:14pm

How many instances do you have?

I’ve just tried locally with 512 instances and lxc list returns in a couple of seconds.
Are the instances running or stopped? Does that make a difference?

JLR83 · July 28, 2022, 1:25pm

Thank you for your help.
The server has 21 running instances and 6 stopped instances( AMD 5900X w/128GB – very low load).
What maybe significant is that each instance has between 90 and 100 snapshots.

tomp · July 28, 2022, 1:26pm

Thanks, yes I suspect that is significant. I’ll re-look at the DB queries and see if there are some inefficiencies related to snapshots, as the instance list (without snapshots) seems fine now.

tomp · July 28, 2022, 1:37pm

I’ve opened https://github.com/lxc/lxd/issues/10707 to investigate this with some ideas as to what the cause is.