I’ve got an application that works well with standalone lxd and I wanted to give clustering a shot.
To boot a container I make a sequence of api calls like:
POST /1.0/instances (to create the instance)
GET /1.0/operations//wait (to wait for the creation to succeed)
PUT /1.0/instances//state (to start the instance)
This all works fine in standalone lxd, but I’m seeing that last call fail in a cluster when the node that the container gets created on is one of the other nodes in the cluster.
{'error': '',
'error_code': 0,
'metadata': {'class': 'task',
'created_at': '2020-06-17T19:45:59.014813571Z',
'description': 'Starting container',
'err': "Common start logic: Storage start: Failed to run: zfs mount deeper-end/containers/X1nzMoezkbvP01UrhB: cannot open 'deeper-end/containers/X1nzMoezkbvP01UrhB': dataset does not exist",
'id': '1689171d-ef42-40f5-99b6-a27cc6cc6917',
'location': 'ip-10-0-1-53',
'may_cancel': False,
'metadata': {'container_progress': 'Remapping container filesystem'},
'resources': {'containers': ['/1.0/containers/X1nzMoezkbvP01UrhB']},
'status': 'Failure',
'status_code': 400,
'updated_at': '2020-06-17T19:45:59.026857479Z'},
'operation': '',
'status': 'Success',
'status_code': 200,
'type': 'sync'}
If I retry the start call it succeeds, which makes me think it could be a race?
Notably, lxc launch --target <some_node> does work, but based on the traffic it looks like it does an additional GET /1.0/instances/ before the call to start the instance here:
https://github.com/lxc/lxd/blob/c66ca9c060a776e39490669782712ff8e4225301/lxc/init.go#L347 so maybe that gives enough time for the zfs changes to stick?
Has anyone else seen this? I could always just retry the call to start the instance, but I would have thought waiting for the create operation to succeed should be sufficient.
I’m using lxd 4.0.1