Does clustering change operation wait / start logic?

s-alty · June 18, 2020, 4:00pm

I’ve got an application that works well with standalone lxd and I wanted to give clustering a shot.

To boot a container I make a sequence of api calls like:
POST /1.0/instances (to create the instance)
GET /1.0/operations//wait (to wait for the creation to succeed)
PUT /1.0/instances//state (to start the instance)

This all works fine in standalone lxd, but I’m seeing that last call fail in a cluster when the node that the container gets created on is one of the other nodes in the cluster.

{'error': '',
 'error_code': 0,
 'metadata': {'class': 'task',
              'created_at': '2020-06-17T19:45:59.014813571Z',
              'description': 'Starting container',
              'err': "Common start logic: Storage start: Failed to run: zfs mount deeper-end/containers/X1nzMoezkbvP01UrhB: cannot open 'deeper-end/containers/X1nzMoezkbvP01UrhB': dataset does not exist",
              'id': '1689171d-ef42-40f5-99b6-a27cc6cc6917',
              'location': 'ip-10-0-1-53',
              'may_cancel': False,
              'metadata': {'container_progress': 'Remapping container filesystem'},
              'resources': {'containers': ['/1.0/containers/X1nzMoezkbvP01UrhB']},
              'status': 'Failure',
              'status_code': 400,
              'updated_at': '2020-06-17T19:45:59.026857479Z'},
 'operation': '',
 'status': 'Success',
 'status_code': 200,
 'type': 'sync'}

If I retry the start call it succeeds, which makes me think it could be a race?

Notably, lxc launch --target <some_node> does work, but based on the traffic it looks like it does an additional GET /1.0/instances/ before the call to start the instance here:
https://github.com/lxc/lxd/blob/c66ca9c060a776e39490669782712ff8e4225301/lxc/init.go#L347 so maybe that gives enough time for the zfs changes to stick?

Has anyone else seen this? I could always just retry the call to start the instance, but I would have thought waiting for the create operation to succeed should be sufficient.

I’m using lxd 4.0.1

turtle0x1 · June 18, 2020, 4:21pm

I think

Adding the target paramter to your API requests will fix this (specifically POST /1.0/instances)

As I understand it LXD tries to balance the containers “round robin” style & the image you are referencing may not exist on the target host

s-alty · June 18, 2020, 5:28pm

With respect to the target parameter, I’m specifically interested in testing out the round robin behavior for load balancing. I was also under the impression that lxd syncs the images between all of the nodes in the cluster.

I’m still leaning toward a race as the culprit. If I modify the logic to insert a 1 second sleep after the wait call and before the start call then the start calls succeeds.

I’m going to revisit my wait call to make sure that’s working as expected.

s-alty · June 18, 2020, 7:22pm

Yeah, this was an error on my end.
The way I was constructing the url for the wait call was by pulling the operation key out of the response to the create call and appending /wait to it.

I think in some release the operation url must have changed to have a ?project=default parameter to the end of it making the wait url I was generating invalid.

Sorry for the false alarm