Does clustering change operation wait / start logic?

I’ve got an application that works well with standalone lxd and I wanted to give clustering a shot.

To boot a container I make a sequence of api calls like:
POST /1.0/instances (to create the instance)
GET /1.0/operations//wait (to wait for the creation to succeed)
PUT /1.0/instances//state (to start the instance)

This all works fine in standalone lxd, but I’m seeing that last call fail in a cluster when the node that the container gets created on is one of the other nodes in the cluster.

{'error': '',
 'error_code': 0,
 'metadata': {'class': 'task',
              'created_at': '2020-06-17T19:45:59.014813571Z',
              'description': 'Starting container',
              'err': "Common start logic: Storage start: Failed to run: zfs mount deeper-end/containers/X1nzMoezkbvP01UrhB: cannot open 'deeper-end/containers/X1nzMoezkbvP01UrhB': dataset does not exist",
              'id': '1689171d-ef42-40f5-99b6-a27cc6cc6917',
              'location': 'ip-10-0-1-53',
              'may_cancel': False,
              'metadata': {'container_progress': 'Remapping container filesystem'},
              'resources': {'containers': ['/1.0/containers/X1nzMoezkbvP01UrhB']},
              'status': 'Failure',
              'status_code': 400,
              'updated_at': '2020-06-17T19:45:59.026857479Z'},
 'operation': '',
 'status': 'Success',
 'status_code': 200,
 'type': 'sync'}

If I retry the start call it succeeds, which makes me think it could be a race?

Notably, lxc launch --target <some_node> does work, but based on the traffic it looks like it does an additional GET /1.0/instances/ before the call to start the instance here: so maybe that gives enough time for the zfs changes to stick?

Has anyone else seen this? I could always just retry the call to start the instance, but I would have thought waiting for the create operation to succeed should be sufficient.

I’m using lxd 4.0.1

I think

Adding the target paramter to your API requests will fix this (specifically POST /1.0/instances)

As I understand it LXD tries to balance the containers “round robin” style & the image you are referencing may not exist on the target host

With respect to the target parameter, I’m specifically interested in testing out the round robin behavior for load balancing. I was also under the impression that lxd syncs the images between all of the nodes in the cluster.

I’m still leaning toward a race as the culprit. If I modify the logic to insert a 1 second sleep after the wait call and before the start call then the start calls succeeds.

I’m going to revisit my wait call to make sure that’s working as expected.

Yeah, this was an error on my end.
The way I was constructing the url for the wait call was by pulling the operation key out of the response to the create call and appending /wait to it.

I think in some release the operation url must have changed to have a ?project=default parameter to the end of it making the wait url I was generating invalid.

Sorry for the false alarm