[LXD] Scriptlet based instance placement scheduler

Project LXD
Status Implemented
Author(s) @tomp
Approver(s) @stgraber
Release 5.11
Internal ID LX033

Abstract

Allow for users to provide a Starlark scriptlet that decides on cluster target at instance creation time.

This scriptlet would be provided with information about the new requested instance as well as a list of candidate cluster members (online and compatible with the request) and details about them and their existing instances to make the decision.

It will have the ability to specify where the instance will be created or can prevent the instance from being created at all.

Rationale

This allows for custom logic to control instance placement, rather than the very basic placement logic that LXD currently has (cluster member with fewest instances).

Specification

Design

The instance placement scriptlet would be provided to LXD by way of a global configuration option called instances.placement.scriptlet. This would be stored in the global database and available to all cluster members.

When a request for a new instance comes to POST /1.0/instances without a target URL parameter specified the the instance placement scriptlet (if defined) would be executed on the cluster member that the request arrived at.

The scriptlet environment will be able to access the following info:

  • Instance create request (including expanded config and devices from profiles).
  • Profiles due to be used.
  • Reason for instance request:
    • New instance request.
    • Temporary instance migration during evacuation.
    • Temporary instance relocation migration while a cluster member is marked as dead.
  • Ability to retrieve candidate cluster members and their config. Cluster member candidates will be filtered by:
    • Online.
    • Correct architecture (by resolving image arch).
    • Available to restricted project (member groups).
  • Ability to retrieve cluster member’s state metrics (including system load, storage pool state etc).
  • Ability to retrieve cluster member’s resources info.
  • Ability to retrieve the expected resources required for the instance.

API changes

A new server config key called instances.placement.scriptlet will be added along with an API extension called instances_placement_scriptlet.

There will be new struct types added:

// InstancePlacement represents the instance placement request.
//
// API extension: instances_placement_scriptlet.
type InstancePlacement struct {
	api.InstancesPost `yaml:",inline"`

	Reason  string `json:"reason"`
	Project string `json:"project"`
}

// InstanceResources represents the required resources for an instance.
//
// swagger:model
//
// API extension: instances_placement_scriptlet.
type InstanceResources struct {
	CPUCores     uint64 `json:"cpu_cores"`
	MemorySize   uint64 `json:"memory_size"`
	RootDiskSize uint64 `json:"root_disk_size"`
}

Scriptlet definition

The instance placement scriptlet will be expected to contain a function called instance_placement.
This function will be called by LXD when an instance is to be created, and will be provided with the instance creation request (equivalent to the InstancePlacement struct above) as the request argument to the function. This will include a Reason field that will indicate the reason for the request, which can be new, evacuation or relocation.

As Starlark has no concept of exceptions, instead the instance_placement function can return an error by returning a non-none value. This value will be returned as an error response to the caller of LXD’s REST API. This allows the scriptlet to block the creation of the instance if needed.

The scriptlet can also optionally control which cluster member the instance should be created on.
To do this it can call the set_target function with the member_name parameter indicating the cluster member it wants.

Functions that will be available to the scriptlet:

  • log_info(*messages): Add a log entry to LXD’s log at info level.
    • messages is one or more message arguments.
  • log_warn(*messages): Add a log entry to LXD’s log at warn level.
    • messages is one or more message arguments.
  • log_error(*messages): Add a log entry to LXD’s log at error level.
    • messages is one or more message arguments.
  • set_target(member_name): Set the cluster member where the instance should be created.
    • member_name is the name of the cluster member the instance should be created on.
  • get_cluster_member_state(member_name): Get the cluster member’s state. Returns an object with the cluster member’s state equivalent to api.ClusterMemberState.
    • member_name is the name of the cluster member to get state for.
  • get_cluster_member_resources(member_name): Get information about resources on the cluster member. Returns an object with the resource info equivalent to api.Resources.
    • member_name is the name of the cluster member to get resource info for.
  • get_instance_resources(): Get information about the resources the instance will require. Returns an object with the resources info equivalent to api.InstanceResources.

The scriptlet must implement:

  • instance_placement(request, candidate_members)
    • request will be an object containing an expanded representation of shared/api/scriptlet.InstancePlacement.
    • candidate_members will be a list of cluster member objects representing shared/api.ClusterMember entries.

Example implementation:

def instance_placement(request, candidate_members):
        log_info("instance_placement started: ", request)

        if request.name == "foo":
                log_error("Invalid name supplied: ", request.name)
                return "Invalid name"

        set_target(candidate_members[0].server_name)
        return # No error

Storing a scriptlet in LXD can be achieved by creating a file for the scriptlet, e.g. instancePlacement.star and then using the following command:

lxc config set instances.placement.scriptlet "$(cat instancePlacement.star)"

Example of a scriptlet error returned to the caller:

lxc init images:ubuntu/jammy foo
Creating foo
Error: Failed instance creation: Failed instance placement scriptlet: Failed with return value: "Invalid name"

CLI changes

No CLI changes are expected.

Database changes

No DB changes are expected.

Upgrade handling

As this is a new feature, no upgrade handling is required.

Further information

When the instances.placement.scriptlet setting is changed, the new value will be compiled to check that it is a valid Starlark program. If successful then a cached compiled program will be held in memory ready to be run for each new instance creation request to avoid having to load it from the database and compile it every time.

Because each cluster member requires to have its own in-memory compiled cache of the program, when the instances.placement.scriptlet setting is changed we will rely upon the existing mechanism that notifies the other cluster members to refresh their local compiled cache of the program…

3 Likes

Would you ever consider allowing the script to make network requests? I can imagine a use case where I’d like to ask a cloud provider’s API for information about where my instance should be placed.

Starlark doesn’t support external communications, from what I can tell, I think because its designed for running embedded inside other applications and is designed to prevent scriptlets from blocking the system for unexpected amounts of time.

It does allow for the application to provide functions into it, so LXD will be providing some functions to access cluster and resource information.

I’m not quite following the use-case of querying an external system for where an instance should be placed? This feature is only for placing instances within existing cluster members, not for provisioning new cluster members (which would make sense to involve the hosting provider at that point).

Here is a scenario - perhaps niche - that comes to mind.

Let’s say I have an LXD cluster entirely within a single AWS VPC (in us-east-1), and I have 10 cluster members spread across 3 subnets. In AWS, a subnet is deployed withing a single “availability zone”. My subnets could be in AZs us-east-1a, us-east-1b, and us-east-1c.

Some AWS managed services, like RDS, require you to provision subnets across multiple availability zones to enable High Availability configurations. Traffic within a single AZ also has lower latency.

Currently, we have cluster member groups which could be assigned based on something like AZ/subnet, and presumably this information will be passed to our Starlark script. And honestly I love this feature as proposed, because it’s a gentle evolution of what exists.

I just want to point out that any scheduling algorithm that needs information from my business domain - e.g. not metrics LXD tracks, but my own customer database, my own service topology - is probably best served by a dedicated external service of my own design.


I am reminded of Envoy’s global rate limit service integration. The idea is you can configure a service that gets called on every request to determine whether rate limits are being hit.

The more I think about it, the more it seems like a Starlark script isn’t the place for this, even if the functionality could be exposed, but it could be implemented more like a webhook that gets called on POST /1.0/instances, and a response header or body can be used for placement information, or perhaps that response can be passed to Starlark, etc.

Yeah, we’ve been considered webhooks for this in the past though part of the issue with them was around what machine should call them, how to handle errors/timeouts/…

I think there’s still room to add webhooks to LXD in the future though. Maybe expose a function to the scriplet that’s specifically meant for doing a generic webhook call or something?

If your placement callback accepted a generic context-ish object where I could hang my own properties, as well as member info, maybe something like.

def get_placement_dynamic(client : HTTPClient, userinfo : Object, members : []Member) -> ID:
    u = userinfo
    result = client.do("GET", u.my_url, ca=u.my_tls.ca, timeout=10)
    return result.ID

def get_placement_static(userinfo : Object, members : []Member) -> ID:
   # only a pure function is allowed here
   return something

Of course, this all depends on what’s elegent in starlark.

FWIW, this proposal can (and probably should) be implemented with pure functions that don’t make network requests. But I wanted to flag this use-case, because perhaps your design can be extended later.

I agree. We do plan on providing the cluster members and their config to the starlark scriptlet.
This will include group membership and any failure domains they are a part of (which sounds like you could make use of that to ensure LXD spreads its own cluster roles out over the availability zones correctly).

Additionally almost all LXD entities (including cluster members) have support for custom user config fields (starting with user.*) and those will also be made available to the scriptlet.

So you could mark each cluster member as being part of a particular AZ or subnet (or anything you like really) and then use that for placement logic.

@tomp I was thinking, as the scheduler will be used for:

  • New instances
  • Temporary instance location during evacuation
  • Temporary instance location while a cluster member is marked as dead

It may be worth passing in an additional field to the function to indicate which of those scenarios we’re dealing with. This would allow administrators to configure a different policy (possibly less strict) when dealing with emergency placement.

Ah yes that sounds like a good idea. Added to spec.

Pull request for this work is here:

1 Like

Here is an example of a script targeting the server with the highest available free memory. The starlark // operator floors the result. It would be nice if the request always included memory and CPU requirements of the instance. There is no information in the data when the instance is created using defaults.

def instance_placement(reason, request, candidate_members):

        # targets the server with the highest available free memory
        # assumes there will be at least one candidate

        server_target = None
        memory_best = -1

        for candidate in candidate_members: 

            server = candidate["server_name"]
            state = get_cluster_member_state(server)
            memory_free = state["sysinfo"]["free_ram"] // (1024 * 1024) # in MB floored
            loads = state["sysinfo"]["load_averages"]

            log_info(server, " - free memory: ", memory_free, "MB, load: ", loads)

            if not server_target or memory_free > memory_best:
                server_target = server
                memory_best = memory_free

        log_info("targeting ", server_target, " for ", reason, " instance ", request["name"])

        set_target(server_target)
        return
2 Likes

Nice!

We don’t know the cpu or memory requirements of an Instance I’m afraid. It very much depends in what will be run inside of it.

An understanding of what an Instance requires is part of what the scriptlet itself could bring based on users requirements (perhaps based on custom user config in the request or the image being used).

If the instance or profiles have limits applied they will be in the request config though.

I meant the creation specs. If I set the values or use an instance type (-t), I see the info in the request while there is nothing for the default (haven’t updated the omit thing :-)).

{\"architecture\": \"x86_64\", \"config\": {}, \"devices\": {\"root\": {\"path\": \"/\", \"pool\": \"remote\", \"type\": \"disk\"}, \"eth0\": {\"type\": \"nic\", \"name\": \"eth0\", \"nictype\": \"bridged\", \"parent\": \"br0\"}}, \"ephemeral\": False, \"profiles\": [\"default\"], \"restore,omitempty\": \"\", \"stateful\": False, \"description\": \"\", \"name\": \"test2\", \"source\": {\"type\": \"image\", \"certificate\": \"\", \"alias,omitempty\": \"ubuntu/22.04\", \"fingerprint,omitempty\": \"\", \"properties,omitempty\": {}, \"server,omitempty\": \"https://images.linuxcontainers.org\", \"secret,omitempty\": \"\", \"protocol,omitempty\": \"simplestreams\", \"base-image,omitempty\": \"\", \"mode,omitempty\": \"pull\", \"operation,omitempty\": \"\", \"secrets,omitempty\": {}, \"source,omitempty\": \"\", \"live,omitempty\": False, \"instance_only,omitempty\": False, \"container_only,omitempty\": False, \"refresh,omitempty\": False, \"project,omitempty\": \"\", \"allow_inconsistent\": False}, \"instance_type\": \"\", \"type\": \"virtual-machine\"}"

vs

{\"architecture\": \"x86_64\", \"config\": {\"limits.cpu\": \"4\", \"limits.memory\": \"8192MB\"}, \"devices\": {\"root\": {\"path\": \"/\", \"pool\": \"remote\", \"type\": \"disk\"}, \"eth0\": {\"type\": \"nic\", \"name\": \"eth0\", \"nictype\": \"bridged\", \"parent\": \"br0\"}}, \"ephemeral\": False, \"profiles\": [\"default\"], \"restore,omitempty\": \"\", \"stateful\": False, \"description\": \"\", \"name\": \"test3\", \"source\": {\"type\": \"image\", \"certificate\": \"\", \"alias,omitempty\": \"ubuntu/22.04\", \"fingerprint,omitempty\": \"\", \"properties,omitempty\": {}, \"server,omitempty\": \"https://images.linuxcontainers.org\", \"secret,omitempty\": \"\", \"protocol,omitempty\": \"simplestreams\", \"base-image,omitempty\": \"\", \"mode,omitempty\": \"pull\", \"operation,omitempty\": \"\", \"secrets,omitempty\": {}, \"source,omitempty\": \"\", \"live,omitempty\": False, \"instance_only,omitempty\": False, \"container_only,omitempty\": False, \"refresh,omitempty\": False, \"project,omitempty\": \"\", \"allow_inconsistent\": False}, \"instance_type\": \"c4-m8\", \"type\": \"virtual-machine\"}"

So looks like in that case there is no config from the profile so its empty (and so no limits are applied). What does “lxc config show (instance) --expanded” show for the instance once its created?

Also what does “lxc profile show default” show?

The instance created with the parameters lists them, the one without is empty so this reflects what we have in the request. In my setup, the default profile does not contain any value but there is a default somewhere as the VM is created with 1GB of RAM and 1 CPU.

I am thinking the placement script could receive the result of LXD processing parameters, profile(s) and hidden defaults.

1 Like

Ah right I see, you’re talking about VMs.

I thought you were using containers (I see the instance type is vm now).

Yes there is a driver level default used for VMs without any limits specified.

At the moment that is applied after the placement script has run, inside the VM driver.

I should think we can pull that logic up to earlier in the process (or at least simulate it) so its available to the script.

1 Like

Maybe we can offer a function which returns the expected resource usage of the instance? This would parse the request and get us a bytes amount of limits.memory, CPU core count for limits.cpu, bytes amount for the root disk, …

And this function could be made aware of the fact that VMs have a default minimum size?

1 Like

Yeah that would also make it easier for the script to use the manually supplied limits because they wouldn’t have to worry about parsing the human readable sizes we accept in config.

@stgraber something like this?

// InstanceResources represents the required resources for an instance.
//
// API extension: instances_placement_scriptlet extension .
type InstanceResources struct {
	CPUCores     uint64 `json:"cpu_cores"`
	MemorySize   uint64 `json:"memory_size"`
	RootDiskSize uint64 `json:"root_disk_size"`
}

@stgraber @egelinas I’ve updated the spec to reflect the changes added in https://github.com/lxc/lxd/pull/11180#issuecomment-1396849493