[LXD] Object storage (S3 API)

tomp · July 11, 2022, 2:15pm


Project	LXD
Status	Implemented
Author(s)	@tomp
Approver(s)	@stgraber
Release	LXD 5.6
Internal ID	LX020

Abstract

Implement a new object storage management API which will let us allocate object storage buckets within storage pools and provide access to them using an S3 API.

The goal is to provide feature parity with most public clouds by providing a object storage feature that many come to expect.

We aim to provide a way to create buckets of a specific size (so we can apply project quotas) and provide a URL and credentials back to the user to allow them access to the bucket.

Actual S3 data access calls will not be done through the main LXD API.

Rationale

To provide a solution for object storage in LXD, both for the distributed setups using Ceph and for local usage.

Ceph

For the Ceph case, we will utilize an externally configured rados gateway. LXD will manage user and bucket creation. LXD will then provide to the user the URL to the specified Rados Gateway along with the credentials to access the bucket.

Local

For the local case we will support enabling a LXD listener that will proxy requests to a per-bucket on-demand MinIO process (see more on the reasoning behind this below). LXD will deal with starting/stopping the MinIO process and setting up users and buckets. LXD will then provide to the user the listener URL along with the credentials to access the bucket.

Specification

Design

At a top level we plan on introducing a new storage pool entity type called “bucket”.

The bucket name will be restricted to valid domain characters and can only be up to 63 characters long.

We expect any user with storage access within a project to request bucket creation via the LXD API or CLI tool, which will respond with two sets of S3 access credentials (for read/write and read only).

We will not allow bucket creation via the S3 API itself.

Access to the bucket via the S3 API will be managed used keys. A key will have a role that can be either read-only or admin.

Ceph

For Ceph we plan to introduce a new storage pool type called cephobject (like we did for the cephfs type) which only support “bucket” entities (like cephfs only supports custom filesystem volumes).

Bucket names for cephobject pools will be unique per storage pool, because each pool is expected to use its own radosgw endpoint that is configured to use a separate tenant/zone group.

The cephobject pools will support an optional pool level setting called cephobject.bucket.name_prefix that will be prepended to the name of all newly created buckets. The bucket name prefix will be accounted for as part of the name length limit. This can be used to isolate LXD created buckets if using a radosgw endpoint that is used by other applications.

The reason for the new storage pool type is that the underlying Ceph rados gateway that provides the S3 API requires the use of several dedicated OSD pools (similar to the cephfs pool type). Whereas the existing ceph storage pool type creates all of its volumes inside a single designated OSD pool, so supporting radosgw buckets on a ceph storage pool would mean that certain entities would exist outside of the designated OSD pool for that LXD storage pool. This was deemed confusing and undesirable, and so to keep things aligned conceptually, we will use a new storage pool type for Ceph radosgw object storage.

An additional reason for implementing a new storage pool type for Ceph object storage is because LXD will rely on a radosgw already being setup and being told what is the existing radosgw endpoint address in order to use the S3 API to create buckets. A radosgw can be configured to use a particular tenant and/or zone group, and thus it will be possible to potentially have multiple cephoject storage pools, each one configured to use a different radosgw endpoint.

The cephobject storage pool type will still rely on the Ceph /etc/ceph/ceph.conf and /etc/ceph/ceph.client.admin.keyring files being present on each LXD server to be able to access the Ceph monitors (like we do for ceph and cephfs pool types) in order to use the radosgw-admin tool to manage radosgw users and buckets. In fact the only thing that LXD will directly use the radosgw endpoint for is to create buckets (which cannot be done via the radosgw-admin command).

The cephobject storage pool type will have the following config options:

cephobject.cluster_name - Name of the Ceph cluster that contains the radosgw.
cephobject.user.name - The Ceph user to use when using radosgw-admin to create the lxd-admin radosgw user.
cephobject.radosgw.endpoint - scheme://host:port to use to communicate with the radosgw S3 API. The scheme is included to support both HTTP and HTTPS radosgw endpoints. This URL will be used both by LXD to create buckets, but also to give out to users to use to access the buckets.
cephobject.radosgw.endpoint_cert_file - File containing certificate of radosgw endpoint for LXD to verify when using HTTPS to connect to it.
cephobject.bucket.name_prefix - Optional prefix to prepend to new buckets.
user.* - Custom user config.

The user layout for radosgw buckets will be as follows:

A lxd-admin user created when the storage pool is created (if doesn’t already exist) using the radosgw-admin user create command. This will be used to create S3 buckets via the radosgw API endpoint. If it does already exist, then its existing S3 credentials will be used.
A user named after the bucket name. The user will be created using radosgw-admin user create and will have the --max-buckets=-1 setting used to prevent them from being able to create their own buckets.
Sub-users of the bucket user will be created with read/write and read only permissions. These users will each have their own access and secret keys that will be used by applications to access the bucket.

When a new bucket is requested via the LXD API, LXD will use the lxd-admin user to create the bucket via the radosgw S3 API, and then use the radosgw-admin bucket link command to change the owner of the bucket to the associated bucket’s new user.

In this way we can have the bucket be owned by the bucket’s user, but still prevent the user from creating their own buckets. It will be possible for the user to delete their own bucket, but they will not be able to recreate it.

Because the bucket is owned by the bucket user, it will be possible for the bucket user to set the S3 policy on the bucket, for example to make the bucket publicly accessible.

Local storage

For the local object storage we plan to add the new “bucket” entity type to all of the existing local storage pool types (dir, btrfs, lvm, and zfs). This will use a volume for each bucket and use MinIO to provide the S3 API and object storage on top of the volume.

Bucket names for local object storage will be unique per cluster member.

We were originally planning to embed MinIO inside LXD and expose it through a new LXD listener. We then wanted to delegate certain buckets to certain mounted storage pool volumes inside MinIO’s config.

Alas this is not currently possible with MinIO because it does not supporting embedding and it does not support mapping buckets to a specific directory. Instead it only supports a single top-level directory and then manages the bucket storage inside that directory. During our research it was observed, however, that each bucket was created as a sub-directory below the MinIO main directory. So we also tried the approach of mounting the storage pool bucket volumes into the MinIO main directory. However MinIO has explicit checks for cross-device mounts inside its main directory and refuses to start. This is due to the way that MinIO relies on atomic renames and so does not support cross-device bind mounts.

So in order to work around the limitations of MinIO the current plan is to create a LXD listener that reverse proxies S3 requests to dynamic MinIO processes, with one process being run for each bucket. Although MinIO does appear to start up and shutdown quickly, its initial resident memory is about 100MB per process, so we will not want to be consuming that much memory for each bucket. Instead LXD will dynamically start MinIO when a bucket is requested, and then stop the process when it has been idle for several minutes. This is similar to how LXD’s forkfile process operates.

The LXD listener address will be specified by the cluster member specific core.object_address global setting. It will be an HTTPS listener using the LXD server’s own certificate or cluster certificate (like the API).

LXD will configure each MinIO process to listen on a random high port on the local loopback address, and set the root user to lxd-admin and a random password upon each start up.

When LXD is stopped or reloaded all running MinIO processes will be stopped until their associated bucket is requested again.

LXD will create a MinIO bucket and a user for each bucket, along with service accounts (like ceph radosgw sub-users) for that user with S3 policies applied to restrict them so they only see their associated bucket. The policy will also prevent writing for read-only service accounts and neither of these service accounts will be able to create buckets.

Project feature

A new project feature called features.storage.buckets will be added. This will default to true for new projects. A DB patch will be added to apply features.storage.buckets=true to all existing projects that have features.storage.volumes enabled.

API changes

A new API extension will be added called storage_buckets with the following API endpoints and structures added:

Create and edit a storage bucket

POST /1.0/storage-pools/<pool_name>/buckets
PUT /1.0/storage-pools/<pool_name>/buckets/<bucket_name>

Using the following new API structures respectively:

type StorageBucketsPost struct {
	StorageBucketPut `yaml:",inline"`

	// Bucket name
	// Example: foo
	//
	// API extension: storage_buckets
	Name string `json:"name" yaml:"name"`
}

type StorageBucketPut struct {
	// Storage bucket configuration map (refer to doc/storage-buckets.md)
	// Example: {"size": "50GiB"}
	//
	// API extension: storage_buckets
	Config map[string]string `json:"config" yaml:"config"`

	// Description of the storage bucket
	// Example: My custom bucket
	//
	// API extension: storage_buckets
	Description string `json:"description" yaml:"description"`
}

Delete a storage bucket

DELETE /1.0/storage-pools/<pool_name>/buckets/<bucket_name>

List storage buckets

GET /1.0/storage-pools/<pool_name>/buckets
GET /1.0/storage-pools/<pool_name>/buckets/<bucket_name>

Returns a list or single record (respectively) of this new StorageBucket structure:

type StorageBucket struct {
	StorageBucketPut `yaml:",inline"`

	// Bucket name
	// Example: foo
	//
	// API extension: storage_buckets
	Name string `json:"name" yaml:"name"`

	// Bucket S3 URL
	// Example: https://127.0.0.1:8080/foo
	//
	// API extension: storage_buckets
	S3URL string `json:"s3_url" yaml:"s3_url"`

	// What cluster member this record was found on
	// Example: lxd01
	//
	// API extension: storage_buckets
	Location string `json:"location" yaml:"location"`
}

Create and edit storage bucket keys

POST /1.0/storage-pools/<pool_name>/buckets/<bucket_name>/keys
PUT /1.0/storage-pools/<pool_name>/buckets/<bucket_name>/keys/<key_name>

Using the following new API structures respectively:

type StorageBucketKeysPost struct {
	StorageBucketKeyPut `yaml:",inline"`

	// Key name
	// Example: my-read-only-key
	//
	// API extension: storage_buckets
	Name string `json:"name" yaml:"name"`
}

type StorageBucketKeyPut struct {
	// Description of the storage bucket key
	// Example: My read-only bucket key
	//
	// API extension: storage_buckets
	Description string `json:"description" yaml:"description"`

	// Whether the key can perform write actions or not.
	// Example: read-only
	//
	// API extension: storage_buckets
	Role string `json:"role" yaml:"role"`

	// Access key
	// Example: 33UgkaIBLBIxb7O1
	//
	// API extension: storage_buckets
	AccessKey string `json:"access-key" yaml:"access-key"`

	// Secret key
	// Example: kDQD6AOgwHgaQI1UIJBJpPaiLgZuJbq0
	//
	// API extension: storage_buckets
	SecretKey string `json:"secret-key" yaml:"secret-key"`
}

Delete a storage bucket key

DELETE /1.0/storage-pools/<pool_name>/buckets/<bucket_name>/keys/<key_name>

List storage bucket keys

GET /1.0/storage-pools/<pool_name>/buckets/<bucket_name>/keys
GET /1.0/storage-pools/<pool_name>/buckets/<bucket_name>/keys/<key_name>

Returns a list or single record (respectively) of this new StorageBucketKey structure:

// StorageBucketKey represents the fields of a LXD storage pool bucket key
//
// swagger:model
//
// API extension: storage_buckets.
type StorageBucketKey struct {
	StorageBucketKeyPut `yaml:",inline"`

	// Key name
	// Example: my-read-only-key
	//
	// API extension: storage_buckets
	Name string `json:"name" yaml:"name"`
}

CLI changes

Add bucket sub-command to the storage pool sub-command:

lxc storage bucket ls <pool>
lxc storage bucket create <pool> <bucket_name> [key=value...]
lxc storage bucket show <pool> <bucket_name>
lxc storage bucket set <pool> <bucket_name> <key>=<value>...
lxc storage bucket delete <pool> <bucket_name>
lxc storage bucket key create <pool> <bucket_name> <key_name> [--role=[admin,read-only]] [--access-key=<access_key>] [--secret_key=<secret_key>]
lxc storage bucket key edit <pool> <bucket_name> <key_name> 
lxc storage bucket key delete <pool> <bucket_name> <key_name>

The --role flag value will default to read-only if not specified.
The --access-key and --secret-key flag values will be randomly generated if not specified.

Valid bucket config keys are:

size - sets the maximum size of the bucket in bytes.
user.* - Custom user specified keys.

Database changes

There will be two new tables added called storage_buckets and storage_buckets_config.

CREATE TABLE "storage_buckets" (
	id INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL,
	name TEXT NOT NULL,
	storage_pool_id INTEGER NOT NULL,
	node_id INTEGER,
	description TEXT NOT NULL,
	project_id INTEGER NOT NULL,
	UNIQUE (node_id, name),
	FOREIGN KEY (storage_pool_id) REFERENCES "storage_pools" (id) ON DELETE CASCADE,
	FOREIGN KEY (node_id) REFERENCES "nodes" (id) ON DELETE CASCADE,
	FOREIGN KEY (project_id) REFERENCES "projects" (id) ON DELETE CASCADE
);

CREATE UNIQUE INDEX storage_buckets_unique_storage_pool_id_node_id_name ON "storage_buckets" (storage_pool_id, IFNULL(node_id, -1), name);

CREATE TABLE "storage_buckets_config" (
	id INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL,
	storage_bucket_id INTEGER NOT NULL,
	key TEXT NOT NULL,
	value TEXT NOT NULL,
	UNIQUE (storage_bucket_id, key),
	FOREIGN KEY (storage_bucket_id) REFERENCES "storage_buckets" (id) ON DELETE CASCADE
);

CREATE TABLE "storage_buckets_keys" (
	id INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL,
	storage_bucket_id INTEGER NOT NULL,
	name TEXT NOT NULL,
	access_key TEXT NOT NULL,
	secret_key TEXT NOT NULL,
	role TEXT NOT NULL,
	UNIQUE (storage_bucket_id, name),
	FOREIGN KEY (storage_bucket_id) REFERENCES "storage_buckets" (id) ON DELETE CASCADE
);

Upgrade handling

This is a new feature so no upgrade handling required.

Further information

For local buckets we did also briefly consider Seaweedfs after a community suggestion, however it was deemed to be too heavyweight and complex (being more like Ceph with the S3 API being a front end on top of existing distributed collection and volume concepts) for the basic local storage scenario we needed it for. Like MinIO, it too did not appear to support embedding.

roka · July 12, 2022, 3:21pm

For object storage take a look at SeaweedFS, besides better S3 compatibility than minio, it also has filer which can mount a shared file system. Also seaweedFS Apache 2.0 license is good for integrating it with LXD.

tomp · July 15, 2022, 3:08pm

@stgraber do you think we need the cephobject pools to have a config setting that will allow the user to specify a location of the CA cert to use for accessing the radosgw endpoint (if they are using HTTPS with a certificate not in the OS level trust store)? Or would we just expect the admin to add any custom CA certs into the system trust store?

stgraber · July 15, 2022, 3:42pm

Oh right, because we actually need to have LXD use the S3 API now…
We may want a config key to point to the expected cert or CA then.

Initially the plan was to never use S3 ourselves, only radosgw-admin and then just provide the S3 API URL to the user for them to consume, making this “not our problem”.

But if we now need to use S3 ourselves, that makes things a bit different…

tomp · July 15, 2022, 3:43pm

Yep, radosgw-admin cannot create buckets frustratingly (it can remove them though), nor can the admin API (https://docs.ceph.com/en/latest/radosgw/adminops/).

tomp · July 15, 2022, 3:54pm

Added cephobject.radosgw.endpoint_cert_file option.

stgraber · July 21, 2022, 7:37pm

Should mention the why? which in this case is feature parity with most public clouds and so a feature that workloads may come to expect.

stgraber · July 21, 2022, 7:38pm

Doesn’t have to be an admin, can be anyone with storage access within a project.

stgraber · July 21, 2022, 7:46pm

Where do we stand on being able to have multiple keys valid at the same time?
I’d prefer that we have a STORAGE/buckets/NAME/keys endpoint which would give us:

GET STORAGE/buckets/NAME/keys => List key names
POST STORAGE/buckets/NAME/keys => Create a new key (takes in name, description, read-only flag)
GET STORAGE/buckets/NAME/keys/KEYNAME => Get the access key + secret
DELETE STORAGE/buckets/NAME/keys => Revoke the key

This would allow for multiple applications interacting with the same bucket and would allow for easy revocation and rotation of keys.

tomp · July 21, 2022, 8:08pm

That should be fine. Ceph radosgw can have multiple named sub-users. I would need to re-run my verification program using a sub-user for write access, as currently I’m using the single main user key for write access, and a single read-only sub-user. But in principle should work fine.

For MinIO there are users, and those users have Service Accounts (which have access & secret keys). The service accounts appear to be where we can apply user level S3 access policies (read, write, which buckets are accessible etc).

One possible wrinkle is that in MinIO the service accounts don’t have a name (unlike the radosgw sub-users). So LXD would need to keep an internal mapping of name to Access Key in MinIO.

tomp · July 22, 2022, 12:26pm

Updated

tomp · July 22, 2022, 12:26pm

Updated

stgraber · July 23, 2022, 3:33am

Okay, then I think we should implement an API similar to that I described above.

tomp · July 25, 2022, 10:05am

I’ve updated the spec to cover this now.

tomp · July 25, 2022, 11:02am

Confirmed using a fullaccess sub-user instead of the main bucket user works the same.
Updated script here updates ceph tests to use subuser for write operations · tomponline/ceph-examples@451aeb5 · GitHub

tomp · July 25, 2022, 1:20pm

tomp · July 25, 2022, 1:31pm

Do we want projects to have a feature to support project-level buckets or should we use the existing project-level custom volume setting?

stgraber · July 25, 2022, 1:49pm

We can make a new feature and enable it by default even for existing pools (will need a patch).

That’s mostly so folks don’t get confused because of the naming. In practice I expect just about everyone to always use the two combined.

tomp · July 25, 2022, 1:54pm

Same question, but with API permissions/role, should we create a storage-buckets permission or re-use storage-volumes?

stgraber · July 25, 2022, 2:13pm

Let’s reuse storage-volumes for now at those permissions aren’t user visible but a detail within RBAC. We can clean those up when we switch to a different RBAC stack.

Stéphane