Project | LXD |
Status | Implemented |
Author(s) | @tomp |
Approver(s) | @stgraber |
Release | LXD 5.6 |
Internal ID | LX020 |
Abstract
Implement a new object storage management API which will let us allocate object storage buckets within storage pools and provide access to them using an S3 API.
The goal is to provide feature parity with most public clouds by providing a object storage feature that many come to expect.
We aim to provide a way to create buckets of a specific size (so we can apply project quotas) and provide a URL and credentials back to the user to allow them access to the bucket.
Actual S3 data access calls will not be done through the main LXD API.
Rationale
To provide a solution for object storage in LXD, both for the distributed setups using Ceph and for local usage.
Ceph
For the Ceph case, we will utilize an externally configured rados gateway. LXD will manage user and bucket creation. LXD will then provide to the user the URL to the specified Rados Gateway along with the credentials to access the bucket.
Local
For the local case we will support enabling a LXD listener that will proxy requests to a per-bucket on-demand MinIO process (see more on the reasoning behind this below). LXD will deal with starting/stopping the MinIO process and setting up users and buckets. LXD will then provide to the user the listener URL along with the credentials to access the bucket.
Specification
Design
At a top level we plan on introducing a new storage pool entity type called âbucketâ.
The bucket name will be restricted to valid domain characters and can only be up to 63 characters long.
We expect any user with storage access within a project to request bucket creation via the LXD API or CLI tool, which will respond with two sets of S3 access credentials (for read/write and read only).
We will not allow bucket creation via the S3 API itself.
Access to the bucket via the S3 API will be managed used keys. A key will have a role
that can be either read-only
or admin
.
Ceph
For Ceph we plan to introduce a new storage pool type called cephobject
(like we did for the cephfs
type) which only support âbucketâ entities (like cephfs
only supports custom filesystem volumes).
Bucket names for cephobject
pools will be unique per storage pool, because each pool is expected to use its own radosgw endpoint that is configured to use a separate tenant/zone group.
The cephobject
pools will support an optional pool level setting called cephobject.bucket.name_prefix
that will be prepended to the name of all newly created buckets. The bucket name prefix will be accounted for as part of the name length limit. This can be used to isolate LXD created buckets if using a radosgw endpoint that is used by other applications.
The reason for the new storage pool type is that the underlying Ceph rados gateway that provides the S3 API requires the use of several dedicated OSD pools (similar to the cephfs
pool type). Whereas the existing ceph
storage pool type creates all of its volumes inside a single designated OSD pool, so supporting radosgw buckets on a ceph
storage pool would mean that certain entities would exist outside of the designated OSD pool for that LXD storage pool. This was deemed confusing and undesirable, and so to keep things aligned conceptually, we will use a new storage pool type for Ceph radosgw object storage.
An additional reason for implementing a new storage pool type for Ceph object storage is because LXD will rely on a radosgw already being setup and being told what is the existing radosgw endpoint address in order to use the S3 API to create buckets. A radosgw can be configured to use a particular tenant and/or zone group, and thus it will be possible to potentially have multiple cephoject
storage pools, each one configured to use a different radosgw endpoint.
The cephobject
storage pool type will still rely on the Ceph /etc/ceph/ceph.conf
and /etc/ceph/ceph.client.admin.keyring
files being present on each LXD server to be able to access the Ceph monitors (like we do for ceph
and cephfs
pool types) in order to use the radosgw-admin
tool to manage radosgw users and buckets. In fact the only thing that LXD will directly use the radosgw endpoint for is to create buckets (which cannot be done via the radosgw-admin
command).
The cephobject
storage pool type will have the following config options:
cephobject.cluster_name
- Name of the Ceph cluster that contains the radosgw.cephobject.user.name
- The Ceph user to use when usingradosgw-admin
to create thelxd-admin
radosgw user.cephobject.radosgw.endpoint
- scheme://host:port to use to communicate with the radosgw S3 API. The scheme is included to support both HTTP and HTTPS radosgw endpoints. This URL will be used both by LXD to create buckets, but also to give out to users to use to access the buckets.cephobject.radosgw.endpoint_cert_file
- File containing certificate of radosgw endpoint for LXD to verify when using HTTPS to connect to it.cephobject.bucket.name_prefix
- Optional prefix to prepend to new buckets.user.*
- Custom user config.
The user layout for radosgw buckets will be as follows:
- A
lxd-admin
user created when the storage pool is created (if doesnât already exist) using theradosgw-admin user create
command. This will be used to create S3 buckets via the radosgw API endpoint. If it does already exist, then its existing S3 credentials will be used. - A user named after the bucket name. The user will be created using
radosgw-admin user create
and will have the--max-buckets=-1
setting used to prevent them from being able to create their own buckets. - Sub-users of the bucket user will be created with read/write and read only permissions. These users will each have their own access and secret keys that will be used by applications to access the bucket.
When a new bucket is requested via the LXD API, LXD will use the lxd-admin
user to create the bucket via the radosgw S3 API, and then use the radosgw-admin bucket link
command to change the owner of the bucket to the associated bucketâs new user.
In this way we can have the bucket be owned by the bucketâs user, but still prevent the user from creating their own buckets. It will be possible for the user to delete their own bucket, but they will not be able to recreate it.
Because the bucket is owned by the bucket user, it will be possible for the bucket user to set the S3 policy on the bucket, for example to make the bucket publicly accessible.
Local storage
For the local object storage we plan to add the new âbucketâ entity type to all of the existing local storage pool types (dir, btrfs, lvm, and zfs). This will use a volume for each bucket and use MinIO to provide the S3 API and object storage on top of the volume.
Bucket names for local object storage will be unique per cluster member.
We were originally planning to embed MinIO inside LXD and expose it through a new LXD listener. We then wanted to delegate certain buckets to certain mounted storage pool volumes inside MinIOâs config.
Alas this is not currently possible with MinIO because it does not supporting embedding and it does not support mapping buckets to a specific directory. Instead it only supports a single top-level directory and then manages the bucket storage inside that directory. During our research it was observed, however, that each bucket was created as a sub-directory below the MinIO main directory. So we also tried the approach of mounting the storage pool bucket volumes into the MinIO main directory. However MinIO has explicit checks for cross-device mounts inside its main directory and refuses to start. This is due to the way that MinIO relies on atomic renames and so does not support cross-device bind mounts.
So in order to work around the limitations of MinIO the current plan is to create a LXD listener that reverse proxies S3 requests to dynamic MinIO processes, with one process being run for each bucket. Although MinIO does appear to start up and shutdown quickly, its initial resident memory is about 100MB per process, so we will not want to be consuming that much memory for each bucket. Instead LXD will dynamically start MinIO when a bucket is requested, and then stop the process when it has been idle for several minutes. This is similar to how LXDâs forkfile
process operates.
The LXD listener address will be specified by the cluster member specific core.object_address
global setting. It will be an HTTPS listener using the LXD serverâs own certificate or cluster certificate (like the API).
LXD will configure each MinIO process to listen on a random high port on the local loopback address, and set the root user to lxd-admin
and a random password upon each start up.
When LXD is stopped or reloaded all running MinIO processes will be stopped until their associated bucket is requested again.
LXD will create a MinIO bucket and a user for each bucket, along with service accounts (like ceph radosgw sub-users) for that user with S3 policies applied to restrict them so they only see their associated bucket. The policy will also prevent writing for read-only service accounts and neither of these service accounts will be able to create buckets.
Project feature
A new project feature called features.storage.buckets
will be added. This will default to true for new projects. A DB patch will be added to apply features.storage.buckets=true
to all existing projects that have features.storage.volumes
enabled.
API changes
A new API extension will be added called storage_buckets
with the following API endpoints and structures added:
Create and edit a storage bucket
POST /1.0/storage-pools/<pool_name>/buckets
PUT /1.0/storage-pools/<pool_name>/buckets/<bucket_name>
Using the following new API structures respectively:
type StorageBucketsPost struct {
StorageBucketPut `yaml:",inline"`
// Bucket name
// Example: foo
//
// API extension: storage_buckets
Name string `json:"name" yaml:"name"`
}
type StorageBucketPut struct {
// Storage bucket configuration map (refer to doc/storage-buckets.md)
// Example: {"size": "50GiB"}
//
// API extension: storage_buckets
Config map[string]string `json:"config" yaml:"config"`
// Description of the storage bucket
// Example: My custom bucket
//
// API extension: storage_buckets
Description string `json:"description" yaml:"description"`
}
Delete a storage bucket
DELETE /1.0/storage-pools/<pool_name>/buckets/<bucket_name>
List storage buckets
GET /1.0/storage-pools/<pool_name>/buckets
GET /1.0/storage-pools/<pool_name>/buckets/<bucket_name>
Returns a list or single record (respectively) of this new StorageBucket
structure:
type StorageBucket struct {
StorageBucketPut `yaml:",inline"`
// Bucket name
// Example: foo
//
// API extension: storage_buckets
Name string `json:"name" yaml:"name"`
// Bucket S3 URL
// Example: https://127.0.0.1:8080/foo
//
// API extension: storage_buckets
S3URL string `json:"s3_url" yaml:"s3_url"`
// What cluster member this record was found on
// Example: lxd01
//
// API extension: storage_buckets
Location string `json:"location" yaml:"location"`
}
Create and edit storage bucket keys
POST /1.0/storage-pools/<pool_name>/buckets/<bucket_name>/keys
PUT /1.0/storage-pools/<pool_name>/buckets/<bucket_name>/keys/<key_name>
Using the following new API structures respectively:
type StorageBucketKeysPost struct {
StorageBucketKeyPut `yaml:",inline"`
// Key name
// Example: my-read-only-key
//
// API extension: storage_buckets
Name string `json:"name" yaml:"name"`
}
type StorageBucketKeyPut struct {
// Description of the storage bucket key
// Example: My read-only bucket key
//
// API extension: storage_buckets
Description string `json:"description" yaml:"description"`
// Whether the key can perform write actions or not.
// Example: read-only
//
// API extension: storage_buckets
Role string `json:"role" yaml:"role"`
// Access key
// Example: 33UgkaIBLBIxb7O1
//
// API extension: storage_buckets
AccessKey string `json:"access-key" yaml:"access-key"`
// Secret key
// Example: kDQD6AOgwHgaQI1UIJBJpPaiLgZuJbq0
//
// API extension: storage_buckets
SecretKey string `json:"secret-key" yaml:"secret-key"`
}
Delete a storage bucket key
DELETE /1.0/storage-pools/<pool_name>/buckets/<bucket_name>/keys/<key_name>
List storage bucket keys
GET /1.0/storage-pools/<pool_name>/buckets/<bucket_name>/keys
GET /1.0/storage-pools/<pool_name>/buckets/<bucket_name>/keys/<key_name>
Returns a list or single record (respectively) of this new StorageBucketKey
structure:
// StorageBucketKey represents the fields of a LXD storage pool bucket key
//
// swagger:model
//
// API extension: storage_buckets.
type StorageBucketKey struct {
StorageBucketKeyPut `yaml:",inline"`
// Key name
// Example: my-read-only-key
//
// API extension: storage_buckets
Name string `json:"name" yaml:"name"`
}
CLI changes
Add bucket
sub-command to the storage pool sub-command:
lxc storage bucket ls <pool>
lxc storage bucket create <pool> <bucket_name> [key=value...]
lxc storage bucket show <pool> <bucket_name>
lxc storage bucket set <pool> <bucket_name> <key>=<value>...
lxc storage bucket delete <pool> <bucket_name>
lxc storage bucket key create <pool> <bucket_name> <key_name> [--role=[admin,read-only]] [--access-key=<access_key>] [--secret_key=<secret_key>]
lxc storage bucket key edit <pool> <bucket_name> <key_name>
lxc storage bucket key delete <pool> <bucket_name> <key_name>
The --role
flag value will default to read-only
if not specified.
The --access-key
and --secret-key
flag values will be randomly generated if not specified.
Valid bucket config keys are:
size
- sets the maximum size of the bucket in bytes.user.*
- Custom user specified keys.
Database changes
There will be two new tables added called storage_buckets
and storage_buckets_config
.
CREATE TABLE "storage_buckets" (
id INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL,
name TEXT NOT NULL,
storage_pool_id INTEGER NOT NULL,
node_id INTEGER,
description TEXT NOT NULL,
project_id INTEGER NOT NULL,
UNIQUE (node_id, name),
FOREIGN KEY (storage_pool_id) REFERENCES "storage_pools" (id) ON DELETE CASCADE,
FOREIGN KEY (node_id) REFERENCES "nodes" (id) ON DELETE CASCADE,
FOREIGN KEY (project_id) REFERENCES "projects" (id) ON DELETE CASCADE
);
CREATE UNIQUE INDEX storage_buckets_unique_storage_pool_id_node_id_name ON "storage_buckets" (storage_pool_id, IFNULL(node_id, -1), name);
CREATE TABLE "storage_buckets_config" (
id INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL,
storage_bucket_id INTEGER NOT NULL,
key TEXT NOT NULL,
value TEXT NOT NULL,
UNIQUE (storage_bucket_id, key),
FOREIGN KEY (storage_bucket_id) REFERENCES "storage_buckets" (id) ON DELETE CASCADE
);
CREATE TABLE "storage_buckets_keys" (
id INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL,
storage_bucket_id INTEGER NOT NULL,
name TEXT NOT NULL,
access_key TEXT NOT NULL,
secret_key TEXT NOT NULL,
role TEXT NOT NULL,
UNIQUE (storage_bucket_id, name),
FOREIGN KEY (storage_bucket_id) REFERENCES "storage_buckets" (id) ON DELETE CASCADE
);
Upgrade handling
This is a new feature so no upgrade handling required.
Further information
For local buckets we did also briefly consider Seaweedfs after a community suggestion, however it was deemed to be too heavyweight and complex (being more like Ceph with the S3 API being a front end on top of existing distributed collection and volume concepts) for the basic local storage scenario we needed it for. Like MinIO, it too did not appear to support embedding.