Can I train AI model by using Incus cluster

I have built a incus cluster with three computers:

Computer 1 (boostrap server):
system:ubuntu20.04
GPU:Nvidia Geforce RTX 4090 * 2

Computer 2 :
system:ubuntu20.04
GPU:Nvidia Geforce RTX 4090 D * 2

Computer 3:
system:ubuntu20.04
GPU:Nvidia Geforce RTX 4090D * 2

and I created a container on bootstrap server which the the system is ubuntu20.04(container version), I have installed the GPU driver,cuda and other deeplearning environments in this container, I want to know can I use the GPU from other servers(computer 2 and computer 3), if incus cluster support such things, what should I do to achieve this goal.

The actual ML part is handled by the framework your using, incus would typically be used to host the containers/vms doing the work.

Here’s the tensorflow guide to multi worker training, might be of some interest in getting started.

ray is also picking up popularity.

I’ve used incus/lxd for both of these and it works^

Here are some details that might help:

Here are two notable videos from Stéphane using LXD:

Misc commands:

  • lxc config device add ContainerName gpu gpu gputype=physical
  • lxc config set ContainerName nvidia.runtime=true
  • Inside the container install pciutils to make the gpu observable via tools like ollama
    • sudo apt install pciutils

Does this help?
Chuck