Can I train AI model by using Incus cluster

jim1 · November 6, 2024, 10:09am

I have built a incus cluster with three computers:

Computer 1 (boostrap server):
system:ubuntu20.04
GPU:Nvidia Geforce RTX 4090 * 2

Computer 2 :
system:ubuntu20.04
GPU:Nvidia Geforce RTX 4090 D * 2

Computer 3:
system:ubuntu20.04
GPU:Nvidia Geforce RTX 4090D * 2

and I created a container on bootstrap server which the the system is ubuntu20.04(container version), I have installed the GPU driver,cuda and other deeplearning environments in this container, I want to know can I use the GPU from other servers(computer 2 and computer 3), if incus cluster support such things, what should I do to achieve this goal.

turtle0x1 · November 6, 2024, 12:48pm

The actual ML part is handled by the framework your using, incus would typically be used to host the containers/vms doing the work.

Here’s the tensorflow guide to multi worker training, might be of some interest in getting started.

ray is also picking up popularity.

I’ve used incus/lxd for both of these and it works^

chuckboecking · November 6, 2024, 2:36pm

Here are some details that might help:

Here are two notable videos from Stéphane using LXD:

Misc commands:

lxc config device add ContainerName gpu gpu gputype=physical
lxc config set ContainerName nvidia.runtime=true
Inside the container install pciutils to make the gpu observable via tools like ollama
- sudo apt install pciutils

Does this help?
Chuck

Can I train AI model by using Incus cluster

I have built a incus cluster with three computers:

Computer 1 (boostrap server): system:ubuntu20.04 GPU:Nvidia Geforce RTX 4090 * 2

Computer 2 : system:ubuntu20.04 GPU:Nvidia Geforce RTX 4090 D * 2

Computer 3: system:ubuntu20.04 GPU:Nvidia Geforce RTX 4090D * 2

Computer 1 (boostrap server):
system:ubuntu20.04
GPU:Nvidia Geforce RTX 4090 * 2

Computer 2 :
system:ubuntu20.04
GPU:Nvidia Geforce RTX 4090 D * 2

Computer 3:
system:ubuntu20.04
GPU:Nvidia Geforce RTX 4090D * 2