ROCm and PyTorch on AMD APU or GPU (AI)

qkiel · April 19, 2024, 9:19pm

AMD ROCm is officially supported only on a few consumer-grade GPUs, mainly Radeon RX 7900 GRE and above. But ROCm consists of many things: compilers, runtime libraries, Ai-related libraries, etc. Often we just need a subset of this for our purposes. Fortunately, we don’t even need the DKMS module to use LLMs, which means we can install ROCm in a container and run any model using llama.cpp or Ollama on almost any AMD GPU, including APUs.

This guide is the basis for subsequent tutorials on how to run highly dangerous, potentially world-ending Ai in 100% secure and guaranteed Ai-proof Incus containers:

Note: I’m using AMD 5600G APU, but most of what you see here also applies to discrete GPUs. Whenever something is APU specific, I will mark it as such.

Table of Content

Preparing container
ROCm
Environment variables
VRAM (for APUs only)
PyTorch (optional)

Preparing container

On the host I’m using vanilla Ubuntu 22.04 with HWE kernel, without additional amdgpu driver and DKMS module. The containers are also Ubuntu 22.04, and require access to the GPU. If you intend to run GUI applications in them, use a GUI profile, otherwise only pass your GPU. The value 44 in gid=44 is the GID of the video group in the container:

incus launch images:ubuntu/jammy/cloud <container_name>
incus config device add <container_name> gpu gpu gid=44

If you have two GPUs, you should only pass one to avoid confusion. To do this, first check the PCI addresses of available GPUs and use pci= option:

incus info --resources | grep "PCI address: " -B 4
incus config device add <container_name> gpu gpu gid=44 pci="0000:XX:XX.X"

We also need access to /dev/kfd with GID of the render group:

incus config device add <container_name> dev_kfd unix-char source=/dev/kfd path=/dev/kfd gid=110

Now let’s log in to the container using the default ubuntu user:

incus exec <container_name> -- sudo --login --user ubuntu

Make sure ubuntu user is in video and render groups:

groups
sudo usermod -a -G render,video $LOGNAME

This may require a restart for it to take effect.

ROCm

We need to decide which version of ROCm we’re going to install. Even though ROCm 6.0+ has discontinued support for GCN-based cards like the one in my 5600G APU, I still use the latest version. I just had to set two environment variables instead of one, which I describe in the next section.

To run ROCm, we need to download and install the AMD Linux drivers in a container. It’s around 30 GB in size, so don’t be surprised. The most up-to-date link can be found on the official website (also look here). At the time of writing, it was 6.1.60100-1:

sudo apt install wget
wget https://repo.radeon.com/amdgpu-install/6.1/ubuntu/jammy/amdgpu-install_6.1.60100-1_all.deb
sudo apt install ./amdgpu-install_6.1.60100-1_all.deb

Now let’s install ROCm, but without DKMS module:

sudo amdgpu-install --usecase=rocm --no-dkms

If ROCm 6.1 doesn’t work for you, you can try last 5.7 version:

wget https://repo.radeon.com/amdgpu-install/5.7.3/ubuntu/jammy/amdgpu-install_5.7.50703-1_all.deb
sudo apt install ./amdgpu-install_5.7.50703-1_all.deb
sudo amdgpu-install --usecase=rocm --no-dkms

The update procedure via the installation script is exactly the same as when installing ROCm for the first time.

After installation, we have access to rocm-smi, rocminfo and clinfo commands, which should detect our GPU. To check which version of ROCm you have, use the command apt show rocm-libs -a.

Environment variables

Before we can run an application that depends on ROCm, we need to present our GPU as supported. This requires setting HSA_OVERRIDE_GFX_VERSION environment variable (values are taken from here):

for GCN 5th gen based GPUs and APUs HSA_OVERRIDE_GFX_VERSION=9.0.0
for RDNA 1 based GPUs and APUs HSA_OVERRIDE_GFX_VERSION=10.1.0
for RDNA 2 based GPUs and APUs HSA_OVERRIDE_GFX_VERSION=10.3.0
for RDNA 3 based GPUs and APUs HSA_OVERRIDE_GFX_VERSION=11.0.0

I’m not entirely sure if this also applies to discrete GPUs, but on ROCm 6.1 my APU requires an additional environment variable HSA_ENABLE_SDMA=0 (skip it when using ROCm 5.7). Both can be added to the .profile file:

echo "export HSA_OVERRIDE_GFX_VERSION=9.0.0" >> .profile
echo "export HSA_ENABLE_SDMA=0" >> .profile

VRAM (for APUs only)

The most foolproof method for running LLM models is to assign fixed amount of VRAM to your APU. This amount will be subtracted from your RAM. On my Asrock board, the VRAM value can be set in the UEFI/BIOS, and it should be approximately at least 0,5 GiB more than the size of the downloaded model you will use.

UEFI/BIOS -> Advanced -> AMD CBS -> NBIO -> GFX -> iGPU -> UMA_SPECIFIED

Some laptops do not have access to this option. Then you can try UniversalAMDFormBrowser. With this tool, you can access and modify AMD PBS/AMD CBS Menu. Simply extract UniversalAMDFormBrowser.zip to FAT32 formatted USB stick and boot from it (disable Secure Boot first).

Ollama (unless compiled by hand) and LM Studio will not work without sufficiently large VRAM, but reserving fixed amount of VRAM is a bit of a waste because the RAM is permanently reduced even if we don’t use the model.

Fortunately, other apps such as llama.cpp and Stable Diffusion GUIs like Fooocus can use unified memory architecture (UMA) to share main memory between the CPU and integrated GPU. Also Ollama can be compiled with just two changed lines of code to support UMA, so we don’t have to assign fixed amount of VRAM to the APU. In that case set iGPU to UMA_AUTO in UEFI/BIOS:

UEFI/BIOS -> Advanced -> AMD CBS -> NBIO -> GFX -> iGPU -> UMA_AUTO

Llama.cpp supports UMA this natively, and you can read about it in this tutorial. But for Fooocus and other PyTorch GUIs for Stable Diffusion (more about Fooocus in this tutorial) we have to use force-host-alloction-APU. The way it works, it uses LD_PRELOAD to load the functions hipMalloc and hipFree before ROCm runtime and therefore is able to intercept those function calls and then forward them to hipHostMalloc and hipHostFree.

You can compile force-host-alloction-APU once in a container with ROCm, and copy it to other containers, but it has to be recompiled for every new ROCm version. First, check if hipcc is installed:

apt list --installed hipcc
sudo apt install git
git clone https://github.com/segurac/force-host-alloction-APU
cd force-host-alloction-APU
CUDA_PATH=/usr/ HIP_PLATFORM="amd" hipcc forcegttalloc.c -o libforcegttalloc.so -shared -fPIC

Now we can start Fooocus or other PyTorch apps with LD_PRELOAD. Notice a ./ before libforcegttalloc.so:

LD_PRELOAD=~/force-host-alloction-APU/./libforcegttalloc.so python3 ~/Fooocus/entry_with_update.py --listen --port 8888 --always-high-vram

PyTorch (optional)

You should only install PyTorch if your application explicitly requires it. You don’t need it for llama.cpp, Ollama or LM Studio. Links to the latest versions are on the official website. Generally you want PyTorch for the version of ROCm you have installed. At the time of writing, for ROCm 6.1 that was a nightly 6.0 (soon it will be nightly 6.1), but we need to install pip first:

sudo apt install python3 python3-pip
pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm6.0

For ROCm 5.7:

pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm5.7

That’s all. As I said, this tutorial is the basis for subsequent tutorials on running generative models in Incus. You will find the links at the beginning of this post. If you have any questions, feel free to ask. Feedback, corrections and tips are greatly appreciated.

P.S.
Incus is such a wonderful tool for doing experiments like this. Without it, I would have messed up my system many times and potentially unleashed rogue Ai. So thank you for developing Incus and saving the world.

simos · May 9, 2024, 5:19pm

Thanks a lot for this. I went through and tried the tutorial.

Some notes.

When you run the command sudo apt install ./amdgpu-install_6.1.60100-1_all.deb
you get the following. But no, it does not use up 30.4 GB of additional disk space. It takes up about 8GB of additional space.
```
0 upgraded, 286 newly installed, 0 to remove and 6 not upgraded.
Need to get 1759 MB of archives.
After this operation, 30.4 GB of additional disk space will be used.
```
There’s a typo (extra ") in the command incus config device add <container_name> gpu gpu gid=44" .
It appears that rocminfo shows the fixed amount of VRAM, under Pool Info. Is that true? I get 512MB, 512MB and 64KB respectively in those three pools.

simos · May 9, 2024, 7:01pm

When you configure the BIOS/UEFI settings of your motherboard to pre-allocate more VRAM for the AMD iGPU (in this case a change from the default 512MB to the maximum 16GB),

rocminfo shows for HSA Agent 2 (GPU):

@@ -123,7 +123,7 @@
   Pool Info:               
     Pool 1                   
       Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
-      Size:                    524288(0x80000) KB                 
+      Size:                    16777216(0x1000000) KB             
       Allocatable:             TRUE                               
       Alloc Granule:           4KB                                
       Alloc Recommended Granule:2048KB                             
@@ -131,7 +131,7 @@
       Accessible by all:       FALSE                              
     Pool 2                   
       Segment:                 GLOBAL; FLAGS: EXTENDED FINE GRAINED
-      Size:                    524288(0x80000) KB                 
+      Size:                    16777216(0x1000000) KB             
       Allocatable:             TRUE                               
       Alloc Granule:           4KB                                
       Alloc Recommended Granule:2048KB

The rocm-smi output has a column VRAM%. This is the percentage of the VRAM compared to the total RAM of your system. If you have a system with 64GB of RAM and allocated 16GB for the iGPU, you would get a figure of about 25% here.

qkiel · May 9, 2024, 8:35pm

I’m glad the tutorial was useful :]

You’re right, apt reports 30.4GB, but it takes up less disk space. I’m not sure how much of this reduction is due to compression (I use ZFS).
Typo fixed
In my case, rocm-smi in the VRAM% column shows the actual percentage of VRAM used. This is usually around 410 MiB of the 512 MiB available in idle, and I see ~80% there.

eliranwong · June 19, 2024, 1:40pm

Thanks @qkiel for your posts. They gave me some good start for running GUIs with incus. I just found your GUI profile for 24.04 a bit complicated, I’m afraid, and not working in my testing. I made one, which I think is simpler. It also works with microphone and camera. Anyone interested may read me notes. Welcome any help to improve my notes further. They are located at GitHub - eliranwong/incus_container_gui_setup: Setup of running GUI applications in a incus container

simos · June 19, 2024, 5:52pm

Welcome Eliran!

Thanks for writing this!

I suggest that you write a new post/tutorial similar to @qkiel’s at Incus / LXD profile for GUI apps: Wayland, X11 and Pulseaudio There, you can present your take on how to get GUI apps to run in an Incus container.

Another suggestion is to add how someone can figure out whether they are actually using PipeWire or PulseAudio. I.e. run pactl info and check what it says on this line Server Name: PulseAudio (on PipeWire 0.3.48).

In addition to that, I notice that you do not cover Wayland. I suppose NVidia .
It would be good at some point to add that support as well. Perhaps someone else may work on this once you write the tutorial post.

Finally, I notice that you use the blockquote environment in Markdown. I suggest to use the performatted text environment instead. When you paste the command from the blockquote environment in a terminal, it is paste in three lines instead of one.

eliranwong · June 19, 2024, 9:10pm

Thanks for suggestions. I will work on that when I am free. I am also curious about the performance of running AI inference inside the container. I ran a small test to compare the inference speed running between the host and the container on the same machine. I recorded the result at:

github.com

eliranwong/MultiAMDGPU_AIDev_Ubuntu/blob/main/igpu_only.md#further-comparison-with-running-inside-an-incus-container

# Tested on a iGPU-only device

Tested device: Beelink GTR6 (Ryzen 9 6900HX CPU + integrated Radeon 680M GPU + 64GB RAM)

Followed https://github.com/eliranwong/MultiAMDGPU_AIDev_Ubuntu/blob/main/README.md for for installation ROCm 6.0.2.

Environment variables:

```
export ROCM_HOME=/opt/rocm-6.0.2
export LD_LIBRARY_PATH=/opt/rocm-6.0.2/include:/opt/rocm-6.0.2/lib:$LD_LIBRARY_PATH
export PATH=$HOME/.local/bin:/opt/rocm-6.0.2/bin:/opt/rocm-6.0.2/llvm/bin:$PATH
export HSA_OVERRIDE_GFX_VERSION=10.3.0
```

# Compile llama.cpp from source:

Compiled with ROCM:

> make LLAMA_HIPBLAS=1 LLAMA_HIP_UMA=1 AMDGPU_TARGETS=gfx1030 -j$(lscpu | grep '^Core(s)' | awk '{print $NF}')

This file has been truncated. show original

It is very encouraging that the running speed inside the container is very close to that of the host.

eliranwong · June 23, 2024, 6:22pm

May I ask how can I post tutorial at https://discuss.linuxcontainers.org/c/tutorials/16?

I wrote a tutorial “Docker + Perplexica + Snap + Firefox in a GUI-enabled Incus Container”

I just realised I haven’t been allowed yet to post a tutorial, so I place the tutorial in github at https://github.com/eliranwong/incus_container_gui_setup/blob/main/ai_tutorials/docker_perplexica.md

stgraber · June 23, 2024, 6:25pm

You can post it in the General or Incus section and we can move it for you.

eliranwong · June 23, 2024, 6:56pm

Thanks for reply. I’ve just added to the “Incus” section.