Ai tutorial: ROCm and PyTorch on AMD APU or GPU

AMD ROCm is officially supported only on a few consumer-grade GPUs, mainly Radeon RX 7900 GRE and above. But ROCm consists of many things: compilers, runtime libraries, Ai-related libraries, etc. Often we just need a subset of this for our purposes. Fortunately, we don’t even need the DKMS module to use LLMs, which means we can install ROCm in a container and run any model using llama.cpp or Ollama on almost any AMD GPU, including APUs.

This guide is the basis for subsequent tutorials on how to run highly dangerous, potentially world-ending Ai in 100% secure and guaranteed Ai-proof Incus containers:

Note: I’m using AMD 5600G APU, but most of what you see here also applies to discrete GPUs. Whenever something is APU specific, I will mark it as such.

Table of Content

  • Preparing container
  • ROCm
  • Environment variables
  • VRAM (for APUs only)
  • PyTorch (optional)

Preparing container

On the host I’m using vanilla Ubuntu 22.04 with HWE kernel, without additional amdgpu driver and DKMS module. The containers are also Ubuntu 22.04, and require access to the GPU. If you intend to run GUI applications in them, use a GUI profile, otherwise only pass your GPU. The value 44 in gid=44 is the GID of the video group in the container:

incus launch images:ubuntu/jammy/cloud <container_name>
incus config device add <container_name> gpu gpu gid=44

If you have two GPUs, you should only pass one to avoid confusion. To do this, first check the PCI addresses of available GPUs and use pci= option:

incus info --resources | grep "PCI address: " -B 4
incus config device add <container_name> gpu gpu gid=44 pci="0000:XX:XX.X"

We also need access to /dev/kfd with GID of the render group:

incus config device add <container_name> dev_kfd unix-char source=/dev/kfd path=/dev/kfd gid=110

Now let’s log in to the container using the default ubuntu user:

incus exec <container_name> -- sudo --login --user ubuntu

Make sure ubuntu user is in video and render groups:

groups
sudo usermod -a -G render,video $LOGNAME

This may require a restart for it to take effect.

ROCm

We need to decide which version of ROCm we’re going to install. Even though ROCm 6.0+ has discontinued support for GCN-based cards like the one in my 5600G APU, I still use the latest version. I just had to set two environment variables instead of one, which I describe in the next section.

To run ROCm, we need to download and install the AMD Linux drivers in a container. It’s around 30 GB in size, so don’t be surprised. The most up-to-date link can be found on the official website (also look here). At the time of writing, it was 6.1.60100-1:

sudo apt install wget
wget https://repo.radeon.com/amdgpu-install/6.1/ubuntu/jammy/amdgpu-install_6.1.60100-1_all.deb
sudo apt install ./amdgpu-install_6.1.60100-1_all.deb

Now let’s install ROCm, but without DKMS module:

sudo amdgpu-install --usecase=rocm --no-dkms

If ROCm 6.1 doesn’t work for you, you can try last 5.7 version:

wget https://repo.radeon.com/amdgpu-install/5.7.3/ubuntu/jammy/amdgpu-install_5.7.50703-1_all.deb
sudo apt install ./amdgpu-install_5.7.50703-1_all.deb
sudo amdgpu-install --usecase=rocm --no-dkms

The update procedure via the installation script is exactly the same as when installing ROCm for the first time.

After installation, we have access to rocm-smi, rocminfo and clinfo commands, which should detect our GPU. To check which version of ROCm you have, use the command apt show rocm-libs -a.

Environment variables

Before we can run an application that depends on ROCm, we need to present our GPU as supported. This requires setting HSA_OVERRIDE_GFX_VERSION environment variable (values are taken from here):

  • for GCN 5th gen based GPUs and APUs HSA_OVERRIDE_GFX_VERSION=9.0.0
  • for RDNA 1 based GPUs and APUs HSA_OVERRIDE_GFX_VERSION=10.1.0
  • for RDNA 2 based GPUs and APUs HSA_OVERRIDE_GFX_VERSION=10.3.0
  • for RDNA 3 based GPUs and APUs HSA_OVERRIDE_GFX_VERSION=11.0.0

I’m not entirely sure if this also applies to discrete GPUs, but on ROCm 6.1 my APU requires an additional environment variable HSA_ENABLE_SDMA=0 (skip it when using ROCm 5.7). Both can be added to the .profile file:

echo "export HSA_OVERRIDE_GFX_VERSION=9.0.0" >> .profile
echo "export HSA_ENABLE_SDMA=0" >> .profile

VRAM (for APUs only)

The most foolproof method for running LLM models is to assign fixed amount of VRAM to your APU. This amount will be subtracted from your RAM. On my Asrock board, the VRAM value can be set in the UEFI/BIOS, and it should be approximately at least 0,5 GiB more than the size of the downloaded model you will use.

UEFI/BIOS -> Advanced -> AMD CBS -> NBIO -> GFX -> iGPU -> UMA_SPECIFIED

Some laptops do not have access to this option. Then you can try UniversalAMDFormBrowser. With this tool, you can access and modify AMD PBS/AMD CBS Menu. Simply extract UniversalAMDFormBrowser.zip to FAT32 formatted USB stick and boot from it (disable Secure Boot first).

Ollama (unless compiled by hand) and LM Studio will not work without sufficiently large VRAM, but reserving fixed amount of VRAM is a bit of a waste because the RAM is permanently reduced even if we don’t use the model.

Fortunately, other apps such as llama.cpp and Stable Diffusion GUIs like Fooocus can use unified memory architecture (UMA) to share main memory between the CPU and integrated GPU. Also Ollama can be compiled with just two changed lines of code to support UMA, so we don’t have to assign fixed amount of VRAM to the APU. In that case set iGPU to UMA_AUTO in UEFI/BIOS:

UEFI/BIOS -> Advanced -> AMD CBS -> NBIO -> GFX -> iGPU -> UMA_AUTO

Llama.cpp supports UMA this natively, and you can read about it in this tutorial. But for Fooocus and other PyTorch GUIs for Stable Diffusion (more about Fooocus in this tutorial) we have to use force-host-alloction-APU. The way it works, it uses LD_PRELOAD to load the functions hipMalloc and hipFree before ROCm runtime and therefore is able to intercept those function calls and then forward them to hipHostMalloc and hipHostFree.

You can compile force-host-alloction-APU once in a container with ROCm, and copy it to other containers, but it has to be recompiled for every new ROCm version. First, check if hipcc is installed:

apt list --installed hipcc
sudo apt install git
git clone https://github.com/segurac/force-host-alloction-APU
cd force-host-alloction-APU
CUDA_PATH=/usr/ HIP_PLATFORM="amd" hipcc forcegttalloc.c -o libforcegttalloc.so -shared -fPIC

Now we can start Fooocus or other PyTorch apps with LD_PRELOAD. Notice a ./ before libforcegttalloc.so:

LD_PRELOAD=~/force-host-alloction-APU/./libforcegttalloc.so python3 ~/Fooocus/entry_with_update.py --listen --port 8888 --always-high-vram

PyTorch (optional)

You should only install PyTorch if your application explicitly requires it. You don’t need it for llama.cpp, Ollama or LM Studio. Links to the latest versions are on the official website. Generally you want PyTorch for the version of ROCm you have installed. At the time of writing, for ROCm 6.1 that was a nightly 6.0 (soon it will be nightly 6.1), but we need to install pip first:

sudo apt install python3 python3-pip
pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm6.0

For ROCm 5.7:

pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm5.7

That’s all. As I said, this tutorial is the basis for subsequent tutorials on running generative models in Incus. You will find the links at the beginning of this post. If you have any questions, feel free to ask. Feedback, corrections and tips are greatly appreciated.

P.S.
Incus is such a wonderful tool for doing experiments like this. Without it, I would have messed up my system many times and potentially unleashed rogue Ai. So thank you for developing Incus and saving the world.

5 Likes

Thanks a lot for this. I went through and tried the tutorial.

Some notes.

  1. When you run the command sudo apt install ./amdgpu-install_6.1.60100-1_all.deb
    you get the following. But no, it does not use up 30.4 GB of additional disk space. It takes up about 8GB of additional space.

    0 upgraded, 286 newly installed, 0 to remove and 6 not upgraded.
    Need to get 1759 MB of archives.
    After this operation, 30.4 GB of additional disk space will be used.
    
  2. There’s a typo (extra ") in the command incus config device add <container_name> gpu gpu gid=44" .

  3. It appears that rocminfo shows the fixed amount of VRAM, under Pool Info. Is that true? I get 512MB, 512MB and 64KB respectively in those three pools.

When you configure the BIOS/UEFI settings of your motherboard to pre-allocate more VRAM for the AMD iGPU (in this case a change from the default 512MB to the maximum 16GB),

  1. rocminfo shows for HSA Agent 2 (GPU):
@@ -123,7 +123,7 @@
   Pool Info:               
     Pool 1                   
       Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
-      Size:                    524288(0x80000) KB                 
+      Size:                    16777216(0x1000000) KB             
       Allocatable:             TRUE                               
       Alloc Granule:           4KB                                
       Alloc Recommended Granule:2048KB                             
@@ -131,7 +131,7 @@
       Accessible by all:       FALSE                              
     Pool 2                   
       Segment:                 GLOBAL; FLAGS: EXTENDED FINE GRAINED
-      Size:                    524288(0x80000) KB                 
+      Size:                    16777216(0x1000000) KB             
       Allocatable:             TRUE                               
       Alloc Granule:           4KB                                
       Alloc Recommended Granule:2048KB                                      
  1. The rocm-smi output has a column VRAM%. This is the percentage of the VRAM compared to the total RAM of your system. If you have a system with 64GB of RAM and allocated 16GB for the iGPU, you would get a figure of about 25% here.

I’m glad the tutorial was useful :]

  1. You’re right, apt reports 30.4GB, but it takes up less disk space. I’m not sure how much of this reduction is due to compression (I use ZFS).
  2. Typo fixed
  3. In my case, rocm-smi in the VRAM% column shows the actual percentage of VRAM used. This is usually around 410 MiB of the 512 MiB available in idle, and I see ~80% there.