Based on what I have experimented with CUDA so far, it seems for GPU processing its better to use same CUDA versions on host and containers use GPU pass through to get best performance.
When the CUDA version in host and container mismatch, I get errors in NVCC and some python libraries using it especially when using triton or ray for large language model inference server. Not looked into details on how to run different CUDA versions in host and container yet.
I think I solved the problem with a pretty nasty but efficient hack.
I installed the newest CUDA in the host and delegated to containers with the usual way. In the containers I need an older CUDA 11.x I just brutally symlinked libraries 12.x to 11.x and voila! It seems that older software is able to find all it needs from the newer libraries. Pytorch 11.x works as it should.