Sriov infiniband device persistent (or at least consistent) port_guid and node_guid

wondering if anyone out there is using multiple partition keys with sriov infiniband devices?

for example, i have several VMs, each with sriov device but all in different partitions with different pkeys. when I reboot the host server, the VFs will be gone and upon creating them again, it is no guarantee that the same virtual function (eg. v0-v15) will be used for the VM again. this makes it difficult to work with guids and pkeys at the fabric manager side since things potentially change every host boot.

i am using kernel inbox driver.

Hey there,

My infiniband experience is very very limited but it looks like the Mellanox driver lets us assign arbitrary GUIDs on a VF, so in theory we could generate one for the Infiniband device and then assign it every time.

I don’t see anything pkey related for Infiniband VFs though so I don’t know if we can mess with that part.

there seems to be some info stored about the sriov device in volatile vars:
volatile.eth1.last_state.created: “false”
volatile.eth1.last_state.pci.parent: 0000:3d:00.0
volatile.eth1.last_state.vf.id: “1”

does it mean lxd/incus will try to use the same vf.id the next time around?

something like this (guaranteeing same vf id) but a persistent config setting would do the job. i’m trying to avoid using the mellanox driver so on boot i am setting VF guids with mlx5_core driver (ip-link) and then unbinding the devices so they are available to lxd/incus… but if there was mechanism to guarantee same guids for a VM sriov device with mellanox driver, i would switch to mellanox driver.

Well, I found a Mellanox guide which showed some sysfs interface to change the GUID on a VF, so we should be able to extend the Infiniband SR-IOV logic in Incus to store the original GUID in a last_state key, then apply a specific one and finally restore the original when the instance goes down.

That’s pretty similar to the way MAC addresses are handled on ethernet VFs.

The last_state data is only used to know how to restore the interface when the instance goes down, it doesn’t guarantee you that the same VF will be used.

What you could do as a workaround though is do the VF setup by hand and then rather than use an infiniband device, use a raw pci device, providing it the PCI address of your SR-IOV VF which will then guarantee that the instance will always use the same VF. But that’s a very manual process.

https://enterprise-support.nvidia.com/s/article/howto-configure-sr-iov-for-connect-ib-connectx-4-with-kvm--infiniband-x#jive_content_id_II_Enable_SRIOV_on_the_MLNX_OFED_driver

Note that this is all OFED stuff which I usually try to avoid and just rely on the mainline mlx3/mlx4 drivers, so before looking at extending Incus, we’ll probably want to see what’s actually available in the mainline kernel.

Seems like the three things we could control would be:

  • port guid
  • node guid
  • policy state

Does that make sense to you? I’d likely need to do more research to understand the meaning of those three and what would be appropriate values, assuming we don’t simply expose all three as config keys and let you sort it out directly :slight_smile:

Yes this makes sense, and would help a lot in my current setup.

I’ve only been experimenting with mlx5 based nics.

If you’re keen to avoid relying on OFED (i would also prefer to avoid), or as another possibility in case OFED is not installed, there is a somewhat long winded way to adjust those settings using only mainline mlx5_core pci driver. You need to bind the VF to mlx5_core pci driver, then you can set guids using ip link set dev $physdev vf $vfid [port_guid|node_guid] $guid (presumably there are associated sysfs interfaces to achieve the same). Once the guids are set, you can unbind the dev from mlx5_core pci driver and it becomes free for use by another driver (ie. vfio-pci) while retaining the guids. There is also a ‘state’ setting to do policy state.

Yeah, that should be fine as we’re pretty used to the bind/unbind dance by now.