Multi GPU Passthrough Issue
I recently purchased an ASUS ESC4000 G4S with 2x Intel Xeon Gold 6148s. I was hoping to swap my 4x RTX A4000s from the two servers that were previously running into this single server but I'm running into an issue with two of the cards.
kvm: -device vfio-pci,host=0000:3c:00.1,id=hostpci0.1,bus=ich9-pcie-port-1,addr=0x0.1: vfio 0000:3c:00.1: Failed to set up TRIGGER eventfd signaling for interrupt INTX-0: VFIO_DEVICE_SET_IRQS failure: Transport endpoint is not connected stopping swtpm instance (pid 349614) due to QEMU startup error
kvm: -device vfio-pci,host=0000:5f:00.1,id=hostpci0.1,bus=ich9-pcie-port-1,addr=0x0.1: vfio 0000:5f:00.1: Failed to set up TRIGGER eventfd signaling for interrupt INTX-0: VFIO_DEVICE_SET_IRQS failure: Transport endpoint is not connected stopping swtpm instance (pid 349341) due to QEMU startup error
The other two cards work without issue when passed to a VM. Below I've added the output of everything I know that is supposed to be checked. The two cards that are causing issues show
Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
which I believe is why they're throwing the error above. All four cards are identical and when I had separate servers with two cards each they all worked. Any advice on how to fix this would be greatly appreciated.
lspci | grep NVIDIA
3c:00.0 VGA compatible controller: NVIDIA Corporation GA104GL [RTX A4000] (rev a1)
3c:00.1 Audio device: NVIDIA Corporation GA104 High Definition Audio Controller (rev a1)
5f:00.0 VGA compatible controller: NVIDIA Corporation GA104GL [RTX A4000] (rev a1)
5f:00.1 Audio device: NVIDIA Corporation GA104 High Definition Audio Controller (rev a1)
af:00.0 VGA compatible controller: NVIDIA Corporation GA104GL [RTX A4000] (rev a1)
af:00.1 Audio device: NVIDIA Corporation GA104 High Definition Audio Controller (rev a1)
d8:00.0 VGA compatible controller: NVIDIA Corporation GA104GL [RTX A4000] (rev a1)
d8:00.1 Audio device: NVIDIA Corporation GA104 High Definition Audio Controller (rev a1)
dmesg | grep -i vfio
[ 4.918792] VFIO - User Level meta-driver version: 0.3
[ 4.927064] vfio-pci 0000:3c:00.0: vgaarb: VGA decodes changed: olddecodes=io+mem,decodes=io+mem:owns=none
[ 4.927318] vfio-pci 0000:5f:00.0: vgaarb: VGA decodes changed: olddecodes=io+mem,decodes=io+mem:owns=none
[ 4.944746] vfio-pci 0000:af:00.0: vgaarb: deactivate vga console
[ 4.944752] vfio-pci 0000:af:00.0: vgaarb: VGA decodes changed: olddecodes=io+mem,decodes=io+mem:owns=none
[ 4.945001] vfio-pci 0000:d8:00.0: vgaarb: VGA decodes changed: olddecodes=io+mem,decodes=io+mem:owns=none
[ 4.945092] vfio_pci: add [10de:24b0[ffffffff:ffffffff]] class 0x000000/00000000
[ 5.048131] vfio_pci: add [10de:228b[ffffffff:ffffffff]] class 0x000000/00000000
[ 535.097820] vfio-pci 0000:d8:00.1: enabling device (0140 -> 0142)
[50851.955554] vfio-pci 0000:af:00.1: enabling device (0140 -> 0142)
[160210.758559] vfio-pci 0000:3c:00.1: enabling device (0140 -> 0142)
[160210.758690] vfio-pci 0000:3c:00.1: PCI INT B: not connected
[180513.956181] vfio-pci 0000:3c:00.1: PCI INT B: no GSI
dmesg | grep 'remapping'
[ 1.283502] DMAR-IR: Queued invalidation will be enabled to support x2apic and Intr-remapping.
[ 1.285801] DMAR-IR: Enabled IRQ remapping in x2apic mode
dmesg | grep -E "DMAR|IOMMU"
[ 0.011828] ACPI: DMAR 0x000000006D398E18 0002B0 (v01 ALASKA A M I 00000001 INTL 20091013)
[ 0.011873] ACPI: Reserving DMAR table memory at [mem 0x6d398e18-0x6d3990c7]
[ 0.500660] DMAR: IOMMU enabled
[ 1.283397] DMAR: Host address width 46
[ 1.283398] DMAR: DRHD base: 0x000000d37fc000 flags: 0x0
[ 1.283415] DMAR: dmar0: reg_base_addr d37fc000 ver 1:0 cap 8d2078c106f0466 ecap f020df
[ 1.283419] DMAR: DRHD base: 0x000000e0ffc000 flags: 0x0
[ 1.283427] DMAR: dmar1: reg_base_addr e0ffc000 ver 1:0 cap 8d2078c106f0466 ecap f020df
[ 1.283429] DMAR: DRHD base: 0x000000ee7fc000 flags: 0x0
[ 1.283437] DMAR: dmar2: reg_base_addr ee7fc000 ver 1:0 cap 8d2078c106f0466 ecap f020df
[ 1.283440] DMAR: DRHD base: 0x000000fbffc000 flags: 0x0
[ 1.283444] DMAR: dmar3: reg_base_addr fbffc000 ver 1:0 cap 8d2078c106f0466 ecap f020df
[ 1.283446] DMAR: DRHD base: 0x000000aaffc000 flags: 0x0
[ 1.283450] DMAR: dmar4: reg_base_addr aaffc000 ver 1:0 cap 8d2078c106f0466 ecap f020de
[ 1.283452] DMAR: DRHD base: 0x000000b87fc000 flags: 0x0
[ 1.283456] DMAR: dmar5: reg_base_addr b87fc000 ver 1:0 cap 8d2078c106f0466 ecap f020de
[ 1.283458] DMAR: DRHD base: 0x000000c5ffc000 flags: 0x0
[ 1.283461] DMAR: dmar6: reg_base_addr c5ffc000 ver 1:0 cap 8d2078c106f0466 ecap f020df
[ 1.283463] DMAR: DRHD base: 0x0000009d7fc000 flags: 0x1
[ 1.283467] DMAR: dmar7: reg_base_addr 9d7fc000 ver 1:0 cap 8d2078c106f0466 ecap f020df
[ 1.283469] DMAR: RMRR base: 0x0000006f2d5000 end: 0x0000006f2e5fff
[ 1.283471] DMAR: ATSR flags: 0x0
[ 1.283475] DMAR: ATSR flags: 0x0
[ 1.283476] DMAR: RHSA base: 0x0000009d7fc000 proximity domain: 0x0
[ 1.283478] DMAR: RHSA base: 0x000000aaffc000 proximity domain: 0x0
[ 1.283479] DMAR: RHSA base: 0x000000b87fc000 proximity domain: 0x0
[ 1.283480] DMAR: RHSA base: 0x000000c5ffc000 proximity domain: 0x0
[ 1.283481] DMAR: RHSA base: 0x000000d37fc000 proximity domain: 0x1
[ 1.283482] DMAR: RHSA base: 0x000000e0ffc000 proximity domain: 0x1
[ 1.283483] DMAR: RHSA base: 0x000000ee7fc000 proximity domain: 0x1
[ 1.283485] DMAR: RHSA base: 0x000000fbffc000 proximity domain: 0x1
[ 1.283487] DMAR-IR: IOAPIC id 12 under DRHD base 0xc5ffc000 IOMMU 6
[ 1.283489] DMAR-IR: IOAPIC id 11 under DRHD base 0xb87fc000 IOMMU 5
[ 1.283491] DMAR-IR: IOAPIC id 10 under DRHD base 0xaaffc000 IOMMU 4
[ 1.283492] DMAR-IR: IOAPIC id 18 under DRHD base 0xfbffc000 IOMMU 3
[ 1.283493] DMAR-IR: IOAPIC id 17 under DRHD base 0xee7fc000 IOMMU 2
[ 1.283495] DMAR-IR: IOAPIC id 16 under DRHD base 0xe0ffc000 IOMMU 1
[ 1.283496] DMAR-IR: IOAPIC id 15 under DRHD base 0xd37fc000 IOMMU 0
[ 1.283498] DMAR-IR: IOAPIC id 8 under DRHD base 0x9d7fc000 IOMMU 7
[ 1.283499] DMAR-IR: IOAPIC id 9 under DRHD base 0x9d7fc000 IOMMU 7
[ 1.283501] DMAR-IR: HPET id 0 under DRHD base 0x9d7fc000
[ 1.283502] DMAR-IR: Queued invalidation will be enabled to support x2apic and Intr-remapping.
[ 1.285801] DMAR-IR: Enabled IRQ remapping in x2apic mode
[ 2.445130] DMAR: No SATC found
[ 2.445132] DMAR: IOMMU feature coherent inconsistent
[ 2.445134] DMAR: IOMMU feature coherent inconsistent
[ 2.445136] DMAR: dmar6: Using Queued invalidation
[ 2.445147] DMAR: dmar5: Using Queued invalidation
[ 2.445151] DMAR: dmar4: Using Queued invalidation
[ 2.445154] DMAR: dmar3: Using Queued invalidation
[ 2.445158] DMAR: dmar2: Using Queued invalidation
[ 2.445167] DMAR: dmar1: Using Queued invalidation
[ 2.445171] DMAR: dmar0: Using Queued invalidation
[ 2.445175] DMAR: dmar7: Using Queued invalidation
[ 2.601752] DMAR: Intel(R) Virtualization Technology for Directed I/O
lspci -n -s 3c:00 -v
3c:00.0 0300: 10de:24b0 (rev a1) (prog-if 00 [VGA controller])
Subsystem: 17aa:14ad
Flags: fast devsel, IRQ 30, NUMA node 0, IOMMU group 5
Memory at b7000000 (32-bit, non-prefetchable) [size=16M]
Memory at 1b000000000 (64-bit, prefetchable) [size=32G]
Memory at 1b800000000 (64-bit, prefetchable) [size=32M]
I/O ports at 7000 [size=128]
Expansion ROM at b8000000 [disabled] [size=512K]
Capabilities: [60] Power Management version 3
Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
Capabilities: [78] Express Legacy Endpoint, MSI 00
Capabilities: [b4] Vendor Specific Information: Len=14 <?>
Capabilities: [100] Virtual Channel
Capabilities: [250] Latency Tolerance Reporting
Capabilities: [258] L1 PM Substates
Capabilities: [128] Power Budgeting <?>
Capabilities: [420] Advanced Error Reporting
Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
Capabilities: [900] Secondary PCI Express
Capabilities: [bb0] Physical Resizable BAR
Capabilities: [c1c] Physical Layer 16.0 GT/s <?>
Capabilities: [d00] Lane Margining at the Receiver <?>
Capabilities: [e00] Data Link Feature <?>
Kernel driver in use: vfio-pci
Kernel modules: nvidiafb, nouveau
3c:00.1 0403: 10de:228b (rev a1)
Subsystem: 17aa:14ad
Flags: fast devsel, IRQ -2147483648, NUMA node 0, IOMMU group 5
Memory at b8080000 (32-bit, non-prefetchable) [size=16K]
Capabilities: [60] Power Management version 3
Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
Capabilities: [78] Express Endpoint, MSI 00
Capabilities: [100] Advanced Error Reporting
Capabilities: [160] Data Link Feature <?>
Kernel driver in use: vfio-pci
Kernel modules: snd_hda_intel
lspci -n -s 5f:00 -v
5f:00.0 0300: 10de:24b0 (rev a1) (prog-if 00 [VGA controller])
Subsystem: 17aa:14ad
Flags: fast devsel, IRQ 255, NUMA node 0, IOMMU group 2
Memory at c4000000 (32-bit, non-prefetchable) [size=16M]
Memory at 1f000000000 (64-bit, prefetchable) [size=32G]
Memory at 1f800000000 (64-bit, prefetchable) [size=32M]
I/O ports at 9000 [size=128]
Expansion ROM at c5000000 [disabled] [size=512K]
Capabilities: [60] Power Management version 3
Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
Capabilities: [78] Express Legacy Endpoint, MSI 00
Capabilities: [b4] Vendor Specific Information: Len=14 <?>
Capabilities: [100] Virtual Channel
Capabilities: [250] Latency Tolerance Reporting
Capabilities: [258] L1 PM Substates
Capabilities: [128] Power Budgeting <?>
Capabilities: [420] Advanced Error Reporting
Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
Capabilities: [900] Secondary PCI Express
Capabilities: [bb0] Physical Resizable BAR
Capabilities: [c1c] Physical Layer 16.0 GT/s <?>
Capabilities: [d00] Lane Margining at the Receiver <?>
Capabilities: [e00] Data Link Feature <?>
Kernel driver in use: vfio-pci
Kernel modules: nvidiafb, nouveau
5f:00.1 0403: 10de:228b (rev a1)
Subsystem: 17aa:14ad
Flags: fast devsel, IRQ 255, NUMA node 0, IOMMU group 2
Memory at c5080000 (32-bit, non-prefetchable) [disabled] [size=16K]
Capabilities: [60] Power Management version 3
Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
Capabilities: [78] Express Endpoint, MSI 00
Capabilities: [100] Advanced Error Reporting
Capabilities: [160] Data Link Feature <?>
Kernel driver in use: vfio-pci
Kernel modules: snd_hda_intel
lspci -n -s af:00 -v
af:00.0 0300: 10de:24b0 (rev a1) (prog-if 00 [VGA controller])
Subsystem: 17aa:14ad
Flags: bus master, fast devsel, latency 0, IRQ 184, NUMA node 1, IOMMU group 10
Memory at ed000000 (32-bit, non-prefetchable) [size=16M]
Memory at 2b000000000 (64-bit, prefetchable) [size=32G]
Memory at 2b800000000 (64-bit, prefetchable) [size=32M]
I/O ports at e000 [size=128]
Expansion ROM at ee000000 [disabled] [size=512K]
Capabilities: [60] Power Management version 3
Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+
Capabilities: [78] Express Legacy Endpoint, MSI 00
Capabilities: [b4] Vendor Specific Information: Len=14 <?>
Capabilities: [100] Virtual Channel
Capabilities: [258] L1 PM Substates
Capabilities: [128] Power Budgeting <?>
Capabilities: [420] Advanced Error Reporting
Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
Capabilities: [900] Secondary PCI Express
Capabilities: [bb0] Physical Resizable BAR
Capabilities: [c1c] Physical Layer 16.0 GT/s <?>
Capabilities: [d00] Lane Margining at the Receiver <?>
Capabilities: [e00] Data Link Feature <?>
Kernel driver in use: vfio-pci
Kernel modules: nvidiafb, nouveau
af:00.1 0403: 10de:228b (rev a1)
Subsystem: 17aa:14ad
Flags: bus master, fast devsel, latency 0, IRQ 181, NUMA node 1, IOMMU group 10
Memory at ee080000 (32-bit, non-prefetchable) [size=16K]
Capabilities: [60] Power Management version 3
Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
Capabilities: [78] Express Endpoint, MSI 00
Capabilities: [100] Advanced Error Reporting
Capabilities: [160] Data Link Feature <?>
Kernel driver in use: vfio-pci
Kernel modules: snd_hda_intel
lspci -n -s d8:00 -v
d8:00.0 0300: 10de:24b0 (rev a1) (prog-if 00 [VGA controller])
Subsystem: 17aa:14ad
Flags: bus master, fast devsel, latency 0, IRQ 185, NUMA node 1, IOMMU group 8
Memory at fa000000 (32-bit, non-prefetchable) [size=16M]
Memory at 2f000000000 (64-bit, prefetchable) [size=32G]
Memory at 2f800000000 (64-bit, prefetchable) [size=32M]
I/O ports at f000 [size=128]
Expansion ROM at fb000000 [disabled] [size=512K]
Capabilities: [60] Power Management version 3
Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+
Capabilities: [78] Express Legacy Endpoint, MSI 00
Capabilities: [b4] Vendor Specific Information: Len=14 <?>
Capabilities: [100] Virtual Channel
Capabilities: [258] L1 PM Substates
Capabilities: [128] Power Budgeting <?>
Capabilities: [420] Advanced Error Reporting
Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
Capabilities: [900] Secondary PCI Express
Capabilities: [bb0] Physical Resizable BAR
Capabilities: [c1c] Physical Layer 16.0 GT/s <?>
Capabilities: [d00] Lane Margining at the Receiver <?>
Capabilities: [e00] Data Link Feature <?>
Kernel driver in use: vfio-pci
Kernel modules: nvidiafb, nouveau
d8:00.1 0403: 10de:228b (rev a1)
Subsystem: 17aa:14ad
Flags: bus master, fast devsel, latency 0, IRQ 183, NUMA node 1, IOMMU group 8
Memory at fb080000 (32-bit, non-prefetchable) [size=16K]
Capabilities: [60] Power Management version 3
Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
Capabilities: [78] Express Endpoint, MSI 00
Capabilities: [100] Advanced Error Reporting
Capabilities: [160] Data Link Feature <?>
Kernel driver in use: vfio-pci
Kernel modules: snd_hda_intel