> It already supports a number of obscure options (you can make QEMU claim to support a CPU feature regardless of whether the host CPU supports it, really?), so adding one more woild fit in just fine.
> Nope. “there are no plans to address it further or fix it in an upcoming release”.
I could see that being the response of an individual open-source developer working for free. But that was IBM saying that, and people pay big bucks to IBM to fix things like this.
It's a bug filed against RHEL 7 originally, by someone working at Red Hat, who suggested that we add the qemu disable-io feature to libvirt. There was no customer case behind either the original RHEL 7 bug nor this cloned RHEL 8 bug, so we simply didn't think it was important to implement this, and 5 years after the original bug was filed, with no customer coming along nor anyone having done the work upstream, the bug was auto-closed.
However if someone came along and did the work upstream to fix it, I'm sure that would be accepted.
Or if a customer turned up who wanted this, that would also be implemented.
As redhat becomes more commercial it's imperative we don't let them be stewards of open source anymore. Too many times to their corporate strategy.
For example they took ownership of X11 only so they can let it die in favor of their preferred Wayland. While Wayland is not bad, it's not covering everything.
But anyway I don't really care anymore, I'm less and less invested in the Linux ecosystem. It's too commercial now, I just stick with the BSDs <3
Because, if you read the report, suc an option is not needed:
- if you disable IO port allocation and plug in a card that requires it, that card cannot possibly work
- if you don't disable it but use only cards that don't require IO ports, you might get an error in your dmesg but the card will still work just fine
So, why would you need to specify this option in the first place?
Back before libvirt made it trivial, I used QEMU/KVM directly to map PCI devices to VMs. It's a little tricky because you must first unmap the device from the host/hypervisor, and you need to unmap the whole bus that the device is on. So if there are other PCI devices on the same bus as that device you want to map, they must all go along, which is often impossible for things like the USB controller for your keyboard/mouse.
These days, instead of crafting a custom script to launch QEMU/KVM for PCI mapping, it's just a few clicks in virt-manager. Note that the first time you launch a VM with a mapped PCI device, the launch will often fail with an error, but it will work on a subsequent retry and thereafter.
Also, I've tinkered with lots of VMs over the past 15 years and I've NEVER had a need for more than 14 buses. Hopefully I never will.
Not sure it will work though: I need to add an option to a `pcie-root-port` command-line argument managed by libvirt.
I can try skipping creating `pcie-root-port`s by libvirt completely, and add them manually using options passthrough, but I'm not sure the rest of libvirt won't throw a fit when it finds other devices that refer to these (unknown to libvirt) PCIe slots.
I'm curious to know more about the VM host machine that they plugged 15 e1000 cards into to test this limitation. And even more curious about the non-test environment in which somebody ran into this limitation.
I can only imagine trying to passthrough 20 nvme devices to a guest, but it seems like a very weird configuration.
On IaaS providers, you get "local scratch NVMe" presented to the guest as individual fixed-sized disks — presumably because they're being IOMMU-pass-through'ed from the host (or a JBOD direct-attached to the host.)
The sizes for these disks were standardized several generations ago, so they're at least presented to the guest as 375G slices (I'm guessing they might actually be partitions of a larger disk nowadays.) To get "decent" amounts of local scratch storage for e.g. a serverless data-warehouse instance, you need "all you can get" of these small volumes — which on at least AWS and GCP, is 24 of them (equalling ~9TB.)
And that's just one guest. The host might have several such guests.
(To be clear, neither AWS nor GCP is likely to be using libvirt anywhere in their stack. This is just to demonstrate the use-case.)
Probably not normal partitions but nvme namespaces instead since that 3ill also allow them to balance iops and such so that one customer doesn't affect another as much.
If I'm not wrong, the pre-allocation of I/O ranges in PCIe bridges is needed only if you intend to hot-plug devices that were not present in the first enumeration.. but in VMs the hardware is known from the start and the PCIe enumeration can assign I/O ranges only if devices underneath actually needs them... is there a reason why hot-plugging is needed in VMs?
Isn't the cloud notoriously worse about hotplugging anything than on-prem systems are? For example, vSphere supports hot adding CPUs and RAM to VMs, but Azure doesn't.
Have you got it working with PCI or PCIe? PCI devices attached to the top-level bus do not request I/O ports unless they need to, and if they do, they request only small slice.
QEMU also allows one to put 8 static PCIe devices into a single "multifunction PCIe device", so it requests 4K I/O ports per 8 devices, giving a bigger headroom. The downside, of course, is that all these 8 devices lose individual hotpluggability, and can only be added/removed en masse.
The biggest problem is hotplug slots, each taking 4K I/O ports unless told otherwise in a way libvirt does not support as I described in the article.
Author here. As correctly guessed in other comments: cloud infrastructure.
To make public IPs and volumes hotpluggable without a guest agent running inside every VM one has to manage them in a way guest OS will handle hotplug using regular mechanisms. For volumes it's PCIe storage hotplug, for public IPs it's PCIe network card hotplug.
If a VM is used as a Kubernetes worker, couple of dozen volumes and public IPs attached is not an unlikely situation.
It’s not a common use-case but I could see it being useful for sharing hardware that requires exclusive access like GPUs/ML accelerators.
Currently if you need GPUs they come with the instance itself meaning you need to boot your VM from scratch, do the work and then shut it down to relinquish the GPU.
With hot-plug you could have continuously running VMs that only attach/detach GPUs as needed, no longer taking the overhead of a full cold boot/shutdown every time.
I ran into this on FreeNAS which uses Bhyve. Not sure if it's FreeNAS' way of doing things, but adding a virtual disk using VirtIO creates a separate SATA controller.
I tried forwarding quad NVMe's and couldn't get it working until I discovered I was hitting this limitation between the existing disks and VirtIO network card.
Perhaps I am slightly misrembering and it was incidental to the NVMe's, but it did fail due to this 14 PCIe device limit due to virtual disks did not share a controller, and I had to change to using Bhyves AHCI driver for some disks to get the VM running again.
I even did a test adding one disk at a time until the VM stopped booting.
They stayed fixed because they were fixed devices in a simple computer. Basic keyboard support, legacy interrupt controller, legacy timers, VGA… stuff that still to this day to an extent makes a PC actually “a PC”, and that may even still be used to various extents in early stages of the boot chain.
In early computers, most device resources were fixed, especially critical stuff like keyboard and interrupt controller. Sometimes device were jumpered, but even then you’d have the choice between a few well known ranges.
There wasn’t any configuration/negotiation protocol in the early days, it was literally defined by how the wires were connected, and a few fixed logic gates. For compatibility, it had to stay that way. x86 PCs have a lot of legacy cruft.
But early boot is also where use of that legacy stuff usually ends with modern operating systems. Pretty much all these devices have been replaced by additional modern variants that are now mostly just using regular MMIO as well (I don’t think x86 I/O ports have relevant advantages, would appreciate to be told otherwise). For devices that are supposed to work on other machines than PCs (nowadays that mostly means ARM stuff), it can even get in the way, since they don’t know about this weird I/O port address space.
So modern OSes of course prefer the newer device variants (most are also decades old by now) of keyboard, interrupt controller, etc., and since those don’t tend to use I/O ports for the aforementioned reasons, modern OSes don’t use them too much in general anymore.
> I don’t think x86 I/O ports have relevant advantages, would appreciate to be told otherwise
It's far outside the mainstream, but the x86 task state segment allows for allowing user level tasks to do i/o on specific ports, with single port granularity. You can map memory for a task only at a page level, so you could potentially allow user-space drivers finer grained access to devices. Of course, more or less nothing uses this.
> For devices that are supposed to work on other machines than PCs (nowadays that mostly means ARM stuff), it can even get in the way, since they don’t know about this weird I/O port address space.
PCI host bridges are supposed to offer a way to interact with I/O ports if it's not something natural for the CPU. Whether or not that happens regularly, I'm not really sure.
Because it's used mostly for commands, so each device used very few ports. 128 bytes is already a pretty large size for the I/O port area of a PCI device, and a lot of them fit in 64k.
A lot of hardware has migrated from using I/O ports to memory-mapped I/O, and instead of fixed I/O addresses ACPI or a similar mechanism provides the OS with the
directory of memory addresses to talk to.
For example, instead of PS/2 keyboard/mouse at I/O ports 0x0060-0x0064, ACPI provides the OS with the memory address to talk to a USB controller, and the USB controller does not use I/O ports at all.
Most of this hardware is gone. The easiest way to see them at all is to boot a VM in QEMU and specifically ask for these ancient devices to be present.
> Nope. “there are no plans to address it further or fix it in an upcoming release”.
<https://bugzilla.redhat.com/show_bug.cgi?id=1408810>
I could see that being the response of an individual open-source developer working for free. But that was IBM saying that, and people pay big bucks to IBM to fix things like this.
However if someone came along and did the work upstream to fix it, I'm sure that would be accepted.
Or if a customer turned up who wanted this, that would also be implemented.
Though, reading the closing comment, this is really CLOSED-WONTFIXYET, as in no plans.
Maybe it'd be nice to introduce a WONTFIXYET. Might be useful to fossick among features abandoned that someday become feasible.
Now if your CTO golfs with an exec at IBM, you might get somewhere.
It appears that although for some devices VM works fine but for others the VM refuses to boot (esp e100)
So the answer might be more nuanced than it seems?
For example they took ownership of X11 only so they can let it die in favor of their preferred Wayland. While Wayland is not bad, it's not covering everything.
But anyway I don't really care anymore, I'm less and less invested in the Linux ecosystem. It's too commercial now, I just stick with the BSDs <3
So, why would you need to specify this option in the first place?
No need, libvirt can pass arbitrary options to QEMU.
https://libvirt.org/kbase/qemu-passthrough-security.html
These days, instead of crafting a custom script to launch QEMU/KVM for PCI mapping, it's just a few clicks in virt-manager. Note that the first time you launch a VM with a mapped PCI device, the launch will often fail with an error, but it will work on a subsequent retry and thereafter.
Also, I've tinkered with lots of VMs over the past 15 years and I've NEVER had a need for more than 14 buses. Hopefully I never will.
Not sure it will work though: I need to add an option to a `pcie-root-port` command-line argument managed by libvirt.
I can try skipping creating `pcie-root-port`s by libvirt completely, and add them manually using options passthrough, but I'm not sure the rest of libvirt won't throw a fit when it finds other devices that refer to these (unknown to libvirt) PCIe slots.
I can only imagine trying to passthrough 20 nvme devices to a guest, but it seems like a very weird configuration.
On IaaS providers, you get "local scratch NVMe" presented to the guest as individual fixed-sized disks — presumably because they're being IOMMU-pass-through'ed from the host (or a JBOD direct-attached to the host.)
The sizes for these disks were standardized several generations ago, so they're at least presented to the guest as 375G slices (I'm guessing they might actually be partitions of a larger disk nowadays.) To get "decent" amounts of local scratch storage for e.g. a serverless data-warehouse instance, you need "all you can get" of these small volumes — which on at least AWS and GCP, is 24 of them (equalling ~9TB.)
And that's just one guest. The host might have several such guests.
(To be clear, neither AWS nor GCP is likely to be using libvirt anywhere in their stack. This is just to demonstrate the use-case.)
Deleted Comment
Correct. I regularly use VMs with more that 14 statically configured PCI devices using QEMU with libvirt without having to resort to qemu:cmdline.
Have you got it working with PCI or PCIe? PCI devices attached to the top-level bus do not request I/O ports unless they need to, and if they do, they request only small slice.
QEMU also allows one to put 8 static PCIe devices into a single "multifunction PCIe device", so it requests 4K I/O ports per 8 devices, giving a bigger headroom. The downside, of course, is that all these 8 devices lose individual hotpluggability, and can only be added/removed en masse.
The biggest problem is hotplug slots, each taking 4K I/O ports unless told otherwise in a way libvirt does not support as I described in the article.
To make public IPs and volumes hotpluggable without a guest agent running inside every VM one has to manage them in a way guest OS will handle hotplug using regular mechanisms. For volumes it's PCIe storage hotplug, for public IPs it's PCIe network card hotplug.
If a VM is used as a Kubernetes worker, couple of dozen volumes and public IPs attached is not an unlikely situation.
Currently if you need GPUs they come with the instance itself meaning you need to boot your VM from scratch, do the work and then shut it down to relinquish the GPU.
With hot-plug you could have continuously running VMs that only attach/detach GPUs as needed, no longer taking the overhead of a full cold boot/shutdown every time.
I tried forwarding quad NVMe's and couldn't get it working until I discovered I was hitting this limitation between the existing disks and VirtIO network card.
I even did a test adding one disk at a time until the VM stopped booting.
In early computers, most device resources were fixed, especially critical stuff like keyboard and interrupt controller. Sometimes device were jumpered, but even then you’d have the choice between a few well known ranges.
There wasn’t any configuration/negotiation protocol in the early days, it was literally defined by how the wires were connected, and a few fixed logic gates. For compatibility, it had to stay that way. x86 PCs have a lot of legacy cruft.
But early boot is also where use of that legacy stuff usually ends with modern operating systems. Pretty much all these devices have been replaced by additional modern variants that are now mostly just using regular MMIO as well (I don’t think x86 I/O ports have relevant advantages, would appreciate to be told otherwise). For devices that are supposed to work on other machines than PCs (nowadays that mostly means ARM stuff), it can even get in the way, since they don’t know about this weird I/O port address space.
So modern OSes of course prefer the newer device variants (most are also decades old by now) of keyboard, interrupt controller, etc., and since those don’t tend to use I/O ports for the aforementioned reasons, modern OSes don’t use them too much in general anymore.
It's far outside the mainstream, but the x86 task state segment allows for allowing user level tasks to do i/o on specific ports, with single port granularity. You can map memory for a task only at a page level, so you could potentially allow user-space drivers finer grained access to devices. Of course, more or less nothing uses this.
> For devices that are supposed to work on other machines than PCs (nowadays that mostly means ARM stuff), it can even get in the way, since they don’t know about this weird I/O port address space.
PCI host bridges are supposed to offer a way to interact with I/O ports if it's not something natural for the CPU. Whether or not that happens regularly, I'm not really sure.
A lot of hardware has migrated from using I/O ports to memory-mapped I/O, and instead of fixed I/O addresses ACPI or a similar mechanism provides the OS with the directory of memory addresses to talk to.
For example, instead of PS/2 keyboard/mouse at I/O ports 0x0060-0x0064, ACPI provides the OS with the memory address to talk to a USB controller, and the USB controller does not use I/O ports at all.
Have a look at a list of the most common I/O ports: https://wiki.osdev.org/I/O_Ports#The_list
Most of this hardware is gone. The easiest way to see them at all is to boot a VM in QEMU and specifically ask for these ancient devices to be present.