(PCIE) Peripheral Component Interconnect [Express]

Introduction

Peripheral Component Interconnect or PCI and its serial cousin, PCI express, is a bus where components can be added to an existing system without too much headache.

In the older days of ISA and EISA busses, the wires were physically connected to certain places, such as the I/O bus and/or MMIO. Furthermore, the interrupt request vectors would be specified when the board was made. Some configuration could happen, for example, I could set a jumper or “dual-inline package” (DIP) switches to change the I/O address or IRQ vector.

Switch position settings to set IRQ and I/O addresses of a device.

With more and more boards being manufactured that do different things, this became untenable. The PCI bus solved a lot of these issues by making much of the configuration done at run time. This means that the operating system can enumerate the PCI bus and set the addresses and other options. It also means that the I/O addresses are programmable by the operating system! This has given rise to the “plug-n-play” system.

This bus ran parallel wires at 33 MHz or 66 MHz depending on a run-time setting. The wires were physically connected from the device to the endpoint, such as the CPU or DMA controller through a chipset “south bridge”. As more and more devices were added, including those soldered directly to the motherboard, a better bus was needed. This is where PCI Express came into play.

PCI express (PCIe) changed the parallel nature into a serial nature. It also changed the connections between devices and the host. Now, PCIe is more like a “star” network topology. Devices are connected together through a PCIe host, and older PCI cards can still be connected through a PCIe bridge. PCI express also has different signaling wires depending on the application.

PCIe uses differential signalling, meaning two wires send out the same information 180 degrees out of phase. So, if one wire is sending a 1, the other is sending a 0, and vice-versa. The thought is that since the wires are so close together, both wires will be affected by interference the same. So, you can differentiate the signal from the noise just by subtracting the values of the two wires.

The signals that come from PCIe come in one of the following packages: 1x, 2x, 4x, 8x, or 16x. There are options for more, but 16x is the fastest in consumer computers nowadays. To read this, read it as 1x (one times) or 16x (sixteen times). This is multiplying the number of wire pairs. So, 16x has sixteen differential pairs or 32 wires in total. Since two of each of the 32 wires have the same data (just the opposite), this allows 16 different signals all at once.

Signal Routing

PCI and PCIe are both “busses”, which just means a central system to connect all of our devices. PCI express can be thought of as the “Internet of devices”. In other words, we have “computers” and “routers” in the Internet sense, which are called “type-0 devices” and “type-1 bridges” in the PCIe sense.

PCI Connections

A type-0 device is given its name by the “header” type, which you will see below when you go to configure the device. Every device connected to PCIe requires (1) PCI configuration access mechanism (CAM) registers and (2) device-specific registers.

Configuration Registers

We access the CAM though MMIO. This memory address is designed into the computer and system boards. For VirtIO, the CAM is connected to 0x3000_0000. Any memory address between 0x3000_0000 and 0x3FFF_FFFF will be sent to the PCI subsystem by the memory controller. This is programmed into the memory controller when the hardware is fabricated (hard coded) or through some sort of non-volatile RAM, which is sometimes called firmware.

Address Selection and Signal Routing

When the memory controller sees the 0x3xxx_xxxx memory address, it forwards it to the PCI “root port”, which is the PCI subsystem. The root port then dissects this memory address bits into two pieces: a bus number (8 bits from 27 to 20) and a device number (5 bits from 19 to 15). Think of the bus number as a “subnet address” on the Internet (like 192.168.1.xxx) and the device number being the IP address (like xxx.xxx.xxx.125). These two pieces of information form a PCI address.

Example PCI Address

A PCI address is formed by dissecting the MMIO address:

Example address: 0x3076_b000

In binary:
0011 0000 0111 0110 1011 0000 0000 0000

Regroup the binary digits:
0011 [00000111]  [01101] [  011  ]  [   0000   ]  [0000 00] 00
     [ Bus #  ]  [Dev #] [Func # ]  [Ext Reg # ]  [ Reg # ]

The memory address above points to bus #7, device #13. The function number (#3 above), extended register, and register fields are for a specific register on a device. The last two bits are used for other purposes and are not used for identifying a device on the PCI bus.

A device can have multiple “functions”. For example, consider a graphics card. It can have an audio output, HDMI output, or even a USB-C input/output. This would be three functions under the same graphics card device.

We will NOT be exploring any multifunction devices for this class, so all function numbers will be 0.

PCI Network Topology

So, at this point, the memory controller forwarded a load/store to the PCI root port based on the memory address. The root port now receives this and separates out the bus number and device number. It then passes these values to all devices directly under it. I use the term directly because we have yet to touch on the type-1 bridges. So, the bridges are directly connected to the root port, but we cannot see anything behind the bridge until we configure it. Think of the bridge as a router in your house. Unless you configure port forwarding or a DMZ, anyone on the outside cannot see the devices behind your router.

A type-1 device connected to PCI is called a bridge. A bridge’s job is to forward signals from a primary bus to a secondary bus. The primary bus is the bus that the bridge is actually connected to. In the figure above, the primary bus for the bridge would be bus 0. We get to configure the secondary bus. As long as we pick a unique bus number, we can assign it whatever we want. However, the best thing to do is just increase the bus number of every bridge you meet.

Until you configure the primary and secondary bus numbers for the bridge above, you will NOT be able to see the devices behind it–the two devices under bus 1. This is why we start at bus 0, device 0 and use a doubly-nested for-loop to go through all the devices on one bus before switching to the next. Since the bridge above is on bus 0, we will configure it before we actually start enumerating the devices on bus 1. Since the bridge is responsible for forwarding signals from bus 0 to bus 1, the bridge must be configured otherwise the signals won’t be forwarded.

Bus Numbering for Bridges

What is a BAR?

A BAR is a base address register. This is a register in the PCI configuration access mechanism address space. Think of the BARs as memory pointers. The memory address a BAR points to connects the memory controller to a devices’ actual registers. Recall that devices have (1) PCI registers and (2) device-specific registers. We have yet to actually connect to #2. We can only do so by configuring the BARs.

One thing to note is a BAR is NOT what we use to access a device. Instead, a BAR stores a memory address we can then use to access a device’s registers. So, if we write 0x4041_0000 into a BAR, we can then load and store to the memory address 0x4041_0000 to access the device-specific registers.

Devices have PCI registers AND device-specific registers.

We cannot access the device-specific registers until we give a memory address to a base-address-register located in the PCI configuration space.

BAR Requirements (PCI specification chapter 3.2)

We will take a look at the device-specific registers in the Virtual I/O lecture.

BAR Memory Pointer Routing

When we write a memory address into a BAR, the PCI subsystem blocks out a space in the memory controller. This is the power of the PCI system. We can write essentially any memory address in the BAR and then use the memory controller to access anything behind it…which will be the device-specific registers.

The device itself will tell you which BARs it uses and for which purposes. So, we have to know which kind of device we’re driving (via vendor id and device id, see below).

Again, a BAR is a memory pointer. To access registers behind it, we load and store using the memory address the BAR points to and NOT the BAR itself.

The output below (from info pci) shows a block device. Every type-0 device can have up to 6, 32-bit BARs (memory pointers). However, many are not used. We can also have BARs that store 64-bit memory addresses. In that case we use BAR[x] to store the lower 32-bits and BAR[x+1] to store the upper 32-bits. For example, the output below shows BAR[4] is a 64-bit BAR, meaning that BAR[4] stores the lower 32-bits and BAR[5] stores the upper 32-bits of a 64-bit memory address.

  Bus  3, device   2, function 0:
    SCSI controller: PCI device 1af4:1042
      PCI subsystem 1af4:1100
      IRQ 0, pin A
      BAR1: 32 bit memory at 0x00000000 [0x00000fff].
      BAR4: 64 bit prefetchable memory at 0x40320000 [0x40323fff].
      id "blk3"

Recall that we can essentially put any memory address in the BAR. The qemu virt machine reserves memory addresses 0x4000_0000 through 0x4FFF_FFFF for PCI device-specific registers. It is important to note that 0x4000_0000 is NOT a RAM address. Instead, it is an address the memory controller forwards to PCI so that it can read from (load) or write to (store) a device-specific registers on the hardware.

Above, I put the bus number starting at bit 20 and the device number starting at bit 16 (1 bit over from where it is in ECAM). The device above is on bus #3 and it is device #2. This is why the memory address pointed to by BAR4 is 0x4032_0000.

Notice that the end memory address of BAR4 is 0x4032_3FFF. Each BAR is a pointer to a memory address which in turn connects to a device-specific register. These registers can be many different sizes. We will discuss below the mechanism for determining how much memory space each BAR needs. It involves writing -1 into the BAR and seeing what comes out.

Configuration

The configuration of PCI is its power. The PCI bus has a configuration address mechanism (CAM) and PCIe extends this to a much larger address space (256 bytes to 4096 bytes) called enhanced configuration address mechanism (ECAM).

The ECAM for the QEMU virt is at MMIO address 0x3000_0000. Each PCI host, bridge, and device has an ECAM, so to start configuring, we need to enumerate all of the devices attached to PCI.

To do so, there is some terminology we need to know. PCI devices are oriented in a bus, slot (device), and function fashion. The bus is what host the device is connected to. Each host has a number of slots (aka devices) where the device actually connects. Then there is a function, which is an addressable unit of the device itself. Most PCI(e) devices have only one function (function 0). However, if bit 7 of the header_type field is set to 1, then it is a multifunction device. A multifunction device can have up to 8 functions (0 – 7), and they must be enumerated like the busses and devices.

Enumerating the PCI Bus

To enumerate the bus, we start with bus 0, slot 0 and keep incrementing until the address space is over. This is called the brute force method, but it’s only done once at boot time for non-hot-plug devices. We will not be covering hot plug devices in this course. A hot plug device is a device that can be plugged in or taken out while the computer and OS are running.

For virt, we have up to 256 busses (up to 8 bits for bus number), and each bus can have up to 32 devices per bus (5 bits for device number). The base address of each ECAM is given by the following diagram.

Recall that the ECAM starts at 0x3000_0000, so this would be bus 0, device 0, function 0. For ECAM, the bus number starts at bit 20, the device number starts at bit 15, the function number starts at bit 12, the extended register number starts at bit 8, and the register number starts at bit 2. So, we can calculate the bus and device using the following function.

#define MMIO_ECAM_BASE 0x30000000
static volatile struct EcamHeader *pcie_get_ecam(uint8_t bus,
                                                 uint8_t device,
                                                 uint8_t function,
                                                 uint16_t reg) 
{
    // Since we're shifting, we need to make sure we
    // have enough space to shift into.
    uint64_t bus64 = bus & 0xff;
    uint64_t device64 = device & 0x1f;
    uint64_t function64 = function & 0x7;
    uint64_t reg64 = reg & 0x3ff; 

    // Finally, put the address together
    return (struct EcamHeader *)
                 (MMIO_ECAM_BASE |     // base 0x3000_0000
                 (bus64 << 20) |       // bus number A[(20+n-1):20] (up to 8 bits)
                 (device64 << 15) |    // device number A[19:15]
                 (function64 << 12) |  // function number A[14:12]
                 (reg64 << 2));        // register number A[11:2]
}

Now, we can use our function to determine the memory address of the header.

int bus;
int device;

// There are a MAXIMUM of 256 busses
// although some implementations allow for fewer.
// Minimum # of busses is 1
for (bus = 0;bus < 256;bus++) {
   for (device = 0;device < 32;device++) {
      // EcamHeader is defined below
      struct EcamHeader *ec = pci_get_ecam(bus, device, 0, 0);
      // Vendor ID 0xffff means "invalid"
      if (ec->vendor_id == 0xffff) continue;
      // If we get here, we have a device.
      printf("Device at bus %d, device %d (MMIO @ 0x%08lx), class: 0x%04x\n",
              bus, device, ec, ec->class_code);
      printf("   Device ID    : 0x%04x, Vendor ID    : 0x%04x\n",
              ec->device_id, ec->vendor_id);
   }
}

Enhanced Configuration Address Space (ECAM)

The configuration layout is based on the header type, but the first 16 bytes are the same for all devices.

First 16 bytes of configuration space header

struct EcamHeader {
    uint16_t vendor_id;
    uint16_t device_id;
    uint16_t command_reg;
    uint16_t status_reg;
    uint8_t revision_id;
    uint8_t prog_if;
    union {
        uint16_t class_code;
        struct {
            uint8_t class_subcode;
            uint8_t class_basecode;
        };
    };
    uint8_t cacheline_size;
    uint8_t latency_timer;
    uint8_t header_type;
    uint8_t bist;
    union {
        struct {
            uint32_t bar[6];
            uint32_t cardbus_cis_pointer;
            uint16_t sub_vendor_id;
            uint16_t sub_device_id;
            uint32_t expansion_rom_addr;
            uint8_t  capes_pointer;
            uint8_t  reserved0[3];
            uint32_t reserved1;
            uint8_t  interrupt_line;
            uint8_t  interrupt_pin;
            uint8_t  min_gnt;
            uint8_t  max_lat;
        } type0;
        struct {
            uint32_t bar[2];
            uint8_t  primary_bus_no;
            uint8_t  secondary_bus_no;
            uint8_t  subordinate_bus_no;
            uint8_t  secondary_latency_timer;
            uint8_t  io_base;
            uint8_t  io_limit;
            uint16_t secondary_status;
            uint16_t memory_base;
            uint16_t memory_limit;
            uint16_t prefetch_memory_base;
            uint16_t prefetch_memory_limit;
            uint32_t prefetch_base_upper;
            uint32_t prefetch_limit_upper;
            uint16_t io_base_upper;
            uint16_t io_limit_upper;
            uint8_t  capes_pointer;
            uint8_t  reserved0[3];
            uint32_t expansion_rom_addr;
            uint8_t  interrupt_line;
            uint8_t  interrupt_pin;
            uint16_t bridge_control;
        } type1;
        struct {
            uint32_t reserved0[9];
            uint8_t  capes_pointer;
            uint8_t  reserved1[3];
            uint32_t reserved2;
            uint8_t  interrupt_line;
            uint8_t  interrupt_pin;
            uint8_t  reserved3[2];
        } common;
    };
};

The PCI device is in little-endian format, so the first byte is Vendor ID, followed by the Device ID. For the QEMU virt purposes, the host’s vendor ID is 0x1b36, whereas each virtio device’s vendor ID is 0x1af4. The device ID combined with the class code tells us what kind device is connected.

The common parts of the header have the following meanings. There are a lot of fields, and many we will not use. The ones we will use are in bold.

vendor_id – The vendor ID of the device. 0x0000 and 0xffff means device is NOT connected (and should be skipped).
device_id – The device ID given to this device. This will identify which driver should configure the device.
command_reg – The command register (detailed below).
status_reg – The status register (detailed below).
revision_id – Device specific revision information (generally not used).
prog_if – Programmable interface (generally not used).
class_code – The class identifier. For example, base class (upper 8 bits) 0x09 is input, and sub class (lower 8 bits) 0x80 is “other”.
cacheline_size – The number of 32-bit words in cache (for bus master devices).
latency_timer – The number of PCI bus clocks required for bus mastering (for bus master devices).
header_type – The type of header (Type 0 – device, Type 1 – pci-to-pci bridge).
bist – Built-in Self Test (BIST).

Type 0 headers contain the following fields.

bar[6] – Base address registers. Programmable MMIO addresses to place up to 6, 32-bit or 3, 64-bit registers. The registers are specific to the device, including I/O and configuration. The OS will write the MMIO address to link these registers. For 64-bit registers, bar[n] is the low 32 bits of the address and bar[n+1] is the high 32 bits of the address.
cardbus_cis_pointer – Cardbus (PCMCIA) bus specification pointer.
sub_vendor_id – The vendor id of the attached subsystem. This is additional information to the vendor id.
sub_device_id – The device id of the attached subsystem. This is additional information to the device id.
expansion_rom_addr – The address for expansion ROM.
capes_pointer – The capability pointer to the head of the capability linked list (described below).
interrupt_line – The interrupt vector that the device is connected to. For virt, this is wired to 0.
interrupt_pin – The interrupt pin that the device will trigger. PCI has four interrupt pins: INTA#, INTB#, INTC#, and INTD#. 0 means the interrupt is not connected, 1 = INTA#, 2 = INTB#, 3 = INTC#, and 4 = INTD#.
min_gnt – Minimum “gain” time.
max_lat – Maximum latency.

Command Register

The command register allows us to send some commands to the PCI device for configuration purposes (not I/O).

This is a read/write register, and it is per-device. If we use MMIO (which we will), we want to ensure Memory Space is 1 so that the PCI device can respond to MMIO reads/writes. We will not be using PIO, so the I/O Space bit should be set to 0.

Bus Mastering is only important if we need a particular PCI to write to RAM. This is usually necessary for MSI/MSI-X writes which uses a physical MMIO address to signal a message.

Make sure that you set the command register BEFORE loading or storing to the address assigned to the BARs! All devices should have bit 1 set, and all bridges should have bits 1 and 2 set. NOTE: The “info pci” command in QEMU will not display the addresses you store into the BARs until the command register is set to accept memory space requests (bit index 1).

Keep reading further for additional considerations when writing to this register.

Status Register

The status register gives a response to a command and has the following structure.

The status register is a read/write register, but we only write 1 to the bits we want to reset. Writing a 0 into a bit will leave it unchanged.

PCI Bridges

PCI bridges are like a network switch. A PCI bridge uses a type 1 header, and it forwards communication to and from a separate, secondary bus. We will not be able to see any device behind a bridge until we set up the bridge.

Remember first to set the bus master and memory-mapped I/O bits (1 and 2) in the command register before setting these fields.

There are a few fields we have to concern ourselves with here. The first few fields are the: (1) primary bus number, (2) secondary bus number, and (3) subordinate bus number. A bridge is attached to a primary bus, usually bus 0, and it forwards requests to and from the secondary bus, which we have to enumerate. Whatever value we assign into the secondary bus number will be the bus number for all devices behind the bridge. Finally, the subordinate bus number is the highest bus number that will be controlled by this bridge. If there are bridges behind other bridges, this is when we will need to set the subordinate bus number. The subordinate bus number must be >= the secondary bus number. For example, if we are enumerating a bridge with three bridges behind it, the subordinate bus number would be secondary + 3. This obviates that any nested bridges must have sequential bus numbers.

Finally, there are four fields we will need to set here. (1) memory and prefetchable memory base and (2) memory and prefetchable memory limit. In this case, the base is the lowest memory addresses that can be forwarded through the bridge, and the limit is the highest memory addresses that can be forwarded through the bridge.

The memory addresses we store in these fields are only the upper 16-bits of the memory address. So, if we want to allow the secondary bus to use the MMIO addresses 0x40000000 through 0x4fffffff, then we would set the memory base to 0x4000 (upper 16 bits) and the limit to 0x4fff (upper 16 bits).

Even though 16-bits of the memory address are stored in this register, only the upper 12 bits are used. This means that only what we set in bits 20 and above will actually be identified by the bridge.

From the PCI-express Bridge Specification (chapter 5.2)

However, if we choose our memory a little bit better, we can shift the bus number into the 20th bit, which is the first addressable bit on the bridge. For example, bus 1 would only need to forward memory transactions from 0x4010_0000 through 0x401F_FFFF. The following screenshot shows the output of info pci for the fourth PCI bridge connected to the root port.

Bus  0, device   4, function 0:
    PCI bridge: PCI device 1b36:000c
      IRQ 0, pin A
      BUS 0.
      secondary bus 4.
      subordinate bus 4.
      IO range [0xf000, 0x0fff]
      memory range [0x40400000, 0x404fffff]
      prefetchable memory range [0x40400000, 0x404fffff]
      BAR0: 32 bit memory at 0x00000000 [0x00000fff].
      id "bridge4"

Notice that only memory addresses between 0x4040_0000 and 0x404F_FFFF will be forwarded to devices (and other bridges) behind this bridge. Therefore, if we set a BAR on a device behind this bridge to 0x4030_0000, the device will never hear memory transactions since the bridge is not configured to forward those addresses. Note that this bridge does have BAR0, but we are not required to configure BARs on bridges.

Each bar can be prefetchable or not, which is indicated by bit 3. We need to set the memory addresses for both memory base and prefetchable memory base.

static void pcie_setup_bridge(volatile struct EcamHeader *ec, uint16_t bus) {
    static uint8_t subordinate = 1;

    uint64_t addrst = 0x40000000 | ((uint64_t)subordinate << 20);
    uint64_t addred = addrst + ((1 << 20) - 1);

    PciEcam *ec = pci_get_ecam(bus, slot, 0, 0);
    
    ec->command_reg = COMMAND_REG_MMIO;
    ec->type1.memory_base = addrst >> 16;
    ec->type1.memory_limit = addred >> 16;
    ec->type1.prefetch_memory_base = addrst >> 16;
    ec->type1.prefetch_memory_limit = addred >> 16;
    ec->type1.primary_bus_no = bus;
    ec->type1.secondary_bus_no = subordinate;
    ec->type1.subordinate_bus_no = subordinate;

    subordinate += 1;
}

We can figure out the bus through the ecam address, but uint16_t bus passed into this function allows us to set the bridge’s primary port. The primary port of most bridges is the port that it was found on. We set both memory and prefetch_memory. Which of these are accessed is controlled by bit 3 of each BAR.

The PCI bridges will hear all memory accesses from the root port. This is why we have to specify the memory base and memory limit. This is mainly for the bridges to forward data we load or store in the memory addresses in the BARs, so with some careful planning, we can only forward the portion of the data required.

Capabilities

The capes_pointer points to an offset based at the top of the header where a linked list of capabilities are. Each capability has a unique identifier (ID), and the structure at the offset is based on the capability. The capabilities linked list allow us to see what sort of things each device can do. An important capability ID is 0x09, which is the “Vendor-specific capability”. We will be looking at these capabilities to determine which base address register (BAR) is connected to which part of the device.

The capabilities all have a common 2-byte sequence, however each capability can have an expanded structure. PCI devices that have a capabilities linked-list will have the status_reg bit 4 (Capabilities List) set to 1. If this bit is 0, then there are no capabilities, and the capes_pointer should be considered invalid…although it will most likely be 0.

The first byte is the capability ID, and then the next byte is the offset to the next capability. All offsets are based on the top of the ECAM (the address of the vendor ID field). The last capability will have the next capability set to 0, signaling there are no more capabilities.

Again, each capability has its own structure, which we will only know after reading the capability ID.

struct Capability {
    uint8_t id;
    uint8_t next;
};
// Make sure there are capabilities (bit 4 of the status register).
if (0 != (ptr_to_ecam->status_reg & (1 << 4)) {
   unsigned char capes_next = ec->common.capes_pointer;
   while (capes_next != 0) {
      unsigned long cap_addr = (unsigned long)pcie_get_ecam(bus, slot, 0, 0) + capes_next;
      struct Capability *cap = (struct Capability *)cap_addr;
      switch (cap->id) {
         case 0x09: /* Vendor Specific */
         {
             /* ... */
         }
         break;
         case 0x10: /* PCI-express */
         {
         }
         break;
         default:
            printf("Unknown capability ID 0x%02x (next: 0x%02x)\n", cap->id, cap->next);
         break;
      }
      capes_next = cap->next;
   }
}

Interrupts

For fast moving devices, such as those connected to PCIe x16, signaling an interrupt for every data transfer will get expensive, and end up slowing the device. PCI and PCIe can function using message signaled interrupts (MSI) or “extended” message signaled interrupts (MSI-X).

An MSI or MSI-X is a place in memory where the PCI device signals a “message” that would normally cause an interrupt. The operating system can look at a field called Pending Bit Array or PBA. If an interrupt is pending, it can then handle the interrupt as normal.

MSI-X is exposed as a capability (Capability ID = 0x11), and MSI is exposed as a capability (Capability ID = 0x05). For typical cases, we use the more advanced MSI-X if it is available over MSI.

NOTE: MSI/MSIX is NOT supported by RISC-V’s PLIC. There is an AIA (advanced interrupt architecture) that supports MSI/MSIX. The legacy PLIC will be assigned a PCIe device based on its slot and interrupt number between 32 and 35. The following formula determines the interrupt pin. Note that more than one PCIe device might be connected to the same interrupt!

\(\text{IRQ}=32 + [(\text{bus} + \text{slot})\mod~4]\)

INT#

Our emulating software codes the values of the interrupt of the PCIe devices based on the bus and slot of the device. This means that the interrupt_pin field in the ECAM is not valid and will usually be 0.

Since multiple devices can trigger on the same interrupt pin, we have to ask each device on that interrupt pin if it caused the interrupt.

Each device has a special way of “interrupting” in this system. For most devices, those handlers will be in Virtio.

MSI

MSI (capability ID = 0x05) 32-bit Message Address Structure
(message control bits 8 = 0, 7 = 0)

MSI (capability ID = 0x05) 64-bit Message Address Structure (message control bits 8 = 0, 7 = 1)

MSI (capability ID = 0x05) 32-bit Per-Vector Message Address Structure
(message control bits 8 = 1, 7 = 0)

MSI (capability ID = 0x05) 64-bit Per-Vector Message Address Structure
(message control bits 8 = 1, 7 = 1)

The message control register for MSI has the following bits:

Bits	R/W	Description
15:9	RO	RESERVED
8	RO	Per-vector masking capable. (0 = No, 1 = Yes)
7	RO	64-bit address capable. (0 = No, 1 = Yes)
6:4	RW	Multiple messages enable. 0b000 – 1 message 0b001 – 2 messages 0b010 – 4 messages 0b011 – 8 messages 0b100 – 16 messages 0b101 – 32 messages 0b110 and 0b111 – Reserved
3:1	RO	Multiple messages capable. Fields are the same as multiple messages enable.
0	RW	MSI Enable (set to 1 to enable MSI, set to 0 to disable MSI).

Message Control Register Bit Fields for MSI

MSI-X

PCI/PCIe uses EITHER MSI or MSI-X, but NOT both. Both can be provided as capabilities; however, software (the OS) must choose one or none. If neither MSI nor MSI-X are chosen, the PCI infrastructure will use the interrupt system. For RISC-V this means it will use the PLIC interrupts between 32 and 35.

The message control register allows us to configure the MSI-X. The table offset contains the offset of the table after reading the BAR given by BIR (Base Indicator Register) (i.e., ec->type0.bar[bir]).

The message control register is different for MSI-X as well. It is still 16-bits, but it contains the following fields.

Bits	R/W	Description
15	RW	MSI-X enable (1 = enabled, 0 = disabled)
14	RW	Function mask (1 = all vectors masked, 0 = masking based on each vector’s masked bit).
13:11	RO	Reserved
10:0	RO	Table size (size encoded as msg_control[10:0] – 1).

Message Control Register Structure for MSI-X

We can set bit 1 in the vector control to mask (turn off) interrupts to that vector. However, if the bit is cleared (reset), then messages can be posted there. We give a 64-bit address in the message address field, and a message in the message data field. When an interrupt is signaled, the device’s function will write the data into the memory given by the memory address. The message address upper field stores the upper 32 bits of a 64-bit address [63:32], and the message address field stores the lower 32 bits of a 64-bit address [31:0].

The PCI specification allows us to write the address in one store (doubleword) or in two separate stores (word). However, the vector must be masked before any changes to the message data field or to the message address field are made.

Base Address Registers (BARs)

Each base address register points to a place in memory where the register on the device is mapped. Since we don’t know the size beforehand, we have to ask the BAR what size it needs. This is done by writing all 1s for each bit in the BAR, then reading the value back out to see what it gives us. All bits of 1 are “necessary”, whereas all bits of 0 are wildcards. We can determine the size that it asks for by masking the last four bits, inverting the bits, and then adding 1 (two’s complement).

Recall that type 0 headers have 6 BAR register fields. Not all may be used. If this is the case, we will get all 0s when we write to it. However, it is the capabilities linked-list that tells us what these bars are used for. In other words, the BARs are for the device, not the PCI (or PCIe) bus.

There is available space in physical memory space available for us in the QEMU virt system to map these base address registers starting at 0x4000_0000 up to, but not including, 0x8000_0000. We can see this is the case looking at the QEMU source code (hw/riscv/virt.c).

    [VIRT_PCIE_ECAM] =   { 0x30000000,    0x10000000 },
    [VIRT_PCIE_MMIO] =   { 0x40000000,    0x40000000 },
    [VIRT_DRAM] =        { 0x80000000,           0x0 },

Each BAR must be mapped using a certain size, which is different for each device. I recommend enumerating the device number by the hex digit represented by bits 23:20. For example, the bus index 0 is at 0x400x_xxxx, bus index 1 is at 0x401x_xxxx, and bus index 2 is at 0x402x_xxxx, and so forth. This allows for 16 devices at once, which is far more than you’ll need. NOTE: Do NOT forget to map the addresses pointed to by the BARs in the MMU. Even though the BAR points to a physical address, when the OS accesses it, it will be a virtual address, so it must be mapped properly in a page table.

Some BARs are 64 bits and others are 32 bits. This can be determined by looking at BAR[2:1]. If this value is 0b00, it is a 32-bit BAR, if this value is 0b10, then it is a 64-bit BAR. Bit index 0 (the least significant bit) determines if this is a memory-mapped BAR (0) or it is a PIO-mapped BAR (1). Our machine does not have a PIO bus, so we can only use MMIO BARs.

Determining BAR Mapping Length

The base address registers just contain a memory address to the top of the memory. However, the data contained at this memory address varies in size. We can see if we look at the capability list the size of each.

We can determine the amount of address space and alignment a BAR needs by first disabling the BAR via the command register (make sure I/O and memory are 0), and then by writing all 1s into the BAR. We can then read back the BAR field. Anything that is still a 1 is a valid portion of the BAR. However, anything that is 0 means that when we pass it a memory address, it should be aligned by this amount.

We can use the alignment to figure out the size that the BAR maps to by taking the two’s complement of the value after masking the last four bits, -(BAR & ~0xFUL). We mask off the last four bits of the BAR since those bits are used for a 1-bit “prefetchable” field, a 2-bit “size” field, and a 1-bit “memory-space identifier” field.

Assigning BARs

When you find a type 0 device, you need to look at the BARs to assign them an MMIO space address. If a BAR has the value of 0 (all 0s), then it is an unused BAR.

Planning the bridges and using bits 27:20 as the bus number and bits 18:16 as the device number makes forwarding easy. We have all the space between 0x4000_0000 through 0x4FFF_FFFF.

PCI-express Devices

Devices connected as PCIe will have the capability ID of 0x10. This will then expose a much larger configuration space. For the QEMU virt, we will not use many of these capabilities as they have to do with link negotiation and power management.

The first field we see is the PCI Express Capabilities Register, which has the following structure.

The rest of the registers deal with actual hardware, and they don’t make much sense for a virtual device. The only reason we care about the PCIe configuration is for the interrupt message number (for MSI/MSI-X).

Resizable BARs

The PCI express specification version 2.1, ratified in 2008, allows you to resize the BAR address space in software. This is important, because if there is support for it, you can map a block of memory to a BAR address. Let’s see an example.

As you can see above, BAR0 is 16 gigabytes. It just so happens that this particular GPU has 16 gigabytes of VRAM. So what have I essentially done? I mapped the GPU memory into the CPU memory controller. I can now dereference a memory address using a load or a store, and I’m reading or writing directly to GPU memory.

Without a resizable BAR, I would be required to make a transfer command for blocks of memory at at ime, which you will do with the VirtIO GPU. Unfortunately, at this point, the virtio GPU device does not support resizable BARs. So, you get to see it the hard way.

Another benefit reveals itself if we think of it as a memory size issue. Consider a system that only has 8 gigabytes of memory. If I want to transfer 16 gigabytes of data, even though we usually won’t transfer the entire memory of the GPU, then I would have to do it in multiple transfer commands. What happens when 6 of those 8 gigabytes are taken for Window and applications?

The resizable BAR capability ID is 0x0015.

The capability register has the following format.

Each of the bits from 23:4 indicate which BAR address sizes are possible, but they are shifted left four places. For example, bit 4 indicates that it supports a 1MB BAR address space, whereas bit 23 indicates that it supports a 512 GB BAR address space.

We first read the sizes that the BAR can be resized to be, and then we set the BAR size into bits 12:8. These are powers of two to megabytes. For example, if I write 3 here, I have an 8MB BAR address space. If I write 19 here, I have a 512 GB BAR address space here.

Recap

As you can tell, there is a lot of information above. This is the tradeoff with making a bus, like, PCI generic and easy. It makes our lives a little bit harder. Here are the steps boiled down to eventually talk to a PCI device.

Enumerate the bus from bus 0 through 255 at MMIO address 0x3000_0000.
For each bus, enumerate the devices on the bus from device 0 through 31.
Check the vendor ID. If it is 0xffff, it is not an attached device, skip and go to the next device.
Otherwise, enumerate the BARS by checking if they’re 64 or 32 bits.
- You do not need to set anything in a bridge’s BARs
- Also recall BARs set to all 0s are not used, so they don’t need an address.
Enumerate the capabilities linked list.
Check how much space is required for an address.
Double check the BAR is actually used by the device.
- Write all 1s (-1UL) to the BAR.
- Read back the address from the same BAR and clear bits 3:0.
- The value of the twos complement of the readback will be the amount of address space needed.
Give the BARs in the capabilities list an empty chunk of memory starting with 0x4000_0000.
- Be careful when aligning here. Make sure you don’t give a BAR the same address as another.
Communicate with the BAR’s address as if they contain the device registers.

When checking and changing a BAR, make sure the command bit index 1 (memory space) is cleared. Only set this bit after all of the BARs are set. If you set up a device and then the memory space bit is cleared, that device will no longer become “present”, and you will have to reinitialize it.

Communicating with PCI(e) Devices

Setup takes time, but luckily it only has to be done once. Now that we’ve set up the base address registers, and we know the capabilities and type of each device, we can now forward this information to the correct driver. Recall that PCI relays three pieces of useful information in this regard: (1) vendor id, (2) device id, and (3) class id. The class is the broadest, most general area where we can start forwarding the information to a device driver. Then, the device abstraction can determine which specific driver to use based on the device and vendor id.

Device drivers should never touch the PCI configuration or BARs directly. Instead, they should go through the PCI subsystem in order to read or write or to more specifically configure a device.

The data at each BAR means something different for each device. You can see that PCI setup for all devices is the same until we get to the capabilities and finally the BARs.

Drivers

When we enumerate the PCI bus, we have to forward the device we found based on the vendor ID and device ID to a driver that knows how to talk to that device. We can create a list of vendor IDs and device IDs and then a callback function that will be used to handle the device.

The command register’s bit index 1 allows the PCI root or bridge to forward requests to a specific device. However, if we disable this, the device itself may reset, so only change the command register when changing BARs.

Contents