Contents

  1. Overview
  2. Videos
  3. Virtio System
  4. PCI Transport
  5. Virtqueues
  6. Debugging

Overview

Virtual I/O devices, or virtio for short, are virtualized devices that are known to be virtualized devices. Usually, a virtual machine will emulate real hardware, so the registers and setup are the same for actual hardware versus emulated hardware. However, there are a lot of things that don’t need to be done on virtual hardware. For example, we don’t have to set power states or clocking for emulated hardware. So, virtio devices only require the setup required for a virtual device.

Be careful when reading and writing to virtio devices. These are asynchronously performed, so requests and memory must stay resident. In other words, do not use local, stack-based variables. Instead, use kmalloc/kzalloc for memory that needs to be used in reentrant functions.

Virtio Specification Reference: (please notify me if this link breaks): https://docs.oasis-open.org/virtio/virtio/v1.1/virtio-v1.1.html


Videos


Virtio System

The virtio architecture has three layers: (1) the backend or transport layer, like PCI express, (2) the virtio layer, and (3) the device layer. The transport layer in our case will be PCI express. So, the backend tells us the vendor specific capabilities that the virtio system understands and what they mean. The data layer consists of several circular rings that are used to read or write data from and to the virtio device. Finally, the device layer tells us how we should read or write data.

This means we will have three sections in our kernel: (1) PCIe and associated functions, (2) Virtio and associated functions, and (3) device specific functions. Each one layer should be agnostic of the other layers.


PCI Transport

PCI transport will use the vendor ID 0x1AF4 and the device ID is dependent on the virtio device. The following table describes the device types.

TypeVirtio Type #PCI Device ID
Invalid00x1040
Network card10x1041
Block device20x1042
Console30x1043
Entropy device40x1044
Wireless LAN100x104a
GPU Device160x1050
Input device180x1052
VirtIO Devices and Types

Most of the virtio interface is revealed to the PCI bus through the capabilities structure. Virtio will use the capability id 0x09, which means vendor specific. Each virtio vendor-specific capability has the following structure.

struct virtio_pci_cap {
   u8 cap_vndr;   /* Generic PCI field: PCI_CAP_ID_VNDR */
   u8 cap_next;   /* Generic PCI field: next ptr. */
   u8 cap_len;    /* Generic PCI field: capability length */
   u8 cfg_type;   /* Identifies the structure. */
   u8 bar;        /* Which BAR to find it. */
   u8 padding[3]; /* Pad to full dword. */
   u32 offset;    /* Offset within bar. */
   u32 length;    /* Length of the structure, in bytes. */
};

The first two fields, cap_vndr and cap_next, are standard for all PCI capabilities. Recall from PCI(e), that the next is an offset from the top of the ECAM header.

You can see that the capability has a type and a bar. The bar is a base address register, that has to be programmed–which just means given a free memory address (we use the address in range 0x4000_0000 – 0x4FFF_FFFF). These can be either 32 or 64 bits. You can tell if a BAR is 64-bits by masking the lower bits. See PCI(e) to see how to do this. The capability type can be one of the following.

        /* Common configuration */
#define VIRTIO_PCI_CAP_COMMON_CFG 1
        /* Notifications */
#define VIRTIO_PCI_CAP_NOTIFY_CFG 2
        /* ISR Status */
#define VIRTIO_PCI_CAP_ISR_CFG 3
        /* Device specific configuration */
#define VIRTIO_PCI_CAP_DEVICE_CFG 4
        /* PCI configuration access */
#define VIRTIO_PCI_CAP_PCI_CFG 5

Generally, a virtio device will have multiple capabilities with an id of 0x09. After we see 0x09, we have to look at the type, which will be 1 through 5. The following is debug output from a 0x1042 device (block device).

 --> Capability @ 0xc8 (0x09)
       u8 cap_vndr = 0x09
       u8 cap_next = 0xb4
       u8 cap_len  = 0x14
       u8 cfg_type = 0x05
       u8 bar      = 0x00
       u32 offset  = 0x00000000
       u32 length  = 0x00000000
  --> Capability @ 0xb4 (0x09)
       u8 cap_vndr = 0x09
       u8 cap_next = 0xa4
       u8 cap_len  = 0x14
       u8 cfg_type = 0x02
       u8 bar      = 0x04
       u32 offset  = 0x00003000
       u32 length  = 0x00001000
  --> Capability @ 0xa4 (0x09)
       u8 cap_vndr = 0x09
       u8 cap_next = 0x94
       u8 cap_len  = 0x10
       u8 cfg_type = 0x04
       u8 bar      = 0x04
       u32 offset  = 0x00002000
       u32 length  = 0x00001000
  --> Capability @ 0x94 (0x09)
       u8 cap_vndr = 0x09
       u8 cap_next = 0x84
       u8 cap_len  = 0x10
       u8 cfg_type = 0x03
       u8 bar      = 0x04
       u32 offset  = 0x00001000
       u32 length  = 0x00001000
  --> Capability @ 0x84 (0x09)
       u8 cap_vndr = 0x09
       u8 cap_next = 0x7c
       u8 cap_len  = 0x10
       u8 cfg_type = 0x01
       u8 bar      = 0x04
       u32 offset  = 0x00000000
       u32 length  = 0x00001000

You can see that all of these capabilities use BAR 4, except config type 5, which uses BAR 0. However, you can see that config type 1 is at BAR 4 offset [0x00000000], whereas config type 3 is at BAR 4 offset [0x00001000]. So, BAR 4 points to the top of the configuration space, and the offset is where each of the config types are. This might change in the future, so we need to have a formal way of determining configuration values.

The formal way is to log the configuration type, BAR and offset into a separate virtio structure.


Common Configuration (Type 1)

struct virtio_pci_common_cfg {
   u32 device_feature_select; /* read-write */
   u32 device_feature; /* read-only for driver */
   u32 driver_feature_select; /* read-write */
   u32 driver_feature; /* read-write */
   u16 msix_config; /* read-write */
   u16 num_queues; /* read-only for driver */
   u8 device_status; /* read-write */
   u8 config_generation; /* read-only for driver */
   /* About a specific virtqueue. */
   u16 queue_select; /* read-write */
   u16 queue_size; /* read-write */
   u16 queue_msix_vector; /* read-write */
   u16 queue_enable; /* read-write */
   u16 queue_notify_off; /* read-only for driver */
   u64 queue_desc; /* read-write */
   u64 queue_driver; /* read-write */
   u64 queue_device; /* read-write */
};

This is the most relevant capability. We will be configuring devices through device_status here. Then for the virtqueue (described below), we have to enable the queue by setting queue_enable to 1 AFTER we set the physical addresses properly for the three sections below.

We have three sections to set physical addresses. The descriptor table’s physical address is set in queue_desc. The available ring’s physical address is set in queue_driver. Finally, the used ring’s physical address is set in queue_device.


Notification Structure (Type 2)

struct virtio_pci_notify_cap {
   struct virtio_pci_cap cap;
   u32 notify_off_multiplier; /* Multiplier for queue_notify_off. */
};

#define BAR_NOTIFY_CAP(offset, queue_notify_off, notify_off_multiplier)   ((offset) + (queue_notify_off) * (notify_off_multiplier))

When we read the virtio PCI capability with the field cfg_type = 2, then we know we have a notify capability. The notify capability contains an extra 4 bytes, which contains the notify offset multiplier. We will use this notify multiplier to multiply the queue notify offset in the BAR-mapped address. We will use this register to tell the device we have made a request.

The queue_notify_off comes from the BAR-mapped address in the type 1 capability. We will select the queue we are looking at by setting the queue_select field. Then, we read the queue_notify_off to determine the offset for that given queue.


Interrupt Service Routine (ISR) Status (Type 3)

struct virtio_pci_isr_cap {
    union {
       struct {
          unsigned queue_interrupt: 1;
          unsigned device_cfg_interrupt: 1;
          unsigned reserved: 30;
       };
       unsigned int isr_cap;
    };
};

ISR capabilities are for interrupts. However, we don’t use interrupts if we use MSI-X (message signaled interrupts). If we happen to disable MSI-X, then the device will write 1 to either queue_interrupt or device_cfg_interrupt so that our driver can differentiate between a queue being updated or the configuration being updated.

If we don’t use MSI-X, we are required to read this register to clear the interrupt. Otherwise, no more interrupts can be handled. For virt, PCI express interrupts can be 32, 33, 34, or 35. See PCIE for more information on how to determine the interrupt.

Until the AIA interrupt controller (APLIC) is approved and implemented, MSI/MSI-X does not function on QEMU for RISC-V.

When we receive an interrupt from the PLIC with a claim_id of 32, 33, 34, or 35, we know it came from PCIE. We know that more than one device may be listening on each IRQ, and that the IRQ # is based on the bus and slot of the PCIE device for QEMU RISC-V.

To determine which device did it, we have to look at one of these two fields. If queue_interrupt is 1, that means that the interrupt was caused by the device responding to the VirtQ. If device_cfg_interrupt is 1, that means the device changed its configuration, and that was the reason the interrupt occurred.


Device Specific (Type 4)

There is no standard device specific structure, hence “device specific”


PCI Configuration (Type 5)

This is an alternate to access the common configuration.


Base Address Registers

Each of the types above will have a base address register in the bar field. This is how you will access the data in the bar.

switch (cap->cfg_type) {
   case VIRTIO_PCI_CAP_COMMON_CFG: /* 1 */
   break;
   case VIRTIO_PCI_CAP_NOTIFY_CFG: /* 2 */
   break;
   case VIRTIO_PCI_CAP_ISR_CFG:    /* 3 */
   break;
   case VIRTIO_PCI_CAP_DEVICE_CFG: /* 4 */
   break;
   case VIRTIO_PCI_CAP_PCI_CFG:    /* 5 */
   break;
   default:
      printf("Unknown virtio capability %d\n", cap->cfg_type);
   break;
}

Many of the configurations will use the same bar. So, you need to be able to detect whether you allocated MMIO space for a bar. You can use a global system, such as the following, for allocating bars.

Since we assigned BARs during PCI init, you can get the MMIO address of each VirtIO register by getting the value back from the bar. Recall that each capability tells you which bar to use via the bar field in the capability structure.


Virtqueues

A virtqueue is a data structure used to read and write data between the driver (software we’re writing in the OS) and the device (block, tablet, GPU, etc). Each device type has a specific number of queues, however, most have only one queue. For example, a GPU has one queue for sending control commands and another for sending cursor updates.

Each virtual queue has a specific structure. The general structure of a request is to fill one of these queues with a data structure understood by the device. After we fill in the structure, we set the available ring (described below) to point to the structure (or structures) we just filled. We then notify the device that we made a change through the queue notify register. We write the queue’s number in this register, and the device will then read the available ring to see what we did.

Finally, a virtqueue is split into three parts: (1) an array of descriptors, (2) a ring array of available descriptor heads, and (3) a ring array of used descriptor heads.

The array of descriptors, the available head ring, and the used head ring must be physically contiguous! These virtio devices only speak in physical memory addresses.


Descriptors

A descriptor is a fundamental unit of “work”, whether it be to read from the device or write to the device. A descriptor has four fields as defined below.

struct virtq_desc {
   /* Address (guest-physical). */
   u64 addr;
   /* Length. */
   u32 len;
   /* This marks a buffer as continuing via the next field. */
   #define VIRTQ_DESC_F_NEXT   1
   /* This marks a buffer as device write-only (otherwise device read-only). */
   #define VIRTQ_DESC_F_WRITE     2
   /* This means the buffer contains a list of buffer descriptors. */
   #define VIRTQ_DESC_F_INDIRECT   4
   /* The flags as indicated above. */
   u16 flags;
   /* Next field if flags & NEXT */
   u16 next;
};

There will be a number of descriptors, which is negotiated by the driver through the queue_size field of the common configuration. The absolute maximum size is 1024, which is hardcoded into QEMU directly.


Available (Driver) and Used (Device) Rings

The available ring, now known as the driver ring, allows the driver to notify the device, whereas the device will place things into the used ring, now known as the device ring, for the driver (us) to read. The available ring is a data structure that contains the following.

struct virtq_avail {
   #define VIRTQ_AVAIL_F_NO_INTERRUPT      1
   u16 flags;
   u16 idx;
   u16 ring[ /* Queue Size */ ];
   u16 used_event; /* Only if VIRTIO_F_EVENT_IDX */
};

When we want to write something to the device, we fill out a descriptor (or group of descriptors), and then set the descriptor index into the next available ring. The ring’s index is specified by idx. When we place something into the ring, we increase idx. After we write the queue number into the queue_notify field, the device will keep reading from the ring until it catches up to what we set to idx.

Conversely, we can look at the used ring to see what the device is trying to tell us. If we don’t setup MSI-X, the device will notify us by sending an interrupt through the PLIC. After we see the vector, we know what device is trying to notify us. We then read the used ring and see what it is telling us.

struct virtq_used {
   #define VIRTQ_USED_F_NO_NOTIFY  1
   u16 flags;
   u16 idx;
   struct virtq_used_elem ring[ /* Queue Size */];
   u16 avail_event; /* Only if VIRTIO_F_EVENT_IDX */
};
/* le32 is used here for ids for padding reasons. */
struct virtq_used_elem {
   /* Index of start of used descriptor chain. */
   u32 id;
   /* Total length of the descriptor chain which was used (written to) */
   u32 len;
};

Just like the device needs to read available elements in the ring until it catches up with the idx field, we have to keep reading used elements from the ring until our internal index is idx. The idx is circular, so after it exceeds 65535, it wraps to 0 since it is a u16 (16-bit unsigned value).

We can use the id field of the used element to tell us which descriptor it is responding to. We look at the id field, then we use that as the index to the descriptor array. We can then see what we requested, and what it is responding to. We have to be careful about the id field. Unless we negotiate VIRTIO_F_IN_ORDER, then we can receive responses out of order, which means we have to handle ids that can potentially be out of order.

We will need to keep our own internal index for both the used and available ring. The one in the virtq_available and virtq_used structures is shared between the device and driver. So, it will update even if we haven’t noticed any data coming in or out.

The device and driver can negotiate the queue_size, which is the number of descriptors, the size of the available ring, and the size of the used ring. Since C doesn’t have templates, we have a choice: (1) make multiple structures for different queue sizes or (2) use pointers to point to each section (descriptors, available, and used).

Ring Sizing

When we actually look at the value in the ring, we need to mod it by the queue size, which is given by the common PCI configuration queue_size. We can negotiate the queue size when we allocate the descriptor table, driver ring, and device ring.

However, when we look at the index values, such as ack_idx, at_idx, device->idx, and driver->idx, they are all 2-byte uint16_t values. However, unlike the ring itself, we do NOT mod the idx fields. Instead, these indices will naturally wrap after 65535 (\(2^{16}-1\)).


Allocating Memory

We can control the queue size, but it must be a power of two. The order of setting up the device is important. The device and the queue cannot be “live” (enabled) until the device and the queues are all set up properly.

The virtio specification lays out the following 8 steps to initialize a device.

  1. Reset the device by writing 0 into the device_status field.
  2. Set the ACKNOWLEDGE status bit (bit 0). The guest OS noticed the device.
  3. Set the DRIVER status bit (bit 1). The guest OS knows how to drive the device.
  4. Read the device feature bits, and write the subset of feature bits understood by the OS and driver to the device. During this step, the driver MAY read (but MUST NOT write) the device-specific configuration fields to check that it can support the device before accepting it.
  5. Set the FEATURES_OK status bit (bit 3). The driver MUST NOT accept new feature bits after this step.
  6. Re-read the device status to ensure the FEATURES_OK bit is still set. Otherwise, the device does not support the features we requested.
  7. Perform device-specific setup, including discovery of virtqueues for the device, optional per-bus setup, reading apossibly writing the device’s virtio configuration space, and population of the virtqueues.
  8. Set the DRIVER_OK status bit. At this point the device is “live”.

The virtio system uses physical memory addresses. So, they must be translated before setting them. The following pseudocode essentially completes the 8 steps above. However, you should be doing the “checking” between setting bits.

volatile priv = pcie_rng_device;

// The specification asks that we check stuff in between 
// each of the following steps.
priv->common_cfg->device_status = VIRTIO_DEV_RESET; // RESET (0)
priv->common_cfg->device_status = VIRTIO_DEV_STATUS_ACKNOWLEDGE; // ACKNOWLEDGE (1)
priv->common_cfg->device_status |= VIRTIO_DEV_STATUS_DRIVER; // DRIVER (2)
// Read features...etc
priv->common_cfg->device_status |= VIRTIO_DEV_STATUS_FEATURES_OK; // FEATURES_OK (8)

// Setup queue 0
priv->common_cfg->queue_select = 0;
u16 qsize = priv->common_cfg->queue_size;

// Descriptor table
virt = (uint64_t)kzalloc(16 * qsize);
vq.desc = (struct VirtqDescriptor *)virt;
priv->common_cfg->queue_desc = mmu_translate(kernel_mmu_table, virt);

// Driver ring (aka available ring)
virt = (uint64_t)kzalloc(6 + 2 * qsize);
vq.driver = (struct VirtqAvail *)virt;
priv->common_cfg->queue_driver = mmu_translate(kernel_mmu_table, virt);

// Device ring (aka used ring)
virt = (uint64_t)kzalloc(6 + 8 * qsize);
vq.device = (struct VirtqUsed *)virt;
priv->common_cfg->queue_device = mmu_translate(kernel_mmu_table, virt);

// Enable the queue (AFTER setting it up!)
priv->common_cfg->queue_enable = 1;
// Make the device LIVE
priv->common_cfg->device_status |= VIRTIO_DEV_STATUS_DRIVER_OK; // DRIVER_OK (4)

The amount of memory we allocate is based on the queue size. However, the 16 * qsize, for example, is because there are 16 bytes per descriptor table entry, and there are ‘queue_size’ number of entries in the table. With the driver ring, we do 6 + 2 * qsize because there are three u16s (flags, ring, used_event), and then each element in the ring is a u16 (2 bytes). The device ring is much the same, except it uses a UsedElem structure, which is two u32s, or 8 bytes.

Lastly, when we allocate memory, we have to make sure that the memory is physically contiguous. This needs to happen because if our allocator (kmalloc) only checks for virtually contiguous pages, then we might have a break in the physical addresses it translates to.


Debugging

You can use the QEMU console to help debug your virtio drivers.

For example, to see if a device has been recognized, you can type info virtio, which prints something like the following.

(qemu) info virtio
/machine/peripheral/blk1/virtio-backend [virtio-blk]
/machine/peripheral/gpu/virtio-backend [virtio-gpu]
/machine/peripheral/rng/virtio-backend [virtio-rng]
/machine/peripheral/tablet/virtio-backend [virtio-input]
/machine/peripheral/keyboard/virtio-backend [virtio-input]

With these names, such as /machine/peripheral/blk1/virtio-backend, we can get more detailed information.

(qemu) info virtio-status /machine/peripheral/blk1/virtio-backend
/machine/peripheral/blk1/virtio-backend:
  device_name:             virtio-blk
  device_id:               2
  vhost_started:           false
  bus_name:                (null)
  broken:                  false
  disabled:                false
  disable_legacy_check:    false
  started:                 true
  use_started:             true
  start_on_kick:           false
  use_guest_notifier_mask: true
  vm_running:              true
  num_vqs:                 4
  queue_sel:               0
  isr:                     0
  endianness:              little
  status:
        VIRTIO_CONFIG_S_ACKNOWLEDGE: Valid virtio device found,
        VIRTIO_CONFIG_S_DRIVER: Guest OS compatible with device,
        VIRTIO_CONFIG_S_FEATURES_OK: Feature negotiation complete,
        VIRTIO_CONFIG_S_DRIVER_OK: Driver setup and ready
  Guest features:

        VIRTIO_BLK_F_FLUSH: Flush command supported,
        VIRTIO_BLK_F_WRITE_ZEROES: Write zeroes command supported,
        VIRTIO_BLK_F_DISCARD: Discard command supported,
        VIRTIO_BLK_F_TOPOLOGY: Topology information available,
        VIRTIO_BLK_F_BLK_SIZE: Block size of disk available,
        VIRTIO_BLK_F_GEOMETRY: Legacy geometry available,
        VIRTIO_BLK_F_SEG_MAX: Max segments in a request is seg_max
  Host features:
        VIRTIO_RING_F_EVENT_IDX: Used & avail. event fields enabled,
        VIRTIO_RING_F_INDIRECT_DESC: Indirect descriptors supported,
        VIRTIO_F_RING_RESET: Driver can reset a queue individually,
        VIRTIO_F_VERSION_1: Device compliant for v1 spec (legacy),
        VIRTIO_F_ANY_LAYOUT: Device accepts arbitrary desc. layouts,
        VIRTIO_F_NOTIFY_ON_EMPTY: Notify when device runs out of avail. descs. on VQ
        VHOST_USER_F_PROTOCOL_FEATURES: Vhost-user protocol features negotiation supported,
        VIRTIO_BLK_F_CONFIG_WCE: Cache writeback and writethrough modes supported,
        VIRTIO_BLK_F_FLUSH: Flush command supported,
        VIRTIO_BLK_F_WRITE_ZEROES: Write zeroes command supported,
        VIRTIO_BLK_F_DISCARD: Discard command supported,
        VIRTIO_BLK_F_MQ: Multiqueue supported,
        VIRTIO_BLK_F_TOPOLOGY: Topology information available,
        VIRTIO_BLK_F_BLK_SIZE: Block size of disk available,
        VIRTIO_BLK_F_GEOMETRY: Legacy geometry available,
        VIRTIO_BLK_F_SEG_MAX: Max segments in a request is seg_max
  Backend features:

(qemu)

You can also observe the virtqueues by using info virtio-queue-status.

(qemu) info virtio-queue-status /machine/peripheral/blk1/virtio-backend 0
/machine/peripheral/blk1/virtio-backend:
  device_name:          virtio-blk
  queue_index:          0
  inuse:                0
  used_idx:             690
  signalled_used:       0
  signalled_used_valid: false
  last_avail_idx:       690
  shadow_avail_idx:     690
  VRing:
    num:          256
    num_default:  256
    align:        4096
    desc:         0x00000000801fc680
    avail:        0x00000000801fd680
    used:         0x00000000801fd888
(qemu)

Finally, you can view an element in a queue using info virtio-queue-element as shown below.

(qemu) info virtio-queue-element /machine/peripheral/blk1/virtio-backend 0
/machine/peripheral/blk1/virtio-backend:
  device_name: virtio-blk
  index:   22
  desc:
    descs:
        addr 0x801ff350 len 1 (write)
  avail:
    flags: 0
    idx:   690
    ring:  22
  used:
    flags: 0
    idx:   690
(qemu)

Trap Vector

Recall that when you notify the device, it will respond to your request via an interrupt. This means that after an external interrupt is triggered via the PLIC, the HART configured for that particular interrupt will jump to the memory address in the stvec register. If this register is not configured properly, you will get an instruction page fault since the next instruction is trying to be fetched from the memory address 0.

Recall that you can type info registers to print out the registers and observe the stvec register.

(qemu) info registers

CPU#0
 V      =   0
 pc       0000000080014e24
 mhartid  0000000000000000
 mstatus  8000000a000060aa
 hstatus  0000000200000000
 vsstatus 0000000a00000000
 mip      0000000000000000
 mie      0000000000000aaa
 mideleg  0000000000001666
 hideleg  0000000000000000
 medeleg  000000000000b1f7
 hedeleg  0000000000000000
 mtvec    0000000080002f10
 stvec    0000000080018140

... SNIP ...