## Introduction

We learned about the nomenclature of computer components, looked at the power of numbers in binary, and looked at digital logic and how we can put these together to produce a result. Now, we will see how we can put some of these together to process data. I use the term process because we can add, subtract, multiply, divide, and so forth.

The most basic central processing unit can be divided into three parts: (1) control unit, (2) arithmetic and logic unit (ALU), and (3) register file. The control unit is responsible for retrieving an instruction, decoding the operands by reading the registers or immediate value, and then sending it to the arithmetic logic unit (or floating-point unit).

All of the screenshots and information can be found in the riscv-spec PDF document in chapters 24 and 25.

## RISC-V Register File

The chart above shows 32 integer registers, denoted with an x0 through x31 and 32 floating point registers, denoted with an f0 through f31. These are the direct register names. All of the registers can be used as source and destination. However, in order for your code to work with C or C++, there must be a standard. This standard is called the application binary interface or ABI.

All integer registers are 64 bits for a 64-bit machine. All floating point registers are 32 bits for a single-precision-only machine, or 64 bits for machines that support double precision. In this class, both the integer registers and floating point registers are 32 bits.

### Application Binary Interface

The ABI register names have a specific purpose. For example, the a0 through a7 registers are the argument registers, t0 through t6 are temporary registers. There are other types of registers, such as the sp (stack pointer) and s0 – s11, which are the saved registers.

In order for our code to work correctly, we must follow these rules. If everyone follows these rules, we can coexist with C or C++ or even assembly language someone else already wrote.

The ABI is generally about functions. So, let’s take a C++ function as an example.

int func(int a, char b, short c, int *d);

The prototype above has three parts: (1) function name, (2) parameter list, and (3) return type. If we look at what this would look like in RISC-V assembly, we would see something like this:

a0 func(a0, a1, a2, a3);

As we can see above, we return a value by putting it into the register a0. This is why the chart shows a0 as both an argument register and the return register. We can also see that the first parameter is a0, the second is a1, the third is a2, and the fourth is a3. No matter if the parameter is a char, int, short, or even a pointer, it still goes into the register–we just use a portion of the register instead of the full thing.

There is a difference between caller saved and callee saved registers. Any caller saved register is free to be used without regard of the previous value. However, a function may use a callee saved register provided it restores the original value of the register before the function returns.

The final special register we will talk about is called the stack pointer. Local variables are stored on the stack, which is a contiguous piece of memory that grows from high memory to low memory addresses. The stack pointer initially points to the very bottom of the stack. So, if we want to allocate space on the stack, we subtract the number of bytes we want from the stack pointer. The stack pointer must be aligned to 16 bytes. This means that the stack pointer must always be a multiple of 16. So, if we need 3 bytes from the stack, we must subtract 16 bytes. If we need 12 bytes from the stack, we must subtract 16. Alignment is used to make the memory controller’s job easier.

The following shows how to allocate enough bytes for an integer and three chars, which is a total of 7 bytes. So, we need to round 7 up to the nearest 16, which is 16 itself.

# Allocate
# integer at bytes sp+0, sp+1, sp+2, and sp+3
# char at byte sp+4
# char at byte sp+5
# char at byte sp+6
# bytes 7, 8, 9, 10, 11, 12, 13, 14, 15 are all empty.

# Deallocate
addi sp, sp, 16

## Instruction Formats

An instruction is simply a list of 0s and 1s that contain information needed to tell the CPU what instruction you want to execute and the parameters for that instruction. In RISC-V, each instruction is 32 bits. RISC-V also contains a compressed format that allows for 16-bit instructions. This means that all the information to instruct the CPU needs to fit within 32 bits.

The portion of the instruction that tells the CPU what instruction to execute is known as the operation code (opcode). In RISC-V, this is the last 7 bits of the instruction, and it tells the CPU whether to add, subtract, multiply, jump, branch, and so forth. For all 32 bit instructions, the last two bits of the 7-bit opcode are always $$11_2$$. So, in all, there is a 5-bit opcode.

The table of opcodes that RISC-V understands is listed below. An opcode might not narrow down to a single instruction, instead, it might narrow down to a smaller subset of instructions. In the latter case, it requires a subopcode to further narrow down the instruction.

As you can see in the figure above, the three-bit column is made up of bits 4, 3, and 2 of the instruction, whereas the two-bit row is made up of bits 6 and 5. We talked about a multiplexor in the digital logic portion, and as you can see, the 5-bit opcode above is the selector for a multiplexor.

If we delve deeper into an instruction, we can see that it can contain additional information, such as the source operands and where we want the CPU to store the result. There are several different types of instructions, however each instruction is still 32 bits. The type of instruction just distinguishes how the bits will be interpreted.

### Register Type (R-type) Instruction Format

As you can see above, the opcode is always bits 0 through 6 (7 bits). The R-type stands for register type, meaning that the source operands are rs1 (register source 1) and rs2 (register source 2) and the destination is denoted by rd (register destination). We also talked about subopcodes, which we can see in the R-type with funct3, which is a 3-bit extenstion to the opcode, and funct7, which is a 7-bit extension to the opcode. When combined, this forms a 17-bit opcode that tells the CPU what instruction to run.

### Four Register Type (R4-type) Instruction Format

This instruction format is mainly used for floating point instructions that requires four different registers encoded in one instruction.

You can see that we need this when encoding one destination and three source registers, such as the instruction fmadd.s rd, rs1, rs2, rs3. This performs the following: rd = rs1 * rs2 + rs3.

### Immediate Type (I-type) Instruction Format

An immediate type uses one register-based source operand (rs1), but the second operand is known as an immediate, which is a small integer literally encoded in the instruction itself. If you remember, all registers are 64-bits for RISC-V64, but for the I-type instruction, we have a 12-bit immediate from bits 31 through 20. This means we will need to widen the 12-bit immediate to a 64-bit immediate. The instruction we choose will determine whether we sign-extend or zero-extend.

The LOAD set of instructions are also I-type instructions. They have the following format:

l<b[u]|h[u]|w[u]|d>  rd, signed_imm(rs1)

Since the immediates of all LOAD instructions are signed, you must sign-extend the immediate. This immediate is directly added to the value rs1 to form an effective memory address. This then becomes the memory address requested by the memory controller.

Do NOT confuse LBU and LB. They both still have a signed immediate. The difference is that LBU will zero-extend the value returned by the memory controller, whereas LB will sign-extend the value returned by the memory controller.

You can also see that all of the LOAD instructions have the same opcode. So, the funct3 becomes the data size and signedness. The leftmost bit of funct3 is 0 if the instruction sign extends, and it will be 1 if the instruction zero extends. The lower (rightmost) two bits describes the number of bytes to load. $$00_2$$ is 1 byte, $$01_2$$ is 2 bytes, $$10_2$$ is four bytes, and $$11_2$$ is 8 bytes.

### Store Type (S-type) Instruction Format

The instruction format above has the STORE opcode, so the store selection (store byte, store half, store word, store doubleword) is all based on the 3-bit subopcode funct3. The important part about this instruction format is to differentiate between register source 1 (rs1) and register source 2 (rs2).

The store has a format like the following.

sh  a0, -4(sp)    # sh  src, offset(base)

The instruction above will take the halfword (16 bits) from the register a0 and store it in the memory location pointed to by sp + -4. You can see from the instruction above that the rs2 is the source register. In other words, it’s the value you want to store into memory. Secondly, you can see that the rs1 register is the base register, which is the memory address where we want to store the value.

Lastly, the offset is a 12-bit value, but in the store format, it is split between bits 31-25 and bits 11-7.

You can see from the SB, SH, and SW instructions above, that the opcode is the same, but the subopcode (funct3) determines the number of bytes that will be stored.

### Branch Type (B-type) Instruction Format

A branch instruction is an instruction that compares the values in two registers and jumps to a section of code based on the condition. For example, BNE stands for “branch if not equal”. So, in this case, we need two registers to compare, a condition code, and an offset to go to. The immediate field in the encoded instruction is the signed-offset.

The important part about a branch instruction is that the offset is PC-relative. This means that it is a signed-offset that is added to the program counter (PC). Recall that the program counter contains the address of the instruction the CPU will execute.

So, in this instruction, we can see the condition code is encoded in the funct3 field, which is a three-bit extended opcode. We can also see the immediate is somewhat odd. Notice that bit 0 is not stored in the immediate. This means that all numbers have to be a multiple of 2 to branch to it. The reason is because the smallest instruction is 2 bytes for compressed instructions. Therefore, all instructions have a 0 in the one’s digit.

When a branch instruction is “taken”, meaning the condition is true, then the immediate is added to the program counter (PC). Otherwise, the program counter moves to the very next instruction.

### Upper Type (U-type) Instruction Format

Since we can only encode an instruction using a maximum of 32 bits, it is sometimes important that we use two instructions to write a full 32-bit or 64-bit value into a register. We can combine an upper-immediate, which requires a larger immediate space to store the value.

Notice that the immediate only stores bits 31 through 12. These instructions will set bits 31 through 12 of the register to the given immediate. Recall that with immediate-type instructions, the lower immediate is 12 bits. So, we can store the upper 20 bits using load upper immediate (lui), and then the lower 12 bits using an instruction such as addi.

### Jump Type (J-type) Instruction Format

We saw that the branch instructions are PC-relative, but two registers eat up some of the offset. With the jump instructions, we still have a PC-relative, signed offset, but now we have more space. In fact, we now have 20 bits instead of 12.

Notice that most of the instruction encoding is dedicated to storing the lower 20 bits, except the one’s place. Again, just like with the branch instruction, the immediate is at least a multiple of 2 since all instructions are at least 2 bytes.

### Encoding Instructions Example

Encoding is also known as “assembling”, where human-readable instructions are turned into 0s and 1s.

sh  a0, -8(sp)

The registers a0 and sp are called ABI-format registers. These have a number between 0-31, which we can see at the register chart. The a0 register is x10 and the sp register is x2, so 10 (or 0b01010) and 2 (or 0b00010).

We then look at the SH (store halfword) instruction:

We need to fill in imm, rs2, and rs1. rs2 is the “source register”, which in this case is the A0 register, which we converted to X10. The rs1 register is the “base register”, or the base of the memory address, which is the SP register or X2. The immediate is broken into two pieces, bits 11 through 5 are on the left and bits 4 through 0 are on the right. So, we know this will be a 12-bit immediate (11-0). Our immediate (offset) is -8, so we need to write -8 using 12 bits:

-8 = ~(8) + 1
-8 = ~0b0000_0000_1000 + 1
-8 = 0b1111_1111_1000

The upper 7 bits will be 0b1111_111, and the lower 5 bits will be 0b1_1000.

Now to convert our registers. Each register field (rs2 and rs1) use 5 bits, which allows numbers from 0 – 31. So, X10 using 5 bits is 0b01010 and X2 using 5 bits is 0b00010.

Now we have all of the pieces of information we need to create a full 32-bit instruction.

imm[11:5] = 0b1111_111
rs2       = 0b0_1010
rs1       = 0b0_0010
func3     = 0b001
imm[4:0]  = 0b1_1000
opcode    = 0b010_0011

Put it all together:
1111_111  0_1010  0_0010  001  1_1000  010_0011

Group into fours:
1111 1110 1010 0001 0001 1100 0010 0011

Convert to hex
F    E    A    1    1    C    2    3

So,
sh  a0, -8(sp)

is,
0xFEA1_1C23

### Decoding Instructions Example

Decoding an instruction is also called “disassembling”, where we convert the 0s and 1s into the assembly instructions we can read.

Convert: 0xFF01_8413

The first thing we need to do is to convert to binary so we can look at the 7-bit opcode:

0xFF01_8413 = 0b1111_1111_0000_0001_1000_0100_0[001_0011]
Opcode = 0b001_0011

We know this is a 32-bit instruction since the last two bits of the opcode are 0b11. Now, we need to look at the chart to see which type of instruction this is.

Our bits 6:5 are 0b00 and our bits 4:2 are 0b100. The last two bits are cut off since this chart is only for opcodes with the last two bits of 0b11.

Looking at the chart, we can see that column 0b100 and row 0b00 gives us the OP-IMM, meaning it is an operation that decodes an immediate, such as xori, addi, ori, srli, and so forth. If this was an R-type, it would simply by the OP category, and lastly OP-FP means “floating point”, such as fadd, fmul, etc.

Now that we know this is an immediate, we can decode the portions of an I-type instruction:

The instruction formats table above has the I-type, which means we need to decode: imm[11:0], rs1, funct3, rd, and we already have the opcode, so getting our original number:

0xFF01_8413 = 0b1111_1111_0000_0001_1000_0100_0001_0011
0b[1111_1111_0000] [00011] [000]   [01000] [001_0011]
imm[11:0]        rs1   funct3    rd     opcode

imm[11:0] means a 12-bit immediate value, rs1 means the “source” register, and rd means the “destination” register. We can now convert the values:

imm[11:0] = 0b1111_1111_0000 (sign bit is 1)
-(~0b1111_1111_0000 + 1) = -(0b0000_0001_0000) = -16.

rs1 = 0b00011 = x3 = gp
rd  = 0b01000 = x8 = s0

Now we have to look at the opcode and funct3 to determine which instruction this is:

So, our funct3 is 000, and our opcode is 001_0011, which matches the ADDI instruction. Now, to put it all together, we get:

addi s0, gp, -16

## Instruction Pipeline

To execute an instruction, several things need to take place. An oscillator is used as a clock to keep moving the program counter. So, for a single-cycle CPU, the instruction must be fetched, decoded, executed, and the result must be stored all within one cycle of the clock.

Instead of using a single-cycle, we can pipeline, which essentially splits these into five stages. The five stages are (in order): (1) instruction fetch (IF), (2) instruction decode (ID), (3) execute (EXE), (4) memory (MEM), and (5) write-back (WB). This five stage pipeline is known as the 5-stage RISC pipeline. This is mainly an academic pipeline, as many pipelines are much longer and some are shorter.

### (1) Instruction Fetch (IF)

The instruction fetch stage will read an instruction from RAM. Some systems have a separate instruction memory versus data memory, so the IF stage will read an instruction from the instruction RAM. However, most systems have an integrated RAM, where instructions and data are in the same bank of RAM.

The instruction fetch stage gets the address (in RAM) of the instruction to fetch from the program counter register, abbreviated PC. This contains a memory address in RAM where 4 bytes are loaded from memory. Those 4 bytes contains the instruction in one of the formats we discussed above.

### (2) Instruction Decode (ID)

The instruction decode stage needs to read the operands from the register file. However, some instructions have an encoded immediate, such as the addi instruction. In this case, the decode stage needs to widen the immediate.

### (3) Execute (EXE)

The execute stage uses a function unit called the arithmetic and logic unit (ALU) or the floating point unit (FPU) depending on which instruction was executed. This is where add actually adds the numbers or xor actually performs the exclusive or operation on the two operands.

### (4) Memory (MEM)

This stage is reserved for the load and store instructions. Those instructions that require something from RAM or something to RAM will do so at this stage. If we think back to a store instruction, we store the value at *(base + offset). So, the execute stage will add the offset to the base, then at the memory stage (this stage), the value will actually get stored. Same thing happens with a load instruction.

### (5) Write-back (WB)

The write-back stage is where any destination register (rd) gets updated with the actual result. This is interesting because the value is actually calculated at stage three (execute), but we don’t actually store the result until this stage, stage five.

There are issues that arise from the way pipelining is structured. For example, take a look at the following RISCV code.

addi a0, zero, 10
sub  t0, a0, a1

If we look at the pipeline stages above, we can see the following.

addi a0, zero, 10   IF - ID - EXE - MEM - WB
sub  t0, a0, a1          IF - ID* - EXE - MEM - WB

Notice that the value of a0 is not known until stage five (writeback) of the addi instruction. However, the sub instruction needs the value of a0 during its stage two (instruction decode). So, if nothing is done, then the a0 will NOT be zero + 10. It will be whatever was in a0 before the addi instruction.

This is known as a read-after-write data hazard, since the problem is encountered when we try to read a register after it was previously written. This is due to when the register is written.

#### Operand Forwarding

We actually know the value of a0 above after the execute stage of the addi instruction. So, we can forward the output of the execute stage in the addi instruction to the execute stage of the sub instruction. This way, the pipeline can continue without any issues. This technique is known as operand forwarding. Without operand forwarding, we would have to stall the other stages of the sub instruction until the addi instruction had a chance to finish, which would give us the following.

addi a0, zero, 10   IF - ID - EXE - MEM - WB
sub  t0, a0, a1          IF - XXX - XXX - XXX - ID - EXE - MEM - WB

The XXX are known as no operations (nops for short), also known as pipeline stalls, also known as pipeline bubbles. They have no real effect except to wait until the register a0 was written before the sub instruction was executed. We want to avoid pipeline stalls as much as possible. You can see above with all the stalls, the pipeline really has no effect. We might as well run the addi instruction to completion and then execute the sub separately. Thus, this resembles a single-cycle CPU more than it resembles a pipelined CPU.

### Control Hazard

Another issue that can arise from pipelining is known as a control hazard, also known as a branch hazard. These arise from branch instruction, such as the following.

beq a0, a1, here
here:
sub t0, t1, t2

The issue is that until the execute stage, we have no idea whether to fetch the add instruction or the sub instruction. If we load the add instruction and the branch is NOT taken, then there are no issues. However, if we load the add instruction and then the branch is taken, we have to flush the pipeline to remove all remnants of the add instruction.

#### Branch Prediction

One way to help minimize the effect of branch hazards is to load the most likely instruction. Sometimes a pipeline stall to flush a branch is unavoidable. However, we can use a branch predictor to help predict what is the most likely outcome of a branch instruction.

I will not go into detail on how a branch predictor is designed, but a branch predictor stores the result of every branch up to some limit. When the branch is executed next, the CPU consults the latest branch. If the branch was previously taken, the CPU will load the branch taken. Obviously, just because we took the branch previously doesn’t mean we will again. So, as I mentioned above, flushing the pipeline is not always avoidable.

## Instruction Set Architecture (ISA)

The way instructions are laid out, the register types, and the addressing modes of a CPU define its instruction set architecture or ISA.

### Branch Instructions

Many of the branch instructions that you could think of, such as bgt, are pseudoinstructions, listed below. If a branch is not taken, meaning the condition is false, then the program counter is incremented by 4, which moves to the next instruction after the branch instruction.

### Jump Instructions

Jump instructions are typically used by programmers by using pseudo instructions, such as call (function call), j (unconditional jump), and ret (return from function call).

### Memory Instructions

The important aspect is that the load instructions will sign-extend unless you use the load that ends with a u (unsigned). For example, lb sign extends whereas lbu zero-extends.

All offsets are absolute offsets. Unlike C++, assembly does not scale the offset by the data size (known as pointer arithmetic).

### Floating-point Instructions

Floating point instructions are executed using the floating-point unit (FPU). So, integer registers must be translated into a floating point register by using the conversion instructions, such as fcvt. The fcvt instruction takes the data sizes separated by dots, such as: fcvt.s.w. This is in the format fcvt.dest.src. The source data type for fcvt.s.w is a word (int) and the destination is a single precision floating point (float). So, if I wanted to take a single-precision floating point number and convert it to an integer, I would use fcvt.w.s.

The fcvt instructions will convert to or from the IEEE-754 format into an integer format. However, we can move directly without any conversion using the fmv (floating point move). So, fmv.x.w will move an IEEE-754 floating point number into an integer register. Unless you know what to look for, the integer register will be nonsensical since it is still in IEEE-754 format.

The possibilities for a source or destination are: D (double precision), S (single precision), W (32-bit integer), or L (64-bit integer). You can convert between a double and a single precision value using fcvt.d.s which is from a single-precision floating point into a double-precision floating point.

Unlike the fcvt instruction, the fmv instruction performs no conversion. It is just a way to copy data from the ALU to the FPU or vice-versa.

Floating point comparisons are interesting, such as feq, flt, and fle. The destination register is an integer register, such as a0, t0, s0, etc. Then the operands, rs1 and rs2, are both floating point registers, such as fa0, ft0, or fs0. The integer register will be set to the value 1 if the comparison is true, or it will be set to the value 0 if the comparison is false.

The following shows how to branch comparing floating point numbers.

fle.s t0, fa0, fs0
bne t0, zero, yes_it_is_less_than

As you can see, we can use the bne or beq instruction with the zero operand since the only two values that t0 can be set in this case is either 0 or 1. So, if we want to test if the floating point comparison was true, we would use bne since t0 would be 1 and NOT EQUAL to 0 in this case. Otherwise, if we want to test if the floating point comparison was false, we would use beq, because recall that t0 would be 0 if fle.s is false.

There are only three comparison instruction formats (feq, fle, and flt). Everything else can be derived from that. For example, the pseudo-instruction fge will flip the operands and then use the flt instruction to compare them. For example,

fge.s t0, ta0, ts0
flt.s t0, ts0, ta0

The two instructions are identical. Notice that the operands are simply switched so we can use the real instruction. Even though we have three floating point comparison instructions, we can derive fne, fge, and fgt.

### Pseudo Instructions

RISC-V is truly a reduced instruction set computer (RISC), so many of the common instructions have to use already existing instructions in a clever way. For example, there is no such thing as a pure jump instruction. Instead, we have the jump-and-link. However, we can emulate the jump instruction by using the zero register for rd. Recall that the zero register is hardwired to zero, so any writes to it are discarded.

To make the programmer’s job easier, the assembler, whose job it is to take assembly instructions and turn them into machine code, will allow the programmer to use fake instructions known as pseudo instructions.

Recall that in the floating-point section, we also saw the pseudo instructions fgt (floating point greater than). We can also derive fge (floating point greater than or equal to). These pseudo instructions are available in RARS and in GAS (GNU assembler).