UCB CS61C

Info

Study notes based on UCB CS61C, Summer 2020 & Fall 2020 online videos along with its textbooks.

Abstraction (Layers of Representation/Interpretation)
Moore's Law
Principles of Locality/Memory Hierarchy
Parallelism
Performance Measurement & Improvement
Dependability via Redundancy

Great Idea #1: Levels of Representation & Interpretation¶

Number Representation¶

lec01.pdf

Number Bases
We use $d\times Base^i$ to represent a number, and in computer the $Base=2$ . Thus, we can also use bits to represent numbers in computer.
Signed Representations
- Sign and Magnitude
- Biased Notation
- One's Complement
- Two's Complement
Overflow
- most significant bit (MSB): the leftmost bit
- least significant bit (LSB): the rightmost bit
- overflow: the proper result cannot be represented by the hardware bits
Sign Extension
- Sign and Magnitude: fill with 0's
- One/Two's Complement: fill with MSB

Sidenote

The RISC-V word is 32 bits long.

Quick computation for $-x$: $x+\bar{x}=-1$ .

Foundation of complement method:
Consider subtracting a large number from a small one, the answer is that it would try to borrow a string of leading 0s, so the result would have a string of 1s.
This gives us the idea that leading 0s means positive, and leading 1s means negative.

See CS:APP Chapter 2 for more.

High-Level Language Program (e.g. C)¶

C: Introduction¶

Basic Concepts¶

lec02.pdf

Compilation
- Comparison
  - C: compilers map C programs into architecture-specific machine code
  - Java: converts to architecture-independent bytecode (run by JVM)
  - Python: directly interprets the code
- Advantage
  - Excellent runtime performance: compiler optimizes the code for the given architecture
  - Fair compilation time: only modified files need recompilation
- Disadvantage
  - Compiled files are architecture-specific
  - "Edit $\to$ Compile $\to$ Run [repeat]" iteration cycle can be slow
Variable Types

C Memory Management¶

lec04.pdf

The picture below shows the C memory layout.

From top to down, we have:

Stack
- local variables, grows downward
- consists of procedure frames, uses a stack pointer (SP) to tell where the current stack frame is
Heap
- space requested via malloc() and used with pointers
- resizes dynamically, grows upward
Static Data
- global and static variables, does not grow or shrink
  - string literals $\in$ static data
    - char * str = "hi"; belongs to, but char str[] = "hi"; does not
- size does not change, but sometimes data can
  - string literals cannot
Code
- loaded when program starts, does not change
- typically read only

When it comes to memory management in C, there can be tons of memory problems. For a better understanding of that, refer to CS11L.

Basic Allocation Strategy: K&R Section 8.7

Each continuous spaces allocated by calling malloc() is called a block.

To manage those blocks, we use a linked-list with each node pointing to each block.

When a free() is called, we insert that block into the list and combine it with adjacent blocks.

When a malloc() is called, we first search for the blocks large enough for allocation, then choose one using a method offered below:

Best-fit: choose the smallest, this provides the least concentrative distribution of small blocks yet a slowest service.
First-fit: choose the block with least order, this is fast but tends to concentrate small blocks at beginning.
Next-fit: like first-fit, but resume search from where we last left off. Also fast, and it does not concentrate small blocks at front.

Floating Point¶

See CS:APP Ch.2 Floating Point.

Single Precision¶

Use normalized, base 2 scientific representation:

Split a 32-bit word into 3 fields:

Here, we use biased notation ($bias=-127$) for the exponent in order to preserve the linearity of value as well as represent a small number.

Double Precision¶

Everything stays the same as Single Precision except the length of each field.

Special Cases¶

Limitations¶

Overflow & Underflow
Rounding occurs when result runs out of the end of the Significant
FP addition is not associative due to rounding errors
- As a result, parallel execution strategies that work for integer data may not work for FP
FP cannot represent all integers

Assembly Language Program (e.g. RISC-V)¶

RISC-V: Introduction¶

Mainstream Instruction Set Architecture: x86, ARM architectures, RISC-V
Complex/Reduced Instruction Set Computing:
- Early trend: CISC
  - difficult to learn
  - less work for the compiler
  - complicated hardware runs more slowly
- Later began to dominate: RISC
  - easier to build fast hardware
  - let software do the complicated operations by composing simpler ones

Assembly Language¶

RISC Design Principles: smaller is faster, keep it simple.
RISC-V Registers: s0-s11, t0-t6, x0
RISC-V Instructions
- Arithmetic: add, sub, addi
- Data Tansfer: lw, sw
  - Memory is byte-addressed
- Control Flow: beq, bne, j
- Six Steps of Calling a Function
  1. Place arguments
  2. Jump to funciton
  3. Create local storage (Prologue)
  4. Perform desired task
  5. Place return value and clean up storage (Epilogue)
  6. Jump back to caller
- Calling Conventions
  - Caller-saved registers (Volatile Registers)
  - Callee-saved registers (Saved Registers)
RISC-V Instruction Formats
- The Stored Program concept is very powerful.

Machine Language Program (RISC-V)¶

Translation vs. Interpretation¶

Interpreter: Directly executes a program in the source language.

Translator: Converts a program from the source language to an equivalent program in another language.

Translation vs. Interpretation:

Interpreter is slower, but code is smaller
Interpretor provides instruction set independence
Translated/compiled code almost always more efficient and therefore higher performance
Translation/compilation helps "hide" the program "source" from the users

Translation¶

C translation steps are as follows:

Compiler¶

Input: Higher-level language (HLL) code
Output: Assembly language code (may contain pseudo-instructions)

Assembler¶

Input: Assembly language code
Output: Object code, information tables $\rightarrow$ Object file

Assembler does this in three steps:

Reads and uses directives
- Assembly directives: give directions to assembler, but do not produce machine instructions. (.text, .data, .global sym, etc)
Replaces pseudo-instructions (mv, li, etc)
Produces machine language
- "Forward Reference" problem

To solve "Forward Reference" problem, we make two passes over the program.

Pass 1:
- Expands pseudo instructions
- Remember positions of labels
- Take out comments, empty lines, etc
- Error checking
Pass 2:
- Use label positions to generate relative addresses
- Outputs the object file

For jumps to external labels and references to data, assembler cannot determine yet, but it will create two tables for linker to solve that.

Symbol Table
- List of "items" that may be used by other files
- Includes labels and data
Relocation Table
- List of "items" this file will need the address of later (currently undetermined)
- Includes:
  - Any external label jumped to jal or jalr
  - Any piece of data

The object file format is listed below from top to down:

object file header
text segment
data segment
relocation table
symbol table
debugging information

Linker¶

Input: Object code files, information tables
Output: Executable code

There are three steps for the linker:

Place code and data modules symbolically in memory
Determine the addresses of data and instruction labels
Patch both the internal and external references

Note that the linker enables separate compilation of files, which saves a lot of compile-time.

Info

The text segment starts at address 0000 0000 0040 0000_hex and the data segment at 0000 0000 1000 0000_hex.

Three types of addresses:

PC-Relative Addressing (beq, bne, jal) $\rightarrow$ never relocate
External Function Reference (usually jal) $\rightarrow$ always relocate
Static Data Reference (often auipc and addi) $\rightarrow$ always relocate
- lw, sw to variables in static area, relative to global pointer

Loader¶

Input: Executable code
Output: \<program is run>

The loader follows these steps:

Reads executable file's header to determine size of text and data segments
Creates an address space large enough for text and data
Copies the instructions and data from executable file into memory
Copies arguments passed to the program onto the stack
Initialize machine registers
Jumps to start-up routine that copies program's arguments from stack to registers and sets the PC

Quote

Virtually every problem in computer science can be solved by another level of indirection.

David Wheeler

Great Idea #2: Moore's Law¶

Logic Circuit Description¶

lec10.pdf

Synchronous Digital Systems (SDS):

Synchronous
- All operations coordinated by a central clock
Digital
- Represent all values with two discrete values: 0/1, high/low, true/false

Switches and Transistors¶

Switches
- The basic elements of physical implementations
Transistors $\rightarrow$ hardware switches
- Modern digital systems designed in CMOS
  - Understanding CMOS
- MOS transistors acts as voltage-controlled switches

CMOS Networks¶

In a word,
- high voltage $\rightarrow$ N-channel ON, P-channel OFF
- low voltage $\rightarrow$ N-channel OFF, P-channel ON
V_GS = V_Gate - V_Source; V_SG = V_Source - V_Gate; V_TH = V_Threshold

In abstraction, we can simply view chips composed of transistors and wires as block diagrams.

Combinational Digital Logic¶

Combinational Logic Gates¶

Digital Systems consists of two basic types of circuits:

Combinational Logic (CL) e.g. ALUs
Sequential Logic (SL) e.g. Registers

Boolean Algebra¶

Translate truth table to boolean algebra:

Sum of Products (SoP)
Product of Sums (PoS)

Circuit Simplification¶

Hardware can have delays, so we should simplify boolean expressions $\rightarrow$ smaller transistor networks $\rightarrow$ smaller circuit delays $\rightarrow$ faster hardware.

Summary

Sequential Digital Logic¶

lec11.pdf

State Handout

In combinational digital logic, systems are memoryless. However, when we want to sum up a series of numbers without knowing its length, this is impossible to implement with memoryless circuits. Thus, we now introduce registers.

Register Details¶

Inside a register are a number of flip-flops (FFs). Here we're talking about the most common type of flip-flops, called an edge-triggered d-type flip-flop.

Take a look at a single FF, here d is referred as input and q as output.

On the rising edge of the clock, the input d is sampled and transferred to the output q. At other times, d is simply ignored.

When a FF feels that there is a change at the input side, it knows that it maybe time to change. However, just like combinational logic circuits, FF cannot change its value instantaneously. Also, time is needed to transfer inputs internally. The combination of these (the former one called setup time, and the latter called hold time) two create a time window when the input d cannot change. Also, after the FF has successfully captured the new input, it takes a small amount of time to transfer that to the output. This delay is called clk-to-q delay.

Pipelining--Adding Registers to Improve Performance¶

Consider the example in the slides, we can infer that the maximum clock frequency is limited by the propagation of the critical path.

To resolve that issue, we can employ the technique called pipelining. In details, that means to add more registers. Notably that this will cause a signal passing through the same path (ignore the added registers) with more cycles (while the length of a cycle is shrunk), but in the long run, it actually improves performance.

Finite State Machines¶

To represent sequential logic, finite state machines are a good idea. Correspondingly, with combinational logic and registers, any FSM can be implemented in hardware.

Functional Units (a.k.a. Execution Unit)¶

Take a look at the examples in the slides.

Hardware Architecture Description¶

CPU¶

Datapath and Control¶

A CPU involves two parts:

Datapath: contains the hardware necessary to perform operations required by the processor
Control: decides what each piece of datapath should do

The picture above shows the basic architecture of a single-cycle RISC-V RV32I datapath.

Single Cycle CPU: All stages of an instruction completed within one long clock cycle.

Processor design principles:

Five steps:
1. Analyze instruction set $\rightarrow$ datapath requirements
2. Select set of datapath components & establish clock methodology
3. Assemble datapath meeting the requirements
4. Analyze implementation of each instruction to determine setting of control points that effects the register transfer
5. Assemble the control logic
  - Formulate Logic Equations
  - Design Circuits
Determining control signs

Control construction options:

ROM
- Can be easily reprogrammed
- Popular when design control logic manually
Combinatorial Logic
- Not easily re-programmable because requires modifying hardware
- Likely less expensive, more complex

Great Idea #3: Principle of Locality¶

Caches¶

Principle of Locality: Programs access only a small portion of the full address space at any instant time

Temporary Locality (time)
Spatial Locality (space)

In typical memory hierarchy, from top to down we have:

RegFile
First Level Cache (consists of Instr Cache & Data Cache) $\rightarrow$ $L1
Second Level Cache (SRAM) $\rightarrow$ $L2
Main Memory (DRAM)
Secondary Memory (Disk or Flash)

Upper above the Main Memory are On-Chip Components.

Fully Associative Caches¶

Fully associative: each memory block can map anywhere in the cache.

In the cache:

Each cache slot contains the actual data block ($8\times 2^{\mathrm{Offset}}$ bits)
Tag field as identifier ($\mathrm{Tag}$ bits)
Valid bit indicate whether cache slot was filled in ($1$ bit)
Any necessary replacement management bits ("LRU(Least Recently Used) bits" - variable $\#$ bits)

Total bits in cache = $\#\, \mathrm{slots}\times(8\times 2^{\mathrm{Offset}}+\mathrm{Tag}+1+?)$ bits

Direct-Mapped Caches¶

Direct-mapped cache: each memory address is associated with one possible block within the cache

Screenshot 2024-02-16 at 00.33.39.png

Within the cache:

Index: which row/block
- $\# \mathrm{blocks}/ \mathrm{cache}=\dfrac{\mathrm{bytes/ cache}}{\mathrm{bytes/ block}}$
Offset: which byte within the block
Tag: distinguish between all the memory addresses that map to the same location
- $\mathrm{tag \, length=addr \, length - offset - index}$

N-Way Set Associative Caches¶

pros:

avoids a lot of conflict misses
hardware cost isn't that bad: only need N comparators

1-way set assoc $\leftarrow$ directly mapped

M-way set assoc $\leftarrow$ fully assoc

Writes, Block Sizes, Misses¶

Effective
- Hit rate (HR)
- Miss rate (MR)
Fast
- Hit time (HT)
- Miss penalty (MP)

Handle cache hits:

Read hits (I$ & D$)
- Fasted possible scenario
Write hits (D$)
1. Write-Through Policy: always write to cache and memory (through cache)
  - Write buffer
2. Write-Back Policy: write only to cache, update memory when block is removed
  - Dirty bit

Handle cache misses:

Read misses ($I & $D)
- Stall, fetch from memory, put in cache
Write misses ($D)
- Write Allocate Policy: bring the block into the cache after a write miss
- No Write Allocate Polity: only change main memory
- Write allocate $\leftrightarrow$ write back, every slot needs an extra dirty bit
- No write allocate $\leftrightarrow$ write-through

Block size tradeoff:

Benefits of larger block size
- Spatial Locality
- Very applicable with Stored-Program Concept
- Works well for sequential array accesses
Drawbacks of larger block size
- Larger miss penalty
- If block size is too big relative to cache size, then there are too few blocks $\rightarrow$ Higher miss rate

Screenshot 2024-02-16 at 20.27.04.png

Types of cache misses:

Compulsory Misses
Conflict Misses (Direct-Mapped Cache only)
- Solution 1: Make the cache size bigger
- Solution 2: Multiple distinct blocks can fit in the same cache Index
Capacity Misses (primary for Fully Associative Cache)

How to categorize misses:

Screenshot 2024-02-16 at 20.41.07.png

Block Replacement Policy¶

LRU (Least Recently Used)
FIFO
Random

Average Memory Access Time¶

AMAT = Hit Time $+$ Miss Penalty $\times$ Miss Rate

Abstract

Why not "Hit Time $\times$ Hit Rate $+$ Miss Time $\times$ Miss Rate"?

AMAT = Hit Time $\times$ (1 $-$ Miss Rate) + (Miss Penalty $-$ Hit Time) $\times$ Miss Rate

Abstract

Big Idea: If something is expensive but we want to do it repeatedly, do it once and cache the result

Cache Design Choices:

size of cache: speed vs. capacity
block size
write policy
associativity choice of N
block replacement policy
2nd level cache
3rd level cache

OS¶

Basics¶

What does OS do?

OS is the first thing that runs when computer starts
Finds and controls all devices in the machine in a general way
Starts services
Loads, runs and manages programs

What does the core of OS do?

Provides isolation between running processes
Provides interaction with the outside world

What does OS need from hardware?

Memory translation
Protection and privilege
- Split the processor into at least two modes: "User" and "Supervisor"
- Lesser privilege cannot change its memory mapping
Traps & Interrupts
- A way of going into supervisor mode on demand

What happens at boot?

BIOS¹: Find a storage device and loads the first sector
Bootloader: Load the OS kernel from disk into a location in memory and jump into it
Init: Launch an application that waits for input in loop (e.g., Terminal/Desktop/...)
OS Boot: Initialize services, drivers, etc.

OS Functions¶

Refer to the lecture slides.

Virtual Memory¶

Hierarchy

Screenshot 2024-02-26 at 22.01.01.png

Processes uses virtual addresses
Memory uses physical addresses

Address space = set of addresses for all available memory locations.

Responsibilities of memory manager:

Map virtual to physical addresses
Protection: isolate memory between processes
Swap memory to disk: give illusions of larger memory by storing some contents on disk

Physical memory (DRAM) is broken into pages:

Screenshot 2024-02-26 at 22.20.39.png

Paged memory translation:

OS keeps track of active processes
Memory manager extracts page number from virtual address
Looks up page address in page table
Computes physical memory address from sum of
- page address and
- offset (from virtual address)

Note

Physical addresses may have more or fewer bits than virtual addresses.

Since a page table is too large for a cache, so we have to store it in memory (DRAM, acts like caches for disk). To minimize performance penalty, we can:

Transfer blocks (not words) between DRAM and caches
Use a cache for frequently accessed page table entries

Refer to slides.

I/O¶

Interface options:

Special I/O instructions
Memory mapped I/O
- Use normal load/store instructions

I/O polling:

Device registers:
- Control register
- Data register
Processor reads from control register in loop
then loads from (input) / writes to (output) data register
Procedure called "polling"

Polling wastes processor resources, there is an alternative called "interrupts":

No I/O activity: nothing to do
Lots of I/O: expensive - thrashing cashes, VM, saving/restoring state

When there is low data rate, we use interrputs. While there's high data rate, we start with interrupts, then switch to Direct Memory Access (DMA).

The DMA allows I/O devices to directly read/write main memory, introducing a new hardware called DMA engine to let CPU execute other instructions.

Incoming data
- Receive interrupt from device
- CPU takes interrupt, initiates transfer
- Device/DMA engine handle the transfer
  - CPU is free to execute other things
- Upon completion, device/DMA engine interrupts the CPU again
Outging data
- CPU decides to initiate transfer, confirms that external device is ready
- CPU begins transfer
- Device/DMA engine handle the transfer
- Device/DMA engine interrupt the CPU again to signal completion

To plug in the DMA engine in the memory hierarchy, there are two extremes:

Between L1$ and CPU
- Pro: free coherency
- Con: trash the CPU's working set with transferred data
Between Last-level cache and main memory
- Pro: don't mess with cache
- Con: need to explicitly manage coherency

Great Idea #4: Parallelism¶

Pipelining¶

Intro¶

Noticing that since we have to set the maximum clock frequency to the lowest value to fit in lw instruction, there will always be chips idle when performing other instructions.

As an analogy to doing laundry, we can use pipelining to increase throughput of entire workload.

To align the five instruction stages, we place a register between each two stages as a buffer.

Hazards¶

lec14.pdf

With pipelining, we successfully improve single cycle performance, however, this also leads to some problems which is called hazard.

A hazard is a situation that prevents starting the next instruction in the next clock cycle.

Structural hazard: a required resource is busy
- Solution 1: stalling
- Solution 2: add more hardware
  - Can always solve a structural hazard by adding more hardware
  - Regfile Structural Hazards
    - Solution 1: build Regfile with independent read and write ports
    - Solution 2: double pumping
    - Read and Write during same clock cycle is okay (write first, then read)
  - Memory Structural Hazards
    - Solution: build memory with separate instruction/data units/caches
    - RISC ISAs designed to avoid such hazards
Data hazard: data dependency between instructions
1. R-type instructions
  - Solution 1: stalling
  - Solution 2: forwarding (a.k.a bypassing)
2. Load instructions
  - Solution 1: stalling = nop
  - Solution 2: let the compiler/assembler put an unrelated instruction in that slot (slot after a load is called a load delay slot)
Control hazard: flow of execution depends on previous instruction
- Solution 1: stalling
- Solution 2: move branch comparator to ID stage
- Solution 3: branch prediction - guess outcome of a branch, fix afterwards if necessary

Parallelism¶

Flynn's Taxonomy¶

Choice of hardware and software parallelism are independent
Flynn's Taxonomy is for parallel hardware

Screenshot 2024-02-27 at 18.33.57.png

Instruction-Level¶

To improve performance, Intel's SIMD instructions:

Fetch one instruction, do the work of multiple instructions
MMX (MultiMedia eXtension)
SSE (Streaming SIMD Extension)

In the evolution of x86 SIMD, it started with MMX, getting new instructions, new and wider registers, more parallelism.

In SSE, architecture extended with eight 128-bit data registers, which are called XMM registers.

Note

In 64-bit address architecture, they are available as 16 64-bit registers.

Programming in C, we can use Intel SSE intrinsics, which are C functions and procedures for putting in assembly language, including SSE instructions.

Operations	Intrinsics	Corresponding SSE instructions
Vector data type	`_m128d`
Load and store operations	`_mm_load_pd` `_mm_store_pd`	`MOVAPD`/aligned, packed double `MOVAPD`/aligned, packed double
Arithmetic	`_mm_add_pd` `_mm_mul_pd`	`ADDPD`/add, packed double `MULPD`/multiple, packed double

2x2 Matrix Multiply

#include <stdio.h>
// header file for SSE compiler intrinsics
#include <emmintrin.h>

// NOTE: vector registers will be represented in comments as v1 = [a | b]
// where v1 is a variable of type __m128d and a, b are doubles

int main(void) {
    // allocate A,B,C aligned on 16-byte boundaries
    double A[4] __attribute__ ((aligned (16)));
    double B[4] __attribute__ ((aligned (16)));
    double C[4] __attribute__ ((aligned (16)));
    int lda = 2;
    int i = 0;
    // declare several 128-bit vector variables
    __m128d c1,c2,a,b1,b2;

    // Initialize A, B, C for example
    /* A = (note column order!)
       1 0
       0 1
    */
    A[0] = 1.0; A[1] = 0.0; A[2] = 0.0; A[3] = 1.0;

    /* B = (note column order!)
       1 3
       2 4
    */
    B[0] = 1.0; B[1] = 2.0; B[2] = 3.0; B[3] = 4.0;

    /* C = (note column order!)
       0 0
       0 0
    */
    C[0] = 0.0; C[1] = 0.0; C[2] = 0.0; C[3] = 0.0;

    // used aligned loads to set
    // c1 = [c_11 | c_21]
    c1 = _mm_load_pd(C+0*lda);
    // c2 = [c_12 | c_22]
    c2 = _mm_load_pd(C+1*lda);

    for (i = 0; i < 2; i++) {
        /* a =
        i = 0: [a_11 | a_21]
        i = 1: [a_12 | a_22]
        */
        a = _mm_load_pd(A+i*lda);
        /* b1 =
        i = 0: [b_11 | b_11]
        i = 1: [b_21 | b_21]
        */
        b1 = _mm_load1_pd(B+i+0*lda);
        /* b2 =
        i = 0: [b_12 | b_12]
        i = 1: [b_22 | b_22]
        */
        b2 = _mm_load1_pd(B+i+1*lda);
        /* c1 =
        i = 0: [c_11 + a_11*b_11 | c_21 + a_21*b_11]
        i = 1: [c_11 + a_21*b_21 | c_21 + a_22*b_21]
        */
        c1 = _mm_add_pd(c1,_mm_mul_pd(a,b1));
        /* c2 =
        i = 0: [c_12 + a_11*b_12 | c_22 + a_21*b_12]
        i = 1: [c_12 + a_21*b_22 | c_22 + a_22*b_22]
        */
        c2 = _mm_add_pd(c2,_mm_mul_pd(a,b2));
    }

    // store c1,c2 back into C for completion
    _mm_store_pd(C+0*lda,c1);
    _mm_store_pd(C+1*lda,c2);

    // print C
    printf("%g,%g\n%g,%g\n",C[0],C[2],C[1],C[3]);
    return 0;
}

To improve RISC-V performance, add SIMD instructions (and hardware) – V extension:

vadd vd, vs1, vs2

Thread-Level¶

Multicore¶

Multiprocessor execution model:

Each processor (core) executes its own instructions
Separate resources:
- datapath
- highest level caches
Shared resources:
- memory (DRAM)
  - Advantages:
    - simplifies communication in program via shared variables
  - Drawbacks:
    - does not scale well
      - "slow" memory shared by too many "customers" (cores)
      - may become bottleneck
- often 3rd level cache
Nomenclature:
- "multiprocessor microprocessor"
- multicore processor

Thread¶

Thread: "thread of execution", is a single stream of instructions.
With a single core, a single CPU can execute many threads by time sharing.
Each thread has:
- dedicated program counter
- separate registers
- accesses the shared memory
Each physical core provides one:
- hardware threads that actively execute instructions
- each executes one "hardware instructions"
OS multiplexes multiple:
- software threads onto the available hardware threads
- all threads except those mapped to hardware threads are waiting

Multithreading¶

When an active thread encounters cache miss, we switch out to execute another thread. However, saving the current state and loading the new state can be expensive. As a result, we consider asking hardware for help. And Moore's Law tells us that transistors are plenty.

Then, we came up with hardware assisted software multithreading. We simply make two copies of PC and registers inside processor hardware, this makes it looks identical to two processors. Thus we now have hyper-threading (aka simultaneously multithreading, SMT), which means that both threads can be active simultaneously.

Screenshot 2024-03-02 at 18.23.47.png

PS: MIMD is thread-level parallelism, while SIM is instruction-level.

OpenMP's Fork-Join Model:

Screenshot 2024-03-03 at 11.02.02.png

Note that OpenMP threads are operating system (software) threads, and OS will multiplex requested OpenMP threads onto available hardware threads. Hopefully each gets a real hardware thread to run on, so there will be no OS-level time-multiplexing. However, others tasks will compete for hardware threads, so be careful when timing result for projects.

Synchronization¶

To solve the problem that different threads could be "racing" for shared resources which will lead to a nondeterministic result, computers use locks to control access, which relies on hardware synchronization instructions.

Solution: Atomic read/write
- R-Type instruction format: Add, And, ...
- amoadd.w rd, rs2, (rs1)

Deadlock: a system state in which no progress is possible.

Solution: elapsed wall clock time

Shared Memory and Caches¶

SMP: (Shared Meomory) Symmetric Multiprocessor

Screenshot 2024-03-08 at 12.49.07.png

How does hardware keep caches coherent?

Each cache tracks state of each block in cache:
- Shared: up-to-date data, other caches may have a copy
- Modified: up-to-date data, changed (dirty), no other cache has a copy, OK to write, memory out-of-date (i.e., write back)
- Exclusive: up-to-date data, no other cache has a copy, OK to write, memory up-to-date
- Owner: up-to-date data, other caches may have a copy (they must be in Shared state)

Screenshot 2024-03-08 at 12.55.55.png

Coherency Tracked by Cache Block:

false sharing: block ping-pongs between two caches even though processors are accessing disjoint variables
4th "C" of cache misses: Coherence Misses

Request-Level¶

Google query-serving architecture:

Screenshot 2024-03-11 at 18.50.27.png

Data-Level¶

Two kinds:

Data in memory
Data on many disks

MapReduce¶

Design goals:

Scalability to large data volumes
Cost-efficiency
Simplified data processing on large clusters

Screenshot 2024-03-11 at 18.55.57.png

MapReduce is centralized:

Master assigns map + reduce tasks to "worker" servers
As soon as a map task finishes, worker server can be a new map or reduce task
To tolerate faults, reassign task if a worker server "dies"

In conclusion, MapReduce is a wonderful abstraction for programming thousands of machines. It hides details of machine failures, and it is file-based.

Spark¶

Apache's Spark is a fast and general engine for large-scale data processing.

It is better than MapReduce. It combines the techniques of lazy-evaluation along with memory-based data processing.

Great Idea #5: Performance Measurement & Improvement¶

CPU Performance Analysis¶

lw instruction needs to do all the things, so it is the bottleneck of a single cycle.

"Iron law" of processor performance:

\[\frac{\mathtt{Time}}{\mathtt{Program}}=\frac{\mathtt{Instructions}}{\mathtt{Program}}\cdot\frac{\mathtt{Cycles}}{\mathtt{Instruction}}\cdot\frac{\mathtt{Time}}{\mathtt{Cycle}}\]

Amdahl's Law¶

Speed up due to enhancement $E$ : $\text{Speedup w/E}=\dfrac{\text{Exec time w/o E}}{\text{Exec time w/E}}$
Example:
- $E$ does not affect a portion of $s(s<1)$ of a task
- It does accelerate the remainder $(1-s)$ by a factor $P(P>1)$
- $\text{Exec time w/E}=\text{Exec time w/o E}\times[s+(1-s)/P]$
- $\text{Speedup w/E}=1/[s+(1-s)/P]<\dfrac{1}{s}$

Great Idea #6: Dependability via Redundancy¶