Computer Systems

ParallelProgramming

Taken from here Why Parallelism? A parallel computer is a collection of compute modules which work together to perform a task efficiently In principle, more processors means more power However, a major bottleneck is communication speed Think of a room of people working together on a calculation. Information needs to travel between them for them to cooperate efficiently. Better communication == faster results If communication takes longer than the task at hand, then the upsides of parallelism go away Some processors might have easier tasks than others, or are more efficient than other nodes. So even if you have a bunch of people, the task moves as slow as the hardest subtask We want to evenly distribute the work amongst all processors Given the above, we want to write parallel programs that scale well Each piece can safely be done in parallel Keep all the processors busy Keep synchronization/ comms to minimum Superscalar Processors A program is really just a series of instructions encoded in binary that a particular CPU can understand. The rate at which these instructions are executed is defined by the clock For simplicity, assume we have a CPU that executes exactly one instruction per cycle At it’s core then, CPU need to: Fetch/Decode: Figure out what instruction to execute Execute: Load the required data and do the thing Write Back: Update he internal state of the system Suppose that we have the program: $a = x^{2}+y^{2}+z^{2}$ The above a the sequential version of the program What happens if you could execute two instructions at the same time? You could run $x^{2}$ and $y^{2}$ in parallel, then add them in parallel with doing the z multiplication, then add up the final results. So now it takes 3 clock cycles This is the idea behind instruction level parallelism (ILP)/ superscalar processors The key idea: some instructions can be run at the same time without interfering with each other. Hence, you can run multiple constructions at the same time! A real world example: the Pentium 4 Single Thread Wall Unfortunately, the ILP gains seems to empirically saturate after 4 instructions In addition, there are hard power limits based on the frequency and voltages (EMI and all that) If chips get too hot, the don’t function anymore The way forward is multiple processors working in tandem Or alternatively, specialized processors for specific workloads Memory There’s the old not-very-funny C joke: “The world is just a giant array of undifferentiated bits” In the programmer’s mental model, memory is exactly this Since most types of memory are not stored on the processor, you need to go fetch it from somewhere This takes a long time if you are getting this from the disk (Memory access times can be 100’s of cycles) Caches Caches act as a way to help with these latencies: By providing an intermediate storage of highly accessed values, you reduce latency Caches should not alter the correctness of a program, only the performance Cache has two methods of “data locality”: Spatial locality: nearby addresses are more likely to be accessed (think array iteration) Temporal locality: high frequency addresses will more likely be accessed in the future Cache can have some hierarchy (L1,L2, L3) it could be implemented as DRAM (variable cycle count) but the important part is that whatever the caching system should be opaque (most) programmers The smallest unit of a cache is a “line” which has some width. If you get 1 address from memory, the adjacent memory values are put into a cache line. Subsequent memory addresses check the cache first before going to memory There are a bunch of policies for how the cache decides what data is high frequency. For now, just assume that a cache of size N bytes stores the last N addresses accessed New values follow LRU policy (least recently used) ie. the oldest access gets replaced first As a concrete example, say you iterate through addresses 0x0 and 0xF with a cache width of 8 bytes and a line width 4 bytes. The above shows the access pattern You get a cache miss every 4 bytes, and the least recently used cache address gets evicted if the cache is full Modern CPUs make predictions about what data will be fetched (so called “pre-fetching”). If the CPU guesses wrong, you can get reduced performance Stalling Suppose that you miss the cache, and are now spending a very long time getting memory from DRAM/disk While you are waiting for the access, you can swap to a different program and do work there There might be a period where the first thread finished stalling and is in a runnable state, but is not being ran by the processor This is OK if aiming for a high throughput system. However, there is a “no free lunch theorem” trade off between throughput (ie. pipelining a bunch of work) and latency (ie. reducing the time each task takes ) A CPU’s utilization rate is defined by the time spend doing actual work instead of stalling. This heavily depends on the number if instructions you can run before stalling. Adding more threads once you are at 100% utilization rate doesn’t increase the utilization. It just delays when you get back to the original task Moving Past Single Core Since you hit a wall in single core performance, one way to increase performance is to use the transistors to make multiple cores instead of making the ILP logic properly Another way is to use SIMD (Single Instruction Multiple Data) You have one register with a bunch of bits in them (for concreteness, say 256 bits). You then subdivide each subsection of this large register (say 64 bit chunks). Each chunk then gets passed into their own ALU Hence the name: a single instruction can work on the multiple chunks of data in a register For SIMD, you need to ensure that each subsequent loop executes the same set of instructions across the same elements (called “coherent execution” or “instruction stream coherence”) Coherent execution is NOT a requirement for parallelism across multiple cores If a program lacks instruction stream coherence, it is called “divergent execution” On a GPU, you have N instances of a program running in parallel on a large number of different cores. Divergent execution in this context can cause massive performance hits SIMD can be expressed at the compiler level (via AVX2 instructions on intel for instance) or can be done implicitly by the hardware Branches Remember that you are assuming that all iterations of the loop you are optimizing are independent of each other. How do you deal with conditional branches (re: if/else kind of statements)? Idea: Have all ALU cores try to run the if branch. If the condition is satisfied for that particular iteration, enable the computation, otherwise disable the ALU core. After that, repeat for the else branch, but disable the other set of registers Latency Issues Think of driving down the highway. If you wanted to increase the throughput (ie. how many cars per hour can move along the highway), you need to either: Increase the speed of each car Increase the number of lanes Memory is the same. There is some maximum throughput/bandwidth which you can transfer data on a channel Memory Latency is defined as the time it takes to transfer any one item You can increase the throughput of a system while maintaining a fixed latency by pipelining your system (ie. when one subprocess finishes, immediately begin a new subprocess) This approach is latency limited: you can only go as fast as the slowest subprocess Incidentally, this is how multiplication takes “a single cycle”. This refers to instruction throughput, not latency Ie. it takes like 4 cycles to run a multiply instruction end to end, but only 1 cycle assuming proper pipelining To go to the extreme, this of a GPU. The max bandwidth is in the TB/sec. Transferring enough memory to saturate this bandwidth is physically impossible However, even a 1 percent utilization rate is still much faster than an 8 core CPU Modern computing therefore is constrained by the memory bandwidth! Performant programs fetch data from memory less often Doing more math instead of more memory requests can be beneficial to performance (up to a point) Parallel Programming Idioms/ ISPC ispc is the Intel SPMD Program Compiler which provides programming primitives Calling a ISPC function spawns a “gang” of ISPC program instances which run concurrently Each instance has it’s own copy of local variables Some keywords unique to ISPC: uniform: type modifier which asserts that all instances use the same value for this variable (non necesssary for correctness) programCount: the number of simultaneously executing instances in a gang (uniform value) programIndex: the id of the current instance in the gang foreach: a parallel loop keyword. This demands that work required to execute all of these loops be done by the gang as a whole, and not each instance It’s the compiler’s job to assign iterations to program instances in the gang There are a bunch of caveats where this will either be undefined or give a compiler error If multiple iterations of a loop body can write to the same memory location (race conditions are bad) Cannot return many copies of a variable which expects one return value cannot have a uniform value which gets assigned a different value across instances The above program interleaves the assignment of program instances (Ex: instance 0 works on indices 0, 8, 16 …) Temporally, at time t=0, each instance is working on a contiguous set of indices, so you can run a packed vector load You could also do “blocked” assignment, where each instance does a contiguous set of loop indices With this setup, you need to access disparate memory locations at t=0 Requires “gather” instructions (vgatherdps) to implement (which can be costly SIMD) If you need to talk between instances, ISPC gives you some intrinsics: uniform int64 reduce_add(int32 x);: compute sum of a variable across all program instances in a gang uniform int32 reduce_min(int32 a);: compute the min of all values in a gang int32 broadcast(int32 value, uniform int index);: Broadcast a value from one instance to all instances in a gang int32 rotate(int32 value, uniform int offset);: For all i, pass value from instance i to instance i+offset% programCount There is a difference between the semantics of an abstraction (ie. what should something do?) and the implementation (ie. how is something implemented) A great deal of caution should be made to disentangle an implementation from the semantics For ISPC, the abstraction is that the compiler spawns programCount logical instruction streams The implementation is the compiler emitting SIMD instructions with the appropriate masking There is some leakage with the uniform keyword (in that you need to think about if a value is shared across all instances) This abstraction/implementation is not unique! What if you didn’t want to provide fine-grain control of what happens in each instance? You would remove programIndex and programCount and be forced to use uniform outside the foreach loop You could also disallow array indexing and force the programmer too think about collections of data (think numpy!)

OperatingSystems

Notes following this MIT course and is based on the xv6 toy operating system. Operating System Interfaces Processes and Memory Forking Exec Operating System Organization Abstracting Physical Resources CPU Modes Processes xv6 startup Security Page Tables Kernel Address space Xv6 Implementation Physical Memory Allocation Xv6 Physical Memory Implementation Process Address Space Sbrk (Xv6) Exec (Xv6) Real World Paging Traps and System Calls RISC-V traps User Space Traps Syscalls Kernel Space Traps Page Fault Exceptions Interrupts ad Device Drivers Console Input Console Output Timer Interrupts Locking Races Spinlocks Using Locks Deadlocks Spinlocks Scheduling Context Switching Scheduler Sleep and Wakeup Semaphores Lost Wake-Up Problem Pipes Process Locks Real World Scheduling File System Buffer Cache Layer Logging Layer Block Allocator Inode Layer Directory Layer Path Layer File Descriptor Layer Operating System Interfaces Operating systems serve a number of purposes Abstracts away lower level hardware, so that programs don’t have to care about what disk you are writing to Allows multiple programs to run “at the same time” Provide well defined interface for programs to interact with each other For the interface, we want a simple interface that is easy to implement, but allows high sophisticated operations to happen UNIX philosophy of composing many simple programs together helps keep this problem trackable Can divide operating system into kernel space and user space kernel space is the program which provides services to other programs. It’s given hardware level privileges which user programs don’t have access to Each running program under the kernel’s manager (called a process) gets it’s own memory which houses the instructions (ie. the CPU assembly instructions), the data (variables created when running the program) and the stack (organizes program procedure calls) if process needs a kernel service, it invokes a system call, which briefly transfers ownership to the kernel, which executes the syscall, and then returns ownership to the user program Called user space and kernel space Typically, people use a shell to allow user input. The shell resides in user space, so you have flexibility in what shell you run Processes and Memory For each running process, there is some user-space memory (instructions, data, stack) and kernel-space memory (which the user’s can’t access) One of the kernel-space things is the PID (process identifier) Xv6 has it’s own custom syscalls, which are somewhat stripped down versions of the UNIX ones Unless otherwise stated, all of these calls return 0 for no error and -1 if there’s an error Call Sig Desc int fork() Create a process, return child’s PID. Makes an exact copy of user-space. returns in both parent and child process. Returns child’s PID in parent process, and 0 in child process int exit(int status) Terminate the current process; Releases all held resources. Exit status reported to wait(). No return. int wait(int *status) Wait for a child to exit; exit status in *status (out parameter); returns child PID. If caller has no children, immediately returns -1. If parent don’t care about child exit status, pass 0 address (ie. (int *) 0) to wait int kill(int pid) Terminate process PID. Returns 0, or -1 for error. int getpid() Return the current process’s PID. int sleep(int n) Pause for n clock ticks. int exec(char *file, char *argv[]) Load a file and execute it with arguments; only returns if error. This replaces the current process’s memory with that of the file. For Xv6, the ELF format is used. char *sbrk(int n) Grow process’s memory by n bytes. Returns start of new memory. int open(char *file, int flags) Open a file; flags indicate read/write; returns an fd (file descriptor). int write(int fd, char *buf, int n) Write n bytes from buf to file descriptor fd; returns n. int read(int fd, char *buf, int n) Read n bytes into buf; returns number read; or 0 if end of file. int close(int fd) Release open file fd. int dup(int fd) Return a new file descriptor referring to the same file as fd. int pipe(int p[]) Create a pipe, put read/write file descriptors in p[0] and p[1]. int chdir(char *dir) Change the current directory. int mkdir(char *dir) Create a new directory. int mknod(char *file, int, int) Create a device file. int fstat(int fd, struct stat *st) Place info about an open file into *st. int stat(char *file, struct stat *st) Place info about a named file into *st. int link(char *file1, char *file2) Create another name (file2) for the file file1. int unlink(char *file) Remove a file. Forking As a small case study, look at the following C snippet int pid = fork(); if(pid > 0){ printf("parent: child=%d\n", pid); pid = wait((int *) 0); printf("child %d is done\n", pid); } else if(pid == 0){ printf("child: exiting\n"); exit(0); } else { printf("fork error\n"); } After the fork(), the pid let’s you determine what the parent and the child do NOTE: While after forking, the memory and register content of both processes is the same, they located in different locations which can evolve independently from each other Don’t expect the output to be the same. Whichever process gets to printf first will output first (parent and child output could also be interleaved) Exec Another case study of exec char *argv[3]; argv[0] = "echo"; argv[1] = "hello"; argv[2] = 0; exec("/bin/echo", argv); printf("exec error\n"); If echo is functioning properly, then the last printf will never get executed Argv[0] is typically reserved for the program name Operating System Organization An operating system has three requirements multiplexing: multiple processes should be able to access the same hardware resources (time-sharing) isolation: each process should not be subject to the bugs of another process (ie. if one process goes under, the others keep running) interaction: processes should be able to interact with each other (eg. like UNIX pipes) xv86 runs on a multi-core RISC-V processor, and is written in “LP64 C” ,which means longs and pointers are 64 bits, but integers are 32 bits. Abstracting Physical Resources In theory, you don’t need an operating system. Just provide a library to interact with the hardware resources, then allow programs to link against this library real-time systems and embedded devices do this for speed reason The problem with this approach is that it requires each program to trust each other (ie. yield control of the CPU time to achieve multiplexing, called coperative time-sharing) and not have any bugs. Both of these are ludicrously naive UNIX uses a file system to abstract the hardware so that programs don’t have to fully trust each other This hardware abstraction also allows transparent switching between hardware CPUs (save registers when you switch, load them back in when you come back) CPU Modes Three modes of CPU machine mode: instructions have full privilege privilege mode: only the kernel can run in this mode. If a user program tries to execute one of these instructions, it’s gets terminated by the kernel user mode: most restricted mode of cpu Need to use a special function to transition to privilege mode Operating systems need to make a choice about what mode it will run in UNIX makes the choice of putting everything in privilege mode (monolithic kernel) Another choice is to have most of the code in user space though (called a microkernel), since if you mess up in kernel mode, everything crashes Processes A process is stored in a struct at kernel/proc.h inn xv86. This info lets the process act like it has it’s own CPU, and act like it has it’s own memory Each process has a thread of control, which holds the state needed to execute the process (p->thread) Each process has two stacks: a user space stack(p->stack), and a kernel space stack (p->kstack) in RISC-V, ecall is used to enter kernel space, and sret is used to return to user space p->state indicates wheather the process is allocaated, ready to run, currently running, waiting for I/O, or exiting p->pagetable holds the process table info, formatted for RISC-V The pagetable is responsible for translating virtual addresses (ie. the addreses a CPU instruction manipulates) to a physical address (an address that the CPU chip sends to main memory) xv6 startup CPU first reads and execute the bootloader from ROM, which loads the rest of the kernel The RISC-V starts with paging hardware disabled The bootloader puts the kernel at 0x80000000 (place here b/c I/O devices are in the lower range), and the starts executing at _entry at _entry, there is a start function which does things only machine mode: acts like it returns from supervisor mode (even though you are running in machine mode) sets the return address to that of main disables the virtual address space delegates all interrupts and exceptions to supervisor mode and then programs the clock chip to generate interrupts in the main function it creates the first user process the first syscall called is exec, replaces the current process with the shell the args to exec get put in a0 and a1, and puts the syscall number in a7 syscall number matches the entries in the syscalls array, a table of function pointers syscall retrieves the syscall number from a7 The first syscall is SYS_exec once sys_exec returns, it’s return value is put in p->trapframe->a0 This causes the original userspace call to return that value Security The kernel assumes that user code is malicious (ie. it will try it’s best to crash the system) The kernel space code is assumed to be safe\ Page Tables Xv6 uses page tables to implement isolation. Maps virtual addresses (address that RISC-V instruction manipulates) to physical addresses (address that CPU sends to main memory) the top page of the address houses the trampoline page and the trapframe page, each of 4096 bytes the trampoline page contains the code to transition in and out of the kernel the trapframe page houses the process’s user registers To elaborate on the mapping: only the bottom 39 bits out of the 64 bits are used for virtual address in RISC-V, there are $2^{27}$ page table entries (PTE). each PTE contains a 44-bit physical page number (PPN) These flags are used to state now that page can be used PTE_V indicates wheather a PTE exists PTE_R indicates if you can read the page PTE_W controls if you can write to a page PTE_U controls if user mode instructions can access the page The top 27 bits of the 39 bits are used to index into the page table to find a PTE You can then construct a 56-bit physical address whose top 44 bits come from the PPN in the PTE, and whose bottom 12 bits are from the virtual address RISC-V uses a hierarchy of pagetables, since that uses less data (for example, three sections of 9 bits are used to index 3 levels of page tables) Kernel Address space Each process gets one page table. The kernel gets it’s own page table to describe it’s address space The kernel uses direct mapping (ie. virtual addresses are equal to physical addresses). The cases where direct mapping isn’t used by the kernel are The trampoline page: it’s mapped to the top of the address space. The user page tables also have this same mapping. This allows you to easily transfers registers between layers The kernel stack pages: each process gets it’s own kernel stack, along with an unmapped guard page whose PTE is invalid. This helps guard against stack overflows Xv6 Implementation The central datastructure is pagetable_t, which is a pointer to a RISC-V root page-table page This can either be a kernel page table, or a preprocess page table Early on in booting, the kernel’s page table gets constructed prior to paging being enabled The root page-table pge gets allocated, then gets populated with the translations the kernel needs, then the kernel stack for each process gets allocated (leaving space for the guard pages) and the appropriate flags are set Physical Memory Allocation The kernel is responsible for allocating and freeing physical memory at runtime for page tables, user memory, kernel stacks, and pipe buffers xv6 uses physical memory from the end of ther kernel to PHYSTOP for run-time allocation Allocation consists of removing a page from the linked list, while freeing is adding a page to the list Xv6 Physical Memory Implementation the main data structure of the allocator is a free list of physical memory pages each entry in the list is a struct run. The struct stores each run structure in the free page itself, since nothing else is stored there The struct has a spin lock on it, which means you need to acquire and release the lock (don’t worry about this for now…) The free list is initialized to hold every page between the end of the kernel and and PHYSTOP Normal operating systems can figure out how much to allocate via some hardware configurations. Xv6 just assumes 128 Mb of RAM a PTE can only refer to a physical address that is aligned to the 4096-byte boundary, so PGROUNDUP macro does this alignment The allocator treats addresses in two different ways As integers in order to perform arithmetic As pointers to read and write memory The aforementioned reasons is why there are so many C type casts Xv6 sets all freed memory to 1 via kfree (to try and catch dangling references) kfree repends the page to the free list (so the list grows head first) Process Address Space Each process gets it’s own page table. When processes get switched, it also changes page tables When a process asks for more memory, xv6 uses kalloc to allocate physical pages, and then populates the page table with the correct PTEs. The OS also sets the correct permissions on the page This schema has some nice advantages Different process page tables translate user addresses to different pages of physical memory. This allows each process to have a private user memory Each process sees its memroy as having contiguous virtual addresses starting at 0, which physical memory is non-contiguous The kernel maps a page with trampoline code at the top of user address space, which means that a single page of physical memory shows up in add address spaces (hence, allows you to share between processes) Referring to above figure, we note that the user stack is a single page, followed by a guard page, which acts as a way to detect a stack overflow. Xv6 just panics, but a real OS would try to allocate more memory for the user stack Sbrk (Xv6) Sbrk is used by a process to grow or shrink memory. Depending on the sign of allocation, sbrk either allocates or deallocates a page at a time as needed (so if you allocate 1 byte, 1 whole page will be allocated) Exec (Xv6) Exec creates the user part of an address space If does this by reading a file stored in the file system in the form of an ELF file It first reads the first 4 bytes to check of the magic number matches ‘0x7F’, ‘E’, ‘L’, ‘F’. If it matches, it assumes that the file is formatted correctly A new empty page table is allocated, then each ELF segment is allocated with uvmalloc then loaded with loadseg walkaddr is used to find the physical addresses of the allocated memory exec then allocates and initializes one page for the user stack, along with a guard page to protect against overflow and too many arguments If any program segment is invalid, the program image is freed, and exec returns -1. Exec is risky, since addresses in the ELF file can refer to the kernel. So user programs can in theory gain kernel privileges Real World Paging real world operating systems deviate from Xv6 drastically Real OS’s don’t crash if a user stack overflows. They actually try to save a page=fault exception Xv6 assumes that physical RAM is always at 0x8000000. The real world can’t make this assumption. Real OS’s need to be able to map arbitrary hardware physical memory layouts into predictable kernel virtual address layouts On machines with large amounts of memory, you might want to use “super pages” (ie. make pages 4 Mb instead of 4kB) to avoid excess page-table manipulation the lack of malloc prevents the kernel from using data structures with require dynamic allocation Real kernels would allow allocation of variable size blocks instead of a hard-coded 4096 bytes Traps and System Calls “Traps” refer to any event which causes the CPU to set aside ordinary execution of instructions Could be a system call Could be an exception (ie. an instruction has done something illegal) Could be a device interrupt: when a device signals that it needs attention Traps often should be transparent to other processes (ie. other code shouldn’t know that they happened) The overall structure of trap handling is a trap forces control to transfer to the kernel the kernel saves registers and other state so that execution can be resumed later the kernel executes the appropriate handler code the original code resumes where it left off “Handler” refers to code which that processes a trap The first handler instructions written in assembly are called a “vector” RISC-V traps RISC-V CPUs have a set of control registers that the kernel writes to to tell the CPU how to handle traps the kernel can read from to find out about how a trap occurred The most important registers are: stvec: the address of it’s trap handler. RISC-V jumps to this address to handle the trap sepc: When a trap occurs, RISC-V save the program counter here the sret instruction copies sepc to pc. The kernel can write to sepc to change where sret goes scause: RISC-V puts a number here that describes the reason for the trap sscratch: the trap handler code uses this to help overwrite user registers before saving them (ie. a scratch pad) sstatus: the SIE bit in sstatus controls weather device interrupts are enabled. If the kernel clears SIE, RISC-V will defer device interrupts until the kernel sets SIE. The SPP bit indicates whether a trap came from user mode or supervisor mode and controls to what mode rset returns You can’t read or write these registers in user mode. A similar set of registers exist for traps handled in machine mode (xV86 only uses these for timer interrupts) RISC-V does the following sequence: if the trap is a device interrupt, ssstatus SIE is cleared. Don’t do any of the following Disable interrupts by clearing the SIE bit in sstatus Copy the pc to sepc Save the current mode in the SPP bit in status set scause to reflect the trap’s cause set the mode to supervisor copy stvec to the pc Start executing in supervisor mode In theory, you could skip some of these steps. In practice, you want to maintain process isolation (ie. prevent user code from running kernel level instructions) Note that the CPU doesn’t do the following operations. It’s the kernel’s job to do these tasks. The CPU does this to allow greater flexibility for the OS (for instance, some OS’s omit the page table switch in some situations for performance reasons) Doesn’t switch to the kernel page table Doesn’t switch to a stack in the kernel Doesn’t save any registers other than the pc User Space Traps In RISC-V, the hardware doesn’t switch page tables on a forced trap. This means that the trap handler address must have a valid mapping in the user page table and in the kernel page table each process automatically gets assigned a trampoline page, which is located at the top of the virtual address space above any user code Cruicially, it is located at the same address in user and kernel space, which makes the transition between the two easy The csrw instruction copies a register over to sscratch the TRAPFRAME is saved just below the TRAMPOLINE virtual address Once all of the registers are copied, you switch to the kernel page table, then figures out what trap occurred, then dispatches the correct subroutine Once the interrupt routine finishes, the RISC-V reloads the old program counter, then switches to the original user page table Syscalls the arguments for exec are placed in a0 and a1, and places the system call number in a7. This system call number indexes into a syscall array, which is an array of function pointers After the associated syscall returns, the return value is recorded in p->trapframe->a0 C calling convention on RISC-V places return values in a0 Conventionally, negative numbers are errors, positive number are success Kernel Space Traps Upon entering the kernel, all 32 registers are pushed to the stack for later restoration Importantly, they are saved on the interrupted kernel thread. So if the trap causes a switch to a different thread, then these interrupted threads will be preserved If an interrupt was due to a timer interrupt, and a process’s kernel thread is running, then the current thread yields to allow other threads to run Once a trap is done, it returns to the original code There in theory can be a race condition when the ‘stvec’ register is still set ot uservec instead of kernelvec. If an interrupt occurred in the interim, that would be bad RISC-V disables interrupts when it starts to take a trap, and they don’t get enabled until after it sets stvec Page Fault Exceptions For xv86, if an exception occurs in user space, the kernel kills the faulting process. If an exception happens in the kernel, the kernel panics. Real kernels can do more interesting things For instance, copy-on-write (COW) fork. The basic idea is that when you create a fork, you need to copy over the memory space of the parent to the child. Xv6 allocate a separate chunk of physical memory for the child. it would be nice if you could use the same physical memory, but a naive implementation would cause the two processes to step over each other the CPU can raise a page-fault exception if a virtual address has no mapping in the page table, or PTE_V is cleared, or a process doesn’t have the correct permissions Initially, parent and child shares all physical pages, but both are read-only. If a page-fault from an attempted write occurred, then a new page of physical memory is allocated for that process, but now with write enabled. Subsequent writes are unimpeded Many use cases of fork() doesn’t allocate new memory, so this save a bunch of page allocations Lazy allocation is another use case. If an application demands memory, the kernel keeps track of the increase in size, but doesn’t allocate any physical memory until a page fault occurs If an application asks for more memory than it needs, you save on allocating all those pages Even if the application massively grows the address space, you can spread out the cost of allocating that space over time This does incur the overhead of more paging requests, but you can allocate a small number of pages on page fault There is demand paging. Naively (like xv6 does), one loads in all of the application text and data into memory before starting the application. You can instead create a user page table with all PTEs marked invalid. Once the program starts and encounters a page for the first time, the kernel allocates that page Programs might need more memory than there is in RAM. the kernel splits the data between RAM and disk, and marks all the disk pages as invalid. If a disk page gets hit, then a new page is allocated in RAM, the page from the disk is copied over, and the PTE is pointed to RAM (this process is called ‘paging in’. ‘paging out’ goes the other way) If there is no free physical RAM, then the kernel must free a physical page (or evict it) from RAM to the disk, and mark the page as invalid again. This is expensive eviction is infrequent if an application only uses a subset of their memory pages which can all fit in RAM Most computers have little to no free memory (think multiplexing user on a cloud computer). When free physical memory is scarce, allocation is expensive Interrupts ad Device Drivers A “driver” refers to code in an operating system that manages a particular device. This means setting the device up, handling interrupts, and mediate processes waiting for I/O from the device Importantly this need to be done in an thread safe concurrent fashion Typically, you can split a driver into two halves The top half runs in a process’s kernel thread. This is called via system calls such as read and write that want to perform device I/O The bottom half executes at interrupt time. On interrupt, the bottom figures out what was done, wakes up a waiting process if needed, and then tells the hardware to start work on any queued operation Console Input The console driver takes inputs from the UART serial-port hardware and stores it in a line buffer. User processes then use the read syscall to fetch these lines from the console The UART appears to software via a set of memory-mapped control registers. That is, certain addresses map to the hardware instead of RAM Console Output The device driver maintains some output buffer so that writing processes don’t need to wait for the UART to finish sending You need to wait if the buffer is full though This is a general pattern: You decouple the device activity from the process activity via buffering and interrupts The console driver can process input even when no process is waiting to read it Similarly, processes can send output without having to wait for the device This decoupling technique is called I/O concurrency. It’s useful for slow devices (like UART) or if you need to do immediate action Timer Interrupts Xv6 needs timer interrupts to maintain the current time, and switch amoung compute-bound processes These are just roll-over counters with a fixed increment frequency and well defined thresholds for rollover (stimecmp in Xv6) Timer interrupts only increment time for one CPU (prevents time from passing faster if therer are multiple CPUs) The kernel code may be moved from one CPU to another without warning Locking Most kernels allow the execution of multiple activities Multiple CPUs can execute independently Multiple CPUs share physical RAM, which allows data structures to be shared between them A CPU, even on a single processor system, can switch between multiple threads of execution A device interrupt handler can interrupt at any time You need to manage this parallelism in order to ensure correctness of program execution (concurrency control techniques) Locks provide mutual exclusion, ensuring that only one CPU at a time can hold the lock Upside: Easy to understand Downside: Serializes concurrent operations Races Imagine that you have your bog standard linked list struct element { int data; struct element *next; }; struct element *list = 0; void push(int data) { struct element *l; l = malloc(sizeof *l); l->data = data; l->next = list; list = l; } What happens if two CPUs execute push on the same linked list? Only the one that occurs latter in time will be successfully realized. The first one will get lost. This is an example of a race Races occur when simultaneous memory access occurs with at least one being a write The outcome of races can depend on The machine code generated by the compiler The timing of the two CPUs involved How the two CPU memory operations are ordered by the memory system Locks are one way of avoiding races. Locks ensure mutual exclusion, so that only one CPU can execute the offending lines of code at a time These sections are often called critical sections Locks protect data by maintaining invariants on the data In the critical sections, these invariants might be violated, but at the end, the invariants should hold again Look at linked lists: The invariant is that the head of the list points to the first element, and the next field points to the next element push temporarily violates this invariant when it updates the pointers. Locks ensure that is section is run by only one CPU Locks decrease performance by serializing concurrent code If multiple processes need a lock at the same time, then they are in contention with each other A big part of kernel design is to avoid locks when possible via clever data structures and algorithms Locks also need to be placed correctly. If they are two wide, they can be serializing code which doesn’t need to be Linked lists again: if you locked the entirety of push, then even malloc would be serialized when it doesn’t need to be Spinlocks One implementation of locks is a spinlock. You have one bit of data which represents if a lock has been acquired or not. Naively, one might implement this like void acquire(struct spinlock *lk) // does not work! { for(;;) { if(lk->locked == 0) { lk->locked = 1; break; } } } The problem occurs at the if statement. If two CPUs both reach line 25, they both might see that lk->locked is 0, which means both of them acquire the lock. This violates mutual exclusion To get around this, most multi-core CPUs have some sort of atomic read and swap instruction which does the following in an uninterruptible way (using ampswap r, a in RISC-V as an example): reads the value at memory address a writes the contents of register r to address a puts the read value into r in Xv6, acquiring and releasing a lock basically boils down to using this instruction To acquire a lock, a CPU continually spams the ampswap instruction by writing a 1 to lk->locked If you get a 1 in the return value, then you know that some other CPU has the lock If you get a 0, then this means that no other CPU was holding the lock previously. Hence, the current CPU now has exclusive ownership of the lock and can move on with life Release works in a similar way, but spams 0 instead Using Locks It can be difficult to figure out what data and invariants each lock should protect Some principle to help guide you: Any time a variable can be written by one CPU at the same time that another CPU can read or write it, a lock should be used to keep the two operations from overlapping If an invariant involves multiple memory locations, typically all of these locations need to be protected by a single lock to ensure the invariance is maintained The above are rules for when locks are necessary, but they say nothing about when they are unnecessary You can divide locking into “coarse-locking” and “fine-locking” to handle different speed requirements For instance, in Xv6, there is a single free list with a single global lock (coarse grain) For the file system, each file has it’s own lock, so that processes manipulating different files don’t need to wait for each Deadlocks Locks need to be acquired and released in the same order, otherwise, you get deadlocks For instance, if one CPU acquires B then A, and the other acquires A then B, then both CPUs will be waiting for the other to finish and will stall For Xv6, there are well defined lock ordering which must be maintained daf In general, it’s difficult to maintain global deadlocking avoiding order For instance, module M1 might call module M2, but the lock order requires the lock in M2 to be acquired before M1 This sets constraints on how fine grained that you can set your locks One way that you might be able to avoid deadlocks is with re-entrant locks (or recursive locks) This means that if a process attempts to acquire a lock it already has, then the kernel just allows it instead of panicking struct spinlock lock; int data = 0; // protected by lock f() { acquire(&lock); if(data == 0){ call_once(); h(); data = 1; } release(&lock); } g() { aquire(&lock); if(data == 0){ call_once(); data = 1; } release(&lock); } h() { ... } Take a look at the above code, you would expect that call_once() would be called once, either by f or by g, but not by both If re-entrant locks are allowed and h happens to call g, call_once will be called twice Without recursive locks, you would get a deadlock. Either bug is bad, but a deadlock is easier to track down since the kernel panics on deadlock, but silently fails with reentrant locks If you wanted to use reentrant locks, then you would need to keep track of the depth of the lock You can get deadlocks with interrupts as well For instance, suppose that a CPU acquires a lock for sys_sleep, and the CPU gets a timer interrupt. The interrupt will try to get the lock, won’t be able to get it, and then will wait for it to be released. sys_sleep won’t every release until the timer interrupt is handled To avoid this, if a spinlock is used by an interrupt handler, a CPU must naver hold the lock while interrupts are enabled Xv6 is more conservative: if a CPU acquires any lock, then interrupts are disabled on that CPU You can also uncover deadlocks when the compiler does out of order execution Mentally, you can think of assembly code are running in the order that the code is written is This is not the case: compiler might reorder your code, omit extraneous memory stores and loads, run independent sections of code in parallel etc. That last one isn’t perfect. You can get code that works sequentially, but fails miserably when done out of order You can tell a compiler not to re-order via memory barriers (__sync_syncrhonize() in C library) Spinlocks Sometimes, a lock needs to be held for a long time Say in the file system, a file is held for tens of milliseconds while read and write operations happen Spinlocks are not good for this situation since a lot of CPU time is wasted spinning the lock Another drawback is that a process cannot yield the CPU while retaining the spick lock ie. Another process wants to use the CPU while your waiting for the slow disk if you yield with a spinlock, then you run into the possibility that another thread of the CPU tries to acquire the spinlock. You then have both threads waiting for each other to finish (deadlock!) The solution to this is sleep-locks This consists of a locked field which is protected by a spinlock, and acquiresleep’s call which does an atomic sleep on the CPU, yielding as needed Since sleep locks keep interrupts enabled, the cannot be used in interrupt handlers Scheduling Operating systems typically run more processes than the computer has CPUs, so you need some way to time-share the CPUs amongst the processes Ideally, this sharing should be transparent to the user A common way to do this is to give the illusion that each process has it’s own CPU by multiplexing processes (ie. the work of a process gets shared between CPUs) There are a bunch of challenges that you need to handle to get this working First, you need to be able to save and restore CPU registers (this needs to be done in assembly) How do you force switches in a way that is transparent to user processes? Xv6 uses the standard technique in which hardware timer interrupts drive context switches all CPUs switch amoung the same set of processes, so you need a locking plan to avoid races A process’s memory and other resources must be freed when the process exits, but it can’t do all of this itself for various reasons One of these is that it can’t free ints own kernel stack while still using it Each CPU must remember which process it is executing so that syscalls affect the correct process kernel Finally, sleep and wakeup allow a process to give up the CPU and wait to be woken up by another process or interrupt Context Switching The above shows how you switch from one user process to another a trap from user space to the old process’s kernel thread a context switch to the current CPU’s scheduler thread a context switch to a new process’s kernel thread a trap return to the new user-level process Xv6 has dedicated threads which execute the scheduler because it’s not safe for the scheduler to execute on any process’s kernel stack Another CPU might wake the process up and run it, and it would be bad to have the same stack on two different CPUs Hence, each CPU has it’s own scheduler thread to cope with situations in which more than one CPU is running a process Upon a context switch, the CPU scheduler thread will save the old process’s registers, stack and program counter, and then load the new process’s registers, stack and program counter The function swtch does this save and restore. It only focuses on saving and restoring as set of 32 RISC-V registers, collectively called “contexts” swtch only saves the callee registers. It does not save the program counter; it save the ra register (ie. the return address from which swtch was called) After that, swtch restores registers from the new context, which holds the register value of a previous swtch call. Upon return, the instrutions point to the restored RA (ie. the instruction which called swtch). It also returns on the new thread’s stack Scheduler Basic ideas is that if a process wants to give up the CPU, it must acquire its own process lock release any other locks it is holding update its own process state Disables interrupts then call the scheduler Each CPU runs a scheduler in it’s own dedicated thread The CPU checks that the above sequence was run, then context switches to the scheduler thread The scheduler then dispatches to any waiting processes NOTE: the process lock is held across context switches This must be done since context switching does not preserve invariants. The lock can only be released if the scheduler clears the cpu’s process the new process’s kernel is completely up and running The scheduler in Xv6 is very simple It loops through the process table and looks for a process that is RUNNABLE, sets that process to RUNNING, then context switches and waits for it to yields. It then continues on Sleep and Wakeup What if you need two threads to communicate with each other? Xv6 uses a mechanism called sleep and wakeup Sleep allows a kernel thread to wait for a specific event A thread can call wakeup to indicate that threads waiting on a specific event should resume Semaphores As an example of the utility of Sleep and Wakeup, let’s try and implement a higher-level syncrhonization mechanism called a semaphore. Semaphores maintain a spinlock and a reference count “Producers” can increment this count when ready “Consumers” busy wait for this count to be non-zeroo, then decrement the counter and return A naive implementation might look like 100 struct semaphore { 101 struct spinlock lock; 102 int count; 103 }; 104 105 void 106 V(struct semaphore *s) 107 { 108 acquire(&s->lock); 109 s->count += 1; 110 release(&s->lock); 111 } 112 113 void 114 P(struct semaphore *s) 115 { 116 while(s->count == 0) 117 ; 118 acquire(&s->lock); 119 s->count -= 1; 120 release(&s->lock); 121 } This implementation is expensive, since the consumer is busy waiting for potentially long periods of time To avoid the busy wait would require the consumer to yield the CPU, and only resume after function V increments the counter Suppose that there exists some functions sleep(chan) and wakeup(chan) sleep(chan) puts the calling process to sleep and releases the CPU for other work wakeup(chan) wakes all processes that are in sleep calls with the same channel number. If no processes are waiting on chan, wakeup does nothing The above allows us to change the semaphore implementation as follows 200 void 201 V(struct semaphore *s) 202 { 203 acquire(&s->lock); 204 s->count += 1; 205 wakeup(s); 206 release(&s->lock); 207 } 208 209 void 210 P(struct semaphore *s) 211 { 212 while(s->count == 0) 213 sleep(s); 214 acquire(&s->lock); 215 s->count -= 1; 216 release(&s->lock); 217 } The consumer is no longer busy waiting (it gives up the CPU instead) Lost Wake-Up Problem The above sleep and wakeup implementation runs into the “Lost Wake-Up” Problem Suppose that P finds that s->count ==0 on line 212. While P is between lines 212 and 213, V runs on another CPU, which changes s->count to be nonzero and calls wakeup, which finds no processes and does nothing. P executes lines 213, but missed the previous wake-up, so it continuously sleeps. You need to wait for V to send another wakeup, which might never happen This happens because the invariant that P sleeps only when s->count==0 is violated by V running at just the wrong moment Naively fixing this by moving the acquire call creates a deadlock THe actual fix is to change the interfact of sleep to be sleep(chan, cond_lock), where cond_lock is the semaphore’s spinlock. This allows the lock to be released only after the process has be properly put to sleep This forces a concurrent V to wait until P as finished sleeping Hence, the proper implementation is 400 void 401 V(struct semaphore *s) 402 { 403 acquire(&s->lock); 404 s->count += 1; 405 wakeup(s); 406 release(&s->lock); 407 } 408 409 void 410 P(struct semaphore *s) 411 { 412 acquire(&s->lock); 413 while(s->count == 0) 414 sleep(s, &s->lock); 415 s->count -= 1; 416 release(&s->lock); 417 } Pipes A pipe struct consists of a spinlock a data buffer nread and nwrite keep track of the number of bytes read/written The buffer is a circular buffer (ie. the next byte after $buf[PIPESIZE-1]$ is $buf[0]$). nread and nwrite do not wrap around This convention allows ups to distinguish between a full buffer (nwrite=nread+PIPESIZE) from an empty buffer(nwrite==nread) To index the buffer, you need to use an index of nread % PIPESIZE instead of nread The pipe works by busy-waiting for the lock to free If writing, the pipe writes to the buffer until it’s out of data, or the buffer is full. If the buffer gets full, then wakeup is called to alert any sleeping readers that there is data to be read The write pipe then sleeps and waits for a wakeup call indicating that the pipe has space sleep releases the pipe’s lock during this process If reading, then after acquiring the lock, the pipe reads from the buffer until the buffer is empty or the desired bytes have been recovered If the buffer is empty, then you read all the available bytes, then send a wakeup to all sleeping write pipes Process Locks p->lock must be held while reading or writing any of the following proc fields p-> state, p->chan, p->killed p->xstate, x->pid This is because these fields can be used by other processes On a higher level, p->lock also prevents race conditions (along with p->state) when allocating new processes Conceals a process from view while it is being created or destroyed Prevent a parent’s wait from collecting a process that has set its state to ZOMBIE but has not yet yielded the CPU It prevents another CPU’s scheduler from deciding to run a yielding process after it sets its state to RUNNABLE but before it finishes swtch Ensures only one CPU’s scheduler decides to run a RUNNABLE process prevents a timer interrupt from causing a process to yield while it is in swtch Prevents wakeup from overlooking a process that is calling sleep but hasn’t yielded the CPU yet Prevent the victim process of kill from exiting makes kill check and write p->state in an atomic fashion Real World Scheduling Xv6 uses round-robin shedueling, where each process is run in turn. Real schedulers might implement some priority scheme so that some processes are given preferential treatment. This is a hard thing to do (things like priority inversion and convoys can happen) sleep and wakeup is a simple synchronization method, but it’s not the only one: for example, the Linux kernel uses an explicit process queue Linear scans of the processes during wakeup is inefficient. There are a variety of data structures which could be more efficient. In the context of threading libraries, this structure is referred to as a condition variable sleep is mapped onto wait wakeup is mapped onto signal wakeup currently wakes up all sleeping processes on a particular channel. This causes all the processes to wake up and rush to grab the lock This sort of behavior is called “a thundering herd” A more sane way might be to have a way to wakeup only one process on a channel, or broadcast a wake up to all sleeping processes on a channel File System A file system’s purpose is to organize and store data; it allows one to share data amoung users and applications, as well as persistence so that data is still available after a reboot There are lots of challenges a file system must handle You need on-disk data structures to represent the tree of named directories, the identities of the blocks that hold a file’s content, and to record the free areas of the disk Support crash recovery: If a crash occurs, the file system should still work/ be self-consistent Different processes must be able to operate on the file system at the same time, so the file system must manage invariants The file system needs an in-memory cache of hot blocks, since disk access is orders of magnitude sower than accessing memory Xv6 uses a 7 layer file system. From bottom up the first layer is the disk the buffer cache layer caches disk blocks and synchronizes access to them, ensuring that only one kernel process at a time can modify the data stored in any particular block Logging layer allows higher layer to wrap updates as atomic transactions. This ensures that the blocks are updated atomically in the face of crashes The inode layer provides individual files Each inode as a unique i-number and some blocks which hold the file data The directory layer implements each directory as a special kind of inode whose content is a sequence of directory entries, which each contains a file’s name and i-number The pathname layer provides hierarchical path names and resolves then with recursive lookup The file description layer creates a top-level file system interface for user-level programs Disk hardware traditionally presents data on the disk in 512-bit wide blocks called sectors. These sectors are sequentially layed out For instance, sector 0 is the first 512 bits, sector 1 is the next 512 bits etc. Xv6 holds recently read sectors in memory This data sometimes is out of sync with the disk (perhaps it hasn’t read from the disk, or maybe the cached data has been updated by the software, but not the hardware) Xv6 splits the file system up into several blocks: Block 0 is the boot sector. The filesystem doesn’t touch this Block 1 is the superblock. It contains meta data bout the file system (file system size in blocks, the number of data blocks, the number of inodes, and the number of blocks in the log) Block 2 holds the rest of the file system Buffer Cache Layer Has two jobs Synchronize access to disk blocks to ensure that only one copy of a block is in memory and that only one kernel thread at a time uses that copy Cache popular blocks so that they don’t need to be re-read from the slow disk The interface of Xv6 uses bread obtains a buf struct which contains a copy of the block which can be read or modified in memory bwrite pushes this modified buffer to the appropriate block on the disk brelse frees the buffer once it’s done The buffer cache uses a doubly linked list of buffers This gets initialized with NBUF empty buf structs Each buffer has a valid field, which states wheather the buffer contains a copy of the block, and a disk field, which indicates that the buffer content has been handed to the disk If you reach the end of the list and you ask for a block that is not in the cache, the buffer cache boots out the least recently used buffer for the new block bget scans the buffer list for a buffer with the given device and sector numbers. If there are none in the cache, then the least recently used block in the cache is forcibly cleared, made invalid so that bread is forced to go to disk, then presented as the return from bread brelse releases the sleep lock on a buffer, then moves that buffer to the front of the linked list. This orders the list from front to back by how recently the buffer was released Logging Layer Suppose that you only perform a subset of writes before the computer crashes The crash may leave an allocated but unreference content block The crash might also leave an inode with a reference to a content block If xv6 supports multiple users, then the latter would allow the old file’s owner to read and write block in the new file, owned by a different user Xv6 performs logging by keeping a seperate log which describes all the write a syscall wants to perform. Once all of them are queued up, the syscall writes a commit record to the disk indicating that the log contains a complete operation. The syscall then copies the writes over to the disk. After the writes complete, the log on the disk gets erased Imagine a crash happens If a log is marked as containing a complete operation, then the recovery code copies the writes to where they belong If a log is not marked as complete, the recovery code ignores the log In either case, the log gets erased by the recovery code Let’s think about why this works: If the crash occurs before the operation commits, then the log in the disk will not be marked as complete. The recovery code will ignore it and the state of the disk will be as though the operation didn’t happen If the crash occurs after the operation commits, then recovery will replay all of the operation’s writes It might even repeat some that the operation has started prior to the crash The above makes disk writing atomic: either all the writes appear on the disk, or none of them appear On Xv6, the log resides in the superblock It contains a header block, followed by a sequence of updated block copies The header contains an array of sector numbers (one for each logged block) and the count of log blocks The count is either zero (indicating no blocks) or non-zero (indicatin the log contains a completed transaction with the indicated number of logged blocks) Xv6 write the header block when a transaction commits. The counter gets set to 0 after copying the logged blocks to the file system So a crash midway through a transation will result in a count of zero A crash after a commit will have a non-zero count The log can accumulate logs across many syscalls. To avoid splitting a syscall across transactions, the logging system only commits when no file-system syscalls are happening This batching of transactions is known as group commit The log has a fixed amount of disk space. If a syscall asks for too many blocks, then you need to split these blocks up so that everything fits (for instance, write() could asks for many blocks) Block Allocator We have a free pool of disk blocks which need to be allocated Xv6 block allocator uses a bitmap on disk. A zero bit in the mask indicates the block is free. Otherwise, it is in use Mkfs sets the bits corresponding to the boot sector, superblock, log blocks, inode blocks and bitmask blocks The block allocate provides two functions balloc allocates a new disk block. This loops over all possible blocks (from 0 to sb.size), looking for a free block. Once found, the bitmap is updated and the block is returned There is an optimization where the loop is split into two parts. The outer loop reads seach block of bitmap bits. The inner loop checks all Bits-Per-Block (BPB) in a single bitmap block A race might occur if two processes try to allocate a block at the same time, but the buffer cache prevents this by only letting one process use any one bitmap block at a time bfree finds the right bitmap block and clears the right bit Inode Layer Inode has two meanings The on-disk data structure containing a file’s size and list of data block numbers An in-memory inode, which contains a copy of the on-disk inode as well as extra information needed within the kernel The on-disk inodes are in a contiguous area called the inode blocks. Each inode is the same size. Each inode has a i-number which states where in this contiguous array that it lives at struct dinode contains a type field which distinguishes between files, directories, and special files (ie. devices). A type of zero indicates that the inode is free nlink field counts the number of directory entries that refer to this inode. Needed to know when the on-disk inode and its data blocks should be freed size field counting the number of bytes of content in the file addrs array records the block number of the disk blocks holding the file’s content The active inodes in memory are kept in an itable a struct inode is only stored if there is a C pointer referring to that inode the refs field in inode keeps track of these C pointers. Once the number of pointers hits 0, the kernel discards the inode from memory There are 4 locking mechanisms in the inode code itable.lock protects the invariant that an inode is present in the inode table at most once, as well as the invariant that an in-memory inode’s ref field counts the number of in-memory pointers to the inode Each in-memory inode has a lock field containing a sleep-lock, which ensure exclusive access to the inode’s fields a non-zero ref field forces the system to maintain that inode in the table nlink counts the number of directory entries that refer to a file. Xv6 won’t free an inode if the link count is greater than 0 iget() returns an inode pointer that is gaurenteed to be valid until the matching iput() is called In words, iget() provides non-exclusive access to the the inode, which some processes depend on (think knowing if a file is open or not) If you want the inode structure to be populated with the most recent update of the inode, then you need to call ilock() and iunlock() to get exclusive ownership, and then read from the disk This seperation helps avoid deadlock in some scenarios (say directory lookup) Directory Layer Directories are implemented internally like a file (ie. inodes) Directories have a linked list of names and inodes numbers. Directories with inode number zero are free the function dirlookup() searches a directory for an entry with a given name. If it finds one, it returns a pointer to the correesponding inode and sets *poff to the byte offset of the entry within the directory The function dirlink() write a new directory entry into the specified directory by looping through the directory entries; If an unallocated entry is found, the inode is created at that offset Path Layer Path name lookup involves sucessive calls to dirlookup. The starting point is either the root directory or the current directory After locking the current directory pointer, you scan through the entries, set the current directory to the found directory, and then interate on that new directory There are some concurrency issues: One kernel thread is looking up a pathname while another kernel thread may be changing the directory tree by unlinking a directory. A potential risk is that a lookup may be searching a directory that has been deleted by another thread Locking a directory when doing path lookup and reference counting inodes are some deadlock prevention mechanisms File Descriptor Layer Everything is a file in UNIX. Everything! This layer helps acheive this uniformity Each process gets its own table of open files (file descriptors) struct file is a wrapper around inode or pipe, plus an I/O offset each open() cll create a new open file Multiple processes can have the same open file, but different I/O offsets A single file can appear multiple times in one process’s file table; it can also appear across multiple processes This could happen with dup or fork All open files are kept in a global file table (ftable) Filealloc scans the file table for an unreferenced file and returns a new reference Filedup increments the reference count fileclose decrements this counter. When the counter hits 0, the underlying pipe or inode is released

Fundamentals of Computer Systems

Compilation of notes for Fundamentals of Computer Systems class for Spring 2023. Problem Solving Tips Administrative Stuff Exams Grades Lecture 0 Lecture 1 CPU Models Addition in different bases Definitions Modular Arithmetic Integer format with word size restriction Unsigned binary numbers Binary Addition Algorithm (BAA) Negative numbers Signed Magnitude Representation 1’s Complement Problems with Signed Magnitude and 1’s Complement 2’s complement Intuition behind 2’s complement Lecture 2 Why 2’s complement? Easy subtraction Detecting Overflow Floating Points IEEE standard for a 32 bit word Doubles Underflow Boolean Algebra Boolean Algebra Identities Distributive Law Proof DeMorgan’s Theorem Consensus Theorem circuit Representation of Boolean Algebra Coverting Circuits to Booleans NAND and NOR XOR Duals Lecture 3 Sum of Products (SoP) Product of Sums (PoS) Convert SoP to PoS Minterms Maxterms Karnaugh Map K-map Notation Summary of Simplification with k-maps 2-bit multiplier Don’t Care Conditions Drawing Circuits Lecture 4 Standard Circuits Enabler Decoder Circuit Decoder With Enable MUX (Multiplexer) Representing Functions with Decoders and MUXes MUX trick Shifter Circuit Barrel Shift left w/ Wraparound L-R Shift Circuit with Rollout Unsigned Adder Circuit Half-Adder Full-Adder Signed Adder/Subtractor (2’s-C) Ripple Carry Adder Optimizing Ripple Carry (Carry Lookahead) Merge with known 0’s Code Converter Contraction Example 1 Lecture 5 Latch Intuition SR Latch SR Latch with Control D Latch with control Latches Can’t be Clocked Flip Flops JK Flip Flop (JKFF) Flip Flop Table Trigger Types Sequential Circuit State Machines Registers Register MUXing Shift Register Ripple Counter PLA (Programmable Logic Devices) MIPS Programming Parts of a Program Computer Hardware CPU Memory Clock ISA (Instruction Set Architecture) Programming in MIPS Instruction Types Memory Pointers $pc sp Constants Sign-extend Pseudoinstructions Things to Remember Assembly Code ALU Details of Memory Single Memory Cell Coincident Selection Extending Memory Why is memory not clocked? Architectures Single Cycle Architectures (SCA) Control Control Flow Pipelines MIPS Dependencies Hazards Implementing Hazard Correction Caching Locality Blocks Cache Hits and Misses Direct mapping Fully Associative Caching Set Associative Caching Writing to Main Memory Virtual Memory Problem Solving Tips Break down problems into manageable sub-problems If you get stuck, take a break, then slowly and methodically think through the sticking point (ie. break it down more. Make sure you are reading the problem correctly) Write in your own words what you need to do Remember that everything happens in parallel, and you only keep the results that you want Administrative Stuff Exams Midterm: On March 8th, 6-9 pm (tentative) Check for conflicts End-of-term exam: Either April 27, evening (6-9 pm) or May 5 in the evening Fill in exam conflicts (by 1/24) Acceptable excuses: conflicting classes and religious conflicts Fill in P-credit office hours (By 1/21) Go to web form The top row indicates the start time of the office hours (e.g. 12 means 12-1. 12:30 means 12:30-1) Need at least 10 hours selected Email for help: 3827-staff[at]googlegroups.com Include 3827 in subject header. Otherwise, it will get overlooked Use EdStem for class updates instead of Courseworks, as well as to ask questions b/c it will be faster than email Grades P Credit has a score between 0 to .3 P = 0 you never attend P = .3 you attend every P-cred and actively participate The final grade is G = 100P+(1-P)(10H+max(30M+60F,45M+45*F)) H is the percent correct on the HW M is the percent correct on the midterm F is the percent correct on the final Lecture 0 Work on HW0 (Not graded) Why should I care about this course? Shows how computers run “under the hood” Learn “parallel thinking” Humans think sequentially Circuits run multiple “threads” at once Useful to understand these ideas in many domains Systems Programming Languages (Compilers and Interpreters) Quantum Computing ML/AI Lecture 1 CPU Models CPU: “brain” of the computer Control Unit: D oes calculations on data in datapath Memory: Stores data (for later use) I/O: Interface to the outside (disk, network, monitor, keyboard, mouse, etc.) Simpler model of CPU: Discrete inputs (over time) Discrete Information Processing System communicates with System state and output something based on input System State Remembers previous information Addition in different bases humans: 10 digits: {0,1,2,3,4,5,6,7,8,9} $(4537.8)_{10} = 4*10^{3}+5*10^{2}+3*10{1}+7*10^{0}+8*10^{-1}$ Shifting everything to the left is multiplication by 10 computers: 2 digits: {0,1} $(1011.1)_{2} = 1*2^{3}+0*2^{2}+1*2^{1}+1*2^{0}+1*2^{-1} = 11.5$ Shifting everything to the left multiplies by 2 Hex: 16 digits {0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F} Hex numbers specified as 0x26BA $0x26BA = 2*16^{3}+6*16^{2}+11*16^{1}+10*16^{0} = (9914)_{10}$ We use hex since it is more human readable and is easy to convert to and from binary a base-2 4 digit can be multiplied by 16 by shifting 4 to the left ($1100 \rightarrow 1100\ 0000$) a base-16 1 digit can be multiplied by 16 by shifting 1 to the left ($0xD \rightarrow 0xD0$) To convert binary to hex, break hex into groups of 4 and convert each group to corresponding binary number $0010\ 1011 \rightarrow 0x2B$ For hex to binary, just convert each hex digit to the corresponding group in binary $0x2B \rightarrow 0010\ 1011$ Definitions True = 1, False = 0 Everything eventually converted to either 1 or 0 Numbers: 19 base 10 = 10011 base 2 Letters: each character is assigned to a 8 bit number Bits: a single binary digit (0 or 1) The leftmost bit is the highest order bit The rightmost bit is the lowest order bit Byte: A group of 8 bits. e.g. (10110010) Word: a grouping of bits computer-architecture dependent The number of bits that the computer architecture can process at once 64-bit word architectures expect data in 64-bit groupings The number of bits used by an architecture is called the word size Common notation for k-bit value: $b_{k-1}b_{k-2}…b_{1}b_{0}$ Modular Arithmetic Def: X mod Y = remainder (X/Y) remainder constrained between 0 and Y-1 Alternative definition: remainder constrained between $-\frac{Y}{2}$ and $\frac{Y}{2}-1$ 100 mod 9 = 1 -10 mod 3 = -1 mod 3 = 2 mod 3 X mod Y = Z mod Y iff $X = Z+KY$ for some integer K Integer format with word size restriction For a given number of digits, how many possible values can you represent? $b^{d}$, where b is the base and d is the number of digits For 2 bit word size: {00, 01, 10, 11} As humans, we can assign whatever value to a distinct bit-sequence: 2 bit word size: {00: .234, 01:i, 10: 234, 11: asdf} Nonsensical , but allowed Unsigned binary numbers Positive integers of a fixed wordsize e.g. 1111 = 15 for a 4-bit word Binary Addition Algorithm (BAA) 1+1 = 0 with a carry of 1 eg. wordsize = 5, add 11110 and 10101 (30+21) 11110+10101 = 10011 if restricted to 5 bits Called overflow: when a result can’t fit within the wordsize constraint The correct answer needs 6 bits (110011) The answer may not be correct because of overflow, but it is correct up to mod $b^{d}$ Negative numbers Signed Magnitude Representation Let the highest order bit denote the sign of the number (0 is positive and 1 is negative) e.g. 0011 = 3 e.g. 1011 = -3 e.g. 1000 = 0000 = 0 To negate, simply flip the highest order bit 1’s Complement Again, highest order bit indicates sign (0 is positive and 1 is negative) To negate a value, flip all the bits 0010 = 2 1101 = -2 e.g. wordsize of 8, value is 11101001. 1’s Complement rep We know its negative (1 in highest order) Negate all the bits and read off positive value, then slap on negative sign Using this, 11101001 = -20 000000 = 111111 are both 0 in 1’s complement Problems with Signed Magnitude and 1’s Complement 0101+1101 = 0010 (after dropping carry of 1) Unsigned binary: 5+8 = 2 (correct mod 16) Signed magnitude: 5+ (-5) = 0 (not 2, even up to mod 16) 1’s complement: 5+(-2) = 3 (not 2, even up to mod 16) TL;DR Binary Addition Algorithm doesn’t work 2’s complement highest order bit is still sign To negate a number, flip all the bits (like 1’s complement) then add 1 in binary (eg a 4 bit word size, you add 0001) 0010 = 2 so 1101+0001 = 1110 = -2 1101 = -3 so 0010+0001 = 0011 = 3 (works for negative) 0 is unique in 2’s complement (0000 = 1111+0001 = 0000) “100000” maps to itself in 2’s complement (largest negative number of the word size) 2’s complement always work with BAA (meaning that with any overflow, the result is off by mod $2^{wordsize}$) Intuition behind 2’s complement Let X be some word of size k Negate all the bits of X and call it Y e.g. X = 100101, so Y = 011010 $X+Y$ using BAA always equals a a word of size k whose entries are entirely 1s $X+Y$ = 111111 Adding the the decimal number 1 to $X+Y$ (ie 000001) causes overflow which yields a word of only zeros. So we can think of the operation of flipping all the bits and adding decimal 1 to a string as the negation of X Lecture 2 The range of unsigned integers is 0 to $2^{k}-1$ The range of signed integers is $-2k^{k-1}$ to $2k^{k-1}$ A string if bits is useless unless you know what representation they represent (unsigned, signed ,ASCII, etc.) Representation: How you interpret the bits Operations: do something to the bits (independent of representation) Why 2’s complement? Easy subtraction Say that we want to find 001110-010101 in signed magnitude representation (14-21). It the same procedure as in base 10, but in base 2 (including carries) Instead, make the representation the 2’s complement Negate 21, then apply BAA. You get the correct answer: 001110-010101=001110+101011=111001 (14-21=7) Detecting Overflow For 2’s complement: Look at the final 2 carries from the BAA algorithm. If the carries are different, you have overflow $00$ case: $5+1=6\rightarrow 0101+0001=0110=6$ $11$ case: $-5-3=-8\rightarrow 1011+1101=1000=-8$ $10$ case (overflow): $-2+-7=-9\rightarrow 1110+1001=0111=7$ $01$ case (overflow): $7+7=14\rightarrow 0111+0111=1110=-6$ Overflow just means that the result can’t be represented with the word size. It does not mean that the result is too big Floating Points Scientific notation: -7.776E3. The number to the left of the E is called the mantissa. The number to the right is the exponent In binary: $-1.10\times2^{01111}= -1.5*2^{7}$ IEEE standard for a 32 bit word highest order bit (31th) is the sign 0 is positive, 1 is negative exponent is bit 30 to 23 represent the value as an unsigned binary, then subtract 127 (the bias) Fraction (the rest of the bits) represents 1.XXXXXX. The “1.” is implicit for all floating point numbers, including zero. The mantissa represent the numbers after the decimal To approximate zero, you set all the bits to zero, which corresponds to $1.0*2^{-127}$, which is effectively zero Doubles A double uses two 32-bit words There is one sign bit, 11 exponent bits, and 52 fraction bits The first word has the sign bit, the exponent bits and the rest are the higher order bits of the fraction The second word is the rest of the fraction The bias is 1023 Underflow Say that we have a number X = 1E-83. Squaring X, we get Y = 1E-166. This cannot be represented as a float. This is referred to as underflow, which is a subcategory of overflow Boolean Algebra Either 0 or 1 (False or True) Can represent as a variable (X,Y,A,B) Complement of X is denoted as $\bar{X}$. This is just flipping a bit Literal a boolean variable or its complement $\cdot$ represents AND (can be omitted) + represents OR Left to right precedence. parentheses give priority to operations You daisy chain operations on multiple values by constructing a truth table where each column represents an unsigned integer Expressions is a set of literals (possibly with repeats) combined with logic operations Equations expression1=expression2 Functions: Takes in some literals and outputs an expression Boolean Algebra Identities $X+XY=X$ $X(X+Y)=X$ Distributive Law Proof $(X+Y)(X+Z) = XX+XZ+YX+YZ$ Apply $XX=X$, $X+XZ = X$, $X+YX=X$ sequentially to show final result DeMorgan’s Theorem Split the big bar at the operator Chance AND to OR and OR to AND Flip-flop literal 0 and 1 to each other Replace function F with the complement $\bar{F}$ and vice versa Consensus Theorem $XY+\bar{XZ}+YZ=XY+\bar{X}Z$ If $YZ = 0$, then trivially you can drop $YZ$ If $YZ=1$ both $Y$ and $Z$ are 1. $XY+\bar{X}Z+1 = 1$, so LHS is 1 Either $X$ or $\bar{X}$. Since $Y=Z=1$, either $XY$ or $\bar{X}Z$ is 1, so RHS is 1 circuit Representation of Boolean Algebra The depth of a circuit refers to how the maximum layers are between the input and output If you want to have multiple inputs, put more lines as the input on the left a k-input gate has a depth of $log_{2}(k)$ Coverting Circuits to Booleans Start from the RHS and move left As you pass each gate, write down the number of inputs in their own section Repeat until you get to your inputs NAND and NOR You invert the output of AND and OR respectively These are universal gates XOR XOR is denoted as $\otimes$ (either the first or the second is true, but not both) Used for parity control (how many of the inputs are true) $Y\otimes 0 = Y$ $Y\otimes 1 = \bar{Y}$ $X_{1}\otimes X_{2}\otimes…X_{k}=1$ when an odd number of $X_{i}=1$ Duals All booleans expressions have duals. TO construct the dual: Swap the ANDs and ORs, swap the literals 0 and 1 for each and rearrange the parentheses as needed You can take the complement of a function by taking the dual, then negating all the literals Lecture 3 Sum of Products (SoP) Any function can be represented as a product of sums (ie. ORing ANDS together) Do all the ANDs before the ORs To calculate, expand out all the terms, then remove duplicates G = X(Y+Z)(X+Y+Z) = XY+XZ Product of Sums (PoS) Any function can be represented as a sum of products (ie. ANDing many ORs together) Do all the ORs before the ANDs Convert SoP to PoS onvert function $F$ to SoP. Take the compliment of SoP by taking the dual and then complementing the literals. This is PoS Minterms Minterm is a product term in which all variable appear exactly once (either complemented or uncomplemented) You can refer to each minterm by the unsigned int that they represent (mX) A function can be described as a sum of its minterms You can evauluate when the function is true by looking at the minterms You can marginalize on a literal by including the minterms that include literal and its complement Minterm expansion is not a simplified form Maxterms The dual of minterms (ie a sum in which all variables appear exactly once, either complemented or uncomplemented) Denoted as MX The product expansion of a function in terms of maxterms Whenever the function evaluates to zero, these inputs correspond to a particular maxterm in the expansion Karnaugh Map Say that you have $F = X+\bar{X}Y$. How do you simplify this? Replace $X = X+XY$ and turn the crank to get $F=X+Y$ Not immediately obvious. Need an automatic tool to help Not the simplest form, but good enough Expand the truth table into a 2D grid indexed Y horizontally and X vertically (origin from the top-left and moves right and down) Each value of the input literal tells you where on the Karnaugh Map Let the vertical axis be X, and the horizontal be YZ Gray encoding: order values of YZ such that only one bit changes at a time Y goes on top, Z goes on bottom Minterm product terms correspond to a $2^{i} \times 2^{j}$ rectangle (including wraparounds) To find the simplest form you can derive from the k-map, you draw the largest rectangles you can (incluing wraparounds) and then relate these rectangles to products K-map Notation Implicant: a product term comprised of minterms for which the function F evaluates to 1 (ie. a box whose dimensions are powers of 2) Prime Implicant: there exists no other implicant that can contain this implicant Essential Prime Implicant: this implicant covers at least one minterm that is not covered by any other prime implicant Summary of Simplification with k-maps Identify all the prime and essential prime implicants Include all the essential PIs If any 1-valued minterms are uncovered by EPIs, choose PIs that are “big” and do a good job covering Heuristic: Choose the PIs that minimize overlap with one another and with EPIs 2-bit multiplier You have 2 2 bit numbers. You need 4 bits to represent the largest product Each bit in the output can be written as a k-map, which can be simplified to write what inputs correspond to this particular output bit getting turned on Don’t Care Conditions These are circumstances where the output doesn’t matter Going to the 2-bit multiplier, say that you know that some set of inputs will never occur. Hence you can ignore the output of those inputs when building a circuit Denote X as the outputs where you don’t care about the output In the k-map, replace the don’t care conditions with X. You can optionally cover X’s with EPI’s (ie. you can include X’s in larger boxes to simplify circuitry) Drawing Circuits Denote a circuit as a blob so you don’t have to redraw a circuit over and over again When an input changes, it takes time for the output to get updated Lecture 4 Standard Circuits NB: Think about circuits as having constant input Enabler Takes in data D (k bits in parallel), as well as an enable/disable trigger E. There is a single output of k bits If E=1, the output equals the input If E=0, output equals k 0’s Implement enabler with k AND gates. Each input has it’s own AND gate, and the other pin is the Enable input Decoder Circuit Takes in a k-bit input and $2^{k}$ 1-bit outputs The selector indicates as an unsigned binary which output = 1. All other outputs are zero Write truth table out. Each row is a minterm. Recall that each minterm evaluates for 1 for a unique combination of literals. You can write decoders with a combination of NOTs and ANDs Decoder With Enable Slap the two previous circuits together Push the output of the decoder through k AND gates where every AND gate has a input connected to the enable pin MUX (Multiplexer) Takes $2^{k}$ 1-bit data values as input. Has a k-bit selector that indicates in unsigned binary which data bit is output You push all the data values in as input simultaneously, and only the one you select is the output. All other inputs are dropped You have a selector part that chooses which input to select You pass all this data to an enabler, whose output gets passed to an OR gate Representing Functions with Decoders and MUXes Use a Decoder and OR together the minterms that evaluate to 1 of your function Use a Multiplexer where you feed in the output of your function. The input to the multiplexer is the values that your literals you take. The output is then your function MUX trick Look at your truth table in pairs of rows. You can select on A and B as the selector input. You pass the C input as one of the the $2^{k}$ 1-bit inputs and it’s complement Shifter Circuit k-bit data value enter as input l-bit selector in unsigned binary tells how many bits to shift k-bit output where bits are shifted by l-bits (can wraparound or drop out) Barrel Shift left w/ Wraparound Say you want to shift over by 2 with wrap around You use MUXes. Each output bit of the MUX goes to the output of the Shifter. All the MUXes have the same shift bit as their selector L-R Shift Circuit with Rollout ...