Compilers

Referencing “Modern Compiler Implementation in ML” by Andrew W. Appel. Lexical Analysis This refers to breaking up the input source code into individual words/tokens A “token” is a sequence of characters that can be treated as a unit in the grammer of a programming language Some common tokens which are typical in programming languages: ID: think names of any kind (variable, function classes etc.) NUM: Think integers REAL: Think floating point RESERVED: Words which are reserved in the programming language. “If” is a typical one These reserved words cannot be used as identifiers (ID) macros and preprocessor directives are done prior to lexing Only “weak” languages require a macro preprocessor (what does that mean?) Some of these token hae semantic values attached to them (think identifiers and literals), which give additional information Regular Expressions You could describe the lexical rules of a programming language in a normal language That’s very verbose and not precise A language can be though of as a set of strings A string is a finite sequence of symbols the symbols are drawn from a finite alphabet A language can be a finite or infinite set The Pascal language is an infinite set, since it includes all possible Pascal programs The language of C reserved words is the set of all alphabetic strings that cannot be used as identifiiers in the C programming language In this viewpoint, we just need to figure out if a given string is within the language set or not Regular expressions provide a compact notation to represent a set of strings Symbol: each symbol the alphabet of the language, the regular expression $\mathbf{a}$ denotes the language containing just the string “a” Alternation: given two regular expressions M and N, the alternation operator ("|") creates a new regular expression $M | N$. A string is in this new language if it was in language M or language N Concatenation: Given regular expressions M and N, $M\dot N$ creates a new regular expression. A string is in $M \dot N$ if said string the concatenation of any two strings in the sub languages (so $(a|b) \dot a$ is the language of strings “aa” and “ba”) $\epsilon$ represents the languages whose only elements is the empty string Repetition: Given a regular expression M, the Kleene closure M* defines a new language. A string is in M* if said string is a concatenation of zero or more strings, all of which are in M (so $((a|b)\dot a)*$ is an infinite set of strings $("", “aa”, “ba”,“aaaa”,“baaa”, “aaba”, …)$) The above are the base operations. You can make some notational short hand: $[abcd] = (a| b|c|d )$ $[b-g] = [bcdefg]$ $M? = (M| \epsilon)$ $M+ = (M\cdot M*)$ There are also some convenience things: a “.” represents any single character except newline A string wrapped in quotes (re: “a.+*” stands for itself literally) Using the above, we can write some common tokens as: Regex Tokens if $250 [a-z][a-z0-9]* (ID) [0-9]+ (NUM) ([0-9]+"."[0-9]*) ([0-9]*"."[0-9]+) A given lexical specification is complete if it always matches some initial substring of the input Can achieve this by having a rule that matches any single character There is some ambiguity in these rules, like what if there are multiple matches? Lex uses two disambiguation rules Accept the longest match which satisfies the regex rules If multiple regular expressions match a longest substring, then the first regular expression determines it’s type This implies that the order you list the regular expressions in matters As an example, “if8” will get matched to an identifier, while “if” gets matched to the IF token Deterministic Finite Automata (DFA) Need to be able to implement regex in a computer program. One representation which lends itself well to this is a finite automaton Consists of a finite set of states, along which edges (labelled by a symbol) which transitions between each state One state is denoted as the start state, while there can be multiple states which are final states (re: a regex expression needs to end on one of these) In a deterministic finite automaton (DFA), no two edges leaving from the same state are labelled with the same symbol a DFA can take in a string, and either accept or reject said string. From the start state, each character in the string induces a transition to some other state. If there is no valid transition available, or if after consuming the string, the DFA is not in a designated final state, then the string is rejected You can convert this graphical representation of the DFA into a transition matrix This is a 2D array which is subscripted by the state number (re: all possible states) and the input character (re: all of the alphabet of the language) Each entry in the matrix tells you which state you need to jump to There is a “dead” state that just loops to itself on all characters (used to encode the absence of an edge) You also need a “finality” array, which maps state number to tokens If you want the longest match for a given input string, you also need to keep track of the last final state the DFA was in, as well as the associated character position Nondeterministic Finite Automata (NFA) Unlike DFAs, NFAs can have edges which share the same symbol. In addition, special $\epsilon$ edges can be traversed without consuming any input The non-determinism comes from the fact that you need to pick which branch you go down It’s easy to convert a regex into a NFA. Just follow the above rules. At the end, you should end up with some NFA with a tail and a head Each symbol gets converted into a tail (re: transition) and a head (re: a state) Same for an $\epsilon$ edge Concatenations are just chains Alternations are just split paths with an epsilon edge buffer Repetition is just a feedback loop This is not necessarily the most efficient realization of said regular expression, but it works Each regular expression is either a primitive (a single symbol or $\epsilon$) or it is made from smaller expressions. Ditto for NFAs Going from a NFA to DFA is harder In order to get the longest match, you need to be able to traverse all of the parallel $\epsilon$ edges at a given node (This applies recursively) This describes the notion of an $\epsilon$-closure Let edge(s,c) be the set of all NFA states reachable by following a single edge with label c from state s For a set of states S, closure(S) is the set of states that can be reached from a state in S without consuming any of the input (re: you cross all the reachable $\epsilon$ edges) This can be represented mathematically. closure(S) is the smallest set T such that $T = S \cup (\cup_{s\in T} edge(s, \epsilon)) $ T can be calculated in a straight forward manner: while $T_{new} \neq T$, let $T_{new} = T \cup (\cup_{s\in T} edge(s, \epsilon))$ This converges since T can only grow every iteration. There is a finite number of states, so the algorithm must terminate. If the cardinality doesn’t grow ($T_{new}= T$), this means that T includes $(\cup_{s\in T} edge(s, \epsilon))$ Suppose that you have a set d of NFA states you need to visit. Define a function DFAedge(d,c) where closure( $\cup_{s\in d}$ edge(s,c)) Can think of this as traversing the edges associated with a character, and then traversing all of the epsilons You can simulate an NFA via the following algorithm let d be the closure of your start state $s_{1}$ for each character in you input string, calculate DFAedge(d,$c_{i}$), then update you set d The above outlines how to construct a DFA using closures and DFAedge $\Sigma$ denotes the alphabet of the language In words, you start with the set generated by the closure of the start state. Iterate through this set. If a character matches one of the edges, check if the connecting state is already in the set. If it is, set that the associated element in the transition matrix to the next state. Otherwise, append the state to the queue, and set the transition matrix element. Do this until you have gone through all of the states This algorithm doesn’t reach unreachable states, which prevents exponential blowups This is not an optimal automaton

Date Created: June 11, 2025 | | Last Modified: May 13, 2026 ||

GLox

Date Created: July 6, 2024 | | Last Modified: May 13, 2026 ||

Building (Yet Another) CHIP8 emulator

TL;DR Why? Great, How Do You Get Started? Environment Setup Invaluable Resources “Hardware” CPU Data Buses Interface Display Memory Sound Keyboard Tying It All Together Fetch-Decode-Execute Windows Update So… Now What? TL;DR If you want to see it in action, click here to play the WASM version. It’s currently not as fast as the desktop app, but it works! Why? Why not? But in all seriousness, I eventually want to make a GBA emulator. However, I’ve always been put it off since reading the hardware documentation on GBATEK makes me go cross-eyed. ...

Date Created: April 27, 2024 | | Last Modified: May 13, 2026 ||

Advanced Programming

Basic CLI commands Vim Cheatsheet Standard Library Functions Compiling and Linking Gcc Header Files Lecture 2 Make Git Lecture 3 Data Types in C Sizeof() Computer Representation of Numbers Signed versus Unsigned Operator Gotchas Precedence Left shift Short Circuit Pre and Post Increment Undefined Behavior Expressions vs Statements Storage Classes Automatic Variables (Local) Static Variables Memory Address Space Stack Data Pointers Void Pointers Faux Pass by Reference NULL pointers Pointer Errata Dangling Pointers Arrays Pointer Arithmetic GUT of Pointers and Arrays Heap Heap Safety Strings String Literals argv Constant Pointers Function Pointers Parsing Function Pointers Structs Unions Typdef Libraries IO Standard I/O Stdout Stdin Stderr Redirection Pipes Files Blocking and Reading Buffering File Seeking Formatted IO Inspecting binary files Endianness Network byte order Forks Reaping Children exec() UNIX Users and Groups Shell Scripts TCP/IP netcat File Descriptors Sockets API Connect Bind, Listen, Accept Send and Receive HTTP/1.0 URLs Command line HTTP requests GET Requests (Client) GET Requests (Server) Status Codes Class Information Course Homepage ...

Date Created: August 21, 2023 | | Last Modified: May 13, 2026 ||