Here is a compendium of my notes that I took to review for Machine Learning. This was meant to be a quick refresher on concepts and to put everything in one place. For a less cursory version, see the background section under Resources at the course website.

Probability and Statistics

Basic Concepts

Axioms of Probability

  • Probability Measure $P: \mathbb{F} \rightarrow \mathbb{{R}}$ such that
    • $P(A) \geq 0$ for all $A \in \mathbb{F}$
    • $P(\Omega) = 1$
    • If $A_{1},A_{2}…$ are disjoint events, then
      • $P(\cup A_{i}) = \Sigma_{i} P(A_{i})$

Definitions

  • Sample space $\Omega$: The set of all possible outcomes
  • Event space $\mathbb{F}:A$ is a subset of $\Omega$
  • a random variable quantity that has an uncertain value.
    • Examples include coin flips (discrete), daily temperature (continuous)
  • An ensemble sampled from a random variable forms a probability distribution
  • The probability density/mass function (pdf,pmf) denote the probability distribution of continuous and discrete functions respectively. Integral/sum must be 1.
  • Joint probability p(x,y) denotes the probability of pair (x,y) occurring. Can be a mix of discrete and continuous. X and y can also be vectors.
  • Marginalization refers to the “summing out” of a variable from a probability distribution ($p(x) = \int p(x,y) dy$ for instance). This removes dependence of distribution on summed variables. This can be extended to discrete ($\int \rightarrow \Sigma$) or to multiple variables ($p(x,y) = \Sigma_w \int p(w,x,y,z) dz$ ).
  • Conditional Probability is the probability of x given a fixed yvalue. Written as p(x|y = y). Like taking a cross section of a CAD model
    • p(x|y = y*) is a relative probability, since the sum does not equal to 1. Need to normalize so that sum is 1.
    • $p(x,y) = p(x|y)p(y) \rightarrow p(x,y) = p(y|x)p(x)$ from symmetry
  • Chain Rule $p(x_{1},x_{2},…x_{k} = \Pi_{i=1}^{k} p(x_{i}|x_{1},…,x_{i-1}) )$
    • Note: when $i=1$, you get $p(x_{1})= p(x|x)$, which is trivial, but important to point out
    • In words: the joint probabilities is the product of probability of the first, time probability of the second given the first, etc.
    • $p(x|y)=p(x)$ and $p(y|x)=p(y)$ are the conditions for x and y to be independent.
      • Substitute into conditional probability definition yields $p(x,y)= p(x)p(y)$
  • Conditional Independence: $p(x_{1}|x_{2},x_{3}) = p(x_{1}|x_{2})$ and $p(x_{1}|x_{2},x_{3}) = p(x_{1}|x_{2})$ are the conditions. Note the symmetry in $x_{1}$ and $x_{3}$
    • In words: $x_{1}$ is conditionally independent of $x_{3}$ given $x_{2}$. Or, if we know $x_{2}$, then $x_{1}$ gives no information about $x_{3}$.
    • Variance of a population $\sigma^{2} = \frac{\Sigma(x-\mu)^{2}}{n} = \frac{\Sigma x^{2}}{n}-\mu^{2}$
    • Variance of sample: $s^{2} = \frac{\Sigma(x-\bar x)^{2}}{n-1} = \frac{\Sigma x^{2}-\frac{(\Sigma x)^{2}}{n}}{n-1}$
    • Standard Deviation = $\sqrt(s^{2})$
    • Standard Error = $\frac{s}{\sqrt{n}}$
    • Null Hypothesis: There is no significant difference between two populations
    • Type I error: Probability of rejecting a true null hypothesis
      • Equals significance level $\alpha$
    • Type II error: Probability of failing to reject a false null hypothesis

Baye’s Rule

  • $p(x|y) = \frac{p(y|x) p(x)}{p(y)}$
    • posterior: $p(x|y)$ or what we know about x given y
    • prior: $p(x)$ or what we know about x before looking at y
    • likelihood: $p(y|x)$ or the information gained about x of y
    • Evidence: $p(y)$ knowledge of y
      • Typical example: a diagnostic test for Covid is 95% accurate. This means that if the test correctly identifies the true state of the patient 95% of the time (ie. if she has Covid, the test returns true, is she doesn’t, it returns false). Here X is the random variable of the true Covid state of the patient and Y is the random variable of the test result. Suppose the infection rate of Covid is 1 in 1000.
        • What is $p(Covid|Pos)$? Not 95%.
        • $p(Covid|Pos) = \frac{p(Pos|Covid)p(Covid)}{p(Pos|Covid)p(Covid)+p(Pos|NoCovid)p(NoCovid)}$
        • Plugging in values yields about 2%
      • A more conceptual way of thinking about Baye’s rule is that it tells you how to eliminate impossible outcomes given some information about the world.

Expectation

  • $E[f(x)] =\int f(x) \cdot p(x)dx$ ($\int \rightarrow \Sigma$ for discrete)
    • p(x) is the weight assigned to each value of f(x)
      Function Expectation
      $x$ mean $\mu_x$
      $x^k$ k-th moment about 0
      $(x-\mu_x)^k$ k-th central moment
      $(x-\mu_x)^2$ variance (spread)
      $(x-\mu_x)^3$ skew (lean)
      $(x-\mu_x)^4$ kurtosis (peakedness)
      $(x-\mu_x)(y-\mu_y)$ covariance of x and y
    • Expectation Manipulation Rules. Let a be a constant
      • $E[a] = a$
      • $E[a\cdot x] = aE[x]$
      • $E[x + y] = E[x]+E[y]$
      • $E[x \cdot y] = E[x]\cdot E[y]$ if x,y are independent
    • Variance Manipulation Rules
      • $Var[a\cdot x+b] = a^{2}\cdot Var[x]$
      • $Var[x+y] = Var[x]+Var[y]$ if x,y are independent

Common Probability Distributions

Bernoulli Distribution

  • Simple success or failure experiment
  • let $\lambda$ denote the probability of success. x=0 denotes failure and x=1 denotes success
    • $Bern_{x}(\lambda) = \lambda^{x}(1-\lambda)^{1-x}$
      • $p(x=0) = 1-\lambda$
      • $p(x=1) = \lambda$
      • $E[x] = \lambda$
      • $Var[x] = \lambda(1-\lambda)$

Binomial Distribution

  • Given a sequence of $N$ independent Bernoulli experiments, the Binomial distribution enumerates the probability for $m$ successes in the sequence
  • $Bin_{m}(N,\lambda) = {N \choose x} \lambda^{m}(1-\lambda)^{N-m}$
  • $E[x] = N\lambda$
  • $Var[x] = N\lambda(1-\lambda)$
  • For a fixed $N\lambda$, the Binomial converges to Poisson as $N \rightarrow \infty$

Poisson Distribution

  • Consider independent events that happen at an average rate $\lambda$ over time (ie. $\lambda$ has “units” of number of events per time)
  • Poisson distribution is discrete, and give the probability of a given number of events $k$ occurring in a fixed time interval
  • $p(k,\lambda) = \frac{\lambda^{k}e^{-\lambda}}{k!} = Pois_{k}(\lambda)$
  • $E[x] = \lambda$
  • $Var[x] = \lambda$

Categorical Distribution

  • Suppose you run a single experiment with $K$ possible outcomes indexed from $K = {1,2,…k}$. Let $\lambda =[\lambda_{1},\lambda_{2},..\lambda_{k}]$ with $\Sigma_{k=1}^{K} \lambda_{k} = 1$
  • $p(x=k) = \lambda_{k}$
  • $Cat_{x}(\lambda) = p(x)$
  • $E[x=k] = \lambda$
  • $Var[x=k] = \lambda(1-\lambda)$

Multinomial Distribution

  • Multinomial distribution give probability of a number of successes given a particular combination
  • Generalization of Categorical Distribution to N trials and Binomial Distribution to K outcomes.
  • Let $m = [m_1 m_2 … m_K]$ denote the observed counts in each category (total number of categories is $K$) over a sequence of $N$ trials
  • Let $\lambda = [\lambda_1 \lambda_2 … \lambda_K]$ denote the probability of each category (subject to normalization of $\lambda_1+ \lambda_2 + … + \lambda_K = 1$)
  • $p(m) = {N \choose m_1 m_2 … m_K}\lambda_1^{m_1}\lambda_1^{m_1}…\lambda_1^{m_K}$
  • $E[m_{k}] = N\lambda$
  • $Var[m_k] = N\lambda(1-\lambda)$

Guassian(Normal) Distribution

  • Useful because A Gaussian is fully specified by only two moments and central limit theorem (CLT) holds
  • CLT states that the mean of independently draw random variables is normally distributed (given enough samples), irrespective of original distribution
  • $E[m_{k}] = \mu$
  • $Var[m_k] = \sigma^{2}$
  • $p(x) = \frac{1}{2\pi \sigma}e^{\frac{-(x-\mu)^{2}}{2\sigma^{2}}}$
  • Standard normal distribution ($\mu=0$, $\sigma = 1$)

Multivariate Gaussian Distribution

  • Let $d$ be the dimension of the space
  • Let $\mu$ be a d-dimensonal vector (Mean vector)
  • Let $\Sigma$ be a DxD symmetric and positive semi-definite matrix (covariance matrix)
  • $p(x) = \frac{1}{(2\pi)^{\frac{D}{2}}|\Sigma|^{\frac{1}{2}}} exp(-\frac{1}{2}(x-\mu)^{T}\Sigma^{-1}(x-\mu))$
  • $E[m_{k}] = \mu$
  • $Var[m_k] = \Sigma$
  • Alternatively, $p(x) = \frac{|\beta|^{\frac{1}{2}}}{(2\pi)^{\frac{D}{2}}} exp(-\frac{1}{2}(x-\mu)^{T}\beta(x-\mu))$ where $\beta = \Sigma^{-1}$ is a precision matrix

Laplace distribution

  • $p(x) = \frac{1}{2\gamma}exp(-\frac{-|x-\mu|}{\gamma})$
  • Places sharp peak at arbitrary point $\mu$ with spread of $\gamma$

Dirac Delta Distribution

  • $p(x) = \delta(x-\mu)$
  • Clusters all mass of the distribution function into a single point
  • Infinity high, infinity narrow peak
  • Exists to be integrated
  • Can be extended to an arbitrary number of points (empirical distribution):
    • $p(x) = \frac{1}{m}\Sigma_{i-1}^{m} \delta(x-x_{i})$

Mixtures of Distributions

  • Define a cluster of distributions. Define a probability distribution P(c) that selects which distribution is drawn from.
  • $P(x) = \Sigma_{i} P(c=i) P(x |c=i)$
  • c is a latent variable, or a random variable that can’t be observed directly
  • Common types of mixture models are the empirical distribution and Gaussian mixture models, where the cluster consists of Gaussians.

Common Functions

Logistic sigmoid $\sigma(x)$

  • $\sigma(x) = \frac{1}{1+exp(-x)}$
  • Defined on $\mathbb{R}$ and confined between 0 and 1
  • Saturates as $x \rightarrow \pm \infty$ (ie insensitive to changes in x)
  • Maximum slope at x=0
  • Useful to produce parameter of a Bernoulli distribution
  • Properties:
    • $\sigma(x) = \frac{exp(x)}{exp(x)+exp(0)}$
    • $\frac{d}{dx}\sigma(x) = \sigma(x)(1-\sigma(x))$
    • $1-\sigma(x) = \sigma(-x)$

Softplus Function $\Zeta(x)$

  • $\Zeta(x) = log(1+exp(x))$
  • Defined on $\mathbb{R}$ and confined between 0 and $\infty$
  • Produces $\sigma$ or $\beta$ of normal distribution
  • A smoother version of $f(x) = max(0,x)$
  • Properties:
    • $log(\sigma(x)) = -\Zeta(-x)$
    • $\frac{d}{dx} \Zeta(x) = \sigma(x)$
    • $\forall x \in (0,1), \sigma^{-1}(x) = log(\frac{x}{1-x})$
    • $\forall x>0, \Zeta^{-1}(x) = log(exp(x)-1)$
    • $\Zeta(x) = \int_{-\infty}^{x} \sigma(y) dy$
    • $\Zeta(x)-\Zeta(-x) = x$

Information Theory

Self-Information

  • General Idea: knowing something unlikely has happened provides more information that knowing something likely has happened. Information theory quantifies this statement
  • Properties of Information
    • Likely events have low information (extreme: guaranteed events contain no information)
    • Less likely events should have more information
    • Independent events should have additive information (ie. tossing a coin twice contains twice as much information as tossing a coin once)
  • The above motivate the definition of self-information. Let x be an event. Then $I(x) = -log(P(x))$ is the self-information of the event. The log typically is base 2 or base e, but in theory could have any base you want; a chance in base is effectively a change in scale/units.
  • NOTE: by definition, $lim_{x\rightarrow 0 } x log(x) = 0$

Shannon Entropy

  • Extends I(x) to determine amount of uncertainty in a probability distribution:
    • $H(x) = H(P) = E[I(x)]$, or the expectation value of I(x)
  • In words, $H(x)$ gives the expect/average amount of information that you can gain from drawing from the probability distribution. Entropy is maximized for a uniform probability distribution (has a very direct connection to entropy in the thermodynamic sense)
  • When x is continuous, Shannon entropy is known as differential entropy

Kullback-Leibler (KL) divergence

  • a metric of quantifying the difference in information between two distributions $P(x)$ and $Q(x)$
    • $D_{KL}(P||Q) = E[log(P(x))-log(Q(x))]$
  • In words (for the discrete case), $D_{KL}$ is the extra information needed to send a message using the code P instead of the code Q.
  • KL divergence is non-negative
  • Only 0 iff P and Q are the same distribution
  • NOT symmetric in P and Q (not a true measure of distance)

Cross-entropy

  • $H(P,Q) = H(P)+D_{KL}(P||Q) = -E[log(Q(x))]$
  • Minimizing cross-entropy w.r.t. Q is the same as minimizing KL divergence

Exponential Distribution

  • Let $p(x,\lambda)$ be a piecewise function such that $p(x<0>,\lambda)=0$ and $p(x>0,\lambda) = \lambda exp(-\lambda x)$ where $\lambda$ is the decay parameter
  • Serves to put a sharp peak at x=0

Chi-squared Distribution

Basics

  • Chi-squared is a continuous distribution of the sum of the squares of k standard normally distributed random variables.
  • k is called the number of “degrees of freedom”
  • $q = \Sigma_{i-1}^{k} x_{i}^{2}$ ($\chi^{2}$ parameter)
  • $p(x) = \frac{x^{\frac{k}{2}-1}e^{-\frac{x}{2}}}{2^{\frac{k}{2}}\Gamma(\frac{k}{2})}$
  • $E[m_{k}] = k$
  • $Var[m_k] = 2k$

Goodness of Fit (GOF)

  • $\chi^{2} = \Sigma \frac{(O-E)^{2}}{E} $
  • degree of freedom (dof) = number of categories -1
  • Look up table for what p-value is. Determines if you accept or reject the null hypothesis by whether p-value is greater than (don’t reject null) or less than (reject null) significance level $\alpha$

Independence

  • Chech whether categorical data is independent
  • You are given observed data table O
  • expected data table E is calculated cellwise
    • For a given row and column, the cell value is $E = \frac{(\text{row total})(\text{column total})}{(\text{grand total})}$
    • Calculate $\chi^{2}$ as normal
    • dof = (#rows-1)(#cols-1)

T-test

One Sample

  • Tests whether the mean of a normally distributed population is different from a specific value
  • Let $\mu$ be the measured mean and $\mu_{o}$ be the specific mean value to be tested
  • $t = \frac{\bar{x}-\mu_{o}}{\frac{s}{\sqrt{n}}}$
  • dof = n-1
  • Using a lookup table find the corresponding p-value, with the following caveats depending on what your null hypothesis is:
    • $\mu>\mu_{0} \rightarrow$ read table as given
    • $\mu<\mu_{0} \rightarrow$ if t-statistic is negative, read table as though t-value was positive
      • If t-statistic has the incorrect sign, read the $1-p$ value instead
    • $\mu>\mu_{0} \rightarrow$ read table as given, but double p-value to take into account both tails
  • If p-value is less than $\alpha$, reject the null hypothesis

Two Sample

Test whether the means of two populations are significantly different from one another

Paired
  • This means that each value in first group has a one to one mapping to the value in the second group
  • Subtract values in a pairwise manner and run one sample t-test with $\mu_{0}$ = 0
Unpaired
  • $t = \frac{\bar x_{1} - \bar x_{2}}{\sqrt{(\frac{s_{1}^{2}}{n}+\frac{s^{2}}{n_{2}})}}$
  • dof = $(n_{1}-1)+(n_{2}-1)$
  • Read off p-value and compare to significance level

Linear Algebra

Types of Objects in Linear Algebra

  • Scalars: A number or a 1x1 matrix.
    • Can be confined to various domains eg. ($s \in \mathbb{R}$) is a real-valued scalar and ($n \in \mathbb{N}$) defines a natural number scalar
  • Vectors: An array of numbers. Each number is identified by its index.
    • typically though of as nx1 matrices (ie. column vectors).
      • $x = \begin{bmatrix} x_{1} \\ x_{2} \\ \vdots \\ x_{m} \end{bmatrix}$
      • Can also think of vectors as points embedded in n-dimensional space, with each element representing a coordinate on a different axis
    • Suppose you want to index only certain elements of a vector. Create a set S = {1,3,6} and write $x_{S}$ to denote elements $x_1, x_3, x_6$. $x_{-S}$ refers to complement of $x_{s}$, or all the elements in x excluding $x_1, x_3, x_6$
  • Matrices
    • 2D array of numbers that uses 2 indices.
      • if A has m rows and n columns, then $A \in \mathbb{R}^{m \times n}$
      • $A_{i,j}$ denotes the element in the ith row and jth column.
        • You can take slices with a “:”, namely $A_{i,:}$ denotes the horizontal slice at row i and $A_{:,j}$ denotes the vertical slice at col j
      • Can be explicitly written out by elements:
        • $\begin{bmatrix} A_{1,1} & A_{1,2} \\ A_{2,1} & A_{2,2} \\ \end{bmatrix}$
  • Tensors
    • n-dimensional arrays that can have an arbitrary number of indices (like the Levi-Civita tensor $\epsilon_{ijk}$)

Matrix Operations

  • Transpose: taking the mirror image across the diagonal
    • Can also think about it as swapping the rows and columns, or more algebraically $A_{i,j} = A_{j,i}$
    • Can transpose a column vector into a row vector. So another way of writing a column vector is $\begin{bmatrix} x_{1} & x_{2} & x_{3} \end{bmatrix}^{T}$
    • When transposing a product of matrices, you reverse the order, and then transpose individually ie.
      • $(AB)^{T} = B^{T}A^{T}$
  • Multiplication
  • Can multiply matrices by each other. If A is a m × n and B is a n × p matrix, then C is a m × p matrix.
    • $C = AB$
    • $C_{i,j} = \Sigma_{k} A_{i,k}B_{k,j}$
    • Matrix multiplication is associative, distributive, but NOT commutative (ie. AB=BA is not always true)
  • Can multiply scalars by matrices: scale each element in matrix by the scalar c
  • Hadamard Product ($A \odot B$)
    • Element-wise multiplication between two same sized matrices
  • Dot Product: Between two vectors of the same size (say x and y), the dot product is defined as $x^{T}y$
    • The dot product is commutative: $x^{T}y = y^{T}x$
  • Addition
    • Can add matrices together as long as their dimensions align
    • Can add scalars to matrices: Let A be an mxn matrix. Create a mxn matrix whose entries are entirely ones and scale it by the scalar, then add to A
    • Broadcasting: Adding a vector to a matrix
      • let C and A be mxn matrices and b be a mx1 vector
        • C = A+b where b is implicitly added to each column of A

Central Problem of Linear Algebra

  • We want to solve $Ax = b$
    • Various techniques exists to state whether this is possible, how to do it if it is possible, or how close we can get

Gaussian Elimination

  • Use matrix multiplication on an augmented matrix $A|b$ to try and simplify the problem as much as possible
  • Through taking linear combinations of the rows, you can simplify the problem while maintaining the row space (and thus the solution) of the problem

Reduced Row Echelon Form (RREF)

  • Defined as a matrix with the following properties
    • The pivot of any row is 1
      • The pivot is the first non-zero entry of a matrix
    • The pivot of any given row is always to the right of the pivot of the row above it
      • It does not need to be in adjacent column, it just needs to be to the right
    • The pivot is the only non-zero entry in its column
  • The end goal of Gaussian elimination is to get the augmented matrix into RREF
Reading of Solutions of Ax=b from RREF
  • For the ith column in RREF that does not have a pivot (ie. the free variables), define a variable $\lambda_{i}$.
    • For the set of $\lambda$, first set all lambda’s to 0 and solve x
    • Then cycle through the set of $\lambda$, setting only 1 of them to 1 (the rest of the free variables) and solve for x
  • The solution is then a linear combination of the vectors found, with the vector with $\lambda$ all equal to zero having a prefactor of 1, and the rest of the vectors having a prefactor of their associated $\lambda$

LU Decomposition

  • We can write any square matrix as the product of a lower triangular matrix and an upper triangular matrix
    • $M = LU$
  • Steps to get LU Decomposition
    • Perform Gaussian elimination until you get an upper triangular matrix. This is U
    • As you do row operations, keep track of the elimination matrices used along the way
    • Explicitly, you start from $A=LU$
      • $A=LU \rightarrow E_{0}E_{1}…E_{n-1}A = U$
      • So $L^{-1} = E_{0}E_{1}…E_{n-1} \rightarrow L = E_{n-1}^{-1}…E_{1}^{-1}E_{0}^{-1}$
    • The product of inverse elimination matrices are simple: You just add a row instead of subtract a row

LDU Decomposition

  • The goal of LDU is to have ones on the main diagonals of L and U.
    • D is a diagonal matrix
    • This is accomplished by scaling each row of U such that the diagonal element in each row is 1
    • You place the scaling factor in the corresponding slot in D

Identity and Inverses

  • The identity matrix is defined as $A_{i,j} = \delta_{i,j}$, where $\delta$ is the Kronecker delta. More explicitly, A has ones along the main diagonal (top left to bottom right) and zeros everywhere else
    • I*A = A
  • The Inverse matrix of a square matrix A is defined as $AA^{-1} = A^{-1}A = I$
    • Assuming $A_{-1}$ exists, its solves the fundamental problem of linear algebra: $Ax=b \rightarrow x = A^{-1}b$

Vector Spaces

  • A vector space over some field $\mathbb{F}$ (think $\mathbb{R}$,$\mathbb{C}$,$\mathbb{Z}$) is a set $V$ with two operations + and * that satisfies the following properties (given $\forall u,v \in \mathbb{R}$ and $c,d \in \mathbb{R}$)
    • Closure under addition: $u+v \in V$
    • Commutativity of addition: $u+v=v+u$
    • Associativity of addition: $(u+v)+2=u+(v+w)$
    • Zero: $\exist 0_{v} \in V$ such that $u+0_{v} = u \ \forall u \in V$
    • Additive inverse: $\exist u \in V$ such that $u+w = 0 \ \forall w \in V$
    • Multiplicative Closure: For every $u \in V$ there exists $w \in V$ such that $u + w = 0_{v}$
    • Distributivity of scalars over scalar addition: (c+d)·v = c·v+d·v where $c,d,v \in \mathbb{F}$
    • Distributivity of scalars over vector addition: c·(u+v) = c·u+c·v where $c \in \mathbb{F}$ and $u,v \in \mathbb{F}$
    • Associativity of scalar multiplication: (cd)·v = c· (d·v) where $c,d \in \mathbb{F}$ and $v \in V$
    • Existence of Unity: 1 · v = v for all v ∈ V where $1 \in \mathbb{F}$

Subspaces

  • A subspace U is a subset of V that satisfies:
    • $\alpha_{1}u_{1}+\alpha_{2}u_{2} \in U$ where $\forall \alpha_{1},\alpha_{2} \in \mathbb{R}$ and $\ u_{1},u{2} \in U$
  • For any subset $S \subset V$, span(S) is a subspace of V
Orthogonal Complements
  • Let U and V be subspaces of vector space W
    • The sum of U and V is: $U+V = span(U\cup V) = \{u+v|u\in U,v\in V\}$
    • The direct sum of U and V is: $U\otimes V = span(U\cup V) = \{u+v|u\in U,v\in V\}$, assuming that $U\cap V = \{0_{w}\}$
      • Let $w = u+v \in U\otimes V$. There is only one way to write w as a the sum of a vector in U and a vector in V
  • Given a subspace U of W, we define:
    • $U^{\perp} = \{w\in W|w\cdot u = 0,\ \forall u \in U\}$
    • This is the orthogonal complement
    • Let U be subspace of a finite dimensional vector space W. Then the set $U{\perp}$ is a subspace of W and $W = U\otimes U^{\perp}$

Linear Transformations

  • Define a map $L: V \rightarrow W$. This map is linear if the following conditions hold:
    • $L(u+v) = L(u)+L(v)$
    • $L(cv) = cL(v)$
  • Alternatively, you can combine the two conditions:
    • $L(ru+sv) = rL(u)+sL(v)$

Diagonalization

  • Given a linear transformation, how can we write it as a matrix?
    • Simplest example: $L: V\rightarrow V$ and we have a basis consisting of the linearly independent eigenvectors of L and the corresponding eigenvalues. Then L can be written as a diagonal matrix whose entries are the eigenvalues
      • the matrix for L in the basis S is diagonal iff S is a basis of eigenvectors for L
Change of Basis
  • A change of basis refers to writing on basis S in terms of another basis T
    • Look at each vector in S, and figure out a linear combination of T that creates S. This linear combination goes into the associated column of the matrix P
  • Two matrices are similar if there exists and invertible matrix P such that
    • $N = P^{-1}MP$
    • A matrix M is diagonalizeable if there exists an invertible matrix P and Diagonal matrix D such that
      • $D = P^{-1}MP$
    • What this means is that, to rewrite a matrix M in terms of a different basis, perform the transformation $M \rightarrow P^{-1}MP$ with the appropriate change of basis matrix P
    • If you know the eigendecomposition of a matrix M, then you can diagonalize M by performing a change of basis to the eigenvector basis

Linear Dependence and Span

  • One can think about the product $Ax$ as the following:
    • each component of x scales the corresponding column of a. The scaled columns are then summed together
      • $Ax = \Sigma_{i} x_{i}A_{:,i}$
    • This is one way to think about a linear combination. Given a set of vectors, you scale each vector by some factor and sum the scaled vectors
    • span denotes the set of all possible linear combinations
  • One can reformulate the fundamental problem $Ax=b$ as determining whether some combination of the columns equals b, or equivalently, does b lie in the span of the column space of A?
  • For a given set of vectors, we say the set is linearly independent if no combination yields the zero vector

Basis and Dimension

  • Dimension refers to the number of components necessary to describe an vector
  • Let V be a vector space. The a set S is a basis for V is S is linearly independent and V = span(S)
    • In $\mathbb{R^{n}}$, the standard basis is the columns of the identity matrix
    • A basis is not unique
      • Although each vector has a unique representation in a given basis
  • V = span($v_{1}v_{2}…v_{n}$) iff all vectors are linearly independent

Norms

  • The norm is a function that gives a metric of the size of a vector. It must satisfy the following properties:
    • $f(x) = 0 \rightarrow x = 0$
    • $f(x+y) \leq f(x) + f(y)$ (triangle equality)
    • $ \forall \alpha \in \mathbb{R}, f(\alpha x) = |\alpha|f(x)$
  • The $L^{p}$ norm is defined as (for $p \in \mathbb{R}, p \geq 1$)
    • $||x||{p} = (\Sigma{i}|x_{i}|^{p})^{\frac{1}{2}}$
  • The squared $L^{2}$ norm is useful since mathematically and computationally, it is easier to work with (derivative is simple, don’t need a square root)
  • The $L{1}$ norm is useful when the difference between zero and nonzero elements is important
    • $L{1}$ norm increases by $\epsilon$ as x goes from 0 to $\epsilon$
  • The $L_{\infty}$ norm denotes the absolute value of the largest element in the array
    • $||x||{\infty} = max|x{i}|$
  • There also exists a notion of norm for a matrix. The most common is the Forbenius norm:
    • $||A||{F}\sqrt{\Sigma{i,j}A^{2}_{i,j}}$

Orthogonal Bases

  • a basis is orthogonal if all the vectors are prependicular to each other
    • $v_{i}\cdot v_{j} = \delta_{ij}$
      • Orthonormal bases impose the additional condition that the norm of each basis vector is 1
    • In an orthonormal basis, any vector can be written as
      • $v = \Sigma_{i}(v\cdot u_{i})u_{i}$
    • Change of basis matrices between orthonormal bases are orthogonal matrices ie.
      • $P^{-1} = P^{T}$

Gram-Schmidt and Orthogonal Complements

Eigendecomposition

  • Solve the matrix equation $Av = \lambda v$, where $\lambda$ is a constant.
    • $\lambda$ is called the eigenvalue and $v$ is called the eigenvector
    • Eigenvectors are NOT unique. This is because we can think of an eigenvector as any vector whose direction is unchanged after being acted on by a matrix. Hence, eigenvectors are typically normalized for convenience.
    • Multiple eigenvectors can have the same eigenvalue
    • Real matrices can have complex eigenvalues
  • The eigendecomposition is not possible for all matrices
  • Given the eigendecomposition of a matrix (ie $\lambda = [\lambda_{1},\lambda_{1},…,\lambda_{n}]$ and V is a matrix whose columns are the eigenvectors), you can write the matrix as:
    • $A = \bold{V} diag(\lambda)V^{-1}$
    • Typically, the eigenvalues are sorted in descending order
    • Make sure that the eigenvectors are in the column that correspond to the correct eigenvalue
  • For a real symmetric matrix, the eigendecomposition is:
    • $Q\Lambda Q^{T}$, where $Q$ is an orthogonal matrix
    • The eigendecomposition is guaranteed to exist for a real, symmetric matrix
  • A matrix is singular iff any of the eigenvalues are zero
  • Let $f(x) = x^{T}Ax$ subject to $||x_{2}||=1$. Whenever x is equal to an eigenvector of A, f equals the corresponding eigenvalue
    • The minimum and maximum of f within the constraint region (ie. the surface of a n-dimensional unit ball) is the minimum and maximum eigenvalue respectively

Singular Value Decomposition (SVD)

  • Every real matrix has a singular value decomposition. Suppose A is an mxn matrix. Let U be a mxm orthogonal matrix, D be a mxn diagonal matrix (NOTE: not necessarily square), and V be a orthogonal nxn matrix. Then
    • $A = UDV^{T}$
    • The columns of U are the left-singular vectors, the columns of V are the right-singular vectors, and the diagonal elements are the singular values
    • U can be found by taking the eigenvectors of $AA^{T}$, V can be found by taking the eigenvectors of $A^{T}A$, and the singular values are the square root of the eigenvalues of $A^{T}A$ of $AA^{T}$, whichever has the smaller dimension.

Moore-Penrose Pseudoinverse

  • For non-square matrices, there may exist a matrix $B$ that solves $Ax = y$ by left-multiplication:
    • $Ax=y \rightarrow x = By$
  • If you have a tall matrix, there may exist no solution, and if you have a wide matrix, there could be multiple solutions
  • Define the matrix:
    • $A^{+} = \lim_{\alpha \rightarrow 0}(A^{T}A+\alpha I)^{-1}A^{T}$
  • The practical formula for computing this matrix is
    • $A^{+} = VD_{+}U^{T}$
      • $U,V,D$ are from the SVD of A. $D^{+}$ can be computed by taking the reciprocal of the nonzero elements and then transposing
  • If A has more columns than row, then $x=A^{+}y$ is one of the possible solutions
    • This solution has the minimum Euclidean norm $||x||_{2}$ among all possible solutions
  • If A has more rows than columns, the x from the pseudoinverse minimizes $||Ax-y||^{2}$ (ie. the vector that has the smallest Euclidean norm that almost solve the problem)

QR Decomposition

Gram-Schmidt

  • For each column vector in matrix M, let the first column be your first basis vector
    • The second column can be made orthogonal by subtracting of the projection of column 2 from column 1
      • $v^{\perp} = v-\frac{u \cdot v}{u \cdot u}u$
    • Rinse and repeat for the rest of the columns, subtracting off the projections of the established basis vectors
    • Normalize every vector at the end
  • Place all the normalized basis vectors in a orthogonal matrix Q
  • The R matrix can be figured out with the following fact:
    • The (i,j) entry of the upper triangular matrix R equals the dot product of the i-th column of Q with the j-th column of M
  • Gram-Schmidt is quite numerically unstable, so it is good for visualization but impractical

Trace

  • The trace gives the sum of all diagonal entries of a matrix
    • $Tr(A) = \Sigma_{i}A_{i,j}$
  • Useful to replace summations
    • Frobenius norm can be rewritten as
      • $||A||_{F} = \sqrt{Tr(AA^{T})}$
  • $Tr(A) = Tr(A^{T})$
  • The trace of a square matrix is invariant under cyclic permuation of matrices
    • $Tr(ABC)=Tr(CAB)=Tr(BCA)$
    • This holds even if A and B are not square matrices, as long as the final product is a square matrix

Determinant

  • Denotes as Det(A) is only defined for square matrices. Det(A) equals the product of all eigenvalues of a matrix
  • $|Det(A)|$ can be thought of as how the volume of the space changes upon action by the matrix A
    • If Det(A) = 0, then the space collapsed to a lower dimensional one
      • Det(A) = 0 also means that the matrix is not-invertible
    • If Det(A) = 1, then the transformation preserves volume
  • Determinant changes sign upon swapping either two rows or two columns
  • Scaling a row by $\lambda$ changes the determinant by a factor of $\lambda$
  • Row operations do not change the value of the determinant
  • det(AB) = det(A) det(B)
  • det($A^{T}$) = det(A)
  • det($A^{-1}$) = $\frac{1}{det(A)}$

Kernel, Range, Nullity,Rank

Mapping definitions

  • Let $f: S \rightarrow T$ be a map from S to T.
    • S is the domain of the function
    • T is the codomain of the fraction
    • the set $ran(f) = im(f) = f(S) = \{f(s)|s\in S\} \subset T$ is called the range or image of f
      • You can think of the range as the part of the codomain T that S actually maps to
      • the image of $L(V)$ is a subspace of W
    • for some subset U of T, then $f^{-1}(U) = \{s\in S|f(s) \in U\}\subset S$
      • The preimage of a set $U$ is the set of all elements of $S$ which maps to $U$
    • The function f is one-to-one if different elements of S always map to different elements of T (for $x\neq y, f(x)\neq f(y)$)
      • also called injective
    • The function f is onto if every element of T is mapped to some element of T (ie. $\forall t \in T, \ \exists s \in S \rightarrow f(s)= t$)
      • also called surjective
    • functions that are both injective and surjective are called bijective

Kernel

  • Let $L: V\rightarrow W$ be a linear transformation. The kernel of L is defined as
    • $ker\ L = \{v\in V|Lv=0_{w}\}$
    • In words, the kernel is all $v \in V$ such that the linear transformation maps onto the zero vector in $W$
    • A linear transformation L is injective iff $ker\ L = \{0_{w}\}$
      • ie. only the zero vector in $V$ maps to the zero vector in $W$
      • $ker\ L$ is a subspace of V

Rank and Nullity

  • The rank of a linear transformation $L$ is the dimension of its image, written as $rank(L) = dim\ L(V) = dim\ Im(L)$
  • The nullity is the dimension of the kernel, written as $null\ L = dim\ ker\ L$
  • $dim\ V = dim\ ker\ V + dim\ L(V)$

Least Squares

  • Solving the equation $M^{T}MX = M^{T}V$, where M is a rectangular matrix
    • Used when there exists no perfect solution to $MX=V$. Finds the best possible solution
  • For polynomial fits, you construct a Vandermonde matrix from your data and solve for the coefficients of the polynomial