ML Refresher Notes

Here is a compendium of my notes that I took to review for Machine Learning. This was meant to be a quick refresher on concepts and to put everything in one place. For a less cursory version, see the background section under Resources at the course website.

Probability and Statistics
Linear Algebra

Probability and Statistics

Basic Concepts

Axioms of Probability

Probability Measure $P: \mathbb{F} \rightarrow \mathbb{{R}}$ such that
- $P(A) \geq 0$ for all $A \in \mathbb{F}$
- $P(\Omega) = 1$
- If $A_{1},A_{2}…$ are disjoint events, then
  - $P(\cup A_{i}) = \Sigma_{i} P(A_{i})$

Definitions

Sample space $\Omega$: The set of all possible outcomes
Event space $\mathbb{F}:A$ is a subset of $\Omega$
a random variable quantity that has an uncertain value.
- Examples include coin flips (discrete), daily temperature (continuous)
An ensemble sampled from a random variable forms a probability distribution
The probability density/mass function (pdf,pmf) denote the probability distribution of continuous and discrete functions respectively. Integral/sum must be 1.
Joint probability p(x,y) denotes the probability of pair (x,y) occurring. Can be a mix of discrete and continuous. X and y can also be vectors.
Marginalization refers to the “summing out” of a variable from a probability distribution ($p(x) = \int p(x,y) dy$ for instance). This removes dependence of distribution on summed variables. This can be extended to discrete ($\int \rightarrow \Sigma$) or to multiple variables ($p(x,y) = \Sigma_w \int p(w,x,y,z) dz$ ).
Conditional Probability is the probability of x given a fixed yvalue. Written as p(x|y = y). Like taking a cross section of a CAD model
- p(x|y = y*) is a relative probability, since the sum does not equal to 1. Need to normalize so that sum is 1.
- $p(x,y) = p(x|y)p(y) \rightarrow p(x,y) = p(y|x)p(x)$ from symmetry
Chain Rule $p(x_{1},x_{2},…x_{k} = \Pi_{i=1}^{k} p(x_{i}|x_{1},…,x_{i-1}) )$
- Note: when $i=1$, you get $p(x_{1})= p(x|x)$, which is trivial, but important to point out
- In words: the joint probabilities is the product of probability of the first, time probability of the second given the first, etc.
- $p(x|y)=p(x)$ and $p(y|x)=p(y)$ are the conditions for x and y to be independent.
  - Substitute into conditional probability definition yields $p(x,y)= p(x)p(y)$
Conditional Independence: $p(x_{1}|x_{2},x_{3}) = p(x_{1}|x_{2})$ and $p(x_{1}|x_{2},x_{3}) = p(x_{1}|x_{2})$ are the conditions. Note the symmetry in $x_{1}$ and $x_{3}$
- In words: $x_{1}$ is conditionally independent of $x_{3}$ given $x_{2}$. Or, if we know $x_{2}$, then $x_{1}$ gives no information about $x_{3}$.
- Variance of a population $\sigma^{2} = \frac{\Sigma(x-\mu)^{2}}{n} = \frac{\Sigma x^{2}}{n}-\mu^{2}$
- Variance of sample: $s^{2} = \frac{\Sigma(x-\bar x)^{2}}{n-1} = \frac{\Sigma x^{2}-\frac{(\Sigma x)^{2}}{n}}{n-1}$
- Standard Deviation = $\sqrt(s^{2})$
- Standard Error = $\frac{s}{\sqrt{n}}$
- Null Hypothesis: There is no significant difference between two populations
- Type I error: Probability of rejecting a true null hypothesis
  - Equals significance level $\alpha$
- Type II error: Probability of failing to reject a false null hypothesis

Baye’s Rule

$p(x|y) = \frac{p(y|x) p(x)}{p(y)}$
- posterior: $p(x|y)$ or what we know about x given y
- prior: $p(x)$ or what we know about x before looking at y
- likelihood: $p(y|x)$ or the information gained about x of y
- Evidence: $p(y)$ knowledge of y
  - Typical example: a diagnostic test for Covid is 95% accurate. This means that if the test correctly identifies the true state of the patient 95% of the time (ie. if she has Covid, the test returns true, is she doesn’t, it returns false). Here X is the random variable of the true Covid state of the patient and Y is the random variable of the test result. Suppose the infection rate of Covid is 1 in 1000.
    - What is $p(Covid|Pos)$? Not 95%.
    - $p(Covid|Pos) = \frac{p(Pos|Covid)p(Covid)}{p(Pos|Covid)p(Covid)+p(Pos|NoCovid)p(NoCovid)}$
    - Plugging in values yields about 2%
  - A more conceptual way of thinking about Baye’s rule is that it tells you how to eliminate impossible outcomes given some information about the world.

Expectation

$E[f(x)] =\int f(x) \cdot p(x)dx$ ($\int \rightarrow \Sigma$ for discrete)

p(x) is the weight assigned to each value of f(x)

Function	Expectation
$x$	mean $\mu_x$
$x^k$	k-th moment about 0
$(x-\mu_x)^k$	k-th central moment
$(x-\mu_x)^2$	variance (spread)
$(x-\mu_x)^3$	skew (lean)
$(x-\mu_x)^4$	kurtosis (peakedness)
$(x-\mu_x)(y-\mu_y)$	covariance of x and y

Expectation Manipulation Rules. Let a be a constant
- $E[a] = a$
- $E[a\cdot x] = aE[x]$
- $E[x + y] = E[x]+E[y]$
- $E[x \cdot y] = E[x]\cdot E[y]$ if x,y are independent
Variance Manipulation Rules
- $Var[a\cdot x+b] = a^{2}\cdot Var[x]$
- $Var[x+y] = Var[x]+Var[y]$ if x,y are independent

Common Probability Distributions

Bernoulli Distribution

Simple success or failure experiment
let $\lambda$ denote the probability of success. x=0 denotes failure and x=1 denotes success
- $Bern_{x}(\lambda) = \lambda^{x}(1-\lambda)^{1-x}$
  - $p(x=0) = 1-\lambda$
  - $p(x=1) = \lambda$
  - $E[x] = \lambda$
  - $Var[x] = \lambda(1-\lambda)$

Binomial Distribution

Given a sequence of $N$ independent Bernoulli experiments, the Binomial distribution enumerates the probability for $m$ successes in the sequence
$Bin_{m}(N,\lambda) = {N \choose x} \lambda^{m}(1-\lambda)^{N-m}$
$E[x] = N\lambda$
$Var[x] = N\lambda(1-\lambda)$
For a fixed $N\lambda$, the Binomial converges to Poisson as $N \rightarrow \infty$

Poisson Distribution

Consider independent events that happen at an average rate $\lambda$ over time (ie. $\lambda$ has “units” of number of events per time)
Poisson distribution is discrete, and give the probability of a given number of events $k$ occurring in a fixed time interval
$p(k,\lambda) = \frac{\lambda^{k}e^{-\lambda}}{k!} = Pois_{k}(\lambda)$
$E[x] = \lambda$
$Var[x] = \lambda$

Categorical Distribution

Suppose you run a single experiment with $K$ possible outcomes indexed from $K = {1,2,…k}$. Let $\lambda =[\lambda_{1},\lambda_{2},..\lambda_{k}]$ with $\Sigma_{k=1}^{K} \lambda_{k} = 1$
$p(x=k) = \lambda_{k}$
$Cat_{x}(\lambda) = p(x)$
$E[x=k] = \lambda$
$Var[x=k] = \lambda(1-\lambda)$

Multinomial Distribution

Multinomial distribution give probability of a number of successes given a particular combination
Generalization of Categorical Distribution to N trials and Binomial Distribution to K outcomes.
Let $m = [m_1 m_2 … m_K]$ denote the observed counts in each category (total number of categories is $K$) over a sequence of $N$ trials
Let $\lambda = [\lambda_1 \lambda_2 … \lambda_K]$ denote the probability of each category (subject to normalization of $\lambda_1+ \lambda_2 + … + \lambda_K = 1$)
$p(m) = {N \choose m_1 m_2 … m_K}\lambda_1^{m_1}\lambda_1^{m_1}…\lambda_1^{m_K}$
$E[m_{k}] = N\lambda$
$Var[m_k] = N\lambda(1-\lambda)$

Guassian(Normal) Distribution

Useful because A Gaussian is fully specified by only two moments and central limit theorem (CLT) holds
CLT states that the mean of independently draw random variables is normally distributed (given enough samples), irrespective of original distribution
$E[m_{k}] = \mu$
$Var[m_k] = \sigma^{2}$
$p(x) = \frac{1}{2\pi \sigma}e^{\frac{-(x-\mu)^{2}}{2\sigma^{2}}}$
Standard normal distribution ($\mu=0$, $\sigma = 1$)

Multivariate Gaussian Distribution

Let $d$ be the dimension of the space
Let $\mu$ be a d-dimensonal vector (Mean vector)
Let $\Sigma$ be a DxD symmetric and positive semi-definite matrix (covariance matrix)
$p(x) = \frac{1}{(2\pi)^{\frac{D}{2}}|\Sigma|^{\frac{1}{2}}} exp(-\frac{1}{2}(x-\mu)^{T}\Sigma^{-1}(x-\mu))$
$E[m_{k}] = \mu$
$Var[m_k] = \Sigma$
Alternatively, $p(x) = \frac{|\beta|^{\frac{1}{2}}}{(2\pi)^{\frac{D}{2}}} exp(-\frac{1}{2}(x-\mu)^{T}\beta(x-\mu))$ where $\beta = \Sigma^{-1}$ is a precision matrix

Laplace distribution

$p(x) = \frac{1}{2\gamma}exp(-\frac{-|x-\mu|}{\gamma})$
Places sharp peak at arbitrary point $\mu$ with spread of $\gamma$

Dirac Delta Distribution

$p(x) = \delta(x-\mu)$
Clusters all mass of the distribution function into a single point
Infinity high, infinity narrow peak
Exists to be integrated
Can be extended to an arbitrary number of points (empirical distribution):
- $p(x) = \frac{1}{m}\Sigma_{i-1}^{m} \delta(x-x_{i})$

Mixtures of Distributions

Define a cluster of distributions. Define a probability distribution P(c) that selects which distribution is drawn from.
$P(x) = \Sigma_{i} P(c=i) P(x |c=i)$
c is a latent variable, or a random variable that can’t be observed directly
Common types of mixture models are the empirical distribution and Gaussian mixture models, where the cluster consists of Gaussians.

Common Functions

Logistic sigmoid $\sigma(x)$

$\sigma(x) = \frac{1}{1+exp(-x)}$
Defined on $\mathbb{R}$ and confined between 0 and 1
Saturates as $x \rightarrow \pm \infty$ (ie insensitive to changes in x)
Maximum slope at x=0
Useful to produce parameter of a Bernoulli distribution
Properties:
- $\sigma(x) = \frac{exp(x)}{exp(x)+exp(0)}$
- $\frac{d}{dx}\sigma(x) = \sigma(x)(1-\sigma(x))$
- $1-\sigma(x) = \sigma(-x)$

Softplus Function $\Zeta(x)$

$\Zeta(x) = log(1+exp(x))$
Defined on $\mathbb{R}$ and confined between 0 and $\infty$
Produces $\sigma$ or $\beta$ of normal distribution
A smoother version of $f(x) = max(0,x)$
Properties:
- $log(\sigma(x)) = -\Zeta(-x)$
- $\frac{d}{dx} \Zeta(x) = \sigma(x)$
- $\forall x \in (0,1), \sigma^{-1}(x) = log(\frac{x}{1-x})$
- $\forall x>0, \Zeta^{-1}(x) = log(exp(x)-1)$
- $\Zeta(x) = \int_{-\infty}^{x} \sigma(y) dy$
- $\Zeta(x)-\Zeta(-x) = x$

Information Theory

Self-Information

General Idea: knowing something unlikely has happened provides more information that knowing something likely has happened. Information theory quantifies this statement
Properties of Information
- Likely events have low information (extreme: guaranteed events contain no information)
- Less likely events should have more information
- Independent events should have additive information (ie. tossing a coin twice contains twice as much information as tossing a coin once)
The above motivate the definition of self-information. Let x be an event. Then $I(x) = -log(P(x))$ is the self-information of the event. The log typically is base 2 or base e, but in theory could have any base you want; a chance in base is effectively a change in scale/units.
NOTE: by definition, $lim_{x\rightarrow 0 } x log(x) = 0$

Shannon Entropy

Extends I(x) to determine amount of uncertainty in a probability distribution:
- $H(x) = H(P) = E[I(x)]$, or the expectation value of I(x)
In words, $H(x)$ gives the expect/average amount of information that you can gain from drawing from the probability distribution. Entropy is maximized for a uniform probability distribution (has a very direct connection to entropy in the thermodynamic sense)
When x is continuous, Shannon entropy is known as differential entropy

Kullback-Leibler (KL) divergence

a metric of quantifying the difference in information between two distributions $P(x)$ and $Q(x)$
- $D_{KL}(P||Q) = E[log(P(x))-log(Q(x))]$
In words (for the discrete case), $D_{KL}$ is the extra information needed to send a message using the code P instead of the code Q.
KL divergence is non-negative
Only 0 iff P and Q are the same distribution
NOT symmetric in P and Q (not a true measure of distance)

Cross-entropy

$H(P,Q) = H(P)+D_{KL}(P||Q) = -E[log(Q(x))]$
Minimizing cross-entropy w.r.t. Q is the same as minimizing KL divergence

Exponential Distribution

Let $p(x,\lambda)$ be a piecewise function such that $p(x<0>,\lambda)=0$ and $p(x>0,\lambda) = \lambda exp(-\lambda x)$ where $\lambda$ is the decay parameter
Serves to put a sharp peak at x=0

Chi-squared Distribution

Basics

Chi-squared is a continuous distribution of the sum of the squares of k standard normally distributed random variables.
k is called the number of “degrees of freedom”
$q = \Sigma_{i-1}^{k} x_{i}^{2}$ ($\chi^{2}$ parameter)
$p(x) = \frac{x^{\frac{k}{2}-1}e^{-\frac{x}{2}}}{2^{\frac{k}{2}}\Gamma(\frac{k}{2})}$
$E[m_{k}] = k$
$Var[m_k] = 2k$

Goodness of Fit (GOF)

$\chi^{2} = \Sigma \frac{(O-E)^{2}}{E} $
degree of freedom (dof) = number of categories -1
Look up table for what p-value is. Determines if you accept or reject the null hypothesis by whether p-value is greater than (don’t reject null) or less than (reject null) significance level $\alpha$

Independence

Chech whether categorical data is independent
You are given observed data table O
expected data table E is calculated cellwise
- For a given row and column, the cell value is $E = \frac{(\text{row total})(\text{column total})}{(\text{grand total})}$
- Calculate $\chi^{2}$ as normal
- dof = (#rows-1)(#cols-1)

T-test

One Sample

Tests whether the mean of a normally distributed population is different from a specific value
Let $\mu$ be the measured mean and $\mu_{o}$ be the specific mean value to be tested
$t = \frac{\bar{x}-\mu_{o}}{\frac{s}{\sqrt{n}}}$
dof = n-1
Using a lookup table find the corresponding p-value, with the following caveats depending on what your null hypothesis is:
- $\mu>\mu_{0} \rightarrow$ read table as given
- $\mu<\mu_{0} \rightarrow$ if t-statistic is negative, read table as though t-value was positive
  - If t-statistic has the incorrect sign, read the $1-p$ value instead
- $\mu>\mu_{0} \rightarrow$ read table as given, but double p-value to take into account both tails
If p-value is less than $\alpha$, reject the null hypothesis

Two Sample

Test whether the means of two populations are significantly different from one another

Paired

This means that each value in first group has a one to one mapping to the value in the second group
Subtract values in a pairwise manner and run one sample t-test with $\mu_{0}$ = 0

Unpaired

$t = \frac{\bar x_{1} - \bar x_{2}}{\sqrt{(\frac{s_{1}^{2}}{n}+\frac{s^{2}}{n_{2}})}}$
dof = $(n_{1}-1)+(n_{2}-1)$
Read off p-value and compare to significance level

Linear Algebra

Types of Objects in Linear Algebra

Scalars: A number or a 1x1 matrix.
- Can be confined to various domains eg. ($s \in \mathbb{R}$) is a real-valued scalar and ($n \in \mathbb{N}$) defines a natural number scalar
Vectors: An array of numbers. Each number is identified by its index.
- typically though of as nx1 matrices (ie. column vectors).
  - $x = \begin{bmatrix} x_{1} \\ x_{2} \\ \vdots \\ x_{m} \end{bmatrix}$
  - Can also think of vectors as points embedded in n-dimensional space, with each element representing a coordinate on a different axis
- Suppose you want to index only certain elements of a vector. Create a set S = {1,3,6} and write $x_{S}$ to denote elements $x_1, x_3, x_6$. $x_{-S}$ refers to complement of $x_{s}$, or all the elements in x excluding $x_1, x_3, x_6$
Matrices
- 2D array of numbers that uses 2 indices.
  - if A has m rows and n columns, then $A \in \mathbb{R}^{m \times n}$
  - $A_{i,j}$ denotes the element in the ith row and jth column.
    - You can take slices with a “:”, namely $A_{i,:}$ denotes the horizontal slice at row i and $A_{:,j}$ denotes the vertical slice at col j
  - Can be explicitly written out by elements:
    - $\begin{bmatrix} A_{1,1} & A_{1,2} \\ A_{2,1} & A_{2,2} \\ \end{bmatrix}$
Tensors
- n-dimensional arrays that can have an arbitrary number of indices (like the Levi-Civita tensor $\epsilon_{ijk}$)

Matrix Operations

Transpose: taking the mirror image across the diagonal
- Can also think about it as swapping the rows and columns, or more algebraically $A_{i,j} = A_{j,i}$
- Can transpose a column vector into a row vector. So another way of writing a column vector is $\begin{bmatrix} x_{1} & x_{2} & x_{3} \end{bmatrix}^{T}$
- When transposing a product of matrices, you reverse the order, and then transpose individually ie.
  - $(AB)^{T} = B^{T}A^{T}$
Multiplication
Can multiply matrices by each other. If A is a m × n and B is a n × p matrix, then C is a m × p matrix.
- $C = AB$
- $C_{i,j} = \Sigma_{k} A_{i,k}B_{k,j}$
- Matrix multiplication is associative, distributive, but NOT commutative (ie. AB=BA is not always true)
Can multiply scalars by matrices: scale each element in matrix by the scalar c
Hadamard Product ($A \odot B$)
- Element-wise multiplication between two same sized matrices
Dot Product: Between two vectors of the same size (say x and y), the dot product is defined as $x^{T}y$
- The dot product is commutative: $x^{T}y = y^{T}x$
Addition
- Can add matrices together as long as their dimensions align
- Can add scalars to matrices: Let A be an mxn matrix. Create a mxn matrix whose entries are entirely ones and scale it by the scalar, then add to A
- Broadcasting: Adding a vector to a matrix
  - let C and A be mxn matrices and b be a mx1 vector
    - C = A+b where b is implicitly added to each column of A

Central Problem of Linear Algebra

We want to solve $Ax = b$
- Various techniques exists to state whether this is possible, how to do it if it is possible, or how close we can get

Gaussian Elimination

Use matrix multiplication on an augmented matrix $A|b$ to try and simplify the problem as much as possible
Through taking linear combinations of the rows, you can simplify the problem while maintaining the row space (and thus the solution) of the problem

Reduced Row Echelon Form (RREF)

Defined as a matrix with the following properties
- The pivot of any row is 1
  - The pivot is the first non-zero entry of a matrix
- The pivot of any given row is always to the right of the pivot of the row above it
  - It does not need to be in adjacent column, it just needs to be to the right
- The pivot is the only non-zero entry in its column
The end goal of Gaussian elimination is to get the augmented matrix into RREF

Reading of Solutions of Ax=b from RREF

For the ith column in RREF that does not have a pivot (ie. the free variables), define a variable $\lambda_{i}$.
- For the set of $\lambda$, first set all lambda’s to 0 and solve x
- Then cycle through the set of $\lambda$, setting only 1 of them to 1 (the rest of the free variables) and solve for x
The solution is then a linear combination of the vectors found, with the vector with $\lambda$ all equal to zero having a prefactor of 1, and the rest of the vectors having a prefactor of their associated $\lambda$

LU Decomposition

We can write any square matrix as the product of a lower triangular matrix and an upper triangular matrix
- $M = LU$
Steps to get LU Decomposition
- Perform Gaussian elimination until you get an upper triangular matrix. This is U
- As you do row operations, keep track of the elimination matrices used along the way
- Explicitly, you start from $A=LU$
  - $A=LU \rightarrow E_{0}E_{1}…E_{n-1}A = U$
  - So $L^{-1} = E_{0}E_{1}…E_{n-1} \rightarrow L = E_{n-1}^{-1}…E_{1}^{-1}E_{0}^{-1}$
- The product of inverse elimination matrices are simple: You just add a row instead of subtract a row

LDU Decomposition

The goal of LDU is to have ones on the main diagonals of L and U.
- D is a diagonal matrix
- This is accomplished by scaling each row of U such that the diagonal element in each row is 1
- You place the scaling factor in the corresponding slot in D

Identity and Inverses

The identity matrix is defined as $A_{i,j} = \delta_{i,j}$, where $\delta$ is the Kronecker delta. More explicitly, A has ones along the main diagonal (top left to bottom right) and zeros everywhere else
- I*A = A
The Inverse matrix of a square matrix A is defined as $AA^{-1} = A^{-1}A = I$
- Assuming $A_{-1}$ exists, its solves the fundamental problem of linear algebra: $Ax=b \rightarrow x = A^{-1}b$

Vector Spaces

A vector space over some field $\mathbb{F}$ (think $\mathbb{R}$,$\mathbb{C}$,$\mathbb{Z}$) is a set $V$ with two operations + and * that satisfies the following properties (given $\forall u,v \in \mathbb{R}$ and $c,d \in \mathbb{R}$)
- Closure under addition: $u+v \in V$
- Commutativity of addition: $u+v=v+u$
- Associativity of addition: $(u+v)+2=u+(v+w)$
- Zero: $\exist 0_{v} \in V$ such that $u+0_{v} = u \ \forall u \in V$
- Additive inverse: $\exist u \in V$ such that $u+w = 0 \ \forall w \in V$
- Multiplicative Closure: For every $u \in V$ there exists $w \in V$ such that $u + w = 0_{v}$
- Distributivity of scalars over scalar addition: (c+d)·v = c·v+d·v where $c,d,v \in \mathbb{F}$
- Distributivity of scalars over vector addition: c·(u+v) = c·u+c·v where $c \in \mathbb{F}$ and $u,v \in \mathbb{F}$
- Associativity of scalar multiplication: (cd)·v = c· (d·v) where $c,d \in \mathbb{F}$ and $v \in V$
- Existence of Unity: 1 · v = v for all v ∈ V where $1 \in \mathbb{F}$

Subspaces

A subspace U is a subset of V that satisfies:
- $\alpha_{1}u_{1}+\alpha_{2}u_{2} \in U$ where $\forall \alpha_{1},\alpha_{2} \in \mathbb{R}$ and $\ u_{1},u{2} \in U$
For any subset $S \subset V$, span(S) is a subspace of V

Orthogonal Complements

Let U and V be subspaces of vector space W
- The sum of U and V is: $U+V = span(U\cup V) = \{u+v|u\in U,v\in V\}$
- The direct sum of U and V is: $U\otimes V = span(U\cup V) = \{u+v|u\in U,v\in V\}$, assuming that $U\cap V = \{0_{w}\}$
  - Let $w = u+v \in U\otimes V$. There is only one way to write w as a the sum of a vector in U and a vector in V
Given a subspace U of W, we define:
- $U^{\perp} = \{w\in W|w\cdot u = 0,\ \forall u \in U\}$
- This is the orthogonal complement
- Let U be subspace of a finite dimensional vector space W. Then the set $U{\perp}$ is a subspace of W and $W = U\otimes U^{\perp}$

Linear Transformations

Define a map $L: V \rightarrow W$. This map is linear if the following conditions hold:
- $L(u+v) = L(u)+L(v)$
- $L(cv) = cL(v)$
Alternatively, you can combine the two conditions:
- $L(ru+sv) = rL(u)+sL(v)$

Diagonalization

Given a linear transformation, how can we write it as a matrix?
- Simplest example: $L: V\rightarrow V$ and we have a basis consisting of the linearly independent eigenvectors of L and the corresponding eigenvalues. Then L can be written as a diagonal matrix whose entries are the eigenvalues
  - the matrix for L in the basis S is diagonal iff S is a basis of eigenvectors for L

Change of Basis

A change of basis refers to writing on basis S in terms of another basis T
- Look at each vector in S, and figure out a linear combination of T that creates S. This linear combination goes into the associated column of the matrix P
Two matrices are similar if there exists and invertible matrix P such that
- $N = P^{-1}MP$
- A matrix M is diagonalizeable if there exists an invertible matrix P and Diagonal matrix D such that
  - $D = P^{-1}MP$
- What this means is that, to rewrite a matrix M in terms of a different basis, perform the transformation $M \rightarrow P^{-1}MP$ with the appropriate change of basis matrix P
- If you know the eigendecomposition of a matrix M, then you can diagonalize M by performing a change of basis to the eigenvector basis

Linear Dependence and Span

One can think about the product $Ax$ as the following:
- each component of x scales the corresponding column of a. The scaled columns are then summed together
  - $Ax = \Sigma_{i} x_{i}A_{:,i}$
- This is one way to think about a linear combination. Given a set of vectors, you scale each vector by some factor and sum the scaled vectors
- span denotes the set of all possible linear combinations
One can reformulate the fundamental problem $Ax=b$ as determining whether some combination of the columns equals b, or equivalently, does b lie in the span of the column space of A?
For a given set of vectors, we say the set is linearly independent if no combination yields the zero vector

Basis and Dimension

Dimension refers to the number of components necessary to describe an vector
Let V be a vector space. The a set S is a basis for V is S is linearly independent and V = span(S)
- In $\mathbb{R^{n}}$, the standard basis is the columns of the identity matrix
- A basis is not unique
  - Although each vector has a unique representation in a given basis
V = span($v_{1}v_{2}…v_{n}$) iff all vectors are linearly independent

Norms

The norm is a function that gives a metric of the size of a vector. It must satisfy the following properties:
- $f(x) = 0 \rightarrow x = 0$
- $f(x+y) \leq f(x) + f(y)$ (triangle equality)
- $ \forall \alpha \in \mathbb{R}, f(\alpha x) = |\alpha|f(x)$
The $L^{p}$ norm is defined as (for $p \in \mathbb{R}, p \geq 1$)
- $||x||{p} = (\Sigma{i}|x_{i}|^{p})^{\frac{1}{2}}$
The squared $L^{2}$ norm is useful since mathematically and computationally, it is easier to work with (derivative is simple, don’t need a square root)
The $L{1}$ norm is useful when the difference between zero and nonzero elements is important
- $L{1}$ norm increases by $\epsilon$ as x goes from 0 to $\epsilon$
The $L_{\infty}$ norm denotes the absolute value of the largest element in the array
- $||x||{\infty} = max|x{i}|$
There also exists a notion of norm for a matrix. The most common is the Forbenius norm:
- $||A||{F}\sqrt{\Sigma{i,j}A^{2}_{i,j}}$

Orthogonal Bases

a basis is orthogonal if all the vectors are prependicular to each other
- $v_{i}\cdot v_{j} = \delta_{ij}$
  - Orthonormal bases impose the additional condition that the norm of each basis vector is 1
- In an orthonormal basis, any vector can be written as
  - $v = \Sigma_{i}(v\cdot u_{i})u_{i}$
- Change of basis matrices between orthonormal bases are orthogonal matrices ie.
  - $P^{-1} = P^{T}$

Gram-Schmidt and Orthogonal Complements

Eigendecomposition

Solve the matrix equation $Av = \lambda v$, where $\lambda$ is a constant.
- $\lambda$ is called the eigenvalue and $v$ is called the eigenvector
- Eigenvectors are NOT unique. This is because we can think of an eigenvector as any vector whose direction is unchanged after being acted on by a matrix. Hence, eigenvectors are typically normalized for convenience.
- Multiple eigenvectors can have the same eigenvalue
- Real matrices can have complex eigenvalues
The eigendecomposition is not possible for all matrices
Given the eigendecomposition of a matrix (ie $\lambda = [\lambda_{1},\lambda_{1},…,\lambda_{n}]$ and V is a matrix whose columns are the eigenvectors), you can write the matrix as:
- $A = \bold{V} diag(\lambda)V^{-1}$
- Typically, the eigenvalues are sorted in descending order
- Make sure that the eigenvectors are in the column that correspond to the correct eigenvalue
For a real symmetric matrix, the eigendecomposition is:
- $Q\Lambda Q^{T}$, where $Q$ is an orthogonal matrix
- The eigendecomposition is guaranteed to exist for a real, symmetric matrix
A matrix is singular iff any of the eigenvalues are zero
Let $f(x) = x^{T}Ax$ subject to $||x_{2}||=1$. Whenever x is equal to an eigenvector of A, f equals the corresponding eigenvalue
- The minimum and maximum of f within the constraint region (ie. the surface of a n-dimensional unit ball) is the minimum and maximum eigenvalue respectively

Singular Value Decomposition (SVD)

Every real matrix has a singular value decomposition. Suppose A is an mxn matrix. Let U be a mxm orthogonal matrix, D be a mxn diagonal matrix (NOTE: not necessarily square), and V be a orthogonal nxn matrix. Then
- $A = UDV^{T}$
- The columns of U are the left-singular vectors, the columns of V are the right-singular vectors, and the diagonal elements are the singular values
- U can be found by taking the eigenvectors of $AA^{T}$, V can be found by taking the eigenvectors of $A^{T}A$, and the singular values are the square root of the eigenvalues of $A^{T}A$ of $AA^{T}$, whichever has the smaller dimension.

Moore-Penrose Pseudoinverse

For non-square matrices, there may exist a matrix $B$ that solves $Ax = y$ by left-multiplication:
- $Ax=y \rightarrow x = By$
If you have a tall matrix, there may exist no solution, and if you have a wide matrix, there could be multiple solutions
Define the matrix:
- $A^{+} = \lim_{\alpha \rightarrow 0}(A^{T}A+\alpha I)^{-1}A^{T}$
The practical formula for computing this matrix is
- $A^{+} = VD_{+}U^{T}$
  - $U,V,D$ are from the SVD of A. $D^{+}$ can be computed by taking the reciprocal of the nonzero elements and then transposing
If A has more columns than row, then $x=A^{+}y$ is one of the possible solutions
- This solution has the minimum Euclidean norm $||x||_{2}$ among all possible solutions
If A has more rows than columns, the x from the pseudoinverse minimizes $||Ax-y||^{2}$ (ie. the vector that has the smallest Euclidean norm that almost solve the problem)

QR Decomposition

Gram-Schmidt

For each column vector in matrix M, let the first column be your first basis vector
- The second column can be made orthogonal by subtracting of the projection of column 2 from column 1
  - $v^{\perp} = v-\frac{u \cdot v}{u \cdot u}u$
- Rinse and repeat for the rest of the columns, subtracting off the projections of the established basis vectors
- Normalize every vector at the end
Place all the normalized basis vectors in a orthogonal matrix Q
The R matrix can be figured out with the following fact:
- The (i,j) entry of the upper triangular matrix R equals the dot product of the i-th column of Q with the j-th column of M
Gram-Schmidt is quite numerically unstable, so it is good for visualization but impractical

Trace

The trace gives the sum of all diagonal entries of a matrix
- $Tr(A) = \Sigma_{i}A_{i,j}$
Useful to replace summations
- Frobenius norm can be rewritten as
  - $||A||_{F} = \sqrt{Tr(AA^{T})}$
$Tr(A) = Tr(A^{T})$
The trace of a square matrix is invariant under cyclic permuation of matrices
- $Tr(ABC)=Tr(CAB)=Tr(BCA)$
- This holds even if A and B are not square matrices, as long as the final product is a square matrix

Determinant

Denotes as Det(A) is only defined for square matrices. Det(A) equals the product of all eigenvalues of a matrix
$|Det(A)|$ can be thought of as how the volume of the space changes upon action by the matrix A
- If Det(A) = 0, then the space collapsed to a lower dimensional one
  - Det(A) = 0 also means that the matrix is not-invertible
- If Det(A) = 1, then the transformation preserves volume
Determinant changes sign upon swapping either two rows or two columns
Scaling a row by $\lambda$ changes the determinant by a factor of $\lambda$
Row operations do not change the value of the determinant
det(AB) = det(A) det(B)
det($A^{T}$) = det(A)
det($A^{-1}$) = $\frac{1}{det(A)}$

Kernel, Range, Nullity,Rank

Mapping definitions

Let $f: S \rightarrow T$ be a map from S to T.
- S is the domain of the function
- T is the codomain of the fraction
- the set $ran(f) = im(f) = f(S) = \{f(s)|s\in S\} \subset T$ is called the range or image of f
  - You can think of the range as the part of the codomain T that S actually maps to
  - the image of $L(V)$ is a subspace of W
- for some subset U of T, then $f^{-1}(U) = \{s\in S|f(s) \in U\}\subset S$
  - The preimage of a set $U$ is the set of all elements of $S$ which maps to $U$
- The function f is one-to-one if different elements of S always map to different elements of T (for $x\neq y, f(x)\neq f(y)$)
  - also called injective
- The function f is onto if every element of T is mapped to some element of T (ie. $\forall t \in T, \ \exists s \in S \rightarrow f(s)= t$)
  - also called surjective
- functions that are both injective and surjective are called bijective

Kernel

Let $L: V\rightarrow W$ be a linear transformation. The kernel of L is defined as
- $ker\ L = \{v\in V|Lv=0_{w}\}$
- In words, the kernel is all $v \in V$ such that the linear transformation maps onto the zero vector in $W$
- A linear transformation L is injective iff $ker\ L = \{0_{w}\}$
  - ie. only the zero vector in $V$ maps to the zero vector in $W$
  - $ker\ L$ is a subspace of V

Rank and Nullity

The rank of a linear transformation $L$ is the dimension of its image, written as $rank(L) = dim\ L(V) = dim\ Im(L)$
The nullity is the dimension of the kernel, written as $null\ L = dim\ ker\ L$
$dim\ V = dim\ ker\ V + dim\ L(V)$

Least Squares

Solving the equation $M^{T}MX = M^{T}V$, where M is a rectangular matrix
- Used when there exists no perfect solution to $MX=V$. Finds the best possible solution
For polynomial fits, you construct a Vandermonde matrix from your data and solve for the coefficients of the polynomial

Probability and Statistics#

Basic Concepts#

Axioms of Probability#

Definitions#

Baye’s Rule#

Expectation#

Common Probability Distributions#

Bernoulli Distribution#

Binomial Distribution#

Poisson Distribution#

Categorical Distribution#

Multinomial Distribution#

Guassian(Normal) Distribution#

Multivariate Gaussian Distribution#

Laplace distribution#

Dirac Delta Distribution#

Mixtures of Distributions#

Common Functions#

Logistic sigmoid $\sigma(x)$#

Softplus Function $\Zeta(x)$#

Information Theory#

Self-Information#

Shannon Entropy#

Kullback-Leibler (KL) divergence#

Cross-entropy#

Exponential Distribution#

Chi-squared Distribution#

Basics#

Goodness of Fit (GOF)#

Independence#

T-test#

One Sample#

Two Sample#

Paired#

Unpaired#

Linear Algebra#

Types of Objects in Linear Algebra#

Matrix Operations#

Central Problem of Linear Algebra#

Gaussian Elimination#

Reduced Row Echelon Form (RREF)#

Reading of Solutions of Ax=b from RREF#

LU Decomposition#

LDU Decomposition#

Identity and Inverses#

Vector Spaces#

Subspaces#

Orthogonal Complements#

Linear Transformations#

Diagonalization#

Change of Basis#

Linear Dependence and Span#

Basis and Dimension#

Norms#

Orthogonal Bases#

Gram-Schmidt and Orthogonal Complements#

Eigendecomposition#

Singular Value Decomposition (SVD)#

Moore-Penrose Pseudoinverse#

QR Decomposition#

Gram-Schmidt#

Trace#

Determinant#

Kernel, Range, Nullity,Rank#

Mapping definitions#

Kernel#

Rank and Nullity#

Least Squares#