Here is a compendium of my notes that I took to review for Machine Learning. This was meant to be a quick refresher on concepts and to put everything in one place. For a less cursory version, see the background section under Resources at the course website.
- Probability and Statistics
- Linear Algebra
- Types of Objects in Linear Algebra
- Matrix Operations
- Central Problem of Linear Algebra
- LU Decomposition
- Identity and Inverses
- Vector Spaces
- Linear Transformations
- Linear Dependence and Span
- Basis and Dimension
- Norms
- Orthogonal Bases
- Gram-Schmidt and Orthogonal Complements
- Eigendecomposition
- Singular Value Decomposition (SVD)
- Moore-Penrose Pseudoinverse
- QR Decomposition
- Trace
- Determinant
- Kernel, Range, Nullity,Rank
- Least Squares
Probability and Statistics
Basic Concepts
Axioms of Probability
- Probability Measure $P: \mathbb{F} \rightarrow \mathbb{{R}}$ such that
- $P(A) \geq 0$ for all $A \in \mathbb{F}$
- $P(\Omega) = 1$
- If $A_{1},A_{2}…$ are disjoint events, then
- $P(\cup A_{i}) = \Sigma_{i} P(A_{i})$
Definitions
- Sample space $\Omega$: The set of all possible outcomes
- Event space $\mathbb{F}:A$ is a subset of $\Omega$
- a random variable quantity that has an uncertain value.
- Examples include coin flips (discrete), daily temperature (continuous)
- An ensemble sampled from a random variable forms a probability distribution
- The probability density/mass function (pdf,pmf) denote the probability distribution of continuous and discrete functions respectively. Integral/sum must be 1.
- Joint probability p(x,y) denotes the probability of pair (x,y) occurring. Can be a mix of discrete and continuous. X and y can also be vectors.
- Marginalization refers to the “summing out” of a variable from a probability distribution ($p(x) = \int p(x,y) dy$ for instance). This removes dependence of distribution on summed variables. This can be extended to discrete ($\int \rightarrow \Sigma$) or to multiple variables ($p(x,y) = \Sigma_w \int p(w,x,y,z) dz$ ).
- Conditional Probability is the probability of x given a fixed yvalue. Written as p(x|y = y). Like taking a cross section of a CAD model
- p(x|y = y*) is a relative probability, since the sum does not equal to 1. Need to normalize so that sum is 1.
- $p(x,y) = p(x|y)p(y) \rightarrow p(x,y) = p(y|x)p(x)$ from symmetry
- Chain Rule $p(x_{1},x_{2},…x_{k} = \Pi_{i=1}^{k} p(x_{i}|x_{1},…,x_{i-1}) )$
- Note: when $i=1$, you get $p(x_{1})= p(x|x)$, which is trivial, but important to point out
- In words: the joint probabilities is the product of probability of the first, time probability of the second given the first, etc.
- $p(x|y)=p(x)$ and $p(y|x)=p(y)$ are the conditions for x and y to be independent.
- Substitute into conditional probability definition yields $p(x,y)= p(x)p(y)$
- Conditional Independence: $p(x_{1}|x_{2},x_{3}) = p(x_{1}|x_{2})$ and $p(x_{1}|x_{2},x_{3}) = p(x_{1}|x_{2})$ are the conditions. Note the symmetry in $x_{1}$ and $x_{3}$
- In words: $x_{1}$ is conditionally independent of $x_{3}$ given $x_{2}$. Or, if we know $x_{2}$, then $x_{1}$ gives no information about $x_{3}$.
- Variance of a population $\sigma^{2} = \frac{\Sigma(x-\mu)^{2}}{n} = \frac{\Sigma x^{2}}{n}-\mu^{2}$
- Variance of sample: $s^{2} = \frac{\Sigma(x-\bar x)^{2}}{n-1} = \frac{\Sigma x^{2}-\frac{(\Sigma x)^{2}}{n}}{n-1}$
- Standard Deviation = $\sqrt(s^{2})$
- Standard Error = $\frac{s}{\sqrt{n}}$
- Null Hypothesis: There is no significant difference between two populations
- Type I error: Probability of rejecting a true null hypothesis
- Equals significance level $\alpha$
- Type II error: Probability of failing to reject a false null hypothesis
Baye’s Rule
- $p(x|y) = \frac{p(y|x) p(x)}{p(y)}$
- posterior: $p(x|y)$ or what we know about x given y
- prior: $p(x)$ or what we know about x before looking at y
- likelihood: $p(y|x)$ or the information gained about x of y
- Evidence: $p(y)$ knowledge of y
- Typical example: a diagnostic test for Covid is 95% accurate. This means that if the test correctly identifies the true state of the patient 95% of the time (ie. if she has Covid, the test returns true, is she doesn’t, it returns false). Here X is the random variable of the true Covid state of the patient and Y is the random variable of the test result. Suppose the infection rate of Covid is 1 in 1000.
- What is $p(Covid|Pos)$? Not 95%.
- $p(Covid|Pos) = \frac{p(Pos|Covid)p(Covid)}{p(Pos|Covid)p(Covid)+p(Pos|NoCovid)p(NoCovid)}$
- Plugging in values yields about 2%
- A more conceptual way of thinking about Baye’s rule is that it tells you how to eliminate impossible outcomes given some information about the world.
- Typical example: a diagnostic test for Covid is 95% accurate. This means that if the test correctly identifies the true state of the patient 95% of the time (ie. if she has Covid, the test returns true, is she doesn’t, it returns false). Here X is the random variable of the true Covid state of the patient and Y is the random variable of the test result. Suppose the infection rate of Covid is 1 in 1000.
Expectation
- $E[f(x)] =\int f(x) \cdot p(x)dx$ ($\int \rightarrow \Sigma$ for discrete)
- p(x) is the weight assigned to each value of f(x)
Function Expectation $x$ mean $\mu_x$ $x^k$ k-th moment about 0 $(x-\mu_x)^k$ k-th central moment $(x-\mu_x)^2$ variance (spread) $(x-\mu_x)^3$ skew (lean) $(x-\mu_x)^4$ kurtosis (peakedness) $(x-\mu_x)(y-\mu_y)$ covariance of x and y - Expectation Manipulation Rules. Let a be a constant
- $E[a] = a$
- $E[a\cdot x] = aE[x]$
- $E[x + y] = E[x]+E[y]$
- $E[x \cdot y] = E[x]\cdot E[y]$ if x,y are independent
- Variance Manipulation Rules
- $Var[a\cdot x+b] = a^{2}\cdot Var[x]$
- $Var[x+y] = Var[x]+Var[y]$ if x,y are independent
- p(x) is the weight assigned to each value of f(x)
Common Probability Distributions
Bernoulli Distribution
- Simple success or failure experiment
- let $\lambda$ denote the probability of success. x=0 denotes failure and x=1 denotes success
- $Bern_{x}(\lambda) = \lambda^{x}(1-\lambda)^{1-x}$
- $p(x=0) = 1-\lambda$
- $p(x=1) = \lambda$
- $E[x] = \lambda$
- $Var[x] = \lambda(1-\lambda)$
- $Bern_{x}(\lambda) = \lambda^{x}(1-\lambda)^{1-x}$
Binomial Distribution
- Given a sequence of $N$ independent Bernoulli experiments, the Binomial distribution enumerates the probability for $m$ successes in the sequence
- $Bin_{m}(N,\lambda) = {N \choose x} \lambda^{m}(1-\lambda)^{N-m}$
- $E[x] = N\lambda$
- $Var[x] = N\lambda(1-\lambda)$
- For a fixed $N\lambda$, the Binomial converges to Poisson as $N \rightarrow \infty$
Poisson Distribution
- Consider independent events that happen at an average rate $\lambda$ over time (ie. $\lambda$ has “units” of number of events per time)
- Poisson distribution is discrete, and give the probability of a given number of events $k$ occurring in a fixed time interval
- $p(k,\lambda) = \frac{\lambda^{k}e^{-\lambda}}{k!} = Pois_{k}(\lambda)$
- $E[x] = \lambda$
- $Var[x] = \lambda$
Categorical Distribution
- Suppose you run a single experiment with $K$ possible outcomes indexed from $K = {1,2,…k}$. Let $\lambda =[\lambda_{1},\lambda_{2},..\lambda_{k}]$ with $\Sigma_{k=1}^{K} \lambda_{k} = 1$
- $p(x=k) = \lambda_{k}$
- $Cat_{x}(\lambda) = p(x)$
- $E[x=k] = \lambda$
- $Var[x=k] = \lambda(1-\lambda)$
Multinomial Distribution
- Multinomial distribution give probability of a number of successes given a particular combination
- Generalization of Categorical Distribution to N trials and Binomial Distribution to K outcomes.
- Let $m = [m_1 m_2 … m_K]$ denote the observed counts in each category (total number of categories is $K$) over a sequence of $N$ trials
- Let $\lambda = [\lambda_1 \lambda_2 … \lambda_K]$ denote the probability of each category (subject to normalization of $\lambda_1+ \lambda_2 + … + \lambda_K = 1$)
- $p(m) = {N \choose m_1 m_2 … m_K}\lambda_1^{m_1}\lambda_1^{m_1}…\lambda_1^{m_K}$
- $E[m_{k}] = N\lambda$
- $Var[m_k] = N\lambda(1-\lambda)$
Guassian(Normal) Distribution
- Useful because A Gaussian is fully specified by only two moments and central limit theorem (CLT) holds
- CLT states that the mean of independently draw random variables is normally distributed (given enough samples), irrespective of original distribution
- $E[m_{k}] = \mu$
- $Var[m_k] = \sigma^{2}$
- $p(x) = \frac{1}{2\pi \sigma}e^{\frac{-(x-\mu)^{2}}{2\sigma^{2}}}$
- Standard normal distribution ($\mu=0$, $\sigma = 1$)
Multivariate Gaussian Distribution
- Let $d$ be the dimension of the space
- Let $\mu$ be a d-dimensonal vector (Mean vector)
- Let $\Sigma$ be a DxD symmetric and positive semi-definite matrix (covariance matrix)
- $p(x) = \frac{1}{(2\pi)^{\frac{D}{2}}|\Sigma|^{\frac{1}{2}}} exp(-\frac{1}{2}(x-\mu)^{T}\Sigma^{-1}(x-\mu))$
- $E[m_{k}] = \mu$
- $Var[m_k] = \Sigma$
- Alternatively, $p(x) = \frac{|\beta|^{\frac{1}{2}}}{(2\pi)^{\frac{D}{2}}} exp(-\frac{1}{2}(x-\mu)^{T}\beta(x-\mu))$ where $\beta = \Sigma^{-1}$ is a precision matrix
Laplace distribution
- $p(x) = \frac{1}{2\gamma}exp(-\frac{-|x-\mu|}{\gamma})$
- Places sharp peak at arbitrary point $\mu$ with spread of $\gamma$
Dirac Delta Distribution
- $p(x) = \delta(x-\mu)$
- Clusters all mass of the distribution function into a single point
- Infinity high, infinity narrow peak
- Exists to be integrated
- Can be extended to an arbitrary number of points (empirical distribution):
- $p(x) = \frac{1}{m}\Sigma_{i-1}^{m} \delta(x-x_{i})$
Mixtures of Distributions
- Define a cluster of distributions. Define a probability distribution P(c) that selects which distribution is drawn from.
- $P(x) = \Sigma_{i} P(c=i) P(x |c=i)$
- c is a latent variable, or a random variable that can’t be observed directly
- Common types of mixture models are the empirical distribution and Gaussian mixture models, where the cluster consists of Gaussians.
Common Functions
Logistic sigmoid $\sigma(x)$
- $\sigma(x) = \frac{1}{1+exp(-x)}$
- Defined on $\mathbb{R}$ and confined between 0 and 1
- Saturates as $x \rightarrow \pm \infty$ (ie insensitive to changes in x)
- Maximum slope at x=0
- Useful to produce parameter of a Bernoulli distribution
- Properties:
- $\sigma(x) = \frac{exp(x)}{exp(x)+exp(0)}$
- $\frac{d}{dx}\sigma(x) = \sigma(x)(1-\sigma(x))$
- $1-\sigma(x) = \sigma(-x)$
Softplus Function $\Zeta(x)$
- $\Zeta(x) = log(1+exp(x))$
- Defined on $\mathbb{R}$ and confined between 0 and $\infty$
- Produces $\sigma$ or $\beta$ of normal distribution
- A smoother version of $f(x) = max(0,x)$
- Properties:
- $log(\sigma(x)) = -\Zeta(-x)$
- $\frac{d}{dx} \Zeta(x) = \sigma(x)$
- $\forall x \in (0,1), \sigma^{-1}(x) = log(\frac{x}{1-x})$
- $\forall x>0, \Zeta^{-1}(x) = log(exp(x)-1)$
- $\Zeta(x) = \int_{-\infty}^{x} \sigma(y) dy$
- $\Zeta(x)-\Zeta(-x) = x$
Information Theory
Self-Information
- General Idea: knowing something unlikely has happened provides more information that knowing something likely has happened. Information theory quantifies this statement
- Properties of Information
- Likely events have low information (extreme: guaranteed events contain no information)
- Less likely events should have more information
- Independent events should have additive information (ie. tossing a coin twice contains twice as much information as tossing a coin once)
- The above motivate the definition of self-information. Let x be an event. Then $I(x) = -log(P(x))$ is the self-information of the event. The log typically is base 2 or base e, but in theory could have any base you want; a chance in base is effectively a change in scale/units.
- NOTE: by definition, $lim_{x\rightarrow 0 } x log(x) = 0$
Shannon Entropy
- Extends I(x) to determine amount of uncertainty in a probability distribution:
- $H(x) = H(P) = E[I(x)]$, or the expectation value of I(x)
- In words, $H(x)$ gives the expect/average amount of information that you can gain from drawing from the probability distribution. Entropy is maximized for a uniform probability distribution (has a very direct connection to entropy in the thermodynamic sense)
- When x is continuous, Shannon entropy is known as differential entropy
Kullback-Leibler (KL) divergence
- a metric of quantifying the difference in information between two distributions $P(x)$ and $Q(x)$
- $D_{KL}(P||Q) = E[log(P(x))-log(Q(x))]$
- In words (for the discrete case), $D_{KL}$ is the extra information needed to send a message using the code P instead of the code Q.
- KL divergence is non-negative
- Only 0 iff P and Q are the same distribution
- NOT symmetric in P and Q (not a true measure of distance)
Cross-entropy
- $H(P,Q) = H(P)+D_{KL}(P||Q) = -E[log(Q(x))]$
- Minimizing cross-entropy w.r.t. Q is the same as minimizing KL divergence
Exponential Distribution
- Let $p(x,\lambda)$ be a piecewise function such that $p(x<0>,\lambda)=0$ and $p(x>0,\lambda) = \lambda exp(-\lambda x)$ where $\lambda$ is the decay parameter
- Serves to put a sharp peak at x=0
Chi-squared Distribution
Basics
- Chi-squared is a continuous distribution of the sum of the squares of k standard normally distributed random variables.
- k is called the number of “degrees of freedom”
- $q = \Sigma_{i-1}^{k} x_{i}^{2}$ ($\chi^{2}$ parameter)
- $p(x) = \frac{x^{\frac{k}{2}-1}e^{-\frac{x}{2}}}{2^{\frac{k}{2}}\Gamma(\frac{k}{2})}$
- $E[m_{k}] = k$
- $Var[m_k] = 2k$
Goodness of Fit (GOF)
- $\chi^{2} = \Sigma \frac{(O-E)^{2}}{E} $
- degree of freedom (dof) = number of categories -1
- Look up table for what p-value is. Determines if you accept or reject the null hypothesis by whether p-value is greater than (don’t reject null) or less than (reject null) significance level $\alpha$
Independence
- Chech whether categorical data is independent
- You are given observed data table O
- expected data table E is calculated cellwise
- For a given row and column, the cell value is $E = \frac{(\text{row total})(\text{column total})}{(\text{grand total})}$
- Calculate $\chi^{2}$ as normal
- dof = (#rows-1)(#cols-1)
T-test
One Sample
- Tests whether the mean of a normally distributed population is different from a specific value
- Let $\mu$ be the measured mean and $\mu_{o}$ be the specific mean value to be tested
- $t = \frac{\bar{x}-\mu_{o}}{\frac{s}{\sqrt{n}}}$
- dof = n-1
- Using a lookup table find the corresponding p-value, with the following caveats depending on what your null hypothesis is:
- $\mu>\mu_{0} \rightarrow$ read table as given
- $\mu<\mu_{0} \rightarrow$ if t-statistic is negative, read table as though t-value was positive
- If t-statistic has the incorrect sign, read the $1-p$ value instead
- $\mu>\mu_{0} \rightarrow$ read table as given, but double p-value to take into account both tails
- If p-value is less than $\alpha$, reject the null hypothesis
Two Sample
Test whether the means of two populations are significantly different from one another
Paired
- This means that each value in first group has a one to one mapping to the value in the second group
- Subtract values in a pairwise manner and run one sample t-test with $\mu_{0}$ = 0
Unpaired
- $t = \frac{\bar x_{1} - \bar x_{2}}{\sqrt{(\frac{s_{1}^{2}}{n}+\frac{s^{2}}{n_{2}})}}$
- dof = $(n_{1}-1)+(n_{2}-1)$
- Read off p-value and compare to significance level
Linear Algebra
Types of Objects in Linear Algebra
- Scalars: A number or a 1x1 matrix.
- Can be confined to various domains eg. ($s \in \mathbb{R}$) is a real-valued scalar and ($n \in \mathbb{N}$) defines a natural number scalar
- Vectors: An array of numbers. Each number is identified by its index.
- typically though of as nx1 matrices (ie. column vectors).
- $x = \begin{bmatrix} x_{1} \\ x_{2} \\ \vdots \\ x_{m} \end{bmatrix}$
- Can also think of vectors as points embedded in n-dimensional space, with each element representing a coordinate on a different axis
- Suppose you want to index only certain elements of a vector. Create a set S = {1,3,6} and write $x_{S}$ to denote elements $x_1, x_3, x_6$. $x_{-S}$ refers to complement of $x_{s}$, or all the elements in x excluding $x_1, x_3, x_6$
- typically though of as nx1 matrices (ie. column vectors).
- Matrices
- 2D array of numbers that uses 2 indices.
- if A has m rows and n columns, then $A \in \mathbb{R}^{m \times n}$
- $A_{i,j}$ denotes the element in the ith row and jth column.
- You can take slices with a “:”, namely $A_{i,:}$ denotes the horizontal slice at row i and $A_{:,j}$ denotes the vertical slice at col j
- Can be explicitly written out by elements:
- $\begin{bmatrix} A_{1,1} & A_{1,2} \\ A_{2,1} & A_{2,2} \\ \end{bmatrix}$
- 2D array of numbers that uses 2 indices.
- Tensors
- n-dimensional arrays that can have an arbitrary number of indices (like the Levi-Civita tensor $\epsilon_{ijk}$)
Matrix Operations
- Transpose: taking the mirror image across the diagonal
- Can also think about it as swapping the rows and columns, or more algebraically $A_{i,j} = A_{j,i}$
- Can transpose a column vector into a row vector. So another way of writing a column vector is $\begin{bmatrix} x_{1} & x_{2} & x_{3} \end{bmatrix}^{T}$
- When transposing a product of matrices, you reverse the order, and then transpose individually ie.
- $(AB)^{T} = B^{T}A^{T}$
- Multiplication
- Can multiply matrices by each other. If A is a m × n and B is a n × p matrix, then C is a m × p matrix.
- $C = AB$
- $C_{i,j} = \Sigma_{k} A_{i,k}B_{k,j}$
- Matrix multiplication is associative, distributive, but NOT commutative (ie. AB=BA is not always true)
- Can multiply scalars by matrices: scale each element in matrix by the scalar c
- Hadamard Product ($A \odot B$)
- Element-wise multiplication between two same sized matrices
- Dot Product: Between two vectors of the same size (say x and y), the dot product is defined as $x^{T}y$
- The dot product is commutative: $x^{T}y = y^{T}x$
- Addition
- Can add matrices together as long as their dimensions align
- Can add scalars to matrices: Let A be an mxn matrix. Create a mxn matrix whose entries are entirely ones and scale it by the scalar, then add to A
- Broadcasting: Adding a vector to a matrix
- let C and A be mxn matrices and b be a mx1 vector
- C = A+b where b is implicitly added to each column of A
- let C and A be mxn matrices and b be a mx1 vector
Central Problem of Linear Algebra
- We want to solve $Ax = b$
- Various techniques exists to state whether this is possible, how to do it if it is possible, or how close we can get
Gaussian Elimination
- Use matrix multiplication on an augmented matrix $A|b$ to try and simplify the problem as much as possible
- Through taking linear combinations of the rows, you can simplify the problem while maintaining the row space (and thus the solution) of the problem
Reduced Row Echelon Form (RREF)
- Defined as a matrix with the following properties
- The pivot of any row is 1
- The pivot is the first non-zero entry of a matrix
- The pivot of any given row is always to the right of the pivot of the row above it
- It does not need to be in adjacent column, it just needs to be to the right
- The pivot is the only non-zero entry in its column
- The pivot of any row is 1
- The end goal of Gaussian elimination is to get the augmented matrix into RREF
Reading of Solutions of Ax=b from RREF
- For the ith column in RREF that does not have a pivot (ie. the free variables), define a variable $\lambda_{i}$.
- For the set of $\lambda$, first set all lambda’s to 0 and solve x
- Then cycle through the set of $\lambda$, setting only 1 of them to 1 (the rest of the free variables) and solve for x
- The solution is then a linear combination of the vectors found, with the vector with $\lambda$ all equal to zero having a prefactor of 1, and the rest of the vectors having a prefactor of their associated $\lambda$
LU Decomposition
- We can write any square matrix as the product of a lower triangular matrix and an upper triangular matrix
- $M = LU$
- Steps to get LU Decomposition
- Perform Gaussian elimination until you get an upper triangular matrix. This is U
- As you do row operations, keep track of the elimination matrices used along the way
- Explicitly, you start from $A=LU$
- $A=LU \rightarrow E_{0}E_{1}…E_{n-1}A = U$
- So $L^{-1} = E_{0}E_{1}…E_{n-1} \rightarrow L = E_{n-1}^{-1}…E_{1}^{-1}E_{0}^{-1}$
- The product of inverse elimination matrices are simple: You just add a row instead of subtract a row
LDU Decomposition
- The goal of LDU is to have ones on the main diagonals of L and U.
- D is a diagonal matrix
- This is accomplished by scaling each row of U such that the diagonal element in each row is 1
- You place the scaling factor in the corresponding slot in D
Identity and Inverses
- The identity matrix is defined as $A_{i,j} = \delta_{i,j}$, where $\delta$ is the Kronecker delta. More explicitly, A has ones along the main diagonal (top left to bottom right) and zeros everywhere else
- I*A = A
- The Inverse matrix of a square matrix A is defined as $AA^{-1} = A^{-1}A = I$
- Assuming $A_{-1}$ exists, its solves the fundamental problem of linear algebra: $Ax=b \rightarrow x = A^{-1}b$
Vector Spaces
- A vector space over some field $\mathbb{F}$ (think $\mathbb{R}$,$\mathbb{C}$,$\mathbb{Z}$) is a set $V$ with two operations + and * that satisfies the following properties (given $\forall u,v \in \mathbb{R}$ and $c,d \in \mathbb{R}$)
- Closure under addition: $u+v \in V$
- Commutativity of addition: $u+v=v+u$
- Associativity of addition: $(u+v)+2=u+(v+w)$
- Zero: $\exist 0_{v} \in V$ such that $u+0_{v} = u \ \forall u \in V$
- Additive inverse: $\exist u \in V$ such that $u+w = 0 \ \forall w \in V$
- Multiplicative Closure: For every $u \in V$ there exists $w \in V$ such that $u + w = 0_{v}$
- Distributivity of scalars over scalar addition: (c+d)·v = c·v+d·v where $c,d,v \in \mathbb{F}$
- Distributivity of scalars over vector addition: c·(u+v) = c·u+c·v where $c \in \mathbb{F}$ and $u,v \in \mathbb{F}$
- Associativity of scalar multiplication: (cd)·v = c· (d·v) where $c,d \in \mathbb{F}$ and $v \in V$
- Existence of Unity: 1 · v = v for all v ∈ V where $1 \in \mathbb{F}$
Subspaces
- A subspace U is a subset of V that satisfies:
- $\alpha_{1}u_{1}+\alpha_{2}u_{2} \in U$ where $\forall \alpha_{1},\alpha_{2} \in \mathbb{R}$ and $\ u_{1},u{2} \in U$
- For any subset $S \subset V$, span(S) is a subspace of V
Orthogonal Complements
- Let U and V be subspaces of vector space W
- The sum of U and V is: $U+V = span(U\cup V) = \{u+v|u\in U,v\in V\}$
- The direct sum of U and V is: $U\otimes V = span(U\cup V) = \{u+v|u\in U,v\in V\}$, assuming that $U\cap V = \{0_{w}\}$
- Let $w = u+v \in U\otimes V$. There is only one way to write w as a the sum of a vector in U and a vector in V
- Given a subspace U of W, we define:
- $U^{\perp} = \{w\in W|w\cdot u = 0,\ \forall u \in U\}$
- This is the orthogonal complement
- Let U be subspace of a finite dimensional vector space W. Then the set $U{\perp}$ is a subspace of W and $W = U\otimes U^{\perp}$
Linear Transformations
- Define a map $L: V \rightarrow W$. This map is linear if the following conditions hold:
- $L(u+v) = L(u)+L(v)$
- $L(cv) = cL(v)$
- Alternatively, you can combine the two conditions:
- $L(ru+sv) = rL(u)+sL(v)$
Diagonalization
- Given a linear transformation, how can we write it as a matrix?
- Simplest example: $L: V\rightarrow V$ and we have a basis consisting of the linearly independent eigenvectors of L and the corresponding eigenvalues. Then L can be written as a diagonal matrix whose entries are the eigenvalues
- the matrix for L in the basis S is diagonal iff S is a basis of eigenvectors for L
- Simplest example: $L: V\rightarrow V$ and we have a basis consisting of the linearly independent eigenvectors of L and the corresponding eigenvalues. Then L can be written as a diagonal matrix whose entries are the eigenvalues
Change of Basis
- A change of basis refers to writing on basis S in terms of another basis T
- Look at each vector in S, and figure out a linear combination of T that creates S. This linear combination goes into the associated column of the matrix P
- Two matrices are similar if there exists and invertible matrix P such that
- $N = P^{-1}MP$
- A matrix M is diagonalizeable if there exists an invertible matrix P and Diagonal matrix D such that
- $D = P^{-1}MP$
- What this means is that, to rewrite a matrix M in terms of a different basis, perform the transformation $M \rightarrow P^{-1}MP$ with the appropriate change of basis matrix P
- If you know the eigendecomposition of a matrix M, then you can diagonalize M by performing a change of basis to the eigenvector basis
Linear Dependence and Span
- One can think about the product $Ax$ as the following:
- each component of x scales the corresponding column of a. The scaled columns are then summed together
- $Ax = \Sigma_{i} x_{i}A_{:,i}$
- This is one way to think about a linear combination. Given a set of vectors, you scale each vector by some factor and sum the scaled vectors
- span denotes the set of all possible linear combinations
- each component of x scales the corresponding column of a. The scaled columns are then summed together
- One can reformulate the fundamental problem $Ax=b$ as determining whether some combination of the columns equals b, or equivalently, does b lie in the span of the column space of A?
- For a given set of vectors, we say the set is linearly independent if no combination yields the zero vector
Basis and Dimension
- Dimension refers to the number of components necessary to describe an vector
- Let V be a vector space. The a set S is a basis for V is S is linearly independent and V = span(S)
- In $\mathbb{R^{n}}$, the standard basis is the columns of the identity matrix
- A basis is not unique
- Although each vector has a unique representation in a given basis
- V = span($v_{1}v_{2}…v_{n}$) iff all vectors are linearly independent
Norms
- The norm is a function that gives a metric of the size of a vector. It must satisfy the following properties:
- $f(x) = 0 \rightarrow x = 0$
- $f(x+y) \leq f(x) + f(y)$ (triangle equality)
- $ \forall \alpha \in \mathbb{R}, f(\alpha x) = |\alpha|f(x)$
- The $L^{p}$ norm is defined as (for $p \in \mathbb{R}, p \geq 1$)
- $||x||{p} = (\Sigma{i}|x_{i}|^{p})^{\frac{1}{2}}$
- The squared $L^{2}$ norm is useful since mathematically and computationally, it is easier to work with (derivative is simple, don’t need a square root)
- The $L{1}$ norm is useful when the difference between zero and nonzero elements is important
- $L{1}$ norm increases by $\epsilon$ as x goes from 0 to $\epsilon$
- The $L_{\infty}$ norm denotes the absolute value of the largest element in the array
- $||x||{\infty} = max|x{i}|$
- There also exists a notion of norm for a matrix. The most common is the Forbenius norm:
- $||A||{F}\sqrt{\Sigma{i,j}A^{2}_{i,j}}$
Orthogonal Bases
- a basis is orthogonal if all the vectors are prependicular to each other
- $v_{i}\cdot v_{j} = \delta_{ij}$
- Orthonormal bases impose the additional condition that the norm of each basis vector is 1
- In an orthonormal basis, any vector can be written as
- $v = \Sigma_{i}(v\cdot u_{i})u_{i}$
- Change of basis matrices between orthonormal bases are orthogonal matrices ie.
- $P^{-1} = P^{T}$
- $v_{i}\cdot v_{j} = \delta_{ij}$
Gram-Schmidt and Orthogonal Complements
Eigendecomposition
- Solve the matrix equation $Av = \lambda v$, where $\lambda$ is a constant.
- $\lambda$ is called the eigenvalue and $v$ is called the eigenvector
- Eigenvectors are NOT unique. This is because we can think of an eigenvector as any vector whose direction is unchanged after being acted on by a matrix. Hence, eigenvectors are typically normalized for convenience.
- Multiple eigenvectors can have the same eigenvalue
- Real matrices can have complex eigenvalues
- The eigendecomposition is not possible for all matrices
- Given the eigendecomposition of a matrix (ie $\lambda = [\lambda_{1},\lambda_{1},…,\lambda_{n}]$ and V is a matrix whose columns are the eigenvectors), you can write the matrix as:
- $A = \bold{V} diag(\lambda)V^{-1}$
- Typically, the eigenvalues are sorted in descending order
- Make sure that the eigenvectors are in the column that correspond to the correct eigenvalue
- For a real symmetric matrix, the eigendecomposition is:
- $Q\Lambda Q^{T}$, where $Q$ is an orthogonal matrix
- The eigendecomposition is guaranteed to exist for a real, symmetric matrix
- A matrix is singular iff any of the eigenvalues are zero
- Let $f(x) = x^{T}Ax$ subject to $||x_{2}||=1$. Whenever x is equal to an eigenvector of A, f equals the corresponding eigenvalue
- The minimum and maximum of f within the constraint region (ie. the surface of a n-dimensional unit ball) is the minimum and maximum eigenvalue respectively
Singular Value Decomposition (SVD)
- Every real matrix has a singular value decomposition. Suppose A is an mxn matrix. Let U be a mxm orthogonal matrix, D be a mxn diagonal matrix (NOTE: not necessarily square), and V be a orthogonal nxn matrix. Then
- $A = UDV^{T}$
- The columns of U are the left-singular vectors, the columns of V are the right-singular vectors, and the diagonal elements are the singular values
- U can be found by taking the eigenvectors of $AA^{T}$, V can be found by taking the eigenvectors of $A^{T}A$, and the singular values are the square root of the eigenvalues of $A^{T}A$ of $AA^{T}$, whichever has the smaller dimension.
Moore-Penrose Pseudoinverse
- For non-square matrices, there may exist a matrix $B$ that solves $Ax = y$ by left-multiplication:
- $Ax=y \rightarrow x = By$
- If you have a tall matrix, there may exist no solution, and if you have a wide matrix, there could be multiple solutions
- Define the matrix:
- $A^{+} = \lim_{\alpha \rightarrow 0}(A^{T}A+\alpha I)^{-1}A^{T}$
- The practical formula for computing this matrix is
- $A^{+} = VD_{+}U^{T}$
- $U,V,D$ are from the SVD of A. $D^{+}$ can be computed by taking the reciprocal of the nonzero elements and then transposing
- $A^{+} = VD_{+}U^{T}$
- If A has more columns than row, then $x=A^{+}y$ is one of the possible solutions
- This solution has the minimum Euclidean norm $||x||_{2}$ among all possible solutions
- If A has more rows than columns, the x from the pseudoinverse minimizes $||Ax-y||^{2}$ (ie. the vector that has the smallest Euclidean norm that almost solve the problem)
QR Decomposition
Gram-Schmidt
- For each column vector in matrix M, let the first column be your first basis vector
- The second column can be made orthogonal by subtracting of the projection of column 2 from column 1
- $v^{\perp} = v-\frac{u \cdot v}{u \cdot u}u$
- Rinse and repeat for the rest of the columns, subtracting off the projections of the established basis vectors
- Normalize every vector at the end
- The second column can be made orthogonal by subtracting of the projection of column 2 from column 1
- Place all the normalized basis vectors in a orthogonal matrix Q
- The R matrix can be figured out with the following fact:
- The (i,j) entry of the upper triangular matrix R equals the dot product of the i-th column of Q with the j-th column of M
- Gram-Schmidt is quite numerically unstable, so it is good for visualization but impractical
Trace
- The trace gives the sum of all diagonal entries of a matrix
- $Tr(A) = \Sigma_{i}A_{i,j}$
- Useful to replace summations
- Frobenius norm can be rewritten as
- $||A||_{F} = \sqrt{Tr(AA^{T})}$
- Frobenius norm can be rewritten as
- $Tr(A) = Tr(A^{T})$
- The trace of a square matrix is invariant under cyclic permuation of matrices
- $Tr(ABC)=Tr(CAB)=Tr(BCA)$
- This holds even if A and B are not square matrices, as long as the final product is a square matrix
Determinant
- Denotes as Det(A) is only defined for square matrices. Det(A) equals the product of all eigenvalues of a matrix
- $|Det(A)|$ can be thought of as how the volume of the space changes upon action by the matrix A
- If Det(A) = 0, then the space collapsed to a lower dimensional one
- Det(A) = 0 also means that the matrix is not-invertible
- If Det(A) = 1, then the transformation preserves volume
- If Det(A) = 0, then the space collapsed to a lower dimensional one
- Determinant changes sign upon swapping either two rows or two columns
- Scaling a row by $\lambda$ changes the determinant by a factor of $\lambda$
- Row operations do not change the value of the determinant
- det(AB) = det(A) det(B)
- det($A^{T}$) = det(A)
- det($A^{-1}$) = $\frac{1}{det(A)}$
Kernel, Range, Nullity,Rank
Mapping definitions
- Let $f: S \rightarrow T$ be a map from S to T.
- S is the domain of the function
- T is the codomain of the fraction
- the set $ran(f) = im(f) = f(S) = \{f(s)|s\in S\} \subset T$ is called the range or image of f
- You can think of the range as the part of the codomain T that S actually maps to
- the image of $L(V)$ is a subspace of W
- for some subset U of T, then $f^{-1}(U) = \{s\in S|f(s) \in U\}\subset S$
- The preimage of a set $U$ is the set of all elements of $S$ which maps to $U$
- The function f is one-to-one if different elements of S always map to different elements of T (for $x\neq y, f(x)\neq f(y)$)
- also called injective
- The function f is onto if every element of T is mapped to some element of T (ie. $\forall t \in T, \ \exists s \in S \rightarrow f(s)= t$)
- also called surjective
- functions that are both injective and surjective are called bijective
Kernel
- Let $L: V\rightarrow W$ be a linear transformation. The kernel of L is defined as
- $ker\ L = \{v\in V|Lv=0_{w}\}$
- In words, the kernel is all $v \in V$ such that the linear transformation maps onto the zero vector in $W$
- A linear transformation L is injective iff $ker\ L = \{0_{w}\}$
- ie. only the zero vector in $V$ maps to the zero vector in $W$
- $ker\ L$ is a subspace of V
Rank and Nullity
- The rank of a linear transformation $L$ is the dimension of its image, written as $rank(L) = dim\ L(V) = dim\ Im(L)$
- The nullity is the dimension of the kernel, written as $null\ L = dim\ ker\ L$
- $dim\ V = dim\ ker\ V + dim\ L(V)$
Least Squares
- Solving the equation $M^{T}MX = M^{T}V$, where M is a rectangular matrix
- Used when there exists no perfect solution to $MX=V$. Finds the best possible solution
- For polynomial fits, you construct a Vandermonde matrix from your data and solve for the coefficients of the polynomial