Data Science For Physicists

Linear Regression You have a set of data $(x_{i},y_{i},\sigma_[i])$, where x is the independent variable, y is the dependent variable, and $\sigma$ is the uncertainty in the dependent variable You want to fit a line to the data ($\hat{y} = mx+b$) You also want the uncertainties of the parameters Why would you want to do this? To know the parameters of the model To extrapolate/interpolate values of the dataset that you don’t have How you do this depends on your viewpoint a frequentist approach would try to maximize the likelihood estimate (information theory) Suppose that you have a probability density for the data $y_{i}$ ranging from 1 to N Treat this density as a function of the parameters. What parameters maximizes this density? A Bayesian approach would be posterior estimates (Cox theorems) Likelihood Function The model we are trying to fit for Linear Regression is $y=mx+b+\pi(0,\sigma)$ We assume that x has no uncertainty that the noise function $\pi$ is Gaussian with mean 0 (why? Central Limit Theorem, and it’s easy to work with) Assume that your variances are correct (big assumption) Each data point is independent from each other (iid) $N = \frac{1}{\sqrt{2\pi \sigma^{2}}} exp(-\frac{(y-\mu)^{2}}{2\sigma_{2}})$ (1D Gaussian) With this, the likelihood function can be written as $\mathbb{L} = \frac{-1}{2}(\vec{Y}-\vec{X}\cdot \vec{a})^{T}C^{-1}(\vec{Y}-\vec{X}\cdot \vec{a})$ Y is the data vector X is the design matrix (first column is the independent variables, 2nd column is all ones) a is the parameter vector $C^{-1}$ is a diagonal matrix, whose diagonals are $\frac{1}{\sigma_{i}^{2}}$ the most likely values of m and b are those which maximize this likelihood (ie. when the derivative w.r.t. a is 0) Dig through this pamphlet to do that calculation Freqentist Week Some Definitions aleatoric uncertainty (noise) and epistemic uncertainty (systematic error or uncertainty in the form of model assumptions) The lower bound on your total uncertainty is the inverse of the Fisher matrix (ie. covariance matrix) This lower bounds is reached only if all of the assumptions are correct (which is normally pretty naive) Freqentist Rules You can never have a measure in the parameter space, only the data space “parameters are set by God. The data are created by us” - D.W.H. This means that statements like “It is more probable that $H_{0}>69$ than $H_{0}<69$” are OK, but “the data are more consistent with $H_{0}>69$ than $H_{0}<69$” are not OK p-value of p<0.05: If the null model were true, we’d get data this “deviant” is less than 0.05 percent of the time The null model is the model where the thing that you are trying to prove is false. So if the likelihood of the null hypothesis is low, then the chance of your hypothesis is high What is $\chi^2?$ You have some data (a set of unordered vectors) and a model that tries to describe this data We can defined $\chi^{2} = \sigma_{i=1}^{N} (\frac{y_{i}-f(x_{i},\theta)}{\sigma_{i}})^{2}$ You can also relate the chi-squared to the log-likelihood via $ln(p(y|\theta)) = \frac{-1}{2}\chi^{2}+const$ This assumes a zero-mean gaussian with known variances, It also assumes that the noise distributions of each point are iid What value do you expect for $\chi^{2}$? If your model describes your data well, then you expect that each data point is offset from the model by the variance of that point. so $\frac{y_{i}-f(x_{i},\theta)}{\sigma_{i}}=1$, so naively, you expect that $\chi^{2} \approx N$ In reality, $\chi^{2} \approx N -dof = N_{dof}$, where dof are the number of degrees of freedom. The handwaving explanation for this is that you aren’t using the true parameters, you are using the best-fit parameters. This causes you to overfit a bit, which juices the $\chi^{2}$ numbers, hence the $-dof$ In the limit of large $N_{dof}$, the $\chi^{2}$ approaches a Gaussian with mean $N_{dof}$ and variance of $2N_{dof}$ A consequence of this is that for sufficiently large N, all models get rejected (“all models are wrong; some are useful” ) If you have multiple models, then you can talk about $\Delta \chi^{2}$ Nuisance Parameters Some parameters in a model you care about. Others you need to include, but don’t want to For example, instrumentation might introduce these extra parameters Suppose that our likelihood function distribution goes like $L = \ln p([y]|\theta,\alpha)$, where $\alpha$ is/are your nuisance parameter(s) You can define a function $max_{\alpha} L(\theta, \alpha) = L_{p}(\theta)$ (ie. you find what $\alpha$ maximizes the likelihood) You didn’t have to specify a prior on $\alpha$ $L_{p}$ is called the profile log likelihood This profiling removes the nuisance parameters from the likelihood, so you can just maximize $L_{p}$ instead of L Uncertainty Estimation Robustness A “robust” method is insensitive to your noise model The previous uncertainty estimates where not robust (ie. they heavily depended on the noise being Gaussian) Fair Sampling Assume that $x_{i}$ is drawn from some distribution $p(x)$. Then your sampling of x is fair if $\sigma_{i=1}^{k} f(x_{i}) \approx \int f(x) p(x) dx$ $p(x)$ must satisfy $\int p(x) dx = 1$ and $p(x) \geq 0$ Bootstrapping Assume that your data $y_{i}$ are fair samples drawn from $p(z|\theta)$ Assumes that your data is “similar”: ie. has similar errors Create an estimator $\hat{\theta}(y_{i})$ ranging from i=1 to i=N Draw K sets of N values from $y_{i}$ with replacement Estimate $\hat{\theta}$ for each of the K samples The variance of $\theta$ then becomes $\frac{1}{K}\Sigma_{j=1}^{K} (\hat{\theta_{j}}-\hat{\theta})^{2}$, where $\hat{\theta}$ if the estimator when using all the data The covariance matrix is $C_{\theta} = \frac{1}{K}\Sigma_{i=1}^{k} (\hat{\theta_{j}}-\hat{\theta})(\hat{\theta_{j}}-\hat{\theta})^{T}$ Jackknifing Assume that your data $y_{i}$ are fair samples drawn from $p(z|\theta)$ Assumes that your data is “similar”: ie. has similar errors Create an estimator $\hat{\theta}(y_{i})$ ranging from i=1 to i=N, and that the estimator holds on a smaller subset of y Take all of the data from y, but drop one datapoint and call this subsample $Y_{j}$ (so you have a subsample of size N-1). Do this for all possible choices of j Define $\hat{\theta}_{j}$ for all of the subsamples $\sigma_{\theta}^{2} = \frac{N-1}{N} \Sigma_{j=1}^{N} (\theta_{j}-<\theta>_{j})^{2}$ $<\theta>_{j}$ is the mean across all of the jackknife estimators Bayesian Statistics Bayes is a theory of prediction, frequentism is a theory of measurement Based on Cox theorems In Bayes, our likelihood function is a probability distribution $p(y|\theta, \alpha, I)$, where $\theta$ are your parameters of interest, $\alpha$ are your nuisancee parameters, and I are your assumptions All of Bayes depends on having a proper set of priors which represents your belief about $\theta$, $\alpha$ and I before your see y Bayes Rule: $p(\theta,\alpha| y, I) = \frac{p(\theta \alpha|I)p(y|\theta, \alpha, I)}{Z}$ where Z is the normalization (or marginalization) equal to $p(y|I)$ Marginalization Suppose that we have some prior $p(\theta,\alpha | I)$ which is seperable ($p(\theta|I) p(\alpha|I) = p(\theta,\alpha | I)$) This can arise if your belief about different set of parameters are distinct from each other (ex. the assumptions about the Higgs mass are seperate from the assumptions about the calorimeter sensitivity) You can integrate out the nuisance parameters $\alpha$. This is called marginalization The marginal likelihood is given by $p(y|\theta, I) = \int p(y|\theta, \alpha, I) p(\alpha | I) d\alpha$ The marginal posterior is $p(\theta|y,I) = \int p(\theta,\alpha|y,I) d\alpha$ MCMC Markov Chain Monte Carlo Universal way of calculating posterior distributions Kind of a crappy way of do so, since no nice convergence theorems Suppose that we have some posterior distribution $p(\Omega| y,I)$ A good sampling of this posterior should be able to approximate the posterior Metropolis Hastings We define some probability distribution $q(\Omega’| \Omega)$ called the proposal distribution (ie. a distribution which generates a new sample given an old sample) We assume that $q(\Omega’| \Omega) = q(\Omega| \Omega’)$ (called detailed balance) Suppose that after drawing from q, you calculate the change in the log-likelihood between the old and the new sample Roll a uniform random number r between 0 and 1. Accept the new sample if $ln(r) < ln(\delta)$ Otherwise, you keep the previous sample In the limit of infinite samples, you will get a fair sampling of the posterior Nearby samples are not independent This is modulated by the “correlation time”, which roughly speaking, is the number of samples you have to take before you get a new independent sample This correlation time contributes to the number of truly independent samples: $N_{eff} = \frac{K}{T}$, where K is the total samples drawn and T is the correlation time How do you choose q? You need to choose a q which equally trades off between exploration and exploitation You could try to measure the autocorrelation time, but that’s hard to compute efficiently You could also try to aim for a target acceptance rate (~0.4) Model Selection Suppose that you have some set of models that you want to test You can select which model you want via AIC and BIC (ie. likelihood ratios) in a frequentist approach With a Bayesian approach, you can find your full marginalized likelihood (integrate over your model parameters) This approach is heavily dependent on your prior (since $p(Y|I) = \int p(y|\theta, I) p(\theta|I) d\theta $) K-fold Cross Validation You leave k datapoints out of your training data. You have your model predict the held out data points. The model which yields the best prediction is empirically the best Appropriate for choosing “nuisance models” Model independent In pseudocode: You hold out k data points Run your model on this subsample of the data You predict the outputs of the withheld data points For frequentists, you predict the probability of getting the withheld data given the subsampled model For Bayesian, you calculate the fully marginalized log likelihood Interpolation AKA overparameterization (p>n) Want to fit to points in between your data There are a couple of types of interpolation linear bilinear cubic For polynomial interpolation, you have roughly the same number of parameters as number of points When interpolation, “sharp edges” in your dataset has a tendency to get smooth out by the interpolation Interpolations are also “too faithful” ie. they stick to the data too much Lot’s of data is “band-limited”, which means that there is a fundamental distance scale below which you can’t get good resolution (think a camera) Noise can in principle have very high frequency noise 3 primary methods for interpolation methods exact pass through the data (think all the flavors of splines) Some wizardry allows these algorithms to run in polynomial time fit a parameterized model to some data with noise cubic time kernel methods (ie. you define some metric, and take a weighted average of the closest neighbors based on the kernel) Over Parameterized Models Dealing with p parameters and n data points where p >n To prevent the values of these parameters from getting too large, we add some regularization Idea is that under some norm, we want to find $argmin ||a||$ such that $y = Xa$ Common regularization schemes are The L2 norm: $\Sigma_{i=1}^{p} a_{i}^{2}$ Called ridge regression The L1 norm: $\Sigma_{i=1}^{p} |a_{i}|$ called lasso regression Has a tendency to zero out irrelevant parameters We have the following model for overparameterization: $Q = ||y-Xa||^{2} + \lambda ||a||$ We take the limit $\lambda\rightarrow 0 $, which yields the argmin on the norm of a Can think of the problem in a Bayesian sense $Q = \chi^{2}+\lambda||a||_{2}^{2}$ $(y-Xa)^{T}C^{-1}(y-Xa)+a^{T}\Lambda a$ This can be solved by minimizing with respect to a The predicted values are then $\hat{y’} = X’\Lambda^{-1}X^{T}(X\Lambda^{-1}X^{T} +C)^{-1} \vec{y}$ where ’ denotes the extrapolated x,y points Gaussian Processes (GPs) Kernels We can view the overparameterized solution as $K’(K+C)^{-1}\vec{y}$, where K is called the kernel kernel matrices have the form $K_{ij} = k(t_{i}, t_{j})$. Typically, most problem have the form $K_{ij} = k(|t_{i}-t_{j}|)$ At $\delta t=0$, K describes the variance of the GP There is also some $\tau$ which describes the correlation time Some common forms for the kernel $K_{ij} = V exp(\frac{-|\delta t|^{2}}{\tau^{2}})$ (squared exponential) $K_{ij} = V exp(\frac{-|\delta t|}{\tau})$ Kernel functions must be semi-positive definite Hard to do in general. Easy way to generate a kernel is to have some strictly positive function in frequency space, then Fourier transform it to real space How do you choose a kernel function? Compute the likelihood function of the data This is done via calculating $\ln p(y_{i}|\mu(t),K)$. Written out: $L = \frac{-1}{2}(\ln \det{2\pi(K+C)})-\frac{1}{2}(\vec{y}-\vec{\mu})^{T}(K+C)^{-1}(\vec{y}-\vec{\mu})$ This allows you to compare different kernel functions, and use the Neyman-Pearson lemma to select the MAP This assumes that the distribution is Gaussian, which is not always a valid assumption Use cross validation Leave one out to get: $\hat{y_{i}} = K_{i}(K_{/i}+C_{/i})^{-1}y_{/i}$ and $\hat{\sigma_{i}^{2}} = k(0)-K_{i}(K_{/i}+C_{/i})^{-1}K_{i}^{T}$ You can then compare this against the actual data and choose the kernel which maximizes that likelihood PCA Imagine that we have our design matrix. We want to decompose $X= USV^{T}$ where S is your singular value matrix and U and $V^{T}$ are your left and right eigenvectors This decomposition creates a spectrum where each eigenvector is weighted by their associated singular value. You can thus get a good approximation of your data by only keeping the top q eigenvectors When is PCA Good? What properties are needed for a good PCA decomposition? When the data is “well aligned” The variation in the data space eis the variation you care about Silly example: trying to differentiate between smiley faces images and sad face images at different positions You want to place all of your faces at the center of your image, so that you don’t waste a PCA eigenvector on locating the average locations of all of the faces When the data has high signal to noise When there are few outliers Assuming that the above holds, then PCA will help de-noise your data (assuming that you choose a sufficient q eigenvectors) You can use leave one out cross validation to help determine your best q Interpolate over missing/bad data Detect anomalies/outliers Misc Writing a Data Analysis Paper Trivial Things Given the paper, people within your field should be able to reproduce your work The writing should be unambiguous The notation should not reuse the same symbol Be consistent with naming things Try to mix language and symbols for instance “wavevector $\vec{k}$. Not “wavevector”. Not $\vec{k}$ Non-trivial Things You should have a list of explicit assumptions and choices Visualize your results in the space of your data Frame your discussion section by the assumptions Causal Separation You have some data, and you want to know what portions of the data are explained by what mechanism Example: We have n cells, observed at t time slices, and you use m microscopes Suppose that over time, the size of the cells appear to grow. Is this due to the cells actually growing, or is it due to the different calibrations for all of the microscopes We have a set of observations $y_{l}\pm \sigma_{l}$ in cell i @ time j with microscope k Some microscopes have a smaller size compared to others. How do you decoupled the effect of the microscope from the actual cell size increase You could calibrate all of your images with an object of known size (external calibration) You could also interleave which microscope you observes the cells with at different times (ie. prevent any covariance buildup between the observables) This is hard to do. A sufficient (but worse) substitution is randomly assignment This problem can then be encoded as a design matrix Baby Machine Learning Two main groups of methods: unsupervised and supervised Unsupervised: Find patterns in your data PCA Kernel density estimation K-means Supervised: Presupposes that your data has some labels associated with it linear regression Gaussian Processes K-nearest neighbors SVM (support vector machines) All of the above can be kernelized (ie. do a transformation to your data to linearize it in some basis) You split your data into 3 partitions: the training set: what you learn your model parameters on the validation set: what you learn your optimal hyperparameters on the test set: what you evaluate your final model against K-Nearest Neighbors Choose the k “closest” feature vectors in the training set to test your feature vector, and then assign the “average” of those k labels With regards to “closest”, you need to define a metric distance in feature space that obeys The triangle inequality and must be non-negative With regards to “average”, you can do a lot of different polling strategies (depends on your problem) You gain biases at the “edges” of the label boundaries K-Means We have a bunch of features and no labels We define k points in our feature space (our centroids) We ask for each feature vector: which of the centroids is the feature vector closest to? After doing the assignment for all of the data, we move all of the centroids to the average of the nearby vectors We iterate on this process until convergence Data Management data and housekeeping data should be stored in the same data format data: your actual measurements housekeeping data (metadata): data surrounding your measurements Don’t store these as a filename or directory tree (seems silly, but it happens) The code that reads in the data should be considered as housekeeping data To store in binary or store in text binary data is more compact, is lossless, and is inscrutable to humans text data takes up more space, can lose precision on conversion, and is human-readable Security: Don’t use .pkl files. Just don’t

Date Created: January 19, 2025 | | Last Modified: May 13, 2026 ||

Machine Learning Notes

Compilation of notes for Machine Learning class for Spring 2023. Administrative Stuff Topic 1 Workflow of Classification Problems (Supervised Learning) Statistical Approach Classifier Maximum Likelihood Estimation Example Convergence Naive Bayes Classifier How do you quantify the quality of a classifier? Approaches to Classification Generative Approach Discriminative Approach Topic 2 Nearest K-Neighbor Importance of Closeness for k-Nearest Neighbors Distances Similarities Issues with the nearest neighbor k-NN Optimality Proof Practical Considerations of k-nearest neighbor Finding the k-th nearest neighbors takes time Metric of Closeness is Sometimes Unclear Derivative w.r.t. Matrix Not Saving All the Training Data Topic 3 Overfitting the Training Data Decision Boundaries Linear Classifier Finding The Weights Perceptron Perceptron Mistake Bound Perceptron Mistake Bound Proof Nonlinear Regression Quadratic Boundaries Proof that All Data is Linear Separable in some Space Kernel Trick (to Deal with Computation) Topic 4 SVM Setup Constrained Optimization Approaches to Optimization Projection Cost Functions Dual Problem Convexity SVM as Convex Optimization Topic 5 Parametric Regression Types of Loss Linear Regression Proof of existence Column Picture Statistical View Ridge Regression Lasso Regression Logistic Regression Why a Sigmoid? Back to Regression Optimal $L_{2}$ Regressor Non-linear Regression Consistency Tneorem Topic 7 (We skipped Learning Theory) k-means Algorithm (At Least One Way to Do it) Choosing K Clustering via Probabilistic Mixture Modeling MLE Estimation Expectation Maximization (EM) Algorithm Topic 8 PCA Projector PCA (What does it have to do with eigenvalues?) Topic 9 Graphical Models Inference Questions Learning Questions Directed Models Undirected Models Bayesian Networks Time Series Model Markov Chain PageRank Administrative Stuff All homework is submitted via Gradescope. Written parts must be PDFs and be typeset. Plots and data analysis from programming part must be included in written part Programming parts are mostly turned in as a separate assignment on Gradescope 40 % HW, 30% Midterm, 30% Final Topic 1 Inputs encoded as vectors ($x \in X$): $\vec{x} = <x_{1},x_{2},…x_{n}>$ Can think of each component as a measurement that you make Output is another vector ($y \in Y$): $\vec{y} = <y_{1},y_{2},…y_{m}>$ Goal is to figure out the function that maps x to y You only have a limited number of samples from X You need to make predictions from this limited sample size Workflow of Classification Problems (Supervised Learning) $(\vec{x_{1}},y_{1}),(\vec{x_{2}},y_{2})… \in X \times Y$ Example: $X = \mathbb{R^{d}}$ and $Y = {0,1}$ for a binary classification from a image with d pixels Assumption: there exists a “relatively simple” function $f^{*}: X \rightarrow Y$ such that $f^{*}(\vec{x_{i}}) = y_{i}$ for most i $*$ in ML implies some optimum value (in this case, an optimal function) Most includes that possibility that there is some fundamental noise in your data Say someone wrote a number that kind of looks like 7 and kinda looks like 1. For the same image, the true $y_{i}$ could be either 7 or 1 (depending on the person’s intentions). Hence, to force $f^{*}$ be a correct for all inputs misses this uncertainty in the true output Learning task: given $n$ examples from the data, from $\hat{f} \approx f^{*}$ $\hat{f}$ denotes a function that tries its best to be optimal Goal: $\hat{f}$ gives mostly correct prediction on unseen examples Statistical Approach $(\vec{x_{1}},y_{1}),(\vec{x_{2}},y_{2})…(\vec{x_{n}},y_{n})$ samples are drawn from the underlying distribution For simplicity, we assume each sample is drawn independently from the same underlying distribution is (called i.i.d. assumption) The learning algorithm draws $\hat{f}$ from a pool of models $\mathbb{F}$ that maximizes the label agreement with the training data How do we select $f \in \mathbb{F}$? Maximum likelihood (best fits the data) Maximum a posteriori (best fits the data but incorporates prior assumptions) Optimization of ’loss’ criterion (best discriminates the labels) Classifier We are given a joint input/output space ($X \times Y$) The data is distributed like $D(X \times Y)$ You have a region where your inputs can reside in ($X \times Y$) and D tells you the density of inputs on the space You can visualize this space by either: Taking slices at a fixed variable value ($P[Y,X=x]$) or the conditional distribution Adding up all the slices along a variable ($P[Y]$) or the marginal distribution A classifier is a measurable function of the type $f: x \in Y$ Ex1 (Constant): $f_{1}(\vec{x}) = y$ for some fixed $y \in Y$. Ex2(Threshold): $f_{2}(x) = \begin{cases} 0\ if \ x \geq 5 \\ 1 \ otherwise \end{cases}$ Ex3 (Majority class): $f_{3}(x) = arg\ max_{y\in Y} P[Y=y]$ This asks: What argument in Y yields the maximum probability? Implicitly, this probability is the marginalized distribution on X More explicitly, you can write probability as: $P[Y=y] = P_{y \approx D}|_{Y}[Y=y]$ In words, you draw a (x,y) pair from D, and you only pay attention to the y part None of these are particularly good/generally applicable classifiers Ex4 (Sample Average Class): $\frac{1}{n}\Sigma \mathbb{I}[f(x_{i}) = y_{i}]$ $\mathbb{I}$ is the indicator function that maps a boolean expression (like $f(x_{i}) = y_{i}$) to either 0 (false) or 1 (true) This is from a sample Ex5 (Population Average): $\mathbb{E}_{(x,y)\in D}[\mathbb{I}[f(x)=y]]= P[f(x)=y]$ This is on the entire population. Essentially a probability that states how often this classifier f gets the output correct called the accuracy function acc(f) Let’s look at the following classifier: $f(\vec{x})= arg\ max_{y\in Y}\ P[y=Y|X=\vec{x}]$ In words, find y that maximizes the probability given a particular x (called Bayes classifier) This is the best classifier. But most of the time, you can’t actually compute $P[y=Y|X=\vec{x}]$ for all inputs. You try your best to estimate these probabilities Proof: Assume binary classification for output space (for simplicity) Let $f(\vec{x})= arg\ max_{y\in Y}\ P[y=Y|X=\vec{x}]$. Let $g(\vec{x})\rightarrow \{0,1\}$. Then $P_{(\vec{x},y)}[g(\vec{x})=y] \leq P_{(\vec{x},y)}[f(\vec{x})=y]$ Let’s look at a single example (Fix $\vec{x}\in X$). Then for any classifier h $P[h(\vec{x})=y|X=\vec{x}]=P[h(\vec{x})=1|X=\vec{x}]+P[h(\vec{x})=0|X=\vec{x}]$ Since $h(\vec{x})=1$ is a constant, we can use $\mathbb{I}$ to rewrite above $\mathbb{I}[h(\vec{x})=1]P[Y=1|X=\vec{x}]+\mathbb{I}[h(\vec{x})=0]P[Y=0|X=\vec{x}] =\mathbb{I}[h(\vec{x})=1]\eta(\vec{x})+\mathbb{I}[h(\vec{x})=0](1- \eta(\vec{x}))$ Where $\eta(\vec{x}) := P[Y=1|X=\vec{x}]$ Hence we need to show that $P[f(\vec{x})=y|X=\vec{x})]-P[g(\vec{x})=y|X=\vec{x})] \geq 0$ This equals $\eta\vec{x}-\mathbb{I}[g(\vec{x})=1]]+(1-\eta(\vec{x}))[\mathbb{I}[f(\vec{x})=0]-\mathbb{I}[g(\vec{x})=0]] = (2\eta(\vec{x})-1)[\mathbb{I}[f(\vec{x})=1]-\mathbb{I}[g(\vec{x})=1]]$ You use $\mathbb{I}[f(\vec{x})=0] = 1 -\mathbb{I}[f(\vec{x})=1]$ If both classifiers guess the same output, there is no degradation in the performance in either classifier What happens if the outputs are different? In either case$[\mathbb{I}[f(\vec{x})=1]-\mathbb{I}[g(\vec{x})=1]]=\pm 1$ depending on if $f(x)$ is correct or $g(x)$ is correct, Say $f(x)$ returns 1: This means that probability of $f(x)$ selecting 1 is greater than 0.5, hence $P[Y=1|X-\vec{x}] =\eta(\vec{x}) \geq 0.5$ so $(2\eta(\vec{x})-1)[\mathbb{I}[f(\vec{x})=1]-\mathbb{I}[g(\vec{x})=1]] \geq 0$ A similar argument holds for $f(x)$ returning 0 (ie. the bayes classifier fails and the alternate classifier works) Since we know that this works for a fixed example, we can integrate across all $x\in X$ to prove the original theorem $P_{(\vec{x},y)}[g(\vec{x})=y] \leq P_{(\vec{x},y)}[f(\vec{x})=y]$ Want to show $\mathbb{E}[\mathbb{I}[P_{(\vec{x},y)}[g(\vec{x})=y]]] - \mathbb{E}[\mathbb{I}[P_{(\vec{x},y)}[f(\vec{x})=y]]] \geq 0$ $\mathbb{E}[\alpha(\vec{x},y)] = \int_{(\vec({x},y)\in X\times Y)} \alpha(\vec{x},y)P((\vec{x},y))d((\vec{x},y))$ So we can rewrite the expectation values using linearity: $\alpha(x,y)=\mathbb{I}[P_{(\vec{x},y)}[f(\vec{x})=y]]$ and $\beta(x,y)=\mathbb{I}[P_{(\vec{x},y)}[g(\vec{x})=y]]$: $\mathbb{E}[\mathbb{I}[P_{(\vec{x},y)}[g(\vec{x})=y]]] - \mathbb{E}[\mathbb{I}[P_{(\vec{x},y)}[f(\vec{x})=y]]] = \int (\alpha(\vec{x},y)-\beta(\vec{x},y)) p(\vec{x},y)d(x,y)=\int_{x}[\int_{y} (\alpha-\beta) p(y|x)dy]dx$ The inner integral is the probability we calculated earlier (the one that was always greater than zero). Hence the double integral is always greater than or equal to 0 Since we can’t normally acheive this classifier, how can we estimate it? $f(\vec{x})= argmax_{y\in Y} P[Y=y|X=\vec{x}]= argmax_{y\in Y}\frac{P[X=\vec{x}|Y=y]\cdot P[Y=y]}{P[X=\vec{x}]}$ denominator independent of Y. This means that $argmax_{y\in Y}\frac{P[X=\vec{x}|Y=y]\cdot P[Y=y]}{P[X=\vec{x}]} = argmax_{y\in Y}P[X=\vec{x}|Y=y]\cdot P[Y=y]$ The value of the function on the LHS is not the same as the RHS, but the y that maximizes the LHS and RHS is the same Instead of figuring out an infinite number of estimates (1 for each $\vec{x}$ in $\R^{d}$ space) you instead make 1 estimate for each $y\in Y$, which is a much smaller space Instead of an infinite number of Bernoulli distributions, you have a finite number of multidimensional distributions $P[Y=y]$ is called the Class Prior or the prior $P[X=\vec{x}|Y=y]$ is called the Class Conditional or the Conditional $P[Y=y|X=\vec{x}]$ is called the Class Posterior or the Posterior Baye’s classifier is not always suitable in asymmetric solutions Ex: cancer detection. Bayes Classifier would always predict nobody has cancer. Would have a high accuracy, but is absolutely useless Maximum Likelihood Estimation We are given some data $\vec{x_{1}}… \vec{x_{n}} \in X$ independently draw Say we have a model class $P = \{p_{\theta}|\theta\in\Theta\}$ where $\Theta$ is the parameter space For Gaussians, $<\mu,\sigma>\in \Theta$ and $p_{\theta} = \frac{1}{\sqrt{2\pi\sigma^{2}}}exp(\frac{-1}{2}\frac{(x-\mu)^{2}}{\sigma^{2}})$ Find the model that best fits the data Define the likelihood function $\mathbb{L}(\theta|X)= P(X|\theta)=P(x_{1},…,x_{n}|\theta)=\Pi_{i=1}^{n}P(\vec{x_{i}}|\theta) = \Pi_{i=1}^{n}p_{\theta}(\vec{x_{i}})$ The likelihood function defines how probable (or how likely) is the data given the model $p_{\theta}$ we want to find $\theta$ that maximizes the likelihood function: $argmax_{\theta\in\Theta}\mathbb{L} = argmax_{\theta\in\Theta}\Pi_{i=1}^{n}p_{\theta}(\vec{x_{i}})$ Say for instance, that we have some data and we want to fit a Gaussian to it: Find $argmax_{\mu,\sigma^{2}}\Pi_{i=1}^{n}p_{\mu,\sigma^{2}}(\vec{x_{i}})$ A trick to avoid using the product rule too much: $argmax_{\theta}\mathbb{L}(\theta|X)= argmax_{\theta} log(\mathbb{L}(\theta|X))$ This turns the product into sums and eliminates the exp of the Guassian. After doing this manipulation, we get: find $argmax_{\mu,\sigma^{2}} \Sigma_{i=1}^{n}[\frac{-1}{2}log(2\pi\sigma^{2})-\frac{(x_{i}-\mu)^{2}}{2\sigma^{2}}]$ Maximize w.r.t $\mu$ to and solve for the stationary points yield $\mu_{ML} = \frac{1}{n}\Sigma_{i=1}^{n}x_{i}$ This is an unbiased estimator that converges to the true $\mu$ as the number of samples go to infinity Can also take w.r.t. $\sigma^{2}$: $\sigma_{ML}^{2}=\frac{1}{n}\Sigma_{i=1}^{n}(x_{i}-\mu)^{2}$ We can find the sample variance by plugging in $\mu_{ML}$ for the true $\mu$ (This is where the $\frac{1}{n-1}$ thing comes from) Example Say that we want to estimate biological gender (male or female) based on weight and height The distribution is spread like $D[X\times Y]$ The Bayes classifier is thus: $f(\vec{x}) = argmax_{y\in\{male,female\}} P[X=\vec{x}|Y=y]\cdot P[Y=y]$ $\hat{P}[Y=male]=$ fraction of training data labelled as male $\hat{P}[Y=female]=$ fraction of training data labelled as female $\hat{P}[X|Y=male]=p_{\theta(male)}(X)$ Use MLE with only male data $\hat{P}[X|Y=female]=p_{\theta(female)}(X)$ Use MLE with only female data Say that the conditional follows a 2D Gaussian distribution Another way to put this: We have a bunch of labelled points in $\mathbb{R^{2}}$ Look only at the male labels. Fit a 2D Gaussian to this data (Find some mean center and the covariance matrix) Do the same with the female labels Once you have gotten your distributions, say that you get a new data point. The distribution (male or female) that yields the higher probability is your Bayes classifier’s output Convergence How fast does our sample converges to the population? Define an error function $E = \int(P(x)-\hat{P}(x))^{2}dx \leq n^{\frac{-2}{2+d}} \approx n^{\frac{-1}{d}}$ If the dimensionality of the problem gets larger (d goes up), then the convergence becomes much slower (you need many many more samples in a higher dimensional space to get the same threshold as a lower dimensional space) Say that you have a tolerance of $\epsilon$ on the error. Then you need $n \geq \epsilon^{d}$ samples Naive Bayes Classifier $f(\vec{x}) = argmax_{y\in Y} P[X=\vec{x}|Y=y]\cdot p[Y=y] = argmax_{y\in Y}\Sigma_{y=1}^{d}P[X^{j}=x^{j}|Y=y]\cdot P[Y=y]$ In words: each feature of the input (the $x_{i}$) are independent from each other given marginalization on the output You break a d-dimensional problem into d 1-dimensional problem, where each 1-d problem requires $\frac{1}{\sqrt{n}}$ How do you quantify the quality of a classifier? If you know the population distribution, then just run the classifier with that data If you have a sample distribution, then you have a biased qualifier since you chose your classifying function from the training data Split the training data into training and testing data The testing data provides an unbiased means of testing your model Approaches to Classification Generative Approach You assume a model of the underlying distribution. When you get a new point, you find which model is the best Advantages The model gives you an interpretation of how data gets generated from a population Since you have a interpretation of the underlying distribution, you can generate simulated data Disadvantages You need to pick a model This is overkill if you just need a classification result, which costs time and money You also are more prone to errors since the sole focus is not classification Discriminative Approach you don’t assume an underlying model. Given a new point, you directly figure out the classification Advantages Typically you get better classification accuracy Disadvantages You gain no understanding of the population Topic 2 Nearest K-Neighbor Importance of Closeness for k-Nearest Neighbors If you have improper units (say that you measure human height in lightyears and human weight in micrograms), this can stretch and compress the axes and exaggerates the importance of a particular feature Choosing the correct closeness metric allows to to scale the axes to avoid this problem Distances If $X\in \R^{d}$, then an example of a distance metric is the $L^{p}$ norm ($(\Sigma_{i=1}^{d} (\vec{x_{i}}-\vec{\hat{x_{j}}})^{p})^\frac{1}{p}$): $p =2$ is Euclidean distance $p =1$ Manhattan distance or cityblock distance $p=\infty$ max distance $p=0$ count ’non-zero’ distance (ignore $\frac{1}{p}$ power in norm definition) Edit distance (genomics): Domain specific: the number of mutations needed to convert one sequence to another Kendell-Tau distance (to compare rankings) What is the minimum number of adjacent pairwise swaps needed to convert 1 ranking to another? Similarities take the distance and construct the function $\frac{1}{1+||\vec{x_{1}}-\vec{x_{2}}||_{p}}$ Cosine similarity: Find the cosine between the two vector Issues with the nearest neighbor Sensitive to noise in data, so labelling is unstable Solution: look at the kth nearest labels to make this stable k-NN Optimality Stone’s Theorem: Theorem 1: for a fixed k, as $n \rightarrow \infty$, k-NN classification error converges to no more than twice Bayes classifier error Theorem 2: if $k\rightarrow\infty$, $n\rightarrow \infty$, but $\frac{k}{n}\rightarrow0$ (the rate of growth goes to zero ie. n grows faster than k), then the k-NN classifier converges to Bayes classifier Why $\frac{k}{n}$? Say that n is fixed an k increases. Eventually, you are averaging over the entire dataset. Hence, if you increase n as well at a rate faster than k, then you are still sampling a small local region of the dataset Proof Let k=1 Notation: P[e] = NN error rate = $1-P_{x,y\in D_{n}}[[f(x)_{n}=y]]$ ie. error = 1-accuracy $D_{n} = <X_{n},Y_{n}>$ = labeled data (size n) The sample data that you have drawn from the underlying distribution $x_{n}$ = nearest neighbor of $x_{t}$ in $D_{n}$ $y_{n}$ = label of nearest neighbor $x_{t}$ and $y_{t}$ are a fixed test point of $x_{t}$ What is the error at a fixed test point $x_{t}$? $lim_{n\rightarrow\infty} P_{y_{t},D_{n}}[e|x_{t}]$ There is some randomness in $y_{t}$ and $D_{n}$ Recall that $P[A] = \Sigma_{b\in B} P[A,B=b] = \Sigma_{b\in B} P[A|B=b]P[B=b]$ If you condition on C, you get: $P[A|C] = \Sigma_{b\in B} P[A,B=b|C] = \Sigma_{b\in B} P[A|B=b,C]P[B=b|C]$ Using the appropriate substitutions, can write: $lim_{n\rightarrow\infty} P_{y_{t},D_{n}}[e|x_{t}] = \int_{n\rightarrow\infty} P_{yt}[e|x_{t}X_{n}]P[X_{n}|x_{t}]dX_{n} = lim_{n\rightarrow\infty}\int P_{yt}[e|x_{t}n_{n}]P[n_{n}|x_{t}] $ In words, the second equality states that you fix your dataset You then write out what nearest neighbor means: $lim_{n\rightarrow\infty}P_{y_{t},D_{n}}[e|x_{t}] = lim_{n\infty}[1-\Sigma_{y\in Y} P(y_{t}=y,y_{n}=y|x_{t},x_{n})]P[x_{n}x_{t}] dx = lim_{n\infty}[1-\Sigma_{y\in Y} P(y_{t}=y|x_{t})P(y_{n}=y|x_{t})]P[x_{n}|x_{t}] dx$ The first is the definition of k-NN (the output of the nearest neighbor is the same as the test sample) The second equality uses independence (ie. truth level labels of the test point and neighbor points are conditionally independent) You take the limit w.r.t. $n\rightarrow\infty$. $P[y_{t}=y|x_{t}]=P[y_{n}=y|x_{n}]$. When you integrate $P[x_{n}|x_{t}] dx$ you get 1, hence $lim_{n\rightarrow\infty}P_{y_{t},D_{n}}[e|x_{t}] = 1-\Sigma_{y\in Y}P^{2}(y_{t}=y|x_{t})$ If Bayes classifier return $y*$ at point $x_{t}$, then $1-\Sigma P^{2}(y_{t}=y|x_{t}) \leq 1- P^{2}(y_{t}=y*|x_{t})$ or $\leq 2(1-P(y_{t}-y*|x_{t})) = 2P^{*}[e|x_{t}]$ Practical Considerations of k-nearest neighbor Finding the k-th nearest neighbors takes time $O(nd)$ where n is the number of samples and d is the dimensionality of each sample Simplify to $\R$. Then you can sort your data in a binary search tree Look up in $O(log\ n)$ Preprocessing is $O(n\ log\ n)$ Instead of thinking in terms of midpoints, think in terms of regions After you divide up $\R$, you can put your data sample into one of the bins Problem: the hard threshold can misclassify the nearest neighbor. Instead, it gives you a near neighbor (in practice ,this distinction doesn’t really matter if you have enough data) Solution: Soften the boundary, where if your data fall in the overlap region, then you traverse both branches of the tree To extend to higher dimensions, take the projection of your data onto one of your dimensions, then do the median. Recursively repeat this process, select which coordinate you project onto each time You can use the same coordinate, although if you overuse a particular coordinate, than your fills (bins) can be elongated along 1 direction You can mitigate this problem by selecting your coordinate based on the one that has the smaller variance Problem/Caveat: What if a different direction is better to subdivide the data? Perform PCA, which calculates the direction of maximum variance, and then project onto this direction Cons: need to do an eigenvalue decomposition, which has a big online overhead This data structure is called a k-d tree Used in databases (boolean selections select regions of the space to reduce the search problem This problem can also be solved by a locality-sensitive hash map Metric of Closeness is Sometimes Unclear Say that you have a particular task and you gather data with some sort of features Some features are not relevant for the task at hand (ie. these features are noisy). However, for other tasks, these noisy features might be of great importance You could also have highly correlated features that can distort the Euclidean distance Solution: we could re-weight the contribution of each feature: $\rho(\vec{x_{1}},\vec{x_{2}};\vec{w}) = [(\vec{x_{1}}-\vec{x_{2}})^{T}W(\vec{x_{1}}-\vec{x_{2}})]^{\frac{1}{2}}$ where $W$ is a diagonal matrix How do we learn what $W$ is? Create two sets, one similar and one dissimilar $S = [(\vec{x_{i}},\vec{x_{j}})|y_{i}=y_{j}]$ $D = [(\vec{x_{i}},\vec{x_{j}})|y_{i}\neq y_{j}]$ Define a cost function $\Psi(\vec{w}) = \lambda\Sigma_{(\vec{x_{i}},\vec{x_{j}})\in S} \rho(\vec{x_{1}},\vec{x_{2}};\vec{w})-(1-\lambda)\Sigma_{(\vec{x_{i}},\vec{x_{j}})\in D} \rho(\vec{x_{1}},\vec{x_{2}};\vec{w})$ $\lambda$ controls which force dominates In words, we want to find the weights that minimize the distance between similar points and maximize the distance between dissimilar points We can simplify this by squaring $\rho$ in our cost function to avoid annoying square root functions Derivative w.r.t. Matrix What is $\frac{d}{dW}(\lambda\Sigma_{i,j}(x_{i}-x_{j})^{T}W(x_{i}-x_{j}))$ where W is a matrix We can think of the function as a map from $\R^{d\times d}\rightarrow \R$ The derivative must have the same dimension as the domain (ie $R^{d\times d}$) Hence, we can intuit that the derivative is $\Sigma_{i,j}(x_{i}-x_{j})(x_{i}-x_{j})^{T}$ (this is correct) Not Saving All the Training Data Instead of saving all the training data, you save the classification of the region where the data lies You also try to make the classification regions as large as possible. Namely, you decide your selection cuts to maximally reduce label uncertainty (ie. as you decide your cut to maximize the purity of each region at each step (up to a certain point to avoid overfitting)) You can quantify label uncertainty as follows Classification Error: $u(C) = 1- max_{y}p_{y}$ $p_{y}$ is the fraction of training data labelled as y in cell C Entropy: $u(C) = \Sigma_{(y\in Y)} p_{y} \log \frac{1}{p_{y}}$ Units of bits This is linearly additive for independent events Gini index: a measure of how “mixed” a population is $u(C) = \Sigma_{(y\in Y)}p_{y}^{2}$ For homogeneity, we take $1-u(C)$ Whatever metric you use, you want to $argmax_{F,T}[u(C)-(p_{L}u(C_{L})+p_{R}u(C_{R}))]$ Namely, choose a feature F and threshold T that maximizes the fraction of the same labels in a region You keep your features and thresholds parallel to axes to maintain interpretability This is a greedy algorithm Finding the optimal decision tree is NP-hard Uncertainty estimates are unstable Tree complexity is dependent of data geometry Topic 3 Overfitting the Training Data As you increase the complexity of a model, you can drive the training error down to zero in the limit This does not mean that this generalizes to the population You can take a cut of your training data (call this your validation data) to tune how complicated you make your model to avoid this overfitting Decision Boundaries Linear Classifier A linear Classifier is defined as a classifier whose decision boundary is a line For simplicity, say that we have a binary classification of -1,1. Recall definition of linearity: $g(a+b) = g(a)+g(b)$ and $g(ca) = cg(a)$ you can drop $g(ca) = cg(a)$ assumption and examine affine decision boundaries (As an abuse of notation, people call these affine decision boundaries linear boundaries) Let g be a decision boundary such that: In 1D, we get $g(x) = w_{1}x+w_{0} = 0$ where x is from your data that satisfied these conditions In general $g(\vec{x}) = \vec{w}\cdot\vec{x}+w_{0}=0$ Define a linear classifier such that $f(x) = 1$ if $g(x) \geq 0$ and $f(x) = -1$ if $g(x) \leq 0$ or $f(\vec(x) = sign(g(\vec{x}))$ Let $g(x) = (\vec{w},w_{0})\cdot(\vec{x},1)$ where you call the extra 1 the bias. This is called “lifting” the data because all of your data gets assigned a value of 1 in a new dimension. This extra degree of freedom allows an affine classifier to truely become a linear classifier Finding The Weights We want to minimize the training error: $argmin_{\vec{w}} = \frac{1}{n}\Sigma_{i}^{N}\bold{1}[sign(\vec{w}\cdot\vec{x_i})\neq y_i]$ The above equals $argmax_{\vec{w}} \Sigma_{x_{i},y_{i}=1}\bold{1}[\vec{x_i}\cdot \vec{w} \le 0] + \Sigma_{x_{i},y_{i}=-1}\bold{1}[\vec{x_i}\cdot \vec{w} \ge 0]$ This is NP-hard to solve (NP-hard for the approximation as well) This is not an absolute statement: not every instance of your data is hard to solve For data where it is NP-hard, you probably don’t want to use a linear classifier in the first place For now, we assume that we have linear separable data (ie. the problem is not NP-hard) Say that our data has a linear dicision boundary that perfectly separates the training data We can define the margin $\gamma$ as the shortest distance between the boundary and the closest data point Perceptron Given: labelled training data Initialize $\vec{w}^{(0)} = 0$ for t = 1,2,3… If $\exist (\vec{x},y) \in S$ such that $sign(\vec{w}^{(t-1)}\cdot \vec{x}) \neq y$ Update $\vec{w}^{(t)} = \vec{w}^{(t-1)}+y\vec{x}$ Remember that $y=\pm 1$ terminate when there is no such training sample that you misclassify This algorithm might not terminate because it is too greedy Perceptron Mistake Bound Assume that there is a unit length $\vec{w}^{*}$ Let $R = max_{\vec{x}\in S} |\vec{x}|$ be the radius and $\gamma$ be the margin The number of iterations T is bounded by $|\frac{R}{\gamma}|^2$ Perceptron Mistake Bound Proof How far is $\vec{w}$ from $\vec{w}^{*}$ We want $\vec{w}$ to converge to $\vec{w}^{*}$ We can use the dot product as a measure of this. We want to decrease the dot product by reducing the angle and not increasing the lengths Let’s say that the algorithm made a mistake at time t. Then: $\vec{w}^{(t)}\cdot \vec{w}^{*} = (\vec{w}^{(t-1)}+y\vec{x})\cdot \vec{w}^{*} \geq \vec{w}^{(t)}\cdot \vec{w}^{*}+\gamma$ $|\vec{w}^{t}|^2=|\vec{w}^{t-1}yx|^2 = |\vec{w}^{t-1}|^2+2y(\vec{w}^{t-1}\cdot x)+|y\vec{x}|^2 \leq |\vec{w}^{t-1}|^2+R^{2}$ The above inequalities hold after every iteration. After iteration T: $T\gamma \leq \vec{w}^{T}\cdot \vec{w}^{*} \leq |\vec{w}^{T}||\vec{w}^{*}| \leq R\sqrt{T}$ Rearrange to get that $T\leq (\frac{R}{\gamma})^{2}$ This means that the perceptron is an online algorithm: it only looks at a small subset of the data (1 at a time). Nonlinear Regression Say that we have some data that is not conducive to linear boundaries Quadratic Boundaries For a quadratic term, our quadratic boundary can be defined as $g(\vec{x}) = w_1x^1_2+w_2x^2_2+w_3x_1x_2+w_4x_1+w_5x_2+w_0$ Suppose that we only look at boundaries of the form $g(\vec{x}) = w_1x_1^2+w_2x_2^2+w_0 = \Sigma_{p+q\leq 2} w^{p,q}x_1^p x_2^q$ Make the change of coordinates $\chi_1 = x_1^{2}$ and $\chi_2 = x_2^{2}$ to make this boundary a linear boundary In function notation: $\phi(x_1,x_2) \rightarrow (x_1^{2},x_2^{2})$ This is called a feature transformation For the general case we have $\phi(x_1,x_2) \rightarrow (x_1^2,x_2^2,x_1x_2,x_1,x_2,1)$ For d dimensions, we have that $g(\vec{x}) \Sigma_{i,j=1}^{d} \Sigma_{p+q\leq 2} w_{i,j}^{p,q} x_{i}^{p}x_{j}^{q}$ Transforming the data into higher dimensions takes effort. We don’t want to do this unless we know that the problem will become linear seperable Proof that All Data is Linear Separable in some Space No label information is given to use ($Y$ is unknown) What we are given is n distinct points S = $\vec{x_1},\vec{x_2},… \vec{x_n}$ There exists a feature transformation (kernel) such that for any labelling of S is linearly separable in the transformed space Consider the mapping into $\mathbb{R}^n$ $\phi(x_i) \rightarrow <0,0,… ,1,0,0,….0>$ or in words, you map the ith data point to the ith component of the $\mathbb{R}^n$ vector Then, the decision boundary induced by linear weighting $\vec{w^{*}} = <y_1,y_2… y_n>$ perfectly seperates the data because $\forall \vec{x}_i \rightarrow sign(\phi(\vec{x}_i)\cdot \vec{w}^{*}) = sign(y_i) = y_i$ This doesn’t work in practice since we don’t know a priori how large out data will be Kernel Trick (to Deal with Computation) Explicitly working with generic Kernel space takes time $\Omega(n) Suppose that taking the dot product $\dot(\vec{x_1}\cdot \vec{x_2})$ is somehow easy to do For example, the generic quadratic boundary make the transformation like the standard transformation, but scale the cross terms up by $\sqrt$ This goes as $O(d^2)$ The dot product calculation is of $O(d)$: $(1+\vec{x_i}\vec{x_j})^2$ This also extend to polynomial transformations $\phi(\vec{x_i})\cdot \phi(\vec{x_i}) = (1+\vec{x_i}\vec{x_j})^2$ with is also linear in d If you have a machine learning algorithm that only works with dot products, then you can run it quickly Topic 4 SVM This is a linear classifier that is a stable solution that gives the largest margin $\gamma$ This is kernelizable, which provides an implicit means of non-linear classification Central idea: We have 2 hyperplanes that are parallel to each other that both correctly classify the data. We want to maximize the distance between them Setup Define two decision boundaries $\vec{w}\cdot\vec{x}-b=1$ and $\vec{w}\cdot\vec{x}-b=-1$ The Training data is correctly classified if $y_{i}(\vec{w_i}\cdot\vec{x_i}-b) \geq 1$ This rolls up the above two decision boundaries into one The distance between the two hyperplanes is $\frac{2}{|\vec{w}|}$ This is a constrained optimization problem Can’t just take derivatives since the minimum subject to the constraint might have a non-zero gradient We have N constraints (1 for each i) The standard machinery for optimization finds minimums. So instead of maximizing $\frac{2}{|\vec{w}|}$, we minimize $\frac{1}{2}|\vec{w}|^2$ with the same constraints We also want to be able to ignore outliers that destroy the linearity of the data. (called slack) We can introduce slack $\zeta_i$ to the constraints as N additional unknowns, with a constrain that $\zeta_i \geq 0$ (slack less than zero is weird since it makes the constraint more stringent, so we don’t do that) $y_{i}(\vec{w_i}\cdot\vec{x_i}-b) \geq 1-\zeta_i$ We try to minimize the slack (since setting all $\zeta_i = \infty$ is a trivial, but bad, solution in the normal setup) Minimize $\frac{1}{2}|\vec{w}|^2+C\Sigma_{i=1}^{n} \zeta_i$ Constrained Optimization We want to minimize $\frac{1}{2}|\vec{w}|^2+C\Sigma_{i=1}^{n} \zeta_i$ with constraints $y_{i}(\vec{w_i}\cdot\vec{x_i}-b) \geq 1-\zeta_i$ and $\zeta_i \geq 0 $ Say that we want to minimize $f(\vec{x})$, where $\vec{x}$ is from the unknown parameter space $\mathbb{R}_d$, subject to constrains $g_i(\vec{x}) \leq 0$ for $1\leq i \leq n$ We also assume the problem is feasible (there is at least 1 value that satisfies all the constraints) Approaches to Optimization Projection Steps start with a feasible solution $x_0$ find $x_1$ that has a slightly lower objective value if $x_1$ violates the constraints, project back to the constraints We can visualize $f(\vec{x_i})$ as some weird surface The constraints cut boundaries in $\mathbb{R}^d$. The intersection of the constraints is non-zero. We want to find the lowest point on f while staying in this union of the constraints Cost Functions We create a new function that has the same value in the constraint region, but has a necessarily higher value elsewhere Let $L(\vec{x},\lambda) = f(\vec{x})+\Sigma_{i=1}^{n}\lambda_{i} g_{i}(\vec{x})$ Since our constraints are $g_i \leq 0$, in the feasible region we have that $L(\vec{x},\lambda) \leq f(\vec(x))$ $max_{\lambda_i\geq 0} L(\vec{x},\lambda) \leq f(\vec{x})$ If x is infeasible, then $L(\vec{x},\lambda) = \infty$ If x is feasible, then $L(\vec{x},\lambda) = f(\vec{x})$ The optimal value is then $p^{*} = min_{\vec{x}}max_{\lambda_i \geq 0 } L(\vec{x},\vec{\lambda})$ x is unconstrained, and the constrains on $\lambda$ is much less stringent Dual Problem We want to find $p^{*} = min_{\vec{x}}max_{\lambda_i \geq 0 } L(\vec{x},\vec{\lambda})$ (the primal problem) Let’s look at a different problem Let $x^*$ be a minimum feasible solution over f $f(x^{*}) = p^{*}$ For all $\lambda_i \geq 0$ $min_{\vec{x}} L(\vec{x},\vec{\lambda}) \leq L(x^{*},\vec{\lambda}) \leq f(x^{*}) = p^{*}$ Hence, $d^{*} = max_{\lambda_{i}\geq 0}min_{\vec{x}}L(\vec{x},\vec{\lambda}) \leq p^{*}$ The upshot: we can easily find the optimum of $min_{\vec{x}}L(\vec{x},\vec{\lambda})$ if L is differentiable The downside: $d^{*}$ is not the value you wanted to find. You wanted to find $p^{*}$ This is called the minimax inequality When is $p^{*} = d^{*}$? Convexity A function $f: \mathbb{R}\rightarrow\mathbb{R^{d}}$ is called convex iff for any two points $x,x’$ and $\beta\in[0,1]$ $f(\beta\vec{x}+(1-\beta)\vec{x’}) \leq \beta f(\vec{x})+(1-\beta)f(\vec{x’})$ In words, we have two points x and x’. we look at f(x) and f(x’), which we denote as the endpoints. Draw the shortest line between the two endpoints What convexity states is that any y between x and x’ must satisfy that f(y) is below the line Happy functions are convex (Direct quote) Not a dichotomy (affine functions are convex and concave) A set $S \in \mathbb{R^d}$ is called convex iff for any points $x, x’ \in S$ and any $\beta \in [0,1]$ $\beta\vec{x}(1-\beta)\vec{x} \in S$ A constrained optimization is called a convex optimization problem if the objective function $f(\vec{x})$ is a convex function the feasible set induced by the constraints g is a convex set These are nice problems since if a local optima exists, it is a global optima If your optimization problem is convex, and if there exists a feasible point s.t. $\forall i, g_{i}(\vec{x}) <0$ or $g_i{\vec{x}} \leq$ whenever $g_{i}$ is affine NOTE: These are sufficient to state that $d^{*} = p^{*}$ SVM as Convex Optimization we are minimizing a quadratic function with linear constraints. Hence SVM is Convex Our Lagrangian is $L(\vec{w},b,\vec{\alpha}) = \frac{1}{2}||\vec{w}||^2+\Sigma_{i=1}^{n}\alpha_{i}(1=y_i(\vec{w}\cdot\vec{x_i}-b))$ Primal: $p^{*} = min_{\vec{w},b}max_{\alpha_i\geq 0} L$ Dual: $d^{*} = max_{\alpha_i\geq 0}min_{\vec{w},b} L$ Taking partials $\frac{\partial}{\partial\vec{w}}L = \vec{w}-\Sigma_{i=1}^{n}\alpha_{i}y_{i}\vec{x_i}$ At the stationary point: $\vec{w}=\Sigma_{i=1}^{n}\alpha_{i}y_{i}\vec{x_i}$ Like the perceptron Taking partials $\frac{\partial}{\partial b}L = \Sigma_{i=1}^{n}\alpha_{i}y_{i} = 0$ $min_{\vec{w},b} L(\vec{w},b,\vec{\alpha}) = \Sigma_{i=1}^{n}\alpha_i-\frac{1}{2}\Sigma_{i,j} \alpha_i \alpha_j y_i y_j(x_i\cdot x_j)$ $\frac{1}{2}w^Tw = (\Sigma_i \alpha_iy_ix_i)^{T}(\Sigma_j\alpha_iy_ix_i)$ Hence, we have converted the primal form of the problem to the dual form of the problem: Maximize $ \Sigma_{i=1}^{n}\alpha_i-\frac{1}{2}\Sigma_{i,j} \alpha_i \alpha_j y_i y_j(x_i\cdot x_j)$ Subject to the constraint $\Sigma_{i=1}^{n} \alpha_i y_i = 0$ $\alpha_i \geq 0$ Note that the function only accesses the data via dot products. This is a kernalized version of the problem, which makes non-linear boundary problems tractable Why support vector machine? Only a few points define the boundary. You could throw away all other data points except these select points. Said points are called support vectors. Topic 5 So far, we have been only dealing with classification. Now we deal with other outputs We could output a real number (energy of particle in a detector) We could also output more esoteric things like a graph (pose estimation) Sentence structure estimation (grammer tree) We will only focus real number output Parametric Regression We assume a form on the underlying distrubution and find the best candidate that minimizes the error/loss Types of Loss Absolute error Prefers to make a single large mistake Quadratic error: makes many small mistakes Penalizes higher error more severely is differentiable Linear Regression $\hat{f} = \vec{w}\cdot \vec{x}+ w_0$ We can incorporate $w_0$ to w via lifting Want to find $min_{w,w_0}\mathbb{E}_{\vec{x},y}[L(\hat{f(x)},y)]$ Equivalently, $argmin_w \frac{1}{n}\Sigma_{i=1}^{n}(\vec{w}\cdot\vec{x_i}-y_i)^2$ (ordinary least square (MSE)) Can take square root to find root mean square error, which gives the correct units. argmin doesn’t care which square error you minimize Let’s represent all of our data as a data matrix X, that is an nxd matrix (n rows for n data points and d columns for d dimensions) Represent w as a dx1 and let Y be a nx1 matrix with the associated y values Using the above, we can also reframe our problem in matrix notation: $argmin_{w}|X\vec{w}-\vec{y}|^2$ taking partial w.r.t. $\vec{w}$ to get $\Sigma 2X^T(X\vec{w}-\vec{y})$ The transpose is succh that the dimensionality of the matrix is the same as the output (ie. 1) Solving for the stationary points to get that $X^TX\vec{w} = X^T\vec{y}$ $\vec{w} = (X^TX)^{\dag}X\vec{w}$ This system will always have a solution (not necessarily a unique one) Poorly behaved (due to overfitting) when we have limited data Proof of existence let $\vec{y} = \vec{y_{col}}+\vec{y_{null}}$ via orthogonal decomposition of y Acting $X^T$ on y kills of the nullspace. Hence $X^T\vec{y} = X^T\vec{y_{col}}$ We can write $\vec{y_{col}}$ as a linear combination of the columns of x. Let the weights be equal to w. recall that Xw takes a linear combination of the columns. This means that the LHS and RHS are equal. Hence, because we have found a $\vec{w}$ that solves the system, it is consistent The upshot of this is that systems of the form $A^TAx = A^Tb$ always has a least 1 solution Column Picture View columns of X as primary data (ie. look at distribution along each feature (d point in an n dimensional space) instead of looking at each data point (n points in a d dimensional space)) Need to mimimize $|\vec{y}-\Sigma_{i=1}^{d}w_ix_i^|^2$ where $x^_i$ is the ith column of A $\hat{y }= X\vec{w_o} = \Sigma_{i=1}^{d}w_{oi}x^*_i$ $\hat{y}$ is the orthogonal projection onto the span of the column space of X Statistical View We have some input x draw from out underlying distribution D The true y is computed via some fixed $w\cdot x$ Some Gaussian noise of the form N(0,$\sigma^2$) corrupts the y to the value that you observe $y = w\cdot x +\epsilon$ Solve with MLE $\log \mathbb{L}(w|S) = \Sigma_{i=1}^{n} \log(\frac{1}{\sqrt{s\pi \sigma^2}}exp(\frac{-(w\cdot x_i-y_i)^2}{2\sigma^2}))$ Argmax w.r.t. w to get something that is proportional to the mean squared loss Ridge Regression We have multiply highly correlated features. We need to find a stable relation between input and response Minimize $|X\vec{w}-y|^2+\lambda|\vec{w}|^{2}$ The first part is standard ols The second part is the regularization part (ie. this tries to keep all the w’s small and around the same values) $\lambda$ is the hyperparameter that dictates the trade off between the reconstruction error and trying to regularize If you lean to much to regularization, you get very small w’s that give you garbage Minimizing w.r.t. w yields $2x^T(Xw-y)+2\lambda w = 0$ Rearranging to get that $w = (X^TX+\lambda\mathbb{I})^{-1}X^Ty$ $(X^TX+\lambda\mathbb{I})$ is positive definite, so inverse always exists Equivalent to the optimization problem: minimize $|Xw-y|^2$ subject to $|w|^2\leq B$ Lasso Regression We have few input features that dictate/control the outcome minimize $|Xw-y|^2$ subject to $|w|_1\leq B$, where we are using the L1 norm The L1 norm encourages sparse solutions (corners of L1 ball are “sticky”) Logistic Regression fit a sigmoid function to your data $f(x) = \frac{1}{1+e^{w\cdot x}}$ can get a binary prediction via sign(2f(x)-1) Why a Sigmoid? Denote the probability P(Y=1|X=x). This gives us the probability of labelling a point as 1 given a particular x Let’s instead look at the ‘odds’ of getting y=1 for a given x odds(P) = $\frac{P}{1-P}$ For an event with P=0.9, odds =9 but for an event P = 0.1, odds = 0.11 This is very asymmetric, which is conducive to our problem of logistic regression take the log of the odds function and call this logit $logit(P) = log(\frac{P}{1-P})$ Notice that logit(P) = -logit(1-P), which is symmetric. Let logit(P) = $w\cdot x$ as our modelling assumption Rearranging the above to get that $P = \frac{1}{1+e^{-w\cdot x}}$, which is the sigmoid function Back to Regression The likelihood L is give as $L = \Pi_{i}^{n} P((x_i,y_i)|w_{i}) \propto \Pi_{i=1}^{n} P(y_i|x_i, w)$ We dropped one part of bayes rule because the distribution in x is independent of w $\Pi_{i=1}^{n} P(y_i|x_i, w) = \Pi_{i=1}^{n} P(y_i=1|x_1,w)^{y_i}(1-P(y_i=1|x_i,w))^{1-y_i}$ This is just a Binomial MLE log-likelihood is as follows: $log(L) = \Sigma_{i=1}^{n} y_{i} log(\frac{P_{x_i}}{1-P_{x_i}})+\Sigma_{i=1}^{n} log(1-P_{x_i})$ $log(L) = \Sigma_{i=1}^{n}y_{i}(w\cdot x_i)+\Sigma_{i=1}^{n} -log(1+exp(w\cdot x_i))$ No closed form solution. Need something like gradient ascent to solve this Optimal $L_{2}$ Regressor For a regressor, we have that $f^{*}(x) = \mathbb{E}[Y|X=x]$ We are taking an average across all possible values of x of the population We need to define how we measure our error prior to defining out the optimal classifier/regressor We want to prove that for any regressor g(x), we have that $\mathbb{E}[|f^{*}(x)-y|^2]\leq \mathbb{E}[|g(x)-y|^2] $ Recall that $f^{*}(x) = \mathbb{E}[Y|X=x]$ Consider the $L_2$ error of g(x) $\mathbb{E}|g(x)-y|^2 = \mathbb{E}|g(x)+f^{*}(x)-f^{*}(x)-y|^2 = \mathbb{E}|g(x)-f^{*}(x)|^2+\mathbb{E}|f^{*}(x)-y|^2+2\mathbb{E}|(g(x)-f^{*})(f^{*}-y)|$ The cross term can be expanded by conditionalizing on x: $2\mathbb{E}|(g(x)-f^{*})(f^{*}-y)| = 2\mathbb{E_{x}}[\mathbb{E_{y|x}}[(g(x)-f^{*}(x))(f^{*}-yf^{*}-y)| X=x]]$ We can pull out functions of x from the inner expectation value b/c x is a constant for this expectation value. We then substitute in the definition of $f^{*}$ to get that the cross term is 0 for all x $2\mathbb{E_{x}}[\mathbb{E_{y|x}}[(g(x)-f^{*}(x))(f^{*}-yf^{*}-y)| X=x]]$ Hence $E|g-y|^2 = \int_{x} |g-f^{*}|^2 \mu (dx)+\mathbb{E}|f^{*}-y|^2$ The above is minimized when g(x) is $f^{*}(x)$, which then means $f^{*}$ is optimal Notice that there is some innate variance in our data that cannot be optimized away The integral term is called the squared bias of the $L_2$ error For L1 error, we the median instead Non-linear Regression If we have too thin of a slice, then we might not have any data to perform regression on We can increase the size of our slice up to some bandwidth in order to prevent reaching the high bias region Let $\hat{y} = \Sigma_{i=1}^{n} w_i{x}y_{i}$ We take into account all points in our sum. We can then fudge the weights w to select for different points. The function that chooses these weights is called the kernel function K Gaussian kernel: $K(x,x’) = exp(\frac{-|x-x’|^2}{h})$ Box kernel: $\mathbb{1[|x-x’| \leq h]}$ Triangle kernel: $[1-(\frac{1}{h})|x-x’|]_{+}$ We then define $w_i(x) = \frac{K(x,x_i)}{\Sigma_{j=1}^{n} K(x,x_j)}$ Consistency Tneorem As $n \rightarrow \infty$, $h\rightarrow 0$, $hn\rightarrow \infty$, then $\mathbb{E}|\hat{f}-f^{*}|^2\rightarrow 0$ $\hat{f} = \Sigma_i^{n} \frac{K(x,x_i)}{\Sigma_{j}^{n} K(x,x_j)}$ Topic 7 (We skipped Learning Theory) Unsupervised learning only has the data and no labels The goal is to discover the underlying structure of the data Can also be used to reduce dimensionality of data for faster processing Increase signal to noise ratio k-means Basic idea: Finding the group is the same as finding the center of the group that it belongs to. Say that we have k possible groups with centers $\vec{c_1},\vec{c_2},…$ We want to minimize $\Sigma_{i=1}^{n} min |\vec{x_i}-\vec{c_j}|^2$ ie. find the closest centers for a given point This is NP-hard, even for the simple non-trivial case d=2 and k=2 Algorithm (At Least One Way to Do it) LLoyd’s Technique/Algorithm Initialize the centers randomly Assign each data point to the closest center Update the centers to the mean of their associated data points Repeat until bored or until no changes happen Choosing K Try them all (k ranges from 1 to n), and calculate the cost function the cost function $\Sigma_{i=1}^{n} min_{j = 1,2..k} |x_i-c_j|^2$ monotonically decreases to zero For some datasets, you could see large drops in the cost, which would be indicative of an optimal k value You also could not choose K Top Down: Partition data into 2 groups, recurse on each part, repeat until you cannot partition any more Bottom Up: Each cluster is a point. Merge each cluster in closest pairs, and recurve until you get a single cluster The definition of closest cluster called linkages (complete is pairwised average, single is using a single point, and average computes means of cluster and then compares these center of masses). Called Ward’s method Clustering via Probabilistic Mixture Modeling Given $X\in \mathbb{R^d}$ and a number of intended clusters k Assume a joint probability distribution (X,C) over the joint space $\mathbb{R^d} \times k$ $C \in \mathbb{R}^k$ denotes a probability vector (ie. entries sum to 1) that denotes the weighting of which cluster you are coming from $P(C=i) = \pi_i$ X(C=i) is some multivariate distribution (takes Gaussian for simplicity, could be anything) For Gaussians, the parameter space is $\theta = (\pi_1,\vec{\mu_1},\Sigma_1,…\pi_n,\vec{\mu_n},\Sigma_n)$ The values of $c_i$ are hidden from us $P(\vec{x}|\vec{\theta}) = \Sigma_{j=1}^{k} \pi_{j} = \frac{1}{\sqrt{(2\pi)^d det(\Sigma_j)}}exp(\frac{-1}{2}(\vec{x}-\vec{\mu_j})^T\Sigma_{j}^{-1}(\vec{x}-\vec{\mu_j}))$ MLE Estimation $\theta_{MLE} = argmax \Sigma_{i=1}^{n} ln [\Sigma_{j=1}^{k} ]\frac{1}{\sqrt{(2\pi)^d det(\Sigma_j)}}exp(\frac{-1}{2}(\vec{x_i}-\vec{\mu_j})^T\Sigma_{j}^{-1}(\vec{x_i}-\vec{\mu_j}))$ Can’t put ln inside inner Sigma. Shouldn’t do this because MLE answer does not give a desirable outcome Expectation Maximization (EM) Algorithm Idea: Initialize the parameters arbitrarily Given the current setting of parameters, find the best (soft) assignment of data samples to the clusters (Expectation Step) This means that you assign the data samples to your clusters with some probabilities for each cluster Update all the parameters with respect to the current (soft) assignment that maximizes the likelihood Keep doing this until you get bored, or until you get a degenerate solution. If you have a degenerate solution, you restart the algorithm with some different random seed of initial parameters In math terms: Initialize $\theta = <\pi_1,\vec{\mu_1},\Sigma_1…>$ Expectation step, assign a weight to each data for each cluster $w_j^i = \frac{\pi_j N(x_i,\mu_j,\Sigma_j)}{\Sigma_{j’=1}^{k}\pi_{j’} N(x_i,\mu_{j’},\Sigma_{j’})}$ Maximize the log-likelihood of the parameters Define $n_j = \Sigma_{i=1}^{n} w_j^i$ as the effective number of points assigned to cluster j $\pi_j = \frac{n_j}{n}$ (the fraction of the data that is associated with cluster j) $\vec{\mu_j} = \Sigma_{i=1}^{n} \frac{w_{j}^{i}\vec{x_i}}{n_j}$ $\Sigma_j = \frac{1}{n_j}\Sigma_{i=1}^{n}w_j^{i} (\vec{x_i}-\vec{\mu}_j)(\vec{x_i}-\vec{\mu}_j)^T$ The above are found by taking vector and matrix derivatives Topic 8 PCA Create a smaller dimensional representation of our data We want to find the optimal map $\mathbb{R}^d \rightarrow \mathbb{R}^k$ that best maintains reconstruction accuracy Equivalently, minimize aggregate residual error Define $\pi_K$ as $\mathbb{R}^d\rightarrow \mathbb{R}^d$ as the k-dimensional orthogonal linear projector We want to calculate the following $min_{\Pi_{k}} \frac{1}{n}\Sigma |\vec{x}-\Pi^{k}(\vec{x_i})|^2$ Projector The projection of any x in $\mathbb{R}^d$ in the orthonormal span $q_1,q_2…q_n$ is given by $\Sigma_{i=1}^{k} (\vec{q_i}\cdot\vec{x})\vec{q_i} = (\Sigma_{i=1}^{k}\vec{q_i}\vec{q_i}^T)\vec{x}$ $\vec{q_i}\vec{q_i}^{T}$ is a dxd matrix or rank 1. The summation of all the matrixs has a rank of k $\vec{q}\vec{q}^T$ is $\Pi_k$ PCA (What does it have to do with eigenvalues?) We now want to find $min_{|q|=1} \frac{1}{n}\Sigma_{i=1}^{n} |\vec{x}-\vec{q}\vec{q}^T\vec{x_i}|^2$ Through some algebra, you can rewrite the problem as $maximize_{|q|=1} \vec{q^T}\frac{1}{n}XX^T\vec{q}$ Since M is a symmetric positive semi definite matrix, by spectral decomposition $M = \Sigma_{i} v_{i}v_{i}^T$ where all the eigenvalues are real and positive Hence $q^{T}Mq = \Sigma_{i}\lambda_{i}(q\cdot v_i)^2$ Define $\alpha_i = (q\cdot v)^2$. The optimization then is to maximize $\Sigma_{i}\lambda_i\alpha_i$ such that $\Sigma_{i} \alpha_i = 1$ This can be maximized by letting $\alpha_1 = 1$ and the rest being equal to zero We can think of this solution as one that maximizes the spread (ie. variance) with respect to the direction of $q_1$ with respect to the origin (ie forcing a mean of 0) We can break down any data into its eigenvectors. We can take advantage of the fact that only the first few (low frequency) eigenvectors contribute meaningfully. $\vec{x} = \Sigma_{i} (\vec{x}\cdot \vec{q_i})q_{i}$ We keep the first k values. (like the Pareto principle) Topic 9 Graphical Models A probabilistic model where a graph represents the conditional dependence structure among the variables Inference Questions From the graph, can you construct probabilities or chances of certain events causing other events? Learning Questions What is the most likely graphical model structure and connection weights that model the data? Directed Models Edge direction typically denotes potential causality Undirected Models Edges denote potential causality Ex: Markov Random Fields Bayesian Networks A type of directed model Assume we have some joint probability distribution P(C,S,R,G) = P(C)P(R|C)P(S|R,C)P(G|S,R,C) via the chain rule If we are given a Graphical model, we can remove some of these dependencies Hence, if we know these final probabilities, we can answer any inference question about the variables How do we find these probabilities out? We know that $P(X_1,…X_n) = \Pi_{i=1}^{d} P(X_i|parent(X_{i}))$ Suppose that our data is binary data (ie. each node either happens or does not happen) Each node can be assigned a probability by counting each instance of the event We can draw arrow between each node to denote independence We can determine independent between two nodes (ie. $P(A,B) \approx P(A)P(B)$) (shorthand $A\perp B$) We also need to check if $P(A,B|C) \approx P(A|C)P(B|C)$ (shorthand $A\perp B|C$) where you can have any number of conditions As long as no cycles occur, we are good Time Series Model A time series model: a family of distributions over a sequence of random variables that is indexed by a totally ordered indexing set (often referred to as time) totally ordered means that that there exists a natural order of the set (ie you can say definitively that if a and b are part of a set wheather $a\leq b$) $… X_{t-2}, X_{t-1}, X_{t}, X_{t+1}, X_{t+2}… $ Markov Chain The conditional distribution of the next state $X_{t+1}$ given all the previous states $X_{i}$ only depends on the current state $X_{t}$ ie $P(X_{t+1}|X_{t},X_{t-1}…) = P(X_{t+1}|X_{t})$ Suppose that we have d distinct states. We need to construct a transition matrix that transition the probabilities of the current state to the probabilities of the next state Call the initial distribution $P(X_{1}=i) = \pi_{i}$ The conditional distribution is $P(X_{t+1}=j|X_{t}=i) = A_{ij}$ The elements in each row of this transition matrix must sum to 1 (row stochastic) Once have the initial condition and the transition matrix, we can calculate probabilities Ex: What is the probability of seeing the sequence 2,2,1,1,2,1,1? This is given by $\pi_{2}*A_{2,2}*A_{2,1}…$ PageRank How does ranking webpages relate to time series? Suppose that you start on some webpage (node). The edges denote links between websites PageRank posits that popularity is proportional to the probability that a random walk ends on page i $P(X_{1}=i) = \pi_{i}$ $P(X_{2}=i) = \Sigma_{j} P(X_{1}=j,X_{2}=i) = \Sigma_{j} P(X_{1}=j)\cdot P(X_{2}=i|X_{1}=j) = \Sigma_{j} \pi_{j} A_{j,i}$ ie. the ith entry of $\pi^{T}A$ = $(\pi^{T}A)_{i}$ $P(X_{t}=i) = (\pi^{T}A^{t-1})_{i}$ We want to find out what the probability approaches as t increases This is identical to the statement: does $lim_{t\rightarrow\infty} A^{t}$ approach a matrix M where all the rows are identical This makes sense because if you multiply by A 1 more time, the matrix M should not change For such an A, we can see that $lim_{t\rightarrow\infty} A^{t} = lim_{t\rightarrow\infty} (A^{t-1})A \rightarrow MA = M$ Hence, we need to find the eigenvectors of $A^{T}$ Some problems are that you don’t know if the matrix A does converge in the limit

Date Created: January 17, 2023 | | Last Modified: May 13, 2026 ||

ML Refresher Notes

Here is a compendium of my notes that I took to review for Machine Learning. This was meant to be a quick refresher on concepts and to put everything in one place. For a less cursory version, see the background section under Resources at the course website. Probability and Statistics Basic Concepts Axioms of Probability Definitions Baye’s Rule Expectation Common Probability Distributions Bernoulli Distribution Binomial Distribution Poisson Distribution Categorical Distribution Multinomial Distribution Guassian(Normal) Distribution Multivariate Gaussian Distribution Laplace distribution Dirac Delta Distribution Mixtures of Distributions Common Functions Logistic sigmoid $\sigma(x)$ Softplus Function $\Zeta(x)$ Information Theory Self-Information Shannon Entropy Kullback-Leibler (KL) divergence Cross-entropy Exponential Distribution Chi-squared Distribution Basics Goodness of Fit (GOF) Independence T-test One Sample Two Sample Paired Unpaired Linear Algebra Types of Objects in Linear Algebra Matrix Operations Central Problem of Linear Algebra Gaussian Elimination Reduced Row Echelon Form (RREF) Reading of Solutions of Ax=b from RREF LU Decomposition LDU Decomposition Identity and Inverses Vector Spaces Subspaces Orthogonal Complements Linear Transformations Diagonalization Change of Basis Linear Dependence and Span Basis and Dimension Norms Orthogonal Bases Gram-Schmidt and Orthogonal Complements Eigendecomposition Singular Value Decomposition (SVD) Moore-Penrose Pseudoinverse QR Decomposition Gram-Schmidt Trace Determinant Kernel, Range, Nullity,Rank Mapping definitions Kernel Rank and Nullity Least Squares Probability and Statistics Basic Concepts Axioms of Probability Probability Measure $P: \mathbb{F} \rightarrow \mathbb{{R}}$ such that $P(A) \geq 0$ for all $A \in \mathbb{F}$ $P(\Omega) = 1$ If $A_{1},A_{2}…$ are disjoint events, then $P(\cup A_{i}) = \Sigma_{i} P(A_{i})$ Definitions Sample space $\Omega$: The set of all possible outcomes Event space $\mathbb{F}:A$ is a subset of $\Omega$ a random variable quantity that has an uncertain value. Examples include coin flips (discrete), daily temperature (continuous) An ensemble sampled from a random variable forms a probability distribution The probability density/mass function (pdf,pmf) denote the probability distribution of continuous and discrete functions respectively. Integral/sum must be 1. Joint probability p(x,y) denotes the probability of pair (x,y) occurring. Can be a mix of discrete and continuous. X and y can also be vectors. Marginalization refers to the “summing out” of a variable from a probability distribution ($p(x) = \int p(x,y) dy$ for instance). This removes dependence of distribution on summed variables. This can be extended to discrete ($\int \rightarrow \Sigma$) or to multiple variables ($p(x,y) = \Sigma_w \int p(w,x,y,z) dz$ ). Conditional Probability is the probability of x given a fixed yvalue. Written as p(x|y = y). Like taking a cross section of a CAD model p(x|y = y*) is a relative probability, since the sum does not equal to 1. Need to normalize so that sum is 1. $p(x,y) = p(x|y)p(y) \rightarrow p(x,y) = p(y|x)p(x)$ from symmetry Chain Rule $p(x_{1},x_{2},…x_{k} = \Pi_{i=1}^{k} p(x_{i}|x_{1},…,x_{i-1}) )$ Note: when $i=1$, you get $p(x_{1})= p(x|x)$, which is trivial, but important to point out In words: the joint probabilities is the product of probability of the first, time probability of the second given the first, etc. $p(x|y)=p(x)$ and $p(y|x)=p(y)$ are the conditions for x and y to be independent. Substitute into conditional probability definition yields $p(x,y)= p(x)p(y)$ Conditional Independence: $p(x_{1}|x_{2},x_{3}) = p(x_{1}|x_{2})$ and $p(x_{1}|x_{2},x_{3}) = p(x_{1}|x_{2})$ are the conditions. Note the symmetry in $x_{1}$ and $x_{3}$ In words: $x_{1}$ is conditionally independent of $x_{3}$ given $x_{2}$. Or, if we know $x_{2}$, then $x_{1}$ gives no information about $x_{3}$. Variance of a population $\sigma^{2} = \frac{\Sigma(x-\mu)^{2}}{n} = \frac{\Sigma x^{2}}{n}-\mu^{2}$ Variance of sample: $s^{2} = \frac{\Sigma(x-\bar x)^{2}}{n-1} = \frac{\Sigma x^{2}-\frac{(\Sigma x)^{2}}{n}}{n-1}$ Standard Deviation = $\sqrt(s^{2})$ Standard Error = $\frac{s}{\sqrt{n}}$ Null Hypothesis: There is no significant difference between two populations Type I error: Probability of rejecting a true null hypothesis Equals significance level $\alpha$ Type II error: Probability of failing to reject a false null hypothesis Baye’s Rule $p(x|y) = \frac{p(y|x) p(x)}{p(y)}$ posterior: $p(x|y)$ or what we know about x given y prior: $p(x)$ or what we know about x before looking at y likelihood: $p(y|x)$ or the information gained about x of y Evidence: $p(y)$ knowledge of y Typical example: a diagnostic test for Covid is 95% accurate. This means that if the test correctly identifies the true state of the patient 95% of the time (ie. if she has Covid, the test returns true, is she doesn’t, it returns false). Here X is the random variable of the true Covid state of the patient and Y is the random variable of the test result. Suppose the infection rate of Covid is 1 in 1000. What is $p(Covid|Pos)$? Not 95%. $p(Covid|Pos) = \frac{p(Pos|Covid)p(Covid)}{p(Pos|Covid)p(Covid)+p(Pos|NoCovid)p(NoCovid)}$ Plugging in values yields about 2% A more conceptual way of thinking about Baye’s rule is that it tells you how to eliminate impossible outcomes given some information about the world. Expectation $E[f(x)] =\int f(x) \cdot p(x)dx$ ($\int \rightarrow \Sigma$ for discrete) p(x) is the weight assigned to each value of f(x) Function Expectation $x$ mean $\mu_x$ $x^k$ k-th moment about 0 $(x-\mu_x)^k$ k-th central moment $(x-\mu_x)^2$ variance (spread) $(x-\mu_x)^3$ skew (lean) $(x-\mu_x)^4$ kurtosis (peakedness) $(x-\mu_x)(y-\mu_y)$ covariance of x and y Expectation Manipulation Rules. Let a be a constant $E[a] = a$ $E[a\cdot x] = aE[x]$ $E[x + y] = E[x]+E[y]$ $E[x \cdot y] = E[x]\cdot E[y]$ if x,y are independent Variance Manipulation Rules $Var[a\cdot x+b] = a^{2}\cdot Var[x]$ $Var[x+y] = Var[x]+Var[y]$ if x,y are independent Common Probability Distributions Bernoulli Distribution Simple success or failure experiment let $\lambda$ denote the probability of success. x=0 denotes failure and x=1 denotes success $Bern_{x}(\lambda) = \lambda^{x}(1-\lambda)^{1-x}$ $p(x=0) = 1-\lambda$ $p(x=1) = \lambda$ $E[x] = \lambda$ $Var[x] = \lambda(1-\lambda)$ Binomial Distribution Given a sequence of $N$ independent Bernoulli experiments, the Binomial distribution enumerates the probability for $m$ successes in the sequence $Bin_{m}(N,\lambda) = {N \choose x} \lambda^{m}(1-\lambda)^{N-m}$ $E[x] = N\lambda$ $Var[x] = N\lambda(1-\lambda)$ For a fixed $N\lambda$, the Binomial converges to Poisson as $N \rightarrow \infty$ Poisson Distribution Consider independent events that happen at an average rate $\lambda$ over time (ie. $\lambda$ has “units” of number of events per time) Poisson distribution is discrete, and give the probability of a given number of events $k$ occurring in a fixed time interval $p(k,\lambda) = \frac{\lambda^{k}e^{-\lambda}}{k!} = Pois_{k}(\lambda)$ $E[x] = \lambda$ $Var[x] = \lambda$ Categorical Distribution Suppose you run a single experiment with $K$ possible outcomes indexed from $K = {1,2,…k}$. Let $\lambda =[\lambda_{1},\lambda_{2},..\lambda_{k}]$ with $\Sigma_{k=1}^{K} \lambda_{k} = 1$ $p(x=k) = \lambda_{k}$ $Cat_{x}(\lambda) = p(x)$ $E[x=k] = \lambda$ $Var[x=k] = \lambda(1-\lambda)$ Multinomial Distribution Multinomial distribution give probability of a number of successes given a particular combination Generalization of Categorical Distribution to N trials and Binomial Distribution to K outcomes. Let $m = [m_1 m_2 … m_K]$ denote the observed counts in each category (total number of categories is $K$) over a sequence of $N$ trials Let $\lambda = [\lambda_1 \lambda_2 … \lambda_K]$ denote the probability of each category (subject to normalization of $\lambda_1+ \lambda_2 + … + \lambda_K = 1$) $p(m) = {N \choose m_1 m_2 … m_K}\lambda_1^{m_1}\lambda_1^{m_1}…\lambda_1^{m_K}$ $E[m_{k}] = N\lambda$ $Var[m_k] = N\lambda(1-\lambda)$ Guassian(Normal) Distribution Useful because A Gaussian is fully specified by only two moments and central limit theorem (CLT) holds CLT states that the mean of independently draw random variables is normally distributed (given enough samples), irrespective of original distribution $E[m_{k}] = \mu$ $Var[m_k] = \sigma^{2}$ $p(x) = \frac{1}{2\pi \sigma}e^{\frac{-(x-\mu)^{2}}{2\sigma^{2}}}$ Standard normal distribution ($\mu=0$, $\sigma = 1$) Multivariate Gaussian Distribution Let $d$ be the dimension of the space Let $\mu$ be a d-dimensonal vector (Mean vector) Let $\Sigma$ be a DxD symmetric and positive semi-definite matrix (covariance matrix) $p(x) = \frac{1}{(2\pi)^{\frac{D}{2}}|\Sigma|^{\frac{1}{2}}} exp(-\frac{1}{2}(x-\mu)^{T}\Sigma^{-1}(x-\mu))$ $E[m_{k}] = \mu$ $Var[m_k] = \Sigma$ Alternatively, $p(x) = \frac{|\beta|^{\frac{1}{2}}}{(2\pi)^{\frac{D}{2}}} exp(-\frac{1}{2}(x-\mu)^{T}\beta(x-\mu))$ where $\beta = \Sigma^{-1}$ is a precision matrix Laplace distribution $p(x) = \frac{1}{2\gamma}exp(-\frac{-|x-\mu|}{\gamma})$ Places sharp peak at arbitrary point $\mu$ with spread of $\gamma$ Dirac Delta Distribution $p(x) = \delta(x-\mu)$ Clusters all mass of the distribution function into a single point Infinity high, infinity narrow peak Exists to be integrated Can be extended to an arbitrary number of points (empirical distribution): $p(x) = \frac{1}{m}\Sigma_{i-1}^{m} \delta(x-x_{i})$ Mixtures of Distributions Define a cluster of distributions. Define a probability distribution P(c) that selects which distribution is drawn from. $P(x) = \Sigma_{i} P(c=i) P(x |c=i)$ c is a latent variable, or a random variable that can’t be observed directly Common types of mixture models are the empirical distribution and Gaussian mixture models, where the cluster consists of Gaussians. Common Functions Logistic sigmoid $\sigma(x)$ $\sigma(x) = \frac{1}{1+exp(-x)}$ Defined on $\mathbb{R}$ and confined between 0 and 1 Saturates as $x \rightarrow \pm \infty$ (ie insensitive to changes in x) Maximum slope at x=0 Useful to produce parameter of a Bernoulli distribution Properties: $\sigma(x) = \frac{exp(x)}{exp(x)+exp(0)}$ $\frac{d}{dx}\sigma(x) = \sigma(x)(1-\sigma(x))$ $1-\sigma(x) = \sigma(-x)$ Softplus Function $\Zeta(x)$ $\Zeta(x) = log(1+exp(x))$ Defined on $\mathbb{R}$ and confined between 0 and $\infty$ Produces $\sigma$ or $\beta$ of normal distribution A smoother version of $f(x) = max(0,x)$ Properties: $log(\sigma(x)) = -\Zeta(-x)$ $\frac{d}{dx} \Zeta(x) = \sigma(x)$ $\forall x \in (0,1), \sigma^{-1}(x) = log(\frac{x}{1-x})$ $\forall x>0, \Zeta^{-1}(x) = log(exp(x)-1)$ $\Zeta(x) = \int_{-\infty}^{x} \sigma(y) dy$ $\Zeta(x)-\Zeta(-x) = x$ Information Theory Self-Information General Idea: knowing something unlikely has happened provides more information that knowing something likely has happened. Information theory quantifies this statement Properties of Information Likely events have low information (extreme: guaranteed events contain no information) Less likely events should have more information Independent events should have additive information (ie. tossing a coin twice contains twice as much information as tossing a coin once) The above motivate the definition of self-information. Let x be an event. Then $I(x) = -log(P(x))$ is the self-information of the event. The log typically is base 2 or base e, but in theory could have any base you want; a chance in base is effectively a change in scale/units. NOTE: by definition, $lim_{x\rightarrow 0 } x log(x) = 0$ Shannon Entropy Extends I(x) to determine amount of uncertainty in a probability distribution: $H(x) = H(P) = E[I(x)]$, or the expectation value of I(x) In words, $H(x)$ gives the expect/average amount of information that you can gain from drawing from the probability distribution. Entropy is maximized for a uniform probability distribution (has a very direct connection to entropy in the thermodynamic sense) When x is continuous, Shannon entropy is known as differential entropy Kullback-Leibler (KL) divergence a metric of quantifying the difference in information between two distributions $P(x)$ and $Q(x)$ $D_{KL}(P||Q) = E[log(P(x))-log(Q(x))]$ In words (for the discrete case), $D_{KL}$ is the extra information needed to send a message using the code P instead of the code Q. KL divergence is non-negative Only 0 iff P and Q are the same distribution NOT symmetric in P and Q (not a true measure of distance) Cross-entropy $H(P,Q) = H(P)+D_{KL}(P||Q) = -E[log(Q(x))]$ Minimizing cross-entropy w.r.t. Q is the same as minimizing KL divergence Exponential Distribution Let $p(x,\lambda)$ be a piecewise function such that $p(x<0>,\lambda)=0$ and $p(x>0,\lambda) = \lambda exp(-\lambda x)$ where $\lambda$ is the decay parameter Serves to put a sharp peak at x=0 Chi-squared Distribution Basics Chi-squared is a continuous distribution of the sum of the squares of k standard normally distributed random variables. k is called the number of “degrees of freedom” $q = \Sigma_{i-1}^{k} x_{i}^{2}$ ($\chi^{2}$ parameter) $p(x) = \frac{x^{\frac{k}{2}-1}e^{-\frac{x}{2}}}{2^{\frac{k}{2}}\Gamma(\frac{k}{2})}$ $E[m_{k}] = k$ $Var[m_k] = 2k$ Goodness of Fit (GOF) $\chi^{2} = \Sigma \frac{(O-E)^{2}}{E} $ degree of freedom (dof) = number of categories -1 Look up table for what p-value is. Determines if you accept or reject the null hypothesis by whether p-value is greater than (don’t reject null) or less than (reject null) significance level $\alpha$ Independence Chech whether categorical data is independent You are given observed data table O expected data table E is calculated cellwise For a given row and column, the cell value is $E = \frac{(\text{row total})(\text{column total})}{(\text{grand total})}$ Calculate $\chi^{2}$ as normal dof = (#rows-1)(#cols-1) T-test One Sample Tests whether the mean of a normally distributed population is different from a specific value Let $\mu$ be the measured mean and $\mu_{o}$ be the specific mean value to be tested $t = \frac{\bar{x}-\mu_{o}}{\frac{s}{\sqrt{n}}}$ dof = n-1 Using a lookup table find the corresponding p-value, with the following caveats depending on what your null hypothesis is: $\mu>\mu_{0} \rightarrow$ read table as given $\mu<\mu_{0} \rightarrow$ if t-statistic is negative, read table as though t-value was positive If t-statistic has the incorrect sign, read the $1-p$ value instead $\mu>\mu_{0} \rightarrow$ read table as given, but double p-value to take into account both tails If p-value is less than $\alpha$, reject the null hypothesis Two Sample Test whether the means of two populations are significantly different from one another ...

Date Created: January 7, 2023 | | Last Modified: May 13, 2026 ||