Compilation of notes for Machine Learning class for Spring 2023.
- Administrative Stuff
- Topic 1
- Topic 2
- Topic 3
- Topic 4
- Topic 5
- Topic 7 (We skipped Learning Theory)
- Topic 8
- Topic 9
Administrative Stuff
- All homework is submitted via Gradescope.
- Written parts must be PDFs and be typeset.
- Plots and data analysis from programming part must be included in written part
- Programming parts are mostly turned in as a separate assignment on Gradescope
- Written parts must be PDFs and be typeset.
- 40 % HW, 30% Midterm, 30% Final
Topic 1
- Inputs encoded as vectors ($x \in X$):
- $\vec{x} = <x_{1},x_{2},…x_{n}>$
- Can think of each component as a measurement that you make
- Output is another vector ($y \in Y$):
- $\vec{y} = <y_{1},y_{2},…y_{m}>$
- Goal is to figure out the function that maps x to y
- You only have a limited number of samples from X
- You need to make predictions from this limited sample size
- You only have a limited number of samples from X
Workflow of Classification Problems (Supervised Learning)
- $(\vec{x_{1}},y_{1}),(\vec{x_{2}},y_{2})… \in X \times Y$
- Example: $X = \mathbb{R^{d}}$ and $Y = {0,1}$ for a binary classification from a image with d pixels
- Assumption: there exists a “relatively simple” function $f^{*}: X \rightarrow Y$ such that $f^{*}(\vec{x_{i}}) = y_{i}$ for most i
- $*$ in ML implies some optimum value (in this case, an optimal function)
- Most includes that possibility that there is some fundamental noise in your data
- Say someone wrote a number that kind of looks like 7 and kinda looks like 1. For the same image, the true $y_{i}$ could be either 7 or 1 (depending on the person’s intentions). Hence, to force $f^{*}$ be a correct for all inputs misses this uncertainty in the true output
- Learning task: given $n$ examples from the data, from $\hat{f} \approx f^{*}$
- $\hat{f}$ denotes a function that tries its best to be optimal
- Goal: $\hat{f}$ gives mostly correct prediction on unseen examples
Statistical Approach
- $(\vec{x_{1}},y_{1}),(\vec{x_{2}},y_{2})…(\vec{x_{n}},y_{n})$ samples are drawn from the underlying distribution
- For simplicity, we assume each sample is drawn independently from the same underlying distribution is (called i.i.d. assumption)
- The learning algorithm draws $\hat{f}$ from a pool of models $\mathbb{F}$ that maximizes the label agreement with the training data
- How do we select $f \in \mathbb{F}$?
- Maximum likelihood (best fits the data)
- Maximum a posteriori (best fits the data but incorporates prior assumptions)
- Optimization of ’loss’ criterion (best discriminates the labels)
Classifier
- We are given a joint input/output space ($X \times Y$)
- The data is distributed like $D(X \times Y)$
- You have a region where your inputs can reside in ($X \times Y$) and D tells you the density of inputs on the space
- You can visualize this space by either:
- Taking slices at a fixed variable value ($P[Y,X=x]$) or the conditional distribution
- Adding up all the slices along a variable ($P[Y]$) or the marginal distribution
- The data is distributed like $D(X \times Y)$
- A classifier is a measurable function of the type $f: x \in Y$
- Ex1 (Constant): $f_{1}(\vec{x}) = y$ for some fixed $y \in Y$.
- Ex2(Threshold): $f_{2}(x) = \begin{cases} 0\ if \ x \geq 5 \\ 1 \ otherwise \end{cases}$
- Ex3 (Majority class): $f_{3}(x) = arg\ max_{y\in Y} P[Y=y]$
- This asks: What argument in Y yields the maximum probability?
- Implicitly, this probability is the marginalized distribution on X
- More explicitly, you can write probability as:
- $P[Y=y] = P_{y \approx D}|_{Y}[Y=y]$
- In words, you draw a (x,y) pair from D, and you only pay attention to the y part
- None of these are particularly good/generally applicable classifiers
- Ex4 (Sample Average Class):
- $\frac{1}{n}\Sigma \mathbb{I}[f(x_{i}) = y_{i}]$
- $\mathbb{I}$ is the indicator function that maps a boolean expression (like $f(x_{i}) = y_{i}$) to either 0 (false) or 1 (true)
- This is from a sample
- $\frac{1}{n}\Sigma \mathbb{I}[f(x_{i}) = y_{i}]$
- Ex5 (Population Average):
- $\mathbb{E}_{(x,y)\in D}[\mathbb{I}[f(x)=y]]= P[f(x)=y]$
- This is on the entire population. Essentially a probability that states how often this classifier f gets the output correct
- called the accuracy function acc(f)
- $\mathbb{E}_{(x,y)\in D}[\mathbb{I}[f(x)=y]]= P[f(x)=y]$
- Let’s look at the following classifier:
- $f(\vec{x})= arg\ max_{y\in Y}\ P[y=Y|X=\vec{x}]$
- In words, find y that maximizes the probability given a particular x (called Bayes classifier)
- This is the best classifier. But most of the time, you can’t actually compute $P[y=Y|X=\vec{x}]$ for all inputs. You try your best to estimate these probabilities
- Proof:
- Assume binary classification for output space (for simplicity)
- Let $f(\vec{x})= arg\ max_{y\in Y}\ P[y=Y|X=\vec{x}]$. Let $g(\vec{x})\rightarrow \{0,1\}$. Then $P_{(\vec{x},y)}[g(\vec{x})=y] \leq P_{(\vec{x},y)}[f(\vec{x})=y]$
- Let’s look at a single example (Fix $\vec{x}\in X$). Then for any classifier h
- $P[h(\vec{x})=y|X=\vec{x}]=P[h(\vec{x})=1|X=\vec{x}]+P[h(\vec{x})=0|X=\vec{x}]$
- Since $h(\vec{x})=1$ is a constant, we can use $\mathbb{I}$ to rewrite above
- $\mathbb{I}[h(\vec{x})=1]P[Y=1|X=\vec{x}]+\mathbb{I}[h(\vec{x})=0]P[Y=0|X=\vec{x}] =\mathbb{I}[h(\vec{x})=1]\eta(\vec{x})+\mathbb{I}[h(\vec{x})=0](1- \eta(\vec{x}))$
- Where $\eta(\vec{x}) := P[Y=1|X=\vec{x}]$
- $\mathbb{I}[h(\vec{x})=1]P[Y=1|X=\vec{x}]+\mathbb{I}[h(\vec{x})=0]P[Y=0|X=\vec{x}] =\mathbb{I}[h(\vec{x})=1]\eta(\vec{x})+\mathbb{I}[h(\vec{x})=0](1- \eta(\vec{x}))$
- Hence we need to show that $P[f(\vec{x})=y|X=\vec{x})]-P[g(\vec{x})=y|X=\vec{x})] \geq 0$
- This equals $\eta\vec{x}-\mathbb{I}[g(\vec{x})=1]]+(1-\eta(\vec{x}))[\mathbb{I}[f(\vec{x})=0]-\mathbb{I}[g(\vec{x})=0]] = (2\eta(\vec{x})-1)[\mathbb{I}[f(\vec{x})=1]-\mathbb{I}[g(\vec{x})=1]]$
- You use $\mathbb{I}[f(\vec{x})=0] = 1 -\mathbb{I}[f(\vec{x})=1]$
- If both classifiers guess the same output, there is no degradation in the performance in either classifier
- What happens if the outputs are different?
- In either case$[\mathbb{I}[f(\vec{x})=1]-\mathbb{I}[g(\vec{x})=1]]=\pm 1$ depending on if $f(x)$ is correct or $g(x)$ is correct,
- Say $f(x)$ returns 1:
- This means that probability of $f(x)$ selecting 1 is greater than 0.5, hence $P[Y=1|X-\vec{x}] =\eta(\vec{x}) \geq 0.5$ so $(2\eta(\vec{x})-1)[\mathbb{I}[f(\vec{x})=1]-\mathbb{I}[g(\vec{x})=1]] \geq 0$
- A similar argument holds for $f(x)$ returning 0 (ie. the bayes classifier fails and the alternate classifier works)
- This equals $\eta\vec{x}-\mathbb{I}[g(\vec{x})=1]]+(1-\eta(\vec{x}))[\mathbb{I}[f(\vec{x})=0]-\mathbb{I}[g(\vec{x})=0]] = (2\eta(\vec{x})-1)[\mathbb{I}[f(\vec{x})=1]-\mathbb{I}[g(\vec{x})=1]]$
- Assume binary classification for output space (for simplicity)
- Since we know that this works for a fixed example, we can integrate across all $x\in X$ to prove the original theorem $P_{(\vec{x},y)}[g(\vec{x})=y] \leq P_{(\vec{x},y)}[f(\vec{x})=y]$
- Want to show $\mathbb{E}[\mathbb{I}[P_{(\vec{x},y)}[g(\vec{x})=y]]] - \mathbb{E}[\mathbb{I}[P_{(\vec{x},y)}[f(\vec{x})=y]]] \geq 0$
- $\mathbb{E}[\alpha(\vec{x},y)] = \int_{(\vec({x},y)\in X\times Y)} \alpha(\vec{x},y)P((\vec{x},y))d((\vec{x},y))$
- So we can rewrite the expectation values using linearity: $\alpha(x,y)=\mathbb{I}[P_{(\vec{x},y)}[f(\vec{x})=y]]$ and $\beta(x,y)=\mathbb{I}[P_{(\vec{x},y)}[g(\vec{x})=y]]$:
- $\mathbb{E}[\mathbb{I}[P_{(\vec{x},y)}[g(\vec{x})=y]]] - \mathbb{E}[\mathbb{I}[P_{(\vec{x},y)}[f(\vec{x})=y]]] = \int (\alpha(\vec{x},y)-\beta(\vec{x},y)) p(\vec{x},y)d(x,y)=\int_{x}[\int_{y} (\alpha-\beta) p(y|x)dy]dx$
- The inner integral is the probability we calculated earlier (the one that was always greater than zero). Hence the double integral is always greater than or equal to 0
- $\mathbb{E}[\mathbb{I}[P_{(\vec{x},y)}[g(\vec{x})=y]]] - \mathbb{E}[\mathbb{I}[P_{(\vec{x},y)}[f(\vec{x})=y]]] = \int (\alpha(\vec{x},y)-\beta(\vec{x},y)) p(\vec{x},y)d(x,y)=\int_{x}[\int_{y} (\alpha-\beta) p(y|x)dy]dx$
- Want to show $\mathbb{E}[\mathbb{I}[P_{(\vec{x},y)}[g(\vec{x})=y]]] - \mathbb{E}[\mathbb{I}[P_{(\vec{x},y)}[f(\vec{x})=y]]] \geq 0$
- $f(\vec{x})= arg\ max_{y\in Y}\ P[y=Y|X=\vec{x}]$
- Since we can’t normally acheive this classifier, how can we estimate it?
- $f(\vec{x})= argmax_{y\in Y} P[Y=y|X=\vec{x}]= argmax_{y\in Y}\frac{P[X=\vec{x}|Y=y]\cdot P[Y=y]}{P[X=\vec{x}]}$
- denominator independent of Y. This means that
- $argmax_{y\in Y}\frac{P[X=\vec{x}|Y=y]\cdot P[Y=y]}{P[X=\vec{x}]} = argmax_{y\in Y}P[X=\vec{x}|Y=y]\cdot P[Y=y]$
- The value of the function on the LHS is not the same as the RHS, but the y that maximizes the LHS and RHS is the same
- Instead of figuring out an infinite number of estimates (1 for each $\vec{x}$ in $\R^{d}$ space) you instead make 1 estimate for each $y\in Y$, which is a much smaller space
- Instead of an infinite number of Bernoulli distributions, you have a finite number of multidimensional distributions
- $P[Y=y]$ is called the Class Prior or the prior
- $P[X=\vec{x}|Y=y]$ is called the Class Conditional or the Conditional
- $P[Y=y|X=\vec{x}]$ is called the Class Posterior or the Posterior
- $argmax_{y\in Y}\frac{P[X=\vec{x}|Y=y]\cdot P[Y=y]}{P[X=\vec{x}]} = argmax_{y\in Y}P[X=\vec{x}|Y=y]\cdot P[Y=y]$
- denominator independent of Y. This means that
- $f(\vec{x})= argmax_{y\in Y} P[Y=y|X=\vec{x}]= argmax_{y\in Y}\frac{P[X=\vec{x}|Y=y]\cdot P[Y=y]}{P[X=\vec{x}]}$
- Baye’s classifier is not always suitable in asymmetric solutions
- Ex: cancer detection. Bayes Classifier would always predict nobody has cancer. Would have a high accuracy, but is absolutely useless
Maximum Likelihood Estimation
- We are given some data $\vec{x_{1}}… \vec{x_{n}} \in X$ independently draw
- Say we have a model class $P = \{p_{\theta}|\theta\in\Theta\}$ where $\Theta$ is the parameter space
- For Gaussians, $<\mu,\sigma>\in \Theta$ and $p_{\theta} = \frac{1}{\sqrt{2\pi\sigma^{2}}}exp(\frac{-1}{2}\frac{(x-\mu)^{2}}{\sigma^{2}})$
- Find the model that best fits the data
- Define the likelihood function $\mathbb{L}(\theta|X)= P(X|\theta)=P(x_{1},…,x_{n}|\theta)=\Pi_{i=1}^{n}P(\vec{x_{i}}|\theta) = \Pi_{i=1}^{n}p_{\theta}(\vec{x_{i}})$
- The likelihood function defines how probable (or how likely) is the data given the model $p_{\theta}$
- we want to find $\theta$ that maximizes the likelihood function: $argmax_{\theta\in\Theta}\mathbb{L} = argmax_{\theta\in\Theta}\Pi_{i=1}^{n}p_{\theta}(\vec{x_{i}})$
- Say for instance, that we have some data and we want to fit a Gaussian to it:
- Find $argmax_{\mu,\sigma^{2}}\Pi_{i=1}^{n}p_{\mu,\sigma^{2}}(\vec{x_{i}})$
- A trick to avoid using the product rule too much:
- $argmax_{\theta}\mathbb{L}(\theta|X)= argmax_{\theta} log(\mathbb{L}(\theta|X))$
- This turns the product into sums and eliminates the exp of the Guassian. After doing this manipulation, we get:
- find $argmax_{\mu,\sigma^{2}} \Sigma_{i=1}^{n}[\frac{-1}{2}log(2\pi\sigma^{2})-\frac{(x_{i}-\mu)^{2}}{2\sigma^{2}}]$
- Maximize w.r.t $\mu$ to and solve for the stationary points yield $\mu_{ML} = \frac{1}{n}\Sigma_{i=1}^{n}x_{i}$
- This is an unbiased estimator that converges to the true $\mu$ as the number of samples go to infinity
- Can also take w.r.t. $\sigma^{2}$: $\sigma_{ML}^{2}=\frac{1}{n}\Sigma_{i=1}^{n}(x_{i}-\mu)^{2}$
- We can find the sample variance by plugging in $\mu_{ML}$ for the true $\mu$ (This is where the $\frac{1}{n-1}$ thing comes from)
Example
- Say that we want to estimate biological gender (male or female) based on weight and height
- The distribution is spread like $D[X\times Y]$
- The Bayes classifier is thus:
- $f(\vec{x}) = argmax_{y\in\{male,female\}} P[X=\vec{x}|Y=y]\cdot P[Y=y]$
- $\hat{P}[Y=male]=$ fraction of training data labelled as male
- $\hat{P}[Y=female]=$ fraction of training data labelled as female
- $\hat{P}[X|Y=male]=p_{\theta(male)}(X)$
- Use MLE with only male data
- $\hat{P}[X|Y=female]=p_{\theta(female)}(X)$
- Use MLE with only female data
- Say that the conditional follows a 2D Gaussian distribution
- $f(\vec{x}) = argmax_{y\in\{male,female\}} P[X=\vec{x}|Y=y]\cdot P[Y=y]$
- Another way to put this:
- We have a bunch of labelled points in $\mathbb{R^{2}}$
- Look only at the male labels. Fit a 2D Gaussian to this data (Find some mean center and the covariance matrix)
- Do the same with the female labels
- We have a bunch of labelled points in $\mathbb{R^{2}}$
- Once you have gotten your distributions, say that you get a new data point. The distribution (male or female) that yields the higher probability is your Bayes classifier’s output
Convergence
- How fast does our sample converges to the population?
- Define an error function $E = \int(P(x)-\hat{P}(x))^{2}dx \leq n^{\frac{-2}{2+d}} \approx n^{\frac{-1}{d}}$
- If the dimensionality of the problem gets larger (d goes up), then the convergence becomes much slower (you need many many more samples in a higher dimensional space to get the same threshold as a lower dimensional space)
- Say that you have a tolerance of $\epsilon$ on the error. Then you need $n \geq \epsilon^{d}$ samples
- If the dimensionality of the problem gets larger (d goes up), then the convergence becomes much slower (you need many many more samples in a higher dimensional space to get the same threshold as a lower dimensional space)
- Define an error function $E = \int(P(x)-\hat{P}(x))^{2}dx \leq n^{\frac{-2}{2+d}} \approx n^{\frac{-1}{d}}$
Naive Bayes Classifier
- $f(\vec{x}) = argmax_{y\in Y} P[X=\vec{x}|Y=y]\cdot p[Y=y] = argmax_{y\in Y}\Sigma_{y=1}^{d}P[X^{j}=x^{j}|Y=y]\cdot P[Y=y]$
- In words: each feature of the input (the $x_{i}$) are independent from each other given marginalization on the output
- You break a d-dimensional problem into d 1-dimensional problem, where each 1-d problem requires $\frac{1}{\sqrt{n}}$
How do you quantify the quality of a classifier?
- If you know the population distribution, then just run the classifier with that data
- If you have a sample distribution, then you have a biased qualifier since you chose your classifying function from the training data
- Split the training data into training and testing data
- The testing data provides an unbiased means of testing your model
Approaches to Classification
Generative Approach
- You assume a model of the underlying distribution. When you get a new point, you find which model is the best
- Advantages
- The model gives you an interpretation of how data gets generated from a population
- Since you have a interpretation of the underlying distribution, you can generate simulated data
- Disadvantages
- You need to pick a model
- This is overkill if you just need a classification result, which costs time and money
- You also are more prone to errors since the sole focus is not classification
- Advantages
Discriminative Approach
- you don’t assume an underlying model. Given a new point, you directly figure out the classification
- Advantages
- Typically you get better classification accuracy
- Disadvantages
- You gain no understanding of the population
- Advantages
Topic 2
Nearest K-Neighbor
Importance of Closeness for k-Nearest Neighbors
- If you have improper units (say that you measure human height in lightyears and human weight in micrograms), this can stretch and compress the axes and exaggerates the importance of a particular feature
- Choosing the correct closeness metric allows to to scale the axes to avoid this problem
Distances
- If $X\in \R^{d}$, then an example of a distance metric is the $L^{p}$ norm ($(\Sigma_{i=1}^{d} (\vec{x_{i}}-\vec{\hat{x_{j}}})^{p})^\frac{1}{p}$):
- $p =2$ is Euclidean distance
- $p =1$ Manhattan distance or cityblock distance
- $p=\infty$ max distance
- $p=0$ count ’non-zero’ distance (ignore $\frac{1}{p}$ power in norm definition)
- Edit distance (genomics):
- Domain specific: the number of mutations needed to convert one sequence to another
- Kendell-Tau distance (to compare rankings)
- What is the minimum number of adjacent pairwise swaps needed to convert 1 ranking to another?
Similarities
- take the distance and construct the function $\frac{1}{1+||\vec{x_{1}}-\vec{x_{2}}||_{p}}$
- Cosine similarity: Find the cosine between the two vector
Issues with the nearest neighbor
- Sensitive to noise in data, so labelling is unstable
- Solution: look at the kth nearest labels to make this stable
k-NN Optimality
- Stone’s Theorem:
- Theorem 1: for a fixed k, as $n \rightarrow \infty$, k-NN classification error converges to no more than twice Bayes classifier error
- Theorem 2: if $k\rightarrow\infty$, $n\rightarrow \infty$, but $\frac{k}{n}\rightarrow0$ (the rate of growth goes to zero ie. n grows faster than k), then the k-NN classifier converges to Bayes classifier
- Why $\frac{k}{n}$?
- Say that n is fixed an k increases. Eventually, you are averaging over the entire dataset. Hence, if you increase n as well at a rate faster than k, then you are still sampling a small local region of the dataset
Proof
- Let k=1
- Notation:
- P[e] = NN error rate = $1-P_{x,y\in D_{n}}[[f(x)_{n}=y]]$
- ie. error = 1-accuracy
- $D_{n} = <X_{n},Y_{n}>$ = labeled data (size n)
- The sample data that you have drawn from the underlying distribution
- $x_{n}$ = nearest neighbor of $x_{t}$ in $D_{n}$
- $y_{n}$ = label of nearest neighbor
- $x_{t}$ and $y_{t}$ are a fixed test point of $x_{t}$
- P[e] = NN error rate = $1-P_{x,y\in D_{n}}[[f(x)_{n}=y]]$
- What is the error at a fixed test point $x_{t}$?
- $lim_{n\rightarrow\infty} P_{y_{t},D_{n}}[e|x_{t}]$
- There is some randomness in $y_{t}$ and $D_{n}$
- $lim_{n\rightarrow\infty} P_{y_{t},D_{n}}[e|x_{t}]$
- Recall that $P[A] = \Sigma_{b\in B} P[A,B=b] = \Sigma_{b\in B} P[A|B=b]P[B=b]$
- If you condition on C, you get:
- $P[A|C] = \Sigma_{b\in B} P[A,B=b|C] = \Sigma_{b\in B} P[A|B=b,C]P[B=b|C]$
- Using the appropriate substitutions, can write:
- $lim_{n\rightarrow\infty} P_{y_{t},D_{n}}[e|x_{t}] = \int_{n\rightarrow\infty} P_{yt}[e|x_{t}X_{n}]P[X_{n}|x_{t}]dX_{n} = lim_{n\rightarrow\infty}\int P_{yt}[e|x_{t}n_{n}]P[n_{n}|x_{t}] $
- In words, the second equality states that you fix your dataset
- You then write out what nearest neighbor means:
- $lim_{n\rightarrow\infty}P_{y_{t},D_{n}}[e|x_{t}] = lim_{n\infty}[1-\Sigma_{y\in Y} P(y_{t}=y,y_{n}=y|x_{t},x_{n})]P[x_{n}x_{t}] dx = lim_{n\infty}[1-\Sigma_{y\in Y} P(y_{t}=y|x_{t})P(y_{n}=y|x_{t})]P[x_{n}|x_{t}] dx$
- The first is the definition of k-NN (the output of the nearest neighbor is the same as the test sample)
- The second equality uses independence (ie. truth level labels of the test point and neighbor points are conditionally independent)
- $lim_{n\rightarrow\infty}P_{y_{t},D_{n}}[e|x_{t}] = lim_{n\infty}[1-\Sigma_{y\in Y} P(y_{t}=y,y_{n}=y|x_{t},x_{n})]P[x_{n}x_{t}] dx = lim_{n\infty}[1-\Sigma_{y\in Y} P(y_{t}=y|x_{t})P(y_{n}=y|x_{t})]P[x_{n}|x_{t}] dx$
- You take the limit w.r.t. $n\rightarrow\infty$. $P[y_{t}=y|x_{t}]=P[y_{n}=y|x_{n}]$. When you integrate $P[x_{n}|x_{t}] dx$ you get 1, hence
- $lim_{n\rightarrow\infty}P_{y_{t},D_{n}}[e|x_{t}] = 1-\Sigma_{y\in Y}P^{2}(y_{t}=y|x_{t})$
- If Bayes classifier return $y*$ at point $x_{t}$, then
- $1-\Sigma P^{2}(y_{t}=y|x_{t}) \leq 1- P^{2}(y_{t}=y*|x_{t})$ or $\leq 2(1-P(y_{t}-y*|x_{t})) = 2P^{*}[e|x_{t}]$
Practical Considerations of k-nearest neighbor
Finding the k-th nearest neighbors takes time
- $O(nd)$ where n is the number of samples and d is the dimensionality of each sample
- Simplify to $\R$. Then you can sort your data in a binary search tree
- Look up in $O(log\ n)$
- Preprocessing is $O(n\ log\ n)$
- Instead of thinking in terms of midpoints, think in terms of regions
- After you divide up $\R$, you can put your data sample into one of the bins
- Problem: the hard threshold can misclassify the nearest neighbor. Instead, it gives you a near neighbor (in practice ,this distinction doesn’t really matter if you have enough data)
- Solution: Soften the boundary, where if your data fall in the overlap region, then you traverse both branches of the tree
- Problem: the hard threshold can misclassify the nearest neighbor. Instead, it gives you a near neighbor (in practice ,this distinction doesn’t really matter if you have enough data)
- To extend to higher dimensions, take the projection of your data onto one of your dimensions, then do the median. Recursively repeat this process, select which coordinate you project onto each time
- You can use the same coordinate, although if you overuse a particular coordinate, than your fills (bins) can be elongated along 1 direction
- You can mitigate this problem by selecting your coordinate based on the one that has the smaller variance
- You can use the same coordinate, although if you overuse a particular coordinate, than your fills (bins) can be elongated along 1 direction
- Problem/Caveat: What if a different direction is better to subdivide the data?
- Perform PCA, which calculates the direction of maximum variance, and then project onto this direction
- Cons: need to do an eigenvalue decomposition, which has a big online overhead
- This data structure is called a k-d tree
- Used in databases (boolean selections select regions of the space to reduce the search problem
- This problem can also be solved by a locality-sensitive hash map
Metric of Closeness is Sometimes Unclear
- Say that you have a particular task and you gather data with some sort of features
- Some features are not relevant for the task at hand (ie. these features are noisy). However, for other tasks, these noisy features might be of great importance
- You could also have highly correlated features that can distort the Euclidean distance
- Solution: we could re-weight the contribution of each feature:
- $\rho(\vec{x_{1}},\vec{x_{2}};\vec{w}) = [(\vec{x_{1}}-\vec{x_{2}})^{T}W(\vec{x_{1}}-\vec{x_{2}})]^{\frac{1}{2}}$ where $W$ is a diagonal matrix
- How do we learn what $W$ is?
- Create two sets, one similar and one dissimilar
- $S = [(\vec{x_{i}},\vec{x_{j}})|y_{i}=y_{j}]$
- $D = [(\vec{x_{i}},\vec{x_{j}})|y_{i}\neq y_{j}]$
- Define a cost function $\Psi(\vec{w}) = \lambda\Sigma_{(\vec{x_{i}},\vec{x_{j}})\in S} \rho(\vec{x_{1}},\vec{x_{2}};\vec{w})-(1-\lambda)\Sigma_{(\vec{x_{i}},\vec{x_{j}})\in D} \rho(\vec{x_{1}},\vec{x_{2}};\vec{w})$
- $\lambda$ controls which force dominates
- In words, we want to find the weights that minimize the distance between similar points and maximize the distance between dissimilar points
- We can simplify this by squaring $\rho$ in our cost function to avoid annoying square root functions
- Create two sets, one similar and one dissimilar
Derivative w.r.t. Matrix
- What is $\frac{d}{dW}(\lambda\Sigma_{i,j}(x_{i}-x_{j})^{T}W(x_{i}-x_{j}))$ where W is a matrix
- We can think of the function as a map from $\R^{d\times d}\rightarrow \R$
- The derivative must have the same dimension as the domain (ie $R^{d\times d}$)
- Hence, we can intuit that the derivative is $\Sigma_{i,j}(x_{i}-x_{j})(x_{i}-x_{j})^{T}$ (this is correct)
Not Saving All the Training Data
- Instead of saving all the training data, you save the classification of the region where the data lies
- You also try to make the classification regions as large as possible. Namely, you decide your selection cuts to maximally reduce label uncertainty (ie. as you decide your cut to maximize the purity of each region at each step (up to a certain point to avoid overfitting))
- You can quantify label uncertainty as follows
- Classification Error:
- $u(C) = 1- max_{y}p_{y}$
- $p_{y}$ is the fraction of training data labelled as y in cell C
- Entropy:
- $u(C) = \Sigma_{(y\in Y)} p_{y} \log \frac{1}{p_{y}}$
- Units of bits
- This is linearly additive for independent events
- Gini index:
- a measure of how “mixed” a population is
- $u(C) = \Sigma_{(y\in Y)}p_{y}^{2}$
- For homogeneity, we take $1-u(C)$
- Classification Error:
- Whatever metric you use, you want to
- $argmax_{F,T}[u(C)-(p_{L}u(C_{L})+p_{R}u(C_{R}))]$
- Namely, choose a feature F and threshold T that maximizes the fraction of the same labels in a region
- You keep your features and thresholds parallel to axes to maintain interpretability
- This is a greedy algorithm
- Finding the optimal decision tree is NP-hard
- Uncertainty estimates are unstable
- Tree complexity is dependent of data geometry
Topic 3
Overfitting the Training Data
- As you increase the complexity of a model, you can drive the training error down to zero in the limit
- This does not mean that this generalizes to the population
- You can take a cut of your training data (call this your validation data) to tune how complicated you make your model to avoid this overfitting
Decision Boundaries
Linear Classifier
- A linear Classifier is defined as a classifier whose decision boundary is a line
- For simplicity, say that we have a binary classification of -1,1.
- Recall definition of linearity: $g(a+b) = g(a)+g(b)$ and $g(ca) = cg(a)$
- you can drop $g(ca) = cg(a)$ assumption and examine affine decision boundaries (As an abuse of notation, people call these affine decision boundaries linear boundaries)
- Let g be a decision boundary such that:
- In 1D, we get $g(x) = w_{1}x+w_{0} = 0$ where x is from your data that satisfied these conditions
- In general $g(\vec{x}) = \vec{w}\cdot\vec{x}+w_{0}=0$
- Define a linear classifier such that $f(x) = 1$ if $g(x) \geq 0$ and $f(x) = -1$ if $g(x) \leq 0$ or $f(\vec(x) = sign(g(\vec{x}))$
- Let $g(x) = (\vec{w},w_{0})\cdot(\vec{x},1)$ where you call the extra 1 the bias. This is called “lifting” the data because all of your data gets assigned a value of 1 in a new dimension. This extra degree of freedom allows an affine classifier to truely become a linear classifier
Finding The Weights
- We want to minimize the training error:
- $argmin_{\vec{w}} = \frac{1}{n}\Sigma_{i}^{N}\bold{1}[sign(\vec{w}\cdot\vec{x_i})\neq y_i]$
- The above equals $argmax_{\vec{w}} \Sigma_{x_{i},y_{i}=1}\bold{1}[\vec{x_i}\cdot \vec{w} \le 0] + \Sigma_{x_{i},y_{i}=-1}\bold{1}[\vec{x_i}\cdot \vec{w} \ge 0]$
- This is NP-hard to solve (NP-hard for the approximation as well)
- This is not an absolute statement: not every instance of your data is hard to solve
- For data where it is NP-hard, you probably don’t want to use a linear classifier in the first place
- For now, we assume that we have linear separable data (ie. the problem is not NP-hard)
- Say that our data has a linear dicision boundary that perfectly separates the training data
- We can define the margin $\gamma$ as the shortest distance between the boundary and the closest data point
Perceptron
- Given: labelled training data
- Initialize $\vec{w}^{(0)} = 0$
- for t = 1,2,3…
- If $\exist (\vec{x},y) \in S$ such that $sign(\vec{w}^{(t-1)}\cdot \vec{x}) \neq y$
- Update $\vec{w}^{(t)} = \vec{w}^{(t-1)}+y\vec{x}$
- Remember that $y=\pm 1$
- Update $\vec{w}^{(t)} = \vec{w}^{(t-1)}+y\vec{x}$
- terminate when there is no such training sample that you misclassify
- If $\exist (\vec{x},y) \in S$ such that $sign(\vec{w}^{(t-1)}\cdot \vec{x}) \neq y$
- This algorithm might not terminate because it is too greedy
Perceptron Mistake Bound
- Assume that there is a unit length $\vec{w}^{*}$
- Let $R = max_{\vec{x}\in S} |\vec{x}|$ be the radius and $\gamma$ be the margin
- The number of iterations T is bounded by $|\frac{R}{\gamma}|^2$
Perceptron Mistake Bound Proof
- How far is $\vec{w}$ from $\vec{w}^{*}$
- We want $\vec{w}$ to converge to $\vec{w}^{*}$
- We can use the dot product as a measure of this. We want to decrease the dot product by reducing the angle and not increasing the lengths
- Let’s say that the algorithm made a mistake at time t. Then:
- $\vec{w}^{(t)}\cdot \vec{w}^{*} = (\vec{w}^{(t-1)}+y\vec{x})\cdot \vec{w}^{*} \geq \vec{w}^{(t)}\cdot \vec{w}^{*}+\gamma$
- $|\vec{w}^{t}|^2=|\vec{w}^{t-1}yx|^2 = |\vec{w}^{t-1}|^2+2y(\vec{w}^{t-1}\cdot x)+|y\vec{x}|^2 \leq |\vec{w}^{t-1}|^2+R^{2}$
- The above inequalities hold after every iteration. After iteration T:
- $T\gamma \leq \vec{w}^{T}\cdot \vec{w}^{*} \leq |\vec{w}^{T}||\vec{w}^{*}| \leq R\sqrt{T}$
- Rearrange to get that $T\leq (\frac{R}{\gamma})^{2}$
- We want $\vec{w}$ to converge to $\vec{w}^{*}$
- This means that the perceptron is an online algorithm: it only looks at a small subset of the data (1 at a time).
Nonlinear Regression
- Say that we have some data that is not conducive to linear boundaries
Quadratic Boundaries
- For a quadratic term, our quadratic boundary can be defined as $g(\vec{x}) = w_1x^1_2+w_2x^2_2+w_3x_1x_2+w_4x_1+w_5x_2+w_0$
- Suppose that we only look at boundaries of the form $g(\vec{x}) = w_1x_1^2+w_2x_2^2+w_0 = \Sigma_{p+q\leq 2} w^{p,q}x_1^p x_2^q$
- Make the change of coordinates $\chi_1 = x_1^{2}$ and $\chi_2 = x_2^{2}$ to make this boundary a linear boundary
- In function notation: $\phi(x_1,x_2) \rightarrow (x_1^{2},x_2^{2})$
- This is called a feature transformation
- For the general case we have $\phi(x_1,x_2) \rightarrow (x_1^2,x_2^2,x_1x_2,x_1,x_2,1)$
- For d dimensions, we have that
- $g(\vec{x}) \Sigma_{i,j=1}^{d} \Sigma_{p+q\leq 2} w_{i,j}^{p,q} x_{i}^{p}x_{j}^{q}$
- Transforming the data into higher dimensions takes effort. We don’t want to do this unless we know that the problem will become linear seperable
Proof that All Data is Linear Separable in some Space
- No label information is given to use ($Y$ is unknown)
- What we are given is n distinct points S = $\vec{x_1},\vec{x_2},… \vec{x_n}$
- There exists a feature transformation (kernel) such that for any labelling of S is linearly separable in the transformed space
- Consider the mapping into $\mathbb{R}^n$
- $\phi(x_i) \rightarrow <0,0,… ,1,0,0,….0>$ or in words, you map the ith data point to the ith component of the $\mathbb{R}^n$ vector
- Then, the decision boundary induced by linear weighting $\vec{w^{*}} = <y_1,y_2… y_n>$ perfectly seperates the data because
- $\forall \vec{x}_i \rightarrow sign(\phi(\vec{x}_i)\cdot \vec{w}^{*}) = sign(y_i) = y_i$
- This doesn’t work in practice since we don’t know a priori how large out data will be
Kernel Trick (to Deal with Computation)
- Explicitly working with generic Kernel space takes time $\Omega(n)
- Suppose that taking the dot product $\dot(\vec{x_1}\cdot \vec{x_2})$ is somehow easy to do
- For example, the generic quadratic boundary
- make the transformation like the standard transformation, but scale the cross terms up by $\sqrt$
- This goes as $O(d^2)$
- The dot product calculation is of $O(d)$: $(1+\vec{x_i}\vec{x_j})^2$
- This also extend to polynomial transformations $\phi(\vec{x_i})\cdot \phi(\vec{x_i}) = (1+\vec{x_i}\vec{x_j})^2$ with is also linear in d
- If you have a machine learning algorithm that only works with dot products, then you can run it quickly
- For example, the generic quadratic boundary
Topic 4
SVM
- This is a linear classifier that is a stable solution that gives the largest margin $\gamma$
- This is kernelizable, which provides an implicit means of non-linear classification
- Central idea: We have 2 hyperplanes that are parallel to each other that both correctly classify the data. We want to maximize the distance between them
Setup
- Define two decision boundaries $\vec{w}\cdot\vec{x}-b=1$ and $\vec{w}\cdot\vec{x}-b=-1$
- The Training data is correctly classified if
- $y_{i}(\vec{w_i}\cdot\vec{x_i}-b) \geq 1$
- This rolls up the above two decision boundaries into one
- $y_{i}(\vec{w_i}\cdot\vec{x_i}-b) \geq 1$
- The distance between the two hyperplanes is $\frac{2}{|\vec{w}|}$
- This is a constrained optimization problem
- Can’t just take derivatives since the minimum subject to the constraint might have a non-zero gradient
- We have N constraints (1 for each i)
- The standard machinery for optimization finds minimums. So instead of maximizing $\frac{2}{|\vec{w}|}$, we minimize $\frac{1}{2}|\vec{w}|^2$ with the same constraints
- We also want to be able to ignore outliers that destroy the linearity of the data. (called slack)
- We can introduce slack $\zeta_i$ to the constraints as N additional unknowns, with a constrain that $\zeta_i \geq 0$ (slack less than zero is weird since it makes the constraint more stringent, so we don’t do that)
- $y_{i}(\vec{w_i}\cdot\vec{x_i}-b) \geq 1-\zeta_i$
- We try to minimize the slack (since setting all $\zeta_i = \infty$ is a trivial, but bad, solution in the normal setup)
- Minimize $\frac{1}{2}|\vec{w}|^2+C\Sigma_{i=1}^{n} \zeta_i$
Constrained Optimization
- We want to
- minimize $\frac{1}{2}|\vec{w}|^2+C\Sigma_{i=1}^{n} \zeta_i$
- with constraints $y_{i}(\vec{w_i}\cdot\vec{x_i}-b) \geq 1-\zeta_i$ and $\zeta_i \geq 0 $
- Say that we want to minimize $f(\vec{x})$, where $\vec{x}$ is from the unknown parameter space $\mathbb{R}_d$, subject to constrains $g_i(\vec{x}) \leq 0$ for $1\leq i \leq n$
- We also assume the problem is feasible (there is at least 1 value that satisfies all the constraints)
Approaches to Optimization
Projection
- Steps
- start with a feasible solution $x_0$
- find $x_1$ that has a slightly lower objective value
- if $x_1$ violates the constraints, project back to the constraints
- We can visualize $f(\vec{x_i})$ as some weird surface
- The constraints cut boundaries in $\mathbb{R}^d$. The intersection of the constraints is non-zero. We want to find the lowest point on f while staying in this union of the constraints
Cost Functions
- We create a new function that has the same value in the constraint region, but has a necessarily higher value elsewhere
- Let $L(\vec{x},\lambda) = f(\vec{x})+\Sigma_{i=1}^{n}\lambda_{i} g_{i}(\vec{x})$
- Since our constraints are $g_i \leq 0$, in the feasible region we have that $L(\vec{x},\lambda) \leq f(\vec(x))$
- $max_{\lambda_i\geq 0} L(\vec{x},\lambda) \leq f(\vec{x})$
- If x is infeasible, then $L(\vec{x},\lambda) = \infty$
- If x is feasible, then $L(\vec{x},\lambda) = f(\vec{x})$
- Since our constraints are $g_i \leq 0$, in the feasible region we have that $L(\vec{x},\lambda) \leq f(\vec(x))$
- The optimal value is then $p^{*} = min_{\vec{x}}max_{\lambda_i \geq 0 } L(\vec{x},\vec{\lambda})$
- x is unconstrained, and the constrains on $\lambda$ is much less stringent
Dual Problem
- We want to find $p^{*} = min_{\vec{x}}max_{\lambda_i \geq 0 } L(\vec{x},\vec{\lambda})$ (the primal problem)
- Let’s look at a different problem
- Let $x^*$ be a minimum feasible solution over f
- $f(x^{*}) = p^{*}$
- For all $\lambda_i \geq 0$
- $min_{\vec{x}} L(\vec{x},\vec{\lambda}) \leq L(x^{*},\vec{\lambda}) \leq f(x^{*}) = p^{*}$
- Hence, $d^{*} = max_{\lambda_{i}\geq 0}min_{\vec{x}}L(\vec{x},\vec{\lambda}) \leq p^{*}$
- The upshot: we can easily find the optimum of $min_{\vec{x}}L(\vec{x},\vec{\lambda})$ if L is differentiable
- The downside: $d^{*}$ is not the value you wanted to find. You wanted to find $p^{*}$
- This is called the minimax inequality
- When is $p^{*} = d^{*}$?
Convexity
- A function $f: \mathbb{R}\rightarrow\mathbb{R^{d}}$ is called convex iff for any two points $x,x’$ and $\beta\in[0,1]$
- $f(\beta\vec{x}+(1-\beta)\vec{x’}) \leq \beta f(\vec{x})+(1-\beta)f(\vec{x’})$
- In words, we have two points x and x’. we look at f(x) and f(x’), which we denote as the endpoints.
- Draw the shortest line between the two endpoints
- What convexity states is that any y between x and x’ must satisfy that f(y) is below the line
- Happy functions are convex (Direct quote)
- Not a dichotomy (affine functions are convex and concave)
- A set $S \in \mathbb{R^d}$ is called convex iff for any points $x, x’ \in S$ and any $\beta \in [0,1]$
- $\beta\vec{x}(1-\beta)\vec{x} \in S$
- A constrained optimization is called a convex optimization problem if
- the objective function $f(\vec{x})$ is a convex function
- the feasible set induced by the constraints g is a convex set
- These are nice problems since if a local optima exists, it is a global optima
- If your optimization problem is convex, and if
- there exists a feasible point s.t. $\forall i, g_{i}(\vec{x}) <0$ or
- $g_i{\vec{x}} \leq$ whenever $g_{i}$ is affine
- NOTE: These are sufficient to state that $d^{*} = p^{*}$
SVM as Convex Optimization
- we are minimizing a quadratic function with linear constraints. Hence SVM is Convex
- Our Lagrangian is $L(\vec{w},b,\vec{\alpha}) = \frac{1}{2}||\vec{w}||^2+\Sigma_{i=1}^{n}\alpha_{i}(1=y_i(\vec{w}\cdot\vec{x_i}-b))$
- Primal: $p^{*} = min_{\vec{w},b}max_{\alpha_i\geq 0} L$
- Dual: $d^{*} = max_{\alpha_i\geq 0}min_{\vec{w},b} L$
- Taking partials $\frac{\partial}{\partial\vec{w}}L = \vec{w}-\Sigma_{i=1}^{n}\alpha_{i}y_{i}\vec{x_i}$
- At the stationary point: $\vec{w}=\Sigma_{i=1}^{n}\alpha_{i}y_{i}\vec{x_i}$
- Like the perceptron
- At the stationary point: $\vec{w}=\Sigma_{i=1}^{n}\alpha_{i}y_{i}\vec{x_i}$
- Taking partials $\frac{\partial}{\partial b}L = \Sigma_{i=1}^{n}\alpha_{i}y_{i} = 0$
- $min_{\vec{w},b} L(\vec{w},b,\vec{\alpha}) = \Sigma_{i=1}^{n}\alpha_i-\frac{1}{2}\Sigma_{i,j} \alpha_i \alpha_j y_i y_j(x_i\cdot x_j)$
- $\frac{1}{2}w^Tw = (\Sigma_i \alpha_iy_ix_i)^{T}(\Sigma_j\alpha_iy_ix_i)$
- Hence, we have converted the primal form of the problem to the dual form of the problem:
- Maximize $ \Sigma_{i=1}^{n}\alpha_i-\frac{1}{2}\Sigma_{i,j} \alpha_i \alpha_j y_i y_j(x_i\cdot x_j)$
- Subject to the constraint $\Sigma_{i=1}^{n} \alpha_i y_i = 0$
- $\alpha_i \geq 0$
- Note that the function only accesses the data via dot products. This is a kernalized version of the problem, which makes non-linear boundary problems tractable
- Why support vector machine?
- Only a few points define the boundary. You could throw away all other data points except these select points. Said points are called support vectors.
Topic 5
- So far, we have been only dealing with classification. Now we deal with other outputs
- We could output a real number (energy of particle in a detector)
- We could also output more esoteric things like a graph (pose estimation)
- Sentence structure estimation (grammer tree)
- We will only focus real number output
Parametric Regression
- We assume a form on the underlying distrubution and find the best candidate that minimizes the error/loss
Types of Loss
- Absolute error
- Prefers to make a single large mistake
- Quadratic error:
- makes many small mistakes
- Penalizes higher error more severely
- is differentiable
Linear Regression
- $\hat{f} = \vec{w}\cdot \vec{x}+ w_0$
- We can incorporate $w_0$ to w via lifting
- Want to find $min_{w,w_0}\mathbb{E}_{\vec{x},y}[L(\hat{f(x)},y)]$
- Equivalently, $argmin_w \frac{1}{n}\Sigma_{i=1}^{n}(\vec{w}\cdot\vec{x_i}-y_i)^2$ (ordinary least square (MSE))
- Can take square root to find root mean square error, which gives the correct units. argmin doesn’t care which square error you minimize
- Let’s represent all of our data as a data matrix X, that is an nxd matrix (n rows for n data points and d columns for d dimensions)
- Represent w as a dx1 and let Y be a nx1 matrix with the associated y values
- Using the above, we can also reframe our problem in matrix notation:
- $argmin_{w}|X\vec{w}-\vec{y}|^2$
- taking partial w.r.t. $\vec{w}$ to get $\Sigma 2X^T(X\vec{w}-\vec{y})$
- The transpose is succh that the dimensionality of the matrix is the same as the output (ie. 1)
- Solving for the stationary points to get that $X^TX\vec{w} = X^T\vec{y}$
- $\vec{w} = (X^TX)^{\dag}X\vec{w}$
- This system will always have a solution (not necessarily a unique one)
- Poorly behaved (due to overfitting) when we have limited data
Proof of existence
- let $\vec{y} = \vec{y_{col}}+\vec{y_{null}}$ via orthogonal decomposition of y
- Acting $X^T$ on y kills of the nullspace. Hence $X^T\vec{y} = X^T\vec{y_{col}}$
- We can write $\vec{y_{col}}$ as a linear combination of the columns of x. Let the weights be equal to w.
- recall that Xw takes a linear combination of the columns. This means that the LHS and RHS are equal. Hence, because we have found a $\vec{w}$ that solves the system, it is consistent
- The upshot of this is that systems of the form $A^TAx = A^Tb$ always has a least 1 solution
Column Picture
- View columns of X as primary data (ie. look at distribution along each feature (d point in an n dimensional space) instead of looking at each data point (n points in a d dimensional space))
- Need to mimimize $|\vec{y}-\Sigma_{i=1}^{d}w_ix_i^|^2$ where $x^_i$ is the ith column of A
- $\hat{y }= X\vec{w_o} = \Sigma_{i=1}^{d}w_{oi}x^*_i$
- $\hat{y}$ is the orthogonal projection onto the span of the column space of X
Statistical View
- We have some input x draw from out underlying distribution D
- The true y is computed via some fixed $w\cdot x$
- Some Gaussian noise of the form N(0,$\sigma^2$) corrupts the y to the value that you observe $y = w\cdot x +\epsilon$
- Solve with MLE
- $\log \mathbb{L}(w|S) = \Sigma_{i=1}^{n} \log(\frac{1}{\sqrt{s\pi \sigma^2}}exp(\frac{-(w\cdot x_i-y_i)^2}{2\sigma^2}))$
- Argmax w.r.t. w to get something that is proportional to the mean squared loss
Ridge Regression
- We have multiply highly correlated features. We need to find a stable relation between input and response
- Minimize $|X\vec{w}-y|^2+\lambda|\vec{w}|^{2}$
- The first part is standard ols
- The second part is the regularization part (ie. this tries to keep all the w’s small and around the same values)
- $\lambda$ is the hyperparameter that dictates the trade off between the reconstruction error and trying to regularize
- If you lean to much to regularization, you get very small w’s that give you garbage
- Minimizing w.r.t. w yields
- $2x^T(Xw-y)+2\lambda w = 0$
- Rearranging to get that $w = (X^TX+\lambda\mathbb{I})^{-1}X^Ty$
- $(X^TX+\lambda\mathbb{I})$ is positive definite, so inverse always exists
- Equivalent to the optimization problem:
- minimize $|Xw-y|^2$
- subject to $|w|^2\leq B$
Lasso Regression
- We have few input features that dictate/control the outcome
- minimize $|Xw-y|^2$
- subject to $|w|_1\leq B$, where we are using the L1 norm
- The L1 norm encourages sparse solutions (corners of L1 ball are “sticky”)
Logistic Regression
- fit a sigmoid function to your data
- $f(x) = \frac{1}{1+e^{w\cdot x}}$
- can get a binary prediction via sign(2f(x)-1)
Why a Sigmoid?
- Denote the probability P(Y=1|X=x). This gives us the probability of labelling a point as 1 given a particular x
- Let’s instead look at the ‘odds’ of getting y=1 for a given x
- odds(P) = $\frac{P}{1-P}$
- For an event with P=0.9, odds =9 but for an event P = 0.1, odds = 0.11
- This is very asymmetric, which is conducive to our problem of logistic regression
- take the log of the odds function and call this logit
- $logit(P) = log(\frac{P}{1-P})$
- Notice that logit(P) = -logit(1-P), which is symmetric.
- odds(P) = $\frac{P}{1-P}$
- Let logit(P) = $w\cdot x$ as our modelling assumption
- Rearranging the above to get that $P = \frac{1}{1+e^{-w\cdot x}}$, which is the sigmoid function
Back to Regression
- The likelihood L is give as
- $L = \Pi_{i}^{n} P((x_i,y_i)|w_{i}) \propto \Pi_{i=1}^{n} P(y_i|x_i, w)$
- We dropped one part of bayes rule because the distribution in x is independent of w
- $\Pi_{i=1}^{n} P(y_i|x_i, w) = \Pi_{i=1}^{n} P(y_i=1|x_1,w)^{y_i}(1-P(y_i=1|x_i,w))^{1-y_i}$
- This is just a Binomial MLE
- $L = \Pi_{i}^{n} P((x_i,y_i)|w_{i}) \propto \Pi_{i=1}^{n} P(y_i|x_i, w)$
- log-likelihood is as follows:
- $log(L) = \Sigma_{i=1}^{n} y_{i} log(\frac{P_{x_i}}{1-P_{x_i}})+\Sigma_{i=1}^{n} log(1-P_{x_i})$
- $log(L) = \Sigma_{i=1}^{n}y_{i}(w\cdot x_i)+\Sigma_{i=1}^{n} -log(1+exp(w\cdot x_i))$
- No closed form solution. Need something like gradient ascent to solve this
Optimal $L_{2}$ Regressor
- For a regressor, we have that $f^{*}(x) = \mathbb{E}[Y|X=x]$
- We are taking an average across all possible values of x of the population
- We need to define how we measure our error prior to defining out the optimal classifier/regressor
- We want to prove that for any regressor g(x), we have that $\mathbb{E}[|f^{*}(x)-y|^2]\leq \mathbb{E}[|g(x)-y|^2] $
- Recall that $f^{*}(x) = \mathbb{E}[Y|X=x]$
- Consider the $L_2$ error of g(x) $\mathbb{E}|g(x)-y|^2 = \mathbb{E}|g(x)+f^{*}(x)-f^{*}(x)-y|^2 = \mathbb{E}|g(x)-f^{*}(x)|^2+\mathbb{E}|f^{*}(x)-y|^2+2\mathbb{E}|(g(x)-f^{*})(f^{*}-y)|$
- The cross term can be expanded by conditionalizing on x:
- $2\mathbb{E}|(g(x)-f^{*})(f^{*}-y)| = 2\mathbb{E_{x}}[\mathbb{E_{y|x}}[(g(x)-f^{*}(x))(f^{*}-yf^{*}-y)| X=x]]$
- We can pull out functions of x from the inner expectation value b/c x is a constant for this expectation value. We then substitute in the definition of $f^{*}$ to get that the cross term is 0 for all x
- $2\mathbb{E_{x}}[\mathbb{E_{y|x}}[(g(x)-f^{*}(x))(f^{*}-yf^{*}-y)| X=x]]$
- Hence $E|g-y|^2 = \int_{x} |g-f^{*}|^2 \mu (dx)+\mathbb{E}|f^{*}-y|^2$
- The above is minimized when g(x) is $f^{*}(x)$, which then means $f^{*}$ is optimal
- Notice that there is some innate variance in our data that cannot be optimized away
- The integral term is called the squared bias of the $L_2$ error
- For L1 error, we the median instead
Non-linear Regression
- If we have too thin of a slice, then we might not have any data to perform regression on
- We can increase the size of our slice up to some bandwidth in order to prevent reaching the high bias region
- Let $\hat{y} = \Sigma_{i=1}^{n} w_i{x}y_{i}$
- We take into account all points in our sum. We can then fudge the weights w to select for different points. The function that chooses these weights is called the kernel function K
- Gaussian kernel: $K(x,x’) = exp(\frac{-|x-x’|^2}{h})$
- Box kernel: $\mathbb{1[|x-x’| \leq h]}$
- Triangle kernel: $[1-(\frac{1}{h})|x-x’|]_{+}$
- We then define $w_i(x) = \frac{K(x,x_i)}{\Sigma_{j=1}^{n} K(x,x_j)}$
- We take into account all points in our sum. We can then fudge the weights w to select for different points. The function that chooses these weights is called the kernel function K
Consistency Tneorem
- As $n \rightarrow \infty$, $h\rightarrow 0$, $hn\rightarrow \infty$, then $\mathbb{E}|\hat{f}-f^{*}|^2\rightarrow 0$
- $\hat{f} = \Sigma_i^{n} \frac{K(x,x_i)}{\Sigma_{j}^{n} K(x,x_j)}$
Topic 7 (We skipped Learning Theory)
- Unsupervised learning only has the data and no labels
- The goal is to discover the underlying structure of the data
- Can also be used to reduce dimensionality of data for faster processing
- Increase signal to noise ratio
k-means
- Basic idea: Finding the group is the same as finding the center of the group that it belongs to. Say that we have k possible groups with centers $\vec{c_1},\vec{c_2},…$
- We want to minimize $\Sigma_{i=1}^{n} min |\vec{x_i}-\vec{c_j}|^2$ ie. find the closest centers for a given point
- This is NP-hard, even for the simple non-trivial case d=2 and k=2
Algorithm (At Least One Way to Do it)
- LLoyd’s Technique/Algorithm
- Initialize the centers randomly
- Assign each data point to the closest center
- Update the centers to the mean of their associated data points
- Repeat until bored or until no changes happen
Choosing K
- Try them all (k ranges from 1 to n), and calculate the cost function
- the cost function $\Sigma_{i=1}^{n} min_{j = 1,2..k} |x_i-c_j|^2$ monotonically decreases to zero
- For some datasets, you could see large drops in the cost, which would be indicative of an optimal k value
- You also could not choose K
- Top Down: Partition data into 2 groups, recurse on each part, repeat until you cannot partition any more
- Bottom Up: Each cluster is a point. Merge each cluster in closest pairs, and recurve until you get a single cluster
- The definition of closest cluster called linkages (complete is pairwised average, single is using a single point, and average computes means of cluster and then compares these center of masses). Called Ward’s method
Clustering via Probabilistic Mixture Modeling
- Given $X\in \mathbb{R^d}$ and a number of intended clusters k
- Assume a joint probability distribution (X,C) over the joint space $\mathbb{R^d} \times k$
- $C \in \mathbb{R}^k$ denotes a probability vector (ie. entries sum to 1) that denotes the weighting of which cluster you are coming from
- $P(C=i) = \pi_i$
- X(C=i) is some multivariate distribution (takes Gaussian for simplicity, could be anything)
- For Gaussians, the parameter space is $\theta = (\pi_1,\vec{\mu_1},\Sigma_1,…\pi_n,\vec{\mu_n},\Sigma_n)$
- The values of $c_i$ are hidden from us
- $P(\vec{x}|\vec{\theta}) = \Sigma_{j=1}^{k} \pi_{j} = \frac{1}{\sqrt{(2\pi)^d det(\Sigma_j)}}exp(\frac{-1}{2}(\vec{x}-\vec{\mu_j})^T\Sigma_{j}^{-1}(\vec{x}-\vec{\mu_j}))$
MLE Estimation
- $\theta_{MLE} = argmax \Sigma_{i=1}^{n} ln [\Sigma_{j=1}^{k} ]\frac{1}{\sqrt{(2\pi)^d det(\Sigma_j)}}exp(\frac{-1}{2}(\vec{x_i}-\vec{\mu_j})^T\Sigma_{j}^{-1}(\vec{x_i}-\vec{\mu_j}))$
- Can’t put ln inside inner Sigma. Shouldn’t do this because MLE answer does not give a desirable outcome
Expectation Maximization (EM) Algorithm
- Idea:
- Initialize the parameters arbitrarily
- Given the current setting of parameters, find the best (soft) assignment of data samples to the clusters (Expectation Step)
- This means that you assign the data samples to your clusters with some probabilities for each cluster
- Update all the parameters with respect to the current (soft) assignment that maximizes the likelihood
- Keep doing this until you get bored, or until you get a degenerate solution. If you have a degenerate solution, you restart the algorithm with some different random seed of initial parameters
- In math terms:
- Initialize $\theta = <\pi_1,\vec{\mu_1},\Sigma_1…>$
- Expectation step, assign a weight to each data for each cluster
- $w_j^i = \frac{\pi_j N(x_i,\mu_j,\Sigma_j)}{\Sigma_{j’=1}^{k}\pi_{j’} N(x_i,\mu_{j’},\Sigma_{j’})}$
- Maximize the log-likelihood of the parameters
- Define $n_j = \Sigma_{i=1}^{n} w_j^i$ as the effective number of points assigned to cluster j
- $\pi_j = \frac{n_j}{n}$ (the fraction of the data that is associated with cluster j)
- $\vec{\mu_j} = \Sigma_{i=1}^{n} \frac{w_{j}^{i}\vec{x_i}}{n_j}$
- $\Sigma_j = \frac{1}{n_j}\Sigma_{i=1}^{n}w_j^{i} (\vec{x_i}-\vec{\mu}_j)(\vec{x_i}-\vec{\mu}_j)^T$
- The above are found by taking vector and matrix derivatives
Topic 8
PCA
- Create a smaller dimensional representation of our data
- We want to find the optimal map $\mathbb{R}^d \rightarrow \mathbb{R}^k$ that best maintains reconstruction accuracy
- Equivalently, minimize aggregate residual error
- We want to find the optimal map $\mathbb{R}^d \rightarrow \mathbb{R}^k$ that best maintains reconstruction accuracy
- Define $\pi_K$ as $\mathbb{R}^d\rightarrow \mathbb{R}^d$ as the k-dimensional orthogonal linear projector
- We want to calculate the following $min_{\Pi_{k}} \frac{1}{n}\Sigma |\vec{x}-\Pi^{k}(\vec{x_i})|^2$
Projector
- The projection of any x in $\mathbb{R}^d$ in the orthonormal span $q_1,q_2…q_n$ is given by
- $\Sigma_{i=1}^{k} (\vec{q_i}\cdot\vec{x})\vec{q_i} = (\Sigma_{i=1}^{k}\vec{q_i}\vec{q_i}^T)\vec{x}$
- $\vec{q_i}\vec{q_i}^{T}$ is a dxd matrix or rank 1. The summation of all the matrixs has a rank of k
- $\vec{q}\vec{q}^T$ is $\Pi_k$
- $\Sigma_{i=1}^{k} (\vec{q_i}\cdot\vec{x})\vec{q_i} = (\Sigma_{i=1}^{k}\vec{q_i}\vec{q_i}^T)\vec{x}$
PCA (What does it have to do with eigenvalues?)
- We now want to find $min_{|q|=1} \frac{1}{n}\Sigma_{i=1}^{n} |\vec{x}-\vec{q}\vec{q}^T\vec{x_i}|^2$
- Through some algebra, you can rewrite the problem as
- $maximize_{|q|=1} \vec{q^T}\frac{1}{n}XX^T\vec{q}$
- Since M is a symmetric positive semi definite matrix, by spectral decomposition $M = \Sigma_{i} v_{i}v_{i}^T$ where all the eigenvalues are real and positive
- Hence $q^{T}Mq = \Sigma_{i}\lambda_{i}(q\cdot v_i)^2$
- Define $\alpha_i = (q\cdot v)^2$. The optimization then is to maximize $\Sigma_{i}\lambda_i\alpha_i$ such that $\Sigma_{i} \alpha_i = 1$
- This can be maximized by letting $\alpha_1 = 1$ and the rest being equal to zero
- We can think of this solution as one that maximizes the spread (ie. variance) with respect to the direction of $q_1$ with respect to the origin (ie forcing a mean of 0)
- We can break down any data into its eigenvectors. We can take advantage of the fact that only the first few (low frequency) eigenvectors contribute meaningfully.
- $\vec{x} = \Sigma_{i} (\vec{x}\cdot \vec{q_i})q_{i}$
- We keep the first k values. (like the Pareto principle)
Topic 9
Graphical Models
- A probabilistic model where a graph represents the conditional dependence structure among the variables
Inference Questions
- From the graph, can you construct probabilities or chances of certain events causing other events?
Learning Questions
- What is the most likely graphical model structure and connection weights that model the data?
Directed Models
- Edge direction typically denotes potential causality
Undirected Models
- Edges denote potential causality
- Ex: Markov Random Fields
Bayesian Networks
- A type of directed model
- Assume we have some joint probability distribution
- P(C,S,R,G) = P(C)P(R|C)P(S|R,C)P(G|S,R,C) via the chain rule
- If we are given a Graphical model, we can remove some of these dependencies
- Hence, if we know these final probabilities, we can answer any inference question about the variables
- How do we find these probabilities out?
- We know that $P(X_1,…X_n) = \Pi_{i=1}^{d} P(X_i|parent(X_{i}))$
- Suppose that our data is binary data (ie. each node either happens or does not happen)
- Each node can be assigned a probability by counting each instance of the event
- We can draw arrow between each node to denote independence
- We can determine independent between two nodes (ie. $P(A,B) \approx P(A)P(B)$) (shorthand $A\perp B$)
- We also need to check if $P(A,B|C) \approx P(A|C)P(B|C)$ (shorthand $A\perp B|C$) where you can have any number of conditions
- As long as no cycles occur, we are good
- How do we find these probabilities out?
Time Series Model
- A time series model: a family of distributions over a sequence of random variables that is indexed by a totally ordered indexing set (often referred to as time)
- totally ordered means that that there exists a natural order of the set (ie you can say definitively that if a and b are part of a set wheather $a\leq b$)
- $… X_{t-2}, X_{t-1}, X_{t}, X_{t+1}, X_{t+2}… $
Markov Chain
- The conditional distribution of the next state $X_{t+1}$ given all the previous states $X_{i}$ only depends on the current state $X_{t}$
- ie $P(X_{t+1}|X_{t},X_{t-1}…) = P(X_{t+1}|X_{t})$
- Suppose that we have d distinct states. We need to construct a transition matrix that transition the probabilities of the current state to the probabilities of the next state
- Call the initial distribution $P(X_{1}=i) = \pi_{i}$
- The conditional distribution is $P(X_{t+1}=j|X_{t}=i) = A_{ij}$
- The elements in each row of this transition matrix must sum to 1 (row stochastic)
- Once have the initial condition and the transition matrix, we can calculate probabilities
- Ex: What is the probability of seeing the sequence 2,2,1,1,2,1,1? This is given by $\pi_{2}*A_{2,2}*A_{2,1}…$
PageRank
- How does ranking webpages relate to time series?
- Suppose that you start on some webpage (node). The edges denote links between websites
- PageRank posits that popularity is proportional to the probability that a random walk ends on page i
- $P(X_{1}=i) = \pi_{i}$
- $P(X_{2}=i) = \Sigma_{j} P(X_{1}=j,X_{2}=i) = \Sigma_{j} P(X_{1}=j)\cdot P(X_{2}=i|X_{1}=j) = \Sigma_{j} \pi_{j} A_{j,i}$
- ie. the ith entry of $\pi^{T}A$ = $(\pi^{T}A)_{i}$
- $P(X_{t}=i) = (\pi^{T}A^{t-1})_{i}$
- We want to find out what the probability approaches as t increases
- This is identical to the statement: does $lim_{t\rightarrow\infty} A^{t}$ approach a matrix M where all the rows are identical
- This makes sense because if you multiply by A 1 more time, the matrix M should not change
- For such an A, we can see that $lim_{t\rightarrow\infty} A^{t} = lim_{t\rightarrow\infty} (A^{t-1})A \rightarrow MA = M$
- Hence, we need to find the eigenvectors of $A^{T}$
- This is identical to the statement: does $lim_{t\rightarrow\infty} A^{t}$ approach a matrix M where all the rows are identical
- Some problems are that you don’t know if the matrix A does converge in the limit