Linear Regression

  • You have a set of data $(x_{i},y_{i},\sigma_[i])$, where x is the independent variable, y is the dependent variable, and $\sigma$ is the uncertainty in the dependent variable
  • You want to fit a line to the data ($\hat{y} = mx+b$)
    • You also want the uncertainties of the parameters
  • Why would you want to do this?
    • To know the parameters of the model
    • To extrapolate/interpolate values of the dataset that you don’t have
  • How you do this depends on your viewpoint
    • a frequentist approach would try to maximize the likelihood estimate (information theory)
      • Suppose that you have a probability density for the data $y_{i}$ ranging from 1 to N
        • Treat this density as a function of the parameters. What parameters maximizes this density?
    • A Bayesian approach would be posterior estimates (Cox theorems)

Likelihood Function

  • The model we are trying to fit for Linear Regression is $y=mx+b+\pi(0,\sigma)$
    • We assume that x has no uncertainty
    • that the noise function $\pi$ is Gaussian with mean 0 (why? Central Limit Theorem, and it’s easy to work with)
      • Assume that your variances are correct (big assumption)
    • Each data point is independent from each other (iid)
  • $N = \frac{1}{\sqrt{2\pi \sigma^{2}}} exp(-\frac{(y-\mu)^{2}}{2\sigma_{2}})$ (1D Gaussian)
  • With this, the likelihood function can be written as $\mathbb{L} = \frac{-1}{2}(\vec{Y}-\vec{X}\cdot \vec{a})^{T}C^{-1}(\vec{Y}-\vec{X}\cdot \vec{a})$
    • Y is the data vector
    • X is the design matrix (first column is the independent variables, 2nd column is all ones)
    • a is the parameter vector
    • $C^{-1}$ is a diagonal matrix, whose diagonals are $\frac{1}{\sigma_{i}^{2}}$
  • the most likely values of m and b are those which maximize this likelihood (ie. when the derivative w.r.t. a is 0)
    • Dig through this pamphlet to do that calculation

Freqentist Week

Some Definitions

  • aleatoric uncertainty (noise) and epistemic uncertainty (systematic error or uncertainty in the form of model assumptions)
  • The lower bound on your total uncertainty is the inverse of the Fisher matrix (ie. covariance matrix)
    • This lower bounds is reached only if all of the assumptions are correct (which is normally pretty naive)

Freqentist Rules

  • You can never have a measure in the parameter space, only the data space
    • “parameters are set by God. The data are created by us” - D.W.H.
    • This means that statements like “It is more probable that $H_{0}>69$ than $H_{0}<69$” are OK, but “the data are more consistent with $H_{0}>69$ than $H_{0}<69$” are not OK
  • p-value of p<0.05: If the null model were true, we’d get data this “deviant” is less than 0.05 percent of the time
    • The null model is the model where the thing that you are trying to prove is false. So if the likelihood of the null hypothesis is low, then the chance of your hypothesis is high

What is $\chi^2?$

  • You have some data (a set of unordered vectors) and a model that tries to describe this data
  • We can defined $\chi^{2} = \sigma_{i=1}^{N} (\frac{y_{i}-f(x_{i},\theta)}{\sigma_{i}})^{2}$
  • You can also relate the chi-squared to the log-likelihood via $ln(p(y|\theta)) = \frac{-1}{2}\chi^{2}+const$
    • This assumes a zero-mean gaussian with known variances, It also assumes that the noise distributions of each point are iid
  • What value do you expect for $\chi^{2}$?
    • If your model describes your data well, then you expect that each data point is offset from the model by the variance of that point. so $\frac{y_{i}-f(x_{i},\theta)}{\sigma_{i}}=1$, so naively, you expect that $\chi^{2} \approx N$
    • In reality, $\chi^{2} \approx N -dof = N_{dof}$, where dof are the number of degrees of freedom.
      • The handwaving explanation for this is that you aren’t using the true parameters, you are using the best-fit parameters. This causes you to overfit a bit, which juices the $\chi^{2}$ numbers, hence the $-dof$
  • In the limit of large $N_{dof}$, the $\chi^{2}$ approaches a Gaussian with mean $N_{dof}$ and variance of $2N_{dof}$
    • A consequence of this is that for sufficiently large N, all models get rejected (“all models are wrong; some are useful” )
  • If you have multiple models, then you can talk about $\Delta \chi^{2}$

Nuisance Parameters

  • Some parameters in a model you care about. Others you need to include, but don’t want to
    • For example, instrumentation might introduce these extra parameters
  • Suppose that our likelihood function distribution goes like $L = \ln p([y]|\theta,\alpha)$, where $\alpha$ is/are your nuisance parameter(s)
    • You can define a function $max_{\alpha} L(\theta, \alpha) = L_{p}(\theta)$ (ie. you find what $\alpha$ maximizes the likelihood)
      • You didn’t have to specify a prior on $\alpha$
      • $L_{p}$ is called the profile log likelihood
    • This profiling removes the nuisance parameters from the likelihood, so you can just maximize $L_{p}$ instead of L

Uncertainty Estimation

Robustness

  • A “robust” method is insensitive to your noise model
  • The previous uncertainty estimates where not robust (ie. they heavily depended on the noise being Gaussian)

Fair Sampling

  • Assume that $x_{i}$ is drawn from some distribution $p(x)$. Then your sampling of x is fair if $\sigma_{i=1}^{k} f(x_{i}) \approx \int f(x) p(x) dx$
  • $p(x)$ must satisfy $\int p(x) dx = 1$ and $p(x) \geq 0$

Bootstrapping

  • Assume that your data $y_{i}$ are fair samples drawn from $p(z|\theta)$
  • Assumes that your data is “similar”: ie. has similar errors
  • Create an estimator $\hat{\theta}(y_{i})$ ranging from i=1 to i=N
    • Draw K sets of N values from $y_{i}$ with replacement
    • Estimate $\hat{\theta}$ for each of the K samples
    • The variance of $\theta$ then becomes $\frac{1}{K}\Sigma_{j=1}^{K} (\hat{\theta_{j}}-\hat{\theta})^{2}$, where $\hat{\theta}$ if the estimator when using all the data
    • The covariance matrix is $C_{\theta} = \frac{1}{K}\Sigma_{i=1}^{k} (\hat{\theta_{j}}-\hat{\theta})(\hat{\theta_{j}}-\hat{\theta})^{T}$

Jackknifing

  • Assume that your data $y_{i}$ are fair samples drawn from $p(z|\theta)$
  • Assumes that your data is “similar”: ie. has similar errors
  • Create an estimator $\hat{\theta}(y_{i})$ ranging from i=1 to i=N, and that the estimator holds on a smaller subset of y
    • Take all of the data from y, but drop one datapoint and call this subsample $Y_{j}$ (so you have a subsample of size N-1). Do this for all possible choices of j
    • Define $\hat{\theta}_{j}$ for all of the subsamples
    • $\sigma_{\theta}^{2} = \frac{N-1}{N} \Sigma_{j=1}^{N} (\theta_{j}-<\theta>_{j})^{2}$
      • $<\theta>_{j}$ is the mean across all of the jackknife estimators

Bayesian Statistics

  • Bayes is a theory of prediction, frequentism is a theory of measurement
  • Based on Cox theorems
  • In Bayes, our likelihood function is a probability distribution $p(y|\theta, \alpha, I)$, where $\theta$ are your parameters of interest, $\alpha$ are your nuisancee parameters, and I are your assumptions
  • All of Bayes depends on having a proper set of priors which represents your belief about $\theta$, $\alpha$ and I before your see y
  • Bayes Rule: $p(\theta,\alpha| y, I) = \frac{p(\theta \alpha|I)p(y|\theta, \alpha, I)}{Z}$ where Z is the normalization (or marginalization) equal to $p(y|I)$

Marginalization

  • Suppose that we have some prior $p(\theta,\alpha | I)$ which is seperable ($p(\theta|I) p(\alpha|I) = p(\theta,\alpha | I)$)
    • This can arise if your belief about different set of parameters are distinct from each other (ex. the assumptions about the Higgs mass are seperate from the assumptions about the calorimeter sensitivity)
    • You can integrate out the nuisance parameters $\alpha$. This is called marginalization
      • The marginal likelihood is given by $p(y|\theta, I) = \int p(y|\theta, \alpha, I) p(\alpha | I) d\alpha$
      • The marginal posterior is $p(\theta|y,I) = \int p(\theta,\alpha|y,I) d\alpha$

MCMC

  • Markov Chain Monte Carlo
  • Universal way of calculating posterior distributions
    • Kind of a crappy way of do so, since no nice convergence theorems
  • Suppose that we have some posterior distribution $p(\Omega| y,I)$
  • A good sampling of this posterior should be able to approximate the posterior

Metropolis Hastings

  • We define some probability distribution $q(\Omega’| \Omega)$ called the proposal distribution (ie. a distribution which generates a new sample given an old sample)
    • We assume that $q(\Omega’| \Omega) = q(\Omega| \Omega’)$ (called detailed balance)
  • Suppose that after drawing from q, you calculate the change in the log-likelihood between the old and the new sample
    • Roll a uniform random number r between 0 and 1. Accept the new sample if $ln(r) < ln(\delta)$
    • Otherwise, you keep the previous sample
    • In the limit of infinite samples, you will get a fair sampling of the posterior
    • Nearby samples are not independent
      • This is modulated by the “correlation time”, which roughly speaking, is the number of samples you have to take before you get a new independent sample
      • This correlation time contributes to the number of truly independent samples: $N_{eff} = \frac{K}{T}$, where K is the total samples drawn and T is the correlation time
  • How do you choose q?
    • You need to choose a q which equally trades off between exploration and exploitation
      • You could try to measure the autocorrelation time, but that’s hard to compute efficiently
      • You could also try to aim for a target acceptance rate (~0.4)

Model Selection

  • Suppose that you have some set of models that you want to test
  • You can select which model you want via AIC and BIC (ie. likelihood ratios) in a frequentist approach
  • With a Bayesian approach, you can find your full marginalized likelihood (integrate over your model parameters)
    • This approach is heavily dependent on your prior (since $p(Y|I) = \int p(y|\theta, I) p(\theta|I) d\theta $)

K-fold Cross Validation

  • You leave k datapoints out of your training data. You have your model predict the held out data points. The model which yields the best prediction is empirically the best
    • Appropriate for choosing “nuisance models”
    • Model independent
  • In pseudocode:
    • You hold out k data points
    • Run your model on this subsample of the data
    • You predict the outputs of the withheld data points
      • For frequentists, you predict the probability of getting the withheld data given the subsampled model
      • For Bayesian, you calculate the fully marginalized log likelihood

Interpolation

  • AKA overparameterization (p>n)
  • Want to fit to points in between your data
  • There are a couple of types of interpolation
    • linear
    • bilinear
    • cubic
  • For polynomial interpolation, you have roughly the same number of parameters as number of points
  • When interpolation, “sharp edges” in your dataset has a tendency to get smooth out by the interpolation
  • Interpolations are also “too faithful” ie. they stick to the data too much
  • Lot’s of data is “band-limited”, which means that there is a fundamental distance scale below which you can’t get good resolution (think a camera)
    • Noise can in principle have very high frequency noise
  • 3 primary methods for interpolation methods
    • exact pass through the data (think all the flavors of splines)
      • Some wizardry allows these algorithms to run in polynomial time
    • fit a parameterized model to some data with noise
      • cubic time
    • kernel methods (ie. you define some metric, and take a weighted average of the closest neighbors based on the kernel)

Over Parameterized Models

  • Dealing with p parameters and n data points where p >n
  • To prevent the values of these parameters from getting too large, we add some regularization
    • Idea is that under some norm, we want to find $argmin ||a||$ such that $y = Xa$
  • Common regularization schemes are
    • The L2 norm: $\Sigma_{i=1}^{p} a_{i}^{2}$
      • Called ridge regression
    • The L1 norm: $\Sigma_{i=1}^{p} |a_{i}|$
      • called lasso regression
      • Has a tendency to zero out irrelevant parameters
  • We have the following model for overparameterization:
    • $Q = ||y-Xa||^{2} + \lambda ||a||$
    • We take the limit $\lambda\rightarrow 0 $, which yields the argmin on the norm of a
  • Can think of the problem in a Bayesian sense
    • $Q = \chi^{2}+\lambda||a||_{2}^{2}$
    • $(y-Xa)^{T}C^{-1}(y-Xa)+a^{T}\Lambda a$
    • This can be solved by minimizing with respect to a
    • The predicted values are then $\hat{y’} = X’\Lambda^{-1}X^{T}(X\Lambda^{-1}X^{T} +C)^{-1} \vec{y}$ where ’ denotes the extrapolated x,y points

Gaussian Processes (GPs)

Kernels

  • We can view the overparameterized solution as $K’(K+C)^{-1}\vec{y}$, where K is called the kernel
  • kernel matrices have the form $K_{ij} = k(t_{i}, t_{j})$. Typically, most problem have the form $K_{ij} = k(|t_{i}-t_{j}|)$
  • At $\delta t=0$, K describes the variance of the GP
    • There is also some $\tau$ which describes the correlation time
  • Some common forms for the kernel
    • $K_{ij} = V exp(\frac{-|\delta t|^{2}}{\tau^{2}})$ (squared exponential)
    • $K_{ij} = V exp(\frac{-|\delta t|}{\tau})$
  • Kernel functions must be semi-positive definite
    • Hard to do in general. Easy way to generate a kernel is to have some strictly positive function in frequency space, then Fourier transform it to real space
  • How do you choose a kernel function?
    • Compute the likelihood function of the data
      • This is done via calculating $\ln p(y_{i}|\mu(t),K)$. Written out: $L = \frac{-1}{2}(\ln \det{2\pi(K+C)})-\frac{1}{2}(\vec{y}-\vec{\mu})^{T}(K+C)^{-1}(\vec{y}-\vec{\mu})$
      • This allows you to compare different kernel functions, and use the Neyman-Pearson lemma to select the MAP
      • This assumes that the distribution is Gaussian, which is not always a valid assumption
    • Use cross validation
      • Leave one out to get: $\hat{y_{i}} = K_{i}(K_{/i}+C_{/i})^{-1}y_{/i}$ and $\hat{\sigma_{i}^{2}} = k(0)-K_{i}(K_{/i}+C_{/i})^{-1}K_{i}^{T}$
        • You can then compare this against the actual data and choose the kernel which maximizes that likelihood

PCA

  • Imagine that we have our design matrix. We want to decompose $X= USV^{T}$ where S is your singular value matrix and U and $V^{T}$ are your left and right eigenvectors
  • This decomposition creates a spectrum where each eigenvector is weighted by their associated singular value. You can thus get a good approximation of your data by only keeping the top q eigenvectors

When is PCA Good?

  • What properties are needed for a good PCA decomposition?
    • When the data is “well aligned”
      • The variation in the data space eis the variation you care about
      • Silly example: trying to differentiate between smiley faces images and sad face images at different positions
        • You want to place all of your faces at the center of your image, so that you don’t waste a PCA eigenvector on locating the average locations of all of the faces
    • When the data has high signal to noise
    • When there are few outliers
  • Assuming that the above holds, then PCA will help
    • de-noise your data (assuming that you choose a sufficient q eigenvectors)
      • You can use leave one out cross validation to help determine your best q
    • Interpolate over missing/bad data
    • Detect anomalies/outliers

Misc

Writing a Data Analysis Paper

Trivial Things

  • Given the paper, people within your field should be able to reproduce your work
  • The writing should be unambiguous
  • The notation should not reuse the same symbol
  • Be consistent with naming things
  • Try to mix language and symbols
    • for instance “wavevector $\vec{k}$. Not “wavevector”. Not $\vec{k}$

Non-trivial Things

  • You should have a list of explicit assumptions and choices
  • Visualize your results in the space of your data
  • Frame your discussion section by the assumptions

Causal Separation

  • You have some data, and you want to know what portions of the data are explained by what mechanism
  • Example: We have n cells, observed at t time slices, and you use m microscopes
    • Suppose that over time, the size of the cells appear to grow. Is this due to the cells actually growing, or is it due to the different calibrations for all of the microscopes
    • We have a set of observations $y_{l}\pm \sigma_{l}$ in cell i @ time j with microscope k
      • Some microscopes have a smaller size compared to others. How do you decoupled the effect of the microscope from the actual cell size increase
      • You could calibrate all of your images with an object of known size (external calibration)
      • You could also interleave which microscope you observes the cells with at different times (ie. prevent any covariance buildup between the observables)
        • This is hard to do. A sufficient (but worse) substitution is randomly assignment
      • This problem can then be encoded as a design matrix

Baby Machine Learning

  • Two main groups of methods: unsupervised and supervised
    • Unsupervised: Find patterns in your data
      • PCA
      • Kernel density estimation
      • K-means
    • Supervised: Presupposes that your data has some labels associated with it
      • linear regression
      • Gaussian Processes
      • K-nearest neighbors
      • SVM (support vector machines)
  • All of the above can be kernelized (ie. do a transformation to your data to linearize it in some basis)
  • You split your data into 3 partitions:
    • the training set: what you learn your model parameters on
    • the validation set: what you learn your optimal hyperparameters on
    • the test set: what you evaluate your final model against

K-Nearest Neighbors

  • Choose the k “closest” feature vectors in the training set to test your feature vector, and then assign the “average” of those k labels
    • With regards to “closest”, you need to define a metric distance in feature space that obeys
      • The triangle inequality
      • and must be non-negative
    • With regards to “average”, you can do a lot of different polling strategies (depends on your problem)
  • You gain biases at the “edges” of the label boundaries

K-Means

  • We have a bunch of features and no labels
  • We define k points in our feature space (our centroids)
    • We ask for each feature vector: which of the centroids is the feature vector closest to?
    • After doing the assignment for all of the data, we move all of the centroids to the average of the nearby vectors
    • We iterate on this process until convergence