Linear Regression

  • You have a set of data $(x_{i},y_{i},\sigma_[i])$, where x is the independent variable, y is the dependent variable, and $\sigma$ is the uncertainty in the dependent variable
  • You want to fit a line to the data ($\hat{y} = mx+b$)
    • You also want the uncertainties of the parameters
  • Why would you want to do this?
    • To know the parameters of the model
    • To extrapolate/interpolate values of the dataset that you don’t have
  • How you do this depends on your viewpoint
    • a frequentist approach would try to maximize the likelihood estimate (information theory)
      • Suppose that you have a probability density for the data $y_{i}$ ranging from 1 to N
        • Treat this density as a function of the parameters. What parameters maximizes this density?
    • A Bayesian approach would be posterior estimates (Cox theorems)

Likelihood Function

  • The model we are trying to fit for Linear Regression is $y=mx+b+\pi(0,\sigma)$
    • We assume that x has no uncertainty
    • that the noise function $\pi$ is Gaussian with mean 0 (why? Central Limit Theorem, and it’s easy to work with)
      • Assume that your variances are correct (big assumption)
    • Each data point is independent from each other (iid)
  • $N = \frac{1}{\sqrt{2\pi \sigma^{2}}} exp(-\frac{(y-\mu)^{2}}{2\sigma_{2}})$ (1D Gaussian)
  • With this, the likelihood function can be written as $\mathbb{L} = \frac{-1}{2}(\vec{Y}-\vec{X}\cdot \vec{a})^{T}C^{-1}(\vec{Y}-\vec{X}\cdot \vec{a})$
    • Y is the data vector
    • X is the design matrix (first column is the independent variables, 2nd column is all ones)
    • a is the parameter vector
    • $C^{-1}$ is a diagonal matrix, whose diagonals are $\frac{1}{\sigma_{i}^{2}}$
  • the most likely values of m and b are those which maximize this likelihood (ie. when the derivative w.r.t. a is 0)
    • Dig through this pamphlet to do that calculation

Freqentist Week

Some Definitions

  • aleatoric uncertainty (noise) and epistemic uncertainty (systematic error or uncertainty in the form of model assumptions)
  • The lower bound on your total uncertainty is the inverse of the Fisher matrix (ie. covariance matrix)
    • This lower bounds is reached only if all of the assumptions are correct (which is normally pretty naive)

Freqentist Rules

  • You can never have a measure in the parameter space, only the data space
    • “parameters are set by God. The data are created by us” - D.W.H.
    • This means that statements like “It is more probable that $H_{0}>69$ than $H_{0}<69$” are OK, but “the data are more consistent with $H_{0}>69$ than $H_{0}<69$” are not OK
  • p-value of p<0.05: If the null model were true, we’d get data this “deviant” is less than 0.05 percent of the time
    • The null model is the model where the thing that you are trying to prove is false. So if the likelihood of the null hypothesis is low, then the chance of your hypothesis is high

What is $\chi^2?$

  • You have some data (a set of unordered vectors) and a model that tries to describe this data
  • We can defined $\chi^{2} = \sigma_{i=1}^{N} (\frac{y_{i}-f(x_{i},\theta)}{\sigma_{i}})^{2}$
  • You can also relate the chi-squared to the log-likelihood via $ln(p(y|\theta)) = \frac{-1}{2}\chi^{2}+const$
    • This assumes a zero-mean gaussian with known variances, It also assumes that the noise distributions of each point are iid
  • What value do you expect for $\chi^{2}$?
    • If your model describes your data well, then you expect that each data point is offset from the model by the variance of that point. so $\frac{y_{i}-f(x_{i},\theta)}{\sigma_{i}}=1$, so naively, you expect that $\chi^{2} \approx N$
    • In reality, $\chi^{2} \approx N -dof = N_{dof}$, where dof are the number of degrees of freedom.
      • The handwaving explanation for this is that you aren’t using the true parameters, you are using the best-fit parameters. This causes you to overfit a bit, which juices the $\chi^{2}$ numbers, hence the $-dof$
  • In the limit of large $N_{dof}$, the $\chi^{2}$ approaches a Gaussian with mean $N_{dof}$ and variance of $2N_{dof}$
    • A consequence of this is that for sufficiently large N, all models get rejected (“all models are wrong; some are useful” )
  • If you have multiple models, then you can talk about $\Delta \chi^{2}$

Nuisance Parameters

  • Some parameters in a model you care about. Others you need to include, but don’t want to
    • For example, instrumentation might introduce these extra parameters
  • Suppose that our likelihood function distribution goes like $L = \ln p([y]|\theta,\alpha)$, where $\alpha$ is/are your nuisance parameter(s)
    • You can define a function $max_{\alpha} L(\theta, \alpha) = L_{p}(\theta)$ (ie. you find what $\alpha$ maximizes the likelihood)
      • You didn’t have to specify a prior on $\alpha$
      • $L_{p}$ is called the profile log likelihood
    • This profiling removes the nuisance parameters from the likelihood, so you can just maximize $L_{p}$ instead of L

Uncertainty Estimation

Robustness

  • A “robust” method is insensitive to your noise model
  • The previous uncertainty estimates where not robust (ie. they heavily depended on the noise being Gaussian)

Fair Sampling

  • Assume that $x_{i}$ is drawn from some distribution $p(x)$. Then your sampling of x is fair if $\sigma_{i=1}^{k} f(x_{i}) \approx \int f(x) p(x) dx$
  • $p(x)$ must satisfy $\int p(x) dx = 1$ and $p(x) \geq 0$

Bootstrapping

  • Assume that your data $y_{i}$ are fair samples drawn from $p(z|\theta)$
  • Assumes that your data is “similar”: ie. has similar errors
  • Create an estimator $\hat{\theta}(y_{i})$ ranging from i=1 to i=N
    • Draw K sets of N values from $y_{i}$ with replacement
    • Estimate $\hat{\theta}$ for each of the K samples
    • The variance of $\theta$ then becomes $\frac{1}{K}\Sigma_{j=1}^{K} (\hat{\theta_{j}}-\hat{\theta})^{2}$, where $\hat{\theta}$ if the estimator when using all the data
    • The covariance matrix is $C_{\theta} = \frac{1}{K}\Sigma_{i=1}^{k} (\hat{\theta_{j}}-\hat{\theta})(\hat{\theta_{j}}-\hat{\theta})^{T}$

Jackknifing

  • Assume that your data $y_{i}$ are fair samples drawn from $p(z|\theta)$
  • Assumes that your data is “similar”: ie. has similar errors
  • Create an estimator $\hat{\theta}(y_{i})$ ranging from i=1 to i=N, and that the estimator holds on a smaller subset of y
    • Take all of the data from y, but drop one datapoint and call this subsample $Y_{j}$ (so you have a subsample of size N-1). Do this for all possible choices of j
    • Define $\hat{\theta}_{j}$ for all of the subsamples
    • $\sigma_{\theta}^{2} = \frac{N-1}{N} \Sigma_{j=1}^{N} (\theta_{j}-<\theta>_{j})^{2}$
      • $<\theta>_{j}$ is the mean across all of the jackknife estimators

Bayesian Statistics

  • Bayes is a theory of prediction, frequentism is a theory of measurement
  • Based on Cox theorems
  • In Bayes, our likelihood function is a probability distribution $p(y|\theta, \alpha, I)$, where $\theta$ are your parameters of interest, $\alpha$ are your nuisancee parameters, and I are your assumptions
  • All of Bayes depends on having a proper set of priors which represents your belief about $\theta$, $\alpha$ and I before your see y
  • Bayes Rule: $p(\theta,\alpha| y, I) = \frac{p(\theta \alpha|I)p(y|\theta, \alpha, I)}{Z}$ where Z is the normalization (or marginalization) equal to $p(y|I)$

Marginalization

  • Suppose that we have some prior $p(\theta,\alpha | I)$ which is seperable ($p(\theta|I) p(\alpha|I) = p(\theta,\alpha | I)$)
    • This can arise if your belief about different set of parameters are distinct from each other (ex. the assumptions about the Higgs mass are seperate from the assumptions about the calorimeter sensitivity)
    • You can integrate out the nuisance parameters $\alpha$. This is called marginalization
      • The marginal likelihood is given by $p(y|\theta, I) = \int p(y|\theta, \alpha, I) p(\alpha | I) d\alpha$
      • The marginal posterior is $p(\theta|y,I) = \int p(\theta,\alpha|y,I) d\alpha$

MCMC

  • Markov Chain Monte Carlo
  • Universal way of calculating posterior distributions
    • Kind of a crappy way of do so, since no nice convergence theorems
  • Suppose that we have some posterior distribution $p(\Omega| y,I)$
  • A good sampling of this posterior should be able to approximate the posterior

Metropolis Hastings

  • We define some probability distribution $q(\Omega’| \Omega)$ called the proposal distribution (ie. a distribution which generates a new sample given an old sample)
    • We assume that $q(\Omega’| \Omega) = q(\Omega| \Omega’)$ (called detailed balance)
  • Suppose that after drawing from q, you calculate the change in the log-likelihood between the old and the new sample
    • Roll a uniform random number r between 0 and 1. Accept the new sample if $ln(r) < ln(\delta)$
    • Otherwise, you keep the previous sample
    • In the limit of infinite samples, you will get a fair sampling of the posterior
    • Nearby samples are not independent
      • This is modulated by the “correlation time”, which roughly speaking, is the number of samples you have to take before you get a new independent sample
      • This correlation time contributes to the number of truly independent samples: $N_{eff} = \frac{K}{T}$, where K is the total samples drawn and T is the correlation time
  • How do you choose q?
    • You need to choose a q which equally trades off between exploration and exploitation
      • You could try to measure the autocorrelation time, but that’s hard to compute efficiently
      • You could also try to aim for a target acceptance rate (~0.4)

Model Selection

  • Suppose that you have some set of models that you want to test
  • You can select which model you want via AIC and BIC (ie. likelihood ratios) in a frequentist approach
  • With a Bayesian approach, you can find your full marginalized likelihood (integrate over your model parameters)
    • This approach is heavily dependent on your prior (since $p(Y|I) = \int p(y|\theta, I) p(\theta|I) d\theta $)

K-fold Cross Validation

  • You leave k datapoints out of your training data. You have your model predict the held out data points. The model which yields the best prediction is empirically the best
    • Appropriate for choosing “nuisance models”
    • Model independent
  • In pseudocode:
    • You hold out k data points
    • Run your model on this subsample of the data
    • You predict the outputs of the withheld data points
      • For frequentists, you predict the probability of getting the withheld data given the subsampled model
      • For Bayesian, you calculate the fully marginalized log likelihood