You have a set of data $(x_{i},y_{i},\sigma_[i])$, where x is the independent variable, y is the dependent variable, and $\sigma$ is the uncertainty in the dependent variable
You want to fit a line to the data ($\hat{y} = mx+b$)
You also want the uncertainties of the parameters
Why would you want to do this?
To know the parameters of the model
To extrapolate/interpolate values of the dataset that you don’t have
How you do this depends on your viewpoint
a frequentist approach would try to maximize the likelihood estimate (information theory)
Suppose that you have a probability density for the data $y_{i}$ ranging from 1 to N
Treat this density as a function of the parameters. What parameters maximizes this density?
A Bayesian approach would be posterior estimates (Cox theorems)
You can never have a measure in the parameter space, only the data space
“parameters are set by God. The data are created by us” - D.W.H.
This means that statements like “It is more probable that $H_{0}>69$ than $H_{0}<69$” are OK, but “the data are more consistent with $H_{0}>69$ than $H_{0}<69$” are not OK
p-value of p<0.05: If the null model were true, we’d get data this “deviant” is less than 0.05 percent of the time
The null model is the model where the thing that you are trying to prove is false. So if the likelihood of the null hypothesis is low, then the chance of your hypothesis is high
You have some data (a set of unordered vectors) and a model that tries to describe this data
We can defined $\chi^{2} = \sigma_{i=1}^{N} (\frac{y_{i}-f(x_{i},\theta)}{\sigma_{i}})^{2}$
You can also relate the chi-squared to the log-likelihood via $ln(p(y|\theta)) = \frac{-1}{2}\chi^{2}+const$
This assumes a zero-mean gaussian with known variances, It also assumes that the noise distributions of each point are iid
What value do you expect for $\chi^{2}$?
If your model describes your data well, then you expect that each data point is offset from the model by the variance of that point. so $\frac{y_{i}-f(x_{i},\theta)}{\sigma_{i}}=1$, so naively, you expect that $\chi^{2} \approx N$
In reality, $\chi^{2} \approx N -dof = N_{dof}$, where dof are the number of degrees of freedom.
The handwaving explanation for this is that you aren’t using the true parameters, you are using the best-fit parameters. This causes you to overfit a bit, which juices the $\chi^{2}$ numbers, hence the $-dof$
In the limit of large $N_{dof}$, the $\chi^{2}$ approaches a Gaussian with mean $N_{dof}$ and variance of $2N_{dof}$
A consequence of this is that for sufficiently large N, all models get rejected (“all models are wrong; some are useful” )
If you have multiple models, then you can talk about $\Delta \chi^{2}$
Assume that your data $y_{i}$ are fair samples drawn from $p(z|\theta)$
Assumes that your data is “similar”: ie. has similar errors
Create an estimator $\hat{\theta}(y_{i})$ ranging from i=1 to i=N
Draw K sets of N values from $y_{i}$ with replacement
Estimate $\hat{\theta}$ for each of the K samples
The variance of $\theta$ then becomes $\frac{1}{K}\Sigma_{j=1}^{K} (\hat{\theta_{j}}-\hat{\theta})^{2}$, where $\hat{\theta}$ if the estimator when using all the data
The covariance matrix is $C_{\theta} = \frac{1}{K}\Sigma_{i=1}^{k} (\hat{\theta_{j}}-\hat{\theta})(\hat{\theta_{j}}-\hat{\theta})^{T}$
Assume that your data $y_{i}$ are fair samples drawn from $p(z|\theta)$
Assumes that your data is “similar”: ie. has similar errors
Create an estimator $\hat{\theta}(y_{i})$ ranging from i=1 to i=N, and that the estimator holds on a smaller subset of y
Take all of the data from y, but drop one datapoint and call this subsample $Y_{j}$ (so you have a subsample of size N-1). Do this for all possible choices of j
Define $\hat{\theta}_{j}$ for all of the subsamples
Bayes is a theory of prediction, frequentism is a theory of measurement
Based on Cox theorems
In Bayes, our likelihood function is a probability distribution $p(y|\theta, \alpha, I)$, where $\theta$ are your parameters of interest, $\alpha$ are your nuisancee parameters, and I are your assumptions
All of Bayes depends on having a proper set of priors which represents your belief about $\theta$, $\alpha$ and I before your see y
Bayes Rule: $p(\theta,\alpha| y, I) = \frac{p(\theta \alpha|I)p(y|\theta, \alpha, I)}{Z}$ where Z is the normalization (or marginalization) equal to $p(y|I)$
Suppose that we have some prior $p(\theta,\alpha | I)$ which is seperable ($p(\theta|I) p(\alpha|I) = p(\theta,\alpha | I)$)
This can arise if your belief about different set of parameters are distinct from each other (ex. the assumptions about the Higgs mass are seperate from the assumptions about the calorimeter sensitivity)
You can integrate out the nuisance parameters $\alpha$. This is called marginalization
The marginal likelihood is given by $p(y|\theta, I) = \int p(y|\theta, \alpha, I) p(\alpha | I) d\alpha$
The marginal posterior is $p(\theta|y,I) = \int p(\theta,\alpha|y,I) d\alpha$
We define some probability distribution $q(\Omega’| \Omega)$ called the proposal distribution (ie. a distribution which generates a new sample given an old sample)
We assume that $q(\Omega’| \Omega) = q(\Omega| \Omega’)$ (called detailed balance)
Suppose that after drawing from q, you calculate the change in the log-likelihood between the old and the new sample
Roll a uniform random number r between 0 and 1. Accept the new sample if $ln(r) < ln(\delta)$
Otherwise, you keep the previous sample
In the limit of infinite samples, you will get a fair sampling of the posterior
Nearby samples are not independent
This is modulated by the “correlation time”, which roughly speaking, is the number of samples you have to take before you get a new independent sample
This correlation time contributes to the number of truly independent samples: $N_{eff} = \frac{K}{T}$, where K is the total samples drawn and T is the correlation time
How do you choose q?
You need to choose a q which equally trades off between exploration and exploitation
You could try to measure the autocorrelation time, but that’s hard to compute efficiently
You could also try to aim for a target acceptance rate (~0.4)
You leave k datapoints out of your training data. You have your model predict the held out data points. The model which yields the best prediction is empirically the best
Appropriate for choosing “nuisance models”
Model independent
In pseudocode:
You hold out k data points
Run your model on this subsample of the data
You predict the outputs of the withheld data points
For frequentists, you predict the probability of getting the withheld data given the subsampled model
For Bayesian, you calculate the fully marginalized log likelihood