Assuming similar errors for each operations, you find that the round-off error scales as $\sqrt{N}$
Making finer time steps yields a better approximation, but the round-off error degrades this. There is some optimal step size which yields the most accurate result
Solving the aoove for $x(r)$ allows one to transform the uniform distribution to the desired nonuniform one
The above doesn’t always work. A notable example is gaussians. You get a non-invertible function. You do some strange transformations (Box-Muller method) to allow you to circumvent this
You try and write down the cumulative distribution of your desired non-analytic distribution:
$F(x) = \int x dx’ f(x’)$ which maps to the range $[0,1]$
You can then generate points from f(x) by drawing from a uniform distribution and then using the formula $x = F^{-1}(r)$ where the inverse is calculated numerically via interpolating a tabulation of F(x)
The weights for the Fourier basis are $W(x,x_{i}) = \frac{1}{2\pi} \int_{-pi}^{\pi} d\omega exp[i\omega(x-x_{i})] = \frac{\sin[\pi(x-x_{i})]}{\pi(x-x_{i})}$
The central difference approximation of the derivative of f at x
That’s it. There are other schemes that you can use (like forward difference and backwards difference), but those are more niche
To extend to multivariable, you just fix the rest of the variables and only vary one at a time
There is ostensibly an optimal choice for dx, which relates to the machine precision and the third derivative, but for all practical purposes, this is irrelavent since if you know the third derivative, why would you be approximating anything?
There is something called jax, which lets you programatically calculate the derivative of a function using the chain rule
The idea is that you calculate the analytical derivative of the function in parallel with the function itself. This process can be recursive, hence allowing the iteration
Central problem: Solve Ax=b where A is a matrix and b is a vector
Can use LU decomposition: A=LU where L is a lower triangular matrix and U is an upper triangular matrix
L is a lower triangular matrix whose diagonal elements are 1 (this follows since there are $N^{2}+N$ values to set in L and U, but only $N^{2}$ contraints in A)
This factorization is done by using Gaussian elimination with pivoting
Once you have the LU factorization, you can easily find the determinant of A, since det(L) = 1 and det(U) = $\Pi_{i} U_{ii}$, and det(A) = det(L)det(U)
Numerically, L and U are easier to work with than A (more 0s and 1s)
If A is very close to being singular (causing inverses to be unstable), or Ax=b is not unique or doesn’t exist, you can use the SVD to get an approximate solution
You can write any matrix as $A = U W V^{T}$
U is an MxN matrix whose columns are orthonormal
W is a NxN diagonal matrix
$V^{T}$ is the transpose of an orthonormal matrix
You can think of the SVD as telling you the range of A (ie. the subspace that A can map into) and the null space of A (ie. the space for which Ax = 0)
The columns of U for the orthonormal basis of the range of A
The columns of V correspond to the basis which spans the null space
W tells you the rank of the matrix (the number of non-zero $w_{j}$)
From the SVD, you can get the condition number of a matrix (ie. $C = \frac{max(W)}{min(W)}$)
A singular matrix would have C be infinity. if 1/C approaches machine precision, then the matrix is singular enough that doing numerics on it directly is unstable
You can easily get the inverse of A from the SVD: $A^{-1} = V W^{-1} U^{T}$
$W^{-1}$ is trivial to find (take the inverse of all the diagonal elements)
Technically, these are right eigenvectors. The left eigenvectors are defined as $xA = \lambda x$. These left ones have the same eigenvalues as the right
Analytically, you just solve $det(A-\lambda I)=0$, but this is an awful way of doing this numerically. You actually use QR factorization. We don’t talk about this in class though
The idea is that you find the QR decomposition, where Q is an orthogonal matrix (ie. $Q^{T} = Q^{-1}$) and R is an upper triangular matrix
analytically, you use Gram-Schmidt, but this is numerically unstable. More numerically stable methods, like Givens rotations and Householder reflections, find ways to stuff zeros in the intermediate matrices to prevent round-off errors
This means that B is a similar matrix to A, and hence has the same eigenvalues
You apply this process iteratively, and since all the intermediate matrices are similar, they all have the same eigenvalues
Note that this process doesn’t preserve the eigenvectors of A. You can find those by solving the characteristic equation ($Ax= \lambdax$) for the particular eigenvalue (potentially solving for multiple lambda at once with the block power method)
For a given real symmetric, positive definite matrix (this kind of thing shows up a lot in physics, like in the moment of inertia tensor), you can find the three principal axes of the system via eigendecomposition
PCA is also useful for reducing the dimensionality of a dataset, which allows a lower dimensional approximation of a high dimensional dataset (for compression or interprability reasons)
Suppose that you have some dataset $\vec{q_{i}}$ with mean $\vec{q_{0}}$
You can define the residuals as $\vec{r-{i}} = \vec{q_{i}}- \vec{q_{0}}$ and the Covariance matrix as $C = \vec{r_{j}}\vec{r_{i}}^{T}$
The eigenvalues and vectors tells you the coordinate system which diagonalizes this matrix
Since eigenvectors with larger eigenvalues have large variance, they are more important to describing the shape of the data. Hence, you can throw away the smaller eigenvalues and associated eigenvectors, and still keep the majority of the information
For many methods, you need to bracket the root (ie. define a region on the domain of x which has the root in it).
Typically, you just start with an initial guess for a bracket, then grow the bracket symmetrically by some factor until you have the root. This doesn’t always work though
The brain-dead way of finding a root is bisection. This is just binary search, with your stopping condition being if the new bracket’s width is on the order of machine precision
There is a less-sure fire, but faster method called Brent’s method, which fits a parabola to the left, right and midpoint of your bracket, then steps towards the nearest root of the parabola
If the root falls outside the bracket, or the update step is very tiny, then Brent defaults to bisection
The exact details do some massaging of terms to make things numerically stable when solving the quadratic
You you have access to the derivative of the function, you can use Newton-Raphson’s method
You iteratively find $x_{i+1} = x_{i}-\frac{f(x)}{f’(x)}$
If you are nowhere near a root, then Newton’s method can not converge, so it’s usually best to use for really precise determinations of a root, due to it’s quadratic convergence
Assume that you can evaluate your function f(x), and you have defined your bracket from [a,c], and you choose some point b in the bracket
Basic idea:
Try fitting a quadratic of the form $\tilde{f}(x) = \alpha x^2 \beta x +\gamma$ to the points a,b,c (ie. solve for the parameters via solving a matrix equation).
Define your new bracket according to the minimum of this quadratic.
This isn’t perfectly stable, since there is a change that the minimum falls outside of your current bracket
Idea: We use the Hessian matrix ($A_{ij} = \frac{\partial^{2} f}{\partial x_{i}\partial x_{j}}$) to approximate the function locally as a multi-dimensional quadratic, then step to the minimum
Problem: You normally don’t know the Hessian of a function
Idea: If we know the eigenvectors of the Hessian, then you can minimize along the direction of each eigenvector (see here)
$H(f) = \int_{-\infty}^{\infty} dt h(t) exp(2\pi i ft)$
$h(t) = \int_{-\infty}^{\infty} dt H(f) exp(-2\pi i ft)$
Discretization:
Suppose that your set of samples at equally spaced locations $t_{k} = k\Delta$, where k runs from 0 to N-1
Let $f_{n} = \frac{n}{N\Delta}$, where n ranges from $-\frac{N}{2}$ to $\frac{N}{2}$
$H(f_{n}) \approx \Delta \Sigma_{k=0}^{N-1} h(t_{k}) exp(\frac{2\pi i kn}{N})$
$h_{k} = \frac{1}{N} \Sigma_{n=0}^{N-1} H_{n} exp(\frac{-2\pi i kn}{N})$
This is slow to calculate (scales as $N^{2}$). FFT can give $N log N$ scaling instead
Basic Idea: Assumes that N is a power of 2 ( can pad out dataset to make it a power of 2). Split the series into even and odd terms. Recursively apply this splitting of even and odd terms. That’s it (lot’s of bookkeeping, but that’s it)
an Nth order differential equation can be reduced to N 1st order differential equations by letting $\frac{dx}{dt} = y$ (and similar constructs can be used for higher derivatives)
To calculate first order derivatives, you should use Runga-Kutta methods
This is good enough for most applications
If you care about energy conservation, you can use things like Leapfrog
If you have multiple time scales in a system, then you need to make surer that your step-size is small enough to capture the relevant phenomena
Consider PDEs of the form $\frac{\partial \vec{q}}{\partial t} = -\frac{\partial \vec{F}(q)}{\partial x}$, where F is the conserved flux. For simplicity, let’s look at the example $\frac{\partial \vec{q}}{\partial t} = -v\frac{\partial q}{\partial x}$
The naive way of solving this numerically is discretizing space and time into equal spaces:
$x_{j} = x_{0}+j\delta x$
$t_{j} = t_{0}+n\delta t$
This gives an update scheme like: $\frac{q_{j}^{n+1}-q_{j}^{n}}{\delta t} = -v \frac{q_{j+1}^{n}-q_{j-1}^{n}}{2\delta x}$
If we are given q defined for all j at n=0, then we can propagate outwards to solve for when n=1, n=2, etc.
This is not stable. We can see this by Taylor expanding around the difference equation solution (which is $q_{j}^{n} = \epsilon^{n}exp(ikj\delta x)$)
Solving for $\epsilon$, we see that the magnitude of $\epsilon$ is greater than 1 for all k, which implies instablity
The Lax method simply replaces the estimate of $q_{j}$ at step n with the average of the neighboring points
This is stable, as long as $\frac{\delta x}{\delta t} < v$, which is calculated by the same sort of stability analysis as before (called von Neumann Analysis)
This is called the Courant condition. Physically, this means that the time step of the simulation must be smaller than the propagation speed of the signal across a single cell
This has a problem, in that higher order wavenumbers get dampened out
You can see this by rewriting the lax equation to tease out a term which looks like $\frac{\delta x^{2}}{2\delta t}\frac{\partial^{2}q}{\partial x^2}$ (like a viscosity term)
Implicit methods also exist, where we utilize future mesh points in the difference equations in such a way that they can’t be uncoupled
For instance, the diffusion equation ($\frac{\partial q}{\partial t} = \frac{\partial }{\partial x}(D \frac{\partial q}{\partial x})$) can be discretized like $q_{j}^{n+1} = q_{j}^{n}+\alpha(q_{j+1}^{n}-2q_{j}^{n}+q_{j-1}^{n})$
These are just a linear system of equations with N unknowns (the $q_{j}^{n+1}$)
Crank-Nicolson methods are a combination of the two (ie. you take a half step explicitly, and a half step implicitly). The overall method is still implicit, but you get better stability properties and conservation of small-scale structure
If you have constant coefficients, you can get away with Fourier methods
Look at Poisson’s equation: $\nabla ^{2} q = \rho$
The Fourier transform of $\nabla^{2} q = -k^{2} q$. Hence, we can write q as $q = -FT(\frac{\rho}{k^{2}})$
After discretizing, we can apply a discrete fourier Transform to all the involved variables
$q_{j} = \frac{1}{J}\Sigma_{n=0}^{J-1} \tilde{q}_{n}exp(\frac{-2\pi i j n}{J})$ and similar expansion for $\rho$
Plugging in and rearranging, we can match coefficients to find that $\tilde{q_{n}} = \frac{\delta^{2}\tilde{\rho_{n}}}{2\cos(\frac{2\pi n}{J})-1}$. The original variables can be calculated via inverse Fourier Transforms
Another category of methods are relaxation methods, where, if you are dealing with elliptic PDEs (ie. $Lq = p$ where L is an elliptic operator), you can instead solve the diffusion equation $\frac{\partial q}{\partial t} = Lq-p$.
Static solutions to the above satisfy the original eqaution, so if you start with a random q, then iteratively apply the finite difference method of the diffusion equation until q doesn’t change anymore (you are “relaxing” to the stable solution)
Suppose that you have some data $\vec{d}$ and a set of model parameters $\vec{theta}$. MCMC allows you to generate a distribution of $\theta$ (called the posterior) which is peaked at the “best” choice of parameters, whose spread is associated with uncertainties about the correct parameters
This makes no assumption on wheather $\theta$ is properly modelling the data
From Baye’s Theorem: $P(\vec{\theta}| \vec{d}) = \frac{P(\vec{d}|\vec{\theta})P(\vec{\theta})}{P(\vec{d})}$
In words: The probability distribution of the real $\theta$ given the data (posterior) equals the probability distribution of the data given the model parameters (likelihood) multiplied by the distribution of the model parameters (prior) divided by the distribution of the data (marginalization)
in fewer words: posterior = likelihood*prior/marginalization
Intuitively, you can think of your current best guess for distribution of the model parameters (say, based on past data you have collected)
The likelihood represents our current understanding of how the model parameters describe the data (ie. our model)
The marginalization is the distribution of data sets over all possible choices of $\theta$
You don’t normally need to calculate this, but if you had to, it’s just an integral over all $\theta$
A Markov Chain is a sequence of points $\vec{\theta_{n}}$ such that the distribution of point n+1 only depends on point n
We want this sequence to have the property that it is distributed according to the posterior
For any desired distribution $\pi(\vec{\theta})$, we can get a Markov chain that follows $\pi$ if $\pi(\vec{\theta_{1}})p(\theta_{1}|\theta_{2}) = \pi(\vec{\theta_{2}})p(\theta_{2}|\theta_{1})$
Proof: Ask the question: Given a set of points $\theta_{1}$, what is the distribution of the next set of points $\theta_{2}?$
For each step, we generate some $\theta_{i}$, at which we evaluate the posterior $\pi(\theta_{i})$
We generate a candidate $\theta_{i+1}$ from some proposal distribution $q(\theta_{i+1}|\theta_{i})$ (Gaussian is a common choice)
We calculate the acceptance probability $\alpha(\theta_{i+1}|\theta_{i}) = min(1, \frac{\pi(\theta_{i+1})q(\theta_{i}|\theta_{i+1})}{\pi(\theta_{i})q(\theta_{i+1}|\theta_{i})})$. With randomly accept the candidate based on this probability
Your starting point in MCMC biases what your final distribution looks like
You also need a finite number of steps before the equilibrium is reached
So you need to figure out how many steps you need, and how may of these steps are independent. Typical tools for this are the autocorrelation function and the autocorrelation time
You choice of proposal distribution also affects the burn-in time/convergence of the system
a proposal distro which generates too large of steps might not converge
a distro which generates small steps might take forever to converge