Motivation: Probability V/S Statistics

What we want to do ? -- we want to determine some unknown quantity.

Statistics need to calculate some parameters to show that results are close to true value of the unknown however probability problem revolves around calculating the actual values. The difference between the Bayesian and classical statistics: Bayesian approach consider the unknown quantity as a random variable whereas the classical statistician will think of it as some constant value.

Important definitions:

statistical model: In mathematical terms, a statistical model is usually thought of as a pair \((S,\mathcal{P})\), where \(S\) is the set of possible observations, i.e. the sample space, and \(\mathcal{P}\) is a set of probability distributions on S.

The intuition behind this definition is as follows. It is assumed that there is a "true" probability distribution induced by the process that generates the observed data. We choose \(\mathcal{P}\) to represent a set (of distributions) which contains a distribution that adequately approximates the true distribution. Note that we do not require that \(\mathcal{P}\) contains the true distribution, and in practice that is rarely the case. Indeed, as Burnham & Anderson state, "A model is a simplification or approximation of reality and hence will not reflect all of reality", whence the saying "all models are wrong".

The set \(\mathcal{P}\) is almost always parameterized: \(\mathcal{P}=\{P_{\theta}:\theta \in \Theta \}\). The set \(\Theta\) defines the parameters of the model.

The statistical model is nonparametric if the parameter set \(\Theta\) is infinite dimensional. A statistical model is semiparametric if it has both finite-dimensional and infinite-dimensional parameters. Formally, if d is the dimension of \(\Theta\) and n is the number of samples, both semiparametric and nonparametric models have \(d \rightarrow \infty\) as \(n\rightarrow \infty\) . If \(d/n\rightarrow 0\) as \(n\rightarrow \infty\) , then the model is semiparametric; otherwise, the model is nonparametric.

Dimension of the model: 1

A priory information: Mathematical modeling problems are often classified into black box or white box models, according to how much a priori information on the system is available. A black-box model is a system of which there is no a priori information available. A white-box model (also called glass box or clear box) is a system where all necessary information is available. Practically all systems are somewhere between the black-box and white-box models, so this concept is useful only as an intuitive guide for deciding which approach to take.

In black-box models one tries to estimate both the functional form of relations between variables and the numerical parameters in those functions. Using a priori information we could end up, for example, with a set of functions that probably could describe the system adequately. If there is no a priori information we would try to use functions as general as possible to cover all different models. An often used approach for black-box models are neural networks which usually do not make assumptions about incoming data.

Model inference and variable inference: A simple example \(y_i = x_i \theta + W\), where learning \(\theta\) is model inference and learning \(x_i\) from \(y_i\) is variable inference (\(\theta\) is known). for example: consider a noisy channel where sometimes we want to know the system estimation (attenuation \(\theta\)) or want to know the sound given \(y_i\).

Estimate: It to refer to the numerical value \(\hat{\theta}\) that we choose to report on the basis of the actual observation \(x\). The value of \(\hat{\theta}\) is to be determined by applying some function \(g\) to the observation \(x\), resulting in \(\hat{\theta} = g(x)\).

Estimator: The random variable \(\hat{\Theta} = g(X)\) is called an estimator, and its realized value equals \(g(x)\) whenever the random variable \(X\) takes the value \(x\).

Empirical distribution: This contains the various measurements and data point. each data point is a random variable.

True distribution: when have some idea of the true distribution (can be approximated by a linear or polynomial regression ) then it is called the parametric setting, where as if we have no idea of true distribution accept it is some function of \(X\), as \(g(X)\), this is called the non-parametric setting.

Model distribution: After finishing the estimation process, we get some value of unknown quantity, in case of linear model we get slop and the intercept of the line to model \(y_i = x_i \theta_1 + \theta_2\).

Point estimate: Point estimation is the attempt to provide the single “best” prediction of somequantity of interest. In general the quantity of interest can be a single parameteror a vector of parameters in some parametric model, such as the weights in ourlinear regression example. To distinguish estimates of parameters from their true value, our conventionwill be to denote a point estimate of a parameter \(\theta\) by \(\hat{\theta}\).

Let \({x^{(1)}, . . . , x^{(m)}}\) be a set of \(m\) independent and identically distributed data points. A point estimator or statistic is any function of the data: $$ \hat{\theta_m} = g(x^{(1)}, . . . , x^{(m)}). $$

Function Estimation: Sometimes we are interested in performing functionestimation (or function approximation). Here, we are trying to predict a variableygiven an input vectorx. We assume that there is a functionf(x) that describesthe approximate relationship betweenyandx. For example, we may assume that \(y=f(x) +\epsilon\), where \(\epsilon\) stands for the part of \(y\) that is not predictable from \(x\). In function estimation, we are interested in approximating \(f\) with a model or estimate \(\hat{f}\). Function estimation is really just the same as estimating a parameter \(\theta\); the function estimator \(\hat{f}\) is simply a point estimator in function space. The linear regression example and the polynomial regression example both illustrate scenarios that may be interpreted as either estimating a parameter \(w\) or estimating a function \(\hat{f}\) mapping from \(x\) to \(y\).

Hypothesis testing: An unknown parameter takes a finite number of values. One wants to find the best hypothesis based on the data. e.g. binary hypothesis problem or m-ary hypothesis problem.

Non-parametric: If we have no idea of true distribution accept it is some function of \(X\), as \(g(X)\), this is called the non-parametric setting.

Different ways to estimate the unknow parameter:

The problems are divided as:

  • MLE (point estimate, mostly used)
  • Bayesian inference (gives probability distribution, has intractability issues)
  • MAP (point estimate and gets advantage from bayesian idea of having a prior)
  • The Conditional Expectation estimator (need to check)

Note that in the baysian approach it is important to take into account of the discreet and continous case of the random variable. You can check the Bertecas book if have time, otherwise goodfellow's book has short and good explanation of all the topics above.


The best explanation is in the deeplearning course offered at oxford, check here. Also check A

  • The parameter(s) \(\theta\) is fixed and unknown
  • Data is generated through the likelihood function \(p(X ;\theta)\) (if discrete) or \(f(X ; \theta)\) (if continuous).
  • Now we will be dealing with multiple candidate models, one for each value of \(\theta\)
  • We will use \(E_\theta[h(X)]\) to define the expectation of the random variable \(h(X)\) as a function of parameter \(\theta\). image image image

Example: Linear Regression as Maximum Likelihood

Bayesian inference

Bayesian inference is a method of statistical inference in which Bayes' theorem is used to update the probability for a hypothesis as more evidence or information becomes available. (It is not a point estimate, instead gives the full distribution, which is intractable because of denominator)

Important definitions image

Important formulas image

Maximum a Posteriori Probability (MAP) estimator:

One common reason for desiring a point estimate is that most operations involving the Bayesian posterior for most interesting models are intractable, and a point estimate offers a tractable approximation. Rather than simply returning to the maximum likelihood estimate, we can still gain some ofthe benefit of the Bayesian approach by allowing the prior to influence the choiceof the point estimate. One rational way to do this is to choose the maximuma posteriori(MAP) point estimate. The MAP estimate chooses the point of maximal posterior probability (or maximal probability density in the more common case of continuous \(\theta\)):

Here, having observed \(x\), we choose an estimate \(\hat{\theta}\) that maximizes the posterior distribution over all \(\theta\). When posterior distribution \(\Theta\) is discreate or continous then we define \(\hat{\theta}\) as follows: $$ \hat{\theta} = \underset{\theta}{\operatorname{argmax}} p_{\Theta|X}(\theta|x) $$ $$ \hat{\theta} = \underset{\theta}{\operatorname{argmax}} f_{\Theta|X}(\theta|x) $$

If \(\Theta\) is continuous, the actual evaluation of the MAP estimate \(\theta\) can some­ times be carried out analytically; for example, if there are no constraints on \(\theta\), by setting to zero the derivative of \(f_{\Theta|X}(\theta|x)\), or of \(\log f_{\Theta|X}(\theta|x)\), and solving for \(\theta\).


The MAP rule maximizes the overall probability of a correct decision over all decision rules \(g\). $$ P(g(X) = \Theta) \leq P(g_{MAP}=\Theta) $$ Note that this argument is mostly relevant when \(\Theta\) is discrete. If \(\Theta\), when conditioned on \(X = x\), is a continuous random variable. the probability of a correct decision is 0 under any rule.

The Conditional Expectation estimator:

Here, we choose the estimate \(\hat{\theta} = E[\Theta | X = x ]\) (In case of continuous expectation is calculated because probability of individual \(\theta\) is zero in continuous probability space). Our aim is to get the (\(\theta\), Probability)-plot where we have probability space for various value of \(\theta\). As describe below: If the posterior distribution of \(\Theta\) is symmetric around its (conditional) mean and unimodal (i.e. , has a single maximum) , the maximum occurs at the mean. Then, the MAP estimator coincides with the conditional expectation estimator. This is the case, for example, if the posterior distribution is guaranteed to be normal.

Some cool problems/derivations:

  • Inference of common mean of normal random variables.
  • Beta priors on the Bias of a coin.
  • Multi parameter problems using sensor network.
  • Bayesian least mean square: Least mean squares (LMS) estimation: Select an estimator /fun­ction of the data that minimizes the mean squared error between the - parameter and its estimate
  • Bayesian linear least mean square estimation: Selects an estimator which is a liner function of the data and minimizes the mean squared error between the parameter and its estimate.