ELEC 5650 - Estimation Theory

"We have decided to call the entire field of control and communication theory, whether in the machine or in the animal, by the name Cybernetics, which we form from the Greek ... for steersman."
-- by Norbert Wiener

This is the lecture notes for "ELEC 5650: Networked Sensing, Estimation and Control" in the 2024-25 Spring semester, delivered by Prof. Ling Shi at HKUST. In this session, we will explore fundamental concepts and techniques in estimation theory, including maximum a posteriori (MAP) estimation, minimum mean squared error (MMSE) estimation, maximum likelihood (ML) estimation, weighted least squares estimation, and linear minimum mean square error (LMMSE) estimation.

MAP (Maximum A Posterior) Estimation

$x$ is the parameter to be estimated

\hat{x} = \arg max_{x} {\begin{cases} f (x | y), & x is continuous \\ p (x | y), & x is discrete \end{cases}

MMSE (Minimum Mean Squared Error) Estimation

\hat{x} = \arg min_{\hat{x}} E [e^{T} e | y] = \arg min_{\hat{x}} E [\hat{x} | y], e = x - \hat{x}

\hat{x} = \int x \cdot f (x | y) d x or \sum x \cdot p (x | y)

Proof:

\begin{aligned} E [e^{T} e] & = E [(x - \hat{x})^{T} (x - \hat{x}) | y] \\ = E [x^{T} x | y] - 2 {\hat{x}}^{T} E [x | y] + {\hat{x}}^{T} \hat{x} \end{aligned}

\frac{\partial}{\partial \hat{x}} (E [x^{T} x | y] - 2 {\hat{x}}^{T} E [x | y] + {\hat{x}}^{T} \hat{x}) = 0

- 2 E [X | Y] + 2 \hat{x} = 0

{\hat{x}}_{MMSE} = E [X | Y]

ML (Maximum Likelihood) Estimation

Non Bayesian. $p (y | x)$ is conditional probability and $p (y; x)$ is parameterized probability, $p (y | x) ⇎ p (y; x)$ .

Assume we have $n$ measurements $X = (X_{1}, \dots, X_{n})$ , we use $p (X; θ)$ to describe the joint probability of $X$ .

{\hat{θ}}_{n} = \arg max_{θ} {\begin{cases} f (X; θ), & θ is continuous \\ p (X; θ), & θ is discrete \end{cases}

p (X; θ) = \prod_{i = 1}^{n} p (X_{i}; θ) \Leftrightarrow \log p (X; θ) = \sum_{i = 1}^{n} \log p (X_{i}; θ)

MAP & ML

\begin{aligned} {\hat{θ}}_{MAP} & = \arg max_{θ} p (θ | x) = \arg max_{θ} \frac{p (θ) p (x | θ)}{p (x)} p (θ | x) = \arg max_{θ} p (θ) p (x | θ) \\ {\hat{θ}}_{ML} & = \arg max_{θ} p (x; θ) \end{aligned}

Weighted Least Square Estimation

E (x) = | | A x - b | |_{Σ}^{2} = x^{T} A^{T} Σ^{- 1} A x - 2 b^{T} Σ^{- 1} A x + b^{T} Σ^{- 1} b

\nabla E = 2 A^{T} Σ^{- 1} A x - 2 A^{T} Σ^{- 1} b

\hat{x} = (A^{T} Σ^{- 1} A)^{- 1} A^{T} Σ^{- 1} b

LMMSE (Linear Minimum Mean Square Error) Estimation

LMMSE estimation wants to find a linear estimator

\hat{x} = K y + b

such that minimize the mean square error

\begin{aligned} MSE & = E [(x - \hat{x})^{T} (x - \hat{x})] \\ = E [x^{T} x] - 2 E [x^{T} \hat{x}] + E [{\hat{x}}^{T} \hat{x}] \\ = E [x^{T} x] - 2 E [x^{T} (K y + b)] + E [(K y + b)^{T} (K y + b)] \end{aligned}

\frac{\partial MSE}{\partial b} = - 2 E [x] + 2 b + 2 K E [y] = 0

b = μ_{x} - K μ_{y}

\begin{aligned} MSE & = E [x^{T} x] - 2 E [x^{T} (K y + b)] + E [(K y + b)^{T} (K y + b)] \\ = E [x^{T} x] - 2 E [x^{T} (K y + μ_{x} - K μ_{y})] + E [(K y + μ_{x} - K μ_{y})^{T} (K y + μ_{x} - K μ_{y})] \\ = E [x^{T} x] - 2 E [x^{T} K y] - 2 μ_{x}^{T} μ_{x} + 2 μ_{x}^{T} K μ_{y} + E [y^{T} K^{T} K y] + μ_{x}^{T} μ_{x} - 2 μ_{x}^{T} K μ_{y} + μ_{y}^{T} K^{T} K μ_{y} \end{aligned}

\frac{\partial MSE}{\partial K} = - 2 Σ_{x y} + 2 K Σ_{y y} = 0

K = Σ_{x y} Σ_{y y}^{- 1}

\hat{x} = K y + b = μ_{x} + Σ_{x y} Σ_{y y}^{- 1} (y - μ_{y})

Σ_{\hat{x} \hat{x}} = Σ_{x x} - Σ_{x y} Σ_{y y}^{- 1} Σ_{y x}

Orthogonality Principle

\begin{aligned} ⟨ x - K y - b, y ⟩ & = E [(x - K y - b) y^{T}] \\ = E [x y^{T}] - K E [y y^{T}] - b E [y^{T}] \\ = Σ_{x y} + μ_{x} μ_{y}^{T} - K (Σ_{y y} + μ_{y} μ_{y}^{T}) - (μ_{x} - K μ_{y}) μ_{y}^{T} \\ = Σ_{x y} - (Σ_{x y} Σ_{y y}^{- 1}) Σ_{y y} \\ = 0 \end{aligned}

x - (K y + b) ⊥ y

This shows that error $e = x - \hat{x}$ is independent of observation $y$ .

Innovation Process

Calculating $Σ_{y y}$ consumes lots of time, however, if $Σ_{y y}$ is diagonal the thing becomes easy. By G.S. process, we can obtain orthogonality vectors ${\vec{e}}_{1}, \dots {\vec{e}}_{k}$ and the lower triangular transform matrix $F$ from ${\vec{y}}_{1}, \dots, {\vec{y}}_{k}$ . The key idea of orthogonal projection is to decompose the observation vector $y_{k}$ into a part related to the past prediction value, which can be predicted by $y_{1}, \dots y_{k - 1}$ , and a new part that is irrelevant to the past prediction value (innovation).

e = F y

Then the covariance can be calculated by

Σ_{e e} = F Σ_{y y} F^{T}, Σ_{e x} = F Σ_{y x}

K_{e} = Σ_{e x} Σ_{e e}^{- 1} = F Σ_{y x} (F^{T})^{- 1} Σ_{y y}^{- 1} F^{- 1}

Although $K_{e}$ is not equal to $K$ , it serves as the Kalman gain in the transformed or projected space defined by the matrix $F$ .

For new coming ${\vec{y}}_{t + 1}$ we can find ${\vec{e}}^{k + 1}$ by G.S. process

\begin{aligned} e_{k + 1} & = y_{k + 1} - {\hat{y}}_{k + 1 | k} \\ = y_{k + 1} - proj (y_{k + 1}; E_{k}) \\ = y_{k + 1} - \sum_{i = 1}^{k} \frac{⟨ y_{k + 1}, e_{i} ⟩}{⟨ e_{i}, e_{i} ⟩} \cdot e_{i} \end{aligned}

It satisfies

⟨ e_{k + 1}, y_{i} ⟩ = E [e_{k + 1} y_{i}^{T}] = 0, \forall i \in [1, k]

To estimate $x$ at $k + 1$

\begin{aligned} {\hat{x}}_{k + 1} & = proj (x_{k + 1}; E_{k + 1}) \\ = \sum_{i = 1}^{k + 1} \frac{⟨ x_{k + 1}, e_{i} ⟩}{⟨ e_{i}, e_{i} ⟩} \cdot e_{i} \\ = \sum_{i = 1}^{k} \frac{⟨ x_{k + 1}, e_{i} ⟩}{⟨ e_{i}, e_{i} ⟩} \cdot e_{+} \frac{⟨ x_{k + 1}, e_{k + 1} ⟩}{⟨ e_{k + 1}, e_{k + 1} ⟩} \cdot e_{k + 1} \\ = \sum_{i = 1}^{k} \frac{⟨ x_{k}, e_{i} ⟩}{⟨ e_{i}, e_{i} ⟩} \cdot e_{i} + \frac{⟨ x_{k + 1}, e_{k + 1} ⟩}{⟨ e_{k + 1}, e_{k + 1} ⟩} \cdot e_{k + 1} \\ = {\hat{x}}_{k} + \frac{⟨ x_{k + 1}, e_{k + 1} ⟩}{⟨ e_{k + 1}, e_{k + 1} ⟩} \cdot e_{k + 1} \end{aligned}

ELEC 5650 - Estimation Theory ​

MAP (Maximum A Posterior) Estimation ​

MMSE (Minimum Mean Squared Error) Estimation ​

ML (Maximum Likelihood) Estimation ​

MAP & ML ​

Weighted Least Square Estimation ​

LMMSE (Linear Minimum Mean Square Error) Estimation ​

Orthogonality Principle ​

Innovation Process ​