Sufficient Statistic

Recall that a Statistic is a “measurement” of the sample. A statistic is sufficient if we can “recover” the sample distribution without knowing the true parameter. Formally, A statistic $T$ is sufficient for Statistical Model ${P_{θ}}$ if the conditional distribution of the sample given $T$ does not depend on $θ$ . In symbolic form, we have

(X ⊥ ⊥ θ) ∣ T (X) or θ ⊥ ⊥ X ∣ T (X) .

In the language of causal inference, the statistic “blocks” the causal chain

θ \to T (X) \to X .

Info

The intuition is that, if I know $T$ but not $θ$ , I can simulate $X^{'} \sim \cdot ∣ T = t$ as good as¹ $X \sim P_{θ}$ .

A sufficient statistic is a “lossless” compression of the data that has all the information about the parameter.

Fisher-Neyman Factorization Theorem

Thm

Suppose $X = R^{d}$ , distribution $P_{θ}$ has density $p_{θ}$ for all $θ \in Θ$ . Then, a statistic $T : X \to R^{k}$ is sufficient iff $p_{θ} = g (T (x), θ) \cdot f (x)$ , for some functions $g : R^{k} \times Θ \to R_{\geq 0}$ and $f : Θ \to R_{\geq 0}$ .

❗️ The result also holds for PMFs.

The proof is straightforward:

p (x ∣ T = t) = \frac{p _{θ} ( x , t )}{\int p _{θ} ( x , t ) d x} = \frac{g ( t , θ ) \cdot f ( x ) \cdot 𝟙 { T ( x ) = t }}{\int g ( t , θ ) \cdot f ( x ) \cdot 𝟙 { T ( x ) = t } d x} \propto f (x) 𝟙 {T (x) = t} ⊥ ⊥ θ .

❗️ $f (x)$ can be a constant. Then, $g$ , and thus $T$ , is just a reformulation of $p_{θ}$ . See !todo homework 1.1.

Examples

$T (X) = X$ is always a sufficient statistic.
Order Statistics $T (X) = (X_{(1)}, \dots, X_{(n)})$ is sufficient.
For $n$ iid Gaussian r.v.s with unit variance, $p_{θ} = (2 π)^{- n /2} exp (- 1/2 (- θ \sum_{i = 1}^{n} x_{i} + n θ^{2} + \sum_{i = 1}^{n} x_{i}^{2}))$ . Thus, $T = \sum_{i = 1}^{n} X_{i}$ and $g (T, θ) = exp (- 1/2 (- θT + n θ^{2}))$ .
❗️ For iid Gaussian r.v.s with a known variance, the Order Statistics $(X_{(i)}) \in R^{n}$ is considered “bigger” than $\overline{X} = \frac{1}{n} \sum_{i = 1}^{n} X_{i} \in R$ , and the latter is more “compressed”.

Gaussian Linear Model

For a fixed overdetermined ( $n \geq d$ ) design matrix $X \in R^{n \times d}$ and the Gaussian linear model is $Y = Xθ + ϵ$ , $ϵ \sim N (0, σ^{2} I_{n})$ . Let $U = X (X^{T} X)^{- 1/2}$ . Then, $T (Y) = U U^{T} Y$ is sufficient for $Y$ .

One way to prove this is to express $Y$ in $T$ and show that the expression does not depend on $θ$ . We have
$T = X (X^{T} X)^{- 1} X^{T} (Xθ + ϵ) = Xθ + U U^{T} ϵ = Y - (I - U U^{T}) ϵ .$
Note that $I - U U^{T}$ is a projector and $X^{T} (I - U U^{T}) = 0$ . Thus,
$Y = T + Proj_{Ker (X^{T})} ϵ .$
Specifically,
$Y ∣ T (y) = t \sim N (t, σ^{2} P) ⊥ ⊥ θ .$
Intuitively, the date $Y$ is high-dimensional ( $n$ ) while the useful information $θ$ is low-dimensional ( $d$ ). The statistic $T$ maps the data to the column space of $X$ , which is also of dimension $d$ . The remaining part is pure noise, orthogonal to the column space of $X$ .

See Gaussian Linear Model for more details.

Rmk

The sufficient statistic is not unique.

Rao-Blackwell Theorem

Thm

Suppose the action space $A$ is convex, the loss function $L (a, θ)$ is convex in $a$ , and $T$ is a sufficient statistic for $θ$ . Then, for any statistical procedure $A$ , consider
$A^{'} (x) = E_{X^{'} \sim \cdot ∣ T (x)} [A (X^{'}) ∣ T (x)] .$
We have
$R (A^{'}; θ) \leq R (A; θ), \forall θ \in Θ.$

Applications

Gaussian Median

Rao-Blackwell theorem gives a better estimator for the median of a Gaussian distribution with known variance than the sample median. Since $T = \overline{X}$ is sufficient, we can do a symmetric sampling such that the new sample satisfies $\frac{1}{m} \sum_{j = 1}^{m} X_{j}^{'} = T = \overline{X}$ . Then, we use the sample median of the new sample as the estimator.

Order Should Not Matter

Rao-Blackwell theorem is useful in proving Admissibility. Suppose the loss function $L$ is strictly convex in $a$ . If $A : [0, 1]^{n} \to [0, 1]$ is not order-invariant to its $n$ arguments, then $A$ is not admissible.

To see this, we use the Order Statistics as the sufficient statistic. Since $[0, 1]$ is convex, by Rao-Blackwell theorem, we consider

A^{'} (x) = E [A (X) ∣ (x_{(i)})] = \frac{1}{n !} σ \in S_{n} \sum A (x_{(σ (i))}) = \frac{1}{n !} σ \in S_{n} \sum A (x_{σ (i)}),

where $S_{n}$ is the permutation group of $n$ elements and the last equality is because the summation is over all permutations. Clearly, $A^{'}$ is order-invariant. Further, by the strict convexity of $L$ , we have

R (A^{'}, θ) = < = = = E [L (\frac{1}{n !} σ \sum A (σ (X)), θ)] E [\frac{1}{n !} σ \sum L (A (σ (X)), θ)] \frac{1}{n !} σ \sum E [L (A (σ (X)), θ)] E [L (A (X), θ)] R (A, θ),

where the penultimate equality is because $σ (X) = d X$ due to iidness, and the strict inequality is by Jensen’s inequality; the equality holds iff $A (σ_{i}) = A (σ_{j})$ for any two permutations, which implies $A$ is order-invariant.

If the loss function is Mean Squared Error, we have an alternative proof that has a better interpretation. Recall the MSE decomposition:

MSE = Var + Bias^{2} .

Note that

Bias (A^{'}) = E [A^{'} (X) - θ] = \frac{1}{n} σ \sum E [A (σ (X)) - θ] = E [A (X) - θ] = Bias (A) .

So we only need to compare the variance:

Var (A^{'}) = Var (\frac{1}{n !} σ \sum A \circ σ) = \frac{1}{( n ! ) ^{2}} Var (σ \sum A \circ σ) \leq \frac{( n ! ) ^{2}}{( n ! ) ^{2}} Var (A) = Var (A),

where the inequality is due to Cauchy-Schwartz Inequality, and the equality holds if and only if

⟺ ⟺ A \circ σ_{i} = d c A \circ σ_{j}, \exists c, \forall σ_{i}, σ_{j} \in S_{n} A \circ σ_{i} = d A \circ σ_{j} A is order-invariant,

where the first equivalence is because

Var (A) = Var (A \circ σ_{i}) = c^{2} Var (A \circ σ_{j}) = c^{2} Var (A) ⟹ c = 1.

Equal in distribution. ↩