M-Estimator

For an Estimation task, one can have different objectives (or metrics). For example, MLE maximizes the likelihood of the estimator, Least Squares minimizes its squared error, and Ridge Regression minimizes the squared error with a penalty term.

M-estimators generalize the above ideas by using a general objective function,

m : Θ \times X \to \overline{R}, (θ, x) \mapsto m_{θ} (x) .

Then, we can define the empirical error: $M_{n} (θ) : = \hat{E}_{n} m_{θ} (X) = \frac{1}{n} \sum_{i = 1}^{n} m_{θ} (x_{i})$ , a.k.a. the criterion function. And an M-Estimator minimizes the empirical error:

\hat{θ} = θ \in Θ arg min M_{n} (θ) .

💡 We can see that M-estimators and Empirical Risk Minimization have the same formulation, one in the context of Estimation and the other in the context of Prediction/Supervised Learning.

Examples

We only need to specify the $m$ function to get different M-estimators.

Maximum Likelihood Estimation: $m_{θ} (x) = - lo g p_{θ} (x)$ .
Ordinary Least Squares: $m_{θ} (x) = (y - x^{T} θ)^{2}$ .
Ridge Regression: $m_{θ} (x) = (y - x^{T} θ)^{2} + λ ∥ θ ∥^{2}$ .
Median regression: $m_{θ} (x) = ∣ y - x^{T} θ ∣$ .
Quantile regression: $m_{θ} (x) = - (θ - x) (τ - 1 (θ - x > 0))$ .
- This loss is also known as the pinball loss.

Properties

In this section, we denote $M (θ) : = E_{P} m_{θ} (X)$ , and

θ^{*} = θ \in Θ arg min M (θ) .

Consistency

By LLN, we know that

M_{n} (θ) \to P / a.s. M (θ), \forall θ .

We wonder under what conditions the following is also true

\hat{θ} : = θ \in Θ arg min M_{n} (θ) \to P θ \in Θ arg min M (θ) = : θ^{*} .

In the function graph of $M$ and $M_{n}$ (see e.g. ^fig-m-graph), the question ask what conditions allow to transfer the convergence of y-axis to that of x-axis.

It turns out that the above consistency holds under two reasonable conditions:

Uniform convergence: $sup_{θ \in Θ} ∣ M_{n} (θ) - M (θ) ∣ \to P 0$ .
- Some sufficient conditions for uniform convergence:
  - Finite VC dimension.
  - Finite Rademacher Complexity or Gaussian Complexity.
  - Compact $Θ$ and continuous $M_{n}$ and $M$ (in $θ$ ), and $E_{P} [sup_{θ \in Θ} M_{θ} (X)] < \infty$ .
Separation: For any $ϵ$ , $in f_{θ : ∥ θ - θ^{*} ∥ > ϵ} M (θ) > M (θ^{*})$
- 💡 In words, this condition says that only parameters close to $θ^{*}$ may yield a value of $M (θ)$ close to the minimum $M (θ^{*})$ .

Example of M function whose point of maximum is not well separated.

Proof of Consistency

By the definition of $\hat{θ}$ , we have the critical inequality:

M_{n} (\hat{θ}) \leq M_{n} (θ^{*}) \to P M (θ^{*}) .

Thus, we have

M_{n} (\hat{θ}) \leq M (θ^{*}) + o_{P} (1) . (1)

By the uniform convergence condition,

M (\hat{θ}) - M (θ^{*}) \leq \leq \to P M (\hat{θ}) - (M_{n} (\hat{θ}) - o_{P} (1)) \hat{θ} \in Θ sup ∣ M (\hat{θ}) - M_{n} (\hat{θ}) ∣ + o_{P} (1) 0.

Then, since no other parameter can yield a value of $M (θ)$ close to $M (θ^{*})$ by the separation condition, we know $\hat{θ} \to P θ^{*}$ . More specifically, for any $ϵ > 0$ , we have

δ : = θ : ∥ θ - θ^{*} ∥ > ϵ in f M (θ) - M (θ^{*}) > 0.

Then, since $M (\hat{θ}) - M (θ^{*}) \to P 0$ , we know that $∥ \hat{θ} - θ^{*} ∥ \leq P ϵ$ . By the arbitrariness of $ϵ$ , we get the result.

Consistency of Approximate Estimator

As we can see in the proof, the crucial step is $(1)$ . Therefore, we can replace $\hat{θ}$ by any approximate estimator $\tilde{θ}$ (approximate minimizer of $M_{n}$ ), and its consistent as long as $(1)$ holds, or equivalently,

⟹ M_{n} (\tilde{θ}) \leq M_{n} (\hat{θ}) + o_{P} (1) M_{n} (\tilde{θ}) \leq M (θ^{*}) + o_{P} (1) .

Asymptotic Normality

Under some regularity conditions:

$θ \mapsto m_{θ} (x)$ is differentiable at $θ^{*}$ $P$ -a.s. (for almost every $x$ );
There exists an $L_{2}$ function $f$ , such that $∣ m_{θ_{1}} (x) - m_{θ_{2}} (x) ∣ \leq f (x) ∥ θ_{1} - θ_{2} ∥$ for all $θ_{1}, θ_{2}$ in a neighborhood of $θ^{*}$ ;
$e : θ \mapsto E m_{θ} (X)$ is twice differentiable with a non-singular second derivative at $θ^{*}$ , denoted as $V_{θ^{*}} = \overset{e}{¨}_{θ^{*}}$ ;

we have

n (\hat{θ} - θ^{*}) \to d N (0, V_{θ^{*}}^{- 1} E [\overset{m}{˙}_{θ^{*}} \overset{m}{˙}_{θ^{*}}^{T}] V_{θ^{*}}^{- 1}) .

❗️ Note that we do not require $m_{θ}$ is twice differentiable. The twice differentiability of $e$ is weaker.

Proof of Asymptotic Normality

The regularity conditions allow the commutation of expectation and derivatives. By definition, $\hat{θ}$ minimizes the sample criterion function $M_{n}$ , and thus is the zero of its derivative. By Taylor expansion,

0 = \dot{M}_{n} (\hat{θ}) = \dot{M}_{n} (θ^{*}) + \ddot{M}_{n} (\overset{ˉ}{θ}) (\hat{θ} - θ^{*}),

where $\overset{ˉ}{θ}$ is some point between $\hat{θ}$ and $θ^{*}$ . By LLN,

\ddot{M}_{n} (θ) \to P \ddot{M} (θ), \forall θ .

By the regularity conditions and the Consistency property,

\ddot{M}_{n} (\overset{ˉ}{θ}) \to P \ddot{M} (θ^{*}) = V_{θ^{*}},

which is invertible. Thus, by CMT, we have

n (\hat{θ} - θ^{*}) \to P V_{θ^{*}}^{- 1} (n \dot{M}_{n} (θ^{*})) .

Finally, since $θ^{*}$ minimizes the population objective $M$ , $\dot{M} (θ^{*}) = 0$ . By CLT,

n \dot{M}_{n} (θ^{*}) \to d N (0, Var (\overset{m}{˙} (X, θ^{*}))) = N (0, E [\overset{m}{˙}_{θ^{*}} \overset{m}{˙}_{θ^{*}}^{T}]) .

Slutsky’s Theorem gives

n (\hat{θ} - θ^{*}) \to d N (0, V_{θ^{*}}^{- 1} E [\overset{m}{˙}_{θ^{*}} \overset{m}{˙}_{θ^{*}}^{T}]) V_{θ^{*}}^{- T}) .

The use of LLN and CLT

We already know $\hat{θ} - θ^{*} \to P 0$ . To know the convergence rate, we need to use CLT, which uses high-order information. Since the $\ddot{M}_{n} (θ^{*})$ converges to a non-singular $\ddot{M} (θ^{*})$ , we do not need to zoom in and LLN suffices. The Hessian $\ddot{M} (θ^{*})$ is called the stable curvature.

Asymptotic Normality of Quantile Regression

Treating quantile regression as an M-estimator, we verify the above conditions and establish its asymptotic normality. Recall the pinball loss function:

m_{θ} (x) = - (θ - x) (q - 1 {θ - x > 0}) .

For the first condition, $\frac{\partial}{\partial θ} m_{θ} (x) ∣_{θ^{*}}$ exists if $x \neq = θ^{*}$ . For any continuous distribution, $P (X = θ^{*}) = 0$ , so the first condition holds, and $\overset{m}{˙}_{θ} (x) = - q + 1 {θ > x}$ .

The second condition also holds:

∣ m_{θ_{1}} (x) - m_{θ_{2}} (x) ∣ = ∣ q (θ_{2} - θ_{1}) + (θ_{1} - x) 1 {θ_{1} > x} - (θ_{2} - x) 1 {θ_{2} > x} ∣ \leq q ∣ θ_{2} - θ_{1} ∣ + ∣ (θ_{1} - x) 1 {θ_{1} > x} - (θ_{2} - x) 1 {θ_{2} > x} ∣ \leq q ∣ θ_{1} - θ_{2} ∣ + ∣ θ_{1} - θ_{2} ∣ = (q + 1) ∣ θ_{1} - θ_{2} ∣.

For the third condition, we have

e (θ) = = \int_{- \infty}^{θ} (1 - q) (θ - x) p (x) d x - \int_{θ}^{\infty} q (θ - x) p (x) d x - qθ + q E [X] + θF (θ) - \int_{- \infty}^{θ} x p (x) d x

By integration by parts,

e (θ) = q (θ - μ) + θF (θ) - (x F (x) ∣_{- \infty}^{θ} - \int_{- \infty}^{θ} F (x) d x) = q (θ - μ) + \int_{- \infty}^{θ} F (x) d x .

Thus,

\overset{e}{˙} (θ) = q + F (θ), \overset{e}{˙} (θ) = p (θ) .

We need $p (θ^{*}) > 0$ .

Now we calculate the asymptotic variance. Note that our convergence point is the $q$ -th quantile $θ^{*} = F^{- 1} (q)$ . Plugging it in gives

E [\overset{m}{˙}_{θ^{*}}^{2}] = E [q^{2} + (1 - 2 q) 1 {θ^{*} > X}] = q^{2} + (1 - 2 q) F (θ^{*}) = q (1 - q) .

Finally, we get

n (\hat{θ} - F^{- 1} (q)) \to d N (0, q (1 - q) p (F^{- 1} (q))^{- 2}) .

Z-Estimator

M-estimators further give rise to Z-Estimators. In many ways, Z-estimators are further generalizations of M-estimators and Moment Estimators, because they solve the zero point of a system. When the $m$ function is differentiable, the zero point of its gradient equates to the M-Estimator. Z-estimators also generalize moment estimators, which solve the zero point of the moment equations.

Sufficient Statistics

Table of Contents

Backlinks

Graph View

M-Estimator

Table of Contents

M-Estimator

Examples

Properties

Consistency

Proof of Consistency

Consistency of Approximate Estimator

Asymptotic Normality

Proof of Asymptotic Normality

Asymptotic Normality of Quantile Regression

Z-Estimator

Backlinks

Graph View