M-Estimator

For an Estimation task, one can have different objectives (or metrics). For example, MLE maximizes the likelihood of the estimator, Least Squares minimizes its squared error, and Ridge Regression minimizes the squared error with a penalty term.

M-estimators generalize the above ideas by using a general objective function,

Then, we can define the empirical error: , a.k.a. the criterion function. And an M-Estimator minimizes the empirical error:

Examples

We only need to specify the function to get different M-estimators.

Properties

In this section, we denote , and

Consistency

By LLN, we know that

We wonder under what conditions the following is also true

In the function graph of and (see e.g. ^fig-m-graph), the question ask what conditions allow to transfer the convergence of y-axis to that of x-axis.

It turns out that the above consistency holds under two reasonable conditions:

  • Uniform convergence: .

  • Separation: For any ,

    • 💡 In words, this condition says that only parameters close to may yield a value of close to the minimum .
Example of M function whose point of maximum is not well separated.
Example of M function whose point of maximum is not well separated.

Proof of Consistency

By the definition of , we have the critical inequality:

Thus, we have

By the uniform convergence condition,

Then, since no other parameter can yield a value of close to by the separation condition, we know . More specifically, for any , we have

Then, since , we know that . By the arbitrariness of , we get the result.

Consistency of Approximate Estimator

As we can see in the proof, the crucial step is . Therefore, we can replace by any approximate estimator (approximate minimizer of ), and its consistent as long as holds, or equivalently,

Asymptotic Normality

Under some regularity conditions:

  • is differentiable at -a.s. (for almost every );
  • There exists an function , such that for all in a neighborhood of ;
  • is twice differentiable with a non-singular second derivative at , denoted as ;

we have

  • ❗️ Note that we do not require is twice differentiable. The twice differentiability of is weaker.

Proof Sketch

In the proof sketch, we neglect the expectation, and assume that the expectation and derivatives commute. By Taylor expansion,

Denote the RHS as . Then, we have

By setting

we get

Thus, the asymptotic variance is approximately

Asymptotic Normality of Quantile Regression

Treating quantile regression as an M-estimator, we verify the above conditions and establish its asymptotic normality. Recall the pinball loss function:

For the first condition, exists if . For any continuous distribution, , so the first condition holds, and .

The second condition also holds:

For the third condition, we have

By integration by parts,

Thus,

We need .

Now we calculate the asymptotic variance. Note that our convergence point is the -th quantile . Plugging it in gives

Finally, we get

Z-Estimator

M-estimators further give rise to Z-Estimators. In many ways, Z-estimators are further generalizations of M-estimators and Moment Estimators, because they solve the zero point of a system. When the function is differentiable, the zero point of its gradient equates to the M-Estimator. Z-estimators also generalize moment estimators, which solve the zero point of the moment equations.