Total Variation Distance

The total variation distance between two probability measures $P$ and $Q$ on a sigma-algebra $F$ of subsets of the sample space $Ω$ is defined via

TV (P, Q) = A \in F sup ∣ P (A) - Q (A) ∣

Informally, this is the largest possible difference between the probabilities that the two probability distributions can assign to the same event.

One direct implication from the definition is that if two distributions have disjoint support, i.e., $ν (supp (P) \cap supp (Q)) = 0$ , where $m$ is the measure on the sigma field, then their TV distance is 1. For example, the TV distance between a discrete and a continuous distribution is 1, because on the common sample space, the support of the discrete distribution has measure 0.

Rmk

TV distance does not tensorize:
$TV (P_{1} \otimes P_{2}, Q_{1} \otimes Q_{2}) \neq ≲ TV (P_{1}, Q_{1}) + TV (P_{2}, Q_{2}) .$
In other words, a property in one dimension does not hold in multiple dimensions. Specifically, suppose we have $n$ iid samples from $P_{θ_{1}}$ . We do not have relationship
$TV (P_{θ_{1}}^{n}, P_{θ_{2}}^{n}) \leq n TV (P_{θ_{1}}, P_{θ_{2}}) .$
Therefore, in practice, it’s usually more convenient to calculate other distances that tensorize, such as the KL Divergence, Wasserstein Distance, and Hellinger Distance.

L1 Norm

Thm

The TV distance is equivalent to the L1 norm.
$TV (p, q) = \frac{1}{2} ∥ p - q ∥_{1} .$

First Proof

Let $B = {p \geq q}$ . Note that

\int_{Ω} ∣ p - q ∣ d ν = \int_{B} (p - q) d ν + \int_{Ω ∖ B} (q - p) d ν \leq 2 A sup \int_{A} (p - q) d ν = 2 TV (p, q) .

On the other side, note first that

\int_{Ω} (p - q) d ν = P (Ω) - Q (Ω) = 0

and hence

\int_{B} (p - q) d ν = \int_{Ω ∖ B} (q - p) d ν

Now for any $A \in F$ , we have

\int_{A} (p - q) d ν = max {\int_{A} (p - q) d ν, \int_{A} (q - p) d ν} \leq max {\int_{A \cap B} (p - q) d ν, \int_{A \cap (Ω ∖ B)} (q - p) d ν} \leq max {\int_{B} (p - q) d ν, \int_{Ω ∖ B} (q - p) d ν} = \int_{B} (p - q) d ν = \frac{1}{2} \int_{Ω} ∣ p - q ∣ d ν

Taking the supremum over $A \in F$ , gives

A sup \int_{A} (p - q) d ν \leq \frac{1}{2} \int_{Ω} ∣ p - q ∣ d ν

which is the other needed inequality.

Second Proof

Again, let $B = {p \geq q}$ . We know that $\int_{B} (p - q) d ν = \frac{1}{2} \int_{Ω} ∣ p - q ∣ d ν$ . Therefore, we only need to show $sup_{A} \int_{A} (p - q) d ν = \int_{B} (p - q) d ν = \int_{B^{C}} (p - q) d ν = \int_{B} (p - q) d ν$ , where the last two equalities are known.

For any $A \neq \in {B, B^{C}}$ , we suppose $P (A) \geq Q (A)$ WLOG. Then,

(P (B) - Q (B)) - (P (A) - Q (A)) = \geq 0 \int_{B ∖ A} (p - q) d ν - \leq 0 \int_{A ∖ B} (p - q) d ν > 0,

where the strict inequality is because two equalties cannot hold at the same time, as $A \neq \in {B, B^{C}}$ . Then,

(P (B) - Q (B)) - (P (A) - Q (A)) = \int_{B} (p - q) d ν - \int_{A} (p - q) d ν > 0,

which further implies

A \neq \in ar g A sup \int_{A} (p - q) d ν .

Thus, $B$ and $B^{C}$ are the sets that achieve the supremum.

Optimal Transport Interpretation

TV can be interpreted as the distance from transforming one distribution to another in an optimal transport perspective.

Formally, suppose $TV (P_{1}, P_{2}) = γ$ . Then, there exists a joint distribution of $(X_{1}, X_{2}) \sim P$ such that the marginal distributions are $P_{1}$ and $P_{2}$ , and $P (X_{1} = X_{2}) = 1 - γ$ .

This means that we can transform $P_{1}$ to $P_{2}$ by moving $γ$ mass from $P_{1}$ to $P_{2}$ , and the remaining mass remains unchanged.

Proof

Suppose $P_{1}, P_{2}$ have PDF/PMF $f_{1}, f_{2}$ . Then note that

\int_{X} f_{1} \land f_{2} = = = = \int_{{f_{1} \leq f_{2}}} f_{1} + \int_{{f_{2} < f_{1}}} f_{2} 1 - \int_{{f_{1} \leq f_{2}}^{c}} f_{1} + 1 - \int_{{f_{2} < f_{1}}^{c}} f_{2} 2 - \int_{{f_{1} > f_{2}}} (f_{1} - f_{2}) - \int_{{f_{2} \geq f_{1}}} (f_{2} - f_{1}) - \int_{{f_{1} > f_{2}}} f_{2} - \int_{{f_{2} \geq f_{1}}} f_{1} 2 - 2∥ P_{1} - P_{2} ∥_{TV} - \int_{X} f_{1} \land f_{2},

which implies

\int_{x} f_{1} \land f_{2} = 1 - ∥ P_{1} - P_{2} ∥_{TV} = 1 - γ

Now note that due to the marginal constraint,

P (X_{1} = X_{2} = x) \leq P (X_{1} = x) \land P (X_{2} = x),

which implies

P sup P (X_{1} = X_{2}) \leq \int_{x} f_{1} \land f_{2} = 1 - γ

OTOH, we can define

P (X_{1} = x, X_{2} = x) = P (X_{1} = x) \land P (X_{2} = x) \Rightarrow P (X_{1} = X_{2}) = 1 - γ

Then we can specify the other values of $P$ to make it meet the marginal constraint.

Example

Consider the following discrete distribution:

P_{0} (X = k) = ⎩ ⎨ ⎧ 0.1, 0.6, 0.3, k = 0; k = 1; k = 2; P_{1} (X = k) = ⎩ ⎨ ⎧ 0.3, 0.6, 0.1, k = 0; k = 1; k = 2.

Link to original To find the joint distribution $P$ that achieves the *optimal coupling*, the general approach is to assign

P (X_{0} = X_{1} = x) = P_{0} (x) \land P_{1} (x),

and assign the remaining mass to make the marginal distributions satisfied.

An example joint distribution is

P (X_{0} = x_{0}, X_{1} = x_{1}) = ⎩ ⎨ ⎧ 0.6, 0.1, 0.2, 0, (x_{0}, x_{1}) = (1, 1); (x_{0}, x_{1}) \in {(0, 0), (2, 2)}; (x_{0}, x_{1}) = (2, 0); otherwise .

We can verify that $P (X_{0}) = P_{0}$ , $P (X_{1}) = P_{1}$ , and $P (X_{1} = X_{2}) = 0.8 = 1 - TV (P_{0}, P_{1})$ .

Sample Gain

It’s intuitive that more iid samples help distinguish two distributions. Formally, we claim

TV (P_{0}, P_{1}) \leq TV (P_{0} \times P_{0}, P_{1} \times P_{1})

First Proof

The first proof is for continuous distributions. Let $B = {f_{0} > f_{1}}$ and WLOG, $\int_{B} f_{0} + \int_{B} f_{1} \geq 1$ . Note that $B \times B \subset B^{opt} = {f_{0} \times f_{0} > f_{1} \times f_{1}}$ . Thus,

TV (P_{0} \times P_{0}, P_{1} \times P_{1}) \geq \int_{B \times B} (f_{0} \times f_{0} - f_{1} \times f_{1}) d x d y = \int_{B \times B} (f_{0} + f_{1}) d x (f_{0} - f_{1}) d y = TV (P_{0}, P_{1}) \int_{B} (f_{0} + f_{1}) d x \geq TV (P_{0}, P_{1}) .

Second Proof

The second proof applies to general distributions. By the set relationship, we have

TV (P_{0}, P_{1}) = B sup ∣ P_{0} (B) - P_{1} (B) ∣ = B sup ∣ P_{0} (B) \times P_{0} (X) - P_{1} (B) \times P_{1} (X) ∣ = C = B \times X sup ∣ P_{0} \times P_{0} (C) - P_{1} \times P_{1} (C) ∣ \leq C sup ∣ P_{0} \times P_{0} (C) - P_{1} \times P_{1} (C) ∣ = TV (P_{0} \times P_{0}, P_{1} \times P_{1}) .

Sufficient Statistics

Table of Contents

Backlinks

Graph View

Total Variation Distance

Table of Contents

Total Variation Distance

L1 Norm

First Proof

Second Proof

Optimal Transport Interpretation

Proof

Example

Sample Gain

First Proof

Second Proof

Backlinks

Graph View