Total Variation Distance

The total variation distance between two probability measures and on a sigma-algebra of subsets of the sample space is defined via

Informally, this is the largest possible difference between the probabilities that the two probability distributions can assign to the same event.

One direct implication from the definition is that if two distributions have disjoint support, i.e., , where is the measure on the sigma field, then their TV distance is 1. For example, the TV distance between a discrete and a continuous distribution is 1, because on the common sample space, the support of the discrete distribution has measure 0.

Rmk

TV distance does not tensorize:

In other words, a property in one dimension does not hold in multiple dimensions. Specifically, suppose we have iid samples from . We do not have relationship

Therefore, in practice, it’s usually more convenient to calculate other distances that tensorize, such as the KL Divergence, Wasserstein Distance, and Hellinger Distance.

L1 Norm

Thm

The TV distance is equivalent to the L1 norm.

First Proof

Let . Note that

On the other side, note first that

and hence

Now for any , we have

Taking the supremum over , gives

which is the other needed inequality.

Second Proof

Again, let . We know that . Therefore, we only need to show , where the last two equalities are known.

For any , we suppose WLOG. Then,

where the strict inequality is because two equalties cannot hold at the same time, as . Then,

which further implies

Thus, and are the sets that achieve the supremum.

Optimal Transport Interpretation

TV can be interpreted as the distance from transforming one distribution to another in an optimal transport perspective.

Formally, suppose . Then, there exists a joint distribution of such that the marginal distributions are and , and .

This means that we can transform to by moving mass from to , and the remaining mass remains unchanged.

Proof

Suppose have PDF/PMF . Then note that

which implies

Now note that due to the marginal constraint,

which implies

OTOH, we can define

Then we can specify the other values of to make it meet the marginal constraint.

!todo See also Hw 3.4.

Sample Gain

It’s intuitive that more iid samples help distinguish two distributions. Formally, we claim

First Proof

The first proof is for continuous distributions. Let and WLOG, . Note that . Thus,

Second Proof

The second proof applies to general distributions. By the set relationship, we have