Total Variation Distance
The total variation distance between two probability measures and on a sigma-algebra of subsets of the sample space is defined via
Informally, this is the largest possible difference between the probabilities that the two probability distributions can assign to the same event.
One direct implication from the definition is that if two distributions have disjoint support, i.e., , where is the measure on the sigma field, then their TV distance is 1. For example, the TV distance between a discrete and a continuous distribution is 1, because on the common sample space, the support of the discrete distribution has measure 0.
Rmk
TV distance does not tensorize:
In other words, a property in one dimension does not hold in multiple dimensions. Specifically, suppose we have iid samples from . We do not have relationship
Therefore, in practice, it’s usually more convenient to calculate other distances that tensorize, such as the KL Divergence, Wasserstein Distance, and Hellinger Distance.
L1 Norm
Thm
The TV distance is equivalent to the L1 norm.
First Proof
Let . Note that
On the other side, note first that
and hence
Now for any , we have
Taking the supremum over , gives
which is the other needed inequality.
Second Proof
Again, let . We know that . Therefore, we only need to show , where the last two equalities are known.
For any , we suppose WLOG. Then,
where the strict inequality is because two equalties cannot hold at the same time, as . Then,
which further implies
Thus, and are the sets that achieve the supremum.
Optimal Transport Interpretation
TV can be interpreted as the distance from transforming one distribution to another in an optimal transport perspective.
Formally, suppose . Then, there exists a joint distribution of such that the marginal distributions are and , and .
This means that we can transform to by moving mass from to , and the remaining mass remains unchanged.
Proof
Suppose have PDF/PMF . Then note that
which implies
Now note that due to the marginal constraint,
which implies
OTOH, we can define
Then we can specify the other values of to make it meet the marginal constraint.
!todo See also Hw 3.4.
Sample Gain
It’s intuitive that more iid samples help distinguish two distributions. Formally, we claim
First Proof
The first proof is for continuous distributions. Let and WLOG, . Note that . Thus,
Second Proof
The second proof applies to general distributions. By the set relationship, we have