Markov Chain

We say a Stochastic Process has k-th order Markov property if

X_{t} ∣ (X_{0}, \dots, X_{t - 1}) = X_{t} ∣ (X_{t - 1}, \dots, X_{t - k}) .

And we call an instance of this sequence a Markov chain.

Random Walk is a simple example of a Markov chain.

Concepts

Transition Kernel

For finite state space $S = {1, \dots, S}$ , its transition matrix is specified by $P_{ij} = Pr (s_{t} = j ∣ s_{t - 1} = i)$ . This matrix is also called the transition kernel or the Markov matrix; it is a row-stochastic matrix.

For general state space, the transition kernel is a family of measurable functions such that $P (s, s^{'}) = Pr (s_{t} = s^{'} ∣ s_{t - 1} = s)$ .

We also often use the transition kernel $P$ as an operator such that $X_{t + 1} = P X_{t}$ . However, for the matrix $P$ , by its definition, we should write $X_{t + 1}^{T} = X_{t}^{T} P$ instead.

The above definitions are for homogeneous Markov chains, where the transition kernel is independent of time: $Pr (s_{t + 1} = s^{'} ∣ s_{t} = s) = Pr (s_{1} = s^{'} ∣ s_{0} = s)$ for all $t$ .

Stationary Distribution

A probability distribution $μ$ is called a stationary (steady/equilibrium) distribution if $μ = P μ$ , or equivalently $μ^{T} = μ^{T} P$ for discrete cases. If on a Markov chain $X_{t} \sim μ$ , we say the chain is in steady-state.

Thm

Finite-space time-homogeneous Markov chains always have a stationary distribution.

This theorem can be proved by various methods:

This is a corollary of a general theorem on the stationary distribution’s existence.
Brouwer’s fixed-point theorem.
Linear Programming approach.

LP approach

A finite-space time-homogeneous Markov chain has a stationary distribution if and only if the following linear program has an unbounded objective function:
$max s.t. 1^{T} μ P^{T} μ = μ μ \geq 0.$
This is because if $μ$ is a stationary distribution, then $c μ$ for any $c > 0$ is a feasible solution to the LP, and the objective function is unbounded; if the LP has an unbounded objective function, it must have a feasible solution, which can be normalized to be a stationary distribution.

On the other hand, the LP has an unbounded objective function if and only if its dual is infeasible. The dual is
$min s.t. 0^{T} λ P λ - λ \geq 1.$
If the dual is feasible, there exists $λ$ such that $P λ - λ \geq 1$ . WLOG, suppose $λ_{1} = max_{s} λ_{s}$ . Then the first row of $P λ - λ$ satisfies
$s \sum P_{1 s} λ_{s} - λ_{1} \leq s \sum P_{1 s} λ_{1} - λ_{1} = 0 < 1,$
contradicting the constraint.

For tabular cases, we also know from the fact $X_{\infty} = X_{\infty} P$ that the largest eigenvalue of $P$ is 1 and the steady distribution is its unit eigenvector.

Transient and Recurrent States

For two states $x$ and $y$ , we write $x \to y$ if there exists a non-zero probability of transitioning from $x$ to $y$ ; formally, $\sum_{t} P^{n} x = \sum_{t > 0} Pr (X_{t} = y ∣ X_{0} = x) > 0$ .

A state $x$ is called transient if there exists a state $y$ such that $x \to y$ but $y \neq \to x$ . Otherwise, it is called recurrent.

We say two states $x$ and $y$ communicate if $x \to y$ and $y \to x$ , and we write $x \leftrightarrow y$ . Communicating states form an equivalence class on the set of recurrent states:

$x \leftrightarrow x$
- Let $z$ be any state that $Pr (X_{1} = z ∣ X_{0} = x) > 0$ (which could be $x$ ). Thus $x \to z$ . Since $x$ is recurrent, $z \to x$ . Therefore, $x \leftrightarrow z$ .
$x \lr y ⟹ y \lr x$
$x \lr y \land y \lr z ⟹ x \lr z$

Therefore, the state space can be partitioned into the set of transient states and the set of communicating classes of recurrent states.

See Recurrence Time for the properties of transient and recurrent states.

Times

Given a state $x$ , we define the following times:

First passage/ hitting/ return time: $T_{x} = min {t > 0 : X_{t} = x ∣ X_{0} = x}$ .
- We set $T_{x} = \infty$ if the minimum doesn’t exist.
Mean recurrence time: $μ_{x} = E [T_{x}]$ .
Visit time: $N_{x} (t) = \sum_{k = 1}^{t} 1 {X_{k} = x}$ .
Absorption time: $a_{x} = min {t > 0 : X_{t} is recurrent ∣ X_{0} = x}$ .

Irreducibility

We say a Markov chain is irreducible if we can eventually reach any state starting from any other state. That is, the Markov chain has no transient states and has only one recurrent class.

Thm

Transclude of uniqueness-of-stationary-distribution-for-irreducible-markov-chains#^743562

Periodicity

We define the

transient probability (or density for continuous cases) from $x$ to $y$ as $p_{x y}^{(n)} = Pr (X_{n} = y ∣ X_{0} = x)$ . Given a transition matrix $P$ , we have $p_{x y}^{(n)} = (P^{n})_{x y}$ ;
accessible times of state $x$ as $I_{x} = {n \geq 1 : p_{xx}^{(n)} > 0}$ , which is a non-empty set for recurrent states and is closed under addition: $m, n \in I_{x} ⟹ m + n \in I_{x}$ ;
period of state $x$ as $d_{x} = gcd (I_{x})$ .
All states in the same recurrent class have the same period.
A recurrent class is called periodic if its period is greater than 1; otherwise, it is called aperiodic.

See Periodicity for the properties of periodic and aperiodic states/ Markov chains.

We say a kernel is aperiodic if the sequence doesn’t loop between sets of states in a pre-deﬁned pattern.

Ergodicity

We say

a state is ergodic if it’s recurrent and aperiodic;
a Markov chain is ergodic if it’s irreducible and all its states are ergodic,
- equivalently, the chain is irreducible and aperiodic.
- For continuous state space, an ergodic chain is also needed to be positive recurrent, i.e., the mean recurrence time of any state is finite. This is automatically satisfied for finite state space.

We have the Mixing Property:

Transclude of mixing-property#^c7829b

Alt Def 1. Sometimes we refer to the ergodicity as the mixing property, i.e., we say a kernel is ergodic if there exists a stationary distribution $μ$ of states such that $lim_{n \to \infty} P^{n} X_{0} \sim μ$ for any initial distribution of $X_{0}$ . For tabular cases, this reads $X_{\infty} = μ$ .

In this case, irreducibility + aperiodicity ⇒ ergodicity.

Alt Def 2. Aligned more closely to the ergodic theory, an ergodic Markov chain is just irreducible. In this sense, the chain is possible to go from any state to any other state in a finite number of steps and every state is visited infinitely often.

Without assuming either irreducibility or aperiodicity, we still have an Ergodic Theorem for chains with a single recurrence class, which relates the time average $N_{x} (t) / t$ to the spatial average $π_{x}$ .

Transclude of ergodic-theorem#^thm

MRP and MDP

POMDP

Model Learning

The most widely used technique to learn a Markov kernel is the Monte Carlo Method, which uses Maximum Likelihood Estimation.

M_{ML} = ar g M max p (s_{1}, \dots, s_{t} ∣ M) = ar g M max u = 1 \sum t - 1 i, j \sum S 1 (s_{u} = i, s_{u + 1} = j) ln M_{ij}

Since each row of $M$ has to be a probability distribution, we can show that

M_{ML} (i, j) = \frac{\sum _{u = 1}^{t - 1} 1 ( s _{u} = i , s _{u + 1} = j )}{\sum _{u = 1}^{t - 1} 1 ( s _{u} = i )} .

Empirically, count how many times we observe a transition from $i \to j$ and divide by the total number of transitions from $i$ .

Simple Ergodic Model-Based Applications

We present two Markov chain model applications assuming ergodicity and the full knowledge of the kernel (so we can compute the steady distribution).

Ranking

Problem setup

We construct a Markov chain where each object is a state.
We encourage transitions from objects that lose to objects that win.
Transitions only occur between objects that play each other.
- If object A beats object B, there should be a high probability of transitioning from B→A and small probability from A→B.
The strength of the transition can be linked to the score of the game.
- for each game, the unnormalized probability is updated as $K_{ij} + = 1 {j wins} + \frac{pts _{j}}{pts _{i} + pts _{j}}$ , where $i, j$ can be A or B, so we are updating four entries after each game.
Predicting the “state” (i.e., object) far in the future, we can interpret a more probable state as a better object.

Semi-Supervised Classification

For a Semi-Supervised Learning Classification, we can use a random walk method:

A “random walker” starts from an unlabeled point $x_{i}$ and moves around from point to point
A transition between nearby points has a higher probability
- Distance defined by Euclidean distance, kernels, etc.
A transition to a labeled point terminates the walk
We label $x_{i}$ using the label of the terminal point

We call the points with pre-defined labels absorbing states because the transition probability from these states to another state is zero. Therefore, this model is not ergodic, indicating the convergence distribution is dependent on the initial state. We arrange $K$ in a way that all the absorbing states appear at the bottom right:

P = (A 0 B I)

Then,

P^{\infty} = (A^{\infty} 0 \sum_{t = 0}^{\infty} A^{t} B I) = (00 (I - A)^{- 1} B I) .

The second equality is from $λ_{m a x} (A) < 1$ . Then, for any unlabeled data point $x_{i}$ , we know its classification weight vector is $[(I - A)^{- 1} B]_{i, :}$

Almost All Probability

Table of Contents

Backlinks

Graph View

Markov Chain

Table of Contents

Markov Chain

Concepts

Transition Kernel

Stationary Distribution

Transient and Recurrent States

Times

Irreducibility

Periodicity

Ergodicity

MRP and MDP

POMDP

Model Learning

Simple Ergodic Model-Based Applications

Ranking

Semi-Supervised Classification

Backlinks

Graph View