Introduction to Markov Chain Monte Carlo ( MCMC)

 

Markov Chain — Introduction

🔹 Intuitive Idea

A Markov chain is a random process where:

The future depends only on the present, not on the past.

This is called the Markov property or memoryless property.


🔹 Everyday Examples

  • Weather:

    • Tomorrow’s weather depends on today’s weather, not last week’s.

  • Board games:

    • Your next position depends only on your current square.

  • Web surfing:

    • Next page depends only on the current page.


🔹 Formal Definition

A sequence of random variables X0,X1,X2,X_0,  is a Markov chain if:

P(Xt+1=jXt=i,Xt1,,X0)=P(Xt+1=jXt=i)P(X_{t+1} = j \mid X_t=i, X_{t-1}, \dots, X_0) = P(X_{t+1} = j \mid X_t=i)

🔹 Transition Probabilities

We define:

Pij=P(Xt+1=jXt=i)P_{ij} = P(X_{t+1}=j \mid X_t=i)

These form the transition matrix:

P=[P11P12P21P22]P = \begin{bmatrix} P_{11} & P_{12} & \dots \\ P_{21} & P_{22} & \dots \\ \vdots & \vdots & \ddots \end{bmatrix}

Properties:

  • Each entry ≥ 0

  • Each row sums to 1


🔹 Example: Simple Weather Model

States:

S={Sunny,Rainy}S = \{\text{Sunny}, \text{Rainy}\}

Transition matrix:

P=[0.80.20.40.6]P= \begin{bmatrix} 0.8 & 0.2 \\ 0.4 & 0.6 \end{bmatrix}

Interpretation:

  • Sunny → Sunny: 0.8

  • Sunny → Rainy: 0.2


🔹 Stationary Distribution

A probability vector π\pi is stationary if:

πP=π\pi P = \pi

Meaning:

  • Long-run probability of being in each state


🔹 Why Markov Chains Matter in ML

Used in:

  • Hidden Markov Models

  • Reinforcement learning

  • PageRank algorithm

  • MCMC sampling

  • Speech recognition

  • NLP sequence models

Monte Carlo Methods — Introduction

🔹 Intuitive Idea

Monte Carlo methods use random sampling to solve problems that are hard to compute analytically.

Replace complicated math with repeated random experiments.


🔹 Origin of the Name

Named after Monte Carlo Casino, because results rely on randomness.


🔹 Core Principle

If you want:

E[f(X)]\mathbb{E}[f(X)]

Instead of solving integrals, approximate by:

1Ni=1Nf(xi)\frac{1}{N}\sum_{i=1}^{N} f(x_i)

where xix_i are random samples.


🔹 Example: Estimating π

  1. Draw random points inside a square

  2. Count how many fall inside a circle

  3. Ratio approximates π

π4×points in circletotal points\pi \approx 4 \times \frac{\text{points in circle}}{\text{total points}}

🔹 Monte Carlo Integration

To compute:

f(x)p(x)dx\int f(x)p(x)\,dx

Use:

1Ni=1Nf(xi)\frac{1}{N}\sum_{i=1}^{N} f(x_i)

with xip(x)x_i \sim p(x)


🔹 Why Monte Carlo Works

Based on Law of Large Numbers:

Sample averagetrue expectation\text{Sample average} \to \text{true expectation}

as NN \to \infty.


Markov Chain + Monte Carlo = MCMC

🔹 The Big Idea

When sampling directly from a distribution is hard:

  1. Construct a Markov chain

  2. Design it so its stationary distribution = target distribution

  3. Run the chain long enough

  4. Use the samples

This is Markov Chain Monte Carlo (MCMC).

Why Do We Need MCMC?

The Core Problem in Bayesian Inference

In Bayesian learning we want the posterior distribution:

p(θD)=p(Dθ)p(θ)p(D)p(\theta \mid D) = \frac{p(D \mid \theta)p(\theta)}{p(D)}

The difficulty:

  • p(D)p(D) involves an integral over all parameters

  • Often impossible to compute analytically

  • Can be very high dimensional

Example:

p(D)=p(Dθ)p(θ)dθp(D)=\int p(D|\theta)p(\theta)d\theta

For realistic ML models → this integral is intractable.


What Do We Actually Need?

Usually we don’t need the exact posterior formula.

Instead we want:

  • Posterior mean

  • Posterior variance

  • Predictions

  • Credible intervals

All of these can be written as:

E[f(θ)]=f(θ)p(θD)dθ\mathbb{E}[f(\theta)] = \int f(\theta)p(\theta|D)d\theta


💡 Monte Carlo Idea (First Insight)

If we had samples:

θ1,θ2,,θNp(θD)\theta_1,\theta_2,\dots,\theta_N \sim p(\theta|D)

then:

E[f(θ)]1Nf(θi)\mathbb{E}[f(\theta)] \approx \frac{1}{N}\sum f(\theta_i)

So:

Bayesian inference becomes easy if we can sample from the posterior.


The Problem

We cannot sample directly from p(θD)

  • Distribution may be complicated

  • High dimensional

  • Only known up to a constant

We only know:

p(θD)p(Dθ)p(θ)p(\theta|D) \propto p(D|\theta)p(\theta)


The Brilliant Idea of MCMC

Instead of sampling directly:

Build a Markov chain whose long-run behavior equals the target posterior.

Then:

  1. Start anywhere

  2. Move randomly using special rules

  3. After many steps, positions follow the posterior distribution

  4. Treat those positions as samples


Intuition: Random Walk in a Mountain Landscape

Imagine:

  • Landscape height = posterior probability

  • High mountains = high probability regions

  • Valleys = low probability

We want to:

Spend more time on high mountains.

MCMC does:

  • Walk randomly

  • Prefer uphill moves

  • Sometimes accept downhill moves

  • Eventually explore entire landscape

  • Time spent at each location ∝ probability


Why Markov Chain?

Because:

  • Next step depends only on current location

  • Not on full history

This keeps the process simple and mathematically tractable.


Why Monte Carlo?

Because we approximate integrals using:

sample averages\text{sample averages}

So:

  • Markov chain → generates samples

  • Monte Carlo → uses samples to compute expectations

Together:

MCMC = Sampling via Markov chain + Estimation via Monte Carlo\boxed{\text{MCMC = Sampling via Markov chain + Estimation via Monte Carlo}}


How MCMC Works Conceptually

Step 1: Start at some parameter value

θ0\theta_0


Step 2: Propose a new value

θ=θt+random noise\theta' = \theta_t + \text{random noise}


Step 3: Decide whether to move

If new point has higher probability → accept.

If lower → accept with some probability.

This ensures:

  • Exploration

  • No getting stuck forever


Step 4: Repeat many times

θ0θ1θ2\theta_0 \to \theta_1 \to \theta_2 \to \dots

After a while:

  • Chain “forgets” starting point

  • Samples follow posterior


What Happens Over Time

Initially:

  • Samples biased toward starting point

After burn-in:

  • Samples represent posterior

Long run:

  • Histogram of samples ≈ posterior density


 Why MCMC Is Powerful

Without MCMC:

Bayesian inference works only for simple models.

With MCMC:

We can handle:

  • Neural networks

  • Hierarchical Bayesian models

  • Complex graphical models

  • High-dimensional parameter spaces


🌍 Real Machine Learning Uses

  • Bayesian neural networks

  • Topic modeling

  • Gaussian process inference

  • Probabilistic programming (Stan, PyMC)

  • Computational biology

  • Physics simulations


Simple Analogy for Students

🔎 Searching for Fish in a Lake

  • Fish density = probability

  • You move randomly

  • Stay longer where fish are abundant

  • Record your positions over time

Your recorded positions represent fish distribution.

That’s MCMC.


One-Line Intuition 

MCMC generates samples from a complex distribution by simulating a Markov chain whose stationary distribution equals the target distribution.


 Summary

Goal:

Sample θp(θD)\text{Sample } \theta \sim p(\theta|D)

Method:

  1. Construct transition rule T(θθ)T(\theta\to\theta')

  2. Ensure stationary distribution = posterior

  3. Run chain long enough

  4. Use samples for Monte Carlo estimation

Comments

Popular posts from this blog

Advanced Mathematics for Computer Science HNCST409 KTU BTech Honors 2024 Scheme

Convex and Non Convex Sets

Gradient-Based Methods for Optimization-Gradient Descent Algorithm