Markov Chain — Introduction

🔹 Intuitive Idea

A Markov chain is a random process where:

The future depends only on the present, not on the past.

This is called the Markov property or memoryless property.

🔹 Everyday Examples

Weather:
- Tomorrow’s weather depends on today’s weather, not last week’s.
Board games:
- Your next position depends only on your current square.
Web surfing:
- Next page depends only on the current page.

🔹 Formal Definition

A sequence of random variables $X_0,$ is a Markov chain if:

P(X_{t+1} = j \mid X_t=i, X_{t-1}, \dots, X_0) = P(X_{t+1} = j \mid X_t=i)

🔹 Transition Probabilities

We define:

P_{ij} = P(X_{t+1}=j \mid X_t=i)

These form the transition matrix:

P = \begin{bmatrix} P_{11} & P_{12} & \dots \\ P_{21} & P_{22} & \dots \\ \vdots & \vdots & \ddots \end{bmatrix}

Properties:

Each entry ≥ 0
Each row sums to 1

🔹 Example: Simple Weather Model

States:

S = \{\text{Sunny}, \text{Rainy}\}

Transition matrix:

P= \begin{bmatrix} 0.8 & 0.2 \\ 0.4 & 0.6 \end{bmatrix}

Interpretation:

Sunny → Sunny: 0.8
Sunny → Rainy: 0.2

🔹 Stationary Distribution

A probability vector $\pi$ is stationary if:

\pi P = \pi

Meaning:

Long-run probability of being in each state

🔹 Why Markov Chains Matter in ML

Used in:

Hidden Markov Models
Reinforcement learning
PageRank algorithm
MCMC sampling
Speech recognition
NLP sequence models

Monte Carlo Methods — Introduction

🔹 Intuitive Idea

Monte Carlo methods use random sampling to solve problems that are hard to compute analytically.

Replace complicated math with repeated random experiments.

🔹 Origin of the Name

Named after Monte Carlo Casino, because results rely on randomness.

🔹 Core Principle

If you want:

\mathbb{E}[f(X)]

Instead of solving integrals, approximate by:

\frac{1}{N}\sum_{i=1}^{N} f(x_i)

where $x_i$ are random samples.

🔹 Example: Estimating π

Draw random points inside a square
Count how many fall inside a circle
Ratio approximates π

\pi \approx 4 \times \frac{\text{points in circle}}{\text{total points}}

🔹 Monte Carlo Integration

To compute:

\int f(x)p(x)\,dx

Use:

\frac{1}{N}\sum_{i=1}^{N} f(x_i)

with $x_i \sim p(x)$

🔹 Why Monte Carlo Works

Based on Law of Large Numbers:

\text{Sample average} \to \text{true expectation}

as $N \to \infty$

Markov Chain + Monte Carlo = MCMC

🔹 The Big Idea

When sampling directly from a distribution is hard:

Construct a Markov chain
Design it so its stationary distribution = target distribution
Run the chain long enough
Use the samples

This is Markov Chain Monte Carlo (MCMC).

Why Do We Need MCMC?

The Core Problem in Bayesian Inference

In Bayesian learning we want the posterior distribution:

$p(\theta \mid D) = \frac{p(D \mid \theta)p(\theta)}{p(D)}$

The difficulty:

$p(D)$ involves an integral over all parameters
Often impossible to compute analytically
Can be very high dimensional

Example:

$p(D)=\int p(D|\theta)p(\theta)d\theta$

For realistic ML models → this integral is intractable.

What Do We Actually Need?

Usually we don’t need the exact posterior formula.

Instead we want:

Posterior mean
Posterior variance
Predictions
Credible intervals

All of these can be written as:

$\mathbb{E}[f(\theta)] = \int f(\theta)p(\theta|D)d\theta$

💡 Monte Carlo Idea (First Insight)

If we had samples:

$\theta_1,\theta_2,\dots,\theta_N \sim p(\theta|D)$

then:

$\mathbb{E}[f(\theta)] \approx \frac{1}{N}\sum f(\theta_i)$

So:

Bayesian inference becomes easy if we can sample from the posterior.

The Problem

We cannot sample directly from $p (θ ∣ D)$

Distribution may be complicated
High dimensional
Only known up to a constant

We only know:

$p(\theta|D) \propto p(D|\theta)p(\theta)$

The Brilliant Idea of MCMC

Instead of sampling directly:

Build a Markov chain whose long-run behavior equals the target posterior.

Then:

Start anywhere
Move randomly using special rules
After many steps, positions follow the posterior distribution
Treat those positions as samples

Intuition: Random Walk in a Mountain Landscape

Imagine:

Landscape height = posterior probability
High mountains = high probability regions
Valleys = low probability

We want to:

Spend more time on high mountains.

MCMC does:

Walk randomly
Prefer uphill moves
Sometimes accept downhill moves
Eventually explore entire landscape
Time spent at each location ∝ probability

Why Markov Chain?

Because:

Next step depends only on current location
Not on full history

This keeps the process simple and mathematically tractable.

Why Monte Carlo?

Because we approximate integrals using:

$\text{sample averages}$

So:

Markov chain → generates samples
Monte Carlo → uses samples to compute expectations

Together:

$\boxed{\text{MCMC = Sampling via Markov chain + Estimation via Monte Carlo}}$

How MCMC Works Conceptually

Step 1: Start at some parameter value

$\theta_0$

Step 2: Propose a new value

$\theta' = \theta_t + \text{random noise}$

Step 3: Decide whether to move

If new point has higher probability → accept.

If lower → accept with some probability.

This ensures:

Exploration
No getting stuck forever

Step 4: Repeat many times

$\theta_0 \to \theta_1 \to \theta_2 \to \dots$

After a while:

Chain “forgets” starting point
Samples follow posterior

What Happens Over Time

Initially:

Samples biased toward starting point

After burn-in:

Samples represent posterior

Long run:

Histogram of samples ≈ posterior density

Why MCMC Is Powerful

Without MCMC:

Bayesian inference works only for simple models.

With MCMC:

We can handle:

Neural networks
Hierarchical Bayesian models
Complex graphical models
High-dimensional parameter spaces

🌍 Real Machine Learning Uses

Bayesian neural networks
Topic modeling
Gaussian process inference
Probabilistic programming (Stan, PyMC)
Computational biology
Physics simulations

Simple Analogy for Students

🔎 Searching for Fish in a Lake

Fish density = probability
You move randomly
Stay longer where fish are abundant
Record your positions over time

Your recorded positions represent fish distribution.

That’s MCMC.

One-Line Intuition

MCMC generates samples from a complex distribution by simulating a Markov chain whose stationary distribution equals the target distribution.

Summary

Goal:

$\text{Sample } \theta \sim p(\theta|D)$

Method:

Construct transition rule $T(\theta\to\theta')$
Ensure stationary distribution = posterior
Run chain long enough
Use samples for Monte Carlo estimation

Introduction to Markov Chain Monte Carlo ( MCMC)

Markov Chain — Introduction

🔹 Intuitive Idea

🔹 Everyday Examples

🔹 Formal Definition

🔹 Transition Probabilities

🔹 Example: Simple Weather Model

🔹 Stationary Distribution

🔹 Why Markov Chains Matter in ML

Monte Carlo Methods — Introduction

🔹 Intuitive Idea

🔹 Origin of the Name

🔹 Core Principle

🔹 Example: Estimating π

🔹 Monte Carlo Integration

🔹 Why Monte Carlo Works

Markov Chain + Monte Carlo = MCMC

🔹 The Big Idea

Why Do We Need MCMC?

The Core Problem in Bayesian Inference

What Do We Actually Need?

💡 Monte Carlo Idea (First Insight)

The Problem

The Brilliant Idea of MCMC

Intuition: Random Walk in a Mountain Landscape

Why Markov Chain?

Why Monte Carlo?

How MCMC Works Conceptually

Step 1: Start at some parameter value

Step 2: Propose a new value

Step 3: Decide whether to move

Step 4: Repeat many times

What Happens Over Time

Why MCMC Is Powerful

Without MCMC:

With MCMC:

🌍 Real Machine Learning Uses

Simple Analogy for Students

🔎 Searching for Fish in a Lake

One-Line Intuition

Summary

Comments

Post a Comment

Popular posts from this blog

Advanced Mathematics for Computer Science HNCST409 KTU BTech Honors 2024 Scheme

Convex and Non Convex Sets

Gradient-Based Methods for Optimization-Gradient Descent Algorithm