Markov Decision Process (MDP) in Reinforcement Learning

November 10, 2025

✅ Markov Decision Process (MDP) in Reinforcement Learning

A Markov Decision Process (MDP) is the standard mathematical framework used to model decision-making problems where outcomes depend on both randomness and the agent’s actions.

It describes how an agent interacts with an environment to learn the best behavior.

Core Idea (Simple Intuition)

An agent repeatedly:

Observes the current state
Takes an action
Receives a reward
Moves to a new state

The goal is to choose actions that maximize total future reward.

The Markov Property

MDPs assume the Markov property:

The future depends only on the current state, not on past history.

P(S_{t+1}|S_t,A_t,S_{t-1},...) = P(S_{t+1}|S_t,A_t)

This makes learning and optimization tractable.

Components of an MDP

An MDP is defined by the tuple:

(S, A, P, R, \gamma)

✅ 1. States (S)

All possible situations of the environment
Example: robot position, game board configuration

✅ 2. Actions (A)

Choices available to the agent
Example: move left/right, accelerate, pick object

✅ 3. Transition Probability (P)

P(s'|s,a)

Probability of moving from state $s$ to $s'$ after action $a$ .

Handles uncertainty in environment.

✅ 4. Reward Function (R)

R(s,a,s')

Immediate feedback received after action.

Positive → good action
Negative → penalty

✅ 5. Discount Factor ( $\gamma$ , 0–1)

Controls importance of future rewards.

$\gamma \approx 0$ → focus on immediate reward
$γ \approx$ → long-term planning

Objective of Reinforcement Learning

Find a policy $\pi(a|s)$ that maximizes expected return:

G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots

Policy = rule for choosing actions in each state.

Example: Robot Navigation

Environment

Grid world

States

Robot positions on grid

Actions

Up, Down, Left, Right

Rewards

+10 for reaching goal
−1 for each step
−10 for hitting obstacle

Transition

Sometimes movement fails → randomness

This forms a complete MDP model.

Value Functions in MDP

🔹 State Value Function

V^\pi(s)=\text{expected total reward from state } s

Measures how good a state is under policy $\pi$ .

🔹 Action Value Function (Q-function)

Q^\pi(s,a)=\text{expected reward after taking action } a \text{ in state } s

Used in algorithms like Q-learning.

Bellman Equation (Key Concept)

The value of a state equals:

V(s)=R(s)+\gamma\sum_{s'}P(s'|s,a)V(s')

Meaning:

Current value = immediate reward + discounted future value.

This recursive relationship enables dynamic programming methods.

Why MDPs Are Important in RL

They provide:

Formal model of agent-environment interaction
Basis for algorithms:
- Value Iteration
- Policy Iteration
- Q-Learning
- SARSA
- Deep RL methods

Almost all reinforcement learning problems are modeled as MDPs.

Real Applications

Robotics navigation
Game playing (chess, Go, Atari)
Self-driving cars
Recommendation systems
Resource allocation and scheduling

Search This Blog

Advanced Mathematics for Computer Science HNCST409 KTU BTech Honors 2024 Scheme- Dr Binu V P