Markov Decision Process (MDP) in Reinforcement Learning

 

✅ Markov Decision Process (MDP) in Reinforcement Learning

A Markov Decision Process (MDP) is the standard mathematical framework used to model decision-making problems where outcomes depend on both randomness and the agent’s actions.

It describes how an agent interacts with an environment to learn the best behavior.


Core Idea (Simple Intuition)

An agent repeatedly:

  1. Observes the current state

  2. Takes an action

  3. Receives a reward

  4. Moves to a new state

The goal is to choose actions that maximize total future reward.


The Markov Property

MDPs assume the Markov property:

The future depends only on the current state, not on past history.

P(St+1St,At,St1,...)=P(St+1St,At)P(S_{t+1}|S_t,A_t,S_{t-1},...) = P(S_{t+1}|S_t,A_t)

This makes learning and optimization tractable.


Components of an MDP

An MDP is defined by the tuple:

(S,A,P,R,γ)(S, A, P, R, \gamma)

✅ 1. States (S)

All possible situations of the environment
Example: robot position, game board configuration


✅ 2. Actions (A)

Choices available to the agent
Example: move left/right, accelerate, pick object


✅ 3. Transition Probability (P)

P(ss,a)P(s'|s,a)

Probability of moving from state ss to ss' after action aa.

Handles uncertainty in environment.


✅ 4. Reward Function (R)

R(s,a,s)R(s,a,s')

Immediate feedback received after action.

  • Positive → good action

  • Negative → penalty


✅ 5. Discount Factor (γ\gamma, 0–1)

Controls importance of future rewards.

  • γ0\gamma \approx 0→ focus on immediate reward

  • γ→ long-term planning


Objective of Reinforcement Learning

Find a policy π(as)\pi(a|s) that maximizes expected return:

Gt=Rt+1+γRt+2+γ2Rt+3+G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots

Policy = rule for choosing actions in each state.

Example: Robot Navigation

Environment

Grid world

States

Robot positions on grid

Actions

Up, Down, Left, Right

Rewards

  • +10 for reaching goal

  • −1 for each step

  • −10 for hitting obstacle

Transition

Sometimes movement fails → randomness

This forms a complete MDP model.


Value Functions in MDP

🔹 State Value Function

Vπ(s)=expected total reward from state sV^\pi(s)=\text{expected total reward from state } s

Measures how good a state is under policy π\pi.


🔹 Action Value Function (Q-function)

Qπ(s,a)=expected reward after taking action a in state sQ^\pi(s,a)=\text{expected reward after taking action } a \text{ in state } s

Used in algorithms like Q-learning.


Bellman Equation (Key Concept)

The value of a state equals:

V(s)=R(s)+γsP(ss,a)V(s)V(s)=R(s)+\gamma\sum_{s'}P(s'|s,a)V(s')

Meaning:

Current value = immediate reward + discounted future value.

This recursive relationship enables dynamic programming methods.


Why MDPs Are Important in RL

They provide:

  • Formal model of agent-environment interaction

  • Basis for algorithms:

    • Value Iteration

    • Policy Iteration

    • Q-Learning

    • SARSA

    • Deep RL methods

Almost all reinforcement learning problems are modeled as MDPs.


Real Applications

  • Robotics navigation

  • Game playing (chess, Go, Atari)

  • Self-driving cars

  • Recommendation systems

  • Resource allocation and scheduling

Comments

Popular posts from this blog

Advanced Mathematics for Computer Science HNCST409 KTU BTech Honors 2024 Scheme

Convex and Non Convex Sets

Gradient-Based Methods for Optimization-Gradient Descent Algorithm