KL Divergence (Kullback–Leibler Divergence)

 

📘 KL Divergence (Kullback–Leibler Divergence)

 What is KL Divergence?

KL divergence measures how different one probability distribution is from another.

DKL(PQ)=iP(i)logP(i)Q(i)D_{KL}(P \parallel Q) = \sum_i P(i)\log\frac{P(i)}{Q(i)}
  • PP = true distribution (data / reality)

  • QQ = approximate or predicted distribution (model)

👉 It answers:

“How much information is lost when we use Q instead of P?”


 Intuition (Simple Explanation)

  • If P=QP = Q → KL divergence = 0 (perfect match)

  • If QQ is very different → KL divergence large

  • It is not symmetric:

DKL(PQ)DKL(QP)D_{KL}(P||Q) \neq D_{KL}(Q||P)

So it’s not a true distance — it's a directed divergence.


Coin Toss Example

True distribution

P(H)=0.8,P(T)=0.2P(H)=0.8,\quad P(T)=0.2

Model prediction

Q(H)=0.6,Q(T)=0.4Q(H)=0.6,\quad Q(T)=0.4
DKL(PQ)=0.8log0.80.6+0.2log0.20.4D_{KL}(P||Q)=0.8\log\frac{0.8}{0.6}+0.2\log\frac{0.2}{0.4} =0.8log(1.33)+0.2log(0.5)0.230.14=0.09=0.8\log(1.33)+0.2\log(0.5) \approx 0.23-0.14=0.09

✔ Small divergence → reasonable prediction


Classification Example

True label = Cat

P=[1,0,0]P=[1,0,0]

Model A:

Q=[0.8,0.1,0.1]Q=[0.8,0.1,0.1]

Model B:

Q=[0.4,0.3,0.3]Q=[0.4,0.3,0.3]

Model A gives smaller KL divergence → better model.


Relation to Entropy & Cross-Entropy

Cross-Entropy(P,Q)=H(P)+DKL(PQ)\text{Cross-Entropy}(P,Q) = H(P) + D_{KL}(P||Q)

Thus:

DKL(PQ)=Cross-EntropyEntropyD_{KL}(P||Q) = \text{Cross-Entropy} - \text{Entropy}

👉 KL divergence = extra loss due to wrong prediction


 Uses of KL Divergence

✅ Machine Learning

  • Training classification models

  • Loss functions (cross-entropy minimization)

  • Comparing probability outputs


✅ Bayesian Statistics

  • Measure difference between prior and posterior

  • Model selection

  • Variational inference (approximate posterior)


✅ Information Theory

  • Compression efficiency

  • Coding cost difference


✅ Deep Learning Applications

  • Variational Autoencoders (VAE)

  • Knowledge distillation

  • Reinforcement learning policy updates


 Python Code Example

✔ Discrete KL divergence

import numpy as np def kl_divergence(P, Q): P = np.array(P) Q = np.array(Q) return np.sum(P * np.log2((P + 1e-12) / (Q + 1e-12))) # Example distributions P = [0.8, 0.2] Q = [0.6, 0.4] print("KL divergence:", kl_divergence(P, Q))

Output:

KL divergence: 0.13202999942331567

✔ KL divergence for classification example

P = [1,0,0] # true label Q1 = [0.8,0.1,0.1] # good model Q2 = [0.4,0.3,0.3] # worse model print("Model A KL:", kl_divergence(P,Q1)) print("Model B KL:", kl_divergence(P,Q2))

Output:
Model A KL: 0.32192809488700175 Model B KL: 1.3219280948851984

 Important Properties 

  • DKL(PQ)0D_{KL}(P||Q) \ge 0

  • Equals 0 only when P=QP=Q

  • Not symmetric

  • Not a metric

  • Measures information loss


Summary

KL divergence measures the difference between two probability distributions PP and QQ. It is defined as

DKL(PQ)=P(i)logP(i)Q(i)D_{KL}(P||Q)=\sum P(i)\log\frac{P(i)}{Q(i)}

It represents the information lost when QQ approximates PP.
KL divergence is always non-negative and equals zero only when the distributions are identical.

Applications:

  • Machine learning loss functions

  • Bayesian inference and variational methods

  • Comparing probability models

  • Information theory and coding

Example: If true distribution is [0.8,0.2][0.8,0.2]and model predicts [0.6,0.4][0.6,0.4], KL divergence quantifies the prediction error.

Comments

Popular posts from this blog

Advanced Mathematics for Computer Science HNCST409 KTU BTech Honors 2024 Scheme

Convex and Non Convex Sets

Gradient-Based Methods for Optimization-Gradient Descent Algorithm