KL Divergence (Kullback–Leibler Divergence)

November 06, 2025

📘 KL Divergence (Kullback–Leibler Divergence)

What is KL Divergence?

KL divergence measures how different one probability distribution is from another.

D_{KL}(P \parallel Q) = \sum_i P(i)\log\frac{P(i)}{Q(i)}

$P$ = true distribution (data / reality)
$Q$ = approximate or predicted distribution (model)

👉 It answers:

“How much information is lost when we use Q instead of P?”

Intuition (Simple Explanation)

If $P = Q$ → KL divergence = 0 (perfect match)
If $Q$ is very different → KL divergence large
It is not symmetric:

D_{KL}(P||Q) \neq D_{KL}(Q||P)

So it’s not a true distance — it's a directed divergence.

Coin Toss Example

True distribution

P(H)=0.8,\quad P(T)=0.2

Model prediction

Q(H)=0.6,\quad Q(T)=0.4

D_{KL}(P||Q)=0.8\log\frac{0.8}{0.6}+0.2\log\frac{0.2}{0.4}

=0.8\log(1.33)+0.2\log(0.5) \approx 0.23-0.14=0.09

✔ Small divergence → reasonable prediction

Classification Example

True label = Cat

P=[1,0,0]

Model A:

Q=[0.8,0.1,0.1]

Model B:

Q=[0.4,0.3,0.3]

Model A gives smaller KL divergence → better model.

Relation to Entropy & Cross-Entropy

\text{Cross-Entropy}(P,Q) = H(P) + D_{KL}(P||Q)

Thus:

D_{KL}(P||Q) = \text{Cross-Entropy} - \text{Entropy}

👉 KL divergence = extra loss due to wrong prediction

Uses of KL Divergence

✅ Machine Learning

Training classification models
Loss functions (cross-entropy minimization)
Comparing probability outputs

✅ Bayesian Statistics

Measure difference between prior and posterior
Model selection
Variational inference (approximate posterior)

✅ Information Theory

Compression efficiency
Coding cost difference

✅ Deep Learning Applications

Variational Autoencoders (VAE)
Knowledge distillation
Reinforcement learning policy updates

Python Code Example

✔ Discrete KL divergence


import numpy as np

def kl_divergence(P, Q):
    P = np.array(P)
    Q = np.array(Q)
    return np.sum(P * np.log2((P + 1e-12) / (Q + 1e-12)))

# Example distributions
P = [0.8, 0.2]
Q = [0.6, 0.4]

print("KL divergence:", kl_divergence(P, Q))

Output:

KL divergence: 0.13202999942331567

✔ KL divergence for classification example


P = [1,0,0]              # true label
Q1 = [0.8,0.1,0.1]       # good model
Q2 = [0.4,0.3,0.3]       # worse model

print("Model A KL:", kl_divergence(P,Q1))
print("Model B KL:", kl_divergence(P,Q2))

Output:
Model A KL: 0.32192809488700175
Model B KL: 1.3219280948851984

Important Properties

$D_{KL}(P||Q) \ge 0$
Equals 0 only when $P=Q$
Not symmetric
Not a metric
Measures information loss

Summary

KL divergence measures the difference between two probability distributions $P$ and $Q$ . It is defined as

D_{KL}(P||Q)=\sum P(i)\log\frac{P(i)}{Q(i)}

It represents the information lost when $Q$ approximates $P$ .
KL divergence is always non-negative and equals zero only when the distributions are identical.

Applications:

Machine learning loss functions
Bayesian inference and variational methods
Comparing probability models
Information theory and coding

Example: If true distribution is $[0.8,0.2]$ and model predicts $[0.6,0.4]$ , KL divergence quantifies the prediction error.

Search This Blog

Advanced Mathematics for Computer Science HNCST409 KTU BTech Honors 2024 Scheme- Dr Binu V P