Stochastic Gradient Descent (SGD) and Mini-Batch Gradient

November 23, 2025

Gradient Descent Variants in Machine Learning

Stochastic Gradient Descent (SGD) and Mini-Batch Gradient Descent

1. Background: Empirical Risk Minimization

Most ML problems solve:

\min_{\theta} J(\theta) = \frac{1}{n}\sum_{i=1}^{n} \ell(\theta; x_i, y_i)

where:

$\theta$ : model parameters
$\ell$ : loss function
$n$ : number of training samples

The difference between GD variants lies in how much data is used per update.

2. Stochastic Gradient Descent (SGD)

Definition

Stochastic Gradient Descent updates parameters using the gradient computed from a single randomly chosen data point.

Update Rule

\theta_{k+1} = \theta_k - \eta \nabla \ell(\theta_k; x_i, y_i)

where $(x_{i}, y_{i}) is chosen randomly.$

Algorithm (SGD)


Initialize θ
Repeat for each epoch:
    Shuffle training data
    For each data point (xi, yi):
        Compute gradient ∇ℓ(θ; xi, yi)
        θ ← θ − η ∇ℓ(θ; xi, yi)

Characteristics of SGD

Aspect	Description
Data per update	1 sample
Speed	Very fast
Noise	High
Convergence	Oscillatory
Memory usage	Very low

📌 Gradient is a noisy estimate of the true gradient.

Advantages

✔ Scales to very large datasets
✔ Fast updates
✔ Can escape saddle points
✔ Suitable for online learning

Disadvantages

❌ Noisy convergence
❌ Sensitive to learning rate
❌ May not settle exactly at minimum

3. Mini-Batch Gradient Descent (MBGD)

Definition

Mini-Batch Gradient Descent updates parameters using the gradient computed from a small subset (batch) of the training data.

Update Rule

\theta_{k+1} = \theta_k - \eta \frac{1}{|B|}\sum_{i \in B} \nabla \ell(\theta_k; x_i, y_i)

where $B$ is a mini-batch.

Algorithm (Mini-Batch GD)


Initialize θ
Repeat for each epoch:
    Shuffle training data
    Divide data into mini-batches
    For each batch B:
        Compute average gradient over B
        θ ← θ − η ∇J_B(θ)

Characteristics of Mini-Batch GD

Aspect	Description
Data per update	16–512 samples
Speed	Fast
Noise	Moderate
Convergence	Stable
Hardware use	Efficient (GPU)

📌 Most deep learning frameworks use this variant.

4. Comparison: Batch GD vs SGD vs Mini-Batch GD

Feature	Batch GD	SGD	Mini-Batch GD
Samples per update	All data	1	Small batch
Update frequency	Low	Very high	High
Stability	Very stable	Noisy	Balanced
Computation	Expensive	Cheap	Efficient
Used in practice	Rare	Some cases	Most common

5. Geometric Interpretation

SGD → Zig-zag path around minimum
Mini-Batch GD → Smoother path
Batch GD → Smoothest but slowest

6. Convergence Behavior

Convex Loss Functions

SGD converges in expectation
Mini-batch GD converges faster and more smoothly

Non-Convex Loss Functions

Noise helps escape saddle points
Mini-batch GD provides good generalization

7. Learning Rate in SGD & Mini-Batch GD

Fixed η → oscillation
Decreasing η → convergence

Typical schedule:

\eta_t = \frac{\eta_0}{1 + kt}

8. Why Mini-Batch GD is Preferred in ML

✔ Efficient GPU computation
✔ Less noisy than SGD
✔ Faster than batch GD
✔ Good generalization

📌 Default choice in deep learning

9. Exam-Ready Definitions

SGD:

An optimization algorithm that updates parameters using the gradient computed from a single randomly selected training example.

Mini-Batch GD:

An optimization algorithm that updates parameters using the average gradient computed from a small subset of training examples.

10. One-Line Intuition

SGD: Fast but noisy learning
Mini-Batch GD: Best trade-off between speed and stability

Search This Blog

Advanced Mathematics for Computer Science HNCST409 KTU BTech Honors 2024 Scheme- Dr Binu V P

Stochastic Gradient Descent (SGD) and Mini-Batch Gradient

Gradient Descent Variants in Machine Learning

Stochastic Gradient Descent (SGD) and Mini-Batch Gradient Descent

1. Background: Empirical Risk Minimization

2. Stochastic Gradient Descent (SGD)

Definition

Update Rule

Algorithm (SGD)

Characteristics of SGD

Advantages

Disadvantages

3. Mini-Batch Gradient Descent (MBGD)

Definition

Update Rule

Algorithm (Mini-Batch GD)

Characteristics of Mini-Batch GD

4. Comparison: Batch GD vs SGD vs Mini-Batch GD

5. Geometric Interpretation

6. Convergence Behavior

Convex Loss Functions

Non-Convex Loss Functions

7. Learning Rate in SGD & Mini-Batch GD

8. Why Mini-Batch GD is Preferred in ML

9. Exam-Ready Definitions

10. One-Line Intuition

Comments

Post a Comment

Popular posts from this blog

Advanced Mathematics for Computer Science HNCST409 KTU BTech Honors 2024 Scheme

Convex and Non Convex Sets

Equivalence Relation