Stochastic Gradient Descent (SGD) and Mini-Batch Gradient

 

Gradient Descent Variants in Machine Learning

Stochastic Gradient Descent (SGD) and Mini-Batch Gradient Descent


1. Background: Empirical Risk Minimization

Most ML problems solve:

minθJ(θ)=1ni=1n(θ;xi,yi)\min_{\theta} J(\theta) = \frac{1}{n}\sum_{i=1}^{n} \ell(\theta; x_i, y_i)

where:

  • θ\theta: model parameters

  • \ell: loss function

  • nn: number of training samples

The difference between GD variants lies in how much data is used per update.


2. Stochastic Gradient Descent (SGD)

Definition

Stochastic Gradient Descent updates parameters using the gradient computed from a single randomly chosen data point.


Update Rule

θk+1=θkη(θk;xi,yi)\theta_{k+1} = \theta_k - \eta \nabla \ell(\theta_k; x_i, y_i)

where (xi,yi) is chosen randomly.


Algorithm (SGD)

Initialize θ Repeat for each epoch: Shuffle training data For each data point (xi, yi): Compute gradient ∇ℓ(θ; xi, yi) θ ← θ − η ∇ℓ(θ; xi, yi)

Characteristics of SGD

Aspect                    Description
Data per update                    1 sample
Speed                    Very fast
Noise                    High
Convergence                    Oscillatory
Memory usage                    Very low

📌 Gradient is a noisy estimate of the true gradient.


Advantages

✔ Scales to very large datasets
✔ Fast updates
✔ Can escape saddle points
✔ Suitable for online learning


Disadvantages

❌ Noisy convergence
❌ Sensitive to learning rate
❌ May not settle exactly at minimum


3. Mini-Batch Gradient Descent (MBGD)

Definition

Mini-Batch Gradient Descent updates parameters using the gradient computed from a small subset (batch) of the training data.


Update Rule

θk+1=θkη1BiB(θk;xi,yi)\theta_{k+1} = \theta_k - \eta \frac{1}{|B|}\sum_{i \in B} \nabla \ell(\theta_k; x_i, y_i)

where BB is a mini-batch.


Algorithm (Mini-Batch GD)

Initialize θ Repeat for each epoch: Shuffle training data Divide data into mini-batches For each batch B: Compute average gradient over B θ ← θ − η ∇J_B(θ)

Characteristics of Mini-Batch GD

Aspect                Description
Data per update                16–512 samples
Speed                Fast
Noise                Moderate
Convergence                Stable
Hardware use                Efficient (GPU)

📌 Most deep learning frameworks use this variant.


4. Comparison: Batch GD vs SGD vs Mini-Batch GD

Feature            Batch GDSGD    Mini-Batch GD
Samples per update        All data        1    Small batch
Update frequency        Low        Very high    High
Stability        Very stable        Noisy    Balanced
Computation        Expensive        Cheap    Efficient
Used in practice        Rare        Some cases    Most common

5. Geometric Interpretation

  • SGD → Zig-zag path around minimum

  • Mini-Batch GD → Smoother path

  • Batch GD → Smoothest but slowest


6. Convergence Behavior

Convex Loss Functions

  • SGD converges in expectation

  • Mini-batch GD converges faster and more smoothly

Non-Convex Loss Functions

  • Noise helps escape saddle points

  • Mini-batch GD provides good generalization


7. Learning Rate in SGD & Mini-Batch GD

  • Fixed η → oscillation

  • Decreasing η → convergence

Typical schedule:

ηt=η01+kt\eta_t = \frac{\eta_0}{1 + kt}

8. Why Mini-Batch GD is Preferred in ML

✔ Efficient GPU computation
✔ Less noisy than SGD
✔ Faster than batch GD
✔ Good generalization

📌 Default choice in deep learning


9. Exam-Ready Definitions

SGD:

An optimization algorithm that updates parameters using the gradient computed from a single randomly selected training example.

Mini-Batch GD:

An optimization algorithm that updates parameters using the average gradient computed from a small subset of training examples.


10. One-Line Intuition

  • SGD: Fast but noisy learning

  • Mini-Batch GD: Best trade-off between speed and stability

Comments

Popular posts from this blog

Advanced Mathematics for Computer Science HNCST409 KTU BTech Honors 2024 Scheme

AKS Primality Testing

Galois Field and Operations