Stochastic Gradient Descent (SGD) and Mini-Batch Gradient
Gradient Descent Variants in Machine Learning
Stochastic Gradient Descent (SGD) and Mini-Batch Gradient Descent
1. Background: Empirical Risk Minimization
Most ML problems solve:
where:
-
: model parameters
-
: loss function
-
: number of training samples
The difference between GD variants lies in how much data is used per update.
2. Stochastic Gradient Descent (SGD)
Definition
Stochastic Gradient Descent updates parameters using the gradient computed from a single randomly chosen data point.
Update Rule
where
Algorithm (SGD)
Characteristics of SGD
| Aspect | Description |
|---|---|
| Data per update | 1 sample |
| Speed | Very fast |
| Noise | High |
| Convergence | Oscillatory |
| Memory usage | Very low |
📌 Gradient is a noisy estimate of the true gradient.
Advantages
✔ Scales to very large datasets
✔ Fast updates
✔ Can escape saddle points
✔ Suitable for online learning
Disadvantages
❌ Noisy convergence
❌ Sensitive to learning rate
❌ May not settle exactly at minimum
3. Mini-Batch Gradient Descent (MBGD)
Definition
Mini-Batch Gradient Descent updates parameters using the gradient computed from a small subset (batch) of the training data.
Update Rule
where is a mini-batch.
Algorithm (Mini-Batch GD)
Characteristics of Mini-Batch GD
| Aspect | Description |
|---|---|
| Data per update | 16–512 samples |
| Speed | Fast |
| Noise | Moderate |
| Convergence | Stable |
| Hardware use | Efficient (GPU) |
📌 Most deep learning frameworks use this variant.
4. Comparison: Batch GD vs SGD vs Mini-Batch GD
| Feature | Batch GD | SGD | Mini-Batch GD |
|---|---|---|---|
| Samples per update | All data | 1 | Small batch |
| Update frequency | Low | Very high | High |
| Stability | Very stable | Noisy | Balanced |
| Computation | Expensive | Cheap | Efficient |
| Used in practice | Rare | Some cases | Most common |
5. Geometric Interpretation
-
SGD → Zig-zag path around minimum
-
Mini-Batch GD → Smoother path
-
Batch GD → Smoothest but slowest
6. Convergence Behavior
Convex Loss Functions
-
SGD converges in expectation
-
Mini-batch GD converges faster and more smoothly
Non-Convex Loss Functions
-
Noise helps escape saddle points
-
Mini-batch GD provides good generalization
7. Learning Rate in SGD & Mini-Batch GD
-
Fixed η → oscillation
-
Decreasing η → convergence
Typical schedule:
8. Why Mini-Batch GD is Preferred in ML
✔ Efficient GPU computation
✔ Less noisy than SGD
✔ Faster than batch GD
✔ Good generalization
📌 Default choice in deep learning
9. Exam-Ready Definitions
SGD:
An optimization algorithm that updates parameters using the gradient computed from a single randomly selected training example.
Mini-Batch GD:
An optimization algorithm that updates parameters using the average gradient computed from a small subset of training examples.
10. One-Line Intuition
-
SGD: Fast but noisy learning
-
Mini-Batch GD: Best trade-off between speed and stability
Comments
Post a Comment