Convergence Behaviour of Batch Gradient Descent (BGD) vs Stochastic Gradient Descent (SGD)
Convergence Behaviour of Batch Gradient Descent (BGD) vs Stochastic Gradient Descent (SGD)
Batch Gradient Descent (BGD)
-
Uses the entire dataset to compute the gradient at each iteration.
-
The loss function decreases smoothly and monotonically (for suitable learning rates).
-
Convergence path is stable and deterministic.
-
For convex loss functions, BGD guarantees convergence to the global minimum.
-
Convergence can be slow per iteration due to high computational cost.
-
Less suitable for very large datasets.
📌 Behavior: Smooth, steady convergence without oscillations.
Stochastic Gradient Descent (SGD)
-
Uses one randomly selected data point per update.
-
The loss function shows fluctuations instead of smooth decrease.
-
Convergence path is noisy and stochastic.
-
Converges in expectation to the minimum, not exactly at every step.
-
Faster initial progress but may oscillate near the minimum.
-
Noise can help escape saddle points and shallow local minima.
📌 Behavior: Fast but zig-zag convergence around the minimum.
Comparison Summary
| Aspect | Batch GD | Stochastic GD |
|---|---|---|
| Gradient estimate | Exact | Noisy |
| Convergence path | Smooth | Zig-zag |
| Stability | High | Low |
| Speed per update | Slow | Fast |
| Final convergence | Exact (convex case) | Approximate |
| Suitability | Small datasets | Large datasets |
Conclusion
Batch Gradient Descent converges smoothly and deterministically to the minimum, while Stochastic Gradient Descent converges faster but with oscillations due to noisy gradient estimates.
Comments
Post a Comment