Convergence Behaviour of Batch Gradient Descent (BGD) vs Stochastic Gradient Descent (SGD)

 

Convergence Behaviour of Batch Gradient Descent (BGD) vs Stochastic Gradient Descent (SGD)

Batch Gradient Descent (BGD)

  • Uses the entire dataset to compute the gradient at each iteration.

  • The loss function decreases smoothly and monotonically (for suitable learning rates).

  • Convergence path is stable and deterministic.

  • For convex loss functions, BGD guarantees convergence to the global minimum.

  • Convergence can be slow per iteration due to high computational cost.

  • Less suitable for very large datasets.

📌 Behavior: Smooth, steady convergence without oscillations.


Stochastic Gradient Descent (SGD)

  • Uses one randomly selected data point per update.

  • The loss function shows fluctuations instead of smooth decrease.

  • Convergence path is noisy and stochastic.

  • Converges in expectation to the minimum, not exactly at every step.

  • Faster initial progress but may oscillate near the minimum.

  • Noise can help escape saddle points and shallow local minima.

📌 Behavior: Fast but zig-zag convergence around the minimum.


Comparison Summary

Aspect            Batch GDStochastic GD
Gradient estimate                Exact    Noisy
Convergence path                Smooth    Zig-zag
Stability                High    Low
Speed per update                Slow    Fast
Final convergence                Exact (convex case)    Approximate
Suitability                Small datasets    Large datasets

Conclusion

Batch Gradient Descent converges smoothly and deterministically to the minimum, while Stochastic Gradient Descent converges faster but with oscillations due to noisy gradient estimates.

Comments

Popular posts from this blog

Advanced Mathematics for Computer Science HNCST409 KTU BTech Honors 2024 Scheme

AKS Primality Testing

Galois Field and Operations