Role of Gradient Descent in Machine Learning Algorithms

 

Role of Gradient Descent in Machine Learning Algorithms

1. Why Optimization is Central to Machine Learning

Most machine learning algorithms can be written as an optimization problem:

minθ  J(θ)\min_{\theta} \; J(\theta)

Where:

  • θ\theta = model parameters (weights, bias)

  •  = loss (cost) function measuring prediction error

📌 Learning = minimizing the loss function

Gradient Descent is the workhorse algorithm used to perform this minimization.


2. General Gradient Descent Framework

Loss Function

Given data {(xi,yi)}i=1n\{(x_i, y_i)\}_{i=1}^n, define

J(θ)=1ni=1n(fθ(xi),yi)J(\theta) = \frac{1}{n}\sum_{i=1}^{n} \ell(f_\theta(x_i), y_i)

Gradient Descent Update

θ(k+1)=θ(k)ηJ(θ(k))\theta^{(k+1)} = \theta^{(k)} - \eta \nabla J(\theta^{(k)})

Where:

  • J(θ)\nabla J(\theta) gives the direction of steepest increase

  • GD moves opposite to the gradient


3. How Gradient Descent Fits into ML Algorithms

Step-by-Step View

  1. Initialize model parameters randomly

  2. Compute predictions using the model

  3. Compute loss (error)

  4. Compute gradient of loss w.r.t parameters

  5. Update parameters using GD

  6. Repeat until convergence

This loop is the training phase of ML algorithms.


4. Examples in Popular Machine Learning Algorithms


4.1 Linear Regression

Model

y^=wTx+b\hat{y} = w^T x + b

Loss (Mean Squared Error)

J(w,b)=1ni=1n(wTxi+byi)2J(w,b) = \frac{1}{n}\sum_{i=1}^n (w^T x_i + b - y_i)^2

Gradients

Jw=2n(wTxi+byi)xi\frac{\partial J}{\partial w} = \frac{2}{n} \sum (w^T x_i + b - y_i)x_i Jb=2n(wTxi+byi)\frac{\partial J}{\partial b} = \frac{2}{n} \sum (w^T x_i + b - y_i)

✔ Convex problem → GD finds global minimum


4.2 Logistic Regression

Model

y^=σ(wTx+b)\hat{y} = \sigma(w^T x + b)

Loss (Log Loss)

J(w,b)=1n[ylogy^+(1y)log(1y^)]J(w,b) = -\frac{1}{n}\sum [y\log \hat{y} + (1-y)\log(1-\hat{y})]

✔ Convex loss
✔ GD guarantees global optimum


4.3 Support Vector Machines (Soft Margin)

Objective:

minw12w2+Cmax(0,1yiwTxi)\min_{w} \frac{1}{2}\|w\|^2 + C\sum \max(0,1-y_i w^T x_i)

✔ Convex but non-smooth
✔ Solved using subgradient descent


4.4 Neural Networks (Deep Learning)

Objective

minθ  1n(fθ(xi),yi)\min_{\theta} \; \frac{1}{n}\sum \ell(f_\theta(x_i), y_i)
  • Highly non-convex

  • Multiple local minima and saddle points

✔ Gradient Descent + Backpropagation
✔ Typically SGD or Mini-batch GD


5. Gradient Descent Variants Used in ML

Algorithm                    Usage
Batch GD                    Small datasets
SGD                    Large-scale ML
Mini-batch GD                    Deep learning
Momentum                    Faster convergence
Adam                                Adaptive learning rates

6. Why Gradient Descent Works in ML

Key Reasons

  1. Loss functions are differentiable

  2. Gradients give direction of improvement

  3. Convex problems → global optimum

  4. Non-convex problems → good approximate solutions


7. Role of Convexity

Case                        Behavior of GD
Convex loss                        Guaranteed convergence
Strongly convex                        Linear convergence
Non-convex                        Finds critical points

📌 This is where convex optimization theory connects to ML.


8. Learning Rate in ML

  • Controls step size

  • Too large → divergence

  • Too small → slow learning

Often:

  • Fixed initially

  • Decayed over epochs

  • Adaptive (Adam, RMSProp)


9. Stopping Criteria in ML Training

  • Gradient norm small

  • Loss change small

  • Maximum epochs reached

  • Validation error stops improving


10. Visualization (Conceptual)

  • Loss surface → landscape

  • GD → downhill movement

  • Parameters updated iteratively


11.  Summary

Gradient Descent is used in machine learning to iteratively update model parameters by minimizing a loss function through movement in the direction opposite to the gradient.


12. One-Line Intuition for Students

Machine learning models learn by repeatedly correcting their mistakes using gradients.

Comments

Popular posts from this blog

Advanced Mathematics for Computer Science HNCST409 KTU BTech Honors 2024 Scheme

AKS Primality Testing

Galois Field and Operations