Role of Gradient Descent in Machine Learning Algorithms

November 24, 2025

Role of Gradient Descent in Machine Learning Algorithms

1. Why Optimization is Central to Machine Learning

Most machine learning algorithms can be written as an optimization problem:

\min_{\theta} \; J(\theta)

Where:

$\theta$ = model parameters (weights, bias)
$J (θ)$ = loss (cost) function measuring prediction error

📌 Learning = minimizing the loss function

Gradient Descent is the workhorse algorithm used to perform this minimization.

2. General Gradient Descent Framework

Loss Function

Given data $\{(x_i, y_i)\}_{i=1}^n$ , define

J(\theta) = \frac{1}{n}\sum_{i=1}^{n} \ell(f_\theta(x_i), y_i)

Gradient Descent Update

\theta^{(k+1)} = \theta^{(k)} - \eta \nabla J(\theta^{(k)})

Where:

$\nabla J(\theta)$ gives the direction of steepest increase
GD moves opposite to the gradient

3. How Gradient Descent Fits into ML Algorithms

Step-by-Step View

Initialize model parameters randomly
Compute predictions using the model
Compute loss (error)
Compute gradient of loss w.r.t parameters
Update parameters using GD
Repeat until convergence

This loop is the training phase of ML algorithms.

4. Examples in Popular Machine Learning Algorithms

4.1 Linear Regression

Model

\hat{y} = w^T x + b

Loss (Mean Squared Error)

J(w,b) = \frac{1}{n}\sum_{i=1}^n (w^T x_i + b - y_i)^2

Gradients

\frac{\partial J}{\partial w} = \frac{2}{n} \sum (w^T x_i + b - y_i)x_i

\frac{\partial J}{\partial b} = \frac{2}{n} \sum (w^T x_i + b - y_i)

✔ Convex problem → GD finds global minimum

4.2 Logistic Regression

Model

\hat{y} = \sigma(w^T x + b)

Loss (Log Loss)

J(w,b) = -\frac{1}{n}\sum [y\log \hat{y} + (1-y)\log(1-\hat{y})]

✔ Convex loss
✔ GD guarantees global optimum

4.3 Support Vector Machines (Soft Margin)

Objective:

\min_{w} \frac{1}{2}\|w\|^2 + C\sum \max(0,1-y_i w^T x_i)

✔ Convex but non-smooth
✔ Solved using subgradient descent

4.4 Neural Networks (Deep Learning)

Objective

\min_{\theta} \; \frac{1}{n}\sum \ell(f_\theta(x_i), y_i)

Highly non-convex
Multiple local minima and saddle points

✔ Gradient Descent + Backpropagation
✔ Typically SGD or Mini-batch GD

5. Gradient Descent Variants Used in ML

Algorithm	Usage
Batch GD	Small datasets
SGD	Large-scale ML
Mini-batch GD	Deep learning
Momentum	Faster convergence
Adam	Adaptive learning rates

6. Why Gradient Descent Works in ML

Key Reasons

Loss functions are differentiable
Gradients give direction of improvement
Convex problems → global optimum
Non-convex problems → good approximate solutions

7. Role of Convexity

Case	Behavior of GD
Convex loss	Guaranteed convergence
Strongly convex	Linear convergence
Non-convex	Finds critical points

📌 This is where convex optimization theory connects to ML.

8. Learning Rate in ML

Controls step size
Too large → divergence
Too small → slow learning

Often:

Fixed initially
Decayed over epochs
Adaptive (Adam, RMSProp)

9. Stopping Criteria in ML Training

Gradient norm small
Loss change small
Maximum epochs reached
Validation error stops improving

10. Visualization (Conceptual)

Loss surface → landscape
GD → downhill movement
Parameters updated iteratively

11. Summary

Gradient Descent is used in machine learning to iteratively update model parameters by minimizing a loss function through movement in the direction opposite to the gradient.

12. One-Line Intuition for Students

Machine learning models learn by repeatedly correcting their mistakes using gradients.

Search This Blog

Advanced Mathematics for Computer Science HNCST409 KTU BTech Honors 2024 Scheme- Dr Binu V P