Gradient-Based Methods for Optimization-Gradient Descent Algorithm

November 25, 2025

Gradient-Based Methods for Optimization

(with Gradient Descent Algorithm)

1. Optimization Problem Setup

In optimization, we aim to find

\min_{x \in \mathbb{R}^n} f(x)

where:

$f(x)$ is a real-valued objective function
$x = (x_1, x_2, \dots, x_n)$ is a vector of variables

In machine learning, $f(x)$ is typically a loss or cost function.

2. What is a Gradient-Based Method?

A gradient-based method uses first-order derivative information (the gradient) to guide the search for an optimum.

Key Idea

The gradient $\nabla f(x)$ points in the direction of steepest increase
The negative gradient points in the direction of steepest decrease

📌 Hence, to minimize a function, we move in the negative gradient direction.

3. Geometric Interpretation of the Gradient

The gradient is perpendicular to level curves
Moving along $-\nabla f(x)$ decreases the function value most rapidly
Optimization proceeds by repeatedly stepping downhill

4. Gradient Descent Algorithm

Basic Update Rule

\boxed{ x_{k+1} = x_k - \eta_k \nabla f(x_k) }

where:

$x_k$ : current point
$\nabla f(x_k)$ : gradient at $x_k$
$\eta_k > 0$ : step size (learning rate)

5. Algorithmic Steps (Pseudo-Code)

Gradient Descent Algorithm

Initialize $x_0$
Choose learning rate $\eta$
Repeat until convergence:
- Compute $\nabla f(x_k)$
- Update $x_{k+1} = x_k - \eta \nabla f(x_k)$
Stop when:
$\|\nabla f(x_k)\| \le \epsilon$

6. Example (One-Dimensional)

Minimize

f(x) = x^2

Gradient: $f'(x) = 2x$
Update rule:

x_{k+1} = x_k - 2\eta x_k

For $0 < \eta < 1$ , the sequence converges to the global minimum $x=0$ .

7. Choice of Learning Rate (Step Size)

(a) Fixed Learning Rate

Simple
May overshoot if too large
Slow if too small

(b) Diminishing Learning Rate

\eta_k \rightarrow 0 \quad \text{as } k \rightarrow \infty

Guarantees convergence for convex functions

(c) Line Search

\eta_k = \arg\min_{\eta>0} f(x_k - \eta \nabla f(x_k))

More accurate
Computationally expensive

8. Convergence Properties

Convex Functions

Gradient descent converges to the global minimum
Rate depends on condition number of Hessian

Non-Convex Functions

May converge to:
- Local minimum
- Saddle point
No global guarantee

9. Stopping Criteria

Gradient norm small:
$\|\nabla f(x_k)\| < \epsilon$
Small change in function value
Maximum number of iterations reached

10. Variants of Gradient Descent

(a) Batch Gradient Descent

Uses entire dataset
Accurate but slow

(b) Stochastic Gradient Descent (SGD)

Uses one data point per update
Faster and noisy
Escapes saddle points

(c) Mini-Batch Gradient Descent

Uses small batches
Most commonly used in practice

11. Accelerated Gradient Methods

(a) Momentum Method

v_{k+1} = \beta v_k + \eta \nabla f(x_k)

x_{k+1} = x_k - v_{k+1}

Speeds up convergence
Reduces oscillations

(b) Nesterov Accelerated Gradient

Computes gradient at a look-ahead point
Faster theoretical convergence

12. Comparison with Second-Order Methods

Method	Derivatives Used	Cost	Speed
Gradient Descent	First	Low	Moderate
Newton’s Method	First + Second	High	Fast
Quasi-Newton	Approx. Hessian	Medium	Fast

13. Advantages of Gradient-Based Methods

Easy to implement
Scales to high-dimensional problems
Core method in machine learning

14. Limitations

Sensitive to learning rate
Slow for ill-conditioned problems
Can stall near saddle points

15. Role in Machine Learning

Training linear regression
Logistic regression
Neural networks (backpropagation)
Support vector machines (primal form)

Search This Blog

Advanced Mathematics for Computer Science HNCST409 KTU BTech Honors 2024 Scheme- Dr Binu V P