Gradient-Based Methods for Optimization-Gradient Descent Algorithm
Gradient-Based Methods for Optimization
(with Gradient Descent Algorithm)
1. Optimization Problem Setup
In optimization, we aim to find
where:
-
is a real-valued objective function
-
is a vector of variables
In machine learning, is typically a loss or cost function.
2. What is a Gradient-Based Method?
A gradient-based method uses first-order derivative information (the gradient) to guide the search for an optimum.
Key Idea
-
The gradient points in the direction of steepest increase
-
The negative gradient points in the direction of steepest decrease
📌 Hence, to minimize a function, we move in the negative gradient direction.
3. Geometric Interpretation of the Gradient
-
The gradient is perpendicular to level curves
-
Moving along decreases the function value most rapidly
-
Optimization proceeds by repeatedly stepping downhill
4. Gradient Descent Algorithm
Basic Update Rule
where:
-
: current point
-
: gradient at
-
: step size (learning rate)
5. Algorithmic Steps (Pseudo-Code)
Gradient Descent Algorithm
-
Initialize
-
Choose learning rate
-
Repeat until convergence:
-
Compute
-
Update
-
-
Stop when:
6. Example (One-Dimensional)
Minimize
-
Gradient:
-
Update rule:
For , the sequence converges to the global minimum .
7. Choice of Learning Rate (Step Size)
(a) Fixed Learning Rate
-
Simple
-
May overshoot if too large
-
Slow if too small
(b) Diminishing Learning Rate
-
Guarantees convergence for convex functions
(c) Line Search
-
More accurate
-
Computationally expensive
8. Convergence Properties
Convex Functions
-
Gradient descent converges to the global minimum
-
Rate depends on condition number of Hessian
Non-Convex Functions
-
May converge to:
-
Local minimum
-
Saddle point
-
-
No global guarantee
9. Stopping Criteria
-
Gradient norm small:
-
Small change in function value
-
Maximum number of iterations reached
10. Variants of Gradient Descent
(a) Batch Gradient Descent
-
Uses entire dataset
-
Accurate but slow
(b) Stochastic Gradient Descent (SGD)
-
Uses one data point per update
-
Faster and noisy
-
Escapes saddle points
(c) Mini-Batch Gradient Descent
-
Uses small batches
-
Most commonly used in practice
11. Accelerated Gradient Methods
(a) Momentum Method
-
Speeds up convergence
-
Reduces oscillations
(b) Nesterov Accelerated Gradient
-
Computes gradient at a look-ahead point
-
Faster theoretical convergence
12. Comparison with Second-Order Methods
| Method | Derivatives Used | Cost | Speed |
|---|---|---|---|
| Gradient Descent | First | Low | Moderate |
| Newton’s Method | First + Second | High | Fast |
| Quasi-Newton | Approx. Hessian | Medium | Fast |
13. Advantages of Gradient-Based Methods
-
Easy to implement
-
Scales to high-dimensional problems
-
Core method in machine learning
14. Limitations
-
Sensitive to learning rate
-
Slow for ill-conditioned problems
-
Can stall near saddle points
15. Role in Machine Learning
-
Training linear regression
-
Logistic regression
-
Neural networks (backpropagation)
-
Support vector machines (primal form)
Comments
Post a Comment