Learning Rate (Step Size) in Optimization

 

Learning Rate (Step Size) in Optimization

1. What is Learning Rate?

The learning rate (also called step size) is a positive scalar that determines how far we move in the direction of the gradient during each iteration of an optimization algorithm.

In gradient descent:

xk+1=xkηf(xk)x_{k+1} = x_k - \eta \nabla f(x_k)

where

η>0 is the learning rate\boxed{\eta > 0 \text{ is the learning rate}}

2. Intuitive Meaning

  • The gradient gives the direction

  • The learning rate decides the distance of movement

📌 Analogy:
Walking downhill:

  • Direction = slope

  • Step length = learning rate


3. Effect of Learning Rate Size

(a) Small Learning Rate

  • Very slow convergence

  • Many iterations needed

  • Stable but inefficient

📉 Example: η=0.001


(b) Large Learning Rate

  • Faster movement

  • May overshoot the minimum

  • Can cause divergence or oscillation

📈 Example: η=1.2


(c) Optimal Learning Rate

  • Fast convergence

  • Stable descent

  • Reaches minimum efficiently

📌 Choosing the right η is crucial.


4. Visual Interpretation (1D)

Learning RateBehavior
Too small    Tiny steps toward minimum
Too large    Jumps back and forth
Appropriate    Smooth convergence

5. Mathematical Insight (Quadratic Case)

For:

f(x)=x2f(x) = x^2

Update rule:

xk+1=(12η)xkx_{k+1} = (1 - 2\eta)x_k

Convergence condition:

0<η<10 < \eta < 1
  • η=0.5\eta = 0.5→ fastest convergence

  • η1\eta \ge 1 → divergence


6. Learning Rate and Convergence

Convex Functions

  • Proper η\eta guarantees convergence to global minimum

Non-Convex Functions

  • Affects:

    • Speed

    • Stability

    • Escape from saddle points


7. Types of Learning Rates

(a) Fixed Learning Rate

ηk=η\eta_k = \eta
  • Simple

  • Sensitive to choice


(b) Decaying Learning Rate

ηk=η01+αk\eta_k = \frac{\eta_0}{1 + \alpha k}
  • Large steps initially

  • Small steps near minimum

  • Ensures convergence


(c) Adaptive Learning Rates

Automatically adjust learning rate:

  • AdaGrad

  • RMSProp

  • Adam

Widely used in deep learning.


8. Learning Rate in Machine Learning

Algorithm                                    Role
Linear Regression                                Controls speed
Logistic Regression                                Stability
Neural Networks                                Training success
SGD                                Noise control

📌 Poor learning rate → poor model training.


9. Common Problems Due to Wrong Learning Rate

ProblemCause
Slow training            Learning rate too small
Divergence            Learning rate too large
Oscillations            High curvature
No convergence            Bad tuning

10. Practical Guidelines

  • Start with moderate value (e.g., 0.01 or 0.1)

  • Monitor loss curve

  • Reduce η if loss oscillates

  • Use decay or adaptive methods


Python Code and Visualization

import numpy as np
import matplotlib.pyplot as plt

# Objective function and gradient
def f(x):
    return x**2

def grad_f(x):
    return 2*x

# Learning rates to compare
learning_rates = [0.05, 0.2, 0.9]
labels = ["Small η = 0.05", "Moderate η = 0.2", "Large η = 0.9"]

x0 = 4.0          # Starting point
iterations = 15  # Number of GD steps

# Function curve
x_curve = np.linspace(-5, 5, 400)
y_curve = f(x_curve)

plt.figure()
plt.plot(x_curve, y_curve)

# Gradient Descent for each learning rate
for eta, label in zip(learning_rates, labels):
    x = x0
    xs = [x]
    ys = [f(x)]

    for _ in range(iterations):
        x = x - eta * grad_f(x)
        xs.append(x)
        ys.append(f(x))

    plt.plot(xs, ys, marker='o', label=label)

plt.xlabel("x")
plt.ylabel("f(x)")
plt.title("Effect of Learning Rate on Gradient Descent")
plt.legend()
plt.show()



Comments

Popular posts from this blog

Advanced Mathematics for Computer Science HNCST409 KTU BTech Honors 2024 Scheme

AKS Primality Testing

Galois Field and Operations