Linear Regression: Using Gradient Descent

 

Linear Regression: Using Gradient Descent


1. What is Linear Regression?

Linear Regression is a supervised learning algorithm used to model the relationship between an input variable and an output variable by fitting a straight line.

Goal

To find a line that best predicts the output for given inputs.


2. Simple Linear Regression Model

For one input variable xx, the model is:

y^=wx+b\hat{y} = wx + b

Where:

  • ww → slope (weight)

  • bb → intercept (bias)

  • y^\hat{y} → predicted output


3. A Simple Example

Problem Statement

Predict a student’s exam score based on hours studied.

Hours studied (x)                Score (y)
150
255
365
470
575

4. Visual Idea

We want to fit a straight line through these points such that the overall error is minimized.

The line might look like:

y^=16x+7\hat{y} = 6x + 45

5. How Do We Measure Error?

We use Mean Squared Error (MSE):

MSE=1ni=1n(y^iyi)2\text{MSE} = \frac{1}{n}\sum_{i=1}^{n} (\hat{y}_i - y_i)^2


This penalizes large errors more heavily.


6. Learning the Best Line (Concept)

  • Start with random ww and bb

  • Predict scores using y^=wx+b\hat{y} = wx + b

  • Compute errors

  • Adjust ww and bb to reduce errors

  • Repeat until the best line is found

This adjustment is done using Gradient Descent.


7. Interpretation of Parameters

  • Slope ww
    → Increase in score per additional hour studied
    (e.g., w=6w = 6 means +6 marks per hour)

  • Intercept bb
    → Expected score with zero hours studied


8. Why Linear Regression Works Here

  • Relationship is approximately linear

  • Errors are minimized globally (convex problem)

  • Model is simple and interpretable


9.  Definition

Linear Regression is a supervised learning algorithm that models the relationship between dependent and independent variables by fitting a linear equation that minimizes the mean squared error.


10. Key Takeaways 

✔ Simple and interpretable
✔ Works well for linear relationships
✔ Foundation for many ML algorithms
✔ Solved using optimization


11. Closed-Form Solution 

For advanced students:

w=(XTX)1XTyw = (X^TX)^{-1}X^Ty

But in practice, gradient descent is preferred for large datasets.


12. One-Line Intuition

Linear regression finds the straight line that best fits the data by minimizing prediction errors.

How Gradient Descent Finds the Best-Fit Line in Linear Regression


1. Problem Setup (Recalled)

We have data points:

(xi,yi),i=1,2,,n(x_i, y_i), \quad i=1,2,\dots,n

Model (Straight Line)

y^=wx+b\hat{y} = wx + b

Our goal is to find ww (slope) and bb (intercept) such that the line fits the data as well as possible.


2. What Does “Best Fit” Mean?

“Best fit” means minimizing the total prediction error.

We measure error using Mean Squared Error (MSE):

J(w,b)=1ni=1n(wxi+byi)2J(w,b) = \frac{1}{n}\sum_{i=1}^n (wx_i + b - y_i)^2

📌 This is a convex function in ww and bb.
So it has one global minimum → one best-fit line.


3. Geometry of the Problem (Very Important)

  • Each choice of (w,b)(w,b) gives a different line

  • Each line produces a certain MSE value

  • If we plot J(w,b)J(w,b), we get a bowl-shaped surface

  • The lowest point of the bowl corresponds to the best line

👉 Gradient descent moves us down this bowl.


4. Role of Gradient Descent

Key Idea

Gradient descent updates ww and bb in the direction that reduces error the fastest.

Mathematically:

New value=Old valueη×Gradient\text{New value} = \text{Old value} - \eta \times \text{Gradient}


5. Compute Gradients (Direction of Steepest Increase)

Gradient with respect to ww

Jw=2ni=1n(wxi+byi)xi\frac{\partial J}{\partial w} = \frac{2}{n}\sum_{i=1}^n (wx_i + b - y_i)x_i

Gradient with respect to bb

Jb=2ni=1n(wxi+byi)\frac{\partial J}{\partial b} = \frac{2}{n}\sum_{i=1}^n (wx_i + b - y_i)

📌 These gradients tell us:

  • If slope is too steep or too flat

  • If line is too high or too low


6. Gradient Descent Update Rules

wwη2n(wxi+byi)xiw \leftarrow w - \eta \frac{2}{n}\sum (wx_i + b - y_i)x_i bbη2n(wxi+byi)b \leftarrow b - \eta \frac{2}{n}\sum (wx_i + b - y_i)

Each update slightly rotates or shifts the line.


7. Simple Numerical Example (Concrete)

Data:

(1,2),(2,3)(1,2), (2,3)

Step 1: Initialize

w=0,b=0w=0,\quad b=0

Predictions:

y^=0\hat{y} = 0

Errors:

(02),(03)(0-2),(0-3)


Step 2: Compute Gradients

Jw=22[(2)(1)+(3)(2)]=8\frac{\partial J}{\partial w} = \frac{2}{2}[(-2)(1)+(-3)(2)] = -8
Jb=22(23)=5\frac{\partial J}{\partial b} = \frac{2}{2}(-2-3) = -5


Step 3: Update Parameters (η=0.1\eta=0.1)

w=00.1(8)=0.8w = 0 - 0.1(-8) = 0.8
b=00.1(5)=0.5b = 0 - 0.1(-5) = 0.5

📌 New line:

y^=0.8x+0.5\hat{y} = 0.8x + 0.5

This line is closer to the data than before.


8. Repeating the Process

Each iteration:

  • Line moves closer to data

  • Error reduces

  • Gradients become smaller

  • Eventually updates stop changing

At convergence:

J(w,b)=0\nabla J(w,b) = 0

This gives the best-fit line.


9. Why Gradient Descent Always Finds the Best Line

✔ MSE loss is convex
✔ Only one minimum exists
✔ Gradient descent cannot get stuck elsewhere

So the final line is globally optimal.


10. Intuitive Explanation 

Gradient descent keeps adjusting the slope and intercept until the line cannot be improved any further.


11. Visual Interpretation ( Analogy)

  • Imagine placing a ball on a bowl

  • The ball rolls down

  • It stops at the lowest point

  • That point corresponds to the best-fit line


12.  Conclusion

In linear regression, gradient descent works by iteratively updating the slope and intercept in the direction opposite to the gradient of the mean squared error until the global minimum is reached, yielding the best-fit line.


13.  Python Code

import numpy as np
import matplotlib.pyplot as plt

# Example data
x = np.array([1, 2, 3, 4, 5])
y = np.array([50, 55, 65, 70, 75])

# Initialize parameters
w = 0.0      # slope
b = 0.0      # intercept
eta = 0.01   # learning rate
iterations = 25

# Store history for plotting
w_history = []
b_history = []

# Gradient Descent
for _ in range(iterations):
    y_pred = w * x + b
   
    # Gradients
    dw = (-2 / len(x)) * np.sum(x * (y - y_pred))
    db = (-2 / len(x)) * np.sum(y - y_pred)
   
    # Update parameters
    w = w - eta * dw
    b = b - eta * db
   
    w_history.append(w)
    b_history.append(b)

# Plot data points
plt.figure()
plt.scatter(x, y)

# Plot intermediate lines
x_line = np.linspace(0, 6, 100)
for i in range(0, iterations, 5):
    y_line = w_history[i] * x_line + b_history[i]
    plt.plot(x_line, y_line,'b')

# Plot final best-fit line
y_final = w * x_line + b
plt.plot(x_line, y_final,'r')

plt.xlabel("Hours Studied")
plt.ylabel("Score")
plt.title("Gradient Descent: Evolution of Best-Fit Line")
plt.show()

print("Final slope (w):", w)
print("Final intercept (b):", b)


The best fit line is drawn in red color
Final slope (w): 16.336112134810687 Final intercept (b): 7.910100074060203

 

Comments

Popular posts from this blog

Advanced Mathematics for Computer Science HNCST409 KTU BTech Honors 2024 Scheme

AKS Primality Testing

Galois Field and Operations