Multiple Linear Regression

Aryan
Jan 2
5 min read

Multiple Linear Regression is an extension of simple linear regression. Unlike simple linear regression, which involves one input feature (independent variable) and one output (dependent variable), multiple linear regression involves more than one input feature and still predicts a single output.

We use it when there are multiple predictors (input variables) influencing the target.

Example Dataset

Suppose we have data for 1000 students :

CGPA	IQ	Placement
8	80	8
9	90	9
5	120	15
...	...	...

In this case, we have two input features: CGPA and IQ, and one target variable: Placement.

Since we have multiple input columns, we use Multiple Linear Regression.

The model takes CGPA and IQ as inputs and predicts Placement.

Mathematical Equation

The equation of the regression line becomes :

y = β₀ + β₁x₁ + β₂x₂

where :

x₁ -> CGPA
x₂ -> IQ
β₀ -> Intercept (bias term)
β₁ , β₂ -> slopes (weights)

If we have n input features, the equation generalizes to :

y = β₀ + β₁x₁ + β₂x₂ + ⋯ + βₙxₙ

Geometric Interpretation

In simple linear regression, the model fits a line in 2D space.
In multiple linear regression with 2 features, it fits a plane in 3D space.
In higher dimensions (n > 2), the model fits a hyperplane in n-dimensional space.

Objective of the Model

The goal of multiple linear regression is to find the optimal values of:

β₀ , β₁ , β₂ , … , βₙ

These are the coefficients (weights) of the model.

The algorithm tries to fit a hyperplane that minimizes the error between the predicted and actual output. In other words, it tries to find a surface that is as close as possible to all data points in high-dimensional space.

Mathematical Formulation of Multiple Linear Regression

Let’s consider data from three students of a college. Our goal is to apply Multiple Linear Regression to predict placement based on CGPA and IQ.

Assume we know the coefficient values β₀ , β₁ , β₂ .

We want to predict:

ŷ₁ = ? ŷ₂ = ? ŷ₃ = ?

To calculate the predicted outputs (ŷᵢ):

ŷ₁ = β₀ + β₁.8 + β₂.80

ŷ₂ = β₀ + β₁.7 + β₂.70

ŷ₃ = β₀ + β₁.5 + β₂.120

Or more generally, using variable notation:

ŷ₁ = β₀ + β₁x₁₁ + β₂x₁₂

ŷ₂ = β₀ + β₁x₂₁ + β₂x₂₂

ŷ₃ = β₀ + β₁x₃₁ + β₂x₃₂

Now, suppose we have m input features and n students (i.e., n training samples).

Each prediction becomes:

ŷᵢ = β₀ + β₁xᵢ₁ + β₂xᵢ₂ + β₃xᵢ₃ + ⋯ + βₘxᵢₘ

So for all students:

ŷ₁ = β₀ + β₁x₁₁ + β₂x₁₂ + ⋯ + βₘx₁ₘ

ŷ₂ = β₀ + β₁x₂₁ + β₂x₂₂ + ⋯ + βₘx₂ₘ

ŷ₃ = β₀ + β₁x₃₁ + β₂x₃₂ + ⋯ + βₘx₃ₘ

⋮

ŷₙ = β₀ + β₁xₙ₁ + β₂xₙ₂ + ⋯ + βₘxₙₘ

Matrix Form (Compact Representation)

Let’s express all predictions in matrix form:

Prediction vector Ŷ :

Matrix Representation :

Ŷ = Xβ

Where:

So,

Ŷ = Xβ

Resulting in:

Loss Function in Multiple Linear Regression

In regression, our aim is to find a best-fit line or hyperplane that minimizes the distance between the actual values and predicted values.

We want to minimize the sum of squared distances (errors):

d₁² + d₂² + d₃² + ⋯ + dₙ²

Objective

In Multiple Linear Regression, we minimize the Sum of Squared Errors (SSE) — the total squared distance between actual and predicted values.

This is known as the Loss Function:

Vector & Matrix Representation

Expanded Matrix Form

Let's now expand E = eᵀe using matrix algebra.

e = y - ŷ ⇒ E = (y - ŷ)ᵀ(y - ŷ)

Now expand using distributive property:

E = yᵀy - yᵀŷ - ŷᵀy + ŷᵀŷ

Since yᵀŷ and ŷᵀy are scalars and equal:

E = yᵀy - 2yᵀŷ + ŷᵀŷ

Proving yᵀŷ = ŷᵀy

Now we will prove that the two expressions below are equal:

yᵀŷ = ŷᵀy

Let’s assume:

y = A
ŷ = B

So we want to prove:

AᵀB = (AᵀB)ᵀ OR C = Cᵀ

This means we want to prove AᵀB is a symmetric matrix, i.e., its transpose is equal to itself.

Symmetric Matrix Example:

Now let’s prove that yᵀŷ is symmetric :

We know :

ŷ = Xβ ⇒ yᵀŷ = yᵀXβ

Let’s break down the dimensions:

X is an n×(m+1) matrix (with intercept column)
β is a (m+1)×1 column vector
yᵀ is a 1*n row vector

so yᵀXβ results in a scalar.

And since the transpose of a scalar is the same scalar, we get :

yᵀŷ = ŷᵀy proven

Loss Function in Terms of β

We had derived the error function as:

E = (y - ŷ)ᵀ(y - ŷ)

Now using the identity ŷ = Xβ , we substitute :

E = yᵀy - 2yᵀXβ + βᵀXᵀXβ

Interpretation

E(β) is a function of β
As we change the values of β , the error E also changes
Input X and output y are fixed, only β is variable

Goal : We want to find the values of β that minimize the error function E(β).

Relation of Loss Function with Coefficients

To minimize the loss function, we differentiate it with respect to β and set it to zero:

We want to find the value of β where the loss is minimized (i.e., the slope is zero at the minimum point).

Differentiating the Loss Function

From earlier:

E(β) = yᵀy - 2yᵀXβ + βᵀXᵀXβ

Now differentiating w.r.t. β :

Matrix Differentiation Rule Used :

in our case:

A = XᵀX , which is symmetric because :

(XᵀX)ᵀ = Xᵀ(Xᵀ)ᵀ = XᵀX

Solving for β

Set the derivative to zero:

0 = -2yᵀX + 2βᵀXᵀX

Divide both sides by 2:

βᵀXᵀX = yᵀX

Multiply both sides by (XᵀX)⁻¹ :

βᵀ = yᵀX(XᵀX)⁻¹

Now transpose both sides:

(βᵀ)ᵀ = [yᵀX(XᵀX)⁻¹]ᵀ

Apply transpose rule:

β = ((XᵀX)⁻¹)ᵀ XᵀY

Since the transpose of an inverse symmetric matrix is the same:

β = (XᵀX)⁻¹ XᵀY

Shape Analysis of Matrices in OLS

shape of β = (m+1)*1

Understanding Shape of Each Term:

Xᵀ is of shape (m+1)*n
X is of shape n*(m+1)
so , XᵀX is of shape (m+1)*(m+1)
Therefore (XᵀX)⁻¹ is also of shape (m+1)*(m+1)

Now :

XᵀY is of shape (m+1)*1
so , (XᵀX)⁻¹ XᵀY is of shape (m+1)*1

Thus,

β = (XᵀX)⁻¹ XᵀY

remains consistent in shape, confirming it's valid.

Proof: (XᵀX)⁻¹ is Symmetric

Let:

A = XᵀX (a square matrix)

we need to prove :

(A⁻¹)ᵀ = A⁻¹

This would imply :

[(XᵀX)⁻¹]ᵀ = (XᵀX)⁻¹

Proof:

given:

AA⁻¹ = I => (AA⁻¹)ᵀ = Iᵀ = I

(A⁻¹)ᵀAᵀ = I

since A = Aᵀ (because XᵀX is symmetric ), this becomes :

(A⁻¹)ᵀA = I

Now pre-multiply both sides with A⁻¹ :

A⁻¹(A⁻¹)ᵀA = A⁻¹ => (A⁻¹)ᵀ = A⁻¹

Hence proved , (XᵀX)⁻¹ is symmetric .

Final Normal Equation (OLS Formula)

β = (XᵀX)⁻¹ XᵀY

This is the Ordinary Least Squares (OLS) solution for multiple linear regression.
It minimizes the sum of squared errors between actual and predicted values.
β contains the intercept and coefficients.

Multiple Linear Regression

Recent Posts

© 2025 Aryan Upadhyay |