
Simple Linear Regression
- Aryan

- Dec 28, 2024
- 5 min read
Updated: Jul 12
Linear Regression
Linear regression is a supervised machine‑learning algorithm that models the relationship between one or more independent variables and a continuous dependent variable. It’s one of the first tools every data‑scientist masters because the core idea—“fit a straight line that captures the trend”—is both intuitive and mathematically elegant.
Simple Linear Regression
Simple linear regression is the most basic form of linear regression. It involves one input feature (also known as an independent variable) and one output feature (or dependent variable).
Example: College Student Data
Let's consider a dataset of college students where we aim to predict their package (output column) based on their CGPA (input column).
Here, CGPA is the input column, and Package is the output column. Our goal is to predict the package based on the CGPA.
The fundamental equation for simple linear regression is :
y = β₀ + β₁x
where:
y is the output (dependent variable)
x is the input (independent variable)
β₀ is the intercept (the value of y when x is 0)
β₁ is the slope (the change in y for a one-unit change in x)
Let's consider a more extensive hypothetical dataset of college students :
The first step in analyzing such data is to visualize it by plotting the data points.

As observed from the plot, the data exhibits a generally linear trend, but it's not perfectly linear. Real-world data is inherently complex and often influenced by various unobserved factors, leading to what is known as stochastic error or random variability. These errors can be analyzed and modeled mathematically.
The Equation of a Line and Best Fit Line
The general equation for a straight line is :
y = mx + b
where :
m represents the slope of the line
b represents the y-intercept
In linear regression, the best-fit line (also known as the regression line) represents the strongest linear relationship between the independent variable (x) and the dependent variable (y) within the dataset.
The primary objective of linear regression is to find the optimal parameters (β₀ and β₁ in our previous equation, or b and m in the general line equation) for this line. These parameters are chosen to minimize the difference (or error) between the predicted y values (from the line) and the actual observed y values in the dataset.
The best-fit line can also be represented by the linear equation:
y = mx + b
Where:
y is the dependent variable (response)
x is the independent variable (predictor)
m is the slope of the line
b is the y-intercept
How to Find m and b (or β₀ and β₁) ?
There are generally two main approaches to determine the optimal values for the slope (m or β₁) and intercept (b or β₀ ) in linear regression: Closed-Form Solutions and Iterative (Non-Closed-Form) Solutions.
Closed-Form Solution: This approach involves directly calculating the values of m and b using a predefined mathematical formula. The most common closed-form method for simple linear regression is the Ordinary Least Squares (OLS) method.
Iterative (Non-Closed-Form) Solution: This approach uses approximation techniques, starting with initial guesses for m and b and iteratively refining them to minimize the error. The most prominent iterative technique is Gradient Descent.
When to Use Which ?
Ordinary Least Squares (OLS): OLS is straightforward and computationally efficient for datasets with a relatively small number of features (low-dimensional data). It provides an exact solution.
Gradient Descent: When dealing with high-dimensional data (a large number of input features), finding a closed-form solution can become computationally very complex or even infeasible. In such scenarios, gradient descent becomes the preferred method, as it can efficiently find approximate solutions by iteratively moving towards the minimum error.
Implementation in Scikit-learn :
The LinearRegression class in Scikit-learn, a popular Python machine learning library, primarily uses the Ordinary Least Squares (OLS) method to determine the regression coefficients.
The SGDRegressor (Stochastic Gradient Descent Regressor) in Scikit-learn, on the other hand, utilizes Gradient Descent to find the values of m and b. This is particularly useful for large datasets where OLS might be too slow.
Formulas to Find the Values of m and b (OLS Method)
For the Ordinary Least Squares (OLS) method, the formulas to calculate the slope (m) and intercept (b) are derived to minimize the sum of squared errors.
The formula for the slope (m) is :

And the formula for the intercept (b) is :
b = ȳ - mx̄
Where:
n is the total number of data points.
xᵢ represents the individual CGPA values (input).
yᵢ represents the individual Package values (output).
x̄ is the mean (average) of the CGPA values.
ȳ is the mean (average) of the Package values.
Error (Loss Function)

In linear regression, we define the error as the difference between the actual observed value and the value predicted by our regression line. For each data point i, the error is denoted as dᵢ .
dᵢ = (yᵢ - ŷᵢ)
Where:
yᵢ is the actual observed value for the i-th data point.
ŷᵢ is the predicted value for the i-th data point from the regression line.
The total error, or the loss function (often denoted as E), is a measure of how well our line fits the data. Ideally, we want this error to be as small as possible.
Initially, one might think of summing the individual errors:
E = d₁ + d₂ + d₃ + … + dₙ
However, simply summing these errors can be problematic because positive and negative errors (points above and below the line) would cancel each other out, potentially leading to a misleadingly small total error even if the line fits poorly.
To address this, and to emphasize larger errors, we square each individual error and then sum them. This is known as the Sum of Squared Errors (SSE) or Residual Sum of Squares (RSS):
E = d₁² + d₂² + d₃² + … + dₙ²
Or, more compactly :

This Sum of Squared Errors (SSE) is our loss function.
Why Squared Errors ?
Avoids Cancellation: Squaring ensures that all errors contribute positively to the total sum, preventing positive and negative deviations from canceling out.
Penalizes Large Errors More: Squaring disproportionately penalizes larger errors. For instance, an error of 2 becomes 4 when squared, while an error of 4 becomes 16. This means the model works harder to reduce larger errors. This also helps in making the model less sensitive to outliers compared to using absolute errors.
Differentiability: The squared error function is continuous and differentiable, which is crucial for using calculus-based optimization methods (like those used to derive the OLS formulas or in gradient descent) to find the minimum error. If we used absolute values (∣dᵢ∣), the function would not be differentiable at zero, making it harder to find the minimum mathematically.
The core objective of simple linear regression (and more broadly, Ordinary Least Squares) is to find the specific values of m and b that minimize this Sum of Squared Errors (E).
Deriving the Formulas for m and b (Minimizing the Loss Function)
Our goal in simple linear regression is to find the values of m(slope) and b(y-intercept) that minimize the Sum of Squared Errors (SSE), our loss function, E.
The predicted value (ŷᵢ) for any given xᵢ is given by :
ŷᵢ = mxᵢ + b
Therefore, our loss function, E, can be expressed as:

To find the minimum value of E with respect to m and b, we use multivariable calculus. We take the partial derivative of E with respect to b and set it to zero, and similarly for m. This is because, at the minimum point of a function, its slope (derivative) is zero.
Derivation of b :

Derivation of m :



