
Assumptions of Linear Regression
- Aryan

- Jan 25
- 14 min read
Linear regression is based on several core assumptions that ensure the validity and reliability of its estimates and inferences. These key assumptions are :
Linearity
Normality of Residuals
Homoscedasticity
No Autocorrelation
No or Little Multicollinearity
Linearity
The Assumption:
There exists a linear relationship between the independent variable(s) and the dependent variable. The model assumes that a change in the independent variable leads to a consistent additive effect on the dependent variable.
Note: This does not require the relationship to be proportional (i.e., passing through the origin).
What happens when this assumption is violated ?
Biased parameter estimates: If the true relationship is not linear, the estimated regression coefficients may become biased, leading to incorrect conclusions about the effects of the independent variables.
Reduced predictive accuracy: A model that incorrectly assumes linearity may fail to capture the true pattern in the data, resulting in poor predictive performance.
Invalid hypothesis tests and confidence intervals: Violating the linearity assumption undermines the validity of statistical tests and confidence intervals, potentially leading to false conclusions about the significance and strength of predictors.
How to check this assumption
Scatter plots: Plot the dependent variable against each independent variable. If the relationship appears linear, the assumption is likely satisfied. Nonlinear patterns or trends may indicate a violation.
Residual plots: Plot residuals (differences between observed and predicted values) against predicted values or independent variables. If linearity holds, residuals should be randomly scattered around zero with no visible pattern. Trends, curves, or heteroscedasticity suggest violation.
Polynomial terms: Introduce polynomial terms (e.g., quadratic or cubic) and compare model fit with the original linear model. A significant improvement in fit may indicate nonlinearity.
What to do when the assumption is violated
Transformations: Apply transformations (e.g., logarithmic, square root, or inverse) to the dependent or independent variables to linearize the relationship.
Polynomial regression: Include polynomial terms (e.g., squared or cubed predictors) to model nonlinear relationships.
Piecewise regression: Divide the independent variable’s range into segments and fit separate linear models to each.
Non-parametric or semi-parametric methods: Use models that do not assume linearity, such as Generalized Additive Models (GAMs), splines, or kernel regression.
Normality of Residuals
The Assumption:
The error terms (residuals) are assumed to follow a normal distribution with a mean of zero and constant variance.
What happens when this assumption is violated ?
Inaccurate hypothesis tests: The t-tests and F-tests used to assess the significance of regression coefficients and the overall model rely on normality. Non-normal residuals can lead to incorrect inferences about predictor significance.
Invalid confidence intervals: Confidence intervals for regression coefficients assume normality. Violations may distort interpretations of effect size and estimate precision.
Model performance: While normality is not essential for predictive accuracy, substantial non-normality may indicate poor model fit, presence of outliers, or omitted variables. However, in large samples, models may still perform well even if residuals deviate from normality.
How to check this assumption
Histogram of residuals: Plot residuals to visually assess their distribution. A bell-shaped curve suggests normality.
Q-Q plot: Compare residual quantiles to a standard normal distribution. Systematic deviation from a straight line signals non-normality.
Statistical tests: Use formal tests such as the Omnibus test, Jarque-Bera test, or Shapiro-Wilk test to evaluate normality.
What to do when the assumption fails
Model selection techniques: Use cross-validation, AIC, or BIC to identify models better suited for non-normal data.
Robust regression: Apply methods like M-estimation, Least Median of Squares (LMS), or Least Trimmed Squares (LTS) that are less sensitive to outliers and deviations from normality. (Transformations may also help.)
Non-parametric/semi-parametric methods: Use approaches like GAMs, splines, or kernel regression that do not assume normality.
Bootstrapping: Generate confidence intervals and perform hypothesis testing through resampling, bypassing the need for normality.
Note: In large samples, the Central Limit Theorem reduces the importance of normality for inference.
Omnibus Test
A statistical test used to evaluate whether residuals are normally distributed, based on
skewness and excess kurtosis.
Steps to Conduct the Omnibus Test :
Hypotheses:
• Null (H₀): Residuals are normally distributed.
• Alternative (H₁): Residuals are not normally distributed.
Fit the model: Run a linear regression and obtain predicted values.
Calculate residuals:
Residual = Observed value − Predicted value
Skewness: Measures the asymmetry of the distribution.
For normality, skewness ≈ 0.
Excess Kurtosis: Measures the "tailedness" of the distribution (i.e., Kurtosis − 3).
For normality, excess kurtosis ≈ 0.
Test Statistic:
K² = n [ ((Skewness)²/6) + ((Kurtosis - 3)²/24) ]
p-value:
Compare K² to a chi-square distribution with 2 degrees of freedom.
• If p > α: Fail to reject H₀ (residuals are normal).
• If p ≤ α: Reject H₀ (residuals are non-normal).
Homoscedasticity
The Assumption:
The variance of the error terms (residuals) must remain constant across all levels of the independent variables. If this spread changes systematically, the efficiency of parameter estimates is compromised.
What happens when this assumption is violated ?
Inefficient estimates: Coefficients remain unbiased but are no longer the Best Linear Unbiased Estimates (BLUE). Standard errors become inflated, reducing the power of hypothesis tests.
Inaccurate hypothesis tests: t-tests and F-tests for regression coefficients and overall model significance assume constant variance. Heteroscedastic residuals may yield misleading statistical inferences.
Invalid confidence intervals: Confidence intervals for coefficients rely on the assumption of homoscedasticity. Violations can distort effect size interpretations and the precision of estimates.
How to check this assumption
• Residual plots: Plot residuals against predicted values or independent variables.
Homoscedasticity: Points are randomly scattered around zero with no visible pattern.
Heteroscedasticity: Presence of systematic patterns (e.g., funnel shape, curvature).
• Breusch-Pagan test: A formal statistical test for heteroscedasticity.
Null hypothesis (H₀): Error variance is constant (homoscedasticity).
Reject H₀ if p-value < significance level (e.g., 0.05).
What to do when the assumption is violated
Transformations: Apply transformations (e.g., logarithmic, square root, inverse) to the dependent or independent variables to stabilize variance.
Weighted Least Squares (WLS): Assign weights to observations inversely proportional to the estimated variance (often based on residuals), improving estimation accuracy.
Robust standard errors: Use heteroscedasticity-consistent standard errors (e.g., Huber-White) for valid hypothesis tests and confidence intervals.
Breusch-Pagan Test
A statistical test (also known as the Cook-Weisberg test) used to detect heteroscedasticity, assuming that error variance is a linear function of the independent variables.
Steps to Perform the Test:
Estimate the OLS model: Fit a linear regression and obtain residuals.
Square the residuals: Calculate e²ᵢ .
Regress e²ᵢ on predictors: Perform OLS regression of squared residuals on the original independent variables. Record the R² from this regression.
Compute test statistic:
LM = n × R²
where n = number of observations.
Determine the p-value:
The LM statistic follows a χ²-distribution with k degrees of freedom (where k = number of predictors).
Conclusion:
p ≤ α → Reject H₀ → Evidence of heteroscedasticity.
p > α → Fail to reject H₀ → Homoscedasticity is plausible.
Note: The test assumes a linear relationship between predictors and error variance. For nonlinear relationships, alternative tests like the White test are recommended.
No Autocorrelation
The Assumption:
There should be no correlation or systematic pattern in the residuals. In other words, the error terms must be independent of one another.
What happens when this assumption is violated ?
Inefficient estimates: While regression coefficients remain unbiased, they are no longer the Best Linear Unbiased Estimates (BLUE). Standard errors become unreliable, reducing the power of hypothesis tests.
Inaccurate hypothesis tests: t-tests and F-tests for evaluating coefficients and overall model significance assume uncorrelated residuals. Autocorrelation can lead to misleading conclusions about predictor significance.
Invalid confidence intervals: Confidence intervals assume residuals are independent. Autocorrelation can distort both the size and interpretation of these intervals.
How to check this assumption
• Durbin-Watson test:
Assesses first-order autocorrelation in residuals.
Statistic interpretation:
0 to <2: Positive autocorrelation (closer to 0 = stronger correlation)
≈ 2: No autocorrelation
>2 to 4: Negative autocorrelation (closer to 4 = stronger correlation)
Limitation: Only detects first-order autocorrelation. It is not sensitive to higher-order correlations.
What to do when the assumption fails
Lagged variables: Include lagged values of the dependent or independent variables to account for time-based dependencies.
Differencing: Transform variables by taking the difference between consecutive observations (e.g., Yₜ − Yₜ₋₁ ) to remove trends and correlation.
Generalized Least Squares (GLS): Use GLS to explicitly model and correct for the autocorrelation in the error terms.
Time series models: For time-dependent data, switch to specialized models like AR, MA, ARIMA, or STL that are built to handle autocorrelation.
Robust standard errors: Use Newey-West or HAC (Heteroscedasticity and Autocorrelation Consistent) standard errors for valid inference in the presence of autocorrelation.
Multicollinearity
Multicollinearity is a statistical phenomenon that occurs when two or more independent variables in a multiple regression model are highly correlated. In other words, these variables exhibit a strong linear relationship, making it difficult to isolate the individual effects of each variable on the dependent variable.
Inference:
(i) Inference focuses on understanding the relationships between the variables in a model. It aims to draw conclusions about the underlying population or process that generated the data.
(ii) Inference often involves hypothesis testing, confidence intervals, and determining the significance of predictor variables.
(iii) The primary goal is to provide insights about the structure of the data and the relationships between variables.
(iv) Interpretability is a key concern when performing inference, as the objective is to understand the underlying mechanisms driving the data.
(v) Examples of inferential techniques include linear regression, logistic regression, and ANOVA.
Prediction:
(i) Prediction focuses on using a model to make accurate forecasts or estimates for new, unseen data.
(ii) It aims to generalize the model to new instances, based on the patterns observed in the training data.
(iii) Prediction often involves minimizing an error metric, such as mean squared error or cross-entropy loss, to assess the accuracy of the model.
(iv) The primary goal is to create an accurate and reliable model for predicting outcomes, rather than understanding the relationships between variables.
(v) Interpretability may be less important in predictive modelling, as the main objective is to create accurate forecasts rather than understanding the underlying structure of the data.
(vi) Examples of predictive techniques include decision trees, support vector machines, neural networks, and ensemble methods like random forests and gradient boosting machines.
In summary, inference focuses on understanding the relationships between variables and interpreting the underlying structure of the data, while prediction focuses on creating accurate forecasts for new, unseen data based on the patterns observed in the training data.
When is Multicollinearity Bad ?
Effects of Multicollinearity on Inference
1. Unreliable Coefficients:
When predictors are highly correlated, the regression model cannot clearly distinguish the effect of one variable from another.
It becomes hard to determine how much each predictor individually contributes to the dependent variable.
2. Misleading Statistical Significance:
The standard errors of the coefficients increase due to multicollinearity, making it harder to determine if a variable is statistically significant.
Important predictors may appear insignificant because their effects are "shared" with other correlated variables.
3. Difficult Interpretation:
Coefficients lose their interpretability. Interpreting the change in the dependent variable for a one-unit increase in a predictor while holding others constant becomes invalid when predictors are highly correlated.
4. Inference Fails in Decision-Making:
In contexts like policymaking or scientific research, where understanding the independent effects of variables is crucial, multicollinearity undermines confidence in the model.
Effects of Multicollinearity on Prediction
1. Minimal Impact on Predictive Accuracy:
Multicollinearity has limited impact on the model’s ability to predict outcomes. The model can combine information from correlated variables to produce accurate predictions for the dependent variable.
As long as the relationship between predictors and the target variable remains stable, predictive accuracy is not significantly affected.
2. Model Sensitivity:
A model with multicollinearity becomes sensitive to small changes in the data:
Slight variations in correlations among predictors can lead to large fluctuations in predictions.
3. Risk of Overfitting:
Multicollinearity increases the complexity of the model, leading to overfitting.
Overfitting occurs when the model performs well on training data but fails to generalize to new data.
4. Redundancy in Predictors:
Highly correlated predictors provide overlapping information.
While this redundancy doesn’t directly harm predictions, it complicates feature selection as it becomes unclear which predictors are most informative.
Key Differences Between Inference and Prediction in Multicollinearity :->
Key Takeaways
For Inference: Multicollinearity is a major issue because it undermines the reliability and interpretability of the model's coefficients. It becomes difficult to understand how individual predictors affect the dependent variable.
For Prediction: Multicollinearity is less critical. As long as the relationship between the predictors and the target variable is stable, predictions will remain accurate. However, the model may become sensitive to data changes and prone to overfitting.
What exactly happens in Multicollinearity (Mathematically) ?
When multicollinearity is present in a model, it can lead to several issues, including:
(i) Difficulty in identifying the most important predictors: Due to the high correlation between independent variables, it becomes challenging to determine which variable has the most significant impact on the dependent variable.
(ii) Inflated standard errors: Multicollinearity can lead to larger standard errors for the regression coefficients, which decreases the statistical power and can make it challenging to determine the true relationship between the independent and dependent variables.
(iii) Unstable and unreliable estimates: The regression coefficients become sensitive to small changes in the data, making it difficult to interpret the results accurately.
Perfect Multicollinearity :->
Perfect multicollinearity occurs when one independent variable in a multiple regression model is an exact linear combination of one or more other independent variables. In other words, there is an exact linear relationship between the independent variables, making it impossible to uniquely estimate the individual effects of each variable on the dependent variable.
Singular X'X Matrix
Regression coefficients are estimated using the equation:
β = (X'X)⁻¹X'y
Here, X is the matrix of predictors, y is the response variable, and X′X is the Gram matrix.
If there is perfect multicollinearity, X′X becomes singular (non-invertible) because the columns of X are linearly dependent. This prevents (X'X)⁻¹ from being computed, and the regression coefficients (β) cannot be uniquely estimated.
2. Indeterminate Coefficients
When perfect multicollinearity exists, there are infinitely many combinations of coefficients that provide the same fit to the data. The regression algorithm cannot determine a unique solution.
3. Impacts on Interpretation
Perfect multicollinearity means that the effect of one predictor cannot be separated from the effect of others. This makes it impossible to interpret the individual β coefficients as the unique contribution of each predictor.
4. Detection of Perfect Multicollinearity
Rank deficiency: The matrix X will not have full rank.
Variance Inflation Factor (VIF): A very high VIF (approaching infinity) indicates perfect multicollinearity.
Types of Multicollinearity :->
Structural multicollinearity: Structural multicollinearity arises due to the way in which the variables are defined or the model is constructed. It occurs when one independent variable is created as a linear combination of other independent variables or when the model includes interaction terms or higher-order terms (such as polynomial terms) without proper scaling or centering.
Data-driven multicollinearity: Data-driven multicollinearity occurs when the independent variables in the dataset are highly correlated due to the specific data being analysed. In this case, the high correlation between the variables is not a result of the way the variables are defined or the model is constructed but rather due to the observed data patterns.
How to detect Multicollinearity
Correlation is a measure of the linear relationship between two variables, and it is commonly used to identify multicollinearity in multiple linear regression models. Multicollinearity occurs when two or more predictor variables in the model are highly correlated, making it difficult to determine their individual contributions to the output variable. To detect multicollinearity using correlation, you can calculate the correlation matrix of the predictor variables. The correlation matrix is a square matrix that shows the pairwise correlations between each pair of predictor variables. The diagonal elements of the matrix are always equal to 1, as they represent the correlation of a variable with itself. The off-diagonal elements represent the correlation between different pairs of variables. In the context of multicollinearity, you should look for off-diagonal elements with high absolute values (e.g., greater than 0.8 or 0.9, depending on the specific application and the level of concern about multicollinearity). High correlation values indicate that the corresponding predictor variables are highly correlated and may be causing multicollinearity issues in the regression model. It's important to note that while correlation can be a useful tool for detecting multicollinearity, it doesn't provide a complete picture of the severity of the issue or its impact on the regression model. Other diagnostic measures, such as Variance Inflation Factor (VIF) and condition number, can also be used to assess the presence and severity of multicollinearity in a regression model.
Variance Inflation Factor (VIF) is a metric used to quantify the severity of multicollinearity in a multiple linear regression model. It measures the extent to which the variance of an estimated regression coefficient is increased due to multicollinearity. For each predictor variable in the regression model, VIF is calculated by performing a separate linear regression using that predictor as the response variable and the remaining predictor variables as the independent variables. The VIF for the predictor variable is then calculated as the reciprocal of the variance explained by the other predictors, which is equal to 1 / (1 - R²). Here, R² is the coefficient of determination for the linear regression using the predictor variable as the response variable. The VIF calculation can be summarized in the following steps:
(i) For each predictor variable Xᵢ in the regression model, perform a linear regression using Xᵢ as the response variable and the remaining predictor variables as the independent variables.
(ii) Calculate the R² value for each of these linear regressions.
(iii) Compute the VIF for each predictor variable Xᵢ as VIFᵢ = 1 / (1 - R²ᵢ).
A VIF value close to 1 indicates that there is very little multicollinearity for the predictor variable, whereas a high VIF value (e.g., greater than 5 or 10, depending on the context) suggests that multicollinearity may be a problem for the predictor variable, and its estimated coefficient might be less reliable. Keep in mind that VIF only provides an indication of the presence and severity of multicollinearity and does not directly address the issue. Depending on the VIF values and the goals of the analysis, you might consider using techniques like variable selection, regularization, or dimensionality reduction methods to address multicollinearity.
Condition No. :-> In the context of multicollinearity, the condition number is a diagnostic measure used to assess the stability and potential numerical issues in a multiple linear regression model. It provides an indication of the severity of multicollinearity by examining the sensitivity of the linear regression to small changes in the input data. The condition number is calculated as the ratio of the largest eigenvalue to the smallest eigenvalue of the matrix XᵀX, where X is the design matrix of the regression model (each row representing an observation and each column representing a predictor variable). A high condition number suggests that the matrix XᵀX is ill-conditioned and can lead to numerical instability when solving the normal equations for the regression coefficients. In the presence of multicollinearity, the design matrix X has highly correlated columns, which can cause the eigenvalues of XᵀX to be very different in magnitude (one or more very large eigenvalues and one or more very small eigenvalues). As a result, the condition number becomes large, indicating that the regression model may be sensitive to small changes in the input data, leading to unstable coefficient estimates. Typically, a condition number larger than 30 (or sometimes even larger than 10 or 20) is considered a warning sign of potential multicollinearity issues. However, the threshold for the condition number depends on the specific application and the level of concern about multicollinearity. It's important to note that a high condition number alone is not definitive proof of multicollinearity. It is an indication that multicollinearity might be a problem, and further investigation (e.g., using VIF, correlation matrix, or tolerance values) may be required to confirm the presence and severity of multicollinearity.
How to remove multicollinearity
Collect more data: In some cases, multicollinearity might be a result of a limited sample size. Collecting more data, if possible, can help reduce multicollinearity and improve the stability of the model.
Remove one of the highly correlated variables: If two or more independent variables are highly correlated, consider removing one of them from the model. This step can help eliminate redundancy in the model and reduce multicollinearity. Choose the variable to remove based on domain knowledge, variable importance, or the one with the highest VIF.
Combine correlated variables: If correlated independent variables represent similar information, consider combining them into a single variable. This combination can be done by averaging, summing, or using other mathematical operations, depending on the context and the nature of the variables.
Use partial least squares regression (PLS): PLS is a technique that combines features of both principal component analysis and multiple regression. It identifies linear combinations of the predictor variables (called latent variables) that have the highest covariance with the response variable, reducing multicollinearity while retaining most of the predictive power.


