Welcome! This repository is dedicated to exploring linear regression - both simple and multiple. It is a tool for analyzing a particular set of problems.
My example data comes from the scikit-learn library's "7.13. Diabetes dataset". This dataset is well-documented in the paper "Least Angle Regression" by Bradley Efron, Trevor Hastie, Iain Johnstone, and Robert Tibshirani (2004) available here.
I'll be using two Python libraries for our analysis - scikit-learn for implementing the regression, and statsmodels library for a detailed statistical analysis of the results.
Really hope that this exploration will be useful to understand the basics of linear regression and how to implement it in Python.
To replicate the analysis, you'll need the following libraries:
It's recommended to run the code in a virtual environment. Follow these steps:
Set up a virtual environment and install necessary libraries with:
pip install -r requirements.txt
or first install pip-tools and create a virtual environment with:
pip-sync
Make sure you're using Python 3.9 and the library versions specified in the requirements.txt file.
Run the Jupyter notebooks available in this repository.
Linear regression is an approach for predicting a quantitative response
Some important points to consider in this approach are:
- that it assumes that there is a relationship between
$X$ and$Y$ . - that it assumes that this relationship is linear.
Mathematically, we can write this linear relationship as
where
-
$\beta_0$ is the intercept term (the expected value of$Y$ when$X$ = 0) -
$\beta_1$ is the slope term (the average increase in$Y$ associated with a one-unit increase in$X$ ).
Together,
In practice,
There are a number of ways of measuring closeness. Howerver, by far the most common approach involves minimizing the least squares criterion which is OLS (James, et al., 2013).
Let
or equivalently
The least squares approach chooses
where
This is the same method used by the LinearRegression function in the scikit-learn library in Python to fit a linear model to the data.
Here's how it relates to scikit-learn:
-
Creating the Model: When you create a linear regression model in scikit-learn using LinearRegression(), you're setting up a model that will find the best fit line through your data using the least squares method, i.e., it will find the line that minimizes the residual sum of squares (RSS).
-
Fitting the Model: When you call the .fit(X, y) method on a LinearRegression object, scikit-learn will use your input features (X) and target variable (y) to compute the optimal parameters (β0 and β1) that minimize the RSS, just as described in your content.
-
Model Coefficients: After fitting, the estimated coefficients can be accessed using the .coef_ and .intercept_ attributes of the LinearRegression object. These correspond to β1 (the slope) and β0 (the intercept) respectively.
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An Introduction to Statistical Learning with Applications in R. Springer.
Deisenroth, M. P., Faisal, A. A.,, Ong, C. S. (2020). Mathematics for Machine Learning. Cambridge University Press.