While building regression algorithms, the common question which comes to our mind is how to evaluate regression models. Even though we are having various statistics to quantify the regression models performance, the straight forward methods are R-Squared and Adjusted R-Squared.
People tend to use the R Squared method, but the catch is r-squared alone is not a good measure for evaluating the regression models. Where comes the hero 🙂 adjusted r-square method.
Learn the key difference between r-squared and adjusted r-squared. #machinelearning #datascience #regression
Even in data science interviews the frequent asked question is
Could you please explain the key difference between R-Squared and Adjusted R-Squared ?
Do you the answer for that
You are crystal clear about r squared but forgot about adjusted r-squared right. Don’t worry, these concepts are a bit confusing. All we need is a regular refresh on the concepts, not regular but at least before you start looking for a new data scientist job.
This article is an ideal place for this.
We hope you are aware of the R Squared method. Still if you are not aware of the R Squared method, state tune till the end of this article. You will learn all the topics along with the key differences between R Squared and Adjusted R Square.
Before we drive further, let’s see the table of contents for this article.
Let’s start with understanding the key concepts in regression concepts, these concepts are not about the regression algorithm. These concepts are the basic blocks which help in understanding the key between R-squared and Adjusted R-Squared in a much deeper level.
Why wait, let’s start!
Basic regression concepts
Unlike any machine learning classification algorithms, the regression models are having various evaluation methods. These evaluation methods are completely different from the classification evaluation methods.
For any regression model evaluation method aim is to show how the residuals are distributed. The way the residuals are used in various formulas changes from one evaluation method to another.
To understand about R-Squared and Adjusted R-Squared we need to know the below basic concepts. In a way we need to get the answers for the below questions.
- What is Error/Residuals?
- What residual Sum of squares?
- What is the total sum of squares?
Let’s start the discussion with residuals.
What is Error/Residuals?
Suppose we have two line equations. If someone asks,
Is the two equations are different or not?
How do we answer that?
The simplest way is to see how the two equations are different in a graphical way and see if each data point in the two lines are deviating or separated.
Now let’s come to the regression model. When we build regression models, we will have two line equations.
- Line plotted using the actual data
- Line plotted using the forecasted data
To add more context to the discussion, let’s say we are forecasting sales for a product using the historical sales data. In our case one line is the actual sales graph and the other is the furcating sales graph.
The difference between the individual actual and the forecasted sales is the called as residuals or error.
If you see the above graph the line is the actual sale graph and the blue dots are the forecasted sales, the difference between the actual sales and the forecasted sales is the residuals at individual level. In the image which is represented in dotted lines.
The sum of all the residuals is called the total error.
What is Residual Sum of Squares?
To calculate the total error we are just performing the summation of all the residuals. If we square the individual residual and then perform the summation it’s called the residual sum of squares.
This value helps us understand how close the forecasted sales line is with the actual sales line. In the regression world we say how accurate the fitted regression model is on the train dataset.
If you are not aware about difference between classification algorithms and regression algorithms, it’s worth to spend time to understand that first.
Why can’t we simplify use the total error instead of the residual sum of squares? Right?
Using the Residual sum of squares has two main advantages.
- Handel’s overestimation and underestimation.
- Helps in penalizing the high residuals.
If the above two advantages do not make any sense, Let us simplify these.
Handel’s the overestimation and underestimation
Suppose the actual sale value is 30 and the forecasted value is 9, as the residual formal says,
The difference between the actual and forecasted value is residual.
The residual value is 30 – 18 = 12,
Suppose the actual sale value is 8 and the forecasted value is 20. In this case the residual value is 8 – 20 = -12.
The first example is for under estimation and the second example is for over estimation. If we sum up these two residuals, the result will be 0.
Does it mean our actual and forecasted values are the same?
To overcome this we use the squared sum of the residual rather than just the summation.
If we apply the squared sum for these two examples, the output results are completely different.
The first squared residual value is 144 and the second squared residual value is 144. So the residual squared sum value is 288.
Helps in penalizing the high residuals
Now let’s understand the penalization part.
In the bagging Vs boosting ensemble method we explained how the weak learners penalized the misclassified sample with higher weightage than the correctly classified samples.
The smart way to perform this is, applying the square on the error term.
Let’s consider the below two actual and the forecasted sales.
Data point 01:
- Actual value: 45
- Forecasted value: 45.6
- Error: -0.6
- Squared Error: 0.36
Data point 02:
- Actual value: 45
- Forecasted value: 35
- Error: 10
- Squared Error: 100
If you see the above results. When the error is so minimal, squaring it makes the error much smaller. Whereas the error is of considerable value, Squaring the error magnifies it. Makes bigger.
This ideal way to see where our regression model is failing. This helps in optimizing the errors for those magnified values.
In mathematical way the below is the formula for residual sum of squares.
In the upcoming section of this article we will be using the residual sum of squares function to calculate the RSS value with a dummy data.
What is the Total Sum of Squares?
Now let’s have a look at the total sum of squares. In the earlier discussion we explained the residual sum of squares, this value says how close the prediction line or model is inline with the actual sales data points.
In other words residual sum of squares explains how the forecasted sales values are deviating from the actual sales values. This is more like statistics on the external values.
How about statistics on the internal data points. In our case the actual sales data points. We can check how the sales are deviating from the average sales. This concept is known as the total sum of squares.
In the residual sum of squares we are subtracting the actual sales value with the forecasted sales value. Whereas in the total sum of squares we subtract the actual sales value with the average sales or the mean sales value.
The below is the function for the total sum of squares.
If we hold for a second and think about this, unlike the residual sum of square cases for each actual sales value we can’t expect a value to subtract. As the mean for the actual values is the same for all the sales data points.
So, to calculate the total sum of squares all we need to do is, take the actual sales value subtract it with the average sales value. Take a square of that value and perform summation on all those values. This gives us the total sum of square values.
I hope the above explanation is clear, still if it is not clear we can have a look at the below sales data. We will be calculating the residual sum of square and total sum of square.
Residual Sum of Square and Total Sum of Square Example
Now let’s understand how we calculate the residual sum of square and total sum of square for this data.
In the above dataset, we are having the actual sales and forecasted sales values. Using these we calculated the residuals which is just the difference between the actual sales and forecasted sales. Then we are squaring each residual.
At the end we are just summing all the residual squares, this gives us the residual sum of square value.
In the same way let’s compute the total sum of squares.
In the above dataset we are having the actual sales data points. Using the actual sales values we have computed the mean of sales, Which is just the average of all the sales. Then for each sale value we are taking the difference with the mean sales value. Next we are squaring the result.
The sum of all these values is the total sum of squares.
By now we are ready to understand about R-Squared. We will consider both the residual sum of square and total sum of square calculated values to populate the R-Squared value.
This will be much clear in the R-squared formula section.
The calculated R-Squared explains how the regression model fit for the actual data points. In some the literature says the R-squared value ranges from 0 to 1. Some literature says the value ranges from 1 to 100. Whatever the range, the max value says the regression model fits so close to the actual values.
This R-squared is treated as a measure to explain how much the variance is explained by the model. For the ideal regression model the R-Squared value should be anywhere near to 1.
Now let’s look at the R-Squared formula and see how it can calculate the value for any given actual and forecasted values.
Below is the actual formula for calculating the R-Squared value.
We can simplify the about formula further.
- RSS: Residual Sum of Square
- TSS: Total Sum of Square
The above is the simplified version for calculating the R-squared value. It uses both the residual sum of square and total sum of square.
The formula is easy to remember.
All we are doing is fractions of RSS and TSS then we are removing the value from 1. For the ideal model the RSS value will be zero, so the R^2 value will be 1. Which mean to say a regression model is good, it should get a R-square value near one.
Calculating R-Squared In Python
We are going to use the below data for all the calculations for this article.
Let’s see how we can calculate the R-squared value using the python.
We created functions for calculating the residual sum of squares and total sum of squares. Then we are using those function to calcuate the R-squared value.
For cross cheaking the implementation, we check the results on the sales data we showed before. We are getting the same results. Residual sum of square is 189 and total sum of square is 1704.4
For this data, we are getting r-squared as 0.89
Limitation of R-Squared
If you clearly observe the R-Squared formula, it’s lagging with the concepts of number of features used. As there is no component for changing the number of features used in the regression model. The R-squared value will be the same or higher if we include more number of features in the regression model.
If you compare this with classification evaluation metrics, for all classificaiton models we can’t completely depend on confusion matrixs right, the same apply’s here too but we have key reason why should not consider the just the r-squared for regression models.
In the above graph we show how the sales growth is impacted by the advertisement spent. In this case we are considering only the advertisement sent as a feature for forecasting the sales growth.
However, if we include multiple features, such as price_reduction, sales_season … etc then the regression models R-squared value will be the same as the previous (only with advertisement spent) or higher. It’s not sure if the newly added features are helping in forecasting the sales.
If the above explanation is not clear. Don’t worry while explaning the key difference between the r-squared and adjusted r-squared sections, we are going to learn this with sales growth case study.
In the above image we are showing how the R-Squared values is behaving when we are increasing the features. Even though we are not sure about the extra added features impact in improving the model accuracy still the R-Square value will increase with increase in features.
The above result is just manually created one, to show how the r-squared value will change with increase in features. We haven’t build any fancy machine learning model yet.
This limitation can be overcome with the Adjusted R-Squared value.
The key thing to note here is, when you are having multiple features in the regression model it’s always better to use Adjusted R-Squared value than just the R-Squared value.
Adjusted R-Squared Explanation
By now we are aware about the limitations of R-Squared, using the adjusted R-squared we can overcome this.
The adjusted R-Squared method will say whether adding the new feature will improve the performance of the model are not.
Adjusted R-Squared Formal
If we consider the sales data, we are having 3 features such as email campaign spend, google adword spend, season and we have 10 observations.
For this sale data, p is 3 if we use these 3 features for building a regression model. N value will be 10 as we are having 10 observations.
In the next section, let’s use this formula to calculate the adjusted R-squared value.
Calculating Adjusted R-Squared in Python
Here we are just using the previoues functions we created and passing the calculated r-squared value to adjusted r-squared function to calculate the adjusted r-squared value.
Difference Between R-Squared and Adjusted R-Squared methods
We have seen how r-squared and adjusted r-squared is calculated individually. But we know where r-squared will fail and where adjusted r-squared captures it. To understand that let’s take the sales data.
Advertisement VS Sales Growth Case Study
To address the limitations of r-squared we are considering the below data. Which has the same sales data, where we remove the dummy_forecast_value. We will be using different combinations of features to build the regression models and to see the behaviour over r-squared vs adjusted r-squared.
You can download the below dataset in our Github account
We are having 3 features.
- email campaign spend
- google adwords spend
The target is sales values. We are going to build 3 models with the below features combinations.
- Features: email campaign spend, google adwords spend
- Target: sales
- Features: google adwords spend, season
- Target: sales
- Features: email campaign, google adwords, season
- Target: sales
Calculate R-Squared and Adjusted R-Squared In Python
We are going to implement 3 functions: model1 , model2, model3. For each model we will compute the both the r-squared and adjusted r-squared value.
We have placed the 3 models results in tabular form for better understanding.
For the model 01 we are having a r-squared value of 03 and adjusted r-squared value of 0.1. Which means the model is not good enough for forecasting sales values.
As a next step we have taken a second feature set to build the regression model, even in the model 02 the results are not so promising. In fact the results are worse than the model 01 results.
In the last iteration. We have taken all the features of model 01 and added the new feature from model 02.
We know that the model 02 is not performing well, so we should expect the low r-squared and adjusted r-squared. But the model 3 r-squared is more than the model 01 r-squared value.
This is a limitation of r-squared, if we see the adjusted r-squared value which is much lower than the model 01 adjusted r-squared value. Which is more reasonable. The other thing to note, r-squared value will range in between 0 to 1 whereas adjusted r-squared can be less than 0 and negative.
Story in short:
Always consider the adjusted r-squared as the evaluation metrics unless we build a model with single feature. In this case both r-squared and adjusted r-squared will be the same.
Which method should we use?
By now you know the answer for this question, Which method should we use. If you don’t please read the article again. Just kidding. We should always consider the adjusted r-squared method as the evaluation metrics for the regression kind of problems.
Additional Internal Resources
Below we listed the must read related articles, if you have time please go through these.
In this article we learned about residual sum of square and total sum square calculations. We used these calculations to calculate the r-squared and adjusted r-squared values. Below are the key points to keep in mind.
- Always consider the adjusted r-squared value as the evaluation metrics for the regression problem over r-squared method.
- The r-squared value ranges from 0 to 1, whereas adjusted r-squared value can be negative too.
You can get the complete code of this article in dataaspirant Gitub account. Feel free to frok.