While building regression algorithms, the common question which comes to our mind is how to evaluate regression models. Even though we are having various statistics to quantify the regression models performance, the straight forward methods are RSquared and Adjusted RSquared.
People tend to use the R Squared method, but the catch is rsquared alone is not a good measure for evaluating the regression models. Where comes the hero ðŸ™‚ adjusted rsquare method.
Learn the key difference between rsquared and adjusted rsquared. #machinelearning #datascience #regression
Even in data science interviews the frequent asked question is
Could you please explain the key difference between RSquared and Adjusted RSquared ?
Do you the answer for that
Â Â
You are crystal clear about r squared but forgot about adjusted rsquared right. Donâ€™t worry, these concepts are a bit confusing. All we need is a regular refresh on the concepts, not regular but at least before you start looking for a new data scientist job.
This article is an ideal place for this.
We hope you are aware of the R Squared method. Still if you are not aware of the R Squared method, state tune till the end of this article. You will learn all the topics along with the key differences between R Squared and Adjusted R Square.
Before we drive further, letâ€™s see the table of contents for this article.
Letâ€™s start with understanding the key concepts in regression concepts, these concepts are not about the regression algorithm. These concepts are the basic blocks which help in understanding the key between Rsquared and Adjusted RSquared in a much deeper level.
Why wait, letâ€™s start!
Basic regression concepts
Unlike any machine learning classification algorithms, the regression models are having various evaluation methods. These evaluation methods are completely different from the classification evaluation methods.
For any regression model evaluation method aim is to show how the residuals are distributed. The way the residuals are used in various formulas changes from one evaluation method to another.
To understand about RSquared and Adjusted RSquared we need to know the below basic concepts. In a way we need to get the answers for the below questions.
 What is Error/Residuals?
 What residual Sum of squares?
 What is the total sum of squares?
Letâ€™s start the discussion with residuals.
What is Error/Residuals?
Suppose we have two line equations. If someone asks,
Is the two equations are different or not?Â
How do we answer that?
The simplest way is to see how the two equations are different in a graphical way and see if each data point in the two lines are deviating or separated.
Now letâ€™s come to the regression model. When we build regression models, we will have two line equations.
 Line plotted using the actual data
 Line plotted using the forecasted data
To add more context to the discussion, letâ€™s say we are forecasting sales for a Â product using the historical sales data. In our case one line is the actual sales graph and the other is the furcating sales graph.
The difference between the individual actual and the forecasted sales is the called as residuals or error.
If you see the above graph the line is the actual sale graph and the blue dots are the forecasted sales, the difference between the actual sales and the forecasted sales is the residuals at individual level. Â In the image which is represented in dotted lines.Â
The sum of all the residuals is called the total error.
What is Residual Sum of Squares?
To calculate the total error we are just performing the summation of all the residuals. If we square the individual residual and then perform the summation itâ€™s called the residual sum of squares.
This value helps us understand how close the forecasted sales line is with the actual sales line. In the regression world we say how accurate the fitted regression model is on the train dataset.
If you are not aware about difference between classification algorithms and regression algorithms, it’s worth to spend time to understand that first.
Why canâ€™t we simplify use the total error instead of the residual sum of squares? Right?
Using the Residual sum of squares has two main advantages.
 Handel’s overestimation and underestimation.
 Helps in penalizing the high residuals.
If the above two advantages do not make any sense, Let us simplify these.
Handelâ€™s the overestimation and underestimation
Suppose the actual sale value is 30 and the forecasted value is 9, as the residual formal says,Â
The difference between the actual and forecasted value is residual.Â
The residual value is 30 â€“ 18 = 12,Â
Suppose the actual sale value is 8 and the forecasted value is 20. In this case the residual value is 8 â€“ 20 = 12.
The first example is for under estimation and the second example is for over estimation. If we sum up these two residuals, the result will be 0.
Does it mean our actual and forecasted values are the same?
No right.Â
To overcome this we use the squared sum of the residual rather than just the summation.
If we apply the squared sum for these two examples, the output results are completely different.
The first squared residual value is 144 and the second squared residual value is 144. So the residual squared sum value is 288.
Helps in penalizing the high residuals
Now letâ€™s understand the penalization part.
In the bagging Vs boosting ensemble method we explained how the weak learners penalized the misclassified sample with higher weightage than the correctly classified samples.
The smart way to perform this is, applying the square on the error term.Â
Letâ€™s consider the below two actual and the forecasted sales.

Data point 01:
 Actual value: 45
 Forecasted value: 45.6
 Error: 0.6
 Squared Error: 0.36

Data point 02:
 Actual value: 45
 Forecasted value: 35
 Error: 10
 Squared Error: 100
If you see the above results. When the error is so minimal, squaring it makes the error much smaller. Whereas the error is of considerable value, Squaring the error magnifies it. Â Makes bigger.
This ideal way to see where our regression model is failing. This helps in optimizing the errors for those magnified values.
In mathematical way the below is the formula for residual sum of squares.
Where:Â
In the upcoming section of this article we will be using the residual sum of squares function to calculate the RSS value with a dummy data.
What is the Total Sum of Squares?
Now letâ€™s have a look at the total sum of squares. In the earlier discussion we explained the residual sum of squares, this value says how close the prediction line or model is inline with the actual sales data points.
In other words residual sum of squares explains how the forecasted sales values are deviating from the actual sales values. This is more like statistics on the external values.Â
How about statistics on the internal data points. In our case the actual sales data points. We can check how Â the sales are deviating from the average sales. This concept is known as the total sum of squares.
In the residual sum of squares we are subtracting the actual sales value with the forecasted sales value. Whereas in the total sum of squares we subtract the actual sales value with the average sales or the mean sales value.Â
The below is the function for the total sum of squares.
Where:
If we hold for a second and think about this, unlike the residual sum of square cases for each actual sales value we canâ€™t expect a value to subtract. As the mean for the actual values is the same for all the sales data points.
So, to calculate the total sum of squares all we need to do is, take the actual sales value subtract it with the average sales value. Take a square of that value and perform summation on all those values. This gives us the total sum of square values.
I hope the above explanation is clear, still if it is not clear we can have a look at the below sales data. We will be calculating the residual sum of square and total sum of square.
Residual Sum of Square and Total Sum of Square Example
Now letâ€™s understand how we calculate the residual sum of square and total sum of square for this data.
In the above dataset, we are having the actual sales and forecasted sales values. Using these we calculated the residuals which is just the difference between the actual sales and forecasted sales. Then we are squaring each residual.
At the end we are just summing all the residual squares, this gives us the residual sum of square value.Â
In the same way letâ€™s compute the total sum of squares.
In the above dataset we are having the actual sales data points. Using the actual sales values we have computed the mean of sales, Which is just the average of all the sales. Then for each sale value we are taking the difference with the mean sales value. Next we are squaring the result. Â
The sum of all these values is the total sum of squares.
RSquared Explanation
By now we are ready to understand about RSquared. We will consider both the residual sum of square and total sum of square calculated values to populate the RSquared value.Â
This will be much clear in the Rsquared formula section.
The calculated RSquared explains how the regression model fit for the actual data points. In some the literature says the Rsquared value ranges from 0 to 1. Some literature says the value ranges from 1 to 100. Whatever the range, the max value says the regression model fits so close to the actual values.
This Rsquared is treated as a measure to explain how much the variance is explained by the model. For the ideal regression model the RSquared value should be anywhere near to 1.
Now letâ€™s look at the RSquared formula and see how it can calculate the value for any given actual and forecasted values.
RSquared Formula
Below is the actual formula for calculating the RSquared value.
We can simplify the about formula further.
Where:
 RSS: Residual Sum of Square
 TSS: Total Sum of Square
The above is the simplified version for calculating the Rsquared value. It uses both the residual sum of square and total sum of square.
The formula is easy to remember.
All we are doing is fractions of RSS and TSS then we are removing the value from 1. For the ideal model the RSS value will be zero, so the R^2 value will be 1. Which mean to say a regression model is good, it should get a Rsquare value near one.
Calculating RSquared In Python
We are going to use the below data for all the calculations for this article.
Letâ€™s see how we can calculate the Rsquared value using the python.
We created functions for calculating the residual sum of squares and total sum of squares. Then we are using those function to calcuate the Rsquared value.
For cross cheaking the implementation, we check the results on the sales data we showed before. We are getting the same results. Residual sum of square is 189 and total sum of square is 1704.4
For this data, we are getting rsquared as 0.89
Limitation of RSquared
If you clearly observe the RSquared formula, itâ€™s lagging with the concepts of number of features used. As there is no component for changing the number of features used in the regression model. The Rsquared value will be the same or higher if we include more number of features in the regression model.Â
If you compare this with classification evaluation metrics, for all classificaiton models we can’t completely depend on confusion matrixs right, the same apply’s here too but we have key reason why should not consider the just the rsquared for Â regression models.
In the above graph we show how the sales growth is impacted by the advertisement spent. In this case we are considering only the advertisement sent as a feature for forecasting the sales growth.Â
However, if we include multiple features, such as price_reduction, sales_season â€¦ etc then the regression models Rsquared value will be the same as the previous (only with advertisement spent) or higher. Itâ€™s not sure if the newly added features are helping in forecasting the sales.
If the above explanation is not clear. Don’t worry while explaning the key difference between the rsquared and adjusted rsquared sections, we are going to learn this with sales growth case study.
In the above image we are showing how the RSquared values is behaving when we are increasing the features. Even though we are not sure about the extra added features impact in improving the model accuracy still the RSquare value will increase with increase in features.
The above result is just manually created one, to show how the rsquared value will change with increase in features. We haven’t build any fancy machine learning model yet.
This limitation can be overcome with the Adjusted RSquared value.
The key thing to note here is, when you are having multiple features in the regression model itâ€™s always better to use Adjusted RSquared value than just the RSquared value.
Adjusted RSquared Explanation
By now we are aware about the limitations of RSquared, using the adjusted Rsquared we can overcome this.
The adjusted RSquared method will say whether adding the new feature will improve the performance of the model are not.
Adjusted RSquared Formal
Where:
If we consider the sales data, we are having 3 features such as email campaign spend, google adword spend, season and we have 10 observations.Â
For this sale data, p is 3 if we use these 3 features for building a regression model. N value will be 10 as we are having 10 observations.
In the next section, letâ€™s use this formula to calculate the adjusted Rsquared value.
Calculating Adjusted RSquared in Python
Here we are just using the previoues functions we created and passing the calculated rsquared value to adjusted rsquared function to calculate the adjusted rsquared value.
Difference Between RSquared and Adjusted RSquared methods
We have seen how rsquared and adjusted rsquared is calculated individually. But we know where rsquared will fail and where adjusted rsquared captures it. To understand that letâ€™s take the sales data.
Advertisement VS Sales Growth Case Study
To address the limitations of rsquared we are considering the below data. Which has the same sales data, where we remove the dummy_forecast_value. We will be using different combinations of features to build the regression models and to see the behaviour over rsquared vs adjusted rsquared.
You can download the below dataset in our Github account
We are having 3 features.
 email campaign spend
 google adwords spend
 season
The target is sales values. We are going to build 3 models with the below features combinations.

Model 01:
 Features: email campaign spend, google adwords spendÂ
 Target: sales

Model 02:
 Features: google adwords spend, seasonÂ
 Target: sales

Model 03:
 Features: email campaign, google adwords, seasonÂ
 Target: sales
Calculate RSquared and Adjusted RSquared In Python
We are going to implement 3 functions: model1 , model2, model3. For each model we will compute the both the rsquared and adjusted rsquared value.
We have placed the 3 models results in tabular form for better understanding.
For the model 01 we are having a rsquared value of 03 and adjusted rsquared value of 0.1. Which means the model is not good enough for forecasting sales values.
As a next step we have taken a second feature set to build the regression model, even in the model 02 the results are not so promising. In fact the results are worse than the model 01 results.Â
In the last iteration. We have taken all the features of model 01 and added the new feature from model 02.
We know that the model 02 is not performing well, so we should expect the low rsquared and adjusted rsquared. But the model 3 rsquared is more than the model 01 rsquared value.Â
This is a limitation of rsquared, if we see the adjusted rsquared value which is much lower than the model 01 adjusted rsquared value. Which is more reasonable. The other thing to note, rsquared value will range in between 0 to 1 whereas adjusted rsquared can be less than 0 and negative.
Story in short:Â
Always consider the adjusted rsquared as the evaluation metrics unless we build a model with single feature. In this case both rsquared and adjusted rsquared will be the same.
Which method should we use?
By now you know the answer for this question, Which method should we use. If you donâ€™t please read the article again. Just kidding. We should always consider the adjusted rsquared method as the evaluation metrics for the regression kind of problems.
Additional Internal Resources
Below we listed the must read related articles, if you have time please go through these.Â
Conclusion
In this article we learned about residual sum of square and total sum square calculations. We used these calculations to calculate the rsquared and adjusted rsquared values. Below are the key points to keep in mind.
 Always consider the adjusted rsquared value as the evaluation metrics for the regression problem over rsquared method.
 The rsquared value ranges from 0 to 1, whereas adjusted rsquared value can be negative too.
You can get the complete code of this article in dataaspirant Gitub account. Feel free to frok.