Data Science: Correlation vs Regression in Statistics
In this article, we will understand the key differences between correlation and regression, and their significance. Correlation and regression are two different types of analyses that are performed on multi-variate distributions of data. They are mathematical concepts that help in understanding the extent of the relation between two variables: and the nature of the relationship between the two variables respectively.
Correlation, as the name suggests is a word formed by combining ‘co’ and ‘relation’. It refers to the analysis of the relationship that is established between two variables in a given dataset. It helps in understanding (or measuring) the linear relationship between two variables.
Two variables are said to be correlated when a change in the value of one variable results in a corresponding change in the value of the other variable. This could be a direct or an indirect change in the value of variables. This indicates a relationship between both the variables.
Correlation is a statistical measure that deals with the strength of the relation between the two variables in question.
Correlation can be a positive or negative value.
Two variables are considered to be positively correlated when the value of one variable increases or decreases following an increase or decrease in the value of the other variable respectively.
Let us understand this better with the help of an example: Suppose you start saving your money in a bank, and they offer some amount of interest on the amount you save in the bank. The more the amount you store in the bank, the more interest you get on your money. This way, the money stored in a bank and the interest obtained on it are positively correlated.
Let us take another example: While investing in stocks, it is usually said that higher the risk while investing in a stock, higher is the rate of returns on such stocks.
This shows a direct inverse relationship between the two variables since both of them increase/decrease when the other variable increases/decreases respectively.
Two variables are considered to be negatively correlated when the value of one variable increases following a decrease in the value of the other variable.
Let us understand this with an example: Suppose a person is looking to lose weight. The one basic idea behind weight loss is reducing the number of calorie intake. When fewer calories are consumed and a significant number of calories are burnt, the rate of weight loss is quicker. This means when the amount of junk food eaten is decreased, weight loss increases.
Let us take another example: Suppose a popular non-essential product that is being sold faces an increase in the price. When this happens, the number of people who purchase it will reduce and the demand would also reduce. This means, when the popularity and price of the product increases, the demand for the product reduces.
An inverse proportion relationship is observed between the two variables since one value increases and the other value decreases or one value decreases and the other value increases.
This indicates that there is no relationship between two variables. It is also known as a zero correlation. This is when a change in one variable doesn’t affect the other variable in any way.
Let us understand this with the help of an example: When the increase in height of our friend/neighbour doesn’t affect our height, since our height is independent of our friend’s height.
Correlation is used when there is a requirement to see if the two variables that are being worked upon are related to each other, and if they are, what the extent of this relationship is, and whether the values are positively or negatively correlated.
Pearson’s correlation coefficient is a popular measure to understand the correlation between two values.
Regression is the type of analysis that helps in the prediction of a dependant value when the value of the independent variable is given. For example, given a dataset that contains two variables (or columns, if visualized as a table), a few rows of values for both the variables would be given. One or more of one of the variables (or column) would be missing, that needs to be found out. One of the variables would depend on the other, thereby forming an equation that relevantly represents the relationship between the two variables. Regression helps in predicting the missing value.
Note: The idea behind any regression technique is to ensure that the difference between the predicted and the actual value is minimal, thereby reducing the error that occurs during the prediction of the dependent variable with the help of the independent variable.
There are different types of regression and some of them have been listed below:
This is one of the basic kinds of regression, which usually involves two variables, where one variable is known as the ‘dependent’ variable and the other one is known as an ‘independent’ variable. Given a dataset, a pattern has to be formed (linear equation) with the help of these two variables and this equation has to be used to fit the given data to a straight line. This straight-line needs to be used to predict the value for a given variable. The predicted values are usually continuous.
There are different types of logistic regression:
Binary logistic regression is a regression technique wherein there are only two types or categories of input that are possible, i.e 0 or 1, yes or no, true or false and so on.
Multinomial logistic regression helps predict output wherein the outcome would belong to one of the more than two classes or categories. In other words, this algorithm is used to predict a nominal dependent variable. Ordinal logistic regression deals with dependant variables that need to be ranked while predicting it with the help of independent variables.
It is also known as L2 regularization. It is a regression technique that helps in finding the best coefficients for a linear regression model with the help of an estimator that is known as ridge estimator. It is used in contrast to the popular ordinary least square method since the former has low variance and hence it calculates better coefficients. It doesn’t eliminate coefficients thereby not producing sparse, simple models.
LASSO is an acronym that stands for ‘Least Absolute Shrinkage and Selection Operator’. It is a type of linear regression that uses the concept of ‘shrinkage’. Shrinkage is a process with the help of which values in a data set are reduced/shrunk to a certain base point (this could be mean, median, etc). It helps in creating simple, easy to understand, sparse models, i.e the models that have fewer parameters to deal with, thereby being simple.
Lasso regression is highly suited for models that have high collinearity levels, i.e a model where certain processes (such as model selection or parameter selection or variable selection) is automated.
It is used to perform L1 and L2 regularization. L1 regularization is a technique that adds a penalty to the given values of coefficients in the equation. This also results in simple, easy to use, sparse models that would contain lesser coefficients. Some of these coefficients can also be estimated off to 0 and hence eliminated from the model altogether. This way, the model becomes simple.
It is said that Lasso regression is easier to work with and understand in comparison to ridge regression.
There are significant differences between both these statistical concepts.
Difference between Correlation and Regression
Let us summarize the difference between correlation and regression with the help of a table:
In this article, we understood the significant differences between two statistical techniques, namely- correlation and regression with the help of examples. Correlation establishes a relationship between two variables whereas regression deals with the prediction of values and curve fitting.
Research & References of Data Science: Correlation vs Regression in Statistics|A&C Accounting And Tax Services