What is linear regression 1

The classic linear regression simply explained - derivation and application examples

Linear regression is one of the most versatile statistical methods: Linear regression is a useful method for forecasting (e.g. forecasting the number of visitors). But the use of a linear regression is often useful for examining relationships (e.g. the influence of advertising expenditure on sales volume). In this article we would like to take a closer look at the topic of linear regression. This article deals mainly with the topic of simple linear regression: How can relationships between two variables be described and modeled? Simple linear regression is to be understood in two ways: Simple linear regression is a linear regression analysis in which only one predictor is taken into account. In this article, simplicity in the sense of simply and comprehensibly explained should also serve as a leitmotif. So don't be afraid of complicated formulas!

If you need assistance in performing or interpreting a regression, our statisticians will be happy to help. Contact us for a free consultation & a non-binding offer.

Let us know your requirements and we will provide you with a free offer within a few hours.

Inquire now without obligation

Linear regression explained simply with a practical example

Imagine the following situation: The company Kuschelwuschel has developed a new hair restorer over many years. This agent is now to be clinically tested in a study on 10 healthy volunteers. Then one would like to use the data obtained to quantify the growth under the shampoo and predict hair growth.

First, we use a scatter diagram (scatter diagram) to graphically represent the relationships. Using a model, estimates for the effects and the tests for effects are then calculated. The linear regression is carried out and interpreted using the IBM SPSS program as an example.

In this blog we describe simple linear regression - simply explained. We'll have our hands full with that. You can find more information on other models in our glossary.

Do you need help performing a linear regression? Would you like individual coaching to understand and apply linear regression? Would you like to use linear regression to evaluate a questionnaire in-depth and professionally?

From statistical tutoring to support through a complete statistical evaluation: Our experts will help you quickly and competently!

This article answers these questions:

  • Why do you need a linear regression?
  • What are Least Squares (KQ)?
  • What is a scatter diagram?
  • How well does my regression line describe the data?
  • How do I know if a predictor is significant?
  • What do you have to consider with linear regression? Which requirements have to be met?
  • What is meant by the term “multiple linear regression”?

A hairy affair and a linear regression

Our scalp hair grows 13 cm a year.

If we could leave this statement as it is, then the research hypothesis of the Kuschelwuschel company would have largely been answered. Such a deterministic statement with exact contexts does not correspond to reality, however: because the hair of a person's head grows differently in a fixed period of time. This depends, for example, on the season, gender, hair care, weather conditions, age, genetic disposition, ... In addition, there is often a large portion of coincidence in the form of incalculable measurement errors or disturbances.

The first step is to first get to know and describe the relationships between the variables.

Let us consider the measured values ​​of the hair lengths of 10 students. These were treated with an innovative shampoo in the study by Kuschelwuschel:

ID12345678910
X: hair length in cm beginning153812563513572529
Y: Hair length in cm after 10 weeks (70 days) of treatment1641515584022651930

To simplify the notation, the measured values ​​at the start of the study are designated as X, the hair lengths after 70 days as Y.

Generally it is called X Predictive, independent variable or Predictor.

Y, on the other hand, is called dependent variable, target variable or Response designated.

The measured values ​​(X; Y) can be illustrated in a so-called scatter plot. The influencing variable is plotted on the x-axis, the target variable on the ordinate.

Hair lengths from 10 subjects

A positive linear relationship can be seen in the diagram. The Pearson correlation coefficient results accordingly as r = 0.973.

What criteria do you use to lay an optimal straight line through the point cloud?

A straight line through the point cloud is given by the formula Y = a + b \ cdot X. A denotes the y-axis intercept and b the slope of the regression line. In regression, a is also called Intercept and b Regression coefficient or slope.

  1. The regression line should be placed through the mean value of the X values ​​(\ bar {X}) and the mean value of the Y values ​​(\ bar {Y}). First the arithmetic mean of the X values ​​and the Y values ​​is calculated

\ bar {X} = \ frac {1} {n} \ cdot \ sum_ {i = 1} ^ {n} x_i
= \ frac {1} {10} \ cdot \ left (15 + 3 + 7.5 + 12.5 + 56 + 34.5 + 12.8 + 56.8 + 25.4 + 29 \ right) = 25 , 23

\ bar {Y} = \ frac {1} {n} \ cdot \ sum_ {i = 1} ^ {n} y_i
= \ frac {1} {10} \ cdot \ left (16.3 + 4 + 14.5 + 15 + 58 + 40 + 22 + 65 + 19 + 30 \ right) = 28.38

2. Now the straight line is laid through the point \ left (\ bar {X}, \ bar {Y} \ right) in such a way that the deviation of the observed Y-values ​​to the Y-values ​​of the regression line is minimized overall. The difference between the observed y-values ​​and the predicted y-values ​​of the regression line is called Residuals. Residuals can have positive or negative values, depending on whether the data points are above or below the mean. Very high or very low residuals are an indication of an “unfavorable” straight line. In order for extreme residuals to have a high weight, the individual residuals are squared. The optimal regression line is the line where the sum of the squared residuals is as small as possible.

This procedure is also called the method of the least squares (OLS method: Ordinary least squares method). The squares can also be illustrated in the scatter plot:

Residual squares with optimal regression line

Every other straight line has a “larger” squared residual sum.

Residual squares for any straight line, here y = 34: The sum of the blue areas is significantly larger than that of the optimal regression line in the diagram above.

With this requirement, the regression line is clearly defined and the intercept a and the regression coefficient b can be calculated.

The formulas for the calculation can be found in the statistics textbooks. In the following we will describe the calculation with statistical software.

Precisely: Linear regression explained simply with SPSS

First you enter the 10 measured values ​​into the data input of SPSS. An introduction to the program can be found in our glossary article on SPSS. Then select the menu item Analyze - Regression - Linear. The measured values ​​at the end of the study are transferred as a dependent variable. Hair lengths at the beginning of the study are selected as the independent variable.

Call of the linear regression in SPSS version 25

Simple linear regression menu in SPSS with definition of the dependent and independent variables

This is how you interpret a linear regression with SPSS

If you confirm the information, you will receive the following calculations in the output window:

Linear regression output window SPSS

4 tables appear in the output window:

  • Included / Removed Variables: This summarizes which dependent and independent variables are included in the model.
  • Model summary: This table gives dimensions for the model quality. R gives the Pearson correlation coefficient, R-squared the squared value. R.2 is also known as Coefficient of determination designated. The coefficient of determination indicates how much of the variability in the data the model is explained. If R2 assumes the value 1, all points lie exactly on the straight line. The closer the value approaches 1, the “closer” the data are to the straight line. R.2 increases automatically when several variables are added, without any real gain in information being associated with it. For this reason, SPSS gives a corrected R2 which is independent of the number of influencing variables. In our example with only one predictor, this does not matter.
  • ANOVA:In the analysis of variance, an F-test first checks whether the entire model is significant. This enables one to decide whether the prediction of the target variable is improved by the independent variables in the model. To do this, the total variance of the data is divided into two parts. The first part is the contribution that is explained by the model. The second part, on the other hand, is an unexplained, random component. The more variability of the data can be explained by the regression, the better the model.

In the example, the model is significant with a p-value <0.001.

  • Coefficients: The estimators for the regression line can be found in this table. The constant denotes the intercept (a). This results in the following regression line in the example:
    Y = 2.648 + 1.020 \ cdot X

Interpret regression coefficients meaningfully

The intercept indicates the constant with which the hair grows within the 70 days. The regression coefficient b = 1.020 shows how the hair length depends on the initial length. The factor is positive, as it can be seen that people with long initial hair show higher growth in the observation period. If the hair length increases by one unit (cm), the final length increases by a factor of 1.020. You can see this effect well if you use the formula for the prediction: Susi's hair is 50 cm long. In 70 days they will have an estimated length of 2.648 + 1.020 \ cdot 50cm = 53.648 cm. Susi's girlfriend's hair is 10 cm longer, after 70 days the hair is estimated to be 10 \ cdot 1.020 longer than Susi's, so it would be 63.848 cm long.

The standard error is a measure of the dispersion of the regression parameters. The column standardized parameters is not relevant in the univariate regression model with one influencing variable. The last two columns of the table contain the results of the statistical test. The null hypothesis is checked by means of a t-test to determine whether the parameter is equal to zero and thus insignificant. Significant p-values ​​show that the variable has a demonstrable effect on the outcome. In contrast to ANOVA, each coefficient is examined individually.

Splitting hairs: Requirements for linear regression simply explained

First of all, the relationship between the target variable and the influencing variable must be linear. If necessary, transformations can be used to ensure this. The Pearson correlation coefficient is a measure of the linearity of two variables. This relationship can be examined in the scatter diagram.

In addition, the following three conditions must be met with regard to the residuals:

1. The residuals are independent of each other

This requirement is usually met by the fact that a real random sample is available in which all observations are independent of one another. If in doubt, the Durbin Watson test can also be used. In the case of a significant p-value, an autocorrelation of the residuals must be assumed. However, P values> 0.05 do not prove the independence of the residuals.

2. The residuals are approximately normally distributed

This can best be checked graphically in a histogram of the residuals. The histogram should be symmetrical about a center and should approach a normal distribution with larger sample sizes. With smaller case numbers, one cannot expect a perfect match. The normal distribution can also be checked by means of a test. Due to the nature of statistical testing, however, the Kolmogorov Smirnov test can only detect a deviation from the normal distribution.

Residual histogram

3. The spread of the residuals is constant over the entire value range of Y (homoscedasticity)

If you plot the residuals against the predicted values ​​in a scatter diagram (scatter diagram), no patterns should appear. If homoscedasticity is present, the points are therefore evenly distributed over the entire value range of Y.

Residual plot: predicted value vs. residual, standardized

Hair growth in n-dimensional space: The multiple linear regression

The same ideas can be used to describe a target variable in terms of many influencing variables. In this case one speaks of a multiple linear regression. The associated regression model has the form:

Y = a + b_1 \ cdot X_1 + b_2 \ cdot X_2 + \ ldots + b_n \ cdot X_n.

Other influencing variables could be gender or age, for example. All of the above assumptions of the univariate model apply analogously. Such complex problems can be processed efficiently with matrix algebra and a computer. The calculations take place in n-dimensional space. In the graphic representation, however, one is limited to the 2-dimensional sheet. The graphic relationships must therefore be examined in pairs.

Summary: linear regression explained in simple terms

Regression relates a target variable to one or more independent variables. In linear regression, there is a linear relationship between the target variable and the influencing variables. With the help of statistical software, the estimated values ​​for the intercept and the regression coefficients can be determined on the basis of the available data. The regression coefficients can then be checked with a t-test. The coefficient of determination R2 provides a quality criterion for how well the model describes the data. An analysis of variance (ANOVA) can be used to test whether the regression model can predict the target variable. As a prerequisite for the calculations, the residuals must be independent of one another, normally distributed and homoscedastic.

When modeling several influencing variables, one speaks of a multiple linear regression. The model is transferred to the n-dimensional space.

In this article we have given you an overview of the subject of regression and explained linear regression in a simple way. If you have any questions about specific aspects of regression, please do not hesitate to contact us. If you have any questions or problems relating to evaluation, interpretation and all other statistical matters, our Novustat experts will be happy to assist you.

Further sources:

Linear regression with SPSS version 25

Review article: Multiple linear regression

Keywords: simply explained, linear regression, spss evaluation, variance