# Can I forecast with logit regression

## Method advice

### Quick start

Logistic regression (SAV, 19 KB) |

### 1. Introduction

The (binary) logistic regression analysis is used when it is to be checked whether there is a connection between a dependent binary variable and one or more independent variables.

In contrast to simple regression analysis and multiple regression analysis, the dependent variable is binary. That is, it has only two forms; e.g. a variable *Heart attack*which takes on the following values: 1 for "yes, has already had a heart attack" and 0 for "no, has not had a heart attack". There is also talk of "dichotomous" variables. The independent variables, on the other hand, are interval-scaled or coded as dummy variables.

For ordinally scaled dependent variables and for nominal dependent variables with more than two values (e.g. the variable *Hair color* with the characteristics: brown, blonde, black or red) there are extensions of the logistic regression analysis: the ordinal logistic regression and the multinominal logistic regression. However, this will not be discussed in more detail.

Binary logistic regression analysis examines the relationship between the probability that the dependent variable takes the value 1 and the independent variables. This means that it is not the value of the dependent variable that is predicted, but the probability that the dependent variable will take the value 1. Furthermore, the requirements are less restrictive than in the linear regression analysis.

It should be noted at this point that every postulated causal relationship must be theoretically justified.

The question of the logistic regression analysis is often shortened as follows:

"Do the independent variables have an influence on the probability that the dependent variable takes on the value 1? How strong is their influence?"

### 1.1. Examples of possible questions

- Do diet, exercise and a person's perception of stress have an influence on the likelihood of bone marrow decline (binary variable with the values "bone marrow decline detectable" and "no bone marrow decline apparent")?
- How strong is the connection between participation in further training and the self-confidence, the managerial authority of the employer and the income of the potential participant?
- What affects the likelihood of Christmas decorations being bought? The number of snowy days, the outside temperature or the number of days until December 24th?
- Can gender, age, education, and occupation predict the likelihood that a particular TV show will be watched?

### 1.2. requirements

✓ | The dependent variable is binary (0-1 coded) |

✓ | The independent variables are coded metrically or, in the case of categorical variables, as dummy variables |

✓ | For each group formed by categorical predictors is n ≥ 25 |

✓ | The independent variables are not highly correlated with one another |

### 2.1. Example of a study

A bank is interested in facts related to the likelihood of someone buying shares. It therefore commissions a market research institute to interview 700 people. It is assumed that the decision to buy shares is influenced by annual income (in thousands of CHF), risk tolerance (scale from 0 to 25) and interest in the current market situation (scale from 0 to 45).

The data set to be analyzed therefore contains a respondent number (*ID*) a variable to buy shares (*Share purchase*: 0 no, 1 yes), the annual income (*income*), the willingness to take risks (*Willingness to take risks*) and the interest in the current market situation (*interest*).

- Figure 1: Example data

The dataset can be downloaded from Quick Start.

### 2.2. The maximum likelihood estimate

### The logistic regression model

The logistic regression analysis is based on the maximum likelihood estimation (also called MLE, for English "maximum likelihood estimation") and differs from the method of least squares, which is used in linear regression analyzes. Similar to a linear regression analysis, an attempt is made to find a function curve that fits the data as well as possible. In contrast to linear regression analysis, however, this function is not a straight line, but a logistic function (see Figure 2). It is "s-shaped", symmetrical and runs asymptotically towards y = 0 and y = 1. This means that the logistic function only takes values between 0 and 1.

The values of the logistic function are interpreted as the probability that the dependent variable y assumes the value 1 (given the independent variables x_{k}), because a logistic regression model does not predict the values of the dependent variable y, but the probability of occurrence of y. A value close to 0 means that y (y = 1) is very unlikely to occur, while a value close to 1 means that y is very likely to occur.

- Figure 2: Logistic function

The logistic regression function is as follows:

With

= | Probability that y = 1 | |

= | Base of the natural logarithm, Euler's number | |

= | Logit (linear regression model of the independent variables) |

z, the so-called "logit", represents a linear regression model:

With

= | independent variables | |

= | Regression coefficients | |

= | Error value |

If the logit is now used in the logistic function, the result is:

Maximum likelihood estimation

The regression coefficients are estimated by the maximum likelihood estimation (MLE) algorithm. MLE determines the regression parameters in such a way that it predicts the highest possible probabilities for the observed y-values when y = 1 and the lowest possible probabilities when y = 0. MLE maximizes a "likelihood function" that states how likely it is that the value of a dependent variable can be predicted by the independent variables. The value of the likelihood function can be used to estimate the model quality and model significance, as will be seen below.

Interpretation of the regression coefficients

The regression coefficients are no longer interpreted in the same way in the logistic regression as they were in the linear regression. A look at the logistic regression function shows that the relationship is not linear, but more complex. What still applies is the "sign interpretation": If the sign of a regression coefficient is positive, an increase in the relevant independent variable causes an increase in the probability that y = 1. If the sign is negative, this means a decrease in the probability.

The relationship between an independent variable and the dependent variable can be interpreted more precisely by means of so-called "odds" (or: betting odds). To calculate the odds, the probability that the event will occur is related to the non-occurrence of the event. Odds are calculated as follows:

So-called "odds ratios" are used to interpret a regression coefficient. These are the ratio of two odds. SPSS refers to the odds ratio of a variable as "Exp (B)" because it is also known as e^{β} can be calculated (β stands for the regression coefficient, e for Euler's number). SPSS outputs odds ratios of the following types:

The following relationship is derived from this and is useful for interpreting the regression coefficients:

The odds ratio of an independent variable give the **Change in relative probability** from y = 1 on, if this independent variable increases by one unit, given all other variables in the model are held constant. That is, the odds ratio of an independent variable is that **factor**by which the odds change if this variable increases by one unit. If an odds ratio (Exp (B)) is one, this results in a multiplication of the relative probability by 1 and thus no change (odds_{to} = Odds_{in front}). If the odds ratio> 1, this means an increase in the odds (odds_{to} > Odds_{in front}), while an odds ratio <1 means a decrease in the odds (Odds_{to}

About the relationship odds ratio = Exp (B) = e^{β} shows how odds ratios and regression coefficients are related: An odds ratio is 1 if the regression coefficient is 0 (B = 0),> 1 if the regression coefficient is positive (B> 0), and <1 if the regression coefficient is negative (B <0). Figure 3 gives an overview:

- Figure 3: Interpretation aid for regression coefficients and odds ratios (Exp (B))

### 3. Logistic regression analysis with SPSS

### 3.1. Formulation of the regression model

When formulating the regression model, it must be decided which variables are included as dependent and independent variables in the model. Theoretical considerations play a central role here. The model should be kept as simple as possible. It is therefore a good idea not to include too many independent variables. In the case of the present example, *Share purchase *the dependent variable whose probability of occurrence is determined by *income*, Interest and* Risk taking*t is predicted:

### 3.2. Methods of variable inclusion

Before performing the analysis, it must be decided in which order the independent variables should be included in the model. This can have an impact on the model that is reported at the end of the analysis. If all independent variables are completely uncorrelated, the order in which they are introduced into the model does not matter. In the social sciences and in market research, however, the variables are rarely completely uncorrelated. Thus the method of variable inclusion is relevant.

Methods of variable inclusion in SPSS

SPSS offers different approaches to include the independent variables in the model (see click sequence in Figure 4).

**Inclusion (ENTER):**A procedure for variable selection in which all variables of a block are recorded in a single step.**Forward selection (conditional):**A method of incremental variable selection with a test for inclusion based on the significance of the score statistic and an exclusion test based on the likelihood of a likelihood ratio statistic calculated using conditional parameter estimates.**Forward selection (likelihood quotient):**A method of incremental variable selection with a test for inclusion based on the significance of the score statistic and a test for exclusion based on the likelihood of a likelihood ratio statistic. This is based on estimated values that are determined from the maximum of a partial likelihood function.**Forward selection (forest):**A method of incremental variable selection with a test for inclusion based on the significance of the score statistic and a test for exclusion based on the likelihood of the Wald statistic.**Backward LR:**The backward selection is the reverse of the forward selection. Step by step, independent variables are removed from the model, starting with the one that has the least relationship to the dependent variable. At the same time, the likelihood ratio statistic is used to check whether the model would improve by adding another variable. (The backward forest and backward conditional methods are not recommended.)**Backward Elimination (Conditional). Backward step-by-step selection.**The exclusion test is based on the likelihood of the likelihood ratio statistic based on conditional parameter estimates.**Backward elimination (likelihood quotient). Backward step-by-step selection.**The elimination test is based on the likelihood of the likelihood ratio statistic based on maximum, partial likelihood estimates.**Backward elimination (forest). Backward step-by-step selection.**The exclusion test is based on the probability of the forest statistics.

In the present example, the regression model is based on well-founded theoretical considerations, which is why the "inclusion" method is chosen.

Hierarchical regression analysis

In addition, variables can also be saved in **Blocks **are introduced (hence also "blockwise regression"). This takes precedence over the procedures described so far and can be combined with them as required. Several groups of variables (blocks) are specified, which SPSS includes one after the other in the model. The method already selected is used within each block (e.g. "Inclusion" or "Forward LR"). In a study on the influence of environmental awareness on behavior, for example, the first block could be used to create a model that includes socio-demographic variables. In the second block, environmental awareness is introduced. In SPSS, not all variables are entered in the "Covariates" box at the same time, only those of the first block. Then "Next" is selected and the variables of the second block are inserted, etc.

This procedure has two advantages: Firstly, the sequence in which the variables are included can be specified precisely in this way. Second, SPSS outputs a regression model for each step, i.e. after each block. As a result, the change in regression coefficients due to the inclusion of a specific additional variable can be observed and tests can be carried out to determine whether the model improves through the addition of the additional block.

In the context of the present example, all variables are inserted in the first block in order to keep the example as simple as possible.

### 3.3. SPSS commands

**SPSS menu: **Analyze> Regression> Binary Logistic

- Figure 4: Click sequence in SPSS

**Hints**

- At
**method**it is decided how the independent variables will be included in the model. - Under options you can find the
**Classification diagrams**and**Confidence intervals for Exp (B)**. - The default setting of the
**Classification threshold**is 0.5.

**SPSS syntax**

LOGISTIC REGRESSION VARIABLES *Share purchase*

/ METHOD = ENTER *Income interest risk taking*

/ CLASSPLOT

/ PRINT = ITER (1) CI (95)

/CRITERIA=PIN(0.05) POUT (0.10) ITERATE (20) CUT (0.5).

### 3.4. Significance of the regression model

To check whether the regression model is overall significant, a chi-square test is carried out (referred to in SPSS as the "omnibus test of the model coefficients"). This checks whether the model as a whole makes an explanatory contribution to the modal prediction (the modal value of *y* predicted). For this purpose, this test uses the logarithm of the value of the likelihood function, which was maximized in the course of the model estimation (see The Maximum Likelihood Estimation). This logarithmized value is referred to as "log likelihood" or "LL" for short. To estimate the model quality, this value is multiplied by -2 (-2LL). The value -2LL describes an error term. As part of the significance test, the -2LL values of two models are compared: that of the established regression model and that of the so-called "base model". The basic model is a model that only takes the constant into account. SPSS outputs the value -2LL of the postulated model in the "Model summary" table (see Figure 7), while the value -2LL of the base model is output when the "iteration log" is requested under "Options". The test variable based on this comparison follows a chi-square distribution:

With

= | Number of independent variables in the model |

This means that the significance of the test statistic chi-square can be checked by comparing the test statistic with the critical value on a chi-square distribution defined by the corresponding number of degrees of freedom.

- Figure 5: SPSS output - verification of the model

The line "Model" in Figure 5 shows that the model as a whole is significant (Chi-Square (3) = 125.36, *p* <.001). For this reason the analysis can be continued. If the model as a whole were not significant, the analysis would not continue.

This table (Figure 5) contains other lines in addition to the "Model" line. These are only significant if "Inclusion" is not used under "Method" or if a hierarchical regression analysis is carried out.

### 3.5.Significance of the regression coefficients

It is now checked whether the regression coefficients (betas) are also significant. A Wald test is carried out for each of the regression coefficients. The test statistics of the Wald test are calculated as follows:

With

= | Regression coefficient of the variable xj (see column "Regression coefficient B" in Figure 6) | |

= | Standard error of β (see column "Standard Error" in Figure 6) |

The results of the forest tests can be found in the "Forest" and "Sig." in Figure 6. SPSS reports the square of the forest test statistic in the "Forest" column.

- Figure 6: SPSS output - regression coefficients

Figure 6 shows that the z-tests for the regression coefficient of *income *(Forest (1) = 14,651, *p* <.001), from* interest *(Forest (1) = 23,036, *p* <.001), from* Willingness to take risks *(Forest (1) = 15,541, *p* <.001) and the constant β (Wald (1) = 35,731, *p* <.001) turn out to be significant. The significant coefficients of the independent variables mean that their regression coefficients are not 0 and these variables therefore have a significant influence on *Share purchase* exhibit.

Since the influence of the variables is interpreted via the odds ratios (Exp (B)), their significance is also checked: If the confidence interval of Exp (B) does not include the value 1, a significant influence is assumed. This applies to all of the independent variables examined (see Figure 6, column "95% confidence interval for EXP (B)").

This results in the following regression function:

At *Willingness to take risks* and *interest* the value of Exp (B)> 1 (and the sign of B is correspondingly positive). Therefore, if the willingness to take risks increases by one unit, the relative probability that a person has already bought shares increases by 41.6% (1.416 - 1 = .416). If the interest increases by one unit, the relative probability that a person has already acquired shares increases by 8.9% (1,089 - 1 = .089). For* income* if Exp (B) <1 (and the sign of B is correspondingly negative). This means: If the income increases by one unit (1,000 Swiss francs), the relative probability that a person has already bought shares drops by 2.1% (.979 - 1 = -.021).

### 3.6. Model quality

To assess the model quality, analogies to the R^{2} linear regression is used. There are a large number of different such pseudo-Rs^{2} - two of them are implemented in SPSS: the Cox and Snell R^{2} and the Nagelkerke R^{2}. The Cox and Snell R^{2} is calculated as follows:

With

= | Sample size | |

= | Euler's number | |

= | Log likelihood of the postulated model or the base model |

The Nagelkerke *R.*-Square is calculated as follows:

The Nagelkerke R^{2} standardizes the Cox and Snell R^{2}so that it can only assume values between 0 and 1. The higher the R^{2}Value, the better the fit between model and data (hence "goodness of fit").

- Figure 7: SPSS output - model quality

The Nagelkerke R^{2} for the present example is, as can be seen in Figure 7, at .24.

Predicted probabilities and observed values

The logistic regression function calculates the probabilities that the dependent variable will take the value 1. These probabilities vary between 0 and 1. This information can be used to take a closer look at the result if the "Classification diagrams" have been selected under "Options" (or alternatively - not explained in more detail here - if the predicted probabilities have been saved as variables and can then be analyzed separately).

SPSS uses the probability of 50.0% (0.500, see footnote in Figure 8) as a cut-off value to determine whether y = 0 or y = 1 is predicted. From a predicted probability of 0.500, it is predicted that *Share purchase* = 1 is. With a lower probability it will *Share purchase* = 0 forecast. If the proportions for y = 0 or y = 1 are the same, a cutoff value of 0.500 can be selected, otherwise it corresponds to the proportion of cases y = 1 and can be read from the classification table in the header. The cut-off value can be set in SPSS under "Options" under "Classification threshold value". The result with the standard setting 0.500 can be seen in Figure 8:

- Figure 8: SPSS output - classification table

In total, 76.1% of the people were classified by the model according to their actual answer. Of those who have never bought shares, 485 out of a total of 517 (485 + 32) were correctly predicted. This corresponds to 93.8% correct forecasts. Of those people who bought shares, only 48 out of a total of 183 (135 + 48) share purchases were correctly predicted. This corresponds to 26.2% correct forecasts.

The "diagram of the observed groups and predicted probabilities" represents a kind of histogram and also illustrates the relationship between predicted probabilities, correspondingly classified predictions for y and the observed values (Figure 9).

- Figure 9: SPSS Output - Plot of observed groups and predicted probabilities

The letters in the diagram represent observations (whether each letter stands for one or more observations can be found in a footnote in the diagram). They give the observed values to the variable *Share purchase* again ("Y" stands for 1 and "N" for 0). The line below the diagram (the x-axis) shows the predicted probabilities and immediately below is the classification based on it ("N" if the probability is <.500, and "Y" if the probability is greater than .500) ). In the ideal case, the prediction of y is not only correct, but also as clear as possible - that is, few people have medium probabilities.

All "Y" in the left half and all "N" in the right half of the diagram thus correspond to false predictions. If these were counted and multiplied by a factor of 2.5 (see note below in Figure 9), 32 N and 135 Y should be found in the "wrong" half. This diagram also shows that *Share purchase *= 1 is rather poorly predicted.

### 3.7. Calculation of the effect size

Effect sizes are calculated to assess the significance of a result. In the example, *R.*-Square 0.24, but the question arises as to whether this is high enough to be classified as significant.

There are different ways to measure the effect size. Among the best known are the effect size of Cohen (*d*) and the correlation coefficient (*r*) by Pearson. The correlation coefficient is very suitable because the effect size is always between 0 (no effect) and 1 (maximum effect). However, if the groups differ significantly in terms of their size, it is recommended that *d* from Cohen to choose there *r* can be distorted by the size differences.

The *R.*- The square that is output during regression analyzes can be converted into an effect size *f* according to Cohen (1992). In this case, the range of values for the effect size is between 0 and infinite.

With

= | Cohen effect size | |

= | R.-Square |

For the above example this results in the following effect size:

In order to assess how big this effect is, one can orientate oneself on the classification of Cohen (1988):

*f*^{2} = .02 corresponds to a **weaknesses** effect*f*^{2} = .15 corresponds to a **middle** effect*f ^{2}* = .35 corresponds to a

**strengthen**effect

So the effect size of .32 corresponds to a medium effect.

### 3.8. A typical statement

A logistic regression analysis shows that both the model as a whole (chi-square (3) = 125.36, *p* < .001, *n* = 700) and the individual coefficients of the variables are significant. If interest in the market situation and willingness to take risks increase by one unit, the relative probability of buying shares increases by 8.9% and 41.6%, respectively. If income increases by 1,000 francs, the relative probability of buying shares decreases by 2.1%. Cohens f^{2} is .32, which corresponds to a mean effect according to Cohen (1992).

- How selective is the ESCP Bachelor?
- What does Dummy Thicc
- Why do people join IIPM
- What are some wise words
- How do artists get such beautiful signatures
- What is an intrapersonal conflict
- Will influencer marketing replace advertising
- What are good examples of product websites
- You need an equation
- How do I get over the past
- Sankshipt Mahabharata is different from Mahabharata
- How do you burglar in chess time
- Why are preachers beating the pulpit?
- What type of language is Finnish
- What is theater
- How much is your car worth
- What do entrepreneurs think of libertarianism?
- Are there Soulja Boys fans
- Do you believe in confessing your sin?
- Are these types of handguns possible?
- 1 is a value
- How do jazz musicians learn to improvise?
- Which HTML editor are you using
- Should I buy computers on eBay?