Can I forecast with logit regression

Method advice

Quick start

What is logistic regression analysis used for?
The (binary) logistic regression analysis tests whether there is a connection between several independent and one binary dependent variable.

SPSS menu
Analyze> Regression> Binary Logistic

SPSS syntax
LOGISTIC REGRESSION VARIABLES dependent variable
/ METHOD = ENTER independent variables
/ CLASSPLOT
/ PRINT = ITER (1) CI (95)

Sample SPSS dataset

Logistic regression (SAV, 19 KB)

1. Introduction


The (binary) logistic regression analysis is used when it is to be checked whether there is a connection between a dependent binary variable and one or more independent variables.

In contrast to simple regression analysis and multiple regression analysis, the dependent variable is binary. That is, it has only two forms; e.g. a variable Heart attackwhich takes on the following values: 1 for "yes, has already had a heart attack" and 0 for "no, has not had a heart attack". There is also talk of "dichotomous" variables. The independent variables, on the other hand, are interval-scaled or coded as dummy variables.

For ordinally scaled dependent variables and for nominal dependent variables with more than two values ​​(e.g. the variable Hair color with the characteristics: brown, blonde, black or red) there are extensions of the logistic regression analysis: the ordinal logistic regression and the multinominal logistic regression. However, this will not be discussed in more detail.

Binary logistic regression analysis examines the relationship between the probability that the dependent variable takes the value 1 and the independent variables. This means that it is not the value of the dependent variable that is predicted, but the probability that the dependent variable will take the value 1. Furthermore, the requirements are less restrictive than in the linear regression analysis.

It should be noted at this point that every postulated causal relationship must be theoretically justified.

The question of the logistic regression analysis is often shortened as follows:
"Do the independent variables have an influence on the probability that the dependent variable takes on the value 1? How strong is their influence?"

1.1. Examples of possible questions

  • Do diet, exercise and a person's perception of stress have an influence on the likelihood of bone marrow decline (binary variable with the values ​​"bone marrow decline detectable" and "no bone marrow decline apparent")?
  • How strong is the connection between participation in further training and the self-confidence, the managerial authority of the employer and the income of the potential participant?
  • What affects the likelihood of Christmas decorations being bought? The number of snowy days, the outside temperature or the number of days until December 24th?
  • Can gender, age, education, and occupation predict the likelihood that a particular TV show will be watched?

1.2. requirements

The dependent variable is binary (0-1 coded)
The independent variables are coded metrically or, in the case of categorical variables, as dummy variables
For each group formed by categorical predictors is n ≥ 25
The independent variables are not highly correlated with one another

2.1. Example of a study


A bank is interested in facts related to the likelihood of someone buying shares. It therefore commissions a market research institute to interview 700 people. It is assumed that the decision to buy shares is influenced by annual income (in thousands of CHF), risk tolerance (scale from 0 to 25) and interest in the current market situation (scale from 0 to 45).

The data set to be analyzed therefore contains a respondent number (ID) a variable to buy shares (Share purchase: 0 no, 1 yes), the annual income (income), the willingness to take risks (Willingness to take risks) and the interest in the current market situation (interest).

Figure 1: Example data


The dataset can be downloaded from Quick Start.

2.2. The maximum likelihood estimate

The logistic regression model


The logistic regression analysis is based on the maximum likelihood estimation (also called MLE, for English "maximum likelihood estimation") and differs from the method of least squares, which is used in linear regression analyzes. Similar to a linear regression analysis, an attempt is made to find a function curve that fits the data as well as possible. In contrast to linear regression analysis, however, this function is not a straight line, but a logistic function (see Figure 2). It is "s-shaped", symmetrical and runs asymptotically towards y = 0 and y = 1. This means that the logistic function only takes values ​​between 0 and 1.

The values ​​of the logistic function are interpreted as the probability that the dependent variable y assumes the value 1 (given the independent variables xk), because a logistic regression model does not predict the values ​​of the dependent variable y, but the probability of occurrence of y. A value close to 0 means that y (y = 1) is very unlikely to occur, while a value close to 1 means that y is very likely to occur.

Figure 2: Logistic function


The logistic regression function is as follows:

With

=Probability that y = 1
=Base of the natural logarithm, Euler's number
=Logit (linear regression model of the independent variables)


z, the so-called "logit", represents a linear regression model:

With

=independent variables
=Regression coefficients
=Error value


If the logit is now used in the logistic function, the result is:



Maximum likelihood estimation


The regression coefficients are estimated by the maximum likelihood estimation (MLE) algorithm. MLE determines the regression parameters in such a way that it predicts the highest possible probabilities for the observed y-values ​​when y = 1 and the lowest possible probabilities when y = 0. MLE maximizes a "likelihood function" that states how likely it is that the value of a dependent variable can be predicted by the independent variables. The value of the likelihood function can be used to estimate the model quality and model significance, as will be seen below.


Interpretation of the regression coefficients


The regression coefficients are no longer interpreted in the same way in the logistic regression as they were in the linear regression. A look at the logistic regression function shows that the relationship is not linear, but more complex. What still applies is the "sign interpretation": If the sign of a regression coefficient is positive, an increase in the relevant independent variable causes an increase in the probability that y = 1. If the sign is negative, this means a decrease in the probability.

The relationship between an independent variable and the dependent variable can be interpreted more precisely by means of so-called "odds" (or: betting odds). To calculate the odds, the probability that the event will occur is related to the non-occurrence of the event. Odds are calculated as follows:


So-called "odds ratios" are used to interpret a regression coefficient. These are the ratio of two odds. SPSS refers to the odds ratio of a variable as "Exp (B)" because it is also known as eβ can be calculated (β stands for the regression coefficient, e for Euler's number). SPSS outputs odds ratios of the following types:
 


The following relationship is derived from this and is useful for interpreting the regression coefficients:


The odds ratio of an independent variable give the Change in relative probability from y = 1 on, if this independent variable increases by one unit, given all other variables in the model are held constant. That is, the odds ratio of an independent variable is that factorby which the odds change if this variable increases by one unit. If an odds ratio (Exp (B)) is one, this results in a multiplication of the relative probability by 1 and thus no change (oddsto = Oddsin front). If the odds ratio> 1, this means an increase in the odds (oddsto > Oddsin front), while an odds ratio <1 means a decrease in the odds (Oddsto in front).

About the relationship odds ratio = Exp (B) = eβ shows how odds ratios and regression coefficients are related: An odds ratio is 1 if the regression coefficient is 0 (B = 0),> 1 if the regression coefficient is positive (B> 0), and <1 if the regression coefficient is negative (B <0). Figure 3 gives an overview:

Figure 3: Interpretation aid for regression coefficients and odds ratios (Exp (B))

3. Logistic regression analysis with SPSS

3.1. Formulation of the regression model


When formulating the regression model, it must be decided which variables are included as dependent and independent variables in the model. Theoretical considerations play a central role here. The model should be kept as simple as possible. It is therefore a good idea not to include too many independent variables. In the case of the present example, Share purchase the dependent variable whose probability of occurrence is determined by income, Interest and Risk takingt is predicted:

3.2. Methods of variable inclusion


Before performing the analysis, it must be decided in which order the independent variables should be included in the model. This can have an impact on the model that is reported at the end of the analysis. If all independent variables are completely uncorrelated, the order in which they are introduced into the model does not matter. In the social sciences and in market research, however, the variables are rarely completely uncorrelated. Thus the method of variable inclusion is relevant.


Methods of variable inclusion in SPSS


SPSS offers different approaches to include the independent variables in the model (see click sequence in Figure 4).

  • Inclusion (ENTER): A procedure for variable selection in which all variables of a block are recorded in a single step.
  • Forward selection (conditional): A method of incremental variable selection with a test for inclusion based on the significance of the score statistic and an exclusion test based on the likelihood of a likelihood ratio statistic calculated using conditional parameter estimates.
  • Forward selection (likelihood quotient): A method of incremental variable selection with a test for inclusion based on the significance of the score statistic and a test for exclusion based on the likelihood of a likelihood ratio statistic. This is based on estimated values ​​that are determined from the maximum of a partial likelihood function.
  • Forward selection (forest): A method of incremental variable selection with a test for inclusion based on the significance of the score statistic and a test for exclusion based on the likelihood of the Wald statistic.
  • Backward LR: The backward selection is the reverse of the forward selection. Step by step, independent variables are removed from the model, starting with the one that has the least relationship to the dependent variable. At the same time, the likelihood ratio statistic is used to check whether the model would improve by adding another variable. (The backward forest and backward conditional methods are not recommended.)
  • Backward Elimination (Conditional). Backward step-by-step selection. The exclusion test is based on the likelihood of the likelihood ratio statistic based on conditional parameter estimates.
  • Backward elimination (likelihood quotient). Backward step-by-step selection. The elimination test is based on the likelihood of the likelihood ratio statistic based on maximum, partial likelihood estimates.
  • Backward elimination (forest). Backward step-by-step selection. The exclusion test is based on the probability of the forest statistics.

In the present example, the regression model is based on well-founded theoretical considerations, which is why the "inclusion" method is chosen.


Hierarchical regression analysis


In addition, variables can also be saved in Blocks are introduced (hence also "blockwise regression"). This takes precedence over the procedures described so far and can be combined with them as required. Several groups of variables (blocks) are specified, which SPSS includes one after the other in the model. The method already selected is used within each block (e.g. "Inclusion" or "Forward LR"). In a study on the influence of environmental awareness on behavior, for example, the first block could be used to create a model that includes socio-demographic variables. In the second block, environmental awareness is introduced. In SPSS, not all variables are entered in the "Covariates" box at the same time, only those of the first block. Then "Next" is selected and the variables of the second block are inserted, etc.

This procedure has two advantages: Firstly, the sequence in which the variables are included can be specified precisely in this way. Second, SPSS outputs a regression model for each step, i.e. after each block. As a result, the change in regression coefficients due to the inclusion of a specific additional variable can be observed and tests can be carried out to determine whether the model improves through the addition of the additional block.

In the context of the present example, all variables are inserted in the first block in order to keep the example as simple as possible.

3.3. SPSS commands


SPSS menu: Analyze> Regression> Binary Logistic

Figure 4: Click sequence in SPSS

 


Hints

  • At method it is decided how the independent variables will be included in the model.
  • Under options you can find the Classification diagrams and Confidence intervals for Exp (B).
  • The default setting of the Classification threshold is 0.5.


SPSS syntax

LOGISTIC REGRESSION VARIABLES Share purchase
/ METHOD = ENTER Income interest risk taking
/ CLASSPLOT
/ PRINT = ITER (1) CI (95)
/CRITERIA=PIN(0.05) POUT (0.10) ITERATE (20) CUT (0.5).

3.4. Significance of the regression model


To check whether the regression model is overall significant, a chi-square test is carried out (referred to in SPSS as the "omnibus test of the model coefficients"). This checks whether the model as a whole makes an explanatory contribution to the modal prediction (the modal value of y predicted). For this purpose, this test uses the logarithm of the value of the likelihood function, which was maximized in the course of the model estimation (see The Maximum Likelihood Estimation). This logarithmized value is referred to as "log likelihood" or "LL" for short. To estimate the model quality, this value is multiplied by -2 (-2LL). The value -2LL describes an error term. As part of the significance test, the -2LL values ​​of two models are compared: that of the established regression model and that of the so-called "base model". The basic model is a model that only takes the constant into account. SPSS outputs the value -2LL of the postulated model in the "Model summary" table (see Figure 7), while the value -2LL of the base model is output when the "iteration log" is requested under "Options". The test variable based on this comparison follows a chi-square distribution:

With

=Number of independent variables in the model

This means that the significance of the test statistic chi-square can be checked by comparing the test statistic with the critical value on a chi-square distribution defined by the corresponding number of degrees of freedom.

Figure 5: SPSS output - verification of the model


The line "Model" in Figure 5 shows that the model as a whole is significant (Chi-Square (3) = 125.36, p <.001). For this reason the analysis can be continued. If the model as a whole were not significant, the analysis would not continue.

This table (Figure 5) contains other lines in addition to the "Model" line. These are only significant if "Inclusion" is not used under "Method" or if a hierarchical regression analysis is carried out.

3.5.Significance of the regression coefficients


It is now checked whether the regression coefficients (betas) are also significant. A Wald test is carried out for each of the regression coefficients. The test statistics of the Wald test are calculated as follows:

With

=Regression coefficient of the variable xj (see column "Regression coefficient B" in Figure 6)
=Standard error of β (see column "Standard Error" in Figure 6)

The results of the forest tests can be found in the "Forest" and "Sig." in Figure 6. SPSS reports the square of the forest test statistic in the "Forest" column.

Figure 6: SPSS output - regression coefficients


Figure 6 shows that the z-tests for the regression coefficient of income (Forest (1) = 14,651, p <.001), from interest (Forest (1) = 23,036, p <.001), from Willingness to take risks (Forest (1) = 15,541, p <.001) and the constant β (Wald (1) = 35,731, p <.001) turn out to be significant. The significant coefficients of the independent variables mean that their regression coefficients are not 0 and these variables therefore have a significant influence on Share purchase exhibit.

Since the influence of the variables is interpreted via the odds ratios (Exp (B)), their significance is also checked: If the confidence interval of Exp (B) does not include the value 1, a significant influence is assumed. This applies to all of the independent variables examined (see Figure 6, column "95% confidence interval for EXP (B)").

This results in the following regression function:


At Willingness to take risks and interest the value of Exp (B)> 1 (and the sign of B is correspondingly positive). Therefore, if the willingness to take risks increases by one unit, the relative probability that a person has already bought shares increases by 41.6% (1.416 - 1 = .416). If the interest increases by one unit, the relative probability that a person has already acquired shares increases by 8.9% (1,089 - 1 = .089). For income if Exp (B) <1 (and the sign of B is correspondingly negative). This means: If the income increases by one unit (1,000 Swiss francs), the relative probability that a person has already bought shares drops by 2.1% (.979 - 1 = -.021).

3.6. Model quality


To assess the model quality, analogies to the R2 linear regression is used. There are a large number of different such pseudo-Rs2 - two of them are implemented in SPSS: the Cox and Snell R2 and the Nagelkerke R2. The Cox and Snell R2 is calculated as follows:

With

=Sample size
=Euler's number
=Log likelihood of the postulated model or the base model


The Nagelkerke R.-Square is calculated as follows:


The Nagelkerke R2 standardizes the Cox and Snell R2so that it can only assume values ​​between 0 and 1. The higher the R2Value, the better the fit between model and data (hence "goodness of fit").

Figure 7: SPSS output - model quality


The Nagelkerke R2 for the present example is, as can be seen in Figure 7, at .24.


Predicted probabilities and observed values


The logistic regression function calculates the probabilities that the dependent variable will take the value 1. These probabilities vary between 0 and 1. This information can be used to take a closer look at the result if the "Classification diagrams" have been selected under "Options" (or alternatively - not explained in more detail here - if the predicted probabilities have been saved as variables and can then be analyzed separately).

SPSS uses the probability of 50.0% (0.500, see footnote in Figure 8) as a cut-off value to determine whether y = 0 or y = 1 is predicted. From a predicted probability of 0.500, it is predicted that Share purchase = 1 is. With a lower probability it will Share purchase = 0 forecast. If the proportions for y = 0 or y = 1 are the same, a cutoff value of 0.500 can be selected, otherwise it corresponds to the proportion of cases y = 1 and can be read from the classification table in the header. The cut-off value can be set in SPSS under "Options" under "Classification threshold value". The result with the standard setting 0.500 can be seen in Figure 8:

Figure 8: SPSS output - classification table


In total, 76.1% of the people were classified by the model according to their actual answer. Of those who have never bought shares, 485 out of a total of 517 (485 + 32) were correctly predicted. This corresponds to 93.8% correct forecasts. Of those people who bought shares, only 48 out of a total of 183 (135 + 48) share purchases were correctly predicted. This corresponds to 26.2% correct forecasts.

The "diagram of the observed groups and predicted probabilities" represents a kind of histogram and also illustrates the relationship between predicted probabilities, correspondingly classified predictions for y and the observed values ​​(Figure 9).

Figure 9: SPSS Output - Plot of observed groups and predicted probabilities


The letters in the diagram represent observations (whether each letter stands for one or more observations can be found in a footnote in the diagram). They give the observed values ​​to the variable Share purchase again ("Y" stands for 1 and "N" for 0). The line below the diagram (the x-axis) shows the predicted probabilities and immediately below is the classification based on it ("N" if the probability is <.500, and "Y" if the probability is greater than .500) ). In the ideal case, the prediction of y is not only correct, but also as clear as possible - that is, few people have medium probabilities.

All "Y" in the left half and all "N" in the right half of the diagram thus correspond to false predictions. If these were counted and multiplied by a factor of 2.5 (see note below in Figure 9), 32 N and 135 Y should be found in the "wrong" half. This diagram also shows that Share purchase = 1 is rather poorly predicted.

3.7. Calculation of the effect size


Effect sizes are calculated to assess the significance of a result. In the example, R.-Square 0.24, but the question arises as to whether this is high enough to be classified as significant.

There are different ways to measure the effect size. Among the best known are the effect size of Cohen (d) and the correlation coefficient (r) by Pearson. The correlation coefficient is very suitable because the effect size is always between 0 (no effect) and 1 (maximum effect). However, if the groups differ significantly in terms of their size, it is recommended that d from Cohen to choose there r can be distorted by the size differences.

The R.- The square that is output during regression analyzes can be converted into an effect size f according to Cohen (1992). In this case, the range of values ​​for the effect size is between 0 and infinite.

With

=Cohen effect size
=R.-Square


For the above example this results in the following effect size:


In order to assess how big this effect is, one can orientate oneself on the classification of Cohen (1988):

f2 = .02 corresponds to a weaknesses effect
f2 = .15 corresponds to a middle effect
f2 = .35 corresponds to a strengthen effect


So the effect size of .32 corresponds to a medium effect.

3.8. A typical statement


A logistic regression analysis shows that both the model as a whole (chi-square (3) = 125.36, p < .001, n = 700) and the individual coefficients of the variables are significant. If interest in the market situation and willingness to take risks increase by one unit, the relative probability of buying shares increases by 8.9% and 41.6%, respectively. If income increases by 1,000 francs, the relative probability of buying shares decreases by 2.1%. Cohens f2 is .32, which corresponds to a mean effect according to Cohen (1992).