What is high leverage

One-way ANCOVA: Finding outliers

One-way ANCOVA

Outliers are another possible source of Distortions in statistical analyzes and most methods are only slightly or not robustly robust if there are outliers in the data set. A single outlier can be the reason for a non-significant or a significant result. You can easily check this yourself by simply quadrupling one value in the example data set. These effects are immediately reflected in the significance and effect sizes of the ANCOVA.

We're going to check outliers here using two different methods: Leverage values and Cook distances.

Outliers should not simply be excluded from further analysis across the board. Whenever cases are excluded, the advantages and disadvantages should be weighed against each other. While outliers can skew inferential statistics, the extent also depends heavily on the method and severity of the outlier. However, every exclusion of a case from the overall sample is always accompanied by a loss of power (due to the reduced sample size) and even more, we exclude cases that can potentially also provide us with important insights. Each exclusion represents an intervention in the data set and should therefore not be made across the board, but with the issue in mind.

Leverage values

The leverage value (English leverage) is a measure of how far the value of an independent variable is from other values. A high leverage would mean that there are no other cases near this case. It could be an outlier. The leverage value can assume values ​​between 0 and 1, whereby a value of 0 would mean that the case has no influence on the prediction and 1 that the prediction is completely determined by this one value.

There are various formulas and cut-offs for calculating when a leverage value is large enough to be classified as an outlier. Many of them depend on the number of groups k and covariates c and the number of cases n. The value p is calculated from k and c doing so: p = k – 1 + c. With one covariate and three groups, ours would be p so 3.

  • Huber (1981) recommends a general cut-off value of .2, regardless of other parameters
  • Igo (2010) recommends the formula \ (\ frac {2 \ cdot p} {n} \) for reasonably large data sets of np > 50
  • Velleman & Welsch (1981), however, recommend \ (\ frac {3 \ cdot p} {n} \) for p > 6 and np > 12
  • Hoaglin & Welsch (1978) recommend \ (2 \ cdot \ frac {p + 1} {n} \) as a rule of thumb for "large leverage values"

Now we are spoiled for choice. We have with our sample dataset p = 3 in 145 cases. According to Igo (2010), an outlier would be a leverage value of .0413 or greater. According to Hoaglin and Welsch (1978), however, at .0552. And in Huber (1981) - regardless of all other parameters - in .2.

Leverage values ​​are checked by arranging the column in the SPSS data view in descending order. To do this, we go to the data view and right-click on the column LEV_1 and arrange it in descending order, as in the video below

After that, the largest values ​​of LEV_1 are above:

The first value (.05816) can be seen as an outlier according to both Igo (2010) and Hoaglin & Welsch (1978). Here we could consider whether we want to exclude this observation from further analysis or not. According to Igo (2010), however, the first 13 cases would also be outliers.

Cook distances

Cook's distance is also a measure of the influence that a single case has on the entire model. It measures how much the regression line would change if we excluded the case. In general, values ​​greater than 1 are considered to be outliers and should be investigated more closely.

The check is carried out in a similar way as before: We arrange the variables COO_1 in descending order:

After we've sorted, our dataset would look like this:

The highest value here is .06 and thus far from the cut-off criterion of 1.

What to do if...

If we have outliers in our data set, we can consider whether to exclude them from further data analysis. Here it is also advisable to exclude the values ​​and carry out the analysis again. This often improves statistics like that pValue or the variance explanation, which we will discuss later.

Since there are several ways to classify outliers, all methods should too used in combination become. A value is most likely to be an outlier when multiple procedures identify it as such.

We would generally recommend being careful about excluding cases. Any exclusion represents an interference with the data that should be carefully considered. If several cases emerge as outliers, it should also be checked whether there is one behind them Systematics lies. It can often be the case that “outliers” accumulate on another variable, e.g. if, in a visual experiment, people are outliers who the stimulus material could not see. If further variables have been recorded, it can be useful to get to the bottom of a possible cause.

As always when excluding cases from further data analysis, the following applies: Document and report everything! If we exclude cases, this must be stated and justified in the paper.

bibliography

  1. Huber. (1981). Robust statistics. New York: John Wiley.
  2. Igo, R.P. (2010). Influential Data Points. In N. J. Salkind (Ed.), Encyclopedia of Research Design (Vol. 2, pp. 600-602). Los Angeles: Sage.
  3. Velleman, P. F., & Welsch, R. E. (1981). Efficient Computing of Regression Diagnostics. The American Statistician, 35(4), 234. doi: 10.2307 / 2683296
  4. Hoaglin, D. C., & Welsch, R. E. (1978). The Hat Matrix in Regression and ANOVA. The American Statistician, 32(1), 17-22. doi: 10.1080 / 00031305.1978.10479237
Perform one-way ANCOVA with post-hoc tests in SPSS
One-way ANCOVA: Check normal distribution