Peter R. Killeen Discusses a Third Alternative for Inferential Statistics
Emerging Research Front Commentary, October 2010
Article: An alternative to null-hypothesis significance tests
Authors: Killeen, PR
Peter R. Killeen talks with ScienceWatch.com and answers a few questions about this month's Emerging Research Front paper in the field of Psychiatry/Psychology.
Why do you think your paper is highly cited?
It provides a third alternative for inferential statistics—neither Frequentist nor standard Bayesian—but predictive: Inferences are to future research outcomes, not to parameters. That a brave editor recommended that it be used in his journal instead of Null-Hypothesis Statistical Tests (NHST) was crucial. The inevitable controversy pursuant to that decision has further raised its profile.
Does it describe a new discovery, methodology, or synthesis of knowledge?
It is based on a standard Bayesian construct, the posterior predictive distribution. My contribution was to recognize that this distribution, often canceled out of likelihood ratios as a nuisance variable, was an invaluable inferential tool. It changed the game from talk about parameters (means, variances, etc) to talk about replicability.
Would you summarize the significance of your paper in layman's terms?
Science, as it is said, is about testing hypotheses; hypotheses that we can never claim to prove, but only to challenge and possibly discredit. Science, as it is done, is about demonstration followed by replication; or by failures to replicate. Historically science has been undergirded—and poorly so—by the Statistics of Hypothesis.
Figure 1: The curve at left is the sampling distribution for a statistic, such as a mean or effect size (d), under the null hypothesis. The traditional p value is the area to the right of the obtained statistic, d1, shown in black. Shift this curve to its most likely position (the observed statistic) and double its variance (to account for the sampling error in the original plus that in the replicate) to create the distribution expected for replications. The probability of finding an effect of the same sign (prep) is given by the shaded area. This illustration was computed with flat priors on parameter values, as appropriate for the evaluation of evidence. Computation is straightforward with more informative priors, such as those reflecting the state of knowledge in the field. Although prep gives the probability of any supportive data in replication, the criterion can be changed, to significant effects in replication, or to the probability of someone finding significant effects in the opposite direction! (From Killeen, 2006; reprinted with permission*).
My paper opens the door to science-appropriate statistics—to a Statistics of Replicability, without asserting claims to higher knowledge (of Truth about Parameters). It restores the meaning of significance, as in your query, to layman's terms—importance—by freeing it from the chimera of effect-size over root n that yields traditional p-values. It opens a further door to a decision-theoretic approach to evaluation of research, providing, en passant, a rational explication of the traditional alpha = 0.05 criterion for significance. (Killeen, 2006).
How did you become involved in this research, and how would you describe the particular challenges, setbacks, and successes that you've encountered along the way?
I became convinced that the appropriate way to teach undergraduate statistics was with randomization methods—permutation and bootstrap tests. In fact, the students grasped the ideas, and became more proficient in the analyses, than when relying on closed forms alone. But that could not solve the problem that NHST only speaks to the probability of data given (null) hypotheses; it does not permit us to speak about hypotheses given data.
But that is just what "rejecting the null" is: it is an assertion about the non-viability of a hypothesis given data. I saw that—implicitly or explicitly—most undergraduate stats teaches this lie—that we can use NHST to reject hypotheses. It was then that I rolled up my sleeves.
In my rush to make this digestible to a general audience, I dropped my Bayesian derivation for a flawed "fiducial" derivation. I was appropriately called on it by commentators, and quickly backwatered to the Bayesian derivation.
A number of colleagues showed that my statistic was severely biased whenever we had knowledge of, or posited knowledge of, the true values of the parameters. Nolo contendere; it was never designed for those cases—knowledge that theoreticians, but not scientists, are privy to, or contrive.
Yet we are so conditioned to think of the probability "given that the null is true", it is easy even for experts to regress to this parametric mode of thinking. If all that is available are the results of an initial study, my analysis gives the unbiased probability that it can be replicated; other probabilities, such as the probability that another experimenter will get results that discredit the original, are easily calculated.
In terms of successes, several independent researchers have begun to generalize the uses of the statistics in very interesting ways.
Where do you see your research leading in the future?
Predictive inference plays well with Bayesian inference in general, and will come to be seen as one more Bayesian tool, of particular use when the alternative hypothesis is composite (i.e., ill-defined: "Not the Null").The large infrastructure of inferential statistics—ANOVAs, regression analyses, and SEM—may be recast in the mode of predictive inference. But that task is beyond my skills. Either this baton is caught, or my contribution becomes a mere hiccup in the history of scientific inference.
Do you foresee any social or political implications for your research?
They have certainly had implications in the social politics of methodologists! In the long run, predictive inference may be a contribution to revision of the scientific folklore of how we acquire knowledge, and what it means to do that. Replication statistics are inherently more sensible, and easy to communicate, than the reductio ad absurdum of NHST.
For instance, few scientists can give a correct definition of confidence intervals; but it is easy to understand and remember that the standard-error of the mean, often drawn as error-bars around data points, give the range within which a mean from an equal-powered replication will fall approximately half the time: They are de facto 50% replicability intervals.
Such improved communication may open the door for a greater social appreciation of the scientific enterprise by the non-scientific community—the one that pays our bills.
Peter Killeen, Ph.D.
Department of Psychology (Behavioral Neuroscience)
Arizona State University
Tempe, AZ, USA
KEYWORDS: EFFECT SIZE; CONFIDENCE-INTERVALS; P-VALUES; METAANALYSIS; PSYCHOLOGY.
“Permission from the Society is waived for authors who wish to reproduce a single table or figure, provided the author’s permission is obtained and full credit is given to the Psychonomic Society and the author through a complete citation.”
*Killeen, P. R. (2006), "Beyond statistical inference: a decision theory for science," Psychonomic Bulletin & Review, 13, 549-562.