## Peter R. Killeen Discusses a Third Alternative for Inferential Statistics

#### Emerging Research Front Commentary, October 2010

Article: An alternative to null-hypothesis significance tests
Authors: Addresses: Arizona State Univ, Dept Psychol, Tempe, AZ 85287 USA. Arizona State Univ, Dept Psychol, Tempe, AZ 85287 USA. |

Peter R. Killeen talks with ScienceWatch.com and answers a few questions about this month's Emerging Research Front paper in the field of Psychiatry/Psychology.

**Why do you think your paper is highly
cited?**

It provides a third alternative for inferential statistics—neither Frequentist nor standard Bayesian—but predictive: Inferences are to future research outcomes, not to parameters. That a brave editor recommended that it be used in his journal instead of Null-Hypothesis Statistical Tests (NHST) was crucial. The inevitable controversy pursuant to that decision has further raised its profile.

**Does it describe a new discovery, methodology, or
synthesis of knowledge?**

It is based on a standard Bayesian construct, the posterior predictive distribution. My contribution was to recognize that this distribution, often canceled out of likelihood ratios as a nuisance variable, was an invaluable inferential tool. It changed the game from talk about parameters (means, variances, etc) to talk about replicability.

**Would you summarize the significance of your paper
in layman's terms?**

Science, as it is said, is about testing hypotheses; hypotheses that we can never claim to prove, but only to challenge and possibly discredit. Science, as it is done, is about demonstration followed by replication; or by failures to replicate. Historically science has been undergirded—and poorly so—by the Statistics of Hypothesis.

**Figure 1:** The curve at left is the
sampling distribution for a statistic, such as a mean or effect size
(*d*), under the null hypothesis. The traditional *p* value
is the area to the right of the obtained statistic, *d*1, shown in
black. Shift this curve to its most likely position (the observed
statistic) and double its variance (to account for the sampling error in
the original plus that in the replicate) to create the distribution
expected for replications. The probability of finding an effect of the same sign
(*p*_{rep}) is given by the shaded area. This
illustration was computed with flat priors on parameter values, as
appropriate for the evaluation of evidence. Computation is
straightforward with more informative priors, such as those reflecting
the state of knowledge in the field. Although *p*_{rep}
gives the probability of any supportive data in replication, the
criterion can be changed, to significant effects in replication, or to
the probability of someone finding significant effects in the opposite
direction! (From Killeen, 2006; reprinted with permission*).

My paper opens the door to science-appropriate statistics—to a
Statistics of Replicability, without asserting claims to higher knowledge
(of Truth about Parameters). It restores the meaning of
*significance*, as in your query, to layman's
terms—importance—by freeing it from the chimera of effect-size
over root n that yields traditional *p*-values. It opens a further
door to a decision-theoretic approach to evaluation of research, providing,
*en passant,* a rational explication of the traditional alpha = 0.05
criterion for significance. (Killeen, 2006).

**How did you become involved in this research, and
how would you describe the particular challenges, setbacks, and
successes that you've encountered along the way?**

I became convinced that the appropriate way to teach undergraduate statistics was with randomization methods—permutation and bootstrap tests. In fact, the students grasped the ideas, and became more proficient in the analyses, than when relying on closed forms alone. But that could not solve the problem that NHST only speaks to the probability of data given (null) hypotheses; it does not permit us to speak about hypotheses given data.

But that is just what "rejecting the null" is: it is an assertion about the non-viability of a hypothesis given data. I saw that—implicitly or explicitly—most undergraduate stats teaches this lie—that we can use NHST to reject hypotheses. It was then that I rolled up my sleeves.

In my rush to make this digestible to a general audience, I dropped my Bayesian derivation for a flawed "fiducial" derivation. I was appropriately called on it by commentators, and quickly backwatered to the Bayesian derivation.

A number of colleagues showed that my statistic was severely biased
whenever we had knowledge of, or posited knowledge of, the true values of
the parameters. *Nolo contendere*; it was never designed for those
cases—knowledge that theoreticians, but not scientists, are privy to,
or contrive.

Yet we are so conditioned to think of the probability "given that the null is true", it is easy even for experts to regress to this parametric mode of thinking. If all that is available are the results of an initial study, my analysis gives the unbiased probability that it can be replicated; other probabilities, such as the probability that another experimenter will get results that discredit the original, are easily calculated.

In terms of successes, several independent researchers have begun to generalize the uses of the statistics in very interesting ways.

**Where do you see your research leading in the
future?**

Predictive inference plays well with Bayesian inference in general, and will come to be seen as one more Bayesian tool, of particular use when the alternative hypothesis is composite (i.e., ill-defined: "Not the Null").The large infrastructure of inferential statistics—ANOVAs, regression analyses, and SEM—may be recast in the mode of predictive inference. But that task is beyond my skills. Either this baton is caught, or my contribution becomes a mere hiccup in the history of scientific inference.

**Do you foresee any social or political implications for your
research?**

They have certainly had implications in the social politics of
methodologists! In the long run, predictive inference may be a contribution
to revision of the scientific folklore of how we acquire knowledge, and
what it means to do that. Replication statistics are inherently more
sensible, and easy to communicate, than the *reductio ad absurdum*
of NHST.

For instance, few scientists can give a correct definition of confidence
intervals; but it is easy to understand and remember that the
standard-error of the mean, often drawn as error-bars around data points,
give the range within which a mean from an equal-powered replication will
fall approximately half the time: They are *de facto* 50%
replicability intervals.

Such improved communication may open the door for a greater social appreciation of the scientific enterprise by the non-scientific community—the one that pays our bills.

**Peter Killeen, Ph.D.**

**Professor Emeritus**

**Department of Psychology (Behavioral Neuroscience)**

**Arizona** **State University**

**Tempe****, AZ, USA**

KEYWORDS: EFFECT SIZE; CONFIDENCE-INTERVALS; P-VALUES; METAANALYSIS; PSYCHOLOGY.

“Permission from the Society is waived for authors who wish to reproduce a single table or figure, provided the author’s permission is obtained and full credit is given to the Psychonomic Society and the author through a complete citation.”

*Killeen, P. R. (2006), "Beyond statistical inference: a decision theory
for science," *Psychonomic Bulletin & Review*, 13, 549-562.