Significance tests or confidence intervals: Which are preferable for the comparison of classifiers?

Daniel Berrar, Jose A. Lozano

Research output: Contribution to journalArticle

10 Citations (Scopus)

Abstract

Null hypothesis significance tests and their p-values currently dominate the statistical evaluation of classifiers in machine learning. Here, we discuss fundamental problems of this research practice. We focus on the problem of comparing multiple fully specified classifiers on a small-sample test set. On the basis of the method by Quesenberry and Hurst, we derive confidence intervals for the effect size, i.e. the difference in true classification performance. These confidence intervals disentangle the effect size from its uncertainty and thereby provide information beyond the p-value. This additional information can drastically change the way in which classification results are currently interpreted, published and acted upon. We illustrate how our reasoning can change, depending on whether we focus on p-values or confidence intervals. We argue that the conclusions from comparative classification studies should be based primarily on effect size estimation with confidence intervals, and not on significance tests and p-values.

Original languageEnglish
Pages (from-to)189-206
Number of pages18
JournalJournal of Experimental and Theoretical Artificial Intelligence
Volume25
Issue number2
DOIs
Publication statusPublished - 2013 Jun 1
Externally publishedYes

Fingerprint

Classifiers
Learning systems

Keywords

  • classification
  • confidence interval
  • null hypothesis significance testing
  • p-value
  • reasoning

ASJC Scopus subject areas

  • Software
  • Artificial Intelligence
  • Theoretical Computer Science

Cite this

Significance tests or confidence intervals : Which are preferable for the comparison of classifiers? / Berrar, Daniel; Lozano, Jose A.

In: Journal of Experimental and Theoretical Artificial Intelligence, Vol. 25, No. 2, 01.06.2013, p. 189-206.

Research output: Contribution to journalArticle

@article{0221d97095d64e40b6cd8877cbc13207,
title = "Significance tests or confidence intervals: Which are preferable for the comparison of classifiers?",
abstract = "Null hypothesis significance tests and their p-values currently dominate the statistical evaluation of classifiers in machine learning. Here, we discuss fundamental problems of this research practice. We focus on the problem of comparing multiple fully specified classifiers on a small-sample test set. On the basis of the method by Quesenberry and Hurst, we derive confidence intervals for the effect size, i.e. the difference in true classification performance. These confidence intervals disentangle the effect size from its uncertainty and thereby provide information beyond the p-value. This additional information can drastically change the way in which classification results are currently interpreted, published and acted upon. We illustrate how our reasoning can change, depending on whether we focus on p-values or confidence intervals. We argue that the conclusions from comparative classification studies should be based primarily on effect size estimation with confidence intervals, and not on significance tests and p-values.",
keywords = "classification, confidence interval, null hypothesis significance testing, p-value, reasoning",
author = "Daniel Berrar and Lozano, {Jose A.}",
year = "2013",
month = "6",
day = "1",
doi = "10.1080/0952813X.2012.680252",
language = "English",
volume = "25",
pages = "189--206",
journal = "Journal of Experimental and Theoretical Artificial Intelligence",
issn = "0952-813X",
publisher = "Taylor and Francis Ltd.",
number = "2",

}

TY - JOUR

T1 - Significance tests or confidence intervals

T2 - Which are preferable for the comparison of classifiers?

AU - Berrar, Daniel

AU - Lozano, Jose A.

PY - 2013/6/1

Y1 - 2013/6/1

N2 - Null hypothesis significance tests and their p-values currently dominate the statistical evaluation of classifiers in machine learning. Here, we discuss fundamental problems of this research practice. We focus on the problem of comparing multiple fully specified classifiers on a small-sample test set. On the basis of the method by Quesenberry and Hurst, we derive confidence intervals for the effect size, i.e. the difference in true classification performance. These confidence intervals disentangle the effect size from its uncertainty and thereby provide information beyond the p-value. This additional information can drastically change the way in which classification results are currently interpreted, published and acted upon. We illustrate how our reasoning can change, depending on whether we focus on p-values or confidence intervals. We argue that the conclusions from comparative classification studies should be based primarily on effect size estimation with confidence intervals, and not on significance tests and p-values.

AB - Null hypothesis significance tests and their p-values currently dominate the statistical evaluation of classifiers in machine learning. Here, we discuss fundamental problems of this research practice. We focus on the problem of comparing multiple fully specified classifiers on a small-sample test set. On the basis of the method by Quesenberry and Hurst, we derive confidence intervals for the effect size, i.e. the difference in true classification performance. These confidence intervals disentangle the effect size from its uncertainty and thereby provide information beyond the p-value. This additional information can drastically change the way in which classification results are currently interpreted, published and acted upon. We illustrate how our reasoning can change, depending on whether we focus on p-values or confidence intervals. We argue that the conclusions from comparative classification studies should be based primarily on effect size estimation with confidence intervals, and not on significance tests and p-values.

KW - classification

KW - confidence interval

KW - null hypothesis significance testing

KW - p-value

KW - reasoning

UR - http://www.scopus.com/inward/record.url?scp=84877652134&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84877652134&partnerID=8YFLogxK

U2 - 10.1080/0952813X.2012.680252

DO - 10.1080/0952813X.2012.680252

M3 - Article

AN - SCOPUS:84877652134

VL - 25

SP - 189

EP - 206

JO - Journal of Experimental and Theoretical Artificial Intelligence

JF - Journal of Experimental and Theoretical Artificial Intelligence

SN - 0952-813X

IS - 2

ER -