Confidence curves: an alternative to null hypothesis significance testing for the comparison of classifiers

Daniel Berrar

Research output: Research - peer-reviewArticle

Abstract

Null hypothesis significance testing is routinely used for comparing the performance of machine learning algorithms. Here, we provide a detailed account of the major underrated problems that this common practice entails. For example, omnibus tests, such as the widely used Friedman test, are not appropriate for the comparison of multiple classifiers over diverse data sets. In contrast to the view that significance tests are essential to a sound and objective interpretation of classification results, our study suggests that no such tests are needed. Instead, greater emphasis should be placed on the magnitude of the performance difference and the investigator’s informed judgment. As an effective tool for this purpose, we propose confidence curves, which depict nested confidence intervals at all levels for the performance difference. These curves enable us to assess the compatibility of an infinite number of null hypotheses with the experimental results. We benchmarked several classifiers on multiple data sets and analyzed the results with both significance tests and confidence curves. Our conclusion is that confidence curves effectively summarize the key information needed for a meaningful interpretation of classification results while avoiding the intrinsic pitfalls of significance tests.

LanguageEnglish
Pages1-39
Number of pages39
JournalMachine Learning
DOIs
StateAccepted/In press - 2016 Dec 30

Fingerprint

Classifiers
Testing
Learning algorithms
Learning systems
Acoustic waves

Keywords

  • Confidence curve
  • Multiple comparisons
  • p value
  • Performance evaluation
  • Significance test

ASJC Scopus subject areas

  • Software
  • Artificial Intelligence

Cite this

Confidence curves : an alternative to null hypothesis significance testing for the comparison of classifiers. / Berrar, Daniel.

In: Machine Learning, 30.12.2016, p. 1-39.

Research output: Research - peer-reviewArticle

@article{8822507a36eb4d2d888b465da065bdbb,
title = "Confidence curves: an alternative to null hypothesis significance testing for the comparison of classifiers",
abstract = "Null hypothesis significance testing is routinely used for comparing the performance of machine learning algorithms. Here, we provide a detailed account of the major underrated problems that this common practice entails. For example, omnibus tests, such as the widely used Friedman test, are not appropriate for the comparison of multiple classifiers over diverse data sets. In contrast to the view that significance tests are essential to a sound and objective interpretation of classification results, our study suggests that no such tests are needed. Instead, greater emphasis should be placed on the magnitude of the performance difference and the investigator’s informed judgment. As an effective tool for this purpose, we propose confidence curves, which depict nested confidence intervals at all levels for the performance difference. These curves enable us to assess the compatibility of an infinite number of null hypotheses with the experimental results. We benchmarked several classifiers on multiple data sets and analyzed the results with both significance tests and confidence curves. Our conclusion is that confidence curves effectively summarize the key information needed for a meaningful interpretation of classification results while avoiding the intrinsic pitfalls of significance tests.",
keywords = "Confidence curve, Multiple comparisons, p value, Performance evaluation, Significance test",
author = "Daniel Berrar",
year = "2016",
month = "12",
doi = "10.1007/s10994-016-5612-6",
pages = "1--39",
journal = "Machine Learning",
issn = "0885-6125",
publisher = "Springer Netherlands",

}

TY - JOUR

T1 - Confidence curves

T2 - Machine Learning

AU - Berrar,Daniel

PY - 2016/12/30

Y1 - 2016/12/30

N2 - Null hypothesis significance testing is routinely used for comparing the performance of machine learning algorithms. Here, we provide a detailed account of the major underrated problems that this common practice entails. For example, omnibus tests, such as the widely used Friedman test, are not appropriate for the comparison of multiple classifiers over diverse data sets. In contrast to the view that significance tests are essential to a sound and objective interpretation of classification results, our study suggests that no such tests are needed. Instead, greater emphasis should be placed on the magnitude of the performance difference and the investigator’s informed judgment. As an effective tool for this purpose, we propose confidence curves, which depict nested confidence intervals at all levels for the performance difference. These curves enable us to assess the compatibility of an infinite number of null hypotheses with the experimental results. We benchmarked several classifiers on multiple data sets and analyzed the results with both significance tests and confidence curves. Our conclusion is that confidence curves effectively summarize the key information needed for a meaningful interpretation of classification results while avoiding the intrinsic pitfalls of significance tests.

AB - Null hypothesis significance testing is routinely used for comparing the performance of machine learning algorithms. Here, we provide a detailed account of the major underrated problems that this common practice entails. For example, omnibus tests, such as the widely used Friedman test, are not appropriate for the comparison of multiple classifiers over diverse data sets. In contrast to the view that significance tests are essential to a sound and objective interpretation of classification results, our study suggests that no such tests are needed. Instead, greater emphasis should be placed on the magnitude of the performance difference and the investigator’s informed judgment. As an effective tool for this purpose, we propose confidence curves, which depict nested confidence intervals at all levels for the performance difference. These curves enable us to assess the compatibility of an infinite number of null hypotheses with the experimental results. We benchmarked several classifiers on multiple data sets and analyzed the results with both significance tests and confidence curves. Our conclusion is that confidence curves effectively summarize the key information needed for a meaningful interpretation of classification results while avoiding the intrinsic pitfalls of significance tests.

KW - Confidence curve

KW - Multiple comparisons

KW - p value

KW - Performance evaluation

KW - Significance test

UR - http://www.scopus.com/inward/record.url?scp=85007500127&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85007500127&partnerID=8YFLogxK

U2 - 10.1007/s10994-016-5612-6

DO - 10.1007/s10994-016-5612-6

M3 - Article

SP - 1

EP - 39

JO - Machine Learning

JF - Machine Learning

SN - 0885-6125

ER -