Comparing competing algorithms: Bayesian versus frequentist hypothesis testing
An ECML/PKDD 2016 Tutorial

9:00-12:40, 19th September 2016, Riva del Garda

Hypothesis testing in machine learning - for instance to establish whether the performance of two algorithms is significantly different - is usually performed using null hypothesis significance tests (nhst). Yet the nhst methodology has well-known drawbacks. For instance, the claimed statistical significances do not necessarily imply practical significance. Moreover, nhst cannot verify the null hypothesis and thus cannot recognize equivalent classifiers. Most important, nhst does not answer the question of the researcher: which is the probability of the null and of the alternative hypothesis, given the observed data? Bayesian hypothesis tests overcome such problems. They compute the posterior probability of the null and the alternative hypothesis. This allows to detect equivalent classifiers and to claim statistical significances which have a practical impact. We will review Bayesian counterpart of the most commonly test adopted in machine learning, such as the correlated t-test and the signed-rank test. We will also show software implementing such test for the most common platforms (R, Python, etc.)

The statistical comparison of competing algorithms is a fundamental task in machine learning. It is usually carried out by means of a null hypothesis significance test (nhst). Yet, nhst has many well-known drawbacks. For instance, nhst can either reject the null hypothesis or fail to reject it. It cannot verify the null hypothesis: when failing to reject it, the test is not stating that the hypothesis is true. Nhst is thus unable to conclude that two classifiers are equivalent. The claimed statistical significances do not necessarily imply practical significance. Nhst rejects the null hypothesis when the p-value is smaller than the test size $\alpha$. Yet the p-value depends both on the effect size (the actual difference between the two classifiers) and the sample size (the number of collected observations). Null hypotheses can virtually always be rejected by collecting a sufficiently large number of data points (for instance by comparing two classifier on a large collection of data sets). There are many other drawbacks, such as dependence on the sampling intention and the lack of sound way for deciding the size $\alpha$ of the test. We will discuss how such issues can be overcome by adopting Bayesian analysis. The Bayesian approach is generally regarded as the most principled approach for learning from data and for reasoning under uncertainty; yet it is not yet adopted in machine learning for model comparison, despite its numerous advantages.

Tutorial Objectives: You will learn

  • the rich information provided by Bayesian analysis and how it differs from traditional (frequentist) statistical analysis;
  • the use of Bayesian tests for assessing/comparing algorithms in machine learning and the use of the region of practical equivalence (rope) to claim that the results of the compared models are practically, not just statistically different;
  • Bayesian decision theory for optimal decisions making;
  • the concepts and hands-on use of modern algorithms ("Dirichlet process", "Markov chain Monte Carlo") that achieve Bayesian analysis for realistic applications and how to use the free software R, Python, Julia and STAN for Bayesian analysis.
We will present Bayesian algorithms for the comparison of classifiers on single and multiple data sets , as replacements for the traditional signed-rank test, sign test, t-test, etc. To this end, we will discuss parametric and non-parametric approaches for Bayesian hypothesis testing and how to present the results of Bayesian analysis . We will conclude by showing how to use the existing software for Bayesian comparison of classifiers.


Time Duration Content Details
09:00 15min Introduction Motivations and Goals
09:15 60min Null hypothesis significance tests in machine learning NHST testing (methods and drawbacks)
10:15 25min Introduction to Bayesian tests Bayesian model comparison versus Bayesian estimation
10:40 20min Break Is the coffee in Riva del Garda better than the coffee in Porto?
11:00 35min Bayesian hypothesis testing for comparing classifiers Single and hierarchical Bayesian models
11:35 55min Non-parametric Bayesian tests and presentation of the results of Bayesian analysis Dirichlet process and how to perform nonparametric Bayesian tests
12:30 10min Summarizing! Summary and conclusions


  • PDF: J. Demsar. Statistical comparisons of classifiers over multiple data sets
  • PDF: A. Benavoli, G. Corani, J. Demsar, and M. Zaffalon. Time for a change: a tutorial for comparing multiple classifiers through Bayesian analysis
  • PDF: G. Corani, A. Benavoli, J. Demsar, F. Mangili, and M. Zaffalon. Statistical comparison of classifiers through Bayesian hierarchical modelling.
  • Github repository of the Tutorial (Python, Julia notebooks);
  • Slides of the Tutorial.

Giorgio Corani

Senior Researcher and Lecturer at the Dalle Molle Institute for Artificial Intelligence (IDSIA), Switzerland. Research interest: Bayesian machine learning, probabilistic graphical models, applied statistics. Co-author of about 60 papers in conferences and journals, including IJCAI, ECAI, ICML, JMLR, ECML, NIPS. Program co-chair of the International Conference on Probabilistic Graphical Models (PGM 2016). Speaker in previous tutorials on robust Bayesian networks at AAAI 2010 and IJCAI 2013.


Janez Demsar

He is associate Professor at the Faculty of Computer and Information Systems, Ljubliana (Slovenia). Phd from Faculty of Computer and Information Science in Ljubljana (2002). Recipient of serveral prizes: teacher of the year in years (2008 - 2015); award for current research work (Slovenian Information Society, 2014). His research interests include machine learning, statistics and computer science education. His paper on the statistical comparison of classifiers (JMLR, 2006) has more than 4000 citations.


Alessio Benavoli

He received all his degrees in Computer and Control Engineering from the University of Firenze, Italy: the Ph.D in 2008 and the M.S. degree in 2004. From April 2007 to May 2008, he worked for the international company SELEX-Sistemi Integrati as system analyst. Currently, he is working as Senior researcher at the Dalle Molle Institute for Artificial Intelligence (IDSIA) in Lugano, Switzerland. His research interests are in the areas of Bayesian nonparametrics, data analytics, imprecise probabilities, decision-making under uncertainty, filtering and control. He has co-authored about 70 peer-reviewed publications in top conferences and journals, including IJCAI, ICML, JMLR, ECML, UAI.