An ECML/PKDD 2016 Tutorial

9:00-12:40, 19

Hypothesis testing in machine learning - for instance to establish whether the performance of two algorithms is significantly different - is usually performed using null hypothesis significance tests (nhst). Yet the nhst methodology has well-known drawbacks. For instance, the claimed statistical significances do not necessarily imply practical significance. Moreover, nhst cannot verify the null hypothesis and thus cannot recognize equivalent classifiers. Most important, nhst does not answer the question of the researcher: which is the probability of the null and of the alternative hypothesis, given the observed data? Bayesian hypothesis tests overcome such problems. They compute the posterior probability of the null and the alternative hypothesis. This allows to detect equivalent classifiers and to claim statistical significances which have a practical impact. We will review Bayesian counterpart of the most commonly test adopted in machine learning, such as the correlated t-test and the signed-rank test. We will also show software implementing such test for the most common platforms (R, Python, etc.)

The *statistical comparison of competing algorithms * is a fundamental task in machine learning. It is usually carried out by means of a null hypothesis significance test (nhst).
Yet, *nhst has many well-known drawbacks*.
For instance, nhst can either reject the null hypothesis or fail to reject it. It cannot verify the
null hypothesis: when failing to reject it, the test is not stating that the hypothesis is true.
Nhst is thus unable to conclude that two classifiers are equivalent.
The claimed statistical significances do not necessarily imply practical significance.
Nhst rejects the null hypothesis when the p-value is smaller than the test size $\alpha$.
Yet the p-value depends both on the effect size (the actual difference between the two classifiers) and the sample size (the number of collected observations).
Null hypotheses can virtually always be rejected by collecting a sufficiently large number of data points (for instance by comparing two classifier on
a large collection of data sets).
There are many other drawbacks, such as dependence on the sampling intention and the lack of sound way for deciding the size $\alpha$ of the test.
We will discuss how such issues can be overcome by adopting **Bayesian analysis**.
The Bayesian approach is generally regarded as *the most principled approach
for learning from data and for reasoning under uncertainty*; yet it is not yet adopted in machine learning for model comparison, despite its numerous advantages.

**Tutorial Objectives:** You will learn

- the rich information provided by Bayesian analysis and how it differs from traditional (frequentist) statistical analysis;
- the use of Bayesian tests for assessing/comparing algorithms in machine learning and the use of the region of practical equivalence (rope) to claim that the results of the compared models are practically, not just statistically different;
- Bayesian decision theory for optimal decisions making;
- the concepts and hands-on use of modern algorithms ("Dirichlet process", "Markov chain Monte Carlo") that achieve Bayesian analysis for realistic applications and how to use the free software R, Python, Julia and STAN for Bayesian analysis.

Time | Duration | Content | Details |
---|---|---|---|

09:00 | 15min | Introduction | Motivations and Goals |

09:15 | 60min | Null hypothesis significance tests in machine learning | NHST testing (methods and drawbacks) |

10:15 | 25min | Introduction to Bayesian tests | Bayesian model comparison versus Bayesian estimation |

10:40 | 20min | Break | Is the coffee in Riva del Garda better than the coffee in Porto? |

11:00 | 35min | Bayesian hypothesis testing for comparing classifiers | Single and hierarchical Bayesian models |

11:35 | 55min | Non-parametric Bayesian tests and presentation of the results of Bayesian analysis | Dirichlet process and how to perform nonparametric Bayesian tests |

12:30 | 10min | Summarizing! | Summary and conclusions |

- PDF: J. Demsar. Statistical comparisons of classifiers over multiple data sets
- PDF: A. Benavoli, G. Corani, J. Demsar, and M. Zaffalon. Time for a change: a tutorial for comparing multiple classifiers through Bayesian analysis
- PDF: G. Corani, A. Benavoli, J. Demsar, F. Mangili, and M. Zaffalon. Statistical comparison of classifiers through Bayesian hierarchical modelling.
- Github repository of the Tutorial (Python, Julia notebooks);
- Slides of the Tutorial.