A Comparison of the Optimality of Statistical Significance Tests for Information Retrieval Evaluation

J. Urbano, M. Marrero and D. Martín
International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 925-928, 2013.


Previous research has suggested the permutation test as the theoretically optimal statistical significance test for IR evaluation, and advocated for the discontinuation of the Wilcoxon and sign tests. We present a large-scale study comprising nearly 60 million system comparisons showing that in practice the bootstrap, t-test and Wilcoxon test outperform the permutation test under different optimality criteria. We also show that actual error rates seem to be lower than the theoretically expected 5%, further confirming that we may actually be underestimating significance.