Towards Minimal Test Collections for Evaluation of Audio Music Similarity and Retrieval

J. Urbano and M. Schedl
WWW International Workshop on Advances in Music Information Research, pp. 917-923, 2012.

An extended version of this paper can be found at Minimal Test Collections for Low-Cost Evaluation of Audio Music Similarity and Retrieval Systems.


Reliable evaluation of Information Retrieval systems requires large amounts of relevance judgments. Making these annotations is quite complex and tedious for many Music Information Retrieval tasks; so performing such evaluations requires too much effort. A low-cost alternative is the application of Minimal Test Collection algorithms; which offer quite reliable results while significantly reducing the annotation effort. The idea is to incrementally select what documents to judge so that we can compute estimates of the effectiveness differences between systems with a certain degree of confidence. In this paper we show a first approach towards its application to the evaluation of the Audio Music Similarity and Retrieval task; run by the annual MIREX evaluation campaign. An analysis with the MIREX 2011 data shows that the judging effort can be reduced to about 35% to obtain results with 95% confidence.