Preventing Hubness in Music Information Retrieval
The so-called “hubness” phenomenon is a general problem of machine learning in high dimensional data spaces. Hubs are data points which keep appearing unwontedly often in nearest neighbor lists of many other data points. This effect is particularly problematic in algorithms for similarity search, as the same “similar” objects are found over and over again. But it has also adverse effects for the many machine learning algorithms that make use of distance information. The effect has been shown to be a natural consequence of high dimensionality and as such is yet another aspect of the curse of dimensionality.
The hub problem has gained particular attention in the field of Music Information Retrieval (MIR), which is the interdisciplinary science of extracting information from music. In MIR, the hub problem has been primarily studied in the context of music recommendation based on modeling of audio similarity. Songs which act as hubs are reported as being similar to very many other songs and hence keep a significant proportion of the audio collection from being recommended at all.
The main goal of this project was to conduct an in-depth study of the hubness problem in the context of MIR. We were able to develop three different approaches that are able to decisively reduce the negative effects of hubness. Two methods re-scale the problematic high-dimensional distance spaces either locally or globally, resulting in a new transformed distance space not showing the problematic hub effects. The third method chooses a distance function different from the ubiquitous Euclidean norm based on hubness analysis. In all these new distance spaces, songs which acted as hub-songs do not crowd the recommendation lists anymore and the full audio collections are accessible again. These methods have also been evaluated with standard machine learning data sets and in the context of image and text retrieval, collaborative filtering, speaker verification and speech recognition. In all these application scenarios hubness is greatly reduced and performance indexes like accuracy or precision and recall are improved.
- Flexer A., Schnitzer D.: Can Shared Nearest Neighbors Reduce Hubness in High-Dimensional Spaces?, in Proceedings of 1st International Workshop on High Dimensional Data Mining (HDM), in conjunction with the IEEE International Conference on Data Mining (IEEE ICDM 2013), Dallas, Texas, USA, 2013.
- Flexer A., Schnitzer D.: Using mutual proximity for novelty detection in audio music similarity, in Procceedings of the 6th International Workshop on Machine Learning and Music, Prague, Czech Republic, 2013.
- Flexer A., Schnitzer D., Schlüter J.: A MIREX meta-analysis of hubness in audio music similarity, Proceedings of the 13th International Society for Music Information Retrieval Conference (ISMIR'12), Porto, Portugal, October 8th-12th, 2012. also available as: TR-2012-06. This paper received the Best Paper Award!
- Knees P., Schnitzer D., Flexer A.: Improving Neighborhood-Based Collaborative Filtering by Reducing Hubness, Proceedings of ACM International Conference on Multimedia Retrieval (ICMR), 2014. Supplemental data and code
- Schedl M., Flexer A.: Putting the User in the Center of Music Information Retrieval, Proceedings of the 13th International Society for Music Information Retrieval Conference (ISMIR'12), Porto, Portugal, October 8th-12th, 2012.
- Schedl M., Schnitzer D.: Location-Aware Music Artist Recommendation, Proceedings of the 20th International Conference on MultiMedia Modeling (MMM 2014), Dublin, Ireland, January 2014.
- Schedl M., Schnitzer D.: Hybrid Retrieval Approaches to Geospatial Music Recommendation, Proceedings of the 35th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2013), Dublin, Ireland, 2013.
- Schedl M., Flexer A., Urbano J.: The Neglected User in Music Information Retrieval Research, Journal of Intelligent Information Systems, December 2013, Volume 41, Issue 3, pp 523-539, 2013.
- Schnitzer D., Flexer A.: Choosing the Metric in High-Dimensional Spaces Based on Hub Analysis, Proceedings of the 22nd European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, 2014.
- Schnitzer D., Flexer A., Tomasev N.: A Case for Hubness Removal in High-Dimensional Multimedia Retrieval, Proceedings of the 36th European Conference on Information Retrieval (ECIR), 2014.
- Schnitzer D., Flexer A., Schedl M., Widmer G.: Local and Global Scaling Reduce Hubs in Space, Journal of Machine Learning Research, Volume 13(Oct):2871-2902, 2012.
- Schnitzer D., Flexer A., Schlüter J.: The Relation of Hubs to the Doddington Zoo in Speaker Verification, in Proceedings of the 21st European Signal Processing Conference (EUSIPCO'2013), September 9-13, Marrakech, Morocco, 2013.
Matlab implementations of algorithms developed during the course of this project are published as the HUB Toolbox. The full toolbox is available on Github.
- Download: hub_toolbox (November 3, 2015).
- The toolbox includes Matlab code for Mutual Proximity (with different distance distribution assumptions), Local Scaling and Shared Nearest Neighbors. Evaluation functions include hubness analysis functions, the Goodman-Kruskal index, KNN evaluation functions and a function to estimate the intrinsic dimensionality. Please refer to the README, the function documentation themselves and our publications listed above for a more detailed description.
- The full hubness evaluation framework as used in our publication Local and Global Scaling Reduce Hubs in Space
- Arthur Flexer
- Jan Schlüter
- Dominik Schnitzer
- Gerhard Widmer