Towards a universal representation for audio information retrieval and analysis

Sand Jensen, B. , Troelsgaard, R., Larsen, J. and Hansen, L. K. (2013) Towards a universal representation for audio information retrieval and analysis. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, Canada, 26-31 May 2013, pp. 3168-3172. ISBN 9781479903566 (doi:10.1109/ICASSP.2013.6638242)

Full text not currently available from Enlighten.


A fundamental and general representation of audio and music which integrates multi-modal data sources is important for both application and basic research purposes. In this paper we address this challenge by proposing a multi-modal version of the Latent Dirichlet Allocation model which provides a joint latent representation. We evaluate this representation on the Million Song Dataset by integrating three fundamentally different modalities, namely tags, lyrics, and audio features. We show how the resulting representation is aligned with common 'cognitive' variables such as tags, and provide some evidence for the common assumption that genres form an acceptable categorization when evaluating latent representations of music. We furthermore quantify the model by its predictive performance in terms of genre and style, providing benchmark results for the Million Song Dataset.

Item Type:Conference Proceedings
Glasgow Author(s) Enlighten ID:Jensen, Dr Bjorn
Authors: Sand Jensen, B., Troelsgaard, R., Larsen, J., and Hansen, L. K.
College/School:College of Science and Engineering > School of Computing Science

University Staff: Request a correction | Enlighten Editors: Update this record