A Probabilistic Framework for the Hierachic Organisation & Classification of Document Collections

Vinokourov, A. and Girolami, M. (2002) A Probabilistic Framework for the Hierachic Organisation & Classification of Document Collections. Journal of Intelligent Information Systems, 18(2&3), pp. 153-172. (doi: 10.1023/A:1013677411002)

Full text not currently available from Enlighten.

Abstract

This paper presents a probabilistic mixture modeling framework for the hierarchic organisation of document collections. It is demonstrated that the probabilistic corpus model which emerges from the automatic or unsupervised hierarchical organisation of a document collection can be further exploited to create a kernel which boosts the performance of state-of-the-art support vector machine document classifiers. It is shown that the performance of such a classifier is further enhanced when employing the kernel derived from an appropriate hierarchic mixture model used for partitioning a document corpus rather than the kernel associated with a flat non-hierarchic mixture model. This has important implications for document classification when a hierarchic ordering of topics exists. This can be considered as the effective combination of documents with no topic or class labels (unlabeled data), labeled documents, and prior domain knowledge (in the form of the known hierarchic structure), in providing enhanced document classification performance.

Item Type:Articles
Status:Published
Refereed:Yes
Glasgow Author(s) Enlighten ID:Girolami, Prof Mark
Authors: Vinokourov, A., and Girolami, M.
College/School:College of Science and Engineering > School of Computing Science
Journal Name:Journal of Intelligent Information Systems
ISSN:0925-9902

University Staff: Request a correction | Enlighten Editors: Update this record