Enhancing Sensitivity Classification with Semantic Features using Word Embeddings

McDonald, G. , Macdonald, C. and Ounis, I. (2017) Enhancing Sensitivity Classification with Semantic Features using Word Embeddings. In: 39th European Conference on Information Retrieval, Aberdeen, Scotland, 8-13 April 2017, pp. 450-463. (doi: 10.1007/978-3-319-56608-5_35)

135030.pdf - Accepted Version



Government documents must be reviewed to identify any sensitive information they may contain, before they can be released to the public. However, traditional paper-based sensitivity review processes are not practical for reviewing born-digital documents. Therefore, there is a timely need for automatic sensitivity classification techniques, to assist the digital sensitivity review process. However, sensitivity is typically a product of the relations between combinations of terms, such as who said what about whom, therefore, automatic sensitivity classification is a difficult task. Vector representations of terms, such as word embeddings, have been shown to be effective at encoding latent term features that preserve semantic relations between terms, which can also be beneficial to sensitivity classification. In this work, we present a thorough evaluation of the effectiveness of semantic word embedding features, along with term and grammatical features, for sensitivity classification. On a test collection of government documents containing real sensitivities, we show that extending text classification with semantic features and additional term n-grams results in significant improvements in classification effectiveness, correctly classifying 9.99% more sensitive documents compared to the text classification baseline.

Item Type:Conference Proceedings
Additional Information:Published in Lecture Notes in Computer Science, v. 10193, pp. 450-463
Glasgow Author(s) Enlighten ID:Macdonald, Professor Craig and McDonald, Dr Graham and Ounis, Professor Iadh
Authors: McDonald, G., Macdonald, C., and Ounis, I.
College/School:College of Science and Engineering > School of Computing Science
Published Online:08 April 2017
Copyright Holders:Copyright © 2017 Springer International Publishing AG
First Published:First published in Lecture Notes in Computer Science 10193:450-463
Publisher Policy:Reproduced in accordance with the copyright policy of the publisher

University Staff: Request a correction | Enlighten Editors: Update this record