Using Part-of-Speech N-grams for Sensitive-text Classification

Macdonald, G., Macdonald, C. and Ounis, I. (2015) Using Part-of-Speech N-grams for Sensitive-text Classification. In: ICTIR 2015: 5th ACM SIGIR International Conference on the Theory of Information Retrieval, Northampton, MA, USA, 27-30 Sep 2015, pp. 381-384. ISBN 9781450338332 (doi:10.1145/2808194.2809496)

Macdonald, G., Macdonald, C. and Ounis, I. (2015) Using Part-of-Speech N-grams for Sensitive-text Classification. In: ICTIR 2015: 5th ACM SIGIR International Conference on the Theory of Information Retrieval, Northampton, MA, USA, 27-30 Sep 2015, pp. 381-384. ISBN 9781450338332 (doi:10.1145/2808194.2809496)

[img]
Preview
Text
108080.pdf - Accepted Version

285kB

Abstract

Freedom of Information legislations in many western democ- racies, including the United Kingdom (UK) and the United States of America (USA), state that citizens have typically the right to access government documents. However, certain sensitive information is exempt from release into the pub- lic domain. For example, in the UK, FOIA Exemption 27 (International Relations) excludes the release of Informa- tion that might damage the interests of the UK abroad. Therefore, the process of reviewing government documents for sensitivity is essential to determine if a document must be redacted before it is archived, or closed until the infor- mation is no longer sensitive. With the increased volume of digital government documents in recent years, there is a need for new tools to assist the digital sensitivity review process. Therefore, in this paper we propose an automatic approach for identifying sensitive text in documents by measuring the amount of sensitivity in sequences of text. Using government documents reviewed by trained sensitivity reviewers, we fo- cus on an aspect of FOIA Exemption 27 which can have a major impact on international relations, namely information supplied in con�dence. We show that our approach leads to markedly increased recall of sensitive text, while achieving a very high level of precision, when compared to a baseline that has been shown to be e�ective at identifying sensitive text in other domains.

Item Type:Conference Proceedings
Status:Published
Refereed:Yes
Glasgow Author(s) Enlighten ID:Macdonald, Dr Craig and Ounis, Professor Iadh
Authors: Macdonald, G., Macdonald, C., and Ounis, I.
College/School:College of Science and Engineering > School of Computing Science
ISBN:9781450338332
Copyright Holders:Copyright © 2015 ACM
Publisher Policy:Reproduced in accordance with the copyright policy of the publisher
Related URLs:

University Staff: Request a correction | Enlighten Editors: Update this record