The Role of Latent Semantic Categories and Clustering in Enhancing the Efficiency of Human Sensitivity Review

Narvala, H. , McDonald, G. and Ounis, I. (2022) The Role of Latent Semantic Categories and Clustering in Enhancing the Efficiency of Human Sensitivity Review. In: Seventh ACM SIGIR Conference on Human Information Interaction and Retrieval (ACM CHIIR 2022), Regensburg, Germany, 14-18 Mar 2022, pp. 56-66. ISBN 9781450391863 (doi: 10.1145/3498366.3505824)

[img] Text
260289.pdf - Accepted Version

1MB

Abstract

Government documents must be manually sensitivity reviewed to identify and protect any sensitive information (e.g. personal information) in the documents before the documents can be opened to the public. However, due to the large volume of born-digital documents that need to be reviewed, there is a growing need for technologies to assist human reviewers and improve the efficiency of the review process. For example, in sensitivity review, a reviewer needs to be able to quickly find documents that belong to specific latent semantic categories (e.g., documents about criminality that contain the personal details of victims). However, manually identifying such document categories is a challenging task when reviewing digital documents, due to the size of, and lack of structure in the collections. We hypothesise that reviewing documents that are clustered by their latent semantic categories will increase the efficiency of the human reviewers, since the reviewers will be able to review related documents in sequence. In this work, we conduct a user study to evaluate the effectiveness of different clustering techniques, document metadata and automatic sensitivity classification, for grouping and prioritising documents for review, to increase the efficiency of the review process. Our study shows that reviewing documents in semantic clusters can significantly improve the efficiency (i.e., speed) of the sensitivity reviewers (+15.65%, T-Test, p<0.05) while maintaining the reviewers’ accuracy. Moreover, we propose a novel strategy for prioritising document clusters for review to maximise the number of documents that are opened to the public within a fixed reviewing time budget. Our proposed prioritisation strategy results in a significant increase in the number of documents that are opened to the public (+37.99%, T-Test, p<0.05) compared to prioritising documents without clusters.

Item Type:Conference Proceedings
Status:Published
Refereed:Yes
Glasgow Author(s) Enlighten ID:Ounis, Professor Iadh and McDonald, Dr Graham and Narvala, Hitarth
Authors: Narvala, H., McDonald, G., and Ounis, I.
College/School:College of Science and Engineering > School of Computing Science
ISBN:9781450391863
Copyright Holders:© 2022 Copyright held by the owner/author(s).
First Published:First published in CHIIR '22: ACM SIGIR Conference on Human Information Interaction and Retrieval
Publisher Policy:Reproduced in accordance with the publisher copyright policy

University Staff: Request a correction | Enlighten Editors: Update this record