Goldilocks: Just-Right Tuning of BERT for Technology-Assisted Review

Yang, E., MacAvaney, S. , Lewis, D. and Frieder, O. (2022) Goldilocks: Just-Right Tuning of BERT for Technology-Assisted Review. In: 44th European Conference on Information Retrieval (ECIR 2022), Stavanger, Norway, 10-14 Apr 2022, pp. 502-517. ISBN 9783030997366 (doi: 10.1007/978-3-030-99736-6_34)

[img] Text
259713.pdf - Accepted Version

1MB

Abstract

Technology-assisted review (TAR) refers to iterative active learning workflows for document review in high recall retrieval (HRR) tasks. TAR research and most commercial TAR software have applied linear models such as logistic regression to lexical features. Transformer-based models with supervised tuning are known to improve effectiveness on many text classification tasks, suggesting their use in TAR. We indeed find that the pre-trained BERT model reduces review cost by 10% to 15% in TAR workflows simulated on the RCV1-v2 newswire collection. In contrast, we likewise determined that linear models outperform BERT for simulated legal discovery topics on the Jeb Bush e-mail collection. This suggests the match between transformer pre-training corpora and the task domain is of greater significance than generally appreciated. Additionally, we show that just-right language model fine-tuning on the task collection before starting active learning is critical. Too little or too much fine-tuning hinders performance, worse than that of linear models, even for a favorable corpus such as RCV1-v2.

Item Type:Conference Proceedings
Status:Published
Refereed:Yes
Glasgow Author(s) Enlighten ID:MacAvaney, Dr Sean
Authors: Yang, E., MacAvaney, S., Lewis, D., and Frieder, O.
College/School:College of Science and Engineering > School of Computing Science
ISSN:0302-9743
ISBN:9783030997366
Published Online:05 April 2022
Copyright Holders:Copyright © 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
First Published:First published in Advances in Information Retrieval. ECIR 2022. Lecture Notes in Computer Science, vol 13185
Publisher Policy:Reproduced in accordance with the publisher copyright policy

University Staff: Request a correction | Enlighten Editors: Update this record