Templated text synthesis for expert-guided multi-label extraction from radiology reports

Schrempf, P., Watson, H., Park, E., Pajak, M., MacKinnon, H., Muir, K. W. , Harris-Birtill, D. and O'Neil, A. Q. (2021) Templated text synthesis for expert-guided multi-label extraction from radiology reports. Machine Learning and Knowledge Extraction, 3(2), pp. 299-317. (doi: 10.3390/make3020015)

[img] Text
236594.pdf - Published Version
Available under License Creative Commons Attribution.

1MB

Abstract

Training medical image analysis models traditionally requires large amounts of expertly annotated imaging data which is time-consuming and expensive to obtain. One solution is to automatically extract scan-level labels from radiology reports. Previously, we showed that, by extending BERT with a per-label attention mechanism, we can train a single model to perform automatic extraction of many labels in parallel. However, if we rely on pure data-driven learning, the model sometimes fails to learn critical features or learns the correct answer via simplistic heuristics (e.g., that “likely” indicates positivity), and thus fails to generalise to rarer cases which have not been learned or where the heuristics break down (e.g., “likely represents prominent VR space or lacunar infarct” which indicates uncertainty over two differential diagnoses). In this work, we propose template creation for data synthesis, which enables us to inject expert knowledge about unseen entities from medical ontologies, and to teach the model rules on how to label difficult cases, by producing relevant training examples. Using this technique alongside domain-specific pre-training for our underlying BERT architecture i.e., PubMedBERT, we improve F1 micro from 0.903 to 0.939 and F1 macro from 0.512 to 0.737 on an independent test set for 33 labels in head CT reports for stroke patients. Our methodology offers a practical way to combine domain knowledge with machine learning for text classification tasks.

Item Type:Articles
Additional Information:This work is part of the Industrial Centre for AI Research in digital Diagnostics (iCAIRD), which is funded by Innovate UK on behalf of UK Research and Innovation (UKRI) project number 104690. The Data Lab has also provided support and funding.
Status:Published
Refereed:Yes
Glasgow Author(s) Enlighten ID:Muir, Professor Keith
Creator Roles:
Muir, K.Methodology, Resources, Data curation, Writing – review and editing, Funding acquisition
Authors: Schrempf, P., Watson, H., Park, E., Pajak, M., MacKinnon, H., Muir, K. W., Harris-Birtill, D., and O'Neil, A. Q.
College/School:College of Medical Veterinary and Life Sciences > Institute of Neuroscience and Psychology
Journal Name:Machine Learning and Knowledge Extraction
Publisher:MDPI
ISSN:2504-4990
ISSN (Online):2504-4990
Copyright Holders:Copyright © 2021 The Authors
First Published:First published in Machine Learning and Knowledge Extraction 3(2):299-317
Publisher Policy:Reproduced under a Creative Commons License

University Staff: Request a correction | Enlighten Editors: Update this record