Extending TextAE for annotation of non-contiguous entities

Lever, J. , Altman, R. and Kim, J.-D. (2020) Extending TextAE for annotation of non-contiguous entities. Genomics and Informatics, 18(2), e15. (doi: 10.5808/gi.2020.18.2.e15) (PMID:32634869) (PMCID:PMC7362949)

[img] Text
242648.pdf - Published Version
Available under License Creative Commons Attribution.



Named entity recognition tools are used to identify mentions of biomedical entities in free text and are essential components of high-quality information retrieval and extraction systems. Without good entity recognition, methods will mislabel searched text and will miss important information or identify spurious text that will frustrate users. Most tools do not capture non-contiguous entities which are separate spans of text that together refer to an entity, e.g., the entity “type 1 diabetes” in the phrase “type 1 and type 2 diabetes.” This type is commonly found in biomedical texts, especially in lists, where multiple biomedical entities are named in shortened form to avoid repeating words. Most text annotation systems, that enable users to view and edit entity annotations, do not support non-contiguous entities. Therefore, experts cannot even visualize non-contiguous entities, let alone annotate them to build valuable datasets for machine learning methods. To combat this problem and as part of the BLAH6 hackathon, we extended the TextAE platform to allow visualization and annotation of non-contiguous entities. This enables users to add new subspans to existing entities by selecting additional text. We integrate this new functionality with TextAE’s existing editing functionality to allow easy changes to entity annotation and editing of relation annotations involving non-contiguous entities, with importing and exporting to the PubAnnotation format. Finally, we roughly quantify the problem across the entire accessible biomedical literature to highlight that there are a substantial number of non-contiguous entities that appear in lists that would be missed by most text mining systems.

Item Type:Articles
Glasgow Author(s) Enlighten ID:Lever, Dr Jake
Authors: Lever, J., Altman, R., and Kim, J.-D.
College/School:College of Science and Engineering > School of Computing Science
Journal Name:Genomics and Informatics
Publisher:Korea Genome Organization
ISSN (Online):2234-0742
Published Online:15 June 2020
Copyright Holders:Copyright © 2020 Korea Genome Organization
First Published:First published in Genomics and Informatics 18(2): e15
Publisher Policy:Reproduced under a Creative Commons License

University Staff: Request a correction | Enlighten Editors: Update this record