A time-sensitive historical thesaurus-based semantic tagger for deep semantic annotation

Piao, S., Dallachy, F. , Baron, A., Demmen, J., Wattam, S., Durkin, P., McCracken, J., Rayson, P. and Alexander, M. (2017) A time-sensitive historical thesaurus-based semantic tagger for deep semantic annotation. Computer Speech and Language, 46, 113 - 135. (doi: 10.1016/j.csl.2017.04.010)

141814.pdf - Published Version
Available under License Creative Commons Attribution.



Automatic extraction and analysis of meaning-related information from natural language data has been an important issue in a number of research areas, such as natural language processing (NLP), text mining, corpus linguistics, and data science. An important aspect of such information extraction and analysis is the semantic annotation of language data using a semantic tagger. In practice, various semantic annotation tools have been designed to carry out different levels of semantic annotation, such as topics of documents, semantic role labeling, named entities or events. Currently, the majority of existing semantic annotation tools identify and tag partial core semantic information in language data, but they tend to be applicable only for modern language corpora. While such semantic analyzers have proven useful for various purposes, a semantic annotation tool that is capable of annotating deep semantic senses of all lexical units, or all-words tagging, is still desirable for a deep, comprehensive semantic analysis of language data. With large-scale digitization efforts underway, delivering historical corpora with texts dating from the last 400 years, a particularly challenging aspect is the need to adapt the annotation in the face of significant word meaning change over time. In this paper, we report on the development of a new semantic tagger (the Historical Thesaurus Semantic Tagger), and discuss challenging issues we faced in this work. This new semantic tagger is built on existing {NLP} tools and incorporates a large-scale historical English thesaurus linked to the Oxford English Dictionary. Employing contextual disambiguation algorithms, this tool is capable of annotating lexical units with a historically-valid highly fine-grained semantic categorization scheme that contains about 225,000 semantic concepts and 4,033 thematic semantic categories. In terms of novelty, it is adapted for processing historical English data, with rich information about historical usage of words and a spelling variant normalizer for historical forms of English. Furthermore, it is able to make use of knowledge about the publication date of a text to adapt its output. In our evaluation, the system achieved encouraging accuracies ranging from 77.12 to 91.08 on individual test texts. Applying time-sensitive methods improved results by as much as 3.54 and by 1.72 on average.

Item Type:Articles
Keywords:Language technology, semantics, historical thesaurus, linguistics, information retrieval, corpus linguistics, corpora.
Glasgow Author(s) Enlighten ID:Alexander, Professor Marc and Durkin, Dr Philip and Dallachy, Dr Fraser
Authors: Piao, S., Dallachy, F., Baron, A., Demmen, J., Wattam, S., Durkin, P., McCracken, J., Rayson, P., and Alexander, M.
Subjects:P Language and Literature > P Philology. Linguistics
P Language and Literature > PE English
Q Science > QA Mathematics > QA76 Computer software
College/School:College of Arts & Humanities > School of Critical Studies > English Language and Linguistics
Journal Name:Computer Speech and Language
Journal Abbr.:CSL
ISSN (Online):1095-8363
Published Online:17 May 2017
Copyright Holders:Copyright © 2017 The Authors
First Published:First published in Computer Speech and Language 46: 113-135
Publisher Policy:Reproduced under a Creative Commons License
Related URLs:

University Staff: Request a correction | Enlighten Editors: Update this record

Project CodeAward NoProject NamePrincipal InvestigatorFunder's NameFunder RefLead Dept
648941Semantic Annotation and Mark Up for Enhancing Lexical Searches (SAMUELS)Marc AlexanderArts & Humanities Research Council (AHRC)AH/L010062/1CRIT - ENGLISH LANGUAGE & LINGUISTICS