Enlighten Publications

In this section

A time-sensitive historical thesaurus-based semantic tagger for deep semantic annotation

Piao, S., Dallachy, F. , Baron, A., Demmen, J., Wattam, S., Durkin, P., McCracken, J., Rayson, P. and Alexander, M. (2017) A time-sensitive historical thesaurus-based semantic tagger for deep semantic annotation. Computer Speech and Language, 46, 113 - 135. (doi: 10.1016/j.csl.2017.04.010)

Preview

Text
141814.pdf - Published Version
Available under License Creative Commons Attribution.
2MB

Abstract

Automatic extraction and analysis of meaning-related information from natural language data has been an important issue in a number of research areas, such as natural language processing (NLP), text mining, corpus linguistics, and data science. An important aspect of such information extraction and analysis is the semantic annotation of language data using a semantic tagger. In practice, various semantic annotation tools have been designed to carry out different levels of semantic annotation, such as topics of documents, semantic role labeling, named entities or events. Currently, the majority of existing semantic annotation tools identify and tag partial core semantic information in language data, but they tend to be applicable only for modern language corpora. While such semantic analyzers have proven useful for various purposes, a semantic annotation tool that is capable of annotating deep semantic senses of all lexical units, or all-words tagging, is still desirable for a deep, comprehensive semantic analysis of language data. With large-scale digitization efforts underway, delivering historical corpora with texts dating from the last 400 years, a particularly challenging aspect is the need to adapt the annotation in the face of significant word meaning change over time. In this paper, we report on the development of a new semantic tagger (the Historical Thesaurus Semantic Tagger), and discuss challenging issues we faced in this work. This new semantic tagger is built on existing {NLP} tools and incorporates a large-scale historical English thesaurus linked to the Oxford English Dictionary. Employing contextual disambiguation algorithms, this tool is capable of annotating lexical units with a historically-valid highly fine-grained semantic categorization scheme that contains about 225,000 semantic concepts and 4,033 thematic semantic categories. In terms of novelty, it is adapted for processing historical English data, with rich information about historical usage of words and a spelling variant normalizer for historical forms of English. Furthermore, it is able to make use of knowledge about the publication date of a text to adapt its output. In our evaluation, the system achieved encouraging accuracies ranging from 77.12 to 91.08 on individual test texts. Applying time-sensitive methods improved results by as much as 3.54 and by 1.72 on average.

Item Type:	Articles
Keywords:	Language technology, semantics, historical thesaurus, linguistics, information retrieval, corpus linguistics, corpora.
Status:	Published
Refereed:	Yes
Glasgow Author(s) Enlighten ID:	Alexander, Professor Marc and Durkin, Dr Philip and Dallachy, Dr Fraser
Authors:	Piao, S., Dallachy, F., Baron, A., Demmen, J., Wattam, S., Durkin, P., McCracken, J., Rayson, P., and Alexander, M.
Subjects:	P Language and Literature > P Philology. Linguistics P Language and Literature > PE English Q Science > QA Mathematics > QA76 Computer software
College/School:	College of Arts & Humanities > School of Critical Studies > English Language and Linguistics
Journal Name:	Computer Speech and Language
Journal Abbr.:	CSL
Publisher:	Elsevier
ISSN:	0885-2308
ISSN (Online):	1095-8363
Published Online:	17 May 2017
Copyright Holders:	Copyright © 2017 The Authors
First Published:	First published in Computer Speech and Language 46: 113-135
Publisher Policy:	Reproduced under a Creative Commons License
Related URLs:	Project Website

University Staff: Request a correction | Enlighten Editors: Update this record

Funder and Project Information

Project Code	Award No	Project Name	Principal Investigator	Funder's Name	Funder Ref	Lead Dept
64894	1	Semantic Annotation and Mark Up for Enhancing Lexical Searches (SAMUELS)	Marc Alexander	Arts & Humanities Research Council (AHRC)	AH/L010062/1	CRIT - ENGLISH LANGUAGE & LINGUISTICS

References

Alexander et al., 2015a M. Alexander, A. Baron, F. Dallachy, S. Piao, P. Rayson Metaphor, popular science and semantic tagging: Distant reading with the historical thesaurus of English Digital Scholarship Humanit., 30 (1) (2015), pp. 16–27 Alexander et al., 2015b M. Alexander, A. Baron, F. Dallachy, S. Piao, P. Rayson, S. Wattam Semantic tagging and early modern collocates Proceedings of The Corpus Linguistics 2015 Conference, Lancaster University, UK (2015), pp. 8–10 Alexander and Davies, 2015 M. Alexander, M. Davies The Hansard Corpus 1803-2005 http://www.hansard-corpus.org (2015) (accessed 6.07.16) Allan, 2012 ,in: J. Allan (Ed.), Topic Detection and Tracking: Event-Based Information Organization, vol. 12, Springer Science & Business Media (2012) Anderson et al., 2015 W. Anderson, C. Hough, C. Kay, E. Bramwell, B. Aitken, R. Hamilton, M. Alexander Metaphor map of English http://www.glasgow.ac.uk/metaphor (2015) (accessed 6.07.16) Archer et al., 2003 D. Archer, T. McEnery, P. Rayson, A. Hardie Developing an automated semantic analysis system for Early Modern English D Archer, P. Rayson, A. Wilson, T. McEnery (Eds.), Proceedings of the Corpus Linguistics 2003 Conference, Lancaster University, UK (2003), pp. 22–31 View Record in Scopus | Citing articles (16) Archer et al., 2004 D. Archer, P. Rayson, S. Piao, T. McEnery Comparing the UCREL semantic annotation scheme with lexicographical taxonomies ,in: G. Williams, S. Vessier (Eds.), Proceedings of the Eleventh EURALEX (European Association for Lexicography) International Congress (Euralex 2004), Lorient, FranceVolume III (2004), pp. 817–827 Balossi, 2014 Balossi, G., 2014. A corpus linguistic approach to literary language and characterization: Virginia Woolf's The Waves. John Benjamins, Amsterdam. Baron and Rayson, 2008 A. Baron, P. Rayson VARD 2: A tool for dealing with spelling variation in historical corpora Proceedings of the Postgraduate Conference in Corpus Linguistics, Birmingham, UK, Aston University (2008) 22 May 2008 Baron and Rayson, 2009 A. Baron, P. Rayson Automatic standardisation of texts containing spelling variation: How much training data do you need? Proceedings of the Corpus Linguistics 2009 Conference, Lancaster University, UK (2009) Baron et al., 2009 A. Baron, P. Rayson, D. Archer Word frequency and key word statistics in historical corpus linguistics Anglistik: Int. J. Eng. Stud., 20 (1) (2009), pp. 41–67 View Record in Scopus | Citing articles (8) Chitchyan et al., 2006 R. Chitchyan, A. Sampaio, A. Rashid, P. Rayson Evaluating EA-Miner: Are early aspect mining techniques effective? Proceedings of Towards Evaluation of Aspect Mining (TEAM 2006). Workshop co-located with ECOOP 2006 (European Conference on Object-Oriented Programming), Nantes, France (twentieth ed.) (2006), pp. 5–8 View Record in Scopus | Citing articles (1) Crystal and Crystal, 2002 D. Crystal, B. Crystal Shakespeare's Words; A Glossary and Language Companion Penguin, London (2002) http://www.shakespeareswords.com (accessed 6.07.16) Cunningham et al., 2011 H. Cunningham, D. Maynard, K. Bontcheva Text Processing With GATE Gateway Press, CA (2011) Demmen et al., In preparation Demmen, J., Jeffries, L., Walker, B. (In Press). Charting the semantics of labour relations in House of Commons debates spanning two hundred years: A study of parliamentary language using corpus linguistic methods and automated semantic tagging. In: Kranert, M., Horan, G. (Eds.) ‘Doing Politics’: Discursivity, Performativity and Mediation in Political Discourse. John Benjamins, Amsterdam. Doherty et al., 2006 N. Doherty, N. Lockett, P. Rayson, S. Riley Electronic-CRM: A simple sales tool or facilitator of relationship marketing? The Twenty-Nineth Institute for Small Business & Entrepreneurship Conference. International Entrepreneurship—from local to global enterprise creation and development, Cardiff-Caerdydd, UK (2006) EEBO EEBO (Early English Books Online), 2003-2017. ProQuest LLC. http://eebo.chadwyck.com/home (accessed 22.05.17). Fauconnier and Turner, 2002 G. Fauconnier, M Turner The Way We Think: Conceptual Blending and the Mind's Hidden Complexities Basic Books, New York (2002) Gacitua et al., 2008 R. Gacitua, P. Sawyer, P. Rayson A flexible framework to experiment with ontology learning techniques Knowl. Based Syst., 21 (3) (2008), pp. 192–199 Article | PDF (536 K) | View Record in Scopus | Citing articles (38) Garside and Smith, 1997 R. Garside, N. Smith A hybrid grammatical tagger: CLAWS4 R. Garside, G. Leech, A. McEnery (Eds.), Corpus Annotation: Linguistic Information from Computer Text Corpora, Longman, London (1997), pp. 102–121 View Record in Scopus | Citing articles (1) Greenblatt et al., 1997 S. Greenblatt, W. Cohen, J.E. Howard, K.E. Maus (Eds.)The Norton Shakespeare (1997) Hancock et al., 2013 J.T. Hancock, M.T. Woodworth, S. Porter Hungry like the wolf: A word-pattern analysis of the language of psychopaths Legal Criminol. Psych., 18 (1) (2013), pp. 102–114 https://doi.org/10.1111/j.2044-8333.2011.02025.x CrossRef | View Record in Scopus | Citing articles (27) Hendrickx and Marquilhas, 2011 I. Hendrickx, R. Marquilhas From old texts to modern spellings: An experiment in automatic normalisation JLCL, 26 (2) (2011), pp. 65–76 View Record in Scopus | Citing articles (5) Iacobacci et al., 2015 I Iacobacci, M.T Pilehvar, R Navigli SENSEMBED: learning sense embeddings for word and relational similarity Proceedings of the Fifty-Third Annual Meeting of the Association for Computational Linguistics and the Seventh International Joint Conference on Natural Language Processing, Beijing, China, July 26–31, 2015 (2015), pp. 95–105 CrossRef | View Record in Scopus | Citing articles (13) Kay et al., 2009 C. Kay, J. Roberts, M. Samuels, I. Wotherspoon (Eds.), Historical Thesaurus of the Oxford English Dictionary. , Oxford University Press, Oxford (2009) http://www.glasgow.ac.uk/thesaurus Kay et al., 2016 C. Kay, J. Roberts, M. Samuels, I. Wotherspoon, M. Alexander (Eds.), Historical Thesaurus of English, University of Glasgow, Glasgow (2016) http://www.glasgow.ac.uk/thesaurus/. First published in print as Historical Thesaurus of the Oxford English Dictionary, 2009. Oxford University Press, Oxford. See also http://www.oed.com/ (last accessed 6 July) Klebanov et al., 2008 B.B. Klebanov, D. Diermeier, E. Beigman Political Analysis, 16 (4) (2008), pp. 447–463 https://doi.org/10.1093/pan/mpn007 CrossRef Lehto et al., 2010 A. Lehto, A. Baron, M. Ratia, P. Rayson Improving the precision of corpus methods: The standardized version of early modern English medical texts I. Taavitsainen, P. Pahta (Eds.), Early Modern English Medical Texts: Corpus Description and Studies, John Benjamins, Amsterdam (2010), pp. 279–290 CrossRef Levandowsky and Winter, 1971 M. Levandowsky, D. Winter Distance between sets Nature, 234 (5) (1971), pp. 34–35 https://doi.org/10.1038/234034a0 CrossRef | View Record in Scopus | Citing articles (125) Markowitz and Hancock, 2014 D.M. Markowitz, J.T. Hancock Linguistic traces of a scientific fraud: the case of Diederik Stapel PLoS ONE, 9 (8) (2014), p. e105937 https://doi.org/10.1371/journal.pone.0105937 CrossRef McArthur, 1981 T. McArthur Longman Lexicon of Contemporary English Longman, London (1981) Miwa et al., 2012 M. Miwa, P. Thompson, S. Ananiadou Boosting automatic event extraction from the literature using domain adaptation and coreference resolution Bioinformatics, 28 (13) (2012), pp. 1759–1765 CrossRef | View Record in Scopus | Citing articles (41) Nakano et al., 2005 M. Nakano, Y. Hasegawa, K. Nakadai, T. Nakamura, J. Takeuchi, T. Torii, H. Tsujino, N. Kanda, H.G. Okuno A two-layer model for behavior and dialogue planning in conversational service robots Proceedings of the 2005 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2005), Edmonton, Alta, Canada (2005) https://doi.org/10.1109/IROS.2005.1545198 Ooi et al., 2007 V. Ooi, P. Tan, A. Chiang Analyzing personal weblogs in Singapore English: The Wmatrix approach Studies in Variation, Contacts and Change in English. Volume 2. Research Unit For Variation, Contacts and Change in English (VARIENG) (2007) Piao et al., 2005 S. Piao, P. Rayson, D. Archer, T. McEnery Comparing and combining a semantic tagger and a statistical tool for MWE extraction Comput. Speech Lang., 19 (4) (2005), pp. 378–397 https://doi.org/10.1016/j.csl.2004.11.002 Article | PDF (208 K) | View Record in Scopus | Citing articles (30) Potts and Baker, 2013 A. Potts, P. Baker Does semantic tagging identify cultural change in British and American English? Int. J. Corpus Linguist., 17 (3) (2013), pp. 295–324 Project Gutenberg. Online ebook resource. https://www.gutenberg.org/ (accessed 19.04.16) Rayson et al., 2004 P. Rayson, D. Archer, S. Piao, T. McEnery The UCREL semantic analysis system Proceedings of the Workshop on Beyond Named Entity Recognition Semantic Labelling for NLP Tasks in Association with Fourth International Conference on Language Resources and Evaluation (LREC 2004), Lisbon, Portugal (2004), pp. 7–12 View Record in Scopus | Citing articles (1) Rayson et al., 2007 P. Rayson, D. Archer, A. Baron, J. Culpeper, N. Smith Tagging the Bard: Evaluating the accuracy of a modern POS tagger on early modern English corpora M. Davies, P. Rayson, S. Hunston, P. Danielsson (Eds.), Proceedings of The Corpus Linguistics 2007 Conference, UK, University of Birmingham (2007) 27–30 July 2007 Rizzo and Troncy, 2012 G. Rizzo, R. Troncy NERD: a framework for unifying named entity recognition and disambiguation extraction tools Proceedings of the Demonstrations at the Thirteenth Conference of the European Chapter of the Association for Computational Linguistics (2012), pp. 73–76 View Record in Scopus | Citing articles (40) Roberts et al., 1995 J. Roberts, C. Kay, L. Grundy A Thesaurus of Old English. (King's College London Medieval Studies XI.) (Second ed.)Rodopi, Amsterdam (1995) 2000 Semino et al., 2015 E. Semino, Z. Demjen, J. Demmen, V. Koller, S. Payne, A. Hardie, P. Rayson The Online Use of Violence and Journey Metaphors by Patients with Cancer, as Compared with Health Professionals: A Mixed Methods Study (online edition)BMJ Supportive and Palliative Care (2015) https://doi.org/10.1136/bmjspcare-2014-000785 Taiani et al., 2008 F. Taiani, P. Grace, G. Coulson, G. Blair Past and future of reflective middleware: Towards a corpus-based impact analysis The Seventh Workshop on Adaptive and Reflective Middleware (ARM'08) 1 December 2008, Leuven, Belgium (2008) collocated with Middleware 2008 Volk et al., 2002 M. Volk, B. Ripplinger, S. Vintar, P. Buitelaar, D. Raileanu, B. Sacaleanu Semantic annotation for concept-based cross-language medical information retrieval Int. J. Med. Inf., 67 (1-3) (2002), pp. 97–112 Article | PDF (164 K) | View Record in Scopus | Citing articles (32) Vossen, 1998 P. Vossen EuroWordNet: Building a multilingual database with wordnets for European languages The ELRA Newsletter, 3 (1) (1998), pp. 7–10 http://vossen.info/docs/1998/elra.pdf (accessed 22.05.17) View Record in Scopus | Citing articles (1) Weston et al., 2013 J. Weston, A. Bordes, O. Yakhnenko, N. Usunier Connecting language and knowledge bases with embedding models for relation extraction Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (2013), pp. 133–1371

Deposit and Record Details

ID Code:	141814
Depositing User:	Professor Marc Alexander
Datestamp:	01 Jun 2017 13:32
Last Modified:	16 Dec 2021 03:19
Date of acceptance:	30 April 2017
Date of first online publication:	17 May 2017
Date Deposited:	1 June 2017
Data Availability Statement:	No