Experiences with parallelisation of an existing NLP pipeline: tagging Hansard

Wattam, S., Rayson, P., Alexander, M. and Anderson, J. (2015) Experiences with parallelisation of an existing NLP pipeline: tagging Hansard. Language Resources and Evaluation,

[img] Text
98360.pdf
Restricted to Repository staff only

434kB

Abstract

This poster describes experiences processing the two-billion-word Hansard corpus using a fairly standard NLP pipeline on a high performance cluster. Herein we report how we were able to parallelise and apply a “traditional” single-threaded batch-oriented application to a platform that differs greatly from that for which it was originally designed. We start by discussing the tagging toolchain, its specific requirements and properties, and its performance characteristics. This is contrasted with a description of the cluster on which it was to run, and specific limitations are discussed such as the overhead of using SAN-based storage. We then go on to discuss the nature of the Hansard corpus, and describe which properties of this corpus in particular prove challenging for use on the system architecture used. The solution for tagging the corpus is then described, along with performance comparisons against a na¨ıve run on commodity hardware. We discuss the gains and benefits of using high-performance machinery rather than relatively cheap commodity hardware. Our poster provides a valuable scenario for large scale NLP pipelines and lessons learnt from the experience

Item Type:Articles
Status:Published
Refereed:Yes
Glasgow Author(s) Enlighten ID:Alexander, Professor Marc and Anderson, Mrs Jean
Authors: Wattam, S., Rayson, P., Alexander, M., and Anderson, J.
Subjects:P Language and Literature > PE English
Q Science > QA Mathematics > QA75 Electronic computers. Computer science
Z Bibliography. Library Science. Information Resources > ZA Information resources > ZA4050 Electronic information resources
College/School:College of Arts & Humanities > School of Critical Studies > English Language and Linguistics
Journal Name:Language Resources and Evaluation
Publisher:Springer Netherlands
ISSN:1574-020X
ISSN (Online):1574-0218

University Staff: Request a correction | Enlighten Editors: Update this record

Project CodeAward NoProject NamePrincipal InvestigatorFunder's NameFunder RefLead Dept
648941Semantic Annotation and Mark Up for Enhancing Lexical Searches (SAMUELS)Marc AlexanderArts and Humanities Research Council (AHRC)AH/L010062/1CRIT - ENGLISH LANGUAGE
567981Parliamentary discourseJean AndersonJoint Information Systems Committee (JISC)Strand ACRIT - ENGLISH LANGUAGE