Enright, J. and Kondrak, G. (2007) A Fast Method for Parallel Document Identification. In: Human Language Technologies 2007, Rochester, NY, USA, 22-27 Apr 2007, pp. 29-32.
|
Text
203108.pdf - Published Version Available under License Creative Commons Attribution Non-commercial Share Alike. 61kB |
Publisher's URL: http://www.aclweb.org/anthology/N07-2008
Abstract
We present a fast method to identify homogeneous parallel documents. The method is based on collecting counts of identical low-frequency words between possibly parallel documents. The candidate with the most shared low-frequency words is selected as the parallel document. The method achieved 99.96% accuracy when tested on the EUROPARL corpus of parliamentary proceedings, failing only in anomalous cases of truncated or otherwise distorted documents. While other work has shown similar performance on this type of dataset, our approach presented here is faster and does not require training. Apart from proposing an efficient method for parallel document identification in a restricted domain, this paper furnishes evidence that parliamentary proceedings may be inappropriate for testing parallel document identification systems in general.
Item Type: | Conference Proceedings |
---|---|
Status: | Published |
Refereed: | Yes |
Glasgow Author(s) Enlighten ID: | Enright, Dr Jessica |
Authors: | Enright, J., and Kondrak, G. |
College/School: | College of Science and Engineering > School of Computing Science |
Copyright Holders: | Copyright © 2007 Association for Computational Linguistics |
First Published: | First published in Proceedings of the Human Language Technologies 2007: 29-32 |
Publisher Policy: | Reproduced under a Creative Commons License |
University Staff: Request a correction | Enlighten Editors: Update this record