A Fast Method for Parallel Document Identification

Enright, J. and Kondrak, G. (2007) A Fast Method for Parallel Document Identification. In: Human Language Technologies 2007, Rochester, NY, USA, 22-27 Apr 2007, pp. 29-32.

[img]
Preview
Text
203108.pdf - Published Version
Available under License Creative Commons Attribution Non-commercial Share Alike.

61kB

Publisher's URL: http://www.aclweb.org/anthology/N07-2008

Abstract

We present a fast method to identify homogeneous parallel documents. The method is based on collecting counts of identical low-frequency words between possibly parallel documents. The candidate with the most shared low-frequency words is selected as the parallel document. The method achieved 99.96% accuracy when tested on the EUROPARL corpus of parliamentary proceedings, failing only in anomalous cases of truncated or otherwise distorted documents. While other work has shown similar performance on this type of dataset, our approach presented here is faster and does not require training. Apart from proposing an efficient method for parallel document identification in a restricted domain, this paper furnishes evidence that parliamentary proceedings may be inappropriate for testing parallel document identification systems in general.

Item Type:Conference Proceedings
Status:Published
Refereed:Yes
Glasgow Author(s) Enlighten ID:Enright, Dr Jessica
Authors: Enright, J., and Kondrak, G.
College/School:College of Science and Engineering > School of Computing Science
Copyright Holders:Copyright © 2007 Association for Computational Linguistics
First Published:First published in Proceedings of the Human Language Technologies 2007: 29-32
Publisher Policy:Reproduced under a Creative Commons License

University Staff: Request a correction | Enlighten Editors: Update this record