Automatic Identification of Document Translations in Large Multilingual Document Collections

Bruno Pouliquen; Camelia Ignat; Ralf Steinberger

arxiv: cs/0609060 · v1 · submitted 2006-09-12 · 💻 cs.CL · cs.IR

Automatic Identification of Document Translations in Large Multilingual Document Collections

Bruno Pouliquen , Ralf Steinberger , Camelia Ignat This is my paper

classification 💻 cs.CL cs.IR

keywords documentdocumentstranslationslargesystemdetectlanguagemultilingual

0 comments

read the original abstract

Texts and their translations are a rich linguistic resource that can be used to train and test statistics-based Machine Translation systems and many other applications. In this paper, we present a working system that can identify translations and other very similar documents among a large number of candidates, by representing the document contents with a vector of thesaurus terms from a multilingual thesaurus, and by then measuring the semantic similarity between the vectors. Tests on different text types have shown that the system can detect translations with over 96% precision in a large search space of 820 documents or more. The system was tuned to ignore language-specific similarities and to give similar documents in a second language the same similarity score as equivalent documents in the same language. The application can also be used to detect cross-lingual document plagiarism.

This paper has not been read by Pith yet.

Automatic Identification of Document Translations in Large Multilingual Document Collections

discussion (0)