pith. sign in

arxiv: cs/0609060 · v1 · submitted 2006-09-12 · 💻 cs.CL · cs.IR

Automatic Identification of Document Translations in Large Multilingual Document Collections

classification 💻 cs.CL cs.IR
keywords documentdocumentstranslationslargesystemdetectlanguagemultilingual
0
0 comments X
read the original abstract

Texts and their translations are a rich linguistic resource that can be used to train and test statistics-based Machine Translation systems and many other applications. In this paper, we present a working system that can identify translations and other very similar documents among a large number of candidates, by representing the document contents with a vector of thesaurus terms from a multilingual thesaurus, and by then measuring the semantic similarity between the vectors. Tests on different text types have shown that the system can detect translations with over 96% precision in a large search space of 820 documents or more. The system was tuned to ignore language-specific similarities and to give similar documents in a second language the same similarity score as equivalent documents in the same language. The application can also be used to detect cross-lingual document plagiarism.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.