Recognition: unknown
Is Textual Similarity Invariant under Machine Translation? Evidence Based on the Political Manifesto Corpus
Pith reviewed 2026-05-09 19:26 UTC · model grok-4.3
The pith
Machine translation preserves cosine similarities between paragraph embeddings in ten languages but distorts them in four, based on a stability test using the Political Manifesto Corpus.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors develop a per-language non-inferiority test that checks four hypotheses about translation effects on embedding similarities. Using over 2,800 manifestos in 28 languages translated to English, they measure stability of pairwise cosine similarities across embedding models and calibrate the invariance threshold with inter-model disagreement on original text. This identifies ten languages where translation demonstrably preserves semantic structure and four where it demonstrably degrades it.
What carries the argument
The non-inferiority test for invariance, which treats inter-model disagreement on untranslated text as the threshold for acceptable translation-induced change in pairwise similarities.
If this is right
- Translated texts can be used for similarity-based tasks in the ten invariant languages without detectable loss of semantic structure.
- The four languages with distortion require caution or original-language processing for reliable embedding comparisons.
- The framework can be applied to other corpora and pipelines to test invariance without needing direct semantic shift measurements.
- Downstream tasks like clustering or retrieval across languages become more trustworthy when limited to invariant languages.
Where Pith is reading between the lines
- If the test generalizes, researchers could screen new translation services or embedding models by running the same inter-model stability check.
- This approach might extend to measuring invariance under other transformations like summarization or paraphrasing.
- Knowing language-specific invariance could guide choices in building multilingual datasets or models.
- Future work might correlate these findings with linguistic features of the languages to predict invariance without full testing.
Load-bearing premise
That the amount of disagreement between different embedding models on original-language texts provides the correct standard for deciding what level of change under translation is still acceptable.
What would settle it
Running the same analysis but with human-annotated similarity judgments on a subset of paragraphs instead of model-based ones, and checking whether the language verdicts match the automated test.
read the original abstract
We investigate the extent to which cosine similarity between paragraph embeddings is invariant under machine translation, using the Manifesto Corpus of over 2,800 political party platforms in 28 languages translated to English via the EU eTranslation service. Rather than measuring translation-induced semantic shift directly we measure the stability of pairwise similarity relationships across embedding models, and use inter-model disagreement on original-language text as a calibrated invariance threshold. This yields a per-language non-inferiority test for four hypotheses about how translation interacts with embedding choice, with verdicts that distinguish languages where translation demonstrably preserves semantic structure from those where it demonstrably degrades it and from those where the available evidence does not resolve the question. The framework is corpus- and pipeline-agnostic and extends naturally to downstream tasks. Applied to our data, it identifies ten languages with translation invariance and four with detectable distortion.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper investigates whether cosine similarity between paragraph embeddings remains invariant under machine translation. Using the Manifesto Corpus (>2,800 texts in 28 languages translated to English via EU eTranslation), it measures stability of pairwise similarities across embedding models and sets an invariance threshold from inter-model disagreement on original-language texts. This supports a per-language non-inferiority test distinguishing languages with preserved semantic structure (10 languages), detectable distortion (4 languages), and inconclusive cases. The framework is described as corpus- and pipeline-agnostic with potential extension to downstream tasks.
Significance. If the threshold choice is justified, the work supplies a practical statistical framework for quantifying MT effects on embedding-based similarity relations, which is relevant for multilingual NLP and computational social science applications involving political texts. The large-scale, real-world corpus and explicit non-inferiority formulation add empirical value; the agnostic design allows reuse beyond the current setting.
major comments (1)
- [§4] §4 (non-inferiority test definition): The invariance threshold is set directly to inter-model disagreement on untranslated paragraphs. No external anchor (human similarity judgments, synthetic distortion controls, or downstream task stability) is provided to show that exceeding this threshold corresponds to semantically meaningful change rather than model-specific encoding differences. This assumption is load-bearing for the reported classification of 10 invariant vs. 4 distorted languages.
minor comments (2)
- [§3.1] §3.1: Paragraph segmentation rules from the manifesto texts are not specified in sufficient detail to support exact reproduction of the pairwise similarity matrices.
- [Results tables] Results tables: The per-language verdicts lack accompanying effect sizes, exact threshold values, or confidence intervals, making it difficult to assess the margin by which each language passes or fails the test.
Simulated Author's Rebuttal
We thank the referee for their constructive review and for identifying a key methodological point. We address the major comment below, providing additional justification for our calibration approach while acknowledging the value of external validation. We plan a partial revision to expand the discussion in §4.
read point-by-point responses
-
Referee: [§4] §4 (non-inferiority test definition): The invariance threshold is set directly to inter-model disagreement on untranslated paragraphs. No external anchor (human similarity judgments, synthetic distortion controls, or downstream task stability) is provided to show that exceeding this threshold corresponds to semantically meaningful change rather than model-specific encoding differences. This assumption is load-bearing for the reported classification of 10 invariant vs. 4 distorted languages.
Authors: We appreciate the referee drawing attention to the calibration of the invariance threshold. Our decision to set the threshold using inter-model disagreement on the original-language paragraphs is intentional: it quantifies the baseline variability in pairwise cosine similarities that arises solely from differences in embedding model architectures and training data, without any translation. Any additional deviation observed after machine translation can therefore be interpreted as exceeding the level of change already attributable to model choice. This internal calibration supports the non-inferiority test by providing a language- and corpus-specific reference point that does not require external human annotations, which would be difficult to obtain consistently across 28 languages and would compromise the framework's agnostic design. We agree that linking the threshold to downstream task performance or human judgments would offer stronger semantic grounding and will revise §4 to include an expanded justification of the current approach together with explicit suggestions for such external anchors in future work. The classification results themselves remain unchanged, as they follow directly from the per-language statistical tests against this calibrated threshold. revision: partial
Circularity Check
No significant circularity detected in derivation chain
full rationale
The paper computes an invariance threshold from inter-model disagreement on original-language paragraphs and applies a non-inferiority test to measure whether translation-induced shifts in pairwise cosine similarities exceed that threshold. This construction is independent: the threshold is fixed from untranslated data before examining translations, so the per-language verdicts (invariant vs. distorted) are not equivalent to the inputs by definition or by fitting. No self-definitional steps, fitted inputs renamed as predictions, load-bearing self-citations, uniqueness theorems, or smuggled ansatzes appear in the described framework. The method is presented as corpus-agnostic and externally extensible, confirming the logic remains self-contained against the chosen proxy.
Axiom & Free-Parameter Ledger
axioms (3)
- domain assumption Cosine similarity between paragraph embeddings captures meaningful semantic relationships between texts
- domain assumption Inter-model disagreement on original-language texts supplies a valid and calibrated threshold for acceptable change under translation
- domain assumption The EU eTranslation outputs are representative machine translations suitable for testing semantic-structure preservation
Reference graph
Works this paper leans on
-
[1]
Agrawal, B
A. Agrawal, B. Fazili, and P. Jyothi. Translation Errors Significantly Impact Low-Resource Languages in Cross-Lingual Learning. In Y . Gra- ham and M. Purver, editors,Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Vol- ume 2: Short Papers), pages 319–329, St. Julian’s, Malta, Mar. 2024. Associa...
2024
-
[2]
Amrhein, N
C. Amrhein, N. Moghe, and L. Guillou. ACES: Translation Accuracy Challenge Sets for Evaluating Machine Translation Metrics. In P. Koehn, L. Barrault, O. Bojar, F. Bougares, R. Chatterjee, M. R. Costa-jussà, C. Federmann, M. Fishel, A. Fraser, M. Freitag, Y . Graham, R. Grund- kiewicz, P. Guzman, B. Haddow, M. Huck, A. Jimeno Yepes, T. Kocmi, A. Martins, M...
2022
-
[3]
Massively multilingual sentence embeddings for zero- shot cross-lingual transfer and beyond
M. Artetxe and H. Schwenk. Massively Multilingual Sentence Embed- dings for Zero-Shot Cross-Lingual Transfer and Beyond.Transactions of the Association for Computational Linguistics, 7:597–610, Sept. 2019. ISSN 2307-387X. doi: 10.1162/tacl_a_00288
-
[4]
M. Baker. Corpus Linguistics and Translation Studies — Implications and Applications. In M. Baker, G. Francis, and E. Tognini-Bonelli, editors,Text and Technology, chapter Text and Technology, pages 233–
-
[5]
ISBN 978-90-272- 8587-4
John Benjamins Publishing Company, 1993. ISBN 978-90-272- 8587-4
1993
-
[6]
L. Beinborn and R. Choenni. Semantic Drift in Multilingual Represen- tations.Computational Linguistics, 46(3):571–603, Nov. 2020. ISSN 0891-2017. doi: 10.1162/coli_a_00382
-
[7]
Longformer: The Long-Document Transformer
I. Beltagy, M. E. Peters, and A. Cohan. Longformer: The Long- Document Transformer. arXiv 2004.05150 [cs.CL], Dec. 2020
work page internal anchor Pith review arXiv 2004
-
[8]
Benjamini and Y
Y . Benjamini and Y . Hochberg. Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing.Journal of the 11 Language OO-OT OO-OT best MM-MX OM-OT Bulgarian.096 -.076-.137.029 Croatian .304 .249-.185 .140 Czech .556.509 .126.423 Danish.014 -.152 -.081 .066 Dutch .155 .076-.108 .014 English -.146-.190 -.637-.297 Estonian .3...
-
[9]
ISSN 2517-6161. doi: 10.1111/j.2517-6161.1995.tb02031.x
-
[10]
T. Blevins, T. Limisiewicz, S. Gururangan, M. Li, H. Gonen, N. A. Smith, and L. Zettlemoyer. Breaking the Curse of Multilinguality with Cross-Lingual Expert Language Models. arXiv 2401.10440 [cs.CL], Jan. 2024
- [11]
-
[12]
L. Breiman. Random Forests.Machine Learning, 45(1):5–32, Oct. 2001. ISSN 1573-0565. doi: 10.1023/A:1010933404324
-
[13]
Brin and L
S. Brin and L. Page. The Anatomy of a Large-Scale Hypertextual Web Search Engine. InSeventh International World-Wide Web Conference (WWW 1998), Brisbane, AU, Apr. 1998
1998
-
[14]
W. J. Browne, H. Goldstein, and J. Rasbash. Multiple Membership Mul- tiple Classification (MMMC) Models.Statistical Modelling, 1(2):103– 124, July 2001. ISSN 1471-082X. doi: 10.1177/1471082X0100100202
-
[15]
D. Cer, M. Diab, E. Agirre, I. Lopez-Gazpio, and L. Specia. SemEval- 2017 Task 1: Semantic Textual Similarity - Multilingual and Cross- lingual Focused Evaluation. InProceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pages 1–14, 2017. doi: 10.18653/v1/S17-2001
-
[16]
D. Cer, Y . Yang, S.-y. Kong, N. Hua, N. Limtiaco, R. S. John, N. Constant, M. Guajardo-Cespedes, S. Yuan, C. Tar, Y .-H. Sung, B. Strope, and R. Kurzweil. Universal Sentence Encoder. arXiv 1803.11175 [cs.CL], Apr. 2018
work page Pith review arXiv 2018
-
[17]
J. Chen, S. Xiao, P. Zhang, K. Luo, D. Lian, and Z. Liu. BGE M3- Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation. arXiv 2402.03216, June 2024. hypothesisκinvariant indeterm. distort. super. Baseline OO-OT 0.5 6 19 3 1.0 13 14 1 1.5 16 11 1 2.0 16 11 1 Best model OO-OTbest 0.5 13 12 3 1.0...
work page internal anchor Pith review arXiv 2024
-
[18]
D. Chicco and G. Jurman. The Advantages of the Matthews Correlation Coefficient (MCC) over F1 Score and Accuracy in Binary Classification Evaluation.BMC Genomics, 21(1):6, Jan. 2020. ISSN 1471-2164. doi: 10.1186/s12864-019-6413-7
-
[19]
K. D. Chowdhury, C. España-Bonet, and J. van Genabith. Understanding Translationese in Multi-view Embedding Spaces. In D. Scott, N. Bel, and C. Zong, editors,Proceedings of the 28th International Conference on Computational Linguistics, pages 6056–6062, Barcelona, Spain (Online), Dec. 2020. International Committee on Computational Linguistics. doi: 10.186...
-
[20]
Proceedings of the Association for Computational Linguistics (ACL) , pages =
A. Conneau, K. Khandelwal, N. Goyal, V . Chaudhary, G. Wenzek, F. Guzmán, E. Grave, M. Ott, L. Zettlemoyer, and V . Stoyanov. Un- supervised Cross-lingual Representation Learning at Scale. In D. Juraf- sky, J. Chai, N. Schluter, and J. Tetreault, editors,Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8440–84...
-
[21]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. arXiv 1810.04805 [cs.CL], Oct. 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[22]
K. Ethayarajh. How Contextual are Contextualized Word Representa- tions? Comparing the Geometry of BERT, ELMo, and GPT-2 Embed- dings. In K. Inui, J. Jiang, V . Ng, and X. Wan, editors,Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Pro- cessing (EMNLP-IJ...
-
[23]
F. Feng, Y . Yang, D. Cer, N. Arivazhagan, and W. Wang. Language- Agnostic BERT Sentence Embedding. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 878–891, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.62
-
[24]
L. Fu and L. Liu. What Are the Differences? A Comparative Study of Generative Artificial Intelligence Translation and Human Translation of Scientific Texts.Humanities and Social Sciences Communications, 11(1): 1–12, Sept. 2024. ISSN 2662-9992. doi: 10.1057/s41599-024-03726-7
-
[25]
W. A. Gale and K. W. Church. A Program for Aligning Sentences in Bilingual Corpora.Computational Linguistics, 19(1):75–102, 1993
1993
-
[26]
Gellerstam
M. Gellerstam. Translationese in Swedish Novels Translated from En- glish. InTranslation Studies in Scandinavia: Proceedings from the Scandinavian Symposium on Translation Theory (SSOTT) II, number 75 12 in Lund Studies in English, pages 88–95, Lund, 1986. CWK Gleerup
1986
-
[27]
P. He, C. Meister, and Z. Su. Testing Machine Translation via Referential Transparency. InProceedings of the 43rd International Conference on Software Engineering, ICSE ’21, pages 410–422, Madrid, Spain, Nov
-
[28]
IEEE Press. ISBN 978-1-4503-9085-9. doi: 10.1109/ICSE43902. 2021.00047
-
[29]
L. Hubert and P. Arabie. Comparing Partitions.Journal of Classification, 2(1):193–218, Dec. 1985. ISSN 1432-1343. doi: 10.1007/BF01908075
-
[30]
A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chap- lot, D. de las Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier, L. R. Lavaud, M.-A. Lachaux, P. Stock, T. L. Scao, T. Lavril, T. Wang, T. Lacroix, and W. E. Sayed. Mistral 7B. arXiv 2310.06825 [cs.CL], Oct. 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[31]
Jiang, S
T. Jiang, S. Huang, Z. Luan, D. Wang, and F. Zhuang. Scaling Sentence Embeddings with Large Language Models. In Y . Al-Onaizan, M. Bansal, and Y .-N. Chen, editors,Findings of the Association for Computational Linguistics: EMNLP 2024, pages 3182–3196, Miami, Florida, USA, Nov. 2024. Association for Computational Linguistics. doi: 10.18653/ v1/2024.finding...
2024
-
[32]
R. Khobragade, H. Patel, A. Namdev, A. Mishra, and P. Bhattacharyya. Machine Translation Evaluation using Bi-directional Entailment. arXiv 1911.00681, Nov. 2019
-
[33]
Killick, P
R. Killick, P. Fearnhead, and I. A. Eckley. Optimal Detection of Change- points With a Linear Computational Cost.Journal of the American Statistical Association, 107(500):1590–1598, Dec. 2012. ISSN 0162-
2012
-
[34]
doi: 10.1080/01621459.2012.737745
-
[35]
Kurokawa, C
D. Kurokawa, C. Goutte, and P. Isabelle. Automatic Detection of Trans- lated Text and its Impact on Machine Translation. InProceedings of Machine Translation Summit XII: Papers, Ottawa, Canada, Aug. 2009
2009
-
[36]
D. Lakens. Equivalence Tests: A Practical Primer for t Tests, Cor- relations, and Meta-Analyses.Social Psychological and Personality Science, 8(4):355–362, May 2017. ISSN 1948-5506. doi: 10.1177/ 1948550617697177
2017
-
[37]
C. Lee, R. Roy, M. Xu, J. Raiman, M. Shoeybi, B. Catanzaro, and W. Ping. NV-Embed: Improved Techniques for Training LLMs as Gen- eralist Embedding Models. arXiv 2405.17428, May 2024
work page internal anchor Pith review arXiv 2024
-
[38]
Lehmann, S
P. Lehmann, S. Franzmann, T. Burst, J. Lewandowski, T. Matthieß, S. Regel, F. Riethmüller, and L. Zehnter. Manifesto Corpus, 2023
2023
-
[39]
Lehmann, S
P. Lehmann, S. Franzmann, T. Burst, T. Matthieß, S. Regel, F. Rieth- müller, A. V olkens, B. Weßels, and L. Zehnter. Manifesto Project Dataset, 2023
2023
-
[40]
X. Li and J. Li. AoE: Angle-optimized Embeddings for Semantic Tex- tual Similarity. In L.-W. Ku, A. Martins, and V . Srikumar, editors, Proceedings of the 62nd Annual Meeting of the Association for Computa- tional Linguistics (Volume 1: Long Papers), pages 1825–1839, Bangkok, Thailand, Aug. 2024. Association for Computational Linguistics. doi: 10.18653/v1...
-
[41]
L. McInnes, J. Healy, N. Saul, and L. Großberger. UMAP: Uniform Man- ifold Approximation and Projection.Journal of Open Source Software, 3(29):861, Sept. 2018. ISSN 2475-9066. doi: 10.21105/joss.00861
-
[42]
UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction
L. McInnes, J. Healy, and J. Melville. UMAP: Uniform Manifold Ap- proximation and Projection for Dimension Reduction. arXiv 1802.03426 [stat.ML], Sept. 2020
work page internal anchor Pith review arXiv 2020
-
[43]
N. Merz, S. Regel, and J. Lewandowski. The Manifesto Corpus: A New Resource for Research on Political Parties and Quantitative Text Analysis.Research & Politics, 3(2):2053168016643346, Apr. 2016. ISSN 2053-1680. doi: 10.1177/2053168016643346
-
[44]
Mikolov, I
T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean. Distributed Representations of Words and Phrases and their Compositionality, Oct. 2013
2013
-
[45]
J. Moon, H. Cho, and E. L. Park. Revisiting Round-trip Translation for Quality Estimation. In A. Martins, H. Moniz, S. Fumega, B. Martins, F. Batista, L. Coheur, C. Parra, I. Trancoso, M. Turchi, A. Bisazza, J. Moorkens, A. Guerberof, M. Nurminen, L. Marg, and M. L. Forcada, editors,Proceedings of the 22nd Annual Conference of the European Association for...
2020
-
[46]
R. C. Moore. Fast and Accurate Sentence Alignment of Bilingual Cor- pora. In S. D. Richardson, editor,Proceedings of the 5th Conference of the Association for Machine Translation in the Americas: Technical Papers, pages 135–144, Tiburon, USA, Oct. 2002. Springer
2002
-
[47]
arXiv preprint arXiv:2210.07316 , year=
N. Muennighoff, N. Tazi, L. Magne, and N. Reimers. MTEB: Massive Text Embedding Benchmark. arXiv 2210.07316 [cs.CL], Mar. 2023
-
[48]
A. Narayan, B. Berger, and H. Cho. Assessing Single-Cell Transcrip- tomic Variability Through Density-Preserving Data Visualization.Na- ture Biotechnology, 39(6):765–774, June 2021. ISSN 1546-1696. doi: 10.1038/s41587-020-00801-7
-
[49]
S. B. Needleman and C. D. Wunsch. A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins. Journal of Molecular Biology, 48(3):443–453, Mar. 1970. ISSN 0022-
1970
-
[50]
doi: 10.1016/0022-2836(70)90057-4
-
[51]
J. Niu and Y . Jiang. Does Simplification Hold True for Machine Transla- tions? A Corpus-Based Analysis of Lexical Diversity in Text Varieties Across Genres.Humanities and Social Sciences Communications, 11(1): 1–10, Apr. 2024. ISSN 2662-9992. doi: 10.1057/s41599-024-02986-7
-
[52]
Z. Nussbaum, J. X. Morris, B. Duderstadt, and A. Mulyar. Nomic Embed: Training a Reproducible Long Context Text Embedder. arXiv 2402.01613, Feb. 2025
-
[53]
Macro f1 and macro f1.arXiv preprint arXiv:1911.03347, 2019
J. Opitz and S. Burst. Macro F1 and Macro F1. arXiv 1911.03347, Feb. 2021
-
[54]
Oravecz, K
C. Oravecz, K. Bontcheva, D. Kolovratnìk, B. Kovachev, and C. Scott. eTranslation’s Submissions to the WMT22 General Machine Translation Task. In P. Koehn, L. Barrault, O. Bojar, F. Bougares, R. Chatterjee, M. R. Costa-jussà, C. Federmann, M. Fishel, A. Fraser, M. Freitag, Y . Graham, R. Grundkiewicz, P. Guzman, B. Haddow, M. Huck, A. Jimeno Yepes, T. Koc...
2022
-
[55]
P. Qi, Y . Zhang, Y . Zhang, J. Bolton, and C. D. Manning. Stanza: A Python Natural Language Processing Toolkit for Many Human Lan- guages. In A. Celikyilmaz and T.-H. Wen, editors,Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 101–108, Online, July 2020. Association for Computational ...
-
[56]
Radford, K
A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever. Improving Language Understanding by Generative Pre-Training. Report, OpenAI, 2018
2018
-
[57]
W. M. Rand. Objective Criteria for the Evaluation of Clustering Methods. Journal of the American Statistical Association, 66(336):846–850, Dec
-
[58]
Journal of the American Statistical Association , author =
ISSN 0162-1459. doi: 10.1080/01621459.1971.10482356
-
[59]
Reimers and I
N. Reimers and I. Gurevych. Sentence-BERT: Sentence Embeddings Using Siamese BERT-Networks. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Nov. 2019
2019
-
[60]
Reimers and I
N. Reimers and I. Gurevych. Making Monolingual Sentence Embeddings Multilingual Using Knowledge Distillation. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Nov. 2020
2020
-
[61]
Reimers and I
N. Reimers and I. Gurevych. Sentence Transformers: MPNET Model, Aug. 2021
2021
-
[62]
Multilingual
P. Riley, I. Caswell, M. Freitag, and D. Grangier. Translationese as a Language in “Multilingual” NMT. In D. Jurafsky, J. Chai, N. Schluter, and J. Tetreault, editors,Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7737–7746, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/ v1/2020...
2020
-
[63]
J. Rybicki. Can Machine Translation of Literary Texts Fool Stylometry? InDigital Humanities 2023, 2023
2023
-
[64]
Saadany and C
H. Saadany and C. Orasan. Is it Great or Terrible? Preserving Senti- ment in Neural Machine Translation of Arabic Reviews. In I. Zitouni, M. Abdul-Mageed, H. Bouamor, F. Bougares, M. El-Haj, N. Tomeh, and W. Zaghouani, editors,Proceedings of the Fifth Arabic Natural Lan- guage Processing Workshop, pages 24–37, Barcelona, Spain (Online), Dec. 2020. Associa...
2020
-
[65]
D. J. Schuirmann. A Comparison of the Two One-Sided Tests Procedure and the Power Approach for Assessing the Equivalence of Average Bioavailability.Journal of Pharmacokinetics and Biopharmaceutics, 15 (6):657–680, Dec. 1987. ISSN 0090-466X. doi: 10.1007/BF01068419
-
[66]
A. Toral. Post-Editese: An Exacerbated Translationese. In M. Forcada, A. Way, B. Haddow, and R. Sennrich, editors,Proceedings of Machine Translation Summit XVII: Research Track, pages 273–281, Dublin, Ire- land, Aug. 2019. European Association for Machine Translation
2019
-
[67]
LLaMA: Open and Efficient Foundation Language Models
H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, and G. Lample. LLaMA: Open and Efficient Foun- dation Language Models. arXiv 2302.13971 [cs.CL], Feb. 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[68]
Llama 2: Open Foundation and Fine-Tuned Chat Models
H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y . Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, D. Bikel, L. Blecher, C. C. Ferrer, M. Chen, G. Cucurull, D. Esiobu, J. Fernandes, J. Fu, W. Fu, B. Fuller, C. Gao, V . Goswami, N. Goyal, A. Hartshorn, S. Hosseini, R. Hou, H. Inan, M. Kardas, V . Kerkez, M. Khabsa, I. Kloumann, A. Ko- r...
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [69]
-
[70]
C. Truong, L. Oudre, and N. Vayatis. Selective Review of Offline Change Point Detection Methods.Signal Processing, 167:107299, Feb. 2020. ISSN 0165-1684. doi: 10.1016/j.sigpro.2019.107299
-
[71]
O. Vasilyev, F. Isono, and J. Bohannon. Linear Cross-Lingual Mapping of Sentence Embeddings. arXiv 2305.14256, June 2024
-
[72]
Vaswani, N
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention Is All You Need, Dec. 2017
2017
-
[73]
V . V olansky, N. Ordan, and S. Wintner. On the Features of Translationese. Digital Scholarship in the Humanities, 30(1):98–118, Apr. 2015. ISSN 2055-7671. doi: 10.1093/llc/fqt031
-
[74]
L. Wang, N. Yang, X. Huang, B. Jiao, L. Yang, D. Jiang, R. Majumder, and F. Wei. Text Embeddings by Weakly-Supervised Contrastive Pre- training. arXiv 2212.03533, 2022
work page internal anchor Pith review arXiv 2022
- [75]
-
[76]
A. Williams, N. Nangia, and S. Bowman. A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference. InProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1112–1122, New Orleans, Louisiana, June 2018. Association for ...
-
[77]
L. Xue, N. Constant, A. Roberts, M. Kale, R. Al-Rfou, A. Siddhant, A. Barua, and C. Raffel. mT5: A Massively Multilingual Pre-Trained Text-to-Text Transformer. arXiv 2010.11934, Mar. 2021. 14 Appendix 1: Models used for generating sentence embeddings Table 10: Sentence-embedding models used per language, plus the post-translation English and multilingual ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.