pith. sign in

arxiv: 2605.29992 · v1 · pith:LAKPMO2Dnew · submitted 2026-05-28 · 💻 cs.CL

Adapting Multilingual Embedding Models to Turkish via Cross-Lingual Tokenizer Surgery and Offline Distillation

Pith reviewed 2026-06-29 07:32 UTC · model grok-4.3

classification 💻 cs.CL
keywords Turkish sentence embeddingstokenizer adaptationoffline distillationmultilingual modelssemantic textual similaritymodel compressioncross-lingual transferembedding adaptation
0
0 comments X

The pith

A 200M-parameter Turkish sentence embedding model built via tokenizer surgery and offline distillation outperforms its 300M-parameter teacher on Turkish semantic tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates an efficient adaptation of a multilingual embedding model to Turkish without full pretraining. It prunes the tokenizer to favor Turkish tokens via frequency analysis on a 40-language corpus, initializes the new embedding table by averaging teacher embeddings for each token, and distills knowledge offline by matching cosine similarities to precomputed teacher vectors. This yields a 200M-parameter model with an 8192-token context that scores higher on Turkish semantic textual similarity than the teacher and ranks seventh on a 26-task Turkish benchmark. The process completes in roughly four hours on one GPU at low cost and releases all components openly.

Core claim

By constructing a Turkish-optimized multilingual tokenizer with a 131,072 vocabulary through pruning and frequency-based token addition, cloning the teacher while preserving backbone weights and using mean-composition for the new embedding table, then performing offline embedding distillation with a cosine similarity objective over a balanced 40-language Wikipedia corpus, the resulting 200M-parameter student model achieves Pearson/Spearman correlations of 77.55%/77.45% on STSbTR (exceeding the teacher's 73.84%/72.92%) and a mean score of 63.9% on TR-MTEB (ranking 7th out of 26 models).

What carries the argument

Three-stage pipeline of cross-lingual tokenizer surgery via pruning and frequency analysis, mean-composition embedding table initialization, and offline cosine-similarity distillation from precomputed teacher vectors.

If this is right

  • The adapted model supports an 8192-token context window, exceeding the 512-token limit of prior BERT-based Turkish encoders.
  • Training finishes in roughly four hours on a single GPU at a total cost of $5-$20 by avoiding online teacher inference.
  • The approach yields a competitive cost-quality trade-off with 33% fewer parameters than the teacher.
  • All model weights, tokenizer files, precomputed embedding datasets, and open-source cloning and distillation tooling are released for reproducibility.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same tokenizer surgery and offline distillation steps could be applied to adapt the teacher to other languages using the same 40-language corpus.
  • Releasing the precomputed teacher embeddings on Wikipedia enables independent experiments or extensions by researchers without access to the original teacher model.
  • The method's avoidance of joint teacher-student training during distillation may allow scaling the approach to larger context windows or additional languages at modest cost.
  • Testing the pipeline on non-embedding tasks such as classification or retrieval could reveal whether the transferred knowledge generalizes beyond sentence similarity.

Load-bearing premise

Mean-composition initialization of the new embedding table combined with offline cosine-similarity distillation on the 40-language Wikipedia corpus will transfer the teacher's semantic knowledge to the Turkish-optimized tokenizer without substantial degradation on Turkish-specific downstream tasks.

What would settle it

If the student model records a Pearson correlation below the teacher's 73.84% on STSbTR or ranks outside the top 10 on TR-MTEB, the effectiveness of the tokenizer surgery and offline distillation pipeline would be refuted.

Figures

Figures reproduced from arXiv: 2605.29992 by Banu Diri, M. Ali Bayram, Sava\c{s} Y{\i}ld{\i}r{\i}m.

Figure 1
Figure 1. Figure 1: End-to-end pipeline for building embeddingmagibu-200m. The teacher model is used only during embedding precomputation, enabling efficient offline distillation. 3.1 Tokenizer Construction To optimize text representation for Turkish while preserving the multilingual capabilities of the teacher, we construct a custom hybrid tokenizer with a vocabulary size of 2 17 = 131,072 tokens. This vocabulary size repres… view at source ↗
Figure 2
Figure 2. Figure 2: Training curves from the offline distillation run. Left: cosine distillation loss decreasing [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
read the original abstract

Sentence embeddings are a foundational component for semantic search, clustering, classification, and retrieval-augmented generation. This paper presents embeddingmagibu-200m, a Turkish-focused sentence embedding model that produces 768-dimensional L2-normalized vectors and supports an 8,192-token context window, far exceeding the 512-token limit of earlier BERT-based Turkish encoders. Instead of full pretraining, an efficient three-stage adaptation pipeline is introduced: (1) construct a Turkish-optimized multilingual tokenizer with a 131,072 vocabulary by pruning redundant tokens from the teacher's vocabulary and incorporating multilingual tokens via frequency analysis on a 40-language corpus, (2) clone a teacher embedding model while preserving transformer backbone weights and initializing a compatible embedding table for the new vocabulary via mean-composition token mapping, and (3) perform offline embedding distillation from precomputed teacher vectors using a cosine similarity objective over a balanced 40-language Wikipedia corpus. The resulting student model contains approximately 200M parameters and trains in roughly four hours on a single GPU by avoiding online teacher inference during training, at a total cost of $5-$20. Empirically, Pearson/Spearman correlations of 77.55%/77.45% are obtained on STSbTR, surpassing the 300M-parameter teacher model (73.84%/72.92%). On TR-MTEB (26 tasks), a mean score of 63.9% is achieved (7th out of 26 models), providing a competitive cost-quality trade-off with 33% fewer parameters than the teacher. To facilitate reproducibility and downstream use, all artifacts are released including model weights, tokenizer files, precomputed embedding datasets, and open-source cloning and distillation tooling.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces embeddingmagibu-200m, a 200M-parameter Turkish sentence embedding model with 768-dim L2-normalized vectors and 8192-token context. It adapts a 300M multilingual teacher via a three-stage pipeline: (1) pruning/augmenting the tokenizer to a 131k Turkish-optimized vocabulary using frequency analysis on a 40-language corpus, (2) cloning the transformer backbone while initializing the new embedding table via mean-composition mapping, and (3) offline cosine-similarity distillation on precomputed teacher embeddings from balanced Wikipedia data. The student is claimed to achieve Pearson/Spearman correlations of 77.55%/77.45% on STSbTR (surpassing the teacher's 73.84%/72.92%) and a mean score of 63.9% on TR-MTEB (7th of 26 models), at low cost (~4 hours on one GPU) with all artifacts released.

Significance. If the empirical claims hold, the work offers a practical, low-cost method for language-specific adaptation of embedding models without full pretraining or online teacher inference during distillation. The emphasis on releasing model weights, tokenizer files, precomputed datasets, and open-source tooling strengthens reproducibility and enables downstream use for Turkish semantic tasks.

major comments (3)
  1. [Abstract and §4] Abstract and §4 (results): The reported STSbTR correlations (77.55%/77.45%) and TR-MTEB mean (63.9%) are presented without error bars, number of evaluation runs, statistical significance tests, or details on the evaluation protocol (e.g., splits or pooling), which is load-bearing for the central claim of outperformance over the 300M teacher.
  2. [§3.1-3.2] §3.1-3.2: The mean-composition initialization for the augmented 131k vocabulary (new Turkish tokens added via frequency analysis) is described as the key transfer step, but no ablation or analysis quantifies how well averaging approximates embeddings for high-frequency Turkish-specific tokens; this directly affects whether the reported gains can be attributed to the pipeline rather than initialization artifacts.
  3. [§3.3] §3.3: Offline distillation uses only cosine similarity on precomputed teacher vectors from the 40-language corpus with no online gradients or auxiliary losses; without evidence that this corrects potential misalignment from the mean-composition step on Turkish tokens, the transfer-without-degradation assumption remains untested and central to the superiority claim.
minor comments (2)
  1. [§3] The context window extension to 8192 tokens is highlighted but no ablation shows its impact on the reported Turkish tasks (most of which likely use shorter inputs).
  2. [§4] Table or figure captions for TR-MTEB results should explicitly list the 26 tasks and the ranking methodology to allow direct comparison.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below, indicating where revisions will be made to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (results): The reported STSbTR correlations (77.55%/77.45%) and TR-MTEB mean (63.9%) are presented without error bars, number of evaluation runs, statistical significance tests, or details on the evaluation protocol (e.g., splits or pooling), which is load-bearing for the central claim of outperformance over the 300M teacher.

    Authors: We agree that the lack of error bars, run counts, and protocol details weakens the presentation of the central empirical claims. In the revised manuscript we will expand the evaluation section to specify the exact STSbTR splits, pooling method, and any other protocol details. We will also report the number of runs performed and add error bars (or explicitly note single-run results with justification) along with a basic significance comparison where feasible. These changes directly address the robustness concern. revision: yes

  2. Referee: [§3.1-3.2] §3.1-3.2: The mean-composition initialization for the augmented 131k vocabulary (new Turkish tokens added via frequency analysis) is described as the key transfer step, but no ablation or analysis quantifies how well averaging approximates embeddings for high-frequency Turkish-specific tokens; this directly affects whether the reported gains can be attributed to the pipeline rather than initialization artifacts.

    Authors: The mean-composition mapping is a pragmatic transfer mechanism that reuses existing multilingual embeddings. While the original submission did not contain a dedicated ablation isolating its contribution, the end-to-end results (student outperforming the teacher on Turkish tasks) provide indirect support for the full pipeline. To respond to the request for quantification, we will add a short analysis or note in §3.1-3.2 examining embedding similarity for a sample of high-frequency Turkish tokens after initialization. revision: partial

  3. Referee: [§3.3] §3.3: Offline distillation uses only cosine similarity on precomputed teacher vectors from the 40-language corpus with no online gradients or auxiliary losses; without evidence that this corrects potential misalignment from the mean-composition step on Turkish tokens, the transfer-without-degradation assumption remains untested and central to the superiority claim.

    Authors: The offline cosine objective was selected precisely to enable low-cost adaptation without repeated teacher inference. The fact that the 200M student exceeds the 300M teacher on STSbTR supplies empirical evidence that any initialization misalignment is not performance-limiting for Turkish. In the revision we will expand §3.3 with an explicit discussion of this assumption, referencing the observed gains and the balanced multilingual training data as mitigating factors. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical adaptation pipeline is self-contained

full rationale

The paper presents a three-stage engineering pipeline (tokenizer pruning/augmentation, mean-composition embedding initialization, offline cosine distillation) whose outputs are measured on external benchmarks (STSbTR, TR-MTEB). No equations, fitted parameters renamed as predictions, or self-citation chains appear in the abstract or described method. Performance numbers are direct empirical measurements, not quantities forced by construction from the same inputs. The derivation is therefore independent of its own results.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim rests on domain assumptions about the effectiveness of mean-composition initialization and offline distillation rather than on new mathematical derivations or external benchmarks.

free parameters (2)
  • new vocabulary size = 131072
    131072 tokens chosen after pruning and frequency analysis on the 40-language corpus.
  • context window length = 8192
    8192 tokens selected to exceed prior 512-token BERT limit.
axioms (2)
  • domain assumption Mean composition of original token embeddings yields a usable initialization for the new vocabulary's embedding table.
    Invoked in stage 2 of the pipeline.
  • domain assumption Cosine-similarity matching to precomputed teacher embeddings on a balanced 40-language Wikipedia corpus transfers semantic capability to the student.
    Core training objective in stage 3.

pith-pipeline@v0.9.1-grok · 5863 in / 1568 out tokens · 41021 ms · 2026-06-29T07:32:00.288166+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Morpheus: A Morphology-Aware Neural Tokenizer and Word Embedder for Turkish

    cs.CL 2026-06 unverdicted novelty 6.0

    Morpheus is a morphology-aware neural tokenizer and embedder for Turkish that achieves lossless reversible segmentation, higher morphological alignment, lower bits-per-character, and competitive root-centric embedding...

Reference graph

Works this paper leans on

15 extracted references · 13 canonical work pages · cited by 1 Pith paper · 5 internal anchors

  1. [1]

    Morphscore: Evaluating morpho- logical awareness of tokenizers across languages

    Catherine Arnett, Marisa Hudspeth, and Brendan O’Connor. Morphscore: Evaluating morpho- logical awareness of tokenizers across languages. https://arxiv.org/abs/2507.06378, 2025

  2. [2]

    Ali Bayram, Ali Arda Fincan, Ahmet Semih Gümü¸ s, Sercan Karaka¸ s, Banu Diri, Sava¸ s Yıldırım, and Demircan Çelik

    M. Ali Bayram, Ali Arda Fincan, Ahmet Semih Gümü¸ s, Sercan Karaka¸ s, Banu Diri, Sava¸ s Yıldırım, and Demircan Çelik. Tokens with meaning: A hybrid tokenization approach for nlp. https://arxiv.org/abs/2508.14292, 2025

  3. [3]

    Baysan and Tunga Güngör

    Selman M. Baysan and Tunga Güngör. Tr-mteb: A comprehensive benchmark and embed- ding model suite for turkish sentence representations. https://aclanthology.org/2025. findings-emnlp.471/, 2025

  4. [4]

    M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation

    Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. M3-embedding: Multi-linguality, multi-functionality, multi-granularity text embeddings through self-knowledge distillation.https://arxiv.org/abs/2402.03216, 2024

  5. [5]

    Turkembed: Turkish embedding model on nli & sts tasks.https://arxiv.org/abs/2511.08376, 2025

    Özay Ezerceli, Gizem Gümü¸ sçekiçci, Tu˘gba Erkoç, and Berke Özenç. Turkembed: Turkish embedding model on nli & sts tasks.https://arxiv.org/abs/2511.08376, 2025

  6. [6]

    Wechsel: Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models

    Benjamin Minixhofer, Fabian Paischer, and Navid Rekabsaz. Wechsel: Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models. https: //aclanthology.org/2022.naacl-main.293/, 2022

  7. [7]

    MTEB: Massive Text Embedding Benchmark

    Niklas Muennighoff, Nouamane Tazi, Loïc Magne, and Nils Reimers. Mteb: Massive text embedding benchmark.https://arxiv.org/abs/2210.07316, 2022

  8. [8]

    Yamshchikov, and Mark Fishel

    Taido Purason, Pavel Chizhov, Ivan P. Yamshchikov, and Mark Fishel. Teaching old tokenizers new words: Efficient tokenizer adaptation for pre-trained models. https://arxiv.org/abs/ 2512.03989, 2025

  9. [9]

    Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

    Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert- networks.https://arxiv.org/abs/1908.10084, 2019

  10. [10]

    Making monolingual sentence embeddings multilingual using knowledge distillation.https://arxiv.org/abs/2004.09813, 2020

    Nils Reimers and Iryna Gurevych. Making monolingual sentence embeddings multilingual using knowledge distillation.https://arxiv.org/abs/2004.09813, 2020

  11. [11]

    How does a language-specific tokenizer affect llms?https://arxiv.org/abs/2502.12560, 2025

    Jean Seo, Jaeyoon Kim, SungJoo Byun, and Hyopil Shin. How does a language-specific tokenizer affect llms?https://arxiv.org/abs/2502.12560, 2025

  12. [12]

    Ebrar Kızılo˘glu, Onur Güngör, and Susan Üsküdarlı

    Melik¸ sah Türker, A. Ebrar Kızılo˘glu, Onur Güngör, and Susan Üsküdarlı. Tabibert: A large- scale modernbert foundation model and unified benchmarking framework for turkish. https: //arxiv.org/abs/2512.23065, 2025

  13. [13]

    EmbeddingGemma: Powerful and Lightweight Text Representations

    Henrique Schechter Vera, Sahil Dua, Biao Zhang, Daniel Salz, Ryan Mullins, Sindhu Raghuram Panyam, Sara Smoot, Iftekhar Naim, Joe Zou, Feiyang Chen, Daniel Cer, Alice Lisak, Min Choi, Lucas Gonzalez, Omar Sanseviero, Glenn Cameron, Ian Ballantyne, Kat Black, Kaifeng Chen, Weiyi Wang, Zhe Li, Gus Martins, Jinhyuk Lee, Mark Sherwood, Juyeong Ji, Renjie Wu, ...

  14. [14]

    Multilingual E5 Text Embeddings: A Technical Report

    Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Majumder, and Furu Wei. Multilingual e5 text embeddings: A technical report.https://arxiv.org/abs/2402.05672, 2024

  15. [15]

    mgte: Generalized long-context text representation and reranking models for multilingual text retrieval

    Xin Zhang, Yanzhao Zhang, Dingkun Long, Wen Xie, Ziqi Dai, Jialong Tang, Huan Lin, Baosong Yang, Pengjun Xie, Fei Huang, Meishan Zhang, Wenjie Li, and Min Zhang. mgte: Generalized long-context text representation and reranking models for multilingual text retrieval. https://arxiv.org/abs/2407.19669, 2024. A Implementation Code This appendix provides the P...