pith. sign in

arxiv: 2606.03027 · v1 · pith:KFFHRESAnew · submitted 2026-06-02 · 💻 cs.CL

SEA-Embedding: Open and Reproducible Text Embeddings for Southeast Asia

Pith reviewed 2026-06-28 10:46 UTC · model grok-4.3

classification 💻 cs.CL
keywords text embeddingsSoutheast Asian languagesreproducible NLPopen source modelsSEA-BED benchmarkmultilingual embeddingstraining objectivesbase encoder initialization
0
0 comments X

The pith

SEA-Embedding delivers an open, reproducible pipeline that trains competitive text embeddings for Southeast Asian languages using only public data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds SEA-Embedding as a complete, shareable pipeline for text embeddings aimed at Southeast Asian languages. It trains the model exclusively on publicly released datasets and systematically varies data composition, training objective, and base encoder initialization to measure effects on robustness. The resulting model reaches state-of-the-art scores on the SEA-BED benchmark while making every training choice transparent and repeatable. This approach matters because prior high-performing embedding models for the region have relied on undisclosed data, blocking both verification and further progress. The work therefore supplies both a practical model and a controlled testbed for studying what actually drives good embeddings in these languages.

Core claim

We present SEA-Embedding, a fully open and reproducible text-embedding pipeline for Southeast Asian languages trained only on publicly available data, and use it to study three core factors of robust embedding design: data composition, training objective, and base encoder initialization. SEA-Embedding achieves state-of-the-art results on SEA-BED while enabling systematic and reproducible analysis of robust text embeddings for the region.

What carries the argument

The SEA-Embedding training pipeline, which isolates and measures the separate contributions of data composition, training objective, and base encoder initialization to embedding robustness.

If this is right

  • Researchers can now replicate the exact training choices that produced the reported SEA-BED gains.
  • Varying data mix, loss function, and starting encoder independently shows which factor most improves robustness for the target languages.
  • Future models for the region can start from the released weights and data rather than closed resources.
  • The same controlled comparison framework can be applied to test new public datasets as they become available.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same public-data approach could be tested on other low-resource language families to check whether closed-data advantages are language-specific.
  • If the three factors interact differently across language families, the pipeline offers a ready template for repeating the ablation study elsewhere.
  • Releasing the full training code and data lists lowers the barrier for groups without access to proprietary corpora to contribute embedding improvements.

Load-bearing premise

Publicly available data alone is enough to produce embeddings that match or exceed the performance of models trained on closed datasets for Southeast Asian languages.

What would settle it

Re-running the SEA-Embedding pipeline on the same public data and finding that it falls short of the reported state-of-the-art scores on SEA-BED.

Figures

Figures reproduced from arXiv: 2606.03027 by Jian Gang Ngui, Peerat Limkonchotiwat, Raymond Ng, Sarana Nutanong.

Figure 1
Figure 1. Figure 1: presents our training pipeline for SEA￾Embedding. This structure serves as a conceptual framework for systematically examining three crit￾ical components of robust text embeddings: RQ1: Data Composition, RQ2: Objective Design, and RQ3: Base Model [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
read the original abstract

Text embeddings are fundamental to many downstream applications, making robustness important for real-world NLP. However, most recent state-of-the-art embedding models are not reproducible because they rely on closed or undisclosed training data, and they remain insufficiently robust for Southeast Asian languages. We present SEA-Embedding, a fully open and reproducible text-embedding pipeline for Southeast Asian languages trained only on publicly available data, and use it to study three core factors of robust embedding design: data composition, training objective, and base encoder initialization. SEA-Embedding achieves state-of-the-art results on SEA-BED while enabling systematic and reproducible analysis of robust text embeddings for the region.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 0 minor

Summary. The manuscript presents SEA-Embedding, a fully open and reproducible text-embedding pipeline for Southeast Asian languages trained exclusively on publicly available data. It claims state-of-the-art results on the SEA-BED benchmark and uses the open pipeline to systematically analyze the impact of data composition, training objective, and base encoder initialization on embedding robustness for the region.

Significance. If the reported results hold, the work is significant for addressing the reproducibility crisis in embedding models by releasing a fully open pipeline and public resources. This enables verifiable and extensible research on Southeast Asian languages, a historically under-served area. The explicit focus on systematic study of design factors, supported by open code and data, is a clear strength that directly mitigates the non-reproducibility problem highlighted in the abstract.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive assessment of our work and for recommending minor revision. We appreciate the recognition of the manuscript's contributions to reproducibility and systematic analysis for Southeast Asian language embeddings.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper's central claim rests on an empirical training pipeline using only publicly available data, followed by evaluation on the SEA-BED benchmark and ablation studies over data composition, objective, and initialization. No equations, derivations, or load-bearing steps are described that reduce by construction to fitted inputs, self-definitions, or self-citation chains. The work is structured as an open reproduction effort, with results presented as outcomes of the described training rather than presupposed by the method itself. This is the standard non-circular case for an empirical systems paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No details available from abstract only to identify free parameters, axioms or invented entities.

pith-pipeline@v0.9.1-grok · 5643 in / 1054 out tokens · 34555 ms · 2026-06-28T10:46:54.287340+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

41 extracted references · 12 canonical work pages

  1. [1]

    MTEB : Massive Text Embedding Benchmark

    Muennighoff, Niklas and Tazi, Nouamane and Magne, Loic and Reimers, Nils. MTEB : Massive Text Embedding Benchmark. Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics. 2023. doi:10.18653/v1/2023.eacl-main.148

  2. [6]

    The Thirteenth International Conference on Learning Representations , year=

    Kenneth Enevoldsen and Isaac Chung and Imene Kerboua and M. The Thirteenth International Conference on Learning Representations , year=

  3. [8]

    2018 , eprint=

    MS MARCO: A Human Generated MAchine Reading COmprehension Dataset , author=. 2018 , eprint=

  4. [9]

    2025 , eprint=

    SEA-BED: Southeast Asia Embedding Benchmark , author=. 2025 , eprint=

  5. [11]

    A o E : Angle-optimized Embeddings for Semantic Textual Similarity

    Li, Xianming and Li, Jing. A o E : Angle-optimized Embeddings for Semantic Textual Similarity. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.101

  6. [12]

    2024 , eprint=

    Jina Embeddings 2: 8192-Token General-Purpose Text Embeddings for Long Documents , author=. 2024 , eprint=

  7. [13]

    Unsupervised Cross-lingual Representation Learning at Scale

    Conneau, Alexis and Khandelwal, Kartikay and Goyal, Naman and Chaudhary, Vishrav and Wenzek, Guillaume and Guzm \'a n, Francisco and Grave, Edouard and Ott, Myle and Zettlemoyer, Luke and Stoyanov, Veselin. Unsupervised Cross-lingual Representation Learning at Scale. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. ...

  8. [14]

    Proceedings of the 37th International Conference on Machine Learning , articleno =

    Hu, Junjie and Ruder, Sebastian and Siddhant, Aditya and Neubig, Graham and Firat, Orhan and Johnson, Melvin , title =. Proceedings of the 37th International Conference on Machine Learning , articleno =. 2020 , publisher =

  9. [15]

    2025 , eprint=

    mmBERT: A Modern Multilingual Encoder with Annealed Language Learning , author=. 2025 , eprint=

  10. [16]

    2025 , url=

    EmbeddingGemma: Powerful and Lightweight Text Representations , author=. 2025 , url=

  11. [17]

    Sentence- BERT : Sentence Embeddings using S iamese BERT -Networks

    Reimers, Nils and Gurevych, Iryna. Sentence- BERT : Sentence Embeddings using S iamese BERT -Networks. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019. doi:10.18653/v1/D19-1410

  12. [18]

    2024 , eprint=

    jina-embeddings-v3: Multilingual Embeddings With Task LoRA , author=. 2024 , eprint=

  13. [19]

    Hugging Face repository , howpublished =

    FineTranslations , author=. Hugging Face repository , howpublished =. 2026 , publisher =

  14. [21]

    2023 , eprint=

    Towards General Text Embeddings with Multi-stage Contrastive Learning , author=. 2023 , eprint=

  15. [22]

    The Thirteenth International Conference on Learning Representations , year=

    Generative Representational Instruction Tuning , author=. The Thirteenth International Conference on Learning Representations , year=

  16. [23]

    2024 , eprint=

    Multilingual E5 Text Embeddings: A Technical Report , author=. 2024 , eprint=

  17. [24]

    2025 , eprint=

    KaLM-Embedding: Superior Training Data Brings A Stronger Embedding Model , author=. 2025 , eprint=

  18. [25]

    2026 , eprint=

    jina-embeddings-v5-text: Task-Targeted Embedding Distillation , author=. 2026 , eprint=

  19. [26]

    2025 , eprint=

    Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models , author=. 2025 , eprint=

  20. [27]

    Xinping Zhao and Xinshuo Hu and Zifei Shan and Shouzheng Huang and Yao Zhou and Xin Zhang and Zetian Sun and zhenyu liu and Dongfang Li and Xinyuan Wei and Youcheng Pan and Yang Xiang and Meishan Zhang and Haofen Wang and Jun Yu and Baotian Hu and Min Zhang , booktitle=. Ka. 2026 , url=

  21. [28]

    2023 , eprint=

    Mistral 7B , author=. 2023 , eprint=

  22. [29]

    2024 , eprint=

    Text Embeddings by Weakly-Supervised Contrastive Pre-training , author=. 2024 , eprint=

  23. [31]

    Mohammad Kalim Akram, Saba Sturua, Nastia Havriushenko, Quentin Herreros, Michael Günther, Maximilian Werk, and Han Xiao. 2026. https://arxiv.org/abs/2602.15547 jina-embeddings-v5-text: Task-targeted embedding distillation . Preprint, arXiv:2602.15547

  24. [32]

    Jianlyu Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. 2024. https://doi.org/10.18653/v1/2024.findings-acl.137 M 3-embedding: Multi-linguality, multi-functionality, multi-granularity text embeddings through self-knowledge distillation . In Findings of the Association for Computational Linguistics: ACL 2024, pages 2318--2335, Bangkok,...

  25. [33]

    Kenneth Enevoldsen, Isaac Chung, Imene Kerboua, M \'a rton Kardos, Ashwin Mathur, David Stap, Jay Gala, Wissam Siblini, Dominik Krzemi \'n ski, Genta Indra Winata, Saba Sturua, Saiteja Utpala, Mathieu Ciancone, Marion Schaeffer, Diganta Misra, Shreeya Dhakal, Jonathan Rystr m, Roman Solomatin, \"O mer Veysel C a g atan, and 63 others. 2025. https://openre...

  26. [34]

    Tianyu Gao, Xingcheng Yao, and Danqi Chen. 2021. https://doi.org/10.18653/v1/2021.emnlp-main.552 S im CSE : Simple contrastive learning of sentence embeddings . In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6894--6910, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics

  27. [35]

    Michael Günther, Jackmin Ong, Isabelle Mohr, Alaeddine Abdessalem, Tanguy Abel, Mohammad Kalim Akram, Susana Guzman, Georgios Mastrapas, Saba Sturua, Bo Wang, Maximilian Werk, Nan Wang, and Han Xiao. 2024. https://arxiv.org/abs/2310.19923 Jina embeddings 2: 8192-token general-purpose text embeddings for long documents . Preprint, arXiv:2310.19923

  28. [36]

    Xinshuo Hu, Zifei Shan, Xinping Zhao, Zetian Sun, Zhenyu Liu, Dongfang Li, Shaolin Ye, Xinyuan Wei, Qian Chen, Baotian Hu, Haofen Wang, Jun Yu, and Min Zhang. 2025. https://arxiv.org/abs/2501.01028 Kalm-embedding: Superior training data brings a stronger embedding model . Preprint, arXiv:2501.01028

  29. [37]

    Guilherme Penedo, Hynek Kydl \' c ek, Amir Hossein Kargaran, and Leandro von Werra. 2026. Finetranslations. https://huggingface.co/datasets/HuggingFaceFW/finetranslations

  30. [38]

    Wuttikorn Ponwitayarat, Raymond Ng, Jann Railey Montalan, Thura Aung, Jian Gang Ngui, Yosephine Susanto, William Tjhi, Panuthep Tasawong, Erik Cambria, Ekapol Chuangsuwanich, Sarana Nutanong, and Peerat Limkonchotiwat. 2025. https://arxiv.org/abs/2508.12243 Sea-bed: Southeast asia embedding benchmark . Preprint, arXiv:2508.12243

  31. [39]

    Nils Reimers and Iryna Gurevych. 2020. https://doi.org/10.18653/v1/2020.emnlp-main.365 Making monolingual sentence embeddings multilingual using knowledge distillation . In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4512--4525, Online. Association for Computational Linguistics

  32. [40]

    Henrique* Schechter Vera, Sahil* Dua, Biao Zhang, Daniel Salz, Ryan Mullins, Sindhu Raghuram Panyam, Sara Smoot, Iftekhar Naim, Joe Zou, Feiyang Chen, Daniel Cer, Alice Lisak, Min Choi, Lucas Gonzalez, Omar Sanseviero, Glenn Cameron, Ian Ballantyne, Kat Black, Kaifeng Chen, and 69 others. 2025. https://arxiv.org/abs/2509.20354 Embeddinggemma: Powerful and...

  33. [41]

    Holger Schwenk, Guillaume Wenzek, Sergey Edunov, Edouard Grave, Armand Joulin, and Angela Fan. 2021. https://doi.org/10.18653/v1/2021.acl-long.507 CCM atrix: Mining billions of high-quality parallel sentences on the web . In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference ...

  34. [42]

    Shivalika Singh, Angelika Romanou, Cl \'e mentine Fourrier, David Ifeoluwa Adelani, Jian Gang Ngui, Daniel Vila-Suero, Peerat Limkonchotiwat, Kelly Marchisio, Wei Qi Leong, Yosephine Susanto, Raymond Ng, Shayne Longpre, Sebastian Ruder, Wei-Yin Ko, Antoine Bosselut, Alice Oh, Andre Martins, Leshem Choshen, Daphne Ippolito, and 4 others. 2025. https://doi....

  35. [43]

    Yosephine Susanto, Adithya Venkatadri Hulagadri, Jann Railey Montalan, Jian Gang Ngui, Xianbin Yong, Wei Qi Leong, Hamsawardhini Rengarajan, Peerat Limkonchotiwat, Yifan Mai, and William Chandra Tjhi. 2025. https://doi.org/10.18653/v1/2025.findings-acl.636 SEA - HELM : S outheast A sian holistic evaluation of language models . In Findings of the Associati...

  36. [44]

    Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. 2024 a . https://arxiv.org/abs/2212.03533 Text embeddings by weakly-supervised contrastive pre-training . Preprint, arXiv:2212.03533

  37. [45]

    Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Majumder, and Furu Wei. 2024 b . https://doi.org/10.18653/v1/2024.acl-long.642 Improving text embeddings with large language models . In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11897--11916, Bangkok, Thailand. Associatio...

  38. [46]

    Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Majumder, and Furu Wei. 2024 c . https://arxiv.org/abs/2402.05672 Multilingual e5 text embeddings: A technical report . Preprint, arXiv:2402.05672

  39. [47]

    Xinyu Zhang, Nandan Thakur, Odunayo Ogundepo, Ehsan Kamalloo, David Alfonso-Hermelo, Xiaoguang Li, Qun Liu, Mehdi Rezagholizadeh, and Jimmy Lin. 2023. https://doi.org/10.1162/tacl_a_00595 MIRACL : A multilingual retrieval dataset covering 18 diverse languages . Transactions of the Association for Computational Linguistics, 11:1114--1131

  40. [48]

    Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, Fei Huang, and Jingren Zhou. 2025. https://arxiv.org/abs/2506.05176 Qwen3 embedding: Advancing text embedding and reranking through foundation models . Preprint, arXiv:2506.05176

  41. [49]

    Xinping Zhao, Xinshuo Hu, Zifei Shan, Shouzheng Huang, Yao Zhou, Xin Zhang, Zetian Sun, zhenyu liu, Dongfang Li, Xinyuan Wei, Youcheng Pan, Yang Xiang, Meishan Zhang, Haofen Wang, Jun Yu, Baotian Hu, and Min Zhang. 2026. https://openreview.net/forum?id=Y7qzhvWhcz Ka LM -embedding-v2: Superior training techniques and data inspire a versatile embedding mode...