pith. sign in
Pith Number

pith:GEFYM37A

pith:2025:GEFYM37A2FREIHXV3RWZ2XONEL
not attested not anchored not stored refs resolved

EmbeddingGemma: Powerful and Lightweight Text Representations

Aashi Jain, Abheesht Sharma, Adam Roberts, Adham Elarabawy, AJ Co, Alice Lisak, Andreas Doumanoglou, Armand Joulin, Babak Samari, Ben Hora, Biao Zhang, Brian Potetz, Cormac Brick, Dahun Kim, Daniel Cer, Daniel Salz, Divyashree Sreepathihalli, Enrique Alfonseca, Fedor Moiseev, Feiyang Chen, Feng Han, Francesco Visin, Frank Palma Gomez, Ga\"el Liu, Glenn Cameron, Gus Martins, Gustavo Hern\'andez \'Abrego, Henrique Schechter Vera, Hesen Zhang, Hui Hui, Ian Ballantyne, Iftekhar Naim, Jay Han, Jiageng Zhang, Jingxiao Zheng, Jinhyuk Lee, Joe Zou, Juyeong Ji, Jyotinder Singh, Kaifeng Chen, Karan Gill, Kat Black, Kathleen Kenealy, Ke Chen, Koert Chen, Lucas Gonzalez, Madhuri Shanbhogue, Mark Sherwood, Michael Boratko, Michelle Casbon, Min Choi, Mojtaba Seyedhosseini, Olivier Lacombe, Omar Sanseviero, Paul Suganthan, Qin Yin, Raphael Hoffmann, Ravin Kumar, Renjie Wu, Ryan Mullins, Sahil Dua, Sai Meher Karthik Duddu, Sandeep Mariserla, Sara Smoot, Setareh Ariafar, Shanfeng Zhang, Shijie Zhang, Simon Baumgartner, Sindhu Raghuram Panyam, Sonam Goenka, Steve Qiu, Tanmaya Dabral, Thomas Mesnard, Tom Duerig, Trevor Walker, Tris Warkentin, Vikram Rao, Waleed Khawaja, Weiyi Wang, Wenlei Zhou, Xiaoqi Ren, Ye Xia, Yichang Chen, Yi-Ting Chen, Yunhsuan Sung, Zach Gleicher, Zhe Dong, Zhe Li, Zhongli Ding

A 300 million parameter model reaches state-of-the-art text embedding results on MTEB

arxiv:2509.20354 v3 · 2025-09-24 · cs.CL · cs.AI

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{GEFYM37A2FREIHXV3RWZ2XONEL}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

EmbeddingGemma (300M) achieves state-of-the-art results on MTEB across multilingual, English, and code domains, outperforming prior top models with fewer than 500M parameters and providing performance comparable to models double its size.

C2weakest assumption

That the described training recipe (encoder-decoder initialization, geometric embedding distillation, spread-out regularizer, and checkpoint merging from varied mixtures) is the primary driver of the reported gains rather than data selection, base model scale, or evaluation specifics.

C3one line summary

A 300M-parameter open embedding model sets new SOTA on MTEB for its size class and matches models twice as large while staying effective when compressed.

References

27 extracted · 27 resolved · 11 Pith anchors

[1] A. Asai, J. Kasai, J. H. Clark, K. Lee, E. Choi, and H. Hajishirzi. Xor qa: Cross-lingual open-retrieval question answering. InProceedings of the 2021 Conference of the North American Chapter of the A 2021
[2] Small Language Models are the Future of Agentic AI · arXiv:2506.02153
[3] Moreira, Radek Osmulski, Mengyao Xu, Ronay Ak, Benedikt Schifferer, and Even Oldridge
[4] Mmteb: Massive multilingual text embedding benchmark · doi:10.48550/arxiv.2502.13595
[5] SimCSE: Simple Contrastive Learning of Sentence Embeddings · arXiv:2104.08821

Cited by

33 papers in Pith

Receipt and verification
First computed 2026-05-17T23:38:52.589643Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

310b866fe0d162441ef5dc6d9d5dcd22f07efd7a8d302ebce5c10a124a95fc21

Aliases

arxiv: 2509.20354 · arxiv_version: 2509.20354v3 · doi: 10.48550/arxiv.2509.20354 · pith_short_12: GEFYM37A2FRE · pith_short_16: GEFYM37A2FREIHXV · pith_short_8: GEFYM37A
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/GEFYM37A2FREIHXV3RWZ2XONEL \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 310b866fe0d162441ef5dc6d9d5dcd22f07efd7a8d302ebce5c10a124a95fc21
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "285bdf1032753db782a592a27f228760328e02fd9ae661a3caf0cd9ccbaaa24e",
    "cross_cats_sorted": [
      "cs.AI"
    ],
    "license": "http://creativecommons.org/licenses/by/4.0/",
    "primary_cat": "cs.CL",
    "submitted_at": "2025-09-24T17:56:51Z",
    "title_canon_sha256": "8aa9784a6c14d3d4ef634410538db259293033755b109710214f8d3ca6132d0e"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2509.20354",
    "kind": "arxiv",
    "version": 3
  }
}