arxiv: 2605.08809 · v1 · submitted 2026-05-09 · 💻 cs.CL · cs.AI

Recognition: 2 theorem links

· Lean Theorem

SimReg: Achieving Higher Performance in the Pretraining via Embedding Similarity Regularization

Yan Sun , Guoxia Wang , Jinle Zeng , JiaBin Yang , Shuai Li , Li Shen , Dacheng Tao , Dianhai Yu , HaiFeng Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:28 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords LLM pretrainingembedding regularizationsimilarity losscontrastive learningtraining convergencezero-shot performanceMixture-of-Expertsnext-token prediction

0 comments

The pith

SimReg applies embedding similarity regularization to next-token pretraining so that tokens sharing ground-truth labels form tighter clusters and separate from others.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes SimReg, a contrastive regularization term added to standard LLM pretraining. It pulls token embeddings that share the same ground-truth label closer together inside each sequence while pushing embeddings with different labels farther apart. This addresses the high intra-class variance and inter-class similarity that context-dependent embeddings create under pure next-token prediction. The resulting larger classification margins make representation learning more efficient. Experiments on both dense and Mixture-of-Experts models show more than 30 percent faster convergence during pretraining and more than 1 percent higher average zero-shot performance on downstream benchmarks.

Core claim

The central claim is that an embedding similarity regularization loss, applied by encouraging intra-label similarity and inter-label separation through contrastive terms on tokens that share ground-truth labels within each pretraining sequence, enlarges multi-classification margins. This enables more efficient classification while the model continues to optimize next-token prediction. The effect appears as faster training convergence and stronger zero-shot downstream results across dense and Mixture-of-Experts architectures.

What carries the argument

SimReg, an embedding similarity regularization loss that uses contrastive principles to cluster same-label token representations and separate different-label ones inside each pretraining sequence.

If this is right

Pretraining converges more than 30 percent faster while the primary next-token objective remains unchanged.
Average zero-shot performance on standard benchmarks rises by more than 1 percent.
The same regularization produces gains on both dense transformers and Mixture-of-Experts models.
Ablation results supply concrete guidance on choosing the regularization weight and temperature.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Label assignment could rely on lightweight heuristics such as entity detection or part-of-speech tags rather than expensive supervision.
The margin-enlargement effect may transfer to other unsupervised objectives that lack explicit classification heads.
Longer contexts would require careful definition of label consistency across sequence boundaries to preserve the regularization benefit.

Load-bearing premise

Ground-truth labels can be meaningfully assigned to tokens inside each pretraining sequence so the similarity regularization can be applied without interfering with next-token prediction.

What would settle it

If pretraining runs on sequences where no reliable ground-truth labels can be assigned show no measurable change in convergence speed or downstream zero-shot scores, the contribution of the regularization would be falsified.

Figures

Figures reproduced from arXiv: 2605.08809 by Dacheng Tao, Dianhai Yu, Guoxia Wang, HaiFeng Wang, JiaBin Yang, Jinle Zeng, Li Shen, Shuai Li, Yan Sun.

**Figure 1.** Figure 1: (left) Workflow of the SIMREG loss. (Right) We compare the cosine similarity of token embeddings in a sample on the LLaMA-7B model trained via “CrossEntropy only" and “CrossEntropy+SIMREG". Using “CrossEntropy only" fails to enforce sufficient separability among token features, whose cosine values of all token pairs exceed 0.5. With the introduction of SIMREG, feature separability is generally enhanced (a… view at source ↗

**Figure 2.** Figure 2: (a) We analyze the token ID distribution over 1B training samples from the C4 dataset and [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Cross-entropy loss acceleration (upper) and contrastive similarity improvements (lower) in [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: (a) Grid search over hyperparameters τ and λ. The blue blocks indicate the values where the final training loss under the corresponding combination (τ, λ) is lower than baseline, with darker colors representing lower losses. (b) We further conduct a fine-grained search over different λ values at the generally optimal τ = 0.01, using an approximate 2× scaling ratio. (c) We explore the trends on different λ … view at source ↗

**Figure 5.** Figure 5: Loss changes of adopting our SIMREG loss at different layers on 1B model. In this part, we empirically investigate at which positions in the model embedding supervision yields the best results. We divide the network according to its natural layer-wise structure and apply supervision at different depths. As shown in [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: The training curve of the SIMREG loss [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

**Figure 7.** Figure 7: The averaged cosine similarity values are 0.488 (CrossEntropy only - left) and [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

**Figure 8.** Figure 8: The averaged cosine similarity values are 0.445 (CrossEntropy only - left) and [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗

read the original abstract

Pretraining large language models (LLMs) with next-token prediction has led to remarkable advances, yet the context-dependent nature of token embeddings in such models results in high intra-class variance and inter-class similarity, thus hindering the efficiency of representation learning. While similarity-based regularization has demonstrated benefit in supervised fine-tuning and classification tasks, its application and efficacy in large-scale LLM pretraining remains underexplored. In this work, we propose the SimReg, an embedding similarity regularization loss that explicitly encourages token representations with the same ground-truth label within each sequence to be more similar, while enforcing separation from different-label tokens via a contrastive loss. Our analysis reveals that this mechanism introduces gains by enlarging multi-classification margins, thereby enabling more efficient classification. Extensive experiments across dense and Mixture-of-Experts (MoE) architectures demonstrate that SimReg consistently accelerates training convergence by over 30% and improves average zero-shot downstream performance by over 1% across standard benchmarks. Further ablation studies and analyses offer practical insights into hyperparameter tuning and loss effectiveness.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SimReg claims 30% faster pretraining convergence by regularizing token embeddings with same/different ground-truth labels, but the labeling step for raw pretraining sequences is the part that needs pinning down before the numbers can be trusted.

read the letter

The main point is that this paper adds a contrastive loss during LLM pretraining to pull embeddings of same-label tokens closer and push different-label ones apart. They report consistent gains: over 30% fewer steps to convergence and a bit over 1% lift in average zero-shot performance, tested on both dense transformers and MoE models. The abstract frames the application to pretraining as underexplored, which is the core new angle here. Similarity regularization has been around for supervised tasks, so the contribution is mainly the scaling and the empirical results at this stage of training. The experiments look solid in breadth, covering multiple architectures and including ablations on hyperparameters and loss terms. That gives some practical takeaways on tuning and shows the effect is not limited to one setup. The analysis tying the regularization to bigger classification margins is a straightforward way to interpret why it helps efficiency. The soft spot is the ground-truth labels. Pretraining data is just raw text with no built-in class labels for tokens, so the method depends entirely on how those labels are assigned inside each sequence. If the paper uses a heuristic, an external model, or anything derived from the training process itself, that detail has to be fully specified and shown not to add hidden supervision or make the baseline comparisons unfair. The abstract does not spell this out, which leaves the headline numbers hard to reproduce or generalize. Without that piece, it is difficult to tell whether the speedup comes from the regularization or from the labeling choice. This is aimed at people working on efficient pretraining and regularization tricks for large models. A reader who wants concrete numbers on compute savings would get something out of the results and ablations, but only after checking the methods section for the label procedure. I would send it to peer review. The claim is large enough and the experimental scope is decent, so referees can verify the labeling details and run the necessary controls. It is not desk-reject material, but it will need revisions to make the method fully transparent.

Referee Report

1 major / 0 minor

Summary. The manuscript proposes SimReg, an additive embedding similarity regularization loss for LLM pretraining. It applies a contrastive objective within each sequence that pulls token embeddings sharing the same ground-truth label closer together while pushing apart embeddings with different labels. The authors report that this enlarges multi-classification margins, accelerates convergence by over 30%, and yields more than 1% average improvement in zero-shot downstream performance across dense and Mixture-of-Experts models on standard benchmarks.

Significance. If the labeling step can be shown to be zero-cost and non-interfering with the next-token objective, the approach could offer a practical way to improve representation quality during pretraining. The claimed gains in training speed and downstream accuracy would be substantial for large-scale models, but only if the regularization is reproducible without hidden supervision.

major comments (1)

The central claim depends on assigning ground-truth labels to tokens inside raw pretraining sequences so that the contrastive term can be computed. Standard next-token corpora supply no such labels. The abstract and method description must specify the exact labeling procedure, demonstrate that it introduces no external supervision or dataset-specific artifacts, and confirm that the core language-modeling loss remains unaltered. Without these details the reported 30% convergence acceleration and >1% zero-shot gains cannot be evaluated or reproduced.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their thoughtful and constructive report. We address the single major comment below and will revise the manuscript to improve clarity and reproducibility as suggested.

read point-by-point responses

Referee: The central claim depends on assigning ground-truth labels to tokens inside raw pretraining sequences so that the contrastive term can be computed. Standard next-token corpora supply no such labels. The abstract and method description must specify the exact labeling procedure, demonstrate that it introduces no external supervision or dataset-specific artifacts, and confirm that the core language-modeling loss remains unaltered. Without these details the reported 30% convergence acceleration and >1% zero-shot gains cannot be evaluated or reproduced.

Authors: We agree that the labeling procedure requires explicit description for reproducibility. In the revised manuscript we will expand both the abstract and the Method section to state the exact procedure used to assign ground-truth labels to tokens within each raw pretraining sequence. The procedure operates solely on information already present in the input sequences, introduces no external supervision or dataset-specific artifacts, and leaves the next-token prediction loss completely unchanged; SimReg is implemented strictly as an additive auxiliary term. These additions will make the 30% convergence acceleration and >1% zero-shot gains fully evaluable and reproducible. revision: yes

Circularity Check

0 steps flagged

No significant circularity; SimReg is an independent additive loss.

full rationale

The paper introduces SimReg as a contrastive regularization term added to next-token prediction. It explicitly conditions on ground-truth labels within sequences to pull same-label embeddings together and push others apart. No equations, derivations, or claims reduce this term to a fitted parameter, self-referential definition, or output of the main objective. No self-citations are invoked as load-bearing uniqueness theorems, and no known empirical pattern is merely renamed. The reported convergence and zero-shot gains are presented as empirical outcomes of the auxiliary loss rather than tautological consequences of its construction. The label-assignment premise is an external modeling choice whose validity is separate from circularity in the derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, background axioms, or new postulated entities are described beyond the standard contrastive loss formulation.

pith-pipeline@v0.9.0 · 5503 in / 1115 out tokens · 88654 ms · 2026-05-12T03:28:24.680810+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel / J_uniquely_calibrated_via_higher_derivative unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Lsr_i ≜ log ∑_{j∈N_i} ϕ_{i,j} − log ∑_{j∈P_i} ϕ_{i,j} … final form Li = Lce_i + λ·softplus(Lsr_i)
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection / RCLCombiner_isCoupling_iff unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

enlarging multi-classification margins … m_k ≥ m_k + δ∥ϵ+∥ + √2 L_P γ

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · 5 internal anchors

[1]

Next token prediction towards multimodal intelligence: A comprehensive survey

Liang Chen, Zekun Wang, Shuhuai Ren, Lei Li, Haozhe Zhao, Yunshui Li, Zefan Cai, Hongcheng Guo, Lei Zhang, Yizhe Xiong, et al. Next token prediction towards multimodal intelligence: A comprehensive survey. arXiv preprint arXiv:2412.18619,

work page arXiv
[2]

Joint selection for large-scale pre-training data via policy gradient-based mask learning.arXiv preprint arXiv:2512.24265,

Ziqing Fan, Yuqiao Xian, Yan Sun, and Li Shen. Joint selection for large-scale pre-training data via policy gradient-based mask learning.arXiv preprint arXiv:2512.24265,

work page arXiv
[3]

Ethan He, Abhinav Khattar, Ryan Prenger, Vijay Korthikanti, Zijie Yan, Tong Liu, Shiqing Fan, Ashwath Aithal, Mohammad Shoeybi, and Bryan Catanzaro

URL https://zenodo.org/records/12608602. Pengzhi Gao, Ruiqing Zhang, Zhongjun He, Hua Wu, and Haifeng Wang. An empirical study of consistency regularization for end-to-end speech-to-text translation.arXiv preprint arXiv:2308.14482,

work page arXiv
[4]

arXiv preprint arXiv:2104.08821 , year=

Tianyu Gao, Xingcheng Yao, and Danqi Chen. Simcse: Simple contrastive learning of sentence embeddings. arXiv preprint arXiv:2104.08821,

work page arXiv
[5]

Jun Hu, Wenwen Xia, Xiaolu Zhang, Chilin Fu, Weichang Wu, Zhaoxin Huan, Ang Li, Zuoli Tang, and Jun Zhou

URL https://openreview.net/forum?id=cu7IUiOhujH. Jun Hu, Wenwen Xia, Xiaolu Zhang, Chilin Fu, Weichang Wu, Zhaoxin Huan, Ang Li, Zuoli Tang, and Jun Zhou. Enhancing sequential recommendation via llm-based semantic embedding learning. InCompanion Proceedings of the ACM Web Conference 2024, pages 103–111,

work page 2024
[6]

URL https://api.semanticscholar.org/CorpusID: 236134216. Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts. arXiv preprint arXiv:2401.04088,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Fast and efficient 2-bit llm inference on gpu: 2/4/16-bit in a weight matrix with asynchronous dequantization

Jinhao Li, Jiaming Xu, Shiyao Li, Shan Huang, Jun Liu, Yaoxiu Lian, and Guohao Dai. Fast and efficient 2-bit llm inference on gpu: 2/4/16-bit in a weight matrix with asynchronous dequantization. InProceedings of the 43rd IEEE/ACM International Conference on Computer-Aided Design, pages 1–9, 2024a. Yingcong Li, Yixiao Huang, Muhammed E Ildiz, Ankit Singh R...

work page arXiv
[8]

arXiv preprint arXiv:2201.10005 , year=

Arvind Neelakantan, Tao Xu, Raul Puri, Alec Radford, Jesse Michael Han, Jerry Tworek, Qiming Yuan, Nikolas Tezak, Jong Wook Kim, Chris Hallacy, et al. Text and code embeddings by contrastive pre-training.arXiv preprint arXiv:2201.10005,

work page arXiv
[9]

Representation Learning with Contrastive Predictive Coding

Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

GLU Variants Improve Transformer

Noam Shazeer. Glu variants improve transformer.arXiv preprint arXiv:2002.05202,

work page internal anchor Pith review Pith/arXiv arXiv 2002
[11]

A simple contrastive learning framework for interactive argument pair identification via argument-context extraction

Lida Shi, Fausto Giunchiglia, Rui Song, Daqian Shi, Tongtong Liu, Xiaolei Diao, and Hao Xu. A simple contrastive learning framework for interactive argument pair identification via argument-context extraction. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 10027–10039,

work page 2022
[12]

Maskpro: Linear-space probabilistic learning for strict (n: M)-sparsity on large language models.arXiv preprint arXiv:2506.12876,

Yan Sun, Qixin Zhang, Zhiyuan Yu, Xikun Zhang, Li Shen, and Dacheng Tao. Maskpro: Linear-space probabilistic learning for strict (n: M)-sparsity on large language models.arXiv preprint arXiv:2506.12876,

work page internal anchor Pith review arXiv
[13]

Llms are also effective embedding models: An in-depth overview.arXiv preprint arXiv:2412.12591,

11 Chongyang Tao, Tao Shen, Shen Gao, Junshuo Zhang, Zhen Li, Kai Hua, Wenpeng Hu, Zhengwei Tao, and Shuai Ma. Llms are also effective embedding models: An in-depth overview.arXiv preprint arXiv:2412.12591,

work page arXiv
[14]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Adagc: Improving training stability for large language model pretraining.arXiv preprint arXiv:2502.11034,

Guoxia Wang, Shuai Li, Congliang Chen, Jinle Zeng, Jiabin Yang, Tao Sun, Yanjun Ma, Dianhai Yu, and Li Shen. Adagc: Improving training stability for large language model pretraining.arXiv preprint arXiv:2502.11034,

work page arXiv
[16]

Do generated data always help contrastive learning?arXiv preprint arXiv:2403.12448,

Yifei Wang, Jizhe Zhang, and Yisen Wang. Do generated data always help contrastive learning?arXiv preprint arXiv:2403.12448,

work page arXiv
[17]

Is contrastive learning necessary? a study of data augmentation vs contrastive learning in sequential recommen- dation

Peilin Zhou, You-Liang Huang, Yueqi Xie, Jingqi Gao, Shoujin Wang, Jae Boum Kim, and Sunghun Kim. Is contrastive learning necessary? a study of data augmentation vs contrastive learning in sequential recommen- dation. InProceedings of the ACM Web Conference 2024, pages 3854–3863,

work page 2024
[18]

12 A Appendix: Experiments A.1 Experimental Setups Here we present the detailed experimental setups in this paper to ensure the reproducibility. Model Hyperparameters.We mainly select LLaMA2 [Touvron et al., 2023] and Mixtral [Jiang et al., 2024] as the dense and MoE backbones for pretraining, including the core modules of the mainstream models in the cur...

work page 2023
[19]

Table 4: Model Hyperparameters. Experts Layers Attention heads Embedding dim FFN hidden size LLaMA2-350M 1 24 16 1024 2371 LLaMA2-1.3B 1 24 32 2048 5461 LLaMA2-3B 1 26 32 3072 8640 LLaMA2-7B 1 32 32 4096 11008 Mixtral-8×1B 8 24 32 2048 5632 Training Hyperparameters.We follow the experimental setups reported in several recent classical LLM pretraining stud...

work page 2048
[20]

Table 5: Training Hyperparameters. batchsize seqlen learning rateλ w β1 β2 clip-λclip-β LLaMA-350M 512 2048 4e-4→4e-5 0.1 0.9 0.95 1.04 0.99 LLaMA-1.3B 2048 2048 3e-4→3e-5 0.1 0.9 0.95 1.04 0.99 LLaMA-3B 2048 2048 3e-4→3e-5 0.1 0.9 0.95 1.04 0.99 LLaMA-7B 2048 2048 3e-4→3e-5 0.1 0.9 0.95 1.04 0.99 Mixtral-8×1B 512 2048 3e-4→3e-5 0.1 0.9 0.95 1.04 0.99 Spe...

work page 2048
[21]

λreg = 0(baseline) 15.06 10.72 9.70 8.99 λreg = 5 14.36 10.46 9.50 8.92 λreg = 10 14.25 10.41 9.46 8.84 λreg = 20 14.29 10.429.44 8.78 λreg = 50 14.33 10.49 9.49 8.81 It can be observed that the trend largely aligns with our hypothesis. Therefore, we propose the following estimation method for the optimal hyperparameters: τ= 0.01, λ reg ≈10× r d 1024 , wh...

work page 2000