arxiv: 2605.01205 · v1 · submitted 2026-05-02 · 💻 cs.CL

Recognition: unknown

SRA: Span Representation Alignment for Large Language Model Distillation

Quoc Phong Dao , Hoang Son Nguyen , Pham Khanh Chi , Tung Nguyen , Linh Ngo Van , Nguyen Thi Ngoc Diep , Trung Le

Authors on Pith no claims yet

Pith reviewed 2026-05-09 15:18 UTC · model grok-4.3

classification 💻 cs.CL

keywords knowledge distillationlarge language modelscross-tokenizer distillationspan representationcenter of massrepresentation alignmentmodel compression

0 comments

The pith

SRA shifts LLM distillation alignment from tokens to attention-weighted span centers of mass for better cross-tokenizer transfer.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SRA as a framework for knowledge distillation between large language models and smaller students that use mismatched tokenizers. It claims that token-level alignment is brittle, so the key is to aggregate tokens first into spans and align the spans instead. Each span is treated as a cluster of particles whose state is captured by its center of mass, an attention-weighted average of the tokens inside it. A geometric regularizer keeps the representation space intact and aligned span logits carry the distilled knowledge. Experiments across different model architectures show consistent gains over prior token-based methods.

Core claim

SRA reframes cross-tokenizer knowledge distillation by moving the alignment target from individual tokens to robust spans, each represented by its attention-weighted center of mass under a multi-particle dynamical systems model, and demonstrates that this produces representations that are more stable across tokenizers and yield stronger distillation performance than token-level baselines.

What carries the argument

The span center of mass, defined as the attention-weighted average of token representations within a span and treated as the state of a particle cluster in a multi-particle dynamical system.

If this is right

Distillation performance becomes less dependent on the exact token boundaries chosen by each model's tokenizer.
Attention weighting focuses alignment on the most salient spans, preserving semantic content that would be diluted at the token level.
The geometric regularizer maintains structural consistency in the shared representation space during transfer.
Adding aligned span logit distillation supplies an extra channel for knowledge transfer beyond representation matching alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same span-center approach could be tested on other cross-model tasks such as retrieval or translation where tokenizers also differ.
If the particle-cluster framing is useful, it might suggest treating attention heads themselves as dynamical systems whose equilibria can be aligned directly.
The method may scale to distillation involving multimodal models where spans could be defined over image patches or audio segments as well.

Load-bearing premise

Modeling spans as particle clusters and using their attention-weighted centers of mass produces representations that remain robust to tokenizer mismatch and carry more useful information for distillation than token-level aggregation.

What would settle it

Re-running the reported cross-architecture distillation experiments but replacing the attention-weighted span center of mass with either token-level alignment or non-attention-weighted span averages, and checking whether the performance gap over CTKD baselines disappears.

Figures

Figures reproduced from arXiv: 2605.01205 by Hoang Son Nguyen, Linh Ngo Van, Nguyen Thi Ngoc Diep, Pham Khanh Chi, Quoc Phong Dao, Trung Le, Tung Nguyen.

**Figure 1.** Figure 1: An illustration of the tokenizer mismatch view at source ↗

**Figure 2.** Figure 2: An illustration of the proposed SRA framework. Teacher–student spans are first matched using longest view at source ↗

**Figure 3.** Figure 3: Win rates (%) for distilling Qwen 2.5-7B→GPT2 1.5B, evaluated by GPT-4o-mini view at source ↗

**Figure 4.** Figure 4: Prompt for GPT-4 evaluation view at source ↗

read the original abstract

Cross-Tokenizer Knowledge Distillation (CTKD) enables knowledge transfer between a large language model and a smaller student, even when they employ different tokenizers. While existing approaches mainly focus on token-level alignment strategies, which are often brittle and sensitive to discrepancies between tokenizers, we argue that the method of aggregating tokens into more robust representations before distillation is of equal importance. In this paper, we introduce \textbf{SRA} (\textbf{S}pan \textbf{R}epresentation \textbf{A}lignment for Large Language Model Distillation), a novel framework that reframes CTKD through the physical lens of Multi-Particle Dynamical Systems. SRA shifts the fundamental unit of alignment from tokens to robust, tokenizer-agnostic spans. We model each span as a cluster of particles and represent its state by its Center of Mass (CoM) - an attention-weighted average that captures rich semantic information. We leverage the concept of span centers of mass with attention-derived weighting to prioritize the most salient spans. In addition, we employ a geometric regularizer to preserve the structural integrity of the representation space and introduce aligned span logit distillation to enhance knowledge transfer across models. In challenging cross-architecture distillation experiments, SRA consistently and significantly outperforms state-of-the-art CTKD baselines, validating our physically-grounded approach.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SRA offers a practical shift to span-level alignment with attention-weighted centers of mass for cross-tokenizer distillation, but the abstract gives no numbers or implementation details to judge whether the gains are real or tuned.

read the letter

Hi there, The punchline on this paper is that it proposes shifting from token-level to span-level alignment in cross-tokenizer knowledge distillation, using a center-of-mass representation derived from attention weights, framed loosely as multi-particle dynamics. This targets a practical issue where different tokenizers make direct token matching unreliable. What the work does reasonably well is highlight that how you aggregate information matters just as much as how you align it. By defining spans and computing their centers of mass with attention-based weights, they aim for representations that are more tokenizer-agnostic and semantically richer. Adding a geometric regularizer to maintain structure in the representation space and performing distillation on the aligned span logits are concrete additions. The abstract suggests this leads to better performance in challenging cross-architecture setups compared to existing CTKD baselines. The physical analogy provides a nice way to think about the clustering of tokens into spans, even if it doesn't lead to new theorems or predictions. It engages with prior work on span representations and attention aggregation by adapting them to the distillation setting. That said, the soft spots are noticeable. The claims of consistent and significant outperformance lack any supporting numbers, error bars, or ablation studies in the abstract. Details on how spans are chosen, the exact form of the geometric regularizer, and implementation specifics are missing, which raises questions about whether the method generalizes or if results depend on careful tuning on the evaluation data. The multi-particle framing seems more like inspirational packaging than a load-bearing mathematical contribution that could be tested independently. This paper would appeal to people in the model compression and efficient inference community, especially those dealing with distillation across models with incompatible tokenizers. A reader interested in practical improvements to LLM deployment might get some ideas from the span aggregation strategy. Overall, I would recommend sending it for peer review. The idea is straightforward and addresses a genuine bottleneck, so referees can evaluate whether the experimental evidence holds up and if the method offers clear advantages over simpler aggregation baselines.

Referee Report

2 major / 2 minor

Summary. The paper introduces SRA, a framework for cross-tokenizer knowledge distillation (CTKD) that reframes alignment through multi-particle dynamical systems. It shifts from token-level to span-level representations, where each span is modeled as a cluster of particles whose state is captured by an attention-weighted center of mass (CoM). The method adds a geometric regularizer to maintain structural properties of the representation space and aligned span-logit distillation for improved transfer. The central empirical claim is that SRA consistently and significantly outperforms state-of-the-art CTKD baselines in cross-architecture distillation experiments.

Significance. If the reported gains prove robust, SRA could offer a practical advance for distilling knowledge between LLMs with mismatched tokenizers and architectures by using higher-level, semantically richer alignment units. The physical-systems framing provides intuitive motivation for the CoM construction and regularizer, and the combination of components addresses a known brittleness in token-level CTKD. Reproducibility would be strengthened by the explicit empirical validation against baselines.

major comments (2)

[§4 (Experiments)] §4 (Experiments): The abstract asserts consistent and significant outperformance over CTKD baselines, yet no quantitative metrics, error bars, dataset specifications, model pairs, or ablation results on span selection, CoM weighting, or the geometric regularizer are supplied. These details are load-bearing for evaluating whether the gains exceed what could be achieved by standard aggregation functions or post-hoc tuning.
[§3.2 (Center of Mass formulation)] §3.2 (Center of Mass formulation): The CoM is defined as an attention-weighted average of tokens within a span, but the precise normalization of attention weights, handling of cross-tokenizer span boundaries, and any free parameters in the weighting scheme are not specified. Without this, it is unclear whether the method is truly tokenizer-agnostic or reduces to a fitted aggregation that could be replicated without the multi-particle framing.

minor comments (2)

[Abstract] Abstract: The phrase 'challenging cross-architecture distillation experiments' should name the specific teacher-student architecture pairs and datasets to allow immediate assessment of the claim's scope.
[Notation] Notation: Ensure consistent use of symbols for spans, CoM, and the geometric regularizer across sections; a table summarizing all hyperparameters would aid clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our manuscript. Their comments highlight important areas for clarification and additional empirical support. We address each major comment point by point below, indicating the revisions we will incorporate in the updated version.

read point-by-point responses

Referee: [§4 (Experiments)] §4 (Experiments): The abstract asserts consistent and significant outperformance over CTKD baselines, yet no quantitative metrics, error bars, dataset specifications, model pairs, or ablation results on span selection, CoM weighting, or the geometric regularizer are supplied. These details are load-bearing for evaluating whether the gains exceed what could be achieved by standard aggregation functions or post-hoc tuning.

Authors: We appreciate the referee's emphasis on empirical rigor. While Section 4 reports performance numbers on cross-architecture pairs (e.g., Llama-2 to Mistral and similar), we acknowledge that error bars, explicit dataset/model tables, and component ablations were not sufficiently detailed. In the revision we will add: (i) mean and standard deviation over three random seeds for all main results, (ii) a summary table listing exact datasets, model sizes, and tokenizer vocabularies, and (iii) ablation tables isolating span selection heuristics, attention-based CoM weighting, and the geometric regularizer. These additions will directly address whether the observed gains exceed those obtainable from simpler aggregation baselines or post-hoc tuning. revision: yes
Referee: [§3.2 (Center of Mass formulation)] §3.2 (Center of Mass formulation): The CoM is defined as an attention-weighted average of tokens within a span, but the precise normalization of attention weights, handling of cross-tokenizer span boundaries, and any free parameters in the weighting scheme are not specified. Without this, it is unclear whether the method is truly tokenizer-agnostic or reduces to a fitted aggregation that could be replicated without the multi-particle framing.

Authors: We agree that the current description in §3.2 lacks sufficient mathematical detail. The attention weights are normalized with a softmax taken exclusively over the tokens belonging to each span (ensuring they sum to one). Span boundaries are aligned across tokenizers by first recovering word-level segments from the original text via a deterministic detokenization step, then projecting those segments onto each model's subword sequence; this mapping uses no learned parameters. The weighting itself is taken directly from the teacher's attention heads with no additional hyperparameters. We will revise §3.2 to include the explicit normalized CoM equation, the word-level alignment procedure, and pseudocode, thereby clarifying that the construction is tokenizer-agnostic and motivated by the multi-particle analogy rather than being an arbitrary fitted aggregator. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on independent experimental validation

full rationale

The paper defines SRA via an explicit modeling choice (attention-weighted span CoM under a multi-particle analogy) and reports empirical gains on cross-architecture distillation benchmarks. No equations, uniqueness theorems, or self-citations are shown that reduce the reported performance to a fitted parameter or to the input data by construction. The physical framing functions as interpretive motivation for the aggregation unit; success is measured by downstream distillation metrics rather than by any internal identity or self-referential prediction. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the assumption that attention-weighted span centers of mass capture semantic information more robustly than token-level or other aggregation methods, and that the multi-particle dynamical systems framing supplies useful inductive bias. No explicit free parameters, axioms, or invented entities are stated in the abstract.

pith-pipeline@v0.9.0 · 5551 in / 1217 out tokens · 29519 ms · 2026-05-09T15:18:31.469907+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

83 extracted references · 16 canonical work pages · 3 internal anchors

[1]

2015 , eprint=

Distilling the Knowledge in a Neural Network , author=. 2015 , eprint=

2015
[2]

2020 , eprint=

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , author=. 2020 , eprint=

2020
[3]

Tinybert: Distilling bert for natural language understanding.arXiv preprint arXiv:1909.10351,

Xiaoqi Jiao and Yichun Yin and Lifeng Shang and Xin Jiang and Xiao Chen and Linlin Li and Fang Wang and Qun Liu , title =. CoRR , volume =. 2019 , url =. 1909.10351 , timestamp =

work page arXiv 2019
[4]

2023 , eprint=

MTEB: Massive Text Embedding Benchmark , author=. 2023 , eprint=

2023
[5]

2025 , eprint=

NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models , author=. 2025 , eprint=

2025
[6]

2024 , eprint=

C-Pack: Packed Resources For General Chinese Embeddings , author=. 2024 , eprint=

2024
[7]

2025 , eprint=

Jasper and Stella: distillation of SOTA embedding models , author=. 2025 , eprint=

2025
[8]

2019 , eprint=

Patient Knowledge Distillation for BERT Model Compression , author=. 2019 , eprint=

2019
[9]

2024 , eprint=

MiniLLM: Knowledge Distillation of Large Language Models , author=. 2024 , eprint=

2024
[10]

2025 , eprint=

Towards Cross-Tokenizer Distillation: the Universal Logit Distillation Loss for LLMs , author=. 2025 , eprint=

2025
[11]

2024 , eprint=

Knowledge Fusion of Large Language Models , author=. 2024 , eprint=

2024
[12]

2024 , eprint=

Dual-Space Knowledge Distillation for Large Language Models , author=. 2024 , eprint=

2024
[13]

2020 , eprint=

MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers , author=. 2020 , eprint=

2020
[14]

2025 , eprint=

A Survey of Large Language Models , author=. 2025 , eprint=

2025
[15]

2020 , eprint=

MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices , author=. 2020 , eprint=

2020
[16]

2023 , eprint=

DistilCSE: Effective Knowledge Distillation For Contrastive Sentence Embeddings , author=. 2023 , eprint=

2023
[17]

2016 , eprint=

Sequence-Level Knowledge Distillation , author=. 2016 , eprint=

2016
[18]

2023 , eprint=

Specializing Smaller Language Models towards Multi-Step Reasoning , author=. 2023 , eprint=

2023
[19]

2023 , eprint=

Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes , author=. 2023 , eprint=

2023
[20]

2019 , eprint=

Similarity of Neural Network Representations Revisited , author=. 2019 , eprint=

2019
[21]

2024 , eprint=

KDMCSE: Knowledge Distillation Multimodal Sentence Embeddings with Adaptive Angular margin Contrastive Learning , author=. 2024 , eprint=

2024
[22]

2025 , eprint=

CoT2Align: Cross-Chain of Thought Distillation via Optimal Transport Alignment for Language Models with Different Tokenizers , author=. 2025 , eprint=

2025
[23]

and Szedmak, Sandor and Shawe-Taylor, John , journal=

Hardoon, David R. and Szedmak, Sandor and Shawe-Taylor, John , journal=. Canonical Correlation Analysis: An Overview with Application to Learning Methods , year=
[24]

2023 , eprint=

Feature Structure Distillation with Centered Kernel Alignment in BERT Transferring , author=. 2023 , eprint=

2023
[25]

2025 , eprint=

Rho-1: Not All Tokens Are What You Need , author=. 2025 , eprint=

2025
[26]

Context-aware Event Forecasting via Graph Disentanglement , booktitle =

Li, Junyan and Zhang, Li Lyna and Xu, Jiahang and Wang, Yujing and Yan, Shaoguang and Xia, Yunqing and Yang, Yuqing and Cao, Ting and Sun, Hao and Deng, Weiwei and Zhang, Qi and Yang, Mao , year=. Constraint-aware and Ranking-distilled Token Pruning for Efficient Transformer Inference , url=. doi:10.1145/3580305.3599284 , booktitle=

work page doi:10.1145/3580305.3599284
[27]

2021 , eprint=

Do Wide and Deep Networks Learn the Same Things? Uncovering How Neural Network Representations Vary with Width and Depth , author=. 2021 , eprint=

2021
[28]

33rd British Machine Vision Conference 2022,

Aninda Saha and Alina N Bialkowski and Sara Khalifa , title =. 33rd British Machine Vision Conference 2022,. 2022 , url =

2022
[29]

The Thirteenth International Conference on Learning Representations , year=

Improving Language Model Distillation through Hidden State Matching , author=. The Thirteenth International Conference on Learning Representations , year=
[30]

BIGPATENT : A Large-Scale Dataset for Abstractive and Coherent Summarization

Sharma, Eva and Li, Chen and Wang, Lu. BIGPATENT : A Large-Scale Dataset for Abstractive and Coherent Summarization. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019. doi:10.18653/v1/P19-1212

work page doi:10.18653/v1/p19-1212 2019
[31]

Tushar Khot and Ashish Sabharwal and Peter Clark , Booktitle =
[32]

Proceedings of the AAAI Conference on Artificial Intelligence , author=

Natural Language Inference in Context - Investigating Contextual Reasoning over Long Texts , volume=. Proceedings of the AAAI Conference on Artificial Intelligence , author=. 2021 , month=. doi:10.1609/aaai.v35i15.17580 , number=

work page doi:10.1609/aaai.v35i15.17580 2021
[33]

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Adversarial NLI: A New Benchmark for Natural Language Understanding , author=. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020

2020
[34]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Multi-level optimal transport for universal cross-tokenizer knowledge distillation on language models , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
[35]

2024 , url=

Parishad BehnamGhader and Vaibhav Adlakha and Marius Mosbach and Dzmitry Bahdanau and Nicolas Chapados and Siva Reddy , booktitle=. 2024 , url=

2024
[36]

S em E val-2014 Task 1: Evaluation of Compositional Distributional Semantic Models on Full Sentences through Semantic Relatedness and Textual Entailment

Marelli, Marco and Bentivogli, Luisa and Baroni, Marco and Bernardi, Raffaella and Menini, Stefano and Zamparelli, Roberto. S em E val-2014 Task 1: Evaluation of Compositional Distributional Semantic Models on Full Sentences through Semantic Relatedness and Textual Entailment. Proceedings of the 8th International Workshop on Semantic Evaluation ( S em E v...

work page doi:10.3115/v1/s14-2001 2014
[37]

2024 , eprint=

BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation , author=. 2024 , eprint=

2024
[38]

URLhttps://doi.org/10.3389/neuro.06.004.2008

Kriegeskorte, Nikolaus and Mur, Marieke and Bandettini, Peter A. , TITLE=. Frontiers in Systems Neuroscience , VOLUME=. 2008 , URL=. doi:10.3389/neuro.06.004.2008 , ISSN=

work page doi:10.3389/neuro.06.004.2008 2008
[39]

2023 , url=

Less is More: Task-aware Layer-wise Distillation for Language Model Compression , author=. 2023 , url=

2023
[40]

arXiv preprint arXiv:2508.12519 , year=

An Introduction to Sliced Optimal Transport , author=. arXiv preprint arXiv:2508.12519 , year=

work page arXiv
[41]

Improving Vietnamese-English Cross-Lingual Retrieval for Legal and General Domains , author=. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers) , pages=

2025
[42]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Mitigating Non-Representative Prototypes and Representation Bias in Few-Shot Continual Relation Extraction , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
[43]

Enhancing Discriminative Representation in Similar Relation Clusters for Few-Shot Continual Relation Extraction , author=. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

2025
[44]

Mutual-pairing Data Augmentation for Fewshot Continual Relation Extraction , author=. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

2025
[45]

Findings of the Association for Computational Linguistics: ACL 2025 , pages=

HiCOT: Improving Neural Topic Models via Optimal Transport and Contrastive Learning , author=. Findings of the Association for Computational Linguistics: ACL 2025 , pages=

2025
[46]

Findings of the Association for Computational Linguistics: ACL 2025 , pages=

Topic Modeling for Short Texts via Optimal Transport-Based Clustering , author=. Findings of the Association for Computational Linguistics: ACL 2025 , pages=

2025
[47]

Understanding and Improving Transformer From a Multi-Particle Dynamic System Point of View

Understanding and improving transformer from a multi-particle dynamic system point of view , author=. arXiv preprint arXiv:1906.02762 , year=

work page Pith review arXiv 1906
[48]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Beyond Logits: Aligning Feature Dynamics for Effective Knowledge Distillation , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
[49]

2009 , publisher=

Systemic functional grammar: A first step into the theory , author=. 2009 , publisher=

2009
[50]

2015 , publisher=

Lexical-functional syntax , author=. 2015 , publisher=

2015
[51]

2019 , eprint=

What Does BERT Look At? An Analysis of BERT's Attention , author=. 2019 , eprint=

2019
[52]

Lifting the Curse of Capacity Gap in Distilling Language Models

Zhang, Chen and Yang, Yang and Liu, Jiahao and Wang, Jingang and Xian, Yunsen and Wang, Benyou and Song, Dawei. Lifting the Curse of Capacity Gap in Distilling Language Models. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023

2023
[53]

AD - KD : Attribution-Driven Knowledge Distillation for Language Model Compression

Wu, Siyue and Chen, Hongzhan and Quan, Xiaojun and Wang, Qifan and Wang, Rui. AD - KD : Attribution-Driven Knowledge Distillation for Language Model Compression. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023

2023
[54]

Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Multi-granularity structural knowledge distillation for language model compression , author=. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
[55]

2025 , eprint=

Enhancing Cross-Tokenizer Knowledge Distillation with Contextual Dynamical Mapping , author=. 2025 , eprint=

2025
[56]

2023 , eprint =

Adapter-based Selective Knowledge Distillation for Federated Multi-domain Meeting Summarization , author =. 2023 , eprint =. doi:10.48550/arXiv.2308.03275 , url =

work page doi:10.48550/arxiv.2308.03275 2023
[57]

2024 , eprint =

VkD: Improving Knowledge Distillation using Orthogonal Projections , author =. 2024 , eprint =. doi:10.48550/arXiv.2403.06213 , url =

work page doi:10.48550/arxiv.2403.06213 2024
[58]

1970 , publisher=

An introduction to celestial mechanics , author=. 1970 , publisher=

1970
[59]

A Comparative Analysis of Task-Agnostic Distillation Methods for Compressing Transformer Language Models

Udagawa, Takuma and Trivedi, Aashka and Merler, Michele and Bhattacharjee, Bishwaranjan. A Comparative Analysis of Task-Agnostic Distillation Methods for Compressing Transformer Language Models. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: Industry Track. 2023

2023
[60]

2024 , eprint =

DeepSeek-V3 Technical Report , author =. 2024 , eprint =

2024
[61]

2024 , eprint =

GPT-4 Technical Report , author =. 2024 , eprint =

2024
[62]

Advances in Neural Information Processing Systems (NeurIPS) , volume =

Neural Ordinary Differential Equations , author =. Advances in Neural Information Processing Systems (NeurIPS) , volume =. 2018 , url =

2018
[63]

2025 , eprint =

Universal Cross-Tokenizer Distillation via Approximate Likelihood Matching , author =. 2025 , eprint =. doi:10.48550/arXiv.2503.20083 , url =

work page doi:10.48550/arxiv.2503.20083 2025
[64]

Super- N atural I nstructions: Generalization via Declarative Instructions on 1600+ NLP Tasks

Wang, Yizhong and Mishra, Swaroop and Alipoormolabashi, Pegah and Kordi, Yeganeh and Mirzaei, Amirreza and Arunkumar, Anjana and Ashok, Arjun and Dhanasekaran, Arut Selvan and Naik, Atharva and Stap, David and Pathak, Eshaan and Karamanolakis, Giannis and Lai, Haizhi Gary and Purohit, Ishan and Mondal, Ishani and Anderson, Jacob and Kuznia, Kirby and Dosh...

work page doi:10.18653/v1/2022.emnlp-main.340 2022
[65]

and Stoica, Ion and Xing, Eric P

Chiang, Wei-Lin and Li, Zhuohan and Lin, Zi and Sheng, Ying and Wu, Zhanghao and Zhang, Hao and Zheng, Lianmin and Zhuang, Siyuan and Zhuang, Yonghao and Gonzalez, Joseph E. and Stoica, Ion and Xing, Eric P. , month =. Vicuna:
[66]

D ialog S um: A Real-Life Scenario Dialogue Summarization Dataset

Chen, Yulong and Liu, Yang and Chen, Liang and Zhang, Yue. D ialog S um: A Real-Life Scenario Dialogue Summarization Dataset. Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. 2021

2021
[67]

Smith, Daniel Khashabi, and Hannaneh Hajishirzi

Wang, Yizhong and Kordi, Yeganeh and Mishra, Swaroop and Liu, Alisa and Smith, Noah A. and Khashabi, Daniel and Hajishirzi, Hannaneh. S elf- I nstruct: Aligning Language Models with S elf- G enerated Instructions. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023. doi:10.18653/v1/2023.acl...

work page doi:10.18653/v1/2023.acl-long.754 2023
[68]

OpenAI blog , volume=

Language models are unsupervised multitask learners , author=. OpenAI blog , volume=. 2019 , url =

2019
[69]

2024 , url =

Zhang, Peiyuan and Zeng, Guangtao and Wang, Tianduo and Lu, Wei , journal=. 2024 , url =

2024
[70]

Qwen Technical Report

Qwen Technical Report , author =. arXiv preprint arXiv:2309.16609 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[71]

OPT: Open Pre-trained Transformer Language Models

Susan Zhang and Stephen Roller and Naman Goyal and Mikel Artetxe and Moya Chen and Shuohui Chen and Christopher Dewan and Mona Diab and Xian Li and Xi Victoria Lin and Todor Mihaylov and Myle Ott and Sam Shleifer and Kurt Shuster and Daniel Simig and Punit Singh Koura and Anjali Sridhar and Tianlu Wang and Luke Zettlemoyer , year=. 2205.01068 , archivePrefix=

work page internal anchor Pith review arXiv
[72]

Identifying and Mitigating Vulnerabilities in

Jiang, Fengqing and Xu, Zhangchen and Niu, Luyao and Wang, Boxin and Jia, Jinyuan and Li, Bo and Poovendran, Radha , journal =. Identifying and Mitigating Vulnerabilities in. 2023 , month = nov, doi =

2023
[73]

Qwen2.5 Technical Report

Qwen2.5 technical report , author=. arXiv preprint arXiv:2412.15115 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[74]

Text summarization branches out , pages=

Rouge: A package for automatic evaluation of summaries , author=. Text summarization branches out , pages=
[75]

2024 , eprint=

Rethinking Kullback-Leibler Divergence in Knowledge Distillation for Large Language Models , author=. 2024 , eprint=

2024
[76]

2024 , eprint=

DistiLLM: Towards Streamlined Distillation for Large Language Models , author=. 2024 , eprint=

2024
[77]

2023 , eprint=

f-Divergence Minimization for Sequence-Level Knowledge Distillation , author=. 2023 , eprint=

2023
[78]

Findings of the Association for Computational Linguistics: EACL , pages=

DWA-KD: Dual-Space Weighting and Time-Warped Alignment for Cross-Tokenizer Knowledge Distillation , author=. Findings of the Association for Computational Linguistics: EACL , pages=
[79]

Fortieth

Truong Nguyen and Phi Van Dat and Ngan Nguyen and Linh Ngo Van and Trung Le and Thanh Hong Nguyen , title =. Fortieth
[80]

Fortieth

Hoang Tran Vuong and Tue Le and Quyen Tran and Linh Ngo Van and Trung Le , title =. Fortieth

Showing first 80 references.