pith. machine review for the scientific record. sign in

arxiv: 2605.01205 · v1 · submitted 2026-05-02 · 💻 cs.CL

Recognition: unknown

SRA: Span Representation Alignment for Large Language Model Distillation

Authors on Pith no claims yet

Pith reviewed 2026-05-09 15:18 UTC · model grok-4.3

classification 💻 cs.CL
keywords knowledge distillationlarge language modelscross-tokenizer distillationspan representationcenter of massrepresentation alignmentmodel compression
0
0 comments X

The pith

SRA shifts LLM distillation alignment from tokens to attention-weighted span centers of mass for better cross-tokenizer transfer.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SRA as a framework for knowledge distillation between large language models and smaller students that use mismatched tokenizers. It claims that token-level alignment is brittle, so the key is to aggregate tokens first into spans and align the spans instead. Each span is treated as a cluster of particles whose state is captured by its center of mass, an attention-weighted average of the tokens inside it. A geometric regularizer keeps the representation space intact and aligned span logits carry the distilled knowledge. Experiments across different model architectures show consistent gains over prior token-based methods.

Core claim

SRA reframes cross-tokenizer knowledge distillation by moving the alignment target from individual tokens to robust spans, each represented by its attention-weighted center of mass under a multi-particle dynamical systems model, and demonstrates that this produces representations that are more stable across tokenizers and yield stronger distillation performance than token-level baselines.

What carries the argument

The span center of mass, defined as the attention-weighted average of token representations within a span and treated as the state of a particle cluster in a multi-particle dynamical system.

If this is right

  • Distillation performance becomes less dependent on the exact token boundaries chosen by each model's tokenizer.
  • Attention weighting focuses alignment on the most salient spans, preserving semantic content that would be diluted at the token level.
  • The geometric regularizer maintains structural consistency in the shared representation space during transfer.
  • Adding aligned span logit distillation supplies an extra channel for knowledge transfer beyond representation matching alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same span-center approach could be tested on other cross-model tasks such as retrieval or translation where tokenizers also differ.
  • If the particle-cluster framing is useful, it might suggest treating attention heads themselves as dynamical systems whose equilibria can be aligned directly.
  • The method may scale to distillation involving multimodal models where spans could be defined over image patches or audio segments as well.

Load-bearing premise

Modeling spans as particle clusters and using their attention-weighted centers of mass produces representations that remain robust to tokenizer mismatch and carry more useful information for distillation than token-level aggregation.

What would settle it

Re-running the reported cross-architecture distillation experiments but replacing the attention-weighted span center of mass with either token-level alignment or non-attention-weighted span averages, and checking whether the performance gap over CTKD baselines disappears.

Figures

Figures reproduced from arXiv: 2605.01205 by Hoang Son Nguyen, Linh Ngo Van, Nguyen Thi Ngoc Diep, Pham Khanh Chi, Quoc Phong Dao, Trung Le, Tung Nguyen.

Figure 1
Figure 1. Figure 1: An illustration of the tokenizer mismatch view at source ↗
Figure 2
Figure 2. Figure 2: An illustration of the proposed SRA framework. Teacher–student spans are first matched using longest view at source ↗
Figure 3
Figure 3. Figure 3: Win rates (%) for distilling Qwen 2.5-7B→GPT2 1.5B, evaluated by GPT-4o-mini view at source ↗
Figure 4
Figure 4. Figure 4: Prompt for GPT-4 evaluation view at source ↗
read the original abstract

Cross-Tokenizer Knowledge Distillation (CTKD) enables knowledge transfer between a large language model and a smaller student, even when they employ different tokenizers. While existing approaches mainly focus on token-level alignment strategies, which are often brittle and sensitive to discrepancies between tokenizers, we argue that the method of aggregating tokens into more robust representations before distillation is of equal importance. In this paper, we introduce \textbf{SRA} (\textbf{S}pan \textbf{R}epresentation \textbf{A}lignment for Large Language Model Distillation), a novel framework that reframes CTKD through the physical lens of Multi-Particle Dynamical Systems. SRA shifts the fundamental unit of alignment from tokens to robust, tokenizer-agnostic spans. We model each span as a cluster of particles and represent its state by its Center of Mass (CoM) - an attention-weighted average that captures rich semantic information. We leverage the concept of span centers of mass with attention-derived weighting to prioritize the most salient spans. In addition, we employ a geometric regularizer to preserve the structural integrity of the representation space and introduce aligned span logit distillation to enhance knowledge transfer across models. In challenging cross-architecture distillation experiments, SRA consistently and significantly outperforms state-of-the-art CTKD baselines, validating our physically-grounded approach.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces SRA, a framework for cross-tokenizer knowledge distillation (CTKD) that reframes alignment through multi-particle dynamical systems. It shifts from token-level to span-level representations, where each span is modeled as a cluster of particles whose state is captured by an attention-weighted center of mass (CoM). The method adds a geometric regularizer to maintain structural properties of the representation space and aligned span-logit distillation for improved transfer. The central empirical claim is that SRA consistently and significantly outperforms state-of-the-art CTKD baselines in cross-architecture distillation experiments.

Significance. If the reported gains prove robust, SRA could offer a practical advance for distilling knowledge between LLMs with mismatched tokenizers and architectures by using higher-level, semantically richer alignment units. The physical-systems framing provides intuitive motivation for the CoM construction and regularizer, and the combination of components addresses a known brittleness in token-level CTKD. Reproducibility would be strengthened by the explicit empirical validation against baselines.

major comments (2)
  1. [§4 (Experiments)] §4 (Experiments): The abstract asserts consistent and significant outperformance over CTKD baselines, yet no quantitative metrics, error bars, dataset specifications, model pairs, or ablation results on span selection, CoM weighting, or the geometric regularizer are supplied. These details are load-bearing for evaluating whether the gains exceed what could be achieved by standard aggregation functions or post-hoc tuning.
  2. [§3.2 (Center of Mass formulation)] §3.2 (Center of Mass formulation): The CoM is defined as an attention-weighted average of tokens within a span, but the precise normalization of attention weights, handling of cross-tokenizer span boundaries, and any free parameters in the weighting scheme are not specified. Without this, it is unclear whether the method is truly tokenizer-agnostic or reduces to a fitted aggregation that could be replicated without the multi-particle framing.
minor comments (2)
  1. [Abstract] Abstract: The phrase 'challenging cross-architecture distillation experiments' should name the specific teacher-student architecture pairs and datasets to allow immediate assessment of the claim's scope.
  2. [Notation] Notation: Ensure consistent use of symbols for spans, CoM, and the geometric regularizer across sections; a table summarizing all hyperparameters would aid clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our manuscript. Their comments highlight important areas for clarification and additional empirical support. We address each major comment point by point below, indicating the revisions we will incorporate in the updated version.

read point-by-point responses
  1. Referee: [§4 (Experiments)] §4 (Experiments): The abstract asserts consistent and significant outperformance over CTKD baselines, yet no quantitative metrics, error bars, dataset specifications, model pairs, or ablation results on span selection, CoM weighting, or the geometric regularizer are supplied. These details are load-bearing for evaluating whether the gains exceed what could be achieved by standard aggregation functions or post-hoc tuning.

    Authors: We appreciate the referee's emphasis on empirical rigor. While Section 4 reports performance numbers on cross-architecture pairs (e.g., Llama-2 to Mistral and similar), we acknowledge that error bars, explicit dataset/model tables, and component ablations were not sufficiently detailed. In the revision we will add: (i) mean and standard deviation over three random seeds for all main results, (ii) a summary table listing exact datasets, model sizes, and tokenizer vocabularies, and (iii) ablation tables isolating span selection heuristics, attention-based CoM weighting, and the geometric regularizer. These additions will directly address whether the observed gains exceed those obtainable from simpler aggregation baselines or post-hoc tuning. revision: yes

  2. Referee: [§3.2 (Center of Mass formulation)] §3.2 (Center of Mass formulation): The CoM is defined as an attention-weighted average of tokens within a span, but the precise normalization of attention weights, handling of cross-tokenizer span boundaries, and any free parameters in the weighting scheme are not specified. Without this, it is unclear whether the method is truly tokenizer-agnostic or reduces to a fitted aggregation that could be replicated without the multi-particle framing.

    Authors: We agree that the current description in §3.2 lacks sufficient mathematical detail. The attention weights are normalized with a softmax taken exclusively over the tokens belonging to each span (ensuring they sum to one). Span boundaries are aligned across tokenizers by first recovering word-level segments from the original text via a deterministic detokenization step, then projecting those segments onto each model's subword sequence; this mapping uses no learned parameters. The weighting itself is taken directly from the teacher's attention heads with no additional hyperparameters. We will revise §3.2 to include the explicit normalized CoM equation, the word-level alignment procedure, and pseudocode, thereby clarifying that the construction is tokenizer-agnostic and motivated by the multi-particle analogy rather than being an arbitrary fitted aggregator. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on independent experimental validation

full rationale

The paper defines SRA via an explicit modeling choice (attention-weighted span CoM under a multi-particle analogy) and reports empirical gains on cross-architecture distillation benchmarks. No equations, uniqueness theorems, or self-citations are shown that reduce the reported performance to a fitted parameter or to the input data by construction. The physical framing functions as interpretive motivation for the aggregation unit; success is measured by downstream distillation metrics rather than by any internal identity or self-referential prediction. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the assumption that attention-weighted span centers of mass capture semantic information more robustly than token-level or other aggregation methods, and that the multi-particle dynamical systems framing supplies useful inductive bias. No explicit free parameters, axioms, or invented entities are stated in the abstract.

pith-pipeline@v0.9.0 · 5551 in / 1217 out tokens · 29519 ms · 2026-05-09T15:18:31.469907+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

83 extracted references · 16 canonical work pages · 3 internal anchors

  1. [1]

    2015 , eprint=

    Distilling the Knowledge in a Neural Network , author=. 2015 , eprint=

  2. [2]

    2020 , eprint=

    DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , author=. 2020 , eprint=

  3. [3]

    Tinybert: Distilling bert for natural language understanding.arXiv preprint arXiv:1909.10351,

    Xiaoqi Jiao and Yichun Yin and Lifeng Shang and Xin Jiang and Xiao Chen and Linlin Li and Fang Wang and Qun Liu , title =. CoRR , volume =. 2019 , url =. 1909.10351 , timestamp =

  4. [4]

    2023 , eprint=

    MTEB: Massive Text Embedding Benchmark , author=. 2023 , eprint=

  5. [5]

    2025 , eprint=

    NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models , author=. 2025 , eprint=

  6. [6]

    2024 , eprint=

    C-Pack: Packed Resources For General Chinese Embeddings , author=. 2024 , eprint=

  7. [7]

    2025 , eprint=

    Jasper and Stella: distillation of SOTA embedding models , author=. 2025 , eprint=

  8. [8]

    2019 , eprint=

    Patient Knowledge Distillation for BERT Model Compression , author=. 2019 , eprint=

  9. [9]

    2024 , eprint=

    MiniLLM: Knowledge Distillation of Large Language Models , author=. 2024 , eprint=

  10. [10]

    2025 , eprint=

    Towards Cross-Tokenizer Distillation: the Universal Logit Distillation Loss for LLMs , author=. 2025 , eprint=

  11. [11]

    2024 , eprint=

    Knowledge Fusion of Large Language Models , author=. 2024 , eprint=

  12. [12]

    2024 , eprint=

    Dual-Space Knowledge Distillation for Large Language Models , author=. 2024 , eprint=

  13. [13]

    2020 , eprint=

    MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers , author=. 2020 , eprint=

  14. [14]

    2025 , eprint=

    A Survey of Large Language Models , author=. 2025 , eprint=

  15. [15]

    2020 , eprint=

    MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices , author=. 2020 , eprint=

  16. [16]

    2023 , eprint=

    DistilCSE: Effective Knowledge Distillation For Contrastive Sentence Embeddings , author=. 2023 , eprint=

  17. [17]

    2016 , eprint=

    Sequence-Level Knowledge Distillation , author=. 2016 , eprint=

  18. [18]

    2023 , eprint=

    Specializing Smaller Language Models towards Multi-Step Reasoning , author=. 2023 , eprint=

  19. [19]

    2023 , eprint=

    Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes , author=. 2023 , eprint=

  20. [20]

    2019 , eprint=

    Similarity of Neural Network Representations Revisited , author=. 2019 , eprint=

  21. [21]

    2024 , eprint=

    KDMCSE: Knowledge Distillation Multimodal Sentence Embeddings with Adaptive Angular margin Contrastive Learning , author=. 2024 , eprint=

  22. [22]

    2025 , eprint=

    CoT2Align: Cross-Chain of Thought Distillation via Optimal Transport Alignment for Language Models with Different Tokenizers , author=. 2025 , eprint=

  23. [23]

    and Szedmak, Sandor and Shawe-Taylor, John , journal=

    Hardoon, David R. and Szedmak, Sandor and Shawe-Taylor, John , journal=. Canonical Correlation Analysis: An Overview with Application to Learning Methods , year=

  24. [24]

    2023 , eprint=

    Feature Structure Distillation with Centered Kernel Alignment in BERT Transferring , author=. 2023 , eprint=

  25. [25]

    2025 , eprint=

    Rho-1: Not All Tokens Are What You Need , author=. 2025 , eprint=

  26. [26]

    Context-aware Event Forecasting via Graph Disentanglement , booktitle =

    Li, Junyan and Zhang, Li Lyna and Xu, Jiahang and Wang, Yujing and Yan, Shaoguang and Xia, Yunqing and Yang, Yuqing and Cao, Ting and Sun, Hao and Deng, Weiwei and Zhang, Qi and Yang, Mao , year=. Constraint-aware and Ranking-distilled Token Pruning for Efficient Transformer Inference , url=. doi:10.1145/3580305.3599284 , booktitle=

  27. [27]

    2021 , eprint=

    Do Wide and Deep Networks Learn the Same Things? Uncovering How Neural Network Representations Vary with Width and Depth , author=. 2021 , eprint=

  28. [28]

    33rd British Machine Vision Conference 2022,

    Aninda Saha and Alina N Bialkowski and Sara Khalifa , title =. 33rd British Machine Vision Conference 2022,. 2022 , url =

  29. [29]

    The Thirteenth International Conference on Learning Representations , year=

    Improving Language Model Distillation through Hidden State Matching , author=. The Thirteenth International Conference on Learning Representations , year=

  30. [30]

    BIGPATENT : A Large-Scale Dataset for Abstractive and Coherent Summarization

    Sharma, Eva and Li, Chen and Wang, Lu. BIGPATENT : A Large-Scale Dataset for Abstractive and Coherent Summarization. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019. doi:10.18653/v1/P19-1212

  31. [31]

    Tushar Khot and Ashish Sabharwal and Peter Clark , Booktitle =

  32. [32]

    Proceedings of the AAAI Conference on Artificial Intelligence , author=

    Natural Language Inference in Context - Investigating Contextual Reasoning over Long Texts , volume=. Proceedings of the AAAI Conference on Artificial Intelligence , author=. 2021 , month=. doi:10.1609/aaai.v35i15.17580 , number=

  33. [33]

    Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

    Adversarial NLI: A New Benchmark for Natural Language Understanding , author=. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020

  34. [34]

    Proceedings of the AAAI Conference on Artificial Intelligence , volume=

    Multi-level optimal transport for universal cross-tokenizer knowledge distillation on language models , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

  35. [35]

    2024 , url=

    Parishad BehnamGhader and Vaibhav Adlakha and Marius Mosbach and Dzmitry Bahdanau and Nicolas Chapados and Siva Reddy , booktitle=. 2024 , url=

  36. [36]

    S em E val-2014 Task 1: Evaluation of Compositional Distributional Semantic Models on Full Sentences through Semantic Relatedness and Textual Entailment

    Marelli, Marco and Bentivogli, Luisa and Baroni, Marco and Bernardi, Raffaella and Menini, Stefano and Zamparelli, Roberto. S em E val-2014 Task 1: Evaluation of Compositional Distributional Semantic Models on Full Sentences through Semantic Relatedness and Textual Entailment. Proceedings of the 8th International Workshop on Semantic Evaluation ( S em E v...

  37. [37]

    2024 , eprint=

    BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation , author=. 2024 , eprint=

  38. [38]

    URLhttps://doi.org/10.3389/neuro.06.004.2008

    Kriegeskorte, Nikolaus and Mur, Marieke and Bandettini, Peter A. , TITLE=. Frontiers in Systems Neuroscience , VOLUME=. 2008 , URL=. doi:10.3389/neuro.06.004.2008 , ISSN=

  39. [39]

    2023 , url=

    Less is More: Task-aware Layer-wise Distillation for Language Model Compression , author=. 2023 , url=

  40. [40]

    arXiv preprint arXiv:2508.12519 , year=

    An Introduction to Sliced Optimal Transport , author=. arXiv preprint arXiv:2508.12519 , year=

  41. [41]

    Improving Vietnamese-English Cross-Lingual Retrieval for Legal and General Domains , author=. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers) , pages=

  42. [42]

    Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

    Mitigating Non-Representative Prototypes and Representation Bias in Few-Shot Continual Relation Extraction , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

  43. [43]

    Enhancing Discriminative Representation in Similar Relation Clusters for Few-Shot Continual Relation Extraction , author=. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

  44. [44]

    Mutual-pairing Data Augmentation for Fewshot Continual Relation Extraction , author=. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

  45. [45]

    Findings of the Association for Computational Linguistics: ACL 2025 , pages=

    HiCOT: Improving Neural Topic Models via Optimal Transport and Contrastive Learning , author=. Findings of the Association for Computational Linguistics: ACL 2025 , pages=

  46. [46]

    Findings of the Association for Computational Linguistics: ACL 2025 , pages=

    Topic Modeling for Short Texts via Optimal Transport-Based Clustering , author=. Findings of the Association for Computational Linguistics: ACL 2025 , pages=

  47. [47]

    Understanding and Improving Transformer From a Multi-Particle Dynamic System Point of View

    Understanding and improving transformer from a multi-particle dynamic system point of view , author=. arXiv preprint arXiv:1906.02762 , year=

  48. [48]

    Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

    Beyond Logits: Aligning Feature Dynamics for Effective Knowledge Distillation , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

  49. [49]

    2009 , publisher=

    Systemic functional grammar: A first step into the theory , author=. 2009 , publisher=

  50. [50]

    2015 , publisher=

    Lexical-functional syntax , author=. 2015 , publisher=

  51. [51]

    2019 , eprint=

    What Does BERT Look At? An Analysis of BERT's Attention , author=. 2019 , eprint=

  52. [52]

    Lifting the Curse of Capacity Gap in Distilling Language Models

    Zhang, Chen and Yang, Yang and Liu, Jiahao and Wang, Jingang and Xian, Yunsen and Wang, Benyou and Song, Dawei. Lifting the Curse of Capacity Gap in Distilling Language Models. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023

  53. [53]

    AD - KD : Attribution-Driven Knowledge Distillation for Language Model Compression

    Wu, Siyue and Chen, Hongzhan and Quan, Xiaojun and Wang, Qifan and Wang, Rui. AD - KD : Attribution-Driven Knowledge Distillation for Language Model Compression. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023

  54. [54]

    Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

    Multi-granularity structural knowledge distillation for language model compression , author=. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

  55. [55]

    2025 , eprint=

    Enhancing Cross-Tokenizer Knowledge Distillation with Contextual Dynamical Mapping , author=. 2025 , eprint=

  56. [56]

    2023 , eprint =

    Adapter-based Selective Knowledge Distillation for Federated Multi-domain Meeting Summarization , author =. 2023 , eprint =. doi:10.48550/arXiv.2308.03275 , url =

  57. [57]

    2024 , eprint =

    VkD: Improving Knowledge Distillation using Orthogonal Projections , author =. 2024 , eprint =. doi:10.48550/arXiv.2403.06213 , url =

  58. [58]

    1970 , publisher=

    An introduction to celestial mechanics , author=. 1970 , publisher=

  59. [59]

    A Comparative Analysis of Task-Agnostic Distillation Methods for Compressing Transformer Language Models

    Udagawa, Takuma and Trivedi, Aashka and Merler, Michele and Bhattacharjee, Bishwaranjan. A Comparative Analysis of Task-Agnostic Distillation Methods for Compressing Transformer Language Models. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: Industry Track. 2023

  60. [60]

    2024 , eprint =

    DeepSeek-V3 Technical Report , author =. 2024 , eprint =

  61. [61]

    2024 , eprint =

    GPT-4 Technical Report , author =. 2024 , eprint =

  62. [62]

    Advances in Neural Information Processing Systems (NeurIPS) , volume =

    Neural Ordinary Differential Equations , author =. Advances in Neural Information Processing Systems (NeurIPS) , volume =. 2018 , url =

  63. [63]

    2025 , eprint =

    Universal Cross-Tokenizer Distillation via Approximate Likelihood Matching , author =. 2025 , eprint =. doi:10.48550/arXiv.2503.20083 , url =

  64. [64]

    Super- N atural I nstructions: Generalization via Declarative Instructions on 1600+ NLP Tasks

    Wang, Yizhong and Mishra, Swaroop and Alipoormolabashi, Pegah and Kordi, Yeganeh and Mirzaei, Amirreza and Arunkumar, Anjana and Ashok, Arjun and Dhanasekaran, Arut Selvan and Naik, Atharva and Stap, David and Pathak, Eshaan and Karamanolakis, Giannis and Lai, Haizhi Gary and Purohit, Ishan and Mondal, Ishani and Anderson, Jacob and Kuznia, Kirby and Dosh...

  65. [65]

    and Stoica, Ion and Xing, Eric P

    Chiang, Wei-Lin and Li, Zhuohan and Lin, Zi and Sheng, Ying and Wu, Zhanghao and Zhang, Hao and Zheng, Lianmin and Zhuang, Siyuan and Zhuang, Yonghao and Gonzalez, Joseph E. and Stoica, Ion and Xing, Eric P. , month =. Vicuna:

  66. [66]

    D ialog S um: A Real-Life Scenario Dialogue Summarization Dataset

    Chen, Yulong and Liu, Yang and Chen, Liang and Zhang, Yue. D ialog S um: A Real-Life Scenario Dialogue Summarization Dataset. Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. 2021

  67. [67]

    Smith, Daniel Khashabi, and Hannaneh Hajishirzi

    Wang, Yizhong and Kordi, Yeganeh and Mishra, Swaroop and Liu, Alisa and Smith, Noah A. and Khashabi, Daniel and Hajishirzi, Hannaneh. S elf- I nstruct: Aligning Language Models with S elf- G enerated Instructions. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023. doi:10.18653/v1/2023.acl...

  68. [68]

    OpenAI blog , volume=

    Language models are unsupervised multitask learners , author=. OpenAI blog , volume=. 2019 , url =

  69. [69]

    2024 , url =

    Zhang, Peiyuan and Zeng, Guangtao and Wang, Tianduo and Lu, Wei , journal=. 2024 , url =

  70. [70]

    Qwen Technical Report

    Qwen Technical Report , author =. arXiv preprint arXiv:2309.16609 , year =

  71. [71]

    OPT: Open Pre-trained Transformer Language Models

    Susan Zhang and Stephen Roller and Naman Goyal and Mikel Artetxe and Moya Chen and Shuohui Chen and Christopher Dewan and Mona Diab and Xian Li and Xi Victoria Lin and Todor Mihaylov and Myle Ott and Sam Shleifer and Kurt Shuster and Daniel Simig and Punit Singh Koura and Anjali Sridhar and Tianlu Wang and Luke Zettlemoyer , year=. 2205.01068 , archivePrefix=

  72. [72]

    Identifying and Mitigating Vulnerabilities in

    Jiang, Fengqing and Xu, Zhangchen and Niu, Luyao and Wang, Boxin and Jia, Jinyuan and Li, Bo and Poovendran, Radha , journal =. Identifying and Mitigating Vulnerabilities in. 2023 , month = nov, doi =

  73. [73]

    Qwen2.5 Technical Report

    Qwen2.5 technical report , author=. arXiv preprint arXiv:2412.15115 , year=

  74. [74]

    Text summarization branches out , pages=

    Rouge: A package for automatic evaluation of summaries , author=. Text summarization branches out , pages=

  75. [75]

    2024 , eprint=

    Rethinking Kullback-Leibler Divergence in Knowledge Distillation for Large Language Models , author=. 2024 , eprint=

  76. [76]

    2024 , eprint=

    DistiLLM: Towards Streamlined Distillation for Large Language Models , author=. 2024 , eprint=

  77. [77]

    2023 , eprint=

    f-Divergence Minimization for Sequence-Level Knowledge Distillation , author=. 2023 , eprint=

  78. [78]

    Findings of the Association for Computational Linguistics: EACL , pages=

    DWA-KD: Dual-Space Weighting and Time-Warped Alignment for Cross-Tokenizer Knowledge Distillation , author=. Findings of the Association for Computational Linguistics: EACL , pages=

  79. [79]

    Fortieth

    Truong Nguyen and Phi Van Dat and Ngan Nguyen and Linh Ngo Van and Trung Le and Thanh Hong Nguyen , title =. Fortieth

  80. [80]

    Fortieth

    Hoang Tran Vuong and Tue Le and Quyen Tran and Linh Ngo Van and Trung Le , title =. Fortieth

Showing first 80 references.