pith. sign in

arxiv: 2505.13893 · v2 · pith:TH475T5Tnew · submitted 2025-05-20 · 💻 cs.CL

InfiGFusion: Graph-on-Logits Distillation via Efficient Gromov-Wasserstein for Model Fusion

Pith reviewed 2026-05-25 08:39 UTC · model grok-4.3

classification 💻 cs.CL
keywords model fusionlogit distillationGromov-Wasserstein distancegraph distillationlarge language modelsreasoning benchmarksmodel merging
0
0 comments X

The pith

Fusing LLMs by distilling co-activation graphs from top-k logits captures cross-token dependencies that independent logit averaging misses.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard logit fusion methods average predictions across vocabulary dimensions without regard to how those dimensions interact during a model's reasoning process. InfiGFusion instead forms a global co-activation graph by taking outer products of the top-k logits at each output position and aggregates them across the sequence. It then aligns these graphs between source models with a sorting-based approximation to the Gromov-Wasserstein distance that drops the cost from O(n^4) to O(n log n). The resulting fused model inherits complementary strengths and records the largest reported gains on multi-step reasoning and causal judgment benchmarks. The approach maintains the same inference cost as the parent models.

Core claim

The paper presents Graph-on-Logits Distillation, which builds a co-activation graph whose nodes are vocabulary channels and whose edges measure joint activation strength, then transfers this structure between heterogeneous models via an efficient Gromov-Wasserstein alignment that preserves the relative geometry of the graphs.

What carries the argument

Graph-on-Logits Distillation loss, constructed by aggregating outer products of top-k logits into a global co-activation graph and aligned with a sorting-based O(n log n) approximation to Gromov-Wasserstein distance.

If this is right

  • The fused model records +35.6 on Multistep Arithmetic and +37.06 on Causal Judgement relative to supervised fine-tuning.
  • GLD improves both quality and stability of fusion across multiple settings and model pairs.
  • The closed-form approximation carries provable guarantees while remaining fast enough for practical use.
  • The method delivers gains on eleven benchmarks covering reasoning, coding, and mathematics.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the co-activation graphs remain stable across prompt lengths, the same fused weights could serve multiple tasks without per-task re-alignment.
  • The graph view might extend naturally to fusing models that differ in tokenizer or vocabulary size by first mapping their logit spaces.
  • Replacing the fixed top-k selection with an entropy-dependent threshold could reduce noise when models are uncertain about the next token.

Load-bearing premise

That the global co-activation graph built from outer products of top-k logits encodes the semantic dependencies needed to align models with different generation behaviors.

What would settle it

Run the identical fusion procedure twice on the same pair of models, once with the full GLD loss and once after replacing every edge weight in the co-activation graph with a constant; if the two fused models achieve statistically indistinguishable scores on Multistep Arithmetic and Causal Judgement, the graph structure contributes nothing beyond ordinary logit averaging.

Figures

Figures reproduced from arXiv: 2505.13893 by Fei Wu, Hongxia Yang, Qi Zhou, Yanggan Gu, Yiming Zhang, Yuanyi Wang, Zhaoyi Yan.

Figure 1
Figure 1. Figure 1: Token-level vs. Structure-aware Fusion. Given pivot and source logits of shape [L, 3] (sequence length L, vocab size 3), token-level methods (left) align dimensions independently, ignoring token interactions. GLD (right) aggregates outer products into [3, 3] co-activation graphs, capturing semantic dependencies via structure-aware graph alignment. the output logits using token-level objectives such as KL d… view at source ↗
Figure 2
Figure 2. Figure 2: InfiGFusion framework. Given instruction-response pairs, source and pivot models produce logits, sparsified into feature-level graphs capturing semantic dependencies. We align graphs via an efficient Gromov-Wasserstein approximation (GLD), reducing complexity from O(n 4 ) to O(n log n). The overall objective combines structure-aware distillation (GLD) with token-level distillation (ULD) and supervised sign… view at source ↗
Figure 3
Figure 3. Figure 3: Top-k analysis. InfiGFusion sparsifies logits by retaining top-k token dimen￾sions before graph construction, selecting the most salient indices per sequence position. This inductive bias suppresses noisy activations and emphasizes meaningful token depen￾dencies, serving as the foundation for graph-based seman￾tic alignment. We evaluate Top-k ∈ {5, 10, 15, 20, 25, 30} and report the results in [PITH_FULL_… view at source ↗
Figure 4
Figure 4. Figure 4: Case study. Case 1: Frank T. Shooting Incident. While both models predict “No,” InfiGFusion performs a deeper step-by-step causality analysis. It explicitly identi￾fies the causal chain’s disruption—distinguishing between “intent,” “misfire,” and “accidental re￾sult”—showcasing robust multi-step causality dis￾ambiguation capabilities. Case 2: Wallace’s Dual Cause of Death. Unlike Phi4’s surface-level judgm… view at source ↗
Figure 5
Figure 5. Figure 5: Comparison of WD and GW distributions during fusion. Left: before distillation; Middle: [PITH_FULL_IMAGE:figures/full_fig_p028_5.png] view at source ↗
read the original abstract

Recent advances in large language models (LLMs) have intensified efforts to fuse heterogeneous open-source models into a unified system that inherits their complementary strengths. Existing logit-based fusion methods maintain inference efficiency but treat vocabulary dimensions independently, overlooking semantic dependencies encoded by cross-dimension interactions. These dependencies reflect how token types interact under a model's internal reasoning and are essential for aligning models with diverse generation behaviors. To explicitly model these dependencies, we propose \textbf{InfiGFusion}, the first structure-aware fusion framework with a novel \textit{Graph-on-Logits Distillation} (GLD) loss. Specifically, we retain the top-$k$ logits per output and aggregate their outer products across sequence positions to form a global co-activation graph, where nodes represent vocabulary channels and edges quantify their joint activations. To ensure scalability and efficiency, we design a sorting-based closed-form approximation that reduces the original $O(n^4)$ cost of Gromov-Wasserstein distance to $O(n \log n)$, with provable approximation guarantees. Experiments across multiple fusion settings show that GLD consistently improves fusion quality and stability. InfiGFusion outperforms SOTA models and fusion baselines across 11 benchmarks spanning reasoning, coding, and mathematics. It shows particular strength in complex reasoning tasks, with +35.6 improvement on Multistep Arithmetic and +37.06 on Causal Judgement over SFT, demonstrating superior multi-step and relational inference.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The manuscript introduces InfiGFusion, a structure-aware LLM fusion method that constructs global co-activation graphs from outer products of top-k logits across sequence positions and aligns models via a Graph-on-Logits Distillation (GLD) loss based on Gromov-Wasserstein distance. It proposes a sorting-based closed-form approximation claimed to reduce GW computation from O(n^4) to O(n log n) with provable guarantees, and reports consistent improvements over SOTA models and fusion baselines on 11 benchmarks in reasoning, coding, and mathematics, including large gains (+35.6 on Multistep Arithmetic, +37.06 on Causal Judgement) relative to SFT.

Significance. If the empirical gains and approximation guarantees hold under scrutiny, the work would offer a scalable way to incorporate cross-dimension logit dependencies into fusion, addressing a gap in prior logit-based methods that treat dimensions independently. The O(n log n) reduction and explicit graph construction are potentially useful strengths for practical deployment of fused models on complex tasks.

major comments (3)
  1. [§4.2, Eq. (8)] §4.2 and Eq. (8): the claim that the sorting-based approximation preserves the essential structure of the Gromov-Wasserstein distance for co-activation graphs is not accompanied by a quantitative bound on the approximation error in terms of the fusion objective; without this, it is unclear whether the reported gains on reasoning benchmarks can be attributed to the GLD loss rather than the approximation artifact.
  2. [Table 2] Table 2, Multistep Arithmetic and Causal Judgement rows: the +35.6 and +37.06 absolute improvements over SFT are presented without standard deviations across runs or statistical significance tests; given that these are the largest reported deltas and central to the claim of superior multi-step reasoning, the lack of variance reporting undermines the strength of the outperformance conclusion.
  3. [§3.1] §3.1: the construction of the global co-activation graph via aggregation of outer products of top-k logits assumes that these pairwise activations encode the semantic dependencies needed for alignment, but no ablation is shown that isolates the contribution of the graph structure versus simply using the top-k logits without the GW term.
minor comments (3)
  1. [§3.1] The notation for the co-activation matrix G in §3.1 is introduced without an explicit definition of how sequence-position aggregation is normalized, making it difficult to reproduce the graph construction.
  2. [Figure 3] Figure 3 caption does not specify the value of k used for the top-k logits or whether results are sensitive to this hyperparameter.
  3. [Related Work] The related-work section omits recent logit-fusion papers that also operate on vocabulary distributions (e.g., those using optimal transport directly on logits), which would help situate the novelty of the graph-based extension.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript accordingly to strengthen the presentation of the approximation analysis, statistical reporting, and ablations.

read point-by-point responses
  1. Referee: [§4.2, Eq. (8)] §4.2 and Eq. (8): the claim that the sorting-based approximation preserves the essential structure of the Gromov-Wasserstein distance for co-activation graphs is not accompanied by a quantitative bound on the approximation error in terms of the fusion objective; without this, it is unclear whether the reported gains on reasoning benchmarks can be attributed to the GLD loss rather than the approximation artifact.

    Authors: We acknowledge that the provable guarantees provided for the sorting-based approximation are stated in terms of the GW distance itself rather than a direct quantitative bound on its effect within the fusion objective. In the revision we will add a discussion of error propagation from the approximation into the GLD loss and its potential impact on downstream performance. revision: yes

  2. Referee: [Table 2] Table 2, Multistep Arithmetic and Causal Judgement rows: the +35.6 and +37.06 absolute improvements over SFT are presented without standard deviations across runs or statistical significance tests; given that these are the largest reported deltas and central to the claim of superior multi-step reasoning, the lack of variance reporting undermines the strength of the outperformance conclusion.

    Authors: We agree that variance reporting and significance testing are needed for these key results. The revised manuscript will include standard deviations computed over multiple runs and paired statistical significance tests for the Multistep Arithmetic and Causal Judgement entries. revision: yes

  3. Referee: [§3.1] §3.1: the construction of the global co-activation graph via aggregation of outer products of top-k logits assumes that these pairwise activations encode the semantic dependencies needed for alignment, but no ablation is shown that isolates the contribution of the graph structure versus simply using the top-k logits without the GW term.

    Authors: We will add an ablation study that directly compares the full GLD loss against a variant that aggregates top-k logits without the Gromov-Wasserstein term. This will isolate the contribution of the graph structure to the observed gains. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation introduces independent components

full rationale

The abstract and method description define a novel GLD loss via top-k logit outer products forming co-activation graphs, followed by a sorting-based O(n log n) GW approximation with stated guarantees. These steps are constructive definitions of new quantities rather than reductions of outputs to fitted inputs or self-citations. No load-bearing self-citation chains, uniqueness theorems, or renamings of known results appear; performance claims rest on external benchmarks. The derivation chain remains self-contained against the described inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Abstract-only review; full parameter list, proof details, and experimental setup unavailable. The central modeling choice (graph construction from logits) and the approximation step rest on unstated assumptions about what the graph represents and how faithful the fast GW surrogate remains.

free parameters (1)
  • top-k
    Number of retained logits per position; choice directly affects the constructed co-activation graph and is not derived from first principles.
axioms (1)
  • domain assumption Cross-dimension interactions in logits reflect semantic dependencies that are essential for aligning models with diverse generation behaviors
    Invoked in abstract paragraph 2 as the motivation for moving beyond independent-vocabulary treatment.

pith-pipeline@v0.9.0 · 5808 in / 1380 out tokens · 43132 ms · 2026-05-25T08:39:18.653601+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Discovering Physical Directions in Weight Space: Composing Neural PDE Experts

    cs.LG 2026-05 unverdicted novelty 7.0

    Fine-tuning neural PDE operators to regime endpoints reveals a physical direction in weight space that CCM uses to compose accurate merged models for new or extrapolated regimes from metadata or short prefixes.

  2. FeatCal: Feature Calibration for Post-Merging Models

    cs.LG 2026-05 conditional novelty 7.0

    FeatCal reduces feature drift in merged models via layer-wise closed-form calibration on a small dataset, outperforming prior post-merging methods on CLIP and GLUE benchmarks with high sample efficiency.

  3. E-PMQ: Expert-Guided Post-Merge Quantization with Merged-Weight Anchoring

    cs.CL 2026-05 unverdicted novelty 6.0

    E-PMQ improves 4-bit quantization accuracy on merged models by 8-42 points across CLIP and GLUE tasks through expert-guided calibration and merged-weight anchoring.

  4. Geometry Conflict: Explaining and Controlling Forgetting in LLM Continual Post-Training

    cs.LG 2026-05 unverdicted novelty 6.0

    Forgetting in LLM continual post-training is a geometry conflict between task-induced covariance structures and the evolving model state, controlled by gating Wasserstein barycenter merging on measured conflict.

Reference graph

Works this paper leans on

71 extracted references · 71 canonical work pages · cited by 4 Pith papers · 11 internal anchors

  1. [1]

    Toolformer: Language models can teach themselves to use tools.Advances in Neural Information Processing Systems, 36:68539– 68551, 2023

    Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools.Advances in Neural Information Processing Systems, 36:68539– 68551, 2023

  2. [2]

    Auto-gpt for online decision making: Benchmarks and additional opinions.arXiv preprint arXiv:2306.02224, 2023

    Hui Yang, Sifu Yue, and Yunzhong He. Auto-gpt for online decision making: Benchmarks and additional opinions.arXiv preprint arXiv:2306.02224, 2023

  3. [3]

    Mixture-of-Agents Enhances Large Language Model Capabilities

    Junlin Wang, Jue Wang, Ben Athiwaratkun, Ce Zhang, and James Zou. Mixture-of-agents enhances large language model capabilities.arXiv preprint arXiv:2406.04692, 2024

  4. [4]

    Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators

    Yann Dubois, Balázs Galambosi, Percy Liang, and Tatsunori B Hashimoto. Length-controlled alpacaeval: A simple way to debias automatic evaluators.arXiv preprint arXiv:2404.04475, 2024

  5. [5]

    Llm-blender: Ensembling large language models with pairwise ranking and generative fusion

    Dongfu Jiang, Xiang Ren, and Bill Yuchen Lin. Llm-blender: Ensembling large language models with pairwise ranking and generative fusion. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14165–14178, 2023

  6. [6]

    Llm-ensemble: Optimal large language model ensemble method for e-commerce product attribute value extraction

    Chenhao Fang, Xiaohan Li, Zezhong Fan, Jianpeng Xu, Kaushiki Nag, Evren Korpeoglu, Sushant Kumar, and Kannan Achan. Llm-ensemble: Optimal large language model ensemble method for e-commerce product attribute value extraction. InProceedings of the 47th Interna- tional ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2910–2914, 2024

  7. [7]

    Sparse upcycling: Training mixture-of-experts from dense checkpoints

    Aran Komatsuzaki, Joan Puigcerver, James Lee-Thorp, Carlos Riquelme Ruiz, Basil Mustafa, Joshua Ainslie, Yi Tay, Mostafa Dehghani, and Neil Houlsby. Sparse upcycling: Training mixture-of-experts from dense checkpoints. InThe Eleventh International Conference on Learning Representations

  8. [8]

    Glam: Efficient scaling of language models with mixture-of-experts

    Nan Du, Yanping Huang, Andrew M Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, et al. Glam: Efficient scaling of language models with mixture-of-experts. InInternational conference on machine learning, pages 5547–5569. PMLR, 2022

  9. [9]

    Towards understanding the mixture-of-experts layer in deep learning.Advances in neural information processing systems, 35:23049–23062, 2022

    Zixiang Chen, Yihe Deng, Yue Wu, Quanquan Gu, and Yuanzhi Li. Towards understanding the mixture-of-experts layer in deep learning.Advances in neural information processing systems, 35:23049–23062, 2022

  10. [10]

    Ties-merging: Resolving interference when merging models.Advances in Neural Information Processing Systems, 36:7093–7115, 2023

    Prateek Yadav, Derek Tam, Leshem Choshen, Colin A Raffel, and Mohit Bansal. Ties-merging: Resolving interference when merging models.Advances in Neural Information Processing Systems, 36:7093–7115, 2023

  11. [11]

    Merging models with fisher-weighted averaging

    Michael S Matena and Colin A Raffel. Merging models with fisher-weighted averaging. Advances in Neural Information Processing Systems, 35:17703–17716, 2022

  12. [12]

    Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time

    Mitchell Wortsman, Gabriel Ilharco, Samir Ya Gadre, Rebecca Roelofs, Raphael Gontijo-Lopes, Ari S Morcos, Hongseok Namkoong, Ali Farhadi, Yair Carmon, Simon Kornblith, et al. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. InInternational conference on machine learning, pages 23965–23998. P...

  13. [13]

    Knowl- edge fusion of large language models

    Fanqi Wan, Xinting Huang, Deng Cai, Xiaojun Quan, Wei Bi, and Shuming Shi. Knowl- edge fusion of large language models. InThe Twelfth International Conference on Learning Representations

  14. [14]

    Profuser: Progressive fusion of large language models.arXiv preprint arXiv:2408.04998, 2024

    Tianyuan Shi, Fanqi Wan, Canbin Huang, Xiaojun Quan, Chenliang Li, Ming Yan, and Ji Zhang. Profuser: Progressive fusion of large language models.arXiv preprint arXiv:2408.04998, 2024. 11

  15. [15]

    Infifusion: A unified framework for enhanced cross-model reasoning via llm fusion.arXiv preprint arXiv:2501.02795, 2025

    Zhaoyi Yan, Yiming Zhang, Baoyi He, Yuhao Fu, Qi Zhou, Zhijie Sang, Chunlin Ji, Shengyu Zhang, Fei Wu, and Hongxia Yang. Infifusion: A unified framework for enhanced cross-model reasoning via llm fusion.arXiv preprint arXiv:2501.02795, 2025

  16. [16]

    A setwise approach for effective and highly efficient zero-shot ranking with large language models

    Shengyao Zhuang, Honglei Zhuang, Bevan Koopman, and Guido Zuccon. A setwise approach for effective and highly efficient zero-shot ranking with large language models. InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 38–47, 2024

  17. [17]

    Discovering preference optimization algorithms with and for large language models.Advances in Neural Information Processing Systems, 37:86528–86573, 2024

    Chris Lu, Samuel Holt, Claudio Fanconi, Alex Chan, Jakob Foerster, Mihaela van der Schaar, and Robert Lange. Discovering preference optimization algorithms with and for large language models.Advances in Neural Information Processing Systems, 37:86528–86573, 2024

  18. [18]

    Scalable gromov-wasserstein learning for graph partitioning and matching.Advances in neural information processing systems, 32, 2019

    Hongteng Xu, Dixin Luo, and Lawrence Carin. Scalable gromov-wasserstein learning for graph partitioning and matching.Advances in neural information processing systems, 32, 2019

  19. [19]

    Gromov-wasserstein averaging of kernel and distance matrices

    Gabriel Peyré, Marco Cuturi, and Justin Solomon. Gromov-wasserstein averaging of kernel and distance matrices. InInternational conference on machine learning, pages 2664–2672. PMLR, 2016

  20. [20]

    React: Synergizing reasoning and acting in language models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations (ICLR), 2023

  21. [21]

    Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face.Advances in Neural Information Processing Systems, 36:38154–38180, 2023

    Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face.Advances in Neural Information Processing Systems, 36:38154–38180, 2023

  22. [22]

    Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in Neural Information Processing Systems, 36:46595–46623, 2023

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in Neural Information Processing Systems, 36:46595–46623, 2023

  23. [23]

    Agentbench: Evaluating llms as agents

    Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, et al. Agentbench: Evaluating llms as agents. InICLR, 2024

  24. [24]

    Self-consistency improves chain of thought reasoning in language models

    Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. InThe Eleventh International Conference on Learning Representations

  25. [25]

    Distilling the Knowledge in a Neural Network

    Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015

  26. [26]

    Tinybert: Distilling bert for natural language understanding

    Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, and Qun Liu. Tinybert: Distilling bert for natural language understanding. InFindings of the Association for Computational Linguistics: EMNLP 2020, pages 4163–4174, 2020

  27. [27]

    Gromov–wasserstein distances and the metric approach to object matching

    Facundo Mémoli. Gromov–wasserstein distances and the metric approach to object matching. Foundations of computational mathematics, 11:417–487, 2011

  28. [28]

    Fused gromov-wasserstein distance for structured objects.Algorithms, 13(9):212, 2020

    Titouan Vayer, Laetitia Chapel, Rémi Flamary, Romain Tavenard, and Nicolas Courty. Fused gromov-wasserstein distance for structured objects.Algorithms, 13(9):212, 2020

  29. [29]

    Learning graphons via struc- tured gromov-wasserstein barycenters

    Hongteng Xu, Dixin Luo, Lawrence Carin, and Hongyuan Zha. Learning graphons via struc- tured gromov-wasserstein barycenters. InProceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 10505–10513, 2021

  30. [30]

    Interdependency matters: Graph alignment for multivariate time series anomaly detection.arXiv preprint arXiv:2410.08877, 2024

    Yuanyi Wang, Haifeng Sun, Chengsen Wang, Mengde Zhu, Jingyu Wang, Wei Tang, Qi Qi, Zirui Zhuang, and Jianxin Liao. Interdependency matters: Graph alignment for multivariate time series anomaly detection.arXiv preprint arXiv:2410.08877, 2024

  31. [31]

    Gradient flow of energy: A general and efficient approach for entity alignment decoding.arXiv preprint arXiv:2401.12798, 2024

    Yuanyi Wang, Haifeng Sun, Jingyu Wang, Qi Qi, Shaoling Sun, and Jianxin Liao. Gradient flow of energy: A general and efficient approach for entity alignment decoding.arXiv preprint arXiv:2401.12798, 2024. 12

  32. [32]

    Towards semantic consistency: Dirichlet energy driven robust multi-modal entity alignment

    Yuanyi Wang, Haifeng Sun, Jiabo Wang, Jingyu Wang, Wei Tang, Qi Qi, Shaoling Sun, and Jianxin Liao. Towards semantic consistency: Dirichlet energy driven robust multi-modal entity alignment. In2024 IEEE 40th International Conference on Data Engineering (ICDE), pages 3559–3572. IEEE, 2024

  33. [33]

    Gromov-wasserstein factorization models for graph clustering

    Hongtengl Xu. Gromov-wasserstein factorization models for graph clustering. InProceedings of the AAAI conference on artificial intelligence, volume 34, pages 6478–6485, 2020

  34. [34]

    Towards cross- tokenizer distillation: the universal logit distillation loss for llms.Transactions on Machine Learning Research

    Nicolas Boizard, Kevin El Haddad, CELINE HUDELOT, and Pierre Colombo. Towards cross- tokenizer distillation: the universal logit distillation loss for llms.Transactions on Machine Learning Research

  35. [35]

    Model outputs documentation

    HuggingFace. Model outputs documentation. https://huggingface.co/docs/ transformers/en/main_classes/output, 2025

  36. [36]

    Superfiltering: Weak-to-strong data filtering for fast instruction-tuning

    Ming Li, Yong Zhang, Shwai He, Zhitao Li, Hongyu Zhao, Jianzong Wang, Ning Cheng, and Tianyi Zhou. Superfiltering: Weak-to-strong data filtering for fast instruction-tuning. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14255–14273, 2024

  37. [37]

    Numinamath: The largest public dataset in ai4maths with 860k pairs of competition math problems and solutions.Hugging Face repository, 13:9, 2024

    Jia Li, Edward Beeching, Lewis Tunstall, Ben Lipkin, Roman Soletskyi, Shengyi Huang, Kashif Rasul, Longhui Yu, Albert Q Jiang, Ziju Shen, et al. Numinamath: The largest public dataset in ai4maths with 860k pairs of competition math problems and solutions.Hugging Face repository, 13:9, 2024

  38. [38]

    Kodcode: A diverse, challenging, and verifiable synthetic dataset for coding

    Zhangchen Xu, Yang Liu, Yueqin Yin, Mingyuan Zhou, and Radha Poovendran. Kodcode: A diverse, challenging, and verifiable synthetic dataset for coding. 2025

  39. [39]

    An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2. 5 technical report.arXiv preprint arXiv:2412.15115, 2024

  40. [40]

    Mistral-small-24b-instruct-2501

    Mistral AI. Mistral-small-24b-instruct-2501. https://huggingface.co/mistralai/ Mistral-Small-24B-Instruct-2501, 2025

  41. [41]

    Phi-4 Technical Report

    Marah Abdin, Jyoti Aneja, Harkirat Behl, Sébastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Harrison, Russell J Hewett, Mojan Javaheripi, Piero Kauffmann, et al. Phi-4 technical report.arXiv preprint arXiv:2412.08905, 2024

  42. [42]

    Online knowledge distillation via collaborative learning

    Qiushan Guo, Xinjiang Wang, Yichao Wu, Zhipeng Yu, Ding Liang, Xiaolin Hu, and Ping Luo. Online knowledge distillation via collaborative learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11020–11029, 2020

  43. [43]

    Fusechat: Knowl- edge fusion of chat models.arXiv preprint arXiv:2408.07990, 2024

    Fanqi Wan, Longguang Zhong, Ziyi Yang, Ruijun Chen, and Xiaojun Quan. Fusechat: Knowl- edge fusion of chat models.arXiv preprint arXiv:2408.07990, 2024

  44. [44]

    Challenging big-bench tasks and whether chain-of-thought can solve them

    Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V Le, Ed H Chi, Denny Zhou, et al. Challenging big-bench tasks and whether chain-of-thought can solve them. InACL (Findings), 2023

  45. [45]

    Quick and (not so) dirty: Unsupervised selection of justification sentences for multi-hop question answering

    Vikas Yadav, Steven Bethard, and Mihai Surdeanu. Quick and (not so) dirty: Unsupervised selection of justification sentences for multi-hop question answering. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2578–2589, 2019

  46. [46]

    Measuring massive multitask language understanding

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. InInternational Conference on Learning Representations

  47. [47]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021. 13

  48. [48]

    Measuring mathematical problem solving with the math dataset

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2)

  49. [49]

    Theoremqa: A theorem-driven question answering dataset

    Wenhu Chen, Ming Yin, Max Ku, Pan Lu, Yixin Wan, Xueguang Ma, Jianyu Xu, Xinyi Wang, and Tony Xia. Theoremqa: A theorem-driven question answering dataset. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2023

  50. [50]

    Program Synthesis with Large Language Models

    Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021

  51. [51]

    Evaluating Large Language Models Trained on Code

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021

  52. [52]

    Drop: A reading comprehension benchmark requiring discrete reasoning over paragraphs

    Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, Sameer Singh, and Matt Gard- ner. Drop: A reading comprehension benchmark requiring discrete reasoning over paragraphs. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Paper...

  53. [53]

    Hellaswag: Can a machine really finish your sentence? InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics

    Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2019

  54. [54]

    Instruction-Following Evaluation for Large Language Models

    Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models.arXiv preprint arXiv:2311.07911, 2023

  55. [55]

    Softsort: A continuous relaxation for the argsort operator

    Sebastian Prillo and Julian Eisenschlos. Softsort: A continuous relaxation for the argsort operator. InInternational Conference on Machine Learning, pages 7793–7802. PMLR, 2020

  56. [56]

    Fast differentiable sorting and ranking

    Mathieu Blondel, Olivier Teboul, Quentin Berthet, and Josip Djolonga. Fast differentiable sorting and ranking. InInternational Conference on Machine Learning, pages 950–959. PMLR, 2020

  57. [57]

    Stability and generalization.Journal of machine learning research, 2(Mar):499–526, 2002

    Olivier Bousquet and André Elisseeff. Stability and generalization.Journal of machine learning research, 2(Mar):499–526, 2002

  58. [58]

    Knowledge distillation performs partial variance reduction.Advances in Neural Information Processing Systems, 36:75229–75258, 2023

    Mher Safaryan, Alexandra Peste, and Dan Alistarh. Knowledge distillation performs partial variance reduction.Advances in Neural Information Processing Systems, 36:75229–75258, 2023

  59. [59]

    Opencompass: A universal evaluation platform for foundation models.https://github.com/open-compass/opencompass, 2023

    OpenCompass Contributors. Opencompass: A universal evaluation platform for foundation models.https://github.com/open-compass/opencompass, 2023

  60. [60]

    Is your code generated by chatGPT really correct? rigorous evaluation of large language models for code generation

    Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. Is your code generated by chatGPT really correct? rigorous evaluation of large language models for code generation. In Thirty-seventh Conference on Neural Information Processing Systems, 2023

  61. [61]

    Minillm: Knowledge distillation of large language models

    Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. Minillm: Knowledge distillation of large language models. InThe Twelfth International Conference on Learning Representations

  62. [62]

    Dual-space knowledge distillation for large language models

    Songming Zhang, Xue Zhang, Zengkui Sun, Yufeng Chen, and Jinan Xu. Dual-space knowledge distillation for large language models. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 18164–18181, 2024

  63. [63]

    Self-Instruct: Aligning Language Models with Self-Generated Instructions

    Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language models with self-generated instruc- tions.arXiv preprint arXiv:2212.10560, 2022. 14

  64. [64]

    Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.See https://vicuna

    Wei-Lin Chiang, Zhuohan Li, Ziqing Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.See https://vicuna. lmsys. org (accessed 14 April 2023), 2(3):6, 2023

  65. [65]

    Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks.arXiv preprint arXiv:2204.07705, 2022

    Yizhong Wang, Swaroop Mishra, Pegah Alipoormolabashi, Yeganeh Kordi, Amirreza Mirzaei, Anjana Arunkumar, Arjun Ashok, Arut Selvan Dhanasekaran, Atharva Naik, David Stap, et al. Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks.arXiv preprint arXiv:2204.07705, 2022

  66. [66]

    Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019

    Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019

  67. [67]

    OPT: Open Pre-trained Transformer Language Models

    Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. Opt: Open pre-trained transformer language models.arXiv preprint arXiv:2205.01068, 2022

  68. [68]

    importance

    Niklas Muennighoff, Thomas Wang, Lintang Sutawika, Adam Roberts, Stella Biderman, Teven Le Scao, M Saiful Bari, Sheng Shen, Zheng Xin Yong, Hailey Schoelkopf, et al. Crosslingual generalization through multitask finetuning. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15991–16111,...

  69. [69]

    Under the assumption that the true features have a positive minimal gap, the sorting operator is stable, or, in practice, one can use a soft sort with known Lipschitz properties

    Sorting Stability Error.Denote by ϵsort the error incurred by the potential change in the sorted order when the features are perturbed. Under the assumption that the true features have a positive minimal gap, the sorting operator is stable, or, in practice, one can use a soft sort with known Lipschitz properties. Hence, only small errors are induced in th...

  70. [70]

    Answer: \$ANSWER

    If an algorithm has γ-uniform stability, then its generalization error can be controlled in O(γ)order of magnitude. 2.γ is approximately of the order of L/n, where L is a Lipschitz constant of the loss function andnis the number of samples. C.2 Lipschitz constant of GW loss Lemma 1(GW Lipschitz constant).LetL GW(T,S) =λGW 2(T,S). If∥S∥ 2 ≤R, then ∇SLGW 2 ...

  71. [71]

    We follow [62] that evaluate on four benchmarks: SelfInst [63], VicunaEval [64], Super Natural Instructions (S-NI) [65], and the Dolly [61]

    for training. We follow [62] that evaluate on four benchmarks: SelfInst [63], VicunaEval [64], Super Natural Instructions (S-NI) [65], and the Dolly [61]. Models:For distillation, we distill from LLaMA3-8B, Mistral-7B, and Qwen2.5-7B into student models, including GPT2-120M [66], OPT-350M [67], and Bloomz-560M [68]. Training settings:Distillation uses LoR...