InfiGFusion: Graph-on-Logits Distillation via Efficient Gromov-Wasserstein for Model Fusion

Fei Wu; Hongxia Yang; Qi Zhou; Yanggan Gu; Yiming Zhang; Yuanyi Wang; Zhaoyi Yan

arxiv: 2505.13893 · v2 · pith:TH475T5Tnew · submitted 2025-05-20 · 💻 cs.CL

InfiGFusion: Graph-on-Logits Distillation via Efficient Gromov-Wasserstein for Model Fusion

Yuanyi Wang , Zhaoyi Yan , Yiming Zhang , Qi Zhou , Yanggan Gu , Fei Wu , Hongxia Yang This is my paper

Pith reviewed 2026-05-25 08:39 UTC · model grok-4.3

classification 💻 cs.CL

keywords model fusionlogit distillationGromov-Wasserstein distancegraph distillationlarge language modelsreasoning benchmarksmodel merging

0 comments

The pith

Fusing LLMs by distilling co-activation graphs from top-k logits captures cross-token dependencies that independent logit averaging misses.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard logit fusion methods average predictions across vocabulary dimensions without regard to how those dimensions interact during a model's reasoning process. InfiGFusion instead forms a global co-activation graph by taking outer products of the top-k logits at each output position and aggregates them across the sequence. It then aligns these graphs between source models with a sorting-based approximation to the Gromov-Wasserstein distance that drops the cost from O(n^4) to O(n log n). The resulting fused model inherits complementary strengths and records the largest reported gains on multi-step reasoning and causal judgment benchmarks. The approach maintains the same inference cost as the parent models.

Core claim

The paper presents Graph-on-Logits Distillation, which builds a co-activation graph whose nodes are vocabulary channels and whose edges measure joint activation strength, then transfers this structure between heterogeneous models via an efficient Gromov-Wasserstein alignment that preserves the relative geometry of the graphs.

What carries the argument

Graph-on-Logits Distillation loss, constructed by aggregating outer products of top-k logits into a global co-activation graph and aligned with a sorting-based O(n log n) approximation to Gromov-Wasserstein distance.

If this is right

The fused model records +35.6 on Multistep Arithmetic and +37.06 on Causal Judgement relative to supervised fine-tuning.
GLD improves both quality and stability of fusion across multiple settings and model pairs.
The closed-form approximation carries provable guarantees while remaining fast enough for practical use.
The method delivers gains on eleven benchmarks covering reasoning, coding, and mathematics.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the co-activation graphs remain stable across prompt lengths, the same fused weights could serve multiple tasks without per-task re-alignment.
The graph view might extend naturally to fusing models that differ in tokenizer or vocabulary size by first mapping their logit spaces.
Replacing the fixed top-k selection with an entropy-dependent threshold could reduce noise when models are uncertain about the next token.

Load-bearing premise

That the global co-activation graph built from outer products of top-k logits encodes the semantic dependencies needed to align models with different generation behaviors.

What would settle it

Run the identical fusion procedure twice on the same pair of models, once with the full GLD loss and once after replacing every edge weight in the co-activation graph with a constant; if the two fused models achieve statistically indistinguishable scores on Multistep Arithmetic and Causal Judgement, the graph structure contributes nothing beyond ordinary logit averaging.

Figures

Figures reproduced from arXiv: 2505.13893 by Fei Wu, Hongxia Yang, Qi Zhou, Yanggan Gu, Yiming Zhang, Yuanyi Wang, Zhaoyi Yan.

**Figure 1.** Figure 1: Token-level vs. Structure-aware Fusion. Given pivot and source logits of shape [L, 3] (sequence length L, vocab size 3), token-level methods (left) align dimensions independently, ignoring token interactions. GLD (right) aggregates outer products into [3, 3] co-activation graphs, capturing semantic dependencies via structure-aware graph alignment. the output logits using token-level objectives such as KL d… view at source ↗

**Figure 2.** Figure 2: InfiGFusion framework. Given instruction-response pairs, source and pivot models produce logits, sparsified into feature-level graphs capturing semantic dependencies. We align graphs via an efficient Gromov-Wasserstein approximation (GLD), reducing complexity from O(n 4 ) to O(n log n). The overall objective combines structure-aware distillation (GLD) with token-level distillation (ULD) and supervised sign… view at source ↗

**Figure 3.** Figure 3: Top-k analysis. InfiGFusion sparsifies logits by retaining top-k token dimensions before graph construction, selecting the most salient indices per sequence position. This inductive bias suppresses noisy activations and emphasizes meaningful token dependencies, serving as the foundation for graph-based semantic alignment. We evaluate Top-k ∈ {5, 10, 15, 20, 25, 30} and report the results in [PITH_FULL_… view at source ↗

**Figure 4.** Figure 4: Case study. Case 1: Frank T. Shooting Incident. While both models predict “No,” InfiGFusion performs a deeper step-by-step causality analysis. It explicitly identifies the causal chain’s disruption—distinguishing between “intent,” “misfire,” and “accidental result”—showcasing robust multi-step causality disambiguation capabilities. Case 2: Wallace’s Dual Cause of Death. Unlike Phi4’s surface-level judgm… view at source ↗

**Figure 5.** Figure 5: Comparison of WD and GW distributions during fusion. Left: before distillation; Middle: [PITH_FULL_IMAGE:figures/full_fig_p028_5.png] view at source ↗

read the original abstract

Recent advances in large language models (LLMs) have intensified efforts to fuse heterogeneous open-source models into a unified system that inherits their complementary strengths. Existing logit-based fusion methods maintain inference efficiency but treat vocabulary dimensions independently, overlooking semantic dependencies encoded by cross-dimension interactions. These dependencies reflect how token types interact under a model's internal reasoning and are essential for aligning models with diverse generation behaviors. To explicitly model these dependencies, we propose \textbf{InfiGFusion}, the first structure-aware fusion framework with a novel \textit{Graph-on-Logits Distillation} (GLD) loss. Specifically, we retain the top-$k$ logits per output and aggregate their outer products across sequence positions to form a global co-activation graph, where nodes represent vocabulary channels and edges quantify their joint activations. To ensure scalability and efficiency, we design a sorting-based closed-form approximation that reduces the original $O(n^4)$ cost of Gromov-Wasserstein distance to $O(n \log n)$, with provable approximation guarantees. Experiments across multiple fusion settings show that GLD consistently improves fusion quality and stability. InfiGFusion outperforms SOTA models and fusion baselines across 11 benchmarks spanning reasoning, coding, and mathematics. It shows particular strength in complex reasoning tasks, with +35.6 improvement on Multistep Arithmetic and +37.06 on Causal Judgement over SFT, demonstrating superior multi-step and relational inference.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The abstract outlines a graph-based logit fusion method with a sorting-based GW approximation, but without the full paper the big reported gains stay unverified.

read the letter

The core idea is to build a co-activation graph from outer products of top-k logits across positions, then distill with a Gromov-Wasserstein loss that they approximate in O(n log n) time via sorting. That is the new piece: treating vocabulary dimensions as nodes with joint activation edges instead of fusing them independently. The claim of provable approximation guarantees is the part that would matter most if it holds up in the proofs and code. The reported lifts on multistep arithmetic and causal judgment are large enough to notice, but they sit on top of an SFT baseline whose details are not visible here. What the work does cleanly is name the limitation in prior logit fusion (independent dimensions) and give a concrete graph construction to address it. The efficiency angle is practical for scaling to real vocab sizes. The soft spots are the usual ones when only the abstract is in front of us: no derivation of the closed-form approximation, no table of exact baselines or run counts, and no check on whether the top-k choice drives the gains or just stabilizes training. The +35 point jumps look sensitive to how the fusion is set up, and without seeing variance or ablation on the graph construction it is hard to tell how much is the new loss versus other factors. This is for groups already running logit-level merges on open models and looking for structure-aware variants. A reader who wants to try the method would need the full equations and the released code to reproduce the numbers. I would send it to review so the approximation and the experimental controls can be checked directly; the abstract alone does not give enough to decide if the central claim is solid.

Referee Report

3 major / 3 minor

Summary. The manuscript introduces InfiGFusion, a structure-aware LLM fusion method that constructs global co-activation graphs from outer products of top-k logits across sequence positions and aligns models via a Graph-on-Logits Distillation (GLD) loss based on Gromov-Wasserstein distance. It proposes a sorting-based closed-form approximation claimed to reduce GW computation from O(n^4) to O(n log n) with provable guarantees, and reports consistent improvements over SOTA models and fusion baselines on 11 benchmarks in reasoning, coding, and mathematics, including large gains (+35.6 on Multistep Arithmetic, +37.06 on Causal Judgement) relative to SFT.

Significance. If the empirical gains and approximation guarantees hold under scrutiny, the work would offer a scalable way to incorporate cross-dimension logit dependencies into fusion, addressing a gap in prior logit-based methods that treat dimensions independently. The O(n log n) reduction and explicit graph construction are potentially useful strengths for practical deployment of fused models on complex tasks.

major comments (3)

[§4.2, Eq. (8)] §4.2 and Eq. (8): the claim that the sorting-based approximation preserves the essential structure of the Gromov-Wasserstein distance for co-activation graphs is not accompanied by a quantitative bound on the approximation error in terms of the fusion objective; without this, it is unclear whether the reported gains on reasoning benchmarks can be attributed to the GLD loss rather than the approximation artifact.
[Table 2] Table 2, Multistep Arithmetic and Causal Judgement rows: the +35.6 and +37.06 absolute improvements over SFT are presented without standard deviations across runs or statistical significance tests; given that these are the largest reported deltas and central to the claim of superior multi-step reasoning, the lack of variance reporting undermines the strength of the outperformance conclusion.
[§3.1] §3.1: the construction of the global co-activation graph via aggregation of outer products of top-k logits assumes that these pairwise activations encode the semantic dependencies needed for alignment, but no ablation is shown that isolates the contribution of the graph structure versus simply using the top-k logits without the GW term.

minor comments (3)

[§3.1] The notation for the co-activation matrix G in §3.1 is introduced without an explicit definition of how sequence-position aggregation is normalized, making it difficult to reproduce the graph construction.
[Figure 3] Figure 3 caption does not specify the value of k used for the top-k logits or whether results are sensitive to this hyperparameter.
[Related Work] The related-work section omits recent logit-fusion papers that also operate on vocabulary distributions (e.g., those using optimal transport directly on logits), which would help situate the novelty of the graph-based extension.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript accordingly to strengthen the presentation of the approximation analysis, statistical reporting, and ablations.

read point-by-point responses

Referee: [§4.2, Eq. (8)] §4.2 and Eq. (8): the claim that the sorting-based approximation preserves the essential structure of the Gromov-Wasserstein distance for co-activation graphs is not accompanied by a quantitative bound on the approximation error in terms of the fusion objective; without this, it is unclear whether the reported gains on reasoning benchmarks can be attributed to the GLD loss rather than the approximation artifact.

Authors: We acknowledge that the provable guarantees provided for the sorting-based approximation are stated in terms of the GW distance itself rather than a direct quantitative bound on its effect within the fusion objective. In the revision we will add a discussion of error propagation from the approximation into the GLD loss and its potential impact on downstream performance. revision: yes
Referee: [Table 2] Table 2, Multistep Arithmetic and Causal Judgement rows: the +35.6 and +37.06 absolute improvements over SFT are presented without standard deviations across runs or statistical significance tests; given that these are the largest reported deltas and central to the claim of superior multi-step reasoning, the lack of variance reporting undermines the strength of the outperformance conclusion.

Authors: We agree that variance reporting and significance testing are needed for these key results. The revised manuscript will include standard deviations computed over multiple runs and paired statistical significance tests for the Multistep Arithmetic and Causal Judgement entries. revision: yes
Referee: [§3.1] §3.1: the construction of the global co-activation graph via aggregation of outer products of top-k logits assumes that these pairwise activations encode the semantic dependencies needed for alignment, but no ablation is shown that isolates the contribution of the graph structure versus simply using the top-k logits without the GW term.

Authors: We will add an ablation study that directly compares the full GLD loss against a variant that aggregates top-k logits without the Gromov-Wasserstein term. This will isolate the contribution of the graph structure to the observed gains. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation introduces independent components

full rationale

The abstract and method description define a novel GLD loss via top-k logit outer products forming co-activation graphs, followed by a sorting-based O(n log n) GW approximation with stated guarantees. These steps are constructive definitions of new quantities rather than reductions of outputs to fitted inputs or self-citations. No load-bearing self-citation chains, uniqueness theorems, or renamings of known results appear; performance claims rest on external benchmarks. The derivation chain remains self-contained against the described inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Abstract-only review; full parameter list, proof details, and experimental setup unavailable. The central modeling choice (graph construction from logits) and the approximation step rest on unstated assumptions about what the graph represents and how faithful the fast GW surrogate remains.

free parameters (1)

top-k
Number of retained logits per position; choice directly affects the constructed co-activation graph and is not derived from first principles.

axioms (1)

domain assumption Cross-dimension interactions in logits reflect semantic dependencies that are essential for aligning models with diverse generation behaviors
Invoked in abstract paragraph 2 as the motivation for moving beyond independent-vocabulary treatment.

pith-pipeline@v0.9.0 · 5808 in / 1380 out tokens · 43132 ms · 2026-05-25T08:39:18.653601+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we design a sorting-based closed-form approximation that reduces the original O(n^4) cost of Gromov-Wasserstein distance to O(n log n), with provable approximation guarantees... |GW(C,D)−gGW(C,D)|≤n^{-1}n^{-2}+m^{-1}m^{-2}
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_add unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Dynamic Graph Construction... adjacency matrix Cb(i,j)=∑t zt(i)·zt(j)
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Proposition 2 (Lipschitz Constants Comparison)... LGW=O(R^3/D)<LWD=O(√D)<LKL=O(e^{RD})

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Discovering Physical Directions in Weight Space: Composing Neural PDE Experts
cs.LG 2026-05 unverdicted novelty 7.0

Fine-tuning neural PDE operators to regime endpoints reveals a physical direction in weight space that CCM uses to compose accurate merged models for new or extrapolated regimes from metadata or short prefixes.
FeatCal: Feature Calibration for Post-Merging Models
cs.LG 2026-05 conditional novelty 7.0

FeatCal reduces feature drift in merged models via layer-wise closed-form calibration on a small dataset, outperforming prior post-merging methods on CLIP and GLUE benchmarks with high sample efficiency.
E-PMQ: Expert-Guided Post-Merge Quantization with Merged-Weight Anchoring
cs.CL 2026-05 unverdicted novelty 6.0

E-PMQ improves 4-bit quantization accuracy on merged models by 8-42 points across CLIP and GLUE tasks through expert-guided calibration and merged-weight anchoring.
Geometry Conflict: Explaining and Controlling Forgetting in LLM Continual Post-Training
cs.LG 2026-05 unverdicted novelty 6.0

Forgetting in LLM continual post-training is a geometry conflict between task-induced covariance structures and the evolving model state, controlled by gating Wasserstein barycenter merging on measured conflict.

Reference graph

Works this paper leans on

71 extracted references · 71 canonical work pages · cited by 4 Pith papers · 11 internal anchors

[1]

Toolformer: Language models can teach themselves to use tools.Advances in Neural Information Processing Systems, 36:68539– 68551, 2023

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools.Advances in Neural Information Processing Systems, 36:68539– 68551, 2023

work page 2023
[2]

Auto-gpt for online decision making: Benchmarks and additional opinions.arXiv preprint arXiv:2306.02224, 2023

Hui Yang, Sifu Yue, and Yunzhong He. Auto-gpt for online decision making: Benchmarks and additional opinions.arXiv preprint arXiv:2306.02224, 2023

work page arXiv 2023
[3]

Mixture-of-Agents Enhances Large Language Model Capabilities

Junlin Wang, Jue Wang, Ben Athiwaratkun, Ce Zhang, and James Zou. Mixture-of-agents enhances large language model capabilities.arXiv preprint arXiv:2406.04692, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators

Yann Dubois, Balázs Galambosi, Percy Liang, and Tatsunori B Hashimoto. Length-controlled alpacaeval: A simple way to debias automatic evaluators.arXiv preprint arXiv:2404.04475, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

Llm-blender: Ensembling large language models with pairwise ranking and generative fusion

Dongfu Jiang, Xiang Ren, and Bill Yuchen Lin. Llm-blender: Ensembling large language models with pairwise ranking and generative fusion. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14165–14178, 2023

work page 2023
[6]

Llm-ensemble: Optimal large language model ensemble method for e-commerce product attribute value extraction

Chenhao Fang, Xiaohan Li, Zezhong Fan, Jianpeng Xu, Kaushiki Nag, Evren Korpeoglu, Sushant Kumar, and Kannan Achan. Llm-ensemble: Optimal large language model ensemble method for e-commerce product attribute value extraction. InProceedings of the 47th Interna- tional ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2910–2914, 2024

work page 2024
[7]

Sparse upcycling: Training mixture-of-experts from dense checkpoints

Aran Komatsuzaki, Joan Puigcerver, James Lee-Thorp, Carlos Riquelme Ruiz, Basil Mustafa, Joshua Ainslie, Yi Tay, Mostafa Dehghani, and Neil Houlsby. Sparse upcycling: Training mixture-of-experts from dense checkpoints. InThe Eleventh International Conference on Learning Representations

work page
[8]

Glam: Efficient scaling of language models with mixture-of-experts

Nan Du, Yanping Huang, Andrew M Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, et al. Glam: Efficient scaling of language models with mixture-of-experts. InInternational conference on machine learning, pages 5547–5569. PMLR, 2022

work page 2022
[9]

Towards understanding the mixture-of-experts layer in deep learning.Advances in neural information processing systems, 35:23049–23062, 2022

Zixiang Chen, Yihe Deng, Yue Wu, Quanquan Gu, and Yuanzhi Li. Towards understanding the mixture-of-experts layer in deep learning.Advances in neural information processing systems, 35:23049–23062, 2022

work page 2022
[10]

Ties-merging: Resolving interference when merging models.Advances in Neural Information Processing Systems, 36:7093–7115, 2023

Prateek Yadav, Derek Tam, Leshem Choshen, Colin A Raffel, and Mohit Bansal. Ties-merging: Resolving interference when merging models.Advances in Neural Information Processing Systems, 36:7093–7115, 2023

work page 2023
[11]

Merging models with fisher-weighted averaging

Michael S Matena and Colin A Raffel. Merging models with fisher-weighted averaging. Advances in Neural Information Processing Systems, 35:17703–17716, 2022

work page 2022
[12]

Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time

Mitchell Wortsman, Gabriel Ilharco, Samir Ya Gadre, Rebecca Roelofs, Raphael Gontijo-Lopes, Ari S Morcos, Hongseok Namkoong, Ali Farhadi, Yair Carmon, Simon Kornblith, et al. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. InInternational conference on machine learning, pages 23965–23998. P...

work page 2022
[13]

Knowl- edge fusion of large language models

Fanqi Wan, Xinting Huang, Deng Cai, Xiaojun Quan, Wei Bi, and Shuming Shi. Knowl- edge fusion of large language models. InThe Twelfth International Conference on Learning Representations

work page
[14]

Profuser: Progressive fusion of large language models.arXiv preprint arXiv:2408.04998, 2024

Tianyuan Shi, Fanqi Wan, Canbin Huang, Xiaojun Quan, Chenliang Li, Ming Yan, and Ji Zhang. Profuser: Progressive fusion of large language models.arXiv preprint arXiv:2408.04998, 2024. 11

work page arXiv 2024
[15]

Infifusion: A unified framework for enhanced cross-model reasoning via llm fusion.arXiv preprint arXiv:2501.02795, 2025

Zhaoyi Yan, Yiming Zhang, Baoyi He, Yuhao Fu, Qi Zhou, Zhijie Sang, Chunlin Ji, Shengyu Zhang, Fei Wu, and Hongxia Yang. Infifusion: A unified framework for enhanced cross-model reasoning via llm fusion.arXiv preprint arXiv:2501.02795, 2025

work page arXiv 2025
[16]

A setwise approach for effective and highly efficient zero-shot ranking with large language models

Shengyao Zhuang, Honglei Zhuang, Bevan Koopman, and Guido Zuccon. A setwise approach for effective and highly efficient zero-shot ranking with large language models. InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 38–47, 2024

work page 2024
[17]

Discovering preference optimization algorithms with and for large language models.Advances in Neural Information Processing Systems, 37:86528–86573, 2024

Chris Lu, Samuel Holt, Claudio Fanconi, Alex Chan, Jakob Foerster, Mihaela van der Schaar, and Robert Lange. Discovering preference optimization algorithms with and for large language models.Advances in Neural Information Processing Systems, 37:86528–86573, 2024

work page 2024
[18]

Scalable gromov-wasserstein learning for graph partitioning and matching.Advances in neural information processing systems, 32, 2019

Hongteng Xu, Dixin Luo, and Lawrence Carin. Scalable gromov-wasserstein learning for graph partitioning and matching.Advances in neural information processing systems, 32, 2019

work page 2019
[19]

Gromov-wasserstein averaging of kernel and distance matrices

Gabriel Peyré, Marco Cuturi, and Justin Solomon. Gromov-wasserstein averaging of kernel and distance matrices. InInternational conference on machine learning, pages 2664–2672. PMLR, 2016

work page 2016
[20]

React: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations (ICLR), 2023

work page 2023
[21]

Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face.Advances in Neural Information Processing Systems, 36:38154–38180, 2023

Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face.Advances in Neural Information Processing Systems, 36:38154–38180, 2023

work page 2023
[22]

Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in Neural Information Processing Systems, 36:46595–46623, 2023

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in Neural Information Processing Systems, 36:46595–46623, 2023

work page 2023
[23]

Agentbench: Evaluating llms as agents

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, et al. Agentbench: Evaluating llms as agents. InICLR, 2024

work page 2024
[24]

Self-consistency improves chain of thought reasoning in language models

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. InThe Eleventh International Conference on Learning Representations

work page
[25]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[26]

Tinybert: Distilling bert for natural language understanding

Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, and Qun Liu. Tinybert: Distilling bert for natural language understanding. InFindings of the Association for Computational Linguistics: EMNLP 2020, pages 4163–4174, 2020

work page 2020
[27]

Gromov–wasserstein distances and the metric approach to object matching

Facundo Mémoli. Gromov–wasserstein distances and the metric approach to object matching. Foundations of computational mathematics, 11:417–487, 2011

work page 2011
[28]

Fused gromov-wasserstein distance for structured objects.Algorithms, 13(9):212, 2020

Titouan Vayer, Laetitia Chapel, Rémi Flamary, Romain Tavenard, and Nicolas Courty. Fused gromov-wasserstein distance for structured objects.Algorithms, 13(9):212, 2020

work page 2020
[29]

Learning graphons via struc- tured gromov-wasserstein barycenters

Hongteng Xu, Dixin Luo, Lawrence Carin, and Hongyuan Zha. Learning graphons via struc- tured gromov-wasserstein barycenters. InProceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 10505–10513, 2021

work page 2021
[30]

Interdependency matters: Graph alignment for multivariate time series anomaly detection.arXiv preprint arXiv:2410.08877, 2024

Yuanyi Wang, Haifeng Sun, Chengsen Wang, Mengde Zhu, Jingyu Wang, Wei Tang, Qi Qi, Zirui Zhuang, and Jianxin Liao. Interdependency matters: Graph alignment for multivariate time series anomaly detection.arXiv preprint arXiv:2410.08877, 2024

work page arXiv 2024
[31]

Gradient flow of energy: A general and efficient approach for entity alignment decoding.arXiv preprint arXiv:2401.12798, 2024

Yuanyi Wang, Haifeng Sun, Jingyu Wang, Qi Qi, Shaoling Sun, and Jianxin Liao. Gradient flow of energy: A general and efficient approach for entity alignment decoding.arXiv preprint arXiv:2401.12798, 2024. 12

work page arXiv 2024
[32]

Towards semantic consistency: Dirichlet energy driven robust multi-modal entity alignment

Yuanyi Wang, Haifeng Sun, Jiabo Wang, Jingyu Wang, Wei Tang, Qi Qi, Shaoling Sun, and Jianxin Liao. Towards semantic consistency: Dirichlet energy driven robust multi-modal entity alignment. In2024 IEEE 40th International Conference on Data Engineering (ICDE), pages 3559–3572. IEEE, 2024

work page 2024
[33]

Gromov-wasserstein factorization models for graph clustering

Hongtengl Xu. Gromov-wasserstein factorization models for graph clustering. InProceedings of the AAAI conference on artificial intelligence, volume 34, pages 6478–6485, 2020

work page 2020
[34]

Towards cross- tokenizer distillation: the universal logit distillation loss for llms.Transactions on Machine Learning Research

Nicolas Boizard, Kevin El Haddad, CELINE HUDELOT, and Pierre Colombo. Towards cross- tokenizer distillation: the universal logit distillation loss for llms.Transactions on Machine Learning Research

work page
[35]

Model outputs documentation

HuggingFace. Model outputs documentation. https://huggingface.co/docs/ transformers/en/main_classes/output, 2025

work page 2025
[36]

Superfiltering: Weak-to-strong data filtering for fast instruction-tuning

Ming Li, Yong Zhang, Shwai He, Zhitao Li, Hongyu Zhao, Jianzong Wang, Ning Cheng, and Tianyi Zhou. Superfiltering: Weak-to-strong data filtering for fast instruction-tuning. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14255–14273, 2024

work page 2024
[37]

Numinamath: The largest public dataset in ai4maths with 860k pairs of competition math problems and solutions.Hugging Face repository, 13:9, 2024

Jia Li, Edward Beeching, Lewis Tunstall, Ben Lipkin, Roman Soletskyi, Shengyi Huang, Kashif Rasul, Longhui Yu, Albert Q Jiang, Ziju Shen, et al. Numinamath: The largest public dataset in ai4maths with 860k pairs of competition math problems and solutions.Hugging Face repository, 13:9, 2024

work page 2024
[38]

Kodcode: A diverse, challenging, and verifiable synthetic dataset for coding

Zhangchen Xu, Yang Liu, Yueqin Yin, Mingyuan Zhou, and Radha Poovendran. Kodcode: A diverse, challenging, and verifiable synthetic dataset for coding. 2025

work page 2025
[39]

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2. 5 technical report.arXiv preprint arXiv:2412.15115, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[40]

Mistral-small-24b-instruct-2501

Mistral AI. Mistral-small-24b-instruct-2501. https://huggingface.co/mistralai/ Mistral-Small-24B-Instruct-2501, 2025

work page 2025
[41]

Phi-4 Technical Report

Marah Abdin, Jyoti Aneja, Harkirat Behl, Sébastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Harrison, Russell J Hewett, Mojan Javaheripi, Piero Kauffmann, et al. Phi-4 technical report.arXiv preprint arXiv:2412.08905, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[42]

Online knowledge distillation via collaborative learning

Qiushan Guo, Xinjiang Wang, Yichao Wu, Zhipeng Yu, Ding Liang, Xiaolin Hu, and Ping Luo. Online knowledge distillation via collaborative learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11020–11029, 2020

work page 2020
[43]

Fusechat: Knowl- edge fusion of chat models.arXiv preprint arXiv:2408.07990, 2024

Fanqi Wan, Longguang Zhong, Ziyi Yang, Ruijun Chen, and Xiaojun Quan. Fusechat: Knowl- edge fusion of chat models.arXiv preprint arXiv:2408.07990, 2024

work page arXiv 2024
[44]

Challenging big-bench tasks and whether chain-of-thought can solve them

Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V Le, Ed H Chi, Denny Zhou, et al. Challenging big-bench tasks and whether chain-of-thought can solve them. InACL (Findings), 2023

work page 2023
[45]

Quick and (not so) dirty: Unsupervised selection of justification sentences for multi-hop question answering

Vikas Yadav, Steven Bethard, and Mihai Surdeanu. Quick and (not so) dirty: Unsupervised selection of justification sentences for multi-hop question answering. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2578–2589, 2019

work page 2019
[46]

Measuring massive multitask language understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. InInternational Conference on Learning Representations

work page
[47]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021. 13

work page internal anchor Pith review Pith/arXiv arXiv 2021
[48]

Measuring mathematical problem solving with the math dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2)

work page
[49]

Theoremqa: A theorem-driven question answering dataset

Wenhu Chen, Ming Yin, Max Ku, Pan Lu, Yixin Wan, Xueguang Ma, Jianyu Xu, Xinyi Wang, and Tony Xia. Theoremqa: A theorem-driven question answering dataset. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2023

work page 2023
[50]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[51]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[52]

Drop: A reading comprehension benchmark requiring discrete reasoning over paragraphs

Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, Sameer Singh, and Matt Gard- ner. Drop: A reading comprehension benchmark requiring discrete reasoning over paragraphs. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Paper...

work page 2019
[53]

Hellaswag: Can a machine really finish your sentence? InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2019

work page 2019
[54]

Instruction-Following Evaluation for Large Language Models

Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models.arXiv preprint arXiv:2311.07911, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[55]

Softsort: A continuous relaxation for the argsort operator

Sebastian Prillo and Julian Eisenschlos. Softsort: A continuous relaxation for the argsort operator. InInternational Conference on Machine Learning, pages 7793–7802. PMLR, 2020

work page 2020
[56]

Fast differentiable sorting and ranking

Mathieu Blondel, Olivier Teboul, Quentin Berthet, and Josip Djolonga. Fast differentiable sorting and ranking. InInternational Conference on Machine Learning, pages 950–959. PMLR, 2020

work page 2020
[57]

Stability and generalization.Journal of machine learning research, 2(Mar):499–526, 2002

Olivier Bousquet and André Elisseeff. Stability and generalization.Journal of machine learning research, 2(Mar):499–526, 2002

work page 2002
[58]

Knowledge distillation performs partial variance reduction.Advances in Neural Information Processing Systems, 36:75229–75258, 2023

Mher Safaryan, Alexandra Peste, and Dan Alistarh. Knowledge distillation performs partial variance reduction.Advances in Neural Information Processing Systems, 36:75229–75258, 2023

work page 2023
[59]

Opencompass: A universal evaluation platform for foundation models.https://github.com/open-compass/opencompass, 2023

OpenCompass Contributors. Opencompass: A universal evaluation platform for foundation models.https://github.com/open-compass/opencompass, 2023

work page 2023
[60]

Is your code generated by chatGPT really correct? rigorous evaluation of large language models for code generation

Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. Is your code generated by chatGPT really correct? rigorous evaluation of large language models for code generation. In Thirty-seventh Conference on Neural Information Processing Systems, 2023

work page 2023
[61]

Minillm: Knowledge distillation of large language models

Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. Minillm: Knowledge distillation of large language models. InThe Twelfth International Conference on Learning Representations

work page
[62]

Dual-space knowledge distillation for large language models

Songming Zhang, Xue Zhang, Zengkui Sun, Yufeng Chen, and Jinan Xu. Dual-space knowledge distillation for large language models. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 18164–18181, 2024

work page 2024
[63]

Self-Instruct: Aligning Language Models with Self-Generated Instructions

Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language models with self-generated instruc- tions.arXiv preprint arXiv:2212.10560, 2022. 14

work page internal anchor Pith review Pith/arXiv arXiv 2022
[64]

Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.See https://vicuna

Wei-Lin Chiang, Zhuohan Li, Ziqing Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.See https://vicuna. lmsys. org (accessed 14 April 2023), 2(3):6, 2023

work page 2023
[65]

Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks.arXiv preprint arXiv:2204.07705, 2022

Yizhong Wang, Swaroop Mishra, Pegah Alipoormolabashi, Yeganeh Kordi, Amirreza Mirzaei, Anjana Arunkumar, Arjun Ashok, Arut Selvan Dhanasekaran, Atharva Naik, David Stap, et al. Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks.arXiv preprint arXiv:2204.07705, 2022

work page arXiv 2022
[66]

Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019

work page 2019
[67]

OPT: Open Pre-trained Transformer Language Models

Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. Opt: Open pre-trained transformer language models.arXiv preprint arXiv:2205.01068, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[68]

importance

Niklas Muennighoff, Thomas Wang, Lintang Sutawika, Adam Roberts, Stella Biderman, Teven Le Scao, M Saiful Bari, Sheng Shen, Zheng Xin Yong, Hailey Schoelkopf, et al. Crosslingual generalization through multitask finetuning. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15991–16111,...

work page 2023
[69]

Under the assumption that the true features have a positive minimal gap, the sorting operator is stable, or, in practice, one can use a soft sort with known Lipschitz properties

Sorting Stability Error.Denote by ϵsort the error incurred by the potential change in the sorted order when the features are perturbed. Under the assumption that the true features have a positive minimal gap, the sorting operator is stable, or, in practice, one can use a soft sort with known Lipschitz properties. Hence, only small errors are induced in th...

work page
[70]

Answer: \$ANSWER

If an algorithm has γ-uniform stability, then its generalization error can be controlled in O(γ)order of magnitude. 2.γ is approximately of the order of L/n, where L is a Lipschitz constant of the loss function andnis the number of samples. C.2 Lipschitz constant of GW loss Lemma 1(GW Lipschitz constant).LetL GW(T,S) =λGW 2(T,S). If∥S∥ 2 ≤R, then ∇SLGW 2 ...

work page arXiv 1992
[71]

We follow [62] that evaluate on four benchmarks: SelfInst [63], VicunaEval [64], Super Natural Instructions (S-NI) [65], and the Dolly [61]

for training. We follow [62] that evaluate on four benchmarks: SelfInst [63], VicunaEval [64], Super Natural Instructions (S-NI) [65], and the Dolly [61]. Models:For distillation, we distill from LLaMA3-8B, Mistral-7B, and Qwen2.5-7B into student models, including GPT2-120M [66], OPT-350M [67], and Bloomz-560M [68]. Training settings:Distillation uses LoR...

work page

[1] [1]

Toolformer: Language models can teach themselves to use tools.Advances in Neural Information Processing Systems, 36:68539– 68551, 2023

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools.Advances in Neural Information Processing Systems, 36:68539– 68551, 2023

work page 2023

[2] [2]

Auto-gpt for online decision making: Benchmarks and additional opinions.arXiv preprint arXiv:2306.02224, 2023

Hui Yang, Sifu Yue, and Yunzhong He. Auto-gpt for online decision making: Benchmarks and additional opinions.arXiv preprint arXiv:2306.02224, 2023

work page arXiv 2023

[3] [3]

Mixture-of-Agents Enhances Large Language Model Capabilities

Junlin Wang, Jue Wang, Ben Athiwaratkun, Ce Zhang, and James Zou. Mixture-of-agents enhances large language model capabilities.arXiv preprint arXiv:2406.04692, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[4] [4]

Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators

Yann Dubois, Balázs Galambosi, Percy Liang, and Tatsunori B Hashimoto. Length-controlled alpacaeval: A simple way to debias automatic evaluators.arXiv preprint arXiv:2404.04475, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[5] [5]

Llm-blender: Ensembling large language models with pairwise ranking and generative fusion

Dongfu Jiang, Xiang Ren, and Bill Yuchen Lin. Llm-blender: Ensembling large language models with pairwise ranking and generative fusion. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14165–14178, 2023

work page 2023

[6] [6]

Llm-ensemble: Optimal large language model ensemble method for e-commerce product attribute value extraction

Chenhao Fang, Xiaohan Li, Zezhong Fan, Jianpeng Xu, Kaushiki Nag, Evren Korpeoglu, Sushant Kumar, and Kannan Achan. Llm-ensemble: Optimal large language model ensemble method for e-commerce product attribute value extraction. InProceedings of the 47th Interna- tional ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2910–2914, 2024

work page 2024

[7] [7]

Sparse upcycling: Training mixture-of-experts from dense checkpoints

Aran Komatsuzaki, Joan Puigcerver, James Lee-Thorp, Carlos Riquelme Ruiz, Basil Mustafa, Joshua Ainslie, Yi Tay, Mostafa Dehghani, and Neil Houlsby. Sparse upcycling: Training mixture-of-experts from dense checkpoints. InThe Eleventh International Conference on Learning Representations

work page

[8] [8]

Glam: Efficient scaling of language models with mixture-of-experts

Nan Du, Yanping Huang, Andrew M Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, et al. Glam: Efficient scaling of language models with mixture-of-experts. InInternational conference on machine learning, pages 5547–5569. PMLR, 2022

work page 2022

[9] [9]

Towards understanding the mixture-of-experts layer in deep learning.Advances in neural information processing systems, 35:23049–23062, 2022

Zixiang Chen, Yihe Deng, Yue Wu, Quanquan Gu, and Yuanzhi Li. Towards understanding the mixture-of-experts layer in deep learning.Advances in neural information processing systems, 35:23049–23062, 2022

work page 2022

[10] [10]

Ties-merging: Resolving interference when merging models.Advances in Neural Information Processing Systems, 36:7093–7115, 2023

Prateek Yadav, Derek Tam, Leshem Choshen, Colin A Raffel, and Mohit Bansal. Ties-merging: Resolving interference when merging models.Advances in Neural Information Processing Systems, 36:7093–7115, 2023

work page 2023

[11] [11]

Merging models with fisher-weighted averaging

Michael S Matena and Colin A Raffel. Merging models with fisher-weighted averaging. Advances in Neural Information Processing Systems, 35:17703–17716, 2022

work page 2022

[12] [12]

Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time

Mitchell Wortsman, Gabriel Ilharco, Samir Ya Gadre, Rebecca Roelofs, Raphael Gontijo-Lopes, Ari S Morcos, Hongseok Namkoong, Ali Farhadi, Yair Carmon, Simon Kornblith, et al. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. InInternational conference on machine learning, pages 23965–23998. P...

work page 2022

[13] [13]

Knowl- edge fusion of large language models

Fanqi Wan, Xinting Huang, Deng Cai, Xiaojun Quan, Wei Bi, and Shuming Shi. Knowl- edge fusion of large language models. InThe Twelfth International Conference on Learning Representations

work page

[14] [14]

Profuser: Progressive fusion of large language models.arXiv preprint arXiv:2408.04998, 2024

Tianyuan Shi, Fanqi Wan, Canbin Huang, Xiaojun Quan, Chenliang Li, Ming Yan, and Ji Zhang. Profuser: Progressive fusion of large language models.arXiv preprint arXiv:2408.04998, 2024. 11

work page arXiv 2024

[15] [15]

Infifusion: A unified framework for enhanced cross-model reasoning via llm fusion.arXiv preprint arXiv:2501.02795, 2025

Zhaoyi Yan, Yiming Zhang, Baoyi He, Yuhao Fu, Qi Zhou, Zhijie Sang, Chunlin Ji, Shengyu Zhang, Fei Wu, and Hongxia Yang. Infifusion: A unified framework for enhanced cross-model reasoning via llm fusion.arXiv preprint arXiv:2501.02795, 2025

work page arXiv 2025

[16] [16]

A setwise approach for effective and highly efficient zero-shot ranking with large language models

Shengyao Zhuang, Honglei Zhuang, Bevan Koopman, and Guido Zuccon. A setwise approach for effective and highly efficient zero-shot ranking with large language models. InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 38–47, 2024

work page 2024

[17] [17]

Discovering preference optimization algorithms with and for large language models.Advances in Neural Information Processing Systems, 37:86528–86573, 2024

Chris Lu, Samuel Holt, Claudio Fanconi, Alex Chan, Jakob Foerster, Mihaela van der Schaar, and Robert Lange. Discovering preference optimization algorithms with and for large language models.Advances in Neural Information Processing Systems, 37:86528–86573, 2024

work page 2024

[18] [18]

Scalable gromov-wasserstein learning for graph partitioning and matching.Advances in neural information processing systems, 32, 2019

Hongteng Xu, Dixin Luo, and Lawrence Carin. Scalable gromov-wasserstein learning for graph partitioning and matching.Advances in neural information processing systems, 32, 2019

work page 2019

[19] [19]

Gromov-wasserstein averaging of kernel and distance matrices

Gabriel Peyré, Marco Cuturi, and Justin Solomon. Gromov-wasserstein averaging of kernel and distance matrices. InInternational conference on machine learning, pages 2664–2672. PMLR, 2016

work page 2016

[20] [20]

React: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations (ICLR), 2023

work page 2023

[21] [21]

Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face.Advances in Neural Information Processing Systems, 36:38154–38180, 2023

Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face.Advances in Neural Information Processing Systems, 36:38154–38180, 2023

work page 2023

[22] [22]

Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in Neural Information Processing Systems, 36:46595–46623, 2023

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in Neural Information Processing Systems, 36:46595–46623, 2023

work page 2023

[23] [23]

Agentbench: Evaluating llms as agents

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, et al. Agentbench: Evaluating llms as agents. InICLR, 2024

work page 2024

[24] [24]

Self-consistency improves chain of thought reasoning in language models

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. InThe Eleventh International Conference on Learning Representations

work page

[25] [25]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[26] [26]

Tinybert: Distilling bert for natural language understanding

Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, and Qun Liu. Tinybert: Distilling bert for natural language understanding. InFindings of the Association for Computational Linguistics: EMNLP 2020, pages 4163–4174, 2020

work page 2020

[27] [27]

Gromov–wasserstein distances and the metric approach to object matching

Facundo Mémoli. Gromov–wasserstein distances and the metric approach to object matching. Foundations of computational mathematics, 11:417–487, 2011

work page 2011

[28] [28]

Fused gromov-wasserstein distance for structured objects.Algorithms, 13(9):212, 2020

Titouan Vayer, Laetitia Chapel, Rémi Flamary, Romain Tavenard, and Nicolas Courty. Fused gromov-wasserstein distance for structured objects.Algorithms, 13(9):212, 2020

work page 2020

[29] [29]

Learning graphons via struc- tured gromov-wasserstein barycenters

Hongteng Xu, Dixin Luo, Lawrence Carin, and Hongyuan Zha. Learning graphons via struc- tured gromov-wasserstein barycenters. InProceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 10505–10513, 2021

work page 2021

[30] [30]

Interdependency matters: Graph alignment for multivariate time series anomaly detection.arXiv preprint arXiv:2410.08877, 2024

Yuanyi Wang, Haifeng Sun, Chengsen Wang, Mengde Zhu, Jingyu Wang, Wei Tang, Qi Qi, Zirui Zhuang, and Jianxin Liao. Interdependency matters: Graph alignment for multivariate time series anomaly detection.arXiv preprint arXiv:2410.08877, 2024

work page arXiv 2024

[31] [31]

Gradient flow of energy: A general and efficient approach for entity alignment decoding.arXiv preprint arXiv:2401.12798, 2024

Yuanyi Wang, Haifeng Sun, Jingyu Wang, Qi Qi, Shaoling Sun, and Jianxin Liao. Gradient flow of energy: A general and efficient approach for entity alignment decoding.arXiv preprint arXiv:2401.12798, 2024. 12

work page arXiv 2024

[32] [32]

Towards semantic consistency: Dirichlet energy driven robust multi-modal entity alignment

Yuanyi Wang, Haifeng Sun, Jiabo Wang, Jingyu Wang, Wei Tang, Qi Qi, Shaoling Sun, and Jianxin Liao. Towards semantic consistency: Dirichlet energy driven robust multi-modal entity alignment. In2024 IEEE 40th International Conference on Data Engineering (ICDE), pages 3559–3572. IEEE, 2024

work page 2024

[33] [33]

Gromov-wasserstein factorization models for graph clustering

Hongtengl Xu. Gromov-wasserstein factorization models for graph clustering. InProceedings of the AAAI conference on artificial intelligence, volume 34, pages 6478–6485, 2020

work page 2020

[34] [34]

Towards cross- tokenizer distillation: the universal logit distillation loss for llms.Transactions on Machine Learning Research

Nicolas Boizard, Kevin El Haddad, CELINE HUDELOT, and Pierre Colombo. Towards cross- tokenizer distillation: the universal logit distillation loss for llms.Transactions on Machine Learning Research

work page

[35] [35]

Model outputs documentation

HuggingFace. Model outputs documentation. https://huggingface.co/docs/ transformers/en/main_classes/output, 2025

work page 2025

[36] [36]

Superfiltering: Weak-to-strong data filtering for fast instruction-tuning

Ming Li, Yong Zhang, Shwai He, Zhitao Li, Hongyu Zhao, Jianzong Wang, Ning Cheng, and Tianyi Zhou. Superfiltering: Weak-to-strong data filtering for fast instruction-tuning. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14255–14273, 2024

work page 2024

[37] [37]

Numinamath: The largest public dataset in ai4maths with 860k pairs of competition math problems and solutions.Hugging Face repository, 13:9, 2024

Jia Li, Edward Beeching, Lewis Tunstall, Ben Lipkin, Roman Soletskyi, Shengyi Huang, Kashif Rasul, Longhui Yu, Albert Q Jiang, Ziju Shen, et al. Numinamath: The largest public dataset in ai4maths with 860k pairs of competition math problems and solutions.Hugging Face repository, 13:9, 2024

work page 2024

[38] [38]

Kodcode: A diverse, challenging, and verifiable synthetic dataset for coding

Zhangchen Xu, Yang Liu, Yueqin Yin, Mingyuan Zhou, and Radha Poovendran. Kodcode: A diverse, challenging, and verifiable synthetic dataset for coding. 2025

work page 2025

[39] [39]

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2. 5 technical report.arXiv preprint arXiv:2412.15115, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[40] [40]

Mistral-small-24b-instruct-2501

Mistral AI. Mistral-small-24b-instruct-2501. https://huggingface.co/mistralai/ Mistral-Small-24B-Instruct-2501, 2025

work page 2025

[41] [41]

Phi-4 Technical Report

Marah Abdin, Jyoti Aneja, Harkirat Behl, Sébastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Harrison, Russell J Hewett, Mojan Javaheripi, Piero Kauffmann, et al. Phi-4 technical report.arXiv preprint arXiv:2412.08905, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[42] [42]

Online knowledge distillation via collaborative learning

Qiushan Guo, Xinjiang Wang, Yichao Wu, Zhipeng Yu, Ding Liang, Xiaolin Hu, and Ping Luo. Online knowledge distillation via collaborative learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11020–11029, 2020

work page 2020

[43] [43]

Fusechat: Knowl- edge fusion of chat models.arXiv preprint arXiv:2408.07990, 2024

Fanqi Wan, Longguang Zhong, Ziyi Yang, Ruijun Chen, and Xiaojun Quan. Fusechat: Knowl- edge fusion of chat models.arXiv preprint arXiv:2408.07990, 2024

work page arXiv 2024

[44] [44]

Challenging big-bench tasks and whether chain-of-thought can solve them

Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V Le, Ed H Chi, Denny Zhou, et al. Challenging big-bench tasks and whether chain-of-thought can solve them. InACL (Findings), 2023

work page 2023

[45] [45]

Quick and (not so) dirty: Unsupervised selection of justification sentences for multi-hop question answering

Vikas Yadav, Steven Bethard, and Mihai Surdeanu. Quick and (not so) dirty: Unsupervised selection of justification sentences for multi-hop question answering. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2578–2589, 2019

work page 2019

[46] [46]

Measuring massive multitask language understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. InInternational Conference on Learning Representations

work page

[47] [47]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021. 13

work page internal anchor Pith review Pith/arXiv arXiv 2021

[48] [48]

Measuring mathematical problem solving with the math dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2)

work page

[49] [49]

Theoremqa: A theorem-driven question answering dataset

Wenhu Chen, Ming Yin, Max Ku, Pan Lu, Yixin Wan, Xueguang Ma, Jianyu Xu, Xinyi Wang, and Tony Xia. Theoremqa: A theorem-driven question answering dataset. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2023

work page 2023

[50] [50]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[51] [51]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[52] [52]

Drop: A reading comprehension benchmark requiring discrete reasoning over paragraphs

Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, Sameer Singh, and Matt Gard- ner. Drop: A reading comprehension benchmark requiring discrete reasoning over paragraphs. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Paper...

work page 2019

[53] [53]

Hellaswag: Can a machine really finish your sentence? InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2019

work page 2019

[54] [54]

Instruction-Following Evaluation for Large Language Models

Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models.arXiv preprint arXiv:2311.07911, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[55] [55]

Softsort: A continuous relaxation for the argsort operator

Sebastian Prillo and Julian Eisenschlos. Softsort: A continuous relaxation for the argsort operator. InInternational Conference on Machine Learning, pages 7793–7802. PMLR, 2020

work page 2020

[56] [56]

Fast differentiable sorting and ranking

Mathieu Blondel, Olivier Teboul, Quentin Berthet, and Josip Djolonga. Fast differentiable sorting and ranking. InInternational Conference on Machine Learning, pages 950–959. PMLR, 2020

work page 2020

[57] [57]

Stability and generalization.Journal of machine learning research, 2(Mar):499–526, 2002

Olivier Bousquet and André Elisseeff. Stability and generalization.Journal of machine learning research, 2(Mar):499–526, 2002

work page 2002

[58] [58]

Knowledge distillation performs partial variance reduction.Advances in Neural Information Processing Systems, 36:75229–75258, 2023

Mher Safaryan, Alexandra Peste, and Dan Alistarh. Knowledge distillation performs partial variance reduction.Advances in Neural Information Processing Systems, 36:75229–75258, 2023

work page 2023

[59] [59]

Opencompass: A universal evaluation platform for foundation models.https://github.com/open-compass/opencompass, 2023

OpenCompass Contributors. Opencompass: A universal evaluation platform for foundation models.https://github.com/open-compass/opencompass, 2023

work page 2023

[60] [60]

Is your code generated by chatGPT really correct? rigorous evaluation of large language models for code generation

Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. Is your code generated by chatGPT really correct? rigorous evaluation of large language models for code generation. In Thirty-seventh Conference on Neural Information Processing Systems, 2023

work page 2023

[61] [61]

Minillm: Knowledge distillation of large language models

Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. Minillm: Knowledge distillation of large language models. InThe Twelfth International Conference on Learning Representations

work page

[62] [62]

Dual-space knowledge distillation for large language models

Songming Zhang, Xue Zhang, Zengkui Sun, Yufeng Chen, and Jinan Xu. Dual-space knowledge distillation for large language models. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 18164–18181, 2024

work page 2024

[63] [63]

Self-Instruct: Aligning Language Models with Self-Generated Instructions

Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language models with self-generated instruc- tions.arXiv preprint arXiv:2212.10560, 2022. 14

work page internal anchor Pith review Pith/arXiv arXiv 2022

[64] [64]

Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.See https://vicuna

Wei-Lin Chiang, Zhuohan Li, Ziqing Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.See https://vicuna. lmsys. org (accessed 14 April 2023), 2(3):6, 2023

work page 2023

[65] [65]

Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks.arXiv preprint arXiv:2204.07705, 2022

Yizhong Wang, Swaroop Mishra, Pegah Alipoormolabashi, Yeganeh Kordi, Amirreza Mirzaei, Anjana Arunkumar, Arjun Ashok, Arut Selvan Dhanasekaran, Atharva Naik, David Stap, et al. Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks.arXiv preprint arXiv:2204.07705, 2022

work page arXiv 2022

[66] [66]

Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019

work page 2019

[67] [67]

OPT: Open Pre-trained Transformer Language Models

Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. Opt: Open pre-trained transformer language models.arXiv preprint arXiv:2205.01068, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[68] [68]

importance

Niklas Muennighoff, Thomas Wang, Lintang Sutawika, Adam Roberts, Stella Biderman, Teven Le Scao, M Saiful Bari, Sheng Shen, Zheng Xin Yong, Hailey Schoelkopf, et al. Crosslingual generalization through multitask finetuning. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15991–16111,...

work page 2023

[69] [69]

Under the assumption that the true features have a positive minimal gap, the sorting operator is stable, or, in practice, one can use a soft sort with known Lipschitz properties

Sorting Stability Error.Denote by ϵsort the error incurred by the potential change in the sorted order when the features are perturbed. Under the assumption that the true features have a positive minimal gap, the sorting operator is stable, or, in practice, one can use a soft sort with known Lipschitz properties. Hence, only small errors are induced in th...

work page

[70] [70]

Answer: \$ANSWER

If an algorithm has γ-uniform stability, then its generalization error can be controlled in O(γ)order of magnitude. 2.γ is approximately of the order of L/n, where L is a Lipschitz constant of the loss function andnis the number of samples. C.2 Lipschitz constant of GW loss Lemma 1(GW Lipschitz constant).LetL GW(T,S) =λGW 2(T,S). If∥S∥ 2 ≤R, then ∇SLGW 2 ...

work page arXiv 1992

[71] [71]

We follow [62] that evaluate on four benchmarks: SelfInst [63], VicunaEval [64], Super Natural Instructions (S-NI) [65], and the Dolly [61]

for training. We follow [62] that evaluate on four benchmarks: SelfInst [63], VicunaEval [64], Super Natural Instructions (S-NI) [65], and the Dolly [61]. Models:For distillation, we distill from LLaMA3-8B, Mistral-7B, and Qwen2.5-7B into student models, including GPT2-120M [66], OPT-350M [67], and Bloomz-560M [68]. Training settings:Distillation uses LoR...

work page