FusionFactory: Fusing LLM Capabilities with Multi-LLM Log Data

Bryan Catanzaro; Haozhen Zhang; Jiaxuan You; Mohammad Shoeybi; Mostofa Patwary; Pengrui Han; Tao Feng; Zijie Lei

arxiv: 2507.10540 · v3 · pith:7WQTLDTNnew · submitted 2025-07-14 · 💻 cs.LG

FusionFactory: Fusing LLM Capabilities with Multi-LLM Log Data

Tao Feng , Haozhen Zhang , Zijie Lei , Pengrui Han , Mostofa Patwary , Mohammad Shoeybi , Bryan Catanzaro , Jiaxuan You This is my paper

classification 💻 cs.LG

keywords fusionllmsacrosscapabilitiesdatafusionfactorymulti-llmbenchmarks

0 comments

read the original abstract

The rapid advancement of large language models (LLMs) has created a diverse landscape of models, each excelling at different tasks. This diversity drives researchers to employ multiple LLMs in practice, leaving behind valuable multi-LLM log data. This naturally leads to the question of whether such logs can be fully leveraged to fuse LLMs' complementary capabilities. Although prior work has explored various strategies for integrating multiple LLMs, we argue that practical fusion must meet two essential requirements: (1) compatibility with real-world serving scenarios (e.g., local and API-based serving), and (2) flexibility to operate at different stages of the LLM pipeline to meet varied user needs (e.g., fine-tuning and inference stages). To this end, we introduce LLMFusionBench, a large-scale benchmark for LLM fusion that spans 14 tasks across five domains, with responses from 20 open-source LLMs (8B--671B) totaling 103M tokens. Building on LLMFusionBench, we propose FusionFactory, a systematic framework with three elaborated levels: (1) query-level fusion via tailored LLM routers, (2) thought-level fusion leveraging retrieved abstract reasoning templates, and (3) model-level fusion via distillation from top-ranked responses. Experiments show that FusionFactory consistently outperforms the best individual LLM across all 14 benchmarks, with the optimal fusion configuration varying across benchmarks, highlighting the promise of multi-LLM log data as a practical foundation for fusing diverse LLM capabilities.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Route to Rome Attack: Directing LLM Routers to Expensive Models via Adversarial Suffix Optimization
cs.CR 2026-04 unverdicted novelty 7.0

R²A uses a hybrid ensemble surrogate router and suffix optimization to significantly increase black-box LLM router selection of expensive models across query distributions.
SWE-Router: Routing in Multi-turn Agentic Software Engineering Tasks
cs.SE 2026-06 unverdicted novelty 6.0

SWE-Router introduces trajectory-conditioned value-based routing for LLM agents on SWE tasks, with a Bayes-optimality theorem and empirical cost savings while retaining most strong-model performance.
Can Heterogeneous Language Models Be Fused?
cs.AI 2026-04 unverdicted novelty 6.0

HeteroFusion fuses heterogeneous LLMs via topology-based alignment and conflict-aware denoising, outperforming merging and ensemble baselines in cross-family and multi-source settings.