Recognition: unknown
Replacing Parameters with Preferences: Federated Alignment of Heterogeneous Vision-Language Models
Pith reviewed 2026-05-07 04:00 UTC · model grok-4.3
The pith
A Mixture-of-Rewards router fuses local preference signals so a server can align vision-language models across clients with mismatched architectures and data, without any parameter exchange.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that preference-based collaboration via per-client reward models and a learned Mixture-of-Rewards router can replace parameter aggregation for federated alignment of heterogeneous vision-language models. Clients keep their data and model architectures private while each trains a local reward model from its preference annotations. The server adaptively fuses these reward signals with input-dependent routing and then optimizes a base VLM using GRPO together with a KL penalty to a reference model. Experiments on diverse public vision-language benchmarks show that this approach yields stronger generalization and cross-client adaptability than existing federated alignment and
What carries the argument
Mixture-of-Rewards with learned routing, which takes outputs from multiple heterogeneous client reward models and produces a single fused reward signal that depends on the input and the alignment objective.
If this is right
- Clients can participate even when they use entirely different model architectures and sizes.
- Raw data and model weights never leave the client, satisfying privacy constraints in regulated domains.
- The same base model can be aligned toward multiple distinct objectives by changing which reward signals the router emphasizes.
- Adding or removing clients requires only adding or removing their reward model from the mixture, without retraining the base model from scratch.
- Cross-client adaptability improves because the fused reward captures evaluation criteria that no single client possesses.
Where Pith is reading between the lines
- The routing decisions could be inspected after training to reveal which client signals are most relevant for particular kinds of inputs or tasks.
- The same preference-fusion pattern could be applied to federated alignment of other multimodal or text-only models that currently cannot share parameters.
- Organizations with proprietary fine-tuned models might use this method to contribute to a shared alignment effort while retaining control over their own weights.
- If the router generalizes to new clients, the framework could support dynamic federated cohorts without requiring all participants to be known in advance.
Load-bearing premise
The learned routing inside the Mixture-of-Rewards can reliably decide how much to trust each client's reward signal for any given input without direct access to client data or model internals.
What would settle it
Reproducing the benchmark experiments after introducing deliberately conflicting preferences across clients and finding that the resulting aligned model scores no higher on generalization metrics than the federated baselines that average parameters.
Figures
read the original abstract
Vision-Language Models (VLMs) have broad potential in privacy-sensitive domains such as healthcare and finance, yet strict data-sharing constraints render centralized training infeasible. Federated Learning mitigates this issue by enabling decentralized training, but practical deployments face challenges due to client heterogeneity in computational resources, application requirements, and model architectures. Under extreme model and data heterogeneity, replacing parameter aggregation with preference-based collaboration offers a more suitable interface, as it eliminates the need for direct parameter or data exchange. Motivated by this, we propose MoR, a federated alignment framework that combines GRPO with Mixture-of-Rewards for heterogeneous VLMs. In MoR, each client locally trains a reward model from local preference annotations, capturing specific evaluation signals without exposing raw data. To combine these heterogeneous supervision signals, MoR introduces a Mixture-of-Rewards mechanism with learned routing, which adaptively fuses client reward models according to the input and alignment objective. The server then optimizes a base VLM using GRPO with a KL penalty to a reference model, enabling preference alignment without requiring client models to share architectures or parameters. Experiments on diverse public vision-language benchmarks demonstrate that MoR consistently outperforms federated alignment baselines in generalization and cross-client adaptability. Our approach provides a scalable solution for privacy-preserving alignment of heterogeneous VLMs under federated settings.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes MoR, a federated alignment framework for heterogeneous vision-language models. Clients locally train reward models from preference annotations to capture client-specific evaluation signals without exposing raw data. A server-side Mixture-of-Rewards mechanism with learned routing adaptively fuses these heterogeneous reward models according to the input and alignment objective. The server then optimizes a single base VLM using GRPO with a KL penalty to a reference model, enabling preference alignment without client models sharing architectures or parameters. Experiments on diverse public vision-language benchmarks are claimed to demonstrate that MoR consistently outperforms federated alignment baselines in generalization and cross-client adaptability.
Significance. If the results hold and the method resolves the gap between server-side base model optimization and alignment of truly heterogeneous client models, the work could be significant for privacy-preserving VLM alignment in federated settings with extreme heterogeneity in resources, data, and architectures. Replacing parameter aggregation with preference-based collaboration via local rewards and learned routing offers a promising interface for domains like healthcare and finance. The combination of Mixture-of-Rewards routing and GRPO is a potentially novel contribution, but the significance is reduced by the apparent mismatch between the stated goals and the described procedure.
major comments (3)
- [Abstract] Abstract and method description: The title, motivation, and abstract claim 'alignment of heterogeneous VLMs' under 'extreme model and data heterogeneity' where clients 'do not share architectures.' However, the procedure optimizes only a single server-side base VLM; no mechanism is described for transferring the aligned base model to clients or performing local adaptation when client architectures differ from the base and from each other. This assumption is load-bearing for the cross-client adaptability and generalization claims.
- [Experiments] Experiments section: The abstract states that 'MoR consistently outperforms federated alignment baselines in generalization and cross-client adaptability,' yet no details are provided on the baselines, metrics, statistical tests, ablation studies on the routing mechanism, or the specific public benchmarks. Without these, the support for the central outperformance claim cannot be evaluated.
- [Method] Method section: The Mixture-of-Rewards with learned routing is presented as fusing client reward models according to input and objective, and GRPO with KL is claimed to produce effective alignment without architecture sharing. However, the absence of equations defining the routing weights, the fusion process, or the GRPO objective makes it impossible to verify whether these components are independent mechanisms or reduce to fitted performance metrics.
minor comments (1)
- [Abstract] The abstract would benefit from a high-level pseudocode or key equations for the Mixture-of-Rewards routing and GRPO objective to improve technical clarity and allow readers to follow the contributions without the full method section.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive report. We address each major comment point by point below. Where the manuscript lacked sufficient detail or explicit description, we have revised the text and will incorporate the changes in the next version.
read point-by-point responses
-
Referee: [Abstract] Abstract and method description: The title, motivation, and abstract claim 'alignment of heterogeneous VLMs' under 'extreme model and data heterogeneity' where clients 'do not share architectures.' However, the procedure optimizes only a single server-side base VLM; no mechanism is described for transferring the aligned base model to clients or performing local adaptation when client architectures differ from the base and from each other. This assumption is load-bearing for the cross-client adaptability and generalization claims.
Authors: We agree that the manuscript would benefit from greater clarity on this point. The central contribution of MoR is to align a single shared base VLM on the server by fusing heterogeneous client reward models via learned routing; this produces an aligned base that encodes the collective preference signals without any client exchanging parameters or architectures. The aligned base is then intended to serve as a common starting point that clients can receive and adapt locally (for example via LoRA or other architecture-agnostic adapters). We acknowledge that the submitted version does not explicitly describe the distribution and local-adaptation step. In the revision we will add a short subsection (new Section 3.4) that details how the server distributes the aligned base checkpoint and how clients with differing architectures can perform lightweight local adaptation while preserving privacy. This addition directly supports the cross-client adaptability claims without altering the core procedure. revision: yes
-
Referee: [Experiments] Experiments section: The abstract states that 'MoR consistently outperforms federated alignment baselines in generalization and cross-client adaptability,' yet no details are provided on the baselines, metrics, statistical tests, ablation studies on the routing mechanism, or the specific public benchmarks. Without these, the support for the central outperformance claim cannot be evaluated.
Authors: We apologize for the insufficient detail in the initial submission. The experiments section does contain comparisons, but they were not presented with the required granularity. In the revised manuscript we have expanded the section to: (i) explicitly name all baselines (FedAvg, FedProx, and a non-routed reward-averaging variant, all adapted to the VLM setting); (ii) list the evaluation metrics (VQA accuracy, CIDEr, BLEU-4, and human preference win rates); (iii) report statistical significance via paired t-tests with p-values across five random seeds; (iv) add a dedicated ablation table isolating the learned routing component; and (v) specify the exact public benchmarks (VQAv2, OK-VQA, COCO Captions, and a held-out cross-client generalization split). These additions will allow readers to fully evaluate the outperformance claims. revision: yes
-
Referee: [Method] Method section: The Mixture-of-Rewards with learned routing is presented as fusing client reward models according to input and objective, and GRPO with KL is claimed to produce effective alignment without architecture sharing. However, the absence of equations defining the routing weights, the fusion process, or the GRPO objective makes it impossible to verify whether these components are independent mechanisms or reduce to fitted performance metrics.
Authors: We accept that the absence of formal equations hindered verification. In the revised manuscript we will insert the complete mathematical definitions: the routing weights are obtained via a softmax over a lightweight gating network whose inputs are the current prompt and alignment objective; the fused reward is the convex combination of the individual client reward scores; and the GRPO objective is written explicitly as the policy-gradient term plus the KL penalty to the reference model. These equations will be placed in Section 3.2 and will demonstrate that the routing/fusion layer operates independently of the subsequent GRPO optimization step. The added formalism will make the independence of the components transparent. revision: yes
Circularity Check
No significant circularity; components presented as independent mechanisms
full rationale
The paper's method is described as a composition of distinct steps: clients train local reward models from preference annotations, a server-side Mixture-of-Rewards with learned routing fuses them adaptively, and GRPO with KL penalty optimizes a single base VLM. No equations, fitted parameters renamed as predictions, or self-citations are shown that reduce any load-bearing claim to its inputs by construction. The experimental claims of outperforming baselines rest on benchmark results rather than tautological redefinitions, making the derivation self-contained against external evaluation.
Axiom & Free-Parameter Ledger
free parameters (1)
- learned routing weights in Mixture-of-Rewards
axioms (2)
- domain assumption Local preference annotations from each client capture distinct evaluation signals that remain useful when fused without exposing raw data.
- domain assumption GRPO optimization with KL penalty to a reference model can produce effective preference alignment even when client models differ in architecture.
invented entities (1)
-
Mixture-of-Rewards with learned routing
no independent evidence
Reference graph
Works this paper leans on
-
[1]
The Thirteenth International Conference on Learning Representations , year=
MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans? , author=. The Thirteenth International Conference on Learning Representations , year=
-
[2]
The Method of Paired Comparisons , author=
Rank Analysis of Incomplete Block Designs: I. The Method of Paired Comparisons , author=. Biometrika , volume=
-
[3]
Exploring the Vulnerabilities of Federated Learning: A Deep Dive Into Gradient Inversion Attacks , year=
Guo, Pengxin and Wang, Runxi and Zeng, Shuang and Zhu, Jinjing and Jiang, Haoning and Wang, Yanran and Zhou, Yuyin and Wang, Feifei and Xiong, Hui and Qu, Liangqiong , journal=. Exploring the Vulnerabilities of Federated Learning: A Deep Dive Into Gradient Inversion Attacks , year=
-
[4]
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=
Routing to the expert: Efficient reward-guided ensemble of large language models , author=. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=
2024
-
[5]
CoRR , year=
Hybrid LLM: Cost-Efficient and Quality-Aware Query Routing , author=. CoRR , year=
-
[6]
The Thirteenth International Conference on Learning Representations , year=
RouteLLM: Learning to Route LLMs from Preference Data , author=. The Thirteenth International Conference on Learning Representations , year=
-
[7]
arXiv preprint arXiv:2502.10051 , year=
ORI: O Routing Intelligence , author=. arXiv preprint arXiv:2502.10051 , year=
-
[8]
arXiv preprint arXiv:2502.14855 , year=
Prompt-to-leaderboard , author=. arXiv preprint arXiv:2502.14855 , year=
-
[9]
ECVL-ROUTER: Scenario-Aware Routing for Vision-Language Models , author=. arXiv preprint arXiv:2510.27256 , year=
-
[10]
arXiv preprint arXiv:2502.02743 , year=
LLM Bandit: Cost-Efficient LLM Generation via Preference-Conditioned Dynamic Routing , author=. arXiv preprint arXiv:2502.02743 , year=
-
[11]
arXiv preprint arXiv:2510.02850 , year=
Reward model routing in alignment , author=. arXiv preprint arXiv:2510.02850 , year=
-
[12]
Proceedings of the 3rd International Workshop on Human-Centered Sensing, Modeling, and Intelligent Systems , pages=
PluralLLM: pluralistic alignment in LLMS via federated learning , author=. Proceedings of the 3rd International Workshop on Human-Centered Sensing, Modeling, and Intelligent Systems , pages=
-
[13]
NeurIPS 2025 Workshop on Evaluating the Evolving LLM Lifecycle: Benchmarks, Emergent Abilities, and Scaling , year=
A Systematic Evaluation of Preference Aggregation in Federated RLHF for Pluralistic Alignment of LLMs , author=. NeurIPS 2025 Workshop on Evaluating the Evolving LLM Lifecycle: Benchmarks, Emergent Abilities, and Scaling , year=
2025
-
[14]
Advances in Neural Information Processing Systems , volume=
Group robust preference optimization in reward-free rlhf , author=. Advances in Neural Information Processing Systems , volume=
-
[15]
NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following , year=
Group Preference Optimization: Few-Shot Alignment of Large Language Models , author=. NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following , year=
2023
-
[16]
Proceedings of the 41st International Conference on Machine Learning , pages=
MaxMin-RLHF: alignment with diverse human preferences , author=. Proceedings of the 41st International Conference on Machine Learning , pages=
-
[17]
arXiv e-prints , pages=
On the client preference of LLM fine-tuning in federated learning , author=. arXiv e-prints , pages=
-
[18]
Advances in neural information processing systems , volume=
Deep reinforcement learning from human preferences , author=. Advances in neural information processing systems , volume=
-
[19]
Proximal Policy Optimization Algorithms
Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=
work page internal anchor Pith review arXiv
-
[20]
International Conference on Learning Representations , year=
On the Weaknesses of Reinforcement Learning for Neural Machine Translation , author=. International Conference on Learning Representations , year=
-
[21]
Advances in neural information processing systems , volume=
Direct preference optimization: Your language model is secretly a reward model , author=. Advances in neural information processing systems , volume=
-
[22]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=
work page internal anchor Pith review arXiv
-
[23]
Advances in neural information processing systems , volume=
Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=
-
[24]
Fine-Tuning Language Models from Human Preferences
Fine-tuning language models from human preferences , author=. arXiv preprint arXiv:1909.08593 , year=
work page internal anchor Pith review arXiv 1909
-
[25]
Federated Learning: Strategies for Improving Communication Efficiency
Federated learning: Strategies for improving communication efficiency , author=. arXiv preprint arXiv:1610.05492 , year=
work page internal anchor Pith review arXiv
-
[26]
Artificial intelligence and statistics , pages=
Communication-efficient learning of deep networks from decentralized data , author=. Artificial intelligence and statistics , pages=. 2017 , organization=
2017
-
[27]
International Conference on Learning Representation (ICLR) , year=
Neural Thompson Sampling , author=. International Conference on Learning Representation (ICLR) , year=
-
[28]
Lei Li and Zhihui Xie and Mukai Li and Shunian Chen and Peiyi Wang and Liang Chen and Yazheng Yang and Benyou Wang and Lingpeng Kong , title =
-
[29]
arXiv preprint arXiv:2402.11411 , year=
Aligning modalities in vision large language models via preference fine-tuning , author=. arXiv preprint arXiv:2402.11411 , year=
-
[30]
Gpt-4o system card , author=. arXiv preprint arXiv:2410.21276 , year=
work page internal anchor Pith review arXiv
-
[31]
Advances in neural information processing systems , volume=
Visual instruction tuning , author=. Advances in neural information processing systems , volume=
-
[32]
Svit: Scaling up visual instruction tuning
Svit: Scaling up visual instruction tuning , author=. arXiv preprint arXiv:2307.04087 , year=
-
[33]
Llavar: Enhanced visual instruction tuning for text-rich image understanding
Llavar: Enhanced visual instruction tuning for text-rich image understanding , author=. arXiv preprint arXiv:2306.17107 , year=
-
[34]
Advances in Neural Information Processing Systems , volume=
Llava-med: Training a large language-and-vision assistant for biomedicine in one day , author=. Advances in Neural Information Processing Systems , volume=
-
[35]
Pmc-vqa: Visual instruction tuning for medical visual question answering , author=. arXiv preprint arXiv:2305.10415 , year=
-
[36]
Qwen2.5-VL , url =
Qwen Team , month =. Qwen2.5-VL , url =
-
[37]
EasyR1: An Efficient, Scalable, Multi-Modality RL Training Framework , author =
-
[38]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution , author=. arXiv preprint arXiv:2409.12191 , year=
work page internal anchor Pith review arXiv
-
[39]
2024 , eprint=
LLaVA-OneVision: Easy Visual Task Transfer , author=. 2024 , eprint=
2024
-
[40]
UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction
Umap: Uniform manifold approximation and projection for dimension reduction , author=. arXiv preprint arXiv:1802.03426 , year=
work page internal anchor Pith review arXiv
-
[41]
DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models
Drivevlm: The convergence of autonomous driving and large vision-language models , author=. arXiv preprint arXiv:2402.12289 , year=
work page internal anchor Pith review arXiv
-
[42]
IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing , volume=
Remote sensing image scene classification meets deep learning: Challenges, methods, benchmarks, and opportunities , author=. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing , volume=. 2020 , publisher=
2020
-
[43]
IEEE transactions on pattern analysis and machine intelligence , volume=
Vision-language models for vision tasks: A survey , author=. IEEE transactions on pattern analysis and machine intelligence , volume=. 2024 , publisher=
2024
-
[44]
arXiv preprint arXiv:2405.17247 , year=
An introduction to vision-language modeling , author=. arXiv preprint arXiv:2405.17247 , year=
-
[45]
Frontiers in artificial intelligence , volume=
Vision-language models for medical report generation and visual question answering: A review , author=. Frontiers in artificial intelligence , volume=. 2024 , publisher=
2024
-
[46]
arXiv preprint arXiv:2406.11903 , year=
A survey of large language models for financial applications: Progress, prospects and challenges , author=. arXiv preprint arXiv:2406.11903 , year=
-
[47]
NPJ digital medicine , volume=
The future of digital health with federated learning , author=. NPJ digital medicine , volume=. 2020 , publisher=
2020
-
[48]
Federated learning: privacy and incentive , pages=
Federated learning for open banking , author=. Federated learning: privacy and incentive , pages=. 2020 , publisher=
2020
-
[49]
Neural computation , volume=
Adaptive mixtures of local experts , author=. Neural computation , volume=. 1991 , publisher=
1991
-
[50]
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
Outrageously large neural networks: The sparsely-gated mixture-of-experts layer , author=. arXiv preprint arXiv:1701.06538 , year=
work page internal anchor Pith review arXiv
-
[51]
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding
Gshard: Scaling giant models with conditional computation and automatic sharding , author=. arXiv preprint arXiv:2006.16668 , year=
work page internal anchor Pith review arXiv 2006
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.