arxiv: 2604.07753 · v1 · submitted 2026-04-09 · 💻 cs.CV · cs.CL· cs.LG

Recognition: 2 theorem links

· Lean Theorem

Symbiotic-MoE: Unlocking the Synergy between Generation and Understanding

Xiangyue Liu , Zijian Zhang , Miles Yang , Zhao Zhong , Liefeng Bo , Ping Tan

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:19 UTC · model grok-4.3

classification 💻 cs.CV cs.CLcs.LG

keywords Symbiotic-MoEMixture-of-Expertsmultimodal modelsimage generationvisual understandingtask interferencecross-modal synergyprogressive training

0 comments

The pith

Symbiotic-MoE lets generative training improve rather than degrade understanding in multimodal models through shared experts and staged optimization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to demonstrate that a native multimodal mixture-of-experts architecture can eliminate the usual destructive interference between image generation and visual-language understanding. It does so by splitting experts into task-specific groups while keeping some experts shared so that visual details learned during generation flow into richer textual representations. Standard mixture-of-experts training fails because generative signals overwhelm the routing; the proposed disentanglement plus progressive shielding prevents that collapse and turns the same signals into useful feedback. If the approach holds, multimodal models could acquire both capabilities together without extra parameters, without isolating tasks, and without the forgetting that currently forces separate training runs.

Core claim

Symbiotic-MoE resolves task interference inside a native multimodal Mixture-of-Experts Transformers model with zero added parameters. It first diagnoses routing collapse in ordinary MoE tuning, where generative gradients monopolize expert utilization. Modality-Aware Expert Disentanglement then partitions experts into task-specific groups while retaining shared experts as a multimodal semantic bridge that lets generative tasks supply fine-grained visual semantics to textual representations. A Progressive Training Strategy applies differential learning rates and early-stage gradient shielding to protect pre-trained knowledge and convert early volatility into constructive feedback for the other

What carries the argument

Modality-Aware Expert Disentanglement, which partitions experts into task-specific groups with shared experts acting as a multimodal semantic bridge, together with Progressive Training Strategy using differential learning rates and early gradient shielding.

If this is right

Generative tasks reach rapid convergence while understanding capabilities are preserved or improved.
Cross-modal synergy produces measurable gains on understanding benchmarks such as MMLU and OCRBench.
The model retains full original capacity with no parameter overhead or fragmentation.
Early training volatility is converted into positive feedback rather than destructive interference.
Task isolation is avoided, unlike structural separation methods that lose synergy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The shared-expert bridge could be reused as a general pattern for transferring useful signals across any pair of conflicting objectives in large models.
Progressive shielding of early gradients may shorten the alignment phase needed when adding new capabilities to already-trained multimodal systems.
If routing stabilizes under this partitioning, similar disentanglement could support scaling MoE models to larger numbers of simultaneous tasks without collapse.

Load-bearing premise

That shared experts will absorb fine-grained visual semantics from generation and transfer them to improve understanding without routing collapse or negative interference between task groups.

What would settle it

After full training, a direct comparison showing that MMLU or OCRBench scores remain flat or decline relative to a standard multimodal baseline, or that routing statistics still show generative tasks dominating expert selection despite the disentanglement.

Figures

Figures reproduced from arXiv: 2604.07753 by Liefeng Bo, Miles Yang, Ping Tan, Xiangyue Liu, Zhao Zhong, Zijian Zhang.

**Figure 1.** Figure 1: Teaser. We visualize the evolution of training paradigms: moving from (a) destructive interference and (b) the “Split-Brain” compromise to (c) symbiotic synergy. As shown in the radar chart, our framework achieves holistic superiority, simultaneously boosting generation and understanding capabilities with zero-parameter overhead and maximal parameter efficiency. Abstract. Empowering Large Multimodal Models… view at source ↗

**Figure 2.** Figure 2: Comparison of Architectures. (a) Standard MoE suffers from routing collapse due to multi-task gradient conflicts (like cognitive dissonance). (b) MoT avoids conflict via physical isolation but induces a “Split-Brain” dilemma, which structurally hinders cross-modal synergy and knowledge transfer. (c) Our Symbiotic-MoE introduces shared experts as a semantic bridge, enabling co-evolution of generation and … view at source ↗

**Figure 3.** Figure 3: Overview of Symbiotic-MoE. We re-architect the Transformer FFN layer to resolve task interference. Our framework features: (1) Modality-Aware Expert Disentanglement, where input tokens are routed to specialized Understanding (blue) or Generation (orange) experts via decoupled routers; (2) A Shared Expert Bridge (purple), which processes all tokens to enforce semantic alignment and prevent modal isolation… view at source ↗

**Figure 4.** Figure 4: Visualization Comparisons. We compare samples generated by SymbioticMoE (top), Standard MoE (middle), and MoT (bottom). Standard MoE suffers from severe structural collapse and visual artifacts due to gradient conflicts. While MoT recovers basic object shapes, it often misses fine-grained semantic details. In contrast, our method achieves superior fidelity and precise semantic alignment (e.g., triangular … view at source ↗

**Figure 5.** Figure 5: Analysis of Training Dynamics. (a) Standard MoE (blue) and MoT (orange) suffer from routing collapse, indicated by the dropping capacity rate. In contrast, our Symbiotic-MoE (green) maintains highest expert utilization (∼0.95). (b) This structural stability translates into superior optimization efficiency, with our method achieving lower convergence generation loss than baselines. this. As highlighted in … view at source ↗

**Figure 6.** Figure 6: Evidencing Generative Synergy. (a) We probe the isolated performance of Shared Experts. Compared to the baseline trained without generation (Train w/o Gen), our Symbiotic method significantly boosts the semantic density of shared experts. (b) & (c) The training dynamics show that while the control baseline degrades, SymbioticMoE effectively reverses the forgetting trend via generative regularization. Opti… view at source ↗

**Figure 1.** Figure 1: Macro-level routing dynamics of the pre-trained VLM. [PITH_FULL_IMAGE:figures/full_fig_p016_1.png] view at source ↗

**Figure 3.** Figure 3: Capacity Rate Dynamics of Text, ViT, and VAE. [PITH_FULL_IMAGE:figures/full_fig_p019_3.png] view at source ↗

**Figure 4.** Figure 4: Cumulative Token Consumption Dynamics. We track the total number of tokens processed by the routing mechanism across 14,000 iterations. The massive influx of tokens across all three modalities highlights the rigorous scale of our cotraining phase, providing a solid empirical foundation for cross-modal synergy. C.2 Data Mixture and Evaluation Protocols To foster true cross-modal synergy, our training corpu… view at source ↗

**Figure 5.** Figure 5: Extended Qualitative Comparison. We compare the early-stage text-toimage generation capabilities of Symbiotic-MoE against the Standard MoE and MoT baselines. At this stage of training, Standard MoE struggles to form coherent structures due to gradient interference (e.g., the handbag). While MoT isolates these conflicts, it exhibits slower semantic alignment, leading to occasional attribute mismatches (e.g… view at source ↗

read the original abstract

Empowering Large Multimodal Models (LMMs) with image generation often leads to catastrophic forgetting in understanding tasks due to severe gradient conflicts. While existing paradigms like Mixture-of-Transformers (MoT) mitigate this conflict through structural isolation, they fundamentally sever cross-modal synergy and suffer from capacity fragmentation. In this work, we present Symbiotic-MoE, a unified pre-training framework that resolves task interference within a native multimodal Mixture-of-Experts (MoE) Transformers architecture with zero-parameter overhead. We first identify that standard MoE tuning leads to routing collapse, where generative gradients dominate expert utilization. To address this, we introduce Modality-Aware Expert Disentanglement, which partitions experts into task-specific groups while utilizing shared experts as a multimodal semantic bridge. Crucially, this design allows shared experts to absorb fine-grained visual semantics from generative tasks to enrich textual representations. To optimize this, we propose a Progressive Training Strategy featuring differential learning rates and early-stage gradient shielding. This mechanism not only shields pre-trained knowledge from early volatility but eventually transforms generative signals into constructive feedback for understanding. Extensive experiments demonstrate that Symbiotic-MoE achieves rapid generative convergence while unlocking cross-modal synergy, boosting inherent understanding with remarkable gains on MMLU and OCRBench.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Symbiotic-MoE proposes shared experts to bridge generation and understanding in MoE but the supporting evidence for the mechanism is not provided.

read the letter

The key takeaway is that Symbiotic-MoE introduces a disentangled expert setup with shared components to link generation and understanding in multimodal MoE, but the evidence for effective synergy is missing from the description. The paper identifies routing collapse in standard MoE where generation takes over expert use. To fix it, they partition experts into task-specific groups and keep shared experts that can pick up fine-grained visual details from generation to improve text understanding. They back this with a progressive training approach using differential learning rates and early gradient shielding to protect initial knowledge and turn generative signals into helpful feedback later. This combination is presented as new compared to isolation methods like MoT. It does well in keeping the architecture unified with no extra parameters and in describing how to manage the training dynamics. However, the claims of rapid convergence and boosts on MMLU and OCRBench lack any supporting numbers, baselines, or ablations in the abstract. The central assumption that shared experts act as a semantic bridge relies on the disentanglement and shielding preventing dominance, but without data on expert utilization, gradient similarities, or contribution breakdowns, it's unclear if the signals are constructive or if new issues arise. The stress-test note is on point here; the mechanism needs verification. This work is aimed at people developing scalable LMMs that require both capabilities without heavy overhead. Readers focused on MoE architectures for multimodal tasks could extract the training strategy for their own experiments. The paper shows clear thinking on the interference problem and engages with existing approaches like MoT. It deserves a serious referee to evaluate the full experiments and confirm the results. I would recommend sending it to peer review, provided the full paper includes the necessary quantitative validations on the expert behaviors.

Referee Report

3 major / 0 minor

Summary. The paper proposes Symbiotic-MoE, a native multimodal MoE Transformer framework for jointly training image generation and understanding in LMMs. It identifies routing collapse in standard MoE under generative dominance, introduces Modality-Aware Expert Disentanglement (task-specific expert groups plus shared experts as a semantic bridge) to enable generative signals to enrich understanding, and a Progressive Training Strategy (differential learning rates and early gradient shielding) to stabilize training and convert interference into synergy, claiming zero-parameter overhead and substantial gains on understanding benchmarks such as MMLU and OCRBench.

Significance. If the mechanisms function as described and the claimed gains are reproducible, the work would be significant for multimodal model design: it offers a parameter-efficient alternative to structural isolation methods like MoT while preserving cross-modal synergy, potentially enabling more unified generative-understanding models without catastrophic forgetting.

major comments (3)

[Abstract] Abstract: The central claims of 'remarkable gains on MMLU and OCRBench', 'rapid generative convergence', and successful conversion of generative signals into constructive feedback for understanding are stated without any quantitative results, baseline comparisons, ablation studies, or experimental details (e.g., no numbers, tables, or figures referenced). This absence makes it impossible to assess whether Modality-Aware Expert Disentanglement and Progressive Training actually prevent routing collapse or deliver the reported benefits.
[Abstract] Abstract / Proposed Method: The assertion that shared experts act as a 'multimodal semantic bridge' absorbing fine-grained visual semantics from generative tasks relies on an unverified mechanism; no evidence is provided (such as expert utilization histograms, per-expert contribution ablations, gradient cosine similarities, or routing statistics) to confirm that the design transmits constructive signals rather than merely delaying interference or creating new collapse modes after shielding is removed.
[Abstract] Abstract: The claim of 'zero-parameter overhead' is not reconciled with the introduction of new components (task-specific partitioning, shared experts, differential LRs, and gradient shielding); without implementation details or a parameter count comparison to the baseline MoE, it is unclear whether the overhead is truly zero or merely deferred.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below. We have revised the manuscript to improve the abstract's informativeness, add explicit references to supporting analyses, and include a parameter comparison table.

read point-by-point responses

Referee: [Abstract] Abstract: The central claims of 'remarkable gains on MMLU and OCRBench', 'rapid generative convergence', and successful conversion of generative signals into constructive feedback for understanding are stated without any quantitative results, baseline comparisons, ablation studies, or experimental details (e.g., no numbers, tables, or figures referenced). This absence makes it impossible to assess whether Modality-Aware Expert Disentanglement and Progressive Training actually prevent routing collapse or deliver the reported benefits.

Authors: We agree that the abstract would be strengthened by direct references to quantitative results and supporting materials. The full manuscript presents these in the Experiments section, including baseline comparisons, ablation studies on routing behavior, and performance tables for MMLU and OCRBench. We have revised the abstract to reference the relevant tables and figures that quantify the gains and convergence behavior. revision: yes
Referee: [Abstract] Abstract / Proposed Method: The assertion that shared experts act as a 'multimodal semantic bridge' absorbing fine-grained visual semantics from generative tasks relies on an unverified mechanism; no evidence is provided (such as expert utilization histograms, per-expert contribution ablations, gradient cosine similarities, or routing statistics) to confirm that the design transmits constructive signals rather than merely delaying interference or creating new collapse modes after shielding is removed.

Authors: The experimental section of the manuscript includes expert utilization histograms, ablation studies isolating the contribution of shared experts, and routing statistics demonstrating improved balance and cross-modal signal flow. These analyses support that the shared experts enable constructive transfer rather than temporary delay. We have updated the method description to explicitly cite these results and added a concise reference in the abstract. revision: yes
Referee: [Abstract] Abstract: The claim of 'zero-parameter overhead' is not reconciled with the introduction of new components (task-specific partitioning, shared experts, differential LRs, and gradient shielding); without implementation details or a parameter count comparison to the baseline MoE, it is unclear whether the overhead is truly zero or merely deferred.

Authors: Task-specific partitioning and shared experts are implemented by re-grouping the existing expert pool within the original MoE architecture, introducing no additional parameters. Differential learning rates and gradient shielding are purely training-time strategies. We have added an explicit parameter-count comparison table in the revised manuscript showing identical total parameters to the baseline MoE, and clarified this point in the abstract and method section. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper proposes new architectural elements (Modality-Aware Expert Disentanglement with task-specific and shared experts) and a Progressive Training Strategy (differential learning rates plus early gradient shielding) to address routing collapse and enable cross-modal synergy in multimodal MoE Transformers. These are presented as independent design choices justified by the architecture description and empirical results on benchmarks, without reducing to self-referential definitions, fitted parameters renamed as predictions, or load-bearing self-citations. No equations or steps in the provided text equate outputs to inputs by construction; the central claims about shared experts absorbing generative signals remain design assertions rather than tautological reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

The approach rests on standard MoE assumptions plus two new design elements introduced to solve routing collapse; no explicit free parameters are named in the abstract.

axioms (2)

domain assumption Standard MoE tuning leads to routing collapse dominated by generative gradients.
Stated as the identified problem that the new design must solve.
ad hoc to paper Shared experts can absorb fine-grained visual semantics from generative tasks to enrich textual representations.
Core assumption of the modality-aware disentanglement mechanism.

invented entities (2)

Modality-Aware Expert Disentanglement no independent evidence
purpose: Partitions experts into task-specific groups while keeping shared experts as multimodal bridge.
New component proposed to prevent routing collapse and enable synergy.
Progressive Training Strategy no independent evidence
purpose: Uses differential learning rates and early gradient shielding to protect knowledge and turn generative signals constructive.
New training procedure to optimize the disentangled experts.

pith-pipeline@v0.9.0 · 5535 in / 1293 out tokens · 39100 ms · 2026-05-10T18:19:49.478109+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Modality-Aware Expert Disentanglement... shared experts as a multimodal semantic bridge... Progressive Training Strategy featuring differential learning rates and early-stage gradient shielding
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

resolves task interference within a native multimodal Mixture-of-Experts (MoE) Transformers architecture with zero-parameter overhead

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

66 extracted references · 34 canonical work pages · 22 internal anchors

[1]

GPT-4 Technical Report

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

arXiv preprint arXiv:2505.12043 (2025) 24

Chen, J., Tang, Q., Lu, Q., Fang, S.: Mol for llms: Dual-loss optimization to enhance domain expertise while preserving general capabilities. arXiv preprint arXiv:2505.12043 (2025) 24

work page arXiv 2025
[3]

Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

Chen, X., Wu, Z., Liu, X., Pan, Z., Liu, W., Xie, Z., Yu, X., Ruan, C.: Janus-pro: Unified multimodal understanding and generation with data and model scaling. arXiv preprint arXiv:2501.17811 (2025)

work page internal anchor Pith review arXiv 2025
[4]

MoFO: Momentum-filtered optimizer for mitigating forgetting in LLM fine-tuning

Chen, Y., Wang, S., Zhang, Y., Lin, Z., Zhang, H., Sun, W., Ding, T., Sun, R.: Mofo: Momentum-filtered optimizer for mitigating forgetting in llm fine-tuning. arXiv preprint arXiv:2407.20999 (2024)

work page arXiv 2024
[5]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Chen, Z., Wu, J., Wang, W., Su, W., Chen, G., Xing, S., Zhong, M., Zhang, Q., Zhu, X., Lu, L., et al.: Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 24185–24198 (2024)

2024
[6]

In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Dai, D., Deng, C., Zhao, C., Xu, R., Gao, H., Chen, D., Li, J., Zeng, W., Yu, X., Wu, Y., et al.: Deepseekmoe: Towards ultimate expert specialization in mixture- of-experts language models. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 1280– 1297 (2024)

2024
[7]

Emerging Properties in Unified Multimodal Pretraining

Deng, C., Zhu, D., Li, K., Gou, C., Li, F., Wang, Z., Zhong, S., Yu, W., Nie, X., Song, Z., Shi, G., Fan, H.: Emerging properties in unified multimodal pretraining. arXiv preprint arXiv:2505.14683 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

Dreamllm: Synergistic multimodal com- prehension and creation

Dong, R., Han, C., Peng, Y., Qi, Z., Ge, Z., Yang, J., Zhao, L., Sun, J., Zhou, H., Wei, H., et al.: Dreamllm: Synergistic multimodal comprehension and creation. arXiv preprint arXiv:2309.11499 (2023)

work page arXiv 2023
[9]

In: International conference on machine learning

Du, N., Huang, Y., Dai, A.M., Tong, S., Lepikhin, D., Xu, Y., Krikun, M., Zhou, Y., Yu, A.W., Firat, O., et al.: Glam: Efficient scaling of language models with mixture-of-experts. In: International conference on machine learning. pp. 5547–
[10]

In: Proceedings of the 32nd ACM international conference on multimedia

Duan, H., Yang, J., Qiao, Y., Fang, X., Chen, L., Liu, Y., Dong, X., Zang, Y., Zhang, P., Wang, J., et al.: Vlmevalkit: An open-source toolkit for evaluating large multi-modality models. In: Proceedings of the 32nd ACM international conference on multimedia. pp. 11198–11201 (2024)

2024
[11]

Journal of Machine Learning Research 23(120), 1–39 (2022)

Fedus,W.,Zoph,B.,Shazeer,N.:Switchtransformers:Scalingtotrillionparameter models with simple and efficient sparsity. Journal of Machine Learning Research 23(120), 1–39 (2022)

2022
[12]

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Fu, C., Chen, P., Shen, Y., Qin, Y., Zhang, M., Lin, X., Yang, J., Zheng, X., Li, K., Sun, X., et al.: Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394 (2023)

work page internal anchor Pith review arXiv 2023
[13]

Llama-adapter v2: Parameter-efficient visual instruction model

Gao, P., Han, J., Zhang, R., Lin, Z., Geng, S., Zhou, A., Zhang, W., Lu, P., He, C., Yue, X., et al.: Llama-adapter v2: Parameter-efficient visual instruction model. arXiv preprint arXiv:2304.15010 (2023)

work page arXiv 2023
[14]

Advances in Neural Information Processing Systems36, 52132–52152 (2023)

Ghosh, D., Hajishirzi, H., Schmidt, L.: Geneval: An object-focused framework for evaluating text-to-image alignment. Advances in Neural Information Processing Systems36, 52132–52152 (2023)

2023
[15]

arXiv preprint arXiv:2502.17298 (2025)

Gu, H., Li, W., Li, L., Zhu, Q., Lee, M., Sun, S., Xue, W., Guo, Y.: Delta decom- pression for moe-based llms compression. arXiv preprint arXiv:2502.17298 (2025)

work page arXiv 2025
[16]

Measuring Massive Multitask Language Understanding

Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., Stein- hardt, J.: Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2009
[17]

In: Proceedings of the 2021 conference on empirical methods in natural language processing

Hessel, J., Holtzman, A., Forbes, M., Le Bras, R., Choi, Y.: Clipscore: A reference- free evaluation metric for image captioning. In: Proceedings of the 2021 conference on empirical methods in natural language processing. pp. 7514–7528 (2021) Symbiotic-MoE Supplementary Material 25

2021
[18]

Advances in neural information processing systems30(2017)

Heusel,M.,Ramsauer,H.,Unterthiner,T.,Nessler,B.,Hochreiter,S.:Ganstrained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems30(2017)

2017
[19]

Iclr1(2), 3 (2022)

Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al.: Lora: Low-rank adaptation of large language models. Iclr1(2), 3 (2022)

2022
[20]

Advances in Neural Information Processing Systems36, 78723–78747 (2023)

Huang, K., Sun, K., Xie, E., Li, Z., Liu, X.: T2i-compbench: A comprehensive benchmark for open-world compositional text-to-image generation. Advances in Neural Information Processing Systems36, 78723–78747 (2023)

2023
[21]

In: Proceedings of the IEEE/CVF confer- ence on computer vision and pattern recognition

Hudson, D.A., Manning, C.D.: Gqa: A new dataset for real-world visual reasoning and compositional question answering. In: Proceedings of the IEEE/CVF confer- ence on computer vision and pattern recognition. pp. 6700–6709 (2019)

2019
[22]

Mixtral of Experts

Jiang, A.Q., Sablayrolles, A., Roux, A., Mensch, A., Savary, B., Bamford, C., Chaplot, D.S., Casas, D.d.l., Hanna, E.B., Bressand, F., et al.: Mixtral of experts. arXiv preprint arXiv:2401.04088 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

In: European conference on computer vision

Kembhavi, A., Salvato, M., Kolve, E., Seo, M., Hajishirzi, H., Farhadi, A.: A di- agram is worth a dozen images. In: European conference on computer vision. pp. 235–251. Springer (2016)

2016
[24]

In: Proceedings of the IEEE Conference on Computer Vision and Pattern recognition

Kembhavi, A., Seo, M., Schwenk, D., Choi, J., Farhadi, A., Hajishirzi, H.: Are you smarter than a sixth grader? textbook question answering for multimodal machine comprehension. In: Proceedings of the IEEE Conference on Computer Vision and Pattern recognition. pp. 4999–5007 (2017)

2017
[25]

Advances in Neural Information Processing Systems36, 21487– 21506 (2023)

Koh, J.Y., Fried, D., Salakhutdinov, R.R.: Generating images with multimodal language models. Advances in Neural Information Processing Systems36, 21487– 21506 (2023)

2023
[26]

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

Lepikhin, D., Lee, H., Xu, Y., Chen, D., Firat, O., Huang, Y., Krikun, M., Shazeer, N., Chen, Z.: Gshard: Scaling giant models with conditional computation and automatic sharding. arXiv preprint arXiv:2006.16668 (2020)

work page internal anchor Pith review arXiv 2006
[27]

In: Proceedings of the 2023 conference on empirical methods in natural language processing

Li, Y., Du, Y., Zhou, K., Wang, J., Zhao, W.X., Wen, J.R.: Evaluating object hal- lucination in large vision-language models. In: Proceedings of the 2023 conference on empirical methods in natural language processing. pp. 292–305 (2023)

2023
[28]

Transactions on Machine Learning Research (2025),https://openreview.net/forum?id=Nu6N69i8SB

Liang, W., YU, L., Luo, L., Iyer, S., Dong, N., Zhou, C., Ghosh, G., Lewis, M., tau Yih, W., Zettlemoyer, L., Lin, X.V.: Mixture-of-transformers: A sparse and scalable architecture for multi-modal foundation models. Transactions on Machine Learning Research (2025),https://openreview.net/forum?id=Nu6N69i8SB

2025
[29]

IEEE Transactions on Multimedia (2026)

Lin, B., Tang, Z., Ye, Y., Huang, J., Zhang, J., Pang, Y., Jin, P., Ning, M., Luo, J., Yuan, L.: Moe-llava: Mixture of experts for large vision-language models. IEEE Transactions on Multimedia (2026)

2026
[30]

In: European conference on computer vision

Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: European conference on computer vision. pp. 740–755. Springer (2014)

2014
[31]

DeepSeek-V3 Technical Report

Liu, A., Feng, B., Xue, B., Wang, B., Wu, B., Lu, C., Zhao, C., Deng, C., Zhang, C., Ruan, C., et al.: Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[32]

Advances in neural information processing systems36, 34892–34916 (2023)

Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in neural information processing systems36, 34892–34916 (2023)

2023
[33]

Science China Information Sciences67(12), 220102 (2024)

Liu, Y., Li, Z., Huang, M., Yang, B., Yu, W., Li, C., Yin, X.C., Liu, C.L., Jin, L., Bai, X.: Ocrbench: on the hidden mystery of ocr in large multimodal models. Science China Information Sciences67(12), 220102 (2024)

2024
[34]

DeepSeek-VL: Towards Real-World Vision-Language Understanding

Lu, H., Liu, W., Zhang, B., Wang, B., Dong, K., Liu, B., Sun, J., Ren, T., Li, Z., Yang, H., et al.: Deepseek-vl: towards real-world vision-language understanding. arXiv preprint arXiv:2403.05525 (2024) 26

work page internal anchor Pith review arXiv 2024
[35]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Luo, G., Yang, X., Dou, W., Wang, Z., Liu, J., Dai, J., Qiao, Y., Zhu, X.: Mono- internvl: Pushing the boundaries of monolithic multimodal large language models with endogenous visual pre-training. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 24960–24971 (2025)

2025
[36]

Coupling experts and routers in mixture-of-experts via an auxiliary loss.arXiv preprint arXiv:2512.23447, 2025

Lv,A.,Ma,J.,Ma,Y.,Qiao,S.:Couplingexpertsandroutersinmixture-of-experts via an auxiliary loss. arXiv preprint arXiv:2512.23447 (2025)

work page arXiv 2025
[37]

In: Findings of the association for computational linguistics: ACL 2022

Masry, A., Do, X.L., Tan, J.Q., Joty, S., Hoque, E.: Chartqa: A benchmark for question answering about charts with visual and logical reasoning. In: Findings of the association for computational linguistics: ACL 2022. pp. 2263–2279 (2022)

2022
[38]

McCloskey, M., Cohen, N.J.: Catastrophic interference in connectionist networks: Thesequentiallearningproblem.In:Psychologyoflearningandmotivation,vol.24, pp. 109–165. Elsevier (1989)

1989
[39]

In: European Conference on Computer Vision

McKinzie, B., Gan, Z., Fauconnier, J.P., Dodge, S., Zhang, B., Dufter, P., Shah, D., Du, X., Peng, F., Belyi, A., et al.: Mm1: methods, analysis and insights from multimodal llm pre-training. In: European Conference on Computer Vision. pp. 304–323. Springer (2024)

2024
[40]

Neural networks113, 54–71 (2019)

Parisi, G.I., Kemker, R., Part, J.L., Kanan, C., Wermter, S.: Continual lifelong learning with neural networks: A review. Neural networks113, 54–71 (2019)

2019
[41]

Advances in Neural Information Processing Systems34, 8583–8595 (2021)

Riquelme, C., Puigcerver, J., Mustafa, B., Neumann, M., Jenatton, R., Su- sano Pinto, A., Keysers, D., Houlsby, N.: Scaling vision with sparse mixture of experts. Advances in Neural Information Processing Systems34, 8583–8595 (2021)

2021
[42]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q., Hinton, G., Dean, J.: Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[43]

Generative pretraining in mul- timodality

Sun, Q., Yu, Q., Cui, Y., Zhang, F., Zhang, X., Wang, Y., Gao, H., Liu, J., Huang, T., Wang, X.: Emu: Generative pretraining in multimodality. arXiv preprint arXiv:2307.05222 (2023)

work page arXiv 2023
[44]

Advances in Neural Information Processing Systems36, 16083– 16099 (2023)

Tang, Z., Yang, Z., Zhu, C., Zeng, M., Bansal, M.: Any-to-any generation via com- posable diffusion. Advances in Neural Information Processing Systems36, 16083– 16099 (2023)

2023
[45]

Team,C.:Chameleon:Mixed-modalearly-fusionfoundationmodels.arXivpreprint arXiv:2405.09818 (2024)

work page internal anchor Pith review arXiv 2024
[46]

Gemini: A Family of Highly Capable Multimodal Models

Team, G., Anil, R., Borgeaud, S., Alayrac, J.B., Yu, J., Soricut, R., Schalkwyk, J., Dai, A.M., Hauth, A., Millican, K., et al.: Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[47]

Kimi-VL Technical Report

Team, K., Du, A., Yin, B., Xing, B., Qu, B., Wang, B., Chen, C., Zhang, C., Du, C., Wei, C., et al.: Kimi-vl technical report. arXiv preprint arXiv:2504.07491 (2025)

work page internal anchor Pith review arXiv 2025
[48]

Moiie: Mixture of intra-and inter-modality experts for large vision language models.arXiv preprint arXiv:2508.09779, 2025

Wang, D., Wang, S., Li, Z., Wang, Y., Li, Y., Tang, D., Shen, X., Huang, X., Wei, Z.: Moiie: Mixture of intra-and inter-modality experts for large vision language models. arXiv preprint arXiv:2508.09779 (2025)

work page arXiv 2025
[49]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Wang, P., Bai, S., Tan, S., Wang, S., Fan, Z., Bai, J., Chen, K., Liu, X., Wang, J., Ge, W., et al.: Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[50]

arXiv preprint arXiv:2511.20520 (2025)

Wang, X., Zhang, Z., Zhang, H., Lin, Z., Zhou, Y., Liu, Q., Zhang, S., Li, Y., Liu, S., Zheng, H., et al.: Hbridge: H-shape bridging of heterogeneous experts for unified multimodal understanding and generation. arXiv preprint arXiv:2511.20520 (2025)

work page arXiv 2025
[51]

Emu3: Next-Token Prediction is All You Need

Wang, X., Zhang, X., Luo, Z., Sun, Q., Cui, Y., Wang, J., Zhang, F., Wang, Y., Li, Z., Yu, Q., et al.: Emu3: Next-token prediction is all you need. arXiv preprint arXiv:2409.18869 (2024) Symbiotic-MoE Supplementary Material 27

work page internal anchor Pith review arXiv 2024
[52]

Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models

Wu, C., Yin, S., Qi, W., Wang, X., Tang, Z., Duan, N.: Visual chatgpt: Talking, drawing and editing with visual foundation models. arXiv preprint arXiv:2303.04671 (2023)

work page internal anchor Pith review arXiv 2023
[53]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Wu, C., Chen, X., Wu, Z., Ma, Y., Liu, X., Pan, Z., Liu, W., Xie, Z., Yu, X., Ruan, C., et al.: Janus: Decoupling visual encoding for unified multimodal understanding and generation. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 12966–12977 (2025)

2025
[54]

In: Forty-first International Conference on Machine Learning (2024)

Wu, S., Fei, H., Qu, L., Ji, W., Chua, T.S.: Next-gpt: Any-to-any multimodal llm. In: Forty-first International Conference on Machine Learning (2024)

2024
[55]

Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis

Wu, X., Hao, Y., Sun, K., Chen, Y., Zhu, F., Zhao, R., Li, H.: Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis. arXiv preprint arXiv:2306.09341 (2023)

work page internal anchor Pith review arXiv 2023
[56]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Xiao, S., Wang, Y., Zhou, J., Yuan, H., Xing, X., Yan, R., Li, C., Wang, S., Huang, T., Liu, Z.: Omnigen: Unified image generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13294–13304 (2025)

2025
[57]

Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

Xie, J., Mao, W., Bai, Z., Zhang, D.J., Wang, W., Lin, K.Q., Gu, Y., Chen, Z., Yang, Z., Shou, M.Z.: Show-o: One single transformer to unify multimodal under- standing and generation. arXiv preprint arXiv:2408.12528 (2024)

work page internal anchor Pith review arXiv 2024
[58]

Tag-moe: Task-aware gating for unified generative mixture-of-experts.arXiv preprint arXiv:2601.08881, 2026

Xu,Y.,Yan,H.,Cao,J.,Cheng,Y.,Hang,T.,He,R.,Yin,Z.,Zhang,S.,Zhang,Y., Li, J., et al.: Tag-moe: Task-aware gating for unified generative mixture-of-experts. arXiv preprint arXiv:2601.08881 (2026)

work page arXiv 2026
[59]

Qwen3 Technical Report

Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al.: Qwen3 technical report. arXiv preprint arXiv:2505.09388 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[60]

IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

Ye, H., Zhang, J., Liu, S., Han, X., Yang, W.: Ip-adapter: Text compati- ble image prompt adapter for text-to-image diffusion models. arXiv preprint arXiv:2308.06721 (2023)

work page internal anchor Pith review arXiv 2023
[61]

Mm1.5: Methods, analysis & insights from multi- modal llm fine-tuning

Zhang, H., Gao, M., Gan, Z., Dufter, P., Wenzel, N., Huang, F., Shah, D., Du, X., Zhang, B., Li, Y., et al.: Mm1. 5: Methods, analysis & insights from multimodal llm fine-tuning. arXiv preprint arXiv:2409.20566 (2024)

work page arXiv 2024
[62]

arXiv preprint arXiv:2511.12113 (2025)

Zhang, L., Xie, Y., Fang, F., Dong, F., Liu, R., Cao, Y.: Metagdpo: Alleviating catastrophic forgetting with metacognitive knowledge through group direct pref- erence optimization. arXiv preprint arXiv:2511.12113 (2025)

work page arXiv 2025
[63]

In: Proceedings of the IEEE/CVF international conference on computer vision

Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image diffusion models. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 3836–3847 (2023)

2023
[64]

arXiv e-prints pp

Zhang, Z., Dong, Q., Zhang, Q., Zhao, J., Zhou, E., Xi, Z., Jin, S., Fan, X., Zhou, Y., Fu, Y., et al.: Reinforcement fine-tuning enables mllms learning novel tasks stably. arXiv e-prints pp. arXiv–2506 (2025)

2025
[65]

Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model

Zhou, C., Yu, L., Babu, A., Tirumala, K., Yasunaga, M., Shamis, L., Kahn, J., Ma, X., Zettlemoyer, L., Levy, O.: Transfusion: Predict the next token and diffuse images with one multi-modal model. arXiv preprint arXiv:2408.11039 (2024)

work page internal anchor Pith review arXiv 2024
[66]

ST-MoE: Designing Stable and Transferable Sparse Expert Models

Zoph, B., Bello, I., Kumar, S., Du, N., Huang, Y., Dean, J., Shazeer, N., Fedus, W.: St-moe: Designing stable and transferable sparse expert models. arXiv preprint arXiv:2202.08906 (2022)

work page internal anchor Pith review arXiv 2022