Robust-U1: Can MLLMs Self-Recover Corrupted Visual Content for Robust Understanding?

Jianmin Chen; Jiaqi Tang; Mengjie Zhao; Qifeng Chen; Qingfa Xiao; Runtao Liu; Wei Wei; Xiangyu Wu; Youyang Zhai

arxiv: 2606.08063 · v1 · pith:7Z34D7HMnew · submitted 2026-06-06 · 💻 cs.CV · cs.AI· cs.CL

Robust-U1: Can MLLMs Self-Recover Corrupted Visual Content for Robust Understanding?

Jiaqi Tang , Jianmin Chen , Youyang Zhai , Wei Wei , Runtao Liu , Mengjie Zhao , Xiangyu Wu , Qingfa Xiao

show 1 more author

Qifeng Chen

This is my paper

Pith reviewed 2026-06-27 20:04 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CL

keywords multimodal large language modelsvisual self-recoveryrobustness to corruptionsreinforcement learningimage reconstructionvisual question answeringadversarial robustness

0 comments

The pith

MLLMs can self-recover corrupted visual content to achieve robust understanding.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper asks whether multimodal large language models can restore corrupted images by themselves and shows that doing so improves their reasoning on degraded inputs. It introduces a three-stage process: supervised fine-tuning to reconstruct images, reinforcement learning that rewards both pixel-level similarity and semantic similarity, and joint reasoning that takes both the original corrupted image and the recovered version as input. Experiments report state-of-the-art results on real-world corruption benchmarks and better performance under adversarial corruptions on visual question answering tasks. A sympathetic reader would care because everyday images often contain noise, blur, or other degradations, and the work suggests models can internally fix these issues rather than relying on separate preprocessing or text-only workarounds.

Core claim

Robust-U1 equips MLLMs with explicit visual self-recovery through supervised fine-tuning for initial reconstruction, reinforcement learning with dual rewards of SSIM and CLIP similarity for high visual quality, and multimodal reasoning that jointly considers the corrupted input and the recovered image; experiments confirm that high-quality visual recovery directly enhances reasoning performance.

What carries the argument

Robust-U1 framework with supervised fine-tuning for reconstruction, dual-reward reinforcement learning using SSIM and CLIP, and multimodal reasoning over both corrupted and recovered images.

If this is right

State-of-the-art robustness on the real-world corruption benchmark.
Superior performance under adversarial corruptions on general VQA benchmarks.
High-quality visual recovery directly enhances reasoning performance.
Self-recovery functions as a critical mechanism for robust visual understanding.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the recovery step generalizes beyond the tested corruptions, models could handle novel degradations without additional fine-tuning.
Joint use of corrupted and recovered images may let models learn to identify and compensate for specific damage patterns.
The approach could extend to other input modalities such as video frames or audio signals that suffer analogous degradations.

Load-bearing premise

The dual-reward reinforcement learning stage produces image recoveries that genuinely improve downstream reasoning rather than merely optimizing the chosen SSIM and CLIP metrics.

What would settle it

An ablation showing that models given only the original corrupted images achieve the same or higher accuracy on the corruption and VQA benchmarks as models given the recovered images.

Figures

Figures reproduced from arXiv: 2606.08063 by Jianmin Chen, Jiaqi Tang, Mengjie Zhao, Qifeng Chen, Qingfa Xiao, Runtao Liu, Wei Wei, Xiangyu Wu, Youyang Zhai.

**Figure 1.** Figure 1: Comparison of robustness enhancement paradigms. (A) Implicit Adaptation: Black-box feature alignment within the visual encoder. (B) Text-based Reasoning: White-box textual chain describing corruption impacts. (C) Our Robust-U1 (Self-Recovering): Direct visual self-recovery and multimodal reasoning over both corrupted and recovered images. recovery capability, enabling direct recovery of corrupted images an… view at source ↗

**Figure 2.** Figure 2: Overview of the three-stage Robust-U1 framework. Stage I: Supervised Fine-Tuning adapts the unified MLLM to recover clean images from corrupted inputs using a rectified-flow loss. Stage II: Reinforcement Learning with dual rewards further enhances the quality of the recovered images via Flow-GRPO (Liu et al., 2025b). Stage III: Multimodal Reasoning trains the model to answer questions by jointly analyzing … view at source ↗

**Figure 3.** Figure 3: Schematic of the dual-reward mechanism used in the reinforcement learning stage. (A) Pixel-Level Structural Reward: Computes the SSIM index by comparing local patches (luminance, contrast, structure) between the recovered image Ir and the ground-truth clean image Io. (B) Semantic Consistency Reward: Utilizes a frozen TinyCLIP (Wu et al., 2023) model to extract image embeddings. The reward is derived from … view at source ↗

**Figure 4.** Figure 4: Visual comparison of recovered images across different training stages. 4.2. Performance on Adversarial Corruptions We further evaluate Robust-U1 under synthetically applied, multi-level adversarial corruptions on three standard VQA benchmarks. Results in [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Visual validation of Rpix. Compared with ours (Green), w/o Rpix may produce more artifacts in pixel level (Red) corruption, as shown in [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: More visual comparison of recovered images across different baselines. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗

read the original abstract

Multimodal Large Language Models (MLLMs) have demonstrated remarkable success in visual understanding, yet their performance degrades significantly under real-world visual corruptions. While existing robustness enhancement approaches exist, they are limited: black-box feature alignment lacks interpretability, and white-box text-based reasoning cannot restore lost pixel-level details. This work investigates a fundamental research question: Can MLLMs recover corrupted visual content by themselves? To address this, we propose Robust-U1, a novel framework that equips MLLMs with explicit visual self-recovery capability for robust understanding. The approach comprises three core stages: supervised fine-tuning for initial reconstruction, reinforcement learning with dual rewards (pixel-level SSIM and semantic-level CLIP similarity) for aligning high visual quality, and multimodal reasoning that jointly considers both the corrupted input and the recovered image. Extensive experiments demonstrate that Robust-U1 achieves state-of-the-art robustness on the real-world corruption benchmark and maintains superior performance under adversarial corruptions on general VQA benchmarks. Analysis confirms that high-quality visual recovery directly enhances reasoning performance, establishing self-recovery as a critical mechanism for robust visual understanding. The source code is available at https://github.com/jqtangust/Robust-U1.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adds a three-stage self-recovery pipeline with dual-reward RL to MLLMs and reports SOTA robustness numbers, but the causal role of the recovered images is not isolated.

read the letter

The main point is that Robust-U1 trains an MLLM to reconstruct corrupted images first via supervised fine-tuning, then refines the output with reinforcement learning that balances SSIM pixel reward and CLIP semantic reward, and finally runs reasoning on both the original corrupted image and the recovered one. This setup produces state-of-the-art results on a real-world corruption benchmark and holds up under adversarial cases on standard VQA tasks.

What is new is the specific combination of internal visual self-recovery inside an MLLM using that dual-reward RL stage before joint reasoning. The code is released, which is useful for anyone wanting to reproduce or extend the pipeline.

The work does a reasonable job demonstrating that an MLLM can be made to produce its own recovered visuals and that doing so lines up with better downstream performance in their reported experiments. The choice of SSIM plus CLIP as rewards is a straightforward way to target both low-level fidelity and higher-level content.

The soft spot is that the central claim about recovery driving the gains rests on thin evidence for causality. The reasoning stage feeds both the corrupted input and the recovered image together, so lifts could come from the joint prompt format, the supervised stage alone, or even metric overfitting rather than genuine visual restoration. The abstract mentions an analysis confirming the benefit, but no explicit ablations are described that would rule out the alternatives, such as feeding a dummy or SFT-only image instead. Details on reward weighting, controls, and error bars are also missing from the summary, which makes the internal validity harder to judge.

This is for researchers working on robustness in multimodal models, especially those already using RL or recovery techniques in computer vision. A reader focused on practical fixes for corrupted inputs in VQA would get value from the pipeline description and the public code. The concrete method and code release are enough to merit a serious referee who can check the ablations and run the numbers.

I would send it to peer review.

Referee Report

2 major / 1 minor

Summary. The paper proposes Robust-U1, a three-stage framework equipping MLLMs with self-recovery capability for corrupted images: supervised fine-tuning for initial reconstruction, reinforcement learning using dual rewards (pixel-level SSIM and semantic-level CLIP similarity), and multimodal reasoning that jointly processes the corrupted input and recovered image. It claims this yields state-of-the-art robustness on real-world corruption benchmarks and superior performance under adversarial corruptions on VQA tasks, with analysis showing that high-quality visual recovery directly enhances reasoning.

Significance. If validated, the result would establish an interpretable pixel- and semantic-level self-recovery mechanism as a critical component for MLLM robustness, going beyond black-box feature alignment or text-only reasoning; the public source code link supports reproducibility.

major comments (2)

[Abstract / multimodal reasoning stage] The multimodal reasoning stage jointly feeds both corrupted and recovered images, yet no ablation is described (e.g., substituting RL output with original corruption, a constant image, or SFT-only output) to isolate whether reported reasoning gains are causally driven by the RL-recovered content rather than the joint prompt format or prior stages; this directly affects the central claim that recovery enhances reasoning.
[Reinforcement learning stage] No quantitative details are given on the relative weighting between SSIM and CLIP rewards in the RL stage, nor on ablation controls or error bars for the SOTA results; this leaves the internal validity of the experimental outcomes unverifiable from the reported high-level outcomes.

minor comments (1)

The abstract states that source code is available; ensure the repository includes exact hyperparameter values for the dual-reward weighting and the full set of benchmark numbers with error bars.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and positive evaluation of the work's potential significance. We address each major comment below with plans for targeted revisions to improve clarity and experimental rigor.

read point-by-point responses

Referee: [Abstract / multimodal reasoning stage] The multimodal reasoning stage jointly feeds both corrupted and recovered images, yet no ablation is described (e.g., substituting RL output with original corruption, a constant image, or SFT-only output) to isolate whether reported reasoning gains are causally driven by the RL-recovered content rather than the joint prompt format or prior stages; this directly affects the central claim that recovery enhances reasoning.

Authors: We agree that the absence of these ablations limits the strength of the causal claim. In the revised version we will add a dedicated ablation subsection reporting performance when the RL-recovered image is replaced by (i) the original corrupted input, (ii) a constant image, and (iii) the SFT-stage output only, while keeping the joint-prompt format fixed. These results will directly quantify the incremental benefit attributable to the RL-recovered content. revision: yes
Referee: [Reinforcement learning stage] No quantitative details are given on the relative weighting between SSIM and CLIP rewards in the RL stage, nor on ablation controls or error bars for the SOTA results; this leaves the internal validity of the experimental outcomes unverifiable from the reported high-level outcomes.

Authors: We will insert the exact weighting coefficient (λ) used to combine the SSIM and CLIP terms in the composite reward, together with an ablation table that isolates each reward component. For the main SOTA tables we will also report standard deviations computed over three independent training seeds. These additions will be placed in the experimental section and appendix. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical method with external benchmarks

full rationale

The paper presents an empirical three-stage framework (SFT for reconstruction, RL with SSIM+CLIP dual rewards, joint multimodal reasoning on corrupted+recovered inputs) whose claims rest on experimental results against standard external benchmarks and metrics. No derivation chain, equations, or self-referential definitions reduce any result to inputs by construction. No self-citation load-bearing steps, uniqueness theorems, or ansatzes appear; the central claim that recovery enhances reasoning is presented as an experimental observation rather than a forced consequence of fitted parameters or prior author work. The work is self-contained against independent evaluation.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract, the framework introduces no new physical entities or unstated mathematical axioms; the dual-reward formulation implicitly treats the relative weighting of SSIM and CLIP as a tunable modeling choice.

free parameters (1)

relative weighting of SSIM and CLIP rewards
The abstract describes dual rewards but does not specify how the two signals are balanced during RL; this balance is a free parameter that affects the recovered image quality.

pith-pipeline@v0.9.1-grok · 5776 in / 1281 out tokens · 16241 ms · 2026-06-27T20:04:37.520144+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

15 extracted references · 3 linked inside Pith

[1]

Are we on the right way for evaluating large vision-language models? InNeurIPS, 2024a

Chen, L., Li, J., Dong, X., Zhang, P., Zang, Y ., Chen, Z., Duan, H., Wang, J., Qiao, Y ., Lin, D., et al. Are we on the right way for evaluating large vision-language models? InNeurIPS, 2024a. Chen, Z., Wu, J., Wang, W., Su, W., Chen, G., Xing, S., Zhong, M., Zhang, Q., Zhu, X., Lu, L., et al. Internvl: Scaling up vision foundation models and aligning fo...

arXiv
[2]

Emerg- ing properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683,

Deng, C., Zhu, D., Li, K., Gou, C., Li, F., Wang, Z., Zhong, S., Yu, W., Nie, X., Song, Z., Shi, G., and Fan, H. Emerg- ing properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683,

Pith/arXiv arXiv
[3]

A semantic decoupling-based two-stage rainy-day attack for revealing weather robustness de- ficiencies in vision-language models.arXiv preprint arXiv:2601.13238,

Hu, C., Chen, X., Jia, Z., Shi, W., Zhang, F., Guo, J., and Wei, Y . A semantic decoupling-based two-stage rainy-day attack for revealing weather robustness de- ficiencies in vision-language models.arXiv preprint arXiv:2601.13238,

arXiv
[4]

When mllms meet compression distortion: A coding paradigm tailored to mllms.arXiv preprint arXiv:2509.24258, 2025a

Liu, J., Jia, Z., Li, J., Li, B., Jin, X., Zeng, W., and Lu, Y . When mllms meet compression distortion: A coding paradigm tailored to mllms.arXiv preprint arXiv:2509.24258, 2025a. Liu, J., Liu, G., Liang, J., Li, Y ., Liu, J., Wang, X., Wan, P., Zhang, D., and Ouyang, W. Flow-grpo: Training flow matching models via online rl.arXiv preprint arXiv:2505.054...

arXiv
[5]

Qwen3 technical report.arXiv preprint arXiv:2505.09388,

Qwen Team. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

Pith/arXiv arXiv
[6]

thinking with images

Zheng, Z., Yang, M., Hong, J., Zhao, C., Xu, G., Yang, L., Shen, C., and Yu, X. Deepeyes: Incentivizing" thinking with images" via reinforcement learning.arXiv preprint arXiv:2505.14362,

Pith/arXiv arXiv
[7]

11 Robust-U1: Can MLLMs Self-Recover Corrupted Visual Content for Robust Understanding? Summary of Appendix This appendix is organized into eight sections, ordered from implementation details to broader discussion: • Appendix A – Implementation Details.Per-stage training cost (GPU type, time, memory, trainable parameters) and evaluation protocols on R-Ben...

2025
[8]

(750k pairs)Generation onlyStage II (RL) 8 ∼20 h 160 41 GBRobust-R1 (Tang et al., 2026a) training splitGeneration onlyStage III (Reasoning)8 ∼8 h 64 43 GBRobust-R1 (Tang et al., 2026a) reasoning dataUnderstanding+Generation A.2. Evaluation Protocol on R-Bench To rigorously assess the performance of our model on R-Bench (Li et al., 2024), we implement dive...

2024
[9]

The aggregate performance is represented by the mean score: Score= 1 N NX i=1 si,(12) where si denotes the scoring result assigned by GPT-3.5-turbo for the i-th sample

as a proxy evaluator to quantify the semantic alignment between model-generated responses and reference answers. The aggregate performance is represented by the mean score: Score= 1 N NX i=1 si,(12) where si denotes the scoring result assigned by GPT-3.5-turbo for the i-th sample. The evaluation framework focuses on three critical dimensions: • Completene...

2025
[10]

restoration → understanding

is utilized to parse the model’s output and identify the intended choice label (e.g., A, B, C, D), which is then compared against the ground-truth label to determine the final accuracy. B. Extended Quantitative Comparisons This section reports two extended comparisons that situate Robust-U1 against alternative pipelines: (i) using state-of-the- art extern...

2025
[11]

0.7398 for Robust-U1

As reported in Table 8, all external-restoration variants underperform Robust-U1 by a large margin in overall score, with the best baseline (all-in-one (Tian et al., 2025)) reaching only 0.5511 vs. 0.7398 for Robust-U1. Two factors explain this gap. First, specialized modules (deblurring, denoising, dehazing) require knowing the degradation type and tend ...

2025
[12]

Rec. Mem

0.6765 0.6584 0.5671 0.4466 0.3782 0.3371 0.6938 0.5914 0.4845 0.5371 Robust-U1(Ours) 0.7353 0.7329 0.6768 0.7067 0.7164 0.6934 0.8272 0.8059 0.7640 0.7398 B.2. Inference Cost and the Detect-then-Recover Variant We extend the computation-cost analysis in Section 4.3 to inference. We compare three deployment modes on R-Bench using Qwen2.5-VL-7B-class hardw...

2024
[13]

as the frozen encoder. To verify that our conclusions are not tied to this specific encoder, we replace TinyCLIP with three alternatives of different scales and architectures: CLIP-B/16 (Radford et al., 2021), SigLIP-B/16 (Zhai et al., 2023), and a heavily distilled, weaker CLIP (Radford et al.,

2021
[14]

Table 18.Sensitivity ofRobust-U1’s scores on R-Bench to the choice of LLM-based evaluator

and GPT-4o (Hurst et al., 2024). Table 18.Sensitivity ofRobust-U1’s scores on R-Bench to the choice of LLM-based evaluator. Evaluator MCQ VQA CAP Overall low mid high low mid high low mid high GPT-3.5-turbo (default) 0.7353 0.7329 0.6768 0.7067 0.7164 0.6934 0.8272 0.8069 0.7640 0.7398 Qwen3-Max (Qwen Team,

2024
[15]

Limitations and Future Work H.1

5.6% 2.1% 10.1% 4.2%Robust-U1(Ours) 92.3% 85.7% H. Limitations and Future Work H.1. Limitations While Robust-U1 demonstrates promising results in enhancing the robustness of Multimodal Large Language Models through visual self-recovery, our work has several limitations that warrant discussion and motivate the future directions discussed in Section H.2. Re...

2020

[1] [1]

Are we on the right way for evaluating large vision-language models? InNeurIPS, 2024a

Chen, L., Li, J., Dong, X., Zhang, P., Zang, Y ., Chen, Z., Duan, H., Wang, J., Qiao, Y ., Lin, D., et al. Are we on the right way for evaluating large vision-language models? InNeurIPS, 2024a. Chen, Z., Wu, J., Wang, W., Su, W., Chen, G., Xing, S., Zhong, M., Zhang, Q., Zhu, X., Lu, L., et al. Internvl: Scaling up vision foundation models and aligning fo...

arXiv

[2] [2]

Emerg- ing properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683,

Deng, C., Zhu, D., Li, K., Gou, C., Li, F., Wang, Z., Zhong, S., Yu, W., Nie, X., Song, Z., Shi, G., and Fan, H. Emerg- ing properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683,

Pith/arXiv arXiv

[3] [3]

A semantic decoupling-based two-stage rainy-day attack for revealing weather robustness de- ficiencies in vision-language models.arXiv preprint arXiv:2601.13238,

Hu, C., Chen, X., Jia, Z., Shi, W., Zhang, F., Guo, J., and Wei, Y . A semantic decoupling-based two-stage rainy-day attack for revealing weather robustness de- ficiencies in vision-language models.arXiv preprint arXiv:2601.13238,

arXiv

[4] [4]

When mllms meet compression distortion: A coding paradigm tailored to mllms.arXiv preprint arXiv:2509.24258, 2025a

Liu, J., Jia, Z., Li, J., Li, B., Jin, X., Zeng, W., and Lu, Y . When mllms meet compression distortion: A coding paradigm tailored to mllms.arXiv preprint arXiv:2509.24258, 2025a. Liu, J., Liu, G., Liang, J., Li, Y ., Liu, J., Wang, X., Wan, P., Zhang, D., and Ouyang, W. Flow-grpo: Training flow matching models via online rl.arXiv preprint arXiv:2505.054...

arXiv

[5] [5]

Qwen3 technical report.arXiv preprint arXiv:2505.09388,

Qwen Team. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

Pith/arXiv arXiv

[6] [6]

thinking with images

Zheng, Z., Yang, M., Hong, J., Zhao, C., Xu, G., Yang, L., Shen, C., and Yu, X. Deepeyes: Incentivizing" thinking with images" via reinforcement learning.arXiv preprint arXiv:2505.14362,

Pith/arXiv arXiv

[7] [7]

11 Robust-U1: Can MLLMs Self-Recover Corrupted Visual Content for Robust Understanding? Summary of Appendix This appendix is organized into eight sections, ordered from implementation details to broader discussion: • Appendix A – Implementation Details.Per-stage training cost (GPU type, time, memory, trainable parameters) and evaluation protocols on R-Ben...

2025

[8] [8]

(750k pairs)Generation onlyStage II (RL) 8 ∼20 h 160 41 GBRobust-R1 (Tang et al., 2026a) training splitGeneration onlyStage III (Reasoning)8 ∼8 h 64 43 GBRobust-R1 (Tang et al., 2026a) reasoning dataUnderstanding+Generation A.2. Evaluation Protocol on R-Bench To rigorously assess the performance of our model on R-Bench (Li et al., 2024), we implement dive...

2024

[9] [9]

The aggregate performance is represented by the mean score: Score= 1 N NX i=1 si,(12) where si denotes the scoring result assigned by GPT-3.5-turbo for the i-th sample

as a proxy evaluator to quantify the semantic alignment between model-generated responses and reference answers. The aggregate performance is represented by the mean score: Score= 1 N NX i=1 si,(12) where si denotes the scoring result assigned by GPT-3.5-turbo for the i-th sample. The evaluation framework focuses on three critical dimensions: • Completene...

2025

[10] [10]

restoration → understanding

is utilized to parse the model’s output and identify the intended choice label (e.g., A, B, C, D), which is then compared against the ground-truth label to determine the final accuracy. B. Extended Quantitative Comparisons This section reports two extended comparisons that situate Robust-U1 against alternative pipelines: (i) using state-of-the- art extern...

2025

[11] [11]

0.7398 for Robust-U1

As reported in Table 8, all external-restoration variants underperform Robust-U1 by a large margin in overall score, with the best baseline (all-in-one (Tian et al., 2025)) reaching only 0.5511 vs. 0.7398 for Robust-U1. Two factors explain this gap. First, specialized modules (deblurring, denoising, dehazing) require knowing the degradation type and tend ...

2025

[12] [12]

Rec. Mem

0.6765 0.6584 0.5671 0.4466 0.3782 0.3371 0.6938 0.5914 0.4845 0.5371 Robust-U1(Ours) 0.7353 0.7329 0.6768 0.7067 0.7164 0.6934 0.8272 0.8059 0.7640 0.7398 B.2. Inference Cost and the Detect-then-Recover Variant We extend the computation-cost analysis in Section 4.3 to inference. We compare three deployment modes on R-Bench using Qwen2.5-VL-7B-class hardw...

2024

[13] [13]

as the frozen encoder. To verify that our conclusions are not tied to this specific encoder, we replace TinyCLIP with three alternatives of different scales and architectures: CLIP-B/16 (Radford et al., 2021), SigLIP-B/16 (Zhai et al., 2023), and a heavily distilled, weaker CLIP (Radford et al.,

2021

[14] [14]

Table 18.Sensitivity ofRobust-U1’s scores on R-Bench to the choice of LLM-based evaluator

and GPT-4o (Hurst et al., 2024). Table 18.Sensitivity ofRobust-U1’s scores on R-Bench to the choice of LLM-based evaluator. Evaluator MCQ VQA CAP Overall low mid high low mid high low mid high GPT-3.5-turbo (default) 0.7353 0.7329 0.6768 0.7067 0.7164 0.6934 0.8272 0.8069 0.7640 0.7398 Qwen3-Max (Qwen Team,

2024

[15] [15]

Limitations and Future Work H.1

5.6% 2.1% 10.1% 4.2%Robust-U1(Ours) 92.3% 85.7% H. Limitations and Future Work H.1. Limitations While Robust-U1 demonstrates promising results in enhancing the robustness of Multimodal Large Language Models through visual self-recovery, our work has several limitations that warrant discussion and motivate the future directions discussed in Section H.2. Re...

2020