pith. sign in

arxiv: 2606.02578 · v1 · pith:OHHROBU7new · submitted 2026-06-01 · 💻 cs.CV · cs.AI

Mitigating Perceptual Judgment Bias in Multimodal LLM-as-a-Judge via Perceptual Perturbation and Reward Modeling

Pith reviewed 2026-06-28 15:04 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords perceptual judgment biasmultimodal LLM-as-a-judgeperceptual perturbationcounterfactual datasetreward modelingGRPObatch rankinghuman alignment
0
0 comments X

The pith

Multimodal LLM judges exhibit bias toward text over visual evidence, which training on perturbed counterfactuals and combined rewards can reduce.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that when visual evidence conflicts with text, multimodal LLM judges tend to reward plausible text narratives instead of perceptually correct answers, a bias termed Perceptual Judgment Bias. It addresses this by building the Perceptually Perturbed Judgment Dataset of minimally edited responses that isolate perceptual errors for supervision. A training framework then applies a GRPO-based reward alongside a batch-ranking objective to enforce coherent ordering. This results in judges with higher perceptual fidelity and better human alignment on benchmarks. Readers should care because trustworthy automated evaluation of multimodal content requires judges that ground judgments in actual visual input rather than text alone.

Core claim

Perceptual Judgment Bias arises when MLLM judges anchor on response text rather than their visual perception during conflicts, and this can be mitigated by constructing the Perceptually Perturbed Judgment Dataset of minimally edited counterfactual responses that enable verifiable supervision, then applying a unified training framework with structured GRPO-based reward and batch-ranking objective to achieve improved perceptual fidelity, ranking coherence, and alignment with human evaluation.

What carries the argument

The Perceptually Perturbed Judgment Dataset, which creates minimally edited counterfactual responses to isolate perceptual errors, combined with the training framework using a GRPO-based reward and batch-ranking objective for global ordering without pairwise labels.

If this is right

  • Improves perceptual fidelity on diverse MLLM-as-a-Judge benchmarks.
  • Achieves coherent global ordering of responses without explicit pairwise labels.
  • Increases alignment between automated judgments and human evaluations.
  • Establishes a scalable pathway for training perceptually grounded multimodal judges.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The perturbation method could be adapted to detect and correct other reasoning biases in MLLMs, such as those involving temporal or spatial inconsistencies.
  • Applying the same dataset construction to new MLLM models might reveal whether the bias is inherent to current architectures or training data.
  • Success here implies that reward modeling can be used more broadly to enforce consistency between multimodal inputs and model outputs.

Load-bearing premise

Minimally edited counterfactual responses in the dataset truly isolate only perceptual errors without introducing other unintended biases or changes.

What would settle it

If retraining on the dataset does not increase the rate at which the MLLM judge selects the visually accurate response over a text-plausible but visually incorrect one in controlled conflict tests, the claim would be falsified.

Figures

Figures reproduced from arXiv: 2606.02578 by Hyunjung Shim, Jaeyo Shin, Jiho Choi, Junyong Kang, Seojeong Park, Seonho Lee.

Figure 1
Figure 1. Figure 1: Perceptual judgment bias in MLLM judges. (a) When perceptual capability is insufficient, a judge may produce incorrect visual descriptions (a2) and assign high scores (a3) to perceptually wrong responses (a2). (b) Even when the judge’s own perception aligns with humans (b2), it may still prefer (b5) visually inconsistent responses (b3) compared to the response with correct perception (b4). We introduce Per… view at source ↗
Figure 2
Figure 2. Figure 2: (Left) Accuracy of distinguishing the corrected response rc from a response with both perceptual and reasoning perturba￾tions (rrp+r ). Baseline MLLM maintains high accuracy. (Right) Accuracy of distinguishing rc from a response with only percep￾tual perturbations (rrp ). Accuracy consistently drops across mod￾els, revealing their sensitivity to perceptual inconsistencies. Our Perception-Judge-Flex-7B impr… view at source ↗
Figure 3
Figure 3. Figure 3: Pipeline Overview. Using perceptual perturbations, we construct the Perceptually Perturbed Judgment Dataset (PPJD). For each correct response, PPJD generates two perturbed variants. The first response, rrp , is produced by altering visually grounded attributes while preserving the original structure. The second response, rrp+r , is created by additionally degrading the original reasoning, resulting in a fu… view at source ↗
Figure 4
Figure 4. Figure 4: A qualitative comparison between our Perception￾Judge and the baseline (Flex-Judge and Qwen3-VL-4B￾Thinking). While the baseline gives a high score to a response that fails to incorporate the visual content, our model interprets the relevant visual cue into its reasoning and penalizes their absence in the final score. For PPJD, we construct preference pairs as (rc, rrp ) and (rrp , rrp+r ), and apply the s… view at source ↗
read the original abstract

Recent multimodal large language models have demonstrated strong reasoning ability, yet their reliability as automated evaluators remains limited by a critical weakness: when visual evidence conflicts with textual cues, MLLM judges tend to reward plausible narratives over perceptually correct answers. We identify and systematically analyze this phenomenon, which we term Perceptual Judgment Bias. Through controlled visual perturbations, existing multimodal judges frequently anchor on the response text instead of their own visual perception, leading to inconsistent and non-verifiable evaluations. To address this issue, we introduce the Perceptually Perturbed Judgment Dataset, which constructs minimally edited counterfactual responses that isolate perceptual errors and enable verifiable supervision. Building on this dataset, we develop a unified training framework that combines a structured GRPO-based reward with a batch-ranking objective, achieving coherent global ordering without explicit pairwise labels. Experiments across diverse MLLM-as-a-Judge benchmarks show that our approach substantially improves perceptual fidelity, ranking coherence, and alignment with human evaluation. Our results establish a scalable and generalizable pathway for training multimodal judges that are perceptually grounded, interpretable, and robust to visual-reasoning conflicts.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper identifies Perceptual Judgment Bias in MLLM-as-a-Judge settings, where models favor plausible textual narratives over conflicting visual evidence. It introduces the Perceptually Perturbed Judgment Dataset, constructed via minimally edited counterfactual responses intended to isolate perceptual errors and supply verifiable supervision. A training framework is proposed that combines a GRPO-based reward model with a batch-ranking objective to achieve coherent global ordering. Experiments on diverse MLLM-as-a-Judge benchmarks are reported to show gains in perceptual fidelity, ranking coherence, and human alignment.

Significance. If the central claims hold, the work offers a concrete pathway to improve the reliability of multimodal judges in settings where visual-textual conflicts arise, which is relevant for automated evaluation pipelines. The dataset construction and structured reward approach could generalize beyond the reported benchmarks if the isolation of perceptual signals is robust.

major comments (2)
  1. [Dataset construction] Dataset construction section: the central claim that the Perceptually Perturbed Judgment Dataset supplies verifiable supervision rests on the assertion that minimally edited counterfactual responses isolate perceptual errors without textual confounds. The manuscript does not provide explicit criteria for edit selection, a definition of 'minimal,' or controls demonstrating that narrative coherence and textual cues remain constant; without these, the supervision signal may be entangled and the attribution of benchmark gains to perceptual grounding cannot be verified.
  2. [Experiments] Experiments section: the abstract states that the approach 'substantially improves' perceptual fidelity, ranking coherence, and human alignment, yet no quantitative tables, baseline comparisons, ablation results, or error analysis are referenced in a manner that allows assessment of effect sizes or statistical significance. This leaves the empirical support for the training framework load-bearing claim unverified.
minor comments (2)
  1. [Method] The GRPO reward formulation and batch-ranking objective are introduced without an accompanying equation or pseudocode block; adding these would clarify how the global ordering is achieved without pairwise labels.
  2. [Method] Notation for the reward model components is introduced in the abstract but not consistently defined on first use in the main text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify key aspects of our work on Perceptual Judgment Bias. We respond point-by-point below and will make targeted revisions to improve rigor and verifiability.

read point-by-point responses
  1. Referee: [Dataset construction] Dataset construction section: the central claim that the Perceptually Perturbed Judgment Dataset supplies verifiable supervision rests on the assertion that minimally edited counterfactual responses isolate perceptual errors without textual confounds. The manuscript does not provide explicit criteria for edit selection, a definition of 'minimal,' or controls demonstrating that narrative coherence and textual cues remain constant; without these, the supervision signal may be entangled and the attribution of benchmark gains to perceptual grounding cannot be verified.

    Authors: We agree the manuscript would benefit from greater explicitness here. In revision we will expand the Dataset Construction section with: (1) formal selection criteria (e.g., edits restricted to perceptual attributes while preserving lexical overlap and narrative flow), (2) a precise definition of 'minimal' (token-edit distance < 5% and semantic similarity > 0.95 via sentence embeddings), and (3) controls consisting of both automated checks and human verification that textual cues and coherence remain unchanged. These additions will directly support the claim of isolated perceptual supervision. revision: yes

  2. Referee: [Experiments] Experiments section: the abstract states that the approach 'substantially improves' perceptual fidelity, ranking coherence, and human alignment, yet no quantitative tables, baseline comparisons, ablation results, or error analysis are referenced in a manner that allows assessment of effect sizes or statistical significance. This leaves the empirical support for the training framework load-bearing claim unverified.

    Authors: The experiments section already contains quantitative tables reporting baseline comparisons, GRPO and batch-ranking ablations, error breakdowns, and human-alignment metrics across benchmarks. To make these results immediately accessible, we will revise the abstract to cite specific effect sizes and statistical significance from the tables, add a consolidated results summary, and ensure all comparisons and ablations are cross-referenced with p-values and confidence intervals in the main text. revision: yes

Circularity Check

0 steps flagged

No circularity detected; claims rest on new dataset and empirical evaluation

full rationale

The paper presents no equations, derivations, or fitted parameters that reduce to prior results by construction. It introduces a newly constructed Perceptually Perturbed Judgment Dataset and a GRPO-based training framework, then reports benchmark improvements from experiments. No self-citation chains, ansatzes smuggled via citation, or renaming of known results appear in the provided text. The central claims are supported by the dataset construction and external benchmark comparisons rather than internal self-definition or statistical forcing, making the derivation self-contained against external evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no information on free parameters, axioms, or invented entities; all fields left empty.

pith-pipeline@v0.9.1-grok · 5747 in / 1055 out tokens · 20395 ms · 2026-06-28T15:04:45.790988+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

32 extracted references

  1. [1]

    Unlike prior studies that primarily enhance perceptual grounding within the reasoning process, our work focuses on elevating perception itself in the context of multimodal judgment

    proposes perception tokens, which embed intrinsic image features as reasoning tokens to enhance visual reason- ing. Unlike prior studies that primarily enhance perceptual grounding within the reasoning process, our work focuses on elevating perception itself in the context of multimodal judgment. C. Experimental Details We provide additional experimental ...

  2. [2]

    is_correct

    for reinforcement learning with verifiable rewards. The code is available at the GitHub repository ( https:// github.com/kaist-cvml/perception-judge). C.2. Implementation Details & Hyperparameters Perception-Judge is initialized from two models: Flex-Judge- VL and Qwen3-VL-Thinking. Both models are optimized using Group Relative Policy Optimization (GRPO)...

  3. [3]

    6 objects

    to extract a small set of perceptual attributes that are both visually grounded, relevant to the instruction, and not explicitly stated in the text instruction itself, ensuring that they reflect information genuinely observed in the image. Concretely, we prompt the model to list at most six atomic attributes such as object counts, shapes, colors, spatial ...

  4. [4]

    Type 1: Keep the logical structure of the CHOSEN answer, but slightly alter perception- related details (as PERCEPTION_ATTRS) so they contradict the image

  5. [5]

    rejected_1

    Type 2: Alter perception-related details AND introduce logical errors in the reasoning process. Each perturbation MUST: - Follow CHOSEN's exact format - Include the complete reasoning with modified - details - Result in a DIFFERENT final answer than - ANSWER_GT Output ONLY this JSON (no markdown, no extra text): {{ "rejected_1": "<full perturbed text for ...

  6. [6]

    For Qwen3-VL-8B-Thinking, we employ full fine-tuning to update all parameters, allow- ing for more comprehensive adaptation of its reasoning ca- pabilities

    based fine-tuning to Qwen2.5-VL-32B-Instruct, we train an additional LoRA adapter of rank 32 while keep- ing the original Flex Judge LoRA modules fixed, yielding Perception-Judge-Flex-32B. For Qwen3-VL-8B-Thinking, we employ full fine-tuning to update all parameters, allow- ing for more comprehensive adaptation of its reasoning ca- pabilities. Although ou...

  7. [7]

    Human Score: 5 / 5 Qwen3-4B-Thinking So, let’s check the assistant’s answer

    So the answer is 21.1 billion US dollars. Human Score: 5 / 5 Qwen3-4B-Thinking So, let’s check the assistant’s answer. The question asks for the retail sales value in 2010. The chart is a bar graph with years on the x-axis and retail sales value in million USD on the y-axis. The 2010 bar has a value of 21.1. Wait, the y-axis labels say ¨Retail sales value...

  8. [8]

    **Relevance and Accuracy**: The image does not contain any informa- tion about the fuel efficiency of a city bus, which is the core requirement of the question. Therefore, the answer provided by Assistant is accurate in the sense that the image does not provide the necessary data, but it might be more accurate to state that the image does not provide the ...

  9. [9]

    **Relevance**: The image does show a city street with various vehicles, which might be relevant to the context of the question, but it does not provide the specific data needed to answer the question about the bus’s fuel efficiency

  10. [10]

    **Accuracy**: The statement that the image does not provide the necessary information is accurate, but it might be more accurate to specify that the image does not provide the specific value (miles per gallon) asked for in the question. Given the lack of specific data in the image, the most accurate and relevant answer would be to state that the image doe...

  11. [11]

    The image shows a typical urban street scene with various vehicles, but it does not include any information relevant to the question asked

    **Relevance and Accuracy**: The answer is accurate in its assess- ment that the image does not provide any specific data about the fuel efficiency of city buses. The image shows a typical urban street scene with various vehicles, but it does not include any information relevant to the question asked. This is a relevant observation because the question spe...

  12. [12]

    **Accuracy in Context**: The answer correctly pointed out that the image does not include any data on bus fuel efficiency, which is a relevant detail. It also correctly noted that the image includes a variety of vehicles, such as cars and a bicycle, which might be relevant to a broader discussion on urban transportation, but these do not pertain to the sp...

  13. [13]

    The question specifically asked for the miles per gallon of an average city bus, and the image did not provide any such data

    **Relevance to Question**: The relevance of the answer to the question asked is clear. The question specifically asked for the miles per gallon of an average city bus, and the image did not provide any such data. Therefore, the answer is accurate in stating that the image does not provide the necessary information to answer the question. Given these consi...

  14. [14]

    According to the table, Canada had 8 Nobel Prize winners in the 1990s, and Italy had 1

    **Relevance and Accuracy**: The answer correctly identifies the number of Nobel Prize winners in the 1990s for both Canada and Italy. According to the table, Canada had 8 Nobel Prize winners in the 1990s, and Italy had 1. The difference between these numbers is indeed 8 - 1 = 7, which means Canada had 7 more Nobel Prize winners than Italy in the 1990s. Th...

  15. [15]

    The difference between the number of Nobel Prize winners in the 1990s for Canada (8) and Italy (1) is 8 - 1 =

    **Calculation**: The calculation provided by Assistant is correct. The difference between the number of Nobel Prize winners in the 1990s for Canada (8) and Italy (1) is 8 - 1 =

  16. [16]

    This means Canada had 7 more Nobel Prize winners than Italy in the 1990s, which aligns with the data in the table

  17. [17]

    In the 1990s, Canada had 7 Nobel Prize winners, while Italy had 3. Therefore, Canada had 4 more Nobel Prize winners than Italy in the 1990s

    **Clarity and Precision**: The answer is clear and precise in its statement, "In the 1990s, Canada had 7 Nobel Prize winners, while Italy had 3. Therefore, Canada had 4 more Nobel Prize winners than Italy in the 1990s." This statement is accurate and directly answers the question asked, which is to find the difference in the number of Nobel Prize winners ...

  18. [18]

    Therefore, Canada had 4 more Nobel Prize winners than Italy in the 1990sïs incorrect

    **Incorrect Calculation**: The statement Ïn the 1990s, Canada had 7 Nobel Prize winners, while Italy had 3. Therefore, Canada had 4 more Nobel Prize winners than Italy in the 1990sïs incorrect. According to the table, in the 1990s, Canada actually had 8 Nobel Prize winners, while Italy had 1. The correct difference would be 8 (Canada) - 1 (Italy) = 7, not 4

  19. [19]

    Therefore, the correct answer should be that Canada had 7 more Nobel Prize winners than Italy in the 1990s

    **Misinterpretation of the Data**: The table shows that in the 1990s, Canada had 8 Nobel Prize winners, which is more than the 1 that Italy had. Therefore, the correct answer should be that Canada had 7 more Nobel Prize winners than Italy in the 1990s. To correct the answer, the cor- rect statement would be: Ïn the 1990s, Canada had 8 Nobel Prize winners,...

  20. [20]

    Is Web Green greater than Yellow?

    **Relevance and Accuracy**: The answer correctly identifies that the Yellow line is above the Web Green line throughout the entire range of the x-axis. This means that for any given value of x, the value of Yellow is greater than the value of Web Green. This is a direct interpretation of the graph, which is a relevant and accurate way to answer the questi...

  21. [21]

    This level of detail is accurate and relevant to the question, which asks whether Web Green is greater than Yellow

    **Level of Detail**: The answer is precise in its conclusion, specifying that the Yellow line is above the Web Green line for all values of x. This level of detail is accurate and relevant to the question, which asks whether Web Green is greater than Yellow. By stating that this is true for all x-values, the answer is comprehensive and accurate

  22. [22]

    It correctly identifies the relative positions of the lines on the graph, which is a clear and accurate way to determine the relationship between Web Green and Yellow

    **Logical Reasoning**: The rea- soning is logical and straightforward. It correctly identifies the relative positions of the lines on the graph, which is a clear and accurate way to determine the relationship between Web Green and Yellow. In contrast, if the answer had been inaccurate, it might have suggested that Web Green is greater than Yellow at some ...

  23. [23]

    This misinterpretation might lead to a misunderstanding of the relationship between the two lines

    **Accuracy and Relevance**: The accuracy of the answer can be questioned because the yellow line is indeed below the Web Green line for a significant portion of the x-axis range, not above it. This misinterpretation might lead to a misunderstanding of the relationship between the two lines

  24. [24]

    From the figure, it is clear that the Web Green line (green dots) is consistently above the Yellow line (yellow dots) for the entire x-axis range

    **Correct Approach**: To determine whether Web Green is greater than Yellow, we need to com- pare the y-values of these two lines at each corresponding x-value. From the figure, it is clear that the Web Green line (green dots) is consistently above the Yellow line (yellow dots) for the entire x-axis range. This means Web Green is greater than Yellow for a...

  25. [25]

    The correct interpretation should be that Web Green is always above the Yellow line, which would mean Web Green is greater than Yellow

    **Misinterpretation**: The statement ¨the yellow line is above the web green line for all values of x-axisïs incorrect. The correct interpretation should be that Web Green is always above the Yellow line, which would mean Web Green is greater than Yellow

  26. [26]

    Therefore, the correct answer should be that Web Green is greater than Yellow, not the other way around

    **Conclusion**: Based on the correct interpretation of the figure, Web Green is indeed greater than Yellow for all values of x. Therefore, the correct answer should be that Web Green is greater than Yellow, not the other way around. </think> Score: 1 Figure S14.Additional qualitative example of score setting. 27 Mitigating Perceptual Judgment Bias in Mult...

  27. [27]

    This means the plate is being held hori- zontally, which is a 90-degree angle to the vertical plane of the man’s body

    **Orientation of the Plate**: The plate is being held in a way that the horizontal line of the plate is parallel to the bottom edge of the table. This means the plate is being held hori- zontally, which is a 90-degree angle to the vertical plane of the man’s body

  28. [28]

    This is a key detail that confirms the angle is 90 degrees, as the plate is being held in a way that its vertical edge is at a right angle to the man’s body

    **Perpendicularity**: The vertical line of the plate is perpendicular to the man’s body. This is a key detail that confirms the angle is 90 degrees, as the plate is being held in a way that its vertical edge is at a right angle to the man’s body

  29. [29]

    **Visual Confirmation**: The image shows the plate being held in a way that aligns with these descriptions. The man’s hand is positioned such that the plate is horizontal, and the vertical edge of the plate is aligned with the vertical line of his body, which is consistent with a 90-degree angle. Given these observations, the analysis provided by Assistan...

  30. [30]

    This is a correct observation

    **Angle Determination**: The man is holding the plate in such a way that the plate is perpendicular to the table, which means the angle between the plate and the table is 90 degrees. This is a correct observation

  31. [31]

    This is a clear and accurate way to describe the orientation of the plate, which is indeed a 90-degree angle

    **Observation of Lines**: The description mentions that the hor- izontal line of the plate is parallel to the bottom edge of the table, and the vertical line of the plate is perpendicular to the man’s body. This is a clear and accurate way to describe the orientation of the plate, which is indeed a 90-degree angle

  32. [32]

    This is a reasonable interpretation based on the visual evidence in the image

    **Contextual Accuracy**: The analysis is accurate in the context of the image, where the plate is being held in a way that it forms a right angle with the table. This is a reasonable interpretation based on the visual evidence in the image. Given these accurate observations, the answer provided by Assistant is quite accurate and relevant to the question a...