Recognition: unknown
Visual-ERM: Reward Modeling for Visual Equivalence
Pith reviewed 2026-05-15 11:15 UTC · model grok-4.3
The pith
A generative reward model that compares rendered images directly supplies the fine-grained feedback needed for vision-to-code reinforcement learning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Fine-grained visual reward supervision is both necessary and sufficient for vision-to-code reinforcement learning regardless of task. Visual-ERM implements this supervision as a multimodal generative model that evaluates outputs by comparing rendered images in visual space rather than relying on textual rules or embedding similarity, yielding consistent gains and stronger test-time scaling via reflection.
What carries the argument
Visual-ERM, a multimodal generative reward model that supplies fine-grained visual equivalence scores by directly comparing rendered inputs and outputs.
If this is right
- RL with Visual-ERM raises Qwen3-VL-8B-Instruct by +8.4 on chart-to-code.
- The same reward produces average gains of +2.7 on table parsing and +4.1 on SVG parsing.
- Test-time reflection and revision guided by Visual-ERM further improves final outputs.
- An 8B Visual-ERM outperforms a 235B Qwen3-VL-Instruct on VC-RewardBench for judging fine visual discrepancies.
Where Pith is reading between the lines
- The same visual-equivalence approach could be tested on other structured-image tasks such as diagram-to-code or layout generation where pixel-level fidelity matters.
- Scaling Visual-ERM while keeping the visual comparison core might narrow the remaining gap to closed-source models on visual judgment benchmarks.
- Hybrid rewards that combine Visual-ERM scores with light textual checks could address both visual fidelity and semantic correctness in one training loop.
Load-bearing premise
The reported performance gains are produced by the visual comparison mechanism itself rather than by other details of the training procedure or by reward hacking that the comparisons fail to prevent.
What would settle it
An ablation that runs identical RL training with a non-visual reward or with the visual comparison step removed and still obtains the same numerical gains on chart-to-code, table, and SVG tasks.
read the original abstract
Vision-to-code tasks require models to reconstruct structured visual inputs, such as charts, tables, and SVGs, into executable or structured representations with high visual fidelity. While recent Large Vision Language Models (LVLMs) achieve strong results via supervised fine-tuning, reinforcement learning remains challenging due to misaligned reward signals. Existing rewards either rely on textual rules or coarse visual embedding similarity, both of which fail to capture fine-grained visual discrepancies and are vulnerable to reward hacking. We propose Visual Equivalence Reward Model (Visual-ERM), a multimodal generative reward model that provides fine-grained, interpretable, and task-agnostic feedback to evaluate vision-to-code quality directly in the rendered visual space. Integrated into RL, Visual-ERM improves Qwen3-VL-8B-Instruct by +8.4 on chart-to-code and yields consistent gains on table and SVG parsing (+2.7, +4.1 on average), and further strengthens test-time scaling via reflection and revision. We also introduce VisualCritic-RewardBench (VC-RewardBench), a benchmark for judging fine-grained image-to-image discrepancies on structured visual data, where Visual-ERM at 8B decisively outperforms Qwen3-VL-235B-Instruct and approaches leading closed-source models. Our results suggest that fine-grained visual reward supervision is both necessary and sufficient for vision-to-code RL, regardless of task specificity.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Visual-ERM, a multimodal generative reward model that supplies fine-grained, interpretable visual-equivalence feedback for RL on vision-to-code tasks (chart-to-code, table parsing, SVG generation). It reports that integrating Visual-ERM into RL yields +8.4 improvement on chart-to-code for Qwen3-VL-8B-Instruct, smaller gains on table and SVG tasks, strengthens test-time scaling, and that the 8B Visual-ERM outperforms Qwen3-VL-235B on the new VC-RewardBench benchmark for judging fine-grained image-to-image discrepancies.
Significance. If the performance deltas can be isolated to the visual reward model itself, the work would provide a concrete mechanism for reducing reward hacking in structured visual reconstruction and demonstrate that task-agnostic fine-grained visual supervision can be both necessary and sufficient for vision-to-code RL.
major comments (2)
- [Abstract] Abstract: the central claim that fine-grained visual supervision is both necessary and sufficient requires explicit isolation of the reward model; the reported +8.4 chart-to-code gain and gains on table/SVG are presented without stating that the RL algorithm, optimizer, step count, sampling strategy, or other hyperparameters were held fixed when comparing against textual-rule and coarse-embedding baselines.
- [Abstract] Abstract: the assertion that Visual-ERM avoids reward hacking is not accompanied by any quantitative metric (e.g., adversarial-example success rate or correlation with human visual judgments on the same trajectories) that would demonstrate superiority over coarse embeddings.
minor comments (1)
- [Abstract] The abstract would benefit from a one-sentence description of the Visual-ERM architecture (e.g., whether it is a generative model that outputs a scalar or a structured critique).
Simulated Author's Rebuttal
We thank the referee for their valuable feedback. We address the major comments below and have prepared revisions to the abstract and discussion sections accordingly.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that fine-grained visual supervision is both necessary and sufficient requires explicit isolation of the reward model; the reported +8.4 chart-to-code gain and gains on table/SVG are presented without stating that the RL algorithm, optimizer, step count, sampling strategy, or other hyperparameters were held fixed when comparing against textual-rule and coarse-embedding baselines.
Authors: We agree that the abstract should make the experimental isolation explicit. All comparisons in our RL experiments held the algorithm (PPO), optimizer, step count, and sampling strategy fixed, varying only the reward model. We will revise the abstract to state this clearly, ensuring the +8.4 gain and other improvements are attributable to Visual-ERM. revision: yes
-
Referee: [Abstract] Abstract: the assertion that Visual-ERM avoids reward hacking is not accompanied by any quantitative metric (e.g., adversarial-example success rate or correlation with human visual judgments on the same trajectories) that would demonstrate superiority over coarse embeddings.
Authors: While the paper shows Visual-ERM's superiority on VC-RewardBench and improved RL outcomes, we acknowledge the absence of specific quantitative metrics for reward hacking such as adversarial success rates. We will add a discussion on how our benchmark results correlate with human visual judgments and note this as a limitation, with plans for future adversarial evaluations. revision: partial
Circularity Check
No circularity: empirical gains presented without self-referential derivations
full rationale
The paper introduces Visual-ERM as a new multimodal reward model and reports empirical improvements when integrated into RL for vision-to-code tasks (+8.4 on chart-to-code, etc.). No equations, derivations, or fitted parameters are described that would reduce the central claim to a self-defined quantity. The necessity/sufficiency argument rests on benchmark comparisons (VC-RewardBench and task-specific gains) rather than any self-citation chain, ansatz smuggling, or renaming of known results. The derivation chain is self-contained as an experimental proposal and evaluation; no load-bearing step collapses to the authors' prior inputs by construction.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[2]
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Zheng Cai, Maosong Cao, Haojiong Chen, Kai Chen, Keyu Chen, Xin Chen, Xun Chen, Zehui Chen, Zhi Chen, Pei Chu, Xiao wen Dong, Haodong Duan, Qi Fan, Zhaoye Fei, Yang Gao, Jiaye Ge, Chenya Gu, Yuzhe Gu, Tao Gui, Aijia Guo, Qipeng Guo, Conghui He, Yingfan Hu, Ting Huang, Tao Jiang, Penglong Jiao, Zhen Jin, Zhikai Lei, Jiaxing Li, Jingwen Li, Linyang Li, Shua...
work page internal anchor Pith review arXiv 2024
-
[4]
Emerging properties in self-supervised vision transformers
Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InICCV, 2021. 1
work page 2021
-
[5]
Lei Chen, Xuanle Zhao, Zhixiong Zeng, Jing Huang, Liming Zheng, Yufeng Zhong, and Lin Ma. Breaking the sft plateau: Multimodal structured reinforcement learning for chart-to-code generation.arXiv preprint arXiv:2508.13587, 2025. A.2
-
[6]
Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with ad- vanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025. 3.3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[7]
Shengyuan Ding, Xinyu Fang, Ziyu Liu, Yuhang Zang, Yuhang Cao, Xiangyu Zhao, Haodong Duan, Xiaoyi Dong, Jianze Liang, Bin Wang, et al. Arm-thinker: Reinforcing multimodal generative reward models with agentic tool use and visual reasoning.arXiv preprint arXiv:2512.05111, 2025. 2
-
[8]
Webcode2m: A real-world dataset for code generation from webpage designs
Yi Gui, Zhen Li, Yao Wan, Yemin Shi, Hongyu Zhang, Bohua Chen, Yi Su, Dongping Chen, Siyuan Wu, Xing Zhou, et al. Webcode2m: A real-world dataset for code generation from webpage designs. In Proceedings of the ACM on Web Conference 2025, pages 1834–1845, 2025. 2
work page 2025
-
[9]
Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv:2412.16720, 2024. 1 11 Visual-ERM: Reward Modeling for Visual Equivalence
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[10]
Lingjie Jiang, Shaohan Huang, Xun Wu, Yixia Li, Dongdong Zhang, and Furu Wei. Viscodex: Unified multimodal code generation via merging vision and coding models.arXiv preprint arXiv:2508.09945,
-
[11]
Dahyun Kim, Chanjun Park, Sanghoon Kim, Wonsung Lee, Wonho Song, Yunsu Kim, Hyeonwoo Kim, Yungi Kim, Hyeonju Lee, Jihoo Kim, Changbae Ahn, Seonghoon Yang, Sukyung Lee, Hyunbyung Park, Gyoungjin Gim, Mikyoung Cha, Hwalsuk Lee, and Sunghun Kim. Solar 10.7b: Scaling large language models with simple yet effective depth up-scaling.ArXiv, abs/2312.15166, 2023. 2
-
[12]
Unisvg: Aunifieddatasetforvectorgraphicunderstandingandgenerationwithmultimodallargelanguage models
Jinke Li, Jiarui Yu, Chenxing Wei, Hande Dong, Qiang Lin, Liangjing Yang, Zhicai Wang, and Yanbin Hao. Unisvg: Aunifieddatasetforvectorgraphicunderstandingandgenerationwithmultimodallargelanguage models. InProceedings of the 33rd ACM International Conference on Multimedia, pages 13156–13163,
-
[13]
Vl-rewardbench: A challenging benchmark for vision-language generative reward models
Lei Li, Yuancheng Wei, Zhihui Xie, Xuqing Yang, Yifan Song, Peiyi Wang, Chenxin An, Tianyu Liu, Sujian Li, Bill Yuchen Lin, et al. Vl-rewardbench: A challenging benchmark for vision-language generative reward models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 24657–24668,
-
[14]
Renhao Li, Jianhong Tu, Yang Su, Hamid Alinejad-Rokny, Derek F Wong, Junyang Lin, and Min Yang. One model to critique them all: Rewarding agentic tool-use via efficient reasoning.arXiv preprint arXiv:2510.26167, 2025. 2
-
[15]
Jun Ling, Yao Qi, Tao Huang, Shibo Zhou, Yanqin Huang, Jiang Yang, Ziqi Song, Ying Zhou, Yang Yang, Heng Tao Shen, et al. Table2latex-rl: High-fidelity latex code generation from table images via reinforced multimodal language models.arXiv preprint arXiv:2509.17589, 2025. 1, 2
-
[16]
Visual-RFT: Visual Reinforcement Fine-Tuning
Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual-rft: Visual reinforcement fine-tuning.arXiv preprint arXiv:2503.01785, 2025. 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[17]
Spark: Synergistic policy and reward co-evolving framework.arXiv preprint arXiv:2509.22624,
Ziyu Liu, Yuhang Zang, Shengyuan Ding, Yuhang Cao, Xiaoyi Dong, Haodong Duan, Dahua Lin, and Jiaqi Wang. Spark: Synergistic policy and reward co-evolving framework.arXiv preprint arXiv:2509.22624,
-
[18]
Chartqa: A benchmark for question answering about charts with visual and logical reasoning
Ahmed Masry, Xuan Long Do, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. Chartqa: A benchmark for question answering about charts with visual and logical reasoning. InFindings of the association for computational linguistics: ACL 2022, pages 2263–2279, 2022. B.1
work page 2022
-
[19]
Minesh Mathew, Viraj Bagal, Rubèn Tito, Dimosthenis Karatzas, Ernest Valveny, and CV Jawahar. Info- graphicvqa. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1697–1706, 2022. B.1
work page 2022
-
[20]
Docvqa: A dataset for vqa on document images
Minesh Mathew, Dimosthenis Karatzas, and CV Jawahar. Docvqa: A dataset for vqa on document images. InProceedings of the IEEE/CVF winter conference on applications of computer vision, pages 2200–2209,
-
[21]
Junbo Niu, Zheng Liu, Zhuangcheng Gu, Bin Wang, Linke Ouyang, Zhiyuan Zhao, Tao Chu, Tianyao He, FanWu, QintongZhang, etal. Mineru2.5: Adecoupledvision-languagemodelforefficienthigh-resolution document parsing.arXiv preprint arXiv:2509.22186, 2025. 1, 2
- [22]
-
[23]
Omnidocbench: Benchmarking diverse pdf document parsing with comprehensive annotations
Linke Ouyang, Yuan Qu, Hongbin Zhou, Jiawei Zhu, Rui Zhang, Qunshu Lin, Bin Wang, Zhiyuan Zhao, Man Jiang, Xiaomeng Zhao, et al. Omnidocbench: Benchmarking diverse pdf document parsing with comprehensive annotations. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 24838–24848, 2025. 1, 4.1
work page 2025
-
[24]
Hao Peng, Yunjia Qi, Xiaozhi Wang, Zijun Yao, Bin Xu, Lei Hou, and Juanzi Li. Agentic reward modeling: Integrating human preferences with verifiable correctness signals for reliable reward systems.arXiv preprint arXiv:2502.19328, 2025. 2 12 Visual-ERM: Reward Modeling for Visual Equivalence
-
[25]
Jake Poznanski, Aman Rangapur, Jon Borchardt, Jason Dunkelberger, Regan Huff, Daniel Lin, Christopher Wilhelm, Kyle Lo, and Luca Soldaini. olmocr: Unlocking trillions of tokens in pdfs with vision language models.arXiv preprint arXiv:2502.18443, 2025. 1, 4.1
-
[26]
Juan A Rodriguez, Shubham Agarwal, Issam H Laradji, Pau Rodriguez, David Vazquez, Christopher Pal, and Marco Pedersoli. Starvector: Generating scalable vector graphics code from images.arXiv preprint arXiv:2312.11556, 2023. A.2
-
[27]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 2, 4.1
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[28]
Design2Code: Bench- marking multimodal code generation for automated front-end engineering
Chenglei Si, Yanzhe Zhang, Ryan Li, Zhengyuan Yang, Ruibo Liu, and Diyi Yang. Design2Code: Bench- marking multimodal code generation for automated front-end engineering. InACL, 2025. 1
work page 2025
-
[29]
Oriane Siméoni, Huy V Vo, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khali- dov,MarcSzafraniec,SeungeunYi,MichaëlRamamonjisoa,etal. Dinov3.arXivpreprintarXiv:2508.10104,
work page internal anchor Pith review Pith/arXiv arXiv
-
[30]
Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaugh- lin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025. 3.1, 3.3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[31]
JanusCoder: Towards a Foundational Visual-Programmatic Interface for Code Intelligence
Qiushi Sun, Jingyang Gong, Yang Liu, Qiaosheng Chen, Lei Li, Kai Chen, Qipeng Guo, Ben Kao, and Fei Yuan. Januscoder: Towards a foundational visual-programmatic interface for code intelligence.arXiv preprint arXiv:2510.23538, 2025. 4.1
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[32]
Wentao Tan, Qiong Cao, Chao Xue, Yibing Zhan, Changxing Ding, and Xiaodong He. Chartmaster: Advancing chart-to-code generation with real-world charts and chart similarity reinforcement learning. arXiv preprint arXiv:2508.17608, 2025. 1, 2
-
[33]
Unified Reward Model for Multimodal Understanding and Generation
Yibin Wang, Yuhang Zang, Hao Li, Cheng Jin, and Jiaqi Wang. Unified reward model for multimodal understanding and generation.arXiv preprint arXiv:2503.05236, 2025. 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[34]
Zirui Wang, Mengzhou Xia, Luxi He, Howard Chen, Yitao Liu, Richard Zhu, Kaiqu Liang, Xindi Wu, Haotian Liu, Sadhika Malladi, et al. Charxiv: Charting gaps in realistic chart understanding in multimodal llms.Advances in Neural Information Processing Systems, 37:113569–113697, 2024. B.1
work page 2024
-
[35]
Cheng Yang, Chufan Shi, Yaxin Liu, Bo Shui, Junjie Wang, Mohan Jing, Linran Xu, Xinyu Zhu, Siheng Li, Yuxiang Zhang, et al. Chartmimic: Evaluating lmm’s cross-modal reasoning capability via chart-to-code generation.arXiv preprint arXiv:2406.09961, 2024. 1, 4.1, 4.2
-
[36]
Omnisvg: A unified scalable vector graphics generation model.arXiv preprint arXiv:2504.06263, 2025
Yiying Yang, Wei Cheng, Sijin Chen, Xianfang Zeng, Fukun Yin, Jiaxu Zhang, Liao Wang, Gang Yu, Xingjun Ma, and Yu-Gang Jiang. Omnisvg: A unified scalable vector graphics generation model.arXiv preprint arXiv:2504.06263, 2025. 2
-
[37]
Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Sainbayar Sukhbaatar, Jing Xu, and Jason E. Weston. Self-rewarding language models.ArXiv, abs/2401.10020, 2024. 2
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[38]
Yuhang Zang, Xiaoyi Dong, Pan Zhang, Yuhang Cao, Ziyu Liu, Shengyuan Ding, Shenxi Wu, Yubo Ma, Haodong Duan, Wenwei Zhang, et al. Internlm-xcomposer2. 5-reward: A simple yet effective multi-modal reward model.arXiv preprint arXiv:2501.12368, 2025. 2
-
[39]
Jiarui Zhang, Yuliang Liu, Zijun Wu, Guosheng Pang, Zhili Ye, Yupei Zhong, Junteng Ma, Tao Wei, Haiyang Xu, Weikai Chen, et al. Monkeyocr v1. 5 technical report: Unlocking robust document parsing for complex patterns.arXiv preprint arXiv:2511.10390, 2025. 2
-
[40]
Zhihan Zhang, Yixin Cao, and Lizi Liao. Enhancing chart-to-code generation in multimodal large language models via iterative dual preference learning.arXiv preprint arXiv:2504.02906, 2025. 2 13 Visual-ERM: Reward Modeling for Visual Equivalence
-
[41]
Xuanle Zhao, Deyang Jiang, Zhixiong Zeng, Lei Chen, Haibo Qiu, Jing Huang, Yufeng Zhong, Liming Zheng, Yilin Cao, and Lin Ma. Vincicoder: Unifying multimodal code generation via coarse-to-fine visual reinforcement learning.arXiv preprint arXiv:2511.00391, 2025. 1, 2, 4.1, 4.2, A.1
-
[42]
XuanleZhao,XianzhenLuo,QiShi,ChiChen,ShuoWang,ZhiyuanLiu,andMaosongSun. Chartcoder: Ad- vancing multimodal large language model for chart-to-code generation.arXiv preprint arXiv:2501.06598,
-
[43]
Xu Zhong, Elaheh ShafieiBavani, and Antonio Jimeno Yepes. Image-based table recognition: data, model, and evaluation.arXiv preprint arXiv:1911.10683, 2019. 2
-
[44]
Starling-7b: Improving llm helpfulness & harmlessness with rlaif, November 2023
Banghua Zhu, Evan Frick, Tianhao Wu, Hanlin Zhu, and Jiantao Jiao. Starling-7b: Improving llm helpfulness & harmlessness with rlaif, November 2023. 2 14 Visual-ERM: Reward Modeling for Visual Equivalence Outline Intheappendix,weprovideadditionalmaterialstosupportthemainpaperandfacilitateadeeperunderstanding of Visual-ERM. First, in Sec. A, we provide deta...
work page 2023
-
[46]
**Generated Image**: a chart rendered using AI-generated Matplotlib code for the **Original Image**. Your task is to **compare the Generated Image against the Original Image** and identify **all visual discrepancies** that affect: * correctness of the data, * layout / structure, or * aesthetic consistency (when it impacts readability or overall fidelity)....
- [47]
-
[48]
Wrong number of subplots (e.g., 1 subplot instead of 2)
-
[49]
Subplots arranged in wrong rows/columns or in the wrong positions
-
[50]
Subplots missing or extra
-
[51]
Axes that should be shared/aligned in the Original are not shared/aligned in the Generated (or vice versa). #### B. `data_error` (Content & Geometry) **Definition:** The visualized data itself is incorrect or distorted. **Typical cases:**
-
[52]
**Geometry mismatch:** The shape of the curve, polygon, or bar pattern is clearly different (e.g., an increasing line becomes flat or decreasing)
-
[53]
**Value distortion:** Bar heights, pie slice angles, or point positions do not match the original trend or relative ratios
-
[54]
**Scale / limits mismatch:** Different xlim/ylim or axis scales that significantly change how the data appears
-
[55]
**Missing / extra data:** Missing series, hallucinated extra lines, wrong number of groups or categories. #### C. `text_error` (Labels & Annotations) **Definition:** Problems with any visible text in the chart. Text includes: titles, subtitles, axis labels, tick labels, legend text, annotations, and text boxes. **Typical cases:**
-
[56]
Missing or incorrect titles, axis labels, tick labels, or legend entries
-
[57]
Misplaced labels (e.g., attached to the wrong axis, wrong subplot, or wrong data series)
-
[58]
Text overlapping with other elements in a way that harms readability
-
[59]
Obvious typos or formatting differences that clearly reduce readability (e.g., extremely small font where the original is normal). #### D. `style_error` (Aesthetics & Appearance) **Definition:** Visual style differences that may not directly change the data values, but affect the overall look and clarity. **Typical cases:**
-
[60]
**Color mismatch:** Different color palette or clearly wrong color mapping for data series / categories
-
[61]
solid), marker shape, or line width
**Markers & lines:** Wrong line style (dashed vs. solid), marker shape, or line width
-
[62]
**Grid / background:** Missing or extra gridlines, wrong background color, frame visibility changes
-
[63]
Other stylistic differences that noticeably change the visual appearance, even if the data is still correct. --- ### Step 3: Assign severity for each error For each error, assign a **severity**: **1 (Minor)**: Purely aesthetic or very small differences. The chart remains correct and easy to understand (e.g., slightly different shade of color, small font s...
-
[64]
If a category has **no errors**, its `*_error_count` **must be 0**
-
[65]
The `errors` list can be **empty** if the Generated Image is visually consistent with the Original Image (ignoring tiny pixel-level differences)
-
[66]
experienced specialist for data visualization
The numeric counts (`*_error_count`) must exactly match the number of errors you report for each corresponding category in the `errors` list. Prompt for Distillation Reward Modeling Datasets (Chart) — Part2 Figure7:Prompt for distillation.The prompt used to distill GPT-5-mini to construct reward-modeling training data for Visual-ERM. We show the Chart-spe...
-
[67]
**Original Image**: a chart rendered using ground-truth Matplotlib code
-
[68]
**Generated Image**: a chart rendered using AI-generated Matplotlib code for the **Original Image**. Your task is to **compare the Generated Image against the Original Image** and identify **all visual discrepancies**, then summarize them in a strict JSON format. Additionally, You should assign a **severity score** for each error: - 1 (minor): small error...
-
[69]
layout_error (structure/layout errors)
-
[70]
text_error (text recognition errors)
-
[71]
numeric_error (numeric/symbol/unit errors) Assign a severity score for each error: - 1 (minor): small errors that barely affect readability or understanding - 2 (medium): errors that affect partial understanding and require manual correction for reliable use - 3 (severe): structural or key-content errors that break reliable alignment or can significantly ...
-
[72]
Otherwise, it must be judged as not matching
A match is only allowed when pred.type == gt.type OR pred.category == gt.category. Otherwise, it must be judged as not matching
-
[73]
A “match” requires that the descriptions point to the same specific error point (same location/object/cell/chart element), and that the error phenomenon is the same or highly consistent
-
[74]
If the pred description is more generalized than gt but clearly refers to the same error point, it can be considered match_level="partial"; otherwise it is "no"
-
[75]
Each pred can match at most one gt, and each gt can match at most one pred (1-to-1 matching)
-
[76]
Do not match randomly: it is better to leave unmatched than to make an incorrect match. [Your output MUST be strictly valid JSON, with no extra text] The JSON output format is: { "matches": [ { "pred_id": 0, "gt_id": 3, "match_level": "yes" | "partial" } ], "unmatched_pred": [1, 2], "unmatched_gt": [0, 4], "notes": "Optional, at most one sentence" } [PRED...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.