pith. machine review for the scientific record. sign in

arxiv: 2603.13224 · v2 · submitted 2026-03-13 · 💻 cs.CV · cs.AI

Recognition: unknown

Visual-ERM: Reward Modeling for Visual Equivalence

Authors on Pith no claims yet

Pith reviewed 2026-05-15 11:15 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords visual reward modelvision-to-codereinforcement learningchart-to-codetable parsingSVG reconstructionmultimodal generative rewardreward hacking
0
0 comments X

The pith

A generative reward model that compares rendered images directly supplies the fine-grained feedback needed for vision-to-code reinforcement learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Vision-to-code tasks ask models to turn visual inputs such as charts and tables into executable code while preserving exact visual appearance. Standard rewards based on text rules or coarse embeddings miss small discrepancies and are easy to exploit. Visual-ERM instead renders both the target and the model output, then generates direct visual comparisons to produce interpretable, task-agnostic scores. When these scores guide reinforcement learning, performance rises sharply on chart-to-code and improves steadily on table and SVG reconstruction. The same model also outperforms much larger baselines on a new benchmark for detecting fine visual differences.

Core claim

Fine-grained visual reward supervision is both necessary and sufficient for vision-to-code reinforcement learning regardless of task. Visual-ERM implements this supervision as a multimodal generative model that evaluates outputs by comparing rendered images in visual space rather than relying on textual rules or embedding similarity, yielding consistent gains and stronger test-time scaling via reflection.

What carries the argument

Visual-ERM, a multimodal generative reward model that supplies fine-grained visual equivalence scores by directly comparing rendered inputs and outputs.

If this is right

  • RL with Visual-ERM raises Qwen3-VL-8B-Instruct by +8.4 on chart-to-code.
  • The same reward produces average gains of +2.7 on table parsing and +4.1 on SVG parsing.
  • Test-time reflection and revision guided by Visual-ERM further improves final outputs.
  • An 8B Visual-ERM outperforms a 235B Qwen3-VL-Instruct on VC-RewardBench for judging fine visual discrepancies.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same visual-equivalence approach could be tested on other structured-image tasks such as diagram-to-code or layout generation where pixel-level fidelity matters.
  • Scaling Visual-ERM while keeping the visual comparison core might narrow the remaining gap to closed-source models on visual judgment benchmarks.
  • Hybrid rewards that combine Visual-ERM scores with light textual checks could address both visual fidelity and semantic correctness in one training loop.

Load-bearing premise

The reported performance gains are produced by the visual comparison mechanism itself rather than by other details of the training procedure or by reward hacking that the comparisons fail to prevent.

What would settle it

An ablation that runs identical RL training with a non-visual reward or with the visual comparison step removed and still obtains the same numerical gains on chart-to-code, table, and SVG tasks.

read the original abstract

Vision-to-code tasks require models to reconstruct structured visual inputs, such as charts, tables, and SVGs, into executable or structured representations with high visual fidelity. While recent Large Vision Language Models (LVLMs) achieve strong results via supervised fine-tuning, reinforcement learning remains challenging due to misaligned reward signals. Existing rewards either rely on textual rules or coarse visual embedding similarity, both of which fail to capture fine-grained visual discrepancies and are vulnerable to reward hacking. We propose Visual Equivalence Reward Model (Visual-ERM), a multimodal generative reward model that provides fine-grained, interpretable, and task-agnostic feedback to evaluate vision-to-code quality directly in the rendered visual space. Integrated into RL, Visual-ERM improves Qwen3-VL-8B-Instruct by +8.4 on chart-to-code and yields consistent gains on table and SVG parsing (+2.7, +4.1 on average), and further strengthens test-time scaling via reflection and revision. We also introduce VisualCritic-RewardBench (VC-RewardBench), a benchmark for judging fine-grained image-to-image discrepancies on structured visual data, where Visual-ERM at 8B decisively outperforms Qwen3-VL-235B-Instruct and approaches leading closed-source models. Our results suggest that fine-grained visual reward supervision is both necessary and sufficient for vision-to-code RL, regardless of task specificity.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes Visual-ERM, a multimodal generative reward model that supplies fine-grained, interpretable visual-equivalence feedback for RL on vision-to-code tasks (chart-to-code, table parsing, SVG generation). It reports that integrating Visual-ERM into RL yields +8.4 improvement on chart-to-code for Qwen3-VL-8B-Instruct, smaller gains on table and SVG tasks, strengthens test-time scaling, and that the 8B Visual-ERM outperforms Qwen3-VL-235B on the new VC-RewardBench benchmark for judging fine-grained image-to-image discrepancies.

Significance. If the performance deltas can be isolated to the visual reward model itself, the work would provide a concrete mechanism for reducing reward hacking in structured visual reconstruction and demonstrate that task-agnostic fine-grained visual supervision can be both necessary and sufficient for vision-to-code RL.

major comments (2)
  1. [Abstract] Abstract: the central claim that fine-grained visual supervision is both necessary and sufficient requires explicit isolation of the reward model; the reported +8.4 chart-to-code gain and gains on table/SVG are presented without stating that the RL algorithm, optimizer, step count, sampling strategy, or other hyperparameters were held fixed when comparing against textual-rule and coarse-embedding baselines.
  2. [Abstract] Abstract: the assertion that Visual-ERM avoids reward hacking is not accompanied by any quantitative metric (e.g., adversarial-example success rate or correlation with human visual judgments on the same trajectories) that would demonstrate superiority over coarse embeddings.
minor comments (1)
  1. [Abstract] The abstract would benefit from a one-sentence description of the Visual-ERM architecture (e.g., whether it is a generative model that outputs a scalar or a structured critique).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their valuable feedback. We address the major comments below and have prepared revisions to the abstract and discussion sections accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that fine-grained visual supervision is both necessary and sufficient requires explicit isolation of the reward model; the reported +8.4 chart-to-code gain and gains on table/SVG are presented without stating that the RL algorithm, optimizer, step count, sampling strategy, or other hyperparameters were held fixed when comparing against textual-rule and coarse-embedding baselines.

    Authors: We agree that the abstract should make the experimental isolation explicit. All comparisons in our RL experiments held the algorithm (PPO), optimizer, step count, and sampling strategy fixed, varying only the reward model. We will revise the abstract to state this clearly, ensuring the +8.4 gain and other improvements are attributable to Visual-ERM. revision: yes

  2. Referee: [Abstract] Abstract: the assertion that Visual-ERM avoids reward hacking is not accompanied by any quantitative metric (e.g., adversarial-example success rate or correlation with human visual judgments on the same trajectories) that would demonstrate superiority over coarse embeddings.

    Authors: While the paper shows Visual-ERM's superiority on VC-RewardBench and improved RL outcomes, we acknowledge the absence of specific quantitative metrics for reward hacking such as adversarial success rates. We will add a discussion on how our benchmark results correlate with human visual judgments and note this as a limitation, with plans for future adversarial evaluations. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical gains presented without self-referential derivations

full rationale

The paper introduces Visual-ERM as a new multimodal reward model and reports empirical improvements when integrated into RL for vision-to-code tasks (+8.4 on chart-to-code, etc.). No equations, derivations, or fitted parameters are described that would reduce the central claim to a self-defined quantity. The necessity/sufficiency argument rests on benchmark comparisons (VC-RewardBench and task-specific gains) rather than any self-citation chain, ansatz smuggling, or renaming of known results. The derivation chain is self-contained as an experimental proposal and evaluation; no load-bearing step collapses to the authors' prior inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are described beyond the high-level proposal of the reward model itself.

pith-pipeline@v0.9.0 · 5577 in / 1068 out tokens · 67770 ms · 2026-05-15T11:15:48.002891+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

74 extracted references · 74 canonical work pages · 11 internal anchors

  1. [2]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

  2. [3]

    InternLM2 Technical Report

    Zheng Cai, Maosong Cao, Haojiong Chen, Kai Chen, Keyu Chen, Xin Chen, Xun Chen, Zehui Chen, Zhi Chen, Pei Chu, Xiao wen Dong, Haodong Duan, Qi Fan, Zhaoye Fei, Yang Gao, Jiaye Ge, Chenya Gu, Yuzhe Gu, Tao Gui, Aijia Guo, Qipeng Guo, Conghui He, Yingfan Hu, Ting Huang, Tao Jiang, Penglong Jiao, Zhen Jin, Zhikai Lei, Jiaxing Li, Jingwen Li, Linyang Li, Shua...

  3. [4]

    Emerging properties in self-supervised vision transformers

    Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InICCV, 2021. 1

  4. [5]

    Breaking the sft plateau: Multimodal structured reinforcement learning for chart-to-code generation.arXiv preprint arXiv:2508.13587, 2025

    Lei Chen, Xuanle Zhao, Zhixiong Zeng, Jing Huang, Liming Zheng, Yufeng Zhong, and Lin Ma. Breaking the sft plateau: Multimodal structured reinforcement learning for chart-to-code generation.arXiv preprint arXiv:2508.13587, 2025. A.2

  5. [6]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with ad- vanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025. 3.3

  6. [7]

    Arm-thinker: Reinforcing multimodal generative reward models with agentic tool use and visual reasoning.arXiv preprint arXiv:2512.05111, 2025

    Shengyuan Ding, Xinyu Fang, Ziyu Liu, Yuhang Zang, Yuhang Cao, Xiangyu Zhao, Haodong Duan, Xiaoyi Dong, Jianze Liang, Bin Wang, et al. Arm-thinker: Reinforcing multimodal generative reward models with agentic tool use and visual reasoning.arXiv preprint arXiv:2512.05111, 2025. 2

  7. [8]

    Webcode2m: A real-world dataset for code generation from webpage designs

    Yi Gui, Zhen Li, Yao Wan, Yemin Shi, Hongyu Zhang, Bohua Chen, Yi Su, Dongping Chen, Siyuan Wu, Xing Zhou, et al. Webcode2m: A real-world dataset for code generation from webpage designs. In Proceedings of the ACM on Web Conference 2025, pages 1834–1845, 2025. 2

  8. [9]

    OpenAI o1 System Card

    Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv:2412.16720, 2024. 1 11 Visual-ERM: Reward Modeling for Visual Equivalence

  9. [10]

    Viscodex: Unified multimodal code generation via merging vision and coding models.arXiv preprint arXiv:2508.09945,

    Lingjie Jiang, Shaohan Huang, Xun Wu, Yixia Li, Dongdong Zhang, and Furu Wei. Viscodex: Unified multimodal code generation via merging vision and coding models.arXiv preprint arXiv:2508.09945,

  10. [11]

    Solar 10.7b: Scaling large language models with simple yet effective depth up-scaling.ArXiv, abs/2312.15166, 2023

    Dahyun Kim, Chanjun Park, Sanghoon Kim, Wonsung Lee, Wonho Song, Yunsu Kim, Hyeonwoo Kim, Yungi Kim, Hyeonju Lee, Jihoo Kim, Changbae Ahn, Seonghoon Yang, Sukyung Lee, Hyunbyung Park, Gyoungjin Gim, Mikyoung Cha, Hwalsuk Lee, and Sunghun Kim. Solar 10.7b: Scaling large language models with simple yet effective depth up-scaling.ArXiv, abs/2312.15166, 2023. 2

  11. [12]

    Unisvg: Aunifieddatasetforvectorgraphicunderstandingandgenerationwithmultimodallargelanguage models

    Jinke Li, Jiarui Yu, Chenxing Wei, Hande Dong, Qiang Lin, Liangjing Yang, Zhicai Wang, and Yanbin Hao. Unisvg: Aunifieddatasetforvectorgraphicunderstandingandgenerationwithmultimodallargelanguage models. InProceedings of the 33rd ACM International Conference on Multimedia, pages 13156–13163,

  12. [13]

    Vl-rewardbench: A challenging benchmark for vision-language generative reward models

    Lei Li, Yuancheng Wei, Zhihui Xie, Xuqing Yang, Yifan Song, Peiyi Wang, Chenxin An, Tianyu Liu, Sujian Li, Bill Yuchen Lin, et al. Vl-rewardbench: A challenging benchmark for vision-language generative reward models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 24657–24668,

  13. [14]

    One model to critique them all: Rewarding agentic tool-use via efficient reasoning.arXiv preprint arXiv:2510.26167, 2025

    Renhao Li, Jianhong Tu, Yang Su, Hamid Alinejad-Rokny, Derek F Wong, Junyang Lin, and Min Yang. One model to critique them all: Rewarding agentic tool-use via efficient reasoning.arXiv preprint arXiv:2510.26167, 2025. 2

  14. [15]

    Table2latex-rl: High-fidelity latex code generation from table images via reinforced multimodal language models.arXiv preprint arXiv:2509.17589, 2025

    Jun Ling, Yao Qi, Tao Huang, Shibo Zhou, Yanqin Huang, Jiang Yang, Ziqi Song, Ying Zhou, Yang Yang, Heng Tao Shen, et al. Table2latex-rl: High-fidelity latex code generation from table images via reinforced multimodal language models.arXiv preprint arXiv:2509.17589, 2025. 1, 2

  15. [16]

    Visual-RFT: Visual Reinforcement Fine-Tuning

    Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual-rft: Visual reinforcement fine-tuning.arXiv preprint arXiv:2503.01785, 2025. 2

  16. [17]

    Spark: Synergistic policy and reward co-evolving framework.arXiv preprint arXiv:2509.22624,

    Ziyu Liu, Yuhang Zang, Shengyuan Ding, Yuhang Cao, Xiaoyi Dong, Haodong Duan, Dahua Lin, and Jiaqi Wang. Spark: Synergistic policy and reward co-evolving framework.arXiv preprint arXiv:2509.22624,

  17. [18]

    Chartqa: A benchmark for question answering about charts with visual and logical reasoning

    Ahmed Masry, Xuan Long Do, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. Chartqa: A benchmark for question answering about charts with visual and logical reasoning. InFindings of the association for computational linguistics: ACL 2022, pages 2263–2279, 2022. B.1

  18. [19]

    Info- graphicvqa

    Minesh Mathew, Viraj Bagal, Rubèn Tito, Dimosthenis Karatzas, Ernest Valveny, and CV Jawahar. Info- graphicvqa. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1697–1706, 2022. B.1

  19. [20]

    Docvqa: A dataset for vqa on document images

    Minesh Mathew, Dimosthenis Karatzas, and CV Jawahar. Docvqa: A dataset for vqa on document images. InProceedings of the IEEE/CVF winter conference on applications of computer vision, pages 2200–2209,

  20. [21]

    Mineru2.5: Adecoupledvision-languagemodelforefficienthigh-resolution document parsing.arXiv preprint arXiv:2509.22186, 2025

    Junbo Niu, Zheng Liu, Zhuangcheng Gu, Bin Wang, Linke Ouyang, Zhiyuan Zhao, Tao Chu, Tianyao He, FanWu, QintongZhang, etal. Mineru2.5: Adecoupledvision-languagemodelforefficienthigh-resolution document parsing.arXiv preprint arXiv:2509.22186, 2025. 1, 2

  21. [22]

    Openai o3-mini system card, 2025

    OpenAI. Openai o3-mini system card, 2025. 1

  22. [23]

    Omnidocbench: Benchmarking diverse pdf document parsing with comprehensive annotations

    Linke Ouyang, Yuan Qu, Hongbin Zhou, Jiawei Zhu, Rui Zhang, Qunshu Lin, Bin Wang, Zhiyuan Zhao, Man Jiang, Xiaomeng Zhao, et al. Omnidocbench: Benchmarking diverse pdf document parsing with comprehensive annotations. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 24838–24848, 2025. 1, 4.1

  23. [24]

    Agentic reward modeling: Integrating human preferences with verifiable correctness signals for reliable reward systems.arXiv preprint arXiv:2502.19328, 2025

    Hao Peng, Yunjia Qi, Xiaozhi Wang, Zijun Yao, Bin Xu, Lei Hou, and Juanzi Li. Agentic reward modeling: Integrating human preferences with verifiable correctness signals for reliable reward systems.arXiv preprint arXiv:2502.19328, 2025. 2 12 Visual-ERM: Reward Modeling for Visual Equivalence

  24. [25]

    olmocr: Unlocking trillions of tokens in pdfs with vision language models.arXiv preprint arXiv:2502.18443,

    Jake Poznanski, Aman Rangapur, Jon Borchardt, Jason Dunkelberger, Regan Huff, Daniel Lin, Christopher Wilhelm, Kyle Lo, and Luca Soldaini. olmocr: Unlocking trillions of tokens in pdfs with vision language models.arXiv preprint arXiv:2502.18443, 2025. 1, 4.1

  25. [26]

    Starvector: Generating scalable vector graphics code from images.arXiv preprint arXiv:2312.11556, 2023

    Juan A Rodriguez, Shubham Agarwal, Issam H Laradji, Pau Rodriguez, David Vazquez, Christopher Pal, and Marco Pedersoli. Starvector: Generating scalable vector graphics code from images.arXiv preprint arXiv:2312.11556, 2023. A.2

  26. [27]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 2, 4.1

  27. [28]

    Design2Code: Bench- marking multimodal code generation for automated front-end engineering

    Chenglei Si, Yanzhe Zhang, Ryan Li, Zhengyuan Yang, Ruibo Liu, and Diyi Yang. Design2Code: Bench- marking multimodal code generation for automated front-end engineering. InACL, 2025. 1

  28. [29]

    DINOv3

    Oriane Siméoni, Huy V Vo, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khali- dov,MarcSzafraniec,SeungeunYi,MichaëlRamamonjisoa,etal. Dinov3.arXivpreprintarXiv:2508.10104,

  29. [30]

    OpenAI GPT-5 System Card

    Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaugh- lin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025. 3.1, 3.3

  30. [31]

    JanusCoder: Towards a Foundational Visual-Programmatic Interface for Code Intelligence

    Qiushi Sun, Jingyang Gong, Yang Liu, Qiaosheng Chen, Lei Li, Kai Chen, Qipeng Guo, Ben Kao, and Fei Yuan. Januscoder: Towards a foundational visual-programmatic interface for code intelligence.arXiv preprint arXiv:2510.23538, 2025. 4.1

  31. [32]

    Chartmaster: Advancing chart-to-code generation with real-world charts and chart similarity reinforcement learning

    Wentao Tan, Qiong Cao, Chao Xue, Yibing Zhan, Changxing Ding, and Xiaodong He. Chartmaster: Advancing chart-to-code generation with real-world charts and chart similarity reinforcement learning. arXiv preprint arXiv:2508.17608, 2025. 1, 2

  32. [33]

    Unified Reward Model for Multimodal Understanding and Generation

    Yibin Wang, Yuhang Zang, Hao Li, Cheng Jin, and Jiaqi Wang. Unified reward model for multimodal understanding and generation.arXiv preprint arXiv:2503.05236, 2025. 2

  33. [34]

    Charxiv: Charting gaps in realistic chart understanding in multimodal llms.Advances in Neural Information Processing Systems, 37:113569–113697, 2024

    Zirui Wang, Mengzhou Xia, Luxi He, Howard Chen, Yitao Liu, Richard Zhu, Kaiqu Liang, Xindi Wu, Haotian Liu, Sadhika Malladi, et al. Charxiv: Charting gaps in realistic chart understanding in multimodal llms.Advances in Neural Information Processing Systems, 37:113569–113697, 2024. B.1

  34. [35]

    Chartmimic: Evaluating lmm’s cross-modal reasoning capability via chart-to-code generation.arXiv preprint arXiv:2406.09961, 2024

    Cheng Yang, Chufan Shi, Yaxin Liu, Bo Shui, Junjie Wang, Mohan Jing, Linran Xu, Xinyu Zhu, Siheng Li, Yuxiang Zhang, et al. Chartmimic: Evaluating lmm’s cross-modal reasoning capability via chart-to-code generation.arXiv preprint arXiv:2406.09961, 2024. 1, 4.1, 4.2

  35. [36]

    Omnisvg: A unified scalable vector graphics generation model.arXiv preprint arXiv:2504.06263, 2025

    Yiying Yang, Wei Cheng, Sijin Chen, Xianfang Zeng, Fukun Yin, Jiaxu Zhang, Liao Wang, Gang Yu, Xingjun Ma, and Yu-Gang Jiang. Omnisvg: A unified scalable vector graphics generation model.arXiv preprint arXiv:2504.06263, 2025. 2

  36. [37]

    Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Sainbayar Sukhbaatar, Jing Xu, and Jason E. Weston. Self-rewarding language models.ArXiv, abs/2401.10020, 2024. 2

  37. [38]

    Internlm-xcomposer2

    Yuhang Zang, Xiaoyi Dong, Pan Zhang, Yuhang Cao, Ziyu Liu, Shengyuan Ding, Shenxi Wu, Yubo Ma, Haodong Duan, Wenwei Zhang, et al. Internlm-xcomposer2. 5-reward: A simple yet effective multi-modal reward model.arXiv preprint arXiv:2501.12368, 2025. 2

  38. [39]

    Monkeyocr v1

    Jiarui Zhang, Yuliang Liu, Zijun Wu, Guosheng Pang, Zhili Ye, Yupei Zhong, Junteng Ma, Tao Wei, Haiyang Xu, Weikai Chen, et al. Monkeyocr v1. 5 technical report: Unlocking robust document parsing for complex patterns.arXiv preprint arXiv:2511.10390, 2025. 2

  39. [40]

    Enhancing chart-to-code generation in multimodal large language models via iterative dual preference learning.arXiv preprint arXiv:2504.02906, 2025

    Zhihan Zhang, Yixin Cao, and Lizi Liao. Enhancing chart-to-code generation in multimodal large language models via iterative dual preference learning.arXiv preprint arXiv:2504.02906, 2025. 2 13 Visual-ERM: Reward Modeling for Visual Equivalence

  40. [41]

    Vincicoder: Unifying multimodal code generation via coarse-to-fine visual reinforcement learning.arXiv preprint arXiv:2511.00391, 2025

    Xuanle Zhao, Deyang Jiang, Zhixiong Zeng, Lei Chen, Haibo Qiu, Jing Huang, Yufeng Zhong, Liming Zheng, Yilin Cao, and Lin Ma. Vincicoder: Unifying multimodal code generation via coarse-to-fine visual reinforcement learning.arXiv preprint arXiv:2511.00391, 2025. 1, 2, 4.1, 4.2, A.1

  41. [42]

    Chartcoder: Ad- vancing multimodal large language model for chart-to-code generation.arXiv preprint arXiv:2501.06598,

    XuanleZhao,XianzhenLuo,QiShi,ChiChen,ShuoWang,ZhiyuanLiu,andMaosongSun. Chartcoder: Ad- vancing multimodal large language model for chart-to-code generation.arXiv preprint arXiv:2501.06598,

  42. [43]

    Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al

    Xu Zhong, Elaheh ShafieiBavani, and Antonio Jimeno Yepes. Image-based table recognition: data, model, and evaluation.arXiv preprint arXiv:1911.10683, 2019. 2

  43. [44]

    Starling-7b: Improving llm helpfulness & harmlessness with rlaif, November 2023

    Banghua Zhu, Evan Frick, Tianhao Wu, Hanlin Zhu, and Jiantao Jiao. Starling-7b: Improving llm helpfulness & harmlessness with rlaif, November 2023. 2 14 Visual-ERM: Reward Modeling for Visual Equivalence Outline Intheappendix,weprovideadditionalmaterialstosupportthemainpaperandfacilitateadeeperunderstanding of Visual-ERM. First, in Sec. A, we provide deta...

  44. [46]

    **Generated Image**: a chart rendered using AI-generated Matplotlib code for the **Original Image**. Your task is to **compare the Generated Image against the Original Image** and identify **all visual discrepancies** that affect: * correctness of the data, * layout / structure, or * aesthetic consistency (when it impacts readability or overall fidelity)....

  45. [47]

    donut, bar vs

    Wrong chart type (e.g., pie vs. donut, bar vs. line, radar vs. line)

  46. [48]

    Wrong number of subplots (e.g., 1 subplot instead of 2)

  47. [49]

    Subplots arranged in wrong rows/columns or in the wrong positions

  48. [50]

    Subplots missing or extra

  49. [51]

    Axes that should be shared/aligned in the Original are not shared/aligned in the Generated (or vice versa). #### B. `data_error` (Content & Geometry) **Definition:** The visualized data itself is incorrect or distorted. **Typical cases:**

  50. [52]

    **Geometry mismatch:** The shape of the curve, polygon, or bar pattern is clearly different (e.g., an increasing line becomes flat or decreasing)

  51. [53]

    **Value distortion:** Bar heights, pie slice angles, or point positions do not match the original trend or relative ratios

  52. [54]

    **Scale / limits mismatch:** Different xlim/ylim or axis scales that significantly change how the data appears

  53. [55]

    **Missing / extra data:** Missing series, hallucinated extra lines, wrong number of groups or categories. #### C. `text_error` (Labels & Annotations) **Definition:** Problems with any visible text in the chart. Text includes: titles, subtitles, axis labels, tick labels, legend text, annotations, and text boxes. **Typical cases:**

  54. [56]

    Missing or incorrect titles, axis labels, tick labels, or legend entries

  55. [57]

    Misplaced labels (e.g., attached to the wrong axis, wrong subplot, or wrong data series)

  56. [58]

    Text overlapping with other elements in a way that harms readability

  57. [59]

    Obvious typos or formatting differences that clearly reduce readability (e.g., extremely small font where the original is normal). #### D. `style_error` (Aesthetics & Appearance) **Definition:** Visual style differences that may not directly change the data values, but affect the overall look and clarity. **Typical cases:**

  58. [60]

    **Color mismatch:** Different color palette or clearly wrong color mapping for data series / categories

  59. [61]

    solid), marker shape, or line width

    **Markers & lines:** Wrong line style (dashed vs. solid), marker shape, or line width

  60. [62]

    **Grid / background:** Missing or extra gridlines, wrong background color, frame visibility changes

  61. [63]

    structure_error_count

    Other stylistic differences that noticeably change the visual appearance, even if the data is still correct. --- ### Step 3: Assign severity for each error For each error, assign a **severity**: **1 (Minor)**: Purely aesthetic or very small differences. The chart remains correct and easy to understand (e.g., slightly different shade of color, small font s...

  62. [64]

    If a category has **no errors**, its `*_error_count` **must be 0**

  63. [65]

    The `errors` list can be **empty** if the Generated Image is visually consistent with the Original Image (ignoring tiny pixel-level differences)

  64. [66]

    experienced specialist for data visualization

    The numeric counts (`*_error_count`) must exactly match the number of errors you report for each corresponding category in the `errors` list. Prompt for Distillation Reward Modeling Datasets (Chart) — Part2 Figure7:Prompt for distillation.The prompt used to distill GPT-5-mini to construct reward-modeling training data for Visual-ERM. We show the Chart-spe...

  65. [67]

    **Original Image**: a chart rendered using ground-truth Matplotlib code

  66. [68]

    structure_error_count

    **Generated Image**: a chart rendered using AI-generated Matplotlib code for the **Original Image**. Your task is to **compare the Generated Image against the Original Image** and identify **all visual discrepancies**, then summarize them in a strict JSON format. Additionally, You should assign a **severity score** for each error: - 1 (minor): small error...

  67. [69]

    layout_error (structure/layout errors)

  68. [70]

    text_error (text recognition errors)

  69. [71]

    layout_error_count

    numeric_error (numeric/symbol/unit errors) Assign a severity score for each error: - 1 (minor): small errors that barely affect readability or understanding - 2 (medium): errors that affect partial understanding and require manual correction for reliable use - 3 (severe): structural or key-content errors that break reliable alignment or can significantly ...

  70. [72]

    Otherwise, it must be judged as not matching

    A match is only allowed when pred.type == gt.type OR pred.category == gt.category. Otherwise, it must be judged as not matching

  71. [73]

    A “match” requires that the descriptions point to the same specific error point (same location/object/cell/chart element), and that the error phenomenon is the same or highly consistent

  72. [74]

    partial"; otherwise it is

    If the pred description is more generalized than gt but clearly refers to the same error point, it can be considered match_level="partial"; otherwise it is "no"

  73. [75]

    Each pred can match at most one gt, and each gt can match at most one pred (1-to-1 matching)

  74. [76]

    matches": [ {

    Do not match randomly: it is better to leave unmatched than to make an incorrect match. [Your output MUST be strictly valid JSON, with no extra text] The JSON output format is: { "matches": [ { "pred_id": 0, "gt_id": 3, "match_level": "yes" | "partial" } ], "unmatched_pred": [1, 2], "unmatched_gt": [0, 4], "notes": "Optional, at most one sentence" } [PRED...