arxiv: 2603.13224 · v2 · submitted 2026-03-13 · 💻 cs.CV · cs.AI

Recognition: unknown

Visual-ERM: Reward Modeling for Visual Equivalence

Ziyu Liu , Shengyuan Ding , Xinyu Fang , Xuanlang Dai , Penghui Yang , Jianze Liang , Jiaqi Wang , Kai Chen

show 2 more authors

Dahua Lin Yuhang Zang

Authors on Pith no claims yet

Pith reviewed 2026-05-15 11:15 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords visual reward modelvision-to-codereinforcement learningchart-to-codetable parsingSVG reconstructionmultimodal generative rewardreward hacking

0 comments

The pith

A generative reward model that compares rendered images directly supplies the fine-grained feedback needed for vision-to-code reinforcement learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Vision-to-code tasks ask models to turn visual inputs such as charts and tables into executable code while preserving exact visual appearance. Standard rewards based on text rules or coarse embeddings miss small discrepancies and are easy to exploit. Visual-ERM instead renders both the target and the model output, then generates direct visual comparisons to produce interpretable, task-agnostic scores. When these scores guide reinforcement learning, performance rises sharply on chart-to-code and improves steadily on table and SVG reconstruction. The same model also outperforms much larger baselines on a new benchmark for detecting fine visual differences.

Core claim

Fine-grained visual reward supervision is both necessary and sufficient for vision-to-code reinforcement learning regardless of task. Visual-ERM implements this supervision as a multimodal generative model that evaluates outputs by comparing rendered images in visual space rather than relying on textual rules or embedding similarity, yielding consistent gains and stronger test-time scaling via reflection.

What carries the argument

Visual-ERM, a multimodal generative reward model that supplies fine-grained visual equivalence scores by directly comparing rendered inputs and outputs.

If this is right

RL with Visual-ERM raises Qwen3-VL-8B-Instruct by +8.4 on chart-to-code.
The same reward produces average gains of +2.7 on table parsing and +4.1 on SVG parsing.
Test-time reflection and revision guided by Visual-ERM further improves final outputs.
An 8B Visual-ERM outperforms a 235B Qwen3-VL-Instruct on VC-RewardBench for judging fine visual discrepancies.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same visual-equivalence approach could be tested on other structured-image tasks such as diagram-to-code or layout generation where pixel-level fidelity matters.
Scaling Visual-ERM while keeping the visual comparison core might narrow the remaining gap to closed-source models on visual judgment benchmarks.
Hybrid rewards that combine Visual-ERM scores with light textual checks could address both visual fidelity and semantic correctness in one training loop.

Load-bearing premise

The reported performance gains are produced by the visual comparison mechanism itself rather than by other details of the training procedure or by reward hacking that the comparisons fail to prevent.

What would settle it

An ablation that runs identical RL training with a non-visual reward or with the visual comparison step removed and still obtains the same numerical gains on chart-to-code, table, and SVG tasks.

read the original abstract

Vision-to-code tasks require models to reconstruct structured visual inputs, such as charts, tables, and SVGs, into executable or structured representations with high visual fidelity. While recent Large Vision Language Models (LVLMs) achieve strong results via supervised fine-tuning, reinforcement learning remains challenging due to misaligned reward signals. Existing rewards either rely on textual rules or coarse visual embedding similarity, both of which fail to capture fine-grained visual discrepancies and are vulnerable to reward hacking. We propose Visual Equivalence Reward Model (Visual-ERM), a multimodal generative reward model that provides fine-grained, interpretable, and task-agnostic feedback to evaluate vision-to-code quality directly in the rendered visual space. Integrated into RL, Visual-ERM improves Qwen3-VL-8B-Instruct by +8.4 on chart-to-code and yields consistent gains on table and SVG parsing (+2.7, +4.1 on average), and further strengthens test-time scaling via reflection and revision. We also introduce VisualCritic-RewardBench (VC-RewardBench), a benchmark for judging fine-grained image-to-image discrepancies on structured visual data, where Visual-ERM at 8B decisively outperforms Qwen3-VL-235B-Instruct and approaches leading closed-source models. Our results suggest that fine-grained visual reward supervision is both necessary and sufficient for vision-to-code RL, regardless of task specificity.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Visual-ERM adds a generative visual reward and benchmark for vision-to-code RL, but the gains lack clear isolation from other training factors.

read the letter

Dear colleague, The main point on this paper is that Visual-ERM supplies a multimodal generative reward that scores rendered visual outputs for fine-grained equivalence in vision-to-code tasks, paired with a new benchmark VC-RewardBench. It reports usable numeric lifts when plugged into RL. What the work does well is shift evaluation into the actual rendered space instead of text rules or coarse embeddings. They show an 8.4 point improvement on chart-to-code for Qwen3-VL-8B, with smaller average gains on table and SVG parsing, and the benchmark has their 8B model beating the 235B version of the same family while approaching closed models. The task-agnostic framing and the focus on interpretable visual feedback are practical for anyone training models to produce structured outputs that must look right. The soft spots sit in the experimental design. The abstract gives no ablations that hold the RL algorithm, steps, optimizer, or sampling fixed across the textual and embedding baselines, so the deltas cannot be cleanly attributed to the reward itself. No metrics appear for reward hacking resistance, such as adversarial success rates or direct human correlation on edge cases. The claim that fine-grained visual supervision is both necessary and sufficient therefore rests on thinner ground than the numbers suggest. This paper targets researchers doing RL on vision-language models for document AI, chart generation, or similar structured tasks. A reader interested in reward modeling or new benchmarks would get concrete ideas from the model formulation and the evaluation protocol. It has enough substance and a clear practical angle to deserve a serious referee, though any review would likely press for controls and more evidence on why the reward drives the results rather than incidental pipeline choices. I would send it to peer review.

Referee Report

2 major / 1 minor

Summary. The paper proposes Visual-ERM, a multimodal generative reward model that supplies fine-grained, interpretable visual-equivalence feedback for RL on vision-to-code tasks (chart-to-code, table parsing, SVG generation). It reports that integrating Visual-ERM into RL yields +8.4 improvement on chart-to-code for Qwen3-VL-8B-Instruct, smaller gains on table and SVG tasks, strengthens test-time scaling, and that the 8B Visual-ERM outperforms Qwen3-VL-235B on the new VC-RewardBench benchmark for judging fine-grained image-to-image discrepancies.

Significance. If the performance deltas can be isolated to the visual reward model itself, the work would provide a concrete mechanism for reducing reward hacking in structured visual reconstruction and demonstrate that task-agnostic fine-grained visual supervision can be both necessary and sufficient for vision-to-code RL.

major comments (2)

[Abstract] Abstract: the central claim that fine-grained visual supervision is both necessary and sufficient requires explicit isolation of the reward model; the reported +8.4 chart-to-code gain and gains on table/SVG are presented without stating that the RL algorithm, optimizer, step count, sampling strategy, or other hyperparameters were held fixed when comparing against textual-rule and coarse-embedding baselines.
[Abstract] Abstract: the assertion that Visual-ERM avoids reward hacking is not accompanied by any quantitative metric (e.g., adversarial-example success rate or correlation with human visual judgments on the same trajectories) that would demonstrate superiority over coarse embeddings.

minor comments (1)

[Abstract] The abstract would benefit from a one-sentence description of the Visual-ERM architecture (e.g., whether it is a generative model that outputs a scalar or a structured critique).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their valuable feedback. We address the major comments below and have prepared revisions to the abstract and discussion sections accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that fine-grained visual supervision is both necessary and sufficient requires explicit isolation of the reward model; the reported +8.4 chart-to-code gain and gains on table/SVG are presented without stating that the RL algorithm, optimizer, step count, sampling strategy, or other hyperparameters were held fixed when comparing against textual-rule and coarse-embedding baselines.

Authors: We agree that the abstract should make the experimental isolation explicit. All comparisons in our RL experiments held the algorithm (PPO), optimizer, step count, and sampling strategy fixed, varying only the reward model. We will revise the abstract to state this clearly, ensuring the +8.4 gain and other improvements are attributable to Visual-ERM. revision: yes
Referee: [Abstract] Abstract: the assertion that Visual-ERM avoids reward hacking is not accompanied by any quantitative metric (e.g., adversarial-example success rate or correlation with human visual judgments on the same trajectories) that would demonstrate superiority over coarse embeddings.

Authors: While the paper shows Visual-ERM's superiority on VC-RewardBench and improved RL outcomes, we acknowledge the absence of specific quantitative metrics for reward hacking such as adversarial success rates. We will add a discussion on how our benchmark results correlate with human visual judgments and note this as a limitation, with plans for future adversarial evaluations. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical gains presented without self-referential derivations

full rationale

The paper introduces Visual-ERM as a new multimodal reward model and reports empirical improvements when integrated into RL for vision-to-code tasks (+8.4 on chart-to-code, etc.). No equations, derivations, or fitted parameters are described that would reduce the central claim to a self-defined quantity. The necessity/sufficiency argument rests on benchmark comparisons (VC-RewardBench and task-specific gains) rather than any self-citation chain, ansatz smuggling, or renaming of known results. The derivation chain is self-contained as an experimental proposal and evaluation; no load-bearing step collapses to the authors' prior inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are described beyond the high-level proposal of the reward model itself.

pith-pipeline@v0.9.0 · 5577 in / 1068 out tokens · 67770 ms · 2026-05-15T11:15:48.002891+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

74 extracted references · 74 canonical work pages · 11 internal anchors

[2]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

InternLM2 Technical Report

Zheng Cai, Maosong Cao, Haojiong Chen, Kai Chen, Keyu Chen, Xin Chen, Xun Chen, Zehui Chen, Zhi Chen, Pei Chu, Xiao wen Dong, Haodong Duan, Qi Fan, Zhaoye Fei, Yang Gao, Jiaye Ge, Chenya Gu, Yuzhe Gu, Tao Gui, Aijia Guo, Qipeng Guo, Conghui He, Yingfan Hu, Ting Huang, Tao Jiang, Penglong Jiao, Zhen Jin, Zhikai Lei, Jiaxing Li, Jingwen Li, Linyang Li, Shua...

work page internal anchor Pith review arXiv 2024
[4]

Emerging properties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InICCV, 2021. 1

work page 2021
[5]

Breaking the sft plateau: Multimodal structured reinforcement learning for chart-to-code generation.arXiv preprint arXiv:2508.13587, 2025

Lei Chen, Xuanle Zhao, Zhixiong Zeng, Jing Huang, Liming Zheng, Yufeng Zhong, and Lin Ma. Breaking the sft plateau: Multimodal structured reinforcement learning for chart-to-code generation.arXiv preprint arXiv:2508.13587, 2025. A.2

work page arXiv 2025
[6]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with ad- vanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025. 3.3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

Arm-thinker: Reinforcing multimodal generative reward models with agentic tool use and visual reasoning.arXiv preprint arXiv:2512.05111, 2025

Shengyuan Ding, Xinyu Fang, Ziyu Liu, Yuhang Zang, Yuhang Cao, Xiangyu Zhao, Haodong Duan, Xiaoyi Dong, Jianze Liang, Bin Wang, et al. Arm-thinker: Reinforcing multimodal generative reward models with agentic tool use and visual reasoning.arXiv preprint arXiv:2512.05111, 2025. 2

work page arXiv 2025
[8]

Webcode2m: A real-world dataset for code generation from webpage designs

Yi Gui, Zhen Li, Yao Wan, Yemin Shi, Hongyu Zhang, Bohua Chen, Yi Su, Dongping Chen, Siyuan Wu, Xing Zhou, et al. Webcode2m: A real-world dataset for code generation from webpage designs. In Proceedings of the ACM on Web Conference 2025, pages 1834–1845, 2025. 2

work page 2025
[9]

OpenAI o1 System Card

Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv:2412.16720, 2024. 1 11 Visual-ERM: Reward Modeling for Visual Equivalence

work page internal anchor Pith review Pith/arXiv arXiv 2024
[10]

Viscodex: Unified multimodal code generation via merging vision and coding models.arXiv preprint arXiv:2508.09945,

Lingjie Jiang, Shaohan Huang, Xun Wu, Yixia Li, Dongdong Zhang, and Furu Wei. Viscodex: Unified multimodal code generation via merging vision and coding models.arXiv preprint arXiv:2508.09945,

work page arXiv
[11]

Solar 10.7b: Scaling large language models with simple yet effective depth up-scaling.ArXiv, abs/2312.15166, 2023

Dahyun Kim, Chanjun Park, Sanghoon Kim, Wonsung Lee, Wonho Song, Yunsu Kim, Hyeonwoo Kim, Yungi Kim, Hyeonju Lee, Jihoo Kim, Changbae Ahn, Seonghoon Yang, Sukyung Lee, Hyunbyung Park, Gyoungjin Gim, Mikyoung Cha, Hwalsuk Lee, and Sunghun Kim. Solar 10.7b: Scaling large language models with simple yet effective depth up-scaling.ArXiv, abs/2312.15166, 2023. 2

work page arXiv 2023
[12]

Unisvg: Aunifieddatasetforvectorgraphicunderstandingandgenerationwithmultimodallargelanguage models

Jinke Li, Jiarui Yu, Chenxing Wei, Hande Dong, Qiang Lin, Liangjing Yang, Zhicai Wang, and Yanbin Hao. Unisvg: Aunifieddatasetforvectorgraphicunderstandingandgenerationwithmultimodallargelanguage models. InProceedings of the 33rd ACM International Conference on Multimedia, pages 13156–13163,

work page
[13]

Vl-rewardbench: A challenging benchmark for vision-language generative reward models

Lei Li, Yuancheng Wei, Zhihui Xie, Xuqing Yang, Yifan Song, Peiyi Wang, Chenxin An, Tianyu Liu, Sujian Li, Bill Yuchen Lin, et al. Vl-rewardbench: A challenging benchmark for vision-language generative reward models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 24657–24668,

work page
[14]

One model to critique them all: Rewarding agentic tool-use via efficient reasoning.arXiv preprint arXiv:2510.26167, 2025

Renhao Li, Jianhong Tu, Yang Su, Hamid Alinejad-Rokny, Derek F Wong, Junyang Lin, and Min Yang. One model to critique them all: Rewarding agentic tool-use via efficient reasoning.arXiv preprint arXiv:2510.26167, 2025. 2

work page arXiv 2025
[15]

Table2latex-rl: High-fidelity latex code generation from table images via reinforced multimodal language models.arXiv preprint arXiv:2509.17589, 2025

Jun Ling, Yao Qi, Tao Huang, Shibo Zhou, Yanqin Huang, Jiang Yang, Ziqi Song, Ying Zhou, Yang Yang, Heng Tao Shen, et al. Table2latex-rl: High-fidelity latex code generation from table images via reinforced multimodal language models.arXiv preprint arXiv:2509.17589, 2025. 1, 2

work page arXiv 2025
[16]

Visual-RFT: Visual Reinforcement Fine-Tuning

Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual-rft: Visual reinforcement fine-tuning.arXiv preprint arXiv:2503.01785, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

Spark: Synergistic policy and reward co-evolving framework.arXiv preprint arXiv:2509.22624,

Ziyu Liu, Yuhang Zang, Shengyuan Ding, Yuhang Cao, Xiaoyi Dong, Haodong Duan, Dahua Lin, and Jiaqi Wang. Spark: Synergistic policy and reward co-evolving framework.arXiv preprint arXiv:2509.22624,

work page arXiv
[18]

Chartqa: A benchmark for question answering about charts with visual and logical reasoning

Ahmed Masry, Xuan Long Do, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. Chartqa: A benchmark for question answering about charts with visual and logical reasoning. InFindings of the association for computational linguistics: ACL 2022, pages 2263–2279, 2022. B.1

work page 2022
[19]

Info- graphicvqa

Minesh Mathew, Viraj Bagal, Rubèn Tito, Dimosthenis Karatzas, Ernest Valveny, and CV Jawahar. Info- graphicvqa. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1697–1706, 2022. B.1

work page 2022
[20]

Docvqa: A dataset for vqa on document images

Minesh Mathew, Dimosthenis Karatzas, and CV Jawahar. Docvqa: A dataset for vqa on document images. InProceedings of the IEEE/CVF winter conference on applications of computer vision, pages 2200–2209,

work page
[21]

Mineru2.5: Adecoupledvision-languagemodelforefficienthigh-resolution document parsing.arXiv preprint arXiv:2509.22186, 2025

Junbo Niu, Zheng Liu, Zhuangcheng Gu, Bin Wang, Linke Ouyang, Zhiyuan Zhao, Tao Chu, Tianyao He, FanWu, QintongZhang, etal. Mineru2.5: Adecoupledvision-languagemodelforefficienthigh-resolution document parsing.arXiv preprint arXiv:2509.22186, 2025. 1, 2

work page arXiv 2025
[22]

Openai o3-mini system card, 2025

OpenAI. Openai o3-mini system card, 2025. 1

work page 2025
[23]

Omnidocbench: Benchmarking diverse pdf document parsing with comprehensive annotations

Linke Ouyang, Yuan Qu, Hongbin Zhou, Jiawei Zhu, Rui Zhang, Qunshu Lin, Bin Wang, Zhiyuan Zhao, Man Jiang, Xiaomeng Zhao, et al. Omnidocbench: Benchmarking diverse pdf document parsing with comprehensive annotations. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 24838–24848, 2025. 1, 4.1

work page 2025
[24]

Agentic reward modeling: Integrating human preferences with verifiable correctness signals for reliable reward systems.arXiv preprint arXiv:2502.19328, 2025

Hao Peng, Yunjia Qi, Xiaozhi Wang, Zijun Yao, Bin Xu, Lei Hou, and Juanzi Li. Agentic reward modeling: Integrating human preferences with verifiable correctness signals for reliable reward systems.arXiv preprint arXiv:2502.19328, 2025. 2 12 Visual-ERM: Reward Modeling for Visual Equivalence

work page arXiv 2025
[25]

olmocr: Unlocking trillions of tokens in pdfs with vision language models.arXiv preprint arXiv:2502.18443,

Jake Poznanski, Aman Rangapur, Jon Borchardt, Jason Dunkelberger, Regan Huff, Daniel Lin, Christopher Wilhelm, Kyle Lo, and Luca Soldaini. olmocr: Unlocking trillions of tokens in pdfs with vision language models.arXiv preprint arXiv:2502.18443, 2025. 1, 4.1

work page arXiv 2025
[26]

Starvector: Generating scalable vector graphics code from images.arXiv preprint arXiv:2312.11556, 2023

Juan A Rodriguez, Shubham Agarwal, Issam H Laradji, Pau Rodriguez, David Vazquez, Christopher Pal, and Marco Pedersoli. Starvector: Generating scalable vector graphics code from images.arXiv preprint arXiv:2312.11556, 2023. A.2

work page arXiv 2023
[27]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 2, 4.1

work page internal anchor Pith review Pith/arXiv arXiv 2024
[28]

Design2Code: Bench- marking multimodal code generation for automated front-end engineering

Chenglei Si, Yanzhe Zhang, Ryan Li, Zhengyuan Yang, Ruibo Liu, and Diyi Yang. Design2Code: Bench- marking multimodal code generation for automated front-end engineering. InACL, 2025. 1

work page 2025
[29]

DINOv3

Oriane Siméoni, Huy V Vo, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khali- dov,MarcSzafraniec,SeungeunYi,MichaëlRamamonjisoa,etal. Dinov3.arXivpreprintarXiv:2508.10104,

work page internal anchor Pith review Pith/arXiv arXiv
[30]

OpenAI GPT-5 System Card

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaugh- lin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025. 3.1, 3.3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[31]

JanusCoder: Towards a Foundational Visual-Programmatic Interface for Code Intelligence

Qiushi Sun, Jingyang Gong, Yang Liu, Qiaosheng Chen, Lei Li, Kai Chen, Qipeng Guo, Ben Kao, and Fei Yuan. Januscoder: Towards a foundational visual-programmatic interface for code intelligence.arXiv preprint arXiv:2510.23538, 2025. 4.1

work page internal anchor Pith review Pith/arXiv arXiv 2025
[32]

Chartmaster: Advancing chart-to-code generation with real-world charts and chart similarity reinforcement learning

Wentao Tan, Qiong Cao, Chao Xue, Yibing Zhan, Changxing Ding, and Xiaodong He. Chartmaster: Advancing chart-to-code generation with real-world charts and chart similarity reinforcement learning. arXiv preprint arXiv:2508.17608, 2025. 1, 2

work page arXiv 2025
[33]

Unified Reward Model for Multimodal Understanding and Generation

Yibin Wang, Yuhang Zang, Hao Li, Cheng Jin, and Jiaqi Wang. Unified reward model for multimodal understanding and generation.arXiv preprint arXiv:2503.05236, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[34]

Charxiv: Charting gaps in realistic chart understanding in multimodal llms.Advances in Neural Information Processing Systems, 37:113569–113697, 2024

Zirui Wang, Mengzhou Xia, Luxi He, Howard Chen, Yitao Liu, Richard Zhu, Kaiqu Liang, Xindi Wu, Haotian Liu, Sadhika Malladi, et al. Charxiv: Charting gaps in realistic chart understanding in multimodal llms.Advances in Neural Information Processing Systems, 37:113569–113697, 2024. B.1

work page 2024
[35]

Chartmimic: Evaluating lmm’s cross-modal reasoning capability via chart-to-code generation.arXiv preprint arXiv:2406.09961, 2024

Cheng Yang, Chufan Shi, Yaxin Liu, Bo Shui, Junjie Wang, Mohan Jing, Linran Xu, Xinyu Zhu, Siheng Li, Yuxiang Zhang, et al. Chartmimic: Evaluating lmm’s cross-modal reasoning capability via chart-to-code generation.arXiv preprint arXiv:2406.09961, 2024. 1, 4.1, 4.2

work page arXiv 2024
[36]

Omnisvg: A unified scalable vector graphics generation model.arXiv preprint arXiv:2504.06263, 2025

Yiying Yang, Wei Cheng, Sijin Chen, Xianfang Zeng, Fukun Yin, Jiaxu Zhang, Liao Wang, Gang Yu, Xingjun Ma, and Yu-Gang Jiang. Omnisvg: A unified scalable vector graphics generation model.arXiv preprint arXiv:2504.06263, 2025. 2

work page arXiv 2025
[37]

Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Sainbayar Sukhbaatar, Jing Xu, and Jason E. Weston. Self-rewarding language models.ArXiv, abs/2401.10020, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024
[38]

Internlm-xcomposer2

Yuhang Zang, Xiaoyi Dong, Pan Zhang, Yuhang Cao, Ziyu Liu, Shengyuan Ding, Shenxi Wu, Yubo Ma, Haodong Duan, Wenwei Zhang, et al. Internlm-xcomposer2. 5-reward: A simple yet effective multi-modal reward model.arXiv preprint arXiv:2501.12368, 2025. 2

work page arXiv 2025
[39]

Monkeyocr v1

Jiarui Zhang, Yuliang Liu, Zijun Wu, Guosheng Pang, Zhili Ye, Yupei Zhong, Junteng Ma, Tao Wei, Haiyang Xu, Weikai Chen, et al. Monkeyocr v1. 5 technical report: Unlocking robust document parsing for complex patterns.arXiv preprint arXiv:2511.10390, 2025. 2

work page arXiv 2025
[40]

Enhancing chart-to-code generation in multimodal large language models via iterative dual preference learning.arXiv preprint arXiv:2504.02906, 2025

Zhihan Zhang, Yixin Cao, and Lizi Liao. Enhancing chart-to-code generation in multimodal large language models via iterative dual preference learning.arXiv preprint arXiv:2504.02906, 2025. 2 13 Visual-ERM: Reward Modeling for Visual Equivalence

work page arXiv 2025
[41]

Vincicoder: Unifying multimodal code generation via coarse-to-fine visual reinforcement learning.arXiv preprint arXiv:2511.00391, 2025

Xuanle Zhao, Deyang Jiang, Zhixiong Zeng, Lei Chen, Haibo Qiu, Jing Huang, Yufeng Zhong, Liming Zheng, Yilin Cao, and Lin Ma. Vincicoder: Unifying multimodal code generation via coarse-to-fine visual reinforcement learning.arXiv preprint arXiv:2511.00391, 2025. 1, 2, 4.1, 4.2, A.1

work page arXiv 2025
[42]

Chartcoder: Ad- vancing multimodal large language model for chart-to-code generation.arXiv preprint arXiv:2501.06598,

XuanleZhao,XianzhenLuo,QiShi,ChiChen,ShuoWang,ZhiyuanLiu,andMaosongSun. Chartcoder: Ad- vancing multimodal large language model for chart-to-code generation.arXiv preprint arXiv:2501.06598,

work page arXiv
[43]

Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al

Xu Zhong, Elaheh ShafieiBavani, and Antonio Jimeno Yepes. Image-based table recognition: data, model, and evaluation.arXiv preprint arXiv:1911.10683, 2019. 2

work page arXiv 1911
[44]

Starling-7b: Improving llm helpfulness & harmlessness with rlaif, November 2023

Banghua Zhu, Evan Frick, Tianhao Wu, Hanlin Zhu, and Jiantao Jiao. Starling-7b: Improving llm helpfulness & harmlessness with rlaif, November 2023. 2 14 Visual-ERM: Reward Modeling for Visual Equivalence Outline Intheappendix,weprovideadditionalmaterialstosupportthemainpaperandfacilitateadeeperunderstanding of Visual-ERM. First, in Sec. A, we provide deta...

work page 2023
[46]

**Generated Image**: a chart rendered using AI-generated Matplotlib code for the **Original Image**. Your task is to **compare the Generated Image against the Original Image** and identify **all visual discrepancies** that affect: * correctness of the data, * layout / structure, or * aesthetic consistency (when it impacts readability or overall fidelity)....

work page
[47]

donut, bar vs

Wrong chart type (e.g., pie vs. donut, bar vs. line, radar vs. line)

work page
[48]

Wrong number of subplots (e.g., 1 subplot instead of 2)

work page
[49]

Subplots arranged in wrong rows/columns or in the wrong positions

work page
[50]

Subplots missing or extra

work page
[51]

Axes that should be shared/aligned in the Original are not shared/aligned in the Generated (or vice versa). #### B. `data_error` (Content & Geometry) **Definition:** The visualized data itself is incorrect or distorted. **Typical cases:**

work page
[52]

**Geometry mismatch:** The shape of the curve, polygon, or bar pattern is clearly different (e.g., an increasing line becomes flat or decreasing)

work page
[53]

**Value distortion:** Bar heights, pie slice angles, or point positions do not match the original trend or relative ratios

work page
[54]

**Scale / limits mismatch:** Different xlim/ylim or axis scales that significantly change how the data appears

work page
[55]

**Missing / extra data:** Missing series, hallucinated extra lines, wrong number of groups or categories. #### C. `text_error` (Labels & Annotations) **Definition:** Problems with any visible text in the chart. Text includes: titles, subtitles, axis labels, tick labels, legend text, annotations, and text boxes. **Typical cases:**

work page
[56]

Missing or incorrect titles, axis labels, tick labels, or legend entries

work page
[57]

Misplaced labels (e.g., attached to the wrong axis, wrong subplot, or wrong data series)

work page
[58]

Text overlapping with other elements in a way that harms readability

work page
[59]

Obvious typos or formatting differences that clearly reduce readability (e.g., extremely small font where the original is normal). #### D. `style_error` (Aesthetics & Appearance) **Definition:** Visual style differences that may not directly change the data values, but affect the overall look and clarity. **Typical cases:**

work page
[60]

**Color mismatch:** Different color palette or clearly wrong color mapping for data series / categories

work page
[61]

solid), marker shape, or line width

**Markers & lines:** Wrong line style (dashed vs. solid), marker shape, or line width

work page
[62]

**Grid / background:** Missing or extra gridlines, wrong background color, frame visibility changes

work page
[63]

structure_error_count

Other stylistic differences that noticeably change the visual appearance, even if the data is still correct. --- ### Step 3: Assign severity for each error For each error, assign a **severity**: **1 (Minor)**: Purely aesthetic or very small differences. The chart remains correct and easy to understand (e.g., slightly different shade of color, small font s...

work page
[64]

If a category has **no errors**, its `*_error_count` **must be 0**

work page
[65]

The `errors` list can be **empty** if the Generated Image is visually consistent with the Original Image (ignoring tiny pixel-level differences)

work page
[66]

experienced specialist for data visualization

The numeric counts (`*_error_count`) must exactly match the number of errors you report for each corresponding category in the `errors` list. Prompt for Distillation Reward Modeling Datasets (Chart) — Part2 Figure7:Prompt for distillation.The prompt used to distill GPT-5-mini to construct reward-modeling training data for Visual-ERM. We show the Chart-spe...

work page
[67]

**Original Image**: a chart rendered using ground-truth Matplotlib code

work page
[68]

structure_error_count

**Generated Image**: a chart rendered using AI-generated Matplotlib code for the **Original Image**. Your task is to **compare the Generated Image against the Original Image** and identify **all visual discrepancies**, then summarize them in a strict JSON format. Additionally, You should assign a **severity score** for each error: - 1 (minor): small error...

work page
[69]

layout_error (structure/layout errors)

work page
[70]

text_error (text recognition errors)

work page
[71]

layout_error_count

numeric_error (numeric/symbol/unit errors) Assign a severity score for each error: - 1 (minor): small errors that barely affect readability or understanding - 2 (medium): errors that affect partial understanding and require manual correction for reliable use - 3 (severe): structural or key-content errors that break reliable alignment or can significantly ...

work page
[72]

Otherwise, it must be judged as not matching

A match is only allowed when pred.type == gt.type OR pred.category == gt.category. Otherwise, it must be judged as not matching

work page
[73]

A “match” requires that the descriptions point to the same specific error point (same location/object/cell/chart element), and that the error phenomenon is the same or highly consistent

work page
[74]

partial"; otherwise it is

If the pred description is more generalized than gt but clearly refers to the same error point, it can be considered match_level="partial"; otherwise it is "no"

work page
[75]

Each pred can match at most one gt, and each gt can match at most one pred (1-to-1 matching)

work page
[76]

matches": [ {

Do not match randomly: it is better to leave unmatched than to make an incorrect match. [Your output MUST be strictly valid JSON, with no extra text] The JSON output format is: { "matches": [ { "pred_id": 0, "gt_id": 3, "match_level": "yes" | "partial" } ], "unmatched_pred": [1, 2], "unmatched_gt": [0, 4], "notes": "Optional, at most one sentence" } [PRED...

work page