pith. machine review for the scientific record. sign in

arxiv: 2605.07415 · v1 · submitted 2026-05-08 · 💻 cs.CV · cs.CL

Recognition: no theorem link

ChartREG++: Towards Benchmarking and Improving Chart Referring Expression Grounding under Diverse referring clues and Multi-Target Referring

Qingfu Zhu, Tianhao Niu, Wanxiang Che, Ziyu Han

Authors on Pith no claims yet

Pith reviewed 2026-05-11 01:44 UTC · model grok-4.3

classification 💻 cs.CV cs.CL
keywords chart referring expression groundingmulti-target referringcode-driven synthesispixel accurate masksinstance segmentationmultimodal modelsbenchmarkchart understanding
0
0 comments X

The pith

A new benchmark and code-driven synthesis pipeline improve referring expression grounding on charts with multiple targets and diverse clues.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a more complete benchmark for chart referring expression grounding that handles multiple target instances, various localization forms beyond bounding boxes, a range of referring clues, and multiple chart types. Existing multimodal models exhibit large performance gaps on this benchmark. The authors also present a code-driven synthesis pipeline that generates pixel-accurate instance masks by leveraging the alignment between plotting code and rendered chart elements, then train an instance segmentation model on these masks and integrate it into a multimodal grounding system.

Core claim

The authors introduce a chart referring expression grounding benchmark that systematically supports multiple localization forms, multiple referred targets, diverse grounding cues and diverse chart types. They further introduce a code-driven synthesis pipeline that exploits the inherent alignment between plotting programs and rendered chart primitives to derive pixel accurate instance masks across chart element types and granularities. Training an instance segmentation model with the synthesized masks and integrating it into a general-purpose multimodal grounding framework produces a system that consistently outperforms baselines on the benchmark and generalizes well to a ChartQA-derived real

What carries the argument

The code-driven synthesis pipeline that exploits the inherent alignment between plotting programs and rendered chart primitives to derive pixel accurate instance masks across chart element types and granularities.

If this is right

  • Localization of fine chart elements can shift from bounding boxes to pixel-accurate masks.
  • Multi-instance target references become tractable in chart grounding tasks.
  • Performance improves across a wider variety of chart types and referring clue types.
  • The trained system transfers to grounding tasks on real charts drawn from ChartQA.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The synthesis approach could extend to other structured visualization types such as diagrams or infographics.
  • Improved chart grounding may benefit downstream applications like automated chart question answering.
  • The benchmark could act as a targeted test for spatial reasoning in vision-language models focused on data visualizations.

Load-bearing premise

The code-driven synthesis pipeline produces masks that faithfully match real rendered charts and the benchmark's distribution of clues and chart types reflects practical use cases.

What would settle it

A pixel-level comparison between synthesized masks and manually annotated real rendered chart instances would show systematic misalignment, or the trained model would show no performance gain on a held-out real-chart test set with new clue distributions.

Figures

Figures reproduced from arXiv: 2605.07415 by Qingfu Zhu, Tianhao Niu, Wanxiang Che, Ziyu Han.

Figure 1
Figure 1. Figure 1: Comparison between ChartREG++ (c) and prior benchmarks. Prior work (a), such as RefChartQA [18] and ChartLens [15], evaluates attribution-aware chart ques￾tion answering, while (b) ChartRef [17] evaluates the ability to link natural language to chart image elements. In these benchmarks, referred targets are mostly identified from textual/location cues in the expression or simple ranking cues in the data, a… view at source ↗
Figure 2
Figure 2. Figure 2: Distributions of dataset complexity and taxonomy. Top: (left) target image com￾plexity measured by the number of lines in the corresponding plotting code; (middle) complexity of referring expressions measured by sentence length; (right) distribution of the number of referred target instances per query (shown only for multi-target sam￾ples). Bottom: (left) distribution of referring cue types; (right) distri… view at source ↗
Figure 3
Figure 3. Figure 3: Proposed pipeline for multi-granularity instance masks with fine-grained chart￾element labels.We start from large-scale Matplotlib plotting code collected from the web or synthesized at scale, and trace each plotting API call to the rendered Artist objects together with their associated metadata.Using the Artist hierarchy, we construct a multi-granularity Artist-to-visual mapping that links code-level prim… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative cases between our method and existing methods bounding box so that the box covers the target point. This requires an extra step of imagining/predicting which point pair will form a covering box, which can fail even when the selected points are close to the target. In contrast, our method directly provides candidate point instances (as masks) on the polyline, therefore the MLLM can select the ta… view at source ↗
Figure 5
Figure 5. Figure 5: Break down analysis results. Break down analysis results We conduct more fine-grained qunatitative analysis with different subsets of our benchmark using our model in Sec. 5.2. Results are shown in the supply material. Effect of chart complexity. We measure chart complexity by the plotting-code length. As shown in [PITH_FULL_IMAGE:figures/full_fig_p033_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: chartlens modification example targets required by the question. As shown in [PITH_FULL_IMAGE:figures/full_fig_p034_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: data referring clue example [PITH_FULL_IMAGE:figures/full_fig_p036_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: visual referring clue example Subplot titles and positions Legend Entry and positions Non-data axis tick values and positions Text annotations directly on chart Axis labels the plotted line series along with its markers representing Average Temperature in the legend All vertical bars positioned above the x-tick label 'WSDMS' all vertical bars in the upper panel of the figure The polar bar sector directly i… view at source ↗
Figure 9
Figure 9. Figure 9: visual referring clue example [PITH_FULL_IMAGE:figures/full_fig_p037_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: referring target element example PolarLinePoints Fill Errorbar Fill_between_density Treemap BoxPlot_Boxpatch [PITH_FULL_IMAGE:figures/full_fig_p038_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: referring target element example [PITH_FULL_IMAGE:figures/full_fig_p038_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: referring target element example [PITH_FULL_IMAGE:figures/full_fig_p039_12.png] view at source ↗
read the original abstract

Referring expression grounding is a core problem in visual grounding and is widely used as a diagnostic of spatial grounding and reasoning in vision and language models, yet most prior work focuses on natural images. In contrast, existing chart referring expression grounding-related benchmarks remain limited: (1) they largely adopt bounding boxes, constraining localization precision for fine chart elements (2) they mostly assume a single and two referred target instances, failing to handle multi-instance target references; (3) the language expressions over-rely on textual cues or data-rank clues (4) they cover only a narrow range of chart types. To address these issues, we introduce a chart referring expression grounding benchmark that systematically supports multiple localization forms, multiple referred targets, diverse grounding cues and diverse chart types. Results across representative multimodal large models reveal a significant performance gap. We further introduce a code-driven synthesis pipeline that exploits the inherent alignment between plotting programs and rendered chart primitives to derive pixel accurate instance masks across chart element types and granularities. We train an instance segmentation model with the synthesized masks and integrate it into a general-purpose multimodal grounding framework. The resulting system consistently outperforms baselines on our benchmark and generalizes well to a ChartQA-derived real-chart grounding benchmark.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 0 minor

Summary. The paper introduces ChartREG++, a new benchmark for referring expression grounding on charts that supports multiple localization forms, multiple target instances, diverse referring clues, and a wide range of chart types. It describes a code-driven synthesis pipeline to create pixel-accurate instance masks by leveraging plotting programs, trains an instance segmentation model on these masks, and integrates it into a multimodal large model framework for grounding. The resulting system is claimed to outperform baselines on the proposed benchmark and to generalize well to a real-chart grounding benchmark derived from ChartQA.

Significance. If the synthetic data pipeline is shown to produce faithful representations of real charts and the generalization results are robust, this work could significantly advance the field of chart understanding in vision-language models by providing a more comprehensive benchmark and an improved grounding method. The approach of using code for precise mask generation is a promising direction for data synthesis in structured visual domains.

major comments (3)
  1. The abstract claims consistent outperformance and good generalization but provides no specific metrics, baseline comparisons, error analysis, or quantitative results, which makes it difficult to evaluate the strength and reliability of these claims.
  2. The central generalization claim to the ChartQA-derived benchmark depends on the unvalidated assumption that the synthetic masks from the plotting-code pipeline accurately match real rendered charts; no quantitative fidelity metrics (e.g., IoU with human annotations) are mentioned, which is load-bearing for the practical utility of the results.
  3. There is insufficient detail on the process of converting ChartQA questions into multi-target referring expressions and on the distribution of clues and chart types in this test set, raising questions about whether it reflects real-world use cases and thus whether the generalization is meaningful.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important areas for improving clarity and supporting our claims. We address each major comment below and have made revisions to the manuscript where appropriate to strengthen the presentation.

read point-by-point responses
  1. Referee: The abstract claims consistent outperformance and good generalization but provides no specific metrics, baseline comparisons, error analysis, or quantitative results, which makes it difficult to evaluate the strength and reliability of these claims.

    Authors: We agree that the abstract would benefit from more concrete details to allow readers to better assess our claims. In the revised manuscript, we have updated the abstract to include key quantitative results, such as the mIoU scores of our model versus baselines on the ChartREG++ benchmark and the generalization performance on the ChartQA-derived set. We have also expanded the error analysis section in the main paper to provide supporting evidence for the outperformance and generalization observations. revision: yes

  2. Referee: The central generalization claim to the ChartQA-derived benchmark depends on the unvalidated assumption that the synthetic masks from the plotting-code pipeline accurately match real rendered charts; no quantitative fidelity metrics (e.g., IoU with human annotations) are mentioned, which is load-bearing for the practical utility of the results.

    Authors: This is a fair and important observation regarding the strength of our generalization results. Our code-driven pipeline generates pixel-accurate masks by construction for the synthetic charts through direct use of plotting primitives. For the real-chart generalization, we have added a new discussion subsection that includes qualitative comparisons of synthetic versus real chart visuals to support the similarity assumption. However, we do not provide quantitative fidelity metrics such as IoU against human-annotated masks on real charts, as this would require a separate annotation effort beyond the scope of the current work. We have accordingly moderated the language around the generalization claims to reflect this limitation. revision: partial

  3. Referee: There is insufficient detail on the process of converting ChartQA questions into multi-target referring expressions and on the distribution of clues and chart types in this test set, raising questions about whether it reflects real-world use cases and thus whether the generalization is meaningful.

    Authors: We appreciate this suggestion for greater transparency. In the revised manuscript, we have substantially expanded the relevant section (now including a dedicated subsection and accompanying table) to describe the conversion process: original ChartQA questions were adapted by identifying multi-element references and reformulating them as referring expressions with varied clues. We also report the distribution statistics for chart types (e.g., proportions of bar, line, pie, and scatter charts) and referring clue categories (textual, data-rank, positional, etc.) in the test set. These additions demonstrate alignment with diverse real-world chart scenarios. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on new benchmark construction and external generalization test.

full rationale

The paper introduces a novel benchmark and code-driven synthesis pipeline that generates instance masks from plotting programs, then trains and evaluates an instance segmentation model on this data. Performance is reported on the synthetic benchmark and a separately constructed ChartQA-derived real-chart set. No equations, fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations appear in the abstract or described chain. The central claims rest on empirical outperformance against baselines under the same evaluation protocol and on an external distribution, making the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are specified in the abstract; the work relies on standard computer vision and multimodal techniques without introducing new postulated entities.

pith-pipeline@v0.9.0 · 5526 in / 1173 out tokens · 39811 ms · 2026-05-11T01:44:53.476341+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages · 7 internal anchors

  1. [1]

    Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., Ge, W., Guo, Z., Huang, Q., Huang, J., Huang, F., Hui, B., Jiang, S., Li, Z., Li, M., Li, M., Li, K., Lin, Z., Lin, J., Liu, X., Liu, J., Liu, C., Liu, Y., Liu, D., Liu, S., Lu, D., Luo, R., Lv, C., Men, R., Meng, L., Ren, X., Ren, X., Song, S., Sun, Y., Tang, ...

  2. [2]

    Carion, N., Gustafson, L., Hu, Y.T., Debnath, S., Hu, R., Suris, D., Ryali, C., Alwala, K.V., Khedr, H., Huang, A., Lei, J., Ma, T., Guo, B., Kalla, A., Marks, M., Greer, J., Wang, M., Sun, P., RÃďdle, R., Afouras, T., Mavroudi, E., Xu, K., Wu, T.H., Zhou, Y., Momeni, L., Hazra, R., Ding, S., Vaze, S., Porcher, F., Li, F., Li, S., Kamath, A., Cheng, H.K.,...

  3. [3]

    Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation (2022),https://arxiv.org/ abs/2112.01527

  4. [4]

    Clark, C., Zhang, J., Ma, Z., Park, J.S., Salehi, M., Tripathi, R., Lee, S., Ren, Z., Kim, C.D., Yang, Y., Shao, V., Yang, Y., Huang, W., Gao, Z., Anderson, T., Zhang, J., Jain, J., Stoica, G., Han, W., Farhadi, A., Krishna, R.: Molmo2: Open weights and data for vision-language models with video understanding and grounding (2026),https://arxiv.org/abs/2601.10611

  5. [5]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., Rosen, E., Marris, L., Petulla, S., Gaffney, C., Aharoni, A., Lintz, N., Pais, T.C., Jacobsson, H., Szpektor, I., Jiang, N.J., Haridasan, K., Omran, A., Saunshi, N., Bahri, D., Mishra, G., Chu, E., Boyd, T., Hekman, B., Parisi, A., Zhang, ...

  6. [6]

    DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

    DeepSeek-AI, Liu, A., Mei, A., Lin, B., Xue, B., Wang, B., Xu, B., Wu, B., Zhang, B., Lin, C., Dong, C., Lu, C., Zhao, C., Deng, C., Xu, C., Ruan, C., Dai, D., Guo, D., Yang, D., Chen, D., Li, E., Zhou, F., Lin, F., Dai, F., Hao, G., Chen, G., Li, G., Zhang, H., Xu, H., Li, H., Liang, H., Wei, H., Zhang, H., Luo, H., Ji, H., Ding, H., Tang, H., Cao, H., G...

  7. [7]

    In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

    Kantharaj, S., Leong, R.T., Lin, X., Masry, A., Thakkar, M., Hoque, E., Joty, S.: Chart-to-text: A large-scale benchmark for chart summarization. In: Muresan, S., Nakov, P., Villavicencio, A. (eds.) Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 4005–4023.AssociationforComputationalLing...

  8. [8]

    Li, J., Dong, X., Zang, Y., Cao, Y., Wang, J., Lin, D.: Visual self-refine: A pixel- guidedparadigmforaccuratechartparsing(2026),https://arxiv.org/abs/2602. 16455

  9. [9]

    T-VSL: text-guided visual sound source localization in mixtures

    Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tuning. In: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 26286–26296 (2024).https://doi.org/10.1109/CVPR52733.2024. 02484

  10. [10]

    net/forum?id=w0H2xGHlkw

    Liu,H.,Li,C.,Wu,Q.,Lee,Y.J.:Visualinstructiontuning.In:Thirty-seventhCon- ference on Neural Information Processing Systems (2023),https://openreview. net/forum?id=w0H2xGHlkw

  11. [11]

    https://doi.org/10.18653/v1/2022.findings- acl.272

    Masry, A., Long, D.X., Tan, J.Q., Joty, S., Hoque, E.: ChartQA: A benchmark for question answering about charts with visual and logical reasoning. In: Muresan, S., Nakov, P., Villavicencio, A. (eds.) Findings of the Association for Computational Linguistics: ACL 2022. pp. 2263–2279. Association for Computational Linguistics, Dublin, Ireland (May 2022).htt...

  12. [12]

    In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025),https://openreview.net/forum?id=wdyOwMISSR

    Ni, M., Yang, Z., Li, L., Lin, C.C., Lin, K., Zuo, W., Wang, L.: Point-RFT: Im- proving multimodal reasoning with visually grounded reinforcement finetuning. In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025),https://openreview.net/forum?id=wdyOwMISSR

  13. [13]

    OpenAI, Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., Avila, R., Babuschkin, I., Balaji, S., Balcom, V., Baltescu, P., Bao, H., Bavarian, M., Belgum, J., Bello, I., Berdine, J., Bernadett-Shapiro, G., Berner, C., Bogdonoff, L., Boiko, O., Boyd, M., Brakman, A.L., Brockman, ...

  14. [14]

    Ren, T., Liu, S., Zeng, A., Lin, J., Li, K., Cao, H., Chen, J., Huang, X., Chen, Y., Yan, F., Zeng, Z., Zhang, H., Li, F., Yang, J., Li, H., Jiang, Q., Zhang, L.: Grounded sam: Assembling open-world models for diverse visual tasks (2024), https://arxiv.org/abs/2401.14159

  15. [15]

    In: Che, W., Nabende, J., Shutova, E., Pilehvar, M.T

    Suri, M., Mathur, P., Lipka, N., Dernoncourt, F., Rossi, R.A., Manocha, D.: ChartLens: Fine-grained visual attribution in charts. In: Che, W., Nabende, J., Shutova, E., Pilehvar, M.T. (eds.) Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 22447– 22462. Association for Computational Lingu...

  16. [16]

    In: The Thirty-ninth Annual Conference on Neural Information ProcessingSystemsDatasetsandBenchmarksTrack(2025),https://openreview

    Tang, L., Kim, G., Zhao, X., Lake, T., Ding, W., Yin, F., Singhal, P., Wadhwa, M., Liu, Z.L., Sprague, Z.R., Namuduri, R., Hu, B., Rodriguez, J.D., Peng, P., Durrett, G.: Chartmuseum: Testing visual reasoning capabilities of large vision- language models. In: The Thirty-ninth Annual Conference on Neural Information ProcessingSystemsDatasetsandBenchmarksTr...

  17. [17]

    Tjandrasuwita, M., Liang, P.P., Solar-Lezama, A.: Chartref: Benchmarking fine- grained visual element localization in charts (2025),https://openreview.net/ forum?id=Pi1Y2huHLg

  18. [18]

    Vogel, A., Moured, O., Chen, Y., Zhang, J., Stiefelhagen, R.: Refchartqa: Ground- ing visual answer on chart images through instruction tuning (2025),https: //arxiv.org/abs/2503.23131

  19. [19]

    Wang, W., Gao, Z., Gu, L., Pu, H., Cui, L., Wei, X., Liu, Z., Jing, L., Ye, S., Shao, J., Wang, Z., Chen, Z., Zhang, H., Yang, G., Wang, H., Wei, Q., Yin, J., Li, W., Cui, E., Chen, G., Ding, Z., Tian, C., Wu, Z., Xie, J., Li, Z., Yang, B., Duan, Y., Wang, X., Hou, Z., Hao, H., Zhang, T., Li, S., Zhao, X., Duan, H., Deng, N., Fu, B., He, Y., Wang, Y., He,...

  20. [20]

    In: The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track (2024), https://openreview.net/forum?id=cy8mq7QYae

    Wang, Z., Xia, M., He, L., Chen, H., Liu, Y., Zhu, R., Liang, K., Wu, X., Liu, H., Malladi, S., Chevalier, A., Arora, S., Chen, D.: Charxiv: Charting gaps in realistic chart understanding in multimodal LLMs. In: The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track (2024), https://openreview.net/forum?id=cy8mq7QYae

  21. [21]

    Xu, Z., Du, S., Qi, Y., SiwenLu, Xu, C., Yuan, C., Guo, J.: Chartpoint: Guiding mllms with grounding reflection for chart reasoning (2025),https://arxiv.org/ abs/2512.00305

  22. [22]

    In: The Thirteenth International Conference on Learning Representations (2025), https://openreview.net/forum?id=sGpCzsfd1K

    Yang, C., Shi, C., Liu, Y., Shui, B., Wang, J., Jing, M., XU, L., Zhu, X., Li, S., Zhang, Y., Liu, G., Nie, X., Cai, D., Yang, Y.: Chartmimic: Evaluat- ing LMM’s cross-modal reasoning capability via chart-to-code generation. In: The Thirteenth International Conference on Learning Representations (2025), https://openreview.net/forum?id=sGpCzsfd1K

  23. [23]

    Yang, J., Zhang, H., Li, F., Zou, X., Li, C., Gao, J.: Set-of-mark prompting un- leashes extraordinary visual grounding in gpt-4v (2023),https://arxiv.org/abs/ 2310.11441

  24. [24]

    Yang, Y., Zhang, Z., Hou, Y., Li, Z., Liu, G., Payani, A., Ting, Y.S., Zheng, L.: Effective training data synthesis for improving mllm chart understanding (2025), https://arxiv.org/abs/2508.06492

  25. [25]

    Yuan, H., Li, X., Zhang, T., Sun, Y., Huang, Z., Xu, S., Ji, S., Tong, Y., Qi, L., Feng, J., Yang, M.H.: Sa2va: Marrying sam2 with llava for dense grounded understanding of images and videos (2025),https://arxiv.org/abs/2501.04001

  26. [26]

    In: Che, W., Nabende, J., Shutova, E., Pilehvar, M.T

    Zhao, X., Luo, X., Shi, Q., Chen, C., Wang, S., Liu, Z., Sun, M.: ChartCoder: Advancing multimodal large language model for chart-to-code generation. In: Che, W., Nabende, J., Shutova, E., Pilehvar, M.T. (eds.) Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 7333–7348. Association for Co...

  27. [27]

    Zhou, Y., Chen, Y., Lin, H., Wu, Y., Yang, S., Qi, Z., Ma, C., Zhu, L., Shan, Y.: Dogr: Towards versatile visual document grounding and referring (2025),https: //arxiv.org/abs/2411.17125 ChartREG++ 29

  28. [28]

    Average Rainfall (cm)

    Zhu, J., Zhou, Y., Wang, Z., Yao, J., Gu, Y., Yuan, Y., Liu, S.: Infodet: A dataset for infographic element detection. In: The Fourteenth International Con- ference on Learning Representations (2026),https://openreview.net/forum?id= Wj0Sc9WBHZ 30 T. Niu et al. 7 Supplementary Materials This section provides additional supply materials on related work, mor...

  29. [29]

    * Do **not** refer to any other graphic element types as targets

    Target Composition (Strict) * The referent(s) must be **only** the specified **Target Element Type**. * Do **not** refer to any other graphic element types as targets

  30. [30]

    * Do not generate âĂIJno-targetâĂİ expressions

    Referent Existence (Strict) * Each referring expression must refer to **at least one** valid target instance in the rendered chart. * Do not generate âĂIJno-targetâĂİ expressions. * **Feasibility guard:** Avoid self-contradictory constraints (e.g., mutually exclusive rank/range/pattern conditions) that could plausibly yield an empty set

  31. [31]

    Referent Subject (Strict) * Each referring expression must **explicitly begin** with **[referent subject]** as expression start. * **Format hardening:** The ‘referring_expression‘ string must start with **exactly** the characters ‘[referent subject]‘ as the **very first characters** (no leading spaces/newlines/punctuation before it)

  32. [32]

    Therefore: ChartREG++ 41 * **Never** mention code-level details (variable names, function calls, parameters, hex color codes, random seeds, etc.)

    Rendered-Image-Only Constraint (Strict) * The model answering will see **only the rendered image**, not the code. Therefore: ChartREG++ 41 * **Never** mention code-level details (variable names, function calls, parameters, hex color codes, random seeds, etc.). * If the code intent and the rendered view could differ, describe only what is **visually appare...

  33. [33]

    * **Never** cite specific generated numeric values

    Random Data Constraint (Strict) If target data are generated using random functions: * Use **only** relations implied by explicit random parameters (distribution/bounds/monotonic transforms). * **Never** cite specific generated numeric values. * Prefer robust non-empty selection styles (rank-bands, local patterns, tick-anchored ranges) over fragile narrow...

  34. [34]

    Recognition-Only Rule (No Calculations) (Strict) These are **recognition_data** expressions: * **Do NOT** use arithmetic or statistics: no differences/ratios/rates, no mean/median/std, no âĂIJaverage + âĂęâĂİ, no derived thresholds. * You **may** use: * direct comparisons (âĂIJhigher/lowerâĂİ, âĂIJabove/belowâĂİ) on visible values, * rank selection by compar...

  35. [35]

    **No derived thresholds**

    **Value-Range Filtering** Targets are all elements whose values are **within/above/below** a specified **range/interval**, where boundaries are **directly given constants** or **explicitly referenced in the chart** (ticks/labels/on-mark labels/some reference mark point). **No derived thresholds**

  36. [36]

    **Tie policy:** if ties occur at the boundary, **include all tied elements**

    **Rank-Band Set Selection** Targets are all elements whose **rank positions** fall in a specified band (top-N, bottom-N, ranks iâĂŞj, excluding extremes) within an explicit scope, determined by **ordering/comparisons only** (no arithmetic/statistics). **Tie policy:** if ties occur at the boundary, **include all tied elements**

  37. [37]

    **Local-Structure Patterns** Targets are elements defined by **local adjacency comparisons** along a series: local peaks/troughs, reversals, neighbor comparisons, and contiguous increasing/decreasing/plateau runsâĂŤ**purely by pairwise higher/lower/equal comparisons**, with no computed rates or aggregates

  38. [38]

    *If multiple series exist, series identity must be disambiguated using legend text (textual/localization) or visual attributes (visual).* 42 T

    **Cross-Series Relations** Targets are defined by **cross-series comparisons** at the same x/category (A above/below B), or per-category winner/loser by comparison, with **no gap/ratio calculations**. *If multiple series exist, series identity must be disambiguated using legend text (textual/localization) or visual attributes (visual).* 42 T. Niu et al. B...

  39. [43]

    C) Visual Feature Categories (for data + visual; use one or more)

    **Text annotations directly on the chart** **Definition:** Targets are selected using **explicit on-chart text** (callouts, data labels, annotation strings) that is visually attached to marks or regions, serving as a direct textual anchor for grounding. C) Visual Feature Categories (for data + visual; use one or more)

  40. [44]

    **Color Attributes** **Definition:** Targets are elements/series identified by a **discrete color label** (e.g., red/blue/green), not subjective shades

  41. [47]

    data_only

    **Fill Style** **Definition:** Targets are elements identified by **interior fill appearance**: **filled vs hollow (outline-only)**, and (when present) **hatch/pattern type and direction** (e.g., diagonal/vertical/horizontal/crosshatch). --- Generation Task (Counts + Mix) Generate **exactly 20** distinct referring expressions: * ‘data_only‘: **10** items ...

  42. [48]

    **Axis labels** **Definition:** Targets are selected using **explicit axis label text** (e.g., x/y axis titles) as an unambiguous anchor to specify which axis (or which subplotâĂŹs axis) the reference applies to

  43. [49]

    **Non-data axis tick values and positions** **Definition:** Targets are selected using **Non-data axis tick labels (values) and their positions** as explicit, non-data anchorsâĂŤwithout requiring exact value reading beyond the printed tick text

  44. [50]

    **Legend entries and their positions** **Definition:** Targets are selected by **legend entry text** (and optionally its **layout position**, e.g., first/second entry, top/bottom of legend) to map from label âĘŠ corresponding elements/series in a scope

  45. [51]

    **Subplot titles, identifiers and positions** **Definition:** Targets are selected by **subplot-level text identifiers** (e.g., subplot title, facet header label, panel tag like âĂIJ(a)/(b)âĂİ) and/or their **panel positions** to disambiguate which subplot the reference is in

  46. [52]

    ### B) Visual Feature Categories (use one or more when visual is allowed)

    **Text annotations directly on the chart** **Definition:** Targets are selected using **explicit on-chart text** (callouts, data labels, annotation strings) that is visually attached to marks or regions, serving as a direct textual anchor for grounding. ### B) Visual Feature Categories (use one or more when visual is allowed)

  47. [53]

    **Color Attributes** ChartREG++ 47 **Definition:** Targets are elements/series identified by a **discrete color label** (e.g., red/blue/green), not subjective shades

  48. [54]

    **Shape Style** **Definition:** Targets are elements identified by a **fixed, enumerated elements/shape name**, e.g., **circle, square, diamond, cross, plus, x, star, pentagon, hexagon**, and oriented variants such as **triangle-up/down/left/right**

  49. [55]

    This applies to **any visible edge**, including **lines** and **borders/outlines** of bars/areas/markers

    **Line Style / Stroke Style** **Definition:** Targets are elements identified by the **stroke/outline pattern** (solid/dashed/dotted/dashdot). This applies to **any visible edge**, including **lines** and **borders/outlines** of bars/areas/markers

  50. [56]

    textual_localization_only

    **Fill Style** **Definition:** Targets are elements identified by **interior fill appearance**: **filled vs hollow (outline-only)**, and (when present) **hatch/pattern type and direction** (e.g., diagonal/vertical/horizontal/crosshatch). --- ## Generation Task (Counts + Mix) Generate **exactly 15** distinct referring expressions: * **textual_localization_...

  51. [57]

    a detailed target element type description,

  52. [58]

    a referring expression (natural language) that refers to one or multiple elements of that target type in the final plot,

  53. [59]

    element_indices

    Python code that generates the visualization, Return which visual element(s) are referred to by the referring expression, **restricted to the target elements created at the code lines marked with ‘#‘**. You MUST reason about the **final visual appearance** after the entire code finishes executing (including axis scaling, normalization, transforms, limits,...