arxiv: 2604.24987 · v1 · submitted 2026-04-27 · 💻 cs.AI

Recognition: unknown

Assessing Y-Axis Influence: Bias in Multimodal Language Models on Chart-to-Table Translation

Seok Hwan Song , Azher Ahmed Efat , Wallapak Tavanapong

Authors on Pith no claims yet

Pith reviewed 2026-05-08 03:20 UTC · model grok-4.3

classification 💻 cs.AI

keywords y-axis biasmultimodal language modelschart-to-table translationdata imbalancetick valuesmodel promptingbias analysisFairChart2Table

0 comments

The pith

Y-axis variations in tick length, count, range and format introduce significant biases in multimodal models translating charts to tables.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Public chart datasets show imbalances in y-axis features including digit length of major tick values, number of major ticks, value ranges, and formats such as abbreviations or scientific notation. These imbalances lead to uneven accuracy when multimodal language models convert chart images into structured tables. The paper presents FairChart2Table as a framework to isolate and measure these effects across five state-of-the-art models through controlled chart variations. Results indicate that longer digit lengths, higher tick counts, wider ranges, and certain formats degrade performance while the number of legends in a chart also influences outcomes. Supplying explicit y-axis details via prompts improves results for some models.

Core claim

Y-axis information creates measurable biases during chart-to-table translation by multimodal language models, with accuracy varying systematically according to the digit length of major tick values, the number of major ticks, the value range, and the tick value format, as shown by systematic testing in the FairChart2Table framework.

What carries the argument

FairChart2Table, a framework that generates controlled variations of chart images to isolate y-axis properties and quantify their impact on model translation accuracy.

Load-bearing premise

That the performance differences across y-axis changes arise mainly from imbalances in public chart datasets rather than from model architectures or other training and evaluation factors.

What would settle it

Retrain one of the tested models on a version of the data where y-axis digit lengths, tick counts, ranges, and formats are balanced, then check whether the accuracy gaps for those features disappear.

Figures

Figures reproduced from arXiv: 2604.24987 by Azher Ahmed Efat, Seok Hwan Song, Wallapak Tavanapong.

**Figure 1.** Figure 1: FairChart2Table Framework. Step 1 generates 10 base tables for each entity count (1-6) with a digit length view at source ↗

**Figure 2.** Figure 2: Ground truth and predicted values differ by the view at source ↗

**Figure 3.** Figure 3: RMST BE F 1 performance for bias evaluation by digit length (X-axis); “W/Hint” indicates when the yaxis major tick values were used as part of the prompts to MLMs. ChartGemma (Masry et al., 2025), since it is, like TinyChart, a model trained on multiple chart understanding tasks. DePlot (Liu et al., 2023a) was selected to represent fully supervised models only for chart-to-table translation, since it per… view at source ↗

**Figure 4.** Figure 4: RMST BE F 1 performance for bias evaluation associated with the number of major ticks, y-axis value range, tick number format, and the number of entities the highest performance with digit lengths ranging from 6 to 16. However, its performance still decreases with increasing digit lengths. GPT-4o performs the worst among the compared models and still exhibits bias by digit lengths. To what extent models a… view at source ↗

**Figure 5.** Figure 5: Bar charts comparing RMSF 1 and RMS-Header (RMSF 1 without header scores) by different models. Type DePlot ChartGemma Tiny Pixtral Gemini GPT Dot VS. Bar 5.067968e-02 1.85467e-67 4.041066e-02 1.318864e-02 2.032969e-08 5.515060e-18 Line VS. Bar 7.570325e-14 5.44866e-72 2.044370e-28 1.147831e-108 8.373813e-126 1.269979e-01 Line VS. Dot 1.419833e-07 0.871489 5.968554e-19 4.882375e-110 6.484573e-118 2.871905e-12 view at source ↗

**Figure 6.** Figure 6: Part A: Line Chart view at source ↗

**Figure 7.** Figure 7: Part A: Dot Chart view at source ↗

**Figure 8.** Figure 8: Part A: Bar Chart view at source ↗

**Figure 9.** Figure 9: Part A: Line Chart at Digit Length 16 view at source ↗

**Figure 10.** Figure 10: Part B: Line Chart at Digit Length 1 with 3 Major Ticks view at source ↗

**Figure 11.** Figure 11: Part B: Line Chart at Digit Length 1 with 11 Major Ticks view at source ↗

**Figure 12.** Figure 12: Part C: Line Chart with Positive Minimum Tick Value transformed from Digit Length 1 view at source ↗

**Figure 13.** Figure 13: Part C: Line Chart with Negative Minimum Tick Value transformed from Digit Length 1 view at source ↗

**Figure 14.** Figure 14: Part C: Line Chart at Digit Length 1 with Extended Range view at source ↗

**Figure 15.** Figure 15: Part D: Line Chart at Digit Length 16 with Comma Format view at source ↗

**Figure 16.** Figure 16: Part D: Line Chart at Digit Length 16 with Scientific Notation Format view at source ↗

**Figure 17.** Figure 17: Part D: Line Chart at Digit Length 16 with Abbreviation Format view at source ↗

**Figure 18.** Figure 18: Part B: Line Chart at Digit Length 1 with 11 Major Ticks view at source ↗

**Figure 19.** Figure 19: Comparison between the original chart (left) and the chart predicted by DePlot (right). The predicted view at source ↗

**Figure 20.** Figure 20: Comparison between the original chart (left) and the chart predicted by DePlot (right). The predicted view at source ↗

**Figure 21.** Figure 21: Comparison between the original chart (left) and the chart predicted by DePlot (right). With only three view at source ↗

**Figure 22.** Figure 22: Comparison between the original chart (left) and the chart predicted by DePlot (right). When the y-axis view at source ↗

read the original abstract

Chart-to-table translation converts chart images into structured tabular data. Accurate translation is crucial for Multimodal Language Model (MLM) to answer complex queries. We observe imbalances in the number of images across different aspects of the y-axis information in public chart datasets. Such imbalances can introduce unintended biases, causing uneven MLM performance. Previous works have not systematically examined these biases. To address this gap, we propose a new framework, FairChart2Table, for analyzing y-axis-related bias on five state-of-the-art models. Key Findings: (1) There are significant y-axis biases related to the digit length of the major tick values, the number of major ticks, the range of values, and the tick value format (e.g., abbreviation or scientific format). (2) The number of legends/entities in chart images impacts MLM performance. (3) Prompting MLM with y-axis information can significantly enhance the performance for some MLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper spots real performance drops tied to y-axis features in chart-to-table work but leaves open whether those drops come from the claimed dataset imbalances or from other model and setup factors.

read the letter

The core finding is that five multimodal models lose accuracy on chart-to-table translation when y-axis ticks have long digit strings, high tick counts, wide value ranges, or scientific notation, and that adding explicit y-axis details to the prompt lifts results for some models. Legend count also correlates with worse performance. They trace this to uneven coverage of those y-axis styles in public chart datasets and introduce FairChart2Table as a way to measure it systematically. That focused audit on one chart element is new; earlier chart-to-table papers did not isolate y-axis properties this way. The prompting observation is a concrete, low-cost suggestion that practitioners could test quickly. The dataset-imbalance diagnosis is plausible given how training data is collected. The work is empirical and observational, with no circular math or unfalsifiable claims, so the thinking is straightforward. The main gap is in the experimental controls. The abstract and stress-test note both flag that we do not see how the authors held model architecture, pre-training corpus, or image-generation details fixed while only varying the y-axis traits. If those other factors co-vary with the y-axis styles in the test set, the performance gaps cannot be cleanly attributed to dataset imbalance alone. No sample sizes, error bars, or statistical tests appear in the summary, which makes it hard to judge how large or reliable the effects are. This paper is for groups building or auditing multimodal tools that pull numbers from charts. Readers who care about bias audits in vision-language tasks will get practical leads from the bias list and the prompting result. It is worth sending to peer review because the task is real, the framework is reusable, and the questions it raises are answerable with tighter experiments. A referee can ask for the missing controls and per-model training statistics without dismissing the idea.

Referee Report

2 major / 1 minor

Summary. The paper claims that public chart datasets exhibit imbalances in y-axis information (digit length of major ticks, number of major ticks, value ranges, and tick formats), which introduce biases in multimodal language models (MLMs) performing chart-to-table translation. It introduces the FairChart2Table framework to systematically evaluate these biases across five state-of-the-art MLMs, additionally reporting that the number of legends/entities affects performance and that y-axis-aware prompting improves results for some models.

Significance. If the central empirical claims hold after addressing controls, the work is significant for surfacing actionable biases in MLMs for visual data extraction, a growing application area. The FairChart2Table framework provides a reusable structure for bias auditing in multimodal tasks, and the prompting result offers a practical mitigation path. These elements strengthen the contribution beyond pure observation.

major comments (2)

[Abstract] Abstract: the key findings on y-axis biases and prompting effects are stated without methodology details, sample sizes, statistical tests, error bars, or dataset specifics, preventing verification of whether the data supports the claims as stated.
[FairChart2Table framework] FairChart2Table framework (experimental design): the evaluation tests five MLMs but does not hold model architecture, pre-training corpus, or prompt formatting fixed while only varying y-axis properties, nor report per-model training-data statistics for the tested chart styles; thus performance differences cannot be cleanly attributed to the claimed y-axis feature imbalances rather than architecture-by-feature interactions or evaluation artifacts.

minor comments (1)

[Abstract] Abstract: consider adding one sentence on the scale of the chart corpus and the exact metrics used (e.g., exact-match accuracy or F1) to give readers immediate context for the reported biases.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the careful review and constructive comments. We address each major point below, indicating planned revisions to strengthen the manuscript while clarifying the scope of our observational study on existing MLMs.

read point-by-point responses

Referee: [Abstract] Abstract: the key findings on y-axis biases and prompting effects are stated without methodology details, sample sizes, statistical tests, error bars, or dataset specifics, preventing verification of whether the data supports the claims as stated.

Authors: We agree that the abstract, as a high-level summary, omits granular details to meet length limits. The full manuscript (Sections 3 and 4) specifies the five MLMs evaluated, the public chart datasets analyzed for imbalances, the FairChart2Table evaluation protocol, and quantitative results. To improve verifiability, we will revise the abstract to include brief references to the evaluation scale, the use of standard metrics with reported variance, and the statistical significance of observed biases. revision: yes
Referee: [FairChart2Table framework] FairChart2Table framework (experimental design): the evaluation tests five MLMs but does not hold model architecture, pre-training corpus, or prompt formatting fixed while only varying y-axis properties, nor report per-model training-data statistics for the tested chart styles; thus performance differences cannot be cleanly attributed to the claimed y-axis feature imbalances rather than architecture-by-feature interactions or evaluation artifacts.

Authors: Our framework is designed to audit biases in deployed state-of-the-art MLMs rather than to isolate causal effects through controlled ablations on identical architectures or corpora, which would require training new models from scratch. We report per-model performance breakdowns and will expand the discussion to explicitly address potential architecture-by-feature interactions and prompt sensitivity as limitations. Publicly available information on pre-training data for chart styles is limited, but we will incorporate any disclosed details and add a dedicated limitations subsection on this point. revision: partial

standing simulated objections not resolved

Detailed per-model training-data statistics specific to chart styles are not publicly disclosed by the developers of the five evaluated MLMs, limiting our ability to fully quantify exposure.

Circularity Check

0 steps flagged

No circularity: purely empirical observational study with no derivations or fitted predictions

full rationale

The paper conducts direct experiments on five MLMs using the FairChart2Table framework to measure performance differences across controlled y-axis variations in chart images. All key findings (biases related to digit length, tick count, range, format, and legend count; benefits of y-axis prompting) rest on observed accuracy metrics from model testing rather than any equations, parameter fitting, predictions derived from inputs, or self-citation chains. No derivation chain exists that reduces to its own inputs by construction, satisfying the criteria for a self-contained empirical analysis.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard machine learning evaluation assumptions that performance gaps reflect dataset-induced bias, with no free parameters, invented entities, or non-standard axioms introduced.

axioms (1)

domain assumption Performance differences in model outputs across y-axis variations indicate bias caused by dataset imbalances
Invoked when attributing uneven MLM performance directly to y-axis information imbalances in public datasets.

pith-pipeline@v0.9.0 · 5473 in / 1352 out tokens · 42246 ms · 2026-05-08T03:20:23.180727+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

26 extracted references · 13 canonical work pages · 2 internal anchors

[1]

Pravesh Agrawal, Szymon Antoniak, Emma Bou Hanna, Baptiste Bout, Devendra Chaplot, Jessica Chudnovsky, Diogo Costa, Baudouin De Monicault, Saurabh Garg, Theophile Gervet, et al. 2024. Pixtral 12b. arXiv preprint arXiv:2410.07073

work page internal anchor Pith review arXiv 2024
[2]

Mubashara Akhtar, Oana Cocarascu, and Elena Simperl. 2023. https://doi.org/10.18653/v1/2023.findings-eacl.30 Reading and reasoning over chart images for evidence-based automated fact-checking . In Findings of the Association for Computational Linguistics: EACL 2023, pages 399--414, Dubrovnik, Croatia. Association for Computational Linguistics

work page doi:10.18653/v1/2023.findings-eacl.30 2023
[3]

Ritwick Chaudhry, Sumit Shekhar, Utkarsh Gupta, Pranav Maneriker, Prann Bansal, and Ajay Joshi. 2020. https://doi.org/10.1109/WACV45572.2020.9093269 Leaf-qa: Locate, encode & attend for figure question answering . In 2020 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 3501--3510

work page doi:10.1109/wacv45572.2020.9093269 2020
[4]

Ashim Gupta, Vivek Gupta, Shuo Zhang, Yujie He, Ning Zhang, and Shalin Shah. 2024. https://doi.org/10.18653/v1/2024.blackboxnlp-1.11 Enhancing question answering on charts through effective pre-training tasks . In Proceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP, pages 185--192, Miami, Florida, US. Associatio...

work page doi:10.18653/v1/2024.blackboxnlp-1.11 2024
[5]

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. 2024. Gpt-4o system card. arXiv preprint arXiv:2410.21276

work page internal anchor Pith review arXiv 2024
[6]

Mohammed Saidul Islam, Raian Rahman, Ahmed Masry, Md Tahmid Rahman Laskar, Mir Tafseer Nayeem, and Enamul Hoque. 2024. https://doi.org/10.18653/v1/2024.findings-emnlp.191 Are large vision language models up to the challenge of chart comprehension and reasoning . In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 3334--3368, Mi...

work page doi:10.18653/v1/2024.findings-emnlp.191 2024
[7]

Kushal Kafle, Brian Price, Scott Cohen, and Christopher Kanan. 2018. https://doi.org/10.1109/CVPR.2018.00592 Dvqa: Understanding data visualizations via question answering . In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5648--5656

work page doi:10.1109/cvpr.2018.00592 2018
[8]

Samira Ebrahimi Kahou, Vincent Michalski, Adam Atkinson, \'A kos K \'a d \'a r, Adam Trischler, and Yoshua Bengio. 2017. https://doi.org/10.48550/arXiv.1710.07300 Figureqa: An annotated figure dataset for visual reasoning . arXiv preprint arXiv:1710.07300

work page Pith review doi:10.48550/arxiv.1710.07300 2017
[9]

Shankar Kantharaj, Rixie Tiffany Leong, Xiang Lin, Ahmed Masry, Megh Thakkar, Enamul Hoque, and Shafiq Joty. 2022. https://doi.org/10.18653/v1/2022.acl-long.277 Chart-to-text: A large-scale benchmark for chart summarization . In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4005--402...

work page doi:10.18653/v1/2022.acl-long.277 2022
[10]

Wonjoong Kim, Sangwu Park, Yeonjun In, Seokwon Han, and Chanyoung Park. 2025. Simplot: Enhancing chart question answering by distilling essentials. In Findings of the Association for Computational Linguistics: NAACL 2025, pages 573--593

2025
[11]

Fangyu Liu, Julian Eisenschlos, Francesco Piccinno, Syrine Krichene, Chenxi Pang, Kenton Lee, Mandar Joshi, Wenhu Chen, Nigel Collier, and Yasemin Altun. 2023 a . Deplot: One-shot visual language reasoning by plot-to-table translation. In Findings of the Association for Computational Linguistics: ACL 2023, pages 10381--10399

2023
[12]

Fangyu Liu, Francesco Piccinno, Syrine Krichene, Chenxi Pang, Kenton Lee, Mandar Joshi, Yasemin Altun, Nigel Collier, and Julian Eisenschlos. 2023 b . Matcha: Enhancing visual language pretraining with math reasoning and chart derendering. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), p...

2023
[13]

Fuxiao Liu, Xiaoyang Wang, Wenlin Yao, Jianshu Chen, Kaiqiang Song, Sangwoo Cho, Yaser Yacoob, and Dong Yu. 2024. https://doi.org/10.18653/v1/2024.naacl-long.70 MMC : Advancing multimodal chart understanding with large-scale instruction tuning . In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Lingui...

work page doi:10.18653/v1/2024.naacl-long.70 2024
[14]

Ahmed Masry, Xuan Long Do, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. 2022. Chartqa: A benchmark for question answering about charts with visual and logical reasoning. In Findings of the association for computational linguistics: ACL 2022, pages 2263--2279

2022
[15]

Ahmed Masry, Parsa Kavehzadeh, Xuan Long Do, Enamul Hoque, and Shafiq Joty. 2023. Unichart: A universal vision-language pretrained model for chart comprehension and reasoning. In Proceedings of the 2023 conference on empirical methods in natural language processing, pages 14662--14684

2023
[16]

Ahmed Masry, Mehrad Shahmohammadi, Md Rizwan Parvez, Enamul Hoque, and Shafiq Joty. 2024. https://doi.org/10.18653/v1/2024.findings-acl.619 C hart I nstruct: Instruction tuning for chart comprehension and reasoning . In Findings of the Association for Computational Linguistics: ACL 2024, pages 10387--10409, Bangkok, Thailand. Association for Computational...

work page doi:10.18653/v1/2024.findings-acl.619 2024
[17]

Ahmed Masry, Megh Thakkar, Aayush Bajaj, Aaryaman Kartha, Enamul Hoque, and Shafiq Joty. 2025. https://aclanthology.org/2025.coling-industry.54/ C hart G emma: Visual instruction-tuning for chart reasoning in the wild . In Proceedings of the 31st International Conference on Computational Linguistics: Industry Track, pages 625--643, Abu Dhabi, UAE. Associa...

2025
[18]

Fanqing Meng, Wenqi Shao, Quanfeng Lu, Peng Gao, Kaipeng Zhang, Yu Qiao, and Ping Luo. 2024. Chartassistant: A universal chart multimodal language model via chart-to-table pre-training and multitask instruction tuning. In Findings of the Association for Computational Linguistics: ACL 2024, pages 7775--7803

2024
[19]

Nitesh Methani, Pritha Ganguly, Mitesh M Khapra, and Pratyush Kumar. 2020. Plotqa: Reasoning over scientific plots. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1527--1536

2020
[20]

Srija Mukhopadhyay, Adnan Qidwai, Aparna Garimella, Pritika Ramu, Vivek Gupta, and Dan Roth. 2024. https://doi.org/10.18653/v1/2024.findings-emnlp.973 Unraveling the truth: Do VLM s really understand charts? a deep dive into consistency and robustness . In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 16696--16717, Miami, Fl...

work page doi:10.18653/v1/2024.findings-emnlp.973 2024
[21]

Karl Pearson. 1896. Vii. mathematical contributions to the theory of evolution.—iii. regression, heredity, and panmixia. Philosophical Transactions of the Royal Society of London. Series A, containing papers of a mathematical or physical character, (187):253--318
[22]

Shraman Pramanick, Rama Chellappa, and Subhashini Venugopalan. 2024. Spiqa: A dataset for multimodal question answering on scientific papers. Advances in Neural Information Processing Systems, 37:118807--118833

2024
[23]

Liang Zhang, Anwen Hu, Haiyang Xu, Ming Yan, Yichen Xu, Qin Jin, Ji Zhang, and Fei Huang. 2024. https://doi.org/10.18653/v1/2024.emnlp-main.112 T iny C hart: Efficient chart understanding with program-of-thoughts learning and visual token merging . In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 1882--1898,...

work page doi:10.18653/v1/2024.emnlp-main.112 2024
[24]

Zifeng Zhu, Mengzhao Jia, Zhihan Zhang, Lang Li, and Meng Jiang. 2025. Multichartqa: Benchmarking vision-language models on multi-chart problems. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 11341--11359

2025
[25]

online" 'onlinestring :=

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...
[26]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...