pith. machine review for the scientific record. sign in

arxiv: 2604.19405 · v1 · submitted 2026-04-21 · 💻 cs.CL

Recognition: unknown

Lost in Translation: Do LVLM Judges Generalize Across Languages?

Amran Bhuiyan, Enamul Hoque, Jimmy Huang, Md Tahmid Rahman Laskar, Mir Tafseer Nayeem, Mizanur Rahman, Mohammed Saidul Islam, Shafiq Joty

Authors on Pith no claims yet

Pith reviewed 2026-05-10 02:14 UTC · model grok-4.3

classification 💻 cs.CL
keywords multilingual evaluationLVLM judgesreward modelsvision-language modelscross-lingual robustnessmultimodal benchmarkpreference evaluation
0
0 comments X

The pith

LVLM judges exhibit inconsistent performance across 25 languages, with model size and architecture proving poor predictors of robustness.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MM-JudgeBench to test whether large vision-language model judges generalize reliably beyond English. It constructs over 60,000 pairwise preference instances spanning general vision-language tasks and chart-centric reasoning in 25 typologically diverse languages. Evaluation of 22 models reveals large cross-lingual variance in judgment quality. The findings indicate that current reward models cannot be assumed to work uniformly for alignment and evaluation outside English-dominant settings. This matters because automated evaluators underpin much of LVLM development, so language-specific failures limit their global applicability.

Core claim

By releasing MM-JudgeBench with two complementary subsets—one extending VL-RewardBench for general vision-language preferences and one derived from OpenCQA for chart-centric reasoning—the authors demonstrate that LVLM judges display substantial performance variance across 25 languages. Their analysis of 15 open-source and 7 proprietary models shows that neither larger scale nor specific architectures reliably improve multilingual consistency, and that even leading judges produce inconsistent preference decisions depending on the input language.

What carries the argument

MM-JudgeBench, the benchmark of over 60K pairwise preference instances across 25 languages formed by extending existing English vision-language and chart-reasoning datasets.

If this is right

  • Reward models for LVLMs require explicit multilingual testing rather than English-only validation to ensure reliable use.
  • Current state-of-the-art LVLM judges cannot be deployed uniformly for non-English alignment or evaluation tasks without additional adaptation.
  • The released multilingual training split from MM-RewardBench offers a starting point for domain adaptation to reduce cross-lingual variance.
  • Architectural scaling alone will not resolve inconsistent behavior in multilingual settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same generalization gaps likely appear in related multilingual alignment tasks such as safety and instruction following for LVLMs.
  • Future benchmarks could incorporate direct native-speaker creation of preference pairs instead of translation to isolate true cross-lingual effects.
  • The variance across languages points to imbalances in the visual-text pretraining data that reward models inherit.

Load-bearing premise

The benchmark's translated and extended preference instances accurately reflect native speaker judgments without introducing significant translation artifacts or cultural biases.

What would settle it

A direct comparison in which native speakers of several benchmark languages re-annotate the same image-text pairs and show low agreement with the existing preference labels would indicate the observed inconsistencies arise from benchmark construction rather than model behavior.

Figures

Figures reproduced from arXiv: 2604.19405 by Amran Bhuiyan, Enamul Hoque, Jimmy Huang, Md Tahmid Rahman Laskar, Mir Tafseer Nayeem, Mizanur Rahman, Mohammed Saidul Islam, Shafiq Joty.

Figure 1
Figure 1. Figure 1: Example from VL-RewardBench subset of MM￾JudgeBench illustrating multilingual evaluation of LVLM judges for a given image. The question and candidate re￾sponses are translated from English to French. The LVLM judge (Gemini-2.5-Flash-Lite) selects the correct response A for English, while incorrectly selects B for French, highlight￾ing the need for multilingual evaluation of LVLM judges. reasoning over dive… view at source ↗
Figure 2
Figure 2. Figure 2: An overview of our methodology: (a) Benchmark construction step contains two stages, i.e., Translation model selection, and Translation data generation (from VL-Reward Bench and OpenCQA data); and (b) End-to￾End evaluation pipeline. The first systematic benchmark proposed for this purpose was RewardBench (Lambert et al., 2025). However, this benchmark was restricted to text-based modality on English-centri… view at source ↗
Figure 3
Figure 3. Figure 3: Position Bias in multilingual M-VL-RewardBench for closed models. Lower values indicate better. OpenCQA usually has lower position bias and variance in comparison to M-VL-RewardBench [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Translation sensitivity analysis on M-VL-RewardBench with Gemini models as the translator and [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Output Instruction Following Accuracy in M-VL￾RewardBench. Gemini-2.5-Flash-Lite showing modest degrada￾tion (98%). Among open models, while none of them could reach 100% accuracy, almost all of them with more than 2B parameters achieve above 95% accuracy (with Gemma-3-12-B achieving the best 99.84% accuracy). Meanwhile, models be￾low 2B parameters show poorer output format instruction following accuracy (… view at source ↗
Figure 7
Figure 7. Figure 7: Example from MM-JudgeBench (OpenCQA) illustrating multilingual reward evaluation. A chart im [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Performance Comparison based on Scripts Groups [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Performance Comparison based on Resource Level [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗
read the original abstract

Automatic evaluators such as reward models play a central role in the alignment and evaluation of large vision-language models (LVLMs). Despite their growing importance, these evaluators are almost exclusively assessed on English-centric benchmarks, leaving open the question of how well these evaluators generalize across languages. To answer this question, we introduce MM-JudgeBench, the first large-scale benchmark for multilingual and multimodal judge model evaluation, which includes over 60K pairwise preference instances spanning 25 typologically diverse languages. MM-JudgeBench integrates two complementary subsets: a general vision-language preference evaluation subset extending VL-RewardBench, and a chart-centric visual-text reasoning subset derived from OpenCQA, enabling systematic analysis of reward models (i.e., LVLM judges) across diverse settings. We additionally release a multilingual training set derived from MM-RewardBench, disjoint from our evaluation data, to support domain adaptation. By evaluating 22 LVLMs (15 open-source, 7 proprietary), we uncover substantial cross-lingual performance variance in our proposed benchmark. Our analysis further shows that model size and architecture are poor predictors of multilingual robustness, and that even state-of-the-art LVLM judges exhibit inconsistent behavior across languages. Together, these findings expose fundamental limitations of current reward modeling and underscore the necessity of multilingual, multimodal benchmarks for developing reliable automated evaluators.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces MM-JudgeBench, the first large-scale benchmark for multilingual multimodal LVLM judge evaluation, comprising over 60K pairwise preference instances across 25 typologically diverse languages. It integrates a general vision-language preference subset extending VL-RewardBench and a chart-centric visual-text reasoning subset derived from OpenCQA. The authors evaluate 22 LVLMs (15 open-source, 7 proprietary), report substantial cross-lingual performance variance, conclude that model size and architecture are poor predictors of multilingual robustness, and note inconsistent behavior even in state-of-the-art judges. They additionally release a disjoint multilingual training set derived from MM-RewardBench to support domain adaptation.

Significance. If the benchmark labels prove reliable, this work is significant for demonstrating fundamental limitations in the cross-lingual generalization of LVLM-based reward models, which are central to alignment pipelines. The scale (60K+ instances, 25 languages), complementary subsets, and public release of both the evaluation benchmark and training data are clear strengths that enable reproducible research and targeted improvements in multilingual evaluator development.

major comments (2)
  1. [§3] §3 (Benchmark Construction): The extension of English-only sources (VL-RewardBench and OpenCQA) to 25 languages is described as adding translated text for the same visual inputs, but no details are provided on whether preference labels were re-annotated by native speakers, back-translated for fidelity, or validated against cultural/lexical artifacts. This is load-bearing for the central claim of model inconsistency across languages, because translation-induced shifts in preference ordering would produce apparent variance that reflects label noise rather than LVLM limitations.
  2. [§4] §4 (Experimental Results): The analysis that model size and architecture are poor predictors of multilingual robustness lacks any reported statistical support such as correlation coefficients, regression results, or ablation studies. In addition, the abstract and results summary provide no error bars, confidence intervals, or significance tests for the cross-lingual performance differences, weakening the evidence for the reported 'substantial variance' and 'inconsistent behavior'.
minor comments (2)
  1. [Abstract] Abstract: The exact total instance count and per-subset or per-language breakdown would improve transparency over the approximate 'over 60K' figure.
  2. [Introduction] Introduction: Early definition of the term 'LVLM judges' and explicit expansion of acronyms such as VL-RewardBench on first use would aid readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address the two major comments point by point below, indicating the changes we will make in the revised manuscript.

read point-by-point responses
  1. Referee: [§3] §3 (Benchmark Construction): The extension of English-only sources (VL-RewardBench and OpenCQA) to 25 languages is described as adding translated text for the same visual inputs, but no details are provided on whether preference labels were re-annotated by native speakers, back-translated for fidelity, or validated against cultural/lexical artifacts. This is load-bearing for the central claim of model inconsistency across languages, because translation-induced shifts in preference ordering would produce apparent variance that reflects label noise rather than LVLM limitations.

    Authors: We appreciate the referee highlighting the need for greater transparency in benchmark construction. The manuscript describes extending the English sources by translating the textual elements while retaining the identical visual inputs and the original English preference labels. We did not re-annotate the labels with native speakers for the 25 languages, as the goal was to evaluate judge behavior on equivalent preference judgments across languages. We acknowledge that this design choice leaves open the possibility of translation-induced artifacts. In the revision we will expand §3 with: (i) the exact translation pipeline and service employed, (ii) results of back-translation fidelity checks performed on a sampled subset, and (iii) an explicit limitations paragraph discussing potential cultural or lexical shifts. These additions will allow readers to evaluate whether the reported cross-lingual variance is attributable to label noise. revision: yes

  2. Referee: [§4] §4 (Experimental Results): The analysis that model size and architecture are poor predictors of multilingual robustness lacks any reported statistical support such as correlation coefficients, regression results, or ablation studies. In addition, the abstract and results summary provide no error bars, confidence intervals, or significance tests for the cross-lingual performance differences, weakening the evidence for the reported 'substantial variance' and 'inconsistent behavior'.

    Authors: We agree that formal statistical support would strengthen the claims. The current manuscript demonstrates variance through per-language performance tables and figures, but does not include correlation analyses or error bars. In the revised version we will add to §4: Pearson and Spearman correlations between model parameter count and cross-lingual performance standard deviation; analogous breakdowns by architecture family; standard-error bars on all relevant plots; and paired significance tests (Wilcoxon signed-rank) between languages for representative models. The abstract will be updated to reference these quantitative supports for the variance and inconsistency findings. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark construction and model evaluation

full rationale

The paper introduces MM-JudgeBench by extending VL-RewardBench and OpenCQA to 25 languages via translation and evaluates 22 LVLMs on over 60K preference instances. All central claims (cross-lingual variance, poor predictive power of size/architecture) rest on new data collection and direct performance measurements rather than any derivation, equation, fitted parameter renamed as prediction, or self-citation chain. No mathematical steps, ansatzes, or uniqueness theorems are invoked; the work is self-contained against external benchmarks and does not reduce any result to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work relies on standard assumptions from prior benchmarks like VL-RewardBench and OpenCQA for preference data construction and model evaluation; no free parameters, ad-hoc axioms, or invented entities are introduced in the abstract.

pith-pipeline@v0.9.0 · 5567 in / 1110 out tokens · 43703 ms · 2026-05-10T02:14:06.083287+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

47 extracted references · 20 canonical work pages · 7 internal anchors

  1. [1]

    Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

    VL-RewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

  2. [2]

    BERTScore: Evaluating Text Generation with BERT , author=

  3. [3]

    Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track , pages=

    Query-OPT: Optimizing Inference of Large Language Models via Multi-Query Instructions in Meeting Summarization , author=. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track , pages=

  4. [4]

    In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demon- strations (Oct 2020).https://doi.org/10.18653/v1/2020.emnlp-demos.6

    Wolf, Thomas and Debut, Lysandre and Sanh, Victor and Chaumond, Julien and Delangue, Clement and Moi, Anthony and Cistac, Pierric and Rault, Tim and Louf, Remi and Funtowicz, Morgan and Davison, Joe and Shleifer, Sam and von Platen, Patrick and Ma, Clara and Jernite, Yacine and Plu, Julien and Xu, Canwen and Le Scao, Teven and Gugger, Sylvain and Drame, M...

  5. [5]

    Proceedings of the 29th symposium on operating systems principles , pages=

    Efficient memory management for large language model serving with pagedattention , author=. Proceedings of the 29th symposium on operating systems principles , pages=

  6. [6]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities , author=. arXiv preprint arXiv:2507.06261 , year=

  7. [7]

    Pixtral 12B

    Pixtral 12B , author=. arXiv preprint arXiv:2410.07073 , year=

  8. [8]

    Qwen3-VL Technical Report

    Qwen3-VL Technical Report , author=. arXiv preprint arXiv:2511.21631 , year=

  9. [9]

    arXiv preprint arXiv:2509.13332 , year=

    Explicit Reasoning Makes Better Judges: A Systematic Study on Accuracy, Efficiency, and Robustness , author=. arXiv preprint arXiv:2509.13332 , year=

  10. [10]

    A Systematic Study and Comprehensive Evaluation of C hat GPT on Benchmark Datasets

    Laskar, Md Tahmid Rahman and Bari, M Saiful and Rahman, Mizanur and Bhuiyan, Md Amran Hossen and Joty, Shafiq and Huang, Jimmy Xiangji. A Systematic Study and Comprehensive Evaluation of C hat GPT on Benchmark Datasets. Findings of the Association for Computational Linguistics: ACL 2023. 2023. doi:10.18653/v1/2023.findings-acl.29

  11. [11]

    Evaluation of C hat GPT on Biomedical Tasks: A Zero-Shot Comparison with Fine-Tuned Generative Transformers

    Jahan, Israt and Laskar, Md Tahmid Rahman and Peng, Chun and Huang, Jimmy. Evaluation of C hat GPT on Biomedical Tasks: A Zero-Shot Comparison with Fine-Tuned Generative Transformers. Proceedings of the 22nd Workshop on Biomedical Natural Language Processing and BioNLP Shared Tasks. 2023. doi:10.18653/v1/2023.bionlp-1.30

  12. [12]

    Journal of artificial intelligence research , volume=

    Reinforcement learning: A survey , author=. Journal of artificial intelligence research , volume=

  13. [13]

    Improving automatic evaluation of large language models (

    Laskar, Md Tahmid Rahman and Jahan, Israt and Dolatabadi, Elham and Peng, Chun and Hoque, Enamul and Huang, Jimmy , booktitle=. Improving automatic evaluation of large language models (

  14. [14]

    Computational Linguistics , volume=

    Domain adaptation with pre-trained transformers for query-focused abstractive text summarization , author=. Computational Linguistics , volume=. 2022 , publisher=

  15. [15]

    WSL - DS : Weakly Supervised Learning with Distant Supervision for Query Focused Multi-Document Abstractive Summarization

    Laskar, Md Tahmid Rahman and Hoque, Enamul and Huang, Jimmy Xiangji. WSL - DS : Weakly Supervised Learning with Distant Supervision for Query Focused Multi-Document Abstractive Summarization. Proceedings of the 28th International Conference on Computational Linguistics. 2020. doi:10.18653/v1/2020.coling-main.495

  16. [16]

    InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency , author=. arXiv preprint arXiv:2508.18265 , year=

  17. [17]

    Proceedings of the 56th annual meeting of the association for computational linguistics (volume 2: Short papers) , pages=

    Filtering and mining parallel data in a joint multilingual space , author=. Proceedings of the 56th annual meeting of the association for computational linguistics (volume 2: Short papers) , pages=

  18. [18]

    Forty-first International Conference on Machine Learning , year=

    Mllm-as-a-judge: Assessing multimodal llm-as-a-judge with vision-language benchmark , author=. Forty-first International Conference on Machine Learning , year=

  19. [19]

    A Little is Enough

    “A Little is Enough”: Few-Shot Quality Estimation based Corpus Filtering improves Machine Translation , author=. Findings of the Association for Computational Linguistics: ACL 2023 , pages=

  20. [20]

    Evaluating

    Padarha, Shreyansh and Semenova, Elizaveta and Vidgen, Bertie and Mahdi, Adam and Hale, Scott A , booktitle=. Evaluating

  21. [21]

    Gemma 3 Technical Report

    Gemma 3 technical report , author=. arXiv preprint arXiv:2503.19786 , year=

  22. [22]

    Findings of the Association for Computational Linguistics: EMNLP 2024 , pages=

    Are Large Vision Language Models up to the Challenge of Chart Comprehension and Reasoning , author=. Findings of the Association for Computational Linguistics: EMNLP 2024 , pages=

  23. [23]

    Proceedings of the Seventh Conference on Machine Translation (WMT) , pages=

    CometKiwi: IST-unbabel 2022 submission for the quality estimation shared task , author=. Proceedings of the Seventh Conference on Machine Translation (WMT) , pages=

  24. [24]

    doi: 10.18653/v1/2022.acl-long.62

    Feng, Fangxiaoyu and Yang, Yinfei and Cer, Daniel and Arivazhagan, Naveen and Wang, Wei. Language-agnostic BERT Sentence Embedding. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2022. doi:10.18653/v1/2022.acl-long.62

  25. [25]

    Proceedings of the Tenth Conference on Machine Translation , pages=

    Findings of the wmt25 general machine translation shared task: Time to stop evaluating on easy test sets , author=. Proceedings of the Tenth Conference on Machine Translation , pages=

  26. [26]

    Stop Taking Tokenizers for Granted: They Are Core Design Decisions in Large Language Models

    Alqahtani, Sawsan and Nayeem, Mir Tafseer and Laskar, Md Tahmid Rahman and Mohiuddin, Tasnim and Bari, M Saiful. Stop Taking Tokenizers for Granted: They Are Core Design Decisions in Large Language Models. Proceedings of the 19th Conference of the E uropean Chapter of the A ssociation for C omputational L inguistics (Volume 1: Long Papers). 2026. doi:10.1...

  27. [27]

    2025 , eprint=

    Beyond Fertility: Analyzing STRR as a Metric for Multilingual Tokenization Evaluation , author=. 2025 , eprint=

  28. [28]

    Findings of the association for computational linguistics ACL 2024 , pages=

    Prometheus-vision: Vision-language model as a judge for fine-grained evaluation , author=. Findings of the association for computational linguistics ACL 2024 , pages=

  29. [29]

    Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

    Llava-critic: Learning to evaluate multimodal models , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

  30. [30]

    Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

    M-rewardbench: Evaluating reward models in multilingual settings , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

  31. [31]

    Multimodal reward- bench: Holistic evaluation of reward models for vision language models.arXiv preprint arXiv:2502.14191, 2025

    Multimodal rewardbench: Holistic evaluation of reward models for vision language models , author=. arXiv preprint arXiv:2502.14191 , year=

  32. [32]

    National Science Review , volume=

    A survey on multimodal large language models , author=. National Science Review , volume=. 2024 , publisher=

  33. [33]

    Patterns , volume=

    A survey of multilingual large language models , author=. Patterns , volume=. 2025 , publisher=

  34. [34]

    Advances in neural information processing systems , volume=

    Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=

  35. [35]

    Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , pages=

    Opencqa: Open-ended question answering with charts , author=. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , pages=

  36. [36]

    Constitutional AI: Harmlessness from AI Feedback

    Constitutional AI: Harmlessness from AI Feedback , author=. arXiv preprint arXiv:2212.08073 , year=

  37. [37]

    From generation to judgment: Opportunities and challenges of

    Li, Dawei and Jiang, Bohan and Huang, Liangjie and Beigi, Alimohammad and Zhao, Chengshuai and Tan, Zhen and Bhattacharjee, Amrita and Jiang, Yuxuan and Chen, Canyu and Wu, Tianhao and others , booktitle=. From generation to judgment: Opportunities and challenges of

  38. [38]

    Zheng, Lianmin and Chiang, Wei-Lin and Sheng, Ying and Zhuang, Siyuan and Wu, Zhanghao and Zhuang, Yonghao and Lin, Zi and Li, Zhuohan and Li, Dacheng and Xing, Eric and others , journal=. Judging

  39. [39]

    Findings of the Association for Computational Linguistics: NAACL 2025 , pages=

    Rewardbench: Evaluating reward models for language modeling , author=. Findings of the Association for Computational Linguistics: NAACL 2025 , pages=

  40. [40]

    Advances in neural information processing systems , volume=

    Deep reinforcement learning from human preferences , author=. Advances in neural information processing systems , volume=

  41. [41]

    Deploying Tiny LVLM Judges for Real-World Evaluation of Chart Models: Lessons Learned and Best Practices

    Laskar, Md Tahmid Rahman and Islam, Mohammed Saidul and Mahbub, Ridwan and Rahman, Mizanur and Bhuiyan, Amran and Jahan, Israt and Nayeem, Mir Tafseer and Joty, Shafiq and Hoque, Enamul and Huang, Jimmy. Deploying Tiny LVLM Judges for Real-World Evaluation of Chart Models: Lessons Learned and Best Practices. Proceedings of the 2025 Conference on Empirical...

  42. [42]

    XL - H ead T ags: Leveraging Multimodal Retrieval Augmentation for the Multilingual Generation of News Headlines and Tags

    Shohan, Faisal Tareque and Nayeem, Mir Tafseer and Islam, Samsul and Akash, Abu Ubaida and Joty, Shafiq. XL - H ead T ags: Leveraging Multimodal Retrieval Augmentation for the Multilingual Generation of News Headlines and Tags. Findings of the Association for Computational Linguistics: ACL 2024. 2024. doi:10.18653/v1/2024.findings-acl.771

  43. [43]

    A Systematic Survey and Critical Review on Evaluating Large Language Models: Challenges, Limitations, and Recommendations

    Laskar, Md Tahmid Rahman and Alqahtani, Sawsan and Bari, M Saiful and Rahman, Mizanur and Khan, Mohammad Abdullah Matin and Khan, Haidar and Jahan, Israt and Bhuiyan, Amran and Tan, Chee Wei and Parvez, Md Rizwan and Hoque, Enamul and Joty, Shafiq and Huang, Jimmy. A Systematic Survey and Critical Review on Evaluating Large Language Models: Challenges, Li...

  44. [44]

    From Charts to Fair Narratives: Uncovering and Mitigating Geo-Economic Biases in Chart-to-Text

    Mahbub, Ridwan and Islam, Mohammed Saidul and Nayeem, Mir Tafseer and Laskar, Md Tahmid Rahman and Rahman, Mizanur and Joty, Shafiq and Hoque, Enamul. From Charts to Fair Narratives: Uncovering and Mitigating Geo-Economic Biases in Chart-to-Text. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/...

  45. [45]

    The Perils of Chart Deception: How Misleading Visualizations Affect Vision-Language Models , year=

    Mahbub, Ridwan and Islam, Mohammed Saidul and Laskar, Md Tahmid Rahman and Rahman, Mizanur and Nayeem, Mir Tafseer and Hoque, Enamul , booktitle=. The Perils of Chart Deception: How Misleading Visualizations Affect Vision-Language Models , year=

  46. [46]

    Judging the Judges: Can Large Vision-Language Models Fairly Evaluate Chart Comprehension and Reasoning?

    Laskar, Md Tahmid Rahman and Islam, Mohammed Saidul and Mahbub, Ridwan and Masry, Ahmed and Rahman, Mizanur and Bhuiyan, Amran and Nayeem, Mir Tafseer and Joty, Shafiq and Hoque, Enamul and Huang, Jimmy. Judging the Judges: Can Large Vision-Language Models Fairly Evaluate Chart Comprehension and Reasoning?. Proceedings of the 63rd Annual Meeting of the As...

  47. [47]

    Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

    Aya Model: An Instruction Finetuned Open-Access Multilingual Language Model. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.845