arxiv: 2604.19405 · v1 · submitted 2026-04-21 · 💻 cs.CL

Recognition: unknown

Lost in Translation: Do LVLM Judges Generalize Across Languages?

Amran Bhuiyan, Enamul Hoque, Jimmy Huang, Md Tahmid Rahman Laskar, Mir Tafseer Nayeem, Mizanur Rahman, Mohammed Saidul Islam, Shafiq Joty

Authors on Pith no claims yet

Pith reviewed 2026-05-10 02:14 UTC · model grok-4.3

classification 💻 cs.CL

keywords multilingual evaluationLVLM judgesreward modelsvision-language modelscross-lingual robustnessmultimodal benchmarkpreference evaluation

0 comments

The pith

LVLM judges exhibit inconsistent performance across 25 languages, with model size and architecture proving poor predictors of robustness.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MM-JudgeBench to test whether large vision-language model judges generalize reliably beyond English. It constructs over 60,000 pairwise preference instances spanning general vision-language tasks and chart-centric reasoning in 25 typologically diverse languages. Evaluation of 22 models reveals large cross-lingual variance in judgment quality. The findings indicate that current reward models cannot be assumed to work uniformly for alignment and evaluation outside English-dominant settings. This matters because automated evaluators underpin much of LVLM development, so language-specific failures limit their global applicability.

Core claim

By releasing MM-JudgeBench with two complementary subsets—one extending VL-RewardBench for general vision-language preferences and one derived from OpenCQA for chart-centric reasoning—the authors demonstrate that LVLM judges display substantial performance variance across 25 languages. Their analysis of 15 open-source and 7 proprietary models shows that neither larger scale nor specific architectures reliably improve multilingual consistency, and that even leading judges produce inconsistent preference decisions depending on the input language.

What carries the argument

MM-JudgeBench, the benchmark of over 60K pairwise preference instances across 25 languages formed by extending existing English vision-language and chart-reasoning datasets.

If this is right

Reward models for LVLMs require explicit multilingual testing rather than English-only validation to ensure reliable use.
Current state-of-the-art LVLM judges cannot be deployed uniformly for non-English alignment or evaluation tasks without additional adaptation.
The released multilingual training split from MM-RewardBench offers a starting point for domain adaptation to reduce cross-lingual variance.
Architectural scaling alone will not resolve inconsistent behavior in multilingual settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same generalization gaps likely appear in related multilingual alignment tasks such as safety and instruction following for LVLMs.
Future benchmarks could incorporate direct native-speaker creation of preference pairs instead of translation to isolate true cross-lingual effects.
The variance across languages points to imbalances in the visual-text pretraining data that reward models inherit.

Load-bearing premise

The benchmark's translated and extended preference instances accurately reflect native speaker judgments without introducing significant translation artifacts or cultural biases.

What would settle it

A direct comparison in which native speakers of several benchmark languages re-annotate the same image-text pairs and show low agreement with the existing preference labels would indicate the observed inconsistencies arise from benchmark construction rather than model behavior.

Figures

Figures reproduced from arXiv: 2604.19405 by Amran Bhuiyan, Enamul Hoque, Jimmy Huang, Md Tahmid Rahman Laskar, Mir Tafseer Nayeem, Mizanur Rahman, Mohammed Saidul Islam, Shafiq Joty.

**Figure 1.** Figure 1: Example from VL-RewardBench subset of MMJudgeBench illustrating multilingual evaluation of LVLM judges for a given image. The question and candidate responses are translated from English to French. The LVLM judge (Gemini-2.5-Flash-Lite) selects the correct response A for English, while incorrectly selects B for French, highlighting the need for multilingual evaluation of LVLM judges. reasoning over dive… view at source ↗

**Figure 2.** Figure 2: An overview of our methodology: (a) Benchmark construction step contains two stages, i.e., Translation model selection, and Translation data generation (from VL-Reward Bench and OpenCQA data); and (b) End-toEnd evaluation pipeline. The first systematic benchmark proposed for this purpose was RewardBench (Lambert et al., 2025). However, this benchmark was restricted to text-based modality on English-centri… view at source ↗

**Figure 3.** Figure 3: Position Bias in multilingual M-VL-RewardBench for closed models. Lower values indicate better. OpenCQA usually has lower position bias and variance in comparison to M-VL-RewardBench [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Translation sensitivity analysis on M-VL-RewardBench with Gemini models as the translator and [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 6.** Figure 6: Output Instruction Following Accuracy in M-VLRewardBench. Gemini-2.5-Flash-Lite showing modest degradation (98%). Among open models, while none of them could reach 100% accuracy, almost all of them with more than 2B parameters achieve above 95% accuracy (with Gemma-3-12-B achieving the best 99.84% accuracy). Meanwhile, models below 2B parameters show poorer output format instruction following accuracy (… view at source ↗

**Figure 7.** Figure 7: Example from MM-JudgeBench (OpenCQA) illustrating multilingual reward evaluation. A chart im [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

**Figure 8.** Figure 8: Performance Comparison based on Scripts Groups [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗

**Figure 9.** Figure 9: Performance Comparison based on Resource Level [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗

read the original abstract

Automatic evaluators such as reward models play a central role in the alignment and evaluation of large vision-language models (LVLMs). Despite their growing importance, these evaluators are almost exclusively assessed on English-centric benchmarks, leaving open the question of how well these evaluators generalize across languages. To answer this question, we introduce MM-JudgeBench, the first large-scale benchmark for multilingual and multimodal judge model evaluation, which includes over 60K pairwise preference instances spanning 25 typologically diverse languages. MM-JudgeBench integrates two complementary subsets: a general vision-language preference evaluation subset extending VL-RewardBench, and a chart-centric visual-text reasoning subset derived from OpenCQA, enabling systematic analysis of reward models (i.e., LVLM judges) across diverse settings. We additionally release a multilingual training set derived from MM-RewardBench, disjoint from our evaluation data, to support domain adaptation. By evaluating 22 LVLMs (15 open-source, 7 proprietary), we uncover substantial cross-lingual performance variance in our proposed benchmark. Our analysis further shows that model size and architecture are poor predictors of multilingual robustness, and that even state-of-the-art LVLM judges exhibit inconsistent behavior across languages. Together, these findings expose fundamental limitations of current reward modeling and underscore the necessity of multilingual, multimodal benchmarks for developing reliable automated evaluators.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MM-JudgeBench is a useful new multilingual test set for LVLM judges, but the cross-lingual inconsistency claims rest on translated labels whose quality is not clearly established.

read the letter

The main thing to know is that this paper builds and releases MM-JudgeBench, a collection of over 60k preference pairs in 25 languages, and uses it to show that current LVLM judges vary widely in performance across languages while model size and architecture give little signal about robustness. They also release a disjoint training set for adaptation work. That benchmark and the scale of the evaluation are the concrete additions here. Prior judge benchmarks stayed English-only, so extending to typologically diverse languages with both general vision-language and chart-reasoning subsets fills an obvious gap. Running 22 models, open and closed, and documenting the variance is straightforward and worth having as a reference point. The training set release is a practical plus for anyone trying to improve multilingual performance. The soft spot is the data construction. The instances start from English sources and add translated text for the same images. The abstract gives no indication that preferences were re-annotated by native speakers or checked for fidelity after translation. If lexical choices or cultural framing shift what counts as the better response, then measured gaps between languages could reflect label noise or translation artifacts rather than genuine judge limitations. Without those details or error bars on the results, the central claim about inconsistent behavior stays only moderately supported. This work is aimed at people building reward models or automated evaluators for LVLMs who need to think about non-English use cases. A reader focused on multilingual alignment or benchmark design will find the dataset itself worth looking at. It deserves a serious referee because the benchmark is new and the empirical scale is real, even though the interpretation of the variance needs tightening on the label side. I would send it to review and ask specifically for the annotation protocol on the translated portions.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces MM-JudgeBench, the first large-scale benchmark for multilingual multimodal LVLM judge evaluation, comprising over 60K pairwise preference instances across 25 typologically diverse languages. It integrates a general vision-language preference subset extending VL-RewardBench and a chart-centric visual-text reasoning subset derived from OpenCQA. The authors evaluate 22 LVLMs (15 open-source, 7 proprietary), report substantial cross-lingual performance variance, conclude that model size and architecture are poor predictors of multilingual robustness, and note inconsistent behavior even in state-of-the-art judges. They additionally release a disjoint multilingual training set derived from MM-RewardBench to support domain adaptation.

Significance. If the benchmark labels prove reliable, this work is significant for demonstrating fundamental limitations in the cross-lingual generalization of LVLM-based reward models, which are central to alignment pipelines. The scale (60K+ instances, 25 languages), complementary subsets, and public release of both the evaluation benchmark and training data are clear strengths that enable reproducible research and targeted improvements in multilingual evaluator development.

major comments (2)

[§3] §3 (Benchmark Construction): The extension of English-only sources (VL-RewardBench and OpenCQA) to 25 languages is described as adding translated text for the same visual inputs, but no details are provided on whether preference labels were re-annotated by native speakers, back-translated for fidelity, or validated against cultural/lexical artifacts. This is load-bearing for the central claim of model inconsistency across languages, because translation-induced shifts in preference ordering would produce apparent variance that reflects label noise rather than LVLM limitations.
[§4] §4 (Experimental Results): The analysis that model size and architecture are poor predictors of multilingual robustness lacks any reported statistical support such as correlation coefficients, regression results, or ablation studies. In addition, the abstract and results summary provide no error bars, confidence intervals, or significance tests for the cross-lingual performance differences, weakening the evidence for the reported 'substantial variance' and 'inconsistent behavior'.

minor comments (2)

[Abstract] Abstract: The exact total instance count and per-subset or per-language breakdown would improve transparency over the approximate 'over 60K' figure.
[Introduction] Introduction: Early definition of the term 'LVLM judges' and explicit expansion of acronyms such as VL-RewardBench on first use would aid readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address the two major comments point by point below, indicating the changes we will make in the revised manuscript.

read point-by-point responses

Referee: [§3] §3 (Benchmark Construction): The extension of English-only sources (VL-RewardBench and OpenCQA) to 25 languages is described as adding translated text for the same visual inputs, but no details are provided on whether preference labels were re-annotated by native speakers, back-translated for fidelity, or validated against cultural/lexical artifacts. This is load-bearing for the central claim of model inconsistency across languages, because translation-induced shifts in preference ordering would produce apparent variance that reflects label noise rather than LVLM limitations.

Authors: We appreciate the referee highlighting the need for greater transparency in benchmark construction. The manuscript describes extending the English sources by translating the textual elements while retaining the identical visual inputs and the original English preference labels. We did not re-annotate the labels with native speakers for the 25 languages, as the goal was to evaluate judge behavior on equivalent preference judgments across languages. We acknowledge that this design choice leaves open the possibility of translation-induced artifacts. In the revision we will expand §3 with: (i) the exact translation pipeline and service employed, (ii) results of back-translation fidelity checks performed on a sampled subset, and (iii) an explicit limitations paragraph discussing potential cultural or lexical shifts. These additions will allow readers to evaluate whether the reported cross-lingual variance is attributable to label noise. revision: yes
Referee: [§4] §4 (Experimental Results): The analysis that model size and architecture are poor predictors of multilingual robustness lacks any reported statistical support such as correlation coefficients, regression results, or ablation studies. In addition, the abstract and results summary provide no error bars, confidence intervals, or significance tests for the cross-lingual performance differences, weakening the evidence for the reported 'substantial variance' and 'inconsistent behavior'.

Authors: We agree that formal statistical support would strengthen the claims. The current manuscript demonstrates variance through per-language performance tables and figures, but does not include correlation analyses or error bars. In the revised version we will add to §4: Pearson and Spearman correlations between model parameter count and cross-lingual performance standard deviation; analogous breakdowns by architecture family; standard-error bars on all relevant plots; and paired significance tests (Wilcoxon signed-rank) between languages for representative models. The abstract will be updated to reference these quantitative supports for the variance and inconsistency findings. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark construction and model evaluation

full rationale

The paper introduces MM-JudgeBench by extending VL-RewardBench and OpenCQA to 25 languages via translation and evaluates 22 LVLMs on over 60K preference instances. All central claims (cross-lingual variance, poor predictive power of size/architecture) rest on new data collection and direct performance measurements rather than any derivation, equation, fitted parameter renamed as prediction, or self-citation chain. No mathematical steps, ansatzes, or uniqueness theorems are invoked; the work is self-contained against external benchmarks and does not reduce any result to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work relies on standard assumptions from prior benchmarks like VL-RewardBench and OpenCQA for preference data construction and model evaluation; no free parameters, ad-hoc axioms, or invented entities are introduced in the abstract.

pith-pipeline@v0.9.0 · 5567 in / 1110 out tokens · 43703 ms · 2026-05-10T02:14:06.083287+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

47 extracted references · 20 canonical work pages · 7 internal anchors

[1]

Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

VL-RewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=
[2]

BERTScore: Evaluating Text Generation with BERT , author=
[3]

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track , pages=

Query-OPT: Optimizing Inference of Large Language Models via Multi-Query Instructions in Meeting Summarization , author=. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track , pages=

2024
[4]

In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demon- strations (Oct 2020).https://doi.org/10.18653/v1/2020.emnlp-demos.6

Wolf, Thomas and Debut, Lysandre and Sanh, Victor and Chaumond, Julien and Delangue, Clement and Moi, Anthony and Cistac, Pierric and Rault, Tim and Louf, Remi and Funtowicz, Morgan and Davison, Joe and Shleifer, Sam and von Platen, Patrick and Ma, Clara and Jernite, Yacine and Plu, Julien and Xu, Canwen and Le Scao, Teven and Gugger, Sylvain and Drame, M...

work page doi:10.18653/v1/2020.emnlp-demos.6 2020
[5]

Proceedings of the 29th symposium on operating systems principles , pages=

Efficient memory management for large language model serving with pagedattention , author=. Proceedings of the 29th symposium on operating systems principles , pages=
[6]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities , author=. arXiv preprint arXiv:2507.06261 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Pixtral 12B

Pixtral 12B , author=. arXiv preprint arXiv:2410.07073 , year=

work page internal anchor Pith review arXiv
[8]

Qwen3-VL Technical Report

Qwen3-VL Technical Report , author=. arXiv preprint arXiv:2511.21631 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[9]

arXiv preprint arXiv:2509.13332 , year=

Explicit Reasoning Makes Better Judges: A Systematic Study on Accuracy, Efficiency, and Robustness , author=. arXiv preprint arXiv:2509.13332 , year=

work page internal anchor Pith review arXiv
[10]

A Systematic Study and Comprehensive Evaluation of C hat GPT on Benchmark Datasets

Laskar, Md Tahmid Rahman and Bari, M Saiful and Rahman, Mizanur and Bhuiyan, Md Amran Hossen and Joty, Shafiq and Huang, Jimmy Xiangji. A Systematic Study and Comprehensive Evaluation of C hat GPT on Benchmark Datasets. Findings of the Association for Computational Linguistics: ACL 2023. 2023. doi:10.18653/v1/2023.findings-acl.29

work page doi:10.18653/v1/2023.findings-acl.29 2023
[11]

Evaluation of C hat GPT on Biomedical Tasks: A Zero-Shot Comparison with Fine-Tuned Generative Transformers

Jahan, Israt and Laskar, Md Tahmid Rahman and Peng, Chun and Huang, Jimmy. Evaluation of C hat GPT on Biomedical Tasks: A Zero-Shot Comparison with Fine-Tuned Generative Transformers. Proceedings of the 22nd Workshop on Biomedical Natural Language Processing and BioNLP Shared Tasks. 2023. doi:10.18653/v1/2023.bionlp-1.30

work page doi:10.18653/v1/2023.bionlp-1.30 2023
[12]

Journal of artificial intelligence research , volume=

Reinforcement learning: A survey , author=. Journal of artificial intelligence research , volume=
[13]

Improving automatic evaluation of large language models (

Laskar, Md Tahmid Rahman and Jahan, Israt and Dolatabadi, Elham and Peng, Chun and Hoque, Enamul and Huang, Jimmy , booktitle=. Improving automatic evaluation of large language models (
[14]

Computational Linguistics , volume=

Domain adaptation with pre-trained transformers for query-focused abstractive text summarization , author=. Computational Linguistics , volume=. 2022 , publisher=

2022
[15]

WSL - DS : Weakly Supervised Learning with Distant Supervision for Query Focused Multi-Document Abstractive Summarization

Laskar, Md Tahmid Rahman and Hoque, Enamul and Huang, Jimmy Xiangji. WSL - DS : Weakly Supervised Learning with Distant Supervision for Query Focused Multi-Document Abstractive Summarization. Proceedings of the 28th International Conference on Computational Linguistics. 2020. doi:10.18653/v1/2020.coling-main.495

work page doi:10.18653/v1/2020.coling-main.495 2020
[16]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency , author=. arXiv preprint arXiv:2508.18265 , year=

work page internal anchor Pith review arXiv
[17]

Proceedings of the 56th annual meeting of the association for computational linguistics (volume 2: Short papers) , pages=

Filtering and mining parallel data in a joint multilingual space , author=. Proceedings of the 56th annual meeting of the association for computational linguistics (volume 2: Short papers) , pages=
[18]

Forty-first International Conference on Machine Learning , year=

Mllm-as-a-judge: Assessing multimodal llm-as-a-judge with vision-language benchmark , author=. Forty-first International Conference on Machine Learning , year=
[19]

A Little is Enough

“A Little is Enough”: Few-Shot Quality Estimation based Corpus Filtering improves Machine Translation , author=. Findings of the Association for Computational Linguistics: ACL 2023 , pages=

2023
[20]

Evaluating

Padarha, Shreyansh and Semenova, Elizaveta and Vidgen, Bertie and Mahdi, Adam and Hale, Scott A , booktitle=. Evaluating
[21]

Gemma 3 Technical Report

Gemma 3 technical report , author=. arXiv preprint arXiv:2503.19786 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[22]

Findings of the Association for Computational Linguistics: EMNLP 2024 , pages=

Are Large Vision Language Models up to the Challenge of Chart Comprehension and Reasoning , author=. Findings of the Association for Computational Linguistics: EMNLP 2024 , pages=

2024
[23]

Proceedings of the Seventh Conference on Machine Translation (WMT) , pages=

CometKiwi: IST-unbabel 2022 submission for the quality estimation shared task , author=. Proceedings of the Seventh Conference on Machine Translation (WMT) , pages=

2022
[24]

doi: 10.18653/v1/2022.acl-long.62

Feng, Fangxiaoyu and Yang, Yinfei and Cer, Daniel and Arivazhagan, Naveen and Wang, Wei. Language-agnostic BERT Sentence Embedding. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2022. doi:10.18653/v1/2022.acl-long.62

work page doi:10.18653/v1/2022.acl-long.62 2022
[25]

Proceedings of the Tenth Conference on Machine Translation , pages=

Findings of the wmt25 general machine translation shared task: Time to stop evaluating on easy test sets , author=. Proceedings of the Tenth Conference on Machine Translation , pages=
[26]

Stop Taking Tokenizers for Granted: They Are Core Design Decisions in Large Language Models

Alqahtani, Sawsan and Nayeem, Mir Tafseer and Laskar, Md Tahmid Rahman and Mohiuddin, Tasnim and Bari, M Saiful. Stop Taking Tokenizers for Granted: They Are Core Design Decisions in Large Language Models. Proceedings of the 19th Conference of the E uropean Chapter of the A ssociation for C omputational L inguistics (Volume 1: Long Papers). 2026. doi:10.1...

work page doi:10.18653/v1/2026.eacl-long.394 2026
[27]

2025 , eprint=

Beyond Fertility: Analyzing STRR as a Metric for Multilingual Tokenization Evaluation , author=. 2025 , eprint=

2025
[28]

Findings of the association for computational linguistics ACL 2024 , pages=

Prometheus-vision: Vision-language model as a judge for fine-grained evaluation , author=. Findings of the association for computational linguistics ACL 2024 , pages=

2024
[29]

Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

Llava-critic: Learning to evaluate multimodal models , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=
[30]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

M-rewardbench: Evaluating reward models in multilingual settings , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
[31]

Multimodal reward- bench: Holistic evaluation of reward models for vision language models.arXiv preprint arXiv:2502.14191, 2025

Multimodal rewardbench: Holistic evaluation of reward models for vision language models , author=. arXiv preprint arXiv:2502.14191 , year=

work page arXiv
[32]

National Science Review , volume=

A survey on multimodal large language models , author=. National Science Review , volume=. 2024 , publisher=

2024
[33]

Patterns , volume=

A survey of multilingual large language models , author=. Patterns , volume=. 2025 , publisher=

2025
[34]

Advances in neural information processing systems , volume=

Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=
[35]

Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , pages=

Opencqa: Open-ended question answering with charts , author=. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , pages=

2022
[36]

Constitutional AI: Harmlessness from AI Feedback

Constitutional AI: Harmlessness from AI Feedback , author=. arXiv preprint arXiv:2212.08073 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[37]

From generation to judgment: Opportunities and challenges of

Li, Dawei and Jiang, Bohan and Huang, Liangjie and Beigi, Alimohammad and Zhao, Chengshuai and Tan, Zhen and Bhattacharjee, Amrita and Jiang, Yuxuan and Chen, Canyu and Wu, Tianhao and others , booktitle=. From generation to judgment: Opportunities and challenges of
[38]

Zheng, Lianmin and Chiang, Wei-Lin and Sheng, Ying and Zhuang, Siyuan and Wu, Zhanghao and Zhuang, Yonghao and Lin, Zi and Li, Zhuohan and Li, Dacheng and Xing, Eric and others , journal=. Judging
[39]

Findings of the Association for Computational Linguistics: NAACL 2025 , pages=

Rewardbench: Evaluating reward models for language modeling , author=. Findings of the Association for Computational Linguistics: NAACL 2025 , pages=

2025
[40]

Advances in neural information processing systems , volume=

Deep reinforcement learning from human preferences , author=. Advances in neural information processing systems , volume=
[41]

Deploying Tiny LVLM Judges for Real-World Evaluation of Chart Models: Lessons Learned and Best Practices

Laskar, Md Tahmid Rahman and Islam, Mohammed Saidul and Mahbub, Ridwan and Rahman, Mizanur and Bhuiyan, Amran and Jahan, Israt and Nayeem, Mir Tafseer and Joty, Shafiq and Hoque, Enamul and Huang, Jimmy. Deploying Tiny LVLM Judges for Real-World Evaluation of Chart Models: Lessons Learned and Best Practices. Proceedings of the 2025 Conference on Empirical...

work page doi:10.18653/v1/2025.emnlp-industry.134 2025
[42]

XL - H ead T ags: Leveraging Multimodal Retrieval Augmentation for the Multilingual Generation of News Headlines and Tags

Shohan, Faisal Tareque and Nayeem, Mir Tafseer and Islam, Samsul and Akash, Abu Ubaida and Joty, Shafiq. XL - H ead T ags: Leveraging Multimodal Retrieval Augmentation for the Multilingual Generation of News Headlines and Tags. Findings of the Association for Computational Linguistics: ACL 2024. 2024. doi:10.18653/v1/2024.findings-acl.771

work page doi:10.18653/v1/2024.findings-acl.771 2024
[43]

A Systematic Survey and Critical Review on Evaluating Large Language Models: Challenges, Limitations, and Recommendations

Laskar, Md Tahmid Rahman and Alqahtani, Sawsan and Bari, M Saiful and Rahman, Mizanur and Khan, Mohammad Abdullah Matin and Khan, Haidar and Jahan, Israt and Bhuiyan, Amran and Tan, Chee Wei and Parvez, Md Rizwan and Hoque, Enamul and Joty, Shafiq and Huang, Jimmy. A Systematic Survey and Critical Review on Evaluating Large Language Models: Challenges, Li...

work page doi:10.18653/v1/2024.emnlp-main.764 2024
[44]

From Charts to Fair Narratives: Uncovering and Mitigating Geo-Economic Biases in Chart-to-Text

Mahbub, Ridwan and Islam, Mohammed Saidul and Nayeem, Mir Tafseer and Laskar, Md Tahmid Rahman and Rahman, Mizanur and Joty, Shafiq and Hoque, Enamul. From Charts to Fair Narratives: Uncovering and Mitigating Geo-Economic Biases in Chart-to-Text. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/...

work page doi:10.18653/v1/2025.emnlp-main.1472 2025
[45]

The Perils of Chart Deception: How Misleading Visualizations Affect Vision-Language Models , year=

Mahbub, Ridwan and Islam, Mohammed Saidul and Laskar, Md Tahmid Rahman and Rahman, Mizanur and Nayeem, Mir Tafseer and Hoque, Enamul , booktitle=. The Perils of Chart Deception: How Misleading Visualizations Affect Vision-Language Models , year=
[46]

Judging the Judges: Can Large Vision-Language Models Fairly Evaluate Chart Comprehension and Reasoning?

Laskar, Md Tahmid Rahman and Islam, Mohammed Saidul and Mahbub, Ridwan and Masry, Ahmed and Rahman, Mizanur and Bhuiyan, Amran and Nayeem, Mir Tafseer and Joty, Shafiq and Hoque, Enamul and Huang, Jimmy. Judging the Judges: Can Large Vision-Language Models Fairly Evaluate Chart Comprehension and Reasoning?. Proceedings of the 63rd Annual Meeting of the As...

work page doi:10.18653/v1/2025.acl-industry.83 2025
[47]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Aya Model: An Instruction Finetuned Open-Access Multilingual Language Model. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.845

work page doi:10.18653/v1/2024.acl-long.845 2024