A Systematic Survey and Critical Review on Evaluating Large Language Models: Challenges, Limitations, and Recommendations

Md Tahmid Rahman Laskar, Sawsan Alqahtani, M Saiful Bari, Mizanur Rahman, Mohammad Abdullah Matin Khan, Haidar Khan, Israt Jahan, Amran Bhuiyan, Chee Wei Tan, Md Rizwan Parvez, Enamul Hoque, Shafiq Joty, Jimmy Huang · 2024 · DOI 10.18653/v1/2024.emnlp-main.764

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

open at publisher browse 2 citing papers

representative citing papers

Lost in Translation: Do LVLM Judges Generalize Across Languages?

cs.CL · 2026-04-21 · unverdicted · novelty 8.0

MM-JudgeBench shows substantial cross-lingual performance variance in 22 LVLM judges, with model size and architecture as poor predictors of multilingual robustness.

GS-QA: A Benchmark for Geospatial Question Answering

cs.DB · 2026-05-21 · unverdicted · novelty 7.0

GS-QA is a new benchmark of 2,800 QA pairs on 28 templates using OSM and Wikipedia data to evaluate LLMs on spatial predicates, multi-source reasoning, and diverse answer types including distances and counts.

citing papers explorer

Showing 2 of 2 citing papers.

Lost in Translation: Do LVLM Judges Generalize Across Languages? cs.CL · 2026-04-21 · unverdicted · none · ref 43
MM-JudgeBench shows substantial cross-lingual performance variance in 22 LVLM judges, with model size and architecture as poor predictors of multilingual robustness.
GS-QA: A Benchmark for Geospatial Question Answering cs.DB · 2026-05-21 · unverdicted · none · ref 31
GS-QA is a new benchmark of 2,800 QA pairs on 28 templates using OSM and Wikipedia data to evaluate LLMs on spatial predicates, multi-source reasoning, and diverse answer types including distances and counts.

A Systematic Survey and Critical Review on Evaluating Large Language Models: Challenges, Limitations, and Recommendations

fields

years

verdicts

representative citing papers

citing papers explorer