Authors extend an existing Arabic QA dataset into the first parallel open-ended benchmark across dialects and MSA, then benchmark LLMs showing underperformance on dialects and open-ended questions.
Answer matching outperforms multiple choice for language model evaluation.arXiv preprint arXiv:2507.02856
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
years
2025 2verdicts
UNVERDICTED 2representative citing papers
Entropy After </Think> (EAT) enables early exiting in reasoning LLMs by tracking entropy stabilization after a </think> token, cutting token use 12-22% on MATH500 and AIME2025 with no accuracy loss.
citing papers explorer
-
Beyond MCQ: An Open-Ended Arabic Cultural QA Benchmark with Dialect Variants
Authors extend an existing Arabic QA dataset into the first parallel open-ended benchmark across dialects and MSA, then benchmark LLMs showing underperformance on dialects and open-ended questions.
-
Entropy After </Think> for reasoning model early exiting
Entropy After </Think> (EAT) enables early exiting in reasoning LLMs by tracking entropy stabilization after a </think> token, cutting token use 12-22% on MATH500 and AIME2025 with no accuracy loss.