Answer matching outperforms multiple choice for language model evaluation.arXiv preprint arXiv:2507.02856

Nikhil Chandak, Shashwat Goel, Ameya Prabhu, Moritz Hardt, Jonas Geiping · 2025 · arXiv 2507.02856

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

representative citing papers

Beyond MCQ: An Open-Ended Arabic Cultural QA Benchmark with Dialect Variants

cs.CL · 2025-10-28 · unverdicted · novelty 7.0

Authors extend an existing Arabic QA dataset into the first parallel open-ended benchmark across dialects and MSA, then benchmark LLMs showing underperformance on dialects and open-ended questions.

Entropy After </Think> for reasoning model early exiting

cs.LG · 2025-09-30 · unverdicted · novelty 6.0

Entropy After </Think> (EAT) enables early exiting in reasoning LLMs by tracking entropy stabilization after a </think> token, cutting token use 12-22% on MATH500 and AIME2025 with no accuracy loss.

citing papers explorer

Showing 2 of 2 citing papers.

Beyond MCQ: An Open-Ended Arabic Cultural QA Benchmark with Dialect Variants cs.CL · 2025-10-28 · unverdicted · none · ref 13
Authors extend an existing Arabic QA dataset into the first parallel open-ended benchmark across dialects and MSA, then benchmark LLMs showing underperformance on dialects and open-ended questions.
Entropy After </Think> for reasoning model early exiting cs.LG · 2025-09-30 · unverdicted · none · ref 1
Entropy After </Think> (EAT) enables early exiting in reasoning LLMs by tracking entropy stabilization after a </think> token, cutting token use 12-22% on MATH500 and AIME2025 with no accuracy loss.

Answer matching outperforms multiple choice for language model evaluation.arXiv preprint arXiv:2507.02856

fields

years

verdicts

representative citing papers

citing papers explorer