ParaEval reduces false performance gaps in MCQA benchmarks from over 2 points to below 1 point by scoring models on multiple paraphrases per answer option instead of single surface forms.
Sanchit Ahuja, Varun Gumma, and Sunayana Sitaram
4 Pith papers cite this work. Polarity classification is still indexing.
4
Pith papers citing it
citation-role summary
background 1
citation-polarity summary
roles
background 1polarities
background 1representative citing papers
LiveCLKTBench generates questions from temporally filtered knowledge entities to isolate and measure genuine cross-lingual knowledge transfer in LLMs across five languages.
The paper surveys the origins, frameworks, applications, and open challenges of AI agents built on large language models.
A survey of LLM agent applications in renewable energy forecasting proposing a six-layer taxonomy and listing twelve open challenges.
citing papers explorer
-
The Rise and Potential of Large Language Model Based Agents: A Survey
The paper surveys the origins, frameworks, applications, and open challenges of AI agents built on large language models.