AgentReview is the first LLM-based simulation framework for peer review that quantifies a 37.1% decision variation attributable to reviewer biases.
Dyval 2: Dynamic evaluation of large language models by meta probing agents
5 Pith papers cite this work. Polarity classification is still indexing.
verdicts
UNVERDICTED 5representative citing papers
Elmes* automates fine-grained rubric construction for LLM educational evaluation via multi-agent interactions and a self-evolving SceneGen module, producing the Edu-330 benchmark that demonstrates multidimensional differences in model teaching performance.
The Robust Reasoning Benchmark shows frontier LLMs are mostly resilient to textual perturbations on AIME problems while open-weight models suffer up to 54% accuracy drops and exhibit accuracy decay on later problems due to attention dilution during chain-of-thought.
A survey reviewing benchmark data contamination in LLMs, its impact on evaluation, and alternative assessment approaches.
A survey of emerging AI agent architectures that organizes single and multi-agent designs around reasoning, planning, tool use, communication, and reflection phases.
citing papers explorer
-
AgentReview: Exploring Peer Review Dynamics with LLM Agents
AgentReview is the first LLM-based simulation framework for peer review that quantifies a 37.1% decision variation attributable to reviewer biases.
-
Elmes*: Automated Construction of Fine-Grained Evaluation Rubrics for Large Language Models in Long-Tail Educational Scenarios
Elmes* automates fine-grained rubric construction for LLM educational evaluation via multi-agent interactions and a self-evolving SceneGen module, producing the Edu-330 benchmark that demonstrates multidimensional differences in model teaching performance.
-
Robust Reasoning Benchmark
The Robust Reasoning Benchmark shows frontier LLMs are mostly resilient to textual perturbations on AIME problems while open-weight models suffer up to 54% accuracy drops and exhibit accuracy decay on later problems due to attention dilution during chain-of-thought.
-
Benchmark Data Contamination of Large Language Models: A Survey
A survey reviewing benchmark data contamination in LLMs, its impact on evaluation, and alternative assessment approaches.
-
The Landscape of Emerging AI Agent Architectures for Reasoning, Planning, and Tool Calling: A Survey
A survey of emerging AI agent architectures that organizes single and multi-agent designs around reasoning, planning, tool use, communication, and reflection phases.