Dyval 2: Dynamic evaluation of large language models by meta probing agents

Dynamic evaluation of large language models by meta probing agents , author= · 2024 · arXiv 2402.14865

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

representative citing papers

AgentReview: Exploring Peer Review Dynamics with LLM Agents

cs.CL · 2024-06-18 · unverdicted · novelty 8.0

AgentReview is the first LLM-based simulation framework for peer review that quantifies a 37.1% decision variation attributable to reviewer biases.

Elmes*: Automated Construction of Fine-Grained Evaluation Rubrics for Large Language Models in Long-Tail Educational Scenarios

cs.LG · 2026-06-04 · unverdicted · novelty 7.0

Elmes* automates fine-grained rubric construction for LLM educational evaluation via multi-agent interactions and a self-evolving SceneGen module, producing the Edu-330 benchmark that demonstrates multidimensional differences in model teaching performance.

Robust Reasoning Benchmark

cs.LG · 2026-03-26 · unverdicted · novelty 7.0 · 2 refs

The Robust Reasoning Benchmark shows frontier LLMs are mostly resilient to textual perturbations on AIME problems while open-weight models suffer up to 54% accuracy drops and exhibit accuracy decay on later problems due to attention dilution during chain-of-thought.

Benchmark Data Contamination of Large Language Models: A Survey

cs.CL · 2024-06-06 · unverdicted · novelty 3.0

A survey reviewing benchmark data contamination in LLMs, its impact on evaluation, and alternative assessment approaches.

The Landscape of Emerging AI Agent Architectures for Reasoning, Planning, and Tool Calling: A Survey

cs.AI · 2024-04-17 · unverdicted · novelty 3.0

A survey of emerging AI agent architectures that organizes single and multi-agent designs around reasoning, planning, tool use, communication, and reflection phases.

citing papers explorer

Showing 5 of 5 citing papers.

AgentReview: Exploring Peer Review Dynamics with LLM Agents cs.CL · 2024-06-18 · unverdicted · none · ref 55
AgentReview is the first LLM-based simulation framework for peer review that quantifies a 37.1% decision variation attributable to reviewer biases.
Elmes*: Automated Construction of Fine-Grained Evaluation Rubrics for Large Language Models in Long-Tail Educational Scenarios cs.LG · 2026-06-04 · unverdicted · none · ref 18
Elmes* automates fine-grained rubric construction for LLM educational evaluation via multi-agent interactions and a self-evolving SceneGen module, producing the Edu-330 benchmark that demonstrates multidimensional differences in model teaching performance.
Robust Reasoning Benchmark cs.LG · 2026-03-26 · unverdicted · none · ref 63 · 2 links
The Robust Reasoning Benchmark shows frontier LLMs are mostly resilient to textual perturbations on AIME problems while open-weight models suffer up to 54% accuracy drops and exhibit accuracy decay on later problems due to attention dilution during chain-of-thought.
Benchmark Data Contamination of Large Language Models: A Survey cs.CL · 2024-06-06 · unverdicted · none · ref 190
A survey reviewing benchmark data contamination in LLMs, its impact on evaluation, and alternative assessment approaches.
The Landscape of Emerging AI Agent Architectures for Reasoning, Planning, and Tool Calling: A Survey cs.AI · 2024-04-17 · unverdicted · none · ref 41
A survey of emerging AI agent architectures that organizes single and multi-agent designs around reasoning, planning, tool use, communication, and reflection phases.

Dyval 2: Dynamic evaluation of large language models by meta probing agents

fields

years

verdicts

representative citing papers

citing papers explorer