Orak is a foundational benchmark providing training data, interfaces, and evaluation tools for LLM agents across diverse video game genres.
Datascibench: An llm agent benchmark for data science
8 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
polarities
background 2representative citing papers
RExBench is a new benchmark showing that LLM coding agents fail to autonomously implement most realistic research extensions to prior AI papers.
AgenticDataBench is a new benchmark covering realistic data science tasks across 15 domains using extracted skills and LLM-generated workflows to evaluate data agents at fine granularity.
Presents a new question-based evaluation framework for LLMs on aggregated social media text and reports that performance declines with input scale, task complexity, and numerical operations beyond 500 instances.
ProfiliTable is a multi-agent system with profiler, generator, and evaluator components that outperforms baselines on 18 tabular task types via dynamic profiling and closed-loop refinement.
Evaluation of 15 LLM configurations across four conditions in a supply chain EDA benchmark finds most lack sufficient repeatability for autonomous deployment, with GPT-5.4 at extra-high reasoning effort scoring highest on mean score (0.8748) and proposed Business utility (0.6952).
DRAFT decouples agent safety judgment into latent extraction and reasoning stages, raising average benchmark accuracy from 63.27% to 91.18%.
A survey consolidating benchmarks, agent frameworks, real-world applications, and protocols for LLM-based autonomous agents into a proposed taxonomy with recommendations for future research.
citing papers explorer
-
Orak: A Foundational Benchmark for Training and Evaluating LLM Agents on Diverse Video Games
Orak is a foundational benchmark providing training data, interfaces, and evaluation tools for LLM agents across diverse video game genres.
-
RExBench: Can coding agents autonomously implement AI research extensions?
RExBench is a new benchmark showing that LLM coding agents fail to autonomously implement most realistic research extensions to prior AI papers.
-
AgenticDataBench: A Comprehensive Benchmark for Data Agents
AgenticDataBench is a new benchmark covering realistic data science tasks across 15 domains using extracted skills and LLM-generated workflows to evaluate data agents at fine granularity.
-
Text Analytics Evaluation Framework: A Case Study on LLMs and Social Media
Presents a new question-based evaluation framework for LLMs on aggregated social media text and reports that performance declines with input scale, task complexity, and numerical operations beyond 500 instances.
-
ProfiliTable: Profiling-Driven Tabular Data Processing via Agentic Workflows
ProfiliTable is a multi-agent system with profiler, generator, and evaluator components that outperforms baselines on 18 tabular task types via dynamic profiling and closed-loop refinement.
-
Business Utility of Large Language Models as Exploratory Data Analysis Agents
Evaluation of 15 LLM configurations across four conditions in a supply chain EDA benchmark finds most lack sufficient repeatability for autonomous deployment, with GPT-5.4 at extra-high reasoning effort scoring highest on mean score (0.8748) and proposed Business utility (0.6952).
-
DRAFT: Task Decoupled Latent Reasoning for Agent Safety
DRAFT decouples agent safety judgment into latent extraction and reasoning stages, raising average benchmark accuracy from 63.27% to 91.18%.
-
From LLM Reasoning to Autonomous AI Agents: A Comprehensive Review
A survey consolidating benchmarks, agent frameworks, real-world applications, and protocols for LLM-based autonomous agents into a proposed taxonomy with recommendations for future research.