Datascibench: An llm agent benchmark for data science

Dan Zhang, Sining Zhoubian, Min Cai, Fengzu Li, Lekang Yang, Wei Wang, Tianjiao Dong, Ziniu Hu, Jie Tang, Yisong Yue · 2025 · arXiv 2502.13897

8 Pith papers cite this work. Polarity classification is still indexing.

8 Pith papers citing it

read on arXiv browse 8 citing papers

citation-role summary

background 1 dataset 1

citation-polarity summary

background 2

representative citing papers

Orak: A Foundational Benchmark for Training and Evaluating LLM Agents on Diverse Video Games

cs.AI · 2025-06-04 · unverdicted · novelty 7.0

Orak is a foundational benchmark providing training data, interfaces, and evaluation tools for LLM agents across diverse video game genres.

RExBench: Can coding agents autonomously implement AI research extensions?

cs.CL · 2025-06-27 · unverdicted · novelty 6.0

RExBench is a new benchmark showing that LLM coding agents fail to autonomously implement most realistic research extensions to prior AI papers.

AgenticDataBench: A Comprehensive Benchmark for Data Agents

cs.DB · 2026-07-02 · unverdicted · novelty 5.0

AgenticDataBench is a new benchmark covering realistic data science tasks across 15 domains using extracted skills and LLM-generated workflows to evaluate data agents at fine granularity.

Text Analytics Evaluation Framework: A Case Study on LLMs and Social Media

cs.CL · 2026-05-20 · unverdicted · novelty 5.0

Presents a new question-based evaluation framework for LLMs on aggregated social media text and reports that performance declines with input scale, task complexity, and numerical operations beyond 500 instances.

ProfiliTable: Profiling-Driven Tabular Data Processing via Agentic Workflows

cs.AI · 2026-05-12 · unverdicted · novelty 5.0 · 2 refs

ProfiliTable is a multi-agent system with profiler, generator, and evaluator components that outperforms baselines on 18 tabular task types via dynamic profiling and closed-loop refinement.

Business Utility of Large Language Models as Exploratory Data Analysis Agents

cs.CY · 2026-05-08 · unverdicted · novelty 5.0

Evaluation of 15 LLM configurations across four conditions in a supply chain EDA benchmark finds most lack sufficient repeatability for autonomous deployment, with GPT-5.4 at extra-high reasoning effort scoring highest on mean score (0.8748) and proposed Business utility (0.6952).

DRAFT: Task Decoupled Latent Reasoning for Agent Safety

cs.LG · 2026-02-11 · unverdicted · novelty 5.0

DRAFT decouples agent safety judgment into latent extraction and reasoning stages, raising average benchmark accuracy from 63.27% to 91.18%.

From LLM Reasoning to Autonomous AI Agents: A Comprehensive Review

cs.AI · 2025-04-28 · accept · novelty 4.0

A survey consolidating benchmarks, agent frameworks, real-world applications, and protocols for LLM-based autonomous agents into a proposed taxonomy with recommendations for future research.

citing papers explorer

Showing 8 of 8 citing papers.

Orak: A Foundational Benchmark for Training and Evaluating LLM Agents on Diverse Video Games cs.AI · 2025-06-04 · unverdicted · none · ref 9
Orak is a foundational benchmark providing training data, interfaces, and evaluation tools for LLM agents across diverse video game genres.
RExBench: Can coding agents autonomously implement AI research extensions? cs.CL · 2025-06-27 · unverdicted · none · ref 54
RExBench is a new benchmark showing that LLM coding agents fail to autonomously implement most realistic research extensions to prior AI papers.
AgenticDataBench: A Comprehensive Benchmark for Data Agents cs.DB · 2026-07-02 · unverdicted · none · ref 61
AgenticDataBench is a new benchmark covering realistic data science tasks across 15 domains using extracted skills and LLM-generated workflows to evaluate data agents at fine granularity.
Text Analytics Evaluation Framework: A Case Study on LLMs and Social Media cs.CL · 2026-05-20 · unverdicted · none · ref 100
Presents a new question-based evaluation framework for LLMs on aggregated social media text and reports that performance declines with input scale, task complexity, and numerical operations beyond 500 instances.
ProfiliTable: Profiling-Driven Tabular Data Processing via Agentic Workflows cs.AI · 2026-05-12 · unverdicted · none · ref 42 · 2 links
ProfiliTable is a multi-agent system with profiler, generator, and evaluator components that outperforms baselines on 18 tabular task types via dynamic profiling and closed-loop refinement.
Business Utility of Large Language Models as Exploratory Data Analysis Agents cs.CY · 2026-05-08 · unverdicted · none · ref 30
Evaluation of 15 LLM configurations across four conditions in a supply chain EDA benchmark finds most lack sufficient repeatability for autonomous deployment, with GPT-5.4 at extra-high reasoning effort scoring highest on mean score (0.8748) and proposed Business utility (0.6952).
DRAFT: Task Decoupled Latent Reasoning for Agent Safety cs.LG · 2026-02-11 · unverdicted · none · ref 23
DRAFT decouples agent safety judgment into latent extraction and reasoning stages, raising average benchmark accuracy from 63.27% to 91.18%.
From LLM Reasoning to Autonomous AI Agents: A Comprehensive Review cs.AI · 2025-04-28 · accept · none · ref 123
A survey consolidating benchmarks, agent frameworks, real-world applications, and protocols for LLM-based autonomous agents into a proposed taxonomy with recommendations for future research.

Datascibench: An llm agent benchmark for data science

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer