Ds-1000: A natural and reliable bench- mark for data science code generation

Yuhang Lai, Chengxi Li, Yiming Wang, Tianyi Zhang, Ruiqi Zhong, Luke Zettlemoyer, Wen-Tau Yih, Daniel Fried, Sida Wang, Tao Yu · 2022 · arXiv 2211.11501

10 Pith papers cite this work. Polarity classification is still indexing.

10 Pith papers citing it

read on arXiv browse 10 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

KernelBench: Can LLMs Write Efficient GPU Kernels?

cs.LG · 2025-02-14 · accept · novelty 7.0

KernelBench shows that even the best current LLMs generate correct and faster-than-baseline GPU kernels in fewer than 20 percent of realistic ML workloads.

Lost in the Flow with Code Talkers: Unveiling the Instruction-Tuning Tax of Large Language Models in Code Tasks

cs.SE · 2026-06-07 · unverdicted · novelty 6.0

Empirical study finds instruction tuning on CodeLLMs improves instruction following at the expense of infilling performance, termed the Instruction-Tuning Tax.

Compass: SLO-aware Query Planner for Compound AI Serving at Scale

cs.DB · 2025-04-23 · unverdicted · novelty 6.0

Compass decomposes multi-query multi-SLO planning for compound AI serving, exploits plan similarities, uses selective profiling, and applies bipartite matching at runtime to deliver 2.4-5.1x higher goodput and 3.8-4.5x lower costs.

Qiskit QuantumKatas: Adapting Microsoft's Quantum Computing exercises for LLM evaluation

quant-ph · 2026-05-26 · unverdicted · novelty 5.0

Adapts QuantumKatas to Qiskit yielding a 350-task benchmark across 26 categories and evaluates 16 LLMs in 39,200 runs, reporting performance gaps and prompting effects.

Text Analytics Evaluation Framework: A Case Study on LLMs and Social Media

cs.CL · 2026-05-20 · unverdicted · novelty 5.0

Presents a new question-based evaluation framework for LLMs on aggregated social media text and reports that performance declines with input scale, task complexity, and numerical operations beyond 500 instances.

Business Utility of Large Language Models as Exploratory Data Analysis Agents

cs.CY · 2026-05-08 · unverdicted · novelty 5.0

Evaluation of 15 LLM configurations across four conditions in a supply chain EDA benchmark finds most lack sufficient repeatability for autonomous deployment, with GPT-5.4 at extra-high reasoning effort scoring highest on mean score (0.8748) and proposed Business utility (0.6952).

AdaDec: A Uncertainty-Guided Lookahead Decoding Framework for LLM-Based Code Generation

cs.SE · 2025-06-10 · unverdicted · novelty 5.0

AdaDec improves Pass@1 accuracy of LLM code generation by up to 20.9% over greedy decoding by triggering lookahead reranking only at high-uncertainty steps on HumanEval+, MBPP+, and DevEval.

StarCoder: may the source be with you!

cs.CL · 2023-05-09 · accept · novelty 5.0

StarCoderBase matches or beats OpenAI's code-cushman-001 on multi-language code benchmarks; the Python-fine-tuned StarCoder reaches 40% pass@1 on HumanEval while retaining other-language performance.

Trading Human Curation for Synthetic Augmentation in RLVR

cs.LG · 2026-06-02 · unverdicted · novelty 4.0

Gated synthetic augmentations can substitute for additional human-authored RLVR tasks at a cost-adjusted trade rate of 1.4x-11.6x while retaining held-out generalization on ten benchmarks spanning code, instruction following, reasoning, and agentic function calling.

An Empirical Study on Influence-Based Pretraining Data Selection for Code Large Language Models

cs.SE · 2026-04-09 · unverdicted · novelty 4.0

Data-influence-score filtering using validation-set loss on downstream coding tasks improves Code-LLM performance, with the most beneficial training data varying significantly across different programming tasks.

citing papers explorer

Showing 8 of 8 citing papers after filters.

Lost in the Flow with Code Talkers: Unveiling the Instruction-Tuning Tax of Large Language Models in Code Tasks cs.SE · 2026-06-07 · unverdicted · none · ref 23
Empirical study finds instruction tuning on CodeLLMs improves instruction following at the expense of infilling performance, termed the Instruction-Tuning Tax.
Compass: SLO-aware Query Planner for Compound AI Serving at Scale cs.DB · 2025-04-23 · unverdicted · none · ref 21
Compass decomposes multi-query multi-SLO planning for compound AI serving, exploits plan similarities, uses selective profiling, and applies bipartite matching at runtime to deliver 2.4-5.1x higher goodput and 3.8-4.5x lower costs.
Qiskit QuantumKatas: Adapting Microsoft's Quantum Computing exercises for LLM evaluation quant-ph · 2026-05-26 · unverdicted · none · ref 3
Adapts QuantumKatas to Qiskit yielding a 350-task benchmark across 26 categories and evaluates 16 LLMs in 39,200 runs, reporting performance gaps and prompting effects.
Text Analytics Evaluation Framework: A Case Study on LLMs and Social Media cs.CL · 2026-05-20 · unverdicted · none · ref 83
Presents a new question-based evaluation framework for LLMs on aggregated social media text and reports that performance declines with input scale, task complexity, and numerical operations beyond 500 instances.
Business Utility of Large Language Models as Exploratory Data Analysis Agents cs.CY · 2026-05-08 · unverdicted · none · ref 18
Evaluation of 15 LLM configurations across four conditions in a supply chain EDA benchmark finds most lack sufficient repeatability for autonomous deployment, with GPT-5.4 at extra-high reasoning effort scoring highest on mean score (0.8748) and proposed Business utility (0.6952).
AdaDec: A Uncertainty-Guided Lookahead Decoding Framework for LLM-Based Code Generation cs.SE · 2025-06-10 · unverdicted · none · ref 37
AdaDec improves Pass@1 accuracy of LLM code generation by up to 20.9% over greedy decoding by triggering lookahead reranking only at high-uncertainty steps on HumanEval+, MBPP+, and DevEval.
Trading Human Curation for Synthetic Augmentation in RLVR cs.LG · 2026-06-02 · unverdicted · none · ref 11
Gated synthetic augmentations can substitute for additional human-authored RLVR tasks at a cost-adjusted trade rate of 1.4x-11.6x while retaining held-out generalization on ten benchmarks spanning code, instruction following, reasoning, and agentic function calling.
An Empirical Study on Influence-Based Pretraining Data Selection for Code Large Language Models cs.SE · 2026-04-09 · unverdicted · none · ref 20
Data-influence-score filtering using validation-set loss on downstream coding tasks improves Code-LLM performance, with the most beneficial training data varying significantly across different programming tasks.

Ds-1000: A natural and reliable bench- mark for data science code generation

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer