hub

BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions

Terry Yue Zhuo, Minh Chien Vu, Jenny Chim, Han Hu, Wenhao Yu, Ratnadira Widyasari · 2024 · cs.SE · arXiv 2406.15877

21 Pith papers cite this work. Polarity classification is still indexing.

21 Pith papers citing it

open full Pith review browse 21 citing papers arXiv PDF

abstract

Task automation has been greatly empowered by the recent advances in Large Language Models (LLMs) via Python code, where the tasks ranging from software engineering development to general-purpose reasoning. While current benchmarks have shown that LLMs can solve tasks using programs like human developers, the majority of their evaluations are limited to short and self-contained algorithmic tasks or standalone function calls. Solving challenging and practical tasks requires the capability of utilizing diverse function calls as tools to efficiently implement functionalities like data analysis and web development. In addition, using multiple tools to solve a task needs compositional reasoning by accurately understanding complex instructions. Fulfilling both of these characteristics can pose a great challenge for LLMs.To assess how well LLMs can solve challenging and practical tasks via programs, we introduce BigCodeBench, a benchmark that challenges LLMs to invoke multiple function calls as tools from 139 libraries and 7 domains for 1,140 fine-grained tasks. To evaluate LLMs rigorously, each task encompasses 5.6 test cases with an average branch coverage of 99%. In addition, we propose a natural-language-oriented variant of BigCodeBench, BigCodeBench-Instruct, that automatically transforms the original docstrings into short instructions only with essential information. Our extensive evaluation of 60 LLMs shows that LLMs are not yet capable of following complex instructions to use function calls precisely, with scores up to 60%, significantly lower than the human performance of 97%. The results underscore the need for further advancements in this area.

hub tools

JSON dossier citing papers JSON arXiv source

representative citing papers

SimWorld Studio: Automatic Environment Generation with Evolving Coding Agent for Embodied Agent Learning

cs.AI · 2026-05-10 · accept · novelty 8.0 · 2 refs

SimWorld Studio deploys an evolving coding agent to create adaptive 3D environments that co-evolve with embodied learners, delivering 18-point success-rate gains over fixed environments in navigation benchmarks.

AgentLens: Revealing The Lucky Pass Problem in SWE-Agent Evaluation

cs.SE · 2026-05-13 · conditional · novelty 7.0

10.7% of passing SWE-agent trajectories are Lucky Passes with chaotic behaviors, and a quality score based on process references changes model rankings across eight backends.

When Prompt Under-Specification Improves Code Correctness: An Exploratory Study of Prompt Wording and Structure Effects on LLM-Based Code Generation

cs.SE · 2026-04-27 · unverdicted · novelty 7.0

Structurally rich task descriptions make LLMs robust to prompt under-specification, and under-specification can enhance code correctness by disrupting misleading lexical or structural cues.

Skill Retrieval Augmentation for Agentic AI

cs.CL · 2026-04-27 · unverdicted · novelty 7.0

Agents improve when they retrieve skills on demand from large corpora, yet current models cannot selectively decide when to load or ignore a retrieved skill.

Incisor: Ex Ante Cloud Instance Selection for HPC Jobs

cs.DC · 2026-04-27 · unverdicted · novelty 7.0

Incisor uses program analysis and frontier LLMs to select working AWS EC2 instances ex ante for 100% of first-time HPC runs of C/C++/Fortran and Python codes, cutting runtime 54% and costs 44% versus an expert-constrained SkyPilot baseline.

Assessing the Impact of Requirement Ambiguity on LLM-based Function-Level Code Generation

cs.SE · 2026-04-23 · conditional · novelty 7.0

Orchid benchmark shows requirement ambiguity degrades LLM code generation performance across all models, with advanced models hit hardest, and LLMs rarely detect or resolve the ambiguity themselves.

Neurosymbolic Repo-level Code Localization

cs.SE · 2026-04-17 · unverdicted · novelty 7.0

LogicLoc combines LLMs with Datalog to achieve accurate repo-level code localization without relying on keyword shortcuts in benchmarks.

Structural Anchors and Reasoning Fragility:Understanding CoT Robustness in LLM4Code

cs.SE · 2026-04-14 · unverdicted · novelty 7.0

CoT prompting in LLM4Code shows mixed robustness that depends on model family, task structure, and perturbations destabilizing structural anchors, leading to trajectory deformations like lengthening, branching, and simplification.

DuET: Dual Execution for Test Output Prediction with Generated Code and Pseudocode

cs.SE · 2026-04-13 · unverdicted · novelty 7.0

DuET uses dual execution of generated code and pseudocode with majority voting to achieve state-of-the-art test output prediction, boosting Pass@1 by 13.6 percentage points on LiveCodeBench.

Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach

cs.LG · 2025-02-07 · unverdicted · novelty 7.0

A recurrent-depth architecture enables language models to improve reasoning performance by iterating computation in latent space, achieving gains equivalent to much larger models on benchmarks.

Internalizing Curriculum Judgment for LLM Reinforcement Fine-Tuning

cs.LG · 2026-05-11 · unverdicted · novelty 6.0

METIS internalizes curriculum judgment in LLM reinforcement fine-tuning by predicting within-prompt reward variance via in-context learning and jointly optimizing with a self-judgment reward, yielding superior performance and up to 67% faster convergence across math, code, and agent benchmarks.

Defective Task Descriptions in LLM-Based Code Generation: Detection and Analysis

cs.SE · 2026-04-27 · conditional · novelty 6.0

SpecValidator detects lexical vagueness, under-specification, and syntax-formatting defects in LLM code-generation prompts with F1 0.804, outperforming GPT-5-mini and Claude Sonnet 4, and shows that under-specification is the most damaging defect type while richer benchmarks are more resilient.

Learned or Memorized ? Quantifying Memorization Advantage in Code LLMs

cs.SE · 2026-04-15 · unverdicted · novelty 6.0

A perturbation method shows memorization advantage in code LLMs varies widely by model and task, remaining low on CVEFixes and Defects4J benchmarks.

InCoder-32B-Thinking: Industrial Code World Model for Thinking

cs.AR · 2026-04-03 · unverdicted · novelty 6.0

InCoder-32B-Thinking uses error-feedback synthesized thinking traces and a code world model to reach top open-source scores on general and industrial code benchmarks including 81.3% on LiveCodeBench and 84.0% on CAD-Coder.

LLaDA2.0: Scaling Up Diffusion Language Models to 100B

cs.LG · 2025-12-10 · conditional · novelty 6.0

LLaDA2.0 scales discrete diffusion language models to 100B parameters via systematic conversion from autoregressive models using a 3-phase WSD training scheme and releases open-source 16B and 100B MoE variants.

Muon-OGD: Muon-based Spectral Orthogonal Gradient Projection for LLM Continual Learning

cs.LG · 2026-05-09 · unverdicted · novelty 5.0

Muon-OGD integrates Muon-style spectral-norm geometry with orthogonal gradient constraints to improve the stability-plasticity trade-off during sequential LLM adaptation.

MiMo-V2-Flash Technical Report

cs.CL · 2026-01-06 · unverdicted · novelty 5.0

MiMo-V2-Flash is a 309B/15B MoE model trained on 27T tokens with hybrid attention and multi-teacher on-policy distillation that matches larger models like DeepSeek-V3.2 while enabling 2.6x faster decoding via repurposed MTP layers.

Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs

cs.CL · 2025-03-03 · unverdicted · novelty 5.0

Phi-4-Mini achieves strong math and coding performance with only 3.8B parameters via high-quality synthetic data, while Phi-4-Multimodal uses Mixture-of-LoRAs to integrate modalities and top speech recognition leaderboards.

How Many Tries Does It Take? Iterative Self-Repair in LLM Code Generation Across Model Scales and Benchmarks

cs.SE · 2026-04-12 · unverdicted · novelty 4.0

Iterative self-repair improves LLM code pass rates by 4.9-17.1 pp on HumanEval and 16-30 pp on MBPP across seven models, with gains concentrated early and syntax errors easier to fix than logical ones.

An End-to-End Framework for Building Large Language Models for Software Operations

cs.LG · 2026-04-06 · unverdicted · novelty 4.0 · 2 refs

OpsLLM is a domain-specific LLM for software ops QA and RCA built with human-curated data, SFT, and RL using a domain process reward model, showing accuracy gains of 0.2-5.7% on QA and 2.7-70.3% on RCA over general LLMs.

Qwen2.5-Coder Technical Report

cs.CL · 2024-09-18 · unverdicted · novelty 4.0

Qwen2.5-Coder models claim state-of-the-art results on over 10 code benchmarks, outperforming larger models of similar size.

citing papers explorer

Showing 21 of 21 citing papers.

SimWorld Studio: Automatic Environment Generation with Evolving Coding Agent for Embodied Agent Learning cs.AI · 2026-05-10 · accept · none · ref 113 · 2 links · internal anchor
SimWorld Studio deploys an evolving coding agent to create adaptive 3D environments that co-evolve with embodied learners, delivering 18-point success-rate gains over fixed environments in navigation benchmarks.
AgentLens: Revealing The Lucky Pass Problem in SWE-Agent Evaluation cs.SE · 2026-05-13 · conditional · none · ref 54 · internal anchor
10.7% of passing SWE-agent trajectories are Lucky Passes with chaotic behaviors, and a quality score based on process references changes model rankings across eight backends.
When Prompt Under-Specification Improves Code Correctness: An Exploratory Study of Prompt Wording and Structure Effects on LLM-Based Code Generation cs.SE · 2026-04-27 · unverdicted · none · ref 49 · internal anchor
Structurally rich task descriptions make LLMs robust to prompt under-specification, and under-specification can enhance code correctness by disrupting misleading lexical or structural cues.
Skill Retrieval Augmentation for Agentic AI cs.CL · 2026-04-27 · unverdicted · none · ref 62 · internal anchor
Agents improve when they retrieve skills on demand from large corpora, yet current models cannot selectively decide when to load or ignore a retrieved skill.
Incisor: Ex Ante Cloud Instance Selection for HPC Jobs cs.DC · 2026-04-27 · unverdicted · none · ref 60 · internal anchor
Incisor uses program analysis and frontier LLMs to select working AWS EC2 instances ex ante for 100% of first-time HPC runs of C/C++/Fortran and Python codes, cutting runtime 54% and costs 44% versus an expert-constrained SkyPilot baseline.
Assessing the Impact of Requirement Ambiguity on LLM-based Function-Level Code Generation cs.SE · 2026-04-23 · conditional · none · ref 44 · internal anchor
Orchid benchmark shows requirement ambiguity degrades LLM code generation performance across all models, with advanced models hit hardest, and LLMs rarely detect or resolve the ambiguity themselves.
Neurosymbolic Repo-level Code Localization cs.SE · 2026-04-17 · unverdicted · none · ref 40 · internal anchor
LogicLoc combines LLMs with Datalog to achieve accurate repo-level code localization without relying on keyword shortcuts in benchmarks.
Structural Anchors and Reasoning Fragility:Understanding CoT Robustness in LLM4Code cs.SE · 2026-04-14 · unverdicted · none · ref 44 · internal anchor
CoT prompting in LLM4Code shows mixed robustness that depends on model family, task structure, and perturbations destabilizing structural anchors, leading to trajectory deformations like lengthening, branching, and simplification.
DuET: Dual Execution for Test Output Prediction with Generated Code and Pseudocode cs.SE · 2026-04-13 · unverdicted · none · ref 3 · internal anchor
DuET uses dual execution of generated code and pseudocode with majority voting to achieve state-of-the-art test output prediction, boosting Pass@1 by 13.6 percentage points on LiveCodeBench.
Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach cs.LG · 2025-02-07 · unverdicted · none · ref 187 · internal anchor
A recurrent-depth architecture enables language models to improve reasoning performance by iterating computation in latent space, achieving gains equivalent to much larger models on benchmarks.
Internalizing Curriculum Judgment for LLM Reinforcement Fine-Tuning cs.LG · 2026-05-11 · unverdicted · none · ref 50 · internal anchor
METIS internalizes curriculum judgment in LLM reinforcement fine-tuning by predicting within-prompt reward variance via in-context learning and jointly optimizing with a self-judgment reward, yielding superior performance and up to 67% faster convergence across math, code, and agent benchmarks.
Defective Task Descriptions in LLM-Based Code Generation: Detection and Analysis cs.SE · 2026-04-27 · conditional · none · ref 50 · internal anchor
SpecValidator detects lexical vagueness, under-specification, and syntax-formatting defects in LLM code-generation prompts with F1 0.804, outperforming GPT-5-mini and Claude Sonnet 4, and shows that under-specification is the most damaging defect type while richer benchmarks are more resilient.
Learned or Memorized ? Quantifying Memorization Advantage in Code LLMs cs.SE · 2026-04-15 · unverdicted · none · ref 7 · internal anchor
A perturbation method shows memorization advantage in code LLMs varies widely by model and task, remaining low on CVEFixes and Defects4J benchmarks.
InCoder-32B-Thinking: Industrial Code World Model for Thinking cs.AR · 2026-04-03 · unverdicted · none · ref 51 · internal anchor
InCoder-32B-Thinking uses error-feedback synthesized thinking traces and a code world model to reach top open-source scores on general and industrial code benchmarks including 81.3% on LiveCodeBench and 84.0% on CAD-Coder.
LLaDA2.0: Scaling Up Diffusion Language Models to 100B cs.LG · 2025-12-10 · conditional · none · ref 43 · internal anchor
LLaDA2.0 scales discrete diffusion language models to 100B parameters via systematic conversion from autoregressive models using a 3-phase WSD training scheme and releases open-source 16B and 100B MoE variants.
Muon-OGD: Muon-based Spectral Orthogonal Gradient Projection for LLM Continual Learning cs.LG · 2026-05-09 · unverdicted · none · ref 40 · internal anchor
Muon-OGD integrates Muon-style spectral-norm geometry with orthogonal gradient constraints to improve the stability-plasticity trade-off during sequential LLM adaptation.
MiMo-V2-Flash Technical Report cs.CL · 2026-01-06 · unverdicted · none · ref 56 · internal anchor
MiMo-V2-Flash is a 309B/15B MoE model trained on 27T tokens with hybrid attention and multi-teacher on-policy distillation that matches larger models like DeepSeek-V3.2 while enabling 2.6x faster decoding via repurposed MTP layers.
Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs cs.CL · 2025-03-03 · unverdicted · none · ref 58 · internal anchor
Phi-4-Mini achieves strong math and coding performance with only 3.8B parameters via high-quality synthetic data, while Phi-4-Multimodal uses Mixture-of-LoRAs to integrate modalities and top speech recognition leaderboards.
How Many Tries Does It Take? Iterative Self-Repair in LLM Code Generation Across Model Scales and Benchmarks cs.SE · 2026-04-12 · unverdicted · none · ref 27 · internal anchor
Iterative self-repair improves LLM code pass rates by 4.9-17.1 pp on HumanEval and 16-30 pp on MBPP across seven models, with gains concentrated early and syntax errors easier to fix than logical ones.
An End-to-End Framework for Building Large Language Models for Software Operations cs.LG · 2026-04-06 · unverdicted · none · ref 60 · 2 links · internal anchor
OpsLLM is a domain-specific LLM for software ops QA and RCA built with human-curated data, SFT, and RL using a domain process reward model, showing accuracy gains of 0.2-5.7% on QA and 2.7-70.3% on RCA over general LLMs.
Qwen2.5-Coder Technical Report cs.CL · 2024-09-18 · unverdicted · none · ref 42 · internal anchor
Qwen2.5-Coder models claim state-of-the-art results on over 10 code benchmarks, outperforming larger models of similar size.

BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions

hub tools

fields

years

verdicts

representative citing papers

citing papers explorer