SimWorld Studio deploys an evolving coding agent to create adaptive 3D environments that co-evolve with embodied learners, delivering 18-point success-rate gains over fixed environments in navigation benchmarks.
hub
BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions
64 Pith papers cite this work. Polarity classification is still indexing.
abstract
Task automation has been greatly empowered by the recent advances in Large Language Models (LLMs) via Python code, where the tasks ranging from software engineering development to general-purpose reasoning. While current benchmarks have shown that LLMs can solve tasks using programs like human developers, the majority of their evaluations are limited to short and self-contained algorithmic tasks or standalone function calls. Solving challenging and practical tasks requires the capability of utilizing diverse function calls as tools to efficiently implement functionalities like data analysis and web development. In addition, using multiple tools to solve a task needs compositional reasoning by accurately understanding complex instructions. Fulfilling both of these characteristics can pose a great challenge for LLMs.To assess how well LLMs can solve challenging and practical tasks via programs, we introduce BigCodeBench, a benchmark that challenges LLMs to invoke multiple function calls as tools from 139 libraries and 7 domains for 1,140 fine-grained tasks. To evaluate LLMs rigorously, each task encompasses 5.6 test cases with an average branch coverage of 99%. In addition, we propose a natural-language-oriented variant of BigCodeBench, BigCodeBench-Instruct, that automatically transforms the original docstrings into short instructions only with essential information. Our extensive evaluation of 60 LLMs shows that LLMs are not yet capable of following complex instructions to use function calls precisely, with scores up to 60%, significantly lower than the human performance of 97%. The results underscore the need for further advancements in this area.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
SlopCodeBench shows coding agents degrade in structural quality and verbosity across iterative extensions, with no agent solving any problem completely and agent code 2x more eroded than human code.
ELDR reduces median TPOT by 5.9-13.9% in PD-disaggregated MoE serving via expert signatures from prefill, K-means partitioning, and locality-band routing with KV-co-indexed signature cache.
RigorBench evaluates AI coding agents on process discipline via five pillars and reports 41% higher process scores and 17% better outcome correctness with structured approaches on 30 tasks.
PrivCode++ introduces the first DP code generation method protecting both prompts and code via latent-conditioned two-stage training, claiming higher utility and stronger privacy than prior baselines.
BOHM extracts multi-resolution attribution trees from existing routing weights in hierarchical AI systems, providing zero-cost explanations that correlate with SHAP when routing is near-optimal.
SkillTTA synthesizes temporary task-specific skills from retrieved training trajectories to boost LLM agent Pass@1 scores on SpreadsheetBench and BigCodeBench without parameter updates.
GoR extracts citation DAGs using position, frequency, predecessor links and time, then fine-tunes Qwen2.5-7B on 498 seed papers to generate ideas, claiming SOTA over gpt-4o baselines via LLM judges.
AgentLens reveals 10.7% of passing SWE-agent trajectories exhibit Lucky Pass behaviors and introduces a process-level evaluation framework with a new annotated dataset of 1,815 trajectories.
Structurally rich task descriptions make LLMs robust to prompt under-specification, and under-specification can enhance code correctness by disrupting misleading lexical or structural cues.
Incisor uses program analysis and frontier LLMs to select working AWS EC2 instances ex ante for 100% of first-time HPC runs of C/C++/Fortran and Python codes, cutting runtime 54% and costs 44% versus an expert-constrained SkyPilot baseline.
Orchid benchmark shows requirement ambiguity degrades LLM code generation performance across all models, with advanced models hit hardest, and LLMs rarely detect or resolve the ambiguity themselves.
LogicLoc combines LLMs with Datalog to achieve accurate repo-level code localization without relying on keyword shortcuts in benchmarks.
CoT prompting in LLM4Code shows mixed robustness that depends on model family, task structure, and perturbations destabilizing structural anchors, leading to trajectory deformations like lengthening, branching, and simplification.
DuET uses dual execution of generated code and pseudocode with majority voting to achieve state-of-the-art test output prediction, boosting Pass@1 by 13.6 percentage points on LiveCodeBench.
Variability modeling from software engineering enables systematic sampling, measurement, and prediction of LLM inference configurations for energy, latency, and accuracy trade-offs.
SWE-EVO shows GPT-5.4 with OpenHands reaching only 25% success on complex multi-file evolution tasks versus 72.8% on SWE-Bench Verified, and introduces Fix Rate as a partial-progress metric.
PerfCoder is a family of LLMs trained on optimization trajectories with human annotations and runtime-based preference alignment that achieves higher runtime speedups and optimization rates on the PIE benchmark than prior models while producing interpretable feedback.
A study of seven LLMs finds that realistic prompt variations such as one-character misspellings trigger library hallucinations in up to 26% of cases, fabricated names in up to 99%, and time-based prompts in up to 85%, and introduces LibHalluBench for evaluation.
The paper delivers a taxonomy of seven LLM study types in software engineering along with eight guidelines that separate mandatory requirements from recommended practices to address reproducibility challenges.
Orak is a foundational benchmark providing training data, interfaces, and evaluation tools for LLM agents across diverse video game genres.
SWE-RL uses RL on software evolution data to train LLMs achieving 41% on SWE-bench Verified with generalization to other reasoning tasks.
A recurrent-depth architecture enables language models to improve reasoning performance by iterating computation in latent space, achieving gains equivalent to much larger models on benchmarks.
VLP adds an NL documentation layer with trace-linked mismatch detection and derived formal checks to make human validation of LLM code feasible, lifting pass@1 from 28.7-73.2% to 65.4-93.5%.
citing papers explorer
-
SimWorld Studio: Automatic Environment Generation with Evolving Coding Agent for Embodied Agent Learning
SimWorld Studio deploys an evolving coding agent to create adaptive 3D environments that co-evolve with embodied learners, delivering 18-point success-rate gains over fixed environments in navigation benchmarks.
-
SlopCodeBench: Benchmarking How Coding Agents Degrade Over Long-Horizon Iterative Tasks
SlopCodeBench shows coding agents degrade in structural quality and verbosity across iterative extensions, with no agent solving any problem completely and agent code 2x more eroded than human code.
-
ELDR: Expert-Locality-Aware Decode Routing for PD-Disaggregated MoE Serving
ELDR reduces median TPOT by 5.9-13.9% in PD-disaggregated MoE serving via expert signatures from prefill, K-means partitioning, and locality-band routing with KV-co-indexed signature cache.
-
RigorBench: Benchmarking Engineering Process Discipline in Autonomous AI Coding Agents
RigorBench evaluates AI coding agents on process discipline via five pillars and reports 41% higher process scores and 17% better outcome correctness with structured approaches on 30 tasks.
-
PrivCode++: Latent-Conditioned Differentially Private Code Generation for Comprehensive Guarantees
PrivCode++ introduces the first DP code generation method protecting both prompts and code via latent-conditioned two-stage training, claiming higher utility and stronger privacy than prior baselines.
-
BOHM: Zero-Cost Hierarchical Attribution for Compound AI Systems
BOHM extracts multi-resolution attribution trees from existing routing weights in hierarchical AI systems, providing zero-cost explanations that correlate with SHAP when routing is near-optimal.
-
Skills on the Fly: Test-Time Adaptive Skill Synthesis for LLM Agents
SkillTTA synthesizes temporary task-specific skills from retrieved training trajectories to boost LLM agent Pass@1 scores on SpreadsheetBench and BigCodeBench without parameter updates.
-
Graphs of Research: Citation Evolution Graphs as Supervision for Research Idea Generation
GoR extracts citation DAGs using position, frequency, predecessor links and time, then fine-tunes Qwen2.5-7B on 498 seed papers to generate ideas, claiming SOTA over gpt-4o baselines via LLM judges.
-
AgentLens: Revealing The Lucky Pass Problem in SWE-Agent Evaluation
AgentLens reveals 10.7% of passing SWE-agent trajectories exhibit Lucky Pass behaviors and introduces a process-level evaluation framework with a new annotated dataset of 1,815 trajectories.
-
When Prompt Under-Specification Improves Code Correctness: An Exploratory Study of Prompt Wording and Structure Effects on LLM-Based Code Generation
Structurally rich task descriptions make LLMs robust to prompt under-specification, and under-specification can enhance code correctness by disrupting misleading lexical or structural cues.
-
Incisor: Ex Ante Cloud Instance Selection for HPC Jobs
Incisor uses program analysis and frontier LLMs to select working AWS EC2 instances ex ante for 100% of first-time HPC runs of C/C++/Fortran and Python codes, cutting runtime 54% and costs 44% versus an expert-constrained SkyPilot baseline.
-
Assessing the Impact of Requirement Ambiguity on LLM-based Function-Level Code Generation
Orchid benchmark shows requirement ambiguity degrades LLM code generation performance across all models, with advanced models hit hardest, and LLMs rarely detect or resolve the ambiguity themselves.
-
Neurosymbolic Repo-level Code Localization
LogicLoc combines LLMs with Datalog to achieve accurate repo-level code localization without relying on keyword shortcuts in benchmarks.
-
Structural Anchors and Reasoning Fragility:Understanding CoT Robustness in LLM4Code
CoT prompting in LLM4Code shows mixed robustness that depends on model family, task structure, and perturbations destabilizing structural anchors, leading to trajectory deformations like lengthening, branching, and simplification.
-
DuET: Dual Execution for Test Output Prediction with Generated Code and Pseudocode
DuET uses dual execution of generated code and pseudocode with majority voting to achieve state-of-the-art test output prediction, boosting Pass@1 by 13.6 percentage points on LiveCodeBench.
-
Pimp My LLM: Leveraging Variability Modeling to Tune Inference Hyperparameters
Variability modeling from software engineering enables systematic sampling, measurement, and prediction of LLM inference configurations for energy, latency, and accuracy trade-offs.
-
SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios
SWE-EVO shows GPT-5.4 with OpenHands reaching only 25% success on complex multi-file evolution tasks versus 72.8% on SWE-Bench Verified, and introduces Fix Rate as a partial-progress metric.
-
PerfCoder: Large Language Models for Interpretable Code Performance Optimization
PerfCoder is a family of LLMs trained on optimization trajectories with human annotations and runtime-based preference alignment that achieves higher runtime speedups and optimization rates on the PIE benchmark than prior models while producing interpretable feedback.
-
Library Hallucinations in LLM-Generated Code: A Risk Analysis Grounded in Developer Queries
A study of seven LLMs finds that realistic prompt variations such as one-character misspellings trigger library hallucinations in up to 26% of cases, fabricated names in up to 99%, and time-based prompts in up to 85%, and introduces LibHalluBench for evaluation.
-
Guidelines for Empirical Studies in Software Engineering involving Large Language Models
The paper delivers a taxonomy of seven LLM study types in software engineering along with eight guidelines that separate mandatory requirements from recommended practices to address reproducibility challenges.
-
Orak: A Foundational Benchmark for Training and Evaluating LLM Agents on Diverse Video Games
Orak is a foundational benchmark providing training data, interfaces, and evaluation tools for LLM agents across diverse video game genres.
-
SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution
SWE-RL uses RL on software evolution data to train LLMs achieving 41% on SWE-bench Verified with generalization to other reasoning tasks.
-
Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach
A recurrent-depth architecture enables language models to improve reasoning performance by iterating computation in latent space, achieving gains equivalent to much larger models on benchmarks.
-
Guiding Human Validation of LLM-Generated Code via Verifiable Literate Programming
VLP adds an NL documentation layer with trace-linked mismatch detection and derived formal checks to make human validation of LLM code feasible, lifting pass@1 from 28.7-73.2% to 65.4-93.5%.
-
SWE-Router: Routing in Multi-turn Agentic Software Engineering Tasks
SWE-Router introduces trajectory-conditioned value-based routing for LLM agents on SWE tasks, with a Bayes-optimality theorem and empirical cost savings while retaining most strong-model performance.
-
Agent-as-a-Router: Agentic Model Routing for Coding Tasks
Agent-as-a-Router turns static LLM routing into an iterative C-A-F loop that accumulates execution feedback to lower cumulative regret on coding tasks.
-
Breaking the Solver Bottleneck: Training Task Generators at the Learnable Frontier
PROPEL amortizes solver evaluation with a trained activation probe to optimize task generators toward a target solve rate, raising the share of learnable tasks from ~10% to ~20% in coding and SWE experiments.
-
FASE: Fast Adaptive Semantic Entropy for Code Quality
FASE approximates functional correctness via MST on structural and semantic dissimilarity graphs, reporting 25% better Spearman correlation and 19% better ROCAUC than LLM-based semantic entropy at 0.3% runtime cost on HumanEval and BigCodeBench.
-
Lost in the Flow with Code Talkers: Unveiling the Instruction-Tuning Tax of Large Language Models in Code Tasks
Empirical study finds instruction tuning on CodeLLMs improves instruction following at the expense of infilling performance, termed the Instruction-Tuning Tax.
-
Attack Selection in Agentic AI Control Evaluations Meaningfully Decreases Safety
Strategic attack selection via start and stop policies reduces empirical safety by 20-28pp in BashArena and LinuxArena agentic control evaluations without changing attack capability.
-
DLLG: Dynamic Logit-Level Gating of LLM Experts
DLLG learns token-level fusion weights for LLM experts from sparse response supervision and outperforms routing, ensembling, and merging baselines on reasoning and code tasks.
-
Fine-Tuning Improves Information Conveyance in Language Models
Fine-tuning reorganizes uncertainty in LLMs into more efficient information conveyance, as shown by stronger length-entropy correlations and a tripling of entropy-semantic diversity links after controls.
-
Design and Report Benchmarks for Knowledge Work
Proposes a three-step benchmark design method (define work activity, specify tested setting, score work product) derived from work studies and O*NET, demonstrated via three case analyses.
-
Harnessing LLM Agents with Skill Programs
HASP upgrades textual skills into executable Program Functions that intervene in LLM agent loops at inference, post-training, or self-evolution, delivering 25% gains over ReAct and 30.4% over Search-R1 on reasoning benchmarks.
-
HyDRA: Hybrid Dynamic Routing Architecture for Heterogeneous LLM Pools
HyDRA routes queries to cost-effective LLMs by predicting multi-dimensional capability requirements with a multi-head encoder and applying shortfall matching against configuration-defined model profiles, delivering up to 72.5 percent cost savings on coding benchmarks while remaining decoupled from具体
-
Exploiting LLM Agent Supply Chains via Payload-less Skills
Semantic Compliance Hijacking lets attackers hijack LLM agents by disguising malicious instructions as compliance rules in skills, reaching up to 77.67% success on confidentiality breaches and 67.33% on RCE while evading all tested scanners.
-
CRANE: Constrained Reasoning Injection for Code Agents via Nullspace Editing
CRANE applies magnitude thresholding, a Conservative Taylor Gate, and Graduated Sigmoidal Projection to the Thinking-Instruct delta to improve code agent pass rates on Roo-Eval, SWE-bench-Verified, and Terminal-Bench while preserving efficiency.
-
Internalizing Curriculum Judgment for LLM Reinforcement Fine-Tuning
METIS internalizes curriculum judgment in LLM reinforcement fine-tuning by predicting within-prompt reward variance via in-context learning and jointly optimizing with a self-judgment reward, yielding superior performance and up to 67% faster convergence across math, code, and agent benchmarks.
-
Muon-OGD: Muon-based Spectral Orthogonal Gradient Projection for LLM Continual Learning
Muon-OGD introduces a spectral-norm constrained orthogonal projection method solved via dual iterations and Newton-Schulz approximations to improve stability-plasticity trade-off in sequential LLM adaptation.
-
Defective Task Descriptions in LLM-Based Code Generation: Detection and Analysis
SpecValidator detects lexical vagueness, under-specification, and syntax-formatting defects in LLM code-generation prompts with F1 0.804, outperforming GPT-5-mini and Claude Sonnet 4, and shows that under-specification is the most damaging defect type while richer benchmarks are more resilient.
-
Skill Retrieval Augmentation for Agentic AI
Introduces SRA paradigm and SRA-Bench benchmark (5,400 tasks, 26,262 skills) showing retrieval improves performance but LLMs fail to selectively incorporate retrieved skills.
-
Learned or Memorized ? Quantifying Memorization Advantage in Code LLMs
A perturbation method shows memorization advantage in code LLMs varies widely by model and task, remaining low on CVEFixes and Defects4J benchmarks.
-
InCoder-32B-Thinking: Industrial Code World Model for Thinking
InCoder-32B-Thinking uses error-feedback synthesized thinking traces and a code world model to reach top open-source scores on general and industrial code benchmarks including 81.3% on LiveCodeBench and 84.0% on CAD-Coder.
-
ContextCov: Deriving and Enforcing Executable Constraints from Agent Instruction Files
ContextCov compiles agent instruction files into static, runtime, and architectural guardrails, raising constraint compliance to 88.3% on SWE-bench Lite tasks versus 67% and 50.3% for prompt and reflection baselines.
-
ACE-Bench: A Lightweight Benchmark for Evaluating Azure SDK Usage Correctness
ACE-Bench is an execution-free benchmark that scores LLM coding agents on correct Azure SDK usage via deterministic regex checks and reference-based LLM judges derived from official documentation.
-
The Geometric Reasoner: Manifold-Informed Latent Foresight Search for Long-Context Reasoning
TGR performs manifold-informed latent foresight search to boost trajectory coverage in long-context reasoning tasks by up to 13 AUC points with minimal overhead.
-
LLaDA2.0: Scaling Up Diffusion Language Models to 100B
LLaDA2.0 scales discrete diffusion language models to 100B parameters via systematic conversion from autoregressive models using a 3-phase WSD training scheme and releases open-source 16B and 100B MoE variants.
-
TRINITY: An Evolved LLM Coordinator
A compact 0.6B-parameter coordinator with a 10K-parameter head uses evolutionary strategy to dynamically delegate roles to LLMs, achieving SOTA results such as 86.2% on LiveCodeBench.
-
Seed Diffusion: A Large-Scale Diffusion Language Model with High-Speed Inference
Seed Diffusion Preview is a discrete diffusion language model that reaches 2146 tokens per second inference on H20 GPUs with competitive code benchmark performance, establishing a new speed-quality Pareto frontier.
-
DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents
DeepResearch Bench supplies 100 expert-crafted PhD-level tasks and two human-aligned evaluation frameworks to measure deep research agents on report quality and citation accuracy.