hub

BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions

Terry Yue Zhuo, Minh Chien Vu, Jenny Chim, Han Hu, Wenhao Yu, Ratnadira Widyasari · 2024 · arXiv 2406.15877

21 Pith papers cite this work. Polarity classification is still indexing.

21 Pith papers citing it

open full Pith review browse 21 citing papers arXiv PDF

hub tools

JSON dossier citing papers JSON arXiv source

representative citing papers

AgentLens: Revealing The Lucky Pass Problem in SWE-Agent Evaluation

cs.SE · 2026-05-13 · conditional · novelty 7.0

10.7% of passing SWE-agent trajectories are Lucky Passes with chaotic behaviors, and a quality score based on process references changes model rankings across eight backends.

When Prompt Under-Specification Improves Code Correctness: An Exploratory Study of Prompt Wording and Structure Effects on LLM-Based Code Generation

cs.SE · 2026-04-27 · unverdicted · novelty 7.0

Structurally rich task descriptions make LLMs robust to prompt under-specification, and under-specification can enhance code correctness by disrupting misleading lexical or structural cues.

Skill Retrieval Augmentation for Agentic AI

cs.CL · 2026-04-27 · unverdicted · novelty 7.0

Agents improve when they retrieve skills on demand from large corpora, yet current models cannot selectively decide when to load or ignore a retrieved skill.

Incisor: Ex Ante Cloud Instance Selection for HPC Jobs

cs.DC · 2026-04-27 · unverdicted · novelty 7.0

Incisor uses program analysis and frontier LLMs to select working AWS EC2 instances ex ante for 100% of first-time HPC runs of C/C++/Fortran and Python codes, cutting runtime 54% and costs 44% versus an expert-constrained SkyPilot baseline.

Assessing the Impact of Requirement Ambiguity on LLM-based Function-Level Code Generation

cs.SE · 2026-04-23 · conditional · novelty 7.0

Orchid benchmark shows requirement ambiguity degrades LLM code generation performance across all models, with advanced models hit hardest, and LLMs rarely detect or resolve the ambiguity themselves.

Neurosymbolic Repo-level Code Localization

cs.SE · 2026-04-17 · unverdicted · novelty 7.0

LogicLoc combines LLMs with Datalog to achieve accurate repo-level code localization without relying on keyword shortcuts in benchmarks.

Structural Anchors and Reasoning Fragility:Understanding CoT Robustness in LLM4Code

cs.SE · 2026-04-14 · unverdicted · novelty 7.0

CoT prompting in LLM4Code shows mixed robustness that depends on model family, task structure, and perturbations destabilizing structural anchors, leading to trajectory deformations like lengthening, branching, and simplification.

DuET: Dual Execution for Test Output Prediction with Generated Code and Pseudocode

cs.SE · 2026-04-13 · unverdicted · novelty 7.0

DuET uses dual execution of generated code and pseudocode with majority voting to achieve state-of-the-art test output prediction, boosting Pass@1 by 13.6 percentage points on LiveCodeBench.

Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach

cs.LG · 2025-02-07 · unverdicted · novelty 7.0

A recurrent-depth architecture enables language models to improve reasoning performance by iterating computation in latent space, achieving gains equivalent to much larger models on benchmarks.

Internalizing Curriculum Judgment for LLM Reinforcement Fine-Tuning

cs.LG · 2026-05-11 · unverdicted · novelty 6.0

METIS internalizes curriculum judgment in LLM reinforcement fine-tuning by predicting within-prompt reward variance via in-context learning and jointly optimizing with a self-judgment reward, yielding superior performance and up to 67% faster convergence across math, code, and agent benchmarks.

Defective Task Descriptions in LLM-Based Code Generation: Detection and Analysis

cs.SE · 2026-04-27 · conditional · novelty 6.0

SpecValidator detects lexical vagueness, under-specification, and syntax-formatting defects in LLM code-generation prompts with F1 0.804, outperforming GPT-5-mini and Claude Sonnet 4, and shows that under-specification is the most damaging defect type while richer benchmarks are more resilient.

Learned or Memorized ? Quantifying Memorization Advantage in Code LLMs

cs.SE · 2026-04-15 · unverdicted · novelty 6.0

A perturbation method shows memorization advantage in code LLMs varies widely by model and task, remaining low on CVEFixes and Defects4J benchmarks.

InCoder-32B-Thinking: Industrial Code World Model for Thinking

cs.AR · 2026-04-03 · unverdicted · novelty 6.0

InCoder-32B-Thinking uses error-feedback synthesized thinking traces and a code world model to reach top open-source scores on general and industrial code benchmarks including 81.3% on LiveCodeBench and 84.0% on CAD-Coder.

LLaDA2.0: Scaling Up Diffusion Language Models to 100B

cs.LG · 2025-12-10 · conditional · novelty 6.0

LLaDA2.0 scales discrete diffusion language models to 100B parameters via systematic conversion from autoregressive models using a 3-phase WSD training scheme and releases open-source 16B and 100B MoE variants.

Muon-OGD: Muon-based Spectral Orthogonal Gradient Projection for LLM Continual Learning

cs.LG · 2026-05-09 · unverdicted · novelty 5.0

Muon-OGD integrates Muon-style spectral-norm geometry with orthogonal gradient constraints to improve the stability-plasticity trade-off during sequential LLM adaptation.

MiMo-V2-Flash Technical Report

cs.CL · 2026-01-06 · unverdicted · novelty 5.0

MiMo-V2-Flash is a 309B/15B MoE model trained on 27T tokens with hybrid attention and multi-teacher on-policy distillation that matches larger models like DeepSeek-V3.2 while enabling 2.6x faster decoding via repurposed MTP layers.

Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs

cs.CL · 2025-03-03 · unverdicted · novelty 5.0

Phi-4-Mini achieves strong math and coding performance with only 3.8B parameters via high-quality synthetic data, while Phi-4-Multimodal uses Mixture-of-LoRAs to integrate modalities and top speech recognition leaderboards.

How Many Tries Does It Take? Iterative Self-Repair in LLM Code Generation Across Model Scales and Benchmarks

cs.SE · 2026-04-12 · unverdicted · novelty 4.0

Iterative self-repair improves LLM code pass rates by 4.9-17.1 pp on HumanEval and 16-30 pp on MBPP across seven models, with gains concentrated early and syntax errors easier to fix than logical ones.

An End-to-End Framework for Building Large Language Models for Software Operations

cs.LG · 2026-04-06 · unverdicted · novelty 4.0 · 2 refs

OpsLLM is a domain-specific LLM for software ops QA and RCA built with human-curated data, SFT, and RL using a domain process reward model, showing accuracy gains of 0.2-5.7% on QA and 2.7-70.3% on RCA over general LLMs.

Qwen2.5-Coder Technical Report

cs.CL · 2024-09-18 · unverdicted · novelty 4.0

Qwen2.5-Coder models claim state-of-the-art results on over 10 code benchmarks, outperforming larger models of similar size.

SimWorld Studio: Automatic Environment Generation with Evolving Coding Agent for Embodied Agent Learning

cs.AI · 2026-05-10

citing papers explorer

Showing 5 of 5 citing papers after filters.

Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach cs.LG · 2025-02-07 · unverdicted · none · ref 187 · internal anchor
A recurrent-depth architecture enables language models to improve reasoning performance by iterating computation in latent space, achieving gains equivalent to much larger models on benchmarks.
Internalizing Curriculum Judgment for LLM Reinforcement Fine-Tuning cs.LG · 2026-05-11 · unverdicted · none · ref 50 · internal anchor
METIS internalizes curriculum judgment in LLM reinforcement fine-tuning by predicting within-prompt reward variance via in-context learning and jointly optimizing with a self-judgment reward, yielding superior performance and up to 67% faster convergence across math, code, and agent benchmarks.
LLaDA2.0: Scaling Up Diffusion Language Models to 100B cs.LG · 2025-12-10 · conditional · none · ref 43 · internal anchor
LLaDA2.0 scales discrete diffusion language models to 100B parameters via systematic conversion from autoregressive models using a 3-phase WSD training scheme and releases open-source 16B and 100B MoE variants.
Muon-OGD: Muon-based Spectral Orthogonal Gradient Projection for LLM Continual Learning cs.LG · 2026-05-09 · unverdicted · none · ref 40 · internal anchor
Muon-OGD integrates Muon-style spectral-norm geometry with orthogonal gradient constraints to improve the stability-plasticity trade-off during sequential LLM adaptation.
An End-to-End Framework for Building Large Language Models for Software Operations cs.LG · 2026-04-06 · unverdicted · none · ref 60 · 2 links · internal anchor
OpsLLM is a domain-specific LLM for software ops QA and RCA built with human-curated data, SFT, and RL using a domain process reward model, showing accuracy gains of 0.2-5.7% on QA and 2.7-70.3% on RCA over general LLMs.

BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions

hub tools

fields

years

verdicts

representative citing papers

citing papers explorer