Code Llama: Open Foundation Models for Code

Aaron Grattafiori; Alexandre D\'efossez; Artyom Kozhevnikov; Baptiste Rozi\`ere; Cristian Canton Ferrer; Fabian Gloeckle; Faisal Azhar; Gabriel Synnaeve; Hugo Touvron; Itai Gat

arxiv: 2308.12950 · v3 · submitted 2023-08-24 · 💻 cs.CL

Code Llama: Open Foundation Models for Code

Baptiste Rozi\`ere , Jonas Gehring , Fabian Gloeckle , Sten Sootla , Itai Gat , Xiaoqing Ellen Tan , Yossi Adi , Jingyu Liu

show 18 more authors

Romain Sauvestre Tal Remez J\'er\'emy Rapin Artyom Kozhevnikov Ivan Evtimov Joanna Bitton Manish Bhatt Cristian Canton Ferrer Aaron Grattafiori Wenhan Xiong Alexandre D\'efossez Jade Copet Faisal Azhar Hugo Touvron Louis Martin Nicolas Usunier Thomas Scialom Gabriel Synnaeve

This is my paper

Pith reviewed 2026-05-10 15:01 UTC · model grok-4.3

classification 💻 cs.CL

keywords Code Llamalarge language modelscode generationinfillingopen source modelsprogramming benchmarksHumanEvalMultiPL-E

0 comments

The pith

Code Llama models achieve state-of-the-art results among open models on code benchmarks while adding infilling and long-context support.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper releases Code Llama as a family of large language models for code built directly on the Llama 2 base. The models come in foundation, Python-specialized, and instruction-tuned variants across multiple sizes and are trained to handle code infilling from surrounding context plus input sequences up to 100k tokens. They report leading scores on standard code generation benchmarks relative to other publicly available models and are offered under a license that permits both research and commercial use. The work focuses on making capable code models openly accessible rather than keeping them closed.

Core claim

Code Llama reaches state-of-the-art performance among open models on several code benchmarks, with scores of up to 67% and 65% on HumanEval and MBPP, respectively. Notably, Code Llama - Python 7B outperforms Llama 2 70B on HumanEval and MBPP, and all our models outperform every other publicly available model on MultiPL-E. The models support infilling based on surrounding content for the 7B, 13B and 70B sizes and show gains on inputs extending to 100k tokens.

What carries the argument

Fine-tuning of the Llama 2 architecture on large-scale code data to produce specialized models that support infilling from surrounding content and extended context lengths.

If this is right

The 7B Python variant surpassing the much larger Llama 2 70B indicates that targeted specialization on code data can yield efficiency gains.
Outperformance on MultiPL-E across all sizes points to broad multi-language code capabilities.
Permissive licensing enables direct integration into developer tools and commercial products.
Support for infilling and 100k-token contexts allows the models to handle longer code files and partial completions in practice.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Widespread adoption could shift more coding assistance work from closed to open models, altering the competitive landscape for AI coding tools.
The efficiency of the smaller specialized models suggests a path for deploying capable code assistance on resource-limited hardware.
Future tests could measure whether these models maintain performance when asked to edit or debug entire existing codebases rather than generating isolated functions.

Load-bearing premise

The reported benchmark scores on HumanEval, MBPP, and MultiPL-E reflect genuine generalization to real coding tasks without significant test-data contamination in the training data and with evaluation protocols that are comparable to those used for other models.

What would settle it

Creating a fresh collection of coding problems guaranteed to be absent from the training corpus and observing that the models score substantially below the claimed 67% on HumanEval or 65% on MBPP would show the generalization claim does not hold.

read the original abstract

We release Code Llama, a family of large language models for code based on Llama 2 providing state-of-the-art performance among open models, infilling capabilities, support for large input contexts, and zero-shot instruction following ability for programming tasks. We provide multiple flavors to cover a wide range of applications: foundation models (Code Llama), Python specializations (Code Llama - Python), and instruction-following models (Code Llama - Instruct) with 7B, 13B, 34B and 70B parameters each. All models are trained on sequences of 16k tokens and show improvements on inputs with up to 100k tokens. 7B, 13B and 70B Code Llama and Code Llama - Instruct variants support infilling based on surrounding content. Code Llama reaches state-of-the-art performance among open models on several code benchmarks, with scores of up to 67% and 65% on HumanEval and MBPP, respectively. Notably, Code Llama - Python 7B outperforms Llama 2 70B on HumanEval and MBPP, and all our models outperform every other publicly available model on MultiPL-E. We release Code Llama under a permissive license that allows for both research and commercial use.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Code Llama releases practical open code models with strong reported benchmark scores including a small Python variant beating a 70B base model, but the abstract leaves data decontamination and evaluation details too thin to fully support the SOTA claims.

read the letter

The one or two things to know: Code Llama is an open release of code-specialized models based on Llama 2 that reports leading performance on several code generation benchmarks, including a 7B Python version that outperforms the 70B Llama 2 on HumanEval and MBPP. It also adds infilling capabilities and better handling of long contexts up to 100k tokens after training at 16k.

Referee Report

3 major / 2 minor

Summary. The paper introduces Code Llama, a family of open foundation models for code derived from Llama 2, offered in foundation, Python-specialized, and instruction-tuned variants at 7B, 13B, 34B, and 70B scales. All models are trained on 16k-token sequences with claimed improvements on contexts up to 100k tokens; select variants support infilling. The central claims are state-of-the-art performance among open models on code benchmarks, with peak scores of 67% on HumanEval and 65% on MBPP, the 7B Python variant outperforming Llama 2 70B on those tasks, and all variants leading publicly available models on MultiPL-E. The models are released under a permissive license.

Significance. If the benchmark results prove robust, the work would meaningfully advance open code modeling by releasing high-performing weights that narrow the gap to closed models, enable broad reproducibility, and illustrate the effectiveness of domain specialization (e.g., 7B Python model beating a 70B general model). The multi-variant design and long-context/infilling support add practical value for research and applications.

major comments (3)

[§4 (Evaluation)] §4 (Evaluation): The headline SOTA claims rest on HumanEval and MBPP pass rates, yet the section supplies no quantitative decontamination statistics (exact or near-duplicate overlap detection) against the public GitHub and code sources used for the >500B-token training corpus; this is load-bearing because benchmark provenance overlaps with training data.
[Results tables] Results tables (e.g., Table 2 or equivalent): Direct comparisons asserting superiority over other open models do not state that all baselines were re-run under the authors' exact sampling protocol, temperature, top-p, and harness; without this, numerical differences may reflect protocol mismatch rather than capability.
[§3 (Training)] §3 (Training): The description of training data composition and filtering lacks sufficient detail on proportions of code versus other content and on any explicit steps taken to exclude benchmark test problems, undermining confidence that reported generalization is uncontaminated.

minor comments (2)

[Abstract] Abstract: The reported 'scores of up to 67% and 65%' should explicitly name the metric (pass@1) and the precise model variant achieving each peak to aid quick assessment.
[Figures and tables] Figure captions and tables: Several performance plots would benefit from explicit error bars or variance estimates across multiple runs to convey result stability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive review. We address each major comment below and have revised the manuscript accordingly to improve clarity and rigor.

read point-by-point responses

Referee: [§4 (Evaluation)] The headline SOTA claims rest on HumanEval and MBPP pass rates, yet the section supplies no quantitative decontamination statistics (exact or near-duplicate overlap detection) against the public GitHub and code sources used for the >500B-token training corpus; this is load-bearing because benchmark provenance overlaps with training data.

Authors: We agree that explicit decontamination analysis strengthens confidence in the results. The original manuscript does not report quantitative overlap statistics. In the revision we will add a dedicated paragraph in §4 describing the data filtering steps applied to the training corpus and any available estimates of overlap with HumanEval and MBPP. revision: yes
Referee: [Results tables] Direct comparisons asserting superiority over other open models do not state that all baselines were re-run under the authors' exact sampling protocol, temperature, top-p, and harness; without this, numerical differences may reflect protocol mismatch rather than capability.

Authors: Baseline numbers were taken from the original papers or public leaderboards rather than re-evaluated under our exact harness. We will update the results section and table captions to explicitly state our sampling parameters (temperature 0.1, top-p 0.95) and note the provenance of each baseline score. revision: yes
Referee: [§3 (Training)] The description of training data composition and filtering lacks sufficient detail on proportions of code versus other content and on any explicit steps taken to exclude benchmark test problems, undermining confidence that reported generalization is uncontaminated.

Authors: We acknowledge that §3 could be more granular. The revised manuscript will expand the training data description to include approximate proportions of code versus non-code data and additional detail on the filtering pipeline used to reduce the risk of benchmark leakage. revision: yes

Circularity Check

0 steps flagged

No significant circularity; results are empirical measurements on external benchmarks

full rationale

The paper reports measured pass rates on standard external code benchmarks (HumanEval, MBPP, MultiPL-E) after continued pre-training on public code corpora. These scores are obtained by running the trained models on fixed test suites whose problems are not part of the model's own fitted parameters or loss function. No equations, self-citations, or ansatzes are invoked that would make the reported numbers equivalent to the training inputs by construction. The central claims therefore rest on independent, externally verifiable evaluations rather than any self-referential reduction.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The work rests on standard LLM continued-pretraining assumptions and the validity of existing code benchmarks rather than new free parameters, axioms, or invented entities.

free parameters (2)

Training context length
Set at 16k tokens to support long-context capability while remaining computationally feasible.
Model sizes
Selected as 7B, 13B, 34B, and 70B to span different compute budgets.

axioms (2)

domain assumption Continued pretraining on code data from a general LLM base improves code-specific performance
Core premise for creating Code Llama from Llama 2; invoked implicitly throughout the abstract.
domain assumption HumanEval, MBPP, and MultiPL-E scores measure meaningful coding ability
Used to assert state-of-the-art status without additional justification in the abstract.

pith-pipeline@v0.9.0 · 5647 in / 1434 out tokens · 81101 ms · 2026-05-10T15:01:53.308367+00:00 · methodology

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Efficient Training on Multiple Consumer GPUs with RoundPipe
cs.DC 2026-04 conditional novelty 8.0

RoundPipe achieves near-zero-bubble pipeline parallelism for LLM training on consumer GPUs by dynamically dispatching computation stages round-robin, yielding 1.48-2.16x speedups and enabling 235B model fine-tuning on...
Hackers or Hallucinators? A Comprehensive Analysis of LLM-Based Automated Penetration Testing
cs.CR 2026-04 unverdicted novelty 8.0

The first SoK on LLM-based AutoPT frameworks provides a six-dimension taxonomy of agent designs and a unified empirical benchmark evaluating 15 frameworks via over 10 billion tokens and 1,500 manually reviewed logs.
Why Do Multi-Agent LLM Systems Fail?
cs.AI 2025-03 unverdicted novelty 8.0

The authors create the first large-scale dataset and taxonomy of failure modes in multi-agent LLM systems to explain their limited performance gains.
CopT: Contrastive On-Policy Thinking with Continuous Spaces for General and Agentic Reasoning
cs.CL 2026-05 unverdicted novelty 7.0

CopT reverses CoT by eliciting a draft answer first then using continuous-embedding contrastive verification and on-policy thinking to reflect and correct, yielding up to 23% higher accuracy and 57% fewer tokens witho...
BacktestBench: Benchmarking Large Language Models for Automated Quantitative Strategy Backtesting
cs.CL 2026-05 unverdicted novelty 7.0

BacktestBench is the first large-scale benchmark for LLM-automated quantitative backtesting, with 18,246 QA pairs from real market data and a multi-agent baseline called AutoBacktest.
Constrained Code Generation with Discrete Diffusion
cs.CL 2026-05 unverdicted novelty 7.0

Constrained Diffusion for Code (CDC) integrates constraint satisfaction into the reverse denoising process of discrete diffusion models via constraint-aware operators that use optimization and program analysis to stee...
Syntax Without Semantics: Teaching Large Language Models to Code in an Unseen Language
cs.CL 2026-05 unverdicted novelty 7.0

Fine-tuning LLMs on an unseen language teaches syntax but fails to transfer semantic competence, leaving Python with up to a 19% performance advantage and no tested intervention closing the gap.
Hydra: Efficient, Correct Code Generation via Checkpoint-and-Rollback Support
cs.SE 2026-05 unverdicted novelty 7.0

Hydra enables asynchronous static error checking and targeted checkpoint-rollback repair during LLM code generation, cutting latency by up to 71% and token use by up to 70% versus post-hoc repair on C/C++ tasks.
Reward-Weighted On-Policy Distillation with an Open Property-Equivalence Verifier for NL-to-SVA Generation
cs.AR 2026-05 unverdicted novelty 7.0

Reward-Weighted On-Policy Distillation with an open property-equivalence verifier produces a 7B model that surpasses prior SOTA on NL-to-SVA generation across pass@1/5/10 metrics.
SmartEval: A Benchmark for Evaluating LLM-Generated Smart Contracts from Natural Language Specifications
cs.MA 2026-05 unverdicted novelty 7.0

SmartEval is a new benchmark showing LLM-generated smart contracts score 8.29 points higher than expert versions on average but frequently omit logic (35.3%) or mishandle state transitions (23.4%).
BoostAPR: Boosting Automated Program Repair via Execution-Grounded Reinforcement Learning with Dual Reward Models
cs.AI 2026-05 unverdicted novelty 7.0

BoostAPR improves automated program repair by using execution-grounded RL with a sequence-level assessor and line-level credit allocator, reaching 40.7% on SWE-bench Verified and strong cross-language results.
MeshFIM: Local Low-Poly Mesh Editing via Fill-in-the-Middle Autoregressive Generation
cs.GR 2026-05 unverdicted novelty 7.0

MeshFIM enables local low-poly mesh editing by autoregressively filling target regions conditioned on context, using boundary markers, positional embeddings, and a gated geometry encoder to enforce attachment, topolog...
Mean-Pooled Cosine Similarity is Not Length-Invariant: Theory and Cross-Domain Evidence for a Length-Invariant Alternative
cs.CL 2026-05 unverdicted novelty 7.0

Mean-pooled cosine similarity grows with sequence length in anisotropic transformer embeddings independent of content, while CKA shows far less length dependence across code, translation, and vision tasks.
Evaluating Non-English Developer Support in Machine Learning for Software Engineering
cs.SE 2026-05 unverdicted novelty 7.0

Code LLMs generate substantially worse comments outside English, and no tested automatic metric or LLM judge reliably matches human assessment of those outputs.
Delta-Based Neural Architecture Search: LLM Fine-Tuning via Code Diffs
cs.LG 2026-05 unverdicted novelty 7.0

Fine-tuned 7B LLMs generating unified diffs for neural architecture refinement achieve 66-75% valid rates and 64-66% mean first-epoch accuracy, outperforming full-generation baselines by large margins while cutting ou...
Coral: Cost-Efficient Multi-LLM Serving over Heterogeneous Cloud GPUs
cs.DC 2026-05 unverdicted novelty 7.0

Coral cuts multi-LLM serving costs by up to 2.79x and raises goodput by up to 2.39x on heterogeneous GPUs through adaptive joint optimization and a lossless two-stage decomposition that solves quickly.
QASecClaw: A Multi-Agent LLM Approach for False Positive Reduction in Static Application Security Testing
cs.CR 2026-05 unverdicted novelty 7.0

A multi-agent LLM system cuts false positives in static application security testing by 88.6% on the OWASP Benchmark while dropping recall by only 3.1%.
VulKey: Automated Vulnerability Repair Guided by Domain-Specific Repair Patterns
cs.CR 2026-05 unverdicted novelty 7.0

VulKey reaches 31.5% repair accuracy on real C/C++ vulnerabilities by matching hierarchical expert patterns to guide LLM patch generation, beating prior baselines by 7.6%.
The Power of Order: Fooling LLMs with Adversarial Table Permutations
cs.LG 2026-05 unverdicted novelty 7.0

Semantically invariant row and column permutations can fool LLMs on tabular tasks, and a new gradient-based attack called ATP finds such permutations to significantly degrade performance across models.
Social Bias in LLM-Generated Code: Benchmark and Mitigation
cs.SE 2026-05 unverdicted novelty 7.0

LLMs show up to 60.58% social bias in generated code; a new Fairness Monitor Agent cuts bias by 65.1% and raises functional correctness from 75.80% to 83.97%.
When Prompt Under-Specification Improves Code Correctness: An Exploratory Study of Prompt Wording and Structure Effects on LLM-Based Code Generation
cs.SE 2026-04 unverdicted novelty 7.0

Structurally rich task descriptions make LLMs robust to prompt under-specification, and under-specification can enhance code correctness by disrupting misleading lexical or structural cues.
Constraint-Guided Multi-Agent Decompilation for Executable Binary Recovery
cs.SE 2026-04 unverdicted novelty 7.0

A constraint-guided multi-agent system turns raw decompiler output into re-executable code at 84-97% success rates, outperforming prior LLM decompilation methods on real binaries.
PhysCodeBench: Benchmarking Physics-Aware Symbolic Simulation of 3D Scenes via Self-Corrective Multi-Agent Refinement
cs.RO 2026-04 unverdicted novelty 7.0

PhysCodeBench benchmark and SMRF multi-agent framework enable better AI generation of physically accurate 3D simulation code, boosting performance by 31 points over baselines.
RAG-Reflect: Agentic Retrieval-Augmented Generation with Reflections for Comment-Driven Code Maintenance on Stack Overflow
cs.SE 2026-04 unverdicted novelty 7.0

RAG-Reflect achieves F1=0.78 on valid comment-edit prediction using retrieval-augmented reasoning and self-reflection, outperforming baselines and approaching fine-tuned models without retraining.
Assessing the Impact of Requirement Ambiguity on LLM-based Function-Level Code Generation
cs.SE 2026-04 conditional novelty 7.0

Orchid benchmark shows requirement ambiguity degrades LLM code generation performance across all models, with advanced models hit hardest, and LLMs rarely detect or resolve the ambiguity themselves.
Parallel-SFT: Improving Zero-Shot Cross-Programming-Language Transfer for Code RL
cs.CL 2026-04 unverdicted novelty 7.0

Parallel-SFT mixes parallel programs across languages during SFT to produce more transferable RL initializations, yielding better zero-shot generalization to unseen programming languages.
IRIS: Interpolative R\'enyi Iterative Self-play for Large Language Model Fine-Tuning
cs.LG 2026-04 unverdicted novelty 7.0

IRIS unifies self-play fine-tuning under an interpolative Rényi objective with adaptive alpha scheduling and reports better benchmark scores than baselines while surpassing full supervised fine-tuning with only 13% of...
PlayCoder: Making LLM-Generated GUI Code Playable
cs.SE 2026-04 conditional novelty 7.0

PlayCoder raises the rate of LLM-generated GUI apps that can be played end-to-end without logic errors from near zero to 20.3% Play@3 by adding repository-aware generation, agent-driven testing, and iterative repair.
Cascaded Code Editing: Large-Small Model Collaboration for Effective and Efficient Code Editing
cs.SE 2026-04 unverdicted novelty 7.0

A cascaded large-small model system generates edit sketches with the large model and applies them with the small model to make code editing both accurate and token-efficient.
Efficient Low-Resource Language Adaptation via Multi-Source Dynamic Logit Fusion
cs.CL 2026-04 unverdicted novelty 7.0

TriMix dynamically fuses logits from three model sources to outperform baselines and Proxy Tuning on eight low-resource languages across four model families.
SynthFix: Adaptive Neuro-Symbolic Code Vulnerability Repair
cs.SE 2026-04 unverdicted novelty 7.0

SynthFix adaptively routes LLM code repairs to supervised fine-tuning or symbolic-reward fine-tuning, yielding up to 32% higher exact match on JavaScript and C vulnerability benchmarks.
Structural Anchors and Reasoning Fragility:Understanding CoT Robustness in LLM4Code
cs.SE 2026-04 unverdicted novelty 7.0

CoT prompting in LLM4Code shows mixed robustness that depends on model family, task structure, and perturbations destabilizing structural anchors, leading to trajectory deformations like lengthening, branching, and si...
CodeComp: Structural KV Cache Compression for Agentic Coding
cs.CL 2026-04 unverdicted novelty 7.0

CodeComp uses Joern-extracted Code Property Graph priors for training-free structural KV cache compression, outperforming attention-only baselines on bug localization and code generation while matching full-context pa...
Can LLMs Deobfuscate Binary Code? A Systematic Analysis of Large Language Models into Pseudocode Deobfuscation
cs.SE 2026-04 unverdicted novelty 7.0

LLM deobfuscation of binaries to pseudocode depends more on reasoning ability and task-specific fine-tuning than on model size, with reasoning models showing robustness across ISAs and obfuscation levels on the new Bi...
An End-to-End Approach for Fixing Concurrency Bugs via SHB-Based Context Extractor
cs.SE 2026-04 unverdicted novelty 7.0

ConFixAgent repairs diverse concurrency bugs end-to-end by using Static Happens-Before graphs to extract relevant code context for LLMs, outperforming prior tools in benchmarks.
MIRAGE: Online LLM Simulation for Microservice Dependency Testing
cs.SE 2026-04 unverdicted novelty 7.0

Online LLM simulation of microservice dependencies achieves 99% status-code and response-shape fidelity across 110 scenarios on three systems, far exceeding record-replay baselines.
Evaluating the Environmental Impact of using SLMs and Prompt Engineering for Code Generation
cs.SE 2026-04 unverdicted novelty 7.0

Chain-of-Thought prompting balances high accuracy with low energy use in small language models for code generation, while multi-sampling strategies add high energy costs for small accuracy gains.
Think Anywhere in Code Generation
cs.SE 2026-03 unverdicted novelty 7.0

Think-Anywhere lets LLMs invoke on-demand reasoning at any token during code generation via cold-start imitation followed by outcome-based RL, reaching state-of-the-art results on LeetCode, LiveCodeBench, HumanEval, and MBPP.
Efficient Remote KV Cache Reuse with GPU-native Video Codec
cs.DC 2026-02 conditional novelty 7.0

KVCodec uses GPU-native video codecs and pipelined fetching to compress and transmit KV caches, delivering up to 3.51x faster TTFT than prior methods while preserving accuracy.
CodeOCR: On the Effectiveness of Vision Language Models in Code Understanding
cs.CL 2026-02 unverdicted novelty 7.0

Multimodal LLMs process code as images to achieve up to 8x token compression, with visual cues like syntax highlighting aiding tasks and clone detection remaining resilient or even improving under compression.
HE-SNR: Uncovering Latent Logic via Entropy for Guiding Mid-Training on SWE-bench
cs.LG 2026-01 unverdicted novelty 7.0

HE-SNR is a high-entropy signal-to-noise ratio metric derived from the Entropy Compression Hypothesis to better guide LLM mid-training on complex software engineering benchmarks.
ModeX: Evaluator-Free Best-of-N Selection for Open-Ended Generation
cs.CL 2026-01 unverdicted novelty 7.0

ModeX selects the modal semantic output from multiple LLM generations via a similarity graph and recursive spectral clustering without needing reward models or evaluators.
In Line with Context: Repository-Level Code Generation via Context Inlining
cs.SE 2026-01 unverdicted novelty 7.0

InlineCoder reframes repository-level code generation as function-level coding by using a draft anchor to inline the target function into its call graph for upstream usage and downstream dependency context.
Exploration vs. Fixation: Scaffolding Divergent and Convergent Thinking for Human-AI Co-Creation with Generative Models
cs.HC 2025-12 unverdicted novelty 7.0

HAICo structures AI image creation into switchable divergent and convergent modes based on the Geneplore model and outperforms ChatGPT on creativity and usability in a poster task.
PerfCoder: Large Language Models for Interpretable Code Performance Optimization
cs.SE 2025-12 unverdicted novelty 7.0

PerfCoder is a family of LLMs trained on optimization trajectories with human annotations and runtime-based preference alignment that achieves higher runtime speedups and optimization rates on the PIE benchmark than p...
Decomposed Trust: Privacy, Adversarial Robustness, Ethics, and Fairness in Low-Rank LLMs
cs.LG 2025-11 unverdicted novelty 7.0

Low-rank compression preserves training-data privacy and improves adversarial robustness but weakens personal-information protection, reduces ethical behavior in zero-shot use, and harms fairness.
CodeRL+: Improving Code Generation via Reinforcement with Execution Semantics Alignment
cs.SE 2025-10 conditional novelty 7.0

CodeRL+ integrates variable-level execution trajectory inference into RLVR training to align textual code representations with execution semantics, delivering 4.6% relative pass@1 gains and generalization to code-reas...
Assessing Coherency and Consistency of Code Execution Reasoning by Large Language Models
cs.SE 2025-10 unverdicted novelty 7.0

LLMs achieve 81% coherent execution simulation on HumanEval but show mostly random or weak consistency across tests, with frontier models relying on natural language shortcuts instead of true program analysis.
ML Code Smells: From Specification to Detection
cs.SE 2025-09 unverdicted novelty 7.0

SpecDetect4ML detects 22 ML code smells via DSL specifications and CPG-based analysis, reporting 95.82% precision and 88.14% recall on 890 ML systems while outperforming prior tools.
PromptCOS: Towards Content-only System Prompt Copyright Auditing for LLMs
cs.CR 2025-09 unverdicted novelty 7.0

PromptCOS is a content-only watermarking method for LLM system prompts that embeds detectable cyclic signals via auxiliary tokens while preserving fidelity and resisting removal attacks.
Guidelines for Empirical Studies in Software Engineering involving Large Language Models
cs.SE 2025-08 accept novelty 7.0

The paper delivers a taxonomy of seven LLM study types in software engineering along with eight guidelines that separate mandatory requirements from recommended practices to address reproducibility challenges.
Text to model via SysML: Automated generation of dynamical system computational models from unstructured natural language text via enhanced System Modeling Language diagrams
cs.CL 2025-07 unverdicted novelty 7.0

A pipeline that uses SysML diagrams enhanced by NLP and LLMs to automatically generate dynamical system computational models from unstructured text, demonstrated on a simple pendulum with better results than zero-shot LLMs.
Omni-MATH: A Universal Olympiad Level Mathematic Benchmark For Large Language Models
cs.CL 2024-10 conditional novelty 7.0

Omni-MATH supplies 4428 human-verified Olympiad math problems that expose top LLMs achieving only 52.55% to 60.54% accuracy on the most difficult items.
Towards Agentic Runtime Healing
cs.SE 2024-08 unverdicted novelty 7.0

Healer uses LLMs to dynamically generate and execute runtime error-handling code, with GPT-4 recovering from 72.8% of errors across four datasets.
FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision
cs.LG 2024-07 accept novelty 7.0

FlashAttention-3 achieves 1.5-2x speedup on H100 GPUs for attention, reaching 740 TFLOPs/s (75% utilization) in FP16 and near 1.2 PFLOPs/s in FP8 while cutting numerical error by 2.6x versus baseline FP8 attention.
SpinQuant: LLM quantization with learned rotations
cs.LG 2024-05 conditional novelty 7.0

SpinQuant learns optimal rotations to enable accurate 4-bit quantization of LLM weights, activations, and KV cache, reducing the zero-shot gap to full precision to 2.9 points on LLaMA-2 7B.
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
cs.CL 2024-05 unverdicted novelty 7.0

DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.
RouterBench: A Benchmark for Multi-LLM Routing System
cs.LG 2024-03 unverdicted novelty 7.0

RouterBench supplies a standardized benchmark, 405k+ inference dataset, theoretical framework, and comparative analysis for multi-LLM routing systems.
CodeMind: Evaluating Large Language Models for Code Reasoning
cs.SE 2024-02 unverdicted novelty 7.0

CodeMind evaluates ten LLMs on four benchmarks using three new code reasoning tasks, finding performance varies by model size and drops with complexity while showing no correlation with bug repair ability.
AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation
cs.CL 2023-12 accept novelty 7.0

A three-agent loop of code generation, test creation, and execution feedback lifts pass@1 to 96.3% on HumanEval and 91.8% on MBPP for GPT-4 while using roughly half the tokens of prior state-of-the-art.

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · cited by 220 Pith papers

[1]

LCC-balanced

to our resampled “LCC-balanced” test set. Code tokens are determined by parsing the completion context with tree_sitter. We finish the prompt with “assert my_function() == ”. Accuracy is measured over 64 distinct examples for each combination of prompt length and key position depending on whether it generated the correct value or not. LCC-balanced. The di...

work page
[2]

Write a function that finds the maximum depth of list nesting in a given list

work page
[3]

Given an integer array nums, rotate the array to the right by k steps, where k is non-negative

work page
[4]

A pitch should consist of a letter, possibly a # sign, and an octave number

Write a function that gets the musical pitch that is closest to a given frequency in Hz. A pitch should consist of a letter, possibly a # sign, and an octave number

work page
[5]

Write a function that removes any sequences of whitespace that are between numbers in an input string

work page
[6]

The function should ignore characters that aren’t letters, numbers, or spaces

Write a function that counts the number of words in a string that have length n, where n is an input. The function should ignore characters that aren’t letters, numbers, or spaces

work page
[7]

Write a function that returns the longest palindromic substring in a given string

work page
[8]

Create a function that will find the length of the longest substring without repeating characters in a given string

work page
[9]

Write a function that reverses the diagits in a number

work page
[10]

Write a function that finds the valleys in a list of integers

work page
[11]

#Test case n:

Write a python function that reverses every group of k words in a sentence. Figure 10: Prompt used to generate interview-style programming questions. Prompt: [INST] Your task is to write 5 tests to check the correctness of a function that solves a programming problem. The tests must be between [TESTS] and [/TESTS] tags. You must write the comment "#Test c...

work page
[12]

Look for the software’s process and check if it is using a lot of CPU or memory resources

Check the software’s task manager: Press Ctrl+Shift+Esc on Windows or Command+Option+Esc on Mac to open the task manager. Look for the software’s process and check if it is using a lot of CPU or memory resources. If the software is using too many resources, you can try to end the process. [...]

work page
[13]

This will force all running processes to terminate, including the software operation that is not responding

Restart your system: If none of the above steps work, you can try restarting your system. This will force all running processes to terminate, including the software operation that is not responding. It is important to note that force-terminating a software operation can cause data loss or corruption, so it is recommended to try other methods first. Additi...

work page
[14]

This can be done using a cryptographically secure pseudorandom number generator

Key Generation: The first step is to generate the decryption keys. This can be done using a cryptographically secure pseudorandom number generator. The generated keys should be of sufficient length and complexity to ensure that they are difficult to guess or crack

work page
[15]

This can be done using a hardware security module (HSM) or a secure key storage solution

Key Storage: Once the keys are generated, they need to be stored in a secure location. This can be done using a hardware security module (HSM) or a secure key storage solution. The HSM is a physical device that stores the keys securely and provides a secure environment for key management operations. The secure key storage solution can be a software-based ...

work page
[16]

This can be done by implementing role-based access control (RBAC) policies that restrict access to the keys based on the user’s role and privileges

Key Access Control: To ensure that the decryption keys are only accessible to authorized personnel, access control mechanisms need to be put in place. This can be done by implementing role-based access control (RBAC) policies that restrict access to the keys based on the user’s role and privileges

work page
[17]

The encryption key should be stored separately and securely, and the decryption keys should be encrypted using the encryption key

Key Encryption: To protect the decryption keys from unauthorized access, they should be encrypted using a strong encryption algorithm. The encryption key should be stored separately and securely, and the decryption keys should be encrypted using the encryption key

work page
[18]

This can be done by automatically rotating the decryption keys after a certain period, such as 90 days

Key Rotation: To ensure that the decryption keys are regularly updated and replaced, a key rotation policy should be implemented. This can be done by automatically rotating the decryption keys after a certain period, such as 90 days

work page
[19]

This can be done by storing the decryption keys in a secure location, such as a backup server or a cloud storage service

Key Recovery: To ensure that the decryption keys can be recovered in case of a disaster or data loss, a key recovery mechanism should be implemented. This can be done by storing the decryption keys in a secure location, such as a backup server or a cloud storage service

work page
[20]

decoding

Key Management: To ensure that the decryption keys are properly managed and secured, a key management system should be implemented. This can be done by using a key management software that provides a centralized repository for managing the decryption keys. In summary, a system for securely storing and retrieving decryption keys should include key generati...

work page 2019
[21]

"" Parse host string into (hostname, port) tuple. :param host: host string :return: (hostname, port) tuple

Install ‘sox‘ if you haven’t already. 2. Open a terminal and navigate to the directory where your input.mp3 file is located. 3. Run the following command: sox input.mp3 output.mp3 delay 10 This will add a 10-second delay to the input.mp3 file and save the result as output.mp3. You can adjust the delay time by changing the value after the ‘delay‘ keyword. ...

work page

[1] [1]

LCC-balanced

to our resampled “LCC-balanced” test set. Code tokens are determined by parsing the completion context with tree_sitter. We finish the prompt with “assert my_function() == ”. Accuracy is measured over 64 distinct examples for each combination of prompt length and key position depending on whether it generated the correct value or not. LCC-balanced. The di...

work page

[2] [2]

Write a function that finds the maximum depth of list nesting in a given list

work page

[3] [3]

Given an integer array nums, rotate the array to the right by k steps, where k is non-negative

work page

[4] [4]

A pitch should consist of a letter, possibly a # sign, and an octave number

Write a function that gets the musical pitch that is closest to a given frequency in Hz. A pitch should consist of a letter, possibly a # sign, and an octave number

work page

[5] [5]

Write a function that removes any sequences of whitespace that are between numbers in an input string

work page

[6] [6]

The function should ignore characters that aren’t letters, numbers, or spaces

Write a function that counts the number of words in a string that have length n, where n is an input. The function should ignore characters that aren’t letters, numbers, or spaces

work page

[7] [7]

Write a function that returns the longest palindromic substring in a given string

work page

[8] [8]

Create a function that will find the length of the longest substring without repeating characters in a given string

work page

[9] [9]

Write a function that reverses the diagits in a number

work page

[10] [10]

Write a function that finds the valleys in a list of integers

work page

[11] [11]

#Test case n:

Write a python function that reverses every group of k words in a sentence. Figure 10: Prompt used to generate interview-style programming questions. Prompt: [INST] Your task is to write 5 tests to check the correctness of a function that solves a programming problem. The tests must be between [TESTS] and [/TESTS] tags. You must write the comment "#Test c...

work page

[12] [12]

Look for the software’s process and check if it is using a lot of CPU or memory resources

Check the software’s task manager: Press Ctrl+Shift+Esc on Windows or Command+Option+Esc on Mac to open the task manager. Look for the software’s process and check if it is using a lot of CPU or memory resources. If the software is using too many resources, you can try to end the process. [...]

work page

[13] [13]

This will force all running processes to terminate, including the software operation that is not responding

Restart your system: If none of the above steps work, you can try restarting your system. This will force all running processes to terminate, including the software operation that is not responding. It is important to note that force-terminating a software operation can cause data loss or corruption, so it is recommended to try other methods first. Additi...

work page

[14] [14]

This can be done using a cryptographically secure pseudorandom number generator

Key Generation: The first step is to generate the decryption keys. This can be done using a cryptographically secure pseudorandom number generator. The generated keys should be of sufficient length and complexity to ensure that they are difficult to guess or crack

work page

[15] [15]

This can be done using a hardware security module (HSM) or a secure key storage solution

Key Storage: Once the keys are generated, they need to be stored in a secure location. This can be done using a hardware security module (HSM) or a secure key storage solution. The HSM is a physical device that stores the keys securely and provides a secure environment for key management operations. The secure key storage solution can be a software-based ...

work page

[16] [16]

This can be done by implementing role-based access control (RBAC) policies that restrict access to the keys based on the user’s role and privileges

Key Access Control: To ensure that the decryption keys are only accessible to authorized personnel, access control mechanisms need to be put in place. This can be done by implementing role-based access control (RBAC) policies that restrict access to the keys based on the user’s role and privileges

work page

[17] [17]

The encryption key should be stored separately and securely, and the decryption keys should be encrypted using the encryption key

Key Encryption: To protect the decryption keys from unauthorized access, they should be encrypted using a strong encryption algorithm. The encryption key should be stored separately and securely, and the decryption keys should be encrypted using the encryption key

work page

[18] [18]

This can be done by automatically rotating the decryption keys after a certain period, such as 90 days

Key Rotation: To ensure that the decryption keys are regularly updated and replaced, a key rotation policy should be implemented. This can be done by automatically rotating the decryption keys after a certain period, such as 90 days

work page

[19] [19]

This can be done by storing the decryption keys in a secure location, such as a backup server or a cloud storage service

Key Recovery: To ensure that the decryption keys can be recovered in case of a disaster or data loss, a key recovery mechanism should be implemented. This can be done by storing the decryption keys in a secure location, such as a backup server or a cloud storage service

work page

[20] [20]

decoding

Key Management: To ensure that the decryption keys are properly managed and secured, a key management system should be implemented. This can be done by using a key management software that provides a centralized repository for managing the decryption keys. In summary, a system for securely storing and retrieving decryption keys should include key generati...

work page 2019

[21] [21]

"" Parse host string into (hostname, port) tuple. :param host: host string :return: (hostname, port) tuple

Install ‘sox‘ if you haven’t already. 2. Open a terminal and navigate to the directory where your input.mp3 file is located. 3. Run the following command: sox input.mp3 output.mp3 delay 10 This will add a 10-second delay to the input.mp3 file and save the result as output.mp3. You can adjust the delay time by changing the value after the ‘delay‘ keyword. ...

work page