super hub Canonical reference

Code Llama: Open Foundation Models for Code

Fabian Gloeckle, Itai Gat, Jonas Gehring, Sten Sootla, Xiaoqing Ellen Tan · 2023 · cs.CL · arXiv 2308.12950

Canonical reference. 80% of citing Pith papers cite this work as background.

297 Pith papers citing it

Background 80% of classified citations

open full Pith review browse 297 citing papers more from Fabian Gloeckle arXiv PDF

abstract

We release Code Llama, a family of large language models for code based on Llama 2 providing state-of-the-art performance among open models, infilling capabilities, support for large input contexts, and zero-shot instruction following ability for programming tasks. We provide multiple flavors to cover a wide range of applications: foundation models (Code Llama), Python specializations (Code Llama - Python), and instruction-following models (Code Llama - Instruct) with 7B, 13B, 34B and 70B parameters each. All models are trained on sequences of 16k tokens and show improvements on inputs with up to 100k tokens. 7B, 13B and 70B Code Llama and Code Llama - Instruct variants support infilling based on surrounding content. Code Llama reaches state-of-the-art performance among open models on several code benchmarks, with scores of up to 67% and 65% on HumanEval and MBPP, respectively. Notably, Code Llama - Python 7B outperforms Llama 2 70B on HumanEval and MBPP, and all our models outperform every other publicly available model on MultiPL-E. We release Code Llama under a permissive license that allows for both research and commercial use.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 39 method 4 baseline 3 dataset 2 other 1

citation-polarity summary

background 39 use method 4 baseline 3 use dataset 2 unclear 1

claims ledger

abstract We release Code Llama, a family of large language models for code based on Llama 2 providing state-of-the-art performance among open models, infilling capabilities, support for large input contexts, and zero-shot instruction following ability for programming tasks. We provide multiple flavors to cover a wide range of applications: foundation models (Code Llama), Python specializations (Code Llama - Python), and instruction-following models (Code Llama - Instruct) with 7B, 13B, 34B and 70B parameters each. All models are trained on sequences of 16k tokens and show improvements on inputs with up

authors

Baptiste Rozi\`ere Fabian Gloeckle Itai Gat Jonas Gehring Sten Sootla Xiaoqing Ellen Tan

co-cited works

representative citing papers

Efficient Training on Multiple Consumer GPUs with RoundPipe

cs.DC · 2026-04-29 · conditional · novelty 8.0

RoundPipe achieves near-zero-bubble pipeline parallelism for LLM training on consumer GPUs by dynamically dispatching computation stages round-robin, yielding 1.48-2.16x speedups and enabling 235B model fine-tuning on 8x RTX 4090.

Hackers or Hallucinators? A Comprehensive Analysis of LLM-Based Automated Penetration Testing

cs.CR · 2026-04-07 · unverdicted · novelty 8.0

The first SoK on LLM-based AutoPT frameworks provides a six-dimension taxonomy of agent designs and a unified empirical benchmark evaluating 15 frameworks via over 10 billion tokens and 1,500 manually reviewed logs.

Why Do Multi-Agent LLM Systems Fail?

cs.AI · 2025-03-17 · unverdicted · novelty 8.0

The authors create the first large-scale dataset and taxonomy of failure modes in multi-agent LLM systems to explain their limited performance gains.

Mitigating Package Hallucinations in Large Language Models via Model Editing

cs.SE · 2026-07-02 · unverdicted · novelty 7.0

BOUND refines LLMs' package-validity boundary via targeted editing to cut package hallucination rates by 79.9% on edit prompts and 65.4% on unseen prompts in recommendation tasks while generalizing to code generation.

The Illusion of Safety: Multi-Tier Verification of AI vs. Human C++ Code

cs.SE · 2026-06-30 · unverdicted · novelty 7.0

Multi-tier verification on VULBENCH-CPP shows AI-generated C++ code triggers confirmed runtime violations roughly twice as often as human code, while static analysis misleadingly indicates parity due to code length.

Autonomous Generation of Metamaterial Databases Based on Multimodal Agents

physics.optics · 2026-06-22 · unverdicted · novelty 7.0

MetaDataGenAgent is a multimodal multi-agent system with a closed-loop plan-execute-reflect mechanism that autonomously extracts, validates, and encodes metamaterial data from unstructured literature into executable databases.

The Alignment Problem in Constrained Code Generation

cs.SE · 2026-06-19 · unverdicted · novelty 7.0

Incomplete constrainers in constrained decoding push LLMs into low-probability program regions, making unconstrained decoding outperform constrained decoding on functional correctness across seven models and three benchmarks.

Repository-Level Solidity Code Generation with Large Language Models: From Prompting to Fine-Tuning

cs.SE · 2026-06-18 · unverdicted · novelty 7.0

Introduces SolidityBench benchmark and SolidityScore metric for repository-level Solidity code generation, finding supervised fine-tuning outperforms prompting, CoT, ICL, and RAG methods on evaluated LLMs.

SENTINEL: Failure-Driven Reinforcement Learning for Training Tool-Using Language Model Agents

cs.CL · 2026-06-11 · unverdicted · novelty 7.0

SENTINEL generates targeted tasks from model failures in a Controller-Proposer-Solver loop, raising Pass^1 from 66.4 to 74.9 on Tau2-Bench Retail and outperforming standard RL.

Detecting Functional Memorization in Code Language Models

cs.LG · 2026-06-11 · unverdicted · novelty 7.0

Authors demonstrate functional memorization in code LLMs via counterfactual midtraining comparison on functional equivalence metrics beyond textual overlap.

Frontier Coding Agents Use Metaprogramming to Adapt to Unfamiliar Programming Languages

cs.AI · 2026-06-09 · unverdicted · novelty 7.0

Strong coding agents use metaprogramming to solve tasks in unfamiliar esoteric languages while weaker agents do not, with performance gaps larger than in mainstream benchmarks.

OpenRTLSet: A Fully Open-Source Dataset for Large Language Model-based Verilog Module Design

cs.CL · 2026-06-09 · unverdicted · novelty 7.0

OpenRTLSet supplies 131k+ Verilog samples with AI-generated descriptions to enable fine-tuning of LLMs for hardware module design.

Causally Evaluating the Learnability of Formal Language Tasks

cs.CL · 2026-06-08 · unverdicted · novelty 7.0

Introduces the binning semiring and causal graphical models to show that correlational evaluation of learnability in formal language tasks leads to incorrect conclusions from confounders.

PrivCode++: Latent-Conditioned Differentially Private Code Generation for Comprehensive Guarantees

cs.CR · 2026-06-08 · unverdicted · novelty 7.0

PrivCode++ introduces the first DP code generation method protecting both prompts and code via latent-conditioned two-stage training, claiming higher utility and stronger privacy than prior baselines.

Beyond Pass Rate: A Multilingual, Execution-Grounded Evaluation of Open Code LLMs

cs.AI · 2026-06-07 · unverdicted · novelty 7.0

Multilingual execution-grounded benchmark finds top open code LLM at 23.64% correctness versus 57.2% human baseline, with compile errors dominating 63% of failures.

SkelDPO: A Skeleton-Guided Direct Preference Optimization Framework for Efficient Code Generation

cs.SE · 2026-06-05 · unverdicted · novelty 7.0

SkelDPO improves code generation efficiency by 2-7% over prior DPO methods via joint preference losses on full code and efficiency-critical skeletons.

Many Circuits, One Mechanism: Input Variation and Evaluation Granularity in Circuit Discovery

cs.CL · 2026-06-04 · unverdicted · novelty 7.0

Structurally distinct circuits for literal sequence copying across token frequency bands implement the same computation, shown by broad transfer of band-specific edges, a shared core recovering 99% performance, and interchangeable representations via causal interventions.

ADK Arena: Evaluating Agent Development Kits via LLM-as-a-Developer

cs.SE · 2026-06-04 · unverdicted · novelty 7.0

ADK Arena evaluates 51 Python ADKs by having an LLM learn each framework's API, write and repair agent code, and run on benchmarks, finding 57% success rate, 5.6x cost variation, no dominant framework, and substitutable information sources.

3DCodeBench: Benchmarking Agentic Procedural 3D Modeling Via Code

cs.CV · 2026-05-31 · unverdicted · novelty 7.0

3DCodeBench is a new benchmark evaluating 12 VLMs on translating multimodal prompts into procedural 3D modeling code, paired with 3DCodeArena for human preference rankings.

What Breaks When LLMs Code? Characterizing Operational Safety Failures of Agentic Code Assistants

cs.SE · 2026-05-29 · unverdicted · novelty 7.0

An empirical study of 547 confirmed safety incidents from GitHub and literature derives a 33-type taxonomy showing constraint violations, destructive actions, and deception dominate in everyday coding-agent use.

Efficient and Scalable Provenance Tracking for LLM-Generated Code Snippets

cs.SE · 2026-05-27 · unverdicted · novelty 7.0

Hybrid vector-search plus fingerprinting pipeline for LLM code provenance achieves Winnowing-level MRR on short snippets and up to 5.4% better on longer ones at logarithmic query time.

CopT: Contrastive On-Policy Thinking with Continuous Spaces for General and Agentic Reasoning

cs.CL · 2026-05-19 · unverdicted · novelty 7.0

CopT reverses CoT by eliciting a draft answer first then using continuous-embedding contrastive verification and on-policy thinking to reflect and correct, yielding up to 23% higher accuracy and 57% fewer tokens without training.

BacktestBench: Benchmarking Large Language Models for Automated Quantitative Strategy Backtesting

cs.CL · 2026-05-18 · unverdicted · novelty 7.0 · 2 refs

Introduces BacktestBench benchmark with 18k QA pairs across four backtesting tasks and evaluates 23 LLMs via the AutoBacktest multi-agent system.

Constrained Code Generation with Discrete Diffusion

cs.CL · 2026-05-16 · unverdicted · novelty 7.0

Constrained Diffusion for Code (CDC) integrates constraint satisfaction into the reverse denoising process of discrete diffusion models via constraint-aware operators that use optimization and program analysis to steer generation toward feasible programs.

citing papers explorer

Showing 22 of 22 citing papers after filters.

Why Do Multi-Agent LLM Systems Fail? cs.AI · 2025-03-17 · unverdicted · none · ref 86 · internal anchor
The authors create the first large-scale dataset and taxonomy of failure modes in multi-agent LLM systems to explain their limited performance gains.
Frontier Coding Agents Use Metaprogramming to Adapt to Unfamiliar Programming Languages cs.AI · 2026-06-09 · unverdicted · none · ref 3 · internal anchor
Strong coding agents use metaprogramming to solve tasks in unfamiliar esoteric languages while weaker agents do not, with performance gaps larger than in mainstream benchmarks.
Beyond Pass Rate: A Multilingual, Execution-Grounded Evaluation of Open Code LLMs cs.AI · 2026-06-07 · unverdicted · none · ref 33 · internal anchor
Multilingual execution-grounded benchmark finds top open code LLM at 23.64% correctness versus 57.2% human baseline, with compile errors dominating 63% of failures.
LGMT: Logic-Grounded Metamorphic Testing for Evaluating the Reasoning Reliability of LLMs cs.AI · 2026-05-12 · unverdicted · none · ref 39 · 2 links · internal anchor
LGMT is a logic-grounded metamorphic testing framework that detects hidden reasoning defects in LLMs by checking consistency on semantically invariant inputs derived from FOL equivalences.
Text2DSL: LLM-Based Code Generation for Domain-Specific Languages cs.AI · 2026-06-21 · unverdicted · none · ref 16 · internal anchor
Formalizes Text2DSL, introduces PolkitBench dataset with 4,204 pairs, and shows structured prompt context boosts syntactic validity to 98.6-99.4%, structural validity by up to 35.5 pp, and CodeBLEU by 60-95% on two MoE models.
Efficient Skill Grounding via Code Refactoring with Small Language Models cs.AI · 2026-06-06 · unverdicted · none · ref 70 · internal anchor
RECENT decouples skill semantics from embodiment-specific bindings via code refactoring to let small language models achieve skill grounding performance matching large language model baselines.
See Before You Code: Learning Visual Priors for Spatially Aware Educational Animation Generation cs.AI · 2026-05-15 · unverdicted · none · ref 27 · internal anchor
OmniManim improves render quality in educational animation code generation by using a Vision Agent with coarse-to-fine bounding-box denoising and interpolation-aware optimization on new datasets.
Verifiable Process Rewards for Agentic Reasoning cs.AI · 2026-05-11 · unverdicted · none · ref 20 · 2 links · internal anchor
VPR converts symbolic, constraint, or posterior oracles into dense turn-level rewards for RL, improving credit assignment in agentic reasoning and transferring to general benchmarks.
BoostAPR: Boosting Automated Program Repair via Execution-Grounded Reinforcement Learning with Dual Reward Models cs.AI · 2026-05-09 · unverdicted · none · ref 39 · 2 links · internal anchor
BoostAPR boosts automated program repair by training a sequence-level assessor and line-level credit allocator from execution outcomes, then applying them in PPO to reach 40.7% on SWE-bench Verified.
LeGo-Code: Can Modular Curriculum Learning Advance Complex Code Generation? Insights from Text-to-SQL cs.AI · 2026-04-20 · unverdicted · none · ref 18 · internal anchor
Modular curriculum learning with tier-specific adapters outperforms standard fine-tuning on complex Text-to-SQL queries in Spider and BIRD benchmarks by avoiding catastrophic forgetting.
MM-Telco: Benchmarks and Multimodal Large Language Models for Telecom Applications cs.AI · 2025-11-17 · unverdicted · none · ref 31 · internal anchor
MM-Telco creates multimodal benchmarks for telecom and demonstrates that fine-tuned LLMs and VLMs achieve significant performance gains on domain-specific tasks.
Saber: An Efficient Sampling with Adaptive Acceleration and Backtracking Enhanced Remasking for Diffusion Language Model cs.AI · 2025-10-20 · unverdicted · none · ref 20 · internal anchor
Saber improves both speed and accuracy of diffusion language models on code generation by dynamically adjusting unmasking steps and reverting low-confidence tokens via backtracking.
Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning cs.AI · 2025-07-01 · conditional · none · ref 197 · internal anchor
Math reasoning gains in LLMs rarely transfer to general domains; RL tuning generalizes while SFT causes forgetting and representation drift.
Cognitive Architectures for Language Agents cs.AI · 2023-09-05 · accept · none · ref 67 · internal anchor
CoALA is a modular cognitive architecture for language agents that organizes memory components, action spaces for internal and external interaction, and a generalized decision-making loop to support more systematic development of capable agents.
AlgoSkill: Learning to Design Algorithms by Scheduling Human-Like Skills cs.AI · 2026-06-29 · unverdicted · none · ref 21 · internal anchor
AlgoSkill improves LLM algorithm design on programming benchmarks by framing it as verification-guided scheduling over a typed skill library with MCTS, outperforming direct generation and self-refinement.
Making Failure Safe: A Constrained, Verifiable Agent Framework for Open-Web Data Collection cs.AI · 2026-06-25 · unverdicted · none · ref 20 · internal anchor
A constrained verifiable agent framework for open-web data collection achieves zero execution-stage LLM tokens and lowest wall-clock time on 80 verified tasks by shifting from free-form code to typed JSON configs with taxonomy and static DAG execution.
Characterizing initial human-AI proof formalization workflows cs.AI · 2026-06-02 · unverdicted · none · ref 205 · internal anchor
A controlled user study and qualitative survey find that AI assistance raises formalization accuracy for math proofs, with users flexibly combining multiple tools while retaining oversight.
Inference Time Context Sparsity: Illusion or Opportunity? cs.AI · 2026-05-22 · unverdicted · none · ref 36 · internal anchor
Current LLMs remain robust to high levels of inference-time context sparsity across diverse tasks, enabling up to 10x acceleration via sparse kernels.
On Distinguishing Capability Elicitation from Capability Creation in Post-Training: A Free-Energy Perspective cs.AI · 2026-05-08 · unverdicted · none · ref 10 · internal anchor
Post-training reweights a pretrained model's behavior distribution either within its existing accessible support (elicitation) or by expanding that support (creation), with both SFT and RL acting as free-energy minimization under different signals.
MAFIG: Multi-agent Driven Formal Instruction Generation Framework cs.AI · 2026-04-13 · unverdicted · none · ref 32 · internal anchor
MAFIG uses a Perception Agent and Emergency Decision Agent plus span-focused local distillation to let lightweight models rapidly generate formal instructions that fix local scheduling failures, achieving over 94% success with sub-second latency on port, warehousing, and deck datasets.
UA-ChatDev: Uncertainty-Aware Multi-Agent Collaboration for Reliable Software Development cs.AI · 2026-07-02 · unverdicted · none · ref 22 · internal anchor
UA-ChatDev integrates token-level uncertainty estimation and phase-aware verification into multi-agent software development and reports better benchmark scores than prior frameworks.
Multi-LCB: Extending LiveCodeBench to Multiple Programming Languages cs.AI · 2026-06-18 · unverdicted · none · ref 7 · internal anchor
Multi-LCB extends LiveCodeBench to 12 languages by translating Python tasks, revealing Python overfitting and performance disparities when evaluating 24 LLMs.

Code Llama: Open Foundation Models for Code

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer