super hub Canonical reference

Code Llama: Open Foundation Models for Code

Fabian Gloeckle, Itai Gat, Jonas Gehring, Sten Sootla, Xiaoqing Ellen Tan · 2023 · cs.CL · arXiv 2308.12950

Canonical reference. 80% of citing Pith papers cite this work as background.

294 Pith papers citing it

Background 80% of classified citations

open full Pith review browse 294 citing papers more from Fabian Gloeckle arXiv PDF

abstract

We release Code Llama, a family of large language models for code based on Llama 2 providing state-of-the-art performance among open models, infilling capabilities, support for large input contexts, and zero-shot instruction following ability for programming tasks. We provide multiple flavors to cover a wide range of applications: foundation models (Code Llama), Python specializations (Code Llama - Python), and instruction-following models (Code Llama - Instruct) with 7B, 13B, 34B and 70B parameters each. All models are trained on sequences of 16k tokens and show improvements on inputs with up to 100k tokens. 7B, 13B and 70B Code Llama and Code Llama - Instruct variants support infilling based on surrounding content. Code Llama reaches state-of-the-art performance among open models on several code benchmarks, with scores of up to 67% and 65% on HumanEval and MBPP, respectively. Notably, Code Llama - Python 7B outperforms Llama 2 70B on HumanEval and MBPP, and all our models outperform every other publicly available model on MultiPL-E. We release Code Llama under a permissive license that allows for both research and commercial use.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 39 method 4 baseline 3 dataset 2 other 1

citation-polarity summary

background 39 use method 4 baseline 3 use dataset 2 unclear 1

claims ledger

abstract We release Code Llama, a family of large language models for code based on Llama 2 providing state-of-the-art performance among open models, infilling capabilities, support for large input contexts, and zero-shot instruction following ability for programming tasks. We provide multiple flavors to cover a wide range of applications: foundation models (Code Llama), Python specializations (Code Llama - Python), and instruction-following models (Code Llama - Instruct) with 7B, 13B, 34B and 70B parameters each. All models are trained on sequences of 16k tokens and show improvements on inputs with up

authors

Baptiste Rozi\`ere Fabian Gloeckle Itai Gat Jonas Gehring Sten Sootla Xiaoqing Ellen Tan

co-cited works

representative citing papers

Efficient Training on Multiple Consumer GPUs with RoundPipe

cs.DC · 2026-04-29 · conditional · novelty 8.0

RoundPipe achieves near-zero-bubble pipeline parallelism for LLM training on consumer GPUs by dynamically dispatching computation stages round-robin, yielding 1.48-2.16x speedups and enabling 235B model fine-tuning on 8x RTX 4090.

Hackers or Hallucinators? A Comprehensive Analysis of LLM-Based Automated Penetration Testing

cs.CR · 2026-04-07 · unverdicted · novelty 8.0

The first SoK on LLM-based AutoPT frameworks provides a six-dimension taxonomy of agent designs and a unified empirical benchmark evaluating 15 frameworks via over 10 billion tokens and 1,500 manually reviewed logs.

Why Do Multi-Agent LLM Systems Fail?

cs.AI · 2025-03-17 · unverdicted · novelty 8.0

The authors create the first large-scale dataset and taxonomy of failure modes in multi-agent LLM systems to explain their limited performance gains.

Mitigating Package Hallucinations in Large Language Models via Model Editing

cs.SE · 2026-07-02 · unverdicted · novelty 7.0

BOUND refines LLMs' package-validity boundary via targeted editing to cut package hallucination rates by 79.9% on edit prompts and 65.4% on unseen prompts in recommendation tasks while generalizing to code generation.

The Illusion of Safety: Multi-Tier Verification of AI vs. Human C++ Code

cs.SE · 2026-06-30 · unverdicted · novelty 7.0

Multi-tier verification on VULBENCH-CPP shows AI-generated C++ code triggers confirmed runtime violations roughly twice as often as human code, while static analysis misleadingly indicates parity due to code length.

The Alignment Problem in Constrained Code Generation

cs.SE · 2026-06-19 · unverdicted · novelty 7.0

Incomplete constrainers in constrained decoding push LLMs into low-probability program regions, making unconstrained decoding outperform constrained decoding on functional correctness across seven models and three benchmarks.

Repository-Level Solidity Code Generation with Large Language Models: From Prompting to Fine-Tuning

cs.SE · 2026-06-18 · unverdicted · novelty 7.0

Introduces SolidityBench benchmark and SolidityScore metric for repository-level Solidity code generation, finding supervised fine-tuning outperforms prompting, CoT, ICL, and RAG methods on evaluated LLMs.

SENTINEL: Failure-Driven Reinforcement Learning for Training Tool-Using Language Model Agents

cs.CL · 2026-06-11 · unverdicted · novelty 7.0

SENTINEL generates targeted tasks from model failures in a Controller-Proposer-Solver loop, raising Pass^1 from 66.4 to 74.9 on Tau2-Bench Retail and outperforming standard RL.

Detecting Functional Memorization in Code Language Models

cs.LG · 2026-06-11 · unverdicted · novelty 7.0

Authors demonstrate functional memorization in code LLMs via counterfactual midtraining comparison on functional equivalence metrics beyond textual overlap.

Frontier Coding Agents Use Metaprogramming to Adapt to Unfamiliar Programming Languages

cs.AI · 2026-06-09 · unverdicted · novelty 7.0

Strong coding agents use metaprogramming to solve tasks in unfamiliar esoteric languages while weaker agents do not, with performance gaps larger than in mainstream benchmarks.

OpenRTLSet: A Fully Open-Source Dataset for Large Language Model-based Verilog Module Design

cs.CL · 2026-06-09 · unverdicted · novelty 7.0

OpenRTLSet supplies 131k+ Verilog samples with AI-generated descriptions to enable fine-tuning of LLMs for hardware module design.

Causally Evaluating the Learnability of Formal Language Tasks

cs.CL · 2026-06-08 · unverdicted · novelty 7.0

Introduces the binning semiring and causal graphical models to show that correlational evaluation of learnability in formal language tasks leads to incorrect conclusions from confounders.

PrivCode++: Latent-Conditioned Differentially Private Code Generation for Comprehensive Guarantees

cs.CR · 2026-06-08 · unverdicted · novelty 7.0

PrivCode++ introduces the first DP code generation method protecting both prompts and code via latent-conditioned two-stage training, claiming higher utility and stronger privacy than prior baselines.

Beyond Pass Rate: A Multilingual, Execution-Grounded Evaluation of Open Code LLMs

cs.AI · 2026-06-07 · unverdicted · novelty 7.0

Multilingual execution-grounded benchmark finds top open code LLM at 23.64% correctness versus 57.2% human baseline, with compile errors dominating 63% of failures.

SkelDPO: A Skeleton-Guided Direct Preference Optimization Framework for Efficient Code Generation

cs.SE · 2026-06-05 · unverdicted · novelty 7.0

SkelDPO improves code generation efficiency by 2-7% over prior DPO methods via joint preference losses on full code and efficiency-critical skeletons.

Many Circuits, One Mechanism: Input Variation and Evaluation Granularity in Circuit Discovery

cs.CL · 2026-06-04 · unverdicted · novelty 7.0

Structurally distinct circuits for literal sequence copying across token frequency bands implement the same computation, shown by broad transfer of band-specific edges, a shared core recovering 99% performance, and interchangeable representations via causal interventions.

ADK Arena: Evaluating Agent Development Kits via LLM-as-a-Developer

cs.SE · 2026-06-04 · unverdicted · novelty 7.0

ADK Arena evaluates 51 Python ADKs by having an LLM learn each framework's API, write and repair agent code, and run on benchmarks, finding 57% success rate, 5.6x cost variation, no dominant framework, and substitutable information sources.

3DCodeBench: Benchmarking Agentic Procedural 3D Modeling Via Code

cs.CV · 2026-05-31 · unverdicted · novelty 7.0

3DCodeBench is a new benchmark evaluating 12 VLMs on translating multimodal prompts into procedural 3D modeling code, paired with 3DCodeArena for human preference rankings.

What Breaks When LLMs Code? Characterizing Operational Safety Failures of Agentic Code Assistants

cs.SE · 2026-05-29 · unverdicted · novelty 7.0

An empirical study of 547 confirmed safety incidents from GitHub and literature derives a 33-type taxonomy showing constraint violations, destructive actions, and deception dominate in everyday coding-agent use.

Efficient and Scalable Provenance Tracking for LLM-Generated Code Snippets

cs.SE · 2026-05-27 · unverdicted · novelty 7.0

Hybrid vector-search plus fingerprinting pipeline for LLM code provenance achieves Winnowing-level MRR on short snippets and up to 5.4% better on longer ones at logarithmic query time.

CopT: Contrastive On-Policy Thinking with Continuous Spaces for General and Agentic Reasoning

cs.CL · 2026-05-19 · unverdicted · novelty 7.0

CopT reverses CoT by eliciting a draft answer first then using continuous-embedding contrastive verification and on-policy thinking to reflect and correct, yielding up to 23% higher accuracy and 57% fewer tokens without training.

BacktestBench: Benchmarking Large Language Models for Automated Quantitative Strategy Backtesting

cs.CL · 2026-05-18 · unverdicted · novelty 7.0 · 2 refs

Introduces BacktestBench benchmark with 18k QA pairs across four backtesting tasks and evaluates 23 LLMs via the AutoBacktest multi-agent system.

Constrained Code Generation with Discrete Diffusion

cs.CL · 2026-05-16 · unverdicted · novelty 7.0

Constrained Diffusion for Code (CDC) integrates constraint satisfaction into the reverse denoising process of discrete diffusion models via constraint-aware operators that use optimization and program analysis to steer generation toward feasible programs.

Syntax Without Semantics: Teaching Large Language Models to Code in an Unseen Language

cs.CL · 2026-05-15 · unverdicted · novelty 7.0

Fine-tuning LLMs on an unseen language teaches syntax but fails to transfer semantic competence, leaving Python with up to a 19% performance advantage and no tested intervention closing the gap.

citing papers explorer

Showing 0 of 0 citing papers after filters.

No citing papers match the current filters.

Code Llama: Open Foundation Models for Code

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer