super hub Canonical reference

LLaMA: Open and Efficient Foundation Language Models

· 2023 · cs.CL · arXiv 2302.13971

Canonical reference. 82% of citing Pith papers cite this work as background.

985 Pith papers citing it

Background 82% of classified citations

open full Pith review browse 985 citing papers arXiv PDF

abstract

We introduce LLaMA, a collection of foundation language models ranging from 7B to 65B parameters. We train our models on trillions of tokens, and show that it is possible to train state-of-the-art models using publicly available datasets exclusively, without resorting to proprietary and inaccessible datasets. In particular, LLaMA-13B outperforms GPT-3 (175B) on most benchmarks, and LLaMA-65B is competitive with the best models, Chinchilla-70B and PaLM-540B. We release all our models to the research community.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 203 method 19 baseline 8 other 6 dataset 1 extension 1

citation-polarity summary

background 195 use method 20 unclear 13 baseline 7 extend 1 support 1 use dataset 1

claims ledger

abstract We introduce LLaMA, a collection of foundation language models ranging from 7B to 65B parameters. We train our models on trillions of tokens, and show that it is possible to train state-of-the-art models using publicly available datasets exclusively, without resorting to proprietary and inaccessible datasets. In particular, LLaMA-13B outperforms GPT-3 (175B) on most benchmarks, and LLaMA-65B is competitive with the best models, Chinchilla-70B and PaLM-540B. We release all our models to the research community.

co-cited works

representative citing papers

Privacy Auditing with Zero (0) Training Run

cs.CR · 2026-05-14 · unverdicted · novelty 8.0

Zero-Run auditing supplies valid lower bounds on differential privacy parameters from fixed member and non-member datasets by modeling and correcting distribution-shift confounding via causal-inference techniques.

Effective Context in Transformers: An Analysis of Fragmentation and Tokenization

cs.LG · 2026-05-13 · unverdicted · novelty 8.0

Fragmentation strictly raises optimal finite-context log-loss on Markov sources while tokenization can make a short token window equivalent to a longer source window under reliability and compression conditions.

Grid Games: The Power of Multiple Grids for Quantizing Large Language Models

cs.LG · 2026-05-12 · accept · novelty 8.0

Allowing each quantization group to select among multiple 4-bit grids improves accuracy over single-grid FP4 for both post-training and pre-training of LLMs.

Steering Without Breaking: Mechanistically Informed Interventions for Discrete Diffusion Language Models

cs.LG · 2026-05-08 · unverdicted · novelty 8.0

Adaptive scheduling of interventions in discrete diffusion language models, timed to attribute-specific commitment schedules discovered with sparse autoencoders, delivers precise multi-attribute steering up to 93% strength while preserving generation quality.

When and Why SignSGD Outperforms SGD: A Theoretical Study Based on $\ell_1$-norm Lower Bounds

cs.LG · 2026-05-07 · unverdicted · novelty 8.0

SignSGD provably beats SGD by a factor of d under sparse noise via matched ℓ1-norm upper and lower bounds, with an equivalent result for Muon on matrices, and this predicts faster GPT-2 pretraining.

Backdoor Attacks on Decentralised Post-Training

cs.CR · 2026-03-31 · conditional · novelty 8.0 · 2 refs

An adversary controlling an intermediate pipeline stage in decentralized LLM post-training can inject a backdoor that reduces alignment from 80% to 6%, with the backdoor persisting in 60% of cases even after subsequent safety training.

Model Context Protocol (MCP) at First Glance: Studying the Security and Maintainability of MCP Servers

cs.SE · 2025-06-16 · conditional · novelty 8.0

First study of 1,899 MCP servers finds eight distinct vulnerabilities (only three traditional), 7.2% with general issues, 5.5% with tool poisoning, and 66% with code smells, urging MCP-specific security practices.

BEAVER: An Enterprise Benchmark for Text-to-SQL

cs.CL · 2024-09-03 · unverdicted · novelty 8.0

BEAVER is the first text-to-SQL benchmark from private enterprise data warehouses, revealing SOTA agentic frameworks achieve only 10.8% accuracy on complex real-world queries.

MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?

cs.CV · 2024-08-23 · conditional · novelty 8.0

MME-RealWorld is the largest manually annotated high-resolution benchmark for MLLMs, where even the best models achieve less than 60% accuracy on challenging real-world tasks.

AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents

cs.CR · 2024-06-19 · unverdicted · novelty 8.0

AgentDojo introduces an extensible evaluation framework populated with realistic agent tasks and security test cases to measure prompt injection robustness in tool-using LLM agents.

AgentClinic: a multimodal agent benchmark to evaluate AI in simulated clinical environments

cs.HC · 2024-05-13 · conditional · novelty 8.0

AgentClinic is a multimodal agent benchmark demonstrating that LLM diagnostic accuracy on MedQA drops to below one-tenth in sequential clinical simulations, with Claude-3.5 leading and large tool-use differences across models.

ORPO: Monolithic Preference Optimization without Reference Model

cs.CL · 2024-03-12 · conditional · novelty 8.0

ORPO performs preference alignment during supervised fine-tuning via a monolithic odds ratio penalty, allowing 7B models to outperform larger state-of-the-art models on alignment benchmarks.

Bridging Language and Items for Retrieval and Recommendation: Benchmarking LLMs as Semantic Encoders

cs.IR · 2024-03-06 · unverdicted · novelty 8.0

BLaIR is a new benchmark and 570M-review dataset showing that LLM performance rankings on recommendation tasks have little correlation with rankings on general embedding benchmarks like MTEB.

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

cs.LG · 2023-12-01 · unverdicted · novelty 8.0

Mamba is a linear-time sequence model using input-dependent selective SSMs that achieves SOTA results across modalities and matches twice-larger Transformers on language modeling with 5x higher inference throughput.

MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

cs.CL · 2023-11-27 · unverdicted · novelty 8.0

MMMU provides 11.5K heterogeneous college-level multimodal questions that current models solve at 56-59% accuracy, establishing a new standard for expert multimodal evaluation.

Tree of Thoughts: Deliberate Problem Solving with Large Language Models

cs.CL · 2023-05-17 · accept · novelty 8.0

Tree of Thoughts enables language models to solve complex planning tasks by generating, evaluating, and searching over coherent intermediate thoughts in a tree, raising Game of 24 success from 4% to 74% with GPT-4.

API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs

cs.CL · 2023-04-14 · conditional · novelty 8.0

API-Bank is a new benchmark and training dataset for tool-augmented LLMs that shows fine-tuned models can approach GPT-3.5 tool-use effectiveness.

Instruction Tuning with GPT-4

cs.CL · 2023-04-06 · unverdicted · novelty 8.0

GPT-4-generated instruction data produces superior zero-shot performance in finetuned LLaMA models versus prior state-of-the-art data.

Beyond Binary Edits Robust Multimodal Knowledge Editing with Adversarial Subspace Alignment

cs.AI · 2026-05-22 · unverdicted · novelty 7.0

Introduces Latent Adversarial Robustification and Rank-Constrained Subspace Learning to enable robust generalization in multimodal knowledge editing through adversarial subspace alignment.

CachePrune: Privacy-Aware and Fine-Grained KV Cache Sharing for Efficient LLM Inference

cs.CR · 2026-05-22 · unverdicted · novelty 7.0

CachePrune enables fine-grained, token-level KV cache reuse across LLM requests by masking sensitive segments, eliminating direct side-channel leakage while cutting TTFT by 4.5x and raising hit rates by 44% versus prior coarse-grained methods.

Brain-LLM Alignment Tracks Training Data, Not Typology

cs.CL · 2026-05-21 · unverdicted · novelty 7.0

Training-language dominance, not English inherent properties, determines brain-LLM alignment across English, Chinese, and French, with additional independent effects from typological distance concentrated in syntactic brain regions.

A mathematical theory of balancing relational generalization and memorization

cs.LG · 2026-05-21 · unverdicted · novelty 7.0

Introduces transitive inference with exceptions task and analytically shows kernel ridge regression balances relational generalization and memorization depending on representational geometry, with validation in finetuned language models.

ST-SimDiff: Balancing Spatiotemporal Similarity and Difference for Efficient Video Understanding with MLLMs

cs.AI · 2026-05-21 · unverdicted · novelty 7.0

ST-SimDiff is a training-free method using a spatio-temporal graph and dual similarity-difference selection to compress video tokens for MLLMs while retaining static and dynamic content.

Generative Conversational Recommender System

cs.IR · 2026-05-21 · unverdicted · novelty 7.0

A single autoregressive model for conversational recommendation that uses semantic item IDs, predicts response intent and target first, then generates the response, reporting up to 29% Recall@1 gains.

citing papers explorer

Showing 23 of 23 citing papers after filters.

Model Context Protocol (MCP) at First Glance: Studying the Security and Maintainability of MCP Servers cs.SE · 2025-06-16 · conditional · none · ref 136 · internal anchor
First study of 1,899 MCP servers finds eight distinct vulnerabilities (only three traditional), 7.2% with general issues, 5.5% with tool poisoning, and 66% with code smells, urging MCP-specific security practices.
StepCodeReasoner: Aligning Code Reasoning with Stepwise Execution Traces via Reinforcement Learning cs.SE · 2026-05-12 · unverdicted · none · ref 28 · internal anchor
StepCodeReasoner aligns code reasoning with verifiable stepwise execution traces via print anchors and bi-level GRPO reinforcement learning, reaching SOTA results on CRUXEval (91.1%) and LiveCodeBench (86.5%) for a 7B model.
RAG-Reflect: Agentic Retrieval-Augmented Generation with Reflections for Comment-Driven Code Maintenance on Stack Overflow cs.SE · 2026-04-24 · unverdicted · none · ref 44 · internal anchor
RAG-Reflect achieves F1=0.78 on valid comment-edit prediction using retrieval-augmented reasoning and self-reflection, outperforming baselines and approaching fine-tuned models without retraining.
TypePro: Boosting LLM-Based Type Inference via Inter-Procedural Slicing cs.SE · 2026-04-03 · unverdicted · none · ref 29 · internal anchor
TypePro reaches 88.9% and 86.6% Top-1 exact match on Python and TypeScript type-inference datasets by feeding LLMs inter-procedural slices plus structurally derived candidate types.
On Interaction Effects in Greybox Fuzzing cs.SE · 2025-10-22 · conditional · none · ref 44 · internal anchor
MuoFuzz improves greybox fuzzing by learning mutator sequence interactions to select effective orders, outperforming AFL++ and MOPT on coverage and unique bugs in FuzzBench and MAGMA.
How Compliant Are GitHub Actions Workflows? A Checklist-Based Study with LLM-Assisted Auditing cs.SE · 2026-05-03 · accept · none · ref 40 · internal anchor
GitHub Actions workflows achieve only 28% overall compliance with best practices, with LLMs enabling an 81% reduction in verification effort via hybrid adjudication but still requiring expert oversight for security judgments.
SelfHeal: Empirical Fix Pattern Analysis and Bug Repair in LLM Agents cs.SE · 2026-04-20 · unverdicted · none · ref 68 · internal anchor
SelfHeal uses two ReAct agents and empirical fix patterns to repair bugs in LLM agents, outperforming baselines on a new 37-instance benchmark.
TreeRanker: Fast and Model-agnostic Ranking System for Code Suggestions in IDEs cs.SE · 2025-08-04 · unverdicted · none · ref 28 · internal anchor
TreeRanker ranks static code completions by organizing candidates in a prefix tree and collecting token scores via a single greedy language-model decoding pass.
Effective LLM Code Refinement via Property-Oriented and Structurally Minimal Feedback cs.SE · 2025-06-23 · unverdicted · none · ref 40 · internal anchor
PGS generates property-oriented, structurally minimal feedback from high-level program properties to refine LLM code, yielding up to 13.4% pass@1 gains and 1.4-1.6x higher bug-fix rates than prior TDD and debugging baselines.
What's DAT? Three Case Studies of Measuring Software Development Productivity at Meta With Diff Authoring Time cs.SE · 2025-03-14 · conditional · none · ref 57 · internal anchor
DAT is a telemetry-based time metric for developer productivity, validated via observational studies and applied in three Meta case studies showing 14%, 33%, and >50% improvements.
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code cs.SE · 2024-03-12 · unverdicted · none · ref 170 · internal anchor
LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.
Prompt Optimization for LLM Code Generation via Reinforcement Learning cs.SE · 2026-05-18 · unverdicted · none · ref 32 · internal anchor
A PPO agent with hybrid actions and test-driven rewards optimizes prompts for code LLMs, raising strict Pass@1 scores on MBPP+, HumanEval+, and APPS over prior methods.
The Readability Spectrum: Patterns, Issues, and Prompt Effects in LLM-Generated Code cs.SE · 2026-05-13 · unverdicted · none · ref 94 · internal anchor
LLM-generated code matches human-written code in overall readability but exhibits different issue patterns, and prompt engineering has limited impact on improving it.
ToxiShield: Promoting Inclusive Developer Communication through Real-Time Toxicity Filtering cs.SE · 2026-04-15 · unverdicted · none · ref 51 · internal anchor
ToxiShield delivers a real-time GitHub extension with a BERT toxicity detector at 98% accuracy, a Claude-based coach, and a fine-tuned Llama reframer at 95% style transfer accuracy, validated by a 10-person TAM study.
Externalization in LLM Agents: A Unified Review of Memory, Skills, Protocols and Harness Engineering cs.SE · 2026-04-09 · accept · none · ref 138 · internal anchor
LLM agent progress depends on externalizing cognitive functions into memory, skills, protocols, and harness engineering that coordinates them reliably.
LicenseGPT: A Fine-tuned Foundation Model for Publicly Available Dataset License Compliance cs.SE · 2024-12-30 · unverdicted · none · ref 80 · internal anchor
LicenseGPT fine-tuned on 500 expert-annotated licenses raises prediction agreement to 64.30% and cuts per-license analysis time by 94.44% from 108s to 6s in lawyer user studies.
Revisiting Sentiment Analysis for Software Engineering in the Era of Large Language Models cs.SE · 2023-10-17 · unverdicted · none · ref 60 · internal anchor
bLLMs achieve state-of-the-art results on limited and imbalanced SE sentiment datasets even in zero-shot settings, but fine-tuned sLLMs outperform when ample balanced training data is available.
Beyond Execution: Static-Analysis Rewards and Hint-Conditioned Diffusion RL for Code Generation cs.SE · 2026-05-16 · unverdicted · none · ref 20 · internal anchor
Static checking rewards and moderate AST-based hints improve diffusion RL performance for code generation, with effectiveness varying by task difficulty across HumanEval, MBPP, and LiveCodeBench.
An Empirical Evaluation of Locally Deployed LLMs for Bug Detection in Python Code cs.SE · 2026-04-25 · unverdicted · none · ref 18 · internal anchor
Locally deployed LLMs achieve 43-45% accuracy on Python bug detection but frequently produce only partial identifications of problematic code regions.
An empirical study of LoRA-based fine-tuning of large language models for automated test case generation cs.SE · 2026-04-08 · unverdicted · none · ref 45 · internal anchor
LoRA fine-tuning enables open-source LLMs such as Ministral-8B to generate requirement-based test cases at a level comparable to pre-tuned proprietary GPT-4.1 models.
Combining Static Code Analysis and Large Language Models Improves Correctness and Performance of Algorithm Recognition cs.SE · 2026-04-03 · conditional · none · ref 53 · internal anchor
Hybrid LLM plus static analysis for algorithm recognition in code cuts required model calls by 72-97% and lifts F1-scores by as much as 12 points.
LLM4Log: A Systematic Review of Large Language Model-based Log Analysis cs.SE · 2026-03-18 · unverdicted · none · ref 172 · 2 links · internal anchor
Systematic review of 145 papers on LLM-based log analysis, providing a unified taxonomy, common design patterns, evaluation practices, and challenges for deployment under drift and limited labels.
Can LLMs Find Bugs in Code? An Evaluation from Beginner Errors to Security Vulnerabilities in Python and C++ cs.SE · 2025-08-22 · unverdicted · none · ref 5 · internal anchor
LLMs perform well on basic syntactic and semantic bugs in small code but struggle with complex security vulnerabilities and large production codebases.

LLaMA: Open and Efficient Foundation Language Models

hub tools

citation-role summary

citation-polarity summary

claims ledger

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer