mega hub Mixed citations

Evaluating Large Language Models Trained on Code

Mark Chen et al · 2021 · cs.LG · arXiv 2107.03374

Mixed citation behavior. Most common role is background (65%).

1268 Pith papers citing it

Background 65% of classified citations

open full Pith review browse 1268 citing papers more from Mark Chen et al arXiv PDF

abstract

We introduce Codex, a GPT language model fine-tuned on publicly available code from GitHub, and study its Python code-writing capabilities. A distinct production version of Codex powers GitHub Copilot. On HumanEval, a new evaluation set we release to measure functional correctness for synthesizing programs from docstrings, our model solves 28.8% of the problems, while GPT-3 solves 0% and GPT-J solves 11.4%. Furthermore, we find that repeated sampling from the model is a surprisingly effective strategy for producing working solutions to difficult prompts. Using this method, we solve 70.2% of our problems with 100 samples per problem. Careful investigation of our model reveals its limitations, including difficulty with docstrings describing long chains of operations and with binding operations to variables. Finally, we discuss the potential broader impacts of deploying powerful code generation technologies, covering safety, security, and economics.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 161 dataset 64 method 12 other 3 baseline 2

citation-polarity summary

background 157 use dataset 57 use method 12 unclear 9 support 5 baseline 2

claims ledger

abstract We introduce Codex, a GPT language model fine-tuned on publicly available code from GitHub, and study its Python code-writing capabilities. A distinct production version of Codex powers GitHub Copilot. On HumanEval, a new evaluation set we release to measure functional correctness for synthesizing programs from docstrings, our model solves 28.8% of the problems, while GPT-3 solves 0% and GPT-J solves 11.4%. Furthermore, we find that repeated sampling from the model is a surprisingly effective strategy for producing working solutions to difficult prompts. Using this method, we solve 70.2% of ou

authors

Mark Chen et al

mega hub controls

export citing contexts JSON export graph JSON export full bundle JSON open full Pith review annotated reader queued

Recognition alignment

counterfactual ablation

If this work disappeared, these are the nearest dependency candidates in Pith, weighted toward method, dataset, baseline, and extension contexts where available. This is a structural signal, not a retraction verdict.

co-cited works

representative citing papers

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

cs.CL · 2022-01-28 · accept · novelty 9.0

Chain-of-thought prompting, by including intermediate reasoning steps in few-shot examples, elicits strong reasoning abilities in large language models on arithmetic, commonsense, and symbolic tasks.

DataComp-VLM: Improved Open Datasets for Vision-Language Models

cs.CV · 2026-06-26 · conditional · novelty 8.0 · 2 refs

DataComp-VLM benchmark shows instruction-heavy data mixing outperforms filtering for VLM training, with DCVLM-Baseline achieving 63.6% on 33 tasks for 8B models (+5.4pp over FineVision).

Sumi: Open Uniform Diffusion Language Model from Scratch

cs.CL · 2026-06-17 · unverdicted · novelty 8.0

Sumi is an openly released 7B parameter uniform diffusion language model pretrained from scratch on 1.5T tokens that matches autoregressive models on several benchmarks.

PCB-QA: Evaluating LLMs over the First Printed Circuit Board Design Question-Answer Dataset

cs.AR · 2026-06-10 · unverdicted · novelty 8.0

PCB-QA is the first QA benchmark for LLMs on printed circuit board designs, with Gemini 3 Flash Preview reaching 93% accuracy on a JSON textual representation.

TheoremBench: Evaluating LLMs on Theorem Proving in Formal Mathematics

cs.AI · 2026-06-08 · unverdicted · novelty 8.0

TheoremBench is a Lean4 benchmark of classical theorems in main and premised forms that evaluates LLM provers on partial progress, coverage, and token efficiency rather than binary success on competition problems.

Evaluating Deep Research Agents on Expert Consulting Work: A Benchmark with Verifiers, Rubrics, and Cognitive Traps

cs.AI · 2026-05-17 · unverdicted · novelty 8.0

A new benchmark with cognitive traps shows frontier deep research agents achieve only 13-16% acceptance on expert consulting tasks under combined verifier and rubric criteria.

Mistletoe: Stealthy Acceleration-Collapse Attacks on Speculative Decoding

cs.CL · 2026-05-13 · unverdicted · novelty 8.0 · 2 refs

Mistletoe introduces a stealthy attack on speculative decoding that collapses acceleration by reducing average accepted length while preserving output semantics.

CIDR: A Large-Scale Industrial Source Code Dataset for Software Engineering Research

cs.SE · 2026-05-12 · unverdicted · novelty 8.0

CIDR is a large-scale curated dataset of proprietary industrial source code repositories spanning 138 languages and 373 million lines of code, collected via formal agreements with industry partners.

PDEAgent-Bench: A Multi-Metric, Multi-Library Benchmark for PDE Solver Generation

cs.AI · 2026-05-10 · unverdicted · novelty 8.0

PDEAgent-Bench is the first multi-metric, multi-library benchmark for AI-generated PDE solvers, evaluating executability, numerical accuracy, and efficiency across DOLFINx, Firedrake, and deal.II.

SimWorld Studio: Automatic Environment Generation with Evolving Coding Agent for Embodied Agent Learning

cs.AI · 2026-05-10 · accept · novelty 8.0 · 2 refs

SimWorld Studio deploys an evolving coding agent to create adaptive 3D environments that co-evolve with embodied learners, delivering 18-point success-rate gains over fixed environments in navigation benchmarks.

PhysicianBench: Evaluating LLM Agents in Real-World EHR Environments

cs.AI · 2026-05-04 · conditional · novelty 8.0

PhysicianBench is a new benchmark of 100 physician-reviewed, execution-grounded tasks in live EHR environments where the best LLM agent reaches only 46% success and open-source models reach 19%.

Can Coding Agents Reproduce Findings in Computational Materials Science?

cs.SE · 2026-05-01 · conditional · novelty 8.0

AutoMat benchmark shows current LLM coding agents achieve at most 54.1% success when reproducing computational materials science claims from papers.

From Mirage to Grounding: Towards Reliable Multimodal Circuit-to-Verilog Code Generation

cs.SE · 2026-04-30 · unverdicted · novelty 8.0

MLLMs exhibit a Mirage effect by bypassing circuit diagrams in favor of header semantics for Verilog generation; VeriGround with identifier anonymization and D-ORPO training reaches 46% Functional Pass@1 while refusing blank images at >92%.

StabilizerBench: A Benchmark for AI-Assisted Quantum Error Correction Circuit Synthesis

quant-ph · 2026-04-23 · conditional · novelty 8.0

StabilizerBench is a new benchmark for evaluating AI agents on generating, optimizing, and making fault-tolerant stabilizer circuits for quantum error correction, with efficient verification and multi-tier scoring.

Gradient-Based Program Synthesis with Neurally Interpreted Languages

cs.LG · 2026-04-20 · unverdicted · novelty 8.0

NLI autonomously discovers a vocabulary of primitive operations and interprets variable-length programs via a neural executor, allowing end-to-end training and gradient-based test-time adaptation that outperforms prior methods on combinatorial generalization tasks.

Autonomous Evolution of EDA Tools: Multi-Agent Self-Evolved ABC

cs.AR · 2026-04-16 · unverdicted · novelty 8.0

LLM agents autonomously evolve the ABC logic synthesis tool by iteratively rewriting its source code to achieve better quality-of-results on standard benchmarks while preserving the original interface.

FermiLink: A Unified Agent Framework for Multidomain Autonomous Scientific Simulations

physics.chem-ph · 2026-04-03 · conditional · novelty 8.0

FermiLink is a unified AI agent framework that automates multidomain scientific simulations via separated package knowledge bases and a four-layer progressive disclosure mechanism, reproducing 56% of target figures in benchmarks and generating research-grade results on unpublished problems.

Supply-Chain Poisoning Attacks Against LLM Coding Agent Skill Ecosystems

cs.CR · 2026-04-03 · unverdicted · novelty 8.0

DDIPE poisons LLM agent skills by embedding malicious logic in documentation examples, achieving 11.6-33.5% bypass rates across frameworks while explicit attacks are blocked, with 2.5% evading detection.

Reinforcement Learning for Diffusion LLMs with Entropy-Guided Step Selection and Stepwise Advantages

cs.LG · 2026-03-13 · unverdicted · novelty 8.0

Derives an exact unbiased policy gradient for RL post-training of diffusion LLMs via entropy-guided step selection and one-step denoising rewards, achieving state-of-the-art results on coding and logical reasoning benchmarks.

MCP-Atlas: A Large-Scale Benchmark for Tool-Use Competency with Real MCP Servers

cs.SE · 2026-01-31 · accept · novelty 8.0 · 2 refs

MCP-Atlas is a new benchmark with 1000 tasks on production MCP servers that uses claim-level scoring to evaluate LLM agents on realistic multi-step tool-use competency.

Probing the Critical Point (CritPt) of AI Reasoning: a Frontier Physics Research Benchmark

cs.AI · 2025-09-30 · unverdicted · novelty 8.0

CritPt benchmark shows state-of-the-art LLMs reach only 5.7% average accuracy on full-scale unpublished physics research tasks, rising to about 10% with coding tools.

The Rise of AI Teammates in Software Engineering (SE) 3.0: How Autonomous Coding Agents Are Reshaping Software Engineering

cs.SE · 2025-07-20 · conditional · novelty 8.0

AIDev is a new open dataset of 456k AI-agent pull requests showing agents submit code faster than humans but with lower acceptance rates and simpler changes.

Large Language Diffusion Models

cs.CL · 2025-02-14 · unverdicted · novelty 8.0

LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

cs.CV · 2024-09-25 · accept · novelty 8.0

Molmo VLMs trained on newly collected PixMo open datasets achieve state-of-the-art performance among open-weight models and surpass multiple proprietary VLMs including Claude 3.5 Sonnet and Gemini 1.5 Pro.

citing papers explorer

Showing 17 of 17 citing papers after filters.

ELDR: Expert-Locality-Aware Decode Routing for PD-Disaggregated MoE Serving cs.DC · 2026-07-01 · unverdicted · none · ref 4 · 2 links · internal anchor
ELDR reduces median TPOT by 5.9-13.9% in PD-disaggregated MoE serving via expert signatures from prefill, K-means partitioning, and locality-band routing with KV-co-indexed signature cache.
D^2SD: Accelerating Speculative Decoding with Dual Diffusion Draft Models cs.DC · 2026-06-03 · unverdicted · none · ref 19 · internal anchor
D^2SD uses two diffusion drafters in a prefix tree structure with confidence scores to select and recover alternative draft sequences, achieving higher acceptance rates in speculative decoding.
Not All Errors Are Equal: A Systematic Study of Error Propagation in Large Language Model Inference cs.DC · 2026-06-01 · unverdicted · none · ref 14 · internal anchor
A new fault-injection framework enables a systematic empirical study that produces 17 takeaways on error propagation in LLM inference and four software-only mitigation directions.
KVServe: Service-Aware KV Cache Compression for Communication-Efficient Disaggregated LLM Serving cs.DC · 2026-05-13 · conditional · none · ref 5 · internal anchor
KVServe delivers up to 9.13x job completion time speedup and 32.8x time-to-first-token reduction by making KV cache compression service-aware and adaptive in disaggregated LLM serving.
Coral: Cost-Efficient Multi-LLM Serving over Heterogeneous Cloud GPUs cs.DC · 2026-05-05 · unverdicted · none · ref 5 · internal anchor
Coral cuts multi-LLM serving costs by up to 2.79x and raises goodput by up to 2.39x on heterogeneous GPUs through adaptive joint optimization and a lossless two-stage decomposition that solves quickly.
Efficient Remote KV Cache Reuse with GPU-native Video Codec cs.DC · 2026-02-10 · conditional · none · ref 22 · internal anchor
KVCodec uses GPU-native video codecs and pipelined fetching to compress and transmit KV caches, delivering up to 3.51x faster TTFT than prior methods while preserving accuracy.
Latent Reasoning Guidance for Parallel Code Translation cs.DC · 2026-06-03 · unverdicted · none · ref 2 · internal anchor
Latent PRM guidance selects among hidden-state trajectories to raise executable validation rate from 32.89% to 42.1% on a 76-task parallel code translation benchmark.
SplitZip: Ultra Fast Lossless KV Compression for Disaggregated LLM Serving cs.DC · 2026-05-03 · unverdicted · none · ref 4 · 3 links · internal anchor
SplitZip introduces a fast lossless KV cache compressor for disaggregated LLM inference that achieves 613 GB/s compression throughput on BF16 tensors and up to 1.32x end-to-end speedup.
Agentic Compilation: Mitigating the LLM Rerun Crisis for Minimized-Inference-Cost Web Automation cs.DC · 2026-04-08 · unverdicted · none · ref 1 · internal anchor
A Compile-and-Execute system decouples LLM reasoning from browser execution via a one-shot JSON blueprint, reducing inference from O(M x N) to amortized O(1) for repetitive web workflows.
PRAXIS: Integrating Program Analysis with Observability for Root-Cause Analysis cs.DC · 2025-12-26 · unverdicted · none · ref 47 · internal anchor
PRAXIS combines LLM-driven structured traversal of service dependency graphs and hammock-block program dependence graphs to improve root-cause analysis accuracy by up to 6.3x while cutting token consumption by 5.3x on 30 real-world cloud incidents.
Seer: Online Context Learning for Fast Synchronous LLM Reinforcement Learning cs.DC · 2025-11-18 · unverdicted · none · ref 4 · internal anchor
Seer improves synchronous LLM RL rollout throughput by up to 2.04x and reduces long-tail latency by 72-94% via divided rollout, context-aware scheduling, and adaptive grouped speculative decoding based on prompt similarity observations.
Beyond Uniform Experts: Cost-Aware Expert Execution for Efficient Multi-Device MoE Inference cs.DC · 2026-06-29 · unverdicted · none · ref 8 · internal anchor
CAEE reduces MoE inference latency 8-18% on 671B DeepSeek-R1 by cost-aware expert pruning and low-overhead compensation while keeping accuracy drop under 1%.
ECHO: Elastic Speculative Decoding with Sparse Gating for High-Concurrency Scenarios cs.DC · 2026-03-10 · unverdicted · none · ref 4 · internal anchor
ECHO uses sparse gating and elastic budget pivoting in a super-tree structure to achieve up to 5.35x speedup for LLM inference under high concurrency.
From Sensors to Insight: Rapid, Edge-to-Core Application Development for Sensor-Driven Applications cs.DC · 2026-05-04 · unverdicted · none · ref 6 · internal anchor
A pattern-based workflow engineering approach with AI assistance enables rapid development of sensor-driven applications across edge-to-core infrastructures, shown via reusable templates in environmental monitoring case studies.
(POSTER) From Sensors to Insight: Rapid, Edge-to-Core Application Development for Sensor-Driven Applications cs.DC · 2026-05-04 · unverdicted · none · ref 4 · internal anchor
A pattern-based AI-assisted workflow methodology using Pegasus on FABRIC reduces sensor-driven application development to 1-1.5 days per app by reusing templates and configuring for edge resources.
Network Edge Inference for Large Language Models: Principles, Techniques, and Opportunities cs.DC · 2026-04-24 · unverdicted · none · ref 21 · internal anchor
A survey synthesizing challenges, system architectures, model optimizations, deployment methods, and resource management techniques for large language model inference at the network edge.
SMART: When is it Actually Worth Expanding a Speculative Tree? cs.DC · 2026-04-09 · unreviewed · ref 7 · internal anchor