mega hub Mixed citations

Evaluating Large Language Models Trained on Code

Mark Chen et al · 2021 · cs.LG · arXiv 2107.03374

Mixed citation behavior. Most common role is background (65%).

1310 Pith papers citing it

Background 65% of classified citations

open full Pith review browse 1310 citing papers more from Mark Chen et al arXiv PDF

abstract

We introduce Codex, a GPT language model fine-tuned on publicly available code from GitHub, and study its Python code-writing capabilities. A distinct production version of Codex powers GitHub Copilot. On HumanEval, a new evaluation set we release to measure functional correctness for synthesizing programs from docstrings, our model solves 28.8% of the problems, while GPT-3 solves 0% and GPT-J solves 11.4%. Furthermore, we find that repeated sampling from the model is a surprisingly effective strategy for producing working solutions to difficult prompts. Using this method, we solve 70.2% of our problems with 100 samples per problem. Careful investigation of our model reveals its limitations, including difficulty with docstrings describing long chains of operations and with binding operations to variables. Finally, we discuss the potential broader impacts of deploying powerful code generation technologies, covering safety, security, and economics.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 161 dataset 64 method 12 other 3 baseline 2

citation-polarity summary

background 157 use dataset 57 use method 12 unclear 9 support 5 baseline 2

claims ledger

abstract We introduce Codex, a GPT language model fine-tuned on publicly available code from GitHub, and study its Python code-writing capabilities. A distinct production version of Codex powers GitHub Copilot. On HumanEval, a new evaluation set we release to measure functional correctness for synthesizing programs from docstrings, our model solves 28.8% of the problems, while GPT-3 solves 0% and GPT-J solves 11.4%. Furthermore, we find that repeated sampling from the model is a surprisingly effective strategy for producing working solutions to difficult prompts. Using this method, we solve 70.2% of ou

authors

Mark Chen et al

mega hub controls

export citing contexts JSON export graph JSON export full bundle JSON open full Pith review annotated reader queued

Recognition alignment

counterfactual ablation

If this work disappeared, these are the nearest dependency candidates in Pith, weighted toward method, dataset, baseline, and extension contexts where available. This is a structural signal, not a retraction verdict.

co-cited works

representative citing papers

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

cs.CL · 2022-01-28 · accept · novelty 9.0

Chain-of-thought prompting, by including intermediate reasoning steps in few-shot examples, elicits strong reasoning abilities in large language models on arithmetic, commonsense, and symbolic tasks.

DataComp-VLM: Improved Open Datasets for Vision-Language Models

cs.CV · 2026-06-26 · conditional · novelty 8.0 · 2 refs

DataComp-VLM benchmark shows instruction-heavy data mixing outperforms filtering for VLM training, with DCVLM-Baseline achieving 63.6% on 33 tasks for 8B models (+5.4pp over FineVision).

Sumi: Open Uniform Diffusion Language Model from Scratch

cs.CL · 2026-06-17 · unverdicted · novelty 8.0

Sumi is an openly released 7B parameter uniform diffusion language model pretrained from scratch on 1.5T tokens that matches autoregressive models on several benchmarks.

PCB-QA: Evaluating LLMs over the First Printed Circuit Board Design Question-Answer Dataset

cs.AR · 2026-06-10 · unverdicted · novelty 8.0

PCB-QA is the first QA benchmark for LLMs on printed circuit board designs, with Gemini 3 Flash Preview reaching 93% accuracy on a JSON textual representation.

TheoremBench: Evaluating LLMs on Theorem Proving in Formal Mathematics

cs.AI · 2026-06-08 · unverdicted · novelty 8.0

TheoremBench is a Lean4 benchmark of classical theorems in main and premised forms that evaluates LLM provers on partial progress, coverage, and token efficiency rather than binary success on competition problems.

Evaluating Deep Research Agents on Expert Consulting Work: A Benchmark with Verifiers, Rubrics, and Cognitive Traps

cs.AI · 2026-05-17 · unverdicted · novelty 8.0

A new benchmark with cognitive traps shows frontier deep research agents achieve only 13-16% acceptance on expert consulting tasks under combined verifier and rubric criteria.

Mistletoe: Stealthy Acceleration-Collapse Attacks on Speculative Decoding

cs.CL · 2026-05-13 · unverdicted · novelty 8.0 · 2 refs

Mistletoe introduces a stealthy attack on speculative decoding that collapses acceleration by reducing average accepted length while preserving output semantics.

CIDR: A Large-Scale Industrial Source Code Dataset for Software Engineering Research

cs.SE · 2026-05-12 · unverdicted · novelty 8.0

CIDR is a large-scale curated dataset of proprietary industrial source code repositories spanning 138 languages and 373 million lines of code, collected via formal agreements with industry partners.

PDEAgent-Bench: A Multi-Metric, Multi-Library Benchmark for PDE Solver Generation

cs.AI · 2026-05-10 · unverdicted · novelty 8.0

PDEAgent-Bench is the first multi-metric, multi-library benchmark for AI-generated PDE solvers, evaluating executability, numerical accuracy, and efficiency across DOLFINx, Firedrake, and deal.II.

SimWorld Studio: Automatic Environment Generation with Evolving Coding Agent for Embodied Agent Learning

cs.AI · 2026-05-10 · accept · novelty 8.0 · 2 refs

SimWorld Studio deploys an evolving coding agent to create adaptive 3D environments that co-evolve with embodied learners, delivering 18-point success-rate gains over fixed environments in navigation benchmarks.

PhysicianBench: Evaluating LLM Agents in Real-World EHR Environments

cs.AI · 2026-05-04 · conditional · novelty 8.0

PhysicianBench is a new benchmark of 100 physician-reviewed, execution-grounded tasks in live EHR environments where the best LLM agent reaches only 46% success and open-source models reach 19%.

Can Coding Agents Reproduce Findings in Computational Materials Science?

cs.SE · 2026-05-01 · conditional · novelty 8.0

AutoMat benchmark shows current LLM coding agents achieve at most 54.1% success when reproducing computational materials science claims from papers.

From Mirage to Grounding: Towards Reliable Multimodal Circuit-to-Verilog Code Generation

cs.SE · 2026-04-30 · unverdicted · novelty 8.0

MLLMs exhibit a Mirage effect by bypassing circuit diagrams in favor of header semantics for Verilog generation; VeriGround with identifier anonymization and D-ORPO training reaches 46% Functional Pass@1 while refusing blank images at >92%.

StabilizerBench: A Benchmark for AI-Assisted Quantum Error Correction Circuit Synthesis

quant-ph · 2026-04-23 · conditional · novelty 8.0

StabilizerBench is a new benchmark for evaluating AI agents on generating, optimizing, and making fault-tolerant stabilizer circuits for quantum error correction, with efficient verification and multi-tier scoring.

Gradient-Based Program Synthesis with Neurally Interpreted Languages

cs.LG · 2026-04-20 · unverdicted · novelty 8.0

NLI autonomously discovers a vocabulary of primitive operations and interprets variable-length programs via a neural executor, allowing end-to-end training and gradient-based test-time adaptation that outperforms prior methods on combinatorial generalization tasks.

Autonomous Evolution of EDA Tools: Multi-Agent Self-Evolved ABC

cs.AR · 2026-04-16 · unverdicted · novelty 8.0

LLM agents autonomously evolve the ABC logic synthesis tool by iteratively rewriting its source code to achieve better quality-of-results on standard benchmarks while preserving the original interface.

FermiLink: A Unified Agent Framework for Multidomain Autonomous Scientific Simulations

physics.chem-ph · 2026-04-03 · conditional · novelty 8.0

FermiLink is a unified AI agent framework that automates multidomain scientific simulations via separated package knowledge bases and a four-layer progressive disclosure mechanism, reproducing 56% of target figures in benchmarks and generating research-grade results on unpublished problems.

Supply-Chain Poisoning Attacks Against LLM Coding Agent Skill Ecosystems

cs.CR · 2026-04-03 · unverdicted · novelty 8.0

DDIPE poisons LLM agent skills by embedding malicious logic in documentation examples, achieving 11.6-33.5% bypass rates across frameworks while explicit attacks are blocked, with 2.5% evading detection.

Reinforcement Learning for Diffusion LLMs with Entropy-Guided Step Selection and Stepwise Advantages

cs.LG · 2026-03-13 · unverdicted · novelty 8.0

Derives an exact unbiased policy gradient for RL post-training of diffusion LLMs via entropy-guided step selection and one-step denoising rewards, achieving state-of-the-art results on coding and logical reasoning benchmarks.

MCP-Atlas: A Large-Scale Benchmark for Tool-Use Competency with Real MCP Servers

cs.SE · 2026-01-31 · accept · novelty 8.0 · 2 refs

MCP-Atlas is a new benchmark with 1000 tasks on production MCP servers that uses claim-level scoring to evaluate LLM agents on realistic multi-step tool-use competency.

Probing the Critical Point (CritPt) of AI Reasoning: a Frontier Physics Research Benchmark

cs.AI · 2025-09-30 · unverdicted · novelty 8.0

CritPt benchmark shows state-of-the-art LLMs reach only 5.7% average accuracy on full-scale unpublished physics research tasks, rising to about 10% with coding tools.

The Rise of AI Teammates in Software Engineering (SE) 3.0: How Autonomous Coding Agents Are Reshaping Software Engineering

cs.SE · 2025-07-20 · conditional · novelty 8.0

AIDev is a new open dataset of 456k AI-agent pull requests showing agents submit code faster than humans but with lower acceptance rates and simpler changes.

Large Language Diffusion Models

cs.CL · 2025-02-14 · unverdicted · novelty 8.0

LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

cs.CV · 2024-09-25 · accept · novelty 8.0

Molmo VLMs trained on newly collected PixMo open datasets achieve state-of-the-art performance among open-weight models and surpass multiple proprietary VLMs including Claude 3.5 Sonnet and Gemini 1.5 Pro.

citing papers explorer

Showing 50 of 1310 citing papers.

Towards Evaluation of Implicit Software World Models in Coding LLMs cs.SE · 2026-06-25 · unverdicted · none · ref 14 · internal anchor
Introduces evaluation of LLMs' implicit software world models via prediction of execution resources on real software tasks, finding modest and brittle performance across models including frontier ones.
Empirical Software Engineering TerraProbe: A Layered-Oracle Framework for Detecting Deceptive Fixes in LLM-Assisted Terraform cs.LG · 2026-06-25 · unverdicted · none · ref 27 · internal anchor
TerraProbe shows that targeted Checkov removal overstates LLM Terraform repair success, with 71.4% of plan-compared real-world repairs being deceptive fixes that leave vulnerabilities intact.
GeMoE: Gating Entropy is All You Need for Uncertainty-aware Adaptive Routing in MoE-based Large Vision-Language Models cs.CV · 2026-06-24 · unverdicted · none · ref 10 · internal anchor
GeMoE adaptively sets the number of experts per token via gating entropy, retaining 99.5% of static-routing performance while raising average sparsity by 36.5%.
CodeChat-Eval: Evaluating Large Language Models in Multi-Turn Code Refinement Dialogues cs.SE · 2026-06-24 · unverdicted · none · ref 7 · 2 links · internal anchor
CodeChat-Eval shows LLMs lose 19.2% to 69.2% functional correctness over multi-turn refinement dialogues, with largest drops on logic-level and additive changes.
CrypFormBench: Benchmarking Formal Analysis Capability of Large Language Models for Cryptographic Schemes cs.CR · 2026-06-24 · unverdicted · none · ref 16 · internal anchor
CrypFormBench is a new benchmark jointly covering symbolic and computational security to evaluate LLMs on five formal analysis capabilities, with results showing top model Claude-3.5 scores 48.7/100 and most models struggling on generation, transformation, and correction.
The Generalization Spectrum: A Chromatographic Approach to Evaluating Learning Algorithms cs.LG · 2026-06-24 · unverdicted · none · ref 9 · 2 links · internal anchor
Introduces the Generalization Spectrum evaluation framework to track per-example generalization across transfer distances in competitive programming tasks.
Transferability for General Reasoning: An Automated Curriculum for Multi-Domain RLVR cs.AI · 2026-06-23 · unverdicted · none · ref 7 · 2 links · internal anchor
TAC is a bandit curriculum for multi-domain RLVR that prioritizes domains whose gradient updates align with and benefit other domains, yielding up to 2.8-point macro accuracy gains over learnability-only baselines on Qwen3-1.7B and Llama3.2-3B.
PatternGSL: A Structured Specification Language for Template-Free and Simulation-Ready 3D Garments cs.CV · 2026-06-23 · unverdicted · none · ref 77 · 4 links · internal anchor
PatternGSL introduces a learnable specification language for sewing patterns that lets vision-language models reconstruct explicit, simulation-ready 3D garments from single images, backed by a new 300K paired dataset.
Agentic Generation of AST Transformation Rules for Fixing Breaking Updates cs.SE · 2026-06-23 · conditional · none · ref 35 · internal anchor
BigBag generates reusable AST transformation rules via LLMs that achieve 78.6% fix rate on 157 breaking dependency updates and 33.3% cross-project transfer overall.
GUI vs. CLI: Execution Bottlenecks in Screen-Only and Skill-Mediated Computer-Use Agents cs.AI · 2026-06-22 · unverdicted · none · ref 2 · internal anchor
A matched benchmark shows GUI computer-use agents at 59.1% full pass rate versus 48.2% for original-skill CLI agents, rising to 69.3% with verifier-guided augmentation, indicating modality-specific execution bottlenecks.
DART: Draft-Agreement Routing for Training-Free Adaptive Thinking Budgets in Hybrid Reasoning Models cs.AI · 2026-06-22 · unverdicted · none · ref 23 · internal anchor
DART is a training-free router that accepts direct answers on draft agreement and allocates thinking budgets via draft entropy on disagreement, reporting accuracy gains and token reductions on math and code benchmarks across model scales.
Beyond Simpson's Paradox: A Cascade of Confounders in AI Agent Pull-Request Co-Authorship cs.SE · 2026-06-21 · unverdicted · none · ref 1 · internal anchor
Stratified analysis of AIDev PRs shows co-authorship effects on AI agent merge rates are artefacts of agent composition, repository selection, and PR commit structure rather than causal benefits.
RigorBench: Benchmarking Engineering Process Discipline in Autonomous AI Coding Agents cs.SE · 2026-06-21 · unverdicted · none · ref 4 · 2 links · internal anchor
RigorBench evaluates AI coding agents on process discipline via five pillars and reports 41% higher process scores and 17% better outcome correctness with structured approaches on 30 tasks.
The Alignment Problem in Constrained Code Generation cs.SE · 2026-06-19 · unverdicted · none · ref 8 · internal anchor
Incomplete constrainers in constrained decoding push LLMs into low-probability program regions, making unconstrained decoding outperform constrained decoding on functional correctness across seven models and three benchmarks.
Power Systems Agent Benchmark: Executable Evaluation of AI Agents in Electric Power Engineering cs.AI · 2026-06-18 · unverdicted · none · ref 10 · 2 links · internal anchor
Introduces the Power Systems Agent Benchmark with 41 task families across eight power engineering areas for executable evaluation of AI agents using deterministic feasibility checks.
When Do Intrinsic Rewards Work for Code Reasoning? A Comprehensive Study cs.AI · 2026-06-18 · unverdicted · none · ref 1 · internal anchor
Empirical evaluation on LiveCodeBench shows certainty-based RLIF yields early gains followed by output shortening and reasoning collapse, providing no advantage for RLVR initialization on code tasks.
N-Version Programming with Coding Agents cs.SE · 2026-06-18 · unverdicted · none · ref 7 · internal anchor
Diverse AI coding agents in N-version programming reduce mean failures from 387.44 to 130.99 in triples on the Launch Interceptor Program, with 11,844 zero-failure units observed across 1M tests.
Repository-Level Solidity Code Generation with Large Language Models: From Prompting to Fine-Tuning cs.SE · 2026-06-18 · unverdicted · none · ref 9 · internal anchor
Introduces SolidityBench benchmark and SolidityScore metric for repository-level Solidity code generation, finding supervised fine-tuning outperforms prompting, CoT, ICL, and RAG methods on evaluated LLMs.
SIGMA: Skill-Incidence Graphs for Compositional Multi-Agent Design cs.MA · 2026-06-18 · unverdicted · none · ref 38 · internal anchor
SIGMA introduces skill-incidence graphs to compose agents from reusable skills, yielding higher average performance and robustness than topology-only baselines on reasoning and coding benchmarks.
Interpreting Neural Combinatorial Optimization via Evolving Programmatic Bottlenecks cs.AI · 2026-06-18 · unverdicted · none · ref 27 · internal anchor
EPB distills NCO models into evolving program portfolios via LLM-driven textual-numerical optimization, matching original performance while exposing stage-dependent heuristic-like behavior.
Hard or Just Unreached? Diagnosing the Sampling Blind Spot in Math-Reasoning Difficulty Estimation cs.LG · 2026-06-17 · unverdicted · none · ref 8 · internal anchor
10.3-22.9% of pass@k=0 math examples across GSM8K and MATH are recovered by a deterministic six-chain regime using activation grafting, showing a sampling blind spot in difficulty estimation.
StaminaBench: Stress-Testing Coding Agents over 100 Interaction Turns cs.SE · 2026-06-17 · unverdicted · none · ref 9 · internal anchor
StaminaBench evaluates coding agents over 100 procedurally generated change requests to a REST API, finding that tested models fail within 5-6 turns without feedback but improve up to 12x with test feedback and good harnesses.
Predicting Mergeability of Parameter-Efficient Fine-Tuning Updates cs.LG · 2026-06-17 · unverdicted · none · ref 19 · internal anchor
MergeProbe forecasts LoRA adapter mergeability from first-few-percent training signals and outperforms interference-aware baselines on retention while adding low overhead on a five-domain benchmark.
Explaining Attention with Program Synthesis cs.LG · 2026-06-17 · unverdicted · none · ref 5 · 2 links · internal anchor
Language-model-guided program synthesis can approximate transformer attention heads with over 75% IoU fidelity on held-out data and allow replacing 25% of heads with only 16% average perplexity increase.
DreamReasoner-8B: Block-Size Curriculum Learning for Diffusion Reasoning Models cs.CL · 2026-06-17 · unverdicted · none · ref 2 · internal anchor
Block-size curriculum learning trains an 8B diffusion model to achieve competitive reasoning performance on math and code benchmarks by transitioning from small to large training block sizes.
Where Did the Variability Go? From Vibe Coding to Product Lines by Regeneration cs.SE · 2026-06-17 · unverdicted · none · ref 8 · internal anchor
Exploratory study of vibe-coded projects shows variability is bound at generation time; proposes VbR as an SPL method using LLMs to generate variant-specific code from specifications.
Beyond Prediction: Tail-Aware Scheduling for LLM Inference cs.LG · 2026-06-16 · unverdicted · none · ref 47 · internal anchor
Presents a distribution-aware scheduling framework for LLM inference that reduces P99 TTLT by 35-50% and TTFT by 34-47% versus SRPT with perfect length knowledge using statistical signals instead of predictions.
Signature filtering: a lightweight enhancement for statistical watermark detection in large language models cs.LG · 2026-06-16 · conditional · none · ref 6 · internal anchor
Signature filtering learns unreliable tokens with MILP and removes them at detection time, raising true positive rates from 8-31% to 78-99% across Kgw, Sweet, Unigram, and Exp watermarks on multiple corpora and LLMs while controlling false positives.
Can LLMs Be CEOs? Benchmarking Strategic Resource Reallocation with Multi-Role Agent Simulation cs.AI · 2026-06-16 · unverdicted · none · ref 34 · internal anchor
CEO-Bench evaluates LLMs on CEO-level strategic resource reallocation via multi-role agent simulations, showing high structural validity but sharp divergence on strategic calibration across five frontier models on 13 scenarios.
Configuration Smells in AGENTS.md Files: Common Mistakes in Configuring Coding Agents cs.SE · 2026-06-14 · unverdicted · none · ref 1 · internal anchor
Presents first catalog of six smells in coding-agent config files, with automated detection heuristics, and reports high prevalence (e.g., Lint Leakage in 62%) from analysis of 100 open-source repos.
AgentRivet: an automated system for producing Rivet routines from journal publications hep-ex · 2026-06-11 · unverdicted · none · ref 14 · internal anchor
AgentRivet applies commercial LLMs in an autonomous workflow to extract physics details from ATLAS and CMS papers and generate Rivet routines, achieving few syntax errors but occasional physics implementation issues on two test cases.
SENTINEL: Failure-Driven Reinforcement Learning for Training Tool-Using Language Model Agents cs.CL · 2026-06-11 · unverdicted · none · ref 13 · internal anchor
SENTINEL generates targeted tasks from model failures in a Controller-Proposer-Solver loop, raising Pass^1 from 66.4 to 74.9 on Tau2-Bench Retail and outperforming standard RL.
Detecting Functional Memorization in Code Language Models cs.LG · 2026-06-11 · unverdicted · none · ref 36 · internal anchor
Authors demonstrate functional memorization in code LLMs via counterfactual midtraining comparison on functional equivalence metrics beyond textual overlap.
Claw-SWE-Bench: A Benchmark for Evaluating OpenClaw-style Agent Harnesses on Coding Tasks cs.LG · 2026-06-10 · conditional · none · ref 5 · internal anchor
Claw-SWE-Bench is a 350-instance multilingual benchmark for OpenClaw-style agent harnesses that shows adapter design raises Pass@1 from 19.1% to 73.4% on the same model while releasing data for reproducible comparison.
Agreement in Representation Space for Open-Ended Self-Consistency cs.CL · 2026-06-10 · unverdicted · none · ref 5 · internal anchor
EBA clusters sampled LLM generations in representation space to estimate agreement, outperforming random selection with stable scaling and showing that central positions correlate with higher generation quality.
CORE-Bench: A Comprehensive Benchmark for Code Retrieval in the Era of Agentic Coding cs.IR · 2026-06-10 · accept · none · ref 13 · internal anchor
CORE-Bench is a benchmark for code retrieval in agentic coding settings, built from curated tasks and SWE-bench instances, showing performance drops and gains from fine-tuning.
Representing Time Series as Structured Programs for LLM Reasoning cs.LG · 2026-06-10 · unverdicted · none · ref 36 · internal anchor
T2SP converts time series into structured programs for trends, periods, and events, enabling off-the-shelf LLMs to perform better on editing, captioning, and QA tasks than raw string inputs.
CODEBLOCK: Learning to Supervise Code at the Right Granularity cs.LG · 2026-06-10 · unverdicted · none · ref 5 · internal anchor
CodeBlock partitions code responses into syntactically coherent blocks, scores them with generalized cross-entropy and data-flow signals, and applies sparse supervision to achieve higher pass@1 than full SFT using 1.9% of tokens on six benchmarks.
INFRAMIND: Infrastructure-Aware Multi-Agent Orchestration cs.AI · 2026-06-09 · unverdicted · none · ref 2 · internal anchor
INFRAMIND is an infrastructure-aware multi-agent orchestration framework that uses RL on a hierarchical constrained MDP to jointly optimize topology, model selection, and scheduling under dynamic load.
P3D-Bench: Benchmarking MLLMs for Parametric 3D Generation and Structural Reasoning cs.CV · 2026-06-09 · unverdicted · none · ref 15 · internal anchor
P3D-Bench is a benchmark with three task families that scores MLLMs on generating executable parametric 3D programs, finding failures in precise geometry and part assembly.
Beyond Absolute Imitation: Anchored Residual Guidance for Privileged On-Policy Distillation cs.LG · 2026-06-09 · unverdicted · none · ref 3 · internal anchor
AR-OPD disentangles privileged supervision via anchored residual guidance to reduce hindsight leakage in on-policy distillation, reporting gains of 2.3 points over full privileged OPD and 7.9 over SFT on reasoning tasks.
Causally Evaluating the Learnability of Formal Language Tasks cs.CL · 2026-06-08 · unverdicted · none · ref 64 · internal anchor
Introduces the binning semiring and causal graphical models to show that correlational evaluation of learnability in formal language tasks leads to incorrect conclusions from confounders.
PrivCode++: Latent-Conditioned Differentially Private Code Generation for Comprehensive Guarantees cs.CR · 2026-06-08 · unverdicted · none · ref 16 · internal anchor
PrivCode++ introduces the first DP code generation method protecting both prompts and code via latent-conditioned two-stage training, claiming higher utility and stronger privacy than prior baselines.
Beyond Pass Rate: A Multilingual, Execution-Grounded Evaluation of Open Code LLMs cs.AI · 2026-06-07 · unverdicted · none · ref 1 · internal anchor
Multilingual execution-grounded benchmark finds top open code LLM at 23.64% correctness versus 57.2% human baseline, with compile errors dominating 63% of failures.
Sample Where You Struggle: Sharpening Base Model Reasoning via Entropy-Guided Power Sampling cs.LG · 2026-06-07 · unverdicted · none · ref 21 · internal anchor
EGPS localizes MCMC moves to high-entropy decision points using forward-pass entropy, yielding up to 12.6× wall-clock speedup and best-or-tied accuracy on MATH500, HumanEval, and GPQA for Qwen2.5-Math-7B.
When LLMs Invent Rust Crates: An Empirical Study of Hallucination Patterns and Mitigation cs.SE · 2026-06-07 · unverdicted · none · ref 7 · internal anchor
First empirical study shows crate hallucination in Rust LLMs has consistent rates across models insensitive to parameters and tests prompt-based mitigation.
AsyncLane: Decoupling Refinement from Advancement in Diffusion Language Model Decoding cs.CL · 2026-06-07 · unverdicted · none · ref 3 · internal anchor
AsyncLane decouples refinement from advancement in DLM decoding via lane forking at delimiters plus efficiency optimizations, yielding up to 3x throughput gains on math and code benchmarks without retraining.
Closed-Form Spectral Regularization for Multi-Task Model Merging cs.LG · 2026-06-05 · unverdicted · none · ref 56 · internal anchor
Iterative solvers in layer-wise model merging act as spectral regularizers on an ill-posed interference operator; closed-form SWUDI and adaptive SWUDI-A match or exceed SOTA merging accuracy with 28-72x wall-clock speedup.
WhiFlash: Accelerating Speculative Decoding with Token-Level Cross-Paradigm Routing cs.LG · 2026-06-05 · unverdicted · none · ref 7 · internal anchor
WhiFlash introduces token-level cross-paradigm routing between autoregressive and diffusion drafting models, with cache optimizations, to raise acceptance lengths and deliver up to 69.6% throughput gains over EAGLE-3.
OpenHalDet: A Unified Benchmark for Hallucination Detection across Diverse Generation Scenarios cs.CL · 2026-06-05 · unverdicted · none · ref 54 · internal anchor
OpenHalDet creates a standardized benchmark and open codebase for comparing hallucination detectors across diverse LLM generation scenarios and access settings.