hub

Deepseek-v4: Towards highly efficient million-token context intelligence

DeepSeek-AI · 2026

11 Pith papers cite this work. Polarity classification is still indexing.

11 Pith papers citing it

browse 11 citing papers

hub tools

JSON dossier citing papers JSON

representative citing papers

WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation

cs.CL · 2026-05-11 · unverdicted · novelty 8.0

A new native-runtime benchmark reveals that current frontier AI agents succeed on at most 62 percent of realistic long-horizon CLI tasks.

MathConstraint: Automated Generation of Verified Combinatorial Reasoning Instances for LLMs

cs.LG · 2026-05-08 · unverdicted · novelty 8.0

MathConstraint generates scalable, automatically verifiable combinatorial problems where LLMs achieve 18.5-66.9% accuracy without tools but roughly double that with solver access.

The Many Faces of On-Policy Distillation: Pitfalls, Mechanisms, and Fixes

cs.AI · 2026-05-11 · conditional · novelty 7.0

On-policy distillation for LLMs is sensitive to teacher choice and loss design, while self-distillation fails on instance-specific information but succeeds on shared rules, with stop-gradient TopK, adapted teachers, and SFT stabilization as mitigations.

Phases of Muon: When Muon Eclipses SignSGD

math.OC · 2026-05-10 · unverdicted · novelty 7.0

On power-law covariance least squares problems, SignSVD (Muon) and SignSGD (Adam proxy) show three phases of relative performance depending on data exponent α and target exponent β.

LoopUS: Recasting Pretrained LLMs into Looped Latent Refinement Models

cs.LG · 2026-05-10 · unverdicted · novelty 7.0

LoopUS converts pretrained LLMs into looped latent refinement models via block decomposition, selective gating, random deep supervision, and confidence-based early exiting to improve reasoning performance.

Your Simulation Runs but Solves the Wrong Physics: PDE-Grounded Intent Verification for LLM-Generated Multiphysics Simulation Code

cs.LG · 2026-05-10 · unverdicted · novelty 7.0

A new Intent Fidelity Score and refinement loop verify that LLM-generated simulation code matches the intended PDEs, improving performance on a 220-case benchmark where execution alone fails to ensure correctness.

AgentForesight: Online Auditing for Early Failure Prediction in Multi-Agent Systems

cs.CL · 2026-05-09 · unverdicted · novelty 7.0

AgentForesight trains a 7B model to perform online auditing of multi-agent LLM trajectories, detecting early decisive errors and outperforming larger models on custom and external benchmarks.

Rubric-based On-policy Distillation

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

Rubric-based on-policy distillation allows training student models using only teacher responses by generating scoring rubrics from contrasts and using them for on-policy optimization, achieving superior performance and up to 10x better sample efficiency than logit-based approaches.

DeltaRubric: Generative Multimodal Reward Modeling via Joint Planning and Verification

cs.CL · 2026-05-10 · unverdicted · novelty 6.0

DeltaRubric decomposes multimodal preference evaluation into self-generated planning and verification steps within a single model, producing large accuracy improvements on VL-RewardBench via multi-role reinforcement learning.

VeriContest: A Competitive-Programming Benchmark for Verifiable Code Generation

cs.SE · 2026-05-08 · unverdicted · novelty 6.0

VeriContest supplies 946 problems with specs, code, proofs, and tests to benchmark verifiable code generation in Rust/Verus, showing models reach 92% on code but only 5% end-to-end on full verifiable synthesis.

From 0-Order Selection to 2-Order Judgment: Combinatorial Hardening Exposes Compositional Failures in Frontier LLMs

cs.CL · 2026-05-08 · unverdicted · novelty 6.0

LogiHard hardens reasoning benchmarks by transforming 0-order selection into 2-order judgment, causing 31-56% accuracy drops in 12 frontier LLMs and a 47% drop on zero-shot MMLU, revealing a combinatorial reasoning gap rather than knowledge deficits.

citing papers explorer

Showing 11 of 11 citing papers.

WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation cs.CL · 2026-05-11 · unverdicted · none · ref 10
A new native-runtime benchmark reveals that current frontier AI agents succeed on at most 62 percent of realistic long-horizon CLI tasks.
MathConstraint: Automated Generation of Verified Combinatorial Reasoning Instances for LLMs cs.LG · 2026-05-08 · unverdicted · none · ref 20
MathConstraint generates scalable, automatically verifiable combinatorial problems where LLMs achieve 18.5-66.9% accuracy without tools but roughly double that with solver access.
The Many Faces of On-Policy Distillation: Pitfalls, Mechanisms, and Fixes cs.AI · 2026-05-11 · conditional · none · ref 3
On-policy distillation for LLMs is sensitive to teacher choice and loss design, while self-distillation fails on instance-specific information but succeeds on shared rules, with stop-gradient TopK, adapted teachers, and SFT stabilization as mitigations.
Phases of Muon: When Muon Eclipses SignSGD math.OC · 2026-05-10 · unverdicted · none · ref 23
On power-law covariance least squares problems, SignSVD (Muon) and SignSGD (Adam proxy) show three phases of relative performance depending on data exponent α and target exponent β.
LoopUS: Recasting Pretrained LLMs into Looped Latent Refinement Models cs.LG · 2026-05-10 · unverdicted · none · ref 68
LoopUS converts pretrained LLMs into looped latent refinement models via block decomposition, selective gating, random deep supervision, and confidence-based early exiting to improve reasoning performance.
Your Simulation Runs but Solves the Wrong Physics: PDE-Grounded Intent Verification for LLM-Generated Multiphysics Simulation Code cs.LG · 2026-05-10 · unverdicted · none · ref 7
A new Intent Fidelity Score and refinement loop verify that LLM-generated simulation code matches the intended PDEs, improving performance on a 220-case benchmark where execution alone fails to ensure correctness.
AgentForesight: Online Auditing for Early Failure Prediction in Multi-Agent Systems cs.CL · 2026-05-09 · unverdicted · none · ref 6
AgentForesight trains a 7B model to perform online auditing of multi-agent LLM trajectories, detecting early decisive errors and outperforming larger models on custom and external benchmarks.
Rubric-based On-policy Distillation cs.LG · 2026-05-08 · unverdicted · none · ref 10
Rubric-based on-policy distillation allows training student models using only teacher responses by generating scoring rubrics from contrasts and using them for on-policy optimization, achieving superior performance and up to 10x better sample efficiency than logit-based approaches.
DeltaRubric: Generative Multimodal Reward Modeling via Joint Planning and Verification cs.CL · 2026-05-10 · unverdicted · none · ref 7
DeltaRubric decomposes multimodal preference evaluation into self-generated planning and verification steps within a single model, producing large accuracy improvements on VL-RewardBench via multi-role reinforcement learning.
VeriContest: A Competitive-Programming Benchmark for Verifiable Code Generation cs.SE · 2026-05-08 · unverdicted · none · ref 11
VeriContest supplies 946 problems with specs, code, proofs, and tests to benchmark verifiable code generation in Rust/Verus, showing models reach 92% on code but only 5% end-to-end on full verifiable synthesis.
From 0-Order Selection to 2-Order Judgment: Combinatorial Hardening Exposes Compositional Failures in Frontier LLMs cs.CL · 2026-05-08 · unverdicted · none · ref 8
LogiHard hardens reasoning benchmarks by transforming 0-order selection into 2-order judgment, causing 31-56% accuracy drops in 12 frontier LLMs and a 47% drop on zero-shot MMLU, revealing a combinatorial reasoning gap rather than knowledge deficits.

Deepseek-v4: Towards highly efficient million-token context intelligence

hub tools

fields

years

verdicts

representative citing papers

citing papers explorer