{"total":23,"items":[{"citing_arxiv_id":"2605.23657","ref_index":17,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"OpenSkillEval: Automatically Auditing the Open Skill Ecosystem for LLM Agents","primary_cat":"cs.CL","submitted_at":"2026-05-22T14:09:41+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18740","ref_index":6,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Vision-OPD: Learning to See Fine Details for Multimodal LLMs via On-Policy Self-Distillation","primary_cat":"cs.CV","submitted_at":"2026-05-18T17:57:04+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18109","ref_index":9,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"TaskGround: Structured Executable Task Inference for Full-Scene Household Reasoning","primary_cat":"cs.AI","submitted_at":"2026-05-18T09:19:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"TaskGround introduces a Ground-Infer-Execute framework for full-scene household reasoning that improves success rates on the FullHome benchmark and enables compact models to match larger ones at up to 18x lower token cost.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17637","ref_index":44,"ref_count":2,"confidence":0.55,"is_internal_anchor":false,"paper_title":"WebGameBench: Requirement-to-Application Evaluation for Coding Agents via Browser-Native Games","primary_cat":"cs.AI","submitted_at":"2026-05-17T20:07:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"WebGameBench is a new benchmark that evaluates coding agents on building browser-native games from frozen specifications, with runtime browser evaluation showing best agents reach 76.9% usable rate but only 20.2% excellent rate.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16839","ref_index":5,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"CompactAttention: Accelerating Chunked Prefill with Block-Union KV Selection","primary_cat":"cs.CL","submitted_at":"2026-05-16T06:47:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"CompactAttention accelerates chunked-prefill attention via Block-Union KV Selection, delivering up to 2.72x speedup at 128K context on LLaMA-3.1-8B while matching dense accuracy on RULER.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13329","ref_index":32,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Tracing Persona Vectors Through LLM Pretraining","primary_cat":"cs.CL","submitted_at":"2026-05-13T10:44:23+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"Persona vectors form within the first 0.22% of LLM pretraining and remain effective for steering post-trained models, with continued refinement and transfer to other models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12741","ref_index":5,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Learning with Rare Success but Rich Feedback via Reflection-Enhanced Self-Distillation","primary_cat":"cs.LG","submitted_at":"2026-05-12T20:46:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RESD turns failure trajectories into token-level supervision via retrospective reflections and a persistent global playbook, enabling faster improvement than standard self-distillation or GRPO with only one rollout per prompt.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11182","ref_index":3,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"The Many Faces of On-Policy Distillation: Pitfalls, Mechanisms, and Fixes","primary_cat":"cs.AI","submitted_at":"2026-05-11T19:44:59+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Instead of training on off-policy responses, OPD trains the student on trajectories sampled from its own policy and uses one or more teacher models to provide dense token-level supervision along these on-policy rollouts. This makes OPD appealing as a practical mechanism for integrating the capabilities of multiple stronger teachers into the student [3, 4], mitigating catastrophic forgetting, and improving sample efficiency [5, 6]. A closely related topic is on-policy self-distillation (OPSD), where the teacher is not a stronger model but the student model itself conditioned on additional privileged information (PI) [7-9]. The PI may include ground-truth answers, system prompts or user preferences."},{"citing_arxiv_id":"2605.10912","ref_index":10,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation","primary_cat":"cs.CL","submitted_at":"2026-05-11T17:49:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"A new native-runtime benchmark reveals that current frontier AI agents succeed on at most 62 percent of realistic long-horizon CLI tasks.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"is hybrid: deterministic rule-based checks on produced artifacts, environment-state auditing of side effects, and an LLM/VLM judge invoked only for semantic checks that rule-based signals cannot resolve. Across 19 frontier models, including 6 proprietary (e.g., Claude Opus 4.7 [ 4], GPT 5.5 [ 29]) and 13 open-source ones (e.g., DeepSeek V4 Pro 1.6T [ 10], Qwen 3.5 397B [ 32]), WildClawBench remains far from saturated. Under the OpenClaw harness [ 30], the strongest model, Claude Opus 4.7, reaches 62.2% overall while every other model stays below 60%, and scores span a 43-point range from 19.3% to 62.2%. Within a single model, multimodal workflows trail pure-text ones (e.g., GPT 5.4: 40.2% vs. 58.0%; Claude Opus 4."},{"citing_arxiv_id":"2605.09552","ref_index":23,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Phases of Muon: When Muon Eclipses SignSGD","primary_cat":"math.OC","submitted_at":"2026-05-10T14:11:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"On power-law covariance least squares problems, SignSVD (Muon) and SignSGD (Adam proxy) show three phases of relative performance depending on data exponent α and target exponent β.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"heavy-ball momentum prewhitening; we present the no-momentum analysis in the main text and discuss the momentum extension numerically in Sec. I. We focus on theSignSVDlimit and validate predictions onMuon(NS-5) numerically (e.g. Fig. 1). Note that recent work has introducedMuonwithhybrid Newton-Schulzwhich results in the near exactSignSVD whitening of singular values [23].1 SignSGDas a proxy forAdam.As a tractable comparator forAdam, we useSignSGD [9], which actsentrywise(not spectrally) onG t+1: (1.5)W t+1 =W t −η bGt+1,( bGt+1)ij def= sign (Gt+1)ij \u0001 . SignSGDcoincides withAdamat β1 = β2 = 0, so comparing it againstSignSVDcaptures, to leading order, the structural difference betweenAdamandMuon. We writes-SGDfor"},{"citing_arxiv_id":"2605.11011","ref_index":68,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"LoopUS: Recasting Pretrained LLMs into Looped Latent Refinement Models","primary_cat":"cs.LG","submitted_at":"2026-05-10T11:05:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LoopUS converts pretrained LLMs into looped latent refinement models via block decomposition, selective gating, random deep supervision, and confidence-based early exiting to improve reasoning performance.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Wei Du, Wei Ping, Wenfei Zhou, Will Jennings, William Zhang, Wojciech Prazuch, Xiaowei Ren, Yashaswi Karnati, Yejin Choi, Yev Meyer, Yi-Fu Wu, Yian Zhang, Yigong Qin, Ying Lin, Yonatan Geifman, Yonggan Fu, Yoshi Subara, Yoshi Suhara, Yubo Gao, Zach Moshe, Zhen Dong, Zhongbo Zhu, Zihan Liu, Zijia Chen, and Zijie Yan. Nvidia nemotron 3: Efficient and open intelligence, 2025. URLhttps://arxiv.org/abs/2512.20856. [68] Kimi Team, Guangyu Chen, Yu Zhang, Jianlin Su, Weixin Xu, Siyuan Pan, Yaoyu Wang, Yucheng Wang, Guanduo Chen, Bohong Yin, Yutian Chen, Junjie Yan, Ming Wei, Y . Zhang, Fanqing Meng, Chao Hong, Xiaotong Xie, Shaowei Liu, Enzhe Lu, Yunpeng Tai, Yanru Chen, Xin Men, Haiqing Guo, Y . Charles, Haoyu Lu, Lin Sui, Jinguo Zhu, Zaida Zhou, Weiran He, Weixiao Huang, Xinran Xu, Yuzhi Wang, Guokun Lai, Yulun Du, Yuxin Wu, Zhilin Yang, and"},{"citing_arxiv_id":"2605.09360","ref_index":7,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Your Simulation Runs but Solves the Wrong Physics: PDE-Grounded Intent Verification for LLM-Generated Multiphysics Simulation Code","primary_cat":"cs.LG","submitted_at":"2026-05-10T06:19:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A new Intent Fidelity Score and refinement loop verify that LLM-generated simulation code matches the intended PDEs, improving performance on a 220-case benchmark where execution alone fails to ensure correctness.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Kernel variants are canonicalized to normalized PDE operator types by the mapping table, and all benchmark artifacts are released in the supplementary material. 6 Experiments 6.1 Experimental Setup LLMs.We evaluate four LLMs spanning capability tiers and providers: Claude Sonnet 4.6 (An- thropic) [3], GPT-5.4 and GPT-4.1-mini (OpenAI) [25, 24], and DeepSeek V4 Flash (DeepSeek) [7], all at temperature = 0. Sonnet 4.6, GPT-5.4, and DeepSeek V4 Flash form the main sweep; GPT-4.1- mini serves as a weak-model case study. Two additional models appear in appendix experiments only. Claude Haiku 4.5 [2] appears in the weak-model registry stress tests, and Gemini 3.1 Flash Lite [9] appears in both the mixed-model ablation and weak-model stress tests (Appendix P)."},{"citing_arxiv_id":"2605.09269","ref_index":7,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"DeltaRubric: Generative Multimodal Reward Modeling via Joint Planning and Verification","primary_cat":"cs.CL","submitted_at":"2026-05-10T02:32:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DeltaRubric decomposes multimodal preference evaluation into self-generated planning and verification steps within a single model, producing large accuracy improvements on VL-RewardBench via multi-role reinforcement learning.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"modeling has evolved from predicting scalar scores [ 2, 27] to LLM-as-a-judge frameworks [ 17, 29, 50], which generate both preference judgments and Chain-of-Thought (CoT) rationales [ 39]. To better capture the multidimensional nature of response quality in open-ended tasks, there is a growing trend toward adopting rubric-based evaluation [ 12, 15, 25, 30, 41], including the most recent DeepSeek-V4 [7], demonstrating that decomposing a complex judgment into a set of criteria effectively improves evaluator reliability and generalization. The transition toward Multimodal Large Language Models (MLLMs) introduces new alignment challenges [3, 34, 44]. Extending RLHF to the visual domain requires multimodal reward models Preprint. arXiv:2605.09269v1 [cs."},{"citing_arxiv_id":"2605.08715","ref_index":6,"ref_count":4,"confidence":0.55,"is_internal_anchor":false,"paper_title":"AgentForesight: Online Auditing for Early Failure Prediction in Multi-Agent Systems","primary_cat":"cs.CL","submitted_at":"2026-05-09T05:55:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"AgentForesight introduces an online auditor model that predicts decisive errors in multi-agent trajectories at the earliest step using a coarse-to-fine reinforcement learning recipe on a new curated dataset AFTraj-2K.","context_count":2,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"evaluating cross-construction generalization beyond our AFTRAJ-2K held-out test split. Baselines.We compareAgentForesight-7B against three baseline categories.(1) Open-source small LLMs: Llama-3.2-3B [13], Gemma-3-4B [10], Qwen2.5-7B-Instruct, Qwen3-8B [56], Qwen3- 32B.(2) Proprietary LLMs: GPT-4.1 [ 36], Gemini-3-Flash [7], Claude-Haiku-4.5 [1], DeepSeek- V4-Flash, DeepSeek-V4-Pro [6].(3) Methodological baselines: four paradigms instantiated on the same Qwen2.5-7B-Instruct to isolate paradigm effects from backbone capability, including uncertainty quantification (Perplexity-7B [8]), tree-search prompting (ToT-7B [58]), self-reflection (Reflexion-7B [48]), and post-hoc failure attribution (AgentDebug-7B [67]). All baselines except"},{"citing_arxiv_id":"2605.08553","ref_index":11,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"VeriContest: A Competitive-Programming Benchmark for Verifiable Code Generation","primary_cat":"cs.SE","submitted_at":"2026-05-08T23:25:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"VeriContest supplies 946 problems with specs, code, proofs, and tests to benchmark verifiable code generation in Rust/Verus, showing models reach 92% on code but only 5% end-to-end on full verifiable synthesis.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021. [9] Codeforces. Codeforces. https://codeforces.com/, 2026. [Online; accessed: 28-April- 2026]. [10] Ole-Johan Dahl, Edsger Wybe Dijkstra, and Charles Antony Richard Hoare.Structured pro- gramming. Academic Press Ltd., 1972. [11] DeepSeek-AI. Deepseek-v4: Towards highly efficient million-token context intelligence, 2026. [12] Xun Deng, Sicheng Zhong, Barı¸ s Bayazıt, Andreas Veneris, Fan Long, and Xujie Si. Verifythis- bench: Generating code, specifications, and proofs all at once.arXiv preprint arXiv:2505.19271, 2025. 10 [13] Quinn Dougherty and Ronak Mehta. Proving the coding interview: A benchmark for formally"},{"citing_arxiv_id":"2605.08498","ref_index":20,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"MathConstraint: Automated Generation of Verified Combinatorial Reasoning Instances for LLMs","primary_cat":"cs.LG","submitted_at":"2026-05-08T21:28:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"MathConstraint generates scalable, automatically verifiable combinatorial problems where LLMs achieve 18.5-66.9% accuracy without tools but roughly double that with solver access.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"the headline MATHCONSTRAINTrelease reruns the same machinery with harder profile regimes and admits instances that remain discriminative under frontier-model evaluation. 4 Evaluation Setup.We evaluate twelve models. Ten are frontier or near-frontier systems:GPT-5.5 [ 46], CLAUDE-OPUS-4.7 [ 3],CLAUDE-4.6-SONNET[ 4],GEMINI-3.1-PROandGEMINI-3.1-FLASH- LITE[ 19],GROK-4.20 [ 61],DEEPSEEK-V4-PROandDEEPSEEK-V4-FLASH[ 20],QWEN3.6- PLUS[ 48], andKIMI-K2.6 [ 54]. The remaining two are open-weight baselines:GPT-OSS-120B[ 1] andLLAMA-3.3-70B-INSTRUCT[ 27]. Models are accessed through OpenRouter at temperature 0; model configuration, pricing, and implementation details are in Sections A and E. 5 Generator Evaluation +verification ≥1frontier failed? MATH- CONSTRAINT yes"},{"citing_arxiv_id":"2605.07396","ref_index":10,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Rubric-based On-policy Distillation","primary_cat":"cs.LG","submitted_at":"2026-05-08T07:52:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Rubric-based on-policy distillation allows training student models using only teacher responses by generating scoring rubrics from contrasts and using them for on-policy optimization, achieving superior performance and up to 10x better sample efficiency than logit-based approaches.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[8] Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. Minillm: Knowledge distillation of large language models. InInternational Conference on Learning Representations, 2024. [9] Bangjun Xiao, Bingquan Xia, Bo Yang, Bofei Gao, Bowen Shen, Chen Zhang, Chenhong He, Chiheng Lou, Fuli Luo, Gang Wang, et al. Mimo-v2-flash technical report.arXiv preprint arXiv:2601.02780, 2026. [10] DeepSeek-AI. Deepseek-v4: Towards highly efficient million-token context intelligence, 2026. [11] David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. Gpqa: A graduate-level google-proof q&a benchmark.arXiv preprint arXiv:2311.12022, 2023. [12] Rahul K. Arora, Jason Wei, Rebecca Soskin Hicks, Preston Bowman, and Joaquin Quiñonero-"},{"citing_arxiv_id":"2605.07268","ref_index":8,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"From 0-Order Selection to 2-Order Judgment: Combinatorial Hardening Exposes Compositional Failures in Frontier LLMs","primary_cat":"cs.CL","submitted_at":"2026-05-08T05:33:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LogiHard hardens reasoning benchmarks by transforming 0-order selection into 2-order judgment, causing 31-56% accuracy drops in 12 frontier LLMs and a 47% drop on zero-shot MMLU, revealing a combinatorial reasoning gap rather than knowledge deficits.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"3% regeneration rate). Each question includes 9-dimensional cognitive features, IRT 3PL parameters ( aj, bj, cj), source attribution, and reasoning type labels. 4 Experiments 4.1 Experimental Setup Models.We evaluate frontier LLMs: GLM-5 [ 9], GLM-4.7 (Zhipu); GPT-5.4, o3 (OpenAI); Claude-Opus-4.6 [1] (Anthropic); DeepSeek-R1, DeepSeek-V3.2, DeepSeek-V4-Pro [8] (DeepSeek); Gemini-3.1-pro [6] (Google); Kimi-k2.5 [36] (Moonshot); Qwen3.5-Plus, Qwen3.6-plus [41] (Al- ibaba). Dual-subset evaluation with 3PL parameterization under Hard Mode (Sgold ≥25 ). Maximum 60 items per subset; termination at SE<0.3 . EAP estimation with N(0,1) priors, 61 quadrature points. All model evaluations were conducted via public API endpoints; Decoding temperature 1."},{"citing_arxiv_id":"2605.07039","ref_index":10,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"PACEvolve++: Improving Test-time Learning for Evolutionary Search Agents","primary_cat":"cs.LG","submitted_at":"2026-05-07T23:38:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"PACEvolve++ uses a phase-adaptive reinforcement learning advisor to decouple hypothesis selection from execution in LLM-driven evolutionary search, delivering faster convergence than prior frameworks on load balancing, recommendation, and protein tasks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"We develop a phase-adaptive recipe that shifts credit assignment from group-relative feedback 2 during exploration to frontier-contribution during refinement, aligning policy learning with evolu- tionary search dynamics. • Empirically, we demonstrate strong performance across a range of real-world research and engineer- ing tasks (§ 4.1), including expert-parallel load balancing [10], sequential recommendation [48], and protein fitness extrapolation [41], outperforming while converging faster than existing methods with and without RL (§ 4). 2 Background 2.1 Evolutionary Search Agents An evolutionary search agent improves a program through repeated proposal, evaluation, and se- lection [16, 11, 17]. Given an initial program p0, an evaluator E:P →R , and a policy πθ, the"},{"citing_arxiv_id":"2605.07021","ref_index":9,"ref_count":2,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Behavior Cue Reasoning: Monitorable Reasoning Improves Efficiency and Safety through Oversight","primary_cat":"cs.AI","submitted_at":"2026-05-07T23:05:50+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Behavior Cue Reasoning trains LLMs to emit special tokens before behaviors, enabling monitors to cut up to 50% wasted reasoning tokens and recover safe actions from 80% of unsafe traces, more than doubling success rates with no performance cost.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.06884","ref_index":8,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Muon with Nesterov Momentum: Heavy-Tailed Noise and (Randomized) Inexact Polar Decomposition","primary_cat":"math.OC","submitted_at":"2026-05-07T19:32:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Muon with Nesterov momentum and inexact polar decomposition achieves optimal convergence rates of O(ε^(-(3α-2)/(α-1))) under heavy-tailed noise for ε-stationary points in non-convex settings.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.06219","ref_index":10,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Joint Consistency: A Unified Test-Time Aggregation Framework via Energy Minimization","primary_cat":"cs.AI","submitted_at":"2026-05-07T13:17:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Joint Consistency casts test-time aggregation as Ising-type energy minimization with pairwise LLM-judge interactions, subsuming voting methods and outperforming baselines across reasoning tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.27083","ref_index":18,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Co-Evolving Policy Distillation","primary_cat":"cs.LG","submitted_at":"2026-04-29T18:24:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CoPD integrates multiple expert capabilities by running parallel RLVR training with bidirectional online policy distillation among experts, outperforming mixed RLVR and sequential OPD while surpassing domain-specific experts on text-image-video reasoning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}