{"total":106,"items":[{"citing_arxiv_id":"2605.23244","ref_index":122,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"Convex Optimization for Alignment and Preference Learning on a Single GPU","primary_cat":"cs.LG","submitted_at":"2026-05-22T05:25:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"COALA applies convex optimization reformulations of neural networks to direct preference optimization, claiming single-GPU training with ~18% of DPO's TFLOPs and competitive performance on multiple datasets and models up to 8B parameters.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.23019","ref_index":15,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"PACE: Two-Timescale Self-Evolution for Small Language Model Agents","primary_cat":"cs.LG","submitted_at":"2026-05-21T20:42:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PACE coordinates low-risk prompt evolution with validated higher-risk control-logic updates to improve frozen SLM agents on benchmarks without model retraining.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22389","ref_index":22,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"Unified Data Selection for LLM Reasoning","primary_cat":"cs.CL","submitted_at":"2026-05-21T12:21:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"High-Entropy Sum (HES) selects high-quality reasoning data for LLMs by summing entropy of the top highest-entropy tokens, matching full-dataset performance with top 20% in SFT and outperforming baselines in RFT and RL.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21856","ref_index":22,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"The Illusion of Reasoning: Exposing Evasive Data Contamination in LLMs via Zero-CoT Truncation","primary_cat":"cs.LG","submitted_at":"2026-05-21T01:06:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"ZCP detects direct and evasive data contamination in LLMs by truncating CoT reasoning and contrasting zero-CoT accuracy on original versus perturbed isomorphic datasets, plus a Contamination Confidence metric.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21614","ref_index":44,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"Exploring the Effectiveness of Using LLMs for Automated Assessment of Student Self Explanations in Programming Education","primary_cat":"cs.HC","submitted_at":"2026-05-20T18:22:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Compares LLMs against semantic similarity for binary classification of student self-explanations in programming education.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20654","ref_index":11,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"REFLECTOR: Internalizing Step-wise Reflection against Indirect Jailbreak","primary_cat":"cs.LG","submitted_at":"2026-05-20T03:16:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Reflector internalizes step-wise self-reflection in LLMs via teacher-guided SFT then RL with outcome and validity rewards, claiming over 90% defense success against indirect jailbreaks plus utility gains like 5.85% on GSM8K.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19538","ref_index":46,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"CaptchaMind: Training CAPTCHA Solvers via Reinforcement Learning with Explicit Reasoning Supervision","primary_cat":"cs.CV","submitted_at":"2026-05-19T08:38:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Presents CaptchaBench benchmark and CaptchaMind RL solver achieving 82.9% success on benchmark tasks and 71% on real-world CAPTCHAs via explicit reasoning process supervision.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19358","ref_index":1,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"Taming the Thinker: Conditional Entropy Shaping for Adaptive LLM Reasoning","primary_cat":"cs.CL","submitted_at":"2026-05-19T04:41:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CES applies conditional bidirectional entropy control on top of DAPO to improve accuracy and shorten responses on mathematical benchmarks for 7B and 1.5B LLMs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19316","ref_index":89,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"A Multi-Agent Framework for Feature-Constrained Difficulty Control in Reading Comprehension Item Generation","primary_cat":"cs.CL","submitted_at":"2026-05-19T03:52:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MAFIG is a multi-agent framework that uses LLM agents and evaluators to generate reading comprehension items with significantly higher adherence to specified feature constraints than single-agent baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19259","ref_index":15,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"ERFSL: An Efficient Reward Function Searcher via Language Models for Custom-Environment Multi-Objective Optimization (Student Abstract)","primary_cat":"eess.SY","submitted_at":"2026-05-19T02:10:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ERFSL generates and optimizes LLM-based reward functions for custom multi-objective RL, correcting codes in one iteration and converging weights in 5.2 iterations on average even from 500x errors.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19228","ref_index":37,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"Diagnosing Multi-step Reasoning Failures in Black-box LLMs via Stepwise Confidence Attribution","primary_cat":"cs.CL","submitted_at":"2026-05-19T00:57:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SCA applies the Information Bottleneck principle via NIBS and GIBS methods to identify erroneous steps in black-box LLM reasoning and boosts self-correction success by up to 13.5%.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17565","ref_index":10,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"Generalization or Memorization? Brittleness Testing for Chess-Trained Language Models","primary_cat":"cs.AI","submitted_at":"2026-05-17T17:49:07+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A compact 25M chess move predictor exceeds larger fine-tuned models on puzzles, indicating memorization in earlier claims, while LLM-Modulo raises general LLM move accuracy from 1.2% to 21.2% and validity to 95.3%.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17324","ref_index":96,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"ASPI: Seeking Ambiguity Clarification Amplifies Prompt Injection Vulnerability in LLM Agents","primary_cat":"cs.CR","submitted_at":"2026-05-17T08:30:45+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Clarification-seeking in LLM agents amplifies prompt injection attack success from ~2% to over 30% across ten frontier models in a new 728-scenario benchmark.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17295","ref_index":29,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"DISA: Offline Importance Sampling for Distribution-Matching LLM-RL","primary_cat":"cs.LG","submitted_at":"2026-05-17T07:14:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"DISA decouples partition function estimation using offline importance sampling for distribution-matching LLM-RL, matching or exceeding online baselines like FlowRL on math and code benchmarks while retaining more strategy diversity.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17228","ref_index":138,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"Artificial Intolerance: Stigmatizing Language in Clinical Documentation Skews Large Language Model Decision-Making","primary_cat":"cs.CL","submitted_at":"2026-05-17T02:28:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Frontier LLMs exhibit bias from stigmatizing language in clinical vignettes across four conditions, skewing decisions toward less aggressive management, with limited mitigation from Chain-of-Thought or self-debiasing prompts.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17037","ref_index":49,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"D$^2$Evo: Dual Difficulty-Aware Self-Evolution for Data-Efficient Reinforcement Learning","primary_cat":"cs.LG","submitted_at":"2026-05-16T15:16:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"D²Evo mines medium-difficulty anchors from the current model, trains a Questioner to generate matching questions, and jointly optimizes Solver and Questioner for progressive gains, outperforming baselines on math reasoning with under 2K real samples.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16638","ref_index":35,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"TTE-Flash: Accelerating Reasoning-based Multimodal Representations via Think-Then-Embed Tokens","primary_cat":"cs.AI","submitted_at":"2026-05-15T21:10:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"TTE-Flash trains latent think tokens with CoT generation loss and embedding tokens with contrastive loss to deliver high-performance multimodal representations without generating explicit reasoning at inference time.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18871","ref_index":6,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"Distributional Energy-Based Models for Uncertainty-Aware Structured LLM Reasoning","primary_cat":"cs.LG","submitted_at":"2026-05-15T17:08:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A 149M-parameter distributional energy-based verifier with low-rank adapter ensemble reduces constraint violations in structured LLM reasoning and outperforms or matches much larger models on five benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16117","ref_index":2,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"SGR: A Stepwise Reasoning Framework for LLMs with External Subgraph Generation","primary_cat":"cs.CL","submitted_at":"2026-05-15T16:02:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"SGR enhances LLM reasoning accuracy by generating external subgraphs from knowledge bases and guiding progressive inference over them, yielding consistent gains over baselines on benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15706","ref_index":39,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"Differentiable Mixture-of-Agents Incentivizes Swarm Intelligence of Large Language Models","primary_cat":"cs.LG","submitted_at":"2026-05-15T07:54:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DMoA is a differentiable multi-agent framework for LLMs that uses recurrent context-aware routing and predictive entropy for test-time adaptation, claiming SOTA results on 9 benchmarks with efficiency and robustness.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15684","ref_index":47,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"ElasticDiT: Efficient Diffusion Transformers via Elastic Architecture and Sparse Attention for High-Resolution Image Generation on Mobile Devices","primary_cat":"cs.CV","submitted_at":"2026-05-15T07:13:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ElasticDiT introduces an elastic DiT architecture with adjustable spatial compression and block depth plus Shift Sparse Block Attention and a distilled VAE to enable a single model to cover multiple fidelity-latency points for high-resolution image generation on mobile devices.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15573","ref_index":17,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"Response-Conditioned Parallel-to-Sequential Orchestration for Multi-Agent Systems","primary_cat":"cs.CL","submitted_at":"2026-05-15T03:33:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Nexa learns a response-conditioned policy that starts with parallel agent execution and adds at most one round of sequential message passing via a predicted sparse DAG, strictly subsuming pure parallel mode.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"The deployment objective is the final task correctness. For a labeled example (Q, y), let ˆyG denote the final output under graph G. In the current implementation, correctness is checked with the same verifier used in evaluation, instantiated as an xVerify-based binary reward [Chen et al., 2025]. We therefore define the task reward Rtask(G) =1[Eval(ˆyG, y) = 1].(17) Because the order π is fixed by the contribution scores, the graph log-probability decomposes over feasible forward edges: logp θ(E | X, π) = X (m,n)∈Eπ \u0010 em→n logp m→n + (1−e m→n) log(1−p m→n) \u0011 .(18) The algorithm also applies an explicit sparsity penalty to the sampled graph reward in the same spirit as topology-economical methods [Zhang et al."},{"citing_arxiv_id":"2605.14169","ref_index":34,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"BOOKMARKS: Efficient Active Storyline Memory for Role-playing","primary_cat":"cs.CL","submitted_at":"2026-05-13T22:48:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"BOOKMARKS introduces searchable bookmarks as reusable answers to storyline questions, enabling active initialization and passive synchronization for more consistent role-playing agent memory than recurrent summarization.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14098","ref_index":51,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"Pause and Reflect: Conformal Aggregation for Chain-of-Thought Reasoning","primary_cat":"stat.ML","submitted_at":"2026-05-13T20:33:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A conformal procedure for CoT replaces majority voting with weighted aggregation and calibrates abstention to guarantee low confident-error rates, achieving 90.1% selective accuracy on GSM8K by abstaining on under 5% of cases.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13511","ref_index":45,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"Many-Shot CoT-ICL: Making In-Context Learning Truly Learn","primary_cat":"cs.CL","submitted_at":"2026-05-13T13:30:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Many-shot CoT-ICL improves when demonstrations are ordered for smooth conceptual progression, with CDS delivering up to 5.42 percentage-point gains on math tasks using 64 examples.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12987","ref_index":4,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"Leveraging Multimodal Self-Consistency Reasoning in Coding Motivational Interviewing for Alcohol Use Reduction","primary_cat":"cs.CL","submitted_at":"2026-05-13T04:36:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Multimodal self-consistency with audio-language models reaches 52.56% accuracy on utterance-level MI coding from five audio sessions, beating single-pass baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12412","ref_index":123,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"Stories in Space: In-Context Learning Trajectories in Conceptual Belief Space","primary_cat":"cs.CL","submitted_at":"2026-05-12T17:09:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LLMs perform in-context learning as trajectories through a structured low-dimensional conceptual belief space, with the structure visible in both behavior and internal representations and causally manipulable via interventions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12139","ref_index":110,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"BoolXLLM: LLM-Assisted Explainability for Boolean Models","primary_cat":"cs.AI","submitted_at":"2026-05-12T13:58:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"BoolXLLM augments an existing Boolean rule learner with LLMs for feature selection, discretization thresholds, and natural-language rule translation to improve interpretability while preserving accuracy.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11807","ref_index":29,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"Why Users Go There: World Knowledge-Augmented Generative Next POI Recommendation","primary_cat":"cs.AI","submitted_at":"2026-05-12T09:01:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"AWARE augments generative next-POI recommendation with LLM agents that produce user-anchored narratives capturing events, culture, and trends, delivering up to 12.4% relative gains on three real datasets.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11662","ref_index":31,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"HSUGA: LLM-Enhanced Recommendation with Hierarchical Semantic Understanding and Group-Aware Alignment","primary_cat":"cs.IR","submitted_at":"2026-05-12T07:22:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"HSUGA improves LLM-enhanced sequential recommendation via staged hierarchical semantic understanding for better preference extraction and group-aware alignment that varies intensity by user activity level.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11169","ref_index":61,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"OLIVIA: Online Learning via Inference-time Action Adaptation for Decision Making in LLM ReAct Agents","primary_cat":"cs.AI","submitted_at":"2026-05-11T19:28:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"OLIVIA treats LLM agent action selection as a contextual linear bandit over frozen hidden states and applies UCB exploration to adapt online, yielding consistent gains over static ReAct and prompt-based baselines on four benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11161","ref_index":128,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"Interpretability Can Be Actionable","primary_cat":"cs.LG","submitted_at":"2026-05-11T19:08:21+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Interpretability research should be judged by actionability—the degree to which its insights support concrete decisions and interventions—rather than explanatory power alone.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.09955","ref_index":83,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"Beyond Majority Voting: Agreement-Based Clustering to Model Annotator Perspectives in Subjective NLP Tasks","primary_cat":"cs.CL","submitted_at":"2026-05-11T04:04:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Agreement-based clustering of annotators improves performance on subjective NLP tasks by capturing diverse perspectives better than majority voting or per-annotator modeling.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.09942","ref_index":34,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"HAGE: Harnessing Agentic Memory via RL-Driven Weighted Graph Evolution","primary_cat":"cs.AI","submitted_at":"2026-05-11T03:41:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"HAGE proposes a trainable weighted graph memory framework with LLM intent classification, dynamic edge modulation, and RL optimization that improves long-horizon reasoning accuracy in agentic LLMs over static baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.09931","ref_index":4,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"PruneTIR: Inference-Time Tool Call Pruning for Effective yet Efficient Tool-Integrated Reasoning","primary_cat":"cs.CL","submitted_at":"2026-05-11T03:28:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PruneTIR prunes erroneous tool-call trajectories during LLM inference via three trigger-based components to raise Pass@1 accuracy and efficiency while shortening context.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"TIR, we conduct experiments on domains beyond mathematics. Specifically, we evaluate PRUNETIR on the GPQA-diamond dataset. GPQA-diamond is the highest-quality subset of GPQA (Rein et al., 13 A Case from AIME24 Illustrating Degraded Reasoning in LLMs Problem: Define $f(x)=|| x|-\\\\tfrac{1}{2}|$ and $g(x)=|| x|-\\\\tfrac{1}{4}|$. Find the number of intersections of the graphs of \\\\[y=4 g(f(\\\\sin (2 \\\\pi x))) \\\\quad\\\\text{ and }\\\\quad x=4 g(f(\\\\cos (3 \\\\pi y))).\\\\] Solution: Okay, let's try to solve this problem step by step using multiple python code calls. The problem is asking for the number of intersections between two graphs defined by y=4g(f(sin(2πx))) and x=4g(f(cos(3πy))). First, I need to understand what these functions f and g do. Let me start by analyzing f(x) and g(x). The function f(x) is defined as"},{"citing_arxiv_id":"2605.09907","ref_index":10,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"RADAR: Redundancy-Aware Diffusion for Multi-Agent Communication Structure Generation","primary_cat":"cs.AI","submitted_at":"2026-05-11T02:50:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"RADAR generates query-adaptive multi-agent communication structures via conditional discrete graph diffusion guided by effective graph size, outperforming baselines on accuracy and token consumption across six benchmarks.","context_count":1,"top_context_role":"background","top_context_polarity":"unclear","context_text":"Answer:from typing import Listdef median(l: List[int]) -> float: if not l:raise ValueError(\"The list is empty.\")l.sort()n = len(l)mid = n / / 2if n % 2 == 0:return (l[mid -1] + l[mid]) / 2.0else:return float(l[mid]) Queryfrom typing import Listdef median(l: List[int]) -> float:\"\"\"Return median of elements in the list l.>>> median([3, 1, 2, 4, 5])3>>> median([-10, 4, 6, 1000, 10, 20])15.0\"\"\" Algorithm Designer Test Analyst Algorithm Designer (f)Sampled case in HumanEval. Figure 7.Case study of the communication topologies generated by RADAR. 17"},{"citing_arxiv_id":"2605.09492","ref_index":7,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"APCD: Adaptive Path-Contrastive Decoding for Reliable Large Language Model Generation","primary_cat":"cs.CL","submitted_at":"2026-05-10T11:57:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"APCD adaptively branches LLM decoding paths based on token entropy and contrasts divergent paths to improve factual accuracy while preserving efficiency.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"distinct trajectories and prevents premature path collapse. As paths diverge, inter-path interaction is gradually attenuated and eventually halted, al- lowing coherent reasoning trajectories to evolve without forced separation. To evaluate the reliability of each generated tra- jectory, we compute its perplexity based on the sequence probability: ppl(y) = exp − 1 L LX t=1 logP(y t |y <t, q) ! (7) where L denotes the trajectory length. During de- coding, paths whose perplexity exceeds a threshold are pruned, as they are unlikely to yield reliable out- puts. After generation completes, the final output is selected as the remaining path with the lowest perplexity. The complete APCD procedure is sum- marized in Algorithm 1. 4 Empirical Evaluation"},{"citing_arxiv_id":"2605.09461","ref_index":8,"ref_count":2,"confidence":0.35,"is_internal_anchor":false,"paper_title":"VulTriage: Triple-Path Context Augmentation for LLM-Based Vulnerability Detection","primary_cat":"cs.AI","submitted_at":"2026-05-10T10:20:05+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":4.0,"formal_verification":"none","one_line_summary":"VulTriage combines control dependency extraction, CWE knowledge retrieval, and semantic summarization to improve LLM accuracy on vulnerability detection, reaching SOTA on PrimeVul and generalizing to Kotlin.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. Junwei Zhang, Zhongxin Liu, Xing Hu, Xin Xia, and Shanping Li. Vulnerability detection by learning from syntax-based execution paths of code.IEEE Transactions on Software Engineering, 49(8): 4196-4212, 2023. Yaqin Zhou, Shangqing Liu, Jingkai Siow, Xiaoning Du, and Yang Liu. Devign: Effective vulner- ability identification by learning comprehensive program semantics via graph neural networks. Advances in neural information processing systems, 32, 2019. Qingsong Zou, Jingyu Xiao, Qing Li, Zhi Yan, Yuhang Wang, Li Xu, Wenxuan Wang, Kuofeng"},{"citing_arxiv_id":"2605.09106","ref_index":31,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"Fin-Bias: Comprehensive Evaluation for LLM Decision-Making under human bias in Finance Domain","primary_cat":"cs.CL","submitted_at":"2026-05-09T18:28:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LLMs copy biased analyst ratings in investment decisions but a new detection method encourages independent reasoning and can improve stock return predictions beyond human levels.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.09008","ref_index":1,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"Relative Kinetic Utility for Reasoning-Aware Structural Pruning in Large Language Models","primary_cat":"cs.LG","submitted_at":"2026-05-09T15:47:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RKU is a curvature-aware structural pruning framework that improves LLM reasoning accuracy at 40% sparsity, reaching 13.34% on GSM8K while outperforming baselines and better preserving out-of-distribution representations.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"tion to aContinuous Kinetic Integralover the model's depth manifoldl∈[0, L]. Furthermore, we replace the discreteL CE with a continuous physical energy functional, specifically the squaredL 2 norm of the final hidden states:L Continuous =∥H (L)∥2 2. We define the Continuous Kinetic Utility for a structural componentcas: U (c) AGF = Z L 0 Ex∼D h Y (l) c ⊙ ∇(l) Yc LContinuous i dl(1) Eq. 1 provides an alternative to Wanda's Trap. By pulling gradients from a continuous spatial norm rather than a discrete vocabulary mapping,∇Yc LContinuous acts as a physical score function. It may highlightKinetic Spikes-the critical structural pathways that maintain the high-dimensional topo- logical integrity of the semantic reasoning space, reducing the influence of high-frequency syntactic"},{"citing_arxiv_id":"2605.08526","ref_index":59,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"Skill-CMIB: Multimodal Agent Skill for Consistent Action via Conditional Multimodal Information Bottleneck","primary_cat":"cs.LG","submitted_at":"2026-05-08T22:17:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"CMIB uses a conditional multimodal information bottleneck to create reusable agent skills that separate verbalizable text content from predictive perceptual residuals, improving execution stability.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.08060","ref_index":31,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"The Memory Curse: How Expanded Recall Erodes Cooperative Intent in LLM Agents","primary_cat":"cs.CL","submitted_at":"2026-05-08T17:47:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Expanded recall in LLM agents erodes cooperative intent in multi-agent social dilemmas, observed in 18 of 28 model-game settings.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.07248","ref_index":10,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"PaT: Planning-after-Trial for Efficient Test-Time Code Generation","primary_cat":"cs.CL","submitted_at":"2026-05-08T05:09:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PaT defers planning until after failed trials in LLM code generation, enabling heterogeneous cheap-plus-powerful model setups that match large-model performance at roughly 69% lower cost.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"(pscs +nD L)(9) To analyze a concrete scenario, let's assume we can choose an sLM such that its capability is a fraction of the LLM's,i.e., ps = pL n . Using the scaling law from Assumption 3 (pM =αc β M), we can relate the costs: cs = ps α \u00011/β = pL nα \u00011/β = n−1/βcL. Substituting these into the heterogeneous cost equation 9: E[CostHeterogeneous](10) = pLcs 2 + pL −p s pL (pscs +nD L)(11) = pLn−1/βcL 2 + n−1 n \u0010 pL n n−1/βcL +nD L \u0011 (12) = \u00121 2 + n−1 n2 \u0013 n−1/βpLcL + (n−1)D L (13) For the heterogeneous policy to be strictly more efficient, the difference must be positive: 0>E[Cost Heterogeneous −Cost Homogeneous](14) =(n−1)D L+ \u00121 2 + n−1 n2 \u0013 n− 1 β pLcL − pLcL 2 . (15) Rearranging to solve for the planning overhead"},{"citing_arxiv_id":"2605.07180","ref_index":14,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"Learning Agent Routing From Early Experience","primary_cat":"cs.CL","submitted_at":"2026-05-08T03:18:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"BoundaryRouter routes queries to LLM or agent using early experience memory from a seed set, cutting inference time 60.6% versus always using agents and raising performance 28.6% versus always using direct LLM inference.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.06919","ref_index":12,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"Can LLMs Take Retrieved Information with a Grain of Salt?","primary_cat":"cs.CL","submitted_at":"2026-05-07T20:29:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"LLMs exhibit systematic failures in obeying expressed certainty in retrieved contexts, but a combination of prior reminders, certainty recalibration, and context simplification reduces obedience errors by 25%.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.06650","ref_index":4,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"Beyond Negative Rollouts: Positive-Only Policy Optimization with Implicit Negative Gradients","primary_cat":"cs.CL","submitted_at":"2026-05-07T17:55:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"POPO uses bounded importance sampling on positive rollouts and a siamese policy network to achieve implicit negative gradients and stable optimization, matching or exceeding GRPO on math benchmarks such as 36.67% on AIME 2025.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.06165","ref_index":7,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"Post Reasoning: Improving the Performance of Non-Thinking Models at No Cost","primary_cat":"cs.AI","submitted_at":"2026-05-07T12:51:49+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Post-Reasoning boosts LLM accuracy by reversing the usual answer-after-reasoning order, delivering mean relative gains of 17.37% across 117 model-benchmark pairs with zero extra cost.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.06130","ref_index":11,"ref_count":3,"confidence":0.35,"is_internal_anchor":false,"paper_title":"Skill1: Unified Evolution of Skill-Augmented Agents via Reinforcement Learning","primary_cat":"cs.AI","submitted_at":"2026-05-07T12:33:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Skill1 trains a single RL policy to co-evolve skill selection, utilization, and distillation in language model agents from one task-outcome reward, using low-frequency trends to credit selection and high-frequency variation to credit distillation, outperforming baselines on ALFWorld and WebShop.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"i . Advantages ˆAdistill i are normalized separately from those of utilization since the two rewards measure different aspects of same outcomes: J distill(θ) =J GRPO θ;{s new,1, . . . , snew,G},{ ˆAdistill 1 , . . . , ˆAdistill G } \u0001 .(10) Total objective.All terms are combined in a single update: J(θ) =J util(θ) +λ 1 J rerank(θ) +λ 2 J distill(θ).(11) The utility score U(s) is updated non-parametrically via Eq. (5). The full procedure is summarized in Algorithm 1. Training hyperparameter settings are in Appendix C. 4 Experiments 4.1 Experimental Setup Environments.We evaluate on ALFWorld (Shridhar et al., 2021), a text-based household envi- ronment requiring multi-step planning and object interaction, and WebShop (Yao et al."},{"citing_arxiv_id":"2605.05893","ref_index":6,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"Logic-Regularized Verifier Elicits Reasoning from LLMs","primary_cat":"cs.CL","submitted_at":"2026-05-07T09:03:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LOVER creates an unsupervised logic-regularized verifier that reaches 95% of supervised verifier performance on reasoning tasks across 10 datasets.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.05737","ref_index":1,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"ReFlect: An Effective Harness System for Complex Long-Horizon LLM Reasoning","primary_cat":"cs.AI","submitted_at":"2026-05-07T06:29:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ReFlect is a harness that wraps LLMs to detect and recover from reasoning errors, achieving 7-29 pp gains over direct CoT on long-horizon tasks and improving code patch quality to 82-87%.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}