{"total":12,"items":[{"citing_arxiv_id":"2606.03685","ref_index":30,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"A Close Look At World Model Recovery In Supervised Fine-Tuned LLM Planners","primary_cat":"cs.LG","submitted_at":"2026-06-02T14:09:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Supervised fine-tuning lets LLMs linearly encode action validity and state predicates, with broader state-space coverage during training improving world-model recovery.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.02994","ref_index":20,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Inducing Reasoning Primitives from Agent Traces","primary_cat":"cs.AI","submitted_at":"2026-06-02T01:11:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Reasoning Primitive Induction mines ReAct traces to build a library of typed pseudo-tools that, when composed in a standard ReAct loop, outperform the original agent by 22-44 percentage points on five subtasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22355","ref_index":43,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"TransitLM: A Large-Scale Dataset and Benchmark for Map-Free Transit Route Generation","primary_cat":"cs.CL","submitted_at":"2026-05-21T11:42:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"TransitLM is a large-scale dataset and benchmark for training LLMs to generate structurally valid map-free transit routes from origin-destination pairs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.06840","ref_index":36,"ref_count":5,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Extracting Search Trees from LLM Reasoning Traces Reveals Myopic Planning","primary_cat":"cs.AI","submitted_at":"2026-05-07T18:45:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LLM planning in four-in-a-row is myopic: move choices match a shallow model that ignores deep nodes expanded in reasoning traces.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.14930","ref_index":33,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"IE as Cache: Information Extraction Enhanced Agentic Reasoning","primary_cat":"cs.CL","submitted_at":"2026-04-16T12:18:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"IE-as-Cache framework repurposes information extraction as a dynamic cognitive cache to improve agentic reasoning accuracy in LLMs on challenging benchmarks.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"and perform multi-hop reasoning without pre-defined schemas or tables. •Agentic Planning (Calendar Scheduling [32]):A com- plex constraint satisfaction task where the agent must manage and resolve conflicting schedules derived purely from raw natural language descriptions, simulating per- sonal assistant scenarios. •Query-Focused Summarization (QMSUM [33]):A dataset for summarizing specific spans from extensive multi-turn meeting transcripts based on user queries. This tests the framework's capacity to distill and synthesize key information from high-noise dialogue environments. This diverse suite allows us to verify our method's effec- tiveness acrossQuestion Answering,Planning, andSumma- rizationtasks within noise-rich contexts."},{"citing_arxiv_id":"2604.06452","ref_index":46,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Learning to Interrupt in Language-based Multi-agent Communication","primary_cat":"cs.CL","submitted_at":"2026-04-07T20:47:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"HANDRAISER learns optimal interruption points in multi-agent LLM communication using estimated future reward and cost, achieving 32.2% lower communication cost with comparable or better task results across games, scheduling, and debate.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.09629","ref_index":41,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"End-to-end PDDL Planning with Hardcoded and Dynamic Agents","primary_cat":"cs.AI","submitted_at":"2025-12-10T13:17:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"MODERATE","novelty_score":5.0,"formal_verification":"none","one_line_summary":"An end-to-end LLM framework refines natural language into valid PDDL domains and problems via hardcoded and dynamic agents, generates plans with standard engines, and returns readable output.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.12626","ref_index":55,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"DoubleAgents: Human-Agent Alignment in a Socially Embedded Workflow","primary_cat":"cs.HC","submitted_at":"2025-09-16T03:43:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DoubleAgents shows that a distributed-cognition design with coordination agent, dashboard, and policy module increases user comfort and reliance on AI agents for coordination tasks over time.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2508.15487","ref_index":25,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Dream 7B: Diffusion Large Language Models","primary_cat":"cs.CL","submitted_at":"2025-08-21T12:09:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Dream 7B is a 7B diffusion LLM that refines sequences in parallel via denoising and outperforms prior diffusion models on general, mathematical, and coding benchmarks with added flexibility in generation order and quality-speed tradeoffs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2507.21046","ref_index":86,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence","primary_cat":"cs.AI","submitted_at":"2025-07-28T17:59:05+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":4.0,"formal_verification":"none","one_line_summary":"The paper delivers the first systematic review of self-evolving agents, structured around what components evolve, when adaptation occurs, and how it is implemented.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2412.13682","ref_index":33,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ChinaTravel: An Open-Ended Travel Planning Benchmark with Compositional Constraint Validation for Language Agents","primary_cat":"cs.AI","submitted_at":"2024-12-18T10:10:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ChinaTravel is a benchmark with sandbox, compositional DSL, and 1154-human dataset for testing language agents on open-ended travel planning constraint satisfaction.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2409.12917","ref_index":59,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Training Language Models to Self-Correct via Reinforcement Learning","primary_cat":"cs.LG","submitted_at":"2024-09-19T17:16:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SCoRe uses multi-turn online RL with regularization on self-generated traces to improve LLM self-correction, achieving 15.6% and 9.1% gains on MATH and HumanEval for Gemini models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}