{"total":27,"items":[{"citing_arxiv_id":"2606.31478","ref_index":74,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"One Reflection Is Not Enough: Self-Correcting Autonomous Research via Multi-Hypothesis Failure Attribution","primary_cat":"cs.AI","submitted_at":"2026-06-30T10:54:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SAGE with MHFA improves failure recovery in autonomous research agents, raising metrics-bearing outputs from 42% to 92% on a 12-topic benchmark versus single-reflection baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.31073","ref_index":2,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"MultiUAV-Plat: An LLM-Oriented Platform, Benchmark and Framework for Multi-UAV Collaborative Task Planning","primary_cat":"cs.AI","submitted_at":"2026-06-30T03:02:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MultiUAV-Plat supplies a new RESTful simulation platform and 1500-task benchmark where Agent4Drone reaches 57.9% task pass rate versus 30.6% for ReAct baseline across 75 multi-UAV missions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.30247","ref_index":1,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Grounding LLM Reasoning under Incomplete Graph Evidence","primary_cat":"cs.CL","submitted_at":"2026-06-29T12:56:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Develops a theoretical perspective showing no hard rule can perfectly reject false unsupported trajectories while retaining true-but-unobserved ones under incomplete graph evidence, and characterizes soft grounding as KL-regularized deformation of the LLM prior.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.03606","ref_index":36,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Testing LLM Arithmetic Reasoning Generalization with Automatic Numeric-Remapping Attacks","primary_cat":"cs.CR","submitted_at":"2026-06-02T13:09:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"An automatic numeric-remapping attack generator reveals 12-26 point accuracy drops on GSM8K for three LLMs while MAWPS and MultiArith stay near 98%.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.00674","ref_index":36,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"The Paradox of Outcome Optimization: A Causal Information-Theoretic Bound on Reasoning Shortcuts in LLMs","primary_cat":"cs.LG","submitted_at":"2026-05-30T11:06:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Outcome optimization induces reward-induced manifold collapse in LLMs by favoring low-complexity spurious correlations over high-complexity causal reasoning, with process reward models acting as topological filters to block shortcuts.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.31308","ref_index":2,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"TraceGraph: Shared Decision Landscapes for Diagnosing and Improving Agent Trajectories","primary_cat":"cs.AI","submitted_at":"2026-05-29T13:40:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"TraceGraph constructs shared state graphs from multi-model trajectories to expose productive cores and trap regions, then uses them to diagnose navigation differences across benchmarks and to drive a recovery pipeline that improves SWE-bench resolved rates by 3-4 points on fired instances.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.28814","ref_index":3,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Self-Improving Language Models with Bidirectional Evolutionary Search","primary_cat":"cs.CL","submitted_at":"2026-05-27T17:59:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Bidirectional Evolutionary Search augments autoregressive expansion with evolutionary recombination operators and dense backward subgoal feedback to produce better candidates than standard best-of-N or tree search for language model self-improvement.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22195","ref_index":24,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Reinforced Graph of Thoughts: RL-Driven Adaptive Prompting for LLMs","primary_cat":"cs.LG","submitted_at":"2026-05-21T09:00:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"RGoT uses RL to adaptively generate task-specific graphs of operations for GoT-style LLM prompting from a human-provided set, with results suggesting feasibility under constraints.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14619","ref_index":2,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"SliceGraph: Mapping Process Isomers in Multi-Run Chain-of-Thought Reasoning","primary_cat":"cs.AI","submitted_at":"2026-05-14T09:37:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SliceGraph maps process isomers in multi-run CoT reasoning, finding that 85.5% of 954 problem-model cells show correct trajectories splitting into multiple process families with 76.6% of run pairs cross-family on average.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12694","ref_index":4,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Agentic Interpretation: Lattice-Structured Evidence for LLM-Based Program Analysis","primary_cat":"cs.SE","submitted_at":"2026-05-12T19:46:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Agentic interpretation uses lattices to track LLM judgments on decomposed program claims during analysis.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11376","ref_index":2,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"LLM-X: A Scalable Negotiation-Oriented Exchange for Communication Among Personal LLM Agents","primary_cat":"cs.AI","submitted_at":"2026-05-12T01:04:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LLM-X is a scalable architecture for direct negotiation and communication among personal LLM agents, featuring federated gateways, typed protocols, and policy enforcement, shown stable in experiments with up to 12 agents.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"a standardized communication substrate. LLM-X differs by intro- ducing a schema-validated protocol with explicit policy controls, enabling systematic study of fairness, latency, and throughput. Planning and Task Decomposition.Reasoning frameworks such as Chain-of-Thought (CoT) [31], Tree-of-Thoughts (ToT) [18, arXiv:2605.11376v1 [cs.AI] 12 May 2026 36], Graph-of-Thoughts (GoT) [2], and Tab-CoT [12] improve gran- ularity of inference by decomposing problems into sub-steps. Other agent-oriented approaches-such as ReAct [ 35], Reflexion [ 26], Inner Monologue [10], and ReWOO [34]-combine reasoning with environment feedback. While these methods operate mainly at the prompt or inference level, LLM-X abstracts the communication layer: our contribution is not a new reasoning algorithm but a"},{"citing_arxiv_id":"2605.10207","ref_index":2,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"LASAR: Latent Adaptive Semantic Aligned Reasoning for Generative Recommendation","primary_cat":"cs.IR","submitted_at":"2026-05-11T08:52:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LASAR uses two-stage supervised training plus reinforcement learning to ground semantic IDs, align latent reasoning trajectories to CoT hidden states via KL divergence, and adaptively choose reasoning depth, halving average steps while improving quality on three datasets.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"In Jie Zhang, Li Chen, Shlomo Berkovsky, Min Zhang, Tommaso Di Noia, Justin Basilico, Luiz Pizzato, and Yang Song, editors,Proceedings of the 17th ACM Conference on Recommender Systems, RecSys 2023, Singapore, Singapore, Septem- ber 18-22, 2023, pages 1007-1014. ACM, 2023. doi: 10.1145/3604915.3608857. URL https://doi.org/10.1145/3604915.3608857. [2] Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, and Torsten Hoefler. Graph of thoughts: Solving elaborate problems with large language mod- els. In Michael J. Wooldridge, Jennifer G. Dy, and Sriraam Natarajan, editors,Thirty-Eighth AAAI Conference on Artificial Intelligence, AAAI 2024, Thirty-Sixth Conference on Inno-"},{"citing_arxiv_id":"2604.20413","ref_index":13,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Self-Awareness before Action: Mitigating Logical Inertia via Proactive Cognitive Awareness","primary_cat":"cs.AI","submitted_at":"2026-04-22T10:33:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SABA improves LLM performance on detective puzzle benchmarks by recursively fusing information into a base state and using queries to resolve missing premises before concluding.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.19895","ref_index":5,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Learning When Not to Decide: A Framework for Overcoming Factual Presumptuousness in AI Adjudication","primary_cat":"cs.AI","submitted_at":"2026-04-21T18:17:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A new structured prompting method (SPEC) helps AI detect insufficient evidence in adjudication tasks and defer decisions appropriately, reaching 89% accuracy on a benchmark varying information completeness from Colorado unemployment insurance cases.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"determinations without adequate support. 4. Prompting Techniques and Model Calibration.Prompt- ing is commonly used to improve reasoning. Chain-of-Thought (CoT) decomposes complex problems into intermediate steps [35]. Tree-of-Thoughts (ToT) explores multiple reasoning branches [38], extended by Graph-of-Thought to model reasoning as arbitrary graphs [5]. Self-Consistency uses consensus across solution paths as an uncertainty signal [33]. Reflexion enables iterative self-critique ICAIL '26, June 08-12, 2026, Singapore, Singapore Afane, Robitschek, Ouyang, and Ho and refinement [31]. Multi-Agent Debate uses multiple model in- stances to surface errors through deliberation [9]. These techniques improve reasoningquality, but are not specifically designed to rec-"},{"citing_arxiv_id":"2604.19656","ref_index":3,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Pause or Fabricate? Training Language Models for Grounded Reasoning","primary_cat":"cs.CL","submitted_at":"2026-04-21T16:45:29+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"GRIL uses stage-specific RL rewards to train LLMs to detect missing premises, pause proactively, and resume grounded reasoning after clarification, yielding up to 45% better premise detection and 30% higher task success on insufficient math datasets.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Further analyses confirm robustness to noisy user responses and generalization beyond mathematical reasoning. Our contributions are: (1) We identifyungrounded reasoningas a distinct failure mode with quantitative metrics for measurement. (2) We reframe reasoning under uncertainty as sequential decision-making, establishing that inferential boundary awareness is distinct from reasoning capability. (3) We propose GRIL, a multi-turn RL framework with stage-specific rewards for premise detection and grounded solving. (4) Through extensive 2 STAGE 1: Clarify and Pause STAGE 2: Grounded Reasoning Try again! Trajectory S0 a0 S1 a1 Sk ak r··· Rdetect Insufficient information, B=? No Yes Env K-turn Interaction Loop r0 r1 𝑟𝑡 = 0 A=1,C=A+B find C? Is at =aclarify?"},{"citing_arxiv_id":"2604.10693","ref_index":1,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"FACT-E: Causality-Inspired Evaluation for Trustworthy Chain-of-Thought Reasoning","primary_cat":"cs.AI","submitted_at":"2026-04-12T15:35:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"FACT-E uses controlled perturbations as an instrumental signal to measure intra-chain faithfulness in CoT reasoning and combines it with answer consistency to select trustworthy trajectories.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"internal bias Z is an unobserved confounder that affects all variables. (c) FACT-E introduces exogenous noise E as an instrumental variable to obtain a more reliable faithfulness evaluation for CoTs. a pool of post-hoc candidates, our goal is to es- timate a reliability score RS for each candidate chain S∈ S as a quality measure, denoted as LLM(Q,S)→ R S ∈[0,1] . A higher RS indi- cates that the reasoning process is not only correct in its final outcome but also faithful among its inter- mediate steps. We model these two aspects below. 2.2 CoT-to-Answer Consistency A correct answer is a prerequisite for a high-quality CoT. Accordingly, we first model the chain's con- sistency with the correct outcome. Definition 1 (CoT-to-Answer Consistency)."},{"citing_arxiv_id":"2604.08299","ref_index":4,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"SeLaR: Selective Latent Reasoning in Large Language Models","primary_cat":"cs.CL","submitted_at":"2026-04-09T14:32:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SeLaR selectively applies latent soft reasoning in LLMs via entropy gating and contrastive regularization, outperforming standard CoT on five benchmarks without training.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"than computing entropy over the full vocabulary as in prior work (Shi et al., 2025), we estimate uncertainty using the top-k most probable tokens, which dominate the model's predictive mass and are most relevant for decision making. Specifically, let Vk ⊂ V denote the set of top-k tokens under pt. We first renormalize the distribution overVk: ˆpt(v) = pt(v)P u∈Vk pt(u) , v∈ V k,(4) and define the truncated entropy as: Ht =− X v∈Vk ˆpt(v) log ˆpt(v),(5) ¯Ht = clamp \u0012 Ht logk ,0,1 \u0013 .(6) This top-k entropy captures the model's uncertainty among its most plausible candidates while avoiding perturbation from the low-probability tokens. Low entropy indicates confident predictions dominated by a small number of candidates, whereas high en-"},{"citing_arxiv_id":"2604.04131","ref_index":12,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Profile-Then-Reason: Bounded Semantic Complexity for Tool-Augmented Language Agents","primary_cat":"cs.AI","submitted_at":"2026-04-05T14:27:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"PTR framework profiles a workflow upfront then executes it deterministically with bounded verification and repair, limiting LM calls to 2-3 while outperforming ReAct in 16 of 24 tested configurations.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.24422","ref_index":3,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"OneSearch-V2: The Latent Reasoning Enhanced Self-distillation Generative Search Framework","primary_cat":"cs.IR","submitted_at":"2026-03-25T15:33:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"OneSearch-V2 improves generative retrieval via latent reasoning and self-distillation, achieving +3.98% item CTR, +2.07% buyer volume, and +2.11% order volume in online A/B tests.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.00520","ref_index":3,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Toward a Safe Internet of Agents","primary_cat":"cs.MA","submitted_at":"2025-11-29T15:31:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"The paper proposes a bottom-up framework for safe agentic AI systems that treats each component as a dual-use interface where added capabilities also expand attack surfaces across single agents, multi-agent systems, and interoperable ecosystems.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.02547","ref_index":183,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"The Landscape of Agentic Reinforcement Learning for LLMs: A Survey","primary_cat":"cs.AI","submitted_at":"2025-09-02T17:46:26+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Survey that defines agentic RL for LLMs via POMDPs, introduces a taxonomy of planning/tool-use/memory/reasoning capabilities and domains, and compiles open environments from over 500 papers.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"While slower in inference due to extended reasoning trajectories, they achieve higher accuracy and robustness in knowledge-intensive tasks such as mathematics, scientific reasoning, and multi-hop question answering [177]. Representative examples include OpenAI's o1 [31] and o3 series [33], DeepSeek- R1 [32], as well as methods that incorporate dynamic test-time scaling [178,179,180,172] or reinforcement learning [181, 47, 182, 183, 184, 185] for reasoning. Modern slow reasoning exhibits output structures that differ substantially from fast reasoning. These includeaclearexplorationandplanningstructure, frequentverificationandcheckingbehaviors, andgenerally longer inference lengths and times. Past work has explored diverse patterns for constructing long-chain reasoning outputs. Some methods-Macro-o1, HuatuoGPT-o1, and AlphaZero-have attempted to synthesize"},{"citing_arxiv_id":"2507.04023","ref_index":4,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Do LLMs Overthink Basic Math Reasoning? Benchmarking the Accuracy-Efficiency Tradeoff in Language Models","primary_cat":"cs.CL","submitted_at":"2025-07-05T12:31:17+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Evaluations of 53 LLMs on 14 basic math tasks show reasoning models use ~18x more tokens with sometimes lower accuracy, non-monotonic gains from extended budgets, and sharp performance drops under token constraints.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2503.09567","ref_index":50,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models","primary_cat":"cs.AI","submitted_at":"2025-03-12T17:35:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"and \"thought\" sequences in reasoning improvement [359], while Hong et al. [259] demonstrate the impact of prompting techniques [546]. Further, Liu et al. [473] and Mondorf and Plank [557] stress the importance of deeper analysis beyond surface-level accuracy, and He et al. [248] explore self- evolutionary processes as a means to advance LLM reasoning. Besta et al. [50] propose a modular 36 framework integrating structure, strategy, and training methods as part of a comprehensive system design approach. Most recently, Li et al. [432] provide a systematic survey of System 2 thinking, focusing on the methods used to differentiate them from System 1 thinking. Despite numerous technical reviews in this field, there is limited discussion on the differences between"},{"citing_arxiv_id":"2502.21074","ref_index":25,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"CODI: Compressing Chain-of-Thought into Continuous Space via Self-Distillation","primary_cat":"cs.CL","submitted_at":"2025-02-28T14:07:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"CODI compresses explicit CoT into continuous space via self-distillation and is the first implicit method to match explicit CoT performance on GSM8k at GPT-2 scale with 3.1x compression and 28.2% higher accuracy than prior implicit approaches.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2408.12935","ref_index":63,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"AI Safety Landscape for Large Language Models: Taxonomy, State-of-the-art, and Future Directions","primary_cat":"cs.AI","submitted_at":"2024-08-23T09:33:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"The paper introduces a taxonomy of AI safety for LLMs organized into Trustworthy AI, Responsible AI, and Safe AI perspectives, accompanied by a review of state-of-the-art methods, challenges, and future directions.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"7 (c)). Safer-instruct [640] proposes a more scalable automatic approach for constructing preference datasets. This method starts with obtaining a reverse model capable of generating instructions based on responses, which is then used to generate instructions for content related to specific topics, such as hate speech (see Fig. 7 (d)). Red-instruct [ 63] explores prompt-based red-teaming methods and releases a Chain of Utterances (CoU) based dataset, HarmfulQA, which consists of conversations between a red LLM and target LLM, both roleplayed by ChatGPT. During the construction of the conversation, the target LLMs are prompted to generate internal thoughts as a prefix in the response, allowing the red LLMs to develop more effective"},{"citing_arxiv_id":"2408.07199","ref_index":6,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents","primary_cat":"cs.AI","submitted_at":"2024-08-13T20:52:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Agent Q integrates MCTS-guided search, self-critique, and off-policy DPO to train LLM agents that outperform behavior cloning and reinforced fine-tuning baselines in WebShop and achieve up to 95.4% success in real-world booking scenarios.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"While the algorithm was developed in a bandit setting Hejna et al. (2024); Rafailov et al. (2024) have extended it to multi-turn settings with preferences over over trajectories. In our setting, we can directly utilize this objective as: ℒT-DPO(𝜋𝜃; 𝒟) = −E(𝜏𝑤,𝜏𝑙)∼𝒟 ⎡ ⎣log 𝜎 ⎛ ⎝ ⎛ ⎝ |𝜏 𝑤|∑︁ 𝑡=0 𝛽 log 𝜋𝜃(a𝑤 𝑡 |h𝑤 𝑡 ) 𝜋ref(a𝑤 𝑡 |h𝑤 𝑡 ) ⎞ ⎠ − ⎛ ⎝ |𝜏 𝑙|∑︁ 𝑡=0 𝛽 log 𝜋𝜃(a𝑙 𝑡|h𝑙 𝑡) 𝜋ref(a𝑙 𝑡|h𝑙 𝑡) ⎞ ⎠ ⎞ ⎠ ⎤ ⎦ (6) 7 Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents Figure 3: Success rate of different approaches on the WebShop Yao et al. (2022) task. All models are based on xLAM-v0.1-r Zhang et al. (2024c). RFT and DPO over xLAM-v0.1-r demonstrate improvements in performance from 28.6% to 31.3% and 37.5% respectively. However, these methods"},{"citing_arxiv_id":"2407.21787","ref_index":8,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Large Language Monkeys: Scaling Inference Compute with Repeated Sampling","primary_cat":"cs.LG","submitted_at":"2024-07-31T17:57:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Repeated sampling scales problem coverage log-linearly with sample count, improving SWE-bench Lite performance from 15.9% to 56% using 250 samples.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Michael Sellitto, Nelson Elhage, Nicholas Schiefer, Noemi Mercado, Nova DasSarma, Robert Lasenby, Robin Larson, Sam Ringer, Scott Johnston, Shauna Kravec, Sheer El Showk, Stanislav Fort, Tamera Lanham, Timothy Telleen-Lawton, Tom Conerly, Tom Henighan, Tristan Hume, Samuel R. Bowman, Zac Hatfield-Dodds, Ben Mann, Dario Amodei, Nicholas Joseph, Sam McCandlish, Tom Brown, and Jared Kaplan. Constitutional ai: Harmlessness from ai feedback, 2022. [8] Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, and Torsten Hoefler. Graph of thoughts: Solving elaborate problems with large language models. Proceedings of the AAAI Conference on Artificial Intelligence, 38(16):17682-17690, March 2024."}],"limit":50,"offset":0}