{"total":20,"items":[{"citing_arxiv_id":"2605.16561","ref_index":11,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Compile-time Security Analysis and Optimization of Sensitive String Producers","primary_cat":"cs.PL","submitted_at":"2026-05-15T19:04:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A language-integrated framework for compile-time analysis of sensitive string producers that minimizes lexical distance between secure and insecure idioms.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.08904","ref_index":56,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"OPT-BENCH: Evaluating the Iterative Self-Optimization of LLM Agents in Large-Scale Search Spaces","primary_cat":"cs.AI","submitted_at":"2026-05-09T11:51:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"OPT-BENCH and OPT-Agent evaluate LLM self-optimization in large search spaces, showing stronger models improve via feedback but stay constrained by base capacity and below human performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.21375","ref_index":36,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"VLAA-GUI: Knowing When to Stop, Recover, and Search, A Modular Framework for GUI Automation","primary_cat":"cs.CL","submitted_at":"2026-04-23T07:42:37+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"VLAA-GUI adds mandatory visual verifiers, multi-tier loop breakers, and on-demand search to GUI agents, reaching 77.5% on OSWorld and 61.0% on WindowsAgentArena with some models exceeding human performance.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"after toggling a setting without verifying the state changed-because completion is left to the model's implicit judgment rather than verified against observable UI evidence [3,14]. Second, agents fall intorepetitive loops: cycling through the same failing action without recovery. Moreover, existing anti-looping heuristics operate at a single granularity and cannot escalate across interaction modalities or planning strategies [36,38,44,53,77]. We presentVLAA-GUI, a modular GUI agent framework that addresses both challenges through three integrated mechanisms-a Completeness Verifier, a Loop Breaker, and a Search Agent-to regulate the system in knowing when to STOP,RECOVER(from repetitive loops), andSEARCHonline for enhanced perfor- mance. To address the early stopping issue, we introduce a mandatory Complete-"},{"citing_arxiv_id":"2503.09572","ref_index":18,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Plan-and-Act: Improving Planning of Agents for Long-Horizon Tasks","primary_cat":"cs.CL","submitted_at":"2025-03-12T17:40:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Plan-and-Act trains a dedicated Planner on synthetic plan-annotated trajectories to generate high-level plans that an Executor follows, reaching 57.58% success on WebArena-Lite and 81.36% on WebVoyager.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2411.18279","ref_index":105,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Large Language Model-Brained GUI Agents: A Survey","primary_cat":"cs.AI","submitted_at":"2024-11-27T12:13:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"MODERATE","novelty_score":4.0,"formal_verification":"none","one_line_summary":"A survey consolidating frameworks, data practices, large action models, benchmarks, applications, and research gaps in LLM-brained GUI agents.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"capabilities, such as symbolic reasoning or excelling in particular tasks like Go or Chess. In contrast, the emergence of LLMs has transformed AI agents by providing them with a natural language interface, enabling human-like decision- making capabilities, and equipping them to perform a wide variety of tasks and take tangible actions in diverse envi- ronments [12], [47], [104], [105]. In LLM agents, if LLMs form the \"brain\" of a GUI agent, then its accompanying components serve as its \"eyes and hands\", enabling the LLM to perceive the environment's status and translate its textual output into actionable steps that generate tangible effects [46]. These components transform LLMs from passive information sources into interactive agents that execute tasks"},{"citing_arxiv_id":"2409.12917","ref_index":14,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Training Language Models to Self-Correct via Reinforcement Learning","primary_cat":"cs.LG","submitted_at":"2024-09-19T17:16:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SCoRe uses multi-turn online RL with regularization on self-generated traces to improve LLM self-correction, achieving 15.6% and 9.1% gains on MATH and HumanEval for Gemini models.","context_count":1,"top_context_role":"other","top_context_polarity":"unclear","context_text":"import math def is_not_prime(n): result = False for i in range(2,int(math.sqrt(n)) + 1): if n % result = True return result [DONE] You are an expert Python programmer, and here is your task: Write a function to find the largest integers from a given list of numbers using heap queue algorithm. Your code should pass these tests: assert heap_queue_largest( [25, 35, 22, 85, 14, 65, 75, 22, 58],3)==[85, 75, 65] assert heap_queue_largest( [25, 35, 22, 85, 14, 65, 75, 22, 58],2)==[85, 75] assert heap_queue_largest( [25, 35, 22, 85, 14, 65, 75, 22, 58],5)==[85, 75, 65, 58, 35] [BEGIN] import heapq as hq def heap_queue_largest(nums,n): largest_nums = hq.nlargest(n, nums) return largest_nums [DONE] MBPP/HumanEval Self-Correction Instruction # There might be an error in the code above because of lack of understanding of the question."},{"citing_arxiv_id":"2403.07974","ref_index":146,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code","primary_cat":"cs.SE","submitted_at":"2024-03-12T17:58:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2403.07718","ref_index":11,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?","primary_cat":"cs.LG","submitted_at":"2024-03-12T14:58:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"WorkArena benchmark shows LLM web agents achieve partial success on enterprise tasks but have a substantial gap to full automation and perform worse with open-source models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2402.02716","ref_index":20,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Understanding the planning of LLM agents: A survey","primary_cat":"cs.AI","submitted_at":"2024-02-05T04:25:24+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":4.0,"formal_verification":"none","one_line_summary":"A survey that provides a taxonomy of methods for improving planning in LLM-based agents across task decomposition, plan selection, external modules, reflection, and memory.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2401.10774","ref_index":88,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads","primary_cat":"cs.LG","submitted_at":"2024-01-19T15:48:40+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Medusa augments LLMs with multiple decoding heads and tree-based attention to predict and verify several tokens in parallel, yielding 2.2-3.6x inference speedup via two fine-tuning regimes.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2401.10935","ref_index":81,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents","primary_cat":"cs.HC","submitted_at":"2024-01-17T08:10:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SeeClick improves visual GUI agents via GUI grounding pre-training on automatically curated data and introduces the ScreenSpot benchmark, with results indicating that stronger grounding boosts downstream task performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2401.01614","ref_index":13,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"GPT-4V(ision) is a Generalist Web Agent, if Grounded","primary_cat":"cs.IR","submitted_at":"2024-01-03T08:33:09+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"GPT-4V achieves 51.1% success on live web tasks as a generalist agent when plans are manually grounded, outperforming text-only models, but automatic grounding lags far behind oracle performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2309.17421","ref_index":63,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision)","primary_cat":"cs.CV","submitted_at":"2023-09-29T17:34:51+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":4.0,"formal_verification":"none","one_line_summary":"GPT-4V processes interleaved image-text inputs generically and supports visual referring prompting for new human-AI interaction.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2309.03409","ref_index":12,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Large Language Models as Optimizers","primary_cat":"cs.LG","submitted_at":"2023-09-07T00:07:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Large language models can optimize by being prompted with histories of past solutions and scores to propose better ones, producing prompts that raise accuracy up to 8% on GSM8K and 50% on Big-Bench Hard over human-designed baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2309.02427","ref_index":38,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Cognitive Architectures for Language Agents","primary_cat":"cs.AI","submitted_at":"2023-09-05T17:56:20+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CoALA is a modular cognitive architecture for language agents that organizes memory components, action spaces for internal and external interaction, and a generalized decision-making loop to support more systematic development of capable agents.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2305.15334","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Gorilla: Large Language Model Connected with Massive APIs","primary_cat":"cs.CL","submitted_at":"2023-05-24T16:48:11+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Gorilla is a fine-tuned LLM that surpasses GPT-4 in accurate API call generation and uses retrieval to handle documentation updates.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"on systematic evaluation and building a pipeline for future use. LLMs for Program Synthesis Harnessing LLMs for program synthesis has historically been a challenging task [ 23, 7, 45, 16, 13, 20]. Researchers have proposed an array of strategies to prompt LLMs to perform better in coding tasks, including in-context learning [ 44, 18, 7], task decomposition [17, 46], and self-debugging [8, 36]. Besides prompting, there have also been efforts to pretrain language models specifically for code generation [28, 22, 27]. However, these strategies focus on prompting large language models or pre-training them for general program synthesis. In our research, in contrast, we focus on a much restricted domain: the synthesis"},{"citing_arxiv_id":"2305.18323","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"ReWOO: Decoupling Reasoning from Observations for Efficient Augmented Language Models","primary_cat":"cs.CL","submitted_at":"2023-05-23T00:16:48+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ReWOO decouples reasoning from tool observations in augmented language models, delivering 5x token efficiency and 4% higher accuracy on multi-step reasoning benchmarks like HotpotQA.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2304.05128","ref_index":101,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Teaching Large Language Models to Self-Debug","primary_cat":"cs.CL","submitted_at":"2023-04-11T10:43:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Self-Debugging teaches LLMs to identify and fix their own code errors through rubber-duck-style natural language explanations and execution feedback, delivering 2-12% gains over baselines on Spider, TransCoder, and MBPP.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2304.03277","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Instruction Tuning with GPT-4","primary_cat":"cs.CL","submitted_at":"2023-04-06T17:58:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"GPT-4-generated instruction data produces superior zero-shot performance in finetuned LLaMA models versus prior state-of-the-art data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2303.11366","ref_index":10,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Reflexion: Language Agents with Verbal Reinforcement Learning","primary_cat":"cs.AI","submitted_at":"2023-03-20T18:08:50+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Reflexion lets LLM agents improve via stored verbal reflections on task feedback, reaching 91% pass@1 on HumanEval and outperforming prior GPT-4 results.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Reflexion (ours) ✓ ✓ ✓ ✓ ✓ Related work on programming Approach Test Debugging Self-generated Multiple Self-reflection Test execution execution tests languages AlphaCode [14] ✓ ✗ ✗ ✓ ✗ CodeT [5] ✓ ✗ ✓ ✗ ✗ Self-debugging [7] ✓ ✓ ✗ ✗ ✗ CodeRL [12] ✓ ✓ ✗ ✗ ✗ Reflexion (ours) ✓ ✓ ✓ ✓ ✓ [16] use decider models to reason over several generations. Kim et al. [10] use a retry pattern over a fixed number of steps without an evaluation step. Goodman [9] perform a qualitative evaluation step that proposes optimizations to the previous generation. In this paper, we show that several of these concepts can be enhanced with self-reflection to build a persisting memory of self-reflective experiences which allows an agent to identify its own errors and self-suggest lessons to learn from its"}],"limit":50,"offset":0}