{"total":13,"items":[{"citing_arxiv_id":"2606.10106","ref_index":13,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"What makes a harness a harness: necessary and sufficient conditions for an agent harness","primary_cat":"cs.SE","submitted_at":"2026-06-08T19:35:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Proposes and tests a constitutive definition of 'agent harness' via conceptual analysis of literature and six real systems.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18073","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"A-ProS: Towards Reliable Autonomous Programming Through Multi-Model Feedback","primary_cat":"cs.SE","submitted_at":"2026-05-18T08:55:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A-ProS uses a hybrid multi-model feedback framework with stateful refinement to improve success rates on competitive programming problems, achieving over 2x gains compared to baseline agent loops.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.00382","ref_index":89,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Social Bias in LLM-Generated Code: Benchmark and Mitigation","primary_cat":"cs.SE","submitted_at":"2026-05-01T04:06:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LLMs show up to 60.58% social bias in generated code; a new Fairness Monitor Agent cuts bias by 65.1% and raises functional correctness from 75.80% to 83.97%.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.10508","ref_index":23,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"How Many Tries Does It Take? Iterative Self-Repair in LLM Code Generation Across Model Scales and Benchmarks","primary_cat":"cs.SE","submitted_at":"2026-04-12T07:51:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Iterative self-repair improves LLM code pass rates by 4.9-17.1 pp on HumanEval and 16-30 pp on MBPP across seven models, with gains concentrated early and syntax errors easier to fix than logical ones.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2510.05746","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ARM: Discovering Agentic Reasoning Modules for Generalizable Multi-Agent Systems","primary_cat":"cs.AI","submitted_at":"2025-10-07T10:04:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ARM evolves specialized reasoning modules from basic CoT via tree search to serve as reusable components in multi-agent systems that generalize across models and domains without per-task re-optimization.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2507.21035","ref_index":29,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"GenoMAS: A Multi-Agent Framework for Scientific Discovery via Code-Driven Gene Expression Analysis","primary_cat":"cs.AI","submitted_at":"2025-07-28T17:55:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"GenoMAS deploys six specialized LLM agents with guided planning to preprocess transcriptomic data and identify genes, reaching 89.13% composite similarity and 60.48% F1 on the GenoTEX benchmark while outperforming prior methods.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Zhang, M. Tang, M. Li, M. Wang, M. Li, N. Tian, P . Huang, P . Zhang, Q. Wang, Q. Chen, Q. Du, R. Ge, R. Zhang, R. Pan, R. Wang, R. Chen, R. Jin, R. Chen, S. Lu, S. Zhou, S. Chen, S. Ye, S. Wang, S. Yu, S. Zhou, S. Pan, and S. Li. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025. [29] Y. Dong, X. Jiang, Z. Jin, and G. Li. Self-collaboration code generation via chatgpt. arXiv preprint arXiv: 2304.07590, 2023. [30] Y. Du, S. Li, A. Torralba, J. B. Tenenbaum, and I. Mordatch. Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv: 2305.14325, 2023. [31] R. Edgar, M. Domrachev, and A. E."},{"citing_arxiv_id":"2409.19894","ref_index":49,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"TransAgent: Enhancing LLM-Based Code Translation via Fine-Grained Execution Alignment","primary_cat":"cs.SE","submitted_at":"2024-09-30T02:53:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"TransAgent improves LLM code translation by up to 33.3% via multi-agent fine-grained execution alignment on a new benchmark of recent tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2409.00557","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Learning to Ask: When LLM Agents Meet Unclear Instruction","primary_cat":"cs.CL","submitted_at":"2024-08-31T23:06:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Introduces NoisyToolBench benchmark and Ask-when-Needed framework to improve LLM tool-use performance when user instructions are unclear or incomplete.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2312.13010","ref_index":11,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation","primary_cat":"cs.CL","submitted_at":"2023-12-20T13:22:41+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A three-agent loop of code generation, test creation, and execution feedback lifts pass@1 to 96.3% on HumanEval and 91.8% on MBPP for GPT-4 while using roughly half the tokens of prior state-of-the-art.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2310.02170","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"A Dynamic LLM-Powered Agent Network for Task-Oriented Agent Collaboration","primary_cat":"cs.CL","submitted_at":"2023-10-03T16:05:48+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DyLAN automatically selects and dynamically organizes LLM agents for collaboration, outperforming fixed-agent baselines on code generation, reasoning, and decision tasks with up to 25% accuracy gains on some MMLU subjects.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2309.02427","ref_index":19,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Cognitive Architectures for Language Agents","primary_cat":"cs.AI","submitted_at":"2023-09-05T17:56:20+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CoALA is a modular cognitive architecture for language agents that organizes memory components, action spaces for internal and external interaction, and a generalized decision-making loop to support more systematic development of capable agents.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2308.11432","ref_index":24,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"A Survey on Large Language Model based Autonomous Agents","primary_cat":"cs.AI","submitted_at":"2023-08-22T13:30:37+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A survey of LLM-based autonomous agents that proposes a unified framework for their construction and reviews applications in social science, natural science, and engineering along with evaluation methods and future directions.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"alities, he can use \"you are an outgoing person\" or \"you are an introverted person\" to profile the agent. The handcrafting method has been leveraged in a lot of previous work to specify the agent profiles. For example, Generative Agent [22] describes the agent by the information such as name, objectives, and relationships with other agents. MetaGPT [23], ChatDev [18], and Self-collaboration [24] predefine Lei Wang et al. A Survey on Large Language Model based Autonomous Agents 5 various roles and their corresponding responsibili- ties in software development, manually assigning distinct profiles to each agent to facilitate collabo- ration. PTLLM [25] aims to explore and quantify personality traits displayed in texts generated by LLMs. This method guides LLMs in generating di-"},{"citing_arxiv_id":"2308.08155","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation","primary_cat":"cs.AI","submitted_at":"2023-08-16T05:57:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AutoGen provides an open-source framework for multi-agent LLM conversations that support customizable interactions across diverse applications.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"shot prompts are obtained from the corresponding repository. As shown in Table 3, the two-agent 24 Task description You are in the middle of a room. Looking quickly around you, you see a bed 1, a desk 2, a desk 1, a safe 1, a drawer 2, a dra wer 1, a shelf 3, a shelf 2, and a shelf 1. Your task is to: look at bowl under the desklamp. Assistant(to Executor): THOUGHT [First, I need to find a bowl. A bowl is more likely to appear in desk (1 -2), drawer (1-2), shelf (1-3), bed (1). Then I need to find and use a desklamp.] ACTION [go to desk 1] (Assistant finds the desklamp at desk 1, then goes to desk 2.) Executor(to Assistant): On the desk 2, you see a bowl 1, and a cd 3... Assistant(to Executor): THOUGHT [Now I find a bowl (1). I need to use the desklamp to look at the bowl. I'll go to the desklamp now.] ACTION [go to desk 1] Executor(to Assistant): On the desk 1, you see a desklamp 1, a"}],"limit":50,"offset":0}