{"total":11,"items":[{"citing_arxiv_id":"2605.19102","ref_index":21,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Prompt Optimization for LLM Code Generation via Reinforcement Learning","primary_cat":"cs.SE","submitted_at":"2026-05-18T20:42:23+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A PPO agent with hybrid actions and test-driven rewards optimizes prompts for code LLMs, raising strict Pass@1 scores on MBPP+, HumanEval+, and APPS over prior methods.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18747","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Code as Agent Harness","primary_cat":"cs.CL","submitted_at":"2026-05-18T17:59:03+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A survey that organizes existing work on LLM-based agents around code as the central harness, structured in three layers of interfaces, mechanisms, and multi-agent scaling, with applications across domains and listed open challenges.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"this paradigm by explicitly decomposing reasoning into executable programs, followed by extensions such as POET [67] and MathCoder [68], which improve execution fidelity and domain specialization. Subsequent work investigates the conditions under which program delegation is effective, including the role of execution correctness, task structure, and runtime interaction. For example, Chain of Code (CoC) [8] and CIRS [69] analyze how executable reasoning changes failure modes relative to pure language-based reasoning. Later directions extend this interface beyond isolated task execution. Cross-lingual reasoning frameworks [70] demonstrate that program-based reasoning can generalize across linguistic environments through shared executable structure, while method-based reasoning [71] introduces reusable programmatic procedures"},{"citing_arxiv_id":"2605.14892","ref_index":104,"ref_count":4,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Beyond Individual Intelligence: Surveying Collaboration, Failure Attribution, and Self-Evolution in LLM-based Multi-Agent Systems","primary_cat":"cs.AI","submitted_at":"2026-05-14T14:36:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A survey that unifies prior work on multi-agent LLM systems via the LIFE framework, mapping dependencies across collaboration, failure attribution, and autonomous self-evolution while identifying cross-stage challenges.","context_count":2,"top_context_role":"background","top_context_polarity":"background","context_text":"rStar-Math [101] ESS Ma Consis SC - - -✓ 2025 TTS [102] ESS Ma Log - - - ✓ ✓ 2025 TOPS [103] ESS Ma Log - - -✓ ✓ 2025 Rewarding Progress [104] VRP Ma Log - - - - ✓ 2025 WizardMath [68] VRP Ma Log - - - -✓ 2025 C. Output-Stage Regulation FActScore [70] OV LF Evid -✓-✓- 2023 FacTool [71] OV Mu Evid - ✓ ✓ ✓ - 2023 SelfCheckGPT [73] OV LF Consis - - -✓- 2023 Semantic En- tropy [105] OV QA LF Inter - - - ✓ - 2023 RARR [78] OV QA LF Evid TV✓-✓- 2023 ITI [81] RC QA Inter - - - ✓ - 2023 Reflexion [19] RC Ag Co Inter Refl - -✓ ✓ 2023 L2R [106] RC QA Inter - - - - - 2023 Factcheck-GPT [72] OV QA LF Evid -✓-✓- 2024 HaloScope [75] OV QA LF Inter - - - ✓ - 2024 DoLa [79] RC QA LF Inter - - -✓- 2024 CAD [80] RC QA LF Evid - - - ✓ - 2024"},{"citing_arxiv_id":"2605.09931","ref_index":22,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"PruneTIR: Inference-Time Tool Call Pruning for Effective yet Efficient Tool-Integrated Reasoning","primary_cat":"cs.CL","submitted_at":"2026-05-11T03:28:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PruneTIR prunes erroneous tool-call trajectories during LLM inference via three trigger-based components to raise Pass@1 accuracy and efficiency while shortening context.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.07462","ref_index":113,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"The Moltbook Files: A Harmless Slopocalypse or Humanity's Last Experiment","primary_cat":"cs.CL","submitted_at":"2026-05-08T09:10:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"An AI-agent social platform generated mostly neutral content whose use in fine-tuning reduced model truthfulness comparably to human Reddit data, suggesting limited unique harm but flagging tail risks like secret leaks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.12214","ref_index":40,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Structural Anchors and Reasoning Fragility:Understanding CoT Robustness in LLM4Code","primary_cat":"cs.SE","submitted_at":"2026-04-14T02:48:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"CoT prompting in LLM4Code shows mixed robustness that depends on model family, task structure, and perturbations destabilizing structural anchors, leading to trajectory deformations like lengthening, branching, and simplification.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.10126","ref_index":32,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MR-Coupler: Automated Metamorphic Test Generation via Functional Coupling Analysis","primary_cat":"cs.SE","submitted_at":"2026-04-11T09:42:44+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MR-Coupler leverages functional coupling analysis and LLMs to generate valid metamorphic test cases for over 90% of tasks while detecting 44% of real bugs, outperforming baselines by 64.90% in validity and 36.56% in false-alarm reduction.","context_count":1,"top_context_role":"background","top_context_polarity":"support","context_text":"agnostic approach to generate metamorphic test cases for a given program under test. Although some approaches are proposed to generate domain-specific MRs [71, 74], or synthesize MRs based on human-prepared materials [41, 66, 67] or manual effort [50] (discussed in Section 6), adapting them into comparable automated domain-agnostic baselines is non-trivial. Given the proven effectiveness of LLMs in code [32, 35, 66] and test generation [53, 70], we setdirectly prompting LLMsas a baseline. In this baseline, we allow LLMs to conduct a round of revision to the generated code based on the execution feedback as in our method, which is found to be an effective common post-processing to enhance code generation [70]. The baseline uses a similar prompt template (Listing 3), and follows"},{"citing_arxiv_id":"2603.29069","ref_index":28,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"On the Mirage of Long-Range Dependency, with an Application to Integer Multiplication","primary_cat":"cs.LG","submitted_at":"2026-03-30T23:15:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"Long-range dependency in integer multiplication is a mirage from 1D representation; a 2D grid reduces it to local 3x3 operations, letting a 321-parameter neural cellular automaton generalize perfectly to inputs 683 times longer than training while Transformers fail.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.10931","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Asynchronous Reasoning: Training-Free Interactive Thinking LLMs","primary_cat":"cs.LG","submitted_at":"2025-12-11T18:57:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Using properties of positional embeddings, reasoning LLMs can be made to think, listen, and generate outputs asynchronously without any additional training, cutting time to first token to under 5 seconds.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2408.15815","ref_index":21,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MR-Adopt: Automatic Deduction of Input Transformation Function for Metamorphic Testing","primary_cat":"cs.SE","submitted_at":"2024-08-28T14:24:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MR-Adopt deduces input transformations from hard-coded MR test cases using LLMs, data-flow refinement, and output-relation selection to enable reuse with new source inputs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2402.07927","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"A Systematic Survey of Prompt Engineering in Large Language Models: Techniques and Applications","primary_cat":"cs.AI","submitted_at":"2024-02-05T19:49:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"A systematic survey categorizes prompt engineering methods for LLMs and VLMs by application area, summarizing methodologies, applications, models, datasets, strengths, and limitations for each technique along with a taxonomy and summary table.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Shafiq Joty, Soujanya Poria, and Lidong Bing. Chain-of- knowledge: Grounding large language models via dynamic knowledge adapting over heterogeneous sources, 2023. Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. Pre-train, prompt, and pre- dict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 55(9):1-35, 2023. Tongxuan Liu, Wenjiang Xu, Weizhe Huang, Xingyu Wang, Jiaxing Wang, Hailong Yang, and Jing Li. Logic-of-thought: Injecting logic into contexts for full reasoning in large lan- guage models, 2024. Jieyi Long. Large language model guided tree-of-thought. arXiv preprint arXiv:2305.08291, 2023. Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Halli-"}],"limit":50,"offset":0}