{"total":12,"items":[{"citing_arxiv_id":"2605.30539","ref_index":13,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"A Theory-Guided LLM Pedagogical Agent for STEM+C Scaffolding Without Over-Reliance","primary_cat":"cs.MA","submitted_at":"2026-05-28T20:13:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Copa is a theory-guided multimodal LLM agent that supports high school computational modeling through adaptive feedback, shown in a 33-dyad study to increase student confidence and conceptual verbalization without fostering dependence.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.26321","ref_index":10,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Anchor: Mitigating Artifact Drift in Agent Benchmark Generation","primary_cat":"cs.AI","submitted_at":"2026-05-25T20:44:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Anchor generates consistent long-horizon agent tasks from parametric constraint programs, yielding ERP-Bench of 300 ERP tasks where frontier models reach optimal solutions in 17.4% of trials.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22564","ref_index":26,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"SynAE: A Framework for Measuring the Quality of Synthetic Data for Tool-Calling Agent Evaluations","primary_cat":"cs.CL","submitted_at":"2026-05-21T14:45:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SynAE is a multi-metric framework that evaluates how well synthetic benchmarks replicate real data characteristics for multi-turn tool-calling agent testing.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21984","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Echo: Learning from Experience Data via User-Driven Refinement","primary_cat":"cs.AI","submitted_at":"2026-05-21T04:34:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Echo is a framework that harvests user-driven refinements of agent proposals as training signals to align models with real-world needs, demonstrated by raising code completion acceptance from 25.7% to 35.7% in production.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20086","ref_index":63,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"What Do Evolutionary Coding Agents Evolve?","primary_cat":"cs.NE","submitted_at":"2026-05-19T16:41:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Evolutionary coding agents achieve most benchmark gains through a small subset of edit types and by cycling previously deleted code lines rather than developing new algorithmic structures.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.03409","ref_index":32,"ref_count":2,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Robust Agent Compensation (RAC): Teaching AI Agents to Compensate","primary_cat":"cs.AI","submitted_at":"2026-05-05T06:27:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"RAC is a log-based recovery paradigm implemented as an architectural extension to agent frameworks, achieving 1.5-8X better latency and token economy than LLM-based recovery on τ-bench and REALM-Bench.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.17817","ref_index":44,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Do LLMs Need to See Everything? A Benchmark and Study of Failures in LLM-driven Smartphone Automation using Screentext vs. Screenshots","primary_cat":"cs.HC","submitted_at":"2026-04-20T05:15:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A new benchmark shows LLM smartphone agents achieve comparable success with screen text alone as with screenshots, but both fail often due to UI accessibility and reasoning gaps.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Autotask: Executing arbitrary voice commands by exploring and learning from mobile gui.arXiv preprint arXiv:2312.16062(2023). [43] Melissa Z Pan, Negar Arabzadeh, Riccardo Cogo, Yuxuan Zhu, Alexander Xiong, Lakshya A Agrawal, Huanzhi Mao, Emma Shen, Sid Pallerla, Liana Patel, et al. 2025. Measuring agents in production.arXiv preprint arXiv:2512.04123(2025). [44] Sanket Pandya. 2024. Android Material Design Guidelines.Design Bootcamp on Medium(20 jun 2024). https://medium.com/design-bootcamp/android- material-design-guidelines-4cd9b3a3b454 Accessed: 2025-07-14. [45] Joon Sung Park, Joseph O'Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. 2023. Generative agents: Interactive"},{"citing_arxiv_id":"2604.13536","ref_index":70,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Don't Let AI Agents YOLO Your Files: Shifting Information and Control to Filesystems for Agent Safety and Autonomy","primary_cat":"cs.OS","submitted_at":"2026-04-15T06:32:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"YoloFS is an agent-native filesystem that stages mutations for review, provides snapshots for agent self-correction, and uses progressive permissions to reduce user interruptions while matching baseline task success.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.08956","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Low-Data Supervised Adaptation Outperforms Prompting for Cloud Segmentation Under Domain Shift","primary_cat":"cs.CV","submitted_at":"2026-04-10T04:56:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Supervised fine-tuning with 0.1% labeled data outperforms all 60 tested prompt variants for CLIPSeg cloud segmentation on satellite imagery under domain shift.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.06802","ref_index":12,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Riemann-Bench: A Benchmark for Moonshot Mathematics","primary_cat":"cs.AI","submitted_at":"2026-04-08T08:16:37+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.05150","ref_index":13,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Compiled AI: Deterministic Code Generation for LLM-Based Workflow Automation","primary_cat":"cs.SE","submitted_at":"2026-04-06T20:25:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Compiled AI generates deterministic code artifacts from LLMs in a one-time compilation step, enabling reliable workflow execution with zero runtime tokens after break-even.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.09002","ref_index":129,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Security Considerations for Multi-agent Systems","primary_cat":"cs.CR","submitted_at":"2026-03-09T22:46:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"No existing AI security framework covers a majority of the 193 identified multi-agent system threats in any category, with OWASP Agentic Security Initiative achieving the highest overall coverage at 65.3%.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}