{"total":13,"items":[{"citing_arxiv_id":"2605.21488","ref_index":26,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"Equilibrium Reasoners: Learning Attractors Enables Scalable Reasoning","primary_cat":"cs.LG","submitted_at":"2026-05-20T17:59:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Equilibrium Reasoners learn task-conditioned attractors in latent dynamics to support scalable iterative reasoning, raising Sudoku-Extreme accuracy from 2.6% to over 99% via up to 40,000 equivalent layers.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17028","ref_index":68,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"PARALLAX: Separating Genuine Hallucination Detection from Benchmark Construction Artifacts","primary_cat":"cs.CL","submitted_at":"2026-05-16T14:57:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Benchmark construction artifacts in hallucination detection corpora allow naive text-similarity baselines to achieve near-perfect scores, and controlled evaluations show most methods perform near chance except SAPLMA and the new DRIFT probe.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16675","ref_index":29,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"LinAlg-Bench: A Forensic Benchmark Revealing Structural Failure Modes in LLM Mathematical Reasoning","primary_cat":"cs.AI","submitted_at":"2026-05-15T22:30:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LinAlg-Bench shows LLMs switch from execution errors to computational abandonment and structured fabrication at 4x4 matrix scale, indicating a working memory limit rather than knowledge gaps.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15665","ref_index":8,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"PRISM: Prompt Reliability via Iterative Simulation and Monitoring for Enterprise Conversational AI","primary_cat":"cs.AI","submitted_at":"2026-05-15T06:43:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"PRISM automates continuous prompt creation, simulation-based testing, diagnosis, and repair for enterprise LLM agents, cutting authoring time to under 30 minutes while reaching 99% reliability and catching drift within 24 hours.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14537","ref_index":22,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"Cattle Trade: A Multi-Agent Benchmark for LLM Bluffing, Bidding, and Bargaining","primary_cat":"cs.AI","submitted_at":"2026-05-14T08:20:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Cattle Trade benchmark shows heuristic code agents outperforming most LLMs in integrated strategic tasks like bidding, bluffing, and resource allocation across 242 games, with strategic coherence predicting rank better than spending volume.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12366","ref_index":14,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"Classifier Context Rot: Monitor Performance Degrades with Context Length","primary_cat":"cs.AI","submitted_at":"2026-05-12T16:34:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Frontier LLMs miss dangerous actions in long coding agent transcripts 2-30 times more often after hundreds of thousands of benign tokens.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.08904","ref_index":129,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"OPT-BENCH: Evaluating the Iterative Self-Optimization of LLM Agents in Large-Scale Search Spaces","primary_cat":"cs.AI","submitted_at":"2026-05-09T11:51:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"OPT-BENCH and OPT-Agent evaluate LLM self-optimization in large search spaces, showing stronger models improve via feedback but stay constrained by base capacity and below human performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.06882","ref_index":10,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"How Well Do LLMs Perform on the Simplest Long-Chain Reasoning Tasks: An Empirical Study on the Equivalence Class Problem","primary_cat":"cs.AI","submitted_at":"2026-05-07T19:31:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Non-reasoning LLMs fail the equivalence class problem while reasoning LLMs perform better but remain incomplete, with difficulty peaking at phase transition for the former and maximum diameter for the latter.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.06584","ref_index":16,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"NeuroAgent: LLM Agents for Multimodal Neuroimaging Analysis and Research","primary_cat":"cs.AI","submitted_at":"2026-05-07T17:13:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"NeuroAgent uses a hierarchical LLM agent framework with Generate-Execute-Validate loops to automate neuroimaging preprocessing, reaching 84.8% end-to-end correctness and 0.9518 AUC for Alzheimer's classification on 1470 ADNI subjects using four modalities.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.04426","ref_index":11,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"Telegraph English: Semantic Prompt Compression via Structured Symbolic Rewriting","primary_cat":"cs.CL","submitted_at":"2026-05-06T02:40:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Telegraph English compresses prompts via structured symbolic rewriting into atomic facts, achieving roughly 50% token reduction with 99.1% key-fact accuracy on LongBench-v2 and outperforming token-deletion baselines across models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.03065","ref_index":114,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"OGPO: Sample Efficient Full-Finetuning of Generative Control Policies","primary_cat":"cs.LG","submitted_at":"2026-05-04T18:36:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"OGPO enables sample-efficient full-finetuning of generative control policies via off-policy critics and modified PPO, achieving SOTA on robot manipulation tasks while rescuing poorly initialized behavior cloning policies without expert data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.01704","ref_index":77,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"The Reasoning Trap: An Information-Theoretic Bound on Closed-System Multi-Step LLM Reasoning","primary_cat":"cs.CL","submitted_at":"2026-05-03T04:12:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Closed-system multi-step LLM reasoning is subject to an information-theoretic bound where mutual information with evidence decreases, preserving accuracy while eroding faithfulness, with EGSR recovering it on SciFact and FEVER.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.21725","ref_index":26,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"AEL: Agent Evolving Learning for Open-Ended Environments","primary_cat":"cs.CL","submitted_at":"2026-04-23T14:29:25+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"AEL uses a fast-timescale bandit for memory policy selection and slow-timescale LLM reflection for causal insights, achieving a Sharpe ratio of 2.13 on a 208-episode portfolio benchmark while showing that added mechanisms degrade performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}