{"total":14,"items":[{"citing_arxiv_id":"2605.22967","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Learned Relay Representations for Forward-Thinking Discrete Diffusion Models","primary_cat":"cs.LG","submitted_at":"2026-05-21T18:53:22+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17187","ref_index":95,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"PluRule: A Benchmark for Moderating Pluralistic Communities on Social Media","primary_cat":"cs.CL","submitted_at":"2026-05-16T22:52:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"PluRule is a new multimodal multilingual benchmark showing that state-of-the-art vision-language models perform only marginally better than a trivial baseline at detecting specific rule violations in pluralistic online communities.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15377","ref_index":66,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Ensemble Monitoring for AI Control: Diverse Signals Outweigh More Compute","primary_cat":"cs.AI","submitted_at":"2026-05-14T20:06:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Diverse ensembles of prompted and fine-tuned GPT-4.1-Mini monitors achieve 2.4x better detection of flawed code solutions than homogeneous ensembles on adversarial inputs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.08221","ref_index":42,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"NoisyCoconut: Counterfactual Consensus via Latent Space Reasoning","primary_cat":"cs.LG","submitted_at":"2026-05-06T13:58:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Injecting noise into LLM latent trajectories creates diverse reasoning paths whose agreement acts as a confidence signal for selective abstention, cutting error rates from 40-70% to under 15% on math tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.02442","ref_index":41,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Measuring AI Reasoning: A Guide for Researchers","primary_cat":"cs.AI","submitted_at":"2026-05-04T10:42:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Reasoning in language models should be measured by the faithfulness and validity of their multi-step search processes and intermediate traces, not final-answer accuracy.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.25166","ref_index":12,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Training Transformers as a Universal Computer","primary_cat":"cs.AI","submitted_at":"2026-04-28T03:15:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A transformer trained on random meaningless MicroPy programs generalizes to execute diverse human-written programs, providing empirical evidence it can act as a universal computer.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.22951","ref_index":32,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"The Power of Power Law: Asymmetry Enables Compositional Reasoning","primary_cat":"cs.AI","submitted_at":"2026-04-24T18:49:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Power-law data sampling creates beneficial asymmetry in the loss landscape that lets models acquire high-frequency skill compositions first, enabling more efficient learning of rare long-tail skills than uniform distributions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.21027","ref_index":135,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"HypEHR: Hyperbolic Modeling of Electronic Health Records for Efficient Question Answering","primary_cat":"cs.AI","submitted_at":"2026-04-22T19:18:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"HypEHR is a hyperbolic embedding model for EHR data that uses Lorentzian geometry and hierarchy-aware pretraining to answer clinical questions nearly as well as large language models but with much smaller size.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.15726","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"LLM Reasoning Is Latent, Not the Chain of Thought","primary_cat":"cs.AI","submitted_at":"2026-04-17T05:59:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"LLM reasoning is primarily mediated by latent-state trajectories rather than by explicit surface chain-of-thought outputs.","context_count":1,"top_context_role":"background","top_context_polarity":"support","context_text":"The difficulty is that recent methods often move several explanatory factors at once, making experimental results hard to interpret as causal support for any specific view: Chain-of-thought prompting changes both visible traces and compute allocation; Latent reasoning methods often change both hidden-state dynamics and compute budget; Test-time scaling changes compute and usually changes the output path as well [4, 5, 9]. The first task, then, is to separate the objects that recent work often conflates. Section 2 does so by distinguishing surface traces, latent-state dynamics, and generic serial compute, and by turning three loose views into three concrete hypotheses: H2 treats multi-step reasoning as primarily mediated by explicit surface CoT; H0 treats most apparent reasoning gains as better explained by generic"},{"citing_arxiv_id":"2604.02371","ref_index":27,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Internalized Reasoning for Long-Context Visual Document Understanding","primary_cat":"cs.CV","submitted_at":"2026-03-31T04:41:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A synthetic pipeline creates and internalizes reasoning traces in VLMs for long-context visual document understanding, with a 32B model surpassing a 235B model on MMLongBenchDoc and showing 12.4x fewer output tokens.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"We summarize a few main methodologies in this area. [18, 51] utilize an auxiliary objective over rationales and a train-only reasoning head respectively. [ 44, 6] append reasoning after answers. An alternative line distills reasoning into hidden-state or latent representations via objectives such as REINFORCE [55], knowledge distillation [10] and V AE inspired objectives [27]. All of these works operate at a smaller scale (GPT-2 to 11B) and typically on simplistic tasks such as CommonsenseQA, SuperGLUE and multi-digit multiplication. In the VLM domain, [56] observe that CoT fine-tuning improves not only CoT-mode but also direct-answer VLM performance, suggesting some degree of internalization. Our work differs from prior implicit CoT methods in several key"},{"citing_arxiv_id":"2602.08167","ref_index":49,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Self-Supervised Bootstrapping of Action-Predictive Embodied Reasoning","primary_cat":"cs.RO","submitted_at":"2026-02-09T00:10:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"R&B-EnCoRe uses self-supervised importance-weighted variational inference to distill action-predictive reasoning datasets that improve VLA performance on manipulation, navigation, and driving tasks without external verifiers.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"39], legged navigation [39-43], and autonomous driving [44]. Semantic and Visual Reasoning.Chain-of-Thought (CoT) reasoning has enhanced LLM and VLM performance by generating intermediate logical steps before producing a final answer [45, 46]. This computation increases expressivity and search capabilities [47, 48], refining internal representations to better answer complex queries [49, 50] in domains ranging from math and coding to visual question answering [51-54]. Beyond standard prompting, recent efforts explicitly integrate reasoning objectives during pre-training and post-training [55- 59], or they improve reasoning and instruction following via supervised finetuning [60-62] or reinforcement learning and self-play [63-66]. Recent works [67-70] leverage Vari-"},{"citing_arxiv_id":"2507.12549","ref_index":59,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"The Serial Scaling Hypothesis","primary_cat":"cs.LG","submitted_at":"2025-07-16T18:01:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"The serial scaling hypothesis formalizes inherently serial problems in complexity theory and demonstrates that diffusion models cannot solve them.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2503.18018","ref_index":11,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Lost in Cultural Translation: Do LLMs Struggle with Math Across Cultural Contexts?","primary_cat":"cs.AI","submitted_at":"2025-03-23T10:35:39+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"LLMs show accuracy drops of 0.3% to 5.9% on GSM8K math problems when culturally adapted to six countries while keeping math operations identical, with statistical significance confirmed by McNemar tests.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2412.06769","ref_index":20,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Training Large Language Models to Reason in a Continuous Latent Space","primary_cat":"cs.CL","submitted_at":"2024-12-09T18:55:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Coconut lets LLMs perform reasoning directly in continuous latent space by recycling hidden states as inputs, outperforming standard chain-of-thought on search-intensive logical tasks with better accuracy-efficiency trade-offs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}