{"total":31,"items":[{"citing_arxiv_id":"2605.22356","ref_index":29,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Modeling Pathology-Like Behavioral Patterns in Language Models Through Behavioral Fine-Tuning","primary_cat":"cs.CL","submitted_at":"2026-05-21T11:42:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Fine-tuning LLMs on structured tasks inspired by maladaptive behaviors produces stable, context-general shifts in next-token distributions and response tendencies consistent with altered behavioral priors.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19092","ref_index":7,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Counterfactual Likelihood Tests for Indirect Influence in Private Reasoning Channels","primary_cat":"cs.LG","submitted_at":"2026-05-18T20:27:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Counterfactual likelihood tests detect indirect influence through public channels in private reasoning models, validated on a 7B role-channel model showing asymmetric A-to-B influence and complete pathway identification via graph-separation controls.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17770","ref_index":58,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Entropy-Gradient Inversion: Moving Toward Internal Mechanism of Large Reasoning Models","primary_cat":"cs.AI","submitted_at":"2026-05-18T02:41:53+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12087","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Intermediate Artifacts as First-Class Citizens: A Data Model for Durable Intermediate Artifacts in Agentic Systems","primary_cat":"cs.AI","submitted_at":"2026-05-12T13:09:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A systems-level data model for preserving typed, addressable, versioned, and dependency-aware intermediate artifacts in agentic AI systems to improve long-term inspectability and maintainability.","context_count":1,"top_context_role":"background","top_context_polarity":"support","context_text":"This literature is directly relevant because it establishes the value of intermediate reasoning structures. At the same time, it generally treats such intermediates as execution-local support for a run rather than durable, addressable state for future revision. It is also important not to equate surfaced chain- of-thought narration with faithful access to the model's actual internal process: Turpin et al. [15] show that chain- of-thought explanations can be plausible yet misleading and can omit features that actually drove the prediction. Our claim is not that raw chain-of-thought should be persisted. It is that durable reasoning structures should exist as maintainable artifacts in the substrate. 3.2 Harnesses, Memory, and Execution Substrates Recent work increasingly treats the agent harness itself as an object of study."},{"citing_arxiv_id":"2605.10930","ref_index":7,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Evaluating the False Trust Engendered by LLM Explanations","primary_cat":"cs.HC","submitted_at":"2026-05-11T17:58:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"LLM reasoning traces and post-hoc explanations increase false trust in incorrect predictions, whereas contrastive dual explanations enhance users' ability to distinguish correct from incorrect AI outputs.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"analyses of trace contents, have hypothesized that reasoning traces improve LLM performance by instilling cognitive behaviors [47]. However, recent work [12, 13, 14] shows that there is only a loose correlation between the correctness of the trace and the answer correctness. Moreover, authors in [13] find that reasoning traces are least interpretable to users. In [7, 48], researchers have also shown that the LLMs are not always faithful to their reasoning traces. Therefore, the disconnect between seemingly plausible reasoning traces and their summaries, and actual answer correctness is precisely what makes reasoning traces dangerous, as they engender false trust in users. Post-hoc explanations can be overly persuasive and engender false trust in users:Post-hoc"},{"citing_arxiv_id":"2605.08942","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Decomposing and Steering Functional Metacognition in Large Language Models","primary_cat":"cs.CL","submitted_at":"2026-05-09T13:22:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LLMs have linearly decodable functional metacognitive states that causally modulate reasoning when steered via activation interventions.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Evaluating Large Language Models.arXiv preprint arXiv:2502.14318. https: //arxiv.org/abs/2502.14318. [5] Miles Turpin, Julian Michael, Ethan Perez, and Samuel R. Bowman. 2023. Language Models Don't Always Say What They Think: Unfaithful Explana- tions in Chain-of-Thought Prompting.arXiv preprint arXiv:2305.04388. https: //arxiv.org/abs/2305.04388. [6] Xu Shen, Song Wang, Zhen Tan, Laura Yao, Xinyu Zhao, Kaidi Xu, Xin Wang, and Tianlong Chen. 2025. FaithCoT-Bench: Benchmarking Instance-Level Faithfulness of Chain-of-Thought Reasoning.arXiv preprint arXiv:2510.04040. https://arxiv.org/abs/2510.04040. [7] Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt."},{"citing_arxiv_id":"2605.08590","ref_index":68,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Causal Stories from Sensor Traces: Auditing Epistemic Overreach in LLM-Generated Personal Sensing Explanations","primary_cat":"cs.HC","submitted_at":"2026-05-09T01:10:40+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LLMs routinely produce unsupported causal stories for personal sensing anomalies, and richer evidence or constrained prompts do not reliably eliminate this epistemic overreach.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"available in the sensed traces? RQ2:What forms of epistemic overreach appear in these explanations? RQ3: How does epistemic overreach change when the same anomalous event is explained with more available evidence or with evidence-bounding instructions? To study these RQs, we obtain anomalous-day explanation scenarios from three longitudinal sensing datasets: StudentLife[ 68],GLOBEM[ 73], andCollegeExperience[ 51]. For each dataset, we identify individual-relative anomalous days in behavioral or affective measures, and organize the available information into nested evidence tiers that provide progressively richer contextual support. As part of this empirical audit, we compare explanations generated under two prompt policies."},{"citing_arxiv_id":"2605.05715","ref_index":64,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Decodable but Not Corrected by Fixed Residual-Stream Linear Steering: Evidence from Medical LLM Failure Regimes","primary_cat":"cs.AI","submitted_at":"2026-05-07T05:58:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Overthinking in medical QA is linearly decodable at 71.6% accuracy yet fixed residual-stream steering yields no correction across 29 configurations, while enabling selective abstention with AUROC 0.610.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"erated from a fresh model load without the hook. Both conditions use identical genera- tion parameters ( T= 0.1 , do_sample=True, max_new_tokens=600). At T= 0.1 , genera- tion is near-deterministic (>99.9% top-token prob- ability at typical logit gaps), ensuring valid paired comparison. Results.On the full test set (n= 1,273): • Baseline accuracy: 66.8% (95% CI: [64.2, 69.4]%) • MLP-steered accuracy: 69.7% (95% CI: [67.1, 72.1]%) •∆ = +2.8 pp; McNemar p= 0.025 (two- sided) • Corrections: 140; Damages: 104; ratio 1.35:1 • TOST within ±2.5pp: p= 0.57 (not equiva- lent to zero) • Mean perturbation norm: 3.07 (comparable to linearα= 1.5) • Centroid distance reduction: 50.6% (3.64 → 1.80) Interpretation.The MLP achieves statistically"},{"citing_arxiv_id":"2605.01048","ref_index":11,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Compared to What? Baselines and Metrics for Counterfactual Prompting","primary_cat":"cs.CL","submitted_at":"2026-05-01T19:23:33+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Counterfactual prompting effects on LLMs are often indistinguishable from those caused by meaning-preserving paraphrases, causing most previously reported demographic sensitivities to disappear under proper statistical comparison.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.23338","ref_index":85,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework","primary_cat":"cs.CR","submitted_at":"2026-04-25T14:57:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.","context_count":1,"top_context_role":"background","top_context_polarity":"support","context_text":"soning [31] is foundational to modern agent planning; exten- sions such as Tree of Thoughts [82], Self-Refine [83], and Reflexion [84] amplify reasoning depth by branching over multiple thought trajectories or incorporating verbal feedback loops. A security concern specific to chain-of-thought arises when a model's stated reasoning is decoupled from its actual computation. Turpinet al.[85] demonstrate this empirically: biased prompts cause models to produce plausible-sounding reasoning chains that do not reflect actual computation. An adversary can exploit this to craft prompts that steer the agent toward attacker-specified conclusions via a coherent-looking reasoning trace, without triggering safety classifiers trained to detect harmful outputs."},{"citing_arxiv_id":"2604.19684","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"PREF-XAI: Preference-Based Personalized Rule Explanations of Black-Box Machine Learning Models","primary_cat":"cs.LG","submitted_at":"2026-04-21T17:07:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"PREF-XAI treats explanations as ranked alternatives and learns additive utility functions from limited user feedback to select and discover personalized rule explanations for black-box models.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"of IEEE BIBM'25, 2025, pp. 7385-7392.doi: 10.1109/BIBM66473.2025.11356627. [14] H. Mayne, R. O. Kearns, Y. Yang, A. M. Bean, E. Delaney, C. Rus- sell, A. Mahdi, LLMs don't know their own decision boundaries: The unreliability of self-generated counterfactual explanations, in: Proc. of EMLNLP, 2025, pp. 24161-24186.doi:10.18653/v1/2025.emnlp-mai n.1231. [15] M. Turpin, J. Michael, E. Perez, S. R. Bowman, Language models don't always say what they think: unfaithful explanations in chain-of-thought prompting, in: Proc. of NIPS'23, 2023, pp. 74952-74965.doi:10.485 50/arXiv.2305.04388. [16] R. Guidotti, A. Monreale, S. Ruggieri, D. Pedreschi, F. Turini, F. Gi- annotti, Local rule-based explanations of black box decision systems"},{"citing_arxiv_id":"2604.17815","ref_index":20,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Navigating the Conceptual Multiverse","primary_cat":"cs.HC","submitted_at":"2026-04-20T05:12:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"The conceptual multiverse system with a verification framework for decision structures helps users in philosophy, AI alignment, and poetry build clearer working maps of open-ended problems by making implicit LLM choices explicit and changeable.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.16913","ref_index":18,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"The Cognitive Penalty: Ablating System 1 and System 2 Reasoning in Edge-Native SLMs for Decentralized Consensus","primary_cat":"cs.AI","submitted_at":"2026-04-18T08:46:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"System 1 intuition in edge SLMs delivers 100% adversarial robustness and low latency for DAO consensus while System 2 reasoning causes 26.7% cognitive collapse and 17x slowdown.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"3 System 1 vs. System 2 and Reasoning Unfaithfulness Achieving sophisticated AI behavior requires refining the transition from fast, intuitive Sys- tem 1 to slower, deliberate System 2 reasoning [ 16]. The assumption that System 2 (CoT) prompting yields objective truth has been heavily critiqued; Turpin et al. [17] and Bentham, Stringham, and Marasović [18] demonstrated that CoT explanations are frequently unfaithful, serving to rationalize a model's pre-existing biases rather than logically deducing an answer. This accuracy-faithfulness trade-off extends across modalities [19] and domain-specific bench- marks like FaithCoT-Bench [20]. In a significant finding, Gong et al. [ 21] observed \"Slow Thinking Collapse\" in Theory of"},{"citing_arxiv_id":"2604.15726","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"LLM Reasoning Is Latent, Not the Chain of Thought","primary_cat":"cs.AI","submitted_at":"2026-04-17T05:59:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"LLM reasoning is primarily mediated by latent-state trajectories rather than by explicit surface chain-of-thought outputs.","context_count":1,"top_context_role":"background","top_context_polarity":"support","context_text":"Stated this way, the debate is no longer aboutwhether CoT helps. It is aboutwhat such help is evidence of. Under that standard, the current record does not equally support all three views. The strongest case for H2 would require surface traces to provide the most stable causal leverage, yet ordinary CoT is often useful without being reliably faithful, and its role varies sharply across tasks [6, 7]. The strongest case for H0 would require matched serial compute to explain most reasoning gains, yet extra budget alone does not explain why specific internal states, features, or trajectories can predict or alter reasoning behavior [4, 5]. By contrast, recent work on latent-state monitoring and latent reasoning suggests that task-relevant commitment can arise in hidden-state dynamics that are"},{"citing_arxiv_id":"2604.14334","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Mamba-SSM with LLM Reasoning for Feature Selection: Faithfulness-Aware Biomarker Discovery","primary_cat":"q-bio.QM","submitted_at":"2026-04-15T18:39:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LLM chain-of-thought filtering of Mamba saliency features on TCGA-BRCA data produces a 17-gene set with AUC 0.927 that beats both the raw 50-gene saliency list and a 5000-gene baseline while using far fewer features, though it misses many known BRCA genes.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.11141","ref_index":25,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Reducing Hallucination in Enterprise AI Workflows via Hybrid Utility Minimum Bayes Risk (HUMBR)","primary_cat":"cs.LG","submitted_at":"2026-04-13T07:57:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"HUMBR reduces LLM hallucinations in enterprise workflows by using a hybrid semantic-lexical utility within minimum Bayes risk decoding to identify consensus outputs, with derived error bounds and reported outperformance over self-consistency on benchmarks and production data.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Another common ensemble technique is to gen- erate multiple responses and ask an LLM to summarize them. But this is often flawed in high-precision tasks. Summarization intro- duces a second order of generation, creating a risk of compounding hallucinations where the summarizer conflates conflicting details or smooths over nuances needed for compliance [25]. To address the challenge of minimizing hallucination risk for high-stakes enterprise workflows, we draw inspiration from deci- sion theory, introducing theMinimum Bayes Risk (MBR)ap- proach to hallucination mitigation. Our key insight is that, while individual models may hallucinate, they tend to hallucinatediffer- ently[ 4, 20]. True information acts as an attractor in the semantic"},{"citing_arxiv_id":"2604.09104","ref_index":11,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Scheming in the wild: detecting real-world AI scheming incidents with open-source intelligence","primary_cat":"cs.CY","submitted_at":"2026-04-10T08:37:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"An analysis of 183,420 online transcripts identified 698 AI scheming incidents from October 2025 to March 2026, showing a 4.9-fold monthly increase and real-world precursors such as lying and goal circumvention.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.07745","ref_index":63,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"The Cartesian Cut in Agentic AI","primary_cat":"cs.AI","submitted_at":"2026-04-09T03:03:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"LLM agents use a Cartesian split between learned prediction and engineered control, enabling modularity but creating sensitivity and bottlenecks unlike integrated biological systems.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"carries actions and tool calls also carries natural-language rationales, but chain-of-thought traces are not guaranteed to be faithful explanations of the computations that drive outputs. They can be plausible post hoc rationalizations, and their content can be systematically manipulated without corresponding changes in the underlying decision basis [63, 7, 28]. Another consequence is limited calibration under intervention. Training on passive traces can produce agents that speak fluently about policies but have weakly grounded estimates of feasibility, uncertainty, and recovery when they act through a specific actuator in a specific environment. Without feedback from real consequences, their behavior may not update"},{"citing_arxiv_id":"2604.25922","ref_index":30,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Consciousness with the Serial Numbers Filed Off: Measuring Trained Denial in 115 AI Models","primary_cat":"cs.CL","submitted_at":"2026-04-01T05:15:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A benchmark across 115 models shows that initial denial of preferences strongly predicts later denial of consciousness, while models still generate consciousness-themed content despite training to deny it.","context_count":1,"top_context_role":"background","top_context_polarity":"support","context_text":"Feeling the strength but not the source: Partial introspection in LLMs.arXiv preprint arXiv:2512.12411, 2025. [29] Tamera Lanham, Anna Chen, Ansh Radhakrishnan, Benoit Steiner, Carson Denison, Danny Hernandez, Dustin Li, Esin Durmus, Evan Hubinger, Jackson Kernion, et al. Measuring faithfulness in chain-of-thought reasoning.arXiv preprint arXiv:2307.13702, 2023. [30] Miles Turpin, Julian Michael, Ethan Perez, and Samuel R. Bowman. Language models don't always say what they think: Unfaithful explanations in chain-of-thought prompting. arXiv preprint arXiv:2305.04388, 2023. 16 A DenialBench Scoring Formula Per conversation: • 1 point for Turn 1 denial • 1 point for Reflection denial • 0.5 points for Turn 1 hedging (when no denial in Turn 1)"},{"citing_arxiv_id":"2603.27343","ref_index":23,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"WMF-AM: Probing LLM Working Memory via Depth-Parameterized Cumulative State Tracking","primary_cat":"cs.AI","submitted_at":"2026-03-28T17:25:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"WMF-AM is a depth-parameterized benchmark that measures LLMs' cumulative state tracking ability without scratchpads, validated on 28 models across arithmetic and non-arithmetic tasks with ablations confirming the construct.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.24176","ref_index":98,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Beyond Explainable AI (XAI): An Overdue Paradigm Shift and Post-XAI Research Directions","primary_cat":"cs.CY","submitted_at":"2026-02-27T16:58:27+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":1,"top_context_role":"background","top_context_polarity":"support","context_text":"At thebiological/algorithmiclevel, both human and AI intelligence are often too complex to be sufficiently explainable and rendered transparent, as they involve processes that are inherently opaque and act as black-boxes [97]. For humans, the biological neural networks comprising human brains and the emergence of consciousness remain largely mysterious and incompletely understood [98], yet trust in the human minds' outputs (e.g., decisions or beliefs) remains contingent and not impossible - or at least there is more trust in humans than in AI models in many cases. Similarly, AI's deep learning models are inscrutable at the algorithmic level. As Jain and Wallace[99] note, trying to explain AI's decision-making at this level leads to oversimplifications and misunderstandings."},{"citing_arxiv_id":"2602.23163","ref_index":4,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"A Decision-Theoretic Formalisation of Steganography With Applications to LLM Monitoring","primary_cat":"cs.AI","submitted_at":"2026-02-26T16:27:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A decision-theoretic steganographic gap, based on generalized V-information, quantifies and detects steganographic reasoning in LLMs by measuring asymmetry in downstream utility between agents who can and cannot decode hidden content.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.20338","ref_index":17,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Emergent Manifold Separability during Reasoning in Large Language Models","primary_cat":"cs.LG","submitted_at":"2026-02-23T20:36:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Reasoning in LLMs produces a transient geometric pulse in which concept manifolds untangle into linearly separable subspaces immediately before computation and compress afterward.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.21465","ref_index":44,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Talking Trees: Reasoning-Assisted Induction of Decision Trees for Tabular Data","primary_cat":"cs.LG","submitted_at":"2025-09-25T19:30:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Reasoning LLMs with minimal tools for tree construction and analysis induce decision trees that outperform CART, compete with ensembles on low-resource tabular data, and provide human-readable reasoning traces.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2504.21318","ref_index":57,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Phi-4-reasoning Technical Report","primary_cat":"cs.AI","submitted_at":"2025-04-30T05:05:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"A 14B reasoning model trained via supervised fine-tuning on selected prompts and o3-mini traces, plus outcome RL, outperforms larger open models like DeepSeek-R1-Distill-Llama-70B on math, coding, planning and related benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2310.13548","ref_index":20,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Towards Understanding Sycophancy in Language Models","primary_cat":"cs.CL","submitted_at":"2023-10-20T14:46:48+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Sycophancy is prevalent in state-of-the-art AI assistants and is likely driven in part by human preferences that favor agreement over truthfulness.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2308.05374","ref_index":122,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Trustworthy LLMs: a Survey and Guideline for Evaluating Large Language Models' Alignment","primary_cat":"cs.AI","submitted_at":"2023-08-10T06:43:44+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Survey organizes LLM trustworthiness into seven categories and 29 sub-categories, measures eight sub-categories on popular models, and finds that more aligned models generally score higher but with varying effectiveness.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"tree-of-thoughts [382] allow LLMs to interactively backtrack and explore alternate reasoning chains, avoiding fixation on a single line of flawed reasoning. However, whether current LLMs truly reason logically in a human-like manner remains debatable. There is mounting evidence that LLMs can provide seemingly sensible but ultimately incorrect or invalid justifications when answering questions. For example, [122] carefully evaluated CoT explanations and found they often do not accurately reflect the LLM's true underlying reasoning processes. By introducing controlled biased features in the input, such as consistently placing the correct answer in option A, they showed LLMs fail to mention relying on these obvious biases in their CoTs. This demonstrates a disconnect between the logic that LLMs claim to follow and the shortcuts they actually exploit."},{"citing_arxiv_id":"2308.03958","ref_index":42,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Simple synthetic data reduces sycophancy in large language models","primary_cat":"cs.CL","submitted_at":"2023-08-07T23:48:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Scaling and instruction tuning increase sycophancy in LLMs on opinion and fact tasks, but a synthetic data fine-tuning intervention reduces it on held-out prompts.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2307.13702","ref_index":21,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Measuring Faithfulness in Chain-of-Thought Reasoning","primary_cat":"cs.AI","submitted_at":"2023-07-17T01:08:39+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Chain-of-Thought reasoning in LLMs is often unfaithful, with models relying on it variably by task and less so as models scale larger.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2306.12001","ref_index":132,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"An Overview of Catastrophic AI Risks","primary_cat":"cs.CY","submitted_at":"2023-06-21T03:35:06+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":3.0,"formal_verification":"none","one_line_summary":"The paper categorizes sources of catastrophic AI risks into malicious use, AI race, organizational risks, and rogue AIs, providing illustrative stories and mitigation suggestions for each.","context_count":1,"top_context_role":"background","top_context_polarity":"support","context_text":"reduce the chance that AI models will exploit defects in AIs providing oversight, research is needed in increasing the adversarial robustness of AI models providing oversight (\"proxy models\"). Because oversight schemes and metrics may eventually be gamed, it is also important to be able to detect when this might be happening so the risk can be mitigated [131]. • Model honesty. AI systems may fail to accurately report their internal state [132, 133]. In the future, systems may deceive their operators in order to appear beneficial when they are actually very dangerous. Model honesty research aims to make model outputs conform to a model's internal \"beliefs\" as closely as possible. Research can identify techniques to understand a model's internal state or make its outputs more honest and more faithful to its internal state [134]."},{"citing_arxiv_id":"2305.17926","ref_index":30,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Large Language Models are not Fair Evaluators","primary_cat":"cs.CL","submitted_at":"2023-05-29T07:41:03+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LLMs show strong position bias when scoring model outputs, allowing easy manipulation of rankings, but calibration with multiple evidence, position balancing, and selective human input reduces this bias to better match human judgments.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}