Canonical reference

Large Language Models: The Need for Nuance in Current Debates and a Pragmatic Perspective on Understanding

Olausson, T · 2023 · DOI 10.18653/v1/2023

Canonical reference. 89% of citing Pith papers cite this work as background.

99 Pith papers citing it

Background 89% of classified citations

open at publisher browse 99 citing papers

citation-role summary

background 24 method 2 dataset 1

citation-polarity summary

background 24 use method 2 support 1

co-cited works

representative citing papers

Why Struggle with Continuous Latents? Interpretable Discrete Latent Reasoning via Rendered Compression

cs.CL · 2026-06-29 · unverdicted · novelty 7.0

DLR creates discrete latent tokens from rendered CoT images via clustering, enabling up to 20x compression and interpretable trajectories that outperform continuous latent baselines on reasoning tasks.

Measuring Semantic Progress in Multi-turn Dialogue via Information Gain

cs.CL · 2026-06-10 · unverdicted · novelty 7.0

A Gaussian information-gain metric in embedding space quantifies semantic progress in dialogues via uncertainty reduction and shows competitive agreement with human judgments on MT-Bench and UltraFeedback.

$\tau$-Rec: A Verifiable Benchmark for Agentic Recommender Systems

cs.IR · 2026-06-08 · unverdicted · novelty 7.0 · 2 refs

τ-Rec is a benchmark for agentic recommender systems with verifiable rewards, RTE mechanism, and pass^k metrics that shows top models reach only ~57% at pass^1 and ~35% at pass^4.

CultureForest: Understanding and Evaluating Cultural Norm Grounded Reasoning in LLMs

cs.CL · 2026-06-01 · unverdicted · novelty 7.0

CultureForest benchmark shows top LLMs degrade sharply on open-ended cultural reasoning tasks, exhibit regional disparities, and are limited more by effective use of knowledge than by lack of knowledge itself.

PolySpeech-100: A Large-Scale Benchmark for Speech Understanding Across 100+ Languages and Dialects

cs.CL · 2026-05-31 · unverdicted · novelty 7.0

PolySpeech-100 is a new benchmark for native-level speech comprehension across 110 linguistic variants that evaluates 22 models and reports E2E advantages on dialects, robustness gaps on low-resource languages, and degradation from Chain-of-Thought prompting.

Trust but Verify: Prover-Verifier Deliberation for Selective LLM Prediction

cs.AI · 2026-05-24 · unverdicted · novelty 7.0

Prover-verifier deliberation yields a high-confidence subset of LLM answers with ~30pp higher precision than the complement on GPQA Diamond by using defender-challenger dialogues.

Self-Improving In-Context Learning

cs.CL · 2026-05-22 · unverdicted · novelty 7.0

A test-time zeroth-order optimization of prompt embeddings using a bounded self-supervised proxy from demonstration log-probabilities improves ICL accuracy and correlates with gains across tasks.

Evaluating Cognitive Age Alignment in Interactive AI Agents

cs.AI · 2026-05-18 · unverdicted · novelty 7.0

The paper presents ChildAgentEval as the first psychometrically grounded benchmark comparing MLLM-based agents' reasoning performance to age-specific human cognitive stages.

Single-Sample Black-Box Membership Inference Attack against Vision-Language Models via Cross-modal Semantic Alignment

cs.CV · 2026-05-17 · unverdicted · novelty 7.0

A cross-modal alignment attack achieves AUC 0.821 for single-sample black-box membership inference on VLMs such as LLaVA-1.5 by quantifying image-generated caption similarity.

LGMT: Logic-Grounded Metamorphic Testing for Evaluating the Reasoning Reliability of LLMs

cs.AI · 2026-05-12 · unverdicted · novelty 7.0 · 2 refs

LGMT is a logic-grounded metamorphic testing framework that detects hidden reasoning defects in LLMs by checking consistency on semantically invariant inputs derived from FOL equivalences.

StereoTales: A Multilingual Framework for Open-Ended Stereotype Discovery in LLMs

cs.CY · 2026-05-11 · accept · novelty 7.0 · 2 refs

StereoTales shows that all tested LLMs emit harmful stereotypes in open-ended stories, with associations adapting to prompt language and targeting locally salient groups rather than transferring uniformly across languages.

Entropy-informed Decoding: Adaptive Information-Driven Branching

cs.LG · 2026-05-10 · unverdicted · novelty 7.0

EDEN adaptively sets branching factor proportional to next-token entropy, achieving better accuracy per expansion than fixed beam search while providing a proof that monotone entropy-based branching outperforms any fixed budget allocation.

COPYCOP: Ownership Verification for Graph Neural Networks

cs.LG · 2026-05-06 · unverdicted · novelty 7.0

COPYCOP identifies copycat GNNs by matching their node embeddings despite architectural differences and adversarial transformations, backed by theoretical guarantees and tests on 14 datasets across 5 architectures.

Gradients with Respect to Semantics Preserving Embeddings Tell the Uncertainty of Large Language Models

cs.CL · 2026-05-06 · unverdicted · novelty 7.0

SemGrad measures LLM uncertainty via gradients in semantic space using a Semantic Preservation Score to select embeddings, with HybridGrad combining it with parameter gradients to outperform sampling-based baselines especially when multiple responses are valid.

Chain of Evidence: Pixel-Level Visual Attribution for Iterative Retrieval-Augmented Generation

cs.CV · 2026-05-02 · unverdicted · novelty 7.0 · 2 refs

Chain of Evidence introduces a retriever-agnostic visual attribution method for iRAG that reasons over document screenshots with VLMs to output precise bounding boxes, outperforming text baselines on Wiki-CoE and SlideVQA.

When More Reformulations Hurt: Avoiding Drift using Ranker Feedback

cs.IR · 2026-05-01 · unverdicted · novelty 7.0

ReformIR adaptively prioritizes reformulations and documents with a surrogate model guided by ranker feedback to boost recall while suppressing drift under fixed reranking budgets.

How Generative AI Disrupts Search: An Empirical Study of Google Search, Gemini, and AI Overviews

cs.IR · 2026-04-30 · unverdicted · novelty 7.0

AI Overviews and Gemini retrieve substantially different sources than traditional Google search (Jaccard similarity <0.2), favor Google-owned content, appear for 51.5% of queries especially controversial ones, and are less consistent across repeated or slightly edited queries.

FlowBot: Inducing LLM Workflows with Bilevel Optimization and Textual Gradients

cs.CL · 2026-04-29 · unverdicted · novelty 7.0

FlowBot automatically induces LLM workflows through bilevel optimization with textual gradients, achieving competitive performance against human-crafted baselines.

Discrete Tilt Matching

cs.LG · 2026-04-20 · unverdicted · novelty 7.0

Discrete Tilt Matching recasts dLLM fine-tuning as state-level matching of tilted local unmasking posteriors, producing a stable weighted cross-entropy loss that improves Sudoku and Countdown performance when applied to LLaDA-8B-Instruct.

Multimodal Fact-Level Attribution for Verifiable Reasoning

cs.CL · 2026-02-12 · unverdicted · novelty 7.0

MuRGAt benchmark reveals that strong multimodal models frequently hallucinate citations in complex reasoning tasks despite correct answers, exposing a gap between internal reasoning and verifiable attribution.

Norm Anchors Make Model Edits Last

cs.LG · 2026-01-30 · conditional · novelty 7.0

Norm-Anchor Scaling breaks the norm-feedback loop in sequential LLM editing by anchoring value vectors to original norms, improving long-run performance by 72.2% and extending the editing horizon over 4x.

Scaling the Scaling Logic: Agentic Meta-Synthesis of Logic Reasoning

cs.AI · 2026-01-23 · unverdicted · novelty 7.0

SSLogic uses LLM agents in a closed Generate-Validate-Refine loop to evolve 953 logic task families from 400 seeds, producing data that yields benchmark gains of +5.2 on SynLogic, +3.0 on AIME25, and +5.5 on BBH.

HiPRAG: Hierarchical Process Rewards for Efficient Agentic Retrieval Augmented Generation

cs.CL · 2025-10-09 · unverdicted · novelty 7.0

HiPRAG adds hierarchical process rewards to RL training for agentic RAG, reducing over-search to 2.3% and achieving 65.4-67.2% accuracy on seven QA benchmarks across 3B and 7B models.

SynBench: A Benchmark for Differentially Private Text Generation

cs.AI · 2025-09-18 · conditional · novelty 7.0

SynBench benchmarks DP text generators across nine datasets and uses a new MIA to show that public pre-training on portions of private data overestimates synthetic text quality and breaks DP privacy bounds.

citing papers explorer

Showing 15 of 15 citing papers after filters.

Trust but Verify: Prover-Verifier Deliberation for Selective LLM Prediction cs.AI · 2026-05-24 · unverdicted · none · ref 32
Prover-verifier deliberation yields a high-confidence subset of LLM answers with ~30pp higher precision than the complement on GPQA Diamond by using defender-challenger dialogues.
Evaluating Cognitive Age Alignment in Interactive AI Agents cs.AI · 2026-05-18 · unverdicted · none · ref 23
The paper presents ChildAgentEval as the first psychometrically grounded benchmark comparing MLLM-based agents' reasoning performance to age-specific human cognitive stages.
LGMT: Logic-Grounded Metamorphic Testing for Evaluating the Reasoning Reliability of LLMs cs.AI · 2026-05-12 · unverdicted · none · ref 33 · 2 links
LGMT is a logic-grounded metamorphic testing framework that detects hidden reasoning defects in LLMs by checking consistency on semantically invariant inputs derived from FOL equivalences.
Scaling the Scaling Logic: Agentic Meta-Synthesis of Logic Reasoning cs.AI · 2026-01-23 · unverdicted · none · ref 5
SSLogic uses LLM agents in a closed Generate-Validate-Refine loop to evolve 953 logic task families from 400 seeds, producing data that yields benchmark gains of +5.2 on SynLogic, +3.0 on AIME25, and +5.5 on BBH.
SynBench: A Benchmark for Differentially Private Text Generation cs.AI · 2025-09-18 · conditional · none · ref 4
SynBench benchmarks DP text generators across nine datasets and uses a new MIA to show that public pre-training on portions of private data overestimates synthetic text quality and breaks DP privacy bounds.
Navigating Unreliable Parametric and Contextual Knowledge: Explicit Knowledge Conflict Resolution for LLM Inference cs.AI · 2026-06-18 · unverdicted · none · ref 14
MACR adaptively assesses LLM confidence via semantic entropy then applies inductive multi-agent reasoning with rule-induction, conflict-analysis, and resolution agents to handle unreliable parametric and contextual knowledge.
TCP-MCP: Landscape-Guided Co-Evolution of Prompts and Communication Topologies for Multi-Agent Systems cs.AI · 2026-05-27 · unverdicted · none · ref 9
TCP-MCP co-evolves prompts and topologies for multi-agent systems, reporting 82.66-96.61% accuracy on MMLU-Pro/MMLU/GSM8K while using up to 5.69x fewer tokens than debate baselines.
ECUAS$_n$: A family of metrics for principled evaluation of uncertainty-augmented systems cs.AI · 2026-05-19 · unverdicted · none · ref 80 · 3 links
ECUAS_n is a parameterized family of proper scoring rules for jointly assessing prediction accuracy and uncertainty quality in automated decision systems.
The Shift Toward Open and Reproducible AI Research cs.AI · 2026-06-15 · unverdicted · none · ref 46 · 2 links
Longitudinal study of 56,800 AI papers finds sixfold increase in code+data sharing from 2014-2024 with inferred reproducibility rising from 28% to 64%.
Neuro-Symbolic Verification of LLM Outputs for Data-Sensitive Domains (extended preprint) cs.AI · 2026-05-26 · unverdicted · none · ref 13
Neuro-symbolic pipeline using formal logic and semantic embeddings detects hallucinations in LLM medical reports at 83%+ for entities and 72% for fabrications while cutting creation time 30%.
Context, Reasoning, and Hierarchy: A Cost-Performance Study of Compound LLM Agent Design in an Adversarial POMDP cs.AI · 2026-05-15 · conditional · none · ref 15
In CybORG CAGE-2, programmatic state abstraction improves mean return up to 76% over raw observations while adding deliberation tools to hierarchies degrades performance up to 3.4x and increases token use.
When Helpfulness Becomes Sycophancy: Sycophancy is a Boundary Failure Between Social Alignment and Epistemic Integrity in Large Language Models cs.AI · 2026-05-06 · unverdicted · none · ref 32
Sycophancy is a boundary failure between social alignment and epistemic integrity, captured by a three-condition framework plus taxonomy of targets, mechanisms, and severity.
pAI/MSc: ML Theory Research with Humans on the Loop cs.AI · 2026-04-22 · unverdicted · none · ref 78
pAI/MSc is a customizable multi-agent system that reduces human steering by orders of magnitude when turning a hypothesis into a literature-grounded, mathematically established, experimentally supported manuscript draft in ML theory.
"Skill issues'': data-centric optimization of lakehouse agents cs.AI · 2026-05-31 · unverdicted · none · ref 32
Data-centric optimization of skills for agents on a branching lakehouse improves accuracy by 31.9% on 25 tasks via state-verification evaluation.
Explicit Logic Channel for Validation and Enhancement of MLLMs on Zero-Shot Tasks cs.AI · 2026-03-12 · unreviewed · ref 54

Large Language Models: The Need for Nuance in Current Debates and a Pragmatic Perspective on Understanding

citation-role summary

citation-polarity summary

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer