Canonical reference

Adaptation with self-evaluation to improve selective prediction in LLMs

Jiefeng Chen, Jinsung Yoon, Sayna Ebrahimi, Sercan Arik, Tomas Pfister, Somesh Jha · 2023 · DOI 10.18653/v1/2023

Canonical reference. 89% of citing Pith papers cite this work as background.

98 Pith papers citing it

Background 89% of classified citations

open at publisher browse 98 citing papers

citation-role summary

background 24 method 2 dataset 1

citation-polarity summary

background 24 use method 2 support 1

co-cited works

representative citing papers

Why Struggle with Continuous Latents? Interpretable Discrete Latent Reasoning via Rendered Compression

cs.CL · 2026-06-29 · unverdicted · novelty 7.0

DLR creates discrete latent tokens from rendered CoT images via clustering, enabling up to 20x compression and interpretable trajectories that outperform continuous latent baselines on reasoning tasks.

Measuring Semantic Progress in Multi-turn Dialogue via Information Gain

cs.CL · 2026-06-10 · unverdicted · novelty 7.0

A Gaussian information-gain metric in embedding space quantifies semantic progress in dialogues via uncertainty reduction and shows competitive agreement with human judgments on MT-Bench and UltraFeedback.

$\tau$-Rec: A Verifiable Benchmark for Agentic Recommender Systems

cs.IR · 2026-06-08 · unverdicted · novelty 7.0 · 2 refs

τ-Rec is a benchmark for agentic recommender systems with verifiable rewards, RTE mechanism, and pass^k metrics that shows top models reach only ~57% at pass^1 and ~35% at pass^4.

CultureForest: Understanding and Evaluating Cultural Norm Grounded Reasoning in LLMs

cs.CL · 2026-06-01 · unverdicted · novelty 7.0

CultureForest benchmark shows top LLMs degrade sharply on open-ended cultural reasoning tasks, exhibit regional disparities, and are limited more by effective use of knowledge than by lack of knowledge itself.

PolySpeech-100: A Large-Scale Benchmark for Speech Understanding Across 100+ Languages and Dialects

cs.CL · 2026-05-31 · unverdicted · novelty 7.0

PolySpeech-100 is a new benchmark for native-level speech comprehension across 110 linguistic variants that evaluates 22 models and reports E2E advantages on dialects, robustness gaps on low-resource languages, and degradation from Chain-of-Thought prompting.

Trust but Verify: Prover-Verifier Deliberation for Selective LLM Prediction

cs.AI · 2026-05-24 · unverdicted · novelty 7.0

Prover-verifier deliberation yields a high-confidence subset of LLM answers with ~30pp higher precision than the complement on GPQA Diamond by using defender-challenger dialogues.

Self-Improving In-Context Learning

cs.CL · 2026-05-22 · unverdicted · novelty 7.0

A test-time zeroth-order optimization of prompt embeddings using a bounded self-supervised proxy from demonstration log-probabilities improves ICL accuracy and correlates with gains across tasks.

Evaluating Cognitive Age Alignment in Interactive AI Agents

cs.AI · 2026-05-18 · unverdicted · novelty 7.0

The paper presents ChildAgentEval as the first psychometrically grounded benchmark comparing MLLM-based agents' reasoning performance to age-specific human cognitive stages.

Single-Sample Black-Box Membership Inference Attack against Vision-Language Models via Cross-modal Semantic Alignment

cs.CV · 2026-05-17 · unverdicted · novelty 7.0

A cross-modal alignment attack achieves AUC 0.821 for single-sample black-box membership inference on VLMs such as LLaVA-1.5 by quantifying image-generated caption similarity.

LGMT: Logic-Grounded Metamorphic Testing for Evaluating the Reasoning Reliability of LLMs

cs.AI · 2026-05-12 · unverdicted · novelty 7.0

LGMT applies metamorphic testing derived from first-order logic equivalences to detect reasoning inconsistencies in LLMs that static benchmarks miss.

StereoTales: A Multilingual Framework for Open-Ended Stereotype Discovery in LLMs

cs.CY · 2026-05-11 · accept · novelty 7.0 · 2 refs

StereoTales shows that all tested LLMs emit harmful stereotypes in open-ended stories, with associations adapting to prompt language and targeting locally salient groups rather than transferring uniformly across languages.

Entropy-informed Decoding: Adaptive Information-Driven Branching

cs.LG · 2026-05-10 · unverdicted · novelty 7.0

EDEN adaptively sets branching factor proportional to next-token entropy, achieving better accuracy per expansion than fixed beam search while providing a proof that monotone entropy-based branching outperforms any fixed budget allocation.

COPYCOP: Ownership Verification for Graph Neural Networks

cs.LG · 2026-05-06 · unverdicted · novelty 7.0

COPYCOP identifies copycat GNNs by matching their node embeddings despite architectural differences and adversarial transformations, backed by theoretical guarantees and tests on 14 datasets across 5 architectures.

Gradients with Respect to Semantics Preserving Embeddings Tell the Uncertainty of Large Language Models

cs.CL · 2026-05-06 · unverdicted · novelty 7.0

SemGrad measures LLM uncertainty via gradients in semantic space using a Semantic Preservation Score to select embeddings, with HybridGrad combining it with parameter gradients to outperform sampling-based baselines especially when multiple responses are valid.

Chain of Evidence: Pixel-Level Visual Attribution for Iterative Retrieval-Augmented Generation

cs.CV · 2026-05-02 · unverdicted · novelty 7.0 · 2 refs

Chain of Evidence introduces a retriever-agnostic visual attribution method for iRAG that reasons over document screenshots with VLMs to output precise bounding boxes, outperforming text baselines on Wiki-CoE and SlideVQA.

When More Reformulations Hurt: Avoiding Drift using Ranker Feedback

cs.IR · 2026-05-01 · unverdicted · novelty 7.0

ReformIR adaptively prioritizes reformulations and documents with a surrogate model guided by ranker feedback to boost recall while suppressing drift under fixed reranking budgets.

How Generative AI Disrupts Search: An Empirical Study of Google Search, Gemini, and AI Overviews

cs.IR · 2026-04-30 · unverdicted · novelty 7.0

AI Overviews and Gemini retrieve substantially different sources than traditional Google search (Jaccard similarity <0.2), favor Google-owned content, appear for 51.5% of queries especially controversial ones, and are less consistent across repeated or slightly edited queries.

FlowBot: Inducing LLM Workflows with Bilevel Optimization and Textual Gradients

cs.CL · 2026-04-29 · unverdicted · novelty 7.0

FlowBot automatically induces LLM workflows through bilevel optimization with textual gradients, achieving competitive performance against human-crafted baselines.

Discrete Tilt Matching

cs.LG · 2026-04-20 · unverdicted · novelty 7.0

Discrete Tilt Matching recasts dLLM fine-tuning as state-level matching of tilted local unmasking posteriors, producing a stable weighted cross-entropy loss that improves Sudoku and Countdown performance when applied to LLaDA-8B-Instruct.

Multimodal Fact-Level Attribution for Verifiable Reasoning

cs.CL · 2026-02-12 · unverdicted · novelty 7.0

MuRGAt benchmark reveals that strong multimodal models frequently hallucinate citations in complex reasoning tasks despite correct answers, exposing a gap between internal reasoning and verifiable attribution.

Norm Anchors Make Model Edits Last

cs.LG · 2026-01-30 · conditional · novelty 7.0

Norm-Anchor Scaling breaks the norm-feedback loop in sequential LLM editing by anchoring value vectors to original norms, improving long-run performance by 72.2% and extending the editing horizon over 4x.

Scaling the Scaling Logic: Agentic Meta-Synthesis of Logic Reasoning

cs.AI · 2026-01-23 · unverdicted · novelty 7.0

SSLogic uses LLM agents in a closed Generate-Validate-Refine loop to evolve 953 logic task families from 400 seeds, producing data that yields benchmark gains of +5.2 on SynLogic, +3.0 on AIME25, and +5.5 on BBH.

HiPRAG: Hierarchical Process Rewards for Efficient Agentic Retrieval Augmented Generation

cs.CL · 2025-10-09 · unverdicted · novelty 7.0

HiPRAG adds hierarchical process rewards to RL training for agentic RAG, reducing over-search to 2.3% and achieving 65.4-67.2% accuracy on seven QA benchmarks across 3B and 7B models.

SynBench: A Benchmark for Differentially Private Text Generation

cs.AI · 2025-09-18 · conditional · novelty 7.0

SynBench benchmarks DP text generators across nine datasets and uses a new MIA to show that public pre-training on portions of private data overestimates synthetic text quality and breaks DP privacy bounds.

citing papers explorer

Showing 2 of 2 citing papers after filters.

LLM-Oriented Information Retrieval: A Denoising-First Perspective cs.IR · 2026-05-01 · unverdicted · none · ref 219 · 2 links
Argues for a denoising-first paradigm in LLM-oriented information retrieval, framing challenges via a four-stage progression and providing a taxonomy of signal-to-noise optimization techniques across the pipeline.
Bridging the Linguistic Divide: A Survey on Leveraging Large Language Models for Machine Translation cs.CL · 2025-04-02 · unverdicted · none · ref 73
A literature survey that organizes prompting, fine-tuning, preference optimization, and context-aware techniques for LLM-based machine translation with emphasis on low-resource languages.

Adaptation with self-evaluation to improve selective prediction in LLMs

citation-role summary

citation-polarity summary

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer