super hub Tool reference

TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension

Daniel Weld, Eunsol Choi, Luke Zettlemoyer, Mandar Joshi · 2017 · Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) · DOI 10.18653/v1/p17-1147

Tool reference. 70% of classified Pith citations use this work as a method, library, or software dependency, not as a substantive claim.

104 Pith papers citing it

606 external citations · Crossref

Method reference 70% of classified citations

open at publisher browse 104 citing papers more from Daniel Weld

hub tools

JSON dossier citing papers JSON publisher DOI

citation-role summary

dataset 6 background 3 method 1

citation-polarity summary

use dataset 6 unclear 2 background 1 use method 1

claims ledger

dataset + TG-Norm +D t-rescaling + Ada-Clipping(A 2TGPO) 49.42 51.29 25.21 53.60 48.06 both training and evaluation. Seven open-domain question answering benchmarks are used, or- ganized into two groups by reasoning depth.Multi-hopbenchmarks consist of HotpotQA [ 28], 2WikiMultihopQA [29], MuSiQue [30], and Bamboogle [31].Single-hopbenchmarks consist of Natural Questions (NQ) [ 32], TriviaQA [ 33], and PopQA [ 34]. We train and evaluate on three backbones: Qwen3-4B, Qwen3-8B, and Qwen2.5-7B. We reportEx
dataset and TriviaQA. For Natural Questions (NQ), we use the dpr-w100 split from ir_datasets to represent open-domain, real-world user queries [34, 35, 36]. For PubMedQA, we adopt the pqa_labeled configuration to model medical question answering, where accurate technical retrieval is needed [37]. For TriviaQA, we employ the rc (reading comprehension) configuration [38]. Using a fixed random seed, we sample 50 benign queries from each dataset for the utility-oriented evaluation of retrieval and generatio
dataset 7750 248.2226 17.8641 266.0867 0.3658 BnB INT8 138.96 139.09 0.9880 56.3886 056.3886 0.0265 NF4 138.96 144.16 0.9124 155.4506 0 155.4506 0.0750 FP4 138.96 138.10 0.9196 145.1767 0 145.1767 0.1306 GPTQ GPTQ-4bit 138.96 140.37 0.9298 136.7867 0 136.7867 0.1422 Benchmarks and scoring.Five benchmarks:MMLU[ 28],ARC[ 29] (multiple-choice knowledge), TriviaQA[ 30],SQuAD[ 31] (short-horizon QA), andGSM8K[ 32] (multi-step reasoning). All risks are computed teacher-forced (prompt c and targets y scored in
dataset significantly on knowledge-intensive and adversarial benchmarks, collapsing on TruthfulQA. We attribute this to the absence of a principled density model, making it unable to generalize across different instruction-tuning regimes. TruthfulQA remains the hardest setting for all methods, as its questions target misconceptions deeply encoded in pretraining weights [16]. Yet,PCNETleads across all models also on this dataset, with Mistral-7B achieving the highest AUROC, consistent with the hypothesis
method (by non-expert validators who are experts in other domains; at least 15 min, avg ~37 min, allowing Google) Part 1: answer Q (correct answer & explanations not shown) Part 2: provide feedback on the following dimensions (correct answer & explanations shown to the validator) Include this Q in the DIAMOND set because (1)2 out of 2 expert validators agree* (2)≤ 1 out of 3 non-expert validators answers correctly •Post-hoc agreement: Is the answer uncontroversial? •Is your background suﬃcient to answe
dataset Pang Wei Koh, Jenia Jitsev, Thomas Kollar, Alex Dimakis, Yair Carmon, Achal Dave, Ludwig Schmidt, and Vaishaal Shankar. Datacomp-LM: In search of the next generation of training sets for language models. InThe Thirty-eighth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024. URL https://openreview.net/forum? id=CNWdWn47IE. [40] Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. TriviaQA: A large scale distantly supervised challenge dataset for read

authors

Daniel Weld Eunsol Choi Luke Zettlemoyer Mandar Joshi

co-cited works

representative citing papers

FlexiSLM: A Dynamic and Controllable Frame Rate Spoken Language Model

cs.SD · 2026-06-30 · unverdicted · novelty 7.0

FlexiSLM is the first spoken language model supporting dynamic and controllable frame rates on speech input and output, outperforming fixed-rate 7B models at high quality and enabling faster inference at lower rates like 6.25 Hz.

Autoregressive Boltzmann Generators

cs.LG · 2026-06-25 · unverdicted · novelty 7.0

ArBG replaces flow-based methods with autoregressive models for Boltzmann sampling, showing gains on peptide benchmarks and a 132M-parameter model Robin cutting zero-shot energy error by over 60% on 8-residue systems.

LatentSkill: From In-Context Textual Skills to In-Weight Latent Skills for LLM Agents

cs.CL · 2026-06-04 · unverdicted · novelty 7.0

LatentSkill uses a hypernetwork to generate LoRA adapters from textual skills, enabling weight-space storage that cuts prefill tokens and boosts agent success rates on ALFWorld and Search-QA.

LazyAttention: Efficient Retrieval-Augmented Generation with Deferred Positional Encoding

cs.CL · 2026-06-03 · unverdicted · novelty 7.0

LazyAttention kernelizes deferred positional encoding to enable zero-copy, position-agnostic KV cache reuse, delivering 1.37× lower TTFT and 1.40× higher throughput than Block-Attention under skewed document distributions while preserving output quality.

MemTrain: Self-Supervised Context Memory Training

cs.CL · 2026-06-02 · unverdicted · novelty 7.0

MemTrain introduces two coupled self-supervised proxy tasks on Wikipedia corpora to train general context-memory capabilities in LLMs, reporting gains of up to 17.67 points on long-text and search-based QA benchmarks over direct post-training.

When Knowledge Is Not Free: Cost-Aware Evidence Selection in Retrieval-Augmented Generation

cs.CL · 2026-06-01 · unverdicted · novelty 7.0

Defines cost-aware RAG with evidence cost tiers and shows static selectors are brittle while agentic LLM-based selection is promising but model-dependent.

From "Weak" Signals to Strong Models: Preference Delta Aggregation with LoRA Merging

cs.AI · 2026-05-29 · unverdicted · novelty 7.0

Aggregating preference deltas from several weak-weaker model pairs via LoRA adapters and geometric alignment merging improves strong-model performance on reasoning and search benchmarks beyond any single delta.

Repetition Mismatch: Why Data Mixture Experiments Don't Scale and How to Fix Them

cs.LG · 2026-05-29 · conditional · novelty 7.0

Repetition rate mismatch between small-scale proxies and target budgets is the main reason data mixture experiments do not scale; a subsampling procedure that equalizes repetition rates recovers optimal mixtures from 1/16-scale experiments.

Bastion: Budget-Aware Speculative Decoding with Tree-structured Block Diffusion Drafting

cs.LG · 2026-05-28 · unverdicted · novelty 7.0

BASTION is a budget-aware speculative decoding framework with adaptive tree-structured block diffusion drafting that reports up to 6.61x speedup and 39% improvement over block-diffusion baselines.

LiveBrowseComp: Are Search Agents Searching, or Just Verifying What They Already Know?

cs.AI · 2026-05-27 · unverdicted · novelty 7.0

LiveBrowseComp shows search agents rely on intrinsic knowledge on standard benchmarks, with scores dropping 25-40 points and closed-book accuracy below 2% on questions about facts from the prior 90 days.

Reward Bias Substitution: Single-Axis Bias Mitigations Redirect Optimization Pressure

cs.AI · 2026-05-27 · accept · novelty 7.0

Single-axis reward bias mitigations redirect optimization pressure to correlated proxies, and audit-distribution scoring produces identical observables for successful mitigation, bias substitution, and overcorrection.

When Answers Stray from Questions: Hallucination Detection via Question-Answer Orthogonal Decomposition

cs.LG · 2026-05-14 · unverdicted · novelty 7.0

QAOD projects away question-aligned directions from answer representations to isolate domain-agnostic factuality signals, enabling efficient hallucination detection with top in-domain AUROC and up to 21% better OOD transfer.

Self-Induced Outcome Potential: Turn-Level Credit Assignment for Agents without Verifiers

cs.LG · 2026-05-06 · unverdicted · novelty 7.0

SIOP enables turn-level credit assignment in LLM agents via semantic clustering of final answers as latent outcomes, improving performance on reasoning benchmarks without verifiers.

Gradients with Respect to Semantics Preserving Embeddings Tell the Uncertainty of Large Language Models

cs.CL · 2026-05-06 · unverdicted · novelty 7.0

SemGrad measures LLM uncertainty via gradients in semantic space using a Semantic Preservation Score to select embeddings, with HybridGrad combining it with parameter gradients to outperform sampling-based baselines especially when multiple responses are valid.

HIVE: Hidden-Evidence Verification for Hallucination Detection in Diffusion Large Language Models

cs.CL · 2026-04-28 · unverdicted · novelty 7.0

HIVE detects hallucinations in diffusion LLMs by selecting and conditioning on hidden evidence from denoising trajectories, achieving up to 0.9236 AUROC on QA benchmarks.

How Much Is One Recurrence Worth? Iso-Depth Scaling Laws for Looped Language Models

cs.LG · 2026-04-22 · unverdicted · novelty 7.0

A fitted iso-depth scaling law measures that one recurrence in looped transformers is worth r^0.46 unique blocks in validation loss.

Preconditioned DeltaNet: Curvature-aware Sequence Modeling for Linear Recurrences

cs.LG · 2026-04-22 · unverdicted · novelty 7.0

Preconditioned delta-rule models with a diagonal curvature approximation improve upon standard DeltaNet, GDN, and KDA by better approximating the test-time regression objective.

Continuous Semantic Caching for Low-Cost LLM Serving

cs.LG · 2026-04-21 · unverdicted · novelty 7.0

Establishes the first rigorous framework for continuous semantic caching of LLM responses using ε-net discretization and kernel ridge regression, with sublinear regret bounds.

Beyond Surface Statistics: Robust Conformal Prediction for LLMs via Internal Representations

cs.CL · 2026-04-17 · unverdicted · novelty 7.0

Internal layer-wise entropy reshaping provides nonconformity scores that improve the validity-efficiency trade-off of conformal prediction for LLMs under cross-domain shift compared to text-level baselines.

A Unified Model and Document Representation for On-Device Retrieval-Augmented Generation

cs.IR · 2026-04-15 · unverdicted · novelty 7.0

A single model unifies retrieval and context compression for on-device RAG via shared representations, matching traditional RAG performance at 1/10 context size with no extra storage.

Profile-Then-Reason: Bounded Semantic Complexity for Tool-Augmented Language Agents

cs.AI · 2026-04-05 · unverdicted · novelty 7.0

PTR framework profiles a workflow upfront then executes it deterministically with bounded verification and repair, limiting LM calls to 2-3 while outperforming ReAct in 16 of 24 tested configurations.

HiPRAG: Hierarchical Process Rewards for Efficient Agentic Retrieval Augmented Generation

cs.CL · 2025-10-09 · unverdicted · novelty 7.0

HiPRAG adds hierarchical process rewards to RL training for agentic RAG, reducing over-search to 2.3% and achieving 65.4-67.2% accuracy on seven QA benchmarks across 3B and 7B models.

Sampling from Your Language Model One Byte at a Time

cs.CL · 2025-06-17 · unverdicted · novelty 7.0

An inference-time technique turns BPE-based LMs into byte- or character-level models, solving the prompt boundary problem while unifying vocabularies across different tokenizers.

Cache Your Prompt When It's Green: Carbon-Aware Caching for Large Language Model Serving

cs.DC · 2025-05-29 · conditional · novelty 7.0

GreenCache dynamically manages LLM KV cache resources to reduce carbon emissions by 15.1% on average (up to 25.3%) while meeting latency constraints for over 90% of requests on real traces.

citing papers explorer

Showing 50 of 104 citing papers.

FlexiSLM: A Dynamic and Controllable Frame Rate Spoken Language Model cs.SD · 2026-06-30 · unverdicted · none · ref 259
FlexiSLM is the first spoken language model supporting dynamic and controllable frame rates on speech input and output, outperforming fixed-rate 7B models at high quality and enabling faster inference at lower rates like 6.25 Hz.
Autoregressive Boltzmann Generators cs.LG · 2026-06-25 · unverdicted · none · ref 41
ArBG replaces flow-based methods with autoregressive models for Boltzmann sampling, showing gains on peptide benchmarks and a 132M-parameter model Robin cutting zero-shot energy error by over 60% on 8-residue systems.
LatentSkill: From In-Context Textual Skills to In-Weight Latent Skills for LLM Agents cs.CL · 2026-06-04 · unverdicted · none · ref 49
LatentSkill uses a hypernetwork to generate LoRA adapters from textual skills, enabling weight-space storage that cuts prefill tokens and boosts agent success rates on ALFWorld and Search-QA.
LazyAttention: Efficient Retrieval-Augmented Generation with Deferred Positional Encoding cs.CL · 2026-06-03 · unverdicted · none · ref 76
LazyAttention kernelizes deferred positional encoding to enable zero-copy, position-agnostic KV cache reuse, delivering 1.37× lower TTFT and 1.40× higher throughput than Block-Attention under skewed document distributions while preserving output quality.
MemTrain: Self-Supervised Context Memory Training cs.CL · 2026-06-02 · unverdicted · none · ref 5
MemTrain introduces two coupled self-supervised proxy tasks on Wikipedia corpora to train general context-memory capabilities in LLMs, reporting gains of up to 17.67 points on long-text and search-based QA benchmarks over direct post-training.
When Knowledge Is Not Free: Cost-Aware Evidence Selection in Retrieval-Augmented Generation cs.CL · 2026-06-01 · unverdicted · none · ref 20
Defines cost-aware RAG with evidence cost tiers and shows static selectors are brittle while agentic LLM-based selection is promising but model-dependent.
From "Weak" Signals to Strong Models: Preference Delta Aggregation with LoRA Merging cs.AI · 2026-05-29 · unverdicted · none · ref 20
Aggregating preference deltas from several weak-weaker model pairs via LoRA adapters and geometric alignment merging improves strong-model performance on reasoning and search benchmarks beyond any single delta.
Repetition Mismatch: Why Data Mixture Experiments Don't Scale and How to Fix Them cs.LG · 2026-05-29 · conditional · none · ref 14
Repetition rate mismatch between small-scale proxies and target budgets is the main reason data mixture experiments do not scale; a subsampling procedure that equalizes repetition rates recovers optimal mixtures from 1/16-scale experiments.
Bastion: Budget-Aware Speculative Decoding with Tree-structured Block Diffusion Drafting cs.LG · 2026-05-28 · unverdicted · none · ref 35
BASTION is a budget-aware speculative decoding framework with adaptive tree-structured block diffusion drafting that reports up to 6.61x speedup and 39% improvement over block-diffusion baselines.
LiveBrowseComp: Are Search Agents Searching, or Just Verifying What They Already Know? cs.AI · 2026-05-27 · unverdicted · none · ref 10
LiveBrowseComp shows search agents rely on intrinsic knowledge on standard benchmarks, with scores dropping 25-40 points and closed-book accuracy below 2% on questions about facts from the prior 90 days.
Reward Bias Substitution: Single-Axis Bias Mitigations Redirect Optimization Pressure cs.AI · 2026-05-27 · accept · none · ref 35
Single-axis reward bias mitigations redirect optimization pressure to correlated proxies, and audit-distribution scoring produces identical observables for successful mitigation, bias substitution, and overcorrection.
When Answers Stray from Questions: Hallucination Detection via Question-Answer Orthogonal Decomposition cs.LG · 2026-05-14 · unverdicted · none · ref 13
QAOD projects away question-aligned directions from answer representations to isolate domain-agnostic factuality signals, enabling efficient hallucination detection with top in-domain AUROC and up to 21% better OOD transfer.
Self-Induced Outcome Potential: Turn-Level Credit Assignment for Agents without Verifiers cs.LG · 2026-05-06 · unverdicted · none · ref 24
SIOP enables turn-level credit assignment in LLM agents via semantic clustering of final answers as latent outcomes, improving performance on reasoning benchmarks without verifiers.
Gradients with Respect to Semantics Preserving Embeddings Tell the Uncertainty of Large Language Models cs.CL · 2026-05-06 · unverdicted · none · ref 21
SemGrad measures LLM uncertainty via gradients in semantic space using a Semantic Preservation Score to select embeddings, with HybridGrad combining it with parameter gradients to outperform sampling-based baselines especially when multiple responses are valid.
HIVE: Hidden-Evidence Verification for Hallucination Detection in Diffusion Large Language Models cs.CL · 2026-04-28 · unverdicted · none · ref 1
HIVE detects hallucinations in diffusion LLMs by selecting and conditioning on hidden evidence from denoising trajectories, achieving up to 0.9236 AUROC on QA benchmarks.
How Much Is One Recurrence Worth? Iso-Depth Scaling Laws for Looped Language Models cs.LG · 2026-04-22 · unverdicted · none · ref 40
A fitted iso-depth scaling law measures that one recurrence in looped transformers is worth r^0.46 unique blocks in validation loss.
Preconditioned DeltaNet: Curvature-aware Sequence Modeling for Linear Recurrences cs.LG · 2026-04-22 · unverdicted · none · ref 78
Preconditioned delta-rule models with a diagonal curvature approximation improve upon standard DeltaNet, GDN, and KDA by better approximating the test-time regression objective.
Continuous Semantic Caching for Low-Cost LLM Serving cs.LG · 2026-04-21 · unverdicted · none · ref 12
Establishes the first rigorous framework for continuous semantic caching of LLM responses using ε-net discretization and kernel ridge regression, with sublinear regret bounds.
Beyond Surface Statistics: Robust Conformal Prediction for LLMs via Internal Representations cs.CL · 2026-04-17 · unverdicted · none · ref 3
Internal layer-wise entropy reshaping provides nonconformity scores that improve the validity-efficiency trade-off of conformal prediction for LLMs under cross-domain shift compared to text-level baselines.
A Unified Model and Document Representation for On-Device Retrieval-Augmented Generation cs.IR · 2026-04-15 · unverdicted · none · ref 19
A single model unifies retrieval and context compression for on-device RAG via shared representations, matching traditional RAG performance at 1/10 context size with no extra storage.
Profile-Then-Reason: Bounded Semantic Complexity for Tool-Augmented Language Agents cs.AI · 2026-04-05 · unverdicted · none · ref 14
PTR framework profiles a workflow upfront then executes it deterministically with bounded verification and repair, limiting LM calls to 2-3 while outperforming ReAct in 16 of 24 tested configurations.
HiPRAG: Hierarchical Process Rewards for Efficient Agentic Retrieval Augmented Generation cs.CL · 2025-10-09 · unverdicted · none · ref 7
HiPRAG adds hierarchical process rewards to RL training for agentic RAG, reducing over-search to 2.3% and achieving 65.4-67.2% accuracy on seven QA benchmarks across 3B and 7B models.
Sampling from Your Language Model One Byte at a Time cs.CL · 2025-06-17 · unverdicted · none · ref 25
An inference-time technique turns BPE-based LMs into byte- or character-level models, solving the prompt boundary problem while unifying vocabularies across different tokenizers.
Cache Your Prompt When It's Green: Carbon-Aware Caching for Large Language Model Serving cs.DC · 2025-05-29 · conditional · none · ref 33
GreenCache dynamically manages LLM KV cache resources to reduce carbon emissions by 15.1% on average (up to 25.3%) while meeting latency constraints for over 90% of requests on real traces.
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cs.CL · 2024-05-07 · unverdicted · none · ref 77
DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.
Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models cs.CL · 2024-04-29 · conditional · none · ref 5
A panel of smaller diverse LLMs outperforms a single large model as an evaluator of generations, showing less intra-model bias and over 7x lower cost.
M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation cs.CL · 2024-02-05 · unverdicted · none · ref 98
M3-Embedding is a single model for multi-lingual, multi-functional, and multi-granular text embeddings trained via self-knowledge distillation that achieves new state-of-the-art results on multilingual, cross-lingual, and long-document retrieval benchmarks.
GAIA: a benchmark for General AI Assistants cs.CL · 2023-11-21 · unverdicted · none · ref 105
GAIA benchmark shows humans at 92% accuracy on simple real-world questions far outperform current AI systems at 15%, proposing this gap as a key milestone for general AI.
GPQA: A Graduate-Level Google-Proof Q&A Benchmark cs.AI · 2023-11-20 · accept · none · ref 2
GPQA is a new graduate-level benchmark where PhD experts score 65% (74% after corrections), skilled non-experts score 34% with web access, and GPT-4 scores 39%, intended to enable realistic tests of human supervision over superhuman AI.
Perhaps PTLMs Should Go to School -- A Task to Assess Open Book and Closed Book QA cs.CL · 2021-10-04 · unverdicted · none · ref 16
Proposes a textbook-based true/false QA task where PTLMs score ~50% closed-book even after pre-training on the text and ~60% open-book with retrieval.
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks cs.CL · 2020-05-22 · accept · none · ref 27
RAG models set new state-of-the-art results on open-domain QA by retrieving Wikipedia passages and conditioning a generative model on them, while also producing more factual text than parametric baselines.
BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions cs.CL · 2019-05-24 · accept · none · ref 14
BoolQ introduces naturally occurring yes/no questions as a challenging benchmark where BERT fine-tuned on MultiNLI reaches 80.4% accuracy against 90% human performance.
When Calibration Rankings Reverse: Accuracy-Controlled Evaluation for Fair Comparison of LLMs cs.CL · 2026-06-29 · unverdicted · none · ref 109
Global calibration metrics like ECE are confounded by accuracy; the proposed ACE framework with three accuracy-controlled views shows many prior calibration advantages weaken or reverse.
What LLMs explain is not what they believe: Evaluating explanation sufficiency under models' own input beliefs cs.LG · 2026-06-26 · unverdicted · none · ref 7
Proposes SCSuff metric for evaluating LLM explanation sufficiency via model-generated alternative inputs, showing explanations are typically insufficient and predictable from hidden states.
SHIFT: Gate-Modulated Activation Steering for Knowledge Conflict Mitigation in Retrieval-Augmented Generation cs.CL · 2026-06-26 · unverdicted · none · ref 27
SHIFT reformulates neuron editing as learnable gate modulation on under 0.01% parameters to let LLMs adaptively balance contextual and parametric knowledge during RAG generation.
Quantifying the Agreement Between Data-Influence and Data-Similarity to Understand LLM Behavior cs.LG · 2026-06-22 · unverdicted · none · ref 66
Data-similarity and data-influence produce significantly overlapping rankings of training documents for LLM outputs, with asymmetry allowing a favorable cost-accuracy trade-off.
All Relations Lead to Rome: Automated Knowledge Graph Creation and Question Generation cs.IR · 2026-06-21 · unverdicted · none · ref 12
ARLtR is a framework for jointly constructing knowledge graphs, embeddings, and grounded QA pairs from text, demonstrated on a Roman Empire dataset with over 19,000 entities and 8,400 QA pairs.
Breaking the Likelihood Trap: Variance-Calibrated Modulation for Large Language Model Decoding cs.CL · 2026-06-21 · unverdicted · none · ref 44
VCM is a training-free decoding intervention that applies PMI-driven token elevation and variance-adaptive penalization to reduce repetitive degeneration in LLM open-ended generation.
MiniMax Sparse Attention cs.AI · 2026-06-11 · unverdicted · none · ref 27
MiniMax Sparse Attention is a GQA-based block-sparse attention mechanism that selects top-k blocks independently per group and delivers 28.4x per-token compute reduction at 1M context with on-par performance plus 14.2x prefill and 7.6x decode speedups via co-designed GPU kernel.
Redesign Mixture-of-Experts Routers with Manifold Power Iteration cs.LG · 2026-06-10 · unverdicted · none · ref 13
Manifold Power Iteration aligns MoE router rows with principal singular directions of experts via a power-then-retract process, with theory showing convergence and experiments on 1B-11B models showing gains.
Soft-Prompt Tuning for Fair and Efficient LLM Benchmark Evaluation cs.CL · 2026-06-10 · unverdicted · none · ref 21
Soft-prompt tuning with 10 vectors improves format compliance on LLM benchmarks and provides a low-cost proxy for comparing base models.
Boosting Self-Consistency with Ranking cs.CL · 2026-06-03 · unverdicted · none · ref 143
RISC reformulates self-consistency answer selection as a ranking task solved by a lightweight LambdaRank model with five hand-designed features, yielding better accuracy-efficiency trade-offs than majority voting on QA benchmarks.
Clustered Self-Assessment: A Simple yet Effective Method for Uncertainty Quantification in Large Language Models cs.CL · 2026-06-02 · unverdicted · none · ref 11
Clustered Self-Assessment groups sampled LLM responses into semantic clusters, presents clusters as multiple-choice options, and uses the LLM's assigned probabilities to those options as direct uncertainty estimates, outperforming entropy baselines with as few as two extra samples.
SimSD: Simple Speculative Decoding in Diffusion Language Models cs.CL · 2026-06-01 · unverdicted · none · ref 27
SimSD adds a masking strategy to enable speculative decoding in diffusion LLMs, delivering up to 7.46x throughput gains on SDAR models while preserving generation quality.
Resonant Context Anchoring: Decoupling Attention Routing and Signal Gain at Inference Time cs.CL · 2026-06-01 · unverdicted · none · ref 56
RCA is a training-free module that boosts input context signal strength in the residual stream of LLMs by orthogonal decoupling of attention routing from value magnitude.
TriLens: Per-Layer Logit-Lens Entropy for White-Box Hallucination Detection cs.AI · 2026-05-31 · unverdicted · none · ref 18
TriLens detects hallucinations via per-layer entropy trajectories of logit-lens readouts from three internal modules across LLMs and QA benchmarks.
Relevance as a Vulnerability: How Web Retrieval Degrades Safety Alignment in LLM Agents cs.CL · 2026-05-28 · unverdicted · none · ref 10
Web retrieval degrades safety alignment in LLM agents, with relevance activating vulnerabilities including a Safe Source Paradox where oppositional content increases harmful compliance.
Query Symbolically or Retrieve Semantically? A Dataset and Method for Semi-Structured Question Answering cs.AI · 2026-05-26 · unverdicted · none · ref 22
DualGraph combines semantic textual KGs with symbolic KGs for semi-structured QA and introduces the SpecsQA benchmark, outperforming baselines on both open and specification questions.
Automatic Layer Selection for Hallucination Detection cs.AI · 2026-05-25 · unverdicted · none · ref 23
FEPoID automatically selects optimal or near-optimal intermediate layers for hallucination detection across LLM architectures and tasks, outperforming prior criteria and baselines, with an added truncation step that further improves performance.
ECUAS$_n$: A family of metrics for principled evaluation of uncertainty-augmented systems cs.AI · 2026-05-19 · unverdicted · none · ref 45 · 3 links
ECUAS_n is a parameterized family of proper scoring rules for jointly assessing prediction accuracy and uncertainty quality in automated decision systems.

TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer