super hub Tool reference

TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension

Daniel Weld, Eunsol Choi, Luke Zettlemoyer, Mandar Joshi · 2017 · Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) · DOI 10.18653/v1/p17-1147

Tool reference. 70% of classified Pith citations use this work as a method, library, or software dependency, not as a substantive claim.

104 Pith papers citing it

606 external citations · Crossref

Method reference 70% of classified citations

open at publisher browse 104 citing papers more from Daniel Weld

hub tools

JSON dossier citing papers JSON publisher DOI

citation-role summary

dataset 6 background 3 method 1

citation-polarity summary

use dataset 6 unclear 2 background 1 use method 1

claims ledger

dataset + TG-Norm +D t-rescaling + Ada-Clipping(A 2TGPO) 49.42 51.29 25.21 53.60 48.06 both training and evaluation. Seven open-domain question answering benchmarks are used, or- ganized into two groups by reasoning depth.Multi-hopbenchmarks consist of HotpotQA [ 28], 2WikiMultihopQA [29], MuSiQue [30], and Bamboogle [31].Single-hopbenchmarks consist of Natural Questions (NQ) [ 32], TriviaQA [ 33], and PopQA [ 34]. We train and evaluate on three backbones: Qwen3-4B, Qwen3-8B, and Qwen2.5-7B. We reportEx
dataset and TriviaQA. For Natural Questions (NQ), we use the dpr-w100 split from ir_datasets to represent open-domain, real-world user queries [34, 35, 36]. For PubMedQA, we adopt the pqa_labeled configuration to model medical question answering, where accurate technical retrieval is needed [37]. For TriviaQA, we employ the rc (reading comprehension) configuration [38]. Using a fixed random seed, we sample 50 benign queries from each dataset for the utility-oriented evaluation of retrieval and generatio
dataset 7750 248.2226 17.8641 266.0867 0.3658 BnB INT8 138.96 139.09 0.9880 56.3886 056.3886 0.0265 NF4 138.96 144.16 0.9124 155.4506 0 155.4506 0.0750 FP4 138.96 138.10 0.9196 145.1767 0 145.1767 0.1306 GPTQ GPTQ-4bit 138.96 140.37 0.9298 136.7867 0 136.7867 0.1422 Benchmarks and scoring.Five benchmarks:MMLU[ 28],ARC[ 29] (multiple-choice knowledge), TriviaQA[ 30],SQuAD[ 31] (short-horizon QA), andGSM8K[ 32] (multi-step reasoning). All risks are computed teacher-forced (prompt c and targets y scored in
dataset significantly on knowledge-intensive and adversarial benchmarks, collapsing on TruthfulQA. We attribute this to the absence of a principled density model, making it unable to generalize across different instruction-tuning regimes. TruthfulQA remains the hardest setting for all methods, as its questions target misconceptions deeply encoded in pretraining weights [16]. Yet,PCNETleads across all models also on this dataset, with Mistral-7B achieving the highest AUROC, consistent with the hypothesis
method (by non-expert validators who are experts in other domains; at least 15 min, avg ~37 min, allowing Google) Part 1: answer Q (correct answer & explanations not shown) Part 2: provide feedback on the following dimensions (correct answer & explanations shown to the validator) Include this Q in the DIAMOND set because (1)2 out of 2 expert validators agree* (2)≤ 1 out of 3 non-expert validators answers correctly •Post-hoc agreement: Is the answer uncontroversial? •Is your background suﬃcient to answe
dataset Pang Wei Koh, Jenia Jitsev, Thomas Kollar, Alex Dimakis, Yair Carmon, Achal Dave, Ludwig Schmidt, and Vaishaal Shankar. Datacomp-LM: In search of the next generation of training sets for language models. InThe Thirty-eighth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024. URL https://openreview.net/forum? id=CNWdWn47IE. [40] Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. TriviaQA: A large scale distantly supervised challenge dataset for read

authors

Daniel Weld Eunsol Choi Luke Zettlemoyer Mandar Joshi

co-cited works

representative citing papers

FlexiSLM: A Dynamic and Controllable Frame Rate Spoken Language Model

cs.SD · 2026-06-30 · unverdicted · novelty 7.0

FlexiSLM is the first spoken language model supporting dynamic and controllable frame rates on speech input and output, outperforming fixed-rate 7B models at high quality and enabling faster inference at lower rates like 6.25 Hz.

Autoregressive Boltzmann Generators

cs.LG · 2026-06-25 · unverdicted · novelty 7.0

ArBG replaces flow-based methods with autoregressive models for Boltzmann sampling, showing gains on peptide benchmarks and a 132M-parameter model Robin cutting zero-shot energy error by over 60% on 8-residue systems.

LatentSkill: From In-Context Textual Skills to In-Weight Latent Skills for LLM Agents

cs.CL · 2026-06-04 · unverdicted · novelty 7.0

LatentSkill uses a hypernetwork to generate LoRA adapters from textual skills, enabling weight-space storage that cuts prefill tokens and boosts agent success rates on ALFWorld and Search-QA.

LazyAttention: Efficient Retrieval-Augmented Generation with Deferred Positional Encoding

cs.CL · 2026-06-03 · unverdicted · novelty 7.0

LazyAttention kernelizes deferred positional encoding to enable zero-copy, position-agnostic KV cache reuse, delivering 1.37× lower TTFT and 1.40× higher throughput than Block-Attention under skewed document distributions while preserving output quality.

MemTrain: Self-Supervised Context Memory Training

cs.CL · 2026-06-02 · unverdicted · novelty 7.0

MemTrain introduces two coupled self-supervised proxy tasks on Wikipedia corpora to train general context-memory capabilities in LLMs, reporting gains of up to 17.67 points on long-text and search-based QA benchmarks over direct post-training.

When Knowledge Is Not Free: Cost-Aware Evidence Selection in Retrieval-Augmented Generation

cs.CL · 2026-06-01 · unverdicted · novelty 7.0

Defines cost-aware RAG with evidence cost tiers and shows static selectors are brittle while agentic LLM-based selection is promising but model-dependent.

From "Weak" Signals to Strong Models: Preference Delta Aggregation with LoRA Merging

cs.AI · 2026-05-29 · unverdicted · novelty 7.0

Aggregating preference deltas from several weak-weaker model pairs via LoRA adapters and geometric alignment merging improves strong-model performance on reasoning and search benchmarks beyond any single delta.

Repetition Mismatch: Why Data Mixture Experiments Don't Scale and How to Fix Them

cs.LG · 2026-05-29 · conditional · novelty 7.0

Repetition rate mismatch between small-scale proxies and target budgets is the main reason data mixture experiments do not scale; a subsampling procedure that equalizes repetition rates recovers optimal mixtures from 1/16-scale experiments.

Bastion: Budget-Aware Speculative Decoding with Tree-structured Block Diffusion Drafting

cs.LG · 2026-05-28 · unverdicted · novelty 7.0

BASTION is a budget-aware speculative decoding framework with adaptive tree-structured block diffusion drafting that reports up to 6.61x speedup and 39% improvement over block-diffusion baselines.

LiveBrowseComp: Are Search Agents Searching, or Just Verifying What They Already Know?

cs.AI · 2026-05-27 · unverdicted · novelty 7.0

LiveBrowseComp shows search agents rely on intrinsic knowledge on standard benchmarks, with scores dropping 25-40 points and closed-book accuracy below 2% on questions about facts from the prior 90 days.

Reward Bias Substitution: Single-Axis Bias Mitigations Redirect Optimization Pressure

cs.AI · 2026-05-27 · accept · novelty 7.0

Single-axis reward bias mitigations redirect optimization pressure to correlated proxies, and audit-distribution scoring produces identical observables for successful mitigation, bias substitution, and overcorrection.

When Answers Stray from Questions: Hallucination Detection via Question-Answer Orthogonal Decomposition

cs.LG · 2026-05-14 · unverdicted · novelty 7.0

QAOD projects away question-aligned directions from answer representations to isolate domain-agnostic factuality signals, enabling efficient hallucination detection with top in-domain AUROC and up to 21% better OOD transfer.

Self-Induced Outcome Potential: Turn-Level Credit Assignment for Agents without Verifiers

cs.LG · 2026-05-06 · unverdicted · novelty 7.0

SIOP enables turn-level credit assignment in LLM agents via semantic clustering of final answers as latent outcomes, improving performance on reasoning benchmarks without verifiers.

Gradients with Respect to Semantics Preserving Embeddings Tell the Uncertainty of Large Language Models

cs.CL · 2026-05-06 · unverdicted · novelty 7.0

SemGrad measures LLM uncertainty via gradients in semantic space using a Semantic Preservation Score to select embeddings, with HybridGrad combining it with parameter gradients to outperform sampling-based baselines especially when multiple responses are valid.

HIVE: Hidden-Evidence Verification for Hallucination Detection in Diffusion Large Language Models

cs.CL · 2026-04-28 · unverdicted · novelty 7.0

HIVE detects hallucinations in diffusion LLMs by selecting and conditioning on hidden evidence from denoising trajectories, achieving up to 0.9236 AUROC on QA benchmarks.

How Much Is One Recurrence Worth? Iso-Depth Scaling Laws for Looped Language Models

cs.LG · 2026-04-22 · unverdicted · novelty 7.0

A fitted iso-depth scaling law measures that one recurrence in looped transformers is worth r^0.46 unique blocks in validation loss.

Preconditioned DeltaNet: Curvature-aware Sequence Modeling for Linear Recurrences

cs.LG · 2026-04-22 · unverdicted · novelty 7.0

Preconditioned delta-rule models with a diagonal curvature approximation improve upon standard DeltaNet, GDN, and KDA by better approximating the test-time regression objective.

Continuous Semantic Caching for Low-Cost LLM Serving

cs.LG · 2026-04-21 · unverdicted · novelty 7.0

Establishes the first rigorous framework for continuous semantic caching of LLM responses using ε-net discretization and kernel ridge regression, with sublinear regret bounds.

Beyond Surface Statistics: Robust Conformal Prediction for LLMs via Internal Representations

cs.CL · 2026-04-17 · unverdicted · novelty 7.0

Internal layer-wise entropy reshaping provides nonconformity scores that improve the validity-efficiency trade-off of conformal prediction for LLMs under cross-domain shift compared to text-level baselines.

A Unified Model and Document Representation for On-Device Retrieval-Augmented Generation

cs.IR · 2026-04-15 · unverdicted · novelty 7.0

A single model unifies retrieval and context compression for on-device RAG via shared representations, matching traditional RAG performance at 1/10 context size with no extra storage.

Profile-Then-Reason: Bounded Semantic Complexity for Tool-Augmented Language Agents

cs.AI · 2026-04-05 · unverdicted · novelty 7.0

PTR framework profiles a workflow upfront then executes it deterministically with bounded verification and repair, limiting LM calls to 2-3 while outperforming ReAct in 16 of 24 tested configurations.

HiPRAG: Hierarchical Process Rewards for Efficient Agentic Retrieval Augmented Generation

cs.CL · 2025-10-09 · unverdicted · novelty 7.0

HiPRAG adds hierarchical process rewards to RL training for agentic RAG, reducing over-search to 2.3% and achieving 65.4-67.2% accuracy on seven QA benchmarks across 3B and 7B models.

Sampling from Your Language Model One Byte at a Time

cs.CL · 2025-06-17 · unverdicted · novelty 7.0

An inference-time technique turns BPE-based LMs into byte- or character-level models, solving the prompt boundary problem while unifying vocabularies across different tokenizers.

Cache Your Prompt When It's Green: Carbon-Aware Caching for Large Language Model Serving

cs.DC · 2025-05-29 · conditional · novelty 7.0

GreenCache dynamically manages LLM KV cache resources to reduce carbon emissions by 15.1% on average (up to 25.3%) while meeting latency constraints for over 90% of requests on real traces.

citing papers explorer

Showing 21 of 21 citing papers after filters.

From "Weak" Signals to Strong Models: Preference Delta Aggregation with LoRA Merging cs.AI · 2026-05-29 · unverdicted · none · ref 20
Aggregating preference deltas from several weak-weaker model pairs via LoRA adapters and geometric alignment merging improves strong-model performance on reasoning and search benchmarks beyond any single delta.
LiveBrowseComp: Are Search Agents Searching, or Just Verifying What They Already Know? cs.AI · 2026-05-27 · unverdicted · none · ref 10
LiveBrowseComp shows search agents rely on intrinsic knowledge on standard benchmarks, with scores dropping 25-40 points and closed-book accuracy below 2% on questions about facts from the prior 90 days.
Reward Bias Substitution: Single-Axis Bias Mitigations Redirect Optimization Pressure cs.AI · 2026-05-27 · accept · none · ref 35
Single-axis reward bias mitigations redirect optimization pressure to correlated proxies, and audit-distribution scoring produces identical observables for successful mitigation, bias substitution, and overcorrection.
Profile-Then-Reason: Bounded Semantic Complexity for Tool-Augmented Language Agents cs.AI · 2026-04-05 · unverdicted · none · ref 14
PTR framework profiles a workflow upfront then executes it deterministically with bounded verification and repair, limiting LM calls to 2-3 while outperforming ReAct in 16 of 24 tested configurations.
GPQA: A Graduate-Level Google-Proof Q&A Benchmark cs.AI · 2023-11-20 · accept · none · ref 2
GPQA is a new graduate-level benchmark where PhD experts score 65% (74% after corrections), skilled non-experts score 34% with web access, and GPT-4 scores 39%, intended to enable realistic tests of human supervision over superhuman AI.
MiniMax Sparse Attention cs.AI · 2026-06-11 · unverdicted · none · ref 27
MiniMax Sparse Attention is a GQA-based block-sparse attention mechanism that selects top-k blocks independently per group and delivers 28.4x per-token compute reduction at 1M context with on-par performance plus 14.2x prefill and 7.6x decode speedups via co-designed GPU kernel.
TriLens: Per-Layer Logit-Lens Entropy for White-Box Hallucination Detection cs.AI · 2026-05-31 · unverdicted · none · ref 18
TriLens detects hallucinations via per-layer entropy trajectories of logit-lens readouts from three internal modules across LLMs and QA benchmarks.
Query Symbolically or Retrieve Semantically? A Dataset and Method for Semi-Structured Question Answering cs.AI · 2026-05-26 · unverdicted · none · ref 22
DualGraph combines semantic textual KGs with symbolic KGs for semi-structured QA and introduces the SpecsQA benchmark, outperforming baselines on both open and specification questions.
Automatic Layer Selection for Hallucination Detection cs.AI · 2026-05-25 · unverdicted · none · ref 23
FEPoID automatically selects optimal or near-optimal intermediate layers for hallucination detection across LLM architectures and tasks, outperforming prior criteria and baselines, with an added truncation step that further improves performance.
ECUAS$_n$: A family of metrics for principled evaluation of uncertainty-augmented systems cs.AI · 2026-05-19 · unverdicted · none · ref 45 · 3 links
ECUAS_n is a parameterized family of proper scoring rules for jointly assessing prediction accuracy and uncertainty quality in automated decision systems.
XDomainBench: Diagnosing Reasoning Collapse in High-Dimensional Scientific Knowledge Composition cs.AI · 2026-05-14 · unverdicted · none · ref 4
XDomainBench shows LLMs suffer systematic reasoning collapse as domain composition order increases due to direct difficulty and interaction-amplified failures.
The First Drop of Ink: Nonlinear Impact of Misleading Information in Long-Context Reasoning cs.AI · 2026-05-11 · unverdicted · none · ref 5
Hard distractors trigger a nonlinear 'First Drop of Ink' performance collapse in long-context LLM reasoning, with most damage from the initial small fraction via disproportionate attention.
CoSearch: Joint Training of Reasoning and Document Ranking via Reinforcement Learning for Agentic Search cs.AI · 2026-04-19 · unverdicted · none · ref 1
Joint RL training of reasoning agent and document ranker via GRPO with semantic grouping and composite rewards yields consistent gains over fixed-retrieval baselines on seven QA benchmarks.
Thinking Sparks!: Emergent Attention Heads in Reasoning Models During Post Training cs.AI · 2025-09-30 · unverdicted · none · ref 17
Post-training on reasoning tasks sparks the emergence of specialized attention heads that enable structured computation, with SFT adding stable heads while GRPO uses dynamic activation and pruning tied to reward signals, and controllable think models relying on compensatory heads instead of specific
Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning cs.AI · 2025-07-01 · conditional · none · ref 164
Math reasoning gains in LLMs rarely transfer to general domains; RL tuning generalizes while SFT causes forgetting and representation drift.
Q-Delta: Beyond Key-Value Associative State Evolution cs.AI · 2026-06-07 · unverdicted · none · ref 66
Q-Delta extends linear attention by introducing a query-conditioned delta rule that incorporates mixed key-query errors into recurrent state updates for improved stability and performance.
ConMem: Structured Memory-Guided Adaptation in Training-Free Multi-Agent Systems cs.AI · 2026-06-07 · unverdicted · none · ref 27
ConMem distills agent trajectories into structured memory cards organized in a relation-aware graph to enable training-free, relation-coordinated adaptation in LLM-based multi-agent systems.
When AI Says It Feels cs.AI · 2026-06-04 · unverdicted · none · ref 1
LLMs trained via rubric-based self-rewarding RL with GRPO enhanced feeling expression and sycophancy robustness but degraded truthful QA performance.
Towards Scalable Lifelong Knowledge Editing with Selective Knowledge Suppression cs.AI · 2026-04-21 · unverdicted · none · ref 155
LightEdit enables scalable lifelong knowledge editing in LLMs via selective knowledge retrieval and probability suppression during decoding, outperforming prior methods on ZSRE, Counterfact, and RIPE while reducing training costs.
Teaching AI Through Benchmark Construction: QuestBench as a Course-Based Practice for Accountable Knowledge Work cs.AI · 2026-05-20 · conditional · none · ref 17 · 2 links
QuestBench is a student-constructed benchmark of 256 questions on which current deep research AI systems achieve a mean pass rate of 16.85% and a best-case rate of 57.58%.
KnowledgeBerg: Evaluating Systematic Knowledge Coverage and Compositional Reasoning in Large Language Models cs.AI · 2026-04-19 · unreviewed · ref 24

TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer