FlexiSLM is the first spoken language model supporting dynamic and controllable frame rates on speech input and output, outperforming fixed-rate 7B models at high quality and enabling faster inference at lower rates like 6.25 Hz.
super hub Tool reference
TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension
Tool reference. 70% of classified Pith citations use this work as a method, library, or software dependency, not as a substantive claim.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- dataset + TG-Norm +D t-rescaling + Ada-Clipping(A 2TGPO) 49.42 51.29 25.21 53.60 48.06 both training and evaluation. Seven open-domain question answering benchmarks are used, or- ganized into two groups by reasoning depth.Multi-hopbenchmarks consist of HotpotQA [ 28], 2WikiMultihopQA [29], MuSiQue [30], and Bamboogle [31].Single-hopbenchmarks consist of Natural Questions (NQ) [ 32], TriviaQA [ 33], and PopQA [ 34]. We train and evaluate on three backbones: Qwen3-4B, Qwen3-8B, and Qwen2.5-7B. We reportEx
- dataset and TriviaQA. For Natural Questions (NQ), we use the dpr-w100 split from ir_datasets to represent open-domain, real-world user queries [34, 35, 36]. For PubMedQA, we adopt the pqa_labeled configuration to model medical question answering, where accurate technical retrieval is needed [37]. For TriviaQA, we employ the rc (reading comprehension) configuration [38]. Using a fixed random seed, we sample 50 benign queries from each dataset for the utility-oriented evaluation of retrieval and generatio
- dataset 7750 248.2226 17.8641 266.0867 0.3658 BnB INT8 138.96 139.09 0.9880 56.3886 056.3886 0.0265 NF4 138.96 144.16 0.9124 155.4506 0 155.4506 0.0750 FP4 138.96 138.10 0.9196 145.1767 0 145.1767 0.1306 GPTQ GPTQ-4bit 138.96 140.37 0.9298 136.7867 0 136.7867 0.1422 Benchmarks and scoring.Five benchmarks:MMLU[ 28],ARC[ 29] (multiple-choice knowledge), TriviaQA[ 30],SQuAD[ 31] (short-horizon QA), andGSM8K[ 32] (multi-step reasoning). All risks are computed teacher-forced (prompt c and targets y scored in
- dataset significantly on knowledge-intensive and adversarial benchmarks, collapsing on TruthfulQA. We attribute this to the absence of a principled density model, making it unable to generalize across different instruction-tuning regimes. TruthfulQA remains the hardest setting for all methods, as its questions target misconceptions deeply encoded in pretraining weights [16]. Yet,PCNETleads across all models also on this dataset, with Mistral-7B achieving the highest AUROC, consistent with the hypothesis
- method (by non-expert validators who are experts in other domains; at least 15 min, avg ~37 min, allowing Google) Part 1: answer Q (correct answer & explanations not shown) Part 2: provide feedback on the following dimensions (correct answer & explanations shown to the validator) Include this Q in the DIAMOND set because (1)2 out of 2 expert validators agree* (2)≤ 1 out of 3 non-expert validators answers correctly •Post-hoc agreement: Is the answer uncontroversial? •Is your background sufficient to answe
- dataset Pang Wei Koh, Jenia Jitsev, Thomas Kollar, Alex Dimakis, Yair Carmon, Achal Dave, Ludwig Schmidt, and Vaishaal Shankar. Datacomp-LM: In search of the next generation of training sets for language models. InThe Thirty-eighth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024. URL https://openreview.net/forum? id=CNWdWn47IE. [40] Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. TriviaQA: A large scale distantly supervised challenge dataset for read
authors
co-cited works
representative citing papers
ArBG replaces flow-based methods with autoregressive models for Boltzmann sampling, showing gains on peptide benchmarks and a 132M-parameter model Robin cutting zero-shot energy error by over 60% on 8-residue systems.
LatentSkill uses a hypernetwork to generate LoRA adapters from textual skills, enabling weight-space storage that cuts prefill tokens and boosts agent success rates on ALFWorld and Search-QA.
LazyAttention kernelizes deferred positional encoding to enable zero-copy, position-agnostic KV cache reuse, delivering 1.37× lower TTFT and 1.40× higher throughput than Block-Attention under skewed document distributions while preserving output quality.
MemTrain introduces two coupled self-supervised proxy tasks on Wikipedia corpora to train general context-memory capabilities in LLMs, reporting gains of up to 17.67 points on long-text and search-based QA benchmarks over direct post-training.
Defines cost-aware RAG with evidence cost tiers and shows static selectors are brittle while agentic LLM-based selection is promising but model-dependent.
Aggregating preference deltas from several weak-weaker model pairs via LoRA adapters and geometric alignment merging improves strong-model performance on reasoning and search benchmarks beyond any single delta.
Repetition rate mismatch between small-scale proxies and target budgets is the main reason data mixture experiments do not scale; a subsampling procedure that equalizes repetition rates recovers optimal mixtures from 1/16-scale experiments.
BASTION is a budget-aware speculative decoding framework with adaptive tree-structured block diffusion drafting that reports up to 6.61x speedup and 39% improvement over block-diffusion baselines.
LiveBrowseComp shows search agents rely on intrinsic knowledge on standard benchmarks, with scores dropping 25-40 points and closed-book accuracy below 2% on questions about facts from the prior 90 days.
Single-axis reward bias mitigations redirect optimization pressure to correlated proxies, and audit-distribution scoring produces identical observables for successful mitigation, bias substitution, and overcorrection.
QAOD projects away question-aligned directions from answer representations to isolate domain-agnostic factuality signals, enabling efficient hallucination detection with top in-domain AUROC and up to 21% better OOD transfer.
SIOP enables turn-level credit assignment in LLM agents via semantic clustering of final answers as latent outcomes, improving performance on reasoning benchmarks without verifiers.
SemGrad measures LLM uncertainty via gradients in semantic space using a Semantic Preservation Score to select embeddings, with HybridGrad combining it with parameter gradients to outperform sampling-based baselines especially when multiple responses are valid.
HIVE detects hallucinations in diffusion LLMs by selecting and conditioning on hidden evidence from denoising trajectories, achieving up to 0.9236 AUROC on QA benchmarks.
A fitted iso-depth scaling law measures that one recurrence in looped transformers is worth r^0.46 unique blocks in validation loss.
Preconditioned delta-rule models with a diagonal curvature approximation improve upon standard DeltaNet, GDN, and KDA by better approximating the test-time regression objective.
Establishes the first rigorous framework for continuous semantic caching of LLM responses using ε-net discretization and kernel ridge regression, with sublinear regret bounds.
Internal layer-wise entropy reshaping provides nonconformity scores that improve the validity-efficiency trade-off of conformal prediction for LLMs under cross-domain shift compared to text-level baselines.
A single model unifies retrieval and context compression for on-device RAG via shared representations, matching traditional RAG performance at 1/10 context size with no extra storage.
PTR framework profiles a workflow upfront then executes it deterministically with bounded verification and repair, limiting LM calls to 2-3 while outperforming ReAct in 16 of 24 tested configurations.
HiPRAG adds hierarchical process rewards to RL training for agentic RAG, reducing over-search to 2.3% and achieving 65.4-67.2% accuracy on seven QA benchmarks across 3B and 7B models.
An inference-time technique turns BPE-based LMs into byte- or character-level models, solving the prompt boundary problem while unifying vocabularies across different tokenizers.
GreenCache dynamically manages LLM KV cache resources to reduce carbon emissions by 15.1% on average (up to 25.3%) while meeting latency constraints for over 90% of requests on real traces.
citing papers explorer
-
From "Weak" Signals to Strong Models: Preference Delta Aggregation with LoRA Merging
Aggregating preference deltas from several weak-weaker model pairs via LoRA adapters and geometric alignment merging improves strong-model performance on reasoning and search benchmarks beyond any single delta.
-
LiveBrowseComp: Are Search Agents Searching, or Just Verifying What They Already Know?
LiveBrowseComp shows search agents rely on intrinsic knowledge on standard benchmarks, with scores dropping 25-40 points and closed-book accuracy below 2% on questions about facts from the prior 90 days.
-
Reward Bias Substitution: Single-Axis Bias Mitigations Redirect Optimization Pressure
Single-axis reward bias mitigations redirect optimization pressure to correlated proxies, and audit-distribution scoring produces identical observables for successful mitigation, bias substitution, and overcorrection.
-
Profile-Then-Reason: Bounded Semantic Complexity for Tool-Augmented Language Agents
PTR framework profiles a workflow upfront then executes it deterministically with bounded verification and repair, limiting LM calls to 2-3 while outperforming ReAct in 16 of 24 tested configurations.
-
GPQA: A Graduate-Level Google-Proof Q&A Benchmark
GPQA is a new graduate-level benchmark where PhD experts score 65% (74% after corrections), skilled non-experts score 34% with web access, and GPT-4 scores 39%, intended to enable realistic tests of human supervision over superhuman AI.
-
MiniMax Sparse Attention
MiniMax Sparse Attention is a GQA-based block-sparse attention mechanism that selects top-k blocks independently per group and delivers 28.4x per-token compute reduction at 1M context with on-par performance plus 14.2x prefill and 7.6x decode speedups via co-designed GPU kernel.
-
TriLens: Per-Layer Logit-Lens Entropy for White-Box Hallucination Detection
TriLens detects hallucinations via per-layer entropy trajectories of logit-lens readouts from three internal modules across LLMs and QA benchmarks.
-
Query Symbolically or Retrieve Semantically? A Dataset and Method for Semi-Structured Question Answering
DualGraph combines semantic textual KGs with symbolic KGs for semi-structured QA and introduces the SpecsQA benchmark, outperforming baselines on both open and specification questions.
-
Automatic Layer Selection for Hallucination Detection
FEPoID automatically selects optimal or near-optimal intermediate layers for hallucination detection across LLM architectures and tasks, outperforming prior criteria and baselines, with an added truncation step that further improves performance.
-
ECUAS$_n$: A family of metrics for principled evaluation of uncertainty-augmented systems
ECUAS_n is a parameterized family of proper scoring rules for jointly assessing prediction accuracy and uncertainty quality in automated decision systems.
-
XDomainBench: Diagnosing Reasoning Collapse in High-Dimensional Scientific Knowledge Composition
XDomainBench shows LLMs suffer systematic reasoning collapse as domain composition order increases due to direct difficulty and interaction-amplified failures.
-
The First Drop of Ink: Nonlinear Impact of Misleading Information in Long-Context Reasoning
Hard distractors trigger a nonlinear 'First Drop of Ink' performance collapse in long-context LLM reasoning, with most damage from the initial small fraction via disproportionate attention.
-
CoSearch: Joint Training of Reasoning and Document Ranking via Reinforcement Learning for Agentic Search
Joint RL training of reasoning agent and document ranker via GRPO with semantic grouping and composite rewards yields consistent gains over fixed-retrieval baselines on seven QA benchmarks.
-
Thinking Sparks!: Emergent Attention Heads in Reasoning Models During Post Training
Post-training on reasoning tasks sparks the emergence of specialized attention heads that enable structured computation, with SFT adding stable heads while GRPO uses dynamic activation and pruning tied to reward signals, and controllable think models relying on compensatory heads instead of specific
-
Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning
Math reasoning gains in LLMs rarely transfer to general domains; RL tuning generalizes while SFT causes forgetting and representation drift.
-
Q-Delta: Beyond Key-Value Associative State Evolution
Q-Delta extends linear attention by introducing a query-conditioned delta rule that incorporates mixed key-query errors into recurrent state updates for improved stability and performance.
-
ConMem: Structured Memory-Guided Adaptation in Training-Free Multi-Agent Systems
ConMem distills agent trajectories into structured memory cards organized in a relation-aware graph to enable training-free, relation-coordinated adaptation in LLM-based multi-agent systems.
-
When AI Says It Feels
LLMs trained via rubric-based self-rewarding RL with GRPO enhanced feeling expression and sycophancy robustness but degraded truthful QA performance.
-
Towards Scalable Lifelong Knowledge Editing with Selective Knowledge Suppression
LightEdit enables scalable lifelong knowledge editing in LLMs via selective knowledge retrieval and probability suppression during decoding, outperforming prior methods on ZSRE, Counterfact, and RIPE while reducing training costs.
-
Teaching AI Through Benchmark Construction: QuestBench as a Course-Based Practice for Accountable Knowledge Work
QuestBench is a student-constructed benchmark of 256 questions on which current deep research AI systems achieve a mean pass rate of 16.85% and a best-case rate of 57.58%.
- KnowledgeBerg: Evaluating Systematic Knowledge Coverage and Compositional Reasoning in Large Language Models