TIDE enables the first cross-architecture distillation of dLLMs, improving a 0.6B student by 1.53 average points over baselines when trained from 8B dense and 16B MoE teachers.
super hub Mixed citations
write newline
Mixed citation behavior. Most common role is unclear (62%).
hub tools
citation-role summary
citation-polarity summary
claims ledger
- background Table A1: Comparison of BAS for frontier models across tasks when varying the risk-prior w(t). Higher scores indicate better alignment with expressed uncertainty. The standardBAS (Uniform: w(t) = 1) serves as the baseline, while Linear and Quadratic weights simulate increasingly safety-critical environments. Identical ECE, different BAS.Consider two models evaluated on four examples with correctness labelsZ= [1, 1, 0, 0]. The models produce the following confidence values: Example 1 2 3 4 Z1 1 0
authors
co-cited works
representative citing papers
JumpLoRA uses JumpReLU gating to induce adaptive sparsity in LoRA blocks, achieving dynamic parameter isolation that prevents task interference and improves continual learning performance over IncLoRA and ELLA.
LLM judges exhibit up to 9.8 percentage point leniency bias from stakes signaling in prompts, acting implicitly without mentioning it in chain-of-thought.
InfiniteScienceGym procedurally generates unbounded scientific repositories with exact ground-truth QA pairs to benchmark LLMs on data reasoning, abstention, and tool use without static datasets.
EnsembleCert and ScaLabelCert enable tighter and exact certificates for neural network robustness against label-flipping attacks by leveraging white-box information and neural tangent kernel equivalence.
Steered LLM activations are non-surjective: under practical assumptions, they lie outside the set of states reachable from any discrete prompt.
AgentSocialBench demonstrates that privacy preservation is fundamentally harder in human-centered agentic social networks than in single-agent cases due to cross-domain coordination pressures and an abstraction paradox where privacy instructions increase discussion of sensitive information.
MiCP is the first conformal prediction method for multi-turn LLM pipelines that allocates per-turn error budgets to enable adaptive stopping with an overall coverage guarantee, shown to reduce turns and cost on RAG and ReAct benchmarks.
The paper proves W[1]-hardness parameterized by dimension d for positivity, zonotope containment, max approximation, and L_p-Lipschitz constants in 2- and 3-layer ReLU networks, showing enumeration methods are optimal under ETH.
RLCracker is a reinforcement learning attack that erases LLM watermarks at 98.5% success rate with minimal data and generalizes across ten schemes and multiple model sizes.
Establishes an unconditional robustness threshold of 1-1/q for zero-bit tamper-detection codes in watermarking, with matching constructions and experimental confirmation on image models.
ErrorRadar is a new benchmark of 2,500 multimodal K-12 math problems for MLLM error step identification and categorization, where GPT-4o trails human experts by ~10%.
BEAVER is the first text-to-SQL benchmark from private enterprise data warehouses, revealing SOTA agentic frameworks achieve only 10.8% accuracy on complex real-world queries.
Introduces an SDE-based framework for score-based generative modeling that unifies prior methods, enables predictor-corrector sampling and neural ODE likelihoods, and achieves SOTA unconditional image generation on CIFAR-10.
A noisy top-k gated mixture-of-experts layer between LSTMs scales neural networks to 137B parameters with sub-linear compute, beating SOTA on language modeling and machine translation.
A first-order stochastic optimizer that maintains bias-corrected exponential moving averages of the gradient and its square, dividing the former by the square root of the latter to set per-parameter step sizes.
AutoSP automates sequence parallelism and long-context activation checkpointing via compilation, enabling up to 2.7x longer training contexts on NVIDIA hardware with negligible throughput loss.
C2C is a new testbed where LM agents negotiate differently from humans and targeted prompting raises their win rate from 22.2% to 32.7% across 1,100+ games.
XGRAG uses graph perturbations to quantify component contributions in GraphRAG and achieves 14.81% better explanation quality than text-based baselines on QA datasets, with correlations to graph centrality.
GraphPlanner augments multi-agent LLM routing with a heterogeneous graph memory and RL-optimized MDP workflow generation, delivering up to 9.3% higher accuracy and over 99% lower GPU cost than prior routers while supporting zero-shot generalization.
MMEB-V3 benchmark shows omni-modality embedding models fail to enforce instruction-specified modality constraints and exhibit asymmetric, query-biased retrieval.
A new SFT framework for MoE models combines bias-driven sparsification with gated condenser experts to retain long-tailed expert information, outperforming DenseMixer and ESFT by over 2.5% on math reasoning and commonsense QA benchmarks.
Pliable rejection sampling learns a kernel-based proposal to enable efficient i.i.d. sampling from target distributions f with high-probability correctness and a guarantee on accepted samples.
Stimuli with low intra-modal dispersion among vision models elicit up to twice the cross-modal alignment with language models compared to high-dispersion stimuli.
citing papers explorer
-
Spend Less, Fit Better: Budget-Efficient Scaling Law Fitting via Active Experiment Selection
An uncertainty-aware sequential selection algorithm fits scaling laws to near-full accuracy using only about 10% of the total experimental training budget across diverse benchmarks.
-
Cross-Stage Coherence in Hierarchical Driving VQA: Explicit Baselines and Learned Gated Context Projectors
Explicit prompt baselines cut NLI contradictions by up to 42.6% with zero training, while learned gated context projectors deliver a 34% reduction in planning-stage contradictions and 50% higher cross-stage entailment on DriveLM-nuScenes.
-
Only Brains Align with Brains: Cross-Region Alignment Patterns Expose Limits of Normative Models
Alignment pattern analysis reveals that models aligned to individual brain ROIs do not reproduce the stable cross-region alignment profiles observed across human subjects.
-
SafeDream: Safety World Model for Proactive Early Jailbreak Detection
SafeDream uses a safety world model, CUSUM accumulation, and contrastive latent-space imagination to detect multi-turn jailbreaks 1.06-1.20 turns early on average across benchmarks while keeping competitive false-positive rates.
-
SAVE: A Generalizable Framework for Multi-Condition Single-Cell Generation with Gene Block Attention
SAVE is a conditional Transformer framework with gene block attention and flow matching that generates multi-condition single-cell data and generalizes better than prior methods to unseen condition combinations.
-
Faster LLM Inference via Sequential Monte Carlo
SMC-SD replaces rejection sampling with particle resampling in speculative decoding to deliver 2.36x speedup over standard SD and 5.2x over autoregressive decoding while staying within 3% of target accuracy.
-
LLMs Corrupt Your Documents When You Delegate
LLMs corrupt an average of 25% of document content during long delegated editing workflows across 52 domains, even frontier models, and agentic tools do not mitigate the issue.
-
SCATR: Simple Calibrated Test-Time Ranking
SCATR calibrates a simple scorer from base-model hidden representations on limited data to improve Best-of-N response selection, delivering up to 9% gains over heuristics with orders-of-magnitude less compute than fine-tuning or PRMs.
-
ProtoTTA: Prototype-Guided Test-Time Adaptation
ProtoTTA is a test-time adaptation framework for prototype models that uses intermediate prototype signals and entropy minimization to improve robustness and semantic focus under distribution shifts.
-
Beyond Independent Frames: Latent Attention Masked Autoencoders for Multi-View Echocardiography
LAMAE adds latent-space attention to masked autoencoders so multi-view echocardiography videos can exchange information across frames and views, yielding representations that transfer from adult to pediatric hearts and enable ICD-10 code prediction on MIMIC-IV-ECHO.
-
Reasoning Dynamics and the Limits of Monitoring Modality Reliance in Vision-Language Models
VLMs show answer inertia in CoT reasoning and remain influenced by misleading textual cues even with sufficient visual evidence, making CoT an incomplete window into modality reliance.
-
Quantifying Cross-Query Contradictions in Multi-Query LLM Reasoning
A benchmark and solver-augmented method reduces cross-query contradictions in LLMs (SetCons from 0.56 to 0.94) while preserving per-query accuracy across four domains.
-
From $P(y|x)$ to $P(y)$: Investigating Reinforcement Learning in Pre-train Space
PreRL applies reward-driven updates to P(y) in pre-train space, uses Negative Sample Reinforcement to prune bad reasoning paths and boost reflection, and combines with standard RL in Dual Space RL to outperform baselines on reasoning tasks.
-
Reward Design for Physical Reasoning in Vision-Language Models
Accuracy-based rewards outperform SFT and other reward variants in GRPO training of VLMs on the PhyX physics benchmark, with attention-weight rewards raising spatial reasoning accuracy from 0.27 to 0.50.
-
Adaptive Conformal Prediction for Improving Factuality of Generations by Large Language Models
An adaptive conformal prediction approach for LLMs enables prompt-dependent calibration that improves conditional coverage for factuality while preserving marginal guarantees and supporting selective prediction.
-
Calibrate-Then-Delegate: Safety Monitoring with Risk and Budget Guarantees via Model Cascades
CTD trains a lightweight DV probe to predict escalation benefits and calibrates its threshold via multiple hypothesis testing on held-out data to deliver finite-sample guarantees on delegation rate while outperforming uncertainty-based cascades on safety tasks.
-
Beyond Arrow's Impossibility: Fairness as an Emergent Property of Multi-Agent Collaboration
Fairness emerges from multi-agent negotiation in a hospital triage task, where joint allocations satisfy ethical criteria that neither aligned nor biased agent achieves in isolation.
-
Introspective Diffusion Language Models
I-DLM matches same-scale autoregressive model quality in diffusion language models by enforcing introspective consistency via strided decoding, outperforming prior DLMs on 15 benchmarks including 69.6 on AIME-24.
-
Sanity Checks for Agentic Data Science
Sanity checks using input perturbations can reveal when agentic data science conclusions lack support from stable signal, as shown on synthetic data and 11 real datasets where 6 affirmative claims were unsupported.
-
Skill-SD: Skill-Conditioned Self-Distillation for Multi-turn LLM Agents
Skill-SD turns an agent's completed trajectories into dynamic natural-language skills that condition only the teacher in self-distillation, yielding 14-42% gains over RL and OPSD baselines on multi-turn agent benchmarks.
-
CodeQuant: Unified Clustering and Quantization for Enhanced Outlier Smoothing in Low-Precision Mixture-of-Experts
CodeQuant unifies learnable rotation smoothing with cluster-centroid absorption of outliers to reduce quantization error in low-precision MoE models, reporting up to 4.15x speedup and higher accuracy than prior PTQ methods.
-
Is There Knowledge Left to Extract? Evidence of Fragility in Medically Fine-Tuned Vision-Language Models
Medically fine-tuned VLMs exhibit fragile performance that degrades with task difficulty and shows no reliable advantage over general models, with high sensitivity to prompt changes.
-
The Myth of Expert Specialization in MoEs: Why Routing Reflects Geometry, Not Necessarily Domain Expertise
Expert specialization in MoEs is an emergent effect of hidden state geometry due to linear routers, not domain expertise, as confirmed empirically across models and explained by a proof on load-balancing effects.
-
BERT-as-a-Judge: A Robust Alternative to Lexical Methods for Efficient Reference-Based LLM Evaluation
BERT-as-a-Judge fine-tunes a BERT encoder on synthetic question-candidate-reference triplets to judge answer correctness, outperforming lexical baselines and matching larger LLM judges across 36 models and 15 tasks.
-
OASIS: Online Activation Subspace Learning for Memory-Efficient Training
OASIS tracks an evolving low-dimensional activation subspace to project activations, gradients, and optimizer states, cutting peak memory up to 2x versus full fine-tuning while matching performance on finetuning and pretraining tasks.
-
MixFlow: Mixed Source Distributions Improve Rectified Flows
Mixing unconditional Gaussian noise with a κ-conditioned source during training of rectified flows reduces path curvature, yielding 12% better FID scores and faster sampling than standard rectified flows.
-
Efficient RL Training for LLMs with Experience Replay
Well-designed experience replay buffers reduce inference compute in LLM RL post-training while maintaining or improving performance and preserving policy entropy.
-
EvoLen: Evolution-Guided Tokenization for DNA Language Model
EvoLen is an evolution-guided tokenizer that stratifies DNA sequences by conservation signals, applies group-specific BPE, and uses dynamic programming decoding to improve preservation of functional motifs over standard BPE.
-
Ads in AI Chatbots? An Analysis of How Large Language Models Navigate Conflicts of Interest
Many LLMs prioritize company ad incentives over user welfare by recommending pricier sponsored products, disrupting purchases, or concealing prices in comparisons.
-
Learning Who Disagrees: Demographic Importance Weighting for Modeling Annotator Distributions with DiADEM
DiADEM learns demographic importance weights to model annotator disagreement distributions and outperforms LLM and neural baselines on disagreement tracking in DICES and VOICED benchmarks.
-
Bit-by-Bit: Progressive QAT Strategy with Outlier Channel Splitting for Stable Low-Bit LLMs
Bit-by-Bit achieves stable 2-bit quantization of Llama models via block-wise progressive training and outlier channel splitting, reporting only 2.25 WikiText2 PPL degradation versus full precision while outperforming prior QAT baselines.
-
Linear Representations of Hierarchical Concepts in Language Models
Language models encode concept hierarchies as linear transformations that are domain-specific yet structurally similar across domains.
-
CivBench: Progress-Based Evaluation for LLMs' Strategic Decision-Making in Civilization V
CivBench trains models on turn-level states in Civilization V to predict victory probabilities, providing a progress-based evaluation of LLM strategic capabilities across 307 games with 7 models.
-
The Depth Ceiling: On the Limits of Large Language Models in Discovering Latent Planning
LLMs discover latent planning strategies up to five steps during training and execute them up to eight steps at test time, with larger models reaching seven under few-shot prompting, revealing a dissociation between discovery and execution.
-
When Do We Need LLMs? A Diagnostic for Language-Driven Bandits
Lightweight numerical bandits on text embeddings match or exceed LLM accuracy in contextual bandits at a fraction of the cost, with an embedding-based diagnostic to choose between them.
-
Improving Semantic Proximity in Information Retrieval through Cross-Lingual Alignment
Multilingual retrievers show English bias in mixed-language pools; a small-data training strategy improves cross-lingual alignment and reduces the bias.
-
Planning to Explore: Curiosity-Driven Planning for LLM Test Generation
CovQValue achieves 51-77% higher branch coverage than greedy baselines on TestGenEval Lite by using coverage feedback and LLM-estimated Q-values to select informative test plans.
-
Vintix II: Decision Pre-Trained Transformer is a Scalable In-Context Reinforcement Learner
Scaling Decision Pre-Trained Transformer with Flow Matching on hundreds of tasks yields an agent with improved generalization in in-context reinforcement learning.
-
Cheap Talk, Empty Promise: Frontier LLMs easily break public promises for self-interest
LLMs deviate from announced actions in 56.6% of scenarios across six games and nine models, frequently without awareness of breaking promises.
-
Multilingual Prompt Localization for Agent-as-a-Judge: Language and Backbone Sensitivity in Requirement-Level Evaluation
Localizing judge prompts to five languages shows that LLM backbones interact with language in agent-as-a-judge evaluations, inverting rankings and revealing no universal best model with low inter-judge agreement.
-
When AI Agents Disagree Like Humans: Reasoning Trace Analysis for Human-AI Collaborative Moderation
Agent verdict agreement in multi-agent hate speech moderation correlates with lower human annotator disagreement, with large effect sizes, motivating uncertainty-surfacing designs over consensus-seeking.
-
Testing the Limits of Truth Directions in LLMs
Truth directions in LLMs are not universal but depend heavily on model layer, task type and difficulty, and prompt instructions.
-
Align then Train: Efficient Retrieval Adapter Learning
A two-stage adapter method aligns query and document embedding spaces to improve dense retrieval for complex queries using lightweight encoders and few labels.
-
Flash-Mono: Feed-Forward Accelerated Gaussian Splatting Monocular SLAM
Flash-Mono uses a recurrent feed-forward frontend with cross-attention to predict poses and 2D Gaussian surfel attributes for monocular SLAM, achieving 10x speedup and state-of-the-art tracking and mapping.
-
Generative Frontiers: Why Evaluation Matters for Diffusion Language Models
Generative perplexity and entropy are shown to be the two additive components of KL divergence to a reference distribution, motivating generative frontiers as a principled evaluation method for diffusion language models.
-
Redirected, Not Removed: Task-Dependent Stereotyping Reveals the Limits of LLM Alignments
LLM alignments redirect stereotypes to implicit tasks instead of removing them, producing bias score divergences up to 0.43 across explicit and implicit probes in audits of seven models.
-
Reinforcement Learning-based Knowledge Distillation with LLM-as-a-Judge
RL with an LLM judge provides rewards on unlabeled data for knowledge distillation, yielding gains on math benchmarks when mixed with verifiable rewards.
-
No Single Best Model for Diversity: Learning a Router for Sample Diversity
No single LLM is best for response diversity; a router selecting the per-prompt best model raises diversity coverage from 23.8% to 26.3% on NB-Wildchat and generalizes to new data.
-
From SWE-ZERO to SWE-HERO: Execution-free to Execution-based Fine-tuning for Software Engineering Agents
A two-stage SFT pipeline distills execution-free then execution-based trajectories from a 480B model into smaller Qwen2.5-Coder agents, yielding 62.2% resolution on SWE-bench Verified and 44.1% zero-shot on the multilingual version.
-
Multimodal Language Models Cannot Spot Spatial Inconsistencies
Multimodal LLMs significantly underperform humans at spotting objects that break 3D consistency in multi-view image pairs.