TIDE enables the first cross-architecture distillation of dLLMs, improving a 0.6B student by 1.53 average points over baselines when trained from 8B dense and 16B MoE teachers.
super hub Mixed citations
write newline
Mixed citation behavior. Most common role is unclear (62%).
hub tools
citation-role summary
citation-polarity summary
claims ledger
- background Table A1: Comparison of BAS for frontier models across tasks when varying the risk-prior w(t). Higher scores indicate better alignment with expressed uncertainty. The standardBAS (Uniform: w(t) = 1) serves as the baseline, while Linear and Quadratic weights simulate increasingly safety-critical environments. Identical ECE, different BAS.Consider two models evaluated on four examples with correctness labelsZ= [1, 1, 0, 0]. The models produce the following confidence values: Example 1 2 3 4 Z1 1 0
authors
co-cited works
representative citing papers
JumpLoRA uses JumpReLU gating to induce adaptive sparsity in LoRA blocks, achieving dynamic parameter isolation that prevents task interference and improves continual learning performance over IncLoRA and ELLA.
LLM judges exhibit up to 9.8 percentage point leniency bias from stakes signaling in prompts, acting implicitly without mentioning it in chain-of-thought.
InfiniteScienceGym procedurally generates unbounded scientific repositories with exact ground-truth QA pairs to benchmark LLMs on data reasoning, abstention, and tool use without static datasets.
EnsembleCert and ScaLabelCert enable tighter and exact certificates for neural network robustness against label-flipping attacks by leveraging white-box information and neural tangent kernel equivalence.
Steered LLM activations are non-surjective: under practical assumptions, they lie outside the set of states reachable from any discrete prompt.
AgentSocialBench demonstrates that privacy preservation is fundamentally harder in human-centered agentic social networks than in single-agent cases due to cross-domain coordination pressures and an abstraction paradox where privacy instructions increase discussion of sensitive information.
MiCP is the first conformal prediction method for multi-turn LLM pipelines that allocates per-turn error budgets to enable adaptive stopping with an overall coverage guarantee, shown to reduce turns and cost on RAG and ReAct benchmarks.
The paper proves W[1]-hardness parameterized by dimension d for positivity, zonotope containment, max approximation, and L_p-Lipschitz constants in 2- and 3-layer ReLU networks, showing enumeration methods are optimal under ETH.
RLCracker is a reinforcement learning attack that erases LLM watermarks at 98.5% success rate with minimal data and generalizes across ten schemes and multiple model sizes.
Establishes an unconditional robustness threshold of 1-1/q for zero-bit tamper-detection codes in watermarking, with matching constructions and experimental confirmation on image models.
ErrorRadar is a new benchmark of 2,500 multimodal K-12 math problems for MLLM error step identification and categorization, where GPT-4o trails human experts by ~10%.
BEAVER is the first text-to-SQL benchmark from private enterprise data warehouses, revealing SOTA agentic frameworks achieve only 10.8% accuracy on complex real-world queries.
Introduces an SDE-based framework for score-based generative modeling that unifies prior methods, enables predictor-corrector sampling and neural ODE likelihoods, and achieves SOTA unconditional image generation on CIFAR-10.
A noisy top-k gated mixture-of-experts layer between LSTMs scales neural networks to 137B parameters with sub-linear compute, beating SOTA on language modeling and machine translation.
A first-order stochastic optimizer that maintains bias-corrected exponential moving averages of the gradient and its square, dividing the former by the square root of the latter to set per-parameter step sizes.
AutoSP automates sequence parallelism and long-context activation checkpointing via compilation, enabling up to 2.7x longer training contexts on NVIDIA hardware with negligible throughput loss.
C2C is a new testbed where LM agents negotiate differently from humans and targeted prompting raises their win rate from 22.2% to 32.7% across 1,100+ games.
XGRAG uses graph perturbations to quantify component contributions in GraphRAG and achieves 14.81% better explanation quality than text-based baselines on QA datasets, with correlations to graph centrality.
GraphPlanner augments multi-agent LLM routing with a heterogeneous graph memory and RL-optimized MDP workflow generation, delivering up to 9.3% higher accuracy and over 99% lower GPU cost than prior routers while supporting zero-shot generalization.
MMEB-V3 benchmark shows omni-modality embedding models fail to enforce instruction-specified modality constraints and exhibit asymmetric, query-biased retrieval.
A new SFT framework for MoE models combines bias-driven sparsification with gated condenser experts to retain long-tailed expert information, outperforming DenseMixer and ESFT by over 2.5% on math reasoning and commonsense QA benchmarks.
Pliable rejection sampling learns a kernel-based proposal to enable efficient i.i.d. sampling from target distributions f with high-probability correctness and a guarantee on accepted samples.
Stimuli with low intra-modal dispersion among vision models elicit up to twice the cross-modal alignment with language models compared to high-dispersion stimuli.
citing papers explorer
-
Reasoning Models Can be Accurately Pruned Via Chain-of-Thought Reconstruction
A pruning technique called Reasoning-Aware Compression (RAC) jointly reconstructs input and chain-of-thought activations to preserve reasoning performance better than standard methods when compressing models like DeepSeek-R1.
-
Dissecting Discrete Soft Actor-Critic: Limitations and Principled Alternatives
Shows entropy coupling limits DSAC on discrete tasks and introduces a generalized actor-critic framework with m-step critics and novel entropy-regularized objectives that perform robustly on Atari.
-
Beyond Correctness: Harmonizing Process and Outcome Rewards through RL Training
PROF curates RL training data via PRM-ORM consistency to improve both final-answer accuracy and intermediate reasoning quality while reducing reliance on strong process reward models.
-
Progressive Multimodal Search and Reasoning for Knowledge-Intensive Visual Question Answering
PMSR progressively constructs structured reasoning trajectories with dual-scope queries and compositional reasoning to improve knowledge acquisition and answer accuracy in knowledge-intensive VQA.
-
Scalable Option Learning in High-Throughput Environments
SOL is a new hierarchical RL algorithm that reaches 35x higher throughput and outperforms flat agents when trained on 30 billion frames in NetHack while showing positive scaling.
-
BugScope: Learn to Find Bugs Like Human
BugScope structures LLM bug detection into three human-mirroring steps and distills guidelines from examples, reaching 0.87 F1 on 33 real bugs while outperforming Claude and Cursor tools and uncovering 184 new issues in production code.
-
Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling
Geometry Forcing aligns video diffusion representations with geometric foundation model features via angular cosine and scale regression objectives to improve 3D consistency in generated videos.
-
Should We Still Pretrain Encoders with Masked Language Modeling?
Controlled ablations of 38 models find MLM superior to CLM on representation benchmarks while CLM offers better data efficiency and stability; a biphasic CLM-then-MLM schedule is optimal under fixed compute and improves when initialized from pretrained CLM models.
-
Preference Learning Unlocks LLMs' Psycho-Counseling Skills
A new expert-principle preference dataset enables an 8B LLM to reach 87% win rate vs GPT-4o on counseling responses through standard preference optimization.
-
Tight Clusters Make Specialized Experts
Introduces Adaptive Clustering router for MoE models that scales features to identify tight expert clusters, yielding faster convergence, robustness to corruption, and performance gains.
-
Evaluating Inter-Column Logical Relationships in Synthetic Tabular Data Generation
Proposes three metrics for inter-column logical relationships in synthetic tabular data and reports that current generators often fail to preserve them on an industrial dataset.
-
SyMerge: From Non-Interference to Synergistic Merging via Single-Layer Adaptation
SyMerge merges models via single-layer adaptation and expert-guided self-labeling to achieve task synergy, reporting SOTA results on vision, dense prediction, and NLP tasks.
-
Score-matching-based Structure Learning for Temporal Data on Networks
PICK adds a parent-finding subroutine for leaf nodes to speed up pruning in score-matching causal discovery, extending it from i.i.d. data to static and temporal network data.
-
Improving Music Source Separation with Diffusion and Consistency Refinement
Diffusion-based refinement followed by consistency distillation improves music source separation quality and inference speed across U-Net and BS-RoFormer backbones on Slakh2100 and MUSDB18.
-
Preference Goal Tuning: Post-Training as Latent Control for Frozen Policies
PGT optimizes latent goal embeddings for frozen policies via trajectory-level preference objectives, reporting 72-81.6% relative gains on 17 Minecraft tasks and 13.4% better OOD performance than fine-tuning.
-
Improving Inverse Folding for Peptide Design with Diversity-regularized Direct Preference Optimization
Diversity-regularized DPO fine-tuning of ProteinMPNN improves structural similarity scores by at least 8% over base model and sequence diversity by up to 20% over standard DPO for peptide inverse folding on OpenFold structures.
-
EventFlow: Forecasting Temporal Point Processes with Flow Matching
EventFlow applies flow matching to learn joint distributions over event times for temporal point processes, reporting 20-53% lower forecast error than autoregressive baselines on standard TPP benchmarks with fewer sampling calls.
-
Inspection and Control of Self-Generated-Text Recognition Ability in Llama3-8b-Instruct
Llama3-8b-Instruct recognizes its own outputs via a residual-stream vector associated with self-authorship that can be steered to control authorship claims and perceptions.
-
Deep Learning Alternatives of the Kolmogorov Superposition Theorem
ActNet is a new KST-based neural network that outperforms KANs and competes with MLPs in PINN benchmarks for PDE simulation tasks.
-
Safe Bayesian Optimization for Complex Control Systems via Additive Gaussian Processes
SafeCtrlBO combines additive GP kernels with boundary-based safe-set expansion to achieve efficient safe optimization of multi-loop controllers on benchmarks and a PMSM hardware platform.
-
ConjNorm: Tractable Density Estimation for Out-of-Distribution Detection
ConjNorm reframes OOD detection score design as optimizing norm p in an exponential family density model via a Bregman divergence theorem, with a tractable Monte Carlo estimator, claiming SOTA gains on CIFAR-100 and ImageNet-1K.
-
Sparse Autoencoders Find Highly Interpretable Features in Language Models
Sparse autoencoders applied to language model activations yield more interpretable and monosemantic features than alternative approaches, enabling finer causal analysis on the indirect object identification task.
-
When 2D Tasks Meet 1D Serialization: On Serialization Friction in Structured Tasks
1D serialization of layout-defined tasks degrades LLM performance more than native 2D image inputs in a controlled testbed of matrix transpose, Conway's Game of Life, and LU decomposition.
-
Understanding Representation Gaps Across Scales in Tropical Tree Species Classification from Drone Imagery
Close-up UAV images yield higher tree species classification accuracy than top-view imagery, with the gap increasing for rare species, and self-supervised cross-scale alignment is proposed to bridge them for canopy-level monitoring.
-
LayerBoost: Layer-Aware Attention Reduction for Efficient LLMs
LayerBoost selectively replaces or removes attention in non-critical transformer layers to cut inference latency up to 68% while recovering quality via brief distillation.
-
Beyond Distribution Sharpening: The Importance of Task Rewards
Task-reward reinforcement learning yields robust gains on math benchmarks for models like Llama-3.2-3B while distribution sharpening alone delivers only limited and unstable improvements.
-
Majority Voting for Code Generation
Functional Majority Voting selects code by runtime agreement on tests, boosting LiveCodeBench performance and serving as an aggregation method for label-free test-time RL without exceeding base model limits.
-
Harmonizing Multi-Objective LLM Unlearning via Unified Domain Representation and Bidirectional Logit Distillation
A multi-objective LLM unlearning approach standardizes data into unified domain representations and applies bidirectional logit distillation to align objectives and achieve balanced state-of-the-art results across efficacy, utility, boundary preservation, and robustness.
-
SWE-TRACE: Optimizing Long-Horizon SWE Agents Through Rubric Process Reward Models and Heuristic Test-Time Scaling
SWE-TRACE optimizes long-horizon SWE agents via token-efficient SFT distillation, rubric-augmented process reward models in RL, and heuristic test-time scaling, yielding higher benchmark resolution rates with reduced tokens and latency.
-
Learning Adaptive Reasoning Paths for Efficient Visual Reasoning
AVR trains vision-language models to adaptively select among full reasoning, perception-only, or direct-answer formats using a modified policy optimization method, reducing token use by 50-90% with little accuracy loss.
-
Context Sensitivity Improves Human-Machine Visual Alignment
Context-sensitive similarity computation from embeddings improves odd-one-out accuracy by up to 15% over context-insensitive baselines for human visual alignment.
-
Peer-Predictive Self-Training for Language Model Reasoning
Multiple language models self-train collaboratively by treating their aggregated responses as internal targets scaled by PMI, yielding 2.2-4.3 percentage point accuracy gains on math benchmarks without external supervision.
-
Hessian-Enhanced Token Attribution (HETA): Interpreting Autoregressive LLMs
HETA is a new attribution framework for decoder-only LLMs that combines semantic transition vectors, Hessian-based sensitivity scores, and KL divergence to produce more faithful and human-aligned token attributions than prior methods.
-
Your Model Diversity, Not Method, Determines Reasoning Strategy
The optimal reasoning strategy for LLMs depends on the model's diversity profile rather than the exploration method itself.
-
Toward World Models for Epidemiology
Epidemiology should be treated as controlled partially observed dynamical systems where world models enable reasoning about latent burden, endogenous surveillance, and counterfactual interventions.
-
Biologically-Grounded Multi-Encoder Architectures as Developability Oracles for Antibody Design
CrossAbSense oracles using frozen PLM encoders plus self- or cross-attention decoders improve prediction accuracy by 12-20% on three of five developability assays for therapeutic IgGs, with architecture choices revealing that aggregation depends on single-chain signals while stability requires heavy
-
Learning to Query History: Nonstationary Classification via Learned Retrieval
A learned retrieval system lets classifiers draw on historical examples to maintain accuracy under distribution shifts by treating the problem as time series prediction.
-
I Can't Believe TTA Is Not Better: When Test-Time Augmentation Hurts Medical Image Classification
Test-time augmentation consistently degrades accuracy in medical image classification on MedMNIST v2 benchmarks due to distribution shifts between augmented test inputs and training data.
-
Early Stopping for Large Reasoning Models via Confidence Dynamics
CoDE-Stop early-stops reasoning models by tracking confidence dynamics in intermediate answers, cutting token use 25-50% while preserving or improving accuracy versus full-length generation.
-
Individual and Combined Effects of English as a Second Language and Typos on LLM Performance
Combining ESL variants and typos produces larger LLM performance drops than either factor alone, with non-additive effects clearest on closed-ended tasks.
-
Automatically Generating Hard Math Problems from Hypothesis-Driven Error Analysis
A hypothesis-driven pipeline generates targeted hard math problems that drop Llama-3.3-70B-Instruct accuracy from 77% on MATH to as low as 45%.
-
Detecting and Correcting Reference Hallucinations in Commercial LLMs and Deep Research Agents
Citation URLs from LLMs and research agents are hallucinated 3-13% of the time and non-resolving 5-18% of the time, with a released tool that reduces failures by 6-79x.
-
Domain-Adapted Retrieval for In-Context Annotation of Pedagogical Dialogue Acts
Domain-adapted utterance-level retrieval raises Cohen's kappa for tutoring dialogue act annotation to 0.526-0.580 on TalkMoves and 0.659-0.743 on Eedi, beating no-retrieval baselines by large margins across three LLMs.
-
Grading the Unspoken: Evaluating Tacit Reasoning in Quantum Field Theory and String Theory with LLMs
LLMs reach near-ceiling performance on explicit QFT and string theory derivations but degrade when required to reconstruct omitted reasoning steps or resolve implicit tensions under global consistency constraints.
-
Embedding-Only Uplink for Onboard Retrieval Under Shift in Remote Sensing
Embedding-only uplink enables flexible onboard retrieval for remote sensing under distribution shifts, with kNN superior for cloud classification and centroids for temporal change detection.
-
Distilling Genomic Models for Efficient mRNA Representation Learning via Embedding Matching
Embedding-based distillation shrinks a large genomic model 200-fold into a compact mRNA specialist that reaches state-of-the-art results among similarly sized models.
-
Training for Compositional Sensitivity Reduces Dense Retrieval Generalization
Training dense retrievers for compositional sensitivity via structure-targeted negatives reduces zero-shot NanoBEIR nDCG@10 by 8-40% across backbones while only partially improving pooled-space separation of structural variants.
-
What Drives Success in Physical Planning with Joint-Embedding Predictive World Models?
An empirical study of JEPA world models identifies architecture, training objective, and planning choices that yield a model outperforming DINO-WM and V-JEPA-2-AC on navigation and manipulation tasks.
-
Rethinking Expert Trajectory Utilization in LLM Post-training for Mathematical Reasoning
Sequential SFT followed by RL, guided by the Plasticity-Ceiling Framework, achieves higher performance ceilings in LLM mathematical reasoning than synchronized methods by optimizing data scale and transition timing.
-
How Learning Rate Decay Wastes Your Best Data in Curriculum-Based LLM Pretraining
Curriculum pretraining with ascending data quality outperforms random order under constant learning rate but loses most benefit under standard decay; moderate decay or final-checkpoint averaging recovers a 1.64% average benchmark gain on 1.5B models trained for 30B tokens.