TIDE enables the first cross-architecture distillation of dLLMs, improving a 0.6B student by 1.53 average points over baselines when trained from 8B dense and 16B MoE teachers.
super hub Mixed citations
write newline
Mixed citation behavior. Most common role is unclear (62%).
hub tools
citation-role summary
citation-polarity summary
claims ledger
- background Table A1: Comparison of BAS for frontier models across tasks when varying the risk-prior w(t). Higher scores indicate better alignment with expressed uncertainty. The standardBAS (Uniform: w(t) = 1) serves as the baseline, while Linear and Quadratic weights simulate increasingly safety-critical environments. Identical ECE, different BAS.Consider two models evaluated on four examples with correctness labelsZ= [1, 1, 0, 0]. The models produce the following confidence values: Example 1 2 3 4 Z1 1 0
authors
co-cited works
representative citing papers
JumpLoRA uses JumpReLU gating to induce adaptive sparsity in LoRA blocks, achieving dynamic parameter isolation that prevents task interference and improves continual learning performance over IncLoRA and ELLA.
LLM judges exhibit up to 9.8 percentage point leniency bias from stakes signaling in prompts, acting implicitly without mentioning it in chain-of-thought.
InfiniteScienceGym procedurally generates unbounded scientific repositories with exact ground-truth QA pairs to benchmark LLMs on data reasoning, abstention, and tool use without static datasets.
EnsembleCert and ScaLabelCert enable tighter and exact certificates for neural network robustness against label-flipping attacks by leveraging white-box information and neural tangent kernel equivalence.
Steered LLM activations are non-surjective: under practical assumptions, they lie outside the set of states reachable from any discrete prompt.
AgentSocialBench demonstrates that privacy preservation is fundamentally harder in human-centered agentic social networks than in single-agent cases due to cross-domain coordination pressures and an abstraction paradox where privacy instructions increase discussion of sensitive information.
MiCP is the first conformal prediction method for multi-turn LLM pipelines that allocates per-turn error budgets to enable adaptive stopping with an overall coverage guarantee, shown to reduce turns and cost on RAG and ReAct benchmarks.
The paper proves W[1]-hardness parameterized by dimension d for positivity, zonotope containment, max approximation, and L_p-Lipschitz constants in 2- and 3-layer ReLU networks, showing enumeration methods are optimal under ETH.
RLCracker is a reinforcement learning attack that erases LLM watermarks at 98.5% success rate with minimal data and generalizes across ten schemes and multiple model sizes.
Establishes an unconditional robustness threshold of 1-1/q for zero-bit tamper-detection codes in watermarking, with matching constructions and experimental confirmation on image models.
ErrorRadar is a new benchmark of 2,500 multimodal K-12 math problems for MLLM error step identification and categorization, where GPT-4o trails human experts by ~10%.
BEAVER is the first text-to-SQL benchmark from private enterprise data warehouses, revealing SOTA agentic frameworks achieve only 10.8% accuracy on complex real-world queries.
Introduces an SDE-based framework for score-based generative modeling that unifies prior methods, enables predictor-corrector sampling and neural ODE likelihoods, and achieves SOTA unconditional image generation on CIFAR-10.
A noisy top-k gated mixture-of-experts layer between LSTMs scales neural networks to 137B parameters with sub-linear compute, beating SOTA on language modeling and machine translation.
A first-order stochastic optimizer that maintains bias-corrected exponential moving averages of the gradient and its square, dividing the former by the square root of the latter to set per-parameter step sizes.
AutoSP automates sequence parallelism and long-context activation checkpointing via compilation, enabling up to 2.7x longer training contexts on NVIDIA hardware with negligible throughput loss.
C2C is a new testbed where LM agents negotiate differently from humans and targeted prompting raises their win rate from 22.2% to 32.7% across 1,100+ games.
XGRAG uses graph perturbations to quantify component contributions in GraphRAG and achieves 14.81% better explanation quality than text-based baselines on QA datasets, with correlations to graph centrality.
GraphPlanner augments multi-agent LLM routing with a heterogeneous graph memory and RL-optimized MDP workflow generation, delivering up to 9.3% higher accuracy and over 99% lower GPU cost than prior routers while supporting zero-shot generalization.
MMEB-V3 benchmark shows omni-modality embedding models fail to enforce instruction-specified modality constraints and exhibit asymmetric, query-biased retrieval.
A new SFT framework for MoE models combines bias-driven sparsification with gated condenser experts to retain long-tailed expert information, outperforming DenseMixer and ESFT by over 2.5% on math reasoning and commonsense QA benchmarks.
Pliable rejection sampling learns a kernel-based proposal to enable efficient i.i.d. sampling from target distributions f with high-probability correctness and a guarantee on accepted samples.
Stimuli with low intra-modal dispersion among vision models elicit up to twice the cross-modal alignment with language models compared to high-dispersion stimuli.
citing papers explorer
-
Toward an Artificial General Teacher: Procedural Geometry Data Generation and Visual Grounding with Vision-Language Models
A procedural engine generates 200k+ synthetic geometry diagrams to fine-tune VLMs for referring image segmentation on abstract diagrams, yielding 49% IoU and 85% Buffered IoU with Florence-2 versus under 1% zero-shot.
-
Student-in-the-Loop Chain-of-Thought Distillation via Generation-Time Selection
Gen-SSD improves chain-of-thought distillation by letting the student model guide the teacher's generation process through real-time selection of learnable reasoning branches, yielding 5.9-point gains over standard KD on math benchmarks.
-
M2-Verify: A Large-Scale Multidomain Benchmark for Checking Multimodal Claim Consistency
M2-Verify is a new multidomain benchmark dataset for multimodal scientific claim consistency that reveals state-of-the-art models drop from 85.8% to 61.6% Micro-F1 on complex perturbations and produce hallucinated explanations.
-
Stochastic Attention: Connectome-Inspired Randomized Routing for Expressive Linear-Time Attention
Stochastic Attention applies random permutations to token sequences in sliding-window attention to achieve exponentially growing receptive fields and full coverage in logarithmic layers, outperforming standard SWA in language model pre-training and inference.
-
QuanBench+: A Unified Multi-Framework Benchmark for LLM-Based Quantum Code Generation
QuanBench+ is a new multi-framework benchmark showing LLMs reach 43-60% Pass@1 on quantum code tasks across three libraries, rising to 67-83% with error-feedback repair, yet performance remains strongly framework-dependent.
-
Robust Test-time Video-Text Retrieval: Benchmarking and Adapting for Query Shifts
New benchmark for video query shifts and HAT-VTR test-time adaptation method that reduces hubness to improve retrieval robustness.
-
PerfCoder: Large Language Models for Interpretable Code Performance Optimization
PerfCoder is a family of LLMs trained on optimization trajectories with human annotations and runtime-based preference alignment that achieves higher runtime speedups and optimization rates on the PIE benchmark than prior models while producing interpretable feedback.
-
SAQ: Stabilizer-Aware Quantum Error Correction Decoder
A dual-stream transformer decoder with constraint-aware post-processing achieves error thresholds of 10.99% and 18.6% on toric codes, approaching ML bounds while scaling linearly.
-
OXtal: An All-Atom Diffusion Model for Organic Crystal Structure Prediction
OXtal recovers experimental organic crystal structures with conformer RMSD below 0.5 Å and over 80% packing similarity using a lattice-free diffusion model trained on 600K structures.
-
Joint Distillation for Fast Likelihood Evaluation and Sampling in Flow-based Models
F2D2 jointly distills sampling and likelihood computation in flow-based models by adding a divergence head to a few-step flow map, achieving accurate log-likelihoods at 2-10 NFEs while preserving sample quality.
-
SAM 3: Segment Anything with Concepts
SAM 3 introduces promptable concept segmentation that doubles accuracy of prior systems on images and videos while improving standard SAM segmentation performance.
-
MTR-DuplexBench: Towards a Comprehensive Evaluation of Multi-Round Conversations for Full-Duplex Speech Language Models
MTR-DuplexBench is a multi-round benchmark for full-duplex speech language models that evaluates turn consistency, dialogue quality, instruction following, and safety.
-
Think-at-Hard: Selective Latent Iterations to Improve Reasoning Language Models
Think-at-Hard selectively triggers latent iterations only on hard tokens via a neural decider and depth-aware LoRA, yielding 3.8-6.8% gains over baselines on nine reasoning benchmarks while iterating on just 7% of tokens.
-
Score-based Membership Inference on Diffusion Models
Presents SimA, a score-based single-query membership inference attack for diffusion models and LDMs that uses denoiser output norm to reveal training set proximity and outperforms multi-query baselines on eight datasets.
-
AudioMoG: Guiding Audio Generation with Mixture-of-Guidance
AudioMoG is a mixture-of-guidance sampling technique that combines CFG and AG signals to outperform single-guidance baselines in text-to-audio generation at equivalent speed.
-
ZeroSiam: An Efficient Asymmetry for Test-Time Entropy Optimization without Collapse
ZeroSiam is an asymmetric architecture using a learnable predictor and stop-gradient that prevents collapse in test-time entropy minimization while also regularizing biased signals for improved performance.
-
Deceive, Detect, and Disclose: Large Language Models Play Mini-Mafia
Mini-Mafia supplies an analytical model logit(p) = v*(m-d) for mafia win probability in LLM role interactions and uses Bayesian inference to estimate per-model parameters that predict tournament results with 76.6% Brier-score improvement over random.
-
Transformers Can Learn Connectivity in Some Graphs but Not Others
Transformers learn connectivity on low-dimensional grid graphs but fail on high-dimensional grids or graphs with many disconnected components, with larger models showing better generalization on grids.
-
Beyond Classification Accuracy: Neural-MedBench and the Need for Deeper Reasoning Benchmarks
Neural-MedBench reveals sharp performance drops in state-of-the-art VLMs on reasoning-intensive neurology tasks compared to conventional classification benchmarks, with reasoning failures dominating errors.
-
Library Hallucinations in LLM-Generated Code: A Risk Analysis Grounded in Developer Queries
A study of seven LLMs finds that realistic prompt variations such as one-character misspellings trigger library hallucinations in up to 26% of cases, fabricated names in up to 99%, and time-based prompts in up to 85%, and introduces LibHalluBench for evaluation.
-
LayerNorm Induces Recency Bias in Transformer Decoders
Stacked causal self-attention combined with LayerNorm induces recency bias in Transformer decoders, reversing the earlier-token bias seen in attention alone.
-
LogitTrace: Detecting Benchmark Contamination via Layerwise Logit Trajectories
LogitTrace detects benchmark contamination by showing that contaminated inputs produce earlier stabilization in layerwise logit trajectories while clean inputs show more gradual accumulation.
-
Concepts in Motion: Temporal Concept Bottleneck Model for Interpretable Video Classification
MoTIF adds temporal self-attention and automatic VLM-based concept discovery to concept bottleneck models for interpretable video classification, showing gains over prior global CBMs on benchmarks.
-
Explicit and Effectively Symmetric Schemes for Neural SDEs on Lie Groups
Introduces the first explicit near-reversible integrator for neural SDEs on Lie groups by extending EES schemes with Bazavov's commutator-free lift, achieving better stability and up to 10x memory reduction on manifold benchmarks.
-
Revisiting Image Manipulation Localization under Realistic Manipulation Scenarios
RITA models image manipulation localization as ordered sequence prediction with a new benchmark HSIM and HSS metric to handle multi-step editing processes.
-
Diffusion and Flow-based Copulas: Forgetting and Remembering Dependencies
Diffusion and flow processes forget dependencies to define valid copulas then learn to remember them for density estimation and sampling, outperforming prior copula methods on complex datasets.
-
On the Convergence of Muon and Beyond
Muon-MVR2 attains the optimal anytime convergence rate of ~O(T^{-1/3}) in stochastic non-convex settings under horizon-free schedules.
-
Visual-TableQA: Open-Domain Benchmark for Reasoning over Table Images
Visual-TableQA is a new open-domain benchmark of rendered table images and complex QA pairs created via multi-LLM collaborative generation, with fine-tuned models showing robust generalization to external tests.
-
Top-H Decoding: Adapting the Creativity and Coherence with Bounded Entropy in Text Generation
Top-H decoding is a computationally efficient greedy algorithm for an entropy-constrained mass maximization problem that improves the creativity-coherence trade-off over min-p sampling in LLM text generation.
-
MetaLint: Easy-to-Hard Generalization for Code Linting
MetaLint uses meta-learning to let models generalize from easy synthetic linting data to hard human-curated best practices, yielding large F-score gains on a new PEP-inspired benchmark.
-
Blending Supervised and Reinforcement Fine-Tuning with Prefix Sampling
Prefix-RFT blends SFT and RFT via prefix sampling from demonstrations to outperform standalone SFT, RFT, and mixed-policy baselines on math reasoning problems.
-
Speak-to-Structure: Evaluating LLMs in Open-domain Natural Language-Driven Molecule Generation
S^2-Bench is a new one-to-many benchmark for natural language-driven molecule generation with three tasks, and OpenMolIns is an instruction dataset enabling Llama3.1-8B to outperform GPT-4o and Claude-3.5 on it.
-
Tighter Performance Theory of FedExProx
New analysis framework yields tighter linear convergence for FedExProx on non-strongly convex quadratics and PL functions, proving outperformance over GD once communication costs are counted.
-
Power-Softmax: Towards Secure LLM Inference over Encrypted Data
Power-Softmax is a new HE-compatible attention variant that permits training and inference of billion-parameter polynomial LLMs with performance matching standard transformers.
-
MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering
MLE-bench evaluates frontier language models as ML engineering agents on 75 Kaggle competitions, with the top setup (o1-preview + AIDE) reaching bronze medal level in 16.9% of tasks.
-
ImProver: Agent-Based Automated Proof Optimization
ImProver is an LLM agent using Chain-of-States, error-correction, and retrieval to rewrite Lean proofs for arbitrary user-defined optimization criteria like shortness and readability.
-
What Causes Performance Degradation in Cross-Subject EEG Classification?
Controlled experiments attribute cross-subject EEG classification degradation to inter-subject variability in multi-class tasks and shortcut learning in single-class tasks.
-
Near-Optimal Policy Identification in Robust Constrained Markov Decision Processes via Epigraph Form
Presents the first algorithm to identify an ε-optimal policy in robust constrained MDPs via epigraph form and bisection search with Õ(ε^{-4}) robust policy evaluations.
-
LoRA: Low-Rank Adaptation of Large Language Models
Adapting large language models by training only a low-rank decomposition BA added to frozen weight matrices matches full fine-tuning while cutting trainable parameters by orders of magnitude and adding no inference latency.
-
METASYMBO: Multi-Agent Language-Guided Metamaterial Discovery via Symbolic Latent Evolution
MetaSymbO proposes a three-agent framework with symbolic latent evolution that improves structural validity and language alignment for metamaterial design from free-form text intents.
-
Learning to Forget: Continual Learning with Adaptive Weight Decay
FADE adapts per-parameter weight decay rates online via approximate meta-gradient descent to improve controlled forgetting over fixed decay in online tracking and streaming classification.
-
NORACL: Neurogenesis for Oracle-free Resource-Adaptive Continual Learning
NORACL dynamically grows network capacity via neurogenesis-inspired signals to achieve oracle-level continual learning performance without pre-specifying architecture size.
-
LUCid: Redefining Relevance For Lifelong Personalization
The LUCid benchmark shows that state-of-the-art models suffer near-zero retrieval recall and only about 50% response alignment when they must surface situationally relevant user information from semantically distant interaction histories.
-
A paradox of AI fluency
Fluent AI users adopt an active, iterative collaboration mode that produces more visible failures but better recovery and success on hard tasks, whereas novices experience more invisible failures from passive use.
-
Evaluating Risks in Weak-to-Strong Alignment: A Bias-Variance Perspective
Strong-model variance is the strongest empirical predictor of blind-spot deception in weak-to-strong alignment, backed by a misfit-based upper bound on population risk.
-
Architecture Determines Observability of Transformers
Architecture and training determine whether transformers retain a readable internal signal that lets activation monitors catch errors missed by output confidence.
-
Reparameterization through Coverings and Topological Weight Priors
Reparameterization through coverings makes the KL term tractable in VAEs whose latent manifolds have non-trivial topology, demonstrated on a Klein bottle latent space.
-
Process Supervision of Confidence Margin for Calibrated LLM Reasoning
RLCM trains LLMs with a margin-enhanced process reward that widens the gap between correct and incorrect reasoning steps, improving calibration on math, code, logic, and science tasks without hurting accuracy.
-
Spend Less, Fit Better: Budget-Efficient Scaling Law Fitting via Active Experiment Selection
An uncertainty-aware sequential selection algorithm fits scaling laws to near-full accuracy using only about 10% of the total experimental training budget across diverse benchmarks.
-
Cross-Stage Coherence in Hierarchical Driving VQA: Explicit Baselines and Learned Gated Context Projectors
Explicit prompt baselines cut NLI contradictions by up to 42.6% with zero training, while learned gated context projectors deliver a 34% reduction in planning-stage contradictions and 50% higher cross-stage entailment on DriveLM-nuScenes.