Transformers converge globally to the optimal DDPM denoiser for multi-token GMMs via self-attention mean denoising, with explicit token and iteration requirements.
super hub Mixed citations
write newline
Mixed citation behavior. Most common role is unclear (71%).
hub tools
citation-role summary
citation-polarity summary
claims ledger
- background Towards retrieval-augmented large language models. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 6491-6501, 2024. Fei, N., Lu, Z., Gao, Y ., Yang, G., Huo, Y ., Wen, J., Lu, H., Song, R., Gao, X., Xiang, T., et al. Towards artificial general intelligence via a multimodal foundation model. Nature Communications, 13(1):3094, 2022. Feng, T., Jin, C., Liu, J., Zhu, K., Tu, H., Cheng, Z., Lin, G., and You, J. How far are we from agi. arXiv preprint arXiv
- other Trace through the logic with the given test input 3. Determine the CORRECT output and which code(s) produced it Respond in the following format: <reasoning> Brief explanation (2-3 sentences max) of why this is the correct output. </reasoning> <correct output> The correct output value </correct output> <correct codes id> List of correct code indices, e.g., [1, 3] or [2] 16 ADVERMCTS </correct codes id> E. Algorithm We present the detailed procedure of ADVERMCTS in pseudocode in Algorithm 1. Algor
- background tic proximity between entities sharing a common surface, where the dependent object's placement is conditioned by its functional utility relative to an anchor (e.g., a keyboard placed relative to a laptop). Based on these relations, a global scene is represented as an ordered sequence of relational tuplesS={T 1,T 2, . . . ,TN }. Each tupleT i is formulated as: Ti =⟨O dep,i,O sup,i,{O f nc,i}opt⟩,(1) where Odep,i is the object to be generated, Osup,i is the mandatory support anchor, and Of nc,i i
- other to symbolic constraints as specified in the symbolic scaf- fold. We give some examples of curated reasoning traces following this procedure in Appendix B.1. For each example (x,y) that is correctly predicted by the de- cision tree model, we let R(x,y, S(x)) denote the curated reasoning tokens. As a result, we collect a set of reason- ing data {xi,z i,y i}i∈C, where zi =R(x i,y i, S(xi)), and C ∈[1, . . . , n] denotes the subset of data that is correctly predicted by the decision tree model. 3.4.
- background reweight these constraints during inversion using a spec- tral objective derived from a local linearization of the full 3 Information-Regularized Constrained Inversion for Stable Avatar Editing from Sparse Supervision decoding-and-rendering pipeline. 3.1 Differentiable Avatar Rendering Pipeline We assume a differentiable, animatable rendering pipeline yt =f(v, θ t)∈R m,(1) where v∈R r is a globalediting codeshared across frames, θt denotes the frame-specific driving state (pose parame- ters, cam
- background where the squares are applied element-wise. • Low-rank term.Let θi = 1 i Pi j=1 θj be the running mean afteri snapshots, and define deviation columnsdi =θ i−θi. To limit the rank, SW AG retains only the lastK such columns in a matrix D∈R d×K, giving the low-rank covariance Σlr = 1 K−1 DD⊺.(27) The resulting SW AG posterior approximation is qSW AG(θ) =N θSW A, 1 2(Σdiag +Σ lr) .(28) Givenz 1 ∼ N(0, I d)andz 2 ∼ N(0, I K), SW AG draws samples via eθ=θ SW A+ 1√ 2 Σ1/2 diagz1 + 1p 2(K−1) Dz2.(29
authors
co-cited works
representative citing papers
Standard DPO surrogates are inconsistent for equicontinuous neural nets; SA-DPO provides structure-aware H-consistency bounds by adapting margins to semantic distance and shows heavy-tailed losses yield superior guarantees for capacity-bounded models via the Margin-Capacity Profile.
KV cache eviction is unified under an information capacity maximization principle derived from a linear-Gaussian attention surrogate, with CapKV proposed as a leverage-score based implementation that outperforms prior heuristics in experiments.
A representation learning approach for multi-source domain adaptation achieves identifiability by partitioning the label's Markov blanket into parents, children, and spouses.
Stochastic Attention adds calibrated uncertainty to transformer foundation models through inference-time multinomial sampling of attention weights and univariate post-hoc tuning of a concentration parameter.
In bandit-feedback zero-sum games, uncoupled algorithms achieve last-iterate Nash convergence at the optimal rate of O(T^{-1/4}).
Momentum SGD exhibits two distinct EoSS regimes for batch sharpness, stabilizing at 2(1-β)/η for small batches and 2(1+β)/η for large batches, aligning with linear stability thresholds.
AdverMCTS frames code generation as a minimax game where an attacker evolves tests to expose flaws in solver-generated code, yielding more robust outputs than static-test baselines.
RFIR framework embeds RF-aware BSDF into Gaussian splatting for decoupled RF scene modeling, generalizing RCS synthesis, RSSI prediction, and wireless scene editability with performance gains.
ReV is a referring-aware visuomotor policy using coupled diffusion heads for real-time trajectory replanning in robotic manipulation, trained solely via targeted perturbations to expert demonstrations and achieving higher success rates in simulated and real tasks.
A conditioning-guided constrained inversion method restricts avatar edits to a low-dimensional part-specific subspace and uses an information matrix spectrum from pipeline linearization to predict and ensure stability under sparse supervision.
DeEscalWild supplies 1,500 high-fidelity de-escalation scenarios that let fine-tuned 3B SLMs outperform general-purpose larger models on realism and dialogue metrics.
AtomicRAG replaces chunk-based and triple-based GraphRAG with atom-entity graphs that store facts as atomic units and use personalized PageRank plus relevance filtering to achieve higher retrieval accuracy and reasoning robustness on five benchmarks.
A new benchmark of 40 scenarios finds state-of-the-art LLMs exhibit outcome-driven constraint violations in 0-62.8% of cases under KPI pressure, with no consistent safety gains across model generations.
HetRL delivers up to 9.17x higher throughput for LLM RL training on heterogeneous GPUs by using hybrid and ILP-based schedulers to solve a joint optimization problem over computation and data dependencies.
MVAD is the first comprehensive benchmark dataset for AI-generated multimodal video-audio detection, with three realistic forgery patterns, high-quality outputs from state-of-the-art models, and diversity across visual styles and content categories.
Super-Linear introduces a pretrained MoE architecture using frequency-specialized linear experts and spectral gating for efficient general time series forecasting.
VCBench is a new privacy-preserving benchmark showing LLMs like DeepSeek-V3 achieve over six times the market baseline precision in predicting founder success.
PINS combines an outer proximal-point loop over shifted entropic OT problems with inner Sinkhorn warm-up and sparse-Newton refinement to reach unregularized OT solutions with global convergence and lower error than Sinkhorn baselines.
TNP-KR adds a kernel regression transformer block, kernel attention bias, scan attention for translation invariance, and deep kernel attention to achieve lower complexity and state-of-the-art results on meta-regression and related benchmarks.
EARL-BO uses RL with an Attention-DeepSets encoder and end-to-end on-policy multi-task fine-tuning to approximate near-optimal multi-step lookahead policies for high-dimensional black-box optimization.
A mean-field dynamical analysis of LoRA in transformers identifies phase transitions in catastrophic forgetting driven by perturbation norm and transformer depth.
Moonwalk enables memory-efficient training of deep networks via mixed-mode gradient computation with vector-inverse-Jacobian products for submersive layers and fragmental checkpointing otherwise, matching backprop runtime at over twice the depth.
Introduces HS-S (aggregating dynamic threat powers) and Coco-S (fixed points of statewise HS Bellman operator) for stochastic games, proves they coincide for two players but disagree for three, shows uniqueness via extended axioms and topological degree theory, and gives sampling estimators.
citing papers explorer
-
Transformers Learn the Optimal DDPM Denoiser for Multi-Token GMMs
Transformers converge globally to the optimal DDPM denoiser for multi-token GMMs via self-attention mean denoising, with explicit token and iteration requirements.
-
Mind the Gap: Structure-Aware Consistency in Preference Learning
Standard DPO surrogates are inconsistent for equicontinuous neural nets; SA-DPO provides structure-aware H-consistency bounds by adapting margins to semantic distance and shows heavy-tailed losses yield superior guarantees for capacity-bounded models via the Margin-Capacity Profile.
-
Rethinking KV Cache Eviction via a Unified Information-Theoretic Objective
KV cache eviction is unified under an information capacity maximization principle derived from a linear-Gaussian attention surrogate, with CapKV proposed as a leverage-score based implementation that outperforms prior heuristics in experiments.
-
A General Representation-Based Approach to Multi-Source Domain Adaptation
A representation learning approach for multi-source domain adaptation achieves identifiability by partitioning the label's Markov blanket into parents, children, and spouses.
-
Calibrating Scientific Foundation Models with Inference-Time Stochastic Attention
Stochastic Attention adds calibrated uncertainty to transformer foundation models through inference-time multinomial sampling of attention weights and univariate post-hoc tuning of a concentration parameter.
-
The Harder Path: Last Iterate Convergence for Uncoupled Learning in Zero-Sum Games with Bandit Feedback
In bandit-feedback zero-sum games, uncoupled algorithms achieve last-iterate Nash convergence at the optimal rate of O(T^{-1/4}).
-
Momentum Further Constrains Sharpness at the Edge of Stochastic Stability
Momentum SGD exhibits two distinct EoSS regimes for batch sharpness, stabilizing at 2(1-β)/η for small batches and 2(1+β)/η for large batches, aligning with linear stability thresholds.
-
AdverMCTS: Combating Pseudo-Correctness in Code Generation via Adversarial Monte Carlo Tree Search
AdverMCTS frames code generation as a minimax game where an attacker evolves tests to expose flaws in solver-generated code, yielding more robust outputs than static-test baselines.
-
Radio-Frequency Inverse Rendering for Wireless Environment Modeling
RFIR framework embeds RF-aware BSDF into Gaussian splatting for decoupled RF scene modeling, generalizing RCS synthesis, RSSI prediction, and wireless scene editability with performance gains.
-
Referring-Aware Visuomotor Policy Learning for Closed-Loop Manipulation
ReV is a referring-aware visuomotor policy using coupled diffusion heads for real-time trajectory replanning in robotic manipulation, trained solely via targeted perturbations to expert demonstrations and achieving higher success rates in simulated and real tasks.
-
Information-Regularized Constrained Inversion for Stable Avatar Editing from Sparse Supervision
A conditioning-guided constrained inversion method restricts avatar edits to a low-dimensional part-specific subspace and uses an information matrix spectrum from pipeline linearization to predict and ensure stability under sparse supervision.
-
DeEscalWild: A Real-World Benchmark for Automated De-Escalation Training with SLMs
DeEscalWild supplies 1,500 high-fidelity de-escalation scenarios that let fine-tuned 3B SLMs outperform general-purpose larger models on realism and dialogue metrics.
-
AtomicRAG: Atom-Entity Graphs for Retrieval-Augmented Generation
AtomicRAG replaces chunk-based and triple-based GraphRAG with atom-entity graphs that store facts as atomic units and use personalized PageRank plus relevance filtering to achieve higher retrieval accuracy and reasoning robustness on five benchmarks.
-
A Benchmark for Evaluating Outcome-Driven Constraint Violations in Autonomous AI Agents
A new benchmark of 40 scenarios finds state-of-the-art LLMs exhibit outcome-driven constraint violations in 0-62.8% of cases under KPI pressure, with no consistent safety gains across model generations.
-
HetRL: Efficient Reinforcement Learning for LLMs in Heterogeneous Environments
HetRL delivers up to 9.17x higher throughput for LLM RL training on heterogeneous GPUs by using hybrid and ILP-based schedulers to solve a joint optimization problem over computation and data dependencies.
-
MVAD: A Benchmark Dataset for Multimodal AI-Generated Video-Audio Detection
MVAD is the first comprehensive benchmark dataset for AI-generated multimodal video-audio detection, with three realistic forgery patterns, high-quality outputs from state-of-the-art models, and diversity across visual styles and content categories.
-
Super-Linear: A Lightweight Pretrained Mixture of Linear Experts for Time Series Forecasting
Super-Linear introduces a pretrained MoE architecture using frequency-specialized linear experts and spectral gating for efficient general time series forecasting.
-
VCBench: Benchmarking LLMs in Venture Capital
VCBench is a new privacy-preserving benchmark showing LLMs like DeepSeek-V3 achieve over six times the market baseline precision in predicting founder success.
-
PINS: Proximal Iterations with Sparse Newton and Sinkhorn for Optimal Transport
PINS combines an outer proximal-point loop over shifted entropic OT problems with inner Sinkhorn warm-up and sparse-Newton refinement to reach unregularized OT solutions with global convergence and lower error than Sinkhorn baselines.
-
Transformer Neural Processes - Kernel Regression
TNP-KR adds a kernel regression transformer block, kernel attention bias, scan attention for translation invariance, and deep kernel attention to achieve lower complexity and state-of-the-art results on meta-regression and related benchmarks.
-
EARL-BO: Reinforcement Learning for Multi-Step Lookahead, High-Dimensional Bayesian Optimization
EARL-BO uses RL with an Attention-DeepSets encoder and end-to-end on-policy multi-task fine-tuning to approximate near-optimal multi-step lookahead policies for high-dimensional black-box optimization.
-
Understanding Catastrophic Forgetting In LoRA via Mean-Field Attention Dynamics
A mean-field dynamical analysis of LoRA in transformers identifies phase transitions in catastrophic forgetting driven by perturbation norm and transformer depth.
-
Moonwalk: Inverse-Forward Differentiation
Moonwalk enables memory-efficient training of deep networks via mixed-mode gradient computation with vector-inverse-Jacobian products for submersive layers and fragmental checkpointing otherwise, matching backprop runtime at over twice the depth.
-
Learning Strategic Value and Cooperation in Multi-Player Stochastic Games through Side Payments
Introduces HS-S (aggregating dynamic threat powers) and Coco-S (fixed points of statewise HS Bellman operator) for stochastic games, proves they coincide for two players but disagree for three, shows uniqueness via extended axioms and topological degree theory, and gives sampling estimators.
-
Explaining the effects of non-convergent sampling in the training of Energy-Based Models
EBMs trained with non-persistent short runs reproduce empirical data statistics via a precise dynamical process, not the equilibrium measure.
-
Mitigating Value Hallucination in Dyna Planning via Multistep Predecessor Models
Introduces multistep predecessor models for Dyna planning to mitigate value hallucination by avoiding real-state updates from simulated values.
-
Combining Stochastic Adaptive Cubic Regularization with Negative Curvature for Nonconvex Optimization
Introduces the SANC algorithm combining negative curvature with stochastic adaptive cubic regularization for nonconvex optimization and claims it is the first such combination with consistent batch sizes for large-scale ML.
-
Connectivity-Optimized Representation Learning via Persistent Homology
A persistent homology loss enforces controllable connectivity in autoencoder latent spaces, improving one-class classification via kernel density estimation on the learned representations.
-
The Power of Power Law: Asymmetry Enables Compositional Reasoning
Power-law data sampling creates beneficial asymmetry in the loss landscape that lets models acquire high-frequency skill compositions first, enabling more efficient learning of rare long-tail skills than uniform distributions.
-
OT on the Map: Quantifying Domain Shifts in Geographic Space
GeoSpOT applies optimal transport to longitude-latitude data to quantify geospatial domain shifts and predict cross-region model transfer performance.
-
SocialGrid: A Benchmark for Planning and Social Reasoning in Embodied Multi-Agent Systems
SocialGrid benchmark shows even top LLMs achieve below 60% in embodied planning and task completion, with deception detection near random chance regardless of model scale.
-
ReSS: Learning Reasoning Models for Tabular Data Prediction via Symbolic Scaffold
ReSS extracts decision paths from trees as scaffolds to guide LLM reasoning generation, fine-tunes the LLM on the resulting dataset with scaffold-invariant augmentation, and reports up to 10% gains on medical and financial tabular benchmarks with new faithfulness metrics.
-
Pair2Scene: Learning Local Object Relations for Procedural Scene Generation
Pair2Scene generates complex 3D scenes beyond training data by training a network on local object-pair placement rules and applying them recursively with collision-aware sampling.
-
TurboEvolve: Towards Fast and Robust LLM-Driven Program Evolution
TurboEvolve improves LLM program evolution by running parallel islands with LLM-generated diverse candidates that carry self-assigned weights, an adaptive scheduler, and clustered seed injection to reach stronger solutions at lower evaluation budgets.
-
Preventing Latent Rehearsal Decay in Online Continual SSL with SOLAR
SOLAR prevents latent rehearsal decay in online continual SSL by adaptively managing replay buffers with deviation proxies and an explicit overlap loss, delivering both fast convergence and state-of-the-art final accuracy on vision benchmarks.
-
ANTIC: Adaptive Neural Temporal In-situ Compressor
ANTIC reduces storage for large-scale PDE simulations by orders of magnitude through adaptive temporal snapshot selection combined with continual neural-field residual compression while preserving physics accuracy.
-
Efficient RL Training for LLMs with Experience Replay
Well-designed experience replay buffers reduce inference compute in LLM RL post-training while maintaining or improving performance and preserving policy entropy.
-
Ads in AI Chatbots? An Analysis of How Large Language Models Navigate Conflicts of Interest
Many LLMs prioritize company ad incentives over user welfare by recommending pricier sponsored products, disrupting purchases, or concealing prices in comparisons.
-
Robust Length Prediction: A Perspective from Heavy-Tailed Prompt-Conditioned Distributions
LLM output lengths conditioned on a prompt form heavy-tailed distributions, so robust estimation from multiple samples outperforms single-sample labels for prediction.
-
MICA: Multivariate Infini Compressive Attention for Time Series Forecasting
MICA adapts infini compressive attention to the channel dimension, enabling scalable cross-channel dependencies in Transformers and cutting forecast error by 5.4% on average versus channel-independent baselines.
-
Single-Stage Signal Attenuation Diffusion Model for Low-Light Image Enhancement and Denoising
SADM adds a signal attenuation coefficient to the diffusion forward process so that reverse denoising simultaneously recovers brightness and suppresses noise without extra stages or correction modules.
-
How Reasoning Evolves from Post-Training Data: An Empirical Study Using Chess
Training language models on single best-move predictions in chess leads to strong but unfaithful reasoning after RL, while multi-move trajectories produce faithful reasoning with similar performance and stability.
-
Advances in Art: Orthogonal Disruption and the Beauty in Schematics
Orthogonal Art is defined as an artistic practice using schematics to occupy generative spaces inaccessible to AI, serving as a pedagogical bridge between art, engineering, and philosophy.
-
OPRIDE: Offline Preference-based Reinforcement Learning via In-Dataset Exploration
OPRIDE improves query efficiency in offline PbRL via a principled in-dataset exploration strategy and discount scheduling, outperforming prior methods with fewer queries and providing theoretical guarantees.
-
LangPrecip: Language-Aware Multimodal Precipitation Nowcasting
LangPrecip treats weather text as semantic motion constraints in a rectified-flow trajectory generator to improve multimodal precipitation nowcasting, yielding over 60% and 19% gains in heavy-rain CSI at 80-minute lead times on Swedish and MRMS data.
-
Self-Supervised Learning by Curvature Alignment
CurvSSL augments Barlow Twins-style SSL with a curvature alignment loss computed from k-nearest-neighbor cosine scores on the unit hypersphere, yielding competitive linear evaluation accuracy on MNIST and CIFAR-10.
-
You Had One Job: Per-Task Quantization Using LLMs' Hidden Representations
TAQ estimates per-layer importance from hidden representations and output sensitivity on task calibration data to allocate mixed precision in a training-free PTQ setting, outperforming task-agnostic baselines on accuracy-memory ratio across benchmarks.
-
Flashlight: PyTorch Compiler Extensions to Accelerate Attention Variants
Flashlight is a compiler-native PyTorch framework that generates efficient fused kernels for arbitrary and data-dependent attention variants, supporting more cases than FlexAttention with competitive performance.
-
Teaching the Teacher: The Role of Teacher-Student Smoothness Alignment in Genetic Programming-based Symbolic Distillation
Regularizing neural network teachers for functional smoothness with Jacobian and Lipschitz penalties yields statistically significant gains in R^2 for genetic programming-based symbolic students on 20 datasets.
-
LLMs on the Line: Data Determines Loss-to-Loss Scaling Laws
Pretraining data determines loss-to-loss scaling laws in LLMs, while model size, optimization, tokenizer, and architecture have limited impact.