OpenSGA fuses vision-language, textual, and geometric features via a distance-gated attention encoder and minimum-cost-flow allocator to outperform prior methods on both frame-to-scan and subscan-to-subscan 3D scene graph alignment, backed by a new 700k-sample ScanNet-SG dataset.
hub
Advances in neural information processing systems , volume=
40 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
co-cited works
roles
background 1polarities
background 1representative citing papers
Self-attention acts as a covariance readout that unifies in-context learning via population gradient descent and repetitive generation via asymptotic Markov behavior.
Temporal correlations from lazy random walks enable efficient SGD learning of k-juntas via temporal-difference loss on ReLU networks, achieving linear sample complexity in d.
Polyphonia improves zero-shot stem-specific timbre transfer in polyphonic music by 15.5% target alignment via acoustic-informed attention calibration that uses probabilistic priors to set coarse boundaries.
ALiBi bias is the expectation of positional LSH-induced block masks, yielding spectral and max-norm approximation bounds that reduce long-context biased attention to randomized short-context unbiased attention.
CaTR applies value-decomposed RL with hierarchical conflict-aware observations to achieve better safety-efficiency trade-offs than planning, optimization, and standard RL baselines in a realistic airport taxiway simulation.
The Sinkhorn treatment effect is a new entropic optimal transport measure of divergence between counterfactual distributions that admits first- and second-order pathwise differentiability, debiased estimators, and asymptotically valid tests for distributional treatment effects.
LG-CoTrain, an LLM-guided co-training method, outperforms classical semi-supervised baselines for crisis tweet classification in low-resource settings with 5-25 labeled examples per class.
EyeCue detects driver cognitive distraction by modeling gaze-visual context interactions in egocentric videos and achieves 74.38% accuracy on the new CogDrive dataset, outperforming 11 baselines.
LC-MAPF uses multi-round local communication between neighboring agents in a pre-trained model to outperform prior learning-based MAPF solvers on diverse unseen scenarios while preserving scalability.
PHALAR achieves up to 70% relative accuracy gain in stem retrieval with under half the parameters and 7x faster training by using phasor-based equivariant representations, setting new SOTA on multiple datasets.
Refusal in language models is mediated by a single direction in residual stream activations that can be erased to disable safety or added to elicit refusal.
DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.
Training per-layer affine probes on frozen transformers yields more reliable latent predictions than the logit lens and enables detection of malicious inputs from prediction trajectories.
TBPO derives a token-level preference optimization objective from sequence-level pairwise data via Bregman divergence ratio matching that generalizes DPO and improves alignment quality.
Set-aggregated genome embeddings from genomic language models predict microbiome abundance profiles with improved generalization to novel genomes over classical bioinformatics methods.
A training-inference consistent segmented execution framework for long-context LLMs matches full-context performance with substantially lower peak memory at very long lengths.
WorldComp2D explicitly structures latent space geometry by object identity and spatial proximity via a proximity-dependent encoder and localizer, cutting parameters up to 4X and FLOPs 2.2X versus state-of-the-art lightweight models on facial landmark localization while staying real-time on CPU.
Theoretical analysis of continual factual knowledge acquisition shows data replay stabilizes pretrained knowledge by shifting convergence dynamics while regularization only slows forgetting, leading to the STOC method for attention-based replay selection.
Generative models learn rules before memorizing data, creating an innovation window whose width depends on dataset size and rule complexity, observed in both diffusion and autoregressive architectures.
HAGE proposes a trainable weighted graph memory framework with LLM intent classification, dynamic edge modulation, and RL optimization that improves long-horizon reasoning accuracy in agentic LLMs over static baselines.
S4 models exhibit stable time-continuity unlike sensitive S6 models, with task continuity predicting performance and enabling temporal subsampling for better efficiency.
QHyer replaces return-to-go with a state-conditioned Q-estimator and adds a gated hybrid attention-mamba backbone to achieve state-of-the-art performance in offline goal-conditioned RL on both Markovian and non-Markovian datasets.
AGoQ delivers up to 52% lower memory use and 1.34x faster training for 8B-32B LLaMA models by using near-4-bit adaptive activations and 8-bit gradients while preserving pretraining convergence and downstream accuracy.
citing papers explorer
-
OpenSGA: Efficient 3D Scene Graph Alignment in the Open World
OpenSGA fuses vision-language, textual, and geometric features via a distance-gated attention encoder and minimum-cost-flow allocator to outperform prior methods on both frame-to-scan and subscan-to-subscan 3D scene graph alignment, backed by a new 700k-sample ScanNet-SG dataset.
-
Self-Attention as a Covariance Readout: A Unified View of In-Context Learning and Repetition
Self-attention acts as a covariance readout that unifies in-context learning via population gradient descent and repetitive generation via asymptotic Markov behavior.
-
The Benefits of Temporal Correlations: SGD Learns k-Juntas from Random Walks Efficiently
Temporal correlations from lazy random walks enable efficient SGD learning of k-juntas via temporal-difference loss on ReLU networks, achieving linear sample complexity in d.
-
Polyphonia: Zero-Shot Timbre Transfer in Polyphonic Music with Acoustic-Informed Attention Calibration
Polyphonia improves zero-shot stem-specific timbre transfer in polyphonic music by 15.5% target alignment via acoustic-informed attention calibration that uses probabilistic priors to set coarse boundaries.
-
Positional LSH: Binary Block Matrix Approximation for Attention with Linear Biases
ALiBi bias is the expectation of positional LSH-induced block masks, yielding spectral and max-norm approximation bounds that reduce long-context biased attention to randomized short-context unbiased attention.
-
Value-Decomposed Reinforcement Learning Framework for Taxiway Routing with Hierarchical Conflict-Aware Observations
CaTR applies value-decomposed RL with hierarchical conflict-aware observations to achieve better safety-efficiency trade-offs than planning, optimization, and standard RL baselines in a realistic airport taxiway simulation.
-
Sinkhorn Treatment Effects: A Causal Optimal Transport Measure
The Sinkhorn treatment effect is a new entropic optimal transport measure of divergence between counterfactual distributions that admits first- and second-order pathwise differentiability, debiased estimators, and asymptotically valid tests for distributional treatment effects.
-
LLM-guided Semi-Supervised Approaches for Social Media Crisis Data Classification
LG-CoTrain, an LLM-guided co-training method, outperforms classical semi-supervised baselines for crisis tweet classification in low-resource settings with 5-25 labeled examples per class.
-
EyeCue: Driver Cognitive Distraction Detection via Gaze-Empowered Egocentric Video Understanding
EyeCue detects driver cognitive distraction by modeling gaze-visual context interactions in egocentric videos and achieves 74.38% accuracy on the new CogDrive dataset, outperforming 11 baselines.
-
Learning to Communicate Locally for Large-Scale Multi-Agent Pathfinding
LC-MAPF uses multi-round local communication between neighboring agents in a pre-trained model to outperform prior learning-based MAPF solvers on diverse unseen scenarios while preserving scalability.
-
PHALAR: Phasors for Learned Musical Audio Representations
PHALAR achieves up to 70% relative accuracy gain in stem retrieval with under half the parameters and 7x faster training by using phasor-based equivariant representations, setting new SOTA on multiple datasets.
-
Refusal in Language Models Is Mediated by a Single Direction
Refusal in language models is mediated by a single direction in residual stream activations that can be erased to disable safety or added to elicit refusal.
-
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.
-
Eliciting Latent Predictions from Transformers with the Tuned Lens
Training per-layer affine probes on frozen transformers yields more reliable latent predictions than the logit lens and enables detection of malicious inputs from prediction trajectories.
-
TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching
TBPO derives a token-level preference optimization objective from sequence-level pairwise data via Bregman divergence ratio matching that generalizes DPO and improves alignment quality.
-
Set-Aggregated Genome Embeddings for Microbiome Abundance Prediction
Set-aggregated genome embeddings from genomic language models predict microbiome abundance profiles with improved generalization to novel genomes over classical bioinformatics methods.
-
Training-Inference Consistent Segmented Execution for Long-Context LLMs
A training-inference consistent segmented execution framework for long-context LLMs matches full-context performance with substantially lower peak memory at very long lengths.
-
WorldComp2D: Spatio-semantic Representations of Object Identity and Location from Local Views
WorldComp2D explicitly structures latent space geometry by object identity and spatial proximity via a proximity-dependent encoder and localizer, cutting parameters up to 4X and FLOPs 2.2X versus state-of-the-art lightweight models on facial landmark localization while staying real-time on CPU.
-
Towards Understanding Continual Factual Knowledge Acquisition of Language Models: From Theory to Algorithm
Theoretical analysis of continual factual knowledge acquisition shows data replay stabilizes pretrained knowledge by shifting convergence dynamics while regularization only slows forgetting, leading to the STOC method for attention-based replay selection.
-
The two clocks and the innovation window: When and how generative models learn rules
Generative models learn rules before memorizing data, creating an innovation window whose width depends on dataset size and rule complexity, observed in both diffusion and autoregressive architectures.
-
HAGE: Harnessing Agentic Memory via RL-Driven Weighted Graph Evolution
HAGE proposes a trainable weighted graph memory framework with LLM intent classification, dynamic edge modulation, and RL optimization that improves long-horizon reasoning accuracy in agentic LLMs over static baselines.
-
Continuity Laws for Sequential Models
S4 models exhibit stable time-continuity unlike sensitive S6 models, with task continuity predicting performance and enabling temporal subsampling for better efficiency.
-
QHyer: Q-conditioned Hybrid Attention-mamba Transformer for Offline Goal-conditioned RL
QHyer replaces return-to-go with a state-conditioned Q-estimator and adds a gated hybrid attention-mamba backbone to achieve state-of-the-art performance in offline goal-conditioned RL on both Markovian and non-Markovian datasets.
-
AGoQ: Activation and Gradient Quantization for Memory-Efficient Distributed Training of LLMs
AGoQ delivers up to 52% lower memory use and 1.34x faster training for 8B-32B LLaMA models by using near-4-bit adaptive activations and 8-bit gradients while preserving pretraining convergence and downstream accuracy.
-
Parameter-Efficient Adaptation of Pre-Trained Vision Foundation Models for Active and Passive Seismic Data Denoising
Adapting vision foundation models with LoRA and kurtosis-guided unsupervised test-time adaptation matches or exceeds domain-specific models for seismic denoising across multiple sites and unseen data.
-
Towards an AI co-scientist
A multi-agent AI system generates novel biomedical hypotheses that show promising experimental validation in drug repurposing for leukemia, new targets for liver fibrosis, and a bacterial gene transfer mechanism.
-
CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer
CogVideoX generates coherent 10-second text-to-video outputs at high resolution using a 3D VAE, expert adaptive LayerNorm transformer, progressive training, and a custom data pipeline, claiming state-of-the-art results.
-
NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models
NV-Embed achieves first place on the MTEB leaderboard across 56 tasks by combining a latent attention layer, causal-mask removal, two-stage contrastive training, and data curation for LLM-based embedding models.
-
TD-MPC2: Scalable, Robust World Models for Continuous Control
TD-MPC2 scales an implicit world-model RL method to a 317M-parameter agent that masters 80 tasks across four domains with a single hyperparameter configuration.
-
MaskTab: Scalable Masked Tabular Pretraining with Scaling Laws and Distillation for Industrial Classification
MaskTab is a masked pretraining method for industrial tabular data that delivers measurable gains in classification AUC and KS metrics while enabling effective distillation to smaller models.
-
RADAR: Redundancy-Aware Diffusion for Multi-Agent Communication Structure Generation
RADAR is a redundancy-aware, query-adaptive framework that uses conditional discrete graph diffusion to generate efficient communication topologies for multi-agent LLM systems, outperforming baselines on six benchmarks with higher accuracy and lower token use.
-
Marrying Generative Model of Healthcare Events with Digital Twin of Social Determinants of Health for Disease Reasoning
A generative framework using geometric diffusion for brain networks and tabular diffusion for other organs integrates ICD-coded SDoH proxies to improve disease reasoning on UK Biobank data.
-
Lost in Translation? Exploring the Shift in Grammatical Gender from Latin to Occitan
An interpretable deep learning framework with a new tokenizer is used to quantify how grammatical gender information is distributed between lemmas and sentential context during the Latin-to-Occitan transition.
-
When and Why Grouping Attention Heads Accelerates Muon Optimization
Grouping attention heads in Muon creates a trade-off between whitening gains and norm costs that, when tuned, improves training loss over full or per-head Muon on GPT-2.
-
Attention-based graph neural networks: a survey
The survey groups attention-based GNNs into three stages—graph recurrent attention networks, graph attention networks, and graph transformers—while reviewing architectures and future directions.
-
InfoGeo: Information-Theoretic Object-Centric Learning for Cross-View Generalizable UAV Geo-Localization
InfoGeo reformulates cross-view geo-localization as an information bottleneck that aligns object-centric structural relations across views while minimizing view-specific noise.
-
Structured Diffusion Bridges: Inductive Bias for Denoising Diffusion Bridges
A structured diffusion bridge method achieves near fully-paired modality translation quality using alignment constraints even in unpaired or semi-paired regimes.
-
Enhancing Target-Guided Proactive Dialogue Systems via Conversational Scenario Modeling and Intent-Keyword Bridging
Conversational scenario modeling from user profiles and domain knowledge, combined with intent-keyword bridging, improves proactivity, fluency, and informativeness in target-guided proactive dialogue systems.
-
Neuroscience-Inspired Analyses of Visual Interestingness in Multimodal Transformers
Human visual interestingness is linearly decodable from final-layer embeddings in Qwen3-VL-8B and becomes progressively more structured across vision and language layers without explicit supervision.
-
DeepSeek LLM: Scaling Open-Source Language Models with Longtermism
DeepSeek LLM 67B exceeds LLaMA-2 70B on code, mathematics and reasoning benchmarks after pre-training on 2 trillion tokens and alignment via SFT and DPO.