DepthVAR adaptively allocates per-token computational depth in VAR models using a cyclic rotated scheduler and dynamic layer masking to achieve 2.3-3.1x inference speedup with minimal quality loss.
Bert: Pre-training of deep bidirectional trans- formers for language understanding
6 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.CV 6years
2026 6representative citing papers
FILTR predicts persistence diagrams from pretrained 3D encoders on the new DONUT benchmark, showing limited topological signals in encoders but successful approximation via learnable feed-forward.
MODIX dynamically rescales positional indices in VLMs using intra-modal covariance-based entropy and inter-modal alignment scores to allocate finer granularity to informative content.
Linear attention delivers significant computational savings in multimodal transformers and follows the same scaling laws as softmax attention on ViT models trained on LAION-400M with ImageNet-21K zero-shot validation.
A staged multimodal fusion model for predicting six continuous emotion intensities from in-the-wild video achieves 0.4722 validation and 0.57 test Pearson correlation in the EMI challenge.
citing papers explorer
-
Depth Adaptive Efficient Visual Autoregressive Modeling
DepthVAR adaptively allocates per-token computational depth in VAR models using a cyclic rotated scheduler and dynamic layer masking to achieve 2.3-3.1x inference speedup with minimal quality loss.
-
FILTR: Extracting Topological Features from Pretrained 3D Models
FILTR predicts persistence diagrams from pretrained 3D encoders on the new DONUT benchmark, showing limited topological signals in encoders but successful approximation via learnable feed-forward.
-
MODIX: A Training-Free Multimodal Information-Driven Positional Index Scaling for Vision-Language Models
MODIX dynamically rescales positional indices in VLMs using intra-modal covariance-based entropy and inter-modal alignment scores to allocate finer granularity to informative content.
-
On The Application of Linear Attention in Multimodal Transformers
Linear attention delivers significant computational savings in multimodal transformers and follows the same scaling laws as softmax attention on ViT models trained on LAION-400M with ImageNet-21K zero-shot validation.
-
Two-Stage Multimodal Framework for Emotion Mimicry Intensity Prediction
A staged multimodal fusion model for predicting six continuous emotion intensities from in-the-wild video achieves 0.4722 validation and 0.57 test Pearson correlation in the EMI challenge.
- DSAA: Dual-Stage Attribute Activation for Fine-grained Open Vocabulary Detection