LSTCN is a dual-branch CNN that extracts temporal gait features by pooling spatial data into strips and applying local spatiotemporal convolutions with asymmetric kernels.
Vision as LoRA
10 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 4polarities
background 4representative citing papers
Image-LoRA selectively adapts only visual tokens and chosen attention heads in VLMs, matching standard LoRA performance with lower parameter count and FLOPs.
DAIN reframes multimodal fusion as dynamic agent collaboration with sparse activation, claiming SOTA results including 2.6% accuracy gain on ADNI across five benchmarks.
SAME-Net adds a differentiable soft attention mask embedding module to achieve rectification-free end-to-end scene text spotting with 84.02% H-mean on Total-Text.
Knowledge conflicts in hypernetwork LLM adaptation stem from constant adapter margins losing to frequency-dependent pretrained margins; selective layer boosting and conflict-aware triggering raise deep-conflict accuracy to 71-72.5% on Gemma-2B and Mistral-7B.
OneVL achieves superior accuracy to explicit chain-of-thought reasoning at answer-only latency by supervising latent tokens with a visual world model decoder that predicts future frames.
SenseNova-U1 presents native unified multimodal models that match top understanding VLMs while delivering strong performance in image generation, infographics, and interleaved tasks via the NEO-unify architecture.
CPIFNet decomposes non-homogeneous dehazing into multiple homogeneous sub-problems via specialized IENet branches trained on different haze concentrations, then uses IFNet to fuse advantageous regions through deep feature merging.
A multilevel perceptual CRF model using Swin Transformer, HPF fusion, HA adapters, and dynamic scaling attention achieves state-of-the-art monocular depth estimation on NYU Depth v2, KITTI, and MatterPort3D with reduced error and fast inference.
RDCNet reports state-of-the-art accuracy on CIFAR-10, CIFAR-100, SVHN, Imagenette, and Imagewoof by combining random dilated convolutions with multi-branch and attention modules.
citing papers explorer
-
Local Spatiotemporal Convolutional Network for Robust Gait Recognition
LSTCN is a dual-branch CNN that extracts temporal gait features by pooling spatial data into strips and applying local spatiotemporal convolutions with asymmetric kernels.
-
Selective LoRA for Visual Tokens and Attention Heads
Image-LoRA selectively adapts only visual tokens and chosen attention heads in VLMs, matching standard LoRA performance with lower parameter count and FLOPs.
-
DAIN: Dynamic Agent-Based Interaction Network for Efficient and Collaborative Multimodal Reasoning
DAIN reframes multimodal fusion as dynamic agent collaboration with sparse activation, claiming SOTA results including 2.6% accuracy gain on ADNI across five benchmarks.
-
Do You Need Text Rectification? Soft Attention Mask Embedding for Rectification-Free Scene Text Spotting
SAME-Net adds a differentiable soft attention mask embedding module to achieve rectification-free end-to-end scene text spotting with 84.02% H-mean on Total-Text.
-
The Override Gap: A Magnitude Account of Knowledge Conflict Failure in Hypernetwork-Based Instant LLM Adaptation
Knowledge conflicts in hypernetwork LLM adaptation stem from constant adapter margins losing to frequency-dependent pretrained margins; selective layer boosting and conflict-aware triggering raise deep-conflict accuracy to 71-72.5% on Gemma-2B and Mistral-7B.
-
Xiaomi OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation
OneVL achieves superior accuracy to explicit chain-of-thought reasoning at answer-only latency by supervising latent tokens with a visual world model decoder that predicts future frames.
-
SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture
SenseNova-U1 presents native unified multimodal models that match top understanding VLMs while delivering strong performance in image generation, infographics, and interleaved tasks via the NEO-unify architecture.
-
Multi-Branch Non-Homogeneous Image Dehazing via Concentration Partitioning and Image Fusion
CPIFNet decomposes non-homogeneous dehazing into multiple homogeneous sub-problems via specialized IENet branches trained on different haze concentrations, then uses IFNet to fuse advantageous regions through deep feature merging.
-
Hierarchical Awareness Adapters with Hybrid Pyramid Feature Fusion for Dense Depth Prediction
A multilevel perceptual CRF model using Swin Transformer, HPF fusion, HA adapters, and dynamic scaling attention achieves state-of-the-art monocular depth estimation on NYU Depth v2, KITTI, and MatterPort3D with reduced error and fast inference.
-
Image Classification via Random Dilated Convolution with Multi-Branch Feature Extraction and Context Excitation
RDCNet reports state-of-the-art accuracy on CIFAR-10, CIFAR-100, SVHN, Imagenette, and Imagewoof by combining random dilated convolutions with multi-branch and attention modules.