Next-vit: Next generation vision transformer for efficient deployment in realistic industrial scenarios

Next-ViT: Next Generation Vision Transformer for Efficient Deployment in Realistic Industrial Scenarios , author= · 2022 · arXiv 2207.05501

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

read on arXiv browse 5 citing papers

representative citing papers

Scaling Parallel Sequence Models to Foundation-Scale Vision Encoders

cs.CV · 2026-05-30 · unverdicted · novelty 6.0

C-GSPN scales 2D spatial propagation to foundation vision encoders via a fast CUDA kernel, compressed blocks, and two-stage distillation, matching ViT performance with 15% fewer parameters and 4x block speedup at 2K resolution.

TextTeacher: What Can Language Teach About Images?

cs.CV · 2026-05-21 · unverdicted · novelty 6.0

TextTeacher uses frozen text embeddings from captions as semantic anchors to guide vision model training, improving ImageNet accuracy by up to 2.7 p.p. and transfer performance by 1.0 p.p. on average.

ShellfishNet: A Domain-Specific Benchmark for Visual Recognition of Marine Molluscs

cs.CV · 2026-05-08 · unverdicted · novelty 5.0

ShellfishNet is a new benchmark of 8,691 images across 32 mollusc taxa for evaluating vision models on real-world underwater ecological monitoring tasks including robustness to degradation.

Insights from Visual Cognition: Understanding Human Action Dynamics with Overall Glance and Refined Gaze Transformer

cs.CV · 2026-04-08 · unverdicted · novelty 5.0

The OG-ReG Transformer achieves state-of-the-art results on Kinetics-400, Something-Something v2, and Diving-48 by combining global glance and local gaze processing paths.

Faster Segment Anything: Towards Lightweight SAM for Mobile Applications

cs.CV · 2023-06-25 · conditional · novelty 5.0

MobileSAM is a 60x smaller distilled version of SAM that matches original performance and runs 5x faster than concurrent FastSAM while supporting CPU inference.

citing papers explorer

Showing 4 of 4 citing papers after filters.

Scaling Parallel Sequence Models to Foundation-Scale Vision Encoders cs.CV · 2026-05-30 · unverdicted · none · ref 54
C-GSPN scales 2D spatial propagation to foundation vision encoders via a fast CUDA kernel, compressed blocks, and two-stage distillation, matching ViT performance with 15% fewer parameters and 4x block speedup at 2K resolution.
TextTeacher: What Can Language Teach About Images? cs.CV · 2026-05-21 · unverdicted · none · ref 35
TextTeacher uses frozen text embeddings from captions as semantic anchors to guide vision model training, improving ImageNet accuracy by up to 2.7 p.p. and transfer performance by 1.0 p.p. on average.
ShellfishNet: A Domain-Specific Benchmark for Visual Recognition of Marine Molluscs cs.CV · 2026-05-08 · unverdicted · none · ref 52
ShellfishNet is a new benchmark of 8,691 images across 32 mollusc taxa for evaluating vision models on real-world underwater ecological monitoring tasks including robustness to degradation.
Insights from Visual Cognition: Understanding Human Action Dynamics with Overall Glance and Refined Gaze Transformer cs.CV · 2026-04-08 · unverdicted · none · ref 42
The OG-ReG Transformer achieves state-of-the-art results on Kinetics-400, Something-Something v2, and Diving-48 by combining global glance and local gaze processing paths.

Next-vit: Next generation vision transformer for efficient deployment in realistic industrial scenarios

fields

years

verdicts

representative citing papers

citing papers explorer