INTERN: A New Learning Paradigm Towards General Vision

Chengyu Wang; Conghui He; Dahua Lin; Ding Liang; Fenggang Liu; Fengwei Yu; Gengshi Huang; Guanglu Song; Huan Peng; Jianing Teng

arxiv: 2111.08687 · v2 · pith:RLVKSIY5new · submitted 2021-11-16 · 💻 cs.CV · cs.AI· cs.LG

INTERN: A New Learning Paradigm Towards General Vision

Jing Shao , Siyu Chen , Yangguang Li , Kun Wang , Zhenfei Yin , Yinan He , Jianing Teng , Qinghong Sun

show 19 more authors

Mengya Gao Jihao Liu Gengshi Huang Guanglu Song Yichao Wu Yuming Huang Fenggang Liu Huan Peng Shuo Qin Chengyu Wang Yujie Wang Conghui He Ding Liang Yu Liu Fengwei Yu Junjie Yan Dahua Lin Xiaogang Wang Yu Qiao

This is my paper

classification 💻 cs.CV cs.AIcs.LG

keywords datalearningmodelparadigmvisiongeneralcapabilitydevelop

0 comments

read the original abstract

Enormous waves of technological innovations over the past several years, marked by the advances in AI technologies, are profoundly reshaping the industry and the society. However, down the road, a key challenge awaits us, that is, our capability of meeting rapidly-growing scenario-specific demands is severely limited by the cost of acquiring a commensurate amount of training data. This difficult situation is in essence due to limitations of the mainstream learning paradigm: we need to train a new model for each new scenario, based on a large quantity of well-annotated data and commonly from scratch. In tackling this fundamental problem, we move beyond and develop a new learning paradigm named INTERN. By learning with supervisory signals from multiple sources in multiple stages, the model being trained will develop strong generalizability. We evaluate our model on 26 well-known datasets that cover four categories of tasks in computer vision. In most cases, our models, adapted with only 10% of the training data in the target domain, outperform the counterparts trained with the full set of data, often by a significant margin. This is an important step towards a promising prospect where such a model with general vision capability can dramatically reduce our reliance on data, thus expediting the adoption of AI technologies. Furthermore, revolving around our new paradigm, we also introduce a new data system, a new architecture, and a new benchmark, which, together, form a general vision ecosystem to support its future development in an open and inclusive manner. See project website at https://opengvlab.shlab.org.cn .

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Rethinking Air-Ground Collaboration: A Progressive Cross-Task Benchmark and Socialized Learning Framework
cs.CV 2026-06 unverdicted novelty 7.0

Presents AGPC benchmark and SCP framework for progressive cross-task air-ground collaborative perception, reporting 3.73% coevolutionary gain and 7.86% downstream improvement over uniform fusion.
InternVideo: General Video Foundation Models via Generative and Discriminative Learning
cs.CV 2022-12 unverdicted novelty 5.0

InternVideo combines masked video modeling and video-language contrastive learning into a single foundation model that reaches state-of-the-art results on 39 video datasets including 91.1% top-1 on Kinetics-400.
InternVideo3: Agentify Foundation Models with Multimodal Contextual Reasoning
cs.CV 2026-06 unverdicted novelty 4.0

InternVideo3 introduces Multimodal Contextual Reasoning and M^2LA attention to enable closed-loop evidence accumulation in long-video understanding and agentic tool use, reporting strong benchmark results.
Data-Centric Foundation Models in Computational Healthcare: A Survey
cs.LG 2024-01 unverdicted novelty 3.0

The paper surveys data-centric strategies for foundation models in computational healthcare and supplies a curated list of related models and datasets.