GPS framework adds self-guided reasoning modules to lightweight VLMs for fine-grained action understanding, claiming performance near GPT-4o with better factual accuracy on a custom CAP-based dataset.
Learning to generalize without bias for open-vocabulary action recognition,
2 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.CV 2years
2026 2verdicts
UNVERDICTED 2representative citing papers
TACO proposes Relative Structure Distillation and a lightweight specialization projection to mitigate inconsistency between fine-tuning and evaluation objectives in open-vocabulary video recognition, claiming state-of-the-art results on cross-dataset and base-to-novel benchmarks.
citing papers explorer
-
Gold Points Sniper: Self-guided Visual Reasoning in VLM for Fine-grained Action Understanding
GPS framework adds self-guided reasoning modules to lightweight VLMs for fine-grained action understanding, claiming performance near GPT-4o with better factual accuracy on a custom CAP-based dataset.
-
TACO: Towards Task-Consistent Open-Vocabulary Adaptation in Video Recognition
TACO proposes Relative Structure Distillation and a lightweight specialization projection to mitigate inconsistency between fine-tuning and evaluation objectives in open-vocabulary video recognition, claiming state-of-the-art results on cross-dataset and base-to-novel benchmarks.