Q-Align trains LMMs on discrete text-defined levels for visual scoring, achieving SOTA on IQA, IAA, and VQA while unifying the tasks in OneAlign.
International Journal of Computer Vision (IJCV) , year=
4 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.CV 4representative citing papers
LLaVA-Video-178K is a new synthetic video instruction dataset that, when combined with existing data to train LLaVA-Video, produces strong results on video understanding benchmarks.
FreqAdapter adapts multimodal models by text-guided multi-scale fine-tuning in the frequency domain, claiming better performance and efficiency than signal-space PEFT methods.
VLMs recover reliable population-level trends in climate change visual discourse on social media even when per-image accuracy is only moderate.
citing papers explorer
-
Q-Align: Teaching LMMs for Visual Scoring via Discrete Text-Defined Levels
Q-Align trains LMMs on discrete text-defined levels for visual scoring, achieving SOTA on IQA, IAA, and VQA while unifying the tasks in OneAlign.
-
LLaVA-Video: Video Instruction Tuning With Synthetic Data
LLaVA-Video-178K is a new synthetic video instruction dataset that, when combined with existing data to train LLaVA-Video, produces strong results on video understanding benchmarks.
-
Text-Guided Multi-Scale Frequency Representation Adaptation
FreqAdapter adapts multimodal models by text-guided multi-scale fine-tuning in the frequency domain, claiming better performance and efficiency than signal-space PEFT methods.
-
From Codebooks to VLMs: Evaluating Automated Visual Discourse Analysis for Climate Change on Social Media
VLMs recover reliable population-level trends in climate change visual discourse on social media even when per-image accuracy is only moderate.