pith. machine review for the scientific record. sign in

arxiv: 2505.16025 · v4 · submitted 2025-05-21 · 💻 cs.CV · cs.MM· eess.IV

Recognition: unknown

Context and Pixel Aware Large Language Model for Video Quality Assessment

Authors on Pith no claims yet
classification 💻 cs.CV cs.MMeess.IV
keywords qualitycp-llmdistortionslanguagepixellargevideoassessment
0
0 comments X
read the original abstract

Video quality assessment (VQA) is a challenging research topic with broad applications. Traditional hand-crafted and discriminative learning-based VQA models mainly focus on pixel-level distortions and lack contextual understanding, while recent multimodal large language models (MLLMs) struggle with sensitivity to small distortions or handle quality scoring and description as separate tasks. To address these shortcomings, we introduce CP-LLM: a Context- and Pixel-aware Large Language Model. CP-LLM is a novel multimodal LLM architecture featuring dual vision encoders designed to independently analyze perceptual quality at both high-level (video context) and low-level (pixel distortion) granularity, along with a language decoder that subsequently reasons about the interplay between these aspects. This design enables CP-LLM to simultaneously produce robust quality scores and interpretable quality descriptions, with enhanced sensitivity to pixel distortions (e.g., compression artifacts). Experiment results demonstrate that CP-LLM achieves state-of-the-art cross-dataset performance on VQA benchmarks and superior robustness to pixel distortions.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. DPC-VQA: Decoupling Quality Perception and Residual Calibration for Video Quality Assessment

    cs.CV 2026-04 unverdicted novelty 7.0

    DPC-VQA decouples a frozen MLLM perceptual prior from a lightweight residual calibration branch to adapt video quality assessment to new scenarios with under 2% trainable parameters and 20% of typical MOS labels.