arxiv: 2605.01402 · v2 · submitted 2026-05-02 · 💻 cs.CL · cs.CV· cs.LG

Recognition: unknown

Injecting Distributional Awareness into MLLMs via Reinforcement Learning for Deep Imbalanced Regression

Yao Du , Shanshan Li , Xiaomeng Li

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:46 UTC · model grok-4.3

classification 💻 cs.CL cs.CVcs.LG

keywords multimodal large language modelsimbalanced regressionlong-tailed distributionsreinforcement learningconcordance correlation coefficientgroup relative policy optimizationdistribution alignment

0 comments

The pith

Multimodal LLMs improve regression on rare values by adding batch-level distribution matching rewards.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that standard fine-tuning of multimodal language models for numerical prediction fails on long-tailed data because it treats each example in isolation. It identifies the absence of cross-sample comparisons as the core issue and introduces a reinforcement learning setup that scores entire batches by how closely their overall spread matches the true distribution. This matters for applications like medical scoring or demand forecasting where errors on uncommon cases carry high costs. The method requires no model changes and delivers gains especially when data for certain values is scarce.

Core claim

The authors claim that a Group Relative Policy Optimization framework equipped with a Concordance Correlation Coefficient reward supplies the missing relational supervision, aligning model outputs with ground-truth distributions across correlation, scale, and mean; this yields consistent gains over supervised fine-tuning and prior regression methods on long-tailed benchmarks, with the largest lifts in medium- and few-shot regimes.

What carries the argument

Group Relative Policy Optimization using a batch-level Concordance Correlation Coefficient reward that compares predicted and ground-truth distributions for correlation, scale, and location.

If this is right

Models exhibit better tail accuracy without regressing to the mean on imbalanced numerical targets.
Performance lifts appear most strongly in medium- and few-shot regimes across unified benchmarks.
The approach integrates without any architectural modifications to existing MLLMs.
Distribution alignment in correlation, scale, and mean follows directly from the batch comparison reward.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same batch-comparison idea could transfer to other prediction settings where labels are naturally uneven, such as time-series forecasting.
It might reduce reliance on synthetic data generation for rare cases by instead leveraging relational signals within real batches.
Extending the reward to handle multimodal outputs or uncertainty estimates could address related calibration problems.

Load-bearing premise

The main limitation in current MLLM regression is missing cross-sample relational supervision and that a batch-wise CCC reward supplies it effectively without introducing new biases.

What would settle it

Running the same long-tailed regression benchmarks with the CCC batch reward replaced by standard point-wise rewards and observing no improvement or degradation in tail performance would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.01402 by Shanshan Li, Xiaomeng Li, Yao Du.

**Figure 1.** Figure 1: SFT exhibits a pronounced regression-to-the-mean effect, with predictions collapsing toward the many-shot region. Our method produces a substantially more balanced prediction distribution and maintains reliable predictions in tail regions. tions with different numeric errors can incur identical loss as long as they correspond to the same ground-truth token. Consequently, standard supervised fine-tuning (… view at source ↗

**Figure 2.** Figure 2: Comparison of training paradigms for numerical prediction in MLLMs. Left: SFT treats regression as token-level classification. Middle: Standard GRPO applies point-wise scalar rewards to each generation. Right: CCC-GRPO introduces batch-level, distributionaware relational supervision. Reinforcement fine-tuning has recently emerged as an effective paradigm for training large reasoning models, where supervi… view at source ↗

**Figure 3.** Figure 3: Overview of the proposed CCC-GRPO framework for deep imbalanced regression in MLLMs. CCC simultaneously captures linear correlation, scale consistency, and mean alignment between two distributions (Lawrence & Lin, 1989). Unlike pure correlation or ranking-based objectives, CCC explicitly penalizes both variance collapse and mean shift, making it sensitive to distributional mismatch beyond relative orderi… view at source ↗

**Figure 4.** Figure 4: Overview of the constructed DIR benchmark for MLLMs. Absolute Errors (GM). MAE reflects average regression accuracy, while GM penalizes concentrated or frequent errors and provides a complementary measure of error uniformity across sparse and under-represented regions. Baselines. We compare against both classical CNN-based DIR methods and MLLM-based regression approaches. Classical DIR baselines employ co… view at source ↗

**Figure 5.** Figure 5: MAE gain of Ours over SFT on IMDB-Movie-DIR under Qwen2.5-VL-3B view at source ↗

**Figure 6.** Figure 6: Sorted error distribution curves for CCC-GRPO and SFT on the BoneAge-DIR dataset under Qwen2.5-VL-3B. ness in sparse regimes. DISCO MAE Reward corresponds to our reproduction of difficulty-aware reweighting (Zhou et al., 2025), adapted to the generative numeric regression setting of MLLMs. It further improves tail performance by adjusting instance importance, yet still relies on point-wise supervision and … view at source ↗

**Figure 7.** Figure 7: Sorted error distribution curves for CCC-GRPO and SFT on the AgeDB-DIR dataset 0 200 400 600 800 Sorted Sample Rank 0 10 20 30 40 Absolute Error Many-shot SFT Ours 0 100 200 Sorted Sample Rank Median-shot SFT Ours 0 20 40 60 Sorted Sample Rank Few-shot SFT Ours view at source ↗

**Figure 8.** Figure 8: Sorted error distribution curves for CCC-GRPO and SFT on the IMDB-Movie-DIR dataset Complementary Error Metrics. Tables 9–12 report detailed results using three complementary metrics: Mean Squared Error (MSE), Mean Absolute Error (MAE), and the Geometric Mean of Absolute Errors (GM). MSE amplifies large errors and is therefore sensitive to catastrophic failures, MAE reflects average prediction accuracy, wh… view at source ↗

**Figure 9.** Figure 9: Sorted error distribution curves for CCC-GRPO and SFT on the IMDB-WIKI-DIR dataset 0 200 400 600 Sorted Sample Rank 0 20 40 60 80 Absolute Error Many-shot SFT Ours 0 200 400 Sorted Sample Rank Median-shot SFT Ours 0 100 200 Sorted Sample Rank Few-shot SFT Ours view at source ↗

**Figure 10.** Figure 10: Sorted error distribution curves for CCC-GRPO and SFT on the BoneAge-DIR dataset by reducing extreme deviations, CCC-GRPO exhibits more pronounced gains under MSE than under MAE, especially in sparse regions. These trends are consistent with the sorted error curves, confirming that CCC-GRPO improves regression robustness by stabilizing predictions across the target spectrum rather than optimizing mean acc… view at source ↗

**Figure 11.** Figure 11: MAE gain across AgeDB-DIR and IMDB-Movie-DIR datasets under imbalanced training distributions. 0 1000 2000 3000 # of samples Train label distribution (bin = 1) Label distribution Absolute MAE gain (SFT Ours) Many-shot region Medium-shot region Few-shot region 0 20 40 60 80 100 Target value 2.5 0.0 2.5 5.0 7.5 Absolute MAE Gain (SFT Ours) (a) IMDB-WIKI-DIR 0 500 1000 # of samples Train label distribution (… view at source ↗

**Figure 12.** Figure 12: MAE gain across IMDB-WIKI-DIR and BoneAge-DIR datasets under imbalanced training distributions. Scaling to Larger MLLM Backbones. In addition to the main experiments on Qwen2.5-VL-3B, we further evaluate CCC-GRPO on a larger backbone, Qwen2.5-VL-7B, with detailed results reported in view at source ↗

**Figure 13.** Figure 13: Imbalanced Training Dataset Overview view at source ↗

**Figure 14.** Figure 14: Balanced Testing Dataset Overview B.2. IMDB-Movie-DIR IMDB-Movie-DIR is constructed from the IMDB movie dataset (Kaggle, 2025), where each sample consists of a single movie poster paired with a continuous IMDb rating score. The task requires predicting the movie rating from visual input only, introducing substantial domain shift and label noise. We preserve the naturally imbalanced training distribution, … view at source ↗

read the original abstract

Multimodal large language models (MLLMs) struggle with numerical regression under long-tailed target distributions. Token-level supervised fine-tuning (SFT) and point-wise regression rewards bias learning toward high-density regions, leading to regression-to-the-mean behavior and poor tail performance. We identify the lack of cross-sample relational supervision as a key limitation of existing MLLM training paradigms. To address it, we propose a distribution-aware reinforcement learning framework based on Group Relative Policy Optimization, which introduces batch-level comparison-based supervision via the Concordance Correlation Coefficient-based reward to align predicted and ground-truth distributions in terms of correlation, scale, and mean. The framework is plug-and-play, requiring no architectural modification. Experiments on a unified suite of long-tailed regression benchmarks show consistent improvements over SFT and existing MLLM regression methods, with particularly strong gains in medium- and few-shot regimes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper offers a batch-level CCC reward inside GRPO to give MLLMs distributional supervision on long-tailed regression, but the abstract supplies no numbers and the sampling concern looks real.

read the letter

The main takeaway is that the authors treat the absence of cross-sample relational signals as the root cause of regression-to-the-mean in MLLM regression and try to fix it with a Concordance Correlation Coefficient reward computed over the whole batch inside Group Relative Policy Optimization. They keep the model unchanged and call the method plug-and-play. That framing is direct and practical if the gains materialize. They correctly note that token-level SFT and point-wise rewards push learning toward dense regions of the target distribution, which is a common pain point when targets are long-tailed. The CCC term is meant to pull the batch predictions toward matching the ground-truth distribution on correlation, scale, and location at once. That is a clean way to inject batch-level comparison without new architecture. The combination appears new in this exact setting even though both GRPO and CCC are established tools elsewhere. The abstract claims consistent gains over SFT and prior MLLM regression baselines, with the largest lifts in medium- and few-shot regimes on unified long-tailed benchmarks. If the full paper shows those numbers with proper controls, the idea would be worth testing for anyone fine-tuning MLLMs on skewed numerical targets. The clearest weakness is the total lack of quantitative evidence in the provided summary. No tables, no specific deltas, no error bars, and no description of batch construction or sampling strategy appear. That makes it impossible to judge whether the reported tail improvements are real or how large they are. The stress-test concern also lands: under ordinary random batching, tail examples will be rare, so the CCC signal will be dominated by head samples and may give little useful gradient for the very cases the method is supposed to rescue. If the paper relies on special batching or large batch sizes to make the reward effective, the “no tuning” claim weakens. This work is aimed at people already training or adapting MLLMs for regression in vision-language or clinical settings where targets are imbalanced. A practitioner who has seen their model collapse to the mean on rare values might find the reward function worth implementing and ablating. It deserves a serious referee because the problem is genuine, the proposed mechanism is simple to reproduce, and the full manuscript presumably contains the missing experiments and controls that would let reviewers check whether the batch reward actually drives the tail gains or whether other RL effects are responsible.

Referee Report

1 major / 1 minor

Summary. The paper claims that MLLMs exhibit regression-to-the-mean on long-tailed numerical regression tasks because token-level SFT and point-wise rewards lack cross-sample relational supervision. It proposes a plug-and-play distribution-aware RL framework based on Group Relative Policy Optimization (GRPO) that uses a batch-level Concordance Correlation Coefficient (CCC) reward to align predicted and ground-truth distributions along correlation, scale, and mean. Experiments on a unified suite of long-tailed regression benchmarks are said to show consistent gains over SFT and prior MLLM regression methods, especially in medium- and few-shot regimes.

Significance. If the results hold, the contribution is significant as a practical, architecture-agnostic method for improving distributional fidelity in MLLM regression on imbalanced data. The work correctly builds on the established CCC metric and presents GRPO as an off-the-shelf RL variant, giving credit for the clean framing of missing relational supervision and the absence of new hyperparameters or model changes.

major comments (1)

[Method section (reward formulation)] Method section (reward formulation): The central claim that the batch-level CCC reward supplies effective cross-sample relational supervision to correct tail regression-to-the-mean is load-bearing, yet the manuscript does not address how random batching interacts with long-tailed targets. Tail samples typically constitute <5-10% of a random batch, so the CCC gradient is dominated by head samples; this leaves open whether any reported tail gains arise from the claimed mechanism or from generic RL effects, and whether batch size or sampling must be tuned (contradicting the plug-and-play assertion).

minor comments (1)

[Abstract] Abstract: The claim of 'consistent improvements' and 'particularly strong gains' is stated without any numerical values, baseline names, or error bars; a one-sentence summary of key metrics would aid quick assessment.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the single major comment on the reward formulation below and will incorporate clarifications and additional analysis in the revision.

read point-by-point responses

Referee: The central claim that the batch-level CCC reward supplies effective cross-sample relational supervision to correct tail regression-to-the-mean is load-bearing, yet the manuscript does not address how random batching interacts with long-tailed targets. Tail samples typically constitute <5-10% of a random batch, so the CCC gradient is dominated by head samples; this leaves open whether any reported tail gains arise from the claimed mechanism or from generic RL effects, and whether batch size or sampling must be tuned (contradicting the plug-and-play assertion).

Authors: We acknowledge that the manuscript does not explicitly analyze the interaction between random batching and long-tailed targets, which is a fair observation. The CCC reward is computed over the full batch and penalizes mismatches in correlation, scale, and location; this global signal can still constrain mean-regression bias even when tails are sparse, because deviations in batch-level statistics affect the reward for all samples. Our point-wise reward baselines already isolate generic RL effects, and the reported gains (particularly in few-shot regimes) are larger than those baselines. Nevertheless, we agree the mechanism would be stronger with direct evidence on batch composition. In the revised manuscript we will add a dedicated paragraph in the method section explaining the batch-level nature of CCC and include an ablation on batch size (standard values 8-32) plus a comparison of random vs. stratified batching to quantify tail-sample influence. These additions require no new hyperparameters or model changes, preserving the plug-and-play claim; batch size is a conventional training choice shared with any RL fine-tuning procedure. revision: partial

Circularity Check

0 steps flagged

No circularity; derivation relies on standard metrics and empirical validation

full rationale

The paper identifies a limitation in existing MLLM regression (lack of cross-sample relational supervision in SFT and pointwise rewards) and proposes a plug-and-play RL framework that applies the established Concordance Correlation Coefficient as a batch-level reward inside Group Relative Policy Optimization. CCC directly quantifies the desired alignment in correlation, scale, and mean, but this is an explicit design choice rather than a self-referential definition or fitted input renamed as a prediction. No equations or self-citations reduce the central claim to its own inputs by construction; gains are asserted via benchmark experiments, not tautological derivation. The chain is self-contained against external benchmarks and standard RL techniques.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review prevents full audit; the method rests on the domain assumption that CCC supplies useful relational supervision and that GRPO can be applied plug-and-play without further justification.

axioms (1)

domain assumption Batch-level CCC reward supplies the missing cross-sample relational supervision needed for tail performance
Invoked as the core mechanism to overcome regression-to-the-mean without additional evidence in the abstract.

pith-pipeline@v0.9.0 · 5456 in / 1221 out tokens · 51975 ms · 2026-05-12T04:46:51.964194+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

158 extracted references · 158 canonical work pages · 12 internal anchors

[1]

proceedings of the IEEE conference on computer vision and pattern recognition workshops , pages=

Agedb: the first manually collected, in-the-wild age database , author=. proceedings of the IEEE conference on computer vision and pattern recognition workshops , pages=

work page
[2]

Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

Deep residual learning for image recognition , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

work page
[3]

Advances in Neural Information Processing Systems , volume=

Variational imbalanced regression: Fair uncertainty quantification via probabilistic smoothing , author=. Advances in Neural Information Processing Systems , volume=

work page
[4]

ConR: Contrastive Regularizer for Deep Imbalanced Regression , author=

work page
[5]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Leveraging group classification with descending soft labeling for deep imbalanced regression , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

work page
[6]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Deep imbalanced regression via hierarchical classification adjustment , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[7]

Dist Loss: Enhancing Regression in Few-Shot Region through Distribution Distance Constraint , author=

work page
[8]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Enhancing Numerical Prediction of MLLMs with Soft Labeling , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

work page
[9]

Wu, Tianhe and Zou, Jian and Liang, Jie and Zhang, Lei and Ma, Kede , journal=

work page
[10]

Advances in neural information processing systems , volume=

Visual instruction tuning , author=. Advances in neural information processing systems , volume=

work page
[11]

Unhackable temporal rewarding for scalable video mllms.arXiv preprint arXiv:2502.12081, 2025

Unhackable Temporal Rewarding for Scalable Video MLLMs , author=. arXiv preprint arXiv:2502.12081 , year=

work page arXiv
[12]

arXiv preprint arXiv:2307.09474 , year=

Chatspot: Bootstrapping multimodal llms via precise referring instruction tuning , author=. arXiv preprint arXiv:2307.09474 , year=

work page arXiv
[13]

arXiv preprint arXiv:2312.00589 , year=

Merlin: Empowering multimodal llms with foresight minds , author=. arXiv preprint arXiv:2312.00589 , year=

work page arXiv
[14]

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

Rt-2: Vision-language-action models transfer web knowledge to robotic control , author=. arXiv preprint arXiv:2307.15818 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[15]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[16]

Proximal Policy Optimization Algorithms

Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[17]

2025 , eprint=

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning , author=. 2025 , eprint=

work page 2025
[18]

Neural networks , volume=

A systematic study of the class imbalance problem in convolutional neural networks , author=. Neural networks , volume=. 2018 , publisher=

work page 2018
[19]

Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Numeracy for language models: Evaluating and improving their ability to predict numbers , author=. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

work page
[20]

CoRR , volume =

Q-insight: Understanding image quality via visual reinforcement learning , author=. arXiv preprint arXiv:2503.22679 , year=

work page arXiv
[21]

Biometrics , pages=

A concordance correlation coefficient to evaluate reproducibility , author=. Biometrics , pages=. 1989 , publisher=

work page 1989
[22]

arXiv preprint arXiv:2505.15074 , year=

DISCO Balances the Scales: Adaptive Domain-and Difficulty-Aware Reinforcement Learning on Imbalanced Data , author=. arXiv preprint arXiv:2505.15074 , year=

work page arXiv
[23]

Understanding R1-Zero-Like Training: A Critical Perspective

Understanding r1-zero-like training: A critical perspective , author=. arXiv preprint arXiv:2503.20783 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[24]

arXiv preprint arXiv:2506.07464 , year=

DeepVideo-R1: Video Reinforcement Fine-Tuning via Difficulty-aware Regressive GRPO , author=. arXiv preprint arXiv:2506.07464 , year=

work page arXiv
[25]

Online Distributionally Robust LLM Alignment via Regression to Relative Reward

DRO-REBEL: Distributionally Robust Relative-Reward Regression for Fast and Efficient LLM Alignment , author=. arXiv preprint arXiv:2509.19104 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[26]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

Reinforcement Learning for Large Language Models via Group Preference Reward Shaping , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2025
[27]

Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

Teaching large language models to regress accurate image quality scores using score distribution , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

work page
[28]

Radiology , volume=

The RSNA pediatric bone age machine learning challenge , author=. Radiology , volume=. 2019 , publisher=

work page 2019
[29]

Advances in Neural Information Processing Systems , volume=

Semi-supervised contrastive learning for deep regression with ordinal rankings from spectral seriation , author=. Advances in Neural Information Processing Systems , volume=

work page
[30]

European Conference on Computer Vision , pages=

Teach clip to develop a number sense for ordinal regression , author=. European Conference on Computer Vision , pages=. 2024 , organization=

work page 2024
[31]

Bai, Shuai and Chen, Keqin and Liu, Xuejing and Wang, Jialin and Ge, Wenbin and Song, Sibo and Dang, Kai and Wang, Peng and Wang, Shijie and Tang, Jun and others , journal=

work page
[32]

International Conference on Learning Representations , year =

Decoupled Weight Decay Regularization , author=. International Conference on Learning Representations , year =

work page
[33]

arXiv preprint arXiv:2504.07954 , year =

Perception-r1: Pioneering perception policy with reinforcement learning , author=. arXiv preprint arXiv:2504.07954 , year=

work page arXiv
[34]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Large-scale long-tailed recognition in an open world , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page
[35]

Visnumbench: Evaluating number sense of multimodal large language models.arXiv preprint arXiv:2503.14939, 2025

VisNumBench: Evaluating Number Sense of Multimodal Large Language Models , author=. arXiv preprint arXiv:2503.14939 , year=

work page arXiv
[36]

arXiv preprint arXiv:2511.11239 , year=

Beyond Flatlands: Unlocking Spatial Intelligence by Decoupling 3D Reasoning from Numerical Regression , author=. arXiv preprint arXiv:2511.11239 , year=

work page arXiv
[37]

Detect anything via next point prediction,

Detect anything via next point prediction , author=. arXiv preprint arXiv:2510.12798 , year=

work page arXiv
[38]

Think or not think: A study of explicit thinking in rule-based visual reinforcement fine-tuning.arXiv preprint arXiv:2503.16188, 2025

Cls-rl: Image classification with rule-based reinforcement learning , author=. arXiv preprint arXiv:2503.16188 , volume=

work page arXiv
[39]

arXiv preprint arXiv:2504.04801 , year=

OrderChain: A General Prompting Paradigm to Improve Ordinal Understanding Ability of MLLM , author=. arXiv preprint arXiv:2504.04801 , year=

work page arXiv
[40]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Cogagent: A visual language model for gui agents , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[41]

IEEE Transactions on Medical Imaging , volume=

Adaptive contrast for image regression in computer-aided disease assessment , author=. IEEE Transactions on Medical Imaging , volume=. 2021 , publisher=

work page 2021
[42]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Semi-supervised deep regression with uncertainty consistency and variational model ensembling via bayesian neural networks , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

work page
[43]

Advances in neural information processing systems , volume=

Ordinal regression by extended binary classification , author=. Advances in neural information processing systems , volume=

work page
[44]

Proceedings of the 31st ACM International Conference on Multimedia , pages=

Clip-count: Towards text-guided zero-shot object counting , author=. Proceedings of the 31st ACM International Conference on Multimedia , pages=

work page
[45]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Learning Multi-Modal Class-Specific Tokens for Weakly Supervised Dense Object Localization , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[46]

International Conference on Machine Learning , pages=

Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation , author=. International Conference on Machine Learning , pages=. 2022 , organization=

work page 2022
[47]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Imagebind: One embedding space to bind them all , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[48]

Advances in Neural Information Processing Systems , volume=

Rank-n-contrast: learning continuous representations for regression , author=. Advances in Neural Information Processing Systems , volume=

work page
[49]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Probing conceptual understanding of large visual-language models , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[50]

arXiv preprint arXiv:2306.16048 , year=

Challenges of Zero-Shot Recognition with Vision-Language Models: Granularity and Correctness , author=. arXiv preprint arXiv:2306.16048 , year=

work page arXiv
[51]

European Conference on Computer Vision , pages=

No token left behind: Explainability-aided image classification and generation , author=. European Conference on Computer Vision , pages=. 2022 , organization=

work page 2022
[52]

Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

Deep ordinal regression network for monocular depth estimation , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

work page
[53]

Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

Ordinal regression with multiple output cnn for age estimation , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

work page
[54]

2013 , eprint=

Efficient Estimation of Word Representations in Vector Space , author=. 2013 , eprint=

work page 2013
[55]

Proceedings of the European Conference on Computer Vision (ECCV) , pages=

Deepgum: Learning deep robust regression with a gaussian-uniform mixture model , author=. Proceedings of the European Conference on Computer Vision (ECCV) , pages=

work page
[56]

Improving language understanding with unsupervised learning

Radford, Alec and Narasimhan, Karthik and Salimans, Tim and Sutskever, Ilya. Improving language understanding with unsupervised learning. 2018

work page 2018
[57]

Computer Vision--ECCV 2020: 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings, Part XXIII 16 , pages=

Adaptive variance based label distribution learning for facial age estimation , author=. Computer Vision--ECCV 2020: 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings, Part XXIII 16 , pages=. 2020 , organization=

work page 2020
[58]

Proceedings of International Conference on Multimedia Retrieval , pages=

Dating color images with ordinal classification , author=. Proceedings of International Conference on Multimedia Retrieval , pages=

work page
[59]

Computer Vision--ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11--14, 2016, Proceedings, Part I 14 , pages=

Photo aesthetics ranking network with attributes and content adaptation , author=. Computer Vision--ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11--14, 2016, Proceedings, Part I 14 , pages=. 2016 , organization=

work page 2016
[60]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Soft labels for ordinal regression , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page
[61]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Bridgenet: A continuity-aware probabilistic network for age estimation , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page
[62]

Proceedings of the IEEE conference on computer vision and pattern recognition workshops , pages=

Fine-grained head pose estimation without keypoints , author=. Proceedings of the IEEE conference on computer vision and pattern recognition workshops , pages=

work page
[63]

Advances in neural information processing systems , volume=

Label distribution learning forests , author=. Advances in neural information processing systems , volume=

work page
[64]

International Journal of Computer Vision , volume=

Deep expectation of real and apparent age from a single image without facial landmarks , author=. International Journal of Computer Vision , volume=. 2018 , publisher=

work page 2018
[65]

7th international conference on automatic face and gesture recognition (FGR06) , pages=

Morph: A longitudinal image database of normal adult age-progression , author=. 7th international conference on automatic face and gesture recognition (FGR06) , pages=. 2006 , organization=

work page 2006
[66]

International conference on machine learning , pages=

Learning transferable visual models from natural language supervision , author=. International conference on machine learning , pages=. 2021 , organization=

work page 2021
[67]

International Journal of Computer Vision , volume=

Learning to prompt for vision-language models , author=. International Journal of Computer Vision , volume=. 2022 , publisher=

work page 2022
[68]

IEEE transactions on pattern analysis and machine intelligence , volume=

Facial age estimation by learning from label distributions , author=. IEEE transactions on pattern analysis and machine intelligence , volume=. 2013 , publisher=

work page 2013
[69]

Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

Using ranking-CNN for age estimation , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

work page
[70]

Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

Mean-variance loss for deep age estimation from a face , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

work page
[71]

Computer Vision--ECCV 2020: 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings, Part XX 16 , pages=

Energy-based models for deep probabilistic regression , author=. Computer Vision--ECCV 2020: 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings, Part XX 16 , pages=. 2020 , organization=

work page 2020
[72]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Learning probabilistic ordinal embeddings for uncertainty-aware regression , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page
[73]

Computer Vision--ECCV 2020: 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings, Part XXX 16 , pages=

Self-paced deep regression forests with consideration on underrepresented examples , author=. Computer Vision--ECCV 2020: 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings, Part XXX 16 , pages=. 2020 , organization=

work page 2020
[74]

Advances in neural information processing systems , volume=

Semi-supervised sequence learning , author=. Advances in neural information processing systems , volume=

work page
[75]

Proceedings of naacL-HLT , volume=

Bert: Pre-training of deep bidirectional transformers for language understanding , author=. Proceedings of naacL-HLT , volume=

work page
[76]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Styleclip: Text-driven manipulation of stylegan imagery , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

work page
[77]

ActionCLIP: A New Paradigm for Video Action Recognition

Wang, Mengmeng and Xing, Jiazheng and Liu, Yong. ActionCLIP: A New Paradigm for Video Action Recognition. arXiv preprint. 2021

work page 2021
[78]

International Journal of Computer Vision , pages=

Clip-adapter: Better vision-language models with feature adapters , author=. International Journal of Computer Vision , pages=. 2023 , publisher=

work page 2023
[79]

Tip-Adapter: Training-free CLIP-Adapter for Better Vision-Language Modeling

Zhang, Renrui and Fang, Rongyao and Zhang, Wei and Gao, Peng and Li, Kunchang and Dai, Jifeng and Qiao, Yu and Li, Hongsheng. Tip-Adapter: Training-free CLIP-Adapter for Better Vision-Language Modeling. arXiv preprint. 2021

work page 2021
[80]

ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=

Wav2clip: Learning robust audio representations from clip , author=. ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=. 2022 , organization=

work page 2022

Showing first 80 references.