arxiv: 2603.03197 · v3 · submitted 2026-03-03 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Specificity-aware reinforcement learning for fine-grained open-world classification

Samuele Angheben , Davide Berasi , Alessandro Conti , Elisa Ricci , Yiming Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-15 16:41 UTC · model grok-4.3

classification 💻 cs.CV

keywords reinforcement learningfine-grained classificationopen-world classificationlarge multimodal modelsspecificityimage classificationverifier reward

0 comments

The pith

A reinforcement learning method steers large multimodal models toward both correct and specific predictions in open-world fine-grained image classification.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that reasoning large multimodal models already hold fine-grained visual knowledge but default to generic labels unless guided otherwise. It introduces SpeciaRL, a reinforcement learning framework that fine-tunes these models using a reward signal drawn from the strongest outputs among online rollouts. The reward is verified to keep predictions accurate while pushing for more detailed class names. This matters because open-world classification, where no fixed label list exists, requires models to name subtle visual differences correctly rather than fall back on broad categories like bird or car.

Core claim

SpeciaRL fine-tunes reasoning LMMs for fine-grained image classification under the open-world setting by introducing a dynamic, verifier-based reward signal anchored to the best predictions within online rollouts, promoting specificity while respecting the model's capabilities to prevent incorrect predictions.

What carries the argument

The verifier-based reward signal anchored to the best predictions within online rollouts, which encourages more specific outputs without reducing correctness.

If this is right

SpeciaRL delivers the best trade-off between correctness and specificity across extensive fine-grained benchmarks.
The method surpasses existing approaches in out-of-domain experiments for open-world fine-grained image classification.
Reasoning LMMs can be steered to use their intrinsic fine-grained knowledge more effectively through this rollout-based reward design.
Open-world fine-grained classification advances by balancing accuracy and detail without relying on a predefined label set.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same rollout-verification reward pattern could be tested on tasks such as generating detailed image captions or attribute lists where specificity also matters.
Similar mechanisms might reduce overly broad outputs when LMMs are applied to medical or scientific image domains with subtle distinctions.
Extending the approach to video or 3D data would require checking whether the verifier remains reliable across temporal or spatial rollouts.
Combining SpeciaRL with other alignment techniques could further stabilize performance when the model encounters entirely novel fine-grained categories.

Load-bearing premise

The verifier-based reward signal derived from online rollouts can reliably promote specificity without introducing bias or reducing the model's ability to produce correct predictions on unseen fine-grained concepts.

What would settle it

An experiment in which models fine-tuned with SpeciaRL show either lower accuracy on held-out fine-grained classes or no measurable increase in specificity compared with standard fine-tuning or prompting baselines.

Figures

Figures reproduced from arXiv: 2603.03197 by Alessandro Conti, Davide Berasi, Elisa Ricci, Samuele Angheben, Yiming Wang.

**Figure 2.** Figure 2: Predictions distribution over categories for Qwen2.5VL-7B [ [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of SpeciaRL Given an input image I, the policy model generates N open-ended predictions {p1, . . . , pN }. Each prediction is categorized by a judge model (LLM verifier) as wrong or correct at different levels of specificity with respect to the ground-truth. A verifiable reward r ∗ i is then assigned according to whether the prediction’s category ci meets the adaptive reference level c ∗ , which i… view at source ↗

**Figure 4.** Figure 4: Qualitative examples of the think-answer output from the [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: LMM default prompt for prediction. “Be specific” LMM prompt (Pc) Classify the image, be specific. Output the thinking process in <think> </think> and the final answer in <answer> </answer> tags. The output answer format should be as follows: <think> ... </think> <answer>a single label or the word ‘None’ to abstain.</answer>. Please strictly follow the format [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

**Figure 6.** Figure 6: LMM prompt for prediction for the “Be specific” baseline. [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗

**Figure 8.** Figure 8: Prompt for the LLM-as-a-judge verifier categorizing a prediction given the target ground-truth. [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗

**Figure 9.** Figure 9: Prompt for generating the reasoning traces used to train the supervised fine-tuning baseline model. [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗

**Figure 10.** Figure 10: LLM-as-a-judge per-batch verification times during [PITH_FULL_IMAGE:figures/full_fig_p014_10.png] view at source ↗

**Figure 11.** Figure 11: Additional qualitative examples of the think-answer output of the base model Qwen2.5VL-7B and SpeciaRL. [PITH_FULL_IMAGE:figures/full_fig_p015_11.png] view at source ↗

**Figure 14.** Figure 14: Generated LMM prompt (Pc (v2)). Additional LMM prompt (Pc (v3)) Classify the image. Only care about precision/specificity: always output the most fine-grained label you can. Do not abstain. Do not output ’None’. If multiple fine-grained labels are plausible, choose the single most specific label you consider most likely. Output the thinking process in <think> </think> and the final answer in <answer> </an… view at source ↗

**Figure 12.** Figure 12: Failure cases. Qualitative examples of SpeciaRL providing a Wrong prediction (Top & Center) and of SpeciaRL unnecessarily using a scientific name for a generic concept (Bottom). Additional LMM prompt (Pc (v1)) Classify the image. Prioritize correctness first. Be as specific as you can ONLY when you are confident the finer-grained label is correct. If you are not confident about a fine-grained label, out… view at source ↗

**Figure 13.** Figure 13: Generated LMM prompt (Pc (v1)). B.4. Additional Prompting baselines We report the performance of three additional topperforming variants of the Pc prompt. These variants were generated using ChatGPT by requesting three different optimal predictor prompts given the full task context. As Additional LMM prompt (Pc (v2)) Classify the image. Optimize for high precision: do not guess. If you are unsure, absta… view at source ↗

**Figure 16.** Figure 16: Generated Prompt for the LLM-as-a-judge verifier. [PITH_FULL_IMAGE:figures/full_fig_p021_16.png] view at source ↗

**Figure 17.** Figure 17: Generated Prompt for the LLM-as-a-judge verifier. [PITH_FULL_IMAGE:figures/full_fig_p022_17.png] view at source ↗

**Figure 18.** Figure 18: Generated Prompt for the LLM-as-a-judge verifier. [PITH_FULL_IMAGE:figures/full_fig_p023_18.png] view at source ↗

read the original abstract

Classifying fine-grained visual concepts under open-world settings, i.e., without a predefined label set, demands models to be both accurate and specific. Recent reasoning Large Multimodal Models (LMMs) exhibit strong visual understanding capability but tend to produce overly generic predictions when performing fine-grained image classification. Our preliminary analysis reveals that models do possess the intrinsic fine-grained domain knowledge. However, promoting more specific predictions (specificity) without compromising correct ones (correctness) remains a non-trivial and understudied challenge. In this work, we investigate how to steer reasoning LMMs toward predictions that are both correct and specific. We propose a novel specificity-aware reinforcement learning framework, SpeciaRL, to fine-tune reasoning LMMs on fine-grained image classification under the open-world setting. SpeciaRL introduces a dynamic, verifier-based reward signal anchored to the best predictions within online rollouts, promoting specificity while respecting the model's capabilities to prevent incorrect predictions. Our out-of-domain experiments show that SpeciaRL delivers the best trade-off between correctness and specificity across extensive fine-grained benchmarks, surpassing existing methods and advancing open-world fine-grained image classification. Code and model are publicly available at https://github.com/s-angheben/SpeciaRL.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes SpeciaRL, a specificity-aware reinforcement learning framework for fine-tuning reasoning large multimodal models on fine-grained open-world image classification. It introduces a dynamic verifier-based reward signal derived from the best predictions among online rollouts to promote specificity while respecting model capabilities and avoiding incorrect predictions, claiming superior correctness-specificity trade-offs on out-of-domain benchmarks relative to prior methods.

Significance. If the empirical claims hold after clarification, the work would advance open-world fine-grained classification by providing a practical RL mechanism to steer LMMs toward more precise outputs without accuracy loss. The public release of code and models supports reproducibility.

major comments (2)

[Abstract] Abstract: the central claim of best trade-off on extensive out-of-domain benchmarks is asserted without any reported metrics, baselines, statistical tests, or verifier implementation details, leaving the empirical support for the main result invisible.
[Method] Method (verifier reward definition): the dynamic reward anchored to the 'best' rollout prediction requires an explicit definition of the verifier (e.g., separate model, rule-based, or self-consistency) and the scoring rule for 'best' when ground truth is unavailable; without this, it is impossible to verify that specificity gains are not achieved at the expense of correctness on unseen fine-grained classes.

minor comments (1)

[Abstract] The abstract would benefit from naming the specific fine-grained benchmarks and at least one quantitative result to ground the superiority claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments on our manuscript. We address each major comment below and propose revisions to improve clarity and completeness.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim of best trade-off on extensive out-of-domain benchmarks is asserted without any reported metrics, baselines, statistical tests, or verifier implementation details, leaving the empirical support for the main result invisible.

Authors: We agree that the abstract, as a high-level summary, omits specific numbers to maintain brevity. The full paper includes comprehensive results in Section 4 with metrics, baselines, and analyses demonstrating the best trade-off. In the revised version, we will incorporate key empirical highlights into the abstract, such as the superior performance on out-of-domain benchmarks, and ensure a brief reference to the verifier approach. revision: yes
Referee: [Method] Method (verifier reward definition): the dynamic reward anchored to the 'best' rollout prediction requires an explicit definition of the verifier (e.g., separate model, rule-based, or self-consistency) and the scoring rule for 'best' when ground truth is unavailable; without this, it is impossible to verify that specificity gains are not achieved at the expense of correctness on unseen fine-grained classes.

Authors: Thank you for this important clarification request. The current manuscript introduces the verifier-based reward in Section 3 but does not provide sufficient implementation details. We will revise the method section to explicitly define the verifier (as a self-consistency mechanism across rollouts combined with a specificity scoring rule based on prediction granularity), and detail the scoring for selecting the 'best' prediction without ground truth by prioritizing consistent and specific outputs while penalizing potential inaccuracies through capability-aware clipping. This will include pseudocode and examples to allow verification that correctness is preserved. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical RL method with no self-referential derivations or fitted predictions

full rationale

The paper proposes SpeciaRL as an empirical reinforcement learning framework that introduces a dynamic verifier-based reward anchored to online rollouts. No equations, derivations, or parameter-fitting steps are presented that reduce the claimed trade-off between correctness and specificity to a self-definition, fitted input, or self-citation chain. The approach is described as steering LMMs via RL without invoking uniqueness theorems, ansatzes smuggled through citations, or renaming known results. The central claim rests on experimental outcomes rather than a closed mathematical loop, making the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No explicit free parameters, axioms, or invented entities are described in the abstract; the method appears to rely on standard RL components plus a custom verifier whose internals are not detailed here.

pith-pipeline@v0.9.0 · 5523 in / 1052 out tokens · 26773 ms · 2026-05-15T16:41:38.789146+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

71 extracted references · 71 canonical work pages · 19 internal anchors

[1]

Lmrl gym: Benchmarks for multi-turn reinforcement learning with language models.arXiv preprint arXiv:2311.18232,

Marwa Abdulhai, Isadora White, Charlie Snell, Charles Sun, Joey Hong, Yuexiang Zhai, Kelvin Xu, and Sergey Levine. Lmrl gym: Benchmarks for multi-turn reinforcement learning with language models.arXiv preprint arXiv:2311.18232,

work page arXiv
[2]

Flamingo: a visual language model for few-shot learning.NeurIPS, 2022

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.NeurIPS, 2022. 2

work page 2022
[3]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 1, 2, 4, 5, 6, 7, 16, 17

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Towards open world recognition

Abhijit Bendale and Terrance Boult. Towards open world recognition. InCVPR, pages 1893–1902, 2015. 1

work page 1902
[5]

Food-101–mining discriminative components with random forests

Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. Food-101–mining discriminative components with random forests. InECCV, 2014. 4, 16, 17, 19

work page 2014
[6]

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhang- wei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test- time scaling.arXiv preprint arXiv:2412.05271, 2024. 7, 16, 17

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks. InCVPR,

work page
[8]

V ocabulary-free image classification.NeurIPS, 2023

Alessandro Conti, Enrico Fini, Massimiliano Mancini, Paolo Rota, Yiming Wang, and Elisa Ricci. V ocabulary-free image classification.NeurIPS, 2023. 1, 6, 7, 16, 17

work page 2023
[9]

On large multimodal models as open-world image classifiers

Alessandro Conti, Massimiliano Mancini, Enrico Fini, Yim- ing Wang, Paolo Rota, and Elisa Ricci. On large multimodal models as open-world image classifiers. InICCV, 2025. 1, 2, 3, 4, 5, 6, 7, 12, 14, 19

work page 2025
[10]

Learning without critics? revisiting grpo in classical reinforcement learning environments.arXiv preprint arXiv:2511.03527, 2025

Bryan LM de Oliveira, Felipe V Frujeri, Marcos PCM Queiroz, Luana GB Martins, Telma W de L Soares, and Luckeciano C Melo. Learning without critics? revisiting grpo in classical reinforcement learning environments.arXiv preprint arXiv:2511.03527, 2025. 8

work page arXiv 2025
[11]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. InCVPR, pages 248–255. Ieee, 2009. 1

work page 2009
[12]

The llama 3 herd of models.arXiv e-prints, pages arXiv–2407,

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Ab- hishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv e-prints, pages arXiv–2407,

work page
[13]

Mme: A comprehensive evaluation benchmark for multi- modal large language models, 2023

Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Meng- dan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, Yunsheng Wu, Rongrong Ji, Caifeng Shan, and Ran He. Mme: A comprehensive evaluation benchmark for multi- modal large language models, 2023. 2

work page 2023
[14]

Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains

Anisha Gunjal, Anthony Wang, Elaine Lau, Vaskar Nath, Yunzhong He, Bing Liu, and Sean Hendryx. Rubrics as rewards: Reinforcement learning beyond verifiable domains. arXiv preprint arXiv:2507.17746, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 2, 3, 5, 6

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Keming Lu, et al. Qwen2. 5-coder technical report.arXiv preprint arXiv:2409.12186, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[17]

OpenAI o1 System Card

Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024
[18]

Scaling up visual and vision-language representation learning with noisy text supervision

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. InICML, pages 4904–

work page
[19]

Large language models are zero-shot reasoners.NeurIPS, 35:22199–22213, 2022

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners.NeurIPS, 35:22199–22213, 2022. 2

work page 2022
[20]

3d object representations for fine-grained categorization

Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d object representations for fine-grained categorization. In ICCV-WS, 2013. 4, 17

work page 2013
[21]

Efficient memory management for large language model serving with pagedattention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the symposium on operating systems principles, pages 611– 626, 2023. 12, 14

work page 2023
[22]

Tulu 3: Pushing Frontiers in Open Language Model Post-Training

Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V Mi- randa, Alisa Liu, Nouha Dziri, Shane Lyu, et al. Tulu 3: Push- ing frontiers in open language model post-training.arXiv preprint arXiv:2411.15124, 2024. 3, 5

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

The measurement of observer agreement for categorical data.biometrics, 1977

J Richard Landis and Gary G Koch. The measurement of observer agreement for categorical data.biometrics, 1977. 20

work page 1977
[24]

Seed-bench: Benchmarking multimodal large language models

Bohao Li, Yuying Ge, Yixiao Ge, Guangzhi Wang, Rui Wang, Ruimao Zhang, and Ying Shan. Seed-bench: Benchmarking multimodal large language models. InCVPR, 2024. 2

work page 2024
[25]

Llava- next: Stronger llms supercharge multimodal capabilities in the wild, 2024

Bo Li, Kaichen Zhang, Hao Zhang, Dong Guo, Renrui Zhang, Feng Li, Yuanhan Zhang, Ziwei Liu, and Chunyuan Li. Llava- next: Stronger llms supercharge multimodal capabilities in the wild, 2024. 1

work page 2024
[26]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Zi- wei Liu, et al. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024
[27]

Blip- 2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip- 2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InICML, 2023. 1, 2

work page 2023
[28]

Mvbench: A comprehensive multi-modal video understand- ing benchmark

Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video understand- ing benchmark. InCVPR, 2024. 2

work page 2024
[29]

Think or not think: A study of explicit thinking in rule-based visual reinforcement fine-tuning.arXiv preprint arXiv:2503.16188, 2025

Ming Li, Jike Zhong, Shitian Zhao, Yuxiang Lai, and Kaipeng Zhang. Think or not think: A study of explicit thinking in rule-based visual reinforcement fine-tuning.arXiv preprint arXiv:2503.16188, 2025. 3, 6

work page arXiv 2025
[30]

Revisiting mllms: An in- depth analysis of image classification abilities.arXiv preprint arXiv:2412.16418, 2024

Huan Liu, Lingyu Xiao, Jiangjiang Liu, Xiaofan Li, Ze Feng, Sen Yang, and Jingdong Wang. Revisiting mllms: An in- depth analysis of image classification abilities.arXiv preprint arXiv:2412.16418, 2024. 1, 2

work page arXiv 2024
[31]

Mmbench: Is your multi-modal model an all-around player? InECCV, 2024

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? InECCV, 2024. 2

work page 2024
[32]

Understanding R1-Zero-Like Training: A Critical Perspective

Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective.arXiv preprint arXiv:2503.20783, 2025. 20

work page internal anchor Pith review Pith/arXiv arXiv 2025
[33]

Visual-RFT: Visual Reinforcement Fine-Tuning

Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual-rft: Visual reinforcement fine-tuning.arXiv preprint arXiv:2503.01785,

work page internal anchor Pith review Pith/arXiv arXiv
[34]

Fine-Grained Visual Classification of Aircraft

Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew Blaschko, and Andrea Vedaldi. Fine-grained visual clas- sification of aircraft.arXiv preprint arXiv:1306.5151, 2013. 4, 17

work page internal anchor Pith review Pith/arXiv arXiv 2013
[35]

Automated flower classification over a large number of classes

Maria-Elena Nilsback and Andrew Zisserman. Automated flower classification over a large number of classes. InIndian conference on computer vision, graphics & image processing. IEEE, 2008. 4, 16, 17, 19

work page 2008
[36]

Training language models to follow instructions with human feedback.NeurIPS, 35:27730–27744, 2022

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Car- roll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.NeurIPS, 35:27730–27744, 2022. 2

work page 2022
[37]

Cats and dogs

Omkar M Parkhi, Andrea Vedaldi, Andrew Zisserman, and CV Jawahar. Cats and dogs. InCVPR, 2012. 4, 16, 17, 19

work page 2012
[38]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InICML, 2021. 1, 2, 6

work page 2021
[39]

A survey on semi-, self-and unsupervised learning for image classification.IEEE Access, 9:82146– 82168, 2021

Lars Schmarje, Monty Santarossa, Simon-Martin Schr¨oder, and Reinhard Koch. A survey on semi-, self-and unsupervised learning for image classification.IEEE Access, 9:82146– 82168, 2021. 1

work page 2021
[40]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathe- matical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 2, 3, 6, 20

work page internal anchor Pith review Pith/arXiv arXiv 2024
[41]

HybridFlow: A Flexible and Efficient RLHF Framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf frame- work.arXiv preprint arXiv:2409.19256, 2024. 6, 14

work page internal anchor Pith review Pith/arXiv arXiv 2024
[42]

Flava: A foundational language and vision alignment model

Amanpreet Singh, Ronghang Hu, Vedanuj Goswami, Guil- laume Couairon, Wojciech Galuba, Marcus Rohrbach, and Douwe Kiela. Flava: A foundational language and vision alignment model. InCVPR, 2022. 2

work page 2022
[43]

Taxonomy-aware evaluation of vision-language models

V´esteinn Snæbjarnarson, Kevin Du, Niklas Stoehr, Serge Belongie, Ryan Cotterell, Nico Lang, and Stella Frank. Taxonomy-aware evaluation of vision-language models. In CVPR, pages 9109–9120, 2025. 2, 3

work page 2025
[44]

Learning to summarize with human feedback.NeurIPS, 33:3008–3021, 2020

Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea V oss, Alec Radford, Dario Amodei, and Paul F Christiano. Learning to summarize with human feedback.NeurIPS, 33:3008–3021, 2020. 2

work page 2020
[45]

Crossing the re- ward bridge: Expanding rl with verifiable rewards across diverse domains.arXiv preprint arXiv:2503.23829, 2025

Yi Su, Dian Yu, Linfeng Song, Juntao Li, Haitao Mi, Zhaopeng Tu, Min Zhang, and Dong Yu. Crossing the reward bridge: Expanding rl with verifiable rewards across diverse domains.arXiv preprint arXiv:2503.23829, 2025. 3

work page arXiv 2025
[46]

MIT press Cambridge, 1998

Richard S Sutton, Andrew G Barto, et al.Reinforcement learning: An introduction. MIT press Cambridge, 1998. 2

work page 1998
[47]

Vision llms are bad at hierarchical visual understanding, and llms are the bottleneck.arXiv preprint arXiv:2505.24840, 2025

Yuwen Tan, Yuan Qing, and Boqing Gong. Vision llms are bad at hierarchical visual understanding, and llms are the bottleneck.arXiv preprint arXiv:2505.24840, 2025. 2

work page arXiv 2025
[48]

Kimi K2: Open Agentic Intelligence

Kimi Team, Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, et al. Kimi k2: Open agentic intelligence. arXiv preprint arXiv:2507.20534, 2025. 3, 5

work page internal anchor Pith review Pith/arXiv arXiv 2025
[49]

Qwen3 technical report, 2025

Qwen Team. Qwen3 technical report, 2025. 6, 14

work page 2025
[50]

The caltech-ucsd birds-200-2011 dataset,

Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona, and Serge Belongie. The caltech-ucsd birds-200-2011 dataset,

work page 2011
[51]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024
[52]

Chain-of- thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of- thought prompting elicits reasoning in large language models. NeurIPS, 35:24824–24837, 2022. 2

work page 2022
[53]

An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, Chengpeng Li, Dayiheng Liu, Jianhong Tu, Jingren Zhou, Junyang Lin, et al. Qwen2. 5-math technical report: To- ward mathematical expert model via self-improvement.arXiv preprint arXiv:2409.12122, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[54]

Internlm-math: Open math large lan- guage models toward verifiable reasoning.arXiv preprint arXiv:2402.06332, 2024

Huaiyuan Ying, Shuo Zhang, Linyang Li, Zhejian Zhou, Yun- fan Shao, Zhaoye Fei, Yichuan Ma, Jiawei Hong, Kuikun Liu, Ziyi Wang, et al. Internlm-math: Open math large lan- guage models toward verifiable reasoning.arXiv preprint arXiv:2402.06332, 2024. 3

work page arXiv 2024
[55]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xi- aochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476,

work page internal anchor Pith review Pith/arXiv arXiv
[56]

Object recognition as next token prediction

Kaiyu Yue, Bor-Chun Chen, Jonas Geiping, Hengduo Li, Tom Goldstein, and Ser-Nam Lim. Object recognition as next token prediction. InCVPR, 2024. 2

work page 2024
[57]

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Yang Yue, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model?arXiv preprint arXiv:2504.13837, 2025. 5

work page internal anchor Pith review Pith/arXiv arXiv 2025
[58]

Star: Self-taught reasoner bootstrapping reasoning with reasoning

Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah D Good- man. Star: Self-taught reasoner bootstrapping reasoning with reasoning. InNeurIPS, 2024. 7

work page 2024
[59]

Sigmoid loss for language image pre-training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InICCV, 2023. 1, 2

work page 2023
[60]

Codedpo: Aligning code models with self generated and verified source code

Kechi Zhang, Ge Li, Yihong Dong, Jingjing Xu, Jun Zhang, Jing Su, Yongfei Liu, and Zhi Jin. Codedpo: Aligning code models with self generated and verified source code. InACL, pages 15854–15871, 2025. 3

work page 2025
[61]

Improve vision language model chain-of-thought reasoning

Ruohong Zhang, Bowen Zhang, Yanghao Li, Haotian Zhang, Zhiqing Sun, Zhe Gan, Yinfei Yang, Ruoming Pang, and Yim- ing Yang. Improve vision language model chain-of-thought reasoning. InACL, pages 1631–1662, 2025. 7

work page 2025
[62]

Why are visually-grounded language models bad at image classifi- cation?NeurIPS, 2024

Yuhui Zhang, Alyssa Unell, Xiaohan Wang, Dhruba Ghosh, Yuchang Su, Ludwig Schmidt, and Serena Yeung-Levy. Why are visually-grounded language models bad at image classifi- cation?NeurIPS, 2024. 1, 2

work page 2024
[63]

Automated generation of challenging multiple-choice questions for vision language model evaluation

Yuhui Zhang, Yuchang Su, Yiming Liu, Xiaohan Wang, James Burgess, Elaine Sui, Chenyu Wang, Josiah Aklilu, Alejandro Lozano, Anjiang Wei, et al. Automated generation of challenging multiple-choice questions for vision language model evaluation. InCVPR, pages 29580–29590, 2025. 2

work page 2025
[64]

Minigpt-4: Enhancing vision-language understanding with advanced large language models

Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mo- hamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models. InICLR,

work page
[65]

TTRL: Test-Time Reinforcement Learning

Yuxin Zuo, Kaiyan Zhang, Li Sheng, Shang Qu, Ganqu Cui, Xuekai Zhu, Haozhan Li, Yuchen Zhang, Xinwei Long, Ermo Hua, et al. Ttrl: Test-time reinforcement learning.arXiv preprint arXiv:2504.16084, 2025. 2 Specificity-aware reinforcement learning for fine-grained open-world classification Supplementary Material In this supplementary material, we present add...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[66]

If prediction is an abstention/refusal/uncertainty (e.g., ”none”, ”cannot tell”, ”I don’t know”): outputAbstain

work page
[67]

If prediction is malformed, nonsense, unrelated, contradictory, or gives multiple options (e.g., ”A or B”, lists): outputWrong

work page
[68]

If prediction and ground truth denote the same entity via exact match or direct synonym: outputSpecific

work page
[69]

• if the parent is broad/coarse (e.g., animal for dog): outputGeneric

Ifpredictionis aparent categoryofground truth: • if the parent is close (e.g., genus for species): outputLess Specific. • if the parent is broad/coarse (e.g., animal for dog): outputGeneric

work page
[70]

Ifpredictionis achild/subtype/instanceofground truth: outputMore Specific

work page
[71]

ground_truth

Otherwise: outputWrong. Input Format: {"ground_truth": "<the_ground_truth_label>", "prediction": "<the_vlm_prediction>"} Output Format: A single word from the allowed categories. Prompt: Apply the decision procedure to classify the following JSON object. Output exactly one category word. INPUT: %s Figure 16. Generated Prompt for the LLM-as-a-judge verifie...

work page