Attribute-Grounded Selective Reasoning for Artwork Emotion Understanding with Multimodal Large Language Models

Cheng Zhang; Hongxia Xie; Wen-Huang Cheng; Yuer Liu; Zhiyu Zhou

arxiv: 2605.15755 · v1 · pith:O4FTJYSSnew · submitted 2026-05-15 · 💻 cs.CV

Attribute-Grounded Selective Reasoning for Artwork Emotion Understanding with Multimodal Large Language Models

Cheng Zhang , Yuer Liu , Zhiyu Zhou , Hongxia Xie , Wen-Huang Cheng This is my paper

Pith reviewed 2026-05-20 18:57 UTC · model grok-4.3

classification 💻 cs.CV

keywords artwork emotion understandingmultimodal large language modelsattribute salienceselective reasoningformal attributesaffective predictionEmoArt dataset

0 comments

The pith

Guiding multimodal models to reason only from emotionally salient attributes improves artwork emotion predictions and explanation quality.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Multimodal large language models often list many visible attributes when explaining emotions in art without isolating which ones actually support the affective judgment. The paper frames artwork emotion understanding as attribute-grounded selective reasoning, where only emotionally operative attributes should enter the interpretation. It extends the EmoArt dataset with human salience annotations on 1,400 artworks to supply instance-level supervision for this distinction. The authors introduce the FAB-G framework, which first predicts attribute salience and then restricts emotional analysis to the retained cues. This produces better results on emotion classification, arousal, and valence prediction while aligning more closely with human salience markings and generating shorter explanations than standard prompting methods.

Core claim

FAB-G works by using a supervised multi-agent process to identify which formal attributes are emotionally salient for a given artwork and then limits the multimodal large language model's emotional analysis to only those retained attributes. This yields consistent improvements in emotion, arousal, and valence prediction accuracy, stronger agreement with human-marked salient attributes on Dice and Tversky metrics, and substantially more compact final explanations than standard prompting baselines. The approach also shows some transfer of the salience selection to other datasets.

What carries the argument

The formal-attribute bottleneck in the FAB-G framework, which predicts salience for each predefined formal attribute and then constrains the multimodal model's reasoning to only the emotionally operative subset.

If this is right

Yields consistent gains in emotion, arousal, and valence prediction accuracy.
Achieves stronger agreement with human-marked salient attributes under Dice and Tversky metrics.
Produces substantially more compact final explanations than prompting-based baselines.
The attribute salience selection transfers beyond the source distribution of EmoArt.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The selective bottleneck could reduce irrelevant detail in other vision-language tasks that involve subjective judgments.
Instance-level human annotations on attribute salience might serve as a template for grounding reasoning in additional multimodal settings.
Attribute-specific boundary cases identified in cross-dataset tests could be studied to improve salience prediction further.

Load-bearing premise

The predefined formal attributes from EmoArt are sufficient to represent the cues that drive emotional responses, and the annotations by 15 art-trained people on 1,400 artworks supply reliable supervision for which attributes are salient.

What would settle it

If applying the FAB-G framework to a new collection of artworks yields no gains over prompting baselines or fails to match fresh human salience annotations on Dice and Tversky scores, the advantage of attribute-grounded selection would not hold.

Figures

Figures reproduced from arXiv: 2605.15755 by Cheng Zhang, Hongxia Xie, Wen-Huang Cheng, Yuer Liu, Zhiyu Zhou.

**Figure 1.** Figure 1: Overview of the EmoArt annotation structure and the proposed human salience extension. The top panel shows an EmoArt sample with content, [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Pipeline for the base EmoArt resource and the supplementary salience extension. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Distribution of 28 common emotions in the valence–arousal space [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Distribution of the major categories and subcategories in EmoArt. The [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: Overview of FAB-G. Five attribute-specific agents predict attribute salience, their outputs are aggregated into a formal-attribute bottleneck, and a [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Qualitative case study of attribute flooding and bottleneck-guided reasoning. Baseline methods activate a broad set of visible attributes, whereas [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

read the original abstract

Multimodal large language models (MLLMs) can produce fluent artwork emotion explanations, but they often suffer from attribute flooding: they enumerate many visible formal attributes without identifying which cues actually support the affective judgment. We therefore formulate artwork emotion understanding as Attribute-Grounded Selective Reasoning (AGSR), where predefined formal attributes serve as evidence units and only emotionally operative attributes should enter the final interpretation. To make this problem measurable, we extend EmoArt, originally introduced at ACM MM 2025 as a 132,664-artwork resource with content, formal-attribute, valence-arousal, and emotion annotations, by adding a 1,400-artwork human salience extension annotated by 15 art-trained annotators. This extension provides instance-level supervision for distinguishing attributes that are merely present from those that are emotionally salient. We further propose FAB-G (Formal-Attribute Bottleneck-Guided reasoning), a supervised multi-agent framework that first predicts attribute-level salience and then constrains downstream emotional analysis to the retained cues. Experiments show that FAB-G yields consistent gains in emotion, arousal, and valence prediction, achieves stronger agreement with human-marked salient attributes under Dice and Tversky metrics, and produces substantially more compact final explanations than prompting-based baselines. Cross-dataset evaluation further suggests that attribute-grounded salience selection transfers beyond the source distribution of EmoArt, while also revealing attribute-specific boundary cases. The dataset and project page are available at https://zhiliangzhang.github.io/EmoArt-130k/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper formulates artwork emotion understanding as Attribute-Grounded Selective Reasoning (AGSR) to mitigate attribute flooding in MLLM outputs. It extends EmoArt with a 1,400-artwork human salience annotation set labeled by 15 art-trained annotators, providing instance-level supervision for distinguishing present versus emotionally operative formal attributes. The proposed FAB-G multi-agent framework first predicts attribute salience and then constrains downstream emotion, arousal, and valence reasoning to the retained cues. Experiments report consistent gains over prompting baselines in prediction metrics, higher Dice/Tversky agreement with human salience labels, and substantially more compact explanations; cross-dataset results are also presented.

Significance. If the results hold, the work supplies a concrete, measurable approach to grounding MLLM explanations in emotionally relevant attributes for affective art analysis, together with a new supervised resource that could support future selective-reasoning research. The emphasis on compactness and cross-dataset transfer is a positive step toward practical interpretability.

major comments (2)

[Dataset extension] Dataset extension section: The central claim that FAB-G achieves stronger Dice/Tversky agreement with human-marked salient attributes rests on the 1,400-artwork salience labels serving as reliable ground truth, yet no inter-annotator agreement statistics (e.g., mean pairwise Dice, Fleiss’ kappa, or per-attribute consistency) are reported. Without these, it is impossible to assess whether the supervised salience predictor is trained on a stable signal or on annotation noise.
[Experiments] Experiments section (cross-dataset evaluation paragraph): The claim of transferability beyond the EmoArt distribution is load-bearing for the broader applicability of AGSR, but the manuscript provides no details on the target datasets, how the predefined EmoArt formal attributes are mapped or adapted, or any domain-shift controls. This omission prevents evaluation of whether observed gains reflect genuine attribute grounding or dataset-specific artifacts.

minor comments (2)

[Abstract] Abstract: The acronym AGSR is used before its expansion; expand on first use for clarity.
[Figure 1] Figure 1 (framework diagram): Agent roles and information flow between the salience predictor and the constrained reasoner could be labeled more explicitly to aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments on our manuscript. We address each of the major comments in detail below and outline the revisions we plan to make.

read point-by-point responses

Referee: [Dataset extension] Dataset extension section: The central claim that FAB-G achieves stronger Dice/Tversky agreement with human-marked salient attributes rests on the 1,400-artwork salience labels serving as reliable ground truth, yet no inter-annotator agreement statistics (e.g., mean pairwise Dice, Fleiss’ kappa, or per-attribute consistency) are reported. Without these, it is impossible to assess whether the supervised salience predictor is trained on a stable signal or on annotation noise.

Authors: We agree that inter-annotator agreement statistics are essential for validating the reliability of the human salience annotations as ground truth. The 15 annotators are all trained in art history or related fields, which we believe contributes to consistency, but we acknowledge that explicit quantitative measures would provide stronger evidence. In the revised manuscript, we will report mean pairwise Dice scores, Fleiss’ kappa, and per-attribute consistency metrics computed on the 1,400-artwork salience labels. This addition will help demonstrate that the supervised signal is stable rather than noisy. revision: yes
Referee: [Experiments] Experiments section (cross-dataset evaluation paragraph): The claim of transferability beyond the EmoArt distribution is load-bearing for the broader applicability of AGSR, but the manuscript provides no details on the target datasets, how the predefined EmoArt formal attributes are mapped or adapted, or any domain-shift controls. This omission prevents evaluation of whether observed gains reflect genuine attribute grounding or dataset-specific artifacts.

Authors: We appreciate this observation regarding the cross-dataset evaluation. To strengthen the presentation of transferability, we will revise the relevant paragraph to include specific details on the target datasets employed, the procedure used to map or adapt the EmoArt formal attributes to these datasets, and any domain-shift controls or analyses performed. We will also discuss potential limitations and boundary cases to clarify that the gains are attributable to attribute grounding rather than artifacts. These additions will make the evaluation more transparent and reproducible. revision: yes

Circularity Check

0 steps flagged

No significant circularity; independent human salience supervision supports downstream claims

full rationale

The paper collects a new 1,400-artwork human salience annotation set from 15 art-trained annotators as explicit instance-level supervision for training the attribute salience predictor inside FAB-G. This supervision is distinct from the target emotion/arousal/valence labels in the original EmoArt resource. The salience predictor is trained and evaluated directly against the human salience marks (Dice/Tversky agreement), after which the retained attributes are used to constrain MLLM reasoning for emotion prediction. No equation or derivation reduces the final emotion predictions to a statistical fit performed on the same emotion scores; the salience step is an externally supervised sub-task. Self-citations to the prior EmoArt paper are present but not load-bearing for the central claims, which rest on the new annotations and cross-dataset transfer results rather than any self-referential uniqueness theorem or ansatz.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The claim depends on the domain assumption that formal attributes are the appropriate evidence units and on the new human salience annotations as supervision; no free parameters or invented physical entities are introduced.

axioms (1)

domain assumption Predefined formal attributes serve as sufficient evidence units for distinguishing emotionally operative cues from merely present ones.
Invoked when formulating artwork emotion understanding as Attribute-Grounded Selective Reasoning.

invented entities (1)

FAB-G multi-agent framework no independent evidence
purpose: Predicts attribute-level salience and constrains downstream emotional analysis to retained cues.
New supervised framework introduced to implement AGSR.

pith-pipeline@v0.9.0 · 5812 in / 1245 out tokens · 50786 ms · 2026-05-20T18:57:41.485278+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/Cost.lean Jcost_pos_of_ne_one unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

FAB-G ... five attribute-specific agents that make binary salience judgments over Color, Composition, Line, Light, and Brushstroke ... formal-attribute bottleneck
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We extend EmoArt ... 1,400-artwork human salience extension annotated by 15 art-trained annotators

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · 4 internal anchors

[1]

Visual instruction tuning,

H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual instruction tuning,” in Advances in Neural Information Processing Systems, 2023

work page 2023
[2]

Gpt-4v(ision) system card,

OpenAI, “Gpt-4v(ision) system card,” https://openai.com/index/ gpt-4v-system-card/, 2023, accessed: 2026-04-12

work page 2023
[3]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

P. Wang, S. Bai, S. Tan, S. Wang, Z. Fan, J. Bai, K. Chen, X. Liu, J. Wang, W. Ge,et al., “Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution,”arXiv preprint arXiv:2409.12191, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

Bell,Art

C. Bell,Art. Chatto & Windus, 1914

work page 1914
[5]

Case and T

C. Case and T. Dalley,The handbook of art therapy. Routledge, 2014

work page 2014
[6]

Affectnet: A database for facial expression, valence, and arousal computing in the wild,

A. Mollahosseini, B. Hasani, and M. H. Mahoor, “Affectnet: A database for facial expression, valence, and arousal computing in the wild,”IEEE Transactions on Affective Computing, vol. 10, no. 1, pp. 18–31, 2017

work page 2017
[7]

A mixed bag of emotions: Model, predict, and transfer emotion distributions,

K.-C. Peng, T. Chen, A. Sadovnik, and A. C. Gallagher, “A mixed bag of emotions: Model, predict, and transfer emotion distributions,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 860–868

work page 2015
[8]

Building a large scale dataset for image emotion recognition: The fine print and the benchmark,

Q. You, J. Luo, H. Jin, and J. Yang, “Building a large scale dataset for image emotion recognition: The fine print and the benchmark,” in Proceedings of the AAAI conference on artificial intelligence, vol. 30, no. 1, 2016

work page 2016
[9]

Contemplating visual emotions: Understanding and overcoming dataset bias,

R. Panda, J. Zhang, H. Li, J.-Y . Lee, X. Lu, and A. K. Roy-Chowdhury, “Contemplating visual emotions: Understanding and overcoming dataset bias,” inProceedings of the European Conference on Computer Vision, 2018, pp. 579–595

work page 2018
[10]

Findingemo: An image dataset for emotion recognition in the wild,

L. Mertens, E. Yargholi, H. Op de Beeck, J. Van den Stock, and J. Vennekens, “Findingemo: An image dataset for emotion recognition in the wild,”Advances in Neural Information Processing Systems, vol. 37, pp. 4956–4996, 2024

work page 2024
[11]

Affective image classification using features inspired by psychology and art theory,

J. Machajdik and A. Hanbury, “Affective image classification using features inspired by psychology and art theory,” inProceedings of the 18th ACM international conference on Multimedia, 2010, pp. 83–92

work page 2010
[12]

Artemis: Affective language for visual art,

P. Achlioptas, M. Ovsjanikov, K. Haydarov, M. Elhoseiny, and L. J. Guibas, “Artemis: Affective language for visual art,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 11 569–11 579

work page 2021
[13]

Emoart: A multidimensional dataset for emotion-aware artistic generation,

C. Zhang, H. Xie, B. Wen, S. Zuo, R. Zhang, and W.-H. Cheng, “Emoart: A multidimensional dataset for emotion-aware artistic generation,” in Proceedings of the 33rd ACM International Conference on Multimedia, 2025

work page 2025
[14]

Emotional category data on images from the international affective picture system,

J. A. Mikels, B. L. Fredrickson, G. R. Larkin, C. M. Lindberg, S. J. Maglio, and P. A. Reuter-Lorenz, “Emotional category data on images from the international affective picture system,”Behavior research methods, vol. 37, pp. 626–630, 2005

work page 2005
[15]

The geneva affective picture database (gaped): a new 730-picture database focusing on valence and normative significance,

E. S. Dan-Glauser and K. R. Scherer, “The geneva affective picture database (gaped): a new 730-picture database focusing on valence and normative significance,”Behavior research methods, vol. 43, pp. 468– 477, 2011

work page 2011
[16]

Emoset: A large-scale visual emotion dataset with rich attributes,

J. Yang, Q. Huang, T. Ding, D. Lischinski, D. Cohen-Or, and H. Huang, “Emoset: A large-scale visual emotion dataset with rich attributes,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 20 383–20 394

work page 2023
[17]

How to read paintings: Semantic art un- derstanding with multi-modal retrieval,

N. Garcia and G. V ogiatzis, “How to read paintings: Semantic art un- derstanding with multi-modal retrieval,” inProceedings of the European Conference on Computer Vision Workshops, 2018

work page 2018
[18]

Emovit: Revolutionizing emotion insights with visual instruction tuning,

H. Xie, C.-J. Peng, Y .-W. Tseng, H.-J. Chen, C.-F. Hsu, H.-H. Shuai, and W.-H. Cheng, “Emovit: Revolutionizing emotion insights with visual instruction tuning,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 26 596– 26 605

work page 2024
[19]

Affectgpt: A new dataset, model, and benchmark for emotion understanding with multimodal large language models,

Z. Lian, H. Chen, L. Chen, H. Sun, L. Sun, Y . Ren, Z. Cheng, B. Liu, R. Liu, X. Peng, J. Yi, and J. Tao, “Affectgpt: A new dataset, model, and benchmark for emotion understanding with multimodal large language models,” inProceedings of the 42nd International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, vol. 267, 2025, pp...

work page 2025
[20]

Affectgpt-r1: Leveraging reinforcement learning for open-vocabulary multimodal emotion recognition,

Z. Lian, F. Zhang, Y . Zhang, J. Tao, R. Liu, H. Chen, and X. Li, “Affectgpt-r1: Leveraging reinforcement learning for open-vocabulary multimodal emotion recognition,”arXiv preprint arXiv:2508.01318, 2025

work page arXiv 2025
[21]

Hello gpt-4o,

OpenAI, “Hello gpt-4o,” https://openai.com/index/hello-gpt-4o/, 2024, accessed: 2026-04-20

work page 2024
[22]

A circumplex model of affect

J. A. Russell, “A circumplex model of affect.”Journal of personality and social psychology, vol. 39, no. 6, p. 1161, 1980

work page 1980
[23]

Qwen3-VL Technical Report

S. Bai, Y . Cai, R. Chen, K. Chen,et al., “Qwen3-VL technical report,” arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[24]

Kimi-VL Technical Report

Kimi Team, A. Du, B. Yin, B. Xing,et al., “Kimi-VL technical report,” arXiv preprint arXiv:2504.07491, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[25]

Qwen2.5-VL Technical Report

S. Bai, K. Chen, X. Liu, J. Wang,et al., “Qwen2.5-VL technical report,” arXiv preprint arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[26]

MiniCPM-o 2.6,

OpenBMB, “MiniCPM-o 2.6,” https://huggingface.co/openbmb/ MiniCPM-o-2 6, 2025, model card. Accessed: 2026-04-20

work page 2025

[1] [1]

Visual instruction tuning,

H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual instruction tuning,” in Advances in Neural Information Processing Systems, 2023

work page 2023

[2] [2]

Gpt-4v(ision) system card,

OpenAI, “Gpt-4v(ision) system card,” https://openai.com/index/ gpt-4v-system-card/, 2023, accessed: 2026-04-12

work page 2023

[3] [3]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

P. Wang, S. Bai, S. Tan, S. Wang, Z. Fan, J. Bai, K. Chen, X. Liu, J. Wang, W. Ge,et al., “Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution,”arXiv preprint arXiv:2409.12191, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[4] [4]

Bell,Art

C. Bell,Art. Chatto & Windus, 1914

work page 1914

[5] [5]

Case and T

C. Case and T. Dalley,The handbook of art therapy. Routledge, 2014

work page 2014

[6] [6]

Affectnet: A database for facial expression, valence, and arousal computing in the wild,

A. Mollahosseini, B. Hasani, and M. H. Mahoor, “Affectnet: A database for facial expression, valence, and arousal computing in the wild,”IEEE Transactions on Affective Computing, vol. 10, no. 1, pp. 18–31, 2017

work page 2017

[7] [7]

A mixed bag of emotions: Model, predict, and transfer emotion distributions,

K.-C. Peng, T. Chen, A. Sadovnik, and A. C. Gallagher, “A mixed bag of emotions: Model, predict, and transfer emotion distributions,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 860–868

work page 2015

[8] [8]

Building a large scale dataset for image emotion recognition: The fine print and the benchmark,

Q. You, J. Luo, H. Jin, and J. Yang, “Building a large scale dataset for image emotion recognition: The fine print and the benchmark,” in Proceedings of the AAAI conference on artificial intelligence, vol. 30, no. 1, 2016

work page 2016

[9] [9]

Contemplating visual emotions: Understanding and overcoming dataset bias,

R. Panda, J. Zhang, H. Li, J.-Y . Lee, X. Lu, and A. K. Roy-Chowdhury, “Contemplating visual emotions: Understanding and overcoming dataset bias,” inProceedings of the European Conference on Computer Vision, 2018, pp. 579–595

work page 2018

[10] [10]

Findingemo: An image dataset for emotion recognition in the wild,

L. Mertens, E. Yargholi, H. Op de Beeck, J. Van den Stock, and J. Vennekens, “Findingemo: An image dataset for emotion recognition in the wild,”Advances in Neural Information Processing Systems, vol. 37, pp. 4956–4996, 2024

work page 2024

[11] [11]

Affective image classification using features inspired by psychology and art theory,

J. Machajdik and A. Hanbury, “Affective image classification using features inspired by psychology and art theory,” inProceedings of the 18th ACM international conference on Multimedia, 2010, pp. 83–92

work page 2010

[12] [12]

Artemis: Affective language for visual art,

P. Achlioptas, M. Ovsjanikov, K. Haydarov, M. Elhoseiny, and L. J. Guibas, “Artemis: Affective language for visual art,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 11 569–11 579

work page 2021

[13] [13]

Emoart: A multidimensional dataset for emotion-aware artistic generation,

C. Zhang, H. Xie, B. Wen, S. Zuo, R. Zhang, and W.-H. Cheng, “Emoart: A multidimensional dataset for emotion-aware artistic generation,” in Proceedings of the 33rd ACM International Conference on Multimedia, 2025

work page 2025

[14] [14]

Emotional category data on images from the international affective picture system,

J. A. Mikels, B. L. Fredrickson, G. R. Larkin, C. M. Lindberg, S. J. Maglio, and P. A. Reuter-Lorenz, “Emotional category data on images from the international affective picture system,”Behavior research methods, vol. 37, pp. 626–630, 2005

work page 2005

[15] [15]

The geneva affective picture database (gaped): a new 730-picture database focusing on valence and normative significance,

E. S. Dan-Glauser and K. R. Scherer, “The geneva affective picture database (gaped): a new 730-picture database focusing on valence and normative significance,”Behavior research methods, vol. 43, pp. 468– 477, 2011

work page 2011

[16] [16]

Emoset: A large-scale visual emotion dataset with rich attributes,

J. Yang, Q. Huang, T. Ding, D. Lischinski, D. Cohen-Or, and H. Huang, “Emoset: A large-scale visual emotion dataset with rich attributes,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 20 383–20 394

work page 2023

[17] [17]

How to read paintings: Semantic art un- derstanding with multi-modal retrieval,

N. Garcia and G. V ogiatzis, “How to read paintings: Semantic art un- derstanding with multi-modal retrieval,” inProceedings of the European Conference on Computer Vision Workshops, 2018

work page 2018

[18] [18]

Emovit: Revolutionizing emotion insights with visual instruction tuning,

H. Xie, C.-J. Peng, Y .-W. Tseng, H.-J. Chen, C.-F. Hsu, H.-H. Shuai, and W.-H. Cheng, “Emovit: Revolutionizing emotion insights with visual instruction tuning,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 26 596– 26 605

work page 2024

[19] [19]

Affectgpt: A new dataset, model, and benchmark for emotion understanding with multimodal large language models,

Z. Lian, H. Chen, L. Chen, H. Sun, L. Sun, Y . Ren, Z. Cheng, B. Liu, R. Liu, X. Peng, J. Yi, and J. Tao, “Affectgpt: A new dataset, model, and benchmark for emotion understanding with multimodal large language models,” inProceedings of the 42nd International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, vol. 267, 2025, pp...

work page 2025

[20] [20]

Affectgpt-r1: Leveraging reinforcement learning for open-vocabulary multimodal emotion recognition,

Z. Lian, F. Zhang, Y . Zhang, J. Tao, R. Liu, H. Chen, and X. Li, “Affectgpt-r1: Leveraging reinforcement learning for open-vocabulary multimodal emotion recognition,”arXiv preprint arXiv:2508.01318, 2025

work page arXiv 2025

[21] [21]

Hello gpt-4o,

OpenAI, “Hello gpt-4o,” https://openai.com/index/hello-gpt-4o/, 2024, accessed: 2026-04-20

work page 2024

[22] [22]

A circumplex model of affect

J. A. Russell, “A circumplex model of affect.”Journal of personality and social psychology, vol. 39, no. 6, p. 1161, 1980

work page 1980

[23] [23]

Qwen3-VL Technical Report

S. Bai, Y . Cai, R. Chen, K. Chen,et al., “Qwen3-VL technical report,” arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[24] [24]

Kimi-VL Technical Report

Kimi Team, A. Du, B. Yin, B. Xing,et al., “Kimi-VL technical report,” arXiv preprint arXiv:2504.07491, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[25] [25]

Qwen2.5-VL Technical Report

S. Bai, K. Chen, X. Liu, J. Wang,et al., “Qwen2.5-VL technical report,” arXiv preprint arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[26] [26]

MiniCPM-o 2.6,

OpenBMB, “MiniCPM-o 2.6,” https://huggingface.co/openbmb/ MiniCPM-o-2 6, 2025, model card. Accessed: 2026-04-20

work page 2025