pith. sign in

arxiv: 2605.20867 · v1 · pith:SNOTCTLFnew · submitted 2026-05-20 · 💻 cs.MA · cs.CV

ProCrit: Self-Elicited Multi-Perspective Reasoning with Critic-Guided Revision for Multimodal Sarcasm Detection

Pith reviewed 2026-05-21 02:32 UTC · model grok-4.3

classification 💻 cs.MA cs.CV
keywords multimodal sarcasm detectionmulti-perspective reasoningcritic-guided revisionagentic rolloutproposal-critic frameworkreinforcement learningvision-language model
0
0 comments X

The pith

ProCrit enables adaptive multi-perspective reasoning for multimodal sarcasm detection through a proposal-critic agent framework.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that effective multimodal sarcasm detection requires generating and integrating sample-specific analytical perspectives rather than using fixed sets. ProCrit achieves this with a proposal agent that reasons across multiple perspectives and a critic agent that evaluates and suggests revisions. To train this without existing process data, it creates synthetic reasoning trajectories by having a vision-language model assume different roles in sequence. The system is trained with reinforcement learning that refines both the drafting and the critiquing based on final outcomes. This leads to better performance on sarcasm detection benchmarks by addressing variable incongruities between text and image.

Core claim

ProCrit is a two-agent framework consisting of a proposal agent for self-elicited multi-perspective reasoning and a critic agent for identifying deficiencies and providing revision guidance, trained via mutual refinement with dual-stage reinforcement learning after synthesizing process annotations through dynamic-role agentic rollouts.

What carries the argument

The Proposal-Critic two-agent framework that uses draft-critique-revise paradigm and synthesizes reasoning annotations from agentic rollouts to enable self-elicited perspectives.

If this is right

  • Self-elicited perspectives adapt to the specific sarcastic mechanisms in each sample without predefined rules.
  • The critic provides targeted natural-language feedback to improve reasoning quality.
  • Mutual-refinement training optimizes both proposal and critic agents together.
  • Improved detection accuracy on three standard multimodal sarcasm benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar agentic approaches could be applied to other multimodal reasoning tasks where analytical perspectives vary by instance.
  • The synthesis of process-level annotations might help in domains lacking explicit reasoning supervision.
  • Combining proposal and critic agents may enhance reliability in other AI reasoning systems.

Load-bearing premise

The process-level reasoning annotations synthesized by the dynamic-role agentic rollout accurately capture the cross-perspective dependencies required for sarcasm detection and provide effective supervision for the agents.

What would settle it

Demonstrating that performance on the sarcasm detection benchmarks remains unchanged when using fixed perspectives or removing the critic revision step would falsify the claim that self-elicited multi-perspective reasoning with critic guidance is key to the improvement.

Figures

Figures reproduced from arXiv: 2605.20867 by Baokui Guo, Bowen Zhang, Jiulong Wu, Min Cao, Siyuan Chai, Yingjia Xu.

Figure 1
Figure 1. Figure 1: Illustration of the reasoning annotation synthesis process. [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of the draft–critique–revise [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of mutual-refinement training. Top: Both agents are initialized via SFT. Bottom left: The critic is optimized while the proposal is frozen, using revision outcomes as rewards. Bottom right: The proposal is optimized in dual stages (draft and revision) while the critic is frozen. 2.3.3 Critic agent: Reinforcement Learning After supervised fine-tuning, the critic agent is further optimized via GRPO … view at source ↗
Figure 6
Figure 6. Figure 6: Examples illustrating that different sarcastic samples involve distinct mechanisms and [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Comparison between existing dataset annotations and the process-level reasoning annota [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Examples of reasoning trajectory synthesized through dynamic-role agentic rollout. Each [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Successful answer correction via critic-guided revision. The draft produces an incorrect [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Reasoning quality improvement via critic feedback. The draft predicts the correct label but [PITH_FULL_IMAGE:figures/full_fig_p023_10.png] view at source ↗
read the original abstract

Multimodal sarcasm detection requires reasoning over cross-modal incongruities between literal expression and intended meaning, yet the specific analytical perspectives needed vary across samples due to the diversity of sarcastic mechanisms. While recent methods make this analytical process explicit, they still rely on fixed, predefined perspectives that operate independently under hand-crafted routing rules. We argue that multimodal sarcasm detection instead calls for self-elicited multi-perspective reasoning, where a model autonomously generates the perspectives needed for each sample and progressively integrates them into a coherent analysis. To realize this goal, we propose ProCrit, a Proposal-Critic two-agent framework with a proposal agent for multi-perspective reasoning and a critic agent for external evaluation and targeted revision guidance. First, to overcome the lack of process-level supervision in existing sarcasm datasets, ProCrit synthesizes process-level reasoning annotations through a dynamic-role agentic rollout: a strong vision-language model sequentially spawns analytical roles within a shared context, and the resulting multi-role trajectories are flattened into sequences that preserve cross-perspective dependencies while enabling efficient autoregressive generation. Second, to improve reasoning reliability, ProCrit adopts a draft-critique-revise paradigm in which an independent critic identifies reasoning deficiencies and provides targeted natural-language feedback for directed revision. Finally, we develop a mutual-refinement training framework that jointly optimizes proposal drafting and feedback-guided revision via dual-stage reinforcement learning, while refining the critic agent according to the actual effectiveness of its feedback. Experiments on three widely used benchmarks demonstrate the effectiveness of ProCrit.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes ProCrit, a Proposal-Critic two-agent framework for multimodal sarcasm detection. It synthesizes process-level reasoning annotations via a dynamic-role agentic rollout with a strong vision-language model that spawns analytical roles and flattens trajectories to preserve cross-perspective dependencies. A draft-critique-revise paradigm uses an independent critic for targeted natural-language feedback, optimized jointly via dual-stage reinforcement learning in a mutual-refinement training framework. The central claim is that this self-elicited multi-perspective reasoning with critic-guided revision is effective on three widely used multimodal sarcasm detection benchmarks.

Significance. If the synthesized annotations faithfully encode sarcasm-specific cross-perspective dependencies and the RL loop demonstrably improves reasoning reliability beyond the base VLM, the work would advance multi-agent frameworks for tasks requiring variable analytical perspectives. It addresses a genuine gap in moving beyond fixed, hand-crafted perspectives in sarcasm detection, with potential applicability to other incongruity-based multimodal reasoning problems.

major comments (2)
  1. [Method (dynamic-role agentic rollout)] Method section on dynamic-role agentic rollout: The central claim requires that the VLM-generated trajectories provide accurate and useful supervision for cross-perspective dependencies, yet the manuscript reports no human validation, inter-annotator agreement, or comparison against gold process labels. Without these, benchmark gains could be attributable to the base VLM prior rather than the self-elicited mechanism or critic revision.
  2. [Experiments] Experiments section: The abstract asserts effectiveness on three benchmarks but the provided description supplies no quantitative results, ablation details against variants without critic feedback, or error analysis. This leaves the load-bearing claim that the proposal-critic loop and dual-stage RL produce the observed improvements unsupported by concrete evidence.
minor comments (2)
  1. [Method] The notation for the flattened autoregressive sequences and the dual-stage RL objectives could be formalized with explicit equations to clarify how cross-perspective dependencies are preserved during training.
  2. [Method] Clarify the distinction between the proposal agent and the critic agent roles in the mutual-refinement loop, perhaps with a diagram or pseudocode, to avoid ambiguity in how feedback is incorporated.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major comment below, clarifying aspects of the method and experiments while indicating planned revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Method (dynamic-role agentic rollout)] Method section on dynamic-role agentic rollout: The central claim requires that the VLM-generated trajectories provide accurate and useful supervision for cross-perspective dependencies, yet the manuscript reports no human validation, inter-annotator agreement, or comparison against gold process labels. Without these, benchmark gains could be attributable to the base VLM prior rather than the self-elicited mechanism or critic revision.

    Authors: We acknowledge that the manuscript does not report human validation, inter-annotator agreement, or direct comparison to gold process labels for the synthesized trajectories. Existing multimodal sarcasm datasets lack such gold process-level annotations, which precludes a direct comparison. The framework instead demonstrates value through the structured dynamic-role rollout that preserves cross-perspective dependencies and the subsequent mutual-refinement RL that optimizes beyond the base VLM prior, as isolated in our ablations. To address the concern, we will add a dedicated discussion subsection on the synthesis process, its design rationale, and limitations, along with qualitative examples of generated trajectories in the revised manuscript. revision: partial

  2. Referee: [Experiments] Experiments section: The abstract asserts effectiveness on three benchmarks but the provided description supplies no quantitative results, ablation details against variants without critic feedback, or error analysis. This leaves the load-bearing claim that the proposal-critic loop and dual-stage RL produce the observed improvements unsupported by concrete evidence.

    Authors: The full manuscript reports quantitative results across the three benchmarks. We will revise the experiments section to explicitly present these results in tables, add ablation studies comparing the full model against variants without critic feedback and without dual-stage RL, and include a detailed error analysis. These additions will directly support the contributions of the proposal-critic loop and training framework. revision: yes

Circularity Check

0 steps flagged

No significant circularity in ProCrit derivation

full rationale

The paper describes a Proposal-Critic agentic framework that synthesizes process annotations via dynamic-role VLM rollout, applies draft-critique-revise, and optimizes via dual-stage RL. No equations, fitted parameters, or self-referential definitions appear; the central claims rest on external VLM capabilities and standard RL rather than reducing benchmark gains or reasoning quality to quantities defined by the method itself. The approach is self-contained against external benchmarks without load-bearing self-citation chains or ansatz smuggling.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The approach rests on the domain assumption that sarcasm mechanisms vary enough to require per-sample perspective generation and that agent-generated trajectories can serve as reliable supervision. No free parameters or invented physical entities are described.

axioms (1)
  • domain assumption Multimodal sarcasm detection requires analytical perspectives that vary across samples due to diverse sarcastic mechanisms.
    Explicitly stated as the motivation for moving beyond fixed predefined perspectives.
invented entities (1)
  • ProCrit Proposal-Critic framework no independent evidence
    purpose: To realize self-elicited multi-perspective reasoning and critic-guided revision
    Newly proposed two-agent architecture whose effectiveness is asserted via benchmark experiments.

pith-pipeline@v0.9.0 · 5822 in / 1393 out tokens · 42200 ms · 2026-05-21T02:32:02.167926+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages · 12 internal anchors

  1. [1]

    Automatic sarcasm detection: A survey,

    A. Joshi, P. Bhattacharyya, and M. J. Carman, “Automatic sarcasm detection: A survey,”ACM Computing Surveys (CSUR), vol. 50, no. 5, pp. 1–22, 2017

  2. [2]

    Who cares about sarcastic tweets? investigating the impact of sarcasm on sentiment analysis,

    D. G. Maynard and M. A. Greenwood, “Who cares about sarcastic tweets? investigating the impact of sarcasm on sentiment analysis,” inLrec 2014 proceedings. ELRA, 2014

  3. [3]

    A survey of figurative language and its computational detection in online social networks,

    M. Abulaish, A. Kamal, and M. J. Zaki, “A survey of figurative language and its computational detection in online social networks,”ACM Transactions on the Web (TWEB), vol. 14, no. 1, pp. 1–52, 2020

  4. [4]

    Multi-modal sarcasm detection in twitter with hierarchical fusion model,

    Y . Cai, H. Cai, and X. Wan, “Multi-modal sarcasm detection in twitter with hierarchical fusion model,” inProceedings of the 57th annual meeting of the association for computational linguistics, 2019, pp. 2506–2515

  5. [5]

    Multi-modal sarcasm detection via cross-modal graph convolutional network,

    B. Liang, C. Lou, X. Li, M. Yang, L. Gui, Y . He, W. Pei, and R. Xu, “Multi-modal sarcasm detection via cross-modal graph convolutional network,” inProceedings of the 60th annual meeting of the association for computational linguistics (volume 1: Long papers), 2022, pp. 1767–1777

  6. [6]

    Fusion and discrimination: A multimodal graph con- trastive learning framework for multimodal sarcasm detection,

    B. Liang, L. Gui, Y . He, E. Cambria, and R. Xu, “Fusion and discrimination: A multimodal graph con- trastive learning framework for multimodal sarcasm detection,”IEEE Transactions on Affective Computing, vol. 15, no. 4, pp. 1874–1888, 2024

  7. [7]

    Mmsd2. 0: Towards a reliable multi-modal sarcasm detection system,

    L. Qin, S. Huang, Q. Chen, C. Cai, Y . Zhang, B. Liang, W. Che, and R. Xu, “Mmsd2. 0: Towards a reliable multi-modal sarcasm detection system,” inFindings of the association for computational linguistics: ACL 2023, 2023, pp. 10 834–10 845

  8. [8]

    Nice perfume. how long did you marinate in it? multimodal sarcasm explanation,

    P. Desai, T. Chakraborty, and M. S. Akhtar, “Nice perfume. how long did you marinate in it? multimodal sarcasm explanation,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 10, 2022, pp. 10 563–10 571

  9. [9]

    Large Language Models Cannot Self-Correct Reasoning Yet

    J. Huang, X. Chen, S. Mishra, H. S. Zheng, A. W. Yu, X. Song, and D. Zhou, “Large language models cannot self-correct reasoning yet,”arXiv preprint arXiv:2310.01798, 2023

  10. [10]

    When can llms actually correct their own mistakes? a critical survey of self-correction of llms,

    R. Kamoi, Y . Zhang, N. Zhang, J. Han, and R. Zhang, “When can llms actually correct their own mistakes? a critical survey of self-correction of llms,”Transactions of the Association for Computational Linguistics, vol. 12, pp. 1417–1440, 2024

  11. [11]

    Self-correction bench: Uncovering and addressing the self-correction blind spot in large language models,

    K. Tsui, “Self-correction bench: Uncovering and addressing the self-correction blind spot in large language models,”arXiv preprint arXiv:2507.02778, 2025

  12. [12]

    Can large language models self-correct in medical question answering? an exploratory study,

    Z. Zhan, M. Cui, and R. Zhang, “Can large language models self-correct in medical question answering? an exploratory study,”arXiv preprint arXiv:2604.00261, 2026

  13. [13]

    Decomposing llm self-correction: The accuracy-correction paradox and error depth hypothesis,

    Y . Li, “Decomposing llm self-correction: The accuracy-correction paradox and error depth hypothesis,” arXiv preprint arXiv:2601.00828, 2025

  14. [14]

    How Many Tries Does It Take? Iterative Self-Repair in LLM Code Generation Across Model Scales and Benchmarks

    J. J. Arimbur, “How many tries does it take? iterative self-repair in llm code generation across model scales and benchmarks,”arXiv preprint arXiv:2604.10508, 2026

  15. [15]

    Leveraging generative large language models with visual instruction and demonstration retrieval for multimodal sarcasm detection,

    B. Tang, B. Lin, H. Yan, and S. Li, “Leveraging generative large language models with visual instruction and demonstration retrieval for multimodal sarcasm detection,” inProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 2024, pp. 1732–1742

  16. [16]

    S3 agent: Unlocking the power of vllm for zero-shot multi-modal sarcasm detection,

    P. Wang, Y . Zhang, H. Fei, Q. Chen, Y . Wang, J. Si, W. Lu, M. Li, and L. Qin, “S3 agent: Unlocking the power of vllm for zero-shot multi-modal sarcasm detection,”ACM Transactions on Multimedia Computing, Communications and Applications, vol. 21, no. 11, pp. 1–16, 2025

  17. [17]

    Qwen2.5-VL Technical Report

    S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tang, H. Zhong, Y . Zhu, M. Yang, Z. Li, J. Wan, P. Wang, W. Ding, Z. Fu, Y . Xu, J. Ye, X. Zhang, T. Xie, Z. Cheng, H. Zhang, Z. Yang, H. Xu, and J. Lin, “Qwen2.5-vl technical report,” 2025. [Online]. Available: https://arxiv.org/abs/2502.13923

  18. [18]

    ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools

    T. Glm, A. Zeng, B. Xu, B. Wang, C. Zhang, D. Yin, D. Zhang, D. Rojas, G. Feng, H. Zhaoet al., “Chatglm: A family of large language models from glm-130b to glm-4 all tools,”arXiv preprint arXiv:2406.12793, 2024. 11

  19. [19]

    Efficient memory management for large language model serving with pagedattention,

    W. Kwon, Z. Li, S. Zhuang, Y . Sheng, L. Zheng, C. H. Yu, J. Gonzalez, H. Zhang, and I. Stoica, “Efficient memory management for large language model serving with pagedattention,” inProceedings of the 29th symposium on operating systems principles, 2023, pp. 611–626

  20. [20]

    Zero: Memory optimizations toward training trillion parameter models,

    S. Rajbhandari, J. Rasley, O. Ruwase, and Y . He, “Zero: Memory optimizations toward training trillion parameter models,” inSC20: international conference for high performance computing, networking, storage and analysis. IEEE, 2020, pp. 1–16

  21. [21]

    Large language models are zero-shot reasoners,

    T. Kojima, S. S. Gu, M. Reid, Y . Matsuo, and Y . Iwasawa, “Large language models are zero-shot reasoners,” Advances in neural information processing systems, vol. 35, pp. 22 199–22 213, 2022

  22. [22]

    Large language models are human-level prompt engineers,

    Y . Zhou, A. I. Muresanu, Z. Han, K. Paster, S. Pitis, H. Chan, and J. Ba, “Large language models are human-level prompt engineers,” inThe eleventh international conference on learning representations, 2022

  23. [23]

    Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models,

    L. Wang, W. Xu, Y . Lan, Z. Hu, Y . Lan, R. K.-W. Lee, and E.-P. Lim, “Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models,” inProceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers), 2023, pp. 2609–2634

  24. [24]

    Generated knowledge prompting for commonsense reasoning,

    J. Liu, A. Liu, X. Lu, S. Welleck, P. West, R. Le Bras, Y . Choi, and H. Hajishirzi, “Generated knowledge prompting for commonsense reasoning,” inProceedings of the 60th annual meeting of the association for computational linguistics (volume 1: long papers), 2022, pp. 3154–3169

  25. [25]

    Ironic: Coherence-aware reasoning chains for multi-modal sarcasm detection,

    A. A. Ramakrishnan, A. A. Ramakrishnan, and D. Lee, “Ironic: Coherence-aware reasoning chains for multi-modal sarcasm detection,”arXiv preprint arXiv:2505.16258, 2025

  26. [26]

    Commander-GPT: Dividing and Routing for Multimodal Sarcasm Detection

    Y . Zhang, C. Zou, B. Wang, J. Qin, and P. Tiwari, “Commander-gpt: Dividing and routing for multimodal sarcasm detection,”arXiv preprint arXiv:2506.19420, 2025

  27. [27]

    Knowlenet: Knowledge fusion network for multimodal sarcasm detection,

    T. Yue, R. Mao, H. Wang, Z. Hu, and E. Cambria, “Knowlenet: Knowledge fusion network for multimodal sarcasm detection,”Information Fusion, vol. 100, p. 101921, 2023

  28. [28]

    Ldgnet: Llms debate-guided network for multimodal sarcasm detection,

    H. Zhou, J. Yan, Y . Chen, R. Hong, W. Zuo, and K. Jin, “Ldgnet: Llms debate-guided network for multimodal sarcasm detection,” inICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2025, pp. 1–5

  29. [29]

    Chain-of-thought prompting elicits reasoning in large language models,

    J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V . Le, D. Zhouet al., “Chain-of-thought prompting elicits reasoning in large language models,”Advances in neural information processing systems, vol. 35, pp. 24 824–24 837, 2022

  30. [30]

    Tree of thoughts: Deliberate problem solving with large language models,

    S. Yao, D. Yu, J. Zhao, I. Shafran, T. Griffiths, Y . Cao, and K. Narasimhan, “Tree of thoughts: Deliberate problem solving with large language models,”Advances in neural information processing systems, vol. 36, pp. 11 809–11 822, 2023

  31. [31]

    Self-Consistency Improves Chain of Thought Reasoning in Language Models

    X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, S. Narang, A. Chowdhery, and D. Zhou, “Self-consistency improves chain of thought reasoning in language models,”arXiv preprint arXiv:2203.11171, 2022

  32. [32]

    Improving factuality and reasoning in language models through multiagent debate,

    Y . Du, S. Li, A. Torralba, J. B. Tenenbaum, and I. Mordatch, “Improving factuality and reasoning in language models through multiagent debate,” inForty-first international conference on machine learning, 2024

  33. [33]

    Encouraging divergent thinking in large language models through multi-agent debate,

    T. Liang, Z. He, W. Jiao, X. Wang, Y . Wang, R. Wang, Y . Yang, S. Shi, and Z. Tu, “Encouraging divergent thinking in large language models through multi-agent debate,” inProceedings of the 2024 conference on empirical methods in natural language processing, 2024, pp. 17 889–17 904

  34. [34]

    Self-refine: Iterative refinement with self-feedback,

    A. Madaan, N. Tandon, P. Gupta, S. Hallinan, L. Gao, S. Wiegreffe, U. Alon, N. Dziri, S. Prabhumoye, Y . Yanget al., “Self-refine: Iterative refinement with self-feedback,”Advances in neural information processing systems, vol. 36, pp. 46 534–46 594, 2023

  35. [35]

    Reflexion: Language agents with verbal reinforcement learning,

    N. Shinn, F. Cassano, A. Gopinath, K. Narasimhan, and S. Yao, “Reflexion: Language agents with verbal reinforcement learning,”Advances in neural information processing systems, vol. 36, pp. 8634–8652, 2023

  36. [36]

    Let’s verify step by step,

    H. Lightman, V . Kosaraju, Y . Burda, H. Edwards, B. Baker, T. Lee, J. Leike, J. Schulman, I. Sutskever, and K. Cobbe, “Let’s verify step by step,” inThe twelfth international conference on learning representations, 2023

  37. [37]

    Math-shepherd: Verify and reinforce llms step-by-step without human annotations,

    P. Wang, L. Li, Z. Shao, R. Xu, D. Dai, Y . Li, D. Chen, Y . Wu, and Z. Sui, “Math-shepherd: Verify and reinforce llms step-by-step without human annotations,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024, pp. 9426–9439. 12

  38. [38]

    Critic-v: Vlm critics help catch vlm errors in multimodal reasoning,

    D. Zhang, J. Lei, J. Li, X. Wang, Y . Liu, Z. Yang, J. Li, W. Wang, S. Yang, J. Wuet al., “Critic-v: Vlm critics help catch vlm errors in multimodal reasoning,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025, pp. 9050–9061

  39. [39]

    Training language models to follow instructions with human feedback,

    L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Rayet al., “Training language models to follow instructions with human feedback,”Advances in neural information processing systems, vol. 35, pp. 27 730–27 744, 2022

  40. [40]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y . Li, Y . Wuet al., “Deepseekmath: Pushing the limits of mathematical reasoning in open language models,”arXiv preprint arXiv:2402.03300, 2024

  41. [41]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    D. Guo, D. Yang, H. Zhang, J. Song, P. Wang, Q. Zhu, R. Xu, R. Zhang, S. Ma, X. Biet al., “Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,”arXiv preprint arXiv:2501.12948, 2025

  42. [42]

    Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model

    J. Hu, Y . Zhang, Q. Han, D. Jiang, X. Zhang, and H.-Y . Shum, “Open-reasoner-zero: An open source approach to scaling up reinforcement learning on the base model,”arXiv preprint arXiv:2503.24290, 2025

  43. [43]

    SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

    T. Chu, Y . Zhai, J. Yang, S. Tong, S. Xie, D. Schuurmans, Q. V . Le, S. Levine, and Y . Ma, “Sft memorizes, rl generalizes: A comparative study of foundation model post-training,”arXiv preprint arXiv:2501.17161, 2025

  44. [44]

    VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model

    H. Shen, P. Liu, J. Li, C. Fang, Y . Ma, J. Liao, Q. Shen, Z. Zhang, K. Zhao, Q. Zhanget al., “Vlm-r1: A stable and generalizable r1-style large vision-language model,”arXiv preprint arXiv:2504.07615, 2025

  45. [45]

    Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

    W. Huang, B. Jia, Z. Zhai, S. Cao, Z. Ye, F. Zhao, Z. Xu, X. Tang, Y . Hu, and S. Lin, “Vision-r1: Incentivizing reasoning capability in multimodal large language models,”arXiv preprint arXiv:2503.06749, 2025. 13 Appendix Contents A Related Work 2 A.1 Multimodal Sarcasm Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 A.2 Reasoning and ...

  46. [46]

    Step2: Pragmatic intent decoding

    Output step-by-step analysis strictly within<think>...</think>tags. Requirements: • Decide the necessary number of steps (recommended: 3–5) • Assign each step a clear, incisive title (e.g., “Step2: Pragmatic intent decoding.”) • For each step, explicitly select an analytical perspective that is: –most relevant to the characteristics of the specific image-...

  47. [47]

    After completing all steps, output the final answer within <answer>...</answer> tags using one of the following options only: • yes (sarcasm or irony is present) • no (sarcasm or irony is not present) 5 Prompt 4: Fixed-Perspective Proposal Drafting Prompt ### Question <image> Text: {text} Does the composite message of this image-text pair qualify as ironi...

  48. [48]

    Output the analysis strictly within<think>...</think>tags. Use exactly only the following three fixed perspectives, in this exact order and with these exact step titles: Step1: Surface-Level Discrepancy Analysis.Analyze whether there is an obvious mismatch, exaggeration, reversal, or unexpected contrast between the image and the text at the surface level....

  49. [49]

    After completing the three steps, output the final answer within <answer>...</answer> tags using one of the following options only: • yes (sarcasm or irony is present) • no (sarcasm or irony is not present) Prompt 5: Generic Proposal Drafting Prompt ### Question <image> Text: {text} Does the composite message of this image-text pair qualify as ironic/sarc...

  50. [50]

    Detailed analysis and reasoning steps supporting the conclusion

    Output the analysis strictly within <think>...</think> tags. Detailed analysis and reasoning steps supporting the conclusion

  51. [51]

    The scoring rubric defines a 0–2 scale that evaluates interpretation accuracy, cross-modal reasoning, and reasoning coherence

    After completing the three steps, output the final answer within <answer>...</answer> tags using one of the following options only: • yes (sarcasm or irony is present) • no (sarcasm or irony is not present) Critic evaluation.The critic agent uses the following prompt (Prompt 6) to evaluate the quality of the proposal’s reasoning process. The scoring rubri...

  52. [52]

    Interpretation accuracy (primary)— Does the reasoning correctly interpret the combined meaning of the image-text pair and explain WHY it is or isn’t sarcastic?

  53. [53]

    Cross-modal reasoning— Does it connect image and text into joint reasoning? For sarcastic pairs: does it identify how they contradict or recontextualize each other to create irony? For non-sarcastic pairs: does it show how they reinforce the same tone and real meaning?

  54. [54]

    glamorous

    Reasoning coherence and efficiency— Does the evidence chain build logically toward the conclusion, with each step contributing a concrete cue or reasoning move? Penalize unsupported leaps, contradictions, and filler steps that do not serve the final judgment. Rate the reasoning on a 0–2 scale. ### Scoring rubric 0 = Misunderstanding— the reasoning does no...