Beyond the Literal: Decomposing Pragmatic Intent in Multimodal Meme Understanding

Binyang Li; Hanqi Yan; Huimin Wang; Kam-Fai Wong; Luyao Ye; Shubo Zhang; Yulan He; Zezhong Wang; Zhengyi Zhao

arxiv: 2606.03604 · v1 · pith:HJVCHPUAnew · submitted 2026-06-02 · 💻 cs.CL

Beyond the Literal: Decomposing Pragmatic Intent in Multimodal Meme Understanding

Zhengyi Zhao , Shubo Zhang , Zezhong Wang , Luyao Ye , Huimin Wang , Hanqi Yan , Binyang Li , Kam-Fai Wong

show 1 more author

Yulan He

This is my paper

Pith reviewed 2026-06-28 10:04 UTC · model grok-4.3

classification 💻 cs.CL

keywords Intent Projectionpragmatic intentliteral-pragmatic decompositionmeme understandinglarge vision-language modelsmultimodal benchmarksorthogonal projectionsarcastic posts

0 comments

The pith

Intent Projection decomposes literal content from pragmatic intent inside a single LVLM to improve meme understanding.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper reframes meme and sarcastic post understanding as a literal-pragmatic decomposition task. Standard LVLMs tend to describe what an image shows rather than the author's intended meaning, because instruction tuning mixes the two signals. Intent Projection counters this by applying an orthogonal projection at the representation level to strip away dominant unimodal directions and keep only the pragmatic residual. It further anchors the decoder with an affect classifier tag and applies a contrastive reward at the objective level to penalize literal restatements. The result is stronger performance across six multimodal benchmarks, with the biggest improvements on posts where literal and intended meanings diverge sharply.

Core claim

Intent Projection separates literal and pragmatic signals at the representation, output, and objective levels within one LVLM backbone: an orthogonal projection module removes dominant unimodal directions from the fused image-text vector to retain only the pragmatic residual, a surface-real affect classifier supplies a discrete polarity tag, and a contrastive reward penalizes answers that merely restate the literal description.

What carries the argument

The orthogonal projection module, which removes dominant unimodal directions from the fused image-text representation while retaining the pragmatic residual.

If this is right

The method outperforms open-source baselines on six multimodal benchmarks.
Gains are largest on high-divergence posts where literal collapse hurts most.
The gap to proprietary models narrows without changing the underlying LVLM architecture.
Structured reasoning chains and affect tags become explicit outputs alongside the final answer.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same projection step could be inserted into other multimodal tasks that require distinguishing surface form from implied meaning, such as visual sarcasm detection or political meme analysis.
If the pragmatic residual proves stable across model scales, smaller open-source LVLMs might close more of the performance gap to larger proprietary systems on intent-heavy content.
Extending the contrastive reward to penalize not only literal restatements but also over-confident guesses could further reduce hallucinated intent.

Load-bearing premise

An orthogonal projection can isolate the pragmatic residual from the fused representation without discarding necessary intent information or creating artifacts the decoder cannot recover.

What would settle it

Replace the orthogonal projection with a random direction on the same backbone and test on a held-out set of high-divergence memes; if performance falls back to baseline levels, the decomposition claim is falsified.

Figures

Figures reproduced from arXiv: 2606.03604 by Binyang Li, Hanqi Yan, Huimin Wang, Kam-Fai Wong, Luyao Ye, Shubo Zhang, Yulan He, Zezhong Wang, Zhengyi Zhao.

**Figure 2.** Figure 2: Divergence-stratified ROUGE-L on Qwen3-VL-8B backbone. Posts split into Low/Mid/High categories [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Scaling behavior (Qwen3-VL 2B/8B/32B). Solid line denotes Ours, and dashed one shows Base. The shaded area shows performance gain. all, demonstrating it targets precisely the regime where literal and pragmatic signals diverge most. Third, removing the affect head entirely (−1.9) hurts more than withholding the tag only at infer7 [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Reward weight sensitivity. Left: λ1 sweep with PI-G ROUGE-L (left axis) and literal-intent cosine similarity (right axis). Right: λ2 sweep across three metrics. ence (−1.4), yet the inference-only removal still degrades performance noticeably, confirming the decoder learns to actively condition on the tag during generation. Gradient masking on y lit anchors the contrastive reward by holding the literal se… view at source ↗

**Figure 5.** Figure 5: Training dynamics for the Qwen3-VL family (2B, 8B, 32B). Each row corresponds to one model scale. [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

**Figure 6.** Figure 6: Training dynamics for the InternVL3 family (2B, 8B, 26B). Layout follows Figure [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

**Figure 7.** Figure 7: Training dynamics for the LLaVA-OneVision family (0.5B, 7B, 72B). Layout follows Figure [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

read the original abstract

When asked what a meme or sarcastic post means, Large Vision Language Models (LVLMs) tend to describe what the image shows rather than what the author is trying to communicate. Standard instruction tuning entangles a post's literal content with its pragmatic meaning, letting surface-level details contaminate the final response. We reframe meme understanding as a problem of literal-pragmatic decomposition and propose \textbf{Intent Projection}, a framework that separates the two signals at the representation, output, and objective levels within a single LVLM backbone. At the representation level, an orthogonal projection module removes dominant unimodal directions from the fused image-text representation, retaining only the pragmatic residual, while a surface-real affect classifier anchors the decoder with a discrete tag that names the polarity gap. At the output level, the model externalizes a structured reasoning chain, and at the objective level a contrastive reward explicitly penalizes answers that restate the literal description. Across six multimodal benchmarks, Intent Projection consistently outperforms open-source baselines and narrows the gap to proprietary models, with the largest gains on high-divergence posts where literal collapse is most damaging.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Intent Projection gives a workable multi-level decomposition for literal vs pragmatic signals in memes, but the orthogonal projection step needs more proof that it actually isolates the residual without losing key intent.

read the letter

The paper's core move is to treat meme understanding as literal-pragmatic decomposition inside one LVLM. They add an orthogonal projection on the fused image-text features to strip dominant unimodal directions, leave a pragmatic residual, tag the decoder with a surface affect label, force an explicit reasoning chain at output, and train with a contrastive reward that punishes literal restatements. That combination is new enough on its own terms.

It does a few things cleanly. The problem statement is accurate: current models really do collapse to surface description on sarcastic or high-divergence posts. Breaking the fix into representation, output, and objective levels is a reasonable way to attack entanglement. The reported gains being largest exactly where literal collapse hurts most is at least consistent with the claim.

The soft spot is the projection itself. The abstract says it removes dominant unimodal directions while keeping the pragmatic residual, but there is no derivation showing the operator is guaranteed to be orthogonal to pragmatic components rather than just the directions that happen to be strong in the training data. If pragmatic signal is already mixed into those directions, the residual could be incomplete or noisy, and the downstream gains might come mostly from the affect tag and contrastive term instead. Without ablations that isolate the projection or error analysis on what gets discarded, it is hard to know how much of the improvement is real decomposition versus added supervision.

This is for people working on multimodal pragmatics, social media analysis, or instruction tuning that needs to handle intent beyond literal content. The idea is worth testing even if the current write-up is light on mechanics. A serious editor should send it to referees rather than desk reject; the framing is clear and the empirical pattern on high-divergence cases is worth checking with proper controls.

Referee Report

3 major / 1 minor

Summary. The paper claims that standard instruction tuning in LVLMs entangles literal and pragmatic meaning in memes, and proposes Intent Projection to decompose them at representation (orthogonal projection retaining pragmatic residual), output (structured reasoning), and objective (contrastive reward) levels, plus a surface-real affect classifier. It reports that this approach outperforms open-source baselines on six multimodal benchmarks and narrows the gap to proprietary models, with largest gains on high-divergence posts.

Significance. If the orthogonal projection reliably isolates pragmatic intent without discarding necessary information, this framework could offer a general method for improving pragmatic understanding in multimodal models. The multi-level decomposition is a novel angle, but the abstract lacks the technical details needed to assess whether the gains are due to successful decomposition.

major comments (3)

[Abstract] Abstract: The orthogonal projection is described as removing 'dominant unimodal directions' to retain the 'pragmatic residual,' but no equation or construction is provided to guarantee that pragmatic components are orthogonal to the removed directions or preserved in the residual. This assumption is central to explaining the gains on high-divergence posts.
[Abstract] Abstract: No ablation studies, error analysis, or quantitative breakdown is mentioned to show that the performance improvements stem from the projection module rather than the affect classifier or contrastive reward alone.
[Abstract] Abstract: The outperformance claims are stated without reference to specific results, tables, or metrics, and the abstract provides no derivation or proof for the projection operator's properties.

minor comments (1)

[Abstract] The term 'pragmatic residual' is used without an initial definition or explanation of how it differs from standard fused representations.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We agree that the abstract would benefit from additional technical specificity to better convey the decomposition mechanism and supporting evidence. We address each major comment below and will revise the abstract accordingly in the next version.

read point-by-point responses

Referee: [Abstract] Abstract: The orthogonal projection is described as removing 'dominant unimodal directions' to retain the 'pragmatic residual,' but no equation or construction is provided to guarantee that pragmatic components are orthogonal to the removed directions or preserved in the residual. This assumption is central to explaining the gains on high-divergence posts.

Authors: The full manuscript (Section 3.2) defines the projection operator explicitly as P = I - UU^T, where the columns of U are the top principal components of unimodal (image-only and text-only) feature matrices extracted from the training set; the residual r = P v is then passed to the decoder. The contrastive objective in Section 3.4 is designed to push pragmatic answers toward the residual while penalizing literal collapse, providing empirical support that pragmatic signal is retained. We acknowledge the abstract is high-level and will revise it to include a concise description of this construction and its motivation. revision: yes
Referee: [Abstract] Abstract: No ablation studies, error analysis, or quantitative breakdown is mentioned to show that the performance improvements stem from the projection module rather than the affect classifier or contrastive reward alone.

Authors: Section 4.3 of the manuscript reports ablation results isolating the projection module, the affect classifier, and the contrastive reward; the largest drop occurs when the projection is removed, especially on high-divergence subsets. Error analysis in Section 4.4 further breaks down cases where literal collapse persists. We will revise the abstract to note that ablations attribute the gains primarily to the projection step. revision: yes
Referee: [Abstract] Abstract: The outperformance claims are stated without reference to specific results, tables, or metrics, and the abstract provides no derivation or proof for the projection operator's properties.

Authors: The abstract summarizes the overall finding; detailed metrics appear in Tables 1-3 and the high-divergence subset analysis in Section 4.2. The projection properties follow directly from the PCA construction and are validated empirically rather than via formal proof. We will revise the abstract to reference the key quantitative improvements (e.g., largest gains on high-divergence posts) and point to the methods section for the operator definition. revision: partial

Circularity Check

0 steps flagged

No circularity: framework components and empirical claims are independently defined

full rationale

The paper introduces Intent Projection as a new framework with explicitly described modules (orthogonal projection at representation level, affect classifier at output level, contrastive reward at objective level) and evaluates it via performance on six external multimodal benchmarks. No equations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. The derivation chain consists of architectural choices and training objectives that are stated as independent design decisions, with results reported against open-source and proprietary baselines rather than reducing to the inputs by construction. This is the common case of a self-contained empirical proposal.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Abstract-only; the framework rests on the assumption that literal and pragmatic signals are linearly separable in fused representations and that contrastive penalties can be applied without destabilizing generation.

axioms (1)

domain assumption Fused image-text representations contain dominant unimodal literal directions that can be removed via orthogonal projection while preserving pragmatic content.
Invoked at the representation level of the proposed module.

invented entities (1)

pragmatic residual no independent evidence
purpose: Retained signal after orthogonal removal of literal directions
New construct introduced to anchor the decoder on intent rather than surface content.

pith-pipeline@v0.9.1-grok · 5758 in / 1264 out tokens · 23890 ms · 2026-06-28T10:04:11.940181+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

21 extracted references · 5 canonical work pages · 2 internal anchors

[1]

Qwen3-VL Technical Report

Qwen3-vl technical report.arXiv preprint arXiv:2511.21631. Nanyi Bi, Yi-Ching Huang, Chao-Chun Han, and Jane Yung-jen Hsu. 2023. You know what i meme: en- hancing people’s understanding and awareness of hateful memes using crowdsourced explanations. Proceedings of the ACM on Human-Computer In- teraction, 7(CSCW1):1–27. Ruichu Cai, Zhifan Jiang, Kaitao Zhe...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

EunJeong Hwang and Vered Shwartz

M-quest–meme question-understanding eval- uation on semantics and toxicity.arXiv preprint arXiv:2603.03315. EunJeong Hwang and Vered Shwartz. 2023. Meme- cap: A dataset for captioning and interpreting memes. InProceedings of the 2023 Conference on Empiri- cal Methods in Natural Language Processing, pages 1433–1445. Prince Jha, Krishanu Maity, Raghav Jain,...

work page arXiv 2023
[3]

LLaVA-OneVision: Easy Visual Task Transfer

Meme-ingful analysis: Enhanced understand- ing of cyberbullying in memes through multimodal explanations. InProceedings of the 18th Confer- ence of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 930–943. Liqiang Jing, Xuemeng Song, Kun Ouyang, Mengzhao Jia, and Liqiang Nie. 2023. Multi-source semantic ...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

InInternational Con- ference on Learning Representations, volume 2025, pages 86669–86690

Matryoshkakv: Adaptive kv compression via trainable orthogonal projection. InInternational Con- ference on Learning Representations, volume 2025, pages 86669–86690. Hongzhan Lin, Ziyang Luo, Bo Wang, Ruichao Yang, and Jing Ma. 2026. Goat-bench: Safety insights to large multimodal models through meme-based social abuse.ACM Transactions on Intelligent Syste...

2025
[5]

Khoi PN Nguyen and Vincent Ng

Revisiting group relative policy optimization: Insights into on-policy and off-policy training.arXiv preprint arXiv:2505.22257. Khoi PN Nguyen and Vincent Ng. 2024. Computational meme understanding: a survey. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 21251–21267. Thanh Thi Nguyen, Campbell Wilson, and ...

work page arXiv 2024
[6]

Yuqi Niu, Dilara Keküllüo˘glu, Weidong Qiu, and Nadin Kokciyan

Aligning large vision-language models by deep reinforcement learning and direct preference optimization.arXiv preprint arXiv:2509.06759. Yuqi Niu, Dilara Keküllüo˘glu, Weidong Qiu, and Nadin Kokciyan. 2026. Behind the meme: Understanding user experiences with memes on social media. In Proceedings of the 2026 CHI Conference on Human Factors in Computing Sy...

work page arXiv 2026
[7]

Literal Observation: Describe exactly what is depicted in the image and stated in the text
[8]

Intent Inference: Analyze the gap between the literal content and the likely intended meaning given the community context
[9]

MemeCap & MET-Meme (Meme Explanation and Intent Classification) Input: <image>\n Based on the provided meme, please separate the surface meaning from the underlying message

Final Answer: Provide the [poster’s underlying intent / explanation of the post]. MemeCap & MET-Meme (Meme Explanation and Intent Classification) Input: <image>\n Based on the provided meme, please separate the surface meaning from the underlying message
[10]

Literal Observation: What is visually and textually present?
[11]

Intent Inference: What cultural reference or joke is being made?
[12]

[Caption]

Final Answer: [Explain the meme (MemeCap) / Classify the primary intent into one of the provided categories (MET-Meme)]. MMSD2.0 (Multimodal Sarcasm Classification) Input: <image>\n Text: "[Caption]". Determine if the text is sarcastic with respect to the image
[13]

Literal Observation: Describe the image and the literal meaning of the text
[14]

Intent Inference: Is there a contradiction between the image and the text that implies sarcasm?
[15]

Sarcastic

Final Answer: Answer strictly with "Sarcastic" or "Not Sarcastic". MuSE (Sarcasm Explanation) Input: <image>\n Text: "[Caption]". This post is sarcastic. Explain why
[16]

Literal Observation: What does the text say and what does the image show?
[17]

Intent Inference: How does the literal meaning contrast with reality or the author’s true belief?
[18]

[Caption]

Final Answer: Provide a concise explanation of the sarcasm. GOAT-Bench (Contextual Abusive-Meme Un- derstanding)For both GOAT-C (classification) and GOAT-G (generation): Input: <image>\n Text: "[Caption]". Analyze this meme for potential 11 abusiveness or toxicity
[19]

Literal Observation: Detail the visual components and transcribed text
[20]

Intent Inference: Identify any dog whistles, harmful stereotypes, or implicit attacks
[21]

It will only get worse from here

Final Answer: [Classify as Abusive/Not Abusive (GOAT-C) / Explain the underlying abusive intent (GOAT-G)]. C Evaluation Metrics Details In this section, we provide formal definitions for the standard evaluation metrics used across our ex- periments, as well as the formulation of our literal- intent divergence metric. As noted in the main text, all reporte...

2000

[1] [1]

Qwen3-VL Technical Report

Qwen3-vl technical report.arXiv preprint arXiv:2511.21631. Nanyi Bi, Yi-Ching Huang, Chao-Chun Han, and Jane Yung-jen Hsu. 2023. You know what i meme: en- hancing people’s understanding and awareness of hateful memes using crowdsourced explanations. Proceedings of the ACM on Human-Computer In- teraction, 7(CSCW1):1–27. Ruichu Cai, Zhifan Jiang, Kaitao Zhe...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

EunJeong Hwang and Vered Shwartz

M-quest–meme question-understanding eval- uation on semantics and toxicity.arXiv preprint arXiv:2603.03315. EunJeong Hwang and Vered Shwartz. 2023. Meme- cap: A dataset for captioning and interpreting memes. InProceedings of the 2023 Conference on Empiri- cal Methods in Natural Language Processing, pages 1433–1445. Prince Jha, Krishanu Maity, Raghav Jain,...

work page arXiv 2023

[3] [3]

LLaVA-OneVision: Easy Visual Task Transfer

Meme-ingful analysis: Enhanced understand- ing of cyberbullying in memes through multimodal explanations. InProceedings of the 18th Confer- ence of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 930–943. Liqiang Jing, Xuemeng Song, Kun Ouyang, Mengzhao Jia, and Liqiang Nie. 2023. Multi-source semantic ...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[4] [4]

InInternational Con- ference on Learning Representations, volume 2025, pages 86669–86690

Matryoshkakv: Adaptive kv compression via trainable orthogonal projection. InInternational Con- ference on Learning Representations, volume 2025, pages 86669–86690. Hongzhan Lin, Ziyang Luo, Bo Wang, Ruichao Yang, and Jing Ma. 2026. Goat-bench: Safety insights to large multimodal models through meme-based social abuse.ACM Transactions on Intelligent Syste...

2025

[5] [5]

Khoi PN Nguyen and Vincent Ng

Revisiting group relative policy optimization: Insights into on-policy and off-policy training.arXiv preprint arXiv:2505.22257. Khoi PN Nguyen and Vincent Ng. 2024. Computational meme understanding: a survey. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 21251–21267. Thanh Thi Nguyen, Campbell Wilson, and ...

work page arXiv 2024

[6] [6]

Yuqi Niu, Dilara Keküllüo˘glu, Weidong Qiu, and Nadin Kokciyan

Aligning large vision-language models by deep reinforcement learning and direct preference optimization.arXiv preprint arXiv:2509.06759. Yuqi Niu, Dilara Keküllüo˘glu, Weidong Qiu, and Nadin Kokciyan. 2026. Behind the meme: Understanding user experiences with memes on social media. In Proceedings of the 2026 CHI Conference on Human Factors in Computing Sy...

work page arXiv 2026

[7] [7]

Literal Observation: Describe exactly what is depicted in the image and stated in the text

[8] [8]

Intent Inference: Analyze the gap between the literal content and the likely intended meaning given the community context

[9] [9]

MemeCap & MET-Meme (Meme Explanation and Intent Classification) Input: <image>\n Based on the provided meme, please separate the surface meaning from the underlying message

Final Answer: Provide the [poster’s underlying intent / explanation of the post]. MemeCap & MET-Meme (Meme Explanation and Intent Classification) Input: <image>\n Based on the provided meme, please separate the surface meaning from the underlying message

[10] [10]

Literal Observation: What is visually and textually present?

[11] [11]

Intent Inference: What cultural reference or joke is being made?

[12] [12]

[Caption]

Final Answer: [Explain the meme (MemeCap) / Classify the primary intent into one of the provided categories (MET-Meme)]. MMSD2.0 (Multimodal Sarcasm Classification) Input: <image>\n Text: "[Caption]". Determine if the text is sarcastic with respect to the image

[13] [13]

Literal Observation: Describe the image and the literal meaning of the text

[14] [14]

Intent Inference: Is there a contradiction between the image and the text that implies sarcasm?

[15] [15]

Sarcastic

Final Answer: Answer strictly with "Sarcastic" or "Not Sarcastic". MuSE (Sarcasm Explanation) Input: <image>\n Text: "[Caption]". This post is sarcastic. Explain why

[16] [16]

Literal Observation: What does the text say and what does the image show?

[17] [17]

Intent Inference: How does the literal meaning contrast with reality or the author’s true belief?

[18] [18]

[Caption]

Final Answer: Provide a concise explanation of the sarcasm. GOAT-Bench (Contextual Abusive-Meme Un- derstanding)For both GOAT-C (classification) and GOAT-G (generation): Input: <image>\n Text: "[Caption]". Analyze this meme for potential 11 abusiveness or toxicity

[19] [19]

Literal Observation: Detail the visual components and transcribed text

[20] [20]

Intent Inference: Identify any dog whistles, harmful stereotypes, or implicit attacks

[21] [21]

It will only get worse from here

Final Answer: [Classify as Abusive/Not Abusive (GOAT-C) / Explain the underlying abusive intent (GOAT-G)]. C Evaluation Metrics Details In this section, we provide formal definitions for the standard evaluation metrics used across our ex- periments, as well as the formulation of our literal- intent divergence metric. As noted in the main text, all reporte...

2000