Contextualized Prompting For Stance Detection On Social Media

Iryna Gurevych; Marcus Maurer; Shakib Yazdani; Simon Kruschinski; Tilman Beck

arxiv: 2606.06022 · v1 · pith:MU7TZ7AVnew · submitted 2026-06-04 · 💻 cs.CL

Contextualized Prompting For Stance Detection On Social Media

Tilman Beck , Shakib Yazdani , Simon Kruschinski , Marcus Maurer , Iryna Gurevych This is my paper

Pith reviewed 2026-06-28 01:48 UTC · model grok-4.3

classification 💻 cs.CL

keywords stance detectionzero-shot promptingcontextual informationlarge language modelssocial mediaTwittertarget descriptions

0 comments

The pith

LLM-generated target descriptions improve zero-shot stance detection on Twitter while most user context reduces accuracy due to noise.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether adding different kinds of context to LLM prompts helps detect the stance of short, noisy social media posts. It compares real user biographies, derived political affiliations, and descriptions of the target generated by the LLM itself across four datasets including a new German Twitter collection. Only the generated target descriptions consistently raise accuracy, while adding other tweets from the same user often lowers it because of extra noise the models cannot filter. A sympathetic reader would care because stance detection is used to track public opinion but struggles with ambiguous language, and this shows how to make zero-shot methods more reliable without extra training.

Core claim

In this work, we systematically investigate the impact of incorporating real-world (e.g., user biographies), derived (e.g., political party), and LLM-generated (e.g., target descriptions) contextual features into zero-shot prompting for stance detection on Twitter. Our evaluation spans four benchmark datasets, including a new high-quality German Twitter stance dataset. Across multiple LLMs, we find that integrating contextual information improves performance, but only under specific conditions. LLM-generated target descriptions consistently enhance accuracy, while other user metadata has mixed or even detrimental effects. Notably, we show that the inclusion of other tweets by the same user,

What carries the argument

zero-shot prompting augmented with real-world user metadata, derived attributes, and LLM-generated target descriptions

If this is right

LLM-generated target descriptions can be added to prompts to raise accuracy across multiple models and languages.
Additional tweets from the same user should be excluded from zero-shot prompts because they introduce noise that impairs results.
User biographies and derived political party information produce mixed or negative effects and require selective use.
LLMs have difficulty separating task-relevant context from irrelevant details in noisy social media settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Prompt design for social media tasks may benefit from prioritizing synthetic context generation over direct use of raw user data.
The noise-filtering limitation observed here could be tested in related tasks such as sentiment analysis or claim verification on short posts.
Releasing the new German dataset enables direct comparison of contextual prompting effects across languages.

Load-bearing premise

The contextual features are accurate and relevant enough for LLMs to use without adding overwhelming irrelevant noise.

What would settle it

If new experiments on additional Twitter datasets show that including other tweets by the same user raises stance detection accuracy instead of lowering it, the central claim would be falsified.

Figures

Figures reproduced from arXiv: 2606.06022 by Iryna Gurevych, Marcus Maurer, Shakib Yazdani, Simon Kruschinski, Tilman Beck.

**Figure 2.** Figure 2: Prompt with additional contextual features to [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

read the original abstract

Stance detection on social media is challenging due to short, noisy, and context-dependent language. While large language models (LLMs) show zero-shot generalization, they are typically prompted without contextual information, which limits their ability to interpret ambiguous posts. In this work, we systematically investigate the impact of incorporating real-world (e.g., user biographies), derived (e.g., political party), and LLM-generated (e.g., target descriptions) contextual features into zero-shot prompting for stance detection on Twitter. Our evaluation spans four benchmark datasets, including a new high-quality German Twitter stance dataset. Across multiple LLMs, we find that integrating contextual information improves performance, but only under specific conditions. LLM-generated target descriptions consistently enhance accuracy, while other user metadata has mixed or even detrimental effects. Notably, we show that the inclusion of other tweets by the same user, often beneficial in supervised learning, can impair performance due to input noise. Our qualitative analysis reveals that LLMs struggle to distinguish task-specific useful information from irrelevant context. Our findings highlight both the promise and challenges of prompting with context information in noisy real-world settings. We publish code and data at this \href{https://github.com/tilmanbeck/stance-context-twitter}{page}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows LLM-generated target descriptions reliably help zero-shot stance detection while user history often adds noise, backed by a new German Twitter dataset and code release.

read the letter

The key point is that context in zero-shot prompting for Twitter stance detection helps only selectively. LLM-generated descriptions of the target improve results across the datasets and models they tested, but real user metadata gives mixed results and other tweets from the same user tend to hurt performance because they introduce noise. They demonstrate this on four datasets, one of which is a new high-quality German collection they created.

What the work does well is the systematic comparison of context types—real-world, derived, and generated—across multiple LLMs, plus the release of code and data. That makes the mixed effects directly checkable. The qualitative analysis on how models fail to separate useful context from irrelevant material is a useful addition for anyone dealing with noisy social media text.

The soft spots are limited. The abstract leaves out exact prompt wording, statistical tests, and error bars, so the size of the gains is hard to judge without the full tables. If the experiments lack proper controls for prompt variation or dataset balance, the directional claims could shrink. Still, the stress-test indicates the ablations are set up to be inspectable, so this looks like a fixable presentation issue rather than a load-bearing problem.

This paper is for researchers working on zero-shot methods for social media analysis or public opinion tasks. A reader who needs concrete guidance on context choices or a German stance resource will find it practical. It deserves a serious referee because the new dataset and the reproducible comparisons give it enough substance to warrant detailed review, even if revisions are needed on the reporting.

Referee Report

1 major / 3 minor

Summary. The paper claims that incorporating contextual information (user biographies, derived political party, LLM-generated target descriptions, and other tweets by the same user) into zero-shot LLM prompts for stance detection on Twitter improves performance only under specific conditions. LLM-generated target descriptions consistently help across models and datasets, while other user metadata has mixed or detrimental effects, and including additional user tweets often impairs results due to input noise. This is supported by systematic ablations on four benchmark datasets (including a new high-quality German Twitter stance dataset), qualitative analysis of LLM context filtering, and public release of code and data.

Significance. If the empirical findings hold, the work offers practical guidance on when and how context should be added to LLM prompts for noisy social media tasks like stance detection. The nuanced 'specific conditions' framing, the new German dataset, and the public code/data release are notable strengths that enhance the contribution's utility and reproducibility for the computational social science and NLP communities.

major comments (1)

[Experiments] Experiments section: the central claims that LLM-generated target descriptions 'consistently enhance accuracy' and that user tweets 'can impair performance due to input noise' rest on directional performance differences, but the manuscript does not report statistical significance tests, error bars, or confidence intervals. Without these, it is difficult to assess whether the observed effects are reliable or could be due to variance across runs or datasets.

minor comments (3)

[Method] Method section: while the code repository is linked, the main text should include at least one example of the exact prompt templates used for each contextual feature (e.g., how the LLM-generated target description is inserted) to improve readability without requiring readers to inspect the code.
[Datasets] Dataset description: the new German Twitter stance dataset is introduced as high-quality, but a brief summary of its annotation process, inter-annotator agreement, or class balance would help readers evaluate its contribution independently of the main results.
[Related Work] Related work: the discussion of prior contextual prompting approaches could cite additional recent works on LLM context integration in social media tasks to better situate the contribution.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We agree that statistical significance testing is necessary to support the central claims and will incorporate appropriate tests in the revision.

read point-by-point responses

Referee: [Experiments] Experiments section: the central claims that LLM-generated target descriptions 'consistently enhance accuracy' and that user tweets 'can impair performance due to input noise' rest on directional performance differences, but the manuscript does not report statistical significance tests, error bars, or confidence intervals. Without these, it is difficult to assess whether the observed effects are reliable or could be due to variance across runs or datasets.

Authors: We agree with this observation. In the revised manuscript, we will add statistical significance testing (e.g., paired t-tests or Wilcoxon signed-rank tests across the four datasets) for the key comparisons involving LLM-generated target descriptions and the inclusion of additional user tweets. We will also report standard deviations across multiple prompt runs or bootstrap confidence intervals as error bars in the result tables to quantify variability. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is an empirical evaluation of prompting strategies on four independent benchmark datasets (including a new German one) with ablations across LLMs; all claims rest on direct performance comparisons and qualitative analysis rather than any derivation, fitted parameter renamed as prediction, or self-citation chain. No equations, uniqueness theorems, or ansatzes are present, so none of the enumerated circularity patterns apply.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Empirical evaluation paper with no mathematical derivations. No free parameters fitted to support a central claim. No invented entities. Relies on standard domain assumptions from NLP and LLM evaluation literature.

axioms (2)

domain assumption Large language models exhibit zero-shot generalization capabilities on stance detection tasks when given appropriate prompts.
Foundational premise stated in the abstract for the zero-shot setting.
domain assumption Contextual features can be meaningfully incorporated into prompts to aid interpretation of ambiguous short texts.
Core hypothesis tested in the work.

pith-pipeline@v0.9.1-grok · 5761 in / 1358 out tokens · 54078 ms · 2026-06-28T01:48:38.585198+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

14 extracted references · 1 linked inside Pith

[1]

InProceedings of the 12th Joint Conference on Lexical and Compu- tational Semantics (*SEM 2023), pages 494–511, Toronto, Canada

Robust integration of contextual information for cross-target stance detection. InProceedings of the 12th Joint Conference on Lexical and Compu- tational Semantics (*SEM 2023), pages 494–511, Toronto, Canada. Association for Computational Lin- guistics. Adrian Benton and Mark Dredze. 2018. Using author embeddings to improve tweet stance classification. In...

2023
[2]

InProceedings of the 49th An- nual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 1506–1515, Portland, Oregon, USA

Collective classification of congressional floor- debate transcripts. InProceedings of the 49th An- nual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 1506–1515, Portland, Oregon, USA. Association for Computational Linguistics. John Chung, Ece Kamar, and Saleema Amershi. 2023. Increasing diversity while maint...

2023
[3]

Daijun Ding, Genan Dai, Cheng Peng, Xiaojiang Peng, Bowen Zhang, and Hu Huang

AAAI Press. Daijun Ding, Genan Dai, Cheng Peng, Xiaojiang Peng, Bowen Zhang, and Hu Huang. 2024. Distantly su- pervised explainable stance detection via chain-of- thought supervision.Mathematics, 12(7):1119. Jiachen Du, Lin Gui, Ruifeng Xu, Yunqing Xia, Xuan Wang, and Erik Cambria. 2020. Commonsense knowledge enhanced memory network for stance classificat...

2024
[4]

InProceedings of the 12th Workshop on Computational Approaches to Subjectiv- ity, Sentiment & Social Media Analysis, pages 71–77, Dublin, Ireland

Infusing knowledge from Wikipedia to en- hance stance detection. InProceedings of the 12th Workshop on Computational Approaches to Subjectiv- ity, Sentiment & Social Media Analysis, pages 71–77, Dublin, Ireland. Association for Computational Lin- guistics. Julian Hohner, Heidi Schulze, Simon Greipl, and Diana Rieger. 2022. From solidarity to blame game: A...

2022
[5]

In Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pages 5–9, Santa Fe, New Mexico

The INCEpTION platform: Machine-assisted and knowledge-oriented interactive annotation. In Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pages 5–9, Santa Fe, New Mexico. Association for Computational Linguistics. Dilek Küçük and Fazli Can. 2021.Stance Detection: Concepts, Approaches, Resources, and O...

2021
[6]

COVID-19

Enhancing zero-shot and few-shot stance de- tection with commonsense knowledge graph. In Findings of the Association for Computational Lin- guistics: ACL-IJCNLP 2021, pages 3152–3157, On- line. Association for Computational Linguistics. Ruibo Liu, Jerry Wei, Fangyu Liu, Chenglei Si, Yanzhe Zhang, Jinmeng Rao, Steven Zheng, Daiyi Peng, Diyi Yang, Denny Zho...

Pith/arXiv arXiv 2021
[7]

Dataset: The study examines German- language tweets (excluding retweets) posted between January 1, 2020, and March 1,

2020
[8]

The tweets were collected using three methods: (1) by keyword ([’stayhome- savelifes’, ’wirbleibenzuhause’, ’infektionss- chutz’, ’bleibdaheim’, ’abstandhalten’,

Access to and storage of the tweets were facilitated via the Twitter Academic API. The tweets were collected using three methods: (1) by keyword ([’stayhome- savelifes’, ’wirbleibenzuhause’, ’infektionss- chutz’, ’bleibdaheim’, ’abstandhalten’, . . . ]), (2) by topic using automatic classification via Twitter, and (3) by Twitter ID, a manually compiled li...
[9]

Each comment counts as a separate post

Analysis units: The units for which a coding sheet is created, are Tweets. Each comment counts as a separate post. Links to photos or other posts are not included in the coding
[10]

Coding units: Tweets are coded both at the post level and at the level of text segments (e.g., clauses, sentences)
[11]

measures

Inclusion criteria: The inclusion criteria de- termine which posts are coded. On Twitter, all tweets that address government measures to contain the COVID-19 pandemic are coded. The inclusion criteria are explained in more detail below. First, the decision to label a tweet is based on the following question: Does the post actually refer to government meas...
[12]

zeroing out

Guidelines for coding and data entry: (1) As a general rule, if any uncertainties arise during the coding process, please contact one of the project leaders to clarify any questions. The primary goal is not to code as indepen- dently as possible, but rather to ensure that the coding is as reliable and valid as possible. (2) For coding, we use the Inceptio...
[13]

You can pause the annotation at any time and resume it later

General Note: Please be aware that tweets may contain malicious, suggestive, offensive, or potentially sensitive content. You can pause the annotation at any time and resume it later
[14]

Hass im In- ternet

A special feature of annotating Twitter hashtags: Hashtags are often ambiguous and can only be understood within their specific context. Therefore, the following should be kept in mind when annotating: Hashtags are only considered as context for what is said; they never stand alone. Hashtags are used to determine whether a measure is being ad- dressed. Fo...

2022

[1] [1]

InProceedings of the 12th Joint Conference on Lexical and Compu- tational Semantics (*SEM 2023), pages 494–511, Toronto, Canada

Robust integration of contextual information for cross-target stance detection. InProceedings of the 12th Joint Conference on Lexical and Compu- tational Semantics (*SEM 2023), pages 494–511, Toronto, Canada. Association for Computational Lin- guistics. Adrian Benton and Mark Dredze. 2018. Using author embeddings to improve tweet stance classification. In...

2023

[2] [2]

InProceedings of the 49th An- nual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 1506–1515, Portland, Oregon, USA

Collective classification of congressional floor- debate transcripts. InProceedings of the 49th An- nual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 1506–1515, Portland, Oregon, USA. Association for Computational Linguistics. John Chung, Ece Kamar, and Saleema Amershi. 2023. Increasing diversity while maint...

2023

[3] [3]

Daijun Ding, Genan Dai, Cheng Peng, Xiaojiang Peng, Bowen Zhang, and Hu Huang

AAAI Press. Daijun Ding, Genan Dai, Cheng Peng, Xiaojiang Peng, Bowen Zhang, and Hu Huang. 2024. Distantly su- pervised explainable stance detection via chain-of- thought supervision.Mathematics, 12(7):1119. Jiachen Du, Lin Gui, Ruifeng Xu, Yunqing Xia, Xuan Wang, and Erik Cambria. 2020. Commonsense knowledge enhanced memory network for stance classificat...

2024

[4] [4]

InProceedings of the 12th Workshop on Computational Approaches to Subjectiv- ity, Sentiment & Social Media Analysis, pages 71–77, Dublin, Ireland

Infusing knowledge from Wikipedia to en- hance stance detection. InProceedings of the 12th Workshop on Computational Approaches to Subjectiv- ity, Sentiment & Social Media Analysis, pages 71–77, Dublin, Ireland. Association for Computational Lin- guistics. Julian Hohner, Heidi Schulze, Simon Greipl, and Diana Rieger. 2022. From solidarity to blame game: A...

2022

[5] [5]

In Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pages 5–9, Santa Fe, New Mexico

The INCEpTION platform: Machine-assisted and knowledge-oriented interactive annotation. In Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pages 5–9, Santa Fe, New Mexico. Association for Computational Linguistics. Dilek Küçük and Fazli Can. 2021.Stance Detection: Concepts, Approaches, Resources, and O...

2021

[6] [6]

COVID-19

Enhancing zero-shot and few-shot stance de- tection with commonsense knowledge graph. In Findings of the Association for Computational Lin- guistics: ACL-IJCNLP 2021, pages 3152–3157, On- line. Association for Computational Linguistics. Ruibo Liu, Jerry Wei, Fangyu Liu, Chenglei Si, Yanzhe Zhang, Jinmeng Rao, Steven Zheng, Daiyi Peng, Diyi Yang, Denny Zho...

Pith/arXiv arXiv 2021

[7] [7]

Dataset: The study examines German- language tweets (excluding retweets) posted between January 1, 2020, and March 1,

2020

[8] [8]

The tweets were collected using three methods: (1) by keyword ([’stayhome- savelifes’, ’wirbleibenzuhause’, ’infektionss- chutz’, ’bleibdaheim’, ’abstandhalten’,

Access to and storage of the tweets were facilitated via the Twitter Academic API. The tweets were collected using three methods: (1) by keyword ([’stayhome- savelifes’, ’wirbleibenzuhause’, ’infektionss- chutz’, ’bleibdaheim’, ’abstandhalten’, . . . ]), (2) by topic using automatic classification via Twitter, and (3) by Twitter ID, a manually compiled li...

[9] [9]

Each comment counts as a separate post

Analysis units: The units for which a coding sheet is created, are Tweets. Each comment counts as a separate post. Links to photos or other posts are not included in the coding

[10] [10]

Coding units: Tweets are coded both at the post level and at the level of text segments (e.g., clauses, sentences)

[11] [11]

measures

Inclusion criteria: The inclusion criteria de- termine which posts are coded. On Twitter, all tweets that address government measures to contain the COVID-19 pandemic are coded. The inclusion criteria are explained in more detail below. First, the decision to label a tweet is based on the following question: Does the post actually refer to government meas...

[12] [12]

zeroing out

Guidelines for coding and data entry: (1) As a general rule, if any uncertainties arise during the coding process, please contact one of the project leaders to clarify any questions. The primary goal is not to code as indepen- dently as possible, but rather to ensure that the coding is as reliable and valid as possible. (2) For coding, we use the Inceptio...

[13] [13]

You can pause the annotation at any time and resume it later

General Note: Please be aware that tweets may contain malicious, suggestive, offensive, or potentially sensitive content. You can pause the annotation at any time and resume it later

[14] [14]

Hass im In- ternet

A special feature of annotating Twitter hashtags: Hashtags are often ambiguous and can only be understood within their specific context. Therefore, the following should be kept in mind when annotating: Hashtags are only considered as context for what is said; they never stand alone. Hashtags are used to determine whether a measure is being ad- dressed. Fo...

2022