arxiv: 2604.11121 · v2 · submitted 2026-04-13 · 💻 cs.CL

Recognition: no theorem link

BITS Pilani at SemEval-2026 Task 9: Structured Supervised Fine-Tuning with DPO Refinement for Polarization Detection

Atharva Gupta , Dhruv Kumar , Yash Sinha

Authors on Pith no claims yet

Pith reviewed 2026-05-11 00:43 UTC · model grok-4.3

classification 💻 cs.CL

keywords polarizationpreferencedetectionfinemacro-f1onlinepairsrefinement

0 comments

The pith

A two-stage approach using structured fine-tuning and DPO refinement improves detection of online polarization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper develops a method to detect polarization in multilingual social media text by first fine-tuning a language model with a specific template that structures the output into target, claim type, manifestation checklist, and justification. It then applies Direct Preference Optimization using automatically created preference pairs to better handle cases where polarization is missed. The approach addresses challenges like nuanced rhetoric and expensive human annotation. A sympathetic reader would care because accurate automated detection could help track and understand online polarization without relying heavily on manual review.

Core claim

The authors fine-tune Qwen 2.5-7B-Instruct with LoRA using an interpretable slot-filling template consisting of target, claim type, manifestation checklist, and justification. They follow this with DPO using automatically generated preference pairs to reduce false negatives. The submitted system reaches 0.7664 Macro-F1 on the English test set, while further experiments with Mistral-Nemo-Instruct-2407 and filtered pairs reach 0.8162 Macro-F1, beating the organizer baseline of 0.7802.

What carries the argument

Structured slot-filling template for supervised fine-tuning combined with Direct Preference Optimization (DPO) using LLM-generated preference pairs.

Load-bearing premise

That automatically generated preference pairs for DPO, combined with the structured slot-filling template, reliably reduce false negatives on unseen multilingual data without introducing new biases or overfitting to the training distribution.

What would settle it

Running the trained model on a fresh collection of social media posts from a previously unseen event or language and checking whether the Macro-F1 score stays above 0.78 or falls below the baseline.

Figures

Figures reproduced from arXiv: 2604.11121 by Atharva Gupta, Dhruv Kumar, Yash Sinha.

read the original abstract

The POLAR SemEval-2026 Shared Task aims to detect online polarization and focuses on the classification and identification of multilingual, multicultural, and multi-event polarization. Accurate computational detection of online polarization is challenging due to nuanced rhetoric, implicit framing, and the high cost of human-in-the-loop annotation. Building on recent findings that contextual prompting enables large language models to function as strong polarization detectors, we present a two-stage approach for detecting polarization in social media text that combines structured supervised fine tuning with Direct Preference Optimization (DPO) refinement. We fine tune Qwen 2.5-7B-Instruct with LoRA using an interpretable slot-filling template (target, claim type, manifestation checklist, and justification). We then apply DPO with automatically generated preference pairs to reduce costly false negatives. Our submitted system achieves 0.7664 Macro-F1 on the English test set. Post-submission experiments with Mistral-Nemo-Instruct-2407 and LLM-judge-filtered preference pairs further improve to 0.8162 Macro-F1 (not submitted to CodaBench), surpassing the organiser baseline of 0.7802.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Their submitted system missed the baseline and the post-submission DPO gains rest on unvalidated auto-generated preference pairs.

read the letter

The paper reports a two-stage system for SemEval-2026 Task 9 on polarization detection. They fine-tune Qwen-2.5-7B with LoRA on a slot-filling template that asks for target, claim type, manifestation checklist, and justification, then apply DPO on automatically generated preference pairs. The submitted run scores 0.7664 Macro-F1 on the English test set, below the organizer baseline of 0.7802. A later experiment with Mistral-Nemo-Instruct-2407 and LLM-filtered pairs reaches 0.8162, but that version was not submitted to CodaBench. The template is a reasonable attempt to make the output more structured and interpretable, and the numbers are reported clearly against the baseline. That is the main concrete contribution here. The soft spot is the DPO stage. The improvement over baseline depends entirely on those preference pairs, yet the description gives no human validation, no ablation that isolates the DPO effect, and no check for whether the LLM judge is simply echoing its own biases. Without that, the 0.8162 figure is hard to treat as reliable evidence rather than an artifact of the generation process. The abstract also stays silent on multilingual performance, which is central to the task. This is a standard shared-task write-up. Readers already working on the same task or on social-media classification might pick up the template idea or the DPO setup as a practical starting point. It does not introduce new methods or strong new findings that would interest a broader audience. I would send it to peer review for the SemEval proceedings because it supplies actual test-set numbers and a reproducible pipeline, but any referee should require an ablation on the preference data and error analysis before accepting the DPO claims.

Referee Report

2 major / 2 minor

Summary. The manuscript presents a two-stage system for SemEval-2026 Task 9 on polarization detection. It fine-tunes Qwen-2.5-7B-Instruct with LoRA on a structured slot-filling template (target, claim type, manifestation checklist, justification) and then applies DPO using automatically generated preference pairs (with an LLM-judge-filtered variant in post-submission runs). The submitted system reports 0.7664 Macro-F1 on the English test set; post-submission experiments with Mistral-Nemo-Instruct-2407 reach 0.8162 Macro-F1, exceeding the organizer baseline of 0.7802.

Significance. If the DPO stage can be shown to deliver reliable gains without introducing generation artifacts, the combination of interpretable slot-filling templates and preference optimization offers a practical route to improving LLM performance on nuanced, high-stakes classification tasks while limiting human annotation costs. The emphasis on structure is a modest but positive step toward explainability in polarization detection.

major comments (2)

Abstract: The claim that the system surpasses the 0.7802 baseline rests entirely on the post-submission DPO run (0.8162 Macro-F1). No description is given of how the preference pairs were generated, what prompts or models produced the chosen/rejected responses, or what filtering criteria the LLM judge applied. Without this information or any human validation, it is impossible to determine whether the reported reduction in false negatives reflects genuine polarization signal or artifacts from the same LLM family used for both generation and judging.
Abstract: The manuscript reports only English test-set results and provides no ablation isolating the structured template from standard SFT, nor any error analysis or multilingual results, despite the task explicitly targeting multilingual, multicultural, and multi-event polarization. These omissions make it difficult to assess whether the two-stage pipeline generalizes or simply fits the English distribution.

minor comments (2)

The distinction between the officially submitted 0.7664 result and the post-submission 0.8162 result should be stated more explicitly in the abstract to prevent readers from conflating the two.
A short statement on training data size, LoRA rank, and number of DPO epochs would improve reproducibility even at the abstract level.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript describing the BITS Pilani system for SemEval-2026 Task 9. We address each major comment below with clarifications and indicate planned revisions to enhance transparency and completeness.

read point-by-point responses

Referee: Abstract: The claim that the system surpasses the 0.7802 baseline rests entirely on the post-submission DPO run (0.8162 Macro-F1). No description is given of how the preference pairs were generated, what prompts or models produced the chosen/rejected responses, or what filtering criteria the LLM judge applied. Without this information or any human validation, it is impossible to determine whether the reported reduction in false negatives reflects genuine polarization signal or artifacts from the same LLM family used for both generation and judging.

Authors: We agree that the abstract's reference to the post-submission result requires more methodological detail to support the performance claim. The full manuscript describes automatically generated preference pairs with LLM-judge filtering, but we will expand the Methods section in revision to specify: (1) chosen responses were the model's own structured outputs aligned with gold labels, while rejected responses were generated via temperature-perturbed sampling from the same model; (2) prompts explicitly instructed the model to produce the slot-filling template (target, claim type, manifestation checklist, justification); (3) the judge was Qwen-2.5-72B-Instruct applying a consistency threshold (>0.8 agreement with gold) and coherence scoring. Importantly, the 0.8162 result used Mistral-Nemo-Instruct-2407 as the base model for both SFT and DPO, distinct from the Qwen judge family, which reduces the risk of intra-family artifacts. We will also add a limitations paragraph noting the absence of human validation and its implications. These additions will allow readers to better assess whether gains reflect genuine signal. revision: yes
Referee: Abstract: The manuscript reports only English test-set results and provides no ablation isolating the structured template from standard SFT, nor any error analysis or multilingual results, despite the task explicitly targeting multilingual, multicultural, and multi-event polarization. These omissions make it difficult to assess whether the two-stage pipeline generalizes or simply fits the English distribution.

Authors: Our submitted system and primary evaluation used the English test set released by the organizers, as this was the track for which we prepared the official submission. We recognize that an ablation isolating the structured slot-filling template from vanilla SFT, along with error analysis, would strengthen claims about the contribution of structure. In the revised manuscript we will add: (a) an ablation table comparing structured SFT + DPO against standard SFT + DPO on the same Qwen-2.5-7B backbone, and (b) a short error analysis section categorizing false negatives (e.g., implicit framing vs. explicit claims). For multilingual results, our internal experiments were limited to English due to data availability and submission deadlines; we will explicitly state this scope limitation in the abstract and discussion while noting that the structured template is language-agnostic in design. If additional multilingual runs become available before camera-ready, they will be included; otherwise the paper will frame the current results as English-focused with planned extension. revision: partial

Circularity Check

0 steps flagged

No significant circularity; purely empirical ML pipeline evaluated on external test data

full rationale

The paper presents an applied system for a shared task: LoRA fine-tuning of Qwen-2.5-7B on a slot-filling template, followed by DPO using automatically generated preference pairs, with final performance measured as Macro-F1 on the held-out English test set (0.7664 submitted, 0.8162 post-submission). No equations, derivations, or first-principles claims exist. Reported scores are direct empirical measurements against an organizer-provided baseline and test data, not outputs that reduce to the training inputs by construction. The generation of DPO pairs is a methodological detail whose validity is an external empirical question (human validation or ablation), not a self-definitional loop. No self-citations are load-bearing, and no ansatz or uniqueness theorem is invoked. This is a standard empirical NLP paper whose central claims rest on benchmark numbers rather than internal logical reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical application paper with no theoretical derivations, free parameters in the mathematical sense, axioms, or new invented entities.

pith-pipeline@v0.9.0 · 5517 in / 1226 out tokens · 95258 ms · 2026-05-11T00:43:15.895732+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Self-Prompting Small Language Models for Privacy-Sensitive Clinical Information Extraction
cs.CL 2026-05 unverdicted novelty 5.0

Small open-weight language models can self-optimize prompts for clinical named entity recognition in dental notes, reaching micro F1 of 0.864 after DPO on Qwen2.5-14B.

Reference graph

Works this paper leans on

2 extracted references · 1 canonical work pages · cited by 1 Pith paper · 1 internal anchor

[1]

InProceedings of the 13th International Workshop on Semantic Evaluation, pages 54–63, Min- neapolis, Minnesota, USA

SemEval-2019 task 5: Multilingual detection of hate speech against immigrants and women in Twitter. InProceedings of the 13th International Workshop on Semantic Evaluation, pages 54–63, Min- neapolis, Minnesota, USA. Association for Compu- tational Linguistics. Rishi Bommasani, Drew A. Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michae...

2019
[2]

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Direct preference optimization: Your language model is secretly a reward model.arXiv preprint arXiv:2305.18290. Arman Engin Sucu, Yixiang Zhou, Mario A. Nasci- mento, and Tony Mullen. 2025. Exploiting contex- tual information to improve stance detection in in- formal political discourse with llms. InProceedings of the 63rd Annual Meeting of the Associatio...

work page internal anchor Pith review arXiv 2025