Recognition: no theorem link
BITS Pilani at SemEval-2026 Task 9: Structured Supervised Fine-Tuning with DPO Refinement for Polarization Detection
Pith reviewed 2026-05-11 00:43 UTC · model grok-4.3
The pith
A two-stage approach using structured fine-tuning and DPO refinement improves detection of online polarization.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors fine-tune Qwen 2.5-7B-Instruct with LoRA using an interpretable slot-filling template consisting of target, claim type, manifestation checklist, and justification. They follow this with DPO using automatically generated preference pairs to reduce false negatives. The submitted system reaches 0.7664 Macro-F1 on the English test set, while further experiments with Mistral-Nemo-Instruct-2407 and filtered pairs reach 0.8162 Macro-F1, beating the organizer baseline of 0.7802.
What carries the argument
Structured slot-filling template for supervised fine-tuning combined with Direct Preference Optimization (DPO) using LLM-generated preference pairs.
Load-bearing premise
That automatically generated preference pairs for DPO, combined with the structured slot-filling template, reliably reduce false negatives on unseen multilingual data without introducing new biases or overfitting to the training distribution.
What would settle it
Running the trained model on a fresh collection of social media posts from a previously unseen event or language and checking whether the Macro-F1 score stays above 0.78 or falls below the baseline.
Figures
read the original abstract
The POLAR SemEval-2026 Shared Task aims to detect online polarization and focuses on the classification and identification of multilingual, multicultural, and multi-event polarization. Accurate computational detection of online polarization is challenging due to nuanced rhetoric, implicit framing, and the high cost of human-in-the-loop annotation. Building on recent findings that contextual prompting enables large language models to function as strong polarization detectors, we present a two-stage approach for detecting polarization in social media text that combines structured supervised fine tuning with Direct Preference Optimization (DPO) refinement. We fine tune Qwen 2.5-7B-Instruct with LoRA using an interpretable slot-filling template (target, claim type, manifestation checklist, and justification). We then apply DPO with automatically generated preference pairs to reduce costly false negatives. Our submitted system achieves 0.7664 Macro-F1 on the English test set. Post-submission experiments with Mistral-Nemo-Instruct-2407 and LLM-judge-filtered preference pairs further improve to 0.8162 Macro-F1 (not submitted to CodaBench), surpassing the organiser baseline of 0.7802.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents a two-stage system for SemEval-2026 Task 9 on polarization detection. It fine-tunes Qwen-2.5-7B-Instruct with LoRA on a structured slot-filling template (target, claim type, manifestation checklist, justification) and then applies DPO using automatically generated preference pairs (with an LLM-judge-filtered variant in post-submission runs). The submitted system reports 0.7664 Macro-F1 on the English test set; post-submission experiments with Mistral-Nemo-Instruct-2407 reach 0.8162 Macro-F1, exceeding the organizer baseline of 0.7802.
Significance. If the DPO stage can be shown to deliver reliable gains without introducing generation artifacts, the combination of interpretable slot-filling templates and preference optimization offers a practical route to improving LLM performance on nuanced, high-stakes classification tasks while limiting human annotation costs. The emphasis on structure is a modest but positive step toward explainability in polarization detection.
major comments (2)
- Abstract: The claim that the system surpasses the 0.7802 baseline rests entirely on the post-submission DPO run (0.8162 Macro-F1). No description is given of how the preference pairs were generated, what prompts or models produced the chosen/rejected responses, or what filtering criteria the LLM judge applied. Without this information or any human validation, it is impossible to determine whether the reported reduction in false negatives reflects genuine polarization signal or artifacts from the same LLM family used for both generation and judging.
- Abstract: The manuscript reports only English test-set results and provides no ablation isolating the structured template from standard SFT, nor any error analysis or multilingual results, despite the task explicitly targeting multilingual, multicultural, and multi-event polarization. These omissions make it difficult to assess whether the two-stage pipeline generalizes or simply fits the English distribution.
minor comments (2)
- The distinction between the officially submitted 0.7664 result and the post-submission 0.8162 result should be stated more explicitly in the abstract to prevent readers from conflating the two.
- A short statement on training data size, LoRA rank, and number of DPO epochs would improve reproducibility even at the abstract level.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript describing the BITS Pilani system for SemEval-2026 Task 9. We address each major comment below with clarifications and indicate planned revisions to enhance transparency and completeness.
read point-by-point responses
-
Referee: Abstract: The claim that the system surpasses the 0.7802 baseline rests entirely on the post-submission DPO run (0.8162 Macro-F1). No description is given of how the preference pairs were generated, what prompts or models produced the chosen/rejected responses, or what filtering criteria the LLM judge applied. Without this information or any human validation, it is impossible to determine whether the reported reduction in false negatives reflects genuine polarization signal or artifacts from the same LLM family used for both generation and judging.
Authors: We agree that the abstract's reference to the post-submission result requires more methodological detail to support the performance claim. The full manuscript describes automatically generated preference pairs with LLM-judge filtering, but we will expand the Methods section in revision to specify: (1) chosen responses were the model's own structured outputs aligned with gold labels, while rejected responses were generated via temperature-perturbed sampling from the same model; (2) prompts explicitly instructed the model to produce the slot-filling template (target, claim type, manifestation checklist, justification); (3) the judge was Qwen-2.5-72B-Instruct applying a consistency threshold (>0.8 agreement with gold) and coherence scoring. Importantly, the 0.8162 result used Mistral-Nemo-Instruct-2407 as the base model for both SFT and DPO, distinct from the Qwen judge family, which reduces the risk of intra-family artifacts. We will also add a limitations paragraph noting the absence of human validation and its implications. These additions will allow readers to better assess whether gains reflect genuine signal. revision: yes
-
Referee: Abstract: The manuscript reports only English test-set results and provides no ablation isolating the structured template from standard SFT, nor any error analysis or multilingual results, despite the task explicitly targeting multilingual, multicultural, and multi-event polarization. These omissions make it difficult to assess whether the two-stage pipeline generalizes or simply fits the English distribution.
Authors: Our submitted system and primary evaluation used the English test set released by the organizers, as this was the track for which we prepared the official submission. We recognize that an ablation isolating the structured slot-filling template from vanilla SFT, along with error analysis, would strengthen claims about the contribution of structure. In the revised manuscript we will add: (a) an ablation table comparing structured SFT + DPO against standard SFT + DPO on the same Qwen-2.5-7B backbone, and (b) a short error analysis section categorizing false negatives (e.g., implicit framing vs. explicit claims). For multilingual results, our internal experiments were limited to English due to data availability and submission deadlines; we will explicitly state this scope limitation in the abstract and discussion while noting that the structured template is language-agnostic in design. If additional multilingual runs become available before camera-ready, they will be included; otherwise the paper will frame the current results as English-focused with planned extension. revision: partial
Circularity Check
No significant circularity; purely empirical ML pipeline evaluated on external test data
full rationale
The paper presents an applied system for a shared task: LoRA fine-tuning of Qwen-2.5-7B on a slot-filling template, followed by DPO using automatically generated preference pairs, with final performance measured as Macro-F1 on the held-out English test set (0.7664 submitted, 0.8162 post-submission). No equations, derivations, or first-principles claims exist. Reported scores are direct empirical measurements against an organizer-provided baseline and test data, not outputs that reduce to the training inputs by construction. The generation of DPO pairs is a methodological detail whose validity is an external empirical question (human validation or ablation), not a self-definitional loop. No self-citations are load-bearing, and no ansatz or uniqueness theorem is invoked. This is a standard empirical NLP paper whose central claims rest on benchmark numbers rather than internal logical reduction.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 1 Pith paper
-
Self-Prompting Small Language Models for Privacy-Sensitive Clinical Information Extraction
Small open-weight language models can self-optimize prompts for clinical named entity recognition in dental notes, reaching micro F1 of 0.864 after DPO on Qwen2.5-14B.
Reference graph
Works this paper leans on
-
[1]
InProceedings of the 13th International Workshop on Semantic Evaluation, pages 54–63, Min- neapolis, Minnesota, USA
SemEval-2019 task 5: Multilingual detection of hate speech against immigrants and women in Twitter. InProceedings of the 13th International Workshop on Semantic Evaluation, pages 54–63, Min- neapolis, Minnesota, USA. Association for Compu- tational Linguistics. Rishi Bommasani, Drew A. Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michae...
2019
-
[2]
Direct Preference Optimization: Your Language Model is Secretly a Reward Model
Direct preference optimization: Your language model is secretly a reward model.arXiv preprint arXiv:2305.18290. Arman Engin Sucu, Yixiang Zhou, Mario A. Nasci- mento, and Tony Mullen. 2025. Exploiting contex- tual information to improve stance detection in in- formal political discourse with llms. InProceedings of the 63rd Annual Meeting of the Associatio...
work page internal anchor Pith review arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.