arxiv: 2604.08923 · v2 · submitted 2026-04-10 · 💻 cs.CL

Recognition: no theorem link

NCL-BU at SemEval-2026 Task 3: Fine-tuning XLM-RoBERTa for Multilingual Dimensional Sentiment Regression

Tong Wu , Nicolay Rusnachenko , Huizhi Liang

Authors on Pith no claims yet

Pith reviewed 2026-05-11 01:41 UTC · model grok-4.3

classification 💻 cs.CL

keywords dimensional aspect-based sentiment analysisvalence arousal regressionXLM-RoBERTamultilingual sentimentfine-tuningfew-shot promptingSemEval-2026regression heads

0 comments

The pith

Fine-tuning XLM-RoBERTa with dual regression heads outperforms few-shot LLMs for multilingual valence-arousal regression.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper describes a submitted system for SemEval-2026 Task 3 Track A Subtask 1 on dimensional aspect-based sentiment analysis. It predicts continuous valence and arousal values in the 1-9 range for aspects in English and Chinese texts across restaurant, laptop, and finance domains. The method fine-tunes separate XLM-RoBERTa-base models per language-domain pair, each with two sigmoid-scaled regression heads, and merges training and development data for final test predictions. Development experiments show this fine-tuning beats several large language models under few-shot prompting on all datasets.

Core claim

Task-specific fine-tuning of XLM-RoBERTa-base models equipped with dual regression heads for valence and arousal outperforms few-shot prompting of large language models across every language-domain dataset in development experiments for dimensional aspect sentiment regression.

What carries the argument

XLM-RoBERTa-base fine-tuned with dual sigmoid-scaled regression heads, one per language-domain pair, with merged train-dev data for test submission.

Load-bearing premise

The development-set comparisons between fine-tuned models and few-shot LLMs use equivalent data and evaluation conditions so that any measured gap reflects the training method rather than hidden differences in setup.

What would settle it

Reproducing the development experiments and finding that at least one few-shot LLM achieves lower mean absolute error than the fine-tuned XLM-RoBERTa on the same held-out development data would falsify the reported superiority.

Figures

Figures reproduced from arXiv: 2604.08923 by Huizhi Liang, Nicolay Rusnachenko, Tong Wu.

**Figure 1.** Figure 1: Model architecture. where [SEP] denotes the separator token of the pretrained tokenizer. The [CLS] token representation h ∈ R d , from the encoder is passed through a dropout layer and then fed into two independent regression heads: Vˆ i = σ(MLPV (h)) × 8 + 1 Aˆ i = σ(MLPA(h)) × 8 + 1 where σ is the sigmoid function and each MLP is a two-layer feedforward network (d→d/2→1) with Tanh activation and dropout… view at source ↗

**Figure 2.** Figure 2: Validation RMSEV A across training epochs. stantial (e.g., 1.5531 to 2.0182 on Chinese Laptop), suggesting that VA regression performance varies considerably across models and languages. Analysis of LLM Limitations. The substantial gap between LLM-based approaches and finetuning can be attributed to several factors: (1) VA regression requires predicting precise numerical values on a continuous scale, whi… view at source ↗

**Figure 3.** Figure 3: Training data distribution in VA space. En [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Test-set RMSEVA heatmap across the VA space. Each cell shows the RMSE and sample count (n) for gold samples in that valence–arousal bin. Red indicates higher error; green indicates lower error. V = 2.17, A= 7.67), the model predicts V = 7.08, A= 7.08, apparently misled by the positive words “generous” and “giving” while missing the underlying negative sentiment [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

read the original abstract

Dimensional Aspect-Based Sentiment Analysis (DimABSA) extends traditional ABSA from categorical polarity labels to continuous valence-arousal (VA) regression. This paper describes a system developed for Track A, Subtask 1 (Dimensional Aspect Sentiment Regression), aiming to predict real-valued VA scores in the [1, 9] range for each given aspect in a text. A fine-tuning approach based on XLM-RoBERTa-base is adopted, with dual regression heads with sigmoid-scaled outputs for valence and arousal prediction. Separate models are trained for each language-domain pair (English and Chinese across restaurant, laptop, and finance domains), and training and development sets are merged for final test predictions. In development experiments, the fine-tuning approach is compared against several large language models under a few-shot prompting setting, demonstrating that task-specific fine-tuning outperforms these LLM-based methods across all evaluation datasets.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This SemEval system paper finds that fine-tuned XLM-RoBERTa beats few-shot LLMs on dimensional sentiment regression, but offers little beyond a standard comparison.

read the letter

The main takeaway is that fine-tuned XLM-RoBERTa with dual regression heads outperforms several few-shot prompted LLMs on the development sets for this multilingual dimensional aspect-based sentiment task across English and Chinese in restaurant, laptop, and finance domains. They train separate models per language-domain pair and use sigmoid scaling to keep valence and arousal outputs in the 1-9 range. The comparison is presented as a practical check rather than a new method. Merging train and dev data only for final test predictions does not touch the reported development results, so that part holds up cleanly. The setup is straightforward and matches the task constraints without added complexity. The empirical point that task-specific fine-tuning wins here is useful as one data point for people deciding between approaches in similar narrow settings. The paper does a decent job documenting the exact configuration and keeping the focus on the shared-task requirements. The soft spots are the absence of any numeric scores, error bars, or statistical tests in the abstract, plus limited detail on the LLM prompting setups and baselines. Without those, the size of the gap is hard to judge and the result feels expected rather than surprising. As a shared-task entry there is no new architecture, derivation, or large-scale analysis, just an application of known fine-tuning techniques. This is mainly for other participants in the SemEval task who want a working system description and a quick head-to-head with prompting. A reader seeking novel methods or deeper insights will not get much value. I would not bring it to a reading group or cite it. It does not deserve peer review at a serious venue; the scope is too narrow and incremental for referee time.

Referee Report

1 major / 2 minor

Summary. The manuscript describes the NCL-BU system for SemEval-2026 Task 3, Track A, Subtask 1 (Dimensional Aspect Sentiment Regression). It fine-tunes XLM-RoBERTa-base with dual sigmoid-scaled regression heads to predict continuous valence and arousal scores in [1,9] for given aspects. Separate models are trained for each English/Chinese language-domain pair (restaurant, laptop, finance). The central claim is that this task-specific fine-tuning outperforms several large language models under few-shot prompting on development experiments across all evaluation datasets. Training and development sets are merged only for final test predictions.

Significance. If the outperformance claim holds with verifiable metrics, the result would indicate that supervised fine-tuning of multilingual encoders can exceed few-shot LLM prompting for continuous VA regression in multilingual ABSA. This would be of practical value for shared-task participants and practitioners seeking efficient adaptation strategies over prompting. The per-language-domain modeling and dual-head design are simple and potentially reproducible, but the current manuscript provides no quantitative support for the key empirical finding.

major comments (1)

Abstract (development experiments paragraph): the claim that fine-tuning 'outperforms these LLM-based methods across all evaluation datasets' is unsupported by any reported metrics (e.g., MSE, Pearson r, or MAE), error bars, or statistical tests. No table or figure presents the actual scores for the XLM-RoBERTa system versus the few-shot baselines, rendering the central empirical result unverifiable.

minor comments (2)

Abstract: the sigmoid outputs are stated to be 'scaled' to the [1,9] range, but the precise affine transformation or post-processing step is not specified.
Abstract: the specific LLMs, shot counts, and prompting templates used in the few-shot baselines are not named, preventing direct replication of the comparison.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their detailed review and for highlighting the need for verifiable quantitative evidence. We agree that the central empirical claim requires explicit metrics and have revised the manuscript to include them.

read point-by-point responses

Referee: [—] Abstract (development experiments paragraph): the claim that fine-tuning 'outperforms these LLM-based methods across all evaluation datasets' is unsupported by any reported metrics (e.g., MSE, Pearson r, or MAE), error bars, or statistical tests. No table or figure presents the actual scores for the XLM-RoBERTa system versus the few-shot baselines, rendering the central empirical result unverifiable.

Authors: We acknowledge that the development-set comparison results were described only qualitatively in the original submission. In the revised manuscript we have added a new table (Table 2) that reports MSE, Pearson r, and MAE for the XLM-RoBERTa fine-tuned models and for each few-shot LLM baseline on every development dataset. The table also includes standard deviations over three random seeds and a brief note on the statistical significance of the observed differences. These additions make the outperformance claim fully verifiable. revision: yes

Circularity Check

0 steps flagged

No significant circularity; purely empirical system description

full rationale

The paper describes a fine-tuning pipeline for XLM-RoBERTa with dual regression heads and reports empirical comparisons against few-shot LLMs on development sets. No equations, derivations, parameters fitted to target quantities, or self-citation chains appear in the provided text. The train+dev merge is explicitly limited to final test predictions and does not affect the development experiments that support the main claim. The work is self-contained against external benchmarks (shared-task data and standard LLM baselines) with no reduction of outputs to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper uses only standard pre-trained transformer fine-tuning and regression heads with no new free parameters, axioms, or invented entities.

pith-pipeline@v0.9.0 · 5466 in / 1021 out tokens · 31088 ms · 2026-05-11T01:41:33.596700+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

7 extracted references · 7 canonical work pages · 1 internal anchor

[1]

DimABSA: Building Multilingual and Multidomain Datasets for Dimensional Aspect-Based Sentiment Analysis

Chinese emobank: Building valence-arousal resources for dimensional sentiment analysis.ACM Trans. Asian Low-Resour. Lang. Inf. Process., 21(4). Lung-Hao Lee, Liang-Chih Yu, Natalia Loukashe- vich, Ilseyar Alimova, Alexander Panchenko, Tzu- Mi Lin, Zhe-Yu Xu, Jian-Yu Zhou, Guangmin Zheng, Jin Wang, Sharanya Awasthi, Jonas Becker, Jan Philip Wahle, Terry Ru...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[2]

InProceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016), pages 19–30, San Diego, California

SemEval-2016 task 5: Aspect based sentiment analysis. InProceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016), pages 19–30, San Diego, California. Association for Computational Linguistics. Maria Pontiki, Dimitris Galanis, Haris Papageorgiou, Suresh Manandhar, and Ion Androutsopoulos. 2015. SemEval-2015 task 12: Aspect based...

work page 2016
[3]

InProceedings of the 2021 Con- ference on Empirical Methods in Natural Language Processing, pages 9209–9219

Aspect sentiment quad prediction as para- phrase generation. InProceedings of the 2021 Con- ference on Empirical Methods in Natural Language Processing, pages 9209–9219. Wenxuan Zhang, Yue Deng, Bing Liu, Sinno Pan, and Lidong Bing. 2024. Sentiment analysis in the era of large language models: A reality check. InFind- ings of the Association for Computati...

work page 2021
[4]

the food was absolutely amaz- ing!!

Text:“the food was absolutely amaz- ing!!” Aspect:“food” Answer:8.50#8.25

work page
[5]

but the staff was so horrible to us

Text:“but the staff was so horrible to us.” Aspect:“staff” Answer:1.33#8.67

work page
[6]

food was just average... if they lowered the prices just a bit, it would be a bigger draw

Text:“food was just average... if they lowered the prices just a bit, it would be a bigger draw.” Aspect:“food” Answer:5.00#5.00 4.Text:“i love this macbook.” Aspect:“macbook” Answer:7.10#6.90 5.Text:“horrible product.” Aspect:“product” Answer:2.60#5.70

work page
[7]

it has and does everything it should

Text:“it has and does everything it should.” Aspect:“NULL” Answer:5.67#5.50 B Prediction Examples Table 6 shows selected test predictions. Near- exact cases match gold values closely; failure cases involve sarcasm or implicit negativity where the model predicts positive values. Text (Aspect) Pred Gold Near-exact predictions “I enjoy real flavor, real frui...

work page