pith. sign in

arxiv: 2606.00647 · v1 · pith:7N7A5FLQnew · submitted 2026-05-30 · 💻 cs.CL · cs.AI

LinguIUTics at PsyDefDetect: Iterative Imbalance-Aware Fine-tuning of Qwen3-8B for Psychological Defense Mechanism Classification

Pith reviewed 2026-06-28 19:06 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords psychological defense mechanismsclass imbalanceQLoRA fine-tuningshared taskclinical NLPmacro F1ensemble blendingminority class augmentation
0
0 comments X

The pith

Fine-tuning Qwen3-8B with grouped cross-validation, minority augmentation and logit-bias post-processing reaches 0.3917 macro F1 on nine-class psychological defense detection.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Detecting psychological defense mechanisms from conversational text is a nine-class clinical NLP task crippled by severe class imbalance that defeats BERT encoders and zero-shot LLMs. The authors apply QLoRA fine-tuning to Qwen3-8B together with grouped stratified cross-validation that blocks leakage, round-robin lexical augmentation focused on minority classes, and a post-processing stage that tunes logit biases before blending multiple ensembles. These components together produce a macro F1 of 0.3917 on the shared-task positive-class leaderboard, fourth place among twenty-one teams, and lift the rarest class from near-zero to 0.797 F1. A reader would care because the pipeline demonstrates concrete ways to make large language models usable for clinical utterance classification where simpler methods collapse on infrequent categories.

Core claim

By iteratively fine-tuning Qwen3-8B under imbalance-aware conditions and applying leakage-preventing grouped stratified cross-validation, minority-class round-robin lexical augmentation, and logit-bias-tuned ensemble post-processing, the method reaches a macro F1 of 0.3917 on the PsyDefDetect positive-class leaderboard, a 7.7-point absolute gain over the Ministral-8B baseline.

What carries the argument

QLoRA fine-tuning of Qwen3-8B combined with grouped stratified cross-validation, minority-class round-robin lexical augmentation, and logit bias tuning plus ensemble blending in post-processing.

Load-bearing premise

The post-processing pipeline with logit bias tuning and ensemble blending will reliably close the validation-to-leaderboard gap and improve minority-class recall without overfitting to the particular test distribution.

What would settle it

Running the identical fine-tuned model and post-processing parameters on a new conversational test set drawn from a different clinical source and finding that minority-class F1 drops below the level achieved by the fine-tuned model without post-processing.

Figures

Figures reproduced from arXiv: 2606.00647 by Ahmed Alfey Sani, Ajwad Abrar, Md Hasibur Rahman Alif, Shefayat E Shams Adib.

Figure 1
Figure 1. Figure 1: Full system pipeline. Phase 1: raw training data is preprocessed and minority classes (Levels 2, 3, 4, 5, 8) are oversampled via round-robin lexical mutation, creating an expanded training set of 2,600+ utterances. Phase 2: two independent grouped stratified 5-fold QLoRA fine-tuning runs of Qwen3-8B the Anchor (seed = 42) and Seed-A (seed = 20260407) sharing identical architecture and hyperparameters, with… view at source ↗
Figure 2
Figure 2. Figure 2: Approximate row-normalised confusion matrices (OOF). Post-processing successfully shifts residual [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
read the original abstract

Detecting psychological defense mechanisms in conversational text remains a challenging clinical NLP problem. For the PsyDefDetect 2026 shared task (nine-class utterance classification evaluated via macro F1), our team LinguIUTics achieves a macro F1-score of 0.3917 on the official positive-class leaderboard, ranking 4th out of 21 registered teams and improving over the Ministral-8B task baseline (31.48 macro F1) by 7.7 absolute points (24.4 percent relative). BERT-family encoders and zero-shot LLMs proved ineffective on rare classes due to severe class imbalance, leading us to QLoRA fine-tuning of Qwen3-8B. We leverage three key strategies: grouped stratified cross-validation (preventing leakage), minority-class round-robin lexical augmentation, and a post-processing pipeline with logit bias tuning and ensemble blending. Together, these components close much of the validation-to-leaderboard gap and substantially improve minority-class recall, driving the critical "Unclear" class (Level 8) from near-zero performance to an F1 score of 0.797.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript describes the LinguIUTics system submitted to the PsyDefDetect 2026 shared task on nine-class classification of psychological defense mechanisms from conversational text. The authors report a macro F1 of 0.3917 on the official positive-class leaderboard (4th of 21 teams), obtained via QLoRA fine-tuning of Qwen3-8B together with grouped stratified cross-validation, minority-class round-robin lexical augmentation, and a post-processing pipeline of logit bias tuning plus ensemble blending; this yields a 7.7-point absolute (24.4% relative) gain over the Ministral-8B baseline and raises the rare “Unclear” class F1 to 0.797.

Significance. If the leaderboard result is reproducible, the work supplies a concrete, engineering-oriented demonstration that parameter-efficient fine-tuning combined with imbalance-aware data augmentation and targeted post-processing can substantially lift minority-class recall in a clinical NLP shared task. The reported improvement on the most difficult class offers a practical reference point for other low-resource, imbalanced multi-class problems in psychological text analysis.

major comments (2)
  1. [Abstract] Abstract and the description of the three key strategies: the manuscript states that grouped stratified cross-validation, minority-class round-robin lexical augmentation, and the post-processing pipeline “drive the critical” gains, yet supplies no ablation tables, incremental addition experiments, or controlled comparisons that isolate the contribution of each component to the 7.7-point improvement. Without such evidence the attribution remains narrative.
  2. [Results] Results section: neither error bars from the grouped cross-validation folds nor any statistical significance test comparing the final system against the Ministral-8B baseline or against intermediate ablations are reported, leaving the reliability of the 0.3917 macro F1 and the per-class gains (especially the jump to 0.797 on the “Unclear” class) difficult to assess.
minor comments (2)
  1. The manuscript would benefit from an explicit table listing all hyper-parameters used for QLoRA (rank, alpha, dropout, learning rate, epochs, etc.) to support reproducibility.
  2. A short related-work paragraph situating the lexical augmentation technique against prior minority-class oversampling methods in clinical NLP would help readers place the contribution.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation of minor revision. We address each major comment below and will incorporate the suggested improvements in the revised manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract and the description of the three key strategies: the manuscript states that grouped stratified cross-validation, minority-class round-robin lexical augmentation, and the post-processing pipeline “drive the critical” gains, yet supplies no ablation tables, incremental addition experiments, or controlled comparisons that isolate the contribution of each component to the 7.7-point improvement. Without such evidence the attribution remains narrative.

    Authors: We agree that the current manuscript attributes performance gains to the three strategies without quantitative isolation of their effects. In the revision we will add a dedicated ablation subsection (and corresponding table) that starts from the base QLoRA fine-tuned Qwen3-8B and incrementally introduces grouped stratified cross-validation, minority-class round-robin lexical augmentation, and the logit-bias-plus-ensemble post-processing pipeline, reporting macro F1 and per-class F1 at each step on the validation folds. revision: yes

  2. Referee: [Results] Results section: neither error bars from the grouped cross-validation folds nor any statistical significance test comparing the final system against the Ministral-8B baseline or against intermediate ablations are reported, leaving the reliability of the 0.3917 macro F1 and the per-class gains (especially the jump to 0.797 on the “Unclear” class) difficult to assess.

    Authors: We concur that fold-level variability and formal significance testing are needed to substantiate the reported scores. In the revised results section we will report mean ± standard deviation across the grouped CV folds for all metrics and will add a statistical comparison (paired t-test on per-fold macro F1) between the final system and the Ministral-8B baseline, including the resulting p-value. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is a standard shared-task system description that reports an empirical macro F1 of 0.3917 obtained via QLoRA fine-tuning, grouped CV, lexical augmentation, and post-processing on the PsyDefDetect test set. No equations, formal derivations, theorems, or parameter-fitting steps that could reduce a claimed prediction to its inputs by construction are present anywhere in the manuscript. The central result is an externally evaluated leaderboard score rather than a quantity computed from fitted parameters inside the paper, and no self-citation chains or ansatzes are invoked to justify any mathematical claim. The work is therefore self-contained against external benchmarks with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract, the central claim rests on the unstated assumption that the shared-task test distribution matches the training distribution closely enough for post-processing to transfer, plus the effectiveness of standard LLM fine-tuning practices. No free parameters, axioms, or invented entities are explicitly introduced in the provided text.

pith-pipeline@v0.9.1-grok · 5756 in / 1258 out tokens · 21935 ms · 2026-06-28T19:06:07.296370+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

18 extracted references · 5 canonical work pages · 2 internal anchors

  1. [1]

    In: Proc

    Liu, Siyang and Zheng, Chujie and Demasi, Orianna and Sabour, Sahand and Li, Yu and Yu, Zhou and Jiang, Yong and Huang, Minlie. Towards Emotional Support Dialog Systems. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)....

  2. [2]

    You Never Know a Person, You Only Know Their Defenses: Detecting Levels of Psychological Defense Mechanisms in Supportive Conversations

    Na, Hongbin and Wang, Zimu and Chen, Zhaoming and Zhou, Peilin and Hua, Yining and Zhou, Grace Ziqi and Zhang, Haiyang and Shen, Tao and Wang, Wei and Torous, John and Ji, Shaoxiong and Chen, Ling. You Never Know a Person, You Only Know Their Defenses: Detecting Levels of Psychological Defense Mechanisms in Supportive Conversations. Findings of the Associ...

  3. [3]

    Studying Defense Mechanisms in Psychotherapy using the Defense Mechanism Rating Scales , volume =

    Perry, John and Henry, Melissa , year =. Studying Defense Mechanisms in Psychotherapy using the Defense Mechanism Rating Scales , volume =. Advances in Psychology , doi =

  4. [4]

    2021 , journal =

    LoRA: Low-Rank Adaptation of Large Language Models , author =. 2021 , journal =

  5. [5]

    2023 , journal =

    QLoRA: Efficient Finetuning of Quantized LLMs , author =. 2023 , journal =

  6. [6]

    2025 , eprint =

    Qwen3 Technical Report , author =. 2025 , eprint =

  7. [7]

    , journal=

    He, Haibo and Garcia, Edwardo A. , journal=. Learning from Imbalanced Data , year=

  8. [8]

    EDA : Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks

    Wei, Jason and Zou, Kai. EDA : Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019. doi:10.18653/v1/D19-1670

  9. [9]

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , month = jun, year =

    Christian Szegedy and Vincent Vanhoucke and Sergey Ioffe and Jonathon Shlens and Zbigniew Wojna , title =. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , month = jun, year =

  10. [10]

    Proceedings of the IEEE International Conference on Computer Vision (ICCV) , month = oct, year =

    Tsung-Yi Lin and Priya Goyal and Ross Girshick and Kaiming He and Piotr Dollár , title =. Proceedings of the IEEE International Conference on Computer Vision (ICCV) , month = oct, year =

  11. [11]

    International Conference on Learning Representations , year =

    Long-tail learning via logit adjustment , author =. International Conference on Learning Representations , year =

  12. [12]

    M ental BERT : Publicly Available Pretrained Language Models for Mental Healthcare

    Ji, Shaoxiong and Zhang, Tianlin and Ansari, Luna and Fu, Jie and Tiwari, Prayag and Cambria, Erik. M ental BERT : Publicly Available Pretrained Language Models for Mental Healthcare. Proceedings of the Thirteenth Language Resources and Evaluation Conference. 2022

  13. [13]

    2021 , url =

    Pengcheng He and Xiaodong Liu and Jianfeng Gao and Weizhu Chen , booktitle =. 2021 , url =

  14. [14]

    RoBERTa: A Robustly Optimized BERT Pretraining Approach

    Yinhan Liu and Myle Ott and Naman Goyal and Jingfei Du and Mandar Joshi and Danqi Chen and Omer Levy and Mike Lewis and Luke Zettlemoyer and Veselin Stoyanov , year =. 1907.11692 , archivePrefix =

  15. [15]

    Jacob Devlin and Ming-Wei Chang and Kenton Lee and Kristina Toutanova , title =. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) , month = jun, year =

  16. [16]

    Overview of the PsyDefDetect Shared Task at BioNLP 2026: Detecting Levels of Psychological Defense Mechanisms in Supportive Conversations

    Na, Hongbin and Wang, Zimu and Chen, Zhaoming and Hua, Yining and Gao, Rena and Yang, Kailai and Chen, Ling and Wang, Wei and Ji, Shaoxiong and Torous, John and Ananiadou, Sophia. Overview of the PsyDefDetect Shared Task at BioNLP 2026: Detecting Levels of Psychological Defense Mechanisms in Supportive Conversations. Proceedings of the 25th Workshop on Bi...

  17. [17]

    In: Findings of the Association for Computational Linguistics: ACL 2025

    Na, Hongbin and Hua, Yining and Wang, Zimu and Shen, Tao and Yu, Beibei and Wang, Lilin and Wang, Wei and Torous, John and Chen, Ling. A Survey of Large Language Models in Psychotherapy: Current Landscape and Future Directions. Findings of the Association for Computational Linguistics: ACL 2025. 2025. doi:10.18653/v1/2025.findings-acl.385

  18. [18]

    Addressing Data Scarcity in Bangla Fake News Detection: An LLM-Based Dataset Augmentation Approach

    Addressing Data Scarcity in Bangla Fake News Detection: An LLM-Based Dataset Augmentation Approach , author=. arXiv preprint arXiv:2605.01292 , year=