pith. sign in

arxiv: 2606.27446 · v1 · pith:VJF3YV7Hnew · submitted 2026-06-25 · 💻 cs.CL

Causal Connections: Leveraging Multilingual Fine-Tuning for Financial QA@FinCausal 2026

Pith reviewed 2026-06-29 02:14 UTC · model grok-4.3

classification 💻 cs.CL
keywords financial causality extractionmultilingual fine-tuningextractive question answeringFinCausal shared taskcause-effect relationscross-lingual transferLLM adaptation
0
0 comments X

The pith

Fine-tuning GPT-4.1 Mini on combined English and Spanish data produces top-ranked cause-effect extraction from financial narratives.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests three families of models on the FinCausal 2026 task of turning financial text into cause-effect question-answer pairs in English and Spanish. It finds that prompt-based and few-shot methods are competitive but that supervised fine-tuning yields the clearest gains. The strongest result comes from taking GPT-4.1 Mini and training it on the pooled English-plus-Spanish data, which ties for first place on the English subtask and places third on Spanish under the shared task's automatic judge. The work therefore treats multilingual fine-tuning as a practical route to better cross-lingual transfer inside a narrow domain.

Core claim

The central claim is that decoder-only LLMs fine-tuned on the union of English and Spanish training examples outperform both prompting regimes and the other model families (multilingual BERT token tagging and multilingual BART generation) on the shared-task metric, reaching a tied-highest English score of 4.8140 and a third-place Spanish score of 4.7753.

What carries the argument

Supervised fine-tuning of a decoder-only LLM on combined multilingual training data for extractive cause-effect QA.

If this is right

  • Supervised fine-tuning supplies larger performance lifts than prompt refinement or few-shot demonstrations alone.
  • Training on pooled English and Spanish data improves results on both language subtasks through cross-lingual transfer.
  • Decoder-only LLMs with task-specific adaptation surpass encoder-only and encoder-decoder baselines in this setting.
  • Prompting and few-shot methods remain viable but are consistently outperformed once labeled data is available for fine-tuning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same combined-data fine-tuning recipe could be tried on other narrow-domain multilingual QA tasks where labeled data exist in only a few languages.
  • If the LLM judge correlates well with human judgment, the approach offers a low-cost way to bootstrap high-quality cause-effect annotations across additional languages without new labeled sets.
  • The performance edge may shrink if future test data diverge sharply from the 2026 shared-task distribution.

Load-bearing premise

The shared task's LLM-as-a-judge score accurately reflects the real quality of extracted cause-effect pairs and the given training and test sets are representative of financial narratives.

What would settle it

Re-evaluation of the same model outputs on a fresh set of financial documents using human annotators or a different automatic metric that produces materially lower relative rankings for the fine-tuned GPT system.

Figures

Figures reproduced from arXiv: 2606.27446 by Akash Kumar Gautam, Christian H\"anig, Serhii Hamotskyi.

Figure 1
Figure 1. Figure 1: External LLM-judge score (English subtask) as a function of the number of few￾shot demonstrations in the prompt for the best￾performing decoder-only configuration. As a further enhancement, we fine-tune the model on up to 2,000 samples, experimenting with three configurations: training on English data only, Spanish data only, and a combined bilingual dataset. Fine-tuning reinforces task-specific behavior a… view at source ↗
Figure 2
Figure 2. Figure 2: Prompt used for extractive QA by decoder models used in the FinCausal 2026 shared task. [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
read the original abstract

This paper describes team HSA_CORAL's submission to the FinCausal 2026 shared task on extracting cause-effect relations from financial narratives via extractive question answering in English and Spanish. We compare three modeling families: (i) encoder-only token tagging with multilingual BERT, (ii) encoder-decoder generation with multilingual BART, and (iii) decoder-only LLMs (Llama 3.1 and GPT variants) using prompt refinement, few-shot demonstrations, and supervised fine-tuning. Across settings, prompting and few-shot examples yield competitive performance, while supervised fine-tuning provides the largest gains. Our best system, GPT-4.1 Mini fine-tuned on combined English and Spanish training data, achieves a tied highest score on the English subtask (score 4.8140) and ranks third on Spanish (score 4.7753) under the shared task's LLM-as-a-judge metric. Overall, the results highlight the value of task-specific adaptation and multilingual fine-tuning for cross-lingual transfer in financial causality QA.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. This paper reports on the HSA_CORAL team's submission to the FinCausal 2026 shared task for extracting cause-effect relations from financial narratives in English and Spanish using extractive question answering. The authors evaluate three families of models: encoder-only token tagging with multilingual BERT, encoder-decoder generation with multilingual BART, and decoder-only LLMs (Llama 3.1 and GPT variants) with prompt refinement, few-shot demonstrations, and supervised fine-tuning. Prompting and few-shot yield competitive results, while fine-tuning gives the largest gains. Their top system is GPT-4.1 Mini fine-tuned on combined English and Spanish data, achieving a tied highest score of 4.8140 on English and third place with 4.7753 on Spanish under the LLM-as-a-judge metric. The work emphasizes the value of task-specific adaptation and multilingual fine-tuning for cross-lingual transfer.

Significance. If the results hold, this work demonstrates the practical benefits of multilingual fine-tuning for cross-lingual performance in a specialized domain like financial causality extraction. It contributes empirical data from a shared task setting, including a direct comparison across model architectures. The explicit reporting of submission scores under the official metric is useful for the community. The systematic exploration of modeling families provides a clear record of what worked for this task.

minor comments (3)
  1. [Abstract] Abstract: The claim that 'supervised fine-tuning provides the largest gains' would be strengthened by a brief quantitative comparison or reference to a results table showing deltas over the prompting and few-shot baselines.
  2. [Results] The manuscript reports concrete scores but omits error bars, number of runs, or any statistical tests; adding these (even if only for the final systems) would improve interpretability of the tied-highest and third-place rankings.
  3. [Experiments] Consider including a summary table of all model families and settings (BERT, BART, Llama, GPT variants) with their respective scores to make the cross-family comparison explicit rather than narrative only.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive evaluation of our submission to the FinCausal 2026 shared task and for recommending minor revision. No specific major comments were raised in the report.

Circularity Check

0 steps flagged

No significant circularity; purely empirical shared-task report

full rationale

The paper describes standard fine-tuning and prompting experiments on provided training data for a shared task, then reports submission scores under the task's external LLM-as-a-judge metric. No derivations, equations, fitted parameters renamed as predictions, or load-bearing self-citations exist. The central claim is a factual report of benchmark performance with no internal reduction to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No theoretical claims, derivations, or new entities; the work is an empirical shared-task report with no free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5723 in / 1020 out tokens · 16733 ms · 2026-06-29T02:14:57.815467+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

68 extracted references · 29 canonical work pages · 7 internal anchors

  1. [1]

    Proceedings of the 4th Financial Narrative Processing Workshop@ LREC2022 , pages=

    The financial causality extraction shared task (FinCausal 2022) , author=. Proceedings of the 4th Financial Narrative Processing Workshop@ LREC2022 , pages=

  2. [2]

    Proceedings of the 1st Joint Workshop on Financial Narrative Processing and MultiLing Financial Summarisation , pages=

    The financial document causality detection shared task (FinCausal 2020) , author=. Proceedings of the 1st Joint Workshop on Financial Narrative Processing and MultiLing Financial Summarisation , pages=

  3. [3]

    The financial document causality detection shared task (FinCausal 2025) , author=. Proceedings of the Joint Workshop of the 9th Financial Technology and Natural Language Processing (FinNLP), the 6th Financial Narrative Processing (FNP), and the 1st Workshop on Large Language Models for Finance and Legal (LLMFinLegal) , pages=

  4. [4]

    2023 IEEE International Conference on Big Data, BigData 2023 , pages=

    The Financial Narrative Summarisation Shared Task (FNS 2023) , author=. 2023 IEEE International Conference on Big Data, BigData 2023 , pages=. 2023 , organization=

  5. [5]

    Information , volume=

    Text to causal knowledge graph: A framework to synthesize knowledge from unstructured business texts into causal graphs , author=. Information , volume=. 2023 , publisher=

  6. [6]

    Bioinformatics , volume=

    Sequence tagging for biomedical extractive question answering , author=. Bioinformatics , volume=. 2022 , publisher=

  7. [7]

    Proceedings of the 58th annual meeting of the association for computational linguistics , pages=

    BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension , author=. Proceedings of the 58th annual meeting of the association for computational linguistics , pages=

  8. [8]

    Proceedings of the 4th financial narrative processing workshop@ LREC2022 , pages=

    Lipi at fincausal 2022: Mining causes and effects from financial texts , author=. Proceedings of the 4th financial narrative processing workshop@ LREC2022 , pages=

  9. [9]

    Journal of Management Information and Decision Sciences , volume=

    Earning movement prediction using machine learning-support vector machines (SVM) , author=. Journal of Management Information and Decision Sciences , volume=. 2019 , publisher=

  10. [10]

    Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics (Demonstrations) , pages=

    End-to-end open-domain question answering with bertserini , author=. Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics (Demonstrations) , pages=

  11. [11]

    Proceedings of the AAAI Conference on Artificial Intelligence , volume=

    Prototypical fine-tuning: Towards robust performance under varying data sizes , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

  12. [12]

    Findings of the Association for Computational Linguistics: EMNLP 2023 , pages=

    Moqagpt: Zero-shot multi-modal open-domain question answering with large language model , author=. Findings of the Association for Computational Linguistics: EMNLP 2023 , pages=

  13. [13]

    Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa) , pages=

    Quasi: a synthetic question-answering dataset in Swedish using GPT-3 and zero-shot learning , author=. Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa) , pages=

  14. [14]

    Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

    Few shot generative model adaption via relaxed spatial structural alignment , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

  15. [15]

    , author=

    Lora: Low-rank adaptation of large language models. , author=. Iclr , volume=

  16. [16]

    Proceedings of the 3rd Workshop on Machine Reading for Question Answering , pages=

    Semantic answer similarity for evaluating question answering models , author=. Proceedings of the 3rd Workshop on Machine Reading for Question Answering , pages=

  17. [17]

    The Journal of Supercomputing , volume=

    Financial causal sentence recognition based on BERT-CNN text classification , author=. The Journal of Supercomputing , volume=. 2022 , publisher=

  18. [18]

    Proceedings of the 2023 conference on empirical methods in natural language processing , pages=

    Event causality extraction via implicit cause-effect interactions , author=. Proceedings of the 2023 conference on empirical methods in natural language processing , pages=

  19. [19]

    Companion Proceedings of the Web Conference 2022 , pages=

    A generative approach for financial causality extraction , author=. Companion Proceedings of the Web Conference 2022 , pages=

  20. [20]

    arXiv preprint arXiv:2401.11817 , year=

    Hallucination is inevitable: An innate limitation of large language models , author=. arXiv preprint arXiv:2401.11817 , year=

  21. [21]

    Advances in neural information processing systems , volume=

    Direct preference optimization: Your language model is secretly a reward model , author=. Advances in neural information processing systems , volume=

  22. [22]

    arXiv preprint arXiv:2204.05862 , year=

    Training a helpful and harmless assistant with reinforcement learning from human feedback , author=. arXiv preprint arXiv:2204.05862 , year=

  23. [23]

    Bert: Pre-training of deep bidirectional transformers for language understanding , author=. Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers) , pages=

  24. [24]

    arXiv preprint arXiv:2301.07597 , year=

    How close is chatgpt to human experts? comparison corpus, evaluation, and detection , author=. arXiv preprint arXiv:2301.07597 , year=

  25. [25]

    arXiv preprint arXiv:2204.09600 , year=

    Hierarchical BERT for medical document understanding , author=. arXiv preprint arXiv:2204.09600 , year=

  26. [26]

    Contemporary Accounting Research , volume=

    FinBERT: A large language model for extracting information from financial text , author=. Contemporary Accounting Research , volume=. 2023 , publisher=

  27. [27]

    Proceedings of the Third International Conference on AI-ML Systems , pages=

    Towards reducing hallucination in extracting information from financial reports using large language models , author=. Proceedings of the Third International Conference on AI-ML Systems , pages=

  28. [28]

    Proceedings of the 13th Workshop on Computational Approaches to Subjectivity, Sentiment, & Social Media Analysis , pages=

    Can ChatGPT understand causal language in science claims? , author=. Proceedings of the 13th Workshop on Computational Approaches to Subjectivity, Sentiment, & Social Media Analysis , pages=

  29. [29]

    Proceedings of the 7th Financial Narrative Processing Workshop (FNP 2026) at LREC 2026

    Moreno-Sandoval, Antonio and Porta, Jordi and Torterolo, Yanco and Stanescu, Alexia and Chatzi, Melina and Roseti, Sof \' a. Proceedings of the 7th Financial Narrative Processing Workshop (FNP 2026) at LREC 2026. 2026

  30. [31]

    Kydlı́ček, Hynek and Penedo, Guilherme and von Werra, Leandro , year =

  31. [32]

    Farquhar, Sebastian and Gal, Yarin and Rainforth, Tom , month = may, year =. On. doi:10.48550/arXiv.2101.11665 , abstract =

  32. [33]

    Kossen, Jannik and Farquhar, Sebastian and Gal, Yarin and Rainforth, Tom , month = jun, year =. Active. doi:10.48550/arXiv.2103.05331 , abstract =

  33. [34]

    Berrada, Gabrielle and Kossen, Jannik and Razzak, Muhammed and Smith, Freddie Bickford and Gal, Yarin and Rainforth, Tom , month = aug, year =. Scaling. doi:10.48550/arXiv.2508.09093 , abstract =

  34. [35]

    and Sitawarin, Chawin and Guo, Chuan and Kokhlikyan, Narine and Suh, G

    Morris, John X. and Sitawarin, Chawin and Guo, Chuan and Kokhlikyan, Narine and Suh, G. Edward and Rush, Alexander M. and Chaudhuri, Kamalika and Mahloujifar, Saeed , month = jun, year =. How much do language models memorize? , url =. doi:10.48550/arXiv.2505.24832 , abstract =

  35. [36]

    Gienapp, Lukas and Hagen, Tim and Fröbe, Maik and Hagen, Matthias and Stein, Benno and Potthast, Martin and Scells, Harrisen , month = apr, year =. The. doi:10.1145/3726302.3730093 , abstract =

  36. [37]

    Singh, Shivalika and Nan, Yiyang and Wang, Alex and D'Souza, Daniel and Kapoor, Sayash and Üstün, Ahmet and Koyejo, Sanmi and Deng, Yuntian and Longpre, Shayne and Smith, Noah and Ermis, Beyza and Fadaee, Marzieh and Hooker, Sara , month = apr, year =. The. doi:10.48550/arXiv.2504.20879 , abstract =

  37. [38]

    Towards a

    Zhong, Ming and Liu, Yang and Yin, Da and Mao, Yuning and Jiao, Yizhu and Liu, Pengfei and Zhu, Chenguang and Ji, Heng and Han, Jiawei , month = oct, year =. Towards a. doi:10.48550/arXiv.2210.07197 , abstract =

  38. [39]

    Proceedings of the 2023

    Chan, David and Petryk, Suzanne and Gonzalez, Joseph and Darrell, Trevor and Canny, John , year =. Proceedings of the 2023. doi:10.18653/v1/2023.emnlp-main.841 , language =

  39. [40]

    Proceedings of the 5th

    Lin, Yen-Ting and Chen, Yun-Nung , year =. Proceedings of the 5th. doi:10.18653/v1/2023.nlp4convai-1.5 , language =

  40. [41]

    Aligning with

    Liu, Yinhong and Zhou, Han and Guo, Zhijiang and Shareghi, Ehsan and Vulić, Ivan and Korhonen, Anna and Collier, Nigel , month = jan, year =. Aligning with. doi:10.48550/arXiv.2403.16950 , abstract =

  41. [42]

    Liusie, Adian and Manakul, Potsawee and Gales, Mark J. F. , month = feb, year =. doi:10.48550/arXiv.2307.07889 , abstract =

  42. [43]

    Gu, Jiawei and Jiang, Xuhui and Shi, Zhichao and Tan, Hexiang and Zhai, Xuehao and Xu, Chengjin and Li, Wei and Shen, Yinghan and Ma, Shengjie and Liu, Honghao and Wang, Saizhuo and Zhang, Kun and Wang, Yuanzhuo and Gao, Wen and Ni, Lionel and Guo, Jian , month = mar, year =. A. doi:10.48550/arXiv.2411.15594 , abstract =

  43. [44]

    2025 , pages =

    Computational Linguistics , author =. 2025 , pages =. doi:10.1162/coli_a_00561 , abstract =

  44. [45]

    BERTScore: Evaluating Text Generation with BERT

    Zhang, Tianyi and Kishore, Varsha and Wu, Felix and Weinberger, Kilian Q. and Artzi, Yoav , month = feb, year =. doi:10.48550/arXiv.1904.09675 , abstract =

  45. [46]

    GPTScore: Evaluate as You Desire

    Fu, Jinlan and Ng, See-Kiong and Jiang, Zhengbao and Liu, Pengfei , month = feb, year =. doi:10.48550/arXiv.2302.04166 , abstract =

  46. [47]

    Asking and

    Wang, Alex and Cho, Kyunghyun and Lewis, Mike , month = apr, year =. Asking and. doi:10.48550/arXiv.2004.04228 , abstract =

  47. [48]

    Topical-

    Gopalakrishnan, Karthik and Hedayatnia, Behnam and Chen, Qinlang and Gottardi, Anna and Kwatra, Sanjeev and Venkatesh, Anu and Gabriel, Raefer and Hakkani-Tur, Dilek , month = aug, year =. Topical-. doi:10.48550/arXiv.2308.11995 , abstract =

  48. [49]

    Fabbri, Wojciech Kry \'s ci \'n ski, Bryan McCann, Caiming Xiong, Richard Socher, and Dragomir Radev

    Transactions of the Association for Computational Linguistics , author =. 2021 , pages =. doi:10.1162/tacl_a_00373 , abstract =

  49. [50]

    Liu, Yang and Iter, Dan and Xu, Yichong and Wang, Shuohang and Xu, Ruochen and Zhu, Chenguang , month = may, year =. G-. doi:10.48550/arXiv.2303.16634 , abstract =

  50. [51]

    doi:10.48550/arXiv.2407.11691 , abstract =

    Duan, Haodong and Yang, Junming and Qiao, Yuxuan and Fang, Xinyu and Chen, Lin and Liu, Yuan and Agarwal, Amit and Chen, Zhe and Li, Mo and Ma, Yubo and Sun, Hailong and Zhao, Xiangyu and Cui, Junbo and Dong, Xiaoyi and Zang, Yuhang and Zhang, Pan and Wang, Jiaqi and Lin, Dahua and Chen, Kai , month = sep, year =. doi:10.48550/arXiv.2407.11691 , abstract =

  51. [52]

    Jacob, Marc , month = feb, year =. German. doi:10.7910/DVN/FSCDPI , abstract =

  52. [53]

    Grattafiori, Aaron and Dubey, Abhimanyu and Jauhri, Abhinav and Pandey, Abhinav and Kadian, Abhishek and Al-Dahle, Ahmad and Letman, Aiesha and Mathur, Akhil and Schelten, Alan and Vaughan, Alex and Yang, Amy and Fan, Angela and Goyal, Anirudh and Hartshorn, Anthony and Yang, Aobo and Mitra, Archi and Sravankumar, Archie and Korenev, Artem and Hinsvark, A...

  53. [54]

    doi:10.48550/arXiv.2411.15296 , abstract =

    Fu, Chaoyou and Zhang, Yi-Fan and Yin, Shukang and Li, Bo and Fang, Xinyu and Zhao, Sirui and Duan, Haodong and Sun, Xing and Liu, Ziwei and Wang, Liang and Shan, Caifeng and He, Ran , month = dec, year =. doi:10.48550/arXiv.2411.15296 , abstract =

  54. [55]

    Proceedings of the 2024 joint international conference on computational linguistics, language resources and evaluation (

    Hamotskyi, Serhii and Kozaeva, Nata and Hänig, Christian , editor =. Proceedings of the 2024 joint international conference on computational linguistics, language resources and evaluation (. 2024 , pages =

  55. [56]

    Development and evaluation of a

    Kozaeva, Nata and Hamotskyi, Serhii and Hanig, Christian , editor =. Development and evaluation of a. Proceedings of the joint workshop of the 7th financial technology and natural language processing, the 5th knowledge discovery from unstructured data in financial services, and the 4th workshop on economics and natural language processing , publisher =. 2...

  56. [57]

    Proceedings of the

    Krieg-Holz, Ulrike and Schuschnig, Christian and Matthies, Franz and Redling, Benjamin and Hahn, Udo , editor =. Proceedings of the. 2016 , pages =

  57. [58]

    Proceedings of

    Hänig, Christian and Schlösser, Markus and Hamotskyi, Serhii and Zambaku, Gent and Blankenburg, Janek , year =. Proceedings of

  58. [59]
  59. [60]

    doi:10.48550/arXiv.2111.15664 , abstract =

    Kim, Geewook and Hong, Teakgyu and Yim, Moonbin and Nam, Jeongyeon and Park, Jinyoung and Yim, Jinyeong and Hwang, Wonseok and Yun, Sangdoo and Han, Dongyoon and Park, Seunghyun , month = oct, year =. doi:10.48550/arXiv.2111.15664 , abstract =

  60. [61]

    2024 , note =

    Docling technical report , url =. 2024 , note =. doi:10.48550/arXiv.2408.09869 , author =

  61. [62]

    ColPali: Efficient Document Retrieval with Vision Language Models

    Faysse, Manuel and Sibille, Hugues and Wu, Tony and Omrani, Bilel and Viaud, Gautier and Hudelot, Céline and Colombo, Pierre , month = oct, year =. doi:10.48550/arXiv.2407.01449 , abstract =

  62. [63]

    International Journal of Data Science and Analytics , author =

    Anonymization of. International Journal of Data Science and Analytics , author =. 2022 , keywords =. doi:10.1007/s41060-021-00285-x , abstract =

  63. [64]

    Deduplicating

    Lee, Katherine and Ippolito, Daphne and Nystrom, Andrew and Zhang, Chiyuan and Eck, Douglas and Callison-Burch, Chris and Carlini, Nicholas , month = mar, year =. Deduplicating

  64. [65]

    Catalan Speecon database

    Speecon Consortium. Catalan Speecon database. 2011

  65. [66]

    The EMILLE/CIIL Corpus

    Anthony McEnery and others. The EMILLE/CIIL Corpus. 2004

  66. [67]

    The OrienTel Moroccan MCA (Modern Colloquial Arabic) database

    Khalid Choukri and Niklas Paullson. The OrienTel Moroccan MCA (Modern Colloquial Arabic) database. 2004

  67. [68]

    ItalWordNet v.2

    Roventini, Adriana and Marinelli, Rita and Bertagna, Francesca. ItalWordNet v.2

  68. [69]

    2026 , version =

    Moreno-Sandoval, Antonio and Torterolo Orta, Yanco Amor and Stanescu, Maria Alexia and Chatzi, Melina , publisher =. 2026 , version =. doi:10.21950/H7RKHH , url =