Causal Connections: Leveraging Multilingual Fine-Tuning for Financial QA@FinCausal 2026
Pith reviewed 2026-06-29 02:14 UTC · model grok-4.3
The pith
Fine-tuning GPT-4.1 Mini on combined English and Spanish data produces top-ranked cause-effect extraction from financial narratives.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that decoder-only LLMs fine-tuned on the union of English and Spanish training examples outperform both prompting regimes and the other model families (multilingual BERT token tagging and multilingual BART generation) on the shared-task metric, reaching a tied-highest English score of 4.8140 and a third-place Spanish score of 4.7753.
What carries the argument
Supervised fine-tuning of a decoder-only LLM on combined multilingual training data for extractive cause-effect QA.
If this is right
- Supervised fine-tuning supplies larger performance lifts than prompt refinement or few-shot demonstrations alone.
- Training on pooled English and Spanish data improves results on both language subtasks through cross-lingual transfer.
- Decoder-only LLMs with task-specific adaptation surpass encoder-only and encoder-decoder baselines in this setting.
- Prompting and few-shot methods remain viable but are consistently outperformed once labeled data is available for fine-tuning.
Where Pith is reading between the lines
- The same combined-data fine-tuning recipe could be tried on other narrow-domain multilingual QA tasks where labeled data exist in only a few languages.
- If the LLM judge correlates well with human judgment, the approach offers a low-cost way to bootstrap high-quality cause-effect annotations across additional languages without new labeled sets.
- The performance edge may shrink if future test data diverge sharply from the 2026 shared-task distribution.
Load-bearing premise
The shared task's LLM-as-a-judge score accurately reflects the real quality of extracted cause-effect pairs and the given training and test sets are representative of financial narratives.
What would settle it
Re-evaluation of the same model outputs on a fresh set of financial documents using human annotators or a different automatic metric that produces materially lower relative rankings for the fine-tuned GPT system.
Figures
read the original abstract
This paper describes team HSA_CORAL's submission to the FinCausal 2026 shared task on extracting cause-effect relations from financial narratives via extractive question answering in English and Spanish. We compare three modeling families: (i) encoder-only token tagging with multilingual BERT, (ii) encoder-decoder generation with multilingual BART, and (iii) decoder-only LLMs (Llama 3.1 and GPT variants) using prompt refinement, few-shot demonstrations, and supervised fine-tuning. Across settings, prompting and few-shot examples yield competitive performance, while supervised fine-tuning provides the largest gains. Our best system, GPT-4.1 Mini fine-tuned on combined English and Spanish training data, achieves a tied highest score on the English subtask (score 4.8140) and ranks third on Spanish (score 4.7753) under the shared task's LLM-as-a-judge metric. Overall, the results highlight the value of task-specific adaptation and multilingual fine-tuning for cross-lingual transfer in financial causality QA.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. This paper reports on the HSA_CORAL team's submission to the FinCausal 2026 shared task for extracting cause-effect relations from financial narratives in English and Spanish using extractive question answering. The authors evaluate three families of models: encoder-only token tagging with multilingual BERT, encoder-decoder generation with multilingual BART, and decoder-only LLMs (Llama 3.1 and GPT variants) with prompt refinement, few-shot demonstrations, and supervised fine-tuning. Prompting and few-shot yield competitive results, while fine-tuning gives the largest gains. Their top system is GPT-4.1 Mini fine-tuned on combined English and Spanish data, achieving a tied highest score of 4.8140 on English and third place with 4.7753 on Spanish under the LLM-as-a-judge metric. The work emphasizes the value of task-specific adaptation and multilingual fine-tuning for cross-lingual transfer.
Significance. If the results hold, this work demonstrates the practical benefits of multilingual fine-tuning for cross-lingual performance in a specialized domain like financial causality extraction. It contributes empirical data from a shared task setting, including a direct comparison across model architectures. The explicit reporting of submission scores under the official metric is useful for the community. The systematic exploration of modeling families provides a clear record of what worked for this task.
minor comments (3)
- [Abstract] Abstract: The claim that 'supervised fine-tuning provides the largest gains' would be strengthened by a brief quantitative comparison or reference to a results table showing deltas over the prompting and few-shot baselines.
- [Results] The manuscript reports concrete scores but omits error bars, number of runs, or any statistical tests; adding these (even if only for the final systems) would improve interpretability of the tied-highest and third-place rankings.
- [Experiments] Consider including a summary table of all model families and settings (BERT, BART, Llama, GPT variants) with their respective scores to make the cross-family comparison explicit rather than narrative only.
Simulated Author's Rebuttal
We thank the referee for the positive evaluation of our submission to the FinCausal 2026 shared task and for recommending minor revision. No specific major comments were raised in the report.
Circularity Check
No significant circularity; purely empirical shared-task report
full rationale
The paper describes standard fine-tuning and prompting experiments on provided training data for a shared task, then reports submission scores under the task's external LLM-as-a-judge metric. No derivations, equations, fitted parameters renamed as predictions, or load-bearing self-citations exist. The central claim is a factual report of benchmark performance with no internal reduction to inputs by construction.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Proceedings of the 4th Financial Narrative Processing Workshop@ LREC2022 , pages=
The financial causality extraction shared task (FinCausal 2022) , author=. Proceedings of the 4th Financial Narrative Processing Workshop@ LREC2022 , pages=
2022
-
[2]
Proceedings of the 1st Joint Workshop on Financial Narrative Processing and MultiLing Financial Summarisation , pages=
The financial document causality detection shared task (FinCausal 2020) , author=. Proceedings of the 1st Joint Workshop on Financial Narrative Processing and MultiLing Financial Summarisation , pages=
2020
-
[3]
The financial document causality detection shared task (FinCausal 2025) , author=. Proceedings of the Joint Workshop of the 9th Financial Technology and Natural Language Processing (FinNLP), the 6th Financial Narrative Processing (FNP), and the 1st Workshop on Large Language Models for Finance and Legal (LLMFinLegal) , pages=
2025
-
[4]
2023 IEEE International Conference on Big Data, BigData 2023 , pages=
The Financial Narrative Summarisation Shared Task (FNS 2023) , author=. 2023 IEEE International Conference on Big Data, BigData 2023 , pages=. 2023 , organization=
2023
-
[5]
Information , volume=
Text to causal knowledge graph: A framework to synthesize knowledge from unstructured business texts into causal graphs , author=. Information , volume=. 2023 , publisher=
2023
-
[6]
Bioinformatics , volume=
Sequence tagging for biomedical extractive question answering , author=. Bioinformatics , volume=. 2022 , publisher=
2022
-
[7]
Proceedings of the 58th annual meeting of the association for computational linguistics , pages=
BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension , author=. Proceedings of the 58th annual meeting of the association for computational linguistics , pages=
-
[8]
Proceedings of the 4th financial narrative processing workshop@ LREC2022 , pages=
Lipi at fincausal 2022: Mining causes and effects from financial texts , author=. Proceedings of the 4th financial narrative processing workshop@ LREC2022 , pages=
2022
-
[9]
Journal of Management Information and Decision Sciences , volume=
Earning movement prediction using machine learning-support vector machines (SVM) , author=. Journal of Management Information and Decision Sciences , volume=. 2019 , publisher=
2019
-
[10]
Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics (Demonstrations) , pages=
End-to-end open-domain question answering with bertserini , author=. Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics (Demonstrations) , pages=
2019
-
[11]
Proceedings of the AAAI Conference on Artificial Intelligence , volume=
Prototypical fine-tuning: Towards robust performance under varying data sizes , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
-
[12]
Findings of the Association for Computational Linguistics: EMNLP 2023 , pages=
Moqagpt: Zero-shot multi-modal open-domain question answering with large language model , author=. Findings of the Association for Computational Linguistics: EMNLP 2023 , pages=
2023
-
[13]
Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa) , pages=
Quasi: a synthetic question-answering dataset in Swedish using GPT-3 and zero-shot learning , author=. Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa) , pages=
-
[14]
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
Few shot generative model adaption via relaxed spatial structural alignment , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
-
[15]
, author=
Lora: Low-rank adaptation of large language models. , author=. Iclr , volume=
-
[16]
Proceedings of the 3rd Workshop on Machine Reading for Question Answering , pages=
Semantic answer similarity for evaluating question answering models , author=. Proceedings of the 3rd Workshop on Machine Reading for Question Answering , pages=
-
[17]
The Journal of Supercomputing , volume=
Financial causal sentence recognition based on BERT-CNN text classification , author=. The Journal of Supercomputing , volume=. 2022 , publisher=
2022
-
[18]
Proceedings of the 2023 conference on empirical methods in natural language processing , pages=
Event causality extraction via implicit cause-effect interactions , author=. Proceedings of the 2023 conference on empirical methods in natural language processing , pages=
2023
-
[19]
Companion Proceedings of the Web Conference 2022 , pages=
A generative approach for financial causality extraction , author=. Companion Proceedings of the Web Conference 2022 , pages=
2022
-
[20]
arXiv preprint arXiv:2401.11817 , year=
Hallucination is inevitable: An innate limitation of large language models , author=. arXiv preprint arXiv:2401.11817 , year=
-
[21]
Advances in neural information processing systems , volume=
Direct preference optimization: Your language model is secretly a reward model , author=. Advances in neural information processing systems , volume=
-
[22]
arXiv preprint arXiv:2204.05862 , year=
Training a helpful and harmless assistant with reinforcement learning from human feedback , author=. arXiv preprint arXiv:2204.05862 , year=
-
[23]
Bert: Pre-training of deep bidirectional transformers for language understanding , author=. Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers) , pages=
2019
-
[24]
arXiv preprint arXiv:2301.07597 , year=
How close is chatgpt to human experts? comparison corpus, evaluation, and detection , author=. arXiv preprint arXiv:2301.07597 , year=
-
[25]
arXiv preprint arXiv:2204.09600 , year=
Hierarchical BERT for medical document understanding , author=. arXiv preprint arXiv:2204.09600 , year=
-
[26]
Contemporary Accounting Research , volume=
FinBERT: A large language model for extracting information from financial text , author=. Contemporary Accounting Research , volume=. 2023 , publisher=
2023
-
[27]
Proceedings of the Third International Conference on AI-ML Systems , pages=
Towards reducing hallucination in extracting information from financial reports using large language models , author=. Proceedings of the Third International Conference on AI-ML Systems , pages=
-
[28]
Proceedings of the 13th Workshop on Computational Approaches to Subjectivity, Sentiment, & Social Media Analysis , pages=
Can ChatGPT understand causal language in science claims? , author=. Proceedings of the 13th Workshop on Computational Approaches to Subjectivity, Sentiment, & Social Media Analysis , pages=
-
[29]
Proceedings of the 7th Financial Narrative Processing Workshop (FNP 2026) at LREC 2026
Moreno-Sandoval, Antonio and Porta, Jordi and Torterolo, Yanco and Stanescu, Alexia and Chatzi, Melina and Roseti, Sof \' a. Proceedings of the 7th Financial Narrative Processing Workshop (FNP 2026) at LREC 2026. 2026
2026
-
[31]
Kydlı́ček, Hynek and Penedo, Guilherme and von Werra, Leandro , year =
-
[32]
Farquhar, Sebastian and Gal, Yarin and Rainforth, Tom , month = may, year =. On. doi:10.48550/arXiv.2101.11665 , abstract =
-
[33]
Kossen, Jannik and Farquhar, Sebastian and Gal, Yarin and Rainforth, Tom , month = jun, year =. Active. doi:10.48550/arXiv.2103.05331 , abstract =
-
[34]
Berrada, Gabrielle and Kossen, Jannik and Razzak, Muhammed and Smith, Freddie Bickford and Gal, Yarin and Rainforth, Tom , month = aug, year =. Scaling. doi:10.48550/arXiv.2508.09093 , abstract =
-
[35]
and Sitawarin, Chawin and Guo, Chuan and Kokhlikyan, Narine and Suh, G
Morris, John X. and Sitawarin, Chawin and Guo, Chuan and Kokhlikyan, Narine and Suh, G. Edward and Rush, Alexander M. and Chaudhuri, Kamalika and Mahloujifar, Saeed , month = jun, year =. How much do language models memorize? , url =. doi:10.48550/arXiv.2505.24832 , abstract =
-
[36]
Gienapp, Lukas and Hagen, Tim and Fröbe, Maik and Hagen, Matthias and Stein, Benno and Potthast, Martin and Scells, Harrisen , month = apr, year =. The. doi:10.1145/3726302.3730093 , abstract =
-
[37]
Singh, Shivalika and Nan, Yiyang and Wang, Alex and D'Souza, Daniel and Kapoor, Sayash and Üstün, Ahmet and Koyejo, Sanmi and Deng, Yuntian and Longpre, Shayne and Smith, Noah and Ermis, Beyza and Fadaee, Marzieh and Hooker, Sara , month = apr, year =. The. doi:10.48550/arXiv.2504.20879 , abstract =
-
[38]
Zhong, Ming and Liu, Yang and Yin, Da and Mao, Yuning and Jiao, Yizhu and Liu, Pengfei and Zhu, Chenguang and Ji, Heng and Han, Jiawei , month = oct, year =. Towards a. doi:10.48550/arXiv.2210.07197 , abstract =
-
[39]
Chan, David and Petryk, Suzanne and Gonzalez, Joseph and Darrell, Trevor and Canny, John , year =. Proceedings of the 2023. doi:10.18653/v1/2023.emnlp-main.841 , language =
-
[40]
Lin, Yen-Ting and Chen, Yun-Nung , year =. Proceedings of the 5th. doi:10.18653/v1/2023.nlp4convai-1.5 , language =
-
[41]
Liu, Yinhong and Zhou, Han and Guo, Zhijiang and Shareghi, Ehsan and Vulić, Ivan and Korhonen, Anna and Collier, Nigel , month = jan, year =. Aligning with. doi:10.48550/arXiv.2403.16950 , abstract =
-
[42]
Liusie, Adian and Manakul, Potsawee and Gales, Mark J. F. , month = feb, year =. doi:10.48550/arXiv.2307.07889 , abstract =
-
[43]
Gu, Jiawei and Jiang, Xuhui and Shi, Zhichao and Tan, Hexiang and Zhai, Xuehao and Xu, Chengjin and Li, Wei and Shen, Yinghan and Ma, Shengjie and Liu, Honghao and Wang, Saizhuo and Zhang, Kun and Wang, Yuanzhuo and Gao, Wen and Ni, Lionel and Guo, Jian , month = mar, year =. A. doi:10.48550/arXiv.2411.15594 , abstract =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2411.15594
-
[44]
Computational Linguistics , author =. 2025 , pages =. doi:10.1162/coli_a_00561 , abstract =
-
[45]
BERTScore: Evaluating Text Generation with BERT
Zhang, Tianyi and Kishore, Varsha and Wu, Felix and Weinberger, Kilian Q. and Artzi, Yoav , month = feb, year =. doi:10.48550/arXiv.1904.09675 , abstract =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1904.09675 1904
-
[46]
GPTScore: Evaluate as You Desire
Fu, Jinlan and Ng, See-Kiong and Jiang, Zhengbao and Liu, Pengfei , month = feb, year =. doi:10.48550/arXiv.2302.04166 , abstract =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2302.04166
-
[47]
Wang, Alex and Cho, Kyunghyun and Lewis, Mike , month = apr, year =. Asking and. doi:10.48550/arXiv.2004.04228 , abstract =
-
[48]
Gopalakrishnan, Karthik and Hedayatnia, Behnam and Chen, Qinlang and Gottardi, Anna and Kwatra, Sanjeev and Venkatesh, Anu and Gabriel, Raefer and Hakkani-Tur, Dilek , month = aug, year =. Topical-. doi:10.48550/arXiv.2308.11995 , abstract =
-
[49]
Fabbri, Wojciech Kry \'s ci \'n ski, Bryan McCann, Caiming Xiong, Richard Socher, and Dragomir Radev
Transactions of the Association for Computational Linguistics , author =. 2021 , pages =. doi:10.1162/tacl_a_00373 , abstract =
-
[50]
Liu, Yang and Iter, Dan and Xu, Yichong and Wang, Shuohang and Xu, Ruochen and Zhu, Chenguang , month = may, year =. G-. doi:10.48550/arXiv.2303.16634 , abstract =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2303.16634
-
[51]
doi:10.48550/arXiv.2407.11691 , abstract =
Duan, Haodong and Yang, Junming and Qiao, Yuxuan and Fang, Xinyu and Chen, Lin and Liu, Yuan and Agarwal, Amit and Chen, Zhe and Li, Mo and Ma, Yubo and Sun, Hailong and Zhao, Xiangyu and Cui, Junbo and Dong, Xiaoyi and Zang, Yuhang and Zhang, Pan and Wang, Jiaqi and Lin, Dahua and Chen, Kai , month = sep, year =. doi:10.48550/arXiv.2407.11691 , abstract =
-
[52]
Jacob, Marc , month = feb, year =. German. doi:10.7910/DVN/FSCDPI , abstract =
-
[53]
Grattafiori, Aaron and Dubey, Abhimanyu and Jauhri, Abhinav and Pandey, Abhinav and Kadian, Abhishek and Al-Dahle, Ahmad and Letman, Aiesha and Mathur, Akhil and Schelten, Alan and Vaughan, Alex and Yang, Amy and Fan, Angela and Goyal, Anirudh and Hartshorn, Anthony and Yang, Aobo and Mitra, Archi and Sravankumar, Archie and Korenev, Artem and Hinsvark, A...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2407.21783 2024
-
[54]
doi:10.48550/arXiv.2411.15296 , abstract =
Fu, Chaoyou and Zhang, Yi-Fan and Yin, Shukang and Li, Bo and Fang, Xinyu and Zhao, Sirui and Duan, Haodong and Sun, Xing and Liu, Ziwei and Wang, Liang and Shan, Caifeng and He, Ran , month = dec, year =. doi:10.48550/arXiv.2411.15296 , abstract =
-
[55]
Proceedings of the 2024 joint international conference on computational linguistics, language resources and evaluation (
Hamotskyi, Serhii and Kozaeva, Nata and Hänig, Christian , editor =. Proceedings of the 2024 joint international conference on computational linguistics, language resources and evaluation (. 2024 , pages =
2024
-
[56]
Development and evaluation of a
Kozaeva, Nata and Hamotskyi, Serhii and Hanig, Christian , editor =. Development and evaluation of a. Proceedings of the joint workshop of the 7th financial technology and natural language processing, the 5th knowledge discovery from unstructured data in financial services, and the 4th workshop on economics and natural language processing , publisher =. 2...
2024
-
[57]
Proceedings of the
Krieg-Holz, Ulrike and Schuschnig, Christian and Matthies, Franz and Redling, Benjamin and Hahn, Udo , editor =. Proceedings of the. 2016 , pages =
2016
-
[58]
Proceedings of
Hänig, Christian and Schlösser, Markus and Hamotskyi, Serhii and Zambaku, Gent and Blankenburg, Janek , year =. Proceedings of
-
[59]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
2025 , note =. doi:10.48550/arXiv.2501.12948 , abstract =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2501.12948 2025
-
[60]
doi:10.48550/arXiv.2111.15664 , abstract =
Kim, Geewook and Hong, Teakgyu and Yim, Moonbin and Nam, Jeongyeon and Park, Jinyoung and Yim, Jinyeong and Hwang, Wonseok and Yun, Sangdoo and Han, Dongyoon and Park, Seunghyun , month = oct, year =. doi:10.48550/arXiv.2111.15664 , abstract =
-
[61]
Docling technical report , url =. 2024 , note =. doi:10.48550/arXiv.2408.09869 , author =
-
[62]
ColPali: Efficient Document Retrieval with Vision Language Models
Faysse, Manuel and Sibille, Hugues and Wu, Tony and Omrani, Bilel and Viaud, Gautier and Hudelot, Céline and Colombo, Pierre , month = oct, year =. doi:10.48550/arXiv.2407.01449 , abstract =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2407.01449
-
[63]
International Journal of Data Science and Analytics , author =
Anonymization of. International Journal of Data Science and Analytics , author =. 2022 , keywords =. doi:10.1007/s41060-021-00285-x , abstract =
-
[64]
Deduplicating
Lee, Katherine and Ippolito, Daphne and Nystrom, Andrew and Zhang, Chiyuan and Eck, Douglas and Callison-Burch, Chris and Carlini, Nicholas , month = mar, year =. Deduplicating
-
[65]
Catalan Speecon database
Speecon Consortium. Catalan Speecon database. 2011
2011
-
[66]
The EMILLE/CIIL Corpus
Anthony McEnery and others. The EMILLE/CIIL Corpus. 2004
2004
-
[67]
The OrienTel Moroccan MCA (Modern Colloquial Arabic) database
Khalid Choukri and Niklas Paullson. The OrienTel Moroccan MCA (Modern Colloquial Arabic) database. 2004
2004
-
[68]
ItalWordNet v.2
Roventini, Adriana and Marinelli, Rita and Bertagna, Francesca. ItalWordNet v.2
-
[69]
Moreno-Sandoval, Antonio and Torterolo Orta, Yanco Amor and Stanescu, Maria Alexia and Chatzi, Melina , publisher =. 2026 , version =. doi:10.21950/H7RKHH , url =
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.