pith. sign in

arxiv: 2606.08051 · v1 · pith:SJQK5XTMnew · submitted 2026-06-06 · 💻 cs.AI · cs.LG

How Small Can You Go? LoRA Fine-Tuning 270M-8B Models for Merchant Information Extraction in Financial Transactions

Pith reviewed 2026-06-27 19:40 UTC · model grok-4.3

classification 💻 cs.AI cs.LG
keywords LoRA fine-tuningmerchant information extractionfinancial transactionssmall language modelsstructured data extractionmodel deploymentinference efficiencyQwen models
0
0 comments X

The pith

Qwen 3.5 4B reaches 96.60% F1 on merchant extraction, within 0.35 points of the 8B baseline using half the parameters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether language models smaller than the current 8B production system can match its accuracy when fine-tuned with LoRA for pulling merchant details out of noisy, abbreviated bank transaction strings. It evaluates 24 variants across Gemma, Qwen, Aya, and LLaMA families on accuracy, speed, training cost, and real serving behavior. A sympathetic reader would care because 8B models create high memory, latency, and cost burdens in financial transaction processing, so smaller alternatives that stay within a few F1 points could make large-scale deployment practical. The results indicate the 4B Qwen model comes close enough to serve as a drop-in replacement under many conditions, and even the 0.8B variant remains competitive.

Core claim

LoRA fine-tuning lets the Qwen 3.5 4B model reach 96.60% F1 with JSON-only prompting, compared with 96.95% for the LLaMA 3.1-8B baseline while using roughly half the parameters. The 0.8B Qwen model achieves 94.75% F1. Chain-of-thought fine-tuning raises scores by 0.3-1.8 points on most models but is not needed for the top Qwen performer. Think and Nothink training templates produce nearly identical results. When all 14 sub-8B models are deployed as serving endpoints, benchmark F1 transfers with an average change of only 0.8 points.

What carries the argument

LoRA fine-tuning of decoder-only models to emit structured JSON merchant fields directly from abbreviated transaction text.

If this is right

  • Reproducing the 8B fine-tune at LoRA rank 8 yields 96.75% F1, only 0.20 points below the rank-32 baseline.
  • The 0.8B Qwen model matches the accuracy of models 2.5-4 times larger while offering a better latency trade-off.
  • Chain-of-thought fine-tuning improves most models but Qwen 3.5 4B performs best with direct JSON prompting.
  • Qwen 3.5 Think and Nothink templates produce F1 differences below 0.004, showing explicit reasoning supervision is unnecessary.
  • All deployed sub-8B models except Aya 3.35B maintain performance with an average 0.8-point change under serving conditions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same size-reduction pattern may apply to other fixed-output structured extraction tasks in finance or adjacent domains.
  • Teams facing similar accuracy-versus-cost constraints could begin testing at the 4B scale rather than defaulting to 8B or larger.
  • The negligible benefit from reasoning supervision suggests that for highly constrained output formats, simpler training objectives suffice.

Load-bearing premise

The dataset used to compute the reported F1 scores has the same distribution as live production transaction traffic.

What would settle it

A production deployment run on fresh live traffic that shows an average F1 drop larger than 2 points compared with the benchmark would falsify the transfer claim.

Figures

Figures reproduced from arXiv: 2606.08051 by Benjamin Barrett, Donghao Huang, Tomas Drietomsky, Zhaoxia Wang.

Figure 3
Figure 3. Figure 3: Free-Thinking vs. JSON-Only. FT outperforms JO for most models; [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 2
Figure 2. Figure 2: F1 progression during training (FT prompt). Dashed: LLaMA 8B [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Per-field F1 across models (FT prompt, best checkpoint). The LLaMA 8B baseline is split into rank-32 (r32) and our rank-8 (r8) re-fine-tune. Fields [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Accuracy–latency Pareto analysis. Pareto-optimal configurations (solid [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
read the original abstract

Financial transaction processing requires extracting structured merchant information from noisy, abbreviated bank transaction strings at scale. Our current production system, a LoRA-fine-tuned LLaMA 3.1-8B, achieves 96.95% F1 on this task, but deploying 8-billion-parameter models imposes prohibitive memory, latency, and cost constraints. To identify more efficient alternatives, we conduct a deployment-focused study of 24 model variants spanning four model families: Gemma 3 (270M, 1B, 4B), Qwen 3.5 (0.8B, 2B, 4B), Aya (3.35B), and LLaMA 3.1-8B, systematically evaluating accuracy, inference throughput, training cost, and hardware behavior to assess production suitability. Our findings show that: (1) reproducing the LLaMA 3.1-8B fine-tune with a LoRA rank of 8 achieves 96.75% F1, only 0.20 points below the rank-32 baseline; (2) Qwen 3.5 4B with JSON-only prompting reaches 96.60% F1, within 0.35 points of the 8B baseline while using roughly half the parameters; (3) the 0.8B Qwen 3.5 model achieves 94.75% F1, matching models 2.5-4x larger and offering an attractive latency-accuracy trade-off; (4) chain-of-thought fine-tuning generally improves F1 by 0.3-1.8 points across most models, although Qwen 3.5 4B performs best with direct JSON-only prompting; and (5) Qwen 3.5 Think and Nothink training templates produce nearly identical results (F1 differences <0.004), indicating that explicit reasoning supervision is unnecessary for structured extraction tasks. We further deploy all 14 fine-tuned sub-8B models as Databricks Model Serving endpoints and observe that benchmark performance transfers reliably to production, with an average F1 change of only 0.8 points. Aya 3.35B, based on the Cohere2 architecture, is the sole exception, exhibiting a 3-5 point decline under serving conditions. Based on these results, we provide deployment recommendations across accuracy and latency requirements, ...

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The manuscript conducts a deployment-oriented empirical study of LoRA fine-tuning across 24 variants of models from 270M to 8B parameters (Gemma 3, Qwen 3.5, Aya, LLaMA 3.1) for structured merchant information extraction from noisy financial transaction strings. It reports F1 scores for different LoRA ranks, prompting strategies (JSON-only vs. chain-of-thought), and model sizes, shows that Qwen 3.5 4B reaches 96.60% F1 (vs. 96.95% for the 8B baseline), notes that explicit reasoning supervision adds little value, and validates transfer to production Databricks serving with an average 0.8-point F1 drop (Aya as outlier).

Significance. If the measurements hold, the work supplies concrete, production-validated guidance on accuracy-latency-cost trade-offs for financial NLP, demonstrating that sub-8B models can deliver near-baseline performance with lower resource demands; the inclusion of serving experiments that quantify distribution shift is a strength.

minor comments (3)
  1. [Methods] Abstract and methods: full training hyperparameters (learning rate, epochs, batch size) and exact train/validation/test split sizes or construction details are not provided; these should be added to support reproducibility of the reported F1 numbers.
  2. [Deployment experiments] Results: while aggregate serving F1 change is stated as 0.8 points, per-model serving F1 scores, throughput numbers, and hardware utilization metrics are not tabulated; a supplementary table would clarify the latency-accuracy recommendations.
  3. [Prompting variants] The statement that Qwen 3.5 Think and Nothink templates produce F1 differences <0.004 is given without the underlying per-model numbers or statistical test; including these values would strengthen the claim that explicit reasoning supervision is unnecessary.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive review and recommendation to accept. The provided summary accurately reflects the manuscript's scope, methodology, and key empirical findings on model-size trade-offs for structured extraction in financial transactions.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper contains no derivation chain, equations, or first-principles claims. All reported results consist of direct empirical F1 measurements obtained by fine-tuning models on training data and evaluating on held-out test data, followed by production serving experiments that measure actual performance drops. No parameters are fitted and then renamed as predictions, no self-citations supply load-bearing uniqueness theorems, and no ansatzes or renamings of known results are presented as novel derivations. The central claims (e.g., Qwen 3.5 4B reaching 96.60% F1) rest entirely on concrete experimental outcomes that remain falsifiable by re-running the same training and evaluation protocol.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The work is an empirical comparison study; claims rest on the standard assumption that LoRA fine-tuning transfers to this extraction task and that the chosen test set is representative.

free parameters (1)
  • LoRA rank = 8
    Set to 8 for the main LLaMA reproduction and other runs; chosen by hand to balance performance and efficiency.
axioms (1)
  • domain assumption Pre-trained LLMs from the tested families can be adapted via LoRA for structured JSON extraction without task-specific architectural changes.
    Invoked when the authors apply the same fine-tuning recipe across Gemma, Qwen, Aya, and LLaMA.

pith-pipeline@v0.9.1-grok · 6005 in / 1459 out tokens · 37666 ms · 2026-06-27T19:40:47.702421+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. TIGER: Inverting Transformer Gradients via Embedding-Subspace Distance Optimization

    cs.CR 2026-06 unverdicted novelty 7.0

    TIGER turns the low-rank attention gradient subspace into a differentiable objective for continuous embedding optimization, improving reconstruction quality and robustness over prior discrete token tests especially un...

Reference graph

Works this paper leans on

26 extracted references · 11 linked inside Pith · cited by 1 Pith paper

  1. [1]

    Financial natural language processing: A survey,

    Z. Zhanget al., “Financial natural language processing: A survey,”ACM Computing Surveys, 2023

  2. [2]

    The Llama 3 herd of models,

    A. Dubey, A. Jauhri, A. Pandeyet al., “The Llama 3 herd of models,” arXiv preprint arXiv:2407.21783, 2024

  3. [3]

    Energy and policy consid- erations for deep learning in NLP,

    E. Strubell, A. Ganesh, and A. McCallum, “Energy and policy consid- erations for deep learning in NLP,” inProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019, pp. 3645–3650

  4. [4]

    Gemma 3 technical report,

    Gemma Team, A. Kamath, J. Ferret, S. Pathak, N. Vieillard, R. Merhej, S. Perrin, T. Matejovicova, A. Ramé, M. Rivièreet al., “Gemma 3 technical report,”arXiv preprint arXiv:2503.19786, 2025. [Online]. Available: https://arxiv.org/abs/2503.19786

  5. [5]

    Qwen3.5: Towards native multimodal agents,

    Qwen Team, “Qwen3.5: Towards native multimodal agents,” Qwen blog, February 2026. [Online]. Available: https://qwen.ai/blog?id=qwen3.5

  6. [6]

    Tiny Aya: Bridging scale and multilingual depth,

    A. R. Salamanca, D. Abagyan, D. D’souza, A. Khairi, D. Mora, S. Dash, V . Aryabumi, S. Rajaee, M. Mofakhami, A. Sahuet al., “Tiny Aya: Bridging scale and multilingual depth,”arXiv preprint arXiv:2603.11510, 2026. [Online]. Available: https://arxiv.org/abs/2603. 11510

  7. [7]

    LoRA: Low-rank adaptation of large language models,

    E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen, “LoRA: Low-rank adaptation of large language models,” in International Conference on Learning Representations, 2022. [Online]. Available: https://arxiv.org/abs/2106.09685

  8. [8]

    QLoRA: Efficient finetuning of quantized language models,

    T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer, “QLoRA: Efficient finetuning of quantized language models,”Advances in Neural Information Processing Systems, vol. 36, 2023

  9. [9]

    AdaLoRA: Adaptive budget allocation for parameter-efficient fine-tuning,

    Q. Zhang, M. Chen, A. Bukharin, P. He, Y . Cheng, W. Chen, and T. Zhao, “AdaLoRA: Adaptive budget allocation for parameter-efficient fine-tuning,” inInternational Conference on Learning Representations,

  10. [10]

    Available: https://arxiv.org/abs/2303.10512

    [Online]. Available: https://arxiv.org/abs/2303.10512

  11. [11]

    LlamaFactory: Unified efficient fine-tuning of 100+ language models,

    Y . Zheng, R. Zhang, J. Zhang, Y . Ye, Z. Luo, Z. Ma, and Y . Ma, “LlamaFactory: Unified efficient fine-tuning of 100+ language models,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, 2024. [Online]. Available: https://arxiv.org/abs/2403.13372

  12. [12]

    Optimizing Chinese-to-English translation using large language models,

    D. Huang and Z. Wang, “Optimizing Chinese-to-English translation using large language models,” inProc. 2025 IEEE Symposium on Computational Intelligence in Natural Language Processing and Social Media (CI-NLPSoMe Companion). Trondheim, Norway: IEEE, 2025

  13. [13]

    Logical reasoning with LLMs via few-shot prompting and fine- tuning: A case study on turtle soup puzzles,

    ——, “Logical reasoning with LLMs via few-shot prompting and fine- tuning: A case study on turtle soup puzzles,” inProc. 2025 IEEE Sym- posium on Computational Intelligence in Natural Language Processing and Social Media (CI-NLPSoMe Companion). Trondheim, Norway: IEEE, 2025

  14. [14]

    Phi-3 technical report: A highly capable language model locally on your phone,

    M. Abdin, S. A. Jacobs, A. A. Awanet al., “Phi-3 technical report: A highly capable language model locally on your phone,”arXiv preprint arXiv:2404.14219, 2024

  15. [15]

    Mamba: Linear-time sequence modeling with selective state spaces,

    A. Gu and T. Dao, “Mamba: Linear-time sequence modeling with selective state spaces,”arXiv preprint arXiv:2312.00752, 2023

  16. [16]

    Transformers are SSMs: Generalized models and efficient algorithms through structured state space duality,

    T. Dao and A. Gu, “Transformers are SSMs: Generalized models and efficient algorithms through structured state space duality,”arXiv preprint arXiv:2405.21060, 2024

  17. [17]

    Beyond task success: Measur- ing workflow fidelity in LLM-based agentic payment systems,

    D. Huang, J. K. Chua, and Z. Wang, “Beyond task success: Measur- ing workflow fidelity in LLM-based agentic payment systems,”arXiv preprint arXiv:2605.06457, 2026

  18. [18]

    Neural architectures for named entity recognition,

    G. Lample, M. Ballesteros, S. Subramanian, K. Kawakami, and C. Dyer, “Neural architectures for named entity recognition,” inProceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2016, pp. 260–270

  19. [19]

    End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF,

    X. Ma and E. Hovy, “End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF,” inProceedings of the 54th Annual Meeting of the Association for Computational Linguistics, 2016, pp. 1064–1074

  20. [20]

    GPT-NER: Named entity recognition via large language models,

    S. Wang, X. Sun, X. Li, R. Ouyang, F. Wu, T. Zhang, J. Li, and G. Wang, “GPT-NER: Named entity recognition via large language models,”arXiv preprint arXiv:2304.10428, 2023

  21. [21]

    Evaluating ChatGPT’s information extraction capabilities: An assess- ment of performance, explainability, calibration, and faithfulness,

    B. Li, G. Fang, Y . Yang, Q. Wang, W. Ye, W. Zhao, and S. Zhang, “Evaluating ChatGPT’s information extraction capabilities: An assess- ment of performance, explainability, calibration, and faithfulness,”arXiv preprint arXiv:2304.11633, 2023

  22. [22]

    When FLUE meets FLANG: Benchmarks and large pretrained language model for financial domain,

    R. S. Shah, K. Chawla, D. Eidnaniet al., “When FLUE meets FLANG: Benchmarks and large pretrained language model for financial domain,” arXiv preprint arXiv:2211.00083, 2022

  23. [23]

    Chain-of-thought prompting elicits reasoning in large language models,

    J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. Chi, Q. V . Le, and D. Zhou, “Chain-of-thought prompting elicits reasoning in large language models,”Advances in Neural Information Processing Systems, vol. 35, pp. 24 824–24 837, 2022

  24. [24]

    Multimodal chain-of-thought reasoning in language models,

    Z. Zhang, A. Zhang, M. Li, H. Zhao, G. Karypis, and A. Smola, “Multimodal chain-of-thought reasoning in language models,”arXiv preprint arXiv:2302.00923, 2023

  25. [25]

    From explicit CoT to implicit CoT: Learning to internalize CoT step by step,

    Y . Deng, Y . Choi, and S. Shieber, “From explicit CoT to implicit CoT: Learning to internalize CoT step by step,”arXiv preprint arXiv:2405.14838, 2024. [Online]. Available: https://arxiv.org/abs/2405. 14838

  26. [26]

    HuggingFace’s Trans- formers: State-of-the-art natural language processing,

    T. Wolf, L. Debut, V . Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowiczet al., “HuggingFace’s Trans- formers: State-of-the-art natural language processing,” inProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, 2020, pp. 38–45