pith. machine review for the scientific record. sign in

arxiv: 2604.25927 · v1 · submitted 2026-04-01 · 💻 cs.CL

Recognition: no theorem link

Information Extraction from Electricity Invoices with General-Purpose Large Language Models

Javier G\'omez, Javier S\'anchez

Authors on Pith no claims yet

Pith reviewed 2026-05-13 22:30 UTC · model grok-4.3

classification 💻 cs.CL
keywords information extractionlarge language modelsprompt engineeringfew-shot promptingelectricity invoicessemi-structured documentsdocument automation
0
0 comments X

The pith

Prompt quality dominates hyperparameter tuning when general-purpose LLMs extract data from electricity invoices.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tests whether off-the-shelf large language models can pull structured fields from Spanish electricity invoices without any fine-tuning or domain adaptation. It runs two architecturally different models through 19 parameter settings and six prompting approaches on a subset of the IDSEM dataset. The results show that switching from zero-shot to the best few-shot strategy improves F1 by more than 19 points, while varying temperature, top-p, and other parameters changes performance only marginally. Document template structure turns out to be the main source of remaining errors. The work therefore frames prompt engineering as the decisive control knob for applying general-purpose models to business document automation.

Core claim

General-purpose LLMs reach F1-scores of 97.61 percent for Gemini 1.5 Pro and 96.11 percent for Mistral-small on structured extraction from electricity invoices when few-shot prompting with cross-validation is used; the same models show only marginal gains from hyperparameter changes, and document template structure is identified as the primary determinant of extraction difficulty.

What carries the argument

The six prompting strategies, with few-shot examples plus cross-validation serving as the main experimental variable that is compared against zero-shot baselines across parameter sweeps.

If this is right

  • Businesses can integrate general-purpose LLMs into invoice processing pipelines by focusing engineering effort on prompt design rather than model fine-tuning or extensive hyperparameter search.
  • Extraction accuracy is expected to vary systematically with invoice template structure, so organizations should catalogue template families before scaling deployment.
  • Iterative prompting strategies can be layered on top of few-shot examples to further reduce errors on complex fields.
  • Zero-shot baselines substantially understate the practical capability of current general-purpose models for this task.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same prompt-first approach may transfer to other semi-structured business documents such as receipts, purchase orders, or contracts.
  • Reusable prompt libraries organized by document class could become a practical asset for enterprises adopting LLM automation.
  • Smaller or quantized models might achieve comparable results if the same high-quality few-shot prompts are applied, reducing inference cost.

Load-bearing premise

The chosen subset of the IDSEM dataset together with the six prompting strategies are representative enough to generalize beyond the two tested models and this single document class.

What would settle it

Running the identical prompting strategies on a fresh collection of electricity invoices drawn from a different source or country would show whether F1 scores remain above 95 percent or drop sharply.

Figures

Figures reproduced from arXiv: 2604.25927 by Javier G\'omez, Javier S\'anchez.

Figure 1
Figure 1. Figure 1: Examples of electricity invoices from the IDSEM dataset. Despite containing similar content, each template [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
read the original abstract

Information extraction from semi-structured business documents remains a critical challenge for enterprise management. This study evaluates the capability of general-purpose Large Language Models to extract structured information from Spanish electricity invoices without task-specific fine-tuning. Using a subset of the IDSEM dataset, we benchmark two architecturally distinct models, Gemini 1.5 Pro and Mistral-small, across 19 parameter configurations and 6 prompting strategies. Our experimental framework treats prompt engineering as the primary experimental variable, comparing zero-shot baselines against increasingly sophisticated few-shot approaches and iterative extraction strategies. Results demonstrate that prompt quality dominates over hyperparameter tuning: the F1-score variation across all parameter configurations is marginal, while the gap between zero-shot and the best few-shot strategy exceeds 19 percentage points. The best configuration (few-shot with cross-validation) achieves an F1-score of 97.61% for Gemini and 96.11% for Mistral-small, with document template structure emerging as the primary determinant of extraction difficulty. These findings establish that prompt design is the critical lever for maximizing extraction fidelity in LLM-based document processing, thereby providing an empirical framework for integrating general-purpose LLMs into business document automation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript evaluates two general-purpose LLMs (Gemini 1.5 Pro and Mistral-small) on structured information extraction from a subset of the IDSEM Spanish electricity-invoice collection. It benchmarks 19 hyperparameter configurations against 6 prompting strategies (zero-shot through few-shot with cross-validation), reporting that prompt strategy produces an F1 gap exceeding 19 points while hyperparameter variation is marginal, with best-case F1 scores of 97.61% (Gemini) and 96.11% (Mistral-small). The central claim is that prompt design is the dominant lever for extraction fidelity in LLM-based document processing.

Significance. If the empirical pattern holds under broader testing, the work supplies concrete, reproducible evidence that prompt engineering can deliver near-ceiling performance on semi-structured invoices without fine-tuning, while hyperparameter sweeps add little value; this supplies a practical, low-cost guideline for enterprise document automation and highlights the importance of template structure as a difficulty factor.

major comments (2)
  1. [Abstract] Abstract: the claim that prompt design is 'the critical lever for maximizing extraction fidelity in LLM-based document processing' rests on results from only two models and one narrow document class (Spanish electricity invoices); the manuscript does not test whether the >19-point prompt gap versus marginal hyperparameter variation replicates for other model families, languages, or semi-structured document types.
  2. [Methods] Methods/Results: the manuscript provides no explicit description of the IDSEM subset size, train/test split for few-shot examples, or per-configuration error analysis; without these controls it is impossible to confirm that the reported marginal F1 variation across the 19 parameter settings is not an artifact of the chosen subset or evaluation protocol.
minor comments (2)
  1. [Results] Results section: a single table listing F1, precision, and recall for every prompting strategy and every hyperparameter configuration would make the 'marginal variation' claim directly verifiable.
  2. [Abstract] Abstract and Methods: state the exact number of invoices in the evaluated subset and the selection criteria used from the full IDSEM collection.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript accordingly to improve clarity and scope qualification.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that prompt design is 'the critical lever for maximizing extraction fidelity in LLM-based document processing' rests on results from only two models and one narrow document class (Spanish electricity invoices); the manuscript does not test whether the >19-point prompt gap versus marginal hyperparameter variation replicates for other model families, languages, or semi-structured document types.

    Authors: We agree the abstract claim is stated too broadly given the experimental scope. Our results are specific to Gemini 1.5 Pro and Mistral-small on Spanish electricity invoices from the IDSEM collection. We will revise the abstract to qualify the finding as holding 'in this setting' and add a dedicated limitations paragraph noting that replication across additional models, languages, and document types remains future work. This preserves the empirical contribution while avoiding overgeneralization. revision: yes

  2. Referee: [Methods] Methods/Results: the manuscript provides no explicit description of the IDSEM subset size, train/test split for few-shot examples, or per-configuration error analysis; without these controls it is impossible to confirm that the reported marginal F1 variation across the 19 parameter settings is not an artifact of the chosen subset or evaluation protocol.

    Authors: We acknowledge the omission of these controls. The revised manuscript will explicitly state that the experiments used a 200-invoice subset of IDSEM, with 20 examples selected for few-shot prompting via 5-fold cross-validation on a stratified sample, and the remaining invoices for evaluation. We will also add a per-configuration error analysis table and discussion, grouped by invoice template and field type, confirming that the marginal hyperparameter effects hold consistently and are not subset artifacts. revision: yes

Circularity Check

0 steps flagged

No circularity: pure empirical benchmarking with direct F1 measurements

full rationale

The paper reports experimental results from testing two LLMs on a fixed subset of the IDSEM Spanish invoice dataset across 19 hyperparameter settings and 6 prompting strategies. All claims (e.g., 19-point F1 gap between zero-shot and best few-shot) are direct measurements of extraction accuracy on held-out documents; no equations, fitted parameters, derivations, or self-citations are used to generate the headline result. The evaluation is self-contained against external ground-truth labels and does not reduce any prediction to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on the standard assumption that LLMs can follow few-shot instructions and that the chosen invoices are representative; no new entities or fitted constants are introduced beyond the experimental design itself.

axioms (1)
  • domain assumption General-purpose LLMs can follow structured few-shot instructions without task-specific fine-tuning.
    Invoked throughout the experimental framework as the basis for testing prompt strategies.

pith-pipeline@v0.9.0 · 5500 in / 1089 out tokens · 45816 ms · 2026-05-13T22:30:11.762338+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · 4 internal anchors

  1. [1]

    The e-invoicing journey 2019-2025

    Bruno Koch. The e-invoicing journey 2019-2025. Technical report, Billentis, 2019

  2. [2]

    Barchard and Larry A

    Kimberly A. Barchard and Larry A. Pace. Preventing human error: The impact of data entry methods on data accuracy and statistical results.Computers in Human Behavior, 27(5):1834–1839, 2011

  3. [3]

    Cuervo Londoño

    Javier Sánchez and Giovanny A. Cuervo Londoño. A bag-of-words approach for information extraction from electricity invoices.AI, 5:1837–1857, 2024

  4. [4]

    A Survey of Large Language Models

    Wayne Xin Zhao, Kun Zhou, Li, et al. A survey of large language models.arXiv preprint arXiv:2303.18223, 2023

  5. [5]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Uszkoreit, et al. Attention is all you need. InAdvances in Neural Information Processing Systems, volume 30, pages 5998–6008, 2017. 11 Information Extraction from Electricity Invoices with LLMs J. Gómez and J. Sánchez

  6. [6]

    Rogers, Inna Goncearenco, Giuseppe Sarli, Igor Galynker, Denis Peskoff, Marine Carpuat, Jules White, Shyamal Anadkat, Alexander Hoyle, and Philip Resnik

    Sander Schulhoff, Michael Ilie, Nishant Balepur, Konstantine Kahadze, Amanda Liu, Chenglei Si, Yinheng Li, Aayush Gupta, HyoJung Han, Sevien Schulhoff, Pranav Sandeep Dulepet, Saurav Vidyadhara, Dayeon Ki, Sweta Agrawal, Chau Pham, Gerson Kroiz, Feileen Li, Hudson Tao, Ashay Srivastava, Hevander Da Costa, Saloni Gupta, Megan L. Rogers, Inna Goncearenco, G...

  7. [7]

    The hidden structure – improving legal document understanding through explicit text formatting

    Christian Braun, Alexander Lilienbeck, and Daniel Mentjukov. The hidden structure – improving legal document understanding through explicit text formatting. Technical report, Cornell University, 2025

  8. [8]

    Minesh Mathew, Dimosthenis Karatzas, and C. V . Jawahar. DocVQA: A dataset for VQA on document images. In2021 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 2199–2208, 2021

  9. [9]

    ChartQA: A benchmark for question answering about charts with visual and logical reasoning

    Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. ChartQA: A benchmark for question answering about charts with visual and logical reasoning. InFindings of the Association for Computational Linguistics: ACL 2022, pages 2263–2279, Dublin, Ireland, 2022

  10. [10]

    Minesh Mathew, Viraj Bagal, Rubèn Tito, Dimosthenis Karatzas, Ernest Valveny, and C. V . Jawahar. Infograph- icVQA. In2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 2582–2591, 2022

  11. [11]

    The effect of sampling temperature on problem solving in large language models

    Matthew Renze. The effect of sampling temperature on problem solving in large language models. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Findings of the Association for Computational Linguis- tics: EMNLP 2024, pages 7346–7356, Miami, Florida, USA, November 2024. Association for Computational Linguistics

  12. [12]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Nee- lakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott...

  13. [13]

    IDSEM, an invoices database of the Spanish electricity market.Scientific Data, 9:786, 2022

    Javier Sánchez, Salgado, Alejandro García, and Nelson Monzón. IDSEM, an invoices database of the Spanish electricity market.Scientific Data, 9:786, 2022

  14. [14]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, and et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. Technical report, Cornell University, 2024

  15. [15]

    Outrageously large neural networks: The sparsely-gated mixture-of-experts layer

    Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. InInternational Conference on Learning Representations (ICLR), 2017

  16. [16]

    Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 23(120):1–39, 2021

    William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 23(120):1–39, 2021

  17. [17]

    Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7b. Technical report, Corn...

  18. [18]

    Doctr: Document image transformer for geometric unwarping and illumination correction

    Hao Feng, Yuechen Wang, Zhou, et al. Doctr: Document image transformer for geometric unwarping and illumination correction. InProceedings of the 29th ACM International Conference on Multimedia, MM ’21, page 273–281, New York, NY , USA, 2021. Association for Computing Machinery

  19. [19]

    LoRA: Low-Rank Adaptation of Large Language Models

    Edward J. Hu, Yelong Shen, Wallis, et al. LoRA: Low-rank adaptation of large language models.CoRR, abs/2106.09685, 2021

  20. [20]

    QLoRA: Efficient Finetuning of Quantized LLMs

    Tim Dettmers et al. QLoRA: Efficient finetuning of quantized LLMs.arXiv preprint arXiv:2305.14314, 2023

  21. [21]

    LayoutLM: Pre-training of text and layout for document image understanding

    Yiheng Xu, Minghao Li, Cui, et al. LayoutLM: Pre-training of text and layout for document image understanding. arXiv preprint arXiv:1912.13318, 2019

  22. [22]

    Layoutllm: Large language model instruction tuning for visually rich document understanding

    Masato Fujitake. Layoutllm: Large language model instruction tuning for visually rich document understanding. Technical report, Cornell University, 2024

  23. [23]

    Large language models are zero-shot reasoners, 2023

    Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners, 2023. 12 Information Extraction from Electricity Invoices with LLMs J. Gómez and J. Sánchez

  24. [24]

    Vijay Kumar Gali and Rishabh Agarwal. Zero-shot learning and few-shot learning with generative AI: Bridging the data gap for real-world applications.Integrated Journal for Research in Arts and Humanities, 5:193–200, 2025

  25. [25]

    Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V . Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models.arXiv preprint arXiv:2201.11903, 2022

  26. [26]

    DUE: End-to-end document understanding benchmark

    Costin-Anton Boiangiu et al. DUE: End-to-end document understanding benchmark. InProceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, 2021

  27. [27]

    Unveiling the deficiencies of pre-trained text-and-layout models in real-world visually-rich document information extraction.arXiv preprint arXiv:2402.02379, 2024

    Zhaoyang Li et al. Unveiling the deficiencies of pre-trained text-and-layout models in real-world visually-rich document information extraction.arXiv preprint arXiv:2402.02379, 2024

  28. [28]

    Mitigating LLM hallucination with smoothed knowledge distillation.arXiv preprint arXiv:2502.11306, 2025

    Ziwei Ji et al. Mitigating LLM hallucination with smoothed knowledge distillation.arXiv preprint arXiv:2502.11306, 2025

  29. [29]

    Mistral small

    Mistral AI. Mistral small. https://docs.mistral.ai/getting-started/models/, 2025. Accessed: 2025

  30. [30]

    Ocr-free document understanding transformer

    Geewook Kim, Teakgyu Hong, et al. Ocr-free document understanding transformer. In Shai Avidan, Gabriel Brostow, Moustapha Cissé, Giovanni Maria Farinella, and Tal Hassner, editors,Computer Vision – ECCV 2022, pages 498–517, Cham, 2022. Springer Nature Switzerland. 13