pith. sign in

arxiv: 2606.00881 · v1 · pith:77WDUQQGnew · submitted 2026-05-30 · 💻 cs.CL

Chunking Methods on Retrieval-Augmented Generation - Effectiveness Evaluation Against Computational Cost and Limitations

Pith reviewed 2026-06-28 18:37 UTC · model grok-4.3

classification 💻 cs.CL
keywords chunking methodsretrieval-augmented generationRAG systemseffectiveness evaluationcomputational costLLM performancesemantic chunking
0
0 comments X

The pith

Chunking in RAG systems introduces measurable effectiveness, cost, and limitation trade-offs that vary by method and data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper performs the first systematic comparison of many chunking techniques inside retrieval-augmented generation pipelines. It tracks how each technique changes retrieval quality and final answer accuracy while also recording the computing resources needed for indexing and search. The evaluation finds that methods designed for narrow cases rarely maintain their advantages when tested on other data, and that even standard approaches carry previously under-examined drawbacks. A reader would care because chunking is an early step that affects both reliability and expense in any production RAG deployment.

Core claim

To the best of our knowledge, this study is the first to systematically evaluate the effectiveness of a wide range of chunking methods and emphasize the underlying challenges of chunking strategies in RAG systems. While chunking is commonly treated as a simple preprocessing step, we show that it introduces a range of impactful and often overlooked issues.

What carries the argument

Comparative evaluation of fixed-size, semantic, and other chunking methods measured jointly on retrieval-generation quality and computational cost.

If this is right

  • Chunking methods exhibit distinct performance profiles rather than one method dominating all settings.
  • Many specialized chunking proposals show limited gains when tested outside their original narrow use cases.
  • Computational costs differ substantially across methods, affecting practical scalability.
  • Treating chunking as neutral preprocessing underestimates its effect on overall RAG reliability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Teams building RAG applications would benefit from running short benchmarks of several chunking options on their own data.
  • Future systems could incorporate lightweight selection logic that picks a chunking strategy based on detected document characteristics.
  • The observed limitations point toward possible value in hybrid chunking that switches rules within a single document collection.

Load-bearing premise

The chosen set of chunking methods, datasets, and evaluation metrics adequately represents behavior across the broader range of real-world RAG applications and data types.

What would settle it

A follow-up experiment on a new collection of documents and queries that produces consistent reversals in the relative ranking of the same chunking methods on the original quality and cost metrics.

Figures

Figures reproduced from arXiv: 2606.00881 by Communication Technology, Faculty of Information, Julianna Godziszewska (1), Karol Kunicki (1), Konrad Wojtasik (1) ((1) Department of Artificial Intelligence, Maciej Piasecki (1), Mateusz \'Smigielski (1), Mateusz Zbrocki (1), Micha{\l} Bernacki-Janson (1), Micha{\l} Rajkowski (1), Poland), Technology, Wroc{\l}aw 50-370, Wroc{\l}aw University of Science.

Figure 1
Figure 1. Figure 1: Number of chunks per chunking method on SQuAD dataset. [PITH_FULL_IMAGE:figures/full_fig_p007_1.png] view at source ↗
read the original abstract

Retrieval-Augmented Generation (RAG) has demonstrated significant capabilities in enhancing the performance of Large Language Models (LLMs). One of the key tasks in RAG systems is the chunking process. Traditionally, fixed-size chunking and semantic chunking have been the standard approaches. However, interest in chunking strategies has been increasing, leading to a growing number of proposed methods that often claim improved performance over these conventional techniques. Many of these approaches are tailored to specific use cases and data types, with limited evidence of their effectiveness across diverse scenarios. As a result, it remains challenging to directly compare different techniques and assess their relative strengths. To the best of our knowledge, this study is the first to systematically evaluate the effectiveness of a wide range of chunking methods and emphasize the underlying challenges of chunking strategies in RAG systems. While chunking is commonly treated as a simple preprocessing step, we show that it introduces a range of impactful and often overlooked issues.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper claims to be the first systematic evaluation of a wide range of chunking methods (including fixed-size, semantic, and others) in Retrieval-Augmented Generation (RAG) systems. It compares their effectiveness against computational costs, highlights limitations, and argues that chunking is not a simple preprocessing step but introduces impactful overlooked issues across diverse scenarios.

Significance. If the empirical results hold and the evaluation is shown to be representative, the work could inform RAG practitioners on chunking trade-offs. The paper's value would lie in its benchmarking scope, but this is contingent on demonstrating that the chosen methods, datasets, and metrics support general conclusions about challenges rather than being convenience samples.

major comments (1)
  1. [Abstract] Abstract: The central claim that this is 'the first to systematically evaluate' a wide range and that chunking 'introduces a range of impactful and often overlooked issues' is load-bearing on the representativeness of the evaluated chunking methods, datasets, and metrics. Without explicit justification, coverage analysis, or discussion of why the finite set (fixed-size, semantic, etc.) and standard benchmarks generalize to diverse real-world scenarios and data types, the identified limitations cannot support broad conclusions about challenges in RAG chunking.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the feedback on the abstract claims. We agree that stronger justification for the evaluation's scope is needed to support the conclusions and will revise the manuscript to address this.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that this is 'the first to systematically evaluate' a wide range and that chunking 'introduces a range of impactful and often overlooked issues' is load-bearing on the representativeness of the evaluated chunking methods, datasets, and metrics. Without explicit justification, coverage analysis, or discussion of why the finite set (fixed-size, semantic, etc.) and standard benchmarks generalize to diverse real-world scenarios and data types, the identified limitations cannot support broad conclusions about challenges in RAG chunking.

    Authors: We agree this point requires addressing. In the revision we will add a new subsection (likely in Section 3 or 4) providing explicit justification for the selected chunking methods, noting that they encompass the dominant categories in the literature (fixed-size as baseline, semantic, and additional variants proposed in recent work). We will include a coverage analysis mapping the methods to key dimensions such as size-based vs. content-aware. For datasets and metrics we will explain the choice of standard RAG benchmarks to enable direct comparison with prior work, while adding an expanded limitations paragraph that explicitly discusses reduced generalizability to non-benchmark data types (e.g., highly specialized domains or multimodal content) and states that observed issues are demonstrated within the evaluated scope rather than claimed as universal. These changes will allow the abstract claims to be retained in tempered form. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmarking study with no derivations

full rationale

The paper is a pure empirical benchmarking study that evaluates a range of chunking methods on RAG performance using standard datasets and metrics. It contains no equations, derivations, fitted parameters, predictions, or uniqueness theorems. The central claim of providing the first systematic evaluation rests on the described experimental setup rather than any self-referential reduction or self-citation chain. All load-bearing elements are external measurements and comparisons, making the work self-contained against external benchmarks with no circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical derivations or new theoretical constructs are described in the abstract; the work is an empirical comparison study.

pith-pipeline@v0.9.1-grok · 5792 in / 993 out tokens · 23528 ms · 2026-06-28T18:37:51.260875+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

37 extracted references · 24 canonical work pages · 1 internal anchor

  1. [1]

    LiteraryQA: Towards effective evaluation of long-document narrative QA, in: Christodoulopoulos, C., Chakraborty, T., Rose, C., Peng, V

    Bonomo, T., Gioffr´e, L., Navigli, R., 2025. LiteraryQA: Towards effective evaluation of long-document narrative QA, in: Christodoulopoulos, C., Chakraborty, T., Rose, C., Peng, V . (Eds.), Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Suzhou, China. pp. 34086–34107. URL:...

  2. [2]

    Langchain.https://github.com/langchain-ai/langchain

    Chase, H., 2022. Langchain.https://github.com/langchain-ai/langchain. Accessed: 2025-05-20

  3. [3]

    Dense X retrieval: What retrieval granularity should we use?, in: Al-Onaizan, Y ., Bansal, M., Chen, Y .N

    Chen, T., Wang, H., Chen, S., Yu, W., Ma, K., Zhao, X., Zhang, H., Yu, D., 2024. Dense X retrieval: What retrieval granularity should we use?, in: Al-Onaizan, Y ., Bansal, M., Chen, Y .N. (Eds.), Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Miami, Florida, USA. pp. 15159...

  4. [4]

    PIRB: A comprehensive benchmark of Polish dense and hybrid text retrieval methods, in: Calzolari, N., Kan, M.Y ., Hoste, V ., Lenci, A., Sakti, S., Xue, N

    Dadas, S., Perełkiewicz, M., Po ´swiata, R., 2024. PIRB: A comprehensive benchmark of Polish dense and hybrid text retrieval methods, in: Calzolari, N., Kan, M.Y ., Hoste, V ., Lenci, A., Sakti, S., Xue, N. (Eds.), Proceedings of the 2024 Joint International Conference on Compu- tational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), E...

  5. [5]

    A dataset of information-seeking questions and answers anchored in research papers

    Dasigi, P., Lo, K., Beltagy, I., Cohan, A., Smith, N.A., Gardner, M., 2021. A dataset of information-seeking questions and answers anchored in research papers. URL:https://arxiv.org/abs/2105.03011,arXiv:2105.03011

  6. [6]

    LumberChunker: Long-form narrative document segmentation, in: Al-Onaizan, Y ., Bansal, M., Chen, Y .N

    Duarte, A.V ., Marques, J.D., Grac ¸a, M., Freire, M., Li, L., Oliveira, A.L., 2024. LumberChunker: Long-form narrative document segmentation, in: Al-Onaizan, Y ., Bansal, M., Chen, Y .N. (Eds.), Findings of the Association for Computational Linguistics: EMNLP 2024, Association for Computational Linguistics, Miami, Florida, USA. pp. 6473–6486. URL:https:/...

  7. [7]

    Comparative eval- uation of advanced chunking for retrieval-augmented generation in large language models for clinical decision support

    Gomez-Cabello, C.A., Prabha, S., Haider, S.A., Genovese, A., Collaco, B.G., Wood, N.G., Bagaria, S., Forte, A.J., 2025. Comparative eval- uation of advanced chunking for retrieval-augmented generation in large language models for clinical decision support. Bioengineering 12. URL:https://www.mdpi.com/2306-5354/12/11/1194, doi:10.3390/bioengineering12111194

  8. [8]

    Late chunking: Contextual chunk embeddings using long-context embedding models

    G ¨unther, M., Mohr, I., Williams, D.J., Wang, B., Xiao, H., 2025. Late chunking: Contextual chunk embeddings using long-context embedding models. URL:https://arxiv.org/abs/2409.04701,arXiv:2409.04701

  9. [9]

    Text tiling: Segmenting text into multi-paragraph subtopic passages

    Hearst, M.A., 1997. Text tiling: Segmenting text into multi-paragraph subtopic passages. Computational Linguistics 23. URL:https: //aclanthology.org/J97-1003.pdf

  10. [10]

    Muennighoff, Z

    Jain, A., Aggarwal, P., Saladi, A., 2025. AutoChunker: Structured text chunking and its evaluation, in: Rehm, G., Li, Y . (Eds.), Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (V olume 6: Industry Track), Association for Computational Linguistics, Vienna, Austria. pp. 983–995. URL:https://aclanthology.org/2025.acl...

  11. [11]

    T rivia QA : A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

    Joshi, M., Choi, E., Weld, D., Zettlemoyer, L., 2017. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension, in: Barzilay, R., Kan, M.Y . (Eds.), Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), Association for Computational Linguistics, Vancouver, Canada. ...

  12. [12]

    5 levels of text splitting: Semantic chunking.https://github.com/FullStackRetrieval-com/ RetrievalTutorials

    Kamradt, G., 2024. 5 levels of text splitting: Semantic chunking.https://github.com/FullStackRetrieval-com/ RetrievalTutorials. Tutorial and Reference Implementation

  13. [13]

    Dense passage retrieval for open-domain question answering, in: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp

    Karpukhin, V ., Oguz, B., Min, S., Lewis, P., Wu, L., Edunov, S., Chen, D., Yih, W.t., 2020. Dense passage retrieval for open-domain question answering, in: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 6769–6781

  14. [14]

    Max-min semantic chunking

    Kiss, A., et al., 2025. Max-min semantic chunking. Discover Computing 28. URL:https://link.springer.com/journal/44227. article number: 117

  15. [15]

    and Uszkoreit, Jakob and Le, Quoc and Petrov, Slav

    Kwiatkowski, T., Palomaki, J., Redfield, O., Collins, M., Parikh, A., Alberti, C., Epstein, D., Polosukhin, I., Devlin, J., Lee, K., Toutanova, K., Jones, L., Kelcey, M., Chang, M.W., Dai, A.M., Uszkoreit, J., Le, Q., Petrov, S., 2019. Natural questions: A benchmark for question answering research. Transactions of the Association for Computational Linguis...

  16. [16]

    Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V ., Goyal, N., K ¨uttler, H., Lewis, M., Yih, W.t., Rockt ¨aschel, T., Riedel, S., Kiela, D., 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks, in: Proceedings of the 34th International Conference on Neural Information Processing Systems, Curran Associates Inc., Red Hook, NY , USA

  17. [17]

    Hichunk: Evaluating and enhancing retrieval-augmented generation with hierarchical chunking

    Lu, W., Chen, K., Qiao, R., Sun, X., 2026. Hichunk: Evaluating and enhancing retrieval-augmented generation with hierarchical chunking. URL:https://openreview.net/forum?id=yCyv2Ij3bS

  18. [18]

    Pavlu, V ., Rajput, S., Golbus, P.B., Aslam, J.A., 2012. Ir system evaluation using nugget-based test collections, in: Proceedings of the Fifth ACM International Conference on Web Search and Data Mining, Association for Computing Machinery, New York, NY , USA. p. 393–402. URL:https://doi.org/10.1145/2124295.2124343, doi:10.1145/2124295.2124343

  19. [19]

    Pradeep, R., Thakur, N., Upadhyay, S., Campos, D., Craswell, N., Soboroff, I., Dang, H.T., Lin, J., 2025. The great nugget recall: Automating fact extraction and rag evaluation with large language models, in: Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval, Association for Computing Machinery...

  20. [20]

    Is semantic chunking worth the computational cost?, in: Chiruzzo, L., Ritter, A., Wang, L

    Qu, R., Tu, R., Bao, F.S., 2025. Is semantic chunking worth the computational cost?, in: Chiruzzo, L., Ritter, A., Wang, L. (Eds.), Findings of the Association for Computational Linguistics: NAACL 2025, Association for Computational Linguistics, Albuquerque, New Mexico. pp. 2155–2177. URL:https://aclanthology.org/2025.findings-naacl.114/, doi:10.18653/v1/...

  21. [21]

    SQuAD: 100,000+ Questions for Machine Comprehension of Text

    Rajpurkar, P., Zhang, J., Lopyrev, K., Liang, P., 2016. Squad: 100,000+questions for machine comprehension of text. URL:https://arxiv. org/abs/1606.05250,arXiv:1606.05250

  22. [22]

    Sentence-bert: Sentence embeddings using siamese bert-networks, in: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, pp

    Reimers, N., Gurevych, I., 2019. Sentence-bert: Sentence embeddings using siamese bert-networks, in: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, pp. 3982–3992

  23. [23]

    Large language models can be easily distracted by irrelevant context, in: Proceedings of the 40th International Conference on Machine Learning, JMLR.org

    Shi, F., Chen, X., Misra, K., Scales, N., Dohan, D., Chi, E., Sch ¨arli, N., Zhou, D., 2023. Large language models can be easily distracted by irrelevant context, in: Proceedings of the 40th International Conference on Machine Learning, JMLR.org

  24. [24]

    Tuora, R., Zwierzchowska, A., Zawadzka-Paluektau, N., Klamra, C., Kobyli ´nski, L., 2023. Poquad - the polish question answering dataset - description and analysis, in: Proceedings of the 12th Knowledge Capture Conference 2023, Association for Computing Machinery, New York, NY , USA. p. 105–113. URL:https://doi.org/10.1145/3587259.3627548, doi:10.1145/358...

  25. [25]

    S2 chunking: A hybrid framework for document segmentation through integrated spatial and semantic analysis

    Verma, P., 2025. S2 chunking: A hybrid framework for document segmentation through integrated spatial and semantic analysis. URL: https://arxiv.org/abs/2501.05485,arXiv:2501.05485

  26. [26]

    Novelqa: Benchmarking question answering on documents exceeding 200k tokens

    Wang, C., Ning, R., Pan, B., Wu, T., Guo, Q., Deng, C., Bao, G., Hu, X., Zhang, Z., Wang, Q., Zhang, Y ., 2025a. Novelqa: Benchmarking question answering on documents exceeding 200k tokens. URL:https://arxiv.org/abs/2403.12766,arXiv:2403.12766

  27. [27]

    Entropy-optimized dynamic text segmentation and rag-enhanced llms for construction engineering knowledge base

    Wang, H., Zhang, D., Li, J., Feng, Z., Zhang, F., 2025b. Entropy-optimized dynamic text segmentation and rag-enhanced llms for construction engineering knowledge base. Applied Sciences 15. URL:https://www.mdpi.com/2076-3417/15/6/3134, doi:10.3390/app15063134

  28. [28]

    Searching for best practices in retrieval-augmented generation, in: Al-Onaizan, Y ., Bansal, M., Chen, Y .N

    Wang, X., Wang, Z., Gao, X., Zhang, F., Wu, Y ., Xu, Z., Shi, T., Wang, Z., Li, S., Qian, Q., Yin, R., Lv, C., Zheng, X., Huang, X., 2024. Searching for best practices in retrieval-augmented generation, in: Al-Onaizan, Y ., Bansal, M., Chen, Y .N. (Eds.), Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Association f...

  29. [29]

    Learning to filter context for retrieval-augmented generation

    Wang, Z., Araki, J., Jiang, Z., Parvez, M.R., Neubig, G., 2023. Learning to filter context for retrieval-augmented generation. URL:https: //arxiv.org/abs/2311.08377,arXiv:2311.08377

  30. [30]

    Wang, Z., Gao, C., Xiao, C., Huang, Y ., Si, S., Luo, K., Bai, Y ., Li, W., Duan, T., Lv, C., Lu, G., Chen, G., Qi, F., Sun, M., 2025c. Document segmentation matters for retrieval-augmented generation, in: Findings of the Association for Computational Linguistics: ACL 2025, Associ- ation for Computational Linguistics, Vienna, Austria. pp. 8063–8075. URL:h...

  31. [31]

    cAST: Enhancing code retrieval-augmented generation with structural chunking via abstract syntax tree, in: Christodoulopoulos, C., Chakraborty, T., Rose, C., Peng, V

    Zhang, Y ., Zhao, X., Wang, Z.Z., Yang, C., Wei, J., Wu, T., 2025. cAST: Enhancing code retrieval-augmented generation with structural chunking via abstract syntax tree, in: Christodoulopoulos, C., Chakraborty, T., Rose, C., Peng, V . (Eds.), Findings of the Association for Computational Linguistics: EMNLP 2025, Association for Computational Linguistics, ...

  32. [32]

    MoC: Mixtures of text chunking learners for retrieval-augmented generation system, in: Che, W., Nabende, J., Shutova, E., Pilehvar, M.T

    Zhao, J., Ji, Z., Fan, Z., Wang, H., Niu, S., Tang, B., Xiong, F., Li, Z., 2025a. MoC: Mixtures of text chunking learners for retrieval-augmented generation system, in: Che, W., Nabende, J., Shutova, E., Pilehvar, M.T. (Eds.), Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), Association for ...

  33. [33]

    Meta-chunking: Learning text segmentation and semantic completion via logical perception

    Zhao, J., Ji, Z., Feng, Y ., Qi, P., Niu, S., Tang, B., Xiong, F., Li, Z., 2025b. Meta-chunking: Learning text segmentation and semantic completion via logical perception. URL:https://arxiv.org/abs/2410.12788,arXiv:2410.12788

  34. [34]

    Zheng, L., Chiang, W.L., Sheng, Y ., Zhuang, S., Wu, Z., Zhuang, Y ., Lin, Z., Li, Z., Li, D., Xing, E.P., Zhang, H., Gonzalez, J.E., Stoica, I.,

  35. [35]

    Judging llm-as-a-judge with mt-bench and chatbot arena, in: Proceedings of the 37th International Conference on Neural Information Processing Systems, Curran Associates Inc., Red Hook, NY , USA

  36. [36]

    Mix-of-granularity: Optimize the chunking granularity for retrieval-augmented gen- eration, in: Rambow, O., Wanner, L., Apidianaki, M., Al-Khalifa, H., Eugenio, B.D., Schockaert, S

    Zhong, Z., Liu, H., Cui, X., Zhang, X., Qin, Z., 2025. Mix-of-granularity: Optimize the chunking granularity for retrieval-augmented gen- eration, in: Rambow, O., Wanner, L., Apidianaki, M., Al-Khalifa, H., Eugenio, B.D., Schockaert, S. (Eds.), Proceedings of the 31st Inter- national Conference on Computational Linguistics, Association for Computational L...

  37. [37]

    Beyond chunk-then-embed: A comprehensive taxonomy and evaluation of document chunking strategies for information retrieval

    Zhou, Y ., Wang, S., Koopman, B., Zuccon, G., 2026. Beyond chunk-then-embed: A comprehensive taxonomy and evaluation of document chunking strategies for information retrieval. URL:https://arxiv.org/abs/2602.16974,arXiv:2602.16974