Query-Adaptive Semantic Chunking for Retrieval-Augmented Generation: A Dynamic Strategy with Contextual Window Expansion

Mudit Rastogi

arxiv: 2605.22834 · v2 · pith:BVDG3OLGnew · submitted 2026-04-29 · 💻 cs.CL · cs.IR

Query-Adaptive Semantic Chunking for Retrieval-Augmented Generation: A Dynamic Strategy with Contextual Window Expansion

Mudit Rastogi This is my paper

Pith reviewed 2026-05-25 00:13 UTC · model grok-4.3

classification 💻 cs.CL cs.IR

keywords Query-Adaptive Semantic ChunkingRAGdocument chunkingretrieval augmented generationsemantic chunkingcontextual expansionquery integration

0 comments

The pith

Query-adaptive semantic chunking raises RAG retrieval F1 to 0.85 by tying segmentation to each query.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes Query-Adaptive Semantic Chunking to fix a core problem in retrieval-augmented generation: fixed chunk sizes ignore both document meaning and the specific user question, forcing a precision-recall trade-off. It introduces three mechanisms—scoring sentences against the query embedding, expanding windows around high-scoring seeds for coherence, and aggregating scores at chunk level—to make chunks query-relevant from the start. On 100 technical documents and 200 queries, this yields an F1 of 0.85, beating fixed chunking by 18-27 percent and semantic or agentic methods by 8-12 percent. Ablation tests show each step adds value, and human judges confirm better relevance and coherence.

Core claim

Query-Adaptive Semantic Chunking (QASC) dynamically constructs chunks by first identifying seed sentences via cosine similarity to the query embedding, then expanding contextual windows around those seeds, and finally aggregating chunk-level scores, leading to more relevant and coherent segments than static or query-agnostic methods.

What carries the argument

Query-Adaptive Semantic Chunking (QASC), a three-step process that integrates the user query into segmentation via embedding similarity, window expansion, and score aggregation.

If this is right

QASC resolves the precision-recall trade-off inherent in fixed chunk sizes by adapting to query intent.
Each of the three components—seed identification, window expansion, and aggregation—contributes measurably to the performance gain.
QASC produces chunks that human evaluators rate as more relevant and coherent than those from recursive, semantic, or agentic baselines.
The 0.85 F1 holds across four query types on technical documents.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar query-adaptive strategies could extend to non-technical domains such as legal or medical corpora where query intent varies sharply.
Future systems might combine QASC with learned chunk-size predictors to further reduce reliance on fixed hyperparameters.
Integrating this at indexing time rather than retrieval time would require pre-computing query-like embeddings, opening a new optimization path.

Load-bearing premise

The 100 technical documents and 200 queries used for evaluation represent real-world RAG workloads and the baseline methods were tuned fairly without favoring the new approach.

What would settle it

A replication on a broader set of documents and queries, or with independently tuned baselines, that fails to show at least an 8 percent F1 lift over semantic and agentic methods would undermine the central performance claim.

read the original abstract

Retrieval-Augmented Generation (RAG) systems depend critically on document chunking quality for retrieving relevant context. Fixed chunking segments documents into uniform units irrespective of semantics or user intent, producing a precision-recall trade-off unresolvable by tuning chunk size alone. Semantic and agentic methods partially address these limitations but do not integrate user queries at the chunking stage. We present Query-Adaptive Semantic Chunking (QASC), which dynamically constructs chunks by integrating queries into segmentation through three mechanisms: cosine similarity scoring between sentence and query embeddings to identify seed sentences, contextual window expansion around seeds to preserve coherence, and chunk-level score aggregation to ensure holistic relevance. We evaluate QASC on 100 technical documents across 200 queries spanning four types, comparing against fixed chunking at five granularities, recursive splitting, semantic chunking, and agentic chunking. QASC achieves an F1-score of 0.85, a relative improvement of 18-27% over fixed chunking and 8-12% over semantic and agentic alternatives. Ablation studies confirm each component contributes meaningfully. Human evaluation by three annotators (Cohen kappa = 0.82) corroborates that QASC produces more relevant and coherent chunks than existing methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

QASC folds the query into chunking with seed scoring plus expansion, but the F1 gains rest on baselines whose tuning is not described.

read the letter

The paper's main move is to score sentences against the query embedding for seed selection, expand a contextual window around those seeds, and aggregate chunk-level scores. This query dependence at segmentation time is the actual addition over fixed, recursive, semantic, and agentic baselines. The ablations show each piece contributes, and the human ratings with Cohen's kappa of 0.82 give some independent check on coherence and relevance. Those elements are straightforward and worth noting. The reported F1 of 0.85 with 18-27% relative lift on 100 technical documents and 200 queries is the headline result. The evaluation covers four query types, which is a reasonable spread for the domain. The soft spot is the comparison itself. The abstract lists the baselines but supplies no chunk-size grid for the fixed methods, no similarity threshold or embedding model for the semantic one, no prompt or LLM details for the agentic one, and no public code. Without those, it is impossible to know whether the baselines were run at competitive settings. The test collection is also narrow and small, with no error bars or statistical tests mentioned. That matches the stress-test concern and makes the size of the improvement hard to trust at face value. The work is aimed at engineers who already run RAG pipelines on technical material and want ideas for better chunking. A reader could extract the three-mechanism recipe and test it on their own data, but would have to re-create fair baselines themselves. It deserves peer review because the underlying problem is real and the proposed integration is simple enough to check, but any referee should ask for code, baseline hyperparameters, and a larger or more diverse test set before the gains can be taken as settled.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes Query-Adaptive Semantic Chunking (QASC) to address limitations of fixed, semantic, and agentic chunking in RAG by dynamically incorporating the user query: cosine similarity identifies seed sentences from query and sentence embeddings, contextual windows expand around seeds for coherence, and chunk-level aggregation ensures relevance. On 100 technical documents and 200 queries across four types, QASC reports an F1 of 0.85 with 18-27% relative gains over fixed chunking at five granularities and 8-12% over recursive, semantic, and agentic baselines. Ablation studies and human evaluation (Cohen's kappa = 0.82 by three annotators) are presented to validate each mechanism.

Significance. If the empirical gains prove robust to properly tuned baselines, QASC would offer a practical advance in query-aware segmentation for RAG, directly addressing the precision-recall trade-off that fixed chunk sizes cannot resolve. The inclusion of component ablations and inter-annotator agreement provides internal checks that strengthen the contribution. The work is empirical rather than theoretical, so its significance hinges on reproducibility and broader applicability beyond the tested technical-document domain.

major comments (2)

[Abstract] Abstract (evaluation results): The central claim of 18-27% and 8-12% relative F1 improvements requires that the four baseline families (fixed at five granularities, recursive, semantic, agentic) were implemented and tuned to their best performance on the identical 100 documents/200 queries. No chunk-size grid, recursive splitter parameters, semantic similarity threshold, agentic prompt/LLM, or tuning procedure is supplied, and no public repository is referenced. This omission is load-bearing because any under-optimized baseline would artifactually inflate the reported deltas.
[Abstract] Abstract (evaluation results): No statistical significance tests, confidence intervals, or error bars are reported for the F1 scores, and the evaluation scope (100 documents, 200 queries, four query types) is stated without evidence that sampling avoided selection effects or that the documents are representative of real-world RAG workloads. These details are necessary to support the cross-method superiority claim.

minor comments (1)

[Abstract] The abstract names 'four query types' but does not enumerate them; adding this enumeration would improve clarity of the experimental design.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the evaluation methodology. We address each major comment below and will revise the manuscript accordingly to improve reproducibility and statistical rigor.

read point-by-point responses

Referee: [Abstract] Abstract (evaluation results): The central claim of 18-27% and 8-12% relative F1 improvements requires that the four baseline families (fixed at five granularities, recursive, semantic, agentic) were implemented and tuned to their best performance on the identical 100 documents/200 queries. No chunk-size grid, recursive splitter parameters, semantic similarity threshold, agentic prompt/LLM, or tuning procedure is supplied, and no public repository is referenced. This omission is load-bearing because any under-optimized baseline would artifactually inflate the reported deltas.

Authors: We agree that the lack of explicit baseline implementation details weakens the ability to verify the reported gains. The full manuscript describes the baseline approaches at a high level in the experiments section but does not enumerate the exact parameter grids, thresholds, or prompts used. In the revised manuscript we will add a detailed experimental setup subsection listing the chunk-size grid searched for fixed chunking, recursive splitter parameters, semantic similarity thresholds, agentic LLM and prompts, and the tuning procedure applied on the same 100-document/200-query set. We will also commit to releasing the full codebase (including baseline implementations) upon acceptance to enable independent reproduction. revision: yes
Referee: [Abstract] Abstract (evaluation results): No statistical significance tests, confidence intervals, or error bars are reported for the F1 scores, and the evaluation scope (100 documents, 200 queries, four query types) is stated without evidence that sampling avoided selection effects or that the documents are representative of real-world RAG workloads. These details are necessary to support the cross-method superiority claim.

Authors: We acknowledge that statistical tests, intervals, and sampling transparency are needed to support the superiority claims. The current version reports only point F1 estimates. In the revision we will add paired statistical tests (e.g., Wilcoxon signed-rank) across the 200 queries, report 95% confidence intervals for all F1 scores, and include error bars on result figures. We will also expand the dataset description to detail the random sampling procedure from the technical corpus, the manual construction of the four query types, and any steps taken to mitigate selection bias, while noting the domain limitation to technical documents. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical comparison with no derivations or self-referential reductions

full rationale

The paper describes a new chunking algorithm (QASC) and reports its F1 performance on an external test set of 100 documents and 200 queries against named baseline families. No equations, fitted parameters, or derivation steps are present that could reduce the reported scores to inputs by construction. No self-citations, uniqueness theorems, or ansatzes are invoked in the provided text. The evaluation is therefore self-contained against external benchmarks and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The method rests on the standard assumption that off-the-shelf sentence embeddings preserve enough semantic signal for cosine-based relevance; no new constants, entities, or fitted parameters are introduced in the abstract description.

axioms (1)

domain assumption Cosine similarity between sentence and query embeddings reliably identifies semantically relevant seed sentences
Core of the first mechanism; invoked without further justification in the abstract.

pith-pipeline@v0.9.0 · 5750 in / 1163 out tokens · 41733 ms · 2026-05-25T00:13:54.289703+00:00 · methodology

Query-Adaptive Semantic Chunking for Retrieval-Augmented Generation: A Dynamic Strategy with Contextual Window Expansion

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)