arxiv: 2604.01432 · v2 · submitted 2026-04-01 · 💻 cs.CL

Recognition: no theorem link

Are Finer Citations Always Better? Rethinking Granularity for Attributed Generation

Benjamin Van Durme, Daniel Khashabi, Hexuan Wang, Jingyu Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-13 22:12 UTC · model grok-4.3

classification 💻 cs.CL

keywords attributed generationcitation granularityattribution qualitylanguage modelssemantic dependenciesmodel scale

0 comments

The pith

Fine-grained citations degrade attribution quality in language models, with paragraph-level citations performing best.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how citation granularity affects attributed generation performance across four model scales from 8B to 120B parameters. It demonstrates that sentence-level citations reduce attribution quality by 16 to 276 percent relative to the best intermediate level, while paragraph-level citations maximize performance by preserving semantic dependencies between claims and evidence. Excessively coarse multi-paragraph citations add distracting noise, and the performance gap grows with model size because fine citations interfere with larger models' multi-sentence synthesis. Selecting citation granularity that matches the model's natural scope improves attribution faithfulness while maintaining or enhancing answer correctness.

Core claim

Enforcing fine-grained citations degrades attribution quality by 16-276% compared to the best-performing granularity, with performance peaking at paragraph-level citations across model scales. Fine-grained citations disrupt necessary semantic dependencies for attributing evidence to answer claims, while excessively coarse citations introduce distracting noise. The magnitude of this performance gap varies non-monotonically with model scale, with fine-grained constraints disproportionately penalizing larger models. Citation-optimal granularity leads to substantial gains in attribution quality while preserving or even improving answer correctness.

What carries the argument

The non-monotonic relationship between citation granularity levels (sentence, paragraph, multi-paragraph) and attribution quality, driven by the balance between semantic coherence and noise.

If this is right

Paragraph-level citations yield higher attribution quality than either sentence-level or multi-paragraph citations.
Larger models suffer larger drops in attribution quality when forced to use sentence-level citations.
Aligning citation granularity with model scale improves attribution without reducing answer correctness.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Attribution systems may need to select granularity dynamically based on generated content length rather than fixing it in advance.
Evaluation benchmarks for attributed generation should vary granularity as a controlled factor instead of assuming finer is always preferable.
The results point to a trade-off where human verification ease and model performance constraints pull in opposite directions on citation design.

Load-bearing premise

The tested models, tasks, and attribution metrics are representative enough that the observed non-monotonic granularity effect will generalize to other setups and future models.

What would settle it

A controlled experiment on new models or tasks where sentence-level citations produce higher attribution quality than paragraph-level citations across multiple scales would falsify the central claim.

Figures

Figures reproduced from arXiv: 2604.01432 by Benjamin Van Durme, Daniel Khashabi, Hexuan Wang, Jingyu Zhang.

**Figure 1.** Figure 1: Impact of Citation Granularity. The model processes a long context with many chunks and generates an answer with supporting citations. We show only the chunks relevant to the query. At fine granularity (Left), sentence-level chunks (C6-C8) cannot independently support the statement. Although the generated answer is correct, attribution fails because fine-grained chunking isolates the subject (“Alberic III”… view at source ↗

**Figure 2.** Figure 2: The Granularity-Performance Curves. We track Citation F1 across granularity settings for three representative citation volume ranges. Across all four evaluated model scales, performance is consistently lowest at fine granularity (k=1, 2) and peaks at intermediate settings before declining or plateauing at coarse settings. Attribution Quality. Following the evaluation protocol defined in Section 2.3.2 of Zh… view at source ↗

**Figure 3.** Figure 3: Citation F1 Decomposition (GPT-120B, 16 ≤ V ≤ 31). F1 (red) peaks by balancing opposing trends: Precision (green) improves with granularity (better context), while Recall (blue) degrades at coarse settings (k = 16) due to noise. The Optimal Balance. The peak performance at intermediate granularities emerges from this tradeoff: at fine granularities, low precision dominates; at coarse granularities, decli… view at source ↗

**Figure 4.** Figure 4: Asymmetric Sensitivity (Llama-70B). Lines depict relative change from the fine-grained baseline (k=1). Attribution quality (Citation F1, red) responds strongly to granularity (> 25% gain), whereas answer correctness (blue) is unresponsive (< 1% change). Model Cit. Opt Acc. Opt Citation Correctness (k) (k) Gain Change 8B 8 1 +27.8% -2.3% 20B 32 4 +58.8% +0.3% 70B 32 16 +25.5% +0.8% 120B 16 16 +18.2% +4.4% … view at source ↗

**Figure 5.** Figure 5: Citation Volume Distributions. Histograms show statement counts per volume range for each granularity setting. Missing bars reflect the structural constraints described in §B. Comparisons in the main text are made strictly within vertical slices (fixed volume ranges). reproduces the judgments of GPT-4o, successfully inheriting its established human alignment at a broader scale. G.3 Large-Scale Consistency … view at source ↗

**Figure 6.** Figure 6: Comprehensive Granularity-Performance Curves. Each line represents a distinct citation volume range. Across all four model scales, performance is consistently lowest at fine granularity and peaks at intermediate settings before declining or plateauing at coarse settings. Volume (V ) Model Fine Baseline Optimal Gains k F1 k F1 Absolute Relative 4 ≤ V ≤ 7 Llama-8B 1 0.461 4 0.564 +0.103 +22.4% GPT-20B 1 0.56… view at source ↗

**Figure 7.** Figure 7: Llama-8B Decomposition. Precision (green), Recall (blue), F1 (red). 19 [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗

**Figure 8.** Figure 8: GPT-20B Decomposition. Precision (green), Recall (blue), F1 (red). 20 [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗

**Figure 9.** Figure 9: Llama-70B Decomposition. Precision (green), Recall (blue), F1 (red). 21 [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗

**Figure 10.** Figure 10: GPT-120B Decomposition. Precision (green), Recall (blue), F1 (red). 22 [PITH_FULL_IMAGE:figures/full_fig_p022_10.png] view at source ↗

**Figure 11.** Figure 11: Attribution vs. Correctness Trade-off (Additional Models). Attribution F1 (red) shows high sensitivity to granularity, while Correctness (blue) is relatively stable. Metric Granularity (k) Pearson r (%) Cohen’s κ (%) Agreement (%) Precision Overall (All k) 73.2 66.0 80.5 Fine (1, 2) 74.9 66.7 77.1 Interm. (4, 8) 74.6 70.7 84.4 Coarse (16, 32) 69.1 59.6 80.3 Recall Overall (All k) 65.2 64.0 82.5 Fine (1, 2… view at source ↗

read the original abstract

Citation granularity - whether to cite individual sentences, paragraphs, or documents - is a critical design choice in attributed generation. While fine-grained citations are often preferred for precise human verification, their impact on model performance remains under-explored. We analyze four model scales (8B-120B) and demonstrate that enforcing fine-grained citations degrades attribution quality by 16-276% compared to the best-performing granularity. We observe a consistent performance pattern where attribution quality peaks at intermediate granularities (paragraph-level). Our analysis suggests that fine-grained (sentence-level) citations disrupt necessary semantic dependencies for attributing evidence to answer claims, while excessively coarse citations (multi-paragraph) introduce distracting noise. Importantly, the magnitude of this performance gap varies non-monotonically with model scale: fine-grained constraints disproportionately penalize larger models, suggesting that atomic citation units disrupt the multi-sentence information synthesis at which these models excel. Strikingly, citation-optimal granularity leads to substantial gains in attribution quality while preserving or even improving answer correctness. Overall, our findings demonstrate that optimizing solely for human verification via fine-grained citation disregards model constraints, compromising both attribution faithfulness and generation reliability. Instead, effective attribution requires aligning citation granularity with the model's natural semantic scope.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Paragraph-level citations beat sentence-level ones on attribution quality here, with the gap widening at larger scales, but the prompting setup leaves the mechanism uncertain.

read the letter

The main takeaway is that forcing sentence-level citations hurts attribution quality compared to paragraph level, with reported drops from 16 to 276 percent, and the penalty grows with model size up to 120B. Paragraph granularity also keeps or improves answer correctness while coarser multi-paragraph citations add noise. That pattern across four scales is the concrete contribution worth noting. The experiments give usable numbers on how granularity trades off verification precision against model performance, which is more direct than most prior attribution work. The scaling interaction stands out as new relative to earlier papers on the topic. One clear soft spot is the enforcement approach. The paper relies on instructions to set granularity, so sentence-level prompts may simply be harder for models to follow consistently than paragraph ones. Without reported adherence rates or a non-prompting control, the claimed semantic-dependency explanation rests on an assumption that could be tested more tightly. This is useful for anyone tuning RAG or summarization pipelines who has to pick a citation style. Readers working on attributed generation will get practical scaling data to inform their choices. The empirical patterns are grounded enough to send for peer review, though referees should press on the prompting controls and whether the results hold under different enforcement methods. I would recommend sending it out.

Referee Report

2 major / 2 minor

Summary. The manuscript investigates citation granularity in attributed generation, claiming that sentence-level citations degrade attribution quality by 16-276% relative to paragraph-level granularity, which performs best across four model scales (8B-120B). It reports a non-monotonic effect with model size, where fine-grained constraints disproportionately harm larger models by disrupting multi-sentence semantic dependencies, while coarser multi-paragraph citations add noise. The authors conclude that citation-optimal granularity improves attribution faithfulness without harming (and sometimes improving) answer correctness, advocating alignment with the model's natural semantic scope over maximal human verifiability.

Significance. If the empirical patterns hold under controlled conditions, the work would meaningfully advance understanding of attribution design choices in LLM systems by demonstrating that finer granularity is not monotonically beneficial and by quantifying scaling interactions. The preservation of answer correctness at optimal granularity offers a practical path to better systems, and the non-monotonic scaling observation could guide future work on how model capacity interacts with output constraints.

major comments (2)

[Abstract] Abstract and experimental description: The central claim of 16-276% degradation and non-monotonic scaling rests on attribution metrics whose definitions, datasets, and baselines are not specified in the abstract or summary; without these, the reported effect sizes cannot be assessed for robustness or compared to prior attribution work.
[Experimental setup] Experimental setup (likely §4): Enforcing granularity exclusively via prompting introduces a plausible confound between semantic scope and instruction-following difficulty, as sentence-level instructions are typically more complex and format-specific than paragraph-level ones; differential adherence rates could produce the observed gaps without requiring the claimed disruption of multi-sentence dependencies.

minor comments (2)

[Abstract] The abstract states concrete percentages without error bars or statistical significance tests; adding these would strengthen the scaling claims.
[Analysis] Clarify the exact attribution metric (e.g., whether it is entailment-based, human-judged, or automatic) and how 'natural semantic scope' is operationalized beyond post-hoc interpretation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify the presentation of our results. We address each major point below and have revised the manuscript to incorporate clarifications and additional details where appropriate.

read point-by-point responses

Referee: [Abstract] Abstract and experimental description: The central claim of 16-276% degradation and non-monotonic scaling rests on attribution metrics whose definitions, datasets, and baselines are not specified in the abstract or summary; without these, the reported effect sizes cannot be assessed for robustness or compared to prior attribution work.

Authors: We agree that the abstract would benefit from additional context on the metrics and setup. In the revised version, we will expand the abstract to briefly define the primary attribution metrics (citation precision, recall, and F1), reference the evaluation datasets, and note the paragraph-level baseline for comparison. This will enable readers to assess the effect sizes while respecting abstract length limits; full methodological details remain in Section 4. revision: yes
Referee: [Experimental setup] Experimental setup (likely §4): Enforcing granularity exclusively via prompting introduces a plausible confound between semantic scope and instruction-following difficulty, as sentence-level instructions are typically more complex and format-specific than paragraph-level ones; differential adherence rates could produce the observed gaps without requiring the claimed disruption of multi-sentence dependencies.

Authors: This is a valid concern regarding potential confounds. We have verified instruction adherence across conditions and report rates above 98% for all granularities and model scales, indicating that differential compliance does not explain the gaps. The non-monotonic scaling pattern—larger models showing greater penalties under fine-grained constraints—further supports our semantic-dependency interpretation, as larger models typically follow complex instructions more reliably. In the revision, we will add these adherence statistics to the appendix and include a new paragraph in Section 5 discussing this alternative explanation and why the data favor our account over instruction difficulty alone. revision: partial

Circularity Check

0 steps flagged

No circularity: claims rest on direct experimental measurements

full rationale

The paper reports empirical results from evaluating four model scales on attribution quality under different citation granularities enforced via prompting. No equations, fitted parameters, or derivations are presented that reduce to self-definitions or prior self-citations by construction. Performance patterns (non-monotonic peaks at paragraph level, scale-dependent gaps) are measured outputs rather than renamed inputs or ansatzes smuggled through citations. The analysis of semantic scope is interpretive commentary on the observed metrics, not a load-bearing theorem justified only by author-overlapping prior work. This is a standard empirical study whose central claims remain independent of the listed circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Central claim rests entirely on empirical observations from four model scales; no free parameters, axioms, or invented entities are stated or required in the abstract.

pith-pipeline@v0.9.0 · 5520 in / 1089 out tokens · 62450 ms · 2026-05-13T22:12:27.776965+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

5 extracted references · 5 canonical work pages · 1 internal anchor

[1]

arXiv preprint arXiv:2406.15319

Enabling large language models to generate text with citations. InConference on Empirical Meth- ods in Natural Language Processing(EMNLP). Luyang Huang, Shuyang Cao, Nikolaus Parulian, Heng Ji, and Lu Wang. 2021. Efficient attentions for long document summarization. InProceedings of the 2021 Conference of the North American Chapter of the Association for ...

work page arXiv 2021
[2]

Attribution, Citation, and Quotation: A Survey of Evidence-based Text Generation with Large Language Models

Attribution, citation, and quotation: A survey of evidence-based text generation with large language models.Preprint, arXiv:2508.15396. Freda Shi, Xinyun Chen, Kanishka Misra, Nathan Scales, David Dohan, Ed H. Chi, Nathanael Schärli, and Denny Zhou. 2023. Large language models can 10 be easily distracted by irrelevant context. InProceed- ings of the 40th ...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Rating: [[...]] Analysis:

Large language models are better reasoners with self-verification. InFindings of the Associa- tion for Computational Linguistics: EMNLP 2023, pages 2550–2575, Singapore. Association for Com- putational Linguistics. Wenhao Wu, Yizhong Wang, Guangxuan Xiao, Hao Peng, and Yao Fu. 2024. Retrieval head mecha- nistically explains long-context factuality.Preprin...

work page arXiv 2023
[4]

Precision Monotonicity:Does precision strictly increase (or stay flat) as granularity becomes coarser? This indicatesBoundary Tolerance

work page
[5]

peak-and- decline

Recall Peak & Decline:At what granularity (k) does recall peak, and does it decline at the coarsest settings? This indicatesSignal Dilution. Universality.Precision monotonicity holds in 94% (15/16) of cases. The recall "peak-and- decline" pattern also holds in 94% (15/16) of cases, confirming that signal dilution is a robust phe- nomenon. We exclude extre...

work page arXiv 2024