pith. machine review for the scientific record. sign in

arxiv: 2109.10862 · v2 · pith:UKSLWPQTnew · submitted 2021-09-22 · 💻 cs.CL · cs.AI· cs.LG

Recursively Summarizing Books with Human Feedback

Pith reviewed 2026-05-17 15:22 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords book summarizationhuman feedbackrecursive decompositionabstractive summarizationGPT-3BookSumNarrativeQAlong-form generation
0
0 comments X

The pith

Recursive decomposition lets models summarize entire books after humans give feedback only on short sections.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates a method to train models on book-length summarization by breaking the task into smaller pieces that humans can evaluate directly. Models first learn to summarize short passages from human demonstrations and comparisons, then apply the same process recursively to the resulting summaries until a full-book summary emerges. Humans never need to read the complete novel themselves yet can still steer the output toward sensible, high-quality results. This yields summaries that match human-written ones in roughly five percent of cases and sets new state-of-the-art numbers on the BookSum benchmark. The same summaries also improve zero-shot question answering on NarrativeQA.

Core claim

The authors combine learning from human feedback with recursive task decomposition on GPT-3. They collect demonstrations and comparisons on short book sections, fine-tune the model via behavioral cloning and reward modeling, and at inference time produce a full-book summary by first condensing small sections then recursively condensing those summaries. Human labelers supervise and evaluate the process quickly without having read the full texts. The resulting model produces sensible book summaries that match human quality on about five percent of books and reaches state-of-the-art performance on BookSum; the summaries further enable state-of-the-art zero-shot results on NarrativeQA.

What carries the argument

Recursive task decomposition, in which models trained on smaller subtasks assist human evaluation of the larger task.

If this is right

  • The approach scales human supervision to tasks whose full scope exceeds direct human reading time.
  • Zero-shot question answering that uses the generated summaries outperforms prior methods on NarrativeQA.
  • The released datasets of model samples support further research on long-form summarization.
  • The same recursive structure can be applied to other generation tasks that benefit from hierarchical decomposition.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar hierarchical feedback loops could help align models on other long-horizon tasks such as full-story generation or multi-chapter reasoning.
  • The method implies that intermediate summaries act as compressed state representations that preserve enough signal for downstream human judgment.
  • Testing the same pipeline on non-fiction or technical documents would show whether narrative structure is necessary for the fidelity to hold.

Load-bearing premise

Summaries of summaries retain enough information and fidelity for the final output to remain faithful to the original book when humans never see the full text.

What would settle it

Human readers who have read the full books rate the model's final summaries as less accurate or less coherent than human-written summaries on a majority of test books.

read the original abstract

A major challenge for scaling machine learning is training models to perform tasks that are very difficult or time-consuming for humans to evaluate. We present progress on this problem on the task of abstractive summarization of entire fiction novels. Our method combines learning from human feedback with recursive task decomposition: we use models trained on smaller parts of the task to assist humans in giving feedback on the broader task. We collect a large volume of demonstrations and comparisons from human labelers, and fine-tune GPT-3 using behavioral cloning and reward modeling to do summarization recursively. At inference time, the model first summarizes small sections of the book and then recursively summarizes these summaries to produce a summary of the entire book. Our human labelers are able to supervise and evaluate the models quickly, despite not having read the entire books themselves. Our resulting model generates sensible summaries of entire books, even matching the quality of human-written summaries in a few cases ($\sim5\%$ of books). We achieve state-of-the-art results on the recent BookSum dataset for book-length summarization. A zero-shot question-answering model using these summaries achieves state-of-the-art results on the challenging NarrativeQA benchmark for answering questions about books and movie scripts. We release datasets of samples from our model.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces a method for abstractive summarization of entire fiction novels by combining human feedback with recursive task decomposition on GPT-3. Smaller sections are summarized first, then these summaries are recursively summarized to produce a book-level output. Human labelers provide demonstrations and pairwise comparisons on summaries without reading the full books. The resulting model produces sensible summaries that match human quality in ~5% of cases, achieves SOTA on BookSum, and yields SOTA zero-shot QA results on NarrativeQA when used as input.

Significance. If the recursive process preserves key information, the work demonstrates a practical approach to scalable oversight for long-context tasks that are difficult for humans to evaluate directly. It shows that human feedback on decomposed subtasks can train models to handle book-length summarization effectively, with supporting evidence from human evaluations and downstream QA performance. The released datasets of model samples provide a useful resource for further research in this area.

major comments (2)
  1. The human evaluation protocol (described in the experiments) relies on labelers judging summaries-of-summaries without access to prior levels or the original text; this leaves open the possibility of cumulative omissions that are invisible to feedback, which directly affects the claim that the final summaries remain faithful to the book.
  2. Table 1 and the BookSum results section report SOTA numbers but provide limited detail on inter-annotator agreement and exact prompt formats for the comparisons used to train the reward model; without these, it is difficult to assess the reliability of the human feedback signal that underpins the recursive training.
minor comments (2)
  1. The description of the recursive inference procedure could include a diagram or pseudocode to clarify how summaries are composed across levels.
  2. A few citations to prior work on hierarchical summarization or long-document QA appear to be missing from the related work section.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and recommendation for minor revision. The comments highlight important aspects of our evaluation protocol and reporting that we will address to strengthen the manuscript.

read point-by-point responses
  1. Referee: The human evaluation protocol (described in the experiments) relies on labelers judging summaries-of-summaries without access to prior levels or the original text; this leaves open the possibility of cumulative omissions that are invisible to feedback, which directly affects the claim that the final summaries remain faithful to the book.

    Authors: We agree this is a substantive limitation of the current evaluation setup. Labelers provide feedback on decomposed subtasks without the full book or prior summaries, so undetected cumulative omissions remain possible. While the strong NarrativeQA results offer indirect evidence that key information is preserved, they do not fully rule out the issue. In the revision we will add an explicit discussion of this limitation, including its implications for claims of faithfulness, and note that future work could include spot-checks against original text excerpts. revision: yes

  2. Referee: Table 1 and the BookSum results section report SOTA numbers but provide limited detail on inter-annotator agreement and exact prompt formats for the comparisons used to train the reward model; without these, it is difficult to assess the reliability of the human feedback signal that underpins the recursive training.

    Authors: We appreciate the request for greater transparency. The manuscript describes the overall human feedback collection process but does not report inter-annotator agreement statistics or reproduce the exact comparison prompts. We will expand the experiments section to include available inter-annotator agreement numbers and move the precise prompt templates into an appendix so readers can better evaluate the reliability of the reward model training signal. revision: yes

Circularity Check

0 steps flagged

No significant circularity in recursive summarization pipeline

full rationale

The paper describes an empirical training procedure that collects independent human demonstrations and comparisons on summaries of summaries, then applies behavioral cloning and reward modeling to fine-tune GPT-3 for recursive summarization at inference time. Results are reported on external benchmarks (BookSum, NarrativeQA) and human evaluations of final outputs. No equation or claim reduces to its own inputs by construction, no load-bearing self-citation chain exists, and human feedback collection is described as separate from the quantities being predicted. The derivation is therefore self-contained as an applied RLHF-style method.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that recursive summarization preserves semantic content across levels and that human preference judgments on short summaries generalize to long-form quality. No new physical constants or particles are introduced.

free parameters (1)
  • reward model temperature and KL coefficient
    Hyper-parameters in the RL fine-tuning stage that control how much the policy deviates from the supervised baseline.
axioms (1)
  • domain assumption Human labelers can reliably judge summary quality from short excerpts without reading the source book.
    Invoked when collecting comparisons for the reward model and when claiming the final summaries are sensible.

pith-pipeline@v0.9.0 · 5538 in / 1179 out tokens · 100475 ms · 2026-05-17T15:22:19.704135+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Generative Agents: Interactive Simulacra of Human Behavior

    cs.HC 2023-04 accept novelty 8.0

    Generative agents with memory streams, reflection, and planning using LLMs exhibit believable individual and emergent social behaviors in a simulated town.

  2. Code as Policies: Language Model Programs for Embodied Control

    cs.RO 2022-09 accept novelty 8.0

    Language models generate robot policy code from natural language commands via few-shot prompting, enabling spatial-geometric reasoning, generalization, and precise control on real robots.

  3. Teaching Models to Express Their Uncertainty in Words

    cs.CL 2022-05 unverdicted novelty 8.0

    GPT-3 can learn to express well-calibrated uncertainty about its answers using natural language phrases rather than logits.

  4. Finetuned Language Models Are Zero-Shot Learners

    cs.CL 2021-09 accept novelty 8.0

    Instruction tuning a 137B language model on over 60 NLP tasks described by instructions substantially boosts zero-shot performance on unseen tasks, outperforming larger GPT-3 models.

  5. Corruption-robust Offline Multi-agent Reinforcement Learning From Human Feedback

    cs.LG 2026-03 unverdicted novelty 7.0

    Introduces robust estimators for linear Markov games in offline MARLHF that achieve O(ε^{1-o(1)}) or O(√ε) bounds on Nash or CCE gaps under uniform or unilateral coverage.

  6. HER: Human-like Reasoning and Reinforcement Learning for LLM Role-playing

    cs.LG 2026-01 unverdicted novelty 7.0

    HER trains LLMs on reverse-engineered reasoning data and human preference rewards to improve cognitive persona simulation, reporting 30-point gains on CoSER and 15% on Minimax benchmarks over Qwen3-32B.

  7. Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language

    cs.CV 2022-04 unverdicted novelty 7.0

    Socratic Models compose zero-shot multimodal reasoning by prompting pretrained language and vision models to exchange information and enable new capabilities without finetuning.

  8. Whose Story Gets Told? Positionality and Bias in LLM Summaries of Life Narratives

    cs.CL 2026-04 unverdicted novelty 6.0

    A proposed pipeline shows LLMs introduce detectable race and gender biases when summarizing life narratives, creating potential for representational harm in research.

  9. Think Before you Write: QA-Guided Reasoning for Character Descriptions in Books

    cs.CL 2026-04 unverdicted novelty 6.0

    QA-guided reasoning via a separate model producing structured traces improves faithfulness, informativeness, and grounding in character description generation from books over long-context LLM baselines.

  10. RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval

    cs.CL 2024-01 unverdicted novelty 6.0

    RAPTOR introduces a tree-organized retrieval method using recursive abstractive summaries, achieving a 20% absolute accuracy improvement on the QuALITY benchmark when paired with GPT-4.

  11. AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models

    cs.CL 2023-10 conditional novelty 6.0

    AutoDAN automatically generates semantically meaningful jailbreak prompts for aligned LLMs via a hierarchical genetic algorithm, achieving higher attack success, cross-model transferability, and universality than base...

  12. Directly Fine-Tuning Diffusion Models on Differentiable Rewards

    cs.CV 2023-09 conditional novelty 6.0

    DRaFT fine-tunes diffusion models by differentiating through sampling to maximize rewards, outperforming RL baselines and improving aesthetics on Stable Diffusion 1.4.

  13. Reinforced Self-Training (ReST) for Language Modeling

    cs.CL 2023-08 unverdicted novelty 6.0

    ReST improves LLM translation quality on benchmarks via offline RL on self-generated data, achieving gains in a compute-efficient way compared to typical RLHF.

  14. Aligning Text-to-Image Models using Human Feedback

    cs.LG 2023-02 unverdicted novelty 6.0

    A three-stage fine-tuning process uses human ratings to train a reward model and then improves text-to-image alignment by maximizing reward-weighted likelihood.

  15. Measuring Progress on Scalable Oversight for Large Language Models

    cs.HC 2022-11 unverdicted novelty 6.0

    Humans chatting with an unreliable LLM assistant outperform both the model alone and unaided humans on MMLU and time-limited QuALITY tasks.

  16. A General Language Assistant as a Laboratory for Alignment

    cs.CL 2021-12 conditional novelty 6.0

    Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.

  17. DTCRS: Dynamic Tree Construction for Recursive Summarization

    cs.CL 2026-04 unverdicted novelty 5.0

    DTCRS dynamically builds summary trees only for suitable question types by using sub-question embeddings as cluster centers, cutting construction time while improving QA on three tasks.

  18. RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment

    cs.LG 2023-04 unverdicted novelty 5.0

    RAFT aligns generative models by ranking samples with a reward model and fine-tuning only on the top-ranked outputs, reporting gains on reward scores and automated metrics for LLMs and diffusion models.

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages · cited by 18 Pith papers

  1. [1]

    This subtask can be decomposed even further if necessary

    decompose_if_needed, which returns either a Respond() indicating the subtasks can be synthesized and answered by the model directly, or a Decompose(subtask) if the model requires help to solve the task. This subtask can be decomposed even further if necessary

  2. [2]

    \ n \ n

    answer_directly, which returns an actual answer to the task, synthesizing the answers to subtasks In general, both decompose_if_needed and answer_directly could be learned and implemented by an ML model. In the fixed decomposition case, decompose_if_needed is implemented pro- grammatically instead. Note also that Decompose only returns a single subtask, ra...

  3. [3]

    So gratuitously including small details is generally penalized, and omitting important details is also penalized

    Coverage: All information in the summary should be important, and there should be no other more important information omitted from the summary. So gratuitously including small details is generally penalized, and omitting important details is also penalized

  4. [4]

    Accuracy: All information in the summary should faithfully reflect the original passage

  5. [5]

    We also have a fourth criteria which is primarily applicable at higher height

    Coherence: Ignoring the passage, the summary should not be confusing, ambiguous, or logically incoherent. We also have a fourth criteria which is primarily applicable at higher height. Labelers were to use their own judgment on how important it was

  6. [6]

    budget”, summary B had a 200 token “budget

    Abstraction: When possible, writing should describe larger arcs and themes rather than just listing a series of events that happened. In addition, we also have the following guidelines • The summary should flow from the end of the previous context • When using pronouns, resolutions should be clear for a naive reader • Present tense should be preferred • Re...

  7. [7]

    Lack of hyperparameter tuning: We did not tune the 175B models much due to compute costs

  8. [8]

    The quality of input summaries is important for labeling accuracy: we found that inter-labeler agreement went down when labelers judged the input summaries as less coherent

    Poor input distribution and noisy comparisons for higher level tasks : The quality of the input summaries given to the model (and thus to human evaluators when evaluating this model) degrades as one moves up the tree. The quality of input summaries is important for labeling accuracy: we found that inter-labeler agreement went down when labelers judged the...

  9. [9]

    Answer the following question based on the above passage, or reply with a summary of relevant information if no answer is found: {question}

    Poor node sampling during RL: Our episode sampling strategy described in Section 2.3.3 may have been suboptimal. Rather than the vast majority of tasks being height 0 tasks, only about one third are. This is in contrast with evaluation time, where height 0 are both most numerous and potentially most important. Empirically, we found that the best full tree...

  10. [10]

    For example, we observed the model inferring that a country of interest was England, despite it having no explicit mention in the summary besides the mention of London

    First, our technique is quite general, and answers questions fully abstractively, rather than via token extraction. For example, we observed the model inferring that a country of interest was England, despite it having no explicit mention in the summary besides the mention of London

  11. [11]

    (On the other hand, we cannot answer questions that are not answered by the summary.)

    Second, when answering 30 questions per passage, we require only one forward pass over the full book rather than 30, with the remaining passes being over a much smaller text. (On the other hand, we cannot answer questions that are not answered by the summary.)

  12. [12]

    The woman at my mother’s side reached out to touch her—vas a estar bien, she told her before turning to walk back to her car

    Lastly, and most importantly, we retain the benefits of decomposition. Our model’s answers can often be easily traced back to the source in the book, and by leveraging the tree structure, we can often tell where mistakes led to wrong answers. Our model’s summaries can help a human perform question answering quickly – see Appendix H – whereas the approach o...