Recursively Summarizing Books with Human Feedback
Pith reviewed 2026-05-17 15:22 UTC · model grok-4.3
The pith
Recursive decomposition lets models summarize entire books after humans give feedback only on short sections.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors combine learning from human feedback with recursive task decomposition on GPT-3. They collect demonstrations and comparisons on short book sections, fine-tune the model via behavioral cloning and reward modeling, and at inference time produce a full-book summary by first condensing small sections then recursively condensing those summaries. Human labelers supervise and evaluate the process quickly without having read the full texts. The resulting model produces sensible book summaries that match human quality on about five percent of books and reaches state-of-the-art performance on BookSum; the summaries further enable state-of-the-art zero-shot results on NarrativeQA.
What carries the argument
Recursive task decomposition, in which models trained on smaller subtasks assist human evaluation of the larger task.
If this is right
- The approach scales human supervision to tasks whose full scope exceeds direct human reading time.
- Zero-shot question answering that uses the generated summaries outperforms prior methods on NarrativeQA.
- The released datasets of model samples support further research on long-form summarization.
- The same recursive structure can be applied to other generation tasks that benefit from hierarchical decomposition.
Where Pith is reading between the lines
- Similar hierarchical feedback loops could help align models on other long-horizon tasks such as full-story generation or multi-chapter reasoning.
- The method implies that intermediate summaries act as compressed state representations that preserve enough signal for downstream human judgment.
- Testing the same pipeline on non-fiction or technical documents would show whether narrative structure is necessary for the fidelity to hold.
Load-bearing premise
Summaries of summaries retain enough information and fidelity for the final output to remain faithful to the original book when humans never see the full text.
What would settle it
Human readers who have read the full books rate the model's final summaries as less accurate or less coherent than human-written summaries on a majority of test books.
read the original abstract
A major challenge for scaling machine learning is training models to perform tasks that are very difficult or time-consuming for humans to evaluate. We present progress on this problem on the task of abstractive summarization of entire fiction novels. Our method combines learning from human feedback with recursive task decomposition: we use models trained on smaller parts of the task to assist humans in giving feedback on the broader task. We collect a large volume of demonstrations and comparisons from human labelers, and fine-tune GPT-3 using behavioral cloning and reward modeling to do summarization recursively. At inference time, the model first summarizes small sections of the book and then recursively summarizes these summaries to produce a summary of the entire book. Our human labelers are able to supervise and evaluate the models quickly, despite not having read the entire books themselves. Our resulting model generates sensible summaries of entire books, even matching the quality of human-written summaries in a few cases ($\sim5\%$ of books). We achieve state-of-the-art results on the recent BookSum dataset for book-length summarization. A zero-shot question-answering model using these summaries achieves state-of-the-art results on the challenging NarrativeQA benchmark for answering questions about books and movie scripts. We release datasets of samples from our model.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces a method for abstractive summarization of entire fiction novels by combining human feedback with recursive task decomposition on GPT-3. Smaller sections are summarized first, then these summaries are recursively summarized to produce a book-level output. Human labelers provide demonstrations and pairwise comparisons on summaries without reading the full books. The resulting model produces sensible summaries that match human quality in ~5% of cases, achieves SOTA on BookSum, and yields SOTA zero-shot QA results on NarrativeQA when used as input.
Significance. If the recursive process preserves key information, the work demonstrates a practical approach to scalable oversight for long-context tasks that are difficult for humans to evaluate directly. It shows that human feedback on decomposed subtasks can train models to handle book-length summarization effectively, with supporting evidence from human evaluations and downstream QA performance. The released datasets of model samples provide a useful resource for further research in this area.
major comments (2)
- The human evaluation protocol (described in the experiments) relies on labelers judging summaries-of-summaries without access to prior levels or the original text; this leaves open the possibility of cumulative omissions that are invisible to feedback, which directly affects the claim that the final summaries remain faithful to the book.
- Table 1 and the BookSum results section report SOTA numbers but provide limited detail on inter-annotator agreement and exact prompt formats for the comparisons used to train the reward model; without these, it is difficult to assess the reliability of the human feedback signal that underpins the recursive training.
minor comments (2)
- The description of the recursive inference procedure could include a diagram or pseudocode to clarify how summaries are composed across levels.
- A few citations to prior work on hierarchical summarization or long-document QA appear to be missing from the related work section.
Simulated Author's Rebuttal
We thank the referee for the constructive review and recommendation for minor revision. The comments highlight important aspects of our evaluation protocol and reporting that we will address to strengthen the manuscript.
read point-by-point responses
-
Referee: The human evaluation protocol (described in the experiments) relies on labelers judging summaries-of-summaries without access to prior levels or the original text; this leaves open the possibility of cumulative omissions that are invisible to feedback, which directly affects the claim that the final summaries remain faithful to the book.
Authors: We agree this is a substantive limitation of the current evaluation setup. Labelers provide feedback on decomposed subtasks without the full book or prior summaries, so undetected cumulative omissions remain possible. While the strong NarrativeQA results offer indirect evidence that key information is preserved, they do not fully rule out the issue. In the revision we will add an explicit discussion of this limitation, including its implications for claims of faithfulness, and note that future work could include spot-checks against original text excerpts. revision: yes
-
Referee: Table 1 and the BookSum results section report SOTA numbers but provide limited detail on inter-annotator agreement and exact prompt formats for the comparisons used to train the reward model; without these, it is difficult to assess the reliability of the human feedback signal that underpins the recursive training.
Authors: We appreciate the request for greater transparency. The manuscript describes the overall human feedback collection process but does not report inter-annotator agreement statistics or reproduce the exact comparison prompts. We will expand the experiments section to include available inter-annotator agreement numbers and move the precise prompt templates into an appendix so readers can better evaluate the reliability of the reward model training signal. revision: yes
Circularity Check
No significant circularity in recursive summarization pipeline
full rationale
The paper describes an empirical training procedure that collects independent human demonstrations and comparisons on summaries of summaries, then applies behavioral cloning and reward modeling to fine-tune GPT-3 for recursive summarization at inference time. Results are reported on external benchmarks (BookSum, NarrativeQA) and human evaluations of final outputs. No equation or claim reduces to its own inputs by construction, no load-bearing self-citation chain exists, and human feedback collection is described as separate from the quantities being predicted. The derivation is therefore self-contained as an applied RLHF-style method.
Axiom & Free-Parameter Ledger
free parameters (1)
- reward model temperature and KL coefficient
axioms (1)
- domain assumption Human labelers can reliably judge summary quality from short excerpts without reading the source book.
Lean theorems connected to this paper
-
IndisputableMonolith.Foundation.DimensionForcingalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We achieve state-of-the-art results on the recent BookSum dataset for book-length summarization.
-
IndisputableMonolith.Cost.FunctionalEquationwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
A zero-shot question-answering model using these summaries achieves state-of-the-art results on the challenging NarrativeQA benchmark.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 18 Pith papers
-
Generative Agents: Interactive Simulacra of Human Behavior
Generative agents with memory streams, reflection, and planning using LLMs exhibit believable individual and emergent social behaviors in a simulated town.
-
Code as Policies: Language Model Programs for Embodied Control
Language models generate robot policy code from natural language commands via few-shot prompting, enabling spatial-geometric reasoning, generalization, and precise control on real robots.
-
Teaching Models to Express Their Uncertainty in Words
GPT-3 can learn to express well-calibrated uncertainty about its answers using natural language phrases rather than logits.
-
Finetuned Language Models Are Zero-Shot Learners
Instruction tuning a 137B language model on over 60 NLP tasks described by instructions substantially boosts zero-shot performance on unseen tasks, outperforming larger GPT-3 models.
-
Corruption-robust Offline Multi-agent Reinforcement Learning From Human Feedback
Introduces robust estimators for linear Markov games in offline MARLHF that achieve O(ε^{1-o(1)}) or O(√ε) bounds on Nash or CCE gaps under uniform or unilateral coverage.
-
HER: Human-like Reasoning and Reinforcement Learning for LLM Role-playing
HER trains LLMs on reverse-engineered reasoning data and human preference rewards to improve cognitive persona simulation, reporting 30-point gains on CoSER and 15% on Minimax benchmarks over Qwen3-32B.
-
Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language
Socratic Models compose zero-shot multimodal reasoning by prompting pretrained language and vision models to exchange information and enable new capabilities without finetuning.
-
Whose Story Gets Told? Positionality and Bias in LLM Summaries of Life Narratives
A proposed pipeline shows LLMs introduce detectable race and gender biases when summarizing life narratives, creating potential for representational harm in research.
-
Think Before you Write: QA-Guided Reasoning for Character Descriptions in Books
QA-guided reasoning via a separate model producing structured traces improves faithfulness, informativeness, and grounding in character description generation from books over long-context LLM baselines.
-
RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval
RAPTOR introduces a tree-organized retrieval method using recursive abstractive summaries, achieving a 20% absolute accuracy improvement on the QuALITY benchmark when paired with GPT-4.
-
AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models
AutoDAN automatically generates semantically meaningful jailbreak prompts for aligned LLMs via a hierarchical genetic algorithm, achieving higher attack success, cross-model transferability, and universality than base...
-
Directly Fine-Tuning Diffusion Models on Differentiable Rewards
DRaFT fine-tunes diffusion models by differentiating through sampling to maximize rewards, outperforming RL baselines and improving aesthetics on Stable Diffusion 1.4.
-
Reinforced Self-Training (ReST) for Language Modeling
ReST improves LLM translation quality on benchmarks via offline RL on self-generated data, achieving gains in a compute-efficient way compared to typical RLHF.
-
Aligning Text-to-Image Models using Human Feedback
A three-stage fine-tuning process uses human ratings to train a reward model and then improves text-to-image alignment by maximizing reward-weighted likelihood.
-
Measuring Progress on Scalable Oversight for Large Language Models
Humans chatting with an unreliable LLM assistant outperform both the model alone and unaided humans on MMLU and time-limited QuALITY tasks.
-
A General Language Assistant as a Laboratory for Alignment
Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.
-
DTCRS: Dynamic Tree Construction for Recursive Summarization
DTCRS dynamically builds summary trees only for suitable question types by using sub-question embeddings as cluster centers, cutting construction time while improving QA on three tasks.
-
RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment
RAFT aligns generative models by ranking samples with a reward model and fine-tuning only on the top-ranked outputs, reporting gains on reward scores and automated metrics for LLMs and diffusion models.
Reference graph
Works this paper leans on
-
[1]
This subtask can be decomposed even further if necessary
decompose_if_needed, which returns either a Respond() indicating the subtasks can be synthesized and answered by the model directly, or a Decompose(subtask) if the model requires help to solve the task. This subtask can be decomposed even further if necessary
-
[2]
answer_directly, which returns an actual answer to the task, synthesizing the answers to subtasks In general, both decompose_if_needed and answer_directly could be learned and implemented by an ML model. In the fixed decomposition case, decompose_if_needed is implemented pro- grammatically instead. Note also that Decompose only returns a single subtask, ra...
work page 2020
-
[3]
Coverage: All information in the summary should be important, and there should be no other more important information omitted from the summary. So gratuitously including small details is generally penalized, and omitting important details is also penalized
-
[4]
Accuracy: All information in the summary should faithfully reflect the original passage
-
[5]
We also have a fourth criteria which is primarily applicable at higher height
Coherence: Ignoring the passage, the summary should not be confusing, ambiguous, or logically incoherent. We also have a fourth criteria which is primarily applicable at higher height. Labelers were to use their own judgment on how important it was
-
[6]
budget”, summary B had a 200 token “budget
Abstraction: When possible, writing should describe larger arcs and themes rather than just listing a series of events that happened. In addition, we also have the following guidelines • The summary should flow from the end of the previous context • When using pronouns, resolutions should be clear for a naive reader • Present tense should be preferred • Re...
work page 2020
-
[7]
Lack of hyperparameter tuning: We did not tune the 175B models much due to compute costs
-
[8]
Poor input distribution and noisy comparisons for higher level tasks : The quality of the input summaries given to the model (and thus to human evaluators when evaluating this model) degrades as one moves up the tree. The quality of input summaries is important for labeling accuracy: we found that inter-labeler agreement went down when labelers judged the...
-
[9]
Poor node sampling during RL: Our episode sampling strategy described in Section 2.3.3 may have been suboptimal. Rather than the vast majority of tasks being height 0 tasks, only about one third are. This is in contrast with evaluation time, where height 0 are both most numerous and potentially most important. Empirically, we found that the best full tree...
work page 2020
-
[10]
First, our technique is quite general, and answers questions fully abstractively, rather than via token extraction. For example, we observed the model inferring that a country of interest was England, despite it having no explicit mention in the summary besides the mention of London
-
[11]
(On the other hand, we cannot answer questions that are not answered by the summary.)
Second, when answering 30 questions per passage, we require only one forward pass over the full book rather than 30, with the remaining passes being over a much smaller text. (On the other hand, we cannot answer questions that are not answered by the summary.)
-
[12]
Lastly, and most importantly, we retain the benefits of decomposition. Our model’s answers can often be easily traced back to the source in the book, and by leveraging the tree structure, we can often tell where mistakes led to wrong answers. Our model’s summaries can help a human perform question answering quickly – see Appendix H – whereas the approach o...
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.