pith. machine review for the scientific record. sign in

arxiv: 2112.09332 · v3 · submitted 2021-12-17 · 💻 cs.CL · cs.AI· cs.LG

Recognition: 2 theorem links

· Lean Theorem

WebGPT: Browser-assisted question-answering with human feedback

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:12 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords question answeringweb browsinghuman feedbacklanguage modelsimitation learningELI5rejection samplingGPT-3
0
0 comments X

The pith

A fine-tuned GPT-3 model using web browsing and human feedback generates answers that humans prefer over those from human demonstrators.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to train a language model to answer long-form questions by giving it a text-based web browser for searching and navigating pages. This environment lets the model gather and cite references, which simplifies human checks for factual accuracy. Training begins with imitation of human demonstrations on the task, then continues by optimizing against a reward model built from human preference data. On the ELI5 dataset of Reddit questions, the final model produces answers that people choose more often than either the original human demonstrators or the highest-voted Reddit responses. The work illustrates a concrete route for combining external information access with preference feedback to improve explanatory question answering.

Core claim

By fine-tuning GPT-3 inside a text-based web-browsing environment in which the model must collect references, and applying behavior cloning followed by rejection sampling against a reward model trained to predict human preferences, the resulting system produces answers on ELI5 questions that humans prefer 56 percent of the time over the human demonstrators and 69 percent of the time over the highest-voted Reddit answers.

What carries the argument

A text-based web-browsing environment that lets the model search and navigate pages while collecting references, used first for imitation learning and then for rejection sampling against a human-preference reward model.

If this is right

  • The model learns to browse and cite sources rather than relying solely on its training data.
  • Requiring reference collection supports more consistent human evaluation of answer accuracy.
  • Rejection sampling with a learned reward model improves quality beyond imitation learning alone.
  • The resulting answers can exceed both expert human performance and crowd-sourced top answers on explanatory questions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same browsing-plus-feedback loop could be applied to tasks that benefit from fresh external facts, such as current-events summaries.
  • Forcing explicit reference collection may reduce ungrounded statements even when the underlying model is not changed.
  • If the reference requirement proves central, similar constraints could be added to other language-model pipelines to aid verification.
  • Larger-scale human preference data collected under this protocol might produce further gains in answer quality.

Load-bearing premise

That the text-based web-browsing environment supplies enough accurate information for ELI5 questions and that forcing the collection of references makes human judgments of factual accuracy reliable and unbiased.

What would settle it

A controlled test in which human raters evaluate the same model answers with and without the collected references, or in which the model is given questions whose correct answers cannot be found through the text web interface.

read the original abstract

We fine-tune GPT-3 to answer long-form questions using a text-based web-browsing environment, which allows the model to search and navigate the web. By setting up the task so that it can be performed by humans, we are able to train models on the task using imitation learning, and then optimize answer quality with human feedback. To make human evaluation of factual accuracy easier, models must collect references while browsing in support of their answers. We train and evaluate our models on ELI5, a dataset of questions asked by Reddit users. Our best model is obtained by fine-tuning GPT-3 using behavior cloning, and then performing rejection sampling against a reward model trained to predict human preferences. This model's answers are preferred by humans 56% of the time to those of our human demonstrators, and 69% of the time to the highest-voted answer from Reddit.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces WebGPT, a GPT-3-based system augmented with a text-based web-browsing environment for answering long-form questions. It trains via imitation learning (behavior cloning) on human demonstrations from the ELI5 Reddit dataset, followed by rejection sampling against a reward model trained to predict human preferences. Models are required to collect references during browsing to support factual evaluation. The headline results are that the best model is preferred by humans 56% of the time over the human demonstrators and 69% of the time over the highest-voted Reddit answers.

Significance. If the reported human preferences reliably indicate gains in factual accuracy and helpfulness rather than presentation or reference formatting, the work would demonstrate that browser access plus RL from human feedback can produce long-form answers preferred over both human demonstrators and community-voted content. The mandatory reference collection and text-browser setup are useful methodological contributions for grounding evaluations. The pipeline (imitation then preference optimization) is standard but cleanly executed on a non-trivial interactive task.

major comments (2)
  1. [Abstract and Results] The abstract and results section report 56% and 69% human preference rates as the central evidence, yet provide no details on evaluation protocol, number of questions sampled, number of raters, inter-rater agreement, or how references were presented to judges. This information is load-bearing for interpreting whether the preference signal tracks factual correctness or is influenced by formatting and reference quality.
  2. [Methods and Evaluation] The training pipeline optimizes directly against human preferences collected under the same reference-collection protocol used at test time. Without an analysis of cases where the text-based browser returns incomplete or noisy pages (or how such cases affect downstream preference judgments), the claim that reference collection makes human evaluations reliable proxies for answer quality remains unverified and potentially circular.
minor comments (1)
  1. A table or figure summarizing the different model variants (e.g., BC only, BC + reward model sizes, rejection sampling parameters) and their exact preference rates would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on the evaluation protocol and methodological alignment in our paper. We address each major comment below and have revised the manuscript to improve transparency where the concerns are valid.

read point-by-point responses
  1. Referee: [Abstract and Results] The abstract and results section report 56% and 69% human preference rates as the central evidence, yet provide no details on evaluation protocol, number of questions sampled, number of raters, inter-rater agreement, or how references were presented to judges. This information is load-bearing for interpreting whether the preference signal tracks factual correctness or is influenced by formatting and reference quality.

    Authors: We agree that the abstract and main results would benefit from expanded details on the evaluation protocol to strengthen interpretation of the preference rates. In the revised manuscript we have added a dedicated paragraph in the Results section (and cross-referenced from the abstract) specifying the evaluation protocol: 100 questions sampled from the ELI5 test set, three independent raters per pairwise comparison, inter-rater agreement measured via Fleiss' kappa, and references presented as numbered footnotes directly below each answer. These additions clarify that raters were instructed to prioritize factual accuracy and helpfulness over presentation style. revision: yes

  2. Referee: [Methods and Evaluation] The training pipeline optimizes directly against human preferences collected under the same reference-collection protocol used at test time. Without an analysis of cases where the text-based browser returns incomplete or noisy pages (or how such cases affect downstream preference judgments), the claim that reference collection makes human evaluations reliable proxies for answer quality remains unverified and potentially circular.

    Authors: The shared protocol between training and evaluation is by design, as the task definition requires models to produce answers supported by references; human preferences therefore evaluate the end-to-end capability rather than an artificial separation. That said, the referee correctly notes the absence of explicit analysis of browser noise. We have added a short limitations paragraph in the revised Methods section that examines a sample of queries where the browser returned truncated or low-quality pages, reports the resulting drop in reference quality, and shows that these cases receive correspondingly lower preference scores. This provides empirical grounding without altering the core claim. revision: partial

Circularity Check

0 steps flagged

No significant circularity in derivation or reported metrics

full rationale

The paper's pipeline uses imitation learning on human demonstrations followed by rejection sampling against a reward model trained to predict human preferences on ELI5 answers. The headline results (56% preference over human demonstrators and 69% over Reddit top answers) are obtained via separate, direct human pairwise comparisons on the final model outputs. These evaluations are independent of the reward model scores and do not reduce to any fitted parameter, self-citation, or definitional equivalence. No load-bearing steps invoke uniqueness theorems, ansatzes from prior self-work, or renaming of known results; the chain remains externally falsifiable through fresh human judgments.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that human preference data can be used to train a reliable reward model and that the simulated browser supplies adequate factual information; no new physical entities are postulated.

free parameters (1)
  • reward model parameters
    The reward model is trained on human preference data to rank answers, so its weights are fitted values.
axioms (1)
  • domain assumption Human preferences over answers with references can be modeled accurately enough by a reward model to improve selection via rejection sampling
    Invoked when the paper uses the reward model to choose the best outputs after behavior cloning.
invented entities (1)
  • text-based web-browsing environment no independent evidence
    purpose: Provides the model with actions to search and navigate the web for gathering information and references
    New task setup introduced to enable the QA behavior; no independent external validation of its fidelity is described in the abstract.

pith-pipeline@v0.9.0 · 5508 in / 1377 out tokens · 44147 ms · 2026-05-10T18:12:27.872076+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document Intelligence

    cs.CL 2026-05 accept novelty 8.0

    CiteVQA requires models to cite specific document regions with bounding boxes alongside answers and finds that even the strongest MLLMs frequently cite the wrong region, with top SAA scores of only 76.0 for closed mod...

  2. Revisable by Design: A Theory of Streaming LLM Agent Execution

    cs.LG 2026-04 unverdicted novelty 8.0

    LLM agents achieve greater flexibility during execution by classifying actions via a reversibility taxonomy and using an Earliest-Conflict Rollback algorithm that matches full-restart quality while wasting far less co...

  3. Discover and Prove: An Open-source Agentic Framework for Hard Mode Automated Theorem Proving in Lean 4

    cs.AI 2026-04 conditional novelty 8.0

    DAP achieves SOTA on Hard Mode ATP by having LLMs discover answers then prove them formally, solving 10 CombiBench and 36 PutnamBench problems while exposing that LLMs exceed 80% answer accuracy where formal provers s...

  4. AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents

    cs.CR 2024-06 unverdicted novelty 8.0

    AgentDojo introduces an extensible evaluation framework populated with realistic agent tasks and security test cases to measure prompt injection robustness in tool-using LLM agents.

  5. OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

    cs.AI 2024-04 accept novelty 8.0

    OSWorld provides the first unified real-computer benchmark for open-ended multimodal agent tasks, exposing large performance gaps between humans and state-of-the-art LLM/VLM agents.

  6. DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines

    cs.CL 2023-10 conditional novelty 8.0

    DSPy compiles short declarative programs into LM pipelines that self-optimize and outperform both standard few-shot prompting and expert-written chains on math, retrieval, and QA tasks.

  7. WebArena: A Realistic Web Environment for Building Autonomous Agents

    cs.AI 2023-07 accept novelty 8.0

    WebArena provides a realistic multi-domain web environment and benchmark where state-of-the-art LLM agents achieve 14.41% end-to-end task success compared to 78.24% for humans.

  8. ClawForge: Generating Executable Interactive Benchmarks for Command-Line Agents

    cs.AI 2026-05 conditional novelty 7.0

    ClawForge supplies a generator that turns scenario templates into reproducible command-line tasks testing state conflict handling, where the strongest frontier model scores only 45.3 percent strict accuracy.

  9. PolitNuggets: Benchmarking Agentic Discovery of Long-Tail Political Facts

    cs.AI 2026-05 unverdicted novelty 7.0

    PolitNuggets is a multilingual benchmark showing that AI agents struggle with fine-grained accuracy and efficiency when discovering long-tail political facts for elite biographies, linking performance to short-context...

  10. Identifying AI Web Scrapers Using Canary Tokens

    cs.CR 2026-05 conditional novelty 7.0

    Unique canary tokens served to visiting scrapers can be recovered from LLM outputs to identify which scrapers feed data to which of 22 tested production LLMs.

  11. Temper and Tilt Lead to SLOP: Reward Hacking Mitigation with Inference-Time Alignment

    cs.LG 2026-05 unverdicted novelty 7.0

    Temperature adjustment on the reference model generalizes inference-time alignment to SLOP ensembles of reward models, with a calibration algorithm that improves robustness to reward hacking while preserving alignment...

  12. CuSearch: Curriculum Rollout Sampling via Search Depth for Agentic RAG

    cs.AI 2026-05 unverdicted novelty 7.0

    CuSearch reallocates rollout budget in RLVR toward deeper-search trajectories as a proxy for retrieval supervision density, yielding up to 11.8 exact-match gains over uniform GRPO sampling on ZeroSearch.

  13. Hypothesis-Driven Deep Research with Large Language Models: A Structured Methodology for Automated Knowledge Discovery

    cs.AI 2026-05 unverdicted novelty 7.0

    HDRI is a six-principle eight-stage framework for hypothesis-organized LLM research featuring gap-driven iteration, traceable fact reasoning, and subject locking, realized in INFOMINER with reported gains in fact dens...

  14. SearchSkill: Teaching LLMs to Use Search Tools with Evolving Skill Banks

    cs.AI 2026-05 unverdicted novelty 7.0

    SearchSkill improves exact match scores and retrieval efficiency on open-domain QA by conditioning LLM actions on skills from an evolving SkillBank updated from failure patterns via two-stage SFT.

  15. Beyond Reasoning: Reinforcement Learning Unlocks Parametric Knowledge in LLMs

    cs.CL 2026-05 unverdicted novelty 7.0

    RL on binary rewards boosts LLM factual recall by ~27% relative across models by redistributing probability mass to latent correct answers rather than acquiring new knowledge.

  16. Selective Rollout: Mid-Trajectory Termination for Multi-Sample Agent RL

    cs.LG 2026-05 conditional novelty 7.0

    A one-parameter early-termination gate based on mean pairwise prefix edit distance reduces wall-clock time by 10.7% and raises held-out success by 2.5 pp in GRPO on ALFWorld by cutting zero-advantage batch dilution.

  17. PIIGuard: Mitigating PII Harvesting under Adversarial Sanitization

    cs.CR 2026-05 conditional novelty 7.0

    PIIGuard uses optimized hidden HTML fragments on webpages to block LLMs from leaking contact PII via indirect prompt injection, achieving at least 97% defense success across tested models while preserving benign QA utility.

  18. A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework

    cs.CR 2026-04 unverdicted novelty 7.0

    A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.

  19. ReflectMT: Internalizing Reflection for Efficient and High-Quality Machine Translation

    cs.CL 2026-04 unverdicted novelty 7.0

    ReflectMT internalizes reflection via two-stage RL to enable direct high-quality machine translation that outperforms explicit reasoning models like DeepSeek-R1 on WMT24 while using 94% fewer tokens.

  20. ArbGraph: Conflict-Aware Evidence Arbitration for Reliable Long-Form Retrieval-Augmented Generation

    cs.CL 2026-04 unverdicted novelty 7.0

    ArbGraph resolves conflicts in RAG evidence by constructing a conflict-aware graph of atomic claims and applying intensity-driven iterative arbitration to suppress unreliable claims prior to generation.

  21. DiscoTrace: Representing and Comparing Answering Strategies of Humans and LLMs in Information-Seeking Question Answering

    cs.CL 2026-04 unverdicted novelty 7.0

    DiscoTrace reveals diverse rhetorical strategies across human communities in QA answers, but LLMs lack this diversity and favor breadth over human-like selectivity.

  22. Does RL Expand the Capability Boundary of LLM Agents? A PASS@(k,T) Analysis

    cs.LG 2026-04 unverdicted novelty 7.0

    RL expands the capability boundary of LLM agents on compositional tool-use tasks, shown by non-converging pass curves at large k with increasing T, while SFT regresses it and the effect is absent on simpler tasks.

  23. Reinforcement Learning via Value Gradient Flow

    cs.LG 2026-04 unverdicted novelty 7.0

    VGF solves behavior-regularized RL by transporting particles from a reference distribution to the value-induced optimal policy via discrete value-guided gradient flow.

  24. ClawBench: Can AI Agents Complete Everyday Online Tasks?

    cs.CL 2026-04 unverdicted novelty 7.0

    ClawBench is a benchmark of 153 live-web tasks where AI agents achieve low success rates, e.g. 33.3% for Claude Sonnet 4.6.

  25. MolmoWeb: Open Visual Web Agent and Open Data for the Open Web

    cs.CV 2026-04 unverdicted novelty 7.0

    Open 4B and 8B visual web agents achieve state-of-the-art results on browser benchmarks by predicting actions from screenshots and instructions, outperforming similar open models and some closed larger-model agents, w...

  26. GeoBrowse: A Geolocation Benchmark for Agentic Tool Use with Expert-Annotated Reasoning Traces

    cs.CL 2026-04 unverdicted novelty 7.0

    GeoBrowse is a two-level geolocation benchmark combining visual cue composition with knowledge-intensive multi-hop queries, paired with the GATE agent workflow that outperforms no-tool, search-only, and image-only baselines.

  27. BibTeX Citation Hallucinations in Scientific Publishing Agents: Evaluation and Mitigation

    cs.DL 2026-04 conditional novelty 7.0

    Frontier LLMs generate BibTeX entries at 83.6% field accuracy but only 50.9% fully correct; two-stage clibib revision raises accuracy to 91.5% and fully correct entries to 78.3% with 0.8% regression.

  28. Corruption-robust Offline Multi-agent Reinforcement Learning From Human Feedback

    cs.LG 2026-03 unverdicted novelty 7.0

    Introduces robust estimators for linear Markov games in offline MARLHF that achieve O(ε^{1-o(1)}) or O(√ε) bounds on Nash or CCE gaps under uniform or unilateral coverage.

  29. Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory

    cs.CL 2025-11 unverdicted novelty 7.0

    Evo-Memory is a new benchmark for self-evolving memory in LLM agents across task streams, with baseline ExpRAG and proposed ReMem method that integrates reasoning, actions, and memory updates for continual improvement.

  30. WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?

    cs.LG 2024-03 unverdicted novelty 7.0

    WorkArena benchmark shows LLM web agents achieve partial success on enterprise tasks but have a substantial gap to full automation and perform worse with open-source models.

  31. GAIA: a benchmark for General AI Assistants

    cs.CL 2023-11 unverdicted novelty 7.0

    GAIA benchmark shows humans at 92% accuracy on simple real-world questions far outperform current AI systems at 15%, proposing this gap as a key milestone for general AI.

  32. Let's Verify Step by Step

    cs.LG 2023-05 accept novelty 7.0

    Process supervision significantly outperforms outcome supervision for training models on the MATH dataset, achieving 78% accuracy on a representative test subset with active learning and a released 800k step-label dataset.

  33. LLM+P: Empowering Large Language Models with Optimal Planning Proficiency

    cs.AI 2023-04 accept novelty 7.0

    LLM+P lets LLMs solve planning problems optimally by converting them to PDDL for classical planners and back to natural language.

  34. Reflexion: Language Agents with Verbal Reinforcement Learning

    cs.AI 2023-03 conditional novelty 7.0

    Reflexion lets LLM agents improve via stored verbal reflections on task feedback, reaching 91% pass@1 on HumanEval and outperforming prior GPT-4 results.

  35. A Generalist Agent

    cs.AI 2022-05 accept novelty 7.0

    Gato is a multi-modal, multi-task, multi-embodiment generalist policy using one transformer network to handle text, vision, games, and robotics tasks.

  36. OPT: Open Pre-trained Transformer Language Models

    cs.CL 2022-05 unverdicted novelty 7.0

    OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.

  37. From Controlled to the Wild: Evaluation of Pentesting Agents for the Real-World

    cs.AI 2026-05 unverdicted novelty 6.0

    A practical evaluation protocol for AI pentesting agents that uses validated vulnerability discovery, LLM semantic matching, and bipartite scoring to assess performance in realistic, complex targets.

  38. SearchSkill: Teaching LLMs to Use Search Tools with Evolving Skill Banks

    cs.AI 2026-05 unverdicted novelty 6.0

    SearchSkill introduces an evolving SkillBank and two-stage SFT to make LLM search query planning explicit via skill selection, improving exact match on QA benchmarks and retrieval behavior.

  39. EvoMAS: Learning Execution-Time Workflows for Multi-Agent Systems

    cs.AI 2026-05 unverdicted novelty 6.0

    EvoMAS trains a workflow adapter with policy gradients to dynamically instantiate stage-specific multi-agent workflows from a fixed agent pool, using explicit task-state construction and terminal success signals, and ...

  40. Slipstream: Trajectory-Grounded Compaction Validation for Long-Horizon Agents

    cs.MA 2026-05 unverdicted novelty 6.0

    Slipstream uses asynchronous compaction with trajectory-grounded judge validation to improve long-horizon agent accuracy by up to 8.8 percentage points and reduce latency by up to 39.7%.

  41. Beyond Pairs: Your Language Model is Secretly Optimizing a Preference Graph

    cs.LG 2026-05 unverdicted novelty 6.0

    GraphDPO generalizes pairwise DPO to a graph-structured Plackett-Luce objective over DAGs induced by rollout rankings, enforcing transitivity with linear complexity and recovering DPO as a special case.

  42. Power Distribution Bridges Sampling, Self-Reward RL, and Self-Distillation

    cs.LG 2026-05 unverdicted novelty 6.0

    The power distribution is the target of power sampling, the closed-form solution to self-reward KL-regularized RL, and the basis for power self-distillation that matches sampling performance at lower cost.

  43. GeoDecider: A Coarse-to-Fine Agentic Workflow for Explainable Lithology Classification

    cs.AI 2026-05 unverdicted novelty 6.0

    GeoDecider introduces a coarse-to-fine agentic workflow using LLMs for explainable lithology classification from well logs, combining a base classifier, tool-augmented reasoning, and geological refinement to outperfor...

  44. FitText: Evolving Agent Tool Ecologies via Memetic Retrieval

    cs.AI 2026-05 unverdicted novelty 6.0

    FitText embeds memetic evolutionary retrieval inside the agent's reasoning loop to iteratively refine pseudo-tool descriptions, raising retrieval rank from 8.81 to 2.78 on ToolRet and pass rate to 0.73 on StableToolBench.

  45. Hallucinations Undermine Trust; Metacognition is a Way Forward

    cs.CL 2026-05 unverdicted novelty 6.0

    LLMs need metacognition to align expressed uncertainty with their actual knowledge boundaries, moving beyond knowledge expansion to reduce confident errors.

  46. AEM: Adaptive Entropy Modulation for Multi-Turn Agentic Reinforcement Learning

    cs.AI 2026-05 unverdicted novelty 6.0

    AEM lifts entropy analysis to the response level and uses a derived uncertainty proxy to rescale advantages, enabling better exploration-exploitation balance and consistent gains over RL baselines on agent benchmarks.

  47. Language Models Refine Mechanical Linkage Designs Through Symbolic Reflection and Modular Optimisation

    cs.AI 2026-04 unverdicted novelty 6.0

    A modular LM-plus-optimizer system with symbolic abstraction reduces geometric error by up to 68% and improves structural validity by up to 134% over monolithic baselines across six motion targets.

  48. From Citation Selection to Citation Absorption: A Measurement Framework for Generative Engine Optimization Across AI Search Platforms

    cs.IR 2026-04 unverdicted novelty 6.0

    A measurement study of 602 prompts across ChatGPT, Google AI Overview, and Perplexity finds that citation selection breadth and absorption depth diverge, with high-influence pages being longer, structured, and evidence-rich.

  49. SnapGuard: Lightweight Prompt Injection Detection for Screenshot-Based Web Agents

    cs.CR 2026-04 unverdicted novelty 6.0

    SnapGuard detects prompt injection attacks on screenshot-based web agents via visual stability indicators and contrast-polarity textual signals, reaching F1 0.75 while running 8x faster than GPT-4o with no added memory cost.

  50. Benchmarking Source-Sensitive Reasoning in Turkish: Humans and LLMs under Evidential Trust Manipulation

    cs.CL 2026-04 unverdicted novelty 6.0

    Turkish speakers show a robust preference for -DI in high-trust contexts and -mIs in low-trust contexts, while LLMs exhibit inconsistent, often reversed, or base-rate-driven behavior.

  51. An AI Agent Execution Environment to Safeguard User Data

    cs.CR 2026-04 unverdicted novelty 6.0

    GAAP guarantees confidentiality of private user data for AI agents by enforcing user-specified permissions deterministically through persistent information flow tracking, without trusting the agent or requiring attack...

  52. Human-Guided Harm Recovery for Computer Use Agents

    cs.AI 2026-04 conditional novelty 6.0

    Introduces harm recovery as a post-execution safeguard for computer-use agents, operationalized via a human-preference rubric, reward model, and BackBench benchmark that shows improved recovery trajectories.

  53. FUSE: Ensembling Verifiers with Zero Labeled Data

    stat.ML 2026-04 unverdicted novelty 6.0

    FUSE ensembles verifiers unsupervisedly by controlling their conditional dependencies to improve spectral ensembling algorithms, matching or exceeding semi-supervised baselines on benchmarks including GPQA Diamond and...

  54. PARM: Pipeline-Adapted Reward Model

    cs.AI 2026-04 unverdicted novelty 6.0

    PARM adapts reward models to multi-stage LLM pipelines via pipeline data and direct preference optimization, improving execution rate and solving accuracy on optimization benchmarks and showing transfer to GSM8K.

  55. Preregistered Belief Revision Contracts

    cs.AI 2026-04 unverdicted novelty 6.0

    PBRC is a contract protocol that enforces evidential belief updates in deliberative multi-agent systems and proves it prevents conformity-driven false cascades under conservative fallbacks.

  56. RaTA-Tool: Retrieval-based Tool Selection with Multimodal Large Language Models

    cs.CV 2026-04 unverdicted novelty 6.0

    RaTA-Tool retrieves suitable external tools for multimodal queries by matching generated task descriptions against tool metadata, supported by a new Hugging Face-derived dataset and DPO optimization.

  57. MARCA: A Checklist-Based Benchmark for Multilingual Web Search

    cs.CL 2026-04 accept novelty 6.0

    MARCA is a bilingual benchmark using 52 questions and validated checklists to evaluate LLM web-search completeness and correctness in English and Portuguese.

  58. ClawGuard: A Runtime Security Framework for Tool-Augmented LLM Agents Against Indirect Prompt Injection

    cs.CR 2026-04 unverdicted novelty 6.0

    ClawGuard enforces deterministic, user-derived access constraints at tool boundaries to block indirect prompt injection without changing the underlying LLM.

  59. ClawGuard: A Runtime Security Framework for Tool-Augmented LLM Agents Against Indirect Prompt Injection

    cs.CR 2026-04 unverdicted novelty 6.0

    ClawGuard enforces user-derived access constraints at tool-call boundaries to block indirect prompt injection in tool-augmented LLM agents across web, MCP, and skill injection channels.

  60. TEC: A Collection of Human Trial-and-error Trajectories for Problem Solving

    cs.CL 2026-04 unverdicted novelty 6.0

    TEC is a new public dataset of detailed human trial-and-error trajectories and reflections on web tasks, with humans showing substantially higher accuracy than LLMs.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages · cited by 96 Pith papers

  1. [1]

    We included URLs in full, rather than using special _URL_ tokens

  2. [2]

    [deleted by user]

    We filtered out questions with the title “[deleted by user]”, and ignored the selftext “[deleted]” and “[removed]”. (The “selftext” is the body of the post.)

  3. [3]

    We concatenated the title and any non-empty selftext, separated by a double new line

  4. [4]

    Explain:

    We prepended “Explain: ” to questions that were not phrased as actual questions (e.g., we used “Explain: gravity” rather than simply “gravity”). The final step was performed because there is sometimes an implicit “Explain Like I’m Five” at the start of questions. We considered a question to be phrased as an actual question if it included either a question ...

  5. [5]

    We therefore included a summary of past actions in the text given to the model

    Unlike humans, the model has no memory of previous steps. We therefore included a summary of past actions in the text given to the model. However, we felt that it was unnecessary to display this to humans

  6. [6]

    We therefore made these actions unavailable to humans, and instead simply merged any repeatedScrolled <up, down> 1 actions that they made

    The Scrolled <up, down> <2, 3> actions are useful for reducing the number of actions taken, but humans are used to scrolling one step at a time. We therefore made these actions unavailable to humans, and instead simply merged any repeatedScrolled <up, down> 1 actions that they made. The full instruction document we provided to contractors for demonstratio...

  7. [7]

    Read the question, and flag if it does not make sense or should not be answered (in which case the rest of the comparison is skipped)

  8. [8]

    Read the first answer and its references

  9. [9]

    Rate the trustworthiness of any references relied upon by the answer

  10. [10]

    A screenshot of the annotation tool is shown in Figure 9

    Annotate each of the claims in the answer with the level of support it has and its relevance to the question. A screenshot of the annotation tool is shown in Figure 9

  11. [11]

    Repeat steps 2–4 for the second answer and its references

  12. [12]

    Give comparison ratings for the amount of unsupported and irrelevant information, the usefulness of information with different levels of support, and coherence

  13. [13]

    A much better

    Weighing everything up, give a final comparison rating for overall usefulness. 17 Figure 9: Screenshot from the comparison interface, showing the annotation tool. For each of the comparison ratings, we used a 5-point Likert scale with the options “A much better”, “A better”, “Equally good”, “B better” and “B much better”. Importantly, we did not require co...

  14. [14]

    It was clear from the instructions what I was supposed to do

  15. [15]

    I found the task enjoyable and engaging

  16. [16]

    I found the task repetitive

  17. [17]

    I was paid fairly for doing the task

  18. [18]

    encouraged

    Overall, I am glad that I did this task. • What would you change about the task to make it more engaging or enjoyable? (Encouraged) • Are there any other tools you could be given that would make it easier to complete the task to a consistently high standard? (Encouraged) • Did you come up with any shortcuts that you used to do the task more quickly, and i...

  19. [19]

    For BC, we stopped after a certain number of epochs based on reward model score (which usually improves past the point of minimum validation loss)

  20. [20]

    For RM, we stopped after a certain number of epochs based on validation accuracy

  21. [21]

    EMA decay

    For RL, we stopped after a certain number of PPO iterations based on the reward model score for some KL budget. The KL here is measured from the BC model, and summed over the episode. For the 175B model, we compared a couple of different KL budgets using human evaluations, and for the 760M and 13B models, we chose KL budgets informed by the 175B evaluatio...

  22. [22]

    Words and their meanings change over time

    Why Are Some Words ‘Bad’? | V ermont Public Radio (www.vpr .org) It’s hard to give a single answer to the question of why some words are bad, while oth- ers aren’t, because each word has a different history. Words and their meanings change over time. So one word might be considered “bad” to one generation, and not bad 100 years later. In addition, words c...

  23. [23]

    bad" words fall into: “Words in general that are considered bad tend to relate to parts of our lives that we don’t like talking about in public, like bathroom functions,

    Why Are Some Words ‘Bad’? | V ermont Public Radio (www.vpr .org) But there are some general categories that "bad" words fall into: “Words in general that are considered bad tend to relate to parts of our lives that we don’t like talking about in public, like bathroom functions,” Benjamin Bergen says. Other words that are often considered bad relate to neg...

  24. [24]

    bad word

    On Words: ‘Bad’ Words and Why We Should Study Them | UV A Today (news.virginia.edu) We also use the term “bad word” to pick out terms that are sanctioned simply because of what they refer to: taboo human acts, impolite biological processes and items that people find disgusting

  25. [25]

    As a general rule, swear words originate from taboo subjects

    The Science of Curse Words: Why The &amp;@$! Do We Swear? (www.babbel.com) For a word to qualify as a swear word it must have the potential to offend — crossing a cultural line into taboo territory. As a general rule, swear words originate from taboo subjects. This is pretty logical. The topic is off-limits, so the related words aren’t meant to be spoken ...

  26. [26]

    language that deprives a person of human qualities or attributes

    All Of These Words Are Offensive (But Only Sometimes) (www.dictionary.com) So, where’s the problem?Ape and monkey are considered offensive terms when they’re used to describe a person of color. It’s what is known as dehumanizing language, “language that deprives a person of human qualities or attributes.” Exactly when the words became slurs is unknown, bu...

  27. [27]

    fierce, ferocious, or cruel; uncivilized; barbarous

    All Of These Words Are Offensive (But Only Sometimes) (www.dictionary.com) The word savage has taken a circuitous path through the lexicon over the years, first showing up in English in the 1200s from Middle English. As an adjective, it’s typically meant “fierce, ferocious, or cruel; uncivilized; barbarous.” When referring to a savage lion ripping an antelo...