Recognition: 2 theorem links
· Lean TheoremWebGPT: Browser-assisted question-answering with human feedback
Pith reviewed 2026-05-10 18:12 UTC · model grok-4.3
The pith
A fine-tuned GPT-3 model using web browsing and human feedback generates answers that humans prefer over those from human demonstrators.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By fine-tuning GPT-3 inside a text-based web-browsing environment in which the model must collect references, and applying behavior cloning followed by rejection sampling against a reward model trained to predict human preferences, the resulting system produces answers on ELI5 questions that humans prefer 56 percent of the time over the human demonstrators and 69 percent of the time over the highest-voted Reddit answers.
What carries the argument
A text-based web-browsing environment that lets the model search and navigate pages while collecting references, used first for imitation learning and then for rejection sampling against a human-preference reward model.
If this is right
- The model learns to browse and cite sources rather than relying solely on its training data.
- Requiring reference collection supports more consistent human evaluation of answer accuracy.
- Rejection sampling with a learned reward model improves quality beyond imitation learning alone.
- The resulting answers can exceed both expert human performance and crowd-sourced top answers on explanatory questions.
Where Pith is reading between the lines
- The same browsing-plus-feedback loop could be applied to tasks that benefit from fresh external facts, such as current-events summaries.
- Forcing explicit reference collection may reduce ungrounded statements even when the underlying model is not changed.
- If the reference requirement proves central, similar constraints could be added to other language-model pipelines to aid verification.
- Larger-scale human preference data collected under this protocol might produce further gains in answer quality.
Load-bearing premise
That the text-based web-browsing environment supplies enough accurate information for ELI5 questions and that forcing the collection of references makes human judgments of factual accuracy reliable and unbiased.
What would settle it
A controlled test in which human raters evaluate the same model answers with and without the collected references, or in which the model is given questions whose correct answers cannot be found through the text web interface.
read the original abstract
We fine-tune GPT-3 to answer long-form questions using a text-based web-browsing environment, which allows the model to search and navigate the web. By setting up the task so that it can be performed by humans, we are able to train models on the task using imitation learning, and then optimize answer quality with human feedback. To make human evaluation of factual accuracy easier, models must collect references while browsing in support of their answers. We train and evaluate our models on ELI5, a dataset of questions asked by Reddit users. Our best model is obtained by fine-tuning GPT-3 using behavior cloning, and then performing rejection sampling against a reward model trained to predict human preferences. This model's answers are preferred by humans 56% of the time to those of our human demonstrators, and 69% of the time to the highest-voted answer from Reddit.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces WebGPT, a GPT-3-based system augmented with a text-based web-browsing environment for answering long-form questions. It trains via imitation learning (behavior cloning) on human demonstrations from the ELI5 Reddit dataset, followed by rejection sampling against a reward model trained to predict human preferences. Models are required to collect references during browsing to support factual evaluation. The headline results are that the best model is preferred by humans 56% of the time over the human demonstrators and 69% of the time over the highest-voted Reddit answers.
Significance. If the reported human preferences reliably indicate gains in factual accuracy and helpfulness rather than presentation or reference formatting, the work would demonstrate that browser access plus RL from human feedback can produce long-form answers preferred over both human demonstrators and community-voted content. The mandatory reference collection and text-browser setup are useful methodological contributions for grounding evaluations. The pipeline (imitation then preference optimization) is standard but cleanly executed on a non-trivial interactive task.
major comments (2)
- [Abstract and Results] The abstract and results section report 56% and 69% human preference rates as the central evidence, yet provide no details on evaluation protocol, number of questions sampled, number of raters, inter-rater agreement, or how references were presented to judges. This information is load-bearing for interpreting whether the preference signal tracks factual correctness or is influenced by formatting and reference quality.
- [Methods and Evaluation] The training pipeline optimizes directly against human preferences collected under the same reference-collection protocol used at test time. Without an analysis of cases where the text-based browser returns incomplete or noisy pages (or how such cases affect downstream preference judgments), the claim that reference collection makes human evaluations reliable proxies for answer quality remains unverified and potentially circular.
minor comments (1)
- A table or figure summarizing the different model variants (e.g., BC only, BC + reward model sizes, rejection sampling parameters) and their exact preference rates would improve readability.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback on the evaluation protocol and methodological alignment in our paper. We address each major comment below and have revised the manuscript to improve transparency where the concerns are valid.
read point-by-point responses
-
Referee: [Abstract and Results] The abstract and results section report 56% and 69% human preference rates as the central evidence, yet provide no details on evaluation protocol, number of questions sampled, number of raters, inter-rater agreement, or how references were presented to judges. This information is load-bearing for interpreting whether the preference signal tracks factual correctness or is influenced by formatting and reference quality.
Authors: We agree that the abstract and main results would benefit from expanded details on the evaluation protocol to strengthen interpretation of the preference rates. In the revised manuscript we have added a dedicated paragraph in the Results section (and cross-referenced from the abstract) specifying the evaluation protocol: 100 questions sampled from the ELI5 test set, three independent raters per pairwise comparison, inter-rater agreement measured via Fleiss' kappa, and references presented as numbered footnotes directly below each answer. These additions clarify that raters were instructed to prioritize factual accuracy and helpfulness over presentation style. revision: yes
-
Referee: [Methods and Evaluation] The training pipeline optimizes directly against human preferences collected under the same reference-collection protocol used at test time. Without an analysis of cases where the text-based browser returns incomplete or noisy pages (or how such cases affect downstream preference judgments), the claim that reference collection makes human evaluations reliable proxies for answer quality remains unverified and potentially circular.
Authors: The shared protocol between training and evaluation is by design, as the task definition requires models to produce answers supported by references; human preferences therefore evaluate the end-to-end capability rather than an artificial separation. That said, the referee correctly notes the absence of explicit analysis of browser noise. We have added a short limitations paragraph in the revised Methods section that examines a sample of queries where the browser returned truncated or low-quality pages, reports the resulting drop in reference quality, and shows that these cases receive correspondingly lower preference scores. This provides empirical grounding without altering the core claim. revision: partial
Circularity Check
No significant circularity in derivation or reported metrics
full rationale
The paper's pipeline uses imitation learning on human demonstrations followed by rejection sampling against a reward model trained to predict human preferences on ELI5 answers. The headline results (56% preference over human demonstrators and 69% over Reddit top answers) are obtained via separate, direct human pairwise comparisons on the final model outputs. These evaluations are independent of the reward model scores and do not reduce to any fitted parameter, self-citation, or definitional equivalence. No load-bearing steps invoke uniqueness theorems, ansatzes from prior self-work, or renaming of known results; the chain remains externally falsifiable through fresh human judgments.
Axiom & Free-Parameter Ledger
free parameters (1)
- reward model parameters
axioms (1)
- domain assumption Human preferences over answers with references can be modeled accurately enough by a reward model to improve selection via rejection sampling
invented entities (1)
-
text-based web-browsing environment
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith.Foundation.DAlembert.Inevitabilitybilinear_family_forced unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We fine-tune GPT-3 to answer long-form questions using a text-based web-browsing environment... optimize answer quality with human feedback.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 60 Pith papers
-
CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document Intelligence
CiteVQA requires models to cite specific document regions with bounding boxes alongside answers and finds that even the strongest MLLMs frequently cite the wrong region, with top SAA scores of only 76.0 for closed mod...
-
Revisable by Design: A Theory of Streaming LLM Agent Execution
LLM agents achieve greater flexibility during execution by classifying actions via a reversibility taxonomy and using an Earliest-Conflict Rollback algorithm that matches full-restart quality while wasting far less co...
-
Discover and Prove: An Open-source Agentic Framework for Hard Mode Automated Theorem Proving in Lean 4
DAP achieves SOTA on Hard Mode ATP by having LLMs discover answers then prove them formally, solving 10 CombiBench and 36 PutnamBench problems while exposing that LLMs exceed 80% answer accuracy where formal provers s...
-
AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents
AgentDojo introduces an extensible evaluation framework populated with realistic agent tasks and security test cases to measure prompt injection robustness in tool-using LLM agents.
-
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
OSWorld provides the first unified real-computer benchmark for open-ended multimodal agent tasks, exposing large performance gaps between humans and state-of-the-art LLM/VLM agents.
-
DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines
DSPy compiles short declarative programs into LM pipelines that self-optimize and outperform both standard few-shot prompting and expert-written chains on math, retrieval, and QA tasks.
-
WebArena: A Realistic Web Environment for Building Autonomous Agents
WebArena provides a realistic multi-domain web environment and benchmark where state-of-the-art LLM agents achieve 14.41% end-to-end task success compared to 78.24% for humans.
-
ClawForge: Generating Executable Interactive Benchmarks for Command-Line Agents
ClawForge supplies a generator that turns scenario templates into reproducible command-line tasks testing state conflict handling, where the strongest frontier model scores only 45.3 percent strict accuracy.
-
PolitNuggets: Benchmarking Agentic Discovery of Long-Tail Political Facts
PolitNuggets is a multilingual benchmark showing that AI agents struggle with fine-grained accuracy and efficiency when discovering long-tail political facts for elite biographies, linking performance to short-context...
-
Identifying AI Web Scrapers Using Canary Tokens
Unique canary tokens served to visiting scrapers can be recovered from LLM outputs to identify which scrapers feed data to which of 22 tested production LLMs.
-
Temper and Tilt Lead to SLOP: Reward Hacking Mitigation with Inference-Time Alignment
Temperature adjustment on the reference model generalizes inference-time alignment to SLOP ensembles of reward models, with a calibration algorithm that improves robustness to reward hacking while preserving alignment...
-
CuSearch: Curriculum Rollout Sampling via Search Depth for Agentic RAG
CuSearch reallocates rollout budget in RLVR toward deeper-search trajectories as a proxy for retrieval supervision density, yielding up to 11.8 exact-match gains over uniform GRPO sampling on ZeroSearch.
-
Hypothesis-Driven Deep Research with Large Language Models: A Structured Methodology for Automated Knowledge Discovery
HDRI is a six-principle eight-stage framework for hypothesis-organized LLM research featuring gap-driven iteration, traceable fact reasoning, and subject locking, realized in INFOMINER with reported gains in fact dens...
-
SearchSkill: Teaching LLMs to Use Search Tools with Evolving Skill Banks
SearchSkill improves exact match scores and retrieval efficiency on open-domain QA by conditioning LLM actions on skills from an evolving SkillBank updated from failure patterns via two-stage SFT.
-
Beyond Reasoning: Reinforcement Learning Unlocks Parametric Knowledge in LLMs
RL on binary rewards boosts LLM factual recall by ~27% relative across models by redistributing probability mass to latent correct answers rather than acquiring new knowledge.
-
Selective Rollout: Mid-Trajectory Termination for Multi-Sample Agent RL
A one-parameter early-termination gate based on mean pairwise prefix edit distance reduces wall-clock time by 10.7% and raises held-out success by 2.5 pp in GRPO on ALFWorld by cutting zero-advantage batch dilution.
-
PIIGuard: Mitigating PII Harvesting under Adversarial Sanitization
PIIGuard uses optimized hidden HTML fragments on webpages to block LLMs from leaking contact PII via indirect prompt injection, achieving at least 97% defense success across tested models while preserving benign QA utility.
-
A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework
A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.
-
ReflectMT: Internalizing Reflection for Efficient and High-Quality Machine Translation
ReflectMT internalizes reflection via two-stage RL to enable direct high-quality machine translation that outperforms explicit reasoning models like DeepSeek-R1 on WMT24 while using 94% fewer tokens.
-
ArbGraph: Conflict-Aware Evidence Arbitration for Reliable Long-Form Retrieval-Augmented Generation
ArbGraph resolves conflicts in RAG evidence by constructing a conflict-aware graph of atomic claims and applying intensity-driven iterative arbitration to suppress unreliable claims prior to generation.
-
DiscoTrace: Representing and Comparing Answering Strategies of Humans and LLMs in Information-Seeking Question Answering
DiscoTrace reveals diverse rhetorical strategies across human communities in QA answers, but LLMs lack this diversity and favor breadth over human-like selectivity.
-
Does RL Expand the Capability Boundary of LLM Agents? A PASS@(k,T) Analysis
RL expands the capability boundary of LLM agents on compositional tool-use tasks, shown by non-converging pass curves at large k with increasing T, while SFT regresses it and the effect is absent on simpler tasks.
-
Reinforcement Learning via Value Gradient Flow
VGF solves behavior-regularized RL by transporting particles from a reference distribution to the value-induced optimal policy via discrete value-guided gradient flow.
-
ClawBench: Can AI Agents Complete Everyday Online Tasks?
ClawBench is a benchmark of 153 live-web tasks where AI agents achieve low success rates, e.g. 33.3% for Claude Sonnet 4.6.
-
MolmoWeb: Open Visual Web Agent and Open Data for the Open Web
Open 4B and 8B visual web agents achieve state-of-the-art results on browser benchmarks by predicting actions from screenshots and instructions, outperforming similar open models and some closed larger-model agents, w...
-
GeoBrowse: A Geolocation Benchmark for Agentic Tool Use with Expert-Annotated Reasoning Traces
GeoBrowse is a two-level geolocation benchmark combining visual cue composition with knowledge-intensive multi-hop queries, paired with the GATE agent workflow that outperforms no-tool, search-only, and image-only baselines.
-
BibTeX Citation Hallucinations in Scientific Publishing Agents: Evaluation and Mitigation
Frontier LLMs generate BibTeX entries at 83.6% field accuracy but only 50.9% fully correct; two-stage clibib revision raises accuracy to 91.5% and fully correct entries to 78.3% with 0.8% regression.
-
Corruption-robust Offline Multi-agent Reinforcement Learning From Human Feedback
Introduces robust estimators for linear Markov games in offline MARLHF that achieve O(ε^{1-o(1)}) or O(√ε) bounds on Nash or CCE gaps under uniform or unilateral coverage.
-
Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory
Evo-Memory is a new benchmark for self-evolving memory in LLM agents across task streams, with baseline ExpRAG and proposed ReMem method that integrates reasoning, actions, and memory updates for continual improvement.
-
WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?
WorkArena benchmark shows LLM web agents achieve partial success on enterprise tasks but have a substantial gap to full automation and perform worse with open-source models.
-
GAIA: a benchmark for General AI Assistants
GAIA benchmark shows humans at 92% accuracy on simple real-world questions far outperform current AI systems at 15%, proposing this gap as a key milestone for general AI.
-
Let's Verify Step by Step
Process supervision significantly outperforms outcome supervision for training models on the MATH dataset, achieving 78% accuracy on a representative test subset with active learning and a released 800k step-label dataset.
-
LLM+P: Empowering Large Language Models with Optimal Planning Proficiency
LLM+P lets LLMs solve planning problems optimally by converting them to PDDL for classical planners and back to natural language.
-
Reflexion: Language Agents with Verbal Reinforcement Learning
Reflexion lets LLM agents improve via stored verbal reflections on task feedback, reaching 91% pass@1 on HumanEval and outperforming prior GPT-4 results.
-
A Generalist Agent
Gato is a multi-modal, multi-task, multi-embodiment generalist policy using one transformer network to handle text, vision, games, and robotics tasks.
-
OPT: Open Pre-trained Transformer Language Models
OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.
-
From Controlled to the Wild: Evaluation of Pentesting Agents for the Real-World
A practical evaluation protocol for AI pentesting agents that uses validated vulnerability discovery, LLM semantic matching, and bipartite scoring to assess performance in realistic, complex targets.
-
SearchSkill: Teaching LLMs to Use Search Tools with Evolving Skill Banks
SearchSkill introduces an evolving SkillBank and two-stage SFT to make LLM search query planning explicit via skill selection, improving exact match on QA benchmarks and retrieval behavior.
-
EvoMAS: Learning Execution-Time Workflows for Multi-Agent Systems
EvoMAS trains a workflow adapter with policy gradients to dynamically instantiate stage-specific multi-agent workflows from a fixed agent pool, using explicit task-state construction and terminal success signals, and ...
-
Slipstream: Trajectory-Grounded Compaction Validation for Long-Horizon Agents
Slipstream uses asynchronous compaction with trajectory-grounded judge validation to improve long-horizon agent accuracy by up to 8.8 percentage points and reduce latency by up to 39.7%.
-
Beyond Pairs: Your Language Model is Secretly Optimizing a Preference Graph
GraphDPO generalizes pairwise DPO to a graph-structured Plackett-Luce objective over DAGs induced by rollout rankings, enforcing transitivity with linear complexity and recovering DPO as a special case.
-
Power Distribution Bridges Sampling, Self-Reward RL, and Self-Distillation
The power distribution is the target of power sampling, the closed-form solution to self-reward KL-regularized RL, and the basis for power self-distillation that matches sampling performance at lower cost.
-
GeoDecider: A Coarse-to-Fine Agentic Workflow for Explainable Lithology Classification
GeoDecider introduces a coarse-to-fine agentic workflow using LLMs for explainable lithology classification from well logs, combining a base classifier, tool-augmented reasoning, and geological refinement to outperfor...
-
FitText: Evolving Agent Tool Ecologies via Memetic Retrieval
FitText embeds memetic evolutionary retrieval inside the agent's reasoning loop to iteratively refine pseudo-tool descriptions, raising retrieval rank from 8.81 to 2.78 on ToolRet and pass rate to 0.73 on StableToolBench.
-
Hallucinations Undermine Trust; Metacognition is a Way Forward
LLMs need metacognition to align expressed uncertainty with their actual knowledge boundaries, moving beyond knowledge expansion to reduce confident errors.
-
AEM: Adaptive Entropy Modulation for Multi-Turn Agentic Reinforcement Learning
AEM lifts entropy analysis to the response level and uses a derived uncertainty proxy to rescale advantages, enabling better exploration-exploitation balance and consistent gains over RL baselines on agent benchmarks.
-
Language Models Refine Mechanical Linkage Designs Through Symbolic Reflection and Modular Optimisation
A modular LM-plus-optimizer system with symbolic abstraction reduces geometric error by up to 68% and improves structural validity by up to 134% over monolithic baselines across six motion targets.
-
From Citation Selection to Citation Absorption: A Measurement Framework for Generative Engine Optimization Across AI Search Platforms
A measurement study of 602 prompts across ChatGPT, Google AI Overview, and Perplexity finds that citation selection breadth and absorption depth diverge, with high-influence pages being longer, structured, and evidence-rich.
-
SnapGuard: Lightweight Prompt Injection Detection for Screenshot-Based Web Agents
SnapGuard detects prompt injection attacks on screenshot-based web agents via visual stability indicators and contrast-polarity textual signals, reaching F1 0.75 while running 8x faster than GPT-4o with no added memory cost.
-
Benchmarking Source-Sensitive Reasoning in Turkish: Humans and LLMs under Evidential Trust Manipulation
Turkish speakers show a robust preference for -DI in high-trust contexts and -mIs in low-trust contexts, while LLMs exhibit inconsistent, often reversed, or base-rate-driven behavior.
-
An AI Agent Execution Environment to Safeguard User Data
GAAP guarantees confidentiality of private user data for AI agents by enforcing user-specified permissions deterministically through persistent information flow tracking, without trusting the agent or requiring attack...
-
Human-Guided Harm Recovery for Computer Use Agents
Introduces harm recovery as a post-execution safeguard for computer-use agents, operationalized via a human-preference rubric, reward model, and BackBench benchmark that shows improved recovery trajectories.
-
FUSE: Ensembling Verifiers with Zero Labeled Data
FUSE ensembles verifiers unsupervisedly by controlling their conditional dependencies to improve spectral ensembling algorithms, matching or exceeding semi-supervised baselines on benchmarks including GPQA Diamond and...
-
PARM: Pipeline-Adapted Reward Model
PARM adapts reward models to multi-stage LLM pipelines via pipeline data and direct preference optimization, improving execution rate and solving accuracy on optimization benchmarks and showing transfer to GSM8K.
-
Preregistered Belief Revision Contracts
PBRC is a contract protocol that enforces evidential belief updates in deliberative multi-agent systems and proves it prevents conformity-driven false cascades under conservative fallbacks.
-
RaTA-Tool: Retrieval-based Tool Selection with Multimodal Large Language Models
RaTA-Tool retrieves suitable external tools for multimodal queries by matching generated task descriptions against tool metadata, supported by a new Hugging Face-derived dataset and DPO optimization.
-
MARCA: A Checklist-Based Benchmark for Multilingual Web Search
MARCA is a bilingual benchmark using 52 questions and validated checklists to evaluate LLM web-search completeness and correctness in English and Portuguese.
-
ClawGuard: A Runtime Security Framework for Tool-Augmented LLM Agents Against Indirect Prompt Injection
ClawGuard enforces deterministic, user-derived access constraints at tool boundaries to block indirect prompt injection without changing the underlying LLM.
-
ClawGuard: A Runtime Security Framework for Tool-Augmented LLM Agents Against Indirect Prompt Injection
ClawGuard enforces user-derived access constraints at tool-call boundaries to block indirect prompt injection in tool-augmented LLM agents across web, MCP, and skill injection channels.
-
TEC: A Collection of Human Trial-and-error Trajectories for Problem Solving
TEC is a new public dataset of detailed human trial-and-error trajectories and reflections on web tasks, with humans showing substantially higher accuracy than LLMs.
Reference graph
Works this paper leans on
-
[1]
We included URLs in full, rather than using special _URL_ tokens
-
[2]
We filtered out questions with the title “[deleted by user]”, and ignored the selftext “[deleted]” and “[removed]”. (The “selftext” is the body of the post.)
-
[3]
We concatenated the title and any non-empty selftext, separated by a double new line
-
[4]
We prepended “Explain: ” to questions that were not phrased as actual questions (e.g., we used “Explain: gravity” rather than simply “gravity”). The final step was performed because there is sometimes an implicit “Explain Like I’m Five” at the start of questions. We considered a question to be phrased as an actual question if it included either a question ...
work page 2017
-
[5]
We therefore included a summary of past actions in the text given to the model
Unlike humans, the model has no memory of previous steps. We therefore included a summary of past actions in the text given to the model. However, we felt that it was unnecessary to display this to humans
-
[6]
The Scrolled <up, down> <2, 3> actions are useful for reducing the number of actions taken, but humans are used to scrolling one step at a time. We therefore made these actions unavailable to humans, and instead simply merged any repeatedScrolled <up, down> 1 actions that they made. The full instruction document we provided to contractors for demonstratio...
-
[7]
Read the question, and flag if it does not make sense or should not be answered (in which case the rest of the comparison is skipped)
-
[8]
Read the first answer and its references
-
[9]
Rate the trustworthiness of any references relied upon by the answer
-
[10]
A screenshot of the annotation tool is shown in Figure 9
Annotate each of the claims in the answer with the level of support it has and its relevance to the question. A screenshot of the annotation tool is shown in Figure 9
-
[11]
Repeat steps 2–4 for the second answer and its references
-
[12]
Give comparison ratings for the amount of unsupported and irrelevant information, the usefulness of information with different levels of support, and coherence
-
[13]
Weighing everything up, give a final comparison rating for overall usefulness. 17 Figure 9: Screenshot from the comparison interface, showing the annotation tool. For each of the comparison ratings, we used a 5-point Likert scale with the options “A much better”, “A better”, “Equally good”, “B better” and “B much better”. Importantly, we did not require co...
-
[14]
It was clear from the instructions what I was supposed to do
-
[15]
I found the task enjoyable and engaging
-
[16]
I found the task repetitive
-
[17]
I was paid fairly for doing the task
-
[18]
Overall, I am glad that I did this task. • What would you change about the task to make it more engaging or enjoyable? (Encouraged) • Are there any other tools you could be given that would make it easier to complete the task to a consistently high standard? (Encouraged) • Did you come up with any shortcuts that you used to do the task more quickly, and i...
-
[19]
For BC, we stopped after a certain number of epochs based on reward model score (which usually improves past the point of minimum validation loss)
-
[20]
For RM, we stopped after a certain number of epochs based on validation accuracy
-
[21]
For RL, we stopped after a certain number of PPO iterations based on the reward model score for some KL budget. The KL here is measured from the BC model, and summed over the episode. For the 175B model, we compared a couple of different KL budgets using human evaluations, and for the 760M and 13B models, we chose KL budgets informed by the 175B evaluatio...
work page 1992
-
[22]
Words and their meanings change over time
Why Are Some Words ‘Bad’? | V ermont Public Radio (www.vpr .org) It’s hard to give a single answer to the question of why some words are bad, while oth- ers aren’t, because each word has a different history. Words and their meanings change over time. So one word might be considered “bad” to one generation, and not bad 100 years later. In addition, words c...
-
[23]
Why Are Some Words ‘Bad’? | V ermont Public Radio (www.vpr .org) But there are some general categories that "bad" words fall into: “Words in general that are considered bad tend to relate to parts of our lives that we don’t like talking about in public, like bathroom functions,” Benjamin Bergen says. Other words that are often considered bad relate to neg...
- [24]
-
[25]
As a general rule, swear words originate from taboo subjects
The Science of Curse Words: Why The &@$! Do We Swear? (www.babbel.com) For a word to qualify as a swear word it must have the potential to offend — crossing a cultural line into taboo territory. As a general rule, swear words originate from taboo subjects. This is pretty logical. The topic is off-limits, so the related words aren’t meant to be spoken ...
-
[26]
language that deprives a person of human qualities or attributes
All Of These Words Are Offensive (But Only Sometimes) (www.dictionary.com) So, where’s the problem?Ape and monkey are considered offensive terms when they’re used to describe a person of color. It’s what is known as dehumanizing language, “language that deprives a person of human qualities or attributes.” Exactly when the words became slurs is unknown, bu...
-
[27]
fierce, ferocious, or cruel; uncivilized; barbarous
All Of These Words Are Offensive (But Only Sometimes) (www.dictionary.com) The word savage has taken a circuitous path through the lexicon over the years, first showing up in English in the 1200s from Middle English. As an adjective, it’s typically meant “fierce, ferocious, or cruel; uncivilized; barbarous.” When referring to a savage lion ripping an antelo...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.