hub

WebGPT: Browser-assisted question-answering with human feedback

Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim · 2021 · cs.CL · arXiv 2112.09332

89 Pith papers cite this work. Polarity classification is still indexing.

89 Pith papers citing it

open full Pith review browse 89 citing papers arXiv PDF

abstract

We fine-tune GPT-3 to answer long-form questions using a text-based web-browsing environment, which allows the model to search and navigate the web. By setting up the task so that it can be performed by humans, we are able to train models on the task using imitation learning, and then optimize answer quality with human feedback. To make human evaluation of factual accuracy easier, models must collect references while browsing in support of their answers. We train and evaluate our models on ELI5, a dataset of questions asked by Reddit users. Our best model is obtained by fine-tuning GPT-3 using behavior cloning, and then performing rejection sampling against a reward model trained to predict human preferences. This model's answers are preferred by humans 56% of the time to those of our human demonstrators, and 69% of the time to the highest-voted answer from Reddit.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 3

citation-polarity summary

background 3

claims ledger

abstract We fine-tune GPT-3 to answer long-form questions using a text-based web-browsing environment, which allows the model to search and navigate the web. By setting up the task so that it can be performed by humans, we are able to train models on the task using imitation learning, and then optimize answer quality with human feedback. To make human evaluation of factual accuracy easier, models must collect references while browsing in support of their answers. We train and evaluate our models on ELI5, a dataset of questions asked by Reddit users. Our best model is obtained by fine-tuning GPT-3 using

co-cited works

representative citing papers

CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document Intelligence

cs.CL · 2026-05-13 · accept · novelty 8.0

CiteVQA requires models to cite specific document regions with bounding boxes alongside answers and finds that even the strongest MLLMs frequently cite the wrong region, with top SAA scores of only 76.0 for closed models and 22.5 for open-source ones.

Revisable by Design: A Theory of Streaming LLM Agent Execution

cs.LG · 2026-04-25 · unverdicted · novelty 8.0

LLM agents achieve greater flexibility during execution by classifying actions via a reversibility taxonomy and using an Earliest-Conflict Rollback algorithm that matches full-restart quality while wasting far less completed work.

Discover and Prove: An Open-source Agentic Framework for Hard Mode Automated Theorem Proving in Lean 4

cs.AI · 2026-04-17 · conditional · novelty 8.0

DAP achieves SOTA on Hard Mode ATP by having LLMs discover answers then prove them formally, solving 10 CombiBench and 36 PutnamBench problems while exposing that LLMs exceed 80% answer accuracy where formal provers stay under 10%.

AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents

cs.CR · 2024-06-19 · unverdicted · novelty 8.0

AgentDojo introduces an extensible evaluation framework populated with realistic agent tasks and security test cases to measure prompt injection robustness in tool-using LLM agents.

OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

cs.AI · 2024-04-11 · accept · novelty 8.0

OSWorld provides the first unified real-computer benchmark for open-ended multimodal agent tasks, exposing large performance gaps between humans and state-of-the-art LLM/VLM agents.

DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines

cs.CL · 2023-10-05 · conditional · novelty 8.0

DSPy compiles short declarative programs into LM pipelines that self-optimize and outperform both standard few-shot prompting and expert-written chains on math, retrieval, and QA tasks.

WebArena: A Realistic Web Environment for Building Autonomous Agents

cs.AI · 2023-07-25 · accept · novelty 8.0

WebArena provides a realistic multi-domain web environment and benchmark where state-of-the-art LLM agents achieve 14.41% end-to-end task success compared to 78.24% for humans.

Identifying AI Web Scrapers Using Canary Tokens

cs.CR · 2026-05-13 · conditional · novelty 7.0

Unique canary tokens served to visiting scrapers can be recovered from LLM outputs to identify which scrapers feed data to which of 22 tested production LLMs.

Temper and Tilt Lead to SLOP: Reward Hacking Mitigation with Inference-Time Alignment

cs.LG · 2026-05-13 · unverdicted · novelty 7.0

Temperature adjustment on the reference model generalizes inference-time alignment to SLOP ensembles of reward models, with a calibration algorithm that improves robustness to reward hacking while preserving alignment performance.

Hypothesis-Driven Deep Research with Large Language Models: A Structured Methodology for Automated Knowledge Discovery

cs.AI · 2026-05-11 · unverdicted · novelty 7.0

HDRI is a six-principle eight-stage framework for hypothesis-organized LLM research featuring gap-driven iteration, traceable fact reasoning, and subject locking, realized in INFOMINER with reported gains in fact density and completeness.

Beyond Reasoning: Reinforcement Learning Unlocks Parametric Knowledge in LLMs

cs.CL · 2026-05-08 · unverdicted · novelty 7.0

RL on binary rewards boosts LLM factual recall by ~27% relative across models by redistributing probability mass to latent correct answers rather than acquiring new knowledge.

Selective Rollout: Mid-Trajectory Termination for Multi-Sample Agent RL

cs.LG · 2026-05-07 · conditional · novelty 7.0

A one-parameter early-termination gate based on mean pairwise prefix edit distance reduces wall-clock time by 10.7% and raises held-out success by 2.5 pp in GRPO on ALFWorld by cutting zero-advantage batch dilution.

PIIGuard: Mitigating PII Harvesting under Adversarial Sanitization

cs.CR · 2026-05-04 · conditional · novelty 7.0

PIIGuard uses optimized hidden HTML fragments on webpages to block LLMs from leaking contact PII via indirect prompt injection, achieving at least 97% defense success across tested models while preserving benign QA utility.

A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework

cs.CR · 2026-04-25 · unverdicted · novelty 7.0

A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.

ReflectMT: Internalizing Reflection for Efficient and High-Quality Machine Translation

cs.CL · 2026-04-21 · unverdicted · novelty 7.0

ReflectMT internalizes reflection via two-stage RL to enable direct high-quality machine translation that outperforms explicit reasoning models like DeepSeek-R1 on WMT24 while using 94% fewer tokens.

ArbGraph: Conflict-Aware Evidence Arbitration for Reliable Long-Form Retrieval-Augmented Generation

cs.CL · 2026-04-20 · unverdicted · novelty 7.0

ArbGraph resolves conflicts in RAG evidence by constructing a conflict-aware graph of atomic claims and applying intensity-driven iterative arbitration to suppress unreliable claims prior to generation.

DiscoTrace: Representing and Comparing Answering Strategies of Humans and LLMs in Information-Seeking Question Answering

cs.CL · 2026-04-16 · unverdicted · novelty 7.0

DiscoTrace reveals diverse rhetorical strategies across human communities in QA answers, but LLMs lack this diversity and favor breadth over human-like selectivity.

Does RL Expand the Capability Boundary of LLM Agents? A PASS@(k,T) Analysis

cs.LG · 2026-04-16 · unverdicted · novelty 7.0

RL expands the capability boundary of LLM agents on compositional tool-use tasks, shown by non-converging pass curves at large k with increasing T, while SFT regresses it and the effect is absent on simpler tasks.

Reinforcement Learning via Value Gradient Flow

cs.LG · 2026-04-15 · unverdicted · novelty 7.0

VGF solves behavior-regularized RL by transporting particles from a reference distribution to the value-induced optimal policy via discrete value-guided gradient flow.

ClawBench: Can AI Agents Complete Everyday Online Tasks?

cs.CL · 2026-04-09 · unverdicted · novelty 7.0

ClawBench is a benchmark of 153 live-web tasks where AI agents achieve low success rates, e.g. 33.3% for Claude Sonnet 4.6.

MolmoWeb: Open Visual Web Agent and Open Data for the Open Web

cs.CV · 2026-04-09 · unverdicted · novelty 7.0

Open 4B and 8B visual web agents achieve state-of-the-art results on browser benchmarks by predicting actions from screenshots and instructions, outperforming similar open models and some closed larger-model agents, with full release of data and code planned.

GeoBrowse: A Geolocation Benchmark for Agentic Tool Use with Expert-Annotated Reasoning Traces

cs.CL · 2026-04-05 · unverdicted · novelty 7.0

GeoBrowse is a two-level geolocation benchmark combining visual cue composition with knowledge-intensive multi-hop queries, paired with the GATE agent workflow that outperforms no-tool, search-only, and image-only baselines.

BibTeX Citation Hallucinations in Scientific Publishing Agents: Evaluation and Mitigation

cs.DL · 2026-04-03 · conditional · novelty 7.0

Frontier LLMs generate BibTeX entries at 83.6% field accuracy but only 50.9% fully correct; two-stage clibib revision raises accuracy to 91.5% and fully correct entries to 78.3% with 0.8% regression.

Corruption-robust Offline Multi-agent Reinforcement Learning From Human Feedback

cs.LG · 2026-03-30 · unverdicted · novelty 7.0

Introduces robust estimators for linear Markov games in offline MARLHF that achieve O(ε^{1-o(1)}) or O(√ε) bounds on Nash or CCE gaps under uniform or unilateral coverage.

citing papers explorer

Showing 50 of 89 citing papers.

CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document Intelligence cs.CL · 2026-05-13 · accept · none · ref 27 · internal anchor
CiteVQA requires models to cite specific document regions with bounding boxes alongside answers and finds that even the strongest MLLMs frequently cite the wrong region, with top SAA scores of only 76.0 for closed models and 22.5 for open-source ones.
Revisable by Design: A Theory of Streaming LLM Agent Execution cs.LG · 2026-04-25 · unverdicted · none · ref 4 · internal anchor
LLM agents achieve greater flexibility during execution by classifying actions via a reversibility taxonomy and using an Earliest-Conflict Rollback algorithm that matches full-restart quality while wasting far less completed work.
Discover and Prove: An Open-source Agentic Framework for Hard Mode Automated Theorem Proving in Lean 4 cs.AI · 2026-04-17 · conditional · none · ref 3 · internal anchor
DAP achieves SOTA on Hard Mode ATP by having LLMs discover answers then prove them formally, solving 10 CombiBench and 36 PutnamBench problems while exposing that LLMs exceed 80% answer accuracy where formal provers stay under 10%.
AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents cs.CR · 2024-06-19 · unverdicted · none · ref 40 · internal anchor
AgentDojo introduces an extensible evaluation framework populated with realistic agent tasks and security test cases to measure prompt injection robustness in tool-using LLM agents.
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments cs.AI · 2024-04-11 · accept · none · ref 38 · internal anchor
OSWorld provides the first unified real-computer benchmark for open-ended multimodal agent tasks, exposing large performance gaps between humans and state-of-the-art LLM/VLM agents.
DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines cs.CL · 2023-10-05 · conditional · none · ref 36 · internal anchor
DSPy compiles short declarative programs into LM pipelines that self-optimize and outperform both standard few-shot prompting and expert-written chains on math, retrieval, and QA tasks.
WebArena: A Realistic Web Environment for Building Autonomous Agents cs.AI · 2023-07-25 · accept · none · ref 1 · internal anchor
WebArena provides a realistic multi-domain web environment and benchmark where state-of-the-art LLM agents achieve 14.41% end-to-end task success compared to 78.24% for humans.
Identifying AI Web Scrapers Using Canary Tokens cs.CR · 2026-05-13 · conditional · none · ref 45 · internal anchor
Unique canary tokens served to visiting scrapers can be recovered from LLM outputs to identify which scrapers feed data to which of 22 tested production LLMs.
Temper and Tilt Lead to SLOP: Reward Hacking Mitigation with Inference-Time Alignment cs.LG · 2026-05-13 · unverdicted · none · ref 11 · internal anchor
Temperature adjustment on the reference model generalizes inference-time alignment to SLOP ensembles of reward models, with a calibration algorithm that improves robustness to reward hacking while preserving alignment performance.
Hypothesis-Driven Deep Research with Large Language Models: A Structured Methodology for Automated Knowledge Discovery cs.AI · 2026-05-11 · unverdicted · none · ref 26 · internal anchor
HDRI is a six-principle eight-stage framework for hypothesis-organized LLM research featuring gap-driven iteration, traceable fact reasoning, and subject locking, realized in INFOMINER with reported gains in fact density and completeness.
Beyond Reasoning: Reinforcement Learning Unlocks Parametric Knowledge in LLMs cs.CL · 2026-05-08 · unverdicted · none · ref 27 · internal anchor
RL on binary rewards boosts LLM factual recall by ~27% relative across models by redistributing probability mass to latent correct answers rather than acquiring new knowledge.
Selective Rollout: Mid-Trajectory Termination for Multi-Sample Agent RL cs.LG · 2026-05-07 · conditional · none · ref 32 · internal anchor
A one-parameter early-termination gate based on mean pairwise prefix edit distance reduces wall-clock time by 10.7% and raises held-out success by 2.5 pp in GRPO on ALFWorld by cutting zero-advantage batch dilution.
PIIGuard: Mitigating PII Harvesting under Adversarial Sanitization cs.CR · 2026-05-04 · conditional · none · ref 10 · internal anchor
PIIGuard uses optimized hidden HTML fragments on webpages to block LLMs from leaking contact PII via indirect prompt injection, achieving at least 97% defense success across tested models while preserving benign QA utility.
A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework cs.CR · 2026-04-25 · unverdicted · none · ref 11 · internal anchor
A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.
ReflectMT: Internalizing Reflection for Efficient and High-Quality Machine Translation cs.CL · 2026-04-21 · unverdicted · none · ref 42 · internal anchor
ReflectMT internalizes reflection via two-stage RL to enable direct high-quality machine translation that outperforms explicit reasoning models like DeepSeek-R1 on WMT24 while using 94% fewer tokens.
ArbGraph: Conflict-Aware Evidence Arbitration for Reliable Long-Form Retrieval-Augmented Generation cs.CL · 2026-04-20 · unverdicted · none · ref 23 · internal anchor
ArbGraph resolves conflicts in RAG evidence by constructing a conflict-aware graph of atomic claims and applying intensity-driven iterative arbitration to suppress unreliable claims prior to generation.
DiscoTrace: Representing and Comparing Answering Strategies of Humans and LLMs in Information-Seeking Question Answering cs.CL · 2026-04-16 · unverdicted · none · ref 1 · internal anchor
DiscoTrace reveals diverse rhetorical strategies across human communities in QA answers, but LLMs lack this diversity and favor breadth over human-like selectivity.
Does RL Expand the Capability Boundary of LLM Agents? A PASS@(k,T) Analysis cs.LG · 2026-04-16 · unverdicted · none · ref 9 · internal anchor
RL expands the capability boundary of LLM agents on compositional tool-use tasks, shown by non-converging pass curves at large k with increasing T, while SFT regresses it and the effect is absent on simpler tasks.
Reinforcement Learning via Value Gradient Flow cs.LG · 2026-04-15 · unverdicted · none · ref 48 · internal anchor
VGF solves behavior-regularized RL by transporting particles from a reference distribution to the value-induced optimal policy via discrete value-guided gradient flow.
ClawBench: Can AI Agents Complete Everyday Online Tasks? cs.CL · 2026-04-09 · unverdicted · none · ref 6 · internal anchor
ClawBench is a benchmark of 153 live-web tasks where AI agents achieve low success rates, e.g. 33.3% for Claude Sonnet 4.6.
MolmoWeb: Open Visual Web Agent and Open Data for the Open Web cs.CV · 2026-04-09 · unverdicted · none · ref 3 · internal anchor
Open 4B and 8B visual web agents achieve state-of-the-art results on browser benchmarks by predicting actions from screenshots and instructions, outperforming similar open models and some closed larger-model agents, with full release of data and code planned.
GeoBrowse: A Geolocation Benchmark for Agentic Tool Use with Expert-Annotated Reasoning Traces cs.CL · 2026-04-05 · unverdicted · none · ref 32 · internal anchor
GeoBrowse is a two-level geolocation benchmark combining visual cue composition with knowledge-intensive multi-hop queries, paired with the GATE agent workflow that outperforms no-tool, search-only, and image-only baselines.
BibTeX Citation Hallucinations in Scientific Publishing Agents: Evaluation and Mitigation cs.DL · 2026-04-03 · conditional · none · ref 16 · internal anchor
Frontier LLMs generate BibTeX entries at 83.6% field accuracy but only 50.9% fully correct; two-stage clibib revision raises accuracy to 91.5% and fully correct entries to 78.3% with 0.8% regression.
Corruption-robust Offline Multi-agent Reinforcement Learning From Human Feedback cs.LG · 2026-03-30 · unverdicted · none · ref 9 · internal anchor
Introduces robust estimators for linear Markov games in offline MARLHF that achieve O(ε^{1-o(1)}) or O(√ε) bounds on Nash or CCE gaps under uniform or unilateral coverage.
GAIA: a benchmark for General AI Assistants cs.CL · 2023-11-21 · unverdicted · none · ref 59 · 2 links · internal anchor
GAIA benchmark shows humans at 92% accuracy on simple real-world questions far outperform current AI systems at 15%, proposing this gap as a key milestone for general AI.
Let's Verify Step by Step cs.LG · 2023-05-31 · accept · none · ref 12 · internal anchor
Process supervision significantly outperforms outcome supervision for training models on the MATH dataset, achieving 78% accuracy on a representative test subset with active learning and a released 800k step-label dataset.
LLM+P: Empowering Large Language Models with Optimal Planning Proficiency cs.AI · 2023-04-22 · accept · none · ref 59 · internal anchor
LLM+P lets LLMs solve planning problems optimally by converting them to PDDL for classical planners and back to natural language.
Reflexion: Language Agents with Verbal Reinforcement Learning cs.AI · 2023-03-20 · conditional · none · ref 17 · internal anchor
Reflexion lets LLM agents improve via stored verbal reflections on task feedback, reaching 91% pass@1 on HumanEval and outperforming prior GPT-4 results.
A Generalist Agent cs.AI · 2022-05-12 · accept · none · ref 41 · internal anchor
Gato is a multi-modal, multi-task, multi-embodiment generalist policy using one transformer network to handle text, vision, games, and robotics tasks.
OPT: Open Pre-trained Transformer Language Models cs.CL · 2022-05-02 · unverdicted · none · ref 271 · internal anchor
OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.
From Controlled to the Wild: Evaluation of Pentesting Agents for the Real-World cs.AI · 2026-05-11 · unverdicted · none · ref 22 · internal anchor
A practical evaluation protocol for AI pentesting agents that uses validated vulnerability discovery, LLM semantic matching, and bipartite scoring to assess performance in realistic, complex targets.
SearchSkill: Teaching LLMs to Use Search Tools with Evolving Skill Banks cs.AI · 2026-05-09 · unverdicted · none · ref 22 · internal anchor
SearchSkill introduces an evolving SkillBank and two-stage SFT to make LLM search query planning explicit via skill selection, improving exact match on QA benchmarks and retrieval behavior.
EvoMAS: Learning Execution-Time Workflows for Multi-Agent Systems cs.AI · 2026-05-09 · unverdicted · none · ref 18 · internal anchor
EvoMAS trains a workflow adapter with policy gradients to dynamically instantiate stage-specific multi-agent workflows from a fixed agent pool, using explicit task-state construction and terminal success signals, and outperforms static baselines on GAIA, HLE, and DeepResearcher.
Slipstream: Trajectory-Grounded Compaction Validation for Long-Horizon Agents cs.MA · 2026-05-09 · unverdicted · none · ref 84 · internal anchor
Slipstream uses asynchronous compaction with trajectory-grounded judge validation to improve long-horizon agent accuracy by up to 8.8 percentage points and reduce latency by up to 39.7%.
Beyond Pairs: Your Language Model is Secretly Optimizing a Preference Graph cs.LG · 2026-05-08 · unverdicted · none · ref 39 · internal anchor
GraphDPO generalizes pairwise DPO to a graph-structured Plackett-Luce objective over DAGs induced by rollout rankings, enforcing transitivity with linear complexity and recovering DPO as a special case.
Power Distribution Bridges Sampling, Self-Reward RL, and Self-Distillation cs.LG · 2026-05-06 · unverdicted · none · ref 189 · internal anchor
The power distribution is the target of power sampling, the closed-form solution to self-reward KL-regularized RL, and the basis for power self-distillation that matches sampling performance at lower cost.
GeoDecider: A Coarse-to-Fine Agentic Workflow for Explainable Lithology Classification cs.AI · 2026-05-05 · unverdicted · none · ref 41 · internal anchor
GeoDecider introduces a coarse-to-fine agentic workflow using LLMs for explainable lithology classification from well logs, combining a base classifier, tool-augmented reasoning, and geological refinement to outperform baselines on benchmarks.
FitText: Evolving Agent Tool Ecologies via Memetic Retrieval cs.AI · 2026-05-04 · unverdicted · none · ref 27 · internal anchor
FitText embeds memetic evolutionary retrieval inside the agent's reasoning loop to iteratively refine pseudo-tool descriptions, raising retrieval rank from 8.81 to 2.78 on ToolRet and pass rate to 0.73 on StableToolBench.
Hallucinations Undermine Trust; Metacognition is a Way Forward cs.CL · 2026-05-02 · unverdicted · none · ref 28 · internal anchor
LLMs need metacognition to align expressed uncertainty with their actual knowledge boundaries, moving beyond knowledge expansion to reduce confident errors.
Language Models Refine Mechanical Linkage Designs Through Symbolic Reflection and Modular Optimisation cs.AI · 2026-04-30 · unverdicted · none · ref 12 · internal anchor
A modular LM-plus-optimizer system with symbolic abstraction reduces geometric error by up to 68% and improves structural validity by up to 134% over monolithic baselines across six motion targets.
From Citation Selection to Citation Absorption: A Measurement Framework for Generative Engine Optimization Across AI Search Platforms cs.IR · 2026-04-28 · unverdicted · none · ref 19 · internal anchor
A measurement study of 602 prompts across ChatGPT, Google AI Overview, and Perplexity finds that citation selection breadth and absorption depth diverge, with high-influence pages being longer, structured, and evidence-rich.
SnapGuard: Lightweight Prompt Injection Detection for Screenshot-Based Web Agents cs.CR · 2026-04-28 · unverdicted · none · ref 30 · internal anchor
SnapGuard detects prompt injection attacks on screenshot-based web agents via visual stability indicators and contrast-polarity textual signals, reaching F1 0.75 while running 8x faster than GPT-4o with no added memory cost.
Benchmarking Source-Sensitive Reasoning in Turkish: Humans and LLMs under Evidential Trust Manipulation cs.CL · 2026-04-27 · unverdicted · none · ref 16 · internal anchor
Turkish speakers show a robust preference for -DI in high-trust contexts and -mIs in low-trust contexts, while LLMs exhibit inconsistent, often reversed, or base-rate-driven behavior.
An AI Agent Execution Environment to Safeguard User Data cs.CR · 2026-04-21 · unverdicted · none · ref 48 · internal anchor
GAAP guarantees confidentiality of private user data for AI agents by enforcing user-specified permissions deterministically through persistent information flow tracking, without trusting the agent or requiring attack-free models.
Human-Guided Harm Recovery for Computer Use Agents cs.AI · 2026-04-20 · conditional · none · ref 22 · internal anchor
Introduces harm recovery as a post-execution safeguard for computer-use agents, operationalized via a human-preference rubric, reward model, and BackBench benchmark that shows improved recovery trajectories.
FUSE: Ensembling Verifiers with Zero Labeled Data stat.ML · 2026-04-20 · unverdicted · none · ref 8 · internal anchor
FUSE ensembles verifiers unsupervisedly by controlling their conditional dependencies to improve spectral ensembling algorithms, matching or exceeding semi-supervised baselines on benchmarks including GPQA Diamond and Humanity's Last Exam.
PARM: Pipeline-Adapted Reward Model cs.AI · 2026-04-20 · unverdicted · none · ref 3 · internal anchor
PARM adapts reward models to multi-stage LLM pipelines via pipeline data and direct preference optimization, improving execution rate and solving accuracy on optimization benchmarks and showing transfer to GSM8K.
Preregistered Belief Revision Contracts cs.AI · 2026-04-16 · unverdicted · none · ref 32 · internal anchor
PBRC is a contract protocol that enforces evidential belief updates in deliberative multi-agent systems and proves it prevents conformity-driven false cascades under conservative fallbacks.
RaTA-Tool: Retrieval-based Tool Selection with Multimodal Large Language Models cs.CV · 2026-04-16 · unverdicted · none · ref 22 · internal anchor
RaTA-Tool retrieves suitable external tools for multimodal queries by matching generated task descriptions against tool metadata, supported by a new Hugging Face-derived dataset and DPO optimization.
MARCA: A Checklist-Based Benchmark for Multilingual Web Search cs.CL · 2026-04-15 · accept · none · ref 17 · internal anchor
MARCA is a bilingual benchmark using 52 questions and validated checklists to evaluate LLM web-search completeness and correctness in English and Portuguese.

WebGPT: Browser-assisted question-answering with human feedback

hub tools

citation-role summary

citation-polarity summary

claims ledger

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer