CiteVQA requires models to cite specific document regions with bounding boxes alongside answers and finds that even the strongest MLLMs frequently cite the wrong region, with top SAA scores of only 76.0 for closed models and 22.5 for open-source ones.
super hub Canonical reference
WebGPT: Browser-assisted question-answering with human feedback
Canonical reference. 92% of citing Pith papers cite this work as background.
abstract
We fine-tune GPT-3 to answer long-form questions using a text-based web-browsing environment, which allows the model to search and navigate the web. By setting up the task so that it can be performed by humans, we are able to train models on the task using imitation learning, and then optimize answer quality with human feedback. To make human evaluation of factual accuracy easier, models must collect references while browsing in support of their answers. We train and evaluate our models on ELI5, a dataset of questions asked by Reddit users. Our best model is obtained by fine-tuning GPT-3 using behavior cloning, and then performing rejection sampling against a reward model trained to predict human preferences. This model's answers are preferred by humans 56% of the time to those of our human demonstrators, and 69% of the time to the highest-voted answer from Reddit.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract We fine-tune GPT-3 to answer long-form questions using a text-based web-browsing environment, which allows the model to search and navigate the web. By setting up the task so that it can be performed by humans, we are able to train models on the task using imitation learning, and then optimize answer quality with human feedback. To make human evaluation of factual accuracy easier, models must collect references while browsing in support of their answers. We train and evaluate our models on ELI5, a dataset of questions asked by Reddit users. Our best model is obtained by fine-tuning GPT-3 using
authors
co-cited works
representative citing papers
LLM agents achieve greater flexibility during execution by classifying actions via a reversibility taxonomy and using an Earliest-Conflict Rollback algorithm that matches full-restart quality while wasting far less completed work.
DAP achieves SOTA on Hard Mode ATP by having LLMs discover answers then prove them formally, solving 10 CombiBench and 36 PutnamBench problems while exposing that LLMs exceed 80% answer accuracy where formal provers stay under 10%.
GUIGuard-Bench is a new benchmark with annotated GUI screenshots that measures privacy recognition, planning fidelity under protection, and utility impact for trajectory-based GUI agents.
AgentDojo introduces an extensible evaluation framework populated with realistic agent tasks and security test cases to measure prompt injection robustness in tool-using LLM agents.
OSWorld provides the first unified real-computer benchmark for open-ended multimodal agent tasks, exposing large performance gaps between humans and state-of-the-art LLM/VLM agents.
DSPy compiles short declarative programs into LM pipelines that self-optimize and outperform both standard few-shot prompting and expert-written chains on math, retrieval, and QA tasks.
WebArena provides a realistic multi-domain web environment and benchmark where state-of-the-art LLM agents achieve 14.41% end-to-end task success compared to 78.24% for humans.
API-Bank is a new benchmark and training dataset for tool-augmented LLMs that shows fine-tuned models can approach GPT-3.5 tool-use effectiveness.
An unsupervised technique extracts latent yes-no knowledge from language model activations by locating a direction that satisfies logical consistency properties, outperforming zero-shot accuracy by 4% on average across models and datasets.
Ko-WideSearch is a new Korean breadth-search benchmark spanning 16 categories and three difficulty tiers that evaluates web agents on full set membership plus per-item attributes, showing consistent gaps between set recovery and row completion.
RealClawBench turns 281 real OpenClaw sessions into reproducible tasks that preserve the original distribution and shows the best of 14 models solves only 65.8 percent.
PixelRAG shows that operating RAG entirely over web screenshots outperforms text-based retrieval on NQ, SimpleQA, MMSearch, LiveVQA, and MoNaCo, with up to 18.1% accuracy gains and 3x token savings via image compression.
Introduces the 'innovation' property of LLMs and proves it is an almost characterization of hallucination while deriving new lower bounds on hallucination rates via missing mass.
No confidence-gated RL policy can achieve maximum helpfulness, optimal calibration, and full autonomy under rational oversight when tasks exceed the agent's competence, because non-affine autonomy incentives destroy strict properness of scoring rules and cause confidence inflation.
Co-ReAct adds step-level rubric guidance to ReAct agents via a GRPO-trained generator using list-wise ranking rewards, yielding consistent gains on DeepResearchBench and SQA-CS-V2.
Proposes an equation-anchored tool-use method for MLLMs that writes the pinhole back-projection equation in Chain-of-Thought and substitutes retrieved camera intrinsics and depths to achieve robustness in 3D object detection and visual grounding under rescaled intrinsics.
REFLECT benchmark shows current LLM judges achieve below 55% accuracy detecting failures in evidence-based research agents, especially on evidence verification.
Skim profiles website patterns offline to enable fast-path speculative execution for web agents, cutting median cost by 1.9x and latency by 33.4% with no accuracy loss on benchmarks.
UI traces of actions and timings from LLM browser agents enable identification of the underlying model with up to 96% F1 across 14 models and multiple tasks.
ClawForge is a generator framework that creates reproducible executable benchmarks for command-line agents under state conflict, with ClawForge-Bench showing frontier models reach at most 45.3% strict accuracy and that state inspection drives most performance gaps.
PolitNuggets is a multilingual benchmark showing that AI agents struggle with fine-grained accuracy and efficiency when discovering long-tail political facts for elite biographies, linking performance to short-context extraction, multilingual robustness, and tool use.
Unique canary tokens served to visiting scrapers can be recovered from LLM outputs to identify which scrapers feed data to which of 22 tested production LLMs.
Temperature adjustment on the reference model generalizes inference-time alignment to SLOP ensembles of reward models, with a calibration algorithm that improves robustness to reward hacking while preserving alignment performance.
citing papers explorer
-
A Survey on Retrieval-Augmented Text Generation for Large Language Models
A survey that categorizes RAG methods for LLMs into four retrieval-centric stages, reviews their evolution and evaluation, and outlines challenges and future directions.