SealQA: Raising the Bar for Reasoning in Search-Augmented Language Models

Nguyen Nguyen; Pratibha Zunjare; Thinh Pham; Tu Vu; Weiyuan Chen; Yu-Min Tseng

arxiv: 2506.01062 · v4 · submitted 2025-06-01 · 💻 cs.CL · cs.AI· cs.LG

SealQA: Raising the Bar for Reasoning in Search-Augmented Language Models

Thinh Pham , Nguyen Nguyen , Pratibha Zunjare , Weiyuan Chen , Yu-Min Tseng , Tu Vu This is my paper

Pith reviewed 2026-05-19 11:06 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords SealQAsearch-augmented language modelsreasoning benchmarkconflicting search resultsfact-seeking questionsnoisy informationlong-context reasoning

0 comments

The pith

Search-augmented language models fail to reason correctly when web searches return conflicting or unhelpful results.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SealQA as a benchmark for testing search-augmented language models on fact-seeking questions where standard web searches produce conflicting, noisy, or unhelpful outputs. It includes three variants: Seal-0 for the hardest cases where most models score near zero, Seal-Hard for added difficulty, and LongSeal for multi-document reasoning with many distractors. Evaluations across frontier models show low accuracy, with even the strongest agentic systems reaching only 17.1 percent and 6.3 percent on Seal-0 at best. The work also finds that advanced reasoning models remain vulnerable to noise and that extra test-time compute does not reliably raise performance.

Core claim

The paper establishes that frontier search-augmented models exhibit critical limitations in factual accuracy and reasoning when faced with unreliable search results, as shown by Seal-0 accuracies of 17.1 percent for o3 and 6.3 percent for o4-mini at peak effort, combined with high susceptibility to noise in models like DeepSeek-R1 and o3-mini and the absence of consistent gains from increased compute or reliable document selection in LongSeal.

What carries the argument

The SealQA benchmark, which supplies fact-seeking questions together with deliberately conflicting or noisy search results to isolate failures in reasoning rather than retrieval alone.

If this is right

Advanced reasoning methods do not overcome the impact of noisy or conflicting search inputs.
Scaling test-time compute produces no reliable accuracy gains and can even reduce performance.
Models continue to miss relevant documents when many distractors are present in long contexts.
Search-augmented systems require new robustness techniques beyond current tool use.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Better search result filtering or verification steps may deliver larger gains than further model scaling.
The benchmark could be extended to train models directly on handling uncertain or contradictory information.
Real-world fact-seeking agents may need external verification layers to reach usable reliability.

Load-bearing premise

The selected questions in Seal-0 and Seal-Hard genuinely represent real-world cases where web search yields conflicting or unhelpful results, and the accuracy metric isolates reasoning failures rather than search-tool limitations.

What would settle it

A search-augmented model reaching above 70 percent accuracy on Seal-0 after receiving cleaned or conflict-resolved search results instead of raw noisy ones.

read the original abstract

We introduce SealQA, a new challenge benchmark for evaluating SEarch-Augmented Language models on fact-seeking questions where web search yields conflicting, noisy, or unhelpful results. SealQA comes in three flavors: (1) Seal-0 (main) and (2) Seal-Hard, which assess factual accuracy and reasoning capabilities, with Seal-0 focusing on the most challenging questions where chat models (e.g., GPT-4.1) typically achieve near-zero accuracy; and (3) LongSeal, which extends SealQA to test long-context, multi-document reasoning in "needle-in-a-haystack" settings. Our evaluation reveals critical limitations in current models: Even frontier LLMs perform poorly across all SealQA flavors. On Seal-0, frontier agentic models equipped with tools like o3 and o4-mini achieve only 17.1% and 6.3% accuracy, respectively, at their best reasoning efforts. We find that advanced reasoning models such as DeepSeek-R1-671B and o3-mini are highly vulnerable to noisy search results. Notably, increasing test-time compute does not yield reliable gains across o3-mini, o4-mini, and o3, with performance often plateauing or even declining early. Additionally, while recent models are less affected by the "lost-in-the-middle" issue, they still fail to reliably identify relevant documents in LongSeal when faced with numerous distractors. To facilitate future work, we release SealQA at huggingface.co/datasets/vtllms/sealqa.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper introduces SealQA, a new benchmark for search-augmented language models on fact-seeking questions where web search yields conflicting, noisy, or unhelpful results. It comprises Seal-0 (main set of challenging questions where chat models achieve near-zero accuracy), Seal-Hard, and LongSeal (for long-context multi-document reasoning). Evaluations report low performance by frontier models (e.g., 17.1% for o3 and 6.3% for o4-mini on Seal-0), vulnerability of reasoning models to noise, unreliable gains from increased test-time compute, and failures to identify relevant documents amid distractors in LongSeal. The dataset is released on Hugging Face.

Significance. If the benchmark questions are verifiably constructed such that retrieved search results are unhelpful and the evaluation isolates reasoning failures, the results would usefully demonstrate limitations in current agentic and reasoning models' handling of noisy retrieval and long-context settings. The public dataset release is a positive contribution for reproducibility and future work on search-augmented systems.

major comments (1)

[Dataset construction / Seal-0 description] The curation and verification process for Seal-0 is underspecified. The abstract and introduction describe Seal-0 as focusing on 'the most challenging questions where chat models (e.g., GPT-4.1) typically achieve near-zero accuracy,' but supply no explicit pipeline for question selection, criteria or metrics confirming that top-k web results do not contain the answer, quantitative checks on conflict/unhelpfulness levels in retrieved passages, or inter-annotator agreement. This is load-bearing for the central claim that the reported accuracies (e.g., 17.1% for o3) reflect reasoning limitations rather than search-tool shortcomings.

minor comments (1)

[Abstract] The acronym 'SEarch-Augmented' in the abstract and title contains inconsistent capitalization.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and positive assessment of the benchmark's potential contribution. We address the major comment on dataset construction below and have revised the manuscript to incorporate additional details.

read point-by-point responses

Referee: [Dataset construction / Seal-0 description] The curation and verification process for Seal-0 is underspecified. The abstract and introduction describe Seal-0 as focusing on 'the most challenging questions where chat models (e.g., GPT-4.1) typically achieve near-zero accuracy,' but supply no explicit pipeline for question selection, criteria or metrics confirming that top-k web results do not contain the answer, quantitative checks on conflict/unhelpfulness levels in retrieved passages, or inter-annotator agreement. This is load-bearing for the central claim that the reported accuracies (e.g., 17.1% for o3) reflect reasoning limitations rather than search-tool shortcomings.

Authors: We agree that the original manuscript's description of Seal-0 curation was high-level and would benefit from greater specificity to fully support the central claims. In the revised version, we have expanded Section 3 with a new subsection detailing the question selection pipeline: questions were drawn from existing fact-seeking sources and filtered by evaluating strong chat models (including GPT-4.1) under search-augmented conditions, retaining only those where accuracy remained near zero. We now explicitly describe the verification process, which combines automated checks (e.g., answer-string absence in top-k passages) with human review to confirm that retrieved results are unhelpful or conflicting. The revision also includes quantitative statistics on noise levels (e.g., fraction of passages with contradictions) and reports inter-annotator agreement for the verification annotations. These changes clarify that the benchmark isolates reasoning failures rather than retrieval deficiencies. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark evaluation with no derivations

full rationale

The paper introduces SealQA as a new dataset and reports direct empirical accuracies of frontier models (e.g., o3 at 17.1% on Seal-0) without any claimed derivations, first-principles predictions, fitted parameters renamed as outputs, or self-citation chains. All results follow from running existing models on the released test cases; no equations or load-bearing steps reduce to inputs by construction. This is a standard self-contained empirical evaluation against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the premise that the curated questions produce genuinely noisy search results and that model accuracy on them measures reasoning capability rather than tool quality.

axioms (1)

domain assumption The selected questions in Seal-0 are those where chat models typically achieve near-zero accuracy.
Used to define benchmark difficulty level.

pith-pipeline@v0.9.0 · 5830 in / 1085 out tokens · 52516 ms · 2026-05-19T11:06:09.087767+00:00 · methodology

discussion (0)

Forward citations

Cited by 12 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

FORT-Searcher: Synthesizing Shortcut-Resistant Search Tasks for Training Deep Search Agents
cs.CL 2026-06 unverdicted novelty 7.0

FORT synthesizes shortcut-resistant search tasks by controlling four identified shortcut risks across entity selection, graph construction, question formulation, and refinement, producing training data that yields age...
Argus: Evidence Assembly for Scalable Deep Research Agents
cs.CL 2026-05 unverdicted novelty 7.0

Argus coordinates a Navigator and multiple Searchers via an evidence graph to assemble complete, source-traced answers, yielding benchmark gains up to 12.7 points with 8 parallel agents and 86.2 on BrowseComp with 64 agents.
Toward Generalist Autonomous Research via Hypothesis-Tree Refinement
cs.CL 2026-06 unverdicted novelty 6.0

Arbor combines a coordinator, executors, and a hypothesis tree to enable cumulative autonomous research, outperforming Codex and Claude Code by over 2.5x on six real tasks and reaching 86.36% Any Medal on MLE-Bench Lite.
Argus: Evidence Assembly for Scalable Deep Research Agents
cs.CL 2026-05 unverdicted novelty 6.0

Argus coordinates a Navigator and multiple Searchers via an evidence graph for deep research, reporting average gains of 5.5 points with one Searcher and 12.7 points with eight parallel Searchers across eight benchmar...
APEX-MEM: Agentic Semi-Structured Memory with Temporal Reasoning for Long-Term Conversational AI
cs.CL 2026-04 unverdicted novelty 6.0

APEX-MEM uses property graphs with temporal events, append-only storage, and an agentic retrieval system to reach 88.88% accuracy on LOCOMO QA and 86.2% on LongMemEval, outperforming prior session-aware methods.
Search, Do not Guess: Teaching Small Language Models to Be Effective Search Agents
cs.AI 2026-04 unverdicted novelty 6.0

A fine-tuning policy trains small language models to search reliably and use evidence, improving multi-hop QA performance by 15-17 points to reach large-model levels.
ExpSeek: Self-Triggered Experience Seeking for Web Agents
cs.CL 2026-01 unverdicted novelty 6.0

ExpSeek shifts web agents to self-triggered step-level experience seeking via entropy thresholds, delivering 9.3% and 7.5% absolute gains on Qwen3-8B and 32B models across four benchmarks.
MiroThinker: Pushing the Performance Boundaries of Open-Source Research Agents via Model, Context, and Interactive Scaling
cs.CL 2025-11 unverdicted novelty 6.0

MiroThinker shows that scaling agent-environment interactions via reinforcement learning lets a 72B open-source model reach up to 81.9% on GAIA and approach commercial performance on research benchmarks.
Are Tools All We Need? Unveiling the Tool-Use Tax in LLM Agents
cs.AI 2026-04 unverdicted novelty 5.0

Tool-augmented LLM reasoning incurs a protocol-induced performance tax that can exceed tool benefits under semantic noise, partially mitigated by a lightweight gate called G-STEP.
Guardian-as-an-Advisor: Advancing Next-Generation Guardian Models for Trustworthy LLMs
cs.LG 2026-04 unverdicted novelty 5.0

Guardian-as-an-Advisor prepends risk labels and explanations from a guardian model to queries, improving LLM safety compliance and reducing over-refusal while adding minimal compute overhead.
EvoSkill: Automated Skill Discovery for Multi-Agent Systems
cs.AI 2026-03 unverdicted novelty 5.0

EvoSkill evolves agent skills via failure analysis and Pareto frontier selection, raising exact-match accuracy 7.3% on OfficeQA and 12.1% on SealQA with 5.3% zero-shot transfer to BrowseComp.
Kimi K2.5: Visual Agentic Intelligence
cs.CL 2026-02 unverdicted novelty 5.0

Kimi K2.5 combines joint text-vision training with an Agent Swarm parallel orchestration framework to reach claimed state-of-the-art results on coding, vision, reasoning, and agent tasks while cutting latency up to 4.5 times.