{"work":{"id":"92de1187-b15f-439e-9005-9178cb024209","openalex_id":null,"doi":null,"arxiv_id":"2506.01062","raw_key":null,"title":"SealQA: Raising the Bar for Reasoning in Search-Augmented Language Models","authors":null,"authors_text":null,"year":2025,"venue":"cs.CL","abstract":"We introduce SealQA, a new challenge benchmark for evaluating SEarch-Augmented Language models on fact-seeking questions where web search yields conflicting, noisy, or unhelpful results. SealQA comes in three flavors: (1) Seal-0 (main) and (2) Seal-Hard, which assess factual accuracy and reasoning capabilities, with Seal-0 focusing on the most challenging questions where chat models (e.g., GPT-4.1) typically achieve near-zero accuracy; and (3) LongSeal, which extends SealQA to test long-context, multi-document reasoning in \"needle-in-a-haystack\" settings. Our evaluation reveals critical limitations in current models: Even frontier LLMs perform poorly across all SealQA flavors. On Seal-0, frontier agentic models equipped with tools like o3 and o4-mini achieve only 17.1% and 6.3% accuracy, respectively, at their best reasoning efforts. We find that advanced reasoning models such as DeepSeek-R1-671B and o3-mini are highly vulnerable to noisy search results. Notably, increasing test-time compute does not yield reliable gains across o3-mini, o4-mini, and o3, with performance often plateauing or even declining early. Additionally, while recent models are less affected by the \"lost-in-the-middle\" issue, they still fail to reliably identify relevant documents in LongSeal when faced with numerous distractors. To facilitate future work, we release SealQA at huggingface.co/datasets/vtllms/sealqa.","external_url":"https://arxiv.org/abs/2506.01062","cited_by_count":null,"metadata_source":"pith","metadata_fetched_at":"2026-06-28T17:02:24.428732+00:00","pith_arxiv_id":"2506.01062","created_at":"2026-05-10T13:35:26.517487+00:00","updated_at":"2026-06-28T17:02:24.428732+00:00","title_quality_ok":true,"display_title":"SealQA: Raising the Bar for Reasoning in Search-Augmented Language Models","render_title":"SealQA: Raising the Bar for Reasoning in Search-Augmented Language Models"},"hub":{"state":{"work_id":"92de1187-b15f-439e-9005-9178cb024209","tier":"hub","tier_reason":"10+ Pith inbound or 1,000+ external citations","pith_inbound_count":11,"external_cited_by_count":null,"distinct_field_count":3,"first_pith_cited_at":"2025-11-14T18:52:07+00:00","last_pith_cited_at":"2026-06-10T10:57:05+00:00","author_build_status":"not_needed","summary_status":"needed","contexts_status":"needed","graph_status":"needed","ask_index_status":"not_needed","reader_status":"not_needed","recognition_status":"not_needed","updated_at":"2026-06-28T21:37:58.965721+00:00","tier_text":"hub"},"tier":"hub","role_counts":[{"context_role":"background","n":1},{"context_role":"dataset","n":1}],"polarity_counts":[{"context_polarity":"background","n":1},{"context_polarity":"use_dataset","n":1}],"runs":{},"summary":{},"graph":{},"authors":[]}}