arXiv preprint arXiv:2602.12413 , year=

Soft Contamination Means Benchmarks Test Shallow Generalization , author= · 2026 · arXiv 2602.12413

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

representative citing papers

Ko-WideSearch: A Korean Breadth-Search Benchmark for Exhaustive Set Enumeration by Web Agents

cs.CL · 2026-06-25 · unverdicted · novelty 7.0

Ko-WideSearch is a new Korean breadth-search benchmark spanning 16 categories and three difficulty tiers that evaluates web agents on full set membership plus per-item attributes, showing consistent gaps between set recovery and row completion.

K-BrowseComp: A Web Browsing Agent Benchmark Grounded in Korean Contexts

cs.CL · 2026-06-01 · unverdicted · novelty 6.0

K-BrowseComp is a new Korean web-browsing agent benchmark where frontier LLMs score 30-46% and Korean LLMs score 0-10% on the verified subset.

citing papers explorer

Showing 2 of 2 citing papers.

Ko-WideSearch: A Korean Breadth-Search Benchmark for Exhaustive Set Enumeration by Web Agents cs.CL · 2026-06-25 · unverdicted · none · ref 48
Ko-WideSearch is a new Korean breadth-search benchmark spanning 16 categories and three difficulty tiers that evaluates web agents on full set membership plus per-item attributes, showing consistent gaps between set recovery and row completion.
K-BrowseComp: A Web Browsing Agent Benchmark Grounded in Korean Contexts cs.CL · 2026-06-01 · unverdicted · none · ref 9
K-BrowseComp is a new Korean web-browsing agent benchmark where frontier LLMs score 30-46% and Korean LLMs score 0-10% on the verified subset.

arXiv preprint arXiv:2602.12413 , year=

fields

years

verdicts

representative citing papers

citing papers explorer