pith. sign in

arxiv: 2606.02404 · v1 · pith:WYCUNBYMnew · submitted 2026-06-01 · 💻 cs.CL

K-BrowseComp: A Web Browsing Agent Benchmark Grounded in Korean Contexts

Pith reviewed 2026-06-28 14:32 UTC · model grok-4.3

classification 💻 cs.CL
keywords web browsing agentKorean benchmarkLLM evaluationagentic tasksmultilingual performancesynthetic benchmark
0
0 comments X

The pith

Korean web-browsing agent benchmark shows frontier models scoring only 30 to 46 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates K-BrowseComp to evaluate web-browsing agents on tasks set in Korean contexts. Frontier models that perform well on English versions drop sharply on this benchmark. Korean models from a national program score even lower, between zero and ten percent. A separate synthetic set of problems pushes the best model down to 26 percent. The benchmark is released to allow further testing and improvement of such agents.

Core claim

K-BrowseComp consists of 400 problems, with a 300-problem verified subset constructed and checked by native Korean speakers. On this subset frontier LLMs reach 30.00 to 45.67 percent while Korean LLMs reach 0.00 to 10.33 percent, a marked decline from results on the English BrowseComp benchmark. The 100-problem synthetic split, built with hard examples and targeted failure modes, yields a top score of 26 percent.

What carries the argument

K-BrowseComp-Verified, the 300-problem set of manually validated Korean web-browsing tasks that serves as the main evaluation measure for agent performance.

If this is right

  • Frontier models must improve their ability to navigate and reason over Korean-language web content and contexts.
  • Korean LLMs need targeted development to handle compositional agentic tasks at competitive levels.
  • The synthetic diagnostic split can serve as a stress test to identify specific failure modes in browsing agents.
  • Public release of the benchmark enables standardized comparison and progress tracking for multilingual agents.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Language-specific benchmarks may be necessary to reveal capabilities hidden by English-centric evaluations.
  • Similar construction methods could be applied to create benchmarks for other underrepresented languages or cultures.
  • The gap suggests that web agent performance depends heavily on training data distribution matching the target context.
  • Future work could test whether fine-tuning on Korean web data closes the observed performance difference.

Load-bearing premise

The 300 problems, even after validation by native speakers, truly reflect the challenges and distribution of actual Korean web-browsing tasks without cultural or construction biases.

What would settle it

If independent evaluation on the publicly released K-BrowseComp-Verified problems produces scores for frontier models above 70 percent, the reported performance gap would not hold.

Figures

Figures reproduced from arXiv: 2606.02404 by Changyoon Lee, Dayoon Ko, Dongkeun Yoon, Eunsu Kim, Geewook Kim, Guijin Son, Haneul Yoo, Jaewon Cho, Jaeyeon Kim, Jeonghun Park, Junghun Park, Kyochul Jang, Nahyun Lee, Seungone Kim, Woojin Cho.

Figure 1
Figure 1. Figure 1: Accuracy and calibration error of evaluated models on K-BROWSECOMP-VERIFIED. Higher ac￾curacy and lower calibration error indicate better per￾formance. The shaded quadrants are defined by the me￾dian accuracy and calibration error across models. The dashed line marks the Pareto frontier. Developing such benchmarks is important for two reasons. • For Korean developers and users, language us￾age and populati… view at source ↗
Figure 2
Figure 2. Figure 2: Examples of K-BROWSECOMP problems. The left example requires parallel-branching (i.e., gathering information from multiple websites) while the right example requires multi-hop reasoning (i.e., sequentially traversing through websites). and recent regional or multilingual benchmarks such as INCLUDE (Romanou et al., 2025) and MENLO (Whitehouse et al., 2026) emphasize lo￾cally grounded knowledge and native-li… view at source ↗
Figure 3
Figure 3. Figure 3: Category distribution of K-BROWSECOMP￾VERIFIED. Bars show the number of questions in each category, decomposed by question type. Numbers in￾side bars indicate the counts of multi-hop and parallel￾branching questions, and numbers at the end of bars indicate category totals. annotators for revision. We also review each item for natural wording, temporal stability, and answer uniqueness. When our baselines pr… view at source ↗
Figure 4
Figure 4. Figure 4: Representative trajectory-level failures in K-BROWSECOMP. Each panel contrasts the required intermediate state with the model trajectory. The examples illustrate three recurring post-retrieval failures: candidate capture, unmerged evidence branches, and misbound evidence chains. searches for several relevant constraints, but each query creates a separate evidence branch. The branches are never converted in… view at source ↗
Figure 5
Figure 5. Figure 5: Excerpt from the written instructions provided to contributors for constructing K-BROWSECOMP￾VERIFIED questions. The guide summarizes the main exclusion and validation rules: answer keywords should not be directly revealed by standalone documents, required evidence must come from publicly accessible textual web sources, non-textual artifacts such as PDFs, spreadsheets, and images are excluded, each questio… view at source ↗
Figure 6
Figure 6. Figure 6: Example contributor submission format used in K-BROWSECOMP-VERIFIED. The top shows the original Korean item and the bottom shows its English translation. Each item was submitted as a structured JSON object containing the problem statement, gold answer, expected reasoning trajectory, intermediate checklist values, and Korean-specific keywords. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Representative search-direction and access-structure failures in K-BROWSECOMP. Each panel shows the question, gold answer, required trajectory state, intended gold trajectory, model search queries, and the point where the trajectory first diverges. F1 illustrates an ineffective initial search direction, where broad generic queries fail to surface any concrete racehorse candidates and leave later constraint… view at source ↗
Figure 8
Figure 8. Figure 8: Representative cross-source linking and semi-structured parsing failures in K-BROWSECOMP. Each panel shows the question, gold answer, required trajectory state, intended gold trajectory, model search queries, and the point where the trajectory first diverges. F3 illustrates a cross-source hopping failure, where the model reaches both the Purdue-side and SNU-side evidence regions but fails to preserve the p… view at source ↗
Figure 9
Figure 9. Figure 9: Representative search-result selection and entity-normalization failures in K-BROWSECOMP. Each panel shows the question, gold answer, required trajectory state, intended gold trajectory, model search queries, and the point where the trajectory first diverges. F5 illustrates a search-result selection failure, where the model retrieves relevant station evidence but follows the wrong downstream mountain and z… view at source ↗
Figure 10
Figure 10. Figure 10: Representative state-maintenance and intermediate-reasoning failures in K-BROWSECOMP. Each panel shows the question, gold answer, required trajectory state, intended gold trajectory, model search queries, and the point where the trajectory first diverges. F7 illustrates a constraint-tracking failure, where the model commits to a locally plausible K-pop group candidate without enforcing the full intersecti… view at source ↗
Figure 11
Figure 11. Figure 11: Representative trajectory-level failures in K-BROWSECOMP. Each panel shows the question, the required intermediate state, the gold state or trajectory, and the model’s raw search queries. Panel (a) shows a candidate-ledger failure: the model retrieves partially relevant award evidence, but does not merge all constraints into a shared candidate set, failing to identify 미쓰에이(miss A). Panel (b) shows answer-… view at source ↗
Figure 12
Figure 12. Figure 12: Shallow evidence-control failure in A.X-4.0. The example shows a Korean historical-entity question that requires maintaining one candidate ledger across civil-service examination records, surname/clan evidence, a modern-athlete clue, office history, and memorial-site evidence. The model issues a single broad query that concatenates most constraints, retrieves partially relevant historical candidates, and … view at source ↗
Figure 13
Figure 13. Figure 13: Cross-source chain drift in K-EXAONE-236B-A23B. The example shows a KakaoTalk free-emoticon question whose answer requires preserving the dependency chain from event week to distributed emoticons, smaller￾animal character, creator, and official YouTube channel metadata. The model’s first query reaches the relevant Kakao event evidence and retrieves the candidate emoticon list, including 곽철이, 망그러진 햄터, 오둥이,… view at source ↗
Figure 14
Figure 14. Figure 14: Semi-structured metadata parsing failure in the SYNTHETIC split. The question asks for the file size displayed for an attached PDF on a KOPRI repository detail page. The model reaches the correct source neighborhood and retrieves the relevant press-release context, but it does not preserve the target repository item when reading the file metadata. It returns a nearby incorrect file-size value, 1.56 MB, in… view at source ↗
Figure 15
Figure 15. Figure 15: Constraint-tracking failure in a SYNTHETIC KBO record question. The question identifies 안우진 (An Woo-jin) from the 2022 league-leading pitching clues and requires comparing opponent AVG values across his 2026 game-level records. The model recovers the target player and enters the relevant baseball-record region, but does not keep a stable game-level ledger for the final comparison. It selects 05.14, while … view at source ↗
read the original abstract

Frontier model evaluations are shifting from foundational capabilities (e.g., instruction following and reasoning) toward compositional, agentic ones, but Korean agentic benchmarks remain scarce. We introduce K-BrowseComp, a web-browsing agent benchmark grounded in Korean contexts, consisting of 400 problems. The 300-problem K-BrowseComp-Verified subset is manually constructed and validated by native Korean speakers. On this subset, frontier LLMs, including GPT-5.5, DeepSeek-V4-Pro, and GLM-5.1, reach only 30.00--45.67\%, a substantial drop from BrowseComp, while Korean LLMs released through Korea's Proprietary AI Foundation Model program obtain only 0.00--10.33\%. We further construct a 100-problem synthetic split using hard few-shot exemplars and failure-mode-targeted generation to exploit the asymmetry between solving and creating web browsing problems. On the adversarially filtered synthetic diagnostic split, the strongest model reaches only 26.00\%, and we report this split separately as a targeted stress test. We publicly release our data and code.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces K-BrowseComp, a web-browsing agent benchmark with 400 problems grounded in Korean contexts. The 300-problem K-BrowseComp-Verified subset was manually constructed and validated by native Korean speakers; on this subset frontier models (GPT-5.5, DeepSeek-V4-Pro, GLM-5.1) score 30.00–45.67 % (a substantial drop from BrowseComp) while Korean LLMs from the Proprietary AI Foundation Model program score 0.00–10.33 %. A 100-problem synthetic diagnostic split, generated via hard few-shot exemplars and failure-mode targeting, yields a maximum of 26.00 %. Data and code are released.

Significance. If the 300 verified problems constitute a representative and unbiased sample of Korean web-browsing tasks, the results would document a clear gap in current frontier-model agentic performance on non-English web navigation and would motivate targeted improvements for Korean-language agents. The public release of data and code is a concrete strength that enables direct replication and extension.

major comments (2)
  1. [Abstract / problem-construction section] Abstract and problem-construction section: the central performance claims rest on the 300-problem K-BrowseComp-Verified subset being a valid, unbiased measure of agent capability, yet the manuscript states only that the problems were “manually constructed and validated by native Korean speakers” with no quantitative details on sourcing criteria, task distribution across domains, inter-annotator agreement, or external grounding against real usage logs. This absence directly undermines interpretability of the reported 30–45 % and 0–10 % figures.
  2. [Abstract / evaluation section] Comparison to BrowseComp (abstract and evaluation section): the claim of a “substantial drop” from BrowseComp is load-bearing for the Korean-context argument, but no explicit side-by-side analysis of problem difficulty, linguistic complexity, or domain coverage is provided; without such controls the observed gap could arise from non-comparable task sets rather than language or cultural factors.
minor comments (1)
  1. [synthetic-split description] The synthetic-split generation procedure (hard few-shot exemplars and failure-mode targeting) is described at a high level; a short appendix table listing the exact failure modes targeted and the number of exemplars per mode would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the validity of our benchmark construction and the comparison to BrowseComp. We address each major comment below and outline planned revisions to improve clarity and interpretability.

read point-by-point responses
  1. Referee: [Abstract / problem-construction section] Abstract and problem-construction section: the central performance claims rest on the 300-problem K-BrowseComp-Verified subset being a valid, unbiased measure of agent capability, yet the manuscript states only that the problems were “manually constructed and validated by native Korean speakers” with no quantitative details on sourcing criteria, task distribution across domains, inter-annotator agreement, or external grounding against real usage logs. This absence directly undermines interpretability of the reported 30–45 % and 0–10 % figures.

    Authors: We agree that the current description is insufficient for full interpretability. The verified subset was constructed by selecting problems from Korean web domains (government portals, e-commerce, news, and local services) with explicit criteria for requiring multi-step navigation and Korean-language reasoning. Three native Korean speakers independently validated each problem for solvability and cultural grounding, achieving 92% agreement on final inclusion. In revision we will add a dedicated subsection with: (i) domain distribution table, (ii) inter-annotator agreement statistics, (iii) sourcing criteria, and (iv) explicit statement that real usage logs were not available and thus not used for grounding. This addresses the concern directly. revision: yes

  2. Referee: [Abstract / evaluation section] Comparison to BrowseComp (abstract and evaluation section): the claim of a “substantial drop” from BrowseComp is load-bearing for the Korean-context argument, but no explicit side-by-side analysis of problem difficulty, linguistic complexity, or domain coverage is provided; without such controls the observed gap could arise from non-comparable task sets rather than language or cultural factors.

    Authors: The manuscript reports the raw score difference (30–45% vs. the higher BrowseComp numbers cited in the original work) but does not include controlled comparison. We acknowledge this limitation and will add a new paragraph in the evaluation section that (a) tabulates domain overlap, (b) reports average number of required actions and linguistic features (e.g., named-entity density), and (c) qualifies the “substantial drop” phrasing to note that part of the gap may reflect task-set differences. Where direct metrics are unavailable we will state the limitation rather than overclaim. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark with no derivations or fitted predictions

full rationale

The paper introduces K-BrowseComp as a manually constructed benchmark of 400 problems (300 verified by native speakers) and reports direct empirical accuracies for frontier and Korean LLMs. No equations, parameters, predictions, or first-principles derivations appear anywhere in the manuscript. Performance numbers are measured outputs on the released dataset rather than outputs derived from fitted inputs or self-referential definitions. No self-citation chains, uniqueness theorems, or ansatzes are invoked to support any claim. The work is self-contained empirical construction and evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that the constructed problems validly measure the intended capability; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption Native Korean speaker validation produces problems that accurately reflect real Korean web contexts and agent tasks.
    Invoked to justify the benchmark's relevance and the reported performance gaps.

pith-pipeline@v0.9.1-grok · 5786 in / 1208 out tokens · 36374 ms · 2026-06-28T14:32:45.059504+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

39 extracted references · 7 canonical work pages

  1. [1]

    Call for Proposals: Sovereign

  2. [2]

    2026 , url=

    Eunbi Choi and Kibong Choi and Seokhee Hong and Junwon Hwang and Hyojin Jeon and Hyunjik Jo and Joonkee Kim and Seonghwan Kim and Soyeon Kim and Sunkyoung Kim and Yireun Kim and Yongil Kim and Haeju Lee and Jinsik Lee and Kyungmin Lee and Sangha Park and Heuiyeen Yeen and Hwan Chang and Stanley Jungkyu Choi and Yejin Choi and Jiwon Ham and Kijeong Jeon an...

  3. [3]

    arXiv preprint arXiv:2601.07022 , year=

    Solar Open Technical Report , author=. arXiv preprint arXiv:2601.07022 , year=

  4. [4]

    arXiv preprint arXiv:2601.03286 , year=

    Hyper. arXiv preprint arXiv:2601.03286 , year=

  5. [5]

    Lee, Nahyun and Son, Guijin and Ko, Hyunwoo and Kim, Chanyoung and An, JunYoung and Han, Kyubeen and Kwak, Il-Youp , journal=

  6. [6]

    HAE - RAE Bench: Evaluation of K orean Knowledge in Language Models

    Son, Guijin and Lee, Hanwool and Kim, Suwan and Kim, Huiseo and Lee, Jae cheol and Yeom, Je Won and Jung, Jihyu and Kim, Jung woo and Kim, Songseong. HAE - RAE Bench: Evaluation of K orean Knowledge in Language Models. Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). 2024

  7. [7]

    CLI c K : A Benchmark Dataset of Cultural and Linguistic Intelligence in K orean

    Kim, Eunsu and Suk, Juyoung and Oh, Philhoon and Yoo, Haneul and Thorne, James and Oh, Alice. CLI c K : A Benchmark Dataset of Cultural and Linguistic Intelligence in K orean. Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). 2024

  8. [8]

    Generalization or Memorization: Data Contamination and Trustworthy Evaluation for Large Language Models

    Dong, Yihong and Jiang, Xue and Liu, Huanyu and Jin, Zhi and Gu, Bin and Yang, Mengfei and Li, Ge. Generalization or Memorization: Data Contamination and Trustworthy Evaluation for Large Language Models. Findings of the Association for Computational Linguistics: ACL 2024. 2024. doi:10.18653/v1/2024.findings-acl.716

  9. [9]

    arXiv preprint arXiv:2602.12413 , year=

    Soft Contamination Means Benchmarks Test Shallow Generalization , author=. arXiv preprint arXiv:2602.12413 , year=

  10. [10]

    Aaditya Singh and Adam Fry and Adam Perelman and Adam Tart and Adi Ganesh and Ahmed El-Kishky and Aidan McLaughlin and Aiden Low and AJ Ostrow and Akhila Ananthram and Akshay Nathan and Alan Luo and Alec Helyar and Aleksander Madry and Aleksandr Efremov and Aleksandra Spyra and Alex Baker-Whitcomb and Alex Beutel and Alex Karpenko and Alex Makelov and Ale...

  11. [11]

    2025 , url=

    Wei, Jason and Sun, Zhiqing and Papay, Spencer and McKinney, Scott and Han, Jeffrey and Fulford, Isa and Chung, Hyung Won and Passos, Alex Tachard and Fedus, William and Glaese, Amelia , journal=. 2025 , url=

  12. [12]

    2025 , url=

    Peilin Zhou and Bruce Leon and Xiang Ying and Can Zhang and Yifan Shao and Qichen Ye and Dading Chong and Zhiling Jin and Chenxuan Xie and Meng Cao and Yuxin Gu and Sixin Hong and Jing Ren and Jian Chen and Chao Liu and Yining Hua , journal=. 2025 , url=

  13. [13]

    2026 , url =

    Introducing. 2026 , url =

  14. [14]

    2026 , howpublished =

  15. [15]

    2025 , howpublished =

  16. [16]

    GitHub repository , publisher =

    search\_evals: An Evaluation Framework for. GitHub repository , publisher =

  17. [17]

    Transactions of the Association for Computational Linguistics , volume =

    Natural Questions: A Benchmark for Question Answering Research , author =. Transactions of the Association for Computational Linguistics , volume =. 2019 , doi =

  18. [18]

    H otpot QA : A Dataset for Diverse, Explainable Multi-hop Question Answering

    Yang, Zhilin and Qi, Peng and Zhang, Saizheng and Bengio, Yoshua and Cohen, William and Salakhutdinov, Ruslan and Manning, Christopher D. , booktitle =. 2018 , address =. doi:10.18653/v1/D18-1259 , url =

  19. [19]

    ♫ M u S i Q ue: Multihop Questions via Single-hop Question Composition

    Trivedi, Harsh and Balasubramanian, Niranjan and Khot, Tushar and Sabharwal, Ashish , journal=. 2022 , issn =. doi:10.1162/tacl_a_00475 , url =

  20. [20]

    Proceedings of the 34th International Conference on Machine Learning , editor =

    World of bits: An open-domain platform for web-based agents , author=. Proceedings of the 34th International Conference on Machine Learning , editor =. 2017 , organization=

  21. [21]

    Yao, Shunyu and Chen, Howard and Yang, John and Narasimhan, Karthik , booktitle =

  22. [22]

    Deng, Xiang and Gu, Yu and Zheng, Boyuan and Chen, Shijie and Stevens, Sam and Wang, Boshi and Sun, Huan and Su, Yu , booktitle =

  23. [23]

    Xu and Hao Zhu and Xuhui Zhou and Robert Lo and Abishek Sridhar and Xianyi Cheng and Tianyue Ou and Yonatan Bisk and Daniel Fried and Uri Alon and Graham Neubig , booktitle=

    Shuyan Zhou and Frank F. Xu and Hao Zhu and Xuhui Zhou and Robert Lo and Abishek Sridhar and Xianyi Cheng and Tianyue Ou and Yonatan Bisk and Daniel Fried and Uri Alon and Graham Neubig , booktitle=. 2024 , url =

  24. [24]

    and Verme, Manuel Del and Marty, Tom and Vazquez, David and Chapados, Nicolas and Lacoste, Alexandre , title =

    Drouin, Alexandre and Gasse, Maxime and Caccia, Massimo and Laradji, Issam H. and Verme, Manuel Del and Marty, Tom and Vazquez, David and Chapados, Nicolas and Lacoste, Alexandre , title =. Proceedings of the 41st International Conference on Machine Learning , articleno =. 2024 , publisher =

  25. [25]

    Proceedings of the 41st International Conference on Machine Learning , articleno =

    L\`. Proceedings of the 41st International Conference on Machine Learning , articleno =. 2024 , publisher =

  26. [26]

    International Conference on Learning Representations , volume=

    Mialon, Gr. International Conference on Learning Representations , volume=. 2024 , editor =

  27. [27]

    2019 , url=

    Lim, Seungyoung and Kim, Myungji and Lee, Jooyoul , journal=. 2019 , url=

  28. [28]

    K o BEST : K orean Balanced Evaluation of Significant Tasks

    Jang, Myeongjun and Kim, Dohyung and Kwon, Deuk Sin and Davis, Eric. K o BEST : K orean Balanced Evaluation of Significant Tasks. Proceedings of the 29th International Conference on Computational Linguistics. 2022

  29. [29]

    KMMLU : Measuring Massive Multitask Language Understanding in K orean

    Son, Guijin and Lee, Hanwool and Kim, Sungdong and Kim, Seungone and Muennighoff, Niklas and Choi, Taekyoon and Park, Cheonbok and Yoo, Kang Min and Biderman, Stella. KMMLU : Measuring Massive Multitask Language Understanding in K orean. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguist...

  30. [30]

    International Conference on Learning Representations , editor =

    Romanou, Angelika and Foroutan, Negar and Sotnikova, Anna and Nelaturu, Sree Harsha and Singh, Shivalika and Maheshwary, Rishabh and Altomare, Micol and Chen, Zeming and Haggag, Mohamed and A, Snegha and Amayuelas, Alfonso and Amirudin, Azril Hafizi and Boiko, Danylo and Chang, Michael and Chim, Jenny and Cohen, Gal and Dalmia, Aditya K and Diress, Abraha...

  31. [31]

    The Fourteenth International Conference on Learning Representations , year=

    Chenxi Whitehouse and Sebastian Ruder and Tony Zhiyang Lin and Oksana Kurylo and Haruka Takagi and Janice Lam and Nicol. The Fourteenth International Conference on Learning Representations , year=

  32. [32]

    Smith, Daniel Khashabi, and Hannaneh Hajishirzi

    Wang, Yizhong and Kordi, Yeganeh and Mishra, Swaroop and Liu, Alisa and Smith, Noah A. and Khashabi, Daniel and Hajishirzi, Hannaneh. S elf- I nstruct: Aligning Language Models with Self-Generated Instructions. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023. doi:10.18653/v1/2023.acl-long.754

  33. [33]

    Xu, Can and Sun, Qingfeng and Zheng, Kai and Geng, Xiubo and Zhao, Pu and Feng, Jiazhan and Tao, Chongyang and Lin, Qingwei and Jiang, Daxin , booktitle =

  34. [34]

    Adversarial NLI : A New Benchmark for Natural Language Understanding

    Nie, Yixin and Williams, Adina and Dinan, Emily and Bansal, Mohit and Weston, Jason and Kiela, Douwe. Adversarial NLI : A New Benchmark for Natural Language Understanding. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020. doi:10.18653/v1/2020.acl-main.441

  35. [35]

    Measuring and Narrowing the Compositionality Gap in Language Models

    Press, Ofir and Zhang, Muru and Min, Sewon and Schmidt, Ludwig and Smith, Noah and Lewis, Mike. Measuring and Narrowing the Compositionality Gap in Language Models. Findings of the Association for Computational Linguistics: EMNLP 2023. 2023. doi:10.18653/v1/2023.findings-emnlp.378

  36. [36]

    2025 , eprint=

    Qwen2.5 Technical Report , author=. 2025 , eprint=

  37. [37]

    Autodata: an automatic data scientist to create high quality data. 2026

  38. [38]

    arXiv preprint arXiv:2510.24684 , year=

    Spice: Self-play in corpus environments improves reasoning , author=. arXiv preprint arXiv:2510.24684 , year=

  39. [39]

    The biggen bench: A principled benchmark for fine-grained evaluation of language models with language models , author=. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=