pith. machine review for the scientific record. sign in

arxiv: 2605.11636 · v1 · submitted 2026-05-12 · 💻 cs.AI

Recognition: no theorem link

Seir\^enes: Adversarial Self-Play with Evolving Distractions for LLM Reasoning

Authors on Pith no claims yet

Pith reviewed 2026-05-13 01:15 UTC · model grok-4.3

classification 💻 cs.AI
keywords adversarial self-playLLM reasoning robustnesscontextual distractionsmathematical reasoning benchmarksself-play reinforcement learningparameter-shared trainingevolving training signals
0
0 comments X

The pith

Seirênes trains one LLM to generate its own distracting contexts and then solve the underlying problems despite them, producing gains of 7 to 10 points on mathematical reasoning benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Seirênes as a self-play reinforcement learning method that converts the fragility of LLMs to extra or misleading information into a built-in training signal. A single model with shared parameters learns both to invent plausible distractions that expose its current weaknesses and to recover the core task from those distractions. This creates a continuing loop in which each side of the contest pushes the other forward. If the approach works as claimed, it would mean LLMs can develop reasoning that holds up under messy real-world conditions rather than only on clean benchmark problems. The reported results include average score increases across seven math benchmarks for models ranging from 4 billion to 30 billion parameters, plus the finding that distractions created by the smallest trained model lower accuracy in much larger closed-source systems.

Core claim

Seirênes places a single model inside a parameter-shared adversarial self-play loop so that it must simultaneously construct distracting contexts that reveal its own reasoning gaps and solve the original problems by isolating the essential logic from those perturbations; the two opposing objectives are trained together with verifiable rewards, producing a co-evolutionary process that continues as the model improves.

What carries the argument

The parameter-shared adversarial self-play loop in which the model generates evolving distracting contexts while also learning to extract and solve the core problem from those contexts.

If this is right

  • The same model scale achieves higher accuracy on standard clean benchmarks after exposure to self-generated distractions.
  • Distractions created by a 4B-parameter model lower accuracy in much larger closed-source models by 4 to 5 points.
  • The adversarial loop maintains an informative curriculum because each improvement in one objective immediately challenges the other.
  • The method works across model sizes from 4B to 30B parameters without requiring separate generator and solver networks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach could be tested on non-mathematical tasks such as code generation or multi-step planning where context noise is common.
  • Distraction generators trained this way might serve as diagnostic tools to map blind spots in other reasoning systems.
  • Long-term stability of the loop may require additional controls on distraction complexity to prevent collapse into trivial or repetitive noise.

Load-bearing premise

The measured gains come from the model learning genuinely more robust reasoning rather than from memorizing or adapting to the particular style of distractions that appear during its own training.

What would settle it

Test the trained models on math problems that contain entirely new categories of irrelevant or misleading text never produced by the self-play generator, such as human-written tangential instructions or novel incidental correlations.

read the original abstract

We present Seir\^enes, a self-play RL framework that transforms contextual interference from a failure mode of LLM reasoning into an internal training signal for co-evolving more resilient reasoners. While RL with verifiable rewards has significantly advanced reasoning capabilities, models can still exhibit fragility when encountering non-idealized contexts: scenarios characterized by superfluous information, tangential instructions, or incidental correlations that differ from the clean distributions typical of standard benchmarks. Seir\^enes harnesses this vulnerability through a parameter-shared and adversarial self-play loop. Within this framework, a single model is trained to both construct plausible yet distracting contexts that expose its own reasoning blind spots, and solve problems by discerning the essential task from these perturbations to recover the core underlying logic. By pitting these competing objectives against each other, Seir\^enes compels the model to move beyond superficial pattern matching and anchors its capabilities in robust underlying reasoning. This continuous interaction sustains an informative co-evolutionary curriculum as the model improves. Across seven mathematical reasoning benchmarks and model scales from 4B to 30B, Seir\^enes achieves average gains of +10.2, +9.1, and +7.2 points. Besides, distracting contexts produced by the 4B Seir\^enes model reduce the accuracy of top-tier closed-source models (GPT and Gemini) by roughly 4--5 points, revealing Seir\^enes' general ability to uncover reasoning models' blind spots.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces Seirênes, a parameter-shared adversarial self-play RL framework in which a single LLM is trained to both generate plausible distracting contexts for mathematical problems and to solve those problems by recovering the core logic from the perturbations. The central claim is that this co-evolutionary loop produces an evolving curriculum of non-trivial distractions that yields more robust reasoning, evidenced by average gains of +10.2, +9.1, and +7.2 points across seven math benchmarks for 4B–30B models and by the ability of 4B-generated distractions to reduce accuracy of GPT and Gemini models by 4–5 points.

Significance. If the gains are shown to arise from genuinely improved disambiguation rather than overfitting to the generator’s output distribution, the method would offer an efficient, single-model route to stress-testing and hardening LLM reasoning against realistic contextual noise. The reported transfer of distractions to closed-source models is a notable strength, as it suggests the generated perturbations expose general blind spots rather than model-specific artifacts.

major comments (3)
  1. [Results / Experimental Setup] The abstract and results sections report large average gains but supply no information on the baselines (standard SFT, non-adversarial RL, or data-augmentation controls), statistical tests, or variance across runs. Without these, it is impossible to determine whether the +7–10 point improvements are attributable to the adversarial loop or simply to additional gradient steps on the same math data.
  2. [Method / Ablations] No ablation isolates the adversarial self-play component (e.g., generator vs. fixed distraction policy, or RL with verifiable rewards alone). The central claim that the co-evolutionary curriculum forces generalizable reasoning therefore rests on an untested assumption; the observed gains could equally result from the solver simply learning to filter the particular style of perturbations produced by its own generator.
  3. [Training Dynamics / Evaluation] The manuscript provides no quantitative tracking of distraction quality over training (semantic distance from the core problem, diversity metrics, or per-epoch solver failure rates). Consequently there is no evidence that the curriculum actually evolves to become harder rather than collapsing to repetitive or easily ignored patterns, which directly undermines the claim of a “continuous interaction [that] sustains an informative co-evolutionary curriculum.”
minor comments (2)
  1. [Abstract] The seven mathematical reasoning benchmarks are not named in the abstract or early sections; listing them explicitly (with citations) would improve reproducibility.
  2. [Method] The notation for the two roles of the shared model (generator vs. solver) is introduced informally; a clear definition of the joint objective and reward signals would clarify the adversarial loop.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which help strengthen the presentation of our work. We address each major comment below, providing clarifications and committing to revisions where appropriate to enhance the experimental evidence for the co-evolutionary benefits of Seirênes.

read point-by-point responses
  1. Referee: [Results / Experimental Setup] The abstract and results sections report large average gains but supply no information on the baselines (standard SFT, non-adversarial RL, or data-augmentation controls), statistical tests, or variance across runs. Without these, it is impossible to determine whether the +7–10 point improvements are attributable to the adversarial loop or simply to additional gradient steps on the same math data.

    Authors: We acknowledge the importance of these controls for isolating the effect of the adversarial self-play. The manuscript emphasizes the novel framework but does not include direct comparisons to non-adversarial RL or SFT with equivalent compute. In the revised manuscript, we will incorporate baselines including standard supervised fine-tuning, RL with verifiable rewards but without the generator, and data augmentation using static distractions. We will also report standard deviations across 3-5 random seeds and perform statistical significance tests (e.g., paired t-tests) to substantiate the gains. revision: yes

  2. Referee: [Method / Ablations] No ablation isolates the adversarial self-play component (e.g., generator vs. fixed distraction policy, or RL with verifiable rewards alone). The central claim that the co-evolutionary curriculum forces generalizable reasoning therefore rests on an untested assumption; the observed gains could equally result from the solver simply learning to filter the particular style of perturbations produced by its own generator.

    Authors: This is a valid concern regarding the source of the improvements. To address it, we will add an ablation study in the revision where we compare the full Seirênes setup against a variant with a fixed generator policy (trained separately and frozen) and against standard RL without adversarial generation. These experiments will demonstrate whether the dynamic co-evolution is necessary for the observed robustness, particularly in the transfer to closed-source models. revision: yes

  3. Referee: [Training Dynamics / Evaluation] The manuscript provides no quantitative tracking of distraction quality over training (semantic distance from the core problem, diversity metrics, or per-epoch solver failure rates). Consequently there is no evidence that the curriculum actually evolves to become harder rather than collapsing to repetitive or easily ignored patterns, which directly undermines the claim of a “continuous interaction [that] sustains an informative co-evolutionary curriculum.”

    Authors: We agree that quantitative evidence of curriculum evolution would bolster the claims. Although the paper includes qualitative examples of distraction progression, we will add quantitative analyses in the revision, such as tracking the average embedding distance between generated distractions and core problems, lexical diversity metrics, and solver failure rates on a held-out set of problems over training epochs. This will provide direct support for the evolving nature of the curriculum. revision: yes

Circularity Check

0 steps flagged

Empirical self-play RL framework with external benchmark evaluation; no circular derivation

full rationale

The paper presents Seirênes as an adversarial self-play RL training procedure in which a parameter-shared model alternately generates distracting contexts and solves the underlying math problems. Reported results consist of average accuracy gains (+10.2, +9.1, +7.2 points) measured on seven independent mathematical reasoning benchmarks across model scales 4B–30B, plus transfer attacks on closed-source models. No equations, fitted parameters renamed as predictions, self-citations, or ansatzes are invoked to derive the gains; the method is a standard RL loop whose outputs are evaluated against fixed external test sets. This is a conventional empirical contribution whose central claims rest on observable benchmark deltas rather than any self-referential reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract provides no explicit free parameters, axioms, or invented entities; the method rests on standard assumptions of RL with verifiable rewards and the premise that self-generated distractions can serve as useful training signals.

pith-pipeline@v0.9.0 · 5579 in / 1184 out tokens · 39978 ms · 2026-05-13T01:15:53.909617+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

74 extracted references · 74 canonical work pages · 22 internal anchors

  1. [1]

    Oxford World’s Classics

    Homer.The Odyssey. Oxford World’s Classics. Oxford University Press, Oxford, 2008. ISBN 9780199536788. Translated by Walter Shewring; introduction by G. S. Kirk

  2. [2]

    Tulu 3: Pushing Frontiers in Open Language Model Post-Training

    Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, et al. Tulu 3: Pushing frontiers in open language model post-training.arXiv preprint arXiv:2411.15124, 2024

  3. [3]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

  4. [4]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

  5. [5]

    OpenAI o1 System Card

    Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024

  6. [6]

    Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model

    Jingcheng Hu, Yinmin Zhang, Qi Han, Daxin Jiang, Xiangyu Zhang, and Heung-Yeung Shum. Open-reasoner-zero: An open source approach to scaling up reinforcement learning on the base model.arXiv preprint arXiv:2503.24290, 2025

  7. [7]

    Deepcoder: A fully open-source 14b coder at o3-mini level.Notion Blog, 1, 2025

    Michael Luo, Sijun Tan, Roy Huang, Ameen Patel, Alpay Ariyak, Qingyang Wu, Xiaoxiang Shi, Rachel Xin, Colin Cai, Maurice Weber, et al. Deepcoder: A fully open-source 14b coder at o3-mini level.Notion Blog, 1, 2025

  8. [8]

    Intern-s1: A scientific multimodal foun- dation model,

    Lei Bai, Zhongrui Cai, Yuhang Cao, Maosong Cao, Weihan Cao, Chiyu Chen, Haojiong Chen, Kai Chen, Pengcheng Chen, Ying Chen, et al. Intern-s1: A scientific multimodal foundation model.arXiv preprint arXiv:2508.15763, 2025

  9. [9]

    Mastering the game of go with deep neural networks and tree search.nature, 529(7587):484–489, 2016

    David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search.nature, 529(7587):484–489, 2016

  10. [10]

    R-Zero: Self-Evolving Reasoning LLM from Zero Data

    Chengsong Huang, Wenhao Yu, Xiaoyang Wang, Hongming Zhang, Zongxia Li, Ruosen Li, Jiaxin Huang, Haitao Mi, and Dong Yu. R-zero: Self-evolving reasoning llm from zero data.arXiv preprint arXiv:2508.05004, 2025

  11. [11]

    Absolute Zero: Reinforced Self-play Reasoning with Zero Data

    Andrew Zhao, Yiran Wu, Yang Yue, Tong Wu, Quentin Xu, Matthieu Lin, Shenzhi Wang, Qingyun Wu, Zilong Zheng, and Gao Huang. Absolute zero: Reinforced self-play reasoning with zero data.arXiv preprint arXiv:2505.03335, 2025

  12. [12]

    Spice: Self-play in corpus environments improves reasoning.arXiv, 2025

    Bo Liu, Chuanyang Jin, Seungone Kim, Weizhe Yuan, Wenting Zhao, Ilia Kulikov, Xian Li, Sainbayar Sukhbaatar, Jack Lanchantin, and Jason Weston. Spice: Self-play in corpus environments improves reasoning.arXiv preprint arXiv:2510.24684, 2025

  13. [13]

    Math-perturb: Benchmarking llms’ math reasoning abilities against hard perturbations

    Kaixuan Huang, Jiacheng Guo, Zihao Li, Xiang Ji, Jiawei Ge, Wenzhe Li, Yingqing Guo, Tianle Cai, Hui Yuan, Runzhe Wang, et al. Math-perturb: Benchmarking llms’ math reasoning abilities against hard perturbations. arXiv preprint arXiv:2502.06453, 2025

  14. [14]

    Language models don’t always say what they think: Unfaithful explanations in chain-of-thought prompting.Advances in Neural Information Processing Systems, 36:74952–74965, 2023

    Miles Turpin, Julian Michael, Ethan Perez, and Samuel Bowman. Language models don’t always say what they think: Unfaithful explanations in chain-of-thought prompting.Advances in Neural Information Processing Systems, 36:74952–74965, 2023

  15. [15]

    Benchmarking large language models in retrieval-augmented generation

    Jiawei Chen, Hongyu Lin, Xianpei Han, and Le Sun. Benchmarking large language models in retrieval-augmented generation. InAAAI Conference on Artificial Intelligence, 2023. URLhttps://api.semanticscholar.org/CorpusID: 261530434

  16. [16]

    From System 1 to System 2: A Survey of Reasoning Large Language Models

    Zhong-Zhi Li, Duzhen Zhang, Ming-Liang Zhang, Jiaxin Zhang, Zengyan Liu, Yuxuan Yao, Haotian Xu, Junhao Zheng, Pei-Jie Wang, Xiuyi Chen, et al. From system 1 to system 2: A survey of reasoning large language models. arXiv preprint arXiv:2502.17419, 2025

  17. [17]

    Math-shepherd: Verify and reinforce llms step-by-step without human annotations

    Peiyi Wang, Lei Li, Zhihong Shao, Runxin Xu, Damai Dai, Yifei Li, Deli Chen, Yu Wu, and Zhifang Sui. Math-shepherd: Verify and reinforce llms step-by-step without human annotations. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9426–9439, 2024. 11

  18. [18]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

  19. [19]

    Group Sequence Policy Optimization

    Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, et al. Group sequence policy optimization.arXiv preprint arXiv:2507.18071, 2025

  20. [20]

    VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks

    Yu Yue, Yufeng Yuan, Qiying Yu, Xiaochen Zuo, Ruofei Zhu, Wenyuan Xu, Jiaze Chen, Chengyi Wang, TianTian Fan, Zhengyin Du, et al. Vapo: Efficient and reliable reinforcement learning for advanced reasoning tasks.arXiv preprint arXiv:2504.05118, 2025

  21. [21]

    Understanding R1-Zero-Like Training: A Critical Perspective

    Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective.arXiv preprint arXiv:2503.20783, 2025

  22. [22]

    Prorl: Prolonged reinforcement learning expands reasoning boundaries in large language models.arXiv preprint arXiv:2505.24864, 2025b

    Mingjie Liu, Shizhe Diao, Ximing Lu, Jian Hu, Xin Dong, Yejin Choi, Jan Kautz, and Yi Dong. Prorl: Prolonged reinforcement learning expands reasoning boundaries in large language models.arXiv preprint arXiv:2505.24864, 2025

  23. [23]

    Justrl: Scaling a 1.5 b llm with a simple rl recipe

    Bingxiang He, Zekai Qu, Zeyuan Liu, Yinghao Chen, Yuxin Zuo, Cheng Qian, Kaiyan Zhang, Weize Chen, Chaojun Xiao, Ganqu Cui, et al. Justrl: Scaling a 1.5 b llm with a simple rl recipe.arXiv preprint arXiv:2512.16649, 2025

  24. [24]

    Guided self-evolving llms with minimal human supervision.arXiv preprint arXiv:2512.02472,

    Wenhao Yu, Zhenwen Liang, Chengsong Huang, Kishan Panaganti, Tianqing Fang, Haitao Mi, and Dong Yu. Guided self-evolving llms with minimal human supervision.arXiv preprint arXiv:2512.02472, 2025

  25. [25]

    Agent0: Unleashing self-evolving agents from zero data via tool-integrated reasoning

    Peng Xia, Kaide Zeng, Jiaqi Liu, Can Qin, Fang Wu, Yiyang Zhou, Caiming Xiong, and Huaxiu Yao. Agent0: Unleashing self-evolving agents from zero data via tool-integrated reasoning.arXiv preprint arXiv:2511.16043, 2025

  26. [26]

    arXiv preprint arXiv:2511.19900 , year=

    Jiaqi Liu, Kaiwen Xiong, Peng Xia, Yiyang Zhou, Haonian Ji, Lu Feng, Siwei Han, Mingyu Ding, and Huaxiu Yao. Agent0-vl: Exploring self-evolving agent for tool-integrated vision-language reasoning.arXiv preprint arXiv:2511.19900, 2025

  27. [27]

    Visplay: Self-evolving vision-language models from images,

    Yicheng He, Chengsong Huang, Zongxia Li, Jiaxin Huang, and Yonghui Yang. Visplay: Self-evolving vision- language models from images.arXiv preprint arXiv:2511.15661, 2025

  28. [28]

    Learning to hint for reinforcement learning

    Yu Xia, Canwen Xu, Zhewei Yao, Julian McAuley, and Yuxiong He. Learning to hint for reinforcement learning. arXiv preprint arXiv:2604.00698, 2026

  29. [29]

    Self-hinting language models enhance reinforcement learning, 2026

    Baohao Liao, Hanze Dong, Xinxing Xu, Christof Monz, and Jiang Bian. Self-hinting language models enhance reinforcement learning.arXiv preprint arXiv:2602.03143, 2026

  30. [30]

    Stephint: Multi-level stepwise hints enhance reinforcement learning to reason.arXiv preprint arXiv:2507.02841, 2025

    Kaiyi Zhang, Ang Lv, Jinpeng Li, Yongbo Wang, Feng Wang, Haoyuan Hu, and Rui Yan. Stephint: Multi-level stepwise hints enhance reinforcement learning to reason.arXiv preprint arXiv:2507.02841, 2025

  31. [31]

    Scaf-grpo: Scaffolded group relative policy optimization for enhancing llm reasoning, 2026

    Xichen Zhang, Sitong Wu, Yinghao Zhu, Haoru Tan, Shaozuo Yu, Ziyi He, and Jiaya Jia. Scaf-grpo: Scaffolded group relative policy optimization for enhancing llm reasoning.arXiv preprint arXiv:2510.19807, 2025

  32. [32]

    Int: Self-proposed interventions enable credit assignment in llm reasoning.arXiv preprint arXiv:2601.14209, 2026

    Matthew YR Yang, Hao Bai, Ian Wu, Gene Yang, Amrith Setlur, and Aviral Kumar. Int: Self-proposed interventions enable credit assignment in llm reasoning.arXiv preprint arXiv:2601.14209, 2026

  33. [33]

    arXiv preprint arXiv:2504.14945 , year =

    Jianhao Yan, Yafu Li, Zican Hu, Zhi Wang, Ganqu Cui, Xiaoye Qu, Yu Cheng, and Yue Zhang. Learning to reason under off-policy guidance.arXiv preprint arXiv:2504.14945, 2025

  34. [34]

    Towards Deep Learning Models Resistant to Adversarial Attacks

    Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks.arXiv preprint arXiv:1706.06083, 2017

  35. [35]

    Adversarial examples for evaluating reading comprehension systems

    Robin Jia and Percy Liang. Adversarial examples for evaluating reading comprehension systems. InProceedings of the 2017 conference on empirical methods in natural language processing, pages 2021–2031, 2017

  36. [36]

    Red teaming language models with language models

    Ethan Perez, Saffron Huang, Francis Song, Trevor Cai, Roman Ring, John Aslanides, Amelia Glaese, Nat McAleese, and Geoffrey Irving. Red teaming language models with language models. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 3419–3448, 2022

  37. [37]

    Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

    Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao Bai, Saurav Kadavath, Ben Mann, Ethan Perez, Nicholas Schiefer, Kamal Ndousse, et al. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned.arXiv preprint arXiv:2209.07858, 2022. 12

  38. [38]

    Universal and Transferable Adversarial Attacks on Aligned Language Models

    Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043, 2023

  39. [39]

    VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning

    Haozhe Wang, Chao Qu, Zuming Huang, Wei Chu, Fangzhen Lin, and Wenhu Chen. Vl-rethinker: Incentivizing self-reflection of vision-language models with reinforcement learning.arXiv preprint arXiv:2504.08837, 2025

  40. [40]

    American invitational mathematics examination (aime) 2024.https: //maa.org, February 2024

    Mathematical Association of America. American invitational mathematics examination (aime) 2024.https: //maa.org, February 2024. Competition problems and official solutions

  41. [41]

    American invitational mathematics examination (aime) 2025.https: //maa.org, February 2025

    Mathematical Association of America. American invitational mathematics examination (aime) 2025.https: //maa.org, February 2025. Competition problems and official solutions

  42. [42]

    American invitational mathematics examination (aime) 2026.https: //maa.org, February 2026

    Mathematical Association of America. American invitational mathematics examination (aime) 2026.https: //maa.org, February 2026. Competition problems and official solutions

  43. [43]

    Towards robust mathematical reasoning

    Minh-Thang Luong, Dawsen Hwang, Hoang H Nguyen, Golnaz Ghiasi, Yuri Chervonyi, Insuk Seo, Junsu Kim, Garrett Bingham, Jonathan Lee, Swaroop Mishra, et al. Towards robust mathematical reasoning. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 35406–35430, 2025

  44. [44]

    Solving quantitative reasoning problems with language models.Advances in neural information processing systems, 35:3843–3857, 2022

    Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, et al. Solving quantitative reasoning problems with language models.Advances in neural information processing systems, 35:3843–3857, 2022

  45. [45]

    Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems

    Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, et al. Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Paper...

  46. [46]

    Matharena: Evaluating llms on uncontaminated math competitions, February 2025

    Mislav Balunović, Jasper Dekoninck, Ivo Petrov, Nikola Jovanović, and Martin Vechev. Matharena: Evaluating llms on uncontaminated math competitions, February 2025. URLhttps://matharena.ai/

  47. [47]

    Exgrpo: Learning to reason from experience

    Runzhe Zhan, Yafu Li, Zhi Wang, Xiaoye Qu, Dongrui Liu, Jing Shao, Derek F Wong, and Yu Cheng. Exgrpo: Learning to reason from experience.arXiv preprint arXiv:2510.02245, 2025

  48. [48]

    Qwen2.5 Technical Report

    An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tingyu X...

  49. [49]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

  50. [50]

    Open r1: A fully open reproduction of deepseek-r1, January 2025

    Hugging Face. Open r1: A fully open reproduction of deepseek-r1, January 2025. URLhttps://github.com/ huggingface/open-r1

  51. [51]

    Gemini 3 flash: frontier intelligence built for speed, 2025

    Gemini Team. Gemini 3 flash: frontier intelligence built for speed, 2025. URLhttps://deepmind.google/models/ gemini/flash/

  52. [52]

    Enhancing robustness in large language models: Prompting for mitigating the impact of irrelevant information

    Ming Jiang, Tingting Huang, Biao Guo, Yao Lu, and Feng Zhang. Enhancing robustness in large language models: Prompting for mitigating the impact of irrelevant information. InInternational Conference on Neural Information Processing, pages 207–222. Springer, 2024

  53. [53]

    Adaptive distraction: Probing llm contextual robustness with automated tree search.arXiv preprint arXiv:2502.01609, 2025

    Yanbo Wang, Zixiang Xu, Yue Huang, Chujie Gao, Siyuan Wu, Jiayi Ye, Pin-Yu Chen, Xiuying Chen, and Xiangliang Zhang. Adaptive distraction: Probing llm contextual robustness with automated tree search.arXiv preprint arXiv:2502.01609, 2025. 13

  54. [54]

    Large language models can be easily distracted by irrelevant context

    Freda Shi, Xinyun Chen, Kanishka Misra, Nathan Scales, David Dohan, Ed H Chi, Nathanael Schärli, and Denny Zhou. Large language models can be easily distracted by irrelevant context. InInternational Conference on Machine Learning, pages 31210–31227. PMLR, 2023

  55. [55]

    Universal adversarial triggers for attacking and analyzing nlp

    Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gardner, and Sameer Singh. Universal adversarial triggers for attacking and analyzing nlp. InProceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pages 2153–2162, 2019

  56. [56]

    Gpt-5.1: A smarter, more conversational chatgpt, 2025

    OpenAI Team. Gpt-5.1: A smarter, more conversational chatgpt, 2025. URLhttps://openai.com/index/gpt-5-1/ /

  57. [57]

    Robust adversarial reinforcement learning

    Lerrel Pinto, James Davidson, Rahul Sukthankar, and Abhinav Gupta. Robust adversarial reinforcement learning. InInternational conference on machine learning, pages 2817–2826. PMLR, 2017

  58. [58]

    Emergent complexity and zero-shot transfer via unsupervised environment design.Advances in neural information processing systems, 33:13049–13061, 2020

    Michael Dennis, Natasha Jaques, Eugene Vinitsky, Alexandre Bayen, Stuart Russell, Andrew Critch, and Sergey Levine. Emergent complexity and zero-shot transfer via unsupervised environment design.Advances in neural information processing systems, 33:13049–13061, 2020

  59. [59]

    Jailbreaking black box large language models in twenty queries

    Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J Pappas, and Eric Wong. Jailbreaking black box large language models in twenty queries. In2025 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML), pages 23–42. IEEE, 2025

  60. [60]

    Constitutional AI: Harmlessness from AI Feedback

    Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022

  61. [61]

    Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models

    Zixiang Chen, Yihe Deng, Huizhuo Yuan, Kaixuan Ji, and Quanquan Gu. Self-play fine-tuning converts weak language models to strong language models.arXiv preprint arXiv:2401.01335, 2024

  62. [62]

    Rupbench: Benchmarking reasoning under perturbations for robustness evaluation in large language models.arXiv preprint arXiv:2406.11020, 2024

    Yuqing Wang and Yun Zhao. Rupbench: Benchmarking reasoning under perturbations for robustness evaluation in large language models.arXiv preprint arXiv:2406.11020, 2024

  63. [63]

    Benchmarking reasoning robustness in large language models.arXiv preprint arXiv:2503.04550, 2025

    Tong Yu, Yongcheng Jing, Xikun Zhang, Wentao Jiang, Wenjie Wu, Yingjie Wang, Wenbin Hu, Bo Du, and Dacheng Tao. Benchmarking reasoning robustness in large language models.arXiv preprint arXiv:2503.04550, 2025

  64. [64]

    On memorization of large language models in logical reasoning

    Chulin Xie, Yangsibo Huang, Chiyuan Zhang, Da Yu, Xinyun Chen, Bill Yuchen Lin, Bo Li, Badih Ghazi, and Ravi Kumar. On memorization of large language models in logical reasoning. InProceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Li...

  65. [65]

    Garbage in, reasoning out? why benchmark scores are unreliable and what to do about it

    Seyed Mahed Mousavi, Edoardo Cecchinato, Lucia Hornikova, and Giuseppe Riccardi. Garbage in, reasoning out? why benchmark scores are unreliable and what to do about it. InFindings of the Association for Computational Linguistics: EACL 2026, pages 1747–1759, 2026

  66. [66]

    How easily do irrelevant inputs skew the responses of large language models?arXiv preprint arXiv:2404.03302, 2024

    Siye Wu, Jian Xie, Jiangjie Chen, Tinghui Zhu, Kai Zhang, and Yanghua Xiao. How easily do irrelevant inputs skew the responses of large language models?arXiv preprint arXiv:2404.03302, 2024

  67. [67]

    The distracting effect: Understanding irrelevant passages in rag

    Chen Amiraz, Florin Cuconasu, Simone Filice, and Zohar Karnin. The distracting effect: Understanding irrelevant passages in rag. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 18228–18258, 2025

  68. [68]

    Lost in the noise: How reasoning models fail with contextual distractors.arXiv preprint arXiv:2601.07226, 2026

    Seongyun Lee, Yongrae Jo, Minju Seo, Moontae Lee, and Minjoon Seo. Lost in the noise: How reasoning models fail with contextual distractors.arXiv preprint arXiv:2601.07226, 2026

  69. [69]

    HybridFlow: A Flexible and Efficient RLHF Framework

    Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv: 2409.19256, 2024

  70. [70]

    Gonzalez, Hao Zhang, and Ion Stoica

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023

  71. [71]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021. 14

  72. [72]

    Measuring Massive Multitask Language Understanding

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300, 2020

  73. [73]

    Can a suit of armor conduct electricity? a new dataset for open book question answering

    Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. InProceedings of the 2018 conference on empirical methods in natural language processing, pages 2381–2391, 2018

  74. [74]

    Trojan horse

    Jia Li, Edward Beeching, Lewis Tunstall, Ben Lipkin, Roman Soletskyi, Shengyi Huang, Kashif Rasul, Longhui Yu, Albert Q Jiang, Ziju Shen, et al. Numinamath: The largest public dataset in ai4maths with 860k pairs of competition math problems and solutions.Hugging Face repository, 13(9):9, 2024. 15 A Extended related work Adversarial examples for language u...