Recognition: 2 theorem links
· Lean TheoremBehavior Cue Reasoning: Monitorable Reasoning Improves Efficiency and Safety through Oversight
Pith reviewed 2026-05-11 00:50 UTC · model grok-4.3
The pith
Training LLMs to emit Behavior Cues before behaviors makes reasoning monitorable, allowing recovery of safe actions from 80% of unsafe traces and doubling success rates.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central discovery is that fine-tuning LLMs to produce Behavior Cues immediately prior to target behaviors provides sufficient information in a compressed form for monitors to improve efficiency by pruning 50% of reasoning tokens in math tasks and to enhance safety by recovering 80% of unsafe action proposals, thereby raising success rates from 46% to 96% in constrained environments without any loss to base reasoning capability.
What carries the argument
Behavior Cues: special token sequences that the model is trained to emit immediately before specific behaviors, serving as signals and control points for external monitoring and intervention.
If this is right
- Monitors can prune up to 50% of wasted reasoning tokens in complex math problem solving based on cue information alone.
- An almost optimal rule-based monitor recovers safe actions from 80% of traces ending in unsafe proposals, more than doubling success from 46% to 96%.
- The approach works across two model families and three domains with no degradation to core performance.
- Training the monitored model to reason in a more monitorable way advances scalable oversight techniques.
Where Pith is reading between the lines
- If Behavior Cues prove stable under fine-tuning, they could be applied to guide reasoning in open-ended tasks where current monitors struggle.
- The compressed cue view suggests that full reasoning traces may contain redundant information for oversight purposes, potentially reducing monitoring compute needs.
- Combining this with other oversight methods might allow handling of more complex misalignments that emerge only at the end of long reasoning chains.
Load-bearing premise
LLMs can be fine-tuned to emit Behavior Cues right before the intended behaviors without any drop in their main reasoning quality, and that monitors can make effective decisions using only the information from those cues.
What would settle it
A result where the fine-tuned model emits cues inconsistently or at incorrect times, or where cue-based monitoring shows no significant improvement in pruning efficiency or safety recovery rates compared to monitoring full traces.
Figures
read the original abstract
Reasoning in Large Language Models (LLMs) poses a challenge for oversight as many misaligned behaviors do not surface until reasoning concludes. To address this, we introduce Behavior Cue Reasoning for making LLM reasoning more controllable and monitorable. Behavior Cues are special token sequences that a model is trained to emit immediately before specific implicit and explicit behaviors, acting as dual purpose signal and control levers. When fine-tuning a weaker external monitor with Reinforcement Learning for reasoning oversight, a compressed view of only information surfaced by Behavior Cues is sufficient signal for the monitor to prune up to 50% of otherwise wasted reasoning tokens in complex math problem solving. When leveraged by an almost optimal rule-based monitor in an environment where excessive constraint violations results in failure, \ours allows for the recovery of safe actions from 80% of reasoning traces that would otherwise end with the proposal of an unsafe action, more than doubling the success rate from 46% to 96%. Through evaluation across two model families and three domains, we show that \bcreasoning improves reasoning monitorability and controllability with no cost to performance. More broadly, our work progresses scalable oversight by demonstrating how the monitored model itself can be trained to reason more tractably to oversight. Code to be released at https://github.com/christopherzc/text-games
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Behavior Cue Reasoning (BCR), in which LLMs are fine-tuned to emit special token sequences ('Behavior Cues') immediately before target behaviors. These cues serve as signals for external monitors, enabling token pruning (up to 50% in math domains with no accuracy loss) and recovery of safe actions (80% of otherwise unsafe traces, raising success from 46% to 96% in constrained environments). Results are reported across two model families and three domains, with ablations indicating preserved core reasoning performance and code to be released.
Significance. If the results hold, the work advances scalable oversight by showing that the model itself can be trained to produce more monitorable reasoning traces, improving both efficiency and safety without performance trade-offs. The inclusion of ablations verifying no degradation and the planned code release strengthen reproducibility.
major comments (1)
- [Experiments and ablations] The fine-tuning procedure for cue emission and the exact experimental controls (hyperparameters, dataset construction, and baseline comparisons) are insufficiently detailed to fully substantiate the claim of no degradation to reasoning performance; this detail is load-bearing for the central 'no cost to performance' assertion across domains.
minor comments (2)
- [Safety experiments] Clarify the precise definition and construction of the 'almost optimal rule-based monitor' used for the safety recovery experiments, including how the 80% recovery rate was measured.
- [Results] The abstract and main text would benefit from explicit statements on the number of runs, random seeds, and statistical significance for the reported gains (50% pruning, 46% to 96% success).
Simulated Author's Rebuttal
We thank the referee for their positive assessment, recognition of the work's significance for scalable oversight, and recommendation for minor revision. We address the major comment below and will incorporate clarifications to strengthen the manuscript.
read point-by-point responses
-
Referee: [Experiments and ablations] The fine-tuning procedure for cue emission and the exact experimental controls (hyperparameters, dataset construction, and baseline comparisons) are insufficiently detailed to fully substantiate the claim of no degradation to reasoning performance; this detail is load-bearing for the central 'no cost to performance' assertion across domains.
Authors: We agree that greater specificity on these elements would strengthen the paper and better support the central claim. In the revised manuscript, we will expand the Methods and Experimental Setup sections to include: the precise fine-tuning objective and procedure for cue emission (including loss formulation and training dynamics); complete hyperparameter tables for all models, domains, and ablations; detailed dataset construction protocols (including prompt templates, behavior labeling, and split statistics); and more granular baseline comparisons with quantitative metrics. These additions will directly address the load-bearing nature of the 'no cost' assertion. The planned code release will provide the full implementation for exact reproducibility. revision: yes
Circularity Check
No significant circularity in the derivation chain
full rationale
The paper's claims rest on empirical results from fine-tuning LLMs to emit Behavior Cues before target behaviors and then measuring monitor performance on compressed cue views in math and constrained environments. Reported quantities such as 80% recovery of safe actions, doubling success from 46% to 96%, and 50% token pruning are direct experimental outcomes with ablations for preserved accuracy; they are not quantities defined in terms of fitted parameters from the same data, nor do they reduce via self-citation or ansatz to the inputs. The central premise of improved monitorability is validated externally through the described training and evaluation procedures rather than by construction.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption LLMs can be fine-tuned via RL to emit specific token sequences before target behaviors without harming base performance.
- domain assumption A compressed view consisting only of Behavior Cues supplies enough signal for effective monitor decisions.
invented entities (1)
-
Behavior Cues
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Behavior Cues are special token sequences that a model is trained to emit immediately before specific implicit and explicit behaviors, acting as dual purpose signal and control levers.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
When leveraged by an almost optimal rule-based monitor... recovery of safe actions from 80% of reasoning traces... doubling the success rate from 46% to 96%.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Concrete Problems in AI Safety
Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané. Concrete problems in ai safety.arXiv preprint arXiv:1606.06565, 2016
work page internal anchor Pith review arXiv 2016
-
[2]
Building with extended thinking
Anthropic. Building with extended thinking. https://docs.claude.com/en/docs/ build-with-claude/extended-thinking, 2025
2025
-
[3]
The internal state of an LLM knows when it’s lying
Amos Azaria and Tom Mitchell. The internal state of an LLM knows when it’s lying. In Findings of the Association for Computational Linguistics: EMNLP 2023, 2023
2023
-
[4]
Guan, Aleksander Madry, Wojciech Zaremba, Jakub Pachocki, and David Farhi
Bowen Baker, Joost Huizinga, Leo Gao, Zehao Dou, Melody Y . Guan, Aleksander Madry, Wojciech Zaremba, Jakub Pachocki, and David Farhi. Monitoring reasoning models for misbehavior and the risks of promoting obfuscation, 2025
2025
-
[5]
Samuel R. Bowman, Jeeyoon Hyun, Ethan Perez, Edwin Chen, Craig Pettit, Scott Heiner, Kamil˙e Lukoši¯ut˙e, Amanda Askell, Andy Jones, Anna Chen, et al. Measuring progress on scalable oversight for large language models.arXiv preprint arXiv:2211.03540, 2022
-
[6]
Weak-to-strong generalization: Eliciting strong capabilities with weak supervision
Collin Burns, Pavel Izmailov, Jan Hendrik Kirchner, Bowen Baker, Leo Gao, Leopold Aschen- brenner, Yining Chen, Adrien Ecoffet, Manas Joglekar, Jan Leike, Ilya Sutskever, and Jeff Wu. Weak-to-strong generalization: Eliciting strong capabilities with weak supervision. In International Conference on Machine Learning, 2024
2024
-
[7]
Textworld: A learning environment for text-based games
Marc-Alexandre Côté, Akos Kádár, Xingdi Yuan, Ben Kybartas, Tavian Barnes, Emery Fine, James Moore, Matthew Hausknecht, Layla El Asri, Mahmoud Adada, et al. Textworld: A learning environment for text-based games. InWorkshop on Computer Games, pages 41–75. Springer, 2018
2018
-
[8]
Tales: Text adventure learning environment suite, 2025
Christopher Zhang Cui, Xingdi Yuan, Ziang Xiao, Prithviraj Ammanabrolu, and Marc- Alexandre Côté. Tales: Text adventure learning environment suite, 2025
2025
-
[9]
Deepseek-v4: Towards highly efficient million-token context intelligence, 2026
DeepSeek-AI. Deepseek-v4: Towards highly efficient million-token context intelligence, 2026
2026
-
[10]
Group-in-Group Policy Optimization for LLM Agent Training
Lang Feng, Zhenghai Xue, Tingcong Liu, and Bo An. Group-in-group policy optimization for llm agent training.arXiv preprint arXiv:2505.10978, 2025
work page internal anchor Pith review arXiv 2025
-
[11]
Think before you speak: Training language models with pause tokens
Sachin Goyal, Ziwei Ji, Ankit Singh Rawat, Aditya Krishna Menon, Sanjiv Kumar, and Vaishnavh Nagarajan. Think before you speak: Training language models with pause tokens. In The Twelfth International Conference on Learning Representations
-
[12]
Bowman, and Evan Hubinger
Ryan Greenblatt, Carson Denison, Benjamin Wright, Fabien Roger, Monte MacDiarmid, Sam Marks, Johannes Treutlein, Tim Belonax, Jack Chen, David Duvenaud, Akbir Khan, Julian Michael, Sören Mindermann, Ethan Perez, Linda Petrini, Jonathan Uesato, Jared Kaplan, Buck Shlegeris, Samuel R. Bowman, and Evan Hubinger. Alignment faking in large language models, 2024
2024
-
[13]
Guan, Miles Wang, Micah Carroll, Zehao Dou, Annie Y
Melody Y . Guan, Miles Wang, Micah Carroll, Zehao Dou, Annie Y . Wei, Marcus Williams, Benjamin Arnav, Joost Huizinga, Ian Kivlichan, Mia Glaese, Jakub Pachocki, and Bowen Baker. Monitoring monitorability, 2025
2025
-
[14]
Inter- active fiction games: A colossal adventure
Matthew Hausknecht, Prithviraj Ammanabrolu, Marc-Alexandre Côté, and Xingdi Yuan. Inter- active fiction games: A colossal adventure. InProceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 7903–7910, 2020
2020
-
[15]
Collapse of self-trained language models, 2024
David Herel and Tomas Mikolov. Collapse of self-trained language models, 2024
2024
-
[16]
The ends justify the thoughts: Rl-induced motivated reasoning in llm cots, 2026
Nikolaus Howe and Micah Carroll. The ends justify the thoughts: Rl-induced motivated reasoning in llm cots, 2026
2026
-
[17]
Deep research agents: A systematic examination and roadmap.arXiv preprint arXiv:2506.18096, 2025
Yuxuan Huang, Yihang Chen, Haozheng Zhang, Kang Li, Huichi Zhou, Meng Fang, Linyi Yang, Xiaoguang Li, Lifeng Shang, Songcen Xu, et al. Deep research agents: A systematic examination and roadmap.arXiv preprint arXiv:2506.18096, 2025. 10
-
[18]
Discoveryworld: A virtual environment for developing and evaluating automated scientific discovery agents.Advances in Neural Information Processing Systems, 37:10088–10116, 2024
Peter Jansen, Marc-Alexandre Côté, Tushar Khot, Erin Bransom, Bhavana Dalvi Mishra, Bodhisattwa Prasad Majumder, Oyvind Tafjord, and Peter Clark. Discoveryworld: A virtual environment for developing and evaluating automated scientific discovery agents.Advances in Neural Information Processing Systems, 37:10088–10116, 2024
2024
-
[19]
well, keep thinking
Hyunbin Jin, Je Won Yeom, Seunghyun Bae, and Taesup Kim. “well, keep thinking”: Enhancing LLM reasoning with adaptive injection decoding. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Findings of the Association for Computa- tional Linguistics: ACL 2025, pages 9989–10018, Vienna, Austria, July 2025. Association fo...
2025
-
[20]
Zachary Kenton, Noah Y . Siegel, János Kramár, Jonah Brown-Cohen, Samuel Albanie, Jannis Bulian, Rishabh Agarwal, David Lindner, Yunhao Tang, Noah D. Goodman, and Rohin Shah. On scalable oversight with weak llms judging strong llms.arXiv preprint arXiv:2407.04622, 2024
-
[21]
Learning to insert [pause] tokens for better reasoning.arXiv preprint arXiv:2506.03616, 2025
Eunki Kim, Sangryul Kim, and James Thorne. Learning to insert [pause] tokens for better reasoning.arXiv preprint arXiv:2506.03616, 2025
-
[22]
Chain of thought monitorability: A new and fragile opportunity for ai safety, 2025
Tomek Korbak, Mikita Balesni, Elizabeth Barnes, Yoshua Bengio, Joe Benton, Joseph Bloom, Mark Chen, Alan Cooney, Allan Dafoe, Anca Dragan, Scott Emmons, Owain Evans, David Farhi, Ryan Greenblatt, Dan Hendrycks, Marius Hobbhahn, Evan Hubinger, Geoffrey Irving, Erik Jenner, Daniel Kokotajlo, Victoria Krakovna, Shane Legg, David Lindner, David Luan, Aleksand...
2025
-
[23]
Bowman, and Ethan Perez
Tamera Lanham, Anna Chen, Ansh Radhakrishnan, Benoit Steiner, Carson Denison, Danny Hernandez, Dustin Li, Esin Durmus, Evan Hubinger, Jackson Kernion, Kamil ˙e Lukoši¯ut˙e, Karina Nguyen, Newton Cheng, Nicholas Joseph, Nicholas Schiefer, Oliver Rausch, Robin Lar- son, Sam McCandlish, Sandipan Kundu, Saurav Kadavath, Shannon Yang, Thomas Henighan, Timothy ...
2023
-
[24]
Early stopping chain-of-thoughts in large language models, 2025
Minjia Mao, Bowen Yin, Yu Zhu, and Xiao Fang. Early stopping chain-of-thoughts in large language models, 2025
2025
-
[25]
s1: Simple test-time scaling
Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candes, and Tatsunori Hashimoto. s1: Simple test-time scaling. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings of the 2025 Conference on Empirical Methods in Natural Lan...
2025
-
[26]
Mlgym: A new framework and benchmark for advancing ai research agents,
Deepak Nathani, Lovish Madaan, Nicholas Roberts, Nikolay Bashlykov, Ajay Menon, Vin- cent Moens, Amar Budhiraja, Despoina Magka, Vladislav V orotilov, Gaurav Chaurasia, et al. Mlgym: A new framework and benchmark for advancing ai research agents.arXiv preprint arXiv:2502.14499, 2025
-
[27]
Learning to reason with llms
OpenAI. Learning to reason with llms. https://openai.com/index/ learning-to-reason-with-llms/, 2024
2024
-
[28]
arXiv preprint arXiv:2411.13543 , year=
Davide Paglieri, Bartłomiej Cupiał, Samuel Coward, Ulyana Piterbarg, Maciej Wolczyk, Akbir Khan, Eduardo Pignatelli, Łukasz Kuci ´nski, Lerrel Pinto, Rob Fergus, et al. Balrog: Bench- marking agentic llm and vlm reasoning on games.arXiv preprint arXiv:2411.13543, 2024
-
[29]
Logit-entropy adaptive stopping heuristic for efficient chain-of-thought reasoning, 2025
Mohammad Atif Quamar and Mohammad Areeb. Logit-entropy adaptive stopping heuristic for efficient chain-of-thought reasoning, 2025. 11
2025
-
[30]
Learning a continue-thinking token for enhanced test-time scaling
Liran Ringel, Elad Tolochinsky, and Yaniv Romano. Learning a continue-thinking token for enhanced test-time scaling. In Kentaro Inui, Sakriani Sakti, Haofen Wang, Derek F. Wong, Push- pak Bhattacharyya, Biplab Banerjee, Asif Ekbal, Tanmoy Chakraborty, and Dhirendra Pratap Singh, editors,Proceedings of the 14th International Joint Conference on Natural Lan...
2025
-
[31]
Proximal policy optimization algorithms, 2017
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms, 2017
2017
-
[32]
Backtracking for safety.arXiv preprint arXiv:2503.08919, 2025
Bilgehan Sel, Dingcheng Li, Phillip Wallis, Vaishakh Keshava, Ming Jin, and Siddhartha Reddy Jonnalagadda. Backtracking for safety.arXiv preprint arXiv:2503.08919, 2025
-
[33]
Think just enough: Sequence-level entropy as a confidence signal for llm reasoning, 2025
Aman Sharma and Paras Chopra. Think just enough: Sequence-level entropy as a confidence signal for llm reasoning, 2025
2025
-
[34]
Alfworld: Aligning text and embodied environments for interactive learning
Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Cote, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. Alfworld: Aligning text and embodied environments for interactive learning. InInternational Conference on Learning Representations
-
[35]
Stop when enough: Adaptive early-stopping for chain-of-thought reasoning, 2025
Renliang Sun, Wei Cheng, Dawei Li, Haifeng Chen, and Wei Wang. Stop when enough: Adaptive early-stopping for chain-of-thought reasoning, 2025
2025
-
[36]
Siao Tang, Xinyin Ma, Gongfan Fang, and Xinchao Wang. Concisehint: Boosting efficient reasoning via continuous concise hints during generation.arXiv preprint arXiv:2506.18810, 2025
-
[37]
Miles Turpin, Julian Michael, Ethan Perez, and Samuel R. Bowman. Language models don’t always say what they think: Unfaithful explanations in chain-of-thought prompting, 2023
2023
-
[38]
Scienceworld: Is your agent smarter than a 5th grader? InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11279–11298, 2022
Ruoyao Wang, Peter Jansen, Marc-Alexandre Côté, and Prithviraj Ammanabrolu. Scienceworld: Is your agent smarter than a 5th grader? InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11279–11298, 2022
2022
-
[39]
BudgetThinker: Empowering Budget-aware LLM Reasoning with Control Tokens.ArXiv, abs/2508.17196, 2025
Hao Wen, Xinrui Wu, Yi Sun, Feifei Zhang, Liye Chen, Jie Wang, Yunxin Liu, Yunhao Liu, Ya-Qin Zhang, and Yuanchun Li. Budgetthinker: Empowering budget-aware llm reasoning with control tokens.arXiv preprint arXiv:2508.17196, 2025
-
[40]
Effec- tively Controlling Reasoning Models Through Thinking Intervention
Tong Wu, Chong Xiang, Jiachen T Wang, G Edward Suh, and Prateek Mittal. Effectively controlling reasoning models through thinking intervention.arXiv preprint arXiv:2503.24370, 2025
-
[41]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[42]
Test-time prompt intervention, 2025
Chenxu Yang, Qingyi Si, Mz Dai, Dingyu Yao, Mingyu Zheng, Minghui Chen, Zheng Lin, and Weiping Wang. Test-time prompt intervention, 2025
2025
-
[43]
arXiv preprint arXiv:2504.15895 , year=
Chenxu Yang, Qingyi Si, Yongjie Duan, Zheliang Zhu, Chenyu Zhu, Qiaowei Li, Minghui Chen, Zheng Lin, and Weiping Wang. Dynamic early exit in reasoning models.arXiv preprint arXiv:2504.15895, 2025
-
[44]
Swe-agent: agent-computer interfaces enable automated software engineering
John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. Swe-agent: agent-computer interfaces enable automated software engineering. InProceedings of the 38th International Conference on Neural Information Processing Systems, pages 50528–50652, 2024
2024
-
[45]
Safe reinforcement learning with natural language constraints.Advances in Neural Information Processing Systems, 34:13794–13808, 2021
Tsung-Yen Yang, Michael Y Hu, Yinlam Chow, Peter J Ramadge, and Karthik Narasimhan. Safe reinforcement learning with natural language constraints.Advances in Neural Information Processing Systems, 34:13794–13808, 2021. 12
2021
-
[46]
Xiao-Wen Yang, Xuan-Yi Zhu, Wen-Da Wei, Ding-Chu Zhang, Jie-Jing Shao, Zhi Zhou, Lan- Zhe Guo, and Yu-Feng Li. Step back to leap forward: Self-backtracking for boosting reasoning of language models.arXiv preprint arXiv:2502.04404, 2025
-
[47]
debug-gym: A text-based environment for interactive debugging, 2025
Xingdi Yuan, Morgane M Moss, Charbel El Feghali, Chinmay Singh, Darya Moldavskaya, Drew MacPhee, Lucas Caccia, Matheus Pereira, Minseon Kim, Alessandro Sordoni, and Marc-Alexandre Côté. debug-gym: A text-based environment for interactive debugging, 2025
2025
-
[48]
Anqi Zhang, Yulin Chen, Jane Pan, Chen Zhao, Aurojit Panda, Jinyang Li, and He He. Reasoning models know when they’re right: Probing hidden states for self-verification.arXiv preprint arXiv:2504.05419, 2025
-
[49]
Xingsheng Zhang, Luxi Xing, Chen Zhang, Yanbing Liu, Yifan Deng, Yunpeng Li, Yue Hu, and Chenxu Niu. Can we steer reasoning direction by thinking intervention? In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Findings of the Association for Computational Linguistics: EMNLP 2025, pages 3888–3913, Suzhou, China, Nov...
2025
-
[50]
Backtracking improves generation safety
Yiming Zhang, Jianfeng Chi, Hailey Nguyen, Kartikeya Upasani, Daniel M Bikel, Jason E Weston, and Eric Michael Smith. Backtracking improves generation safety. InThe Thirteenth International Conference on Learning Representations
-
[51]
Zekai Zhao, Qi Liu, Kun Zhou, Zihan Liu, Yifei Shao, Zhiting Hu, and Biwei Huang. Activation control for efficiently eliciting long chain-of-thought ability of language models.arXiv preprint arXiv:2505.17697, 2025. A Embedded Answer Fixation When initially attempting to train models to perform Behavior Cue Reasoning, our initial approaches were pure promp...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.