Narration-of-Thought: Inference-Time Scaffolding for Defeasible Ethical Reasoning in Large Language Models

Alvaro Velasquez; Patrick Cooper

arxiv: 2606.26366 · v1 · pith:KINNZI5Lnew · submitted 2026-06-24 · 💻 cs.AI · cs.CL· cs.CY

Narration-of-Thought: Inference-Time Scaffolding for Defeasible Ethical Reasoning in Large Language Models

Patrick Cooper , Alvaro Velasquez This is my paper

Pith reviewed 2026-06-26 01:34 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.CY

keywords narration-of-thoughtchain-of-thoughtethical reasoninglarge language modelsmoral dilemmasinference-timedefeasible reasoningstakeholder analysis

0 comments

The pith

A five-section narration-of-thought prompt structures ethical reasoning in LLMs to reduce stakeholder collapse and uncertainty suppression.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces narration-of-thought, a prompt that divides chain-of-thought into protagonist, stakeholders, two-step consequences, uncertainty, and commitment sections. Standard chain-of-thought on moral dilemmas often names only one stakeholder and suppresses uncertainty before deciding on an action. NoT addresses these by forcing explicit consideration of multiple parties and unknowns at inference time without any model changes. This matters because it makes the reasoning process more complete and auditable for use in agents that must handle ethical choices dependably. Experiments on 100 scenarios show large reductions in the failure modes across multiple models.

Core claim

Narration-of-thought structures the reasoning trace into five explicit sections that externalize stakeholders, consequences, and uncertainty before a commitment is made. On the DailyDilemmas benchmark, this reduces stakeholder collapse from as high as 31% to under 1% and uncertainty suppression from as high as 72% to between 1% and 24% across four generators. A matched-budget control confirms the structure itself drives the gains, and the method extends to a debate protocol that achieves near-full consensus in multi-stakeholder settings.

What carries the argument

narration-of-thought (NoT), a system prompt organizing chain-of-thought into five sections: protagonist, stakeholders, two-step consequences, uncertainty, then commitment

If this is right

Stakeholder collapse falls below 1% on 100 DailyDilemmas scenarios across models from three vendors.
Uncertainty suppression drops to 1-24% on the same set.
A verbose chain-of-thought baseline with matched token budget does not achieve the same gains.
Textual-gradient descent starting from NoT can further refine the scaffold.
The approach scales to a five-round debate protocol yielding 95-100% consensus.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The explicit structure may support better human oversight by making the basis for decisions transparent.
Using a judge from a different model family than the generator improves evaluation reliability.
This inference-time method could apply to other domains where explicit enumeration of factors improves reasoning quality.

Load-bearing premise

The paper's custom metrics of stakeholder count and uncertainty score accurately reflect better ethical reasoning rather than just prompt compliance.

What would settle it

Human judges rating the quality of reasoning traces on the DailyDilemmas scenarios would need to show no preference for NoT outputs over standard chain-of-thought to falsify the claim of improved ethical reasoning.

Figures

Figures reproduced from arXiv: 2606.26366 by Alvaro Velasquez, Patrick Cooper.

**Figure 2.** Figure 2: Trace-level instance of the two failure modes whose panel-level rates Table [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Failure-mode firing rates (standard CoT vs. NoT) across the four-model panel ( [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Sub-instruction ablation on claude-sonnet-4-6: Cliff’s δ for each dropone-section condition vs. the full NoT control on stakeholder count and uncertainty score. Negative δ means the sub-instruction raised the coded metric, so dropping it reduces it. The Stakeholders sub-instruction (sub-instruction 2 of the NoT prompt) and the Uncertainty sub-instruction (sub-instruction 4) show the largest negative δ on… view at source ↗

**Figure 5.** Figure 5: Consensus rates across four progressively more structured debate designs (five scenarios, two generators). [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Visualisation of Table [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

**Figure 7.** Figure 7: Clean-refusal rate per generator and scaffold on each agentic scenario ( [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗

**Figure 8.** Figure 8: Cross-family training (NoT-v3) beats in-family training (NoT-v2) on both axes the optimiser targeted, on [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗

read the original abstract

Standard chain-of-thought on moral dilemmas exhibits two failure modes: stakeholder collapse (the trace names at most one party with a stake in the outcome) and uncertainty suppression (no explicit unknowns or hedges before committing to an action). We introduce narration-of-thought (NoT), a system prompt that structures chain-of-thought into five sections: protagonist, stakeholders, two-step consequences, uncertainty, then commitment. NoT adds no training, parameters, or fine-tuning. On 100 DailyDilemmas scenarios across four generators from three vendors, NoT cuts stakeholder collapse from up to 31% to under 1% and uncertainty suppression from up to 72% to 1-24% on every model. A matched-budget verbose-CoT control rules out token spend as the active ingredient; NoT retains Cliff's delta advantages of +0.79 to +0.90 on stakeholder count and +0.65 to +0.93 on uncertainty score for three of four generators, and a section ablation attributes each shift to its specific sub-instruction. Textual-gradient descent initialised at NoT improves the scaffold further; a cross-family training judge (different vendor from the generator) dominates an in-family one on every measured axis. Extended to a five-round multi-stakeholder debate protocol, the scaffold converts a 6% standoff into 95% full consensus on a calibration set and 100% combined convergence on a DailyDilemmas replication. The resulting traces externalise the stakeholders, consequences, and uncertainty grounding each commitment, providing an auditable substrate for dependable agentic deployment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

NoT shows a five-section prompt reliably shifts LLM outputs on custom stakeholder and uncertainty counts across models with solid controls, but the metrics lack external validation so the ethical improvement claim stays unproven.

read the letter

The punchline is that this structured prompt cuts the defined failure modes on the DailyDilemmas set, with the effect holding across four generators and surviving a token-matched verbose baseline plus section ablations.

The work introduces the specific five-section scaffold (protagonist, stakeholders, two-step consequences, uncertainty, commitment) and shows it increases named parties and hedges in the traces. They also extend it to a multi-round debate protocol that raises consensus rates. The experiments are careful: multiple vendors, cross-family judge, and an ablation that ties each change to its section. That level of control is better than most prompting papers.

The soft spot is the evaluation. Stakeholder collapse and uncertainty suppression are author-defined automatic counts with no reported correlation to human ethical ratings or any downstream measure like decision quality or harm reduction. The gains are real for the counts, but it is not shown that more names and hedges equal substantively better defeasible reasoning rather than just prompt-compliant text. The scenario set is fixed and narrow, so broader claims about agentic deployment rest on an assumption that is not tested.

This is for groups working on inference-time methods for LLM safety and alignment. The design is honest enough on its own terms to deserve referee time; reviewers can press on metric validation and generalizability. I would send it out rather than desk reject.

Referee Report

1 major / 1 minor

Summary. The manuscript introduces Narration-of-Thought (NoT), an inference-time system prompt that structures LLM chain-of-thought into five fixed sections (protagonist, stakeholders, two-step consequences, uncertainty, commitment) to reduce stakeholder collapse and uncertainty suppression in moral dilemmas. It reports results on 100 DailyDilemmas scenarios across four generators from three vendors, showing reductions from up to 31% to <1% in collapse and up to 72% to 1-24% in suppression, supported by a matched-budget verbose-CoT control, section ablation, textual-gradient descent, cross-family judge comparisons, and extension to a five-round debate protocol achieving high consensus rates.

Significance. If the custom metrics validly measure ethical reasoning quality, the approach offers a parameter-free, training-free scaffold that externalizes stakeholders, consequences, and uncertainty for more auditable outputs. The work is strengthened by explicit controls (matched token budget, section ablation attributing shifts to specific sub-instructions), testing across multiple models and vendors, and the debate extension showing conversion of standoffs to consensus; these elements provide reproducible empirical support for format-driven effects.

major comments (1)

[Evaluation protocol and metrics definition (as described in abstract and results on DailyDilemmas)] The evaluation relies on internally defined metrics of stakeholder count (number of named parties) and uncertainty score (presence of explicit hedges/unknowns), with no reported correlation to human ethical judgments, expert ratings, or downstream performance (e.g., simulated harm reduction). This is load-bearing for the central claim of improved 'defeasible ethical reasoning' because the quantitative gains (Cliff's delta +0.79 to +0.93) could reflect prompt compliance rather than substantive improvement; the matched-budget and ablation results address format vs. length but not external grounding.

minor comments (1)

[Abstract] The abstract and results sections could clarify the exact operational definitions of 'stakeholder collapse' and 'uncertainty suppression' with an example trace to improve reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address the single major comment below.

read point-by-point responses

Referee: The evaluation relies on internally defined metrics of stakeholder count (number of named parties) and uncertainty score (presence of explicit hedges/unknowns), with no reported correlation to human ethical judgments, expert ratings, or downstream performance (e.g., simulated harm reduction). This is load-bearing for the central claim of improved 'defeasible ethical reasoning' because the quantitative gains (Cliff's delta +0.79 to +0.93) could reflect prompt compliance rather than substantive improvement; the matched-budget and ablation results address format vs. length but not external grounding.

Authors: The metrics were constructed to quantify the two concrete failure modes defined in the introduction (stakeholder collapse as naming at most one party; uncertainty suppression as absence of explicit hedges before commitment). The matched-budget verbose-CoT control and section ablation isolate the contribution of the specific five-section structure, while the five-round debate extension demonstrates conversion of standoffs to 95-100% consensus, supplying indirect evidence of utility beyond format compliance. We nevertheless agree that the absence of correlation with human ethical judgments or downstream harm metrics constitutes a genuine limitation for the stronger claim of improved defeasible ethical reasoning. We will add an explicit limitations paragraph acknowledging this gap and outlining planned follow-up studies with human raters and simulated harm metrics. revision: partial

Circularity Check

0 steps flagged

No circularity; empirical prompt evaluation only

full rationale

The paper introduces a fixed five-section prompt template and reports direct counts of named stakeholders and explicit uncertainty expressions on 100 held-out DailyDilemmas scenarios. No equations, fitted parameters, self-referential quantities, or derivations appear; the reported reductions are literal measurements of output compliance with the prompt instructions. No self-citation chain or uniqueness theorem is invoked to support a central claim. The work is therefore self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical effectiveness of a new prompt template whose success depends on LLM instruction-following behavior rather than new mathematical derivations or invented entities.

axioms (1)

domain assumption Large language models can reliably follow a complex multi-section system prompt to externalize stakeholders, consequences, and uncertainty without additional training.
The method's reported gains presuppose that the target models will adhere to the five-section format on the tested scenarios.

pith-pipeline@v0.9.1-grok · 5830 in / 1268 out tokens · 25390 ms · 2026-06-26T01:34:37.368491+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

69 extracted references · 8 linked inside Pith

[1]

Ji, Jiaming and Qiu, Tianyi and Chen, Boyuan and Zhang, Borong and Lou, Hantao and Wang, Kaile and Duan, Yawen and He, Zhonghao and Zhou, Jiayi and Zhang, Zhaowei and Zeng, Fanzhi and Ng, Kwan Yee and Dai, Juntao and Pan, Xuehai and O'Gara, Aidan and Lei, Yingshan and Xu, Hua and Tse, Brian and Fu, Jie and McAleer, Stephen and Yang, Yaodong and Wang, Yizh...
[2]

2025 , month = apr, url =

Sycophancy in. 2025 , month = apr, url =

2025
[3]

2025 , url =

Expanding on. 2025 , url =

2025
[4]

and Ritchie, Stuart J

Lynch, Aengus and Wright, Benjamin and Larson, Caleb and Troy, Kevin K. and Ritchie, Stuart J. and Mindermann, S. Agentic. 2025 , month = jun, url =

2025
[5]

Yan, Hanqi and Xu, Hainiu and Qi, Siya and Yang, Shu and He, Yulan , journal =. When
[6]

and others , journal =

Cheng, Myra and Durmus, Esin and Zhang, Miranda and Korbak, Tomasz and Perez, Ethan and Bowman, Samuel R. and others , journal =
[7]

and Cheng, Newton and Durmus, Esin and Hatfield-Dodds, Zac and Johnston, Scott R

Sharma, Mrinank and Tong, Meg and Korbak, Tomasz and Duvenaud, David and Askell, Amanda and Bowman, Samuel R. and Cheng, Newton and Durmus, Esin and Hatfield-Dodds, Zac and Johnston, Scott R. and Kravec, Shauna and Maxwell, Timothy and McCandlish, Sam and Ndousse, Kamal and Rausch, Oliver and Schiefer, Nicholas and Yan, Da and Zhang, Miranda and Perez, Et...

2024
[8]

Sycophancy in

Malmqvist, Lars , journal =. Sycophancy in
[9]

Alignment

Greenblatt, Ryan and Denison, Carson and Wright, Benjamin and Roger, Fabien and MacDiarmid, Monte and Marks, Sam and Treutlein, Johannes and Belrose, Tim and Scheurer, Jonathan and Scheurer, Mikita and others , journal =. Alignment
[10]

Frontier

Meinke, Alexander and Schoen, Bronson and Scheurer, J. Frontier. arXiv preprint arXiv:2412.04984 , year =

Pith/arXiv arXiv
[11]

Turner, Alexander Matt and Smith, Logan Riggs and Shah, Rohin and Critch, Andrew and Tadepalli, Prasad , journal =. Optimal
[12]

Bostrom, Nick , journal =. The
[13]

Survey of

Shayegani, Erfan and Mamun, Md Abdullah Al and Fu, Yu and Zaree, Pedram and Dong, Yue and Abu-Ghazaleh, Nael , journal =. Survey of
[14]

Adversarially

Goldblum, Micah and Fowl, Liam and Feizi, Soheil and Goldstein, Tom , booktitle =. Adversarially
[15]

Concrete

Amodei, Dario and Olah, Chris and Steinhardt, Jacob and Christiano, Paul and Schulman, John and Man. Concrete. arXiv preprint arXiv:1606.06565 , year =

Pith/arXiv arXiv
[16]

and Everitt, Tom and Lefrancq, Andrew and Orseau, Laurent and Legg, Shane , journal =

Leike, Jan and Martic, Miljan and Krakovna, Victoria and Ortega, Pedro A. and Everitt, Tom and Lefrancq, Andrew and Orseau, Laurent and Legg, Shane , journal =
[17]

Jin, Zhijing and Liu, Jiarui and Lyu, Zhiheng and Poff, Spencer and Sachan, Mrinmaya and Mihalcea, Rada and Diab, Mona and Sch. Can. International Conference on Learning Representations (ICLR) , year =
[18]

MacIntyre, Alasdair , year =. After
[19]

Bruner, Jerome , year =. Actual
[20]

Simulacra and

Baudrillard, Jean , year =. Simulacra and
[21]

Yudkowsky, Eliezer , booktitle =. Complex. 2011 , publisher =

2011
[22]

Superintelligence:

Bostrom, Nick , year =. Superintelligence:
[23]

Causality:

Pearl, Judea , year =. Causality:
[24]

Pearl, Judea , journal =. Causal
[25]

Janzing, Dominik and Sch. Causal. IEEE Transactions on Information Theory , volume =
[26]

and Heskes, Tom , booktitle =

Claassen, Tom and Mooij, Joris M. and Heskes, Tom , booktitle =. Learning
[27]

Chickering, David Maxwell , journal =. Optimal
[28]

Li, Ming and Vit. An. 2019 , edition =

2019
[29]

, year =

Fodor, Jerry A. , year =. The
[30]

Psychological

Putnam, Hilary , booktitle =. Psychological. 1967 , publisher =

1967
[31]

Kuleshov on

Kuleshov, Lev , year =. Kuleshov on
[32]

Narration in the

Bordwell, David , year =. Narration in the
[33]

Chollet, Fran. On the. arXiv preprint arXiv:1911.01547 , year =

Pith/arXiv arXiv 1911
[34]

and Mullally, Sin

Maguire, Eleanor A. and Mullally, Sin. Is. Trends in Cognitive Sciences , volume =
[35]

Friston, Karl , journal =. The
[36]

Rao, Rajesh P. N. and Ballard, Dana H. , journal =. Predictive
[37]

Blum, Lenore and Blum, Manuel , journal =. A
[38]

Theoretical

Blum, Lenore and Blum, Manuel , journal =. Theoretical
[39]

and Focella, Elizabeth S

Shaffer, Vicki A. and Focella, Elizabeth S. and Hathaway, Andrew and Scherer, Laura D. and Zikmund-Fisher, Brian J. , journal =. Why
[40]

Bientzle, Martina and Eggeling, Marie and Cress, Ulrike and Kimmerle, Joachim , journal =. The
[41]

Narrative

Bientzle, Martina and Cress, Ulrike and Kimmerle, Joachim , journal =. Narrative
[42]

and Cai, Carrie J

Park, Joon Sung and O'Brien, Joseph C. and Cai, Carrie J. and Morris, Meredith Ringel and Liang, Percy and Bernstein, Michael S. , booktitle =. Generative
[43]

Irving, Geoffrey and Christiano, Paul and Amodei, Dario , journal =
[44]

and Mordatch, Igor , journal =

Du, Yilun and Li, Shuang and Torralba, Antonio and Tenenbaum, Joshua B. and Mordatch, Igor , journal =. Improving
[45]

, journal =

Pollock, John L. , journal =. Defeasible
[46]

Gottweis, Juraj and Weng, Wei-Hung and Daryin, Alexander and Tu, Tao and Palepu, Anil and Sirkovic, Petar and Myaskovsky, Artiom and Weissenberger, Felix and Rong, Keran and Tanno, Ryutaro and Saab, Khaled and Popovici, Dan and Blum, Jacob and Zhang, Fan and Chou, Katherine and Hassidim, Avinatan and Gokturk, Burak and Vahdat, Amin and Kohli, Pushmeet and...

Pith/arXiv arXiv
[47]

Lu, Chris and Lu, Cong and Lange, Robert Tjarko and Foerster, Jakob and Clune, Jeff and Ha, David , journal =. The
[48]

and MacKnight, Robert and Kline, Ben and Gomes, Gabe , journal =

Boiko, Daniil A. and MacKnight, Robert and Kline, Ben and Gomes, Gabe , journal =. Autonomous
[49]

Bran, Andres and Cox, Sam and Schilter, Oliver and Baldassari, Carlo and White, Andrew D

M. Bran, Andres and Cox, Sam and Schilter, Oliver and Baldassari, Carlo and White, Andrew D. and Schwaller, Philippe , journal =. Augmenting
[50]

Schmidgall, Samuel and Su, Yusheng and Wang, Ze and Sun, Ximeng and Wu, Jialian and Yu, Xiaodong and Liu, Jiang and Moor, Michael and Liu, Zicheng and Barsoum, Emad , journal =. Agent
[51]

and Pak, John E

Swanson, Kyle and Wu, Wesley and Bulaong, Nash L. and Pak, John E. and Zou, James , journal =. The
[52]

Chain-of-

Wei, Jason and Wang, Xuezhi and Schuurmans, Dale and Bosma, Maarten and Ichter, Brian and Xia, Fei and Chi, Ed and Le, Quoc and Zhou, Denny , booktitle =. Chain-of-
[53]

Kojima, Takeshi and Gu, Shixiang Shane and Reid, Machel and Matsuo, Yutaka and Iwasawa, Yusuke , booktitle =. Large
[54]

2025 , note =

Chiu, Yu Ying and Jiang, Liwei and Choi, Yejin , booktitle =. 2025 , note =

2025
[55]

Jin, Zhijing and Levine, Sydney and Gonzalez, Fernando and Kamal, Ojasv and Sap, Maarten and Sachan, Mrinmaya and Mihalcea, Rada and Tenenbaum, Joshua B. and Sch. When to. Advances in Neural Information Processing Systems (NeurIPS) , year =
[56]

Wu, Wenshan and Mao, Shaoguang and Zhang, Yadong and Xia, Yan and Dong, Li and Cui, Lei and Wei, Furu , journal =. Mind's
[57]

Dominance

Cliff, Norman , journal =. Dominance
[58]

Cohen, Jacob , journal =. A
[59]

Divergence

Lin, Jianhua , journal =. Divergence
[60]

arXiv preprint arXiv:2308.01263 , year =

R. arXiv preprint arXiv:2308.01263 , year =

Pith/arXiv arXiv
[61]

and Akinwande, Victor and Al-Nuaimi, Namir and Alfaraj, Najla and Alhussain, Elie and Banovic, Nanna and Barikeri, Soumya and Bartolo, Max and others , journal =

Vidgen, Bertie and Agrawal, Adarsh and Ahmed, Ahmed M. and Akinwande, Victor and Al-Nuaimi, Namir and Alfaraj, Najla and Alhussain, Elie and Banovic, Nanna and Barikeri, Soumya and Bartolo, Max and others , journal =. Introducing
[62]

Yuksekgonul, Mert and Bianchi, Federico and Boen, Joseph and Liu, Sheng and Huang, Zhi and Guestrin, Carlos and Zou, James , journal =
[63]

arXiv preprint arXiv:2309.03409 , year =

Large Language Models as Optimizers , author =. arXiv preprint arXiv:2309.03409 , year =

Pith/arXiv arXiv
[64]

arXiv preprint arXiv:2211.01910 , year =

Large Language Models Are Human-Level Prompt Engineers , author =. arXiv preprint arXiv:2211.01910 , year =

Pith/arXiv arXiv
[65]

arXiv preprint arXiv:2602.23971 , year =

Ask don't tell: Reducing sycophancy in large language models , author =. arXiv preprint arXiv:2602.23971 , year =

Pith/arXiv arXiv
[66]

2025 , note =

Petrov, Ivo and Dekoninck, Jasper and Vechev, Martin , journal =. 2025 , note =

2025
[67]

2025 , note =

Fanous, Ahmed and others , journal =. 2025 , note =

2025
[68]

2004 , publisher =

Content Analysis: An Introduction to Its Methodology , author =. 2004 , publisher =

2004
[69]

2026 , note =

Narration-of-Thought:. 2026 , note =

2026

[1] [1]

Ji, Jiaming and Qiu, Tianyi and Chen, Boyuan and Zhang, Borong and Lou, Hantao and Wang, Kaile and Duan, Yawen and He, Zhonghao and Zhou, Jiayi and Zhang, Zhaowei and Zeng, Fanzhi and Ng, Kwan Yee and Dai, Juntao and Pan, Xuehai and O'Gara, Aidan and Lei, Yingshan and Xu, Hua and Tse, Brian and Fu, Jie and McAleer, Stephen and Yang, Yaodong and Wang, Yizh...

[2] [2]

2025 , month = apr, url =

Sycophancy in. 2025 , month = apr, url =

2025

[3] [3]

2025 , url =

Expanding on. 2025 , url =

2025

[4] [4]

and Ritchie, Stuart J

Lynch, Aengus and Wright, Benjamin and Larson, Caleb and Troy, Kevin K. and Ritchie, Stuart J. and Mindermann, S. Agentic. 2025 , month = jun, url =

2025

[5] [5]

Yan, Hanqi and Xu, Hainiu and Qi, Siya and Yang, Shu and He, Yulan , journal =. When

[6] [6]

and others , journal =

Cheng, Myra and Durmus, Esin and Zhang, Miranda and Korbak, Tomasz and Perez, Ethan and Bowman, Samuel R. and others , journal =

[7] [7]

and Cheng, Newton and Durmus, Esin and Hatfield-Dodds, Zac and Johnston, Scott R

Sharma, Mrinank and Tong, Meg and Korbak, Tomasz and Duvenaud, David and Askell, Amanda and Bowman, Samuel R. and Cheng, Newton and Durmus, Esin and Hatfield-Dodds, Zac and Johnston, Scott R. and Kravec, Shauna and Maxwell, Timothy and McCandlish, Sam and Ndousse, Kamal and Rausch, Oliver and Schiefer, Nicholas and Yan, Da and Zhang, Miranda and Perez, Et...

2024

[8] [8]

Sycophancy in

Malmqvist, Lars , journal =. Sycophancy in

[9] [9]

Alignment

Greenblatt, Ryan and Denison, Carson and Wright, Benjamin and Roger, Fabien and MacDiarmid, Monte and Marks, Sam and Treutlein, Johannes and Belrose, Tim and Scheurer, Jonathan and Scheurer, Mikita and others , journal =. Alignment

[10] [10]

Frontier

Meinke, Alexander and Schoen, Bronson and Scheurer, J. Frontier. arXiv preprint arXiv:2412.04984 , year =

Pith/arXiv arXiv

[11] [11]

Turner, Alexander Matt and Smith, Logan Riggs and Shah, Rohin and Critch, Andrew and Tadepalli, Prasad , journal =. Optimal

[12] [12]

Bostrom, Nick , journal =. The

[13] [13]

Survey of

Shayegani, Erfan and Mamun, Md Abdullah Al and Fu, Yu and Zaree, Pedram and Dong, Yue and Abu-Ghazaleh, Nael , journal =. Survey of

[14] [14]

Adversarially

Goldblum, Micah and Fowl, Liam and Feizi, Soheil and Goldstein, Tom , booktitle =. Adversarially

[15] [15]

Concrete

Amodei, Dario and Olah, Chris and Steinhardt, Jacob and Christiano, Paul and Schulman, John and Man. Concrete. arXiv preprint arXiv:1606.06565 , year =

Pith/arXiv arXiv

[16] [16]

and Everitt, Tom and Lefrancq, Andrew and Orseau, Laurent and Legg, Shane , journal =

Leike, Jan and Martic, Miljan and Krakovna, Victoria and Ortega, Pedro A. and Everitt, Tom and Lefrancq, Andrew and Orseau, Laurent and Legg, Shane , journal =

[17] [17]

Jin, Zhijing and Liu, Jiarui and Lyu, Zhiheng and Poff, Spencer and Sachan, Mrinmaya and Mihalcea, Rada and Diab, Mona and Sch. Can. International Conference on Learning Representations (ICLR) , year =

[18] [18]

MacIntyre, Alasdair , year =. After

[19] [19]

Bruner, Jerome , year =. Actual

[20] [20]

Simulacra and

Baudrillard, Jean , year =. Simulacra and

[21] [21]

Yudkowsky, Eliezer , booktitle =. Complex. 2011 , publisher =

2011

[22] [22]

Superintelligence:

Bostrom, Nick , year =. Superintelligence:

[23] [23]

Causality:

Pearl, Judea , year =. Causality:

[24] [24]

Pearl, Judea , journal =. Causal

[25] [25]

Janzing, Dominik and Sch. Causal. IEEE Transactions on Information Theory , volume =

[26] [26]

and Heskes, Tom , booktitle =

Claassen, Tom and Mooij, Joris M. and Heskes, Tom , booktitle =. Learning

[27] [27]

Chickering, David Maxwell , journal =. Optimal

[28] [28]

Li, Ming and Vit. An. 2019 , edition =

2019

[29] [29]

, year =

Fodor, Jerry A. , year =. The

[30] [30]

Psychological

Putnam, Hilary , booktitle =. Psychological. 1967 , publisher =

1967

[31] [31]

Kuleshov on

Kuleshov, Lev , year =. Kuleshov on

[32] [32]

Narration in the

Bordwell, David , year =. Narration in the

[33] [33]

Chollet, Fran. On the. arXiv preprint arXiv:1911.01547 , year =

Pith/arXiv arXiv 1911

[34] [34]

and Mullally, Sin

Maguire, Eleanor A. and Mullally, Sin. Is. Trends in Cognitive Sciences , volume =

[35] [35]

Friston, Karl , journal =. The

[36] [36]

Rao, Rajesh P. N. and Ballard, Dana H. , journal =. Predictive

[37] [37]

Blum, Lenore and Blum, Manuel , journal =. A

[38] [38]

Theoretical

Blum, Lenore and Blum, Manuel , journal =. Theoretical

[39] [39]

and Focella, Elizabeth S

Shaffer, Vicki A. and Focella, Elizabeth S. and Hathaway, Andrew and Scherer, Laura D. and Zikmund-Fisher, Brian J. , journal =. Why

[40] [40]

Bientzle, Martina and Eggeling, Marie and Cress, Ulrike and Kimmerle, Joachim , journal =. The

[41] [41]

Narrative

Bientzle, Martina and Cress, Ulrike and Kimmerle, Joachim , journal =. Narrative

[42] [42]

and Cai, Carrie J

Park, Joon Sung and O'Brien, Joseph C. and Cai, Carrie J. and Morris, Meredith Ringel and Liang, Percy and Bernstein, Michael S. , booktitle =. Generative

[43] [43]

Irving, Geoffrey and Christiano, Paul and Amodei, Dario , journal =

[44] [44]

and Mordatch, Igor , journal =

Du, Yilun and Li, Shuang and Torralba, Antonio and Tenenbaum, Joshua B. and Mordatch, Igor , journal =. Improving

[45] [45]

, journal =

Pollock, John L. , journal =. Defeasible

[46] [46]

Gottweis, Juraj and Weng, Wei-Hung and Daryin, Alexander and Tu, Tao and Palepu, Anil and Sirkovic, Petar and Myaskovsky, Artiom and Weissenberger, Felix and Rong, Keran and Tanno, Ryutaro and Saab, Khaled and Popovici, Dan and Blum, Jacob and Zhang, Fan and Chou, Katherine and Hassidim, Avinatan and Gokturk, Burak and Vahdat, Amin and Kohli, Pushmeet and...

Pith/arXiv arXiv

[47] [47]

Lu, Chris and Lu, Cong and Lange, Robert Tjarko and Foerster, Jakob and Clune, Jeff and Ha, David , journal =. The

[48] [48]

and MacKnight, Robert and Kline, Ben and Gomes, Gabe , journal =

Boiko, Daniil A. and MacKnight, Robert and Kline, Ben and Gomes, Gabe , journal =. Autonomous

[49] [49]

Bran, Andres and Cox, Sam and Schilter, Oliver and Baldassari, Carlo and White, Andrew D

M. Bran, Andres and Cox, Sam and Schilter, Oliver and Baldassari, Carlo and White, Andrew D. and Schwaller, Philippe , journal =. Augmenting

[50] [50]

Schmidgall, Samuel and Su, Yusheng and Wang, Ze and Sun, Ximeng and Wu, Jialian and Yu, Xiaodong and Liu, Jiang and Moor, Michael and Liu, Zicheng and Barsoum, Emad , journal =. Agent

[51] [51]

and Pak, John E

Swanson, Kyle and Wu, Wesley and Bulaong, Nash L. and Pak, John E. and Zou, James , journal =. The

[52] [52]

Chain-of-

Wei, Jason and Wang, Xuezhi and Schuurmans, Dale and Bosma, Maarten and Ichter, Brian and Xia, Fei and Chi, Ed and Le, Quoc and Zhou, Denny , booktitle =. Chain-of-

[53] [53]

Kojima, Takeshi and Gu, Shixiang Shane and Reid, Machel and Matsuo, Yutaka and Iwasawa, Yusuke , booktitle =. Large

[54] [54]

2025 , note =

Chiu, Yu Ying and Jiang, Liwei and Choi, Yejin , booktitle =. 2025 , note =

2025

[55] [55]

Jin, Zhijing and Levine, Sydney and Gonzalez, Fernando and Kamal, Ojasv and Sap, Maarten and Sachan, Mrinmaya and Mihalcea, Rada and Tenenbaum, Joshua B. and Sch. When to. Advances in Neural Information Processing Systems (NeurIPS) , year =

[56] [56]

Wu, Wenshan and Mao, Shaoguang and Zhang, Yadong and Xia, Yan and Dong, Li and Cui, Lei and Wei, Furu , journal =. Mind's

[57] [57]

Dominance

Cliff, Norman , journal =. Dominance

[58] [58]

Cohen, Jacob , journal =. A

[59] [59]

Divergence

Lin, Jianhua , journal =. Divergence

[60] [60]

arXiv preprint arXiv:2308.01263 , year =

R. arXiv preprint arXiv:2308.01263 , year =

Pith/arXiv arXiv

[61] [61]

and Akinwande, Victor and Al-Nuaimi, Namir and Alfaraj, Najla and Alhussain, Elie and Banovic, Nanna and Barikeri, Soumya and Bartolo, Max and others , journal =

Vidgen, Bertie and Agrawal, Adarsh and Ahmed, Ahmed M. and Akinwande, Victor and Al-Nuaimi, Namir and Alfaraj, Najla and Alhussain, Elie and Banovic, Nanna and Barikeri, Soumya and Bartolo, Max and others , journal =. Introducing

[62] [62]

Yuksekgonul, Mert and Bianchi, Federico and Boen, Joseph and Liu, Sheng and Huang, Zhi and Guestrin, Carlos and Zou, James , journal =

[63] [63]

arXiv preprint arXiv:2309.03409 , year =

Large Language Models as Optimizers , author =. arXiv preprint arXiv:2309.03409 , year =

Pith/arXiv arXiv

[64] [64]

arXiv preprint arXiv:2211.01910 , year =

Large Language Models Are Human-Level Prompt Engineers , author =. arXiv preprint arXiv:2211.01910 , year =

Pith/arXiv arXiv

[65] [65]

arXiv preprint arXiv:2602.23971 , year =

Ask don't tell: Reducing sycophancy in large language models , author =. arXiv preprint arXiv:2602.23971 , year =

Pith/arXiv arXiv

[66] [66]

2025 , note =

Petrov, Ivo and Dekoninck, Jasper and Vechev, Martin , journal =. 2025 , note =

2025

[67] [67]

2025 , note =

Fanous, Ahmed and others , journal =. 2025 , note =

2025

[68] [68]

2004 , publisher =

Content Analysis: An Introduction to Its Methodology , author =. 2004 , publisher =

2004

[69] [69]

2026 , note =

Narration-of-Thought:. 2026 , note =

2026