HERO'S JOURNEY: Testing Complex Rule Induction with Text Games

Anshun Asher Zheng; David I. Beaver; Junyi Jessy Li; Kanishka Misra

arxiv: 2606.02556 · v1 · pith:4HF35XRJnew · submitted 2026-06-01 · 💻 cs.CL

HERO'S JOURNEY: Testing Complex Rule Induction with Text Games

Anshun Asher Zheng , Kanishka Misra , David I. Beaver , Junyi Jessy Li This is my paper

Pith reviewed 2026-06-28 14:48 UTC · model grok-4.3

classification 💻 cs.CL

keywords rule inductiontext gameslarge language modelsbenchmarkattribute inductionprocedural inductionepisodic taskssteering methods

0 comments

The pith

Large language models show evidence of rule induction in text games but remain limited and uneven, with execution as a bottleneck.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents HERO'S JOURNEY, a benchmark of eight tasks in which agents must infer hidden rules from demonstrations in goal-directed episodic text games and then execute multi-step actions based on those rules. The tasks span attribute and procedural induction families, each with four structural rule forms plus controls for lexical grounding and identifiability. Evaluations of current LLMs find that models can induce some rules from examples, yet performance is inconsistent across tasks. Process execution creates a clear bottleneck, while surface-level semantic changes have little impact. Targeted steering improves attribute tasks but produces no reliable gains on procedural ones.

Core claim

Models show evidence of rule induction from demonstrations in episodic text game tasks, yet this ability remains limited and varies unevenly across the eight tasks in the benchmark. Process execution introduces an execution bottleneck, while surface semantics has minimal effect on performance. Induction-specific steering improves results on attribute induction tasks but yields no reliable gains on procedural tasks.

What carries the argument

HERO'S JOURNEY benchmark of eight tasks across attribute and procedural induction families, each with four structural rule forms, controllable lexical grounding, and identifiability conditions.

If this is right

Models that induce rules from demonstrations could generalize across new goal-directed scenarios within the same task families.
The execution bottleneck implies that even correctly induced rules do not guarantee successful multi-step application.
Steering methods effective only on attribute tasks indicate that procedural rule induction requires distinct techniques.
Minimal effect of surface semantics suggests models rely primarily on structural patterns rather than lexical cues.
The gap in procedural induction remains an open challenge for improving model performance on sequence-based rules.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The benchmark design could be adapted to test rule induction in partially observable or multi-agent text environments.
Persistent procedural gaps may constrain AI agents that must chain actions in planning or game-like settings.
Identifiability conditions could inform construction of training curricula that strengthen rule extraction in LLMs.
Results on steering suggest hybrid methods combining induction with explicit execution planning merit further tests.

Load-bearing premise

The eight tasks and their identifiability conditions isolate rule induction ability without confounding effects from prompting format, game length, or lexical choice.

What would settle it

If models achieve equal performance on procedural and attribute tasks when rules are induced from demonstrations, or if execution accuracy matches induction accuracy once rules are known, the claimed execution bottleneck and uneven induction would be undermined.

Figures

Figures reproduced from arXiv: 2606.02556 by Anshun Asher Zheng, David I. Beaver, Junyi Jessy Li, Kanishka Misra.

**Figure 1.** Figure 1: Overview of HERO’S JOURNEY. 1 Rule interactions: four structural forms varying how entity attributes jointly determine the required item or process. 2 Attribute induction tasks: illustrated with AComp tasks (§3.2.1; each attribute independently governs a separate output dimension); entity class ( ranger, captain) and role ( prophet, chirurgeon). 3 Procedural induction tasks: illustrated with P-Comp tasks … view at source ↗

**Figure 2.** Figure 2: Task curation and the eight induction tasks. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: ECSR vs. RV across all model×task conditions. Each point represents one model on one task. in Appx. B.2. 4 Evaluation For each task in §3, we generate 20 variants by randomly sampling surface names from the lexicon and varying the source-gen splits. In each episode the agent receives (see prompt in Appx. H.1): (1) a world listing of all entities, attributes, and locations; (2) source-split demonstrations … view at source ↗

**Figure 4.** Figure 4: ECSR (solid colored lines) and contextual_bias(k) (dotted gray triangles) across different coverage k/k∗ , for all eight tasks with GPT-5.4-mini and GPT-OSS-120B. The vertical dotted line marks the identifiability threshold k ∗ = 1 (full source split). Bias is zero and omitted for A/P-Comp. when the rule is not identified, reflecting that humans who miss the rule generally cannot complete the task efficie… view at source ↗

**Figure 5.** Figure 5: Task curation illustrated on the Dragon’s Keep example (cf. Figure [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

**Figure 6.** Figure 6: Attribute induction task grids (one instantiation of the source/gen split). Rows index class ( [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

**Figure 7.** Figure 7: Procedural induction task grids (one instantiation of the source/gen split). Rows index class ( [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

**Figure 8.** Figure 8: Task success rate (left) and normalized efficiency on successful episodes (right) per model and task, with [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗

**Figure 9.** Figure 9: (A–B) Format gap (QA accuracy − ECSR) per model and task, for attribute (A) and procedural (B) tasks. Bars above zero indicate an execution bottleneck; bars below zero a QA bottleneck. Stars mark gaps significantly non-zero (Bonferroni-corrected by task family; ∗p < 0.05, ∗∗p < 0.01, ∗∗∗p < 0.001). (C–D) ECSR and RV under semantic vs. nonce conditions for attribute (C) and procedural (D) tasks. Brackets ma… view at source ↗

**Figure 10.** Figure 10: ECSR and RV for GPT-5.4-mini and Qwen3.5-27B under semantic vs. nonce lexical conditions, for attribute (left) and procedural (right) tasks. Brackets mark significant pairwise differences (Bonferroni-corrected by task family; p < 0.05); unmarked pairs are non-significant. −0.05 0.00 0.05 0.10 0.15 0.20 0.25 Δ ECSR ReAct ACE IDEA HR ** ** *** *** GPT Qwen Attribute Procedural [PITH_FULL_IMAGE:figures/fu… view at source ↗

**Figure 12.** Figure 12: Episodic environment interface for humans to play the games; We use the same instructions as the one [PITH_FULL_IMAGE:figures/full_fig_p024_12.png] view at source ↗

**Figure 13.** Figure 13: Interface to annotate the rule underlying the demonstrations after each episode [PITH_FULL_IMAGE:figures/full_fig_p024_13.png] view at source ↗

read the original abstract

We introduce HERO'S JOURNEY, a benchmark for rule induction in goal-directed episodic tasks, where agents must infer hidden rules from demonstrations and act on them through multi-step execution. HERO'S JOURNEY covers eight tasks across attribute and procedural induction families, each with four structural rule forms, controllable lexical grounding, and identifiability conditions. Evaluating state-of-the-art LLMs, we find that models show evidence of rule induction, but the ability is limited and uneven across tasks. Meanwhile, process execution adds an execution bottleneck for models, whereas surface semantics has minimal effect. Induction-specific steering methods improve performance on attribute tasks but show no reliable gains on procedural tasks, suggesting the gap in procedural induction remains an open challenge.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

HERO'S JOURNEY gives a clean new benchmark structure for rule induction in text games, but the isolation claims need the full methods and controls to hold up.

read the letter

The paper's main contribution is a benchmark family that splits rule induction into attribute and procedural tracks across eight tasks, each with four explicit structural rule forms plus controllable lexical grounding and identifiability conditions. That setup is more structured than most existing text-game evaluations.

It does a reasonable job separating induction from execution and testing whether surface semantics matters. The reported pattern—that models show limited uneven induction, execution creates the bigger bottleneck, and steering helps attributes but not procedures—lines up with what people have seen in other reasoning work and flags a concrete gap.

The soft spot is the lack of visible methods, data, or ablations. Without those, it's impossible to confirm that the identifiability conditions actually hold the prompting format, episode length, and lexical choices constant. The stress-test concern about confounds is fair on the abstract alone; if the full paper doesn't show quantitative checks that those variables were inert, the attribution of performance differences to rule induction stays shaky.

This is for people building or using reasoning benchmarks who want something more controlled than standard text adventures. A reader focused on LLM induction limits would get concrete task definitions and a clear open question on procedural cases.

It deserves peer review. The benchmark framing is specific enough that referees can check the controls and stats once the full paper is in front of them.

Referee Report

2 major / 1 minor

Summary. The paper introduces the HERO'S JOURNEY benchmark for evaluating rule induction in LLMs via eight goal-directed episodic text-game tasks spanning attribute and procedural induction families. Each task includes four structural rule forms, controllable lexical grounding, and identifiability conditions. Evaluation of state-of-the-art models shows limited and uneven evidence of rule induction, with process execution as a bottleneck and minimal impact from surface semantics; induction-specific steering improves attribute tasks but yields no reliable gains on procedural tasks.

Significance. If the central empirical claims hold under the stated identifiability conditions, the benchmark offers a structured way to probe complex rule induction beyond surface patterns, with the distinction between attribute and procedural families and the steering results highlighting an open challenge in procedural induction. The design elements of controllable lexical grounding and multiple rule forms per task are positive features for systematic evaluation.

major comments (2)

[Abstract] Abstract and the description of identifiability conditions: the central claim that performance differences can be attributed to rule induction (rather than prompting format, episode length, or lexical choice) rests on the assertion that the eight tasks plus identifiability conditions isolate the target construct; however, the manuscript provides no explicit ablations or quantitative checks demonstrating that these variables were held constant or shown to be inert, leaving the attribution of 'limited and uneven' induction and 'minimal effect' of surface semantics unsecured.
[Abstract] Abstract, results on steering methods: the differential effect (gains on attribute tasks, none on procedural) is load-bearing for the conclusion that 'the gap in procedural induction remains an open challenge,' yet without reported statistical tests, error analysis, or controls confirming that the steering interventions were applied identically across families, the unevenness cannot be confidently localized to induction ability versus execution or prompting confounds.

minor comments (1)

The abstract refers to 'four structural rule forms' per task without enumerating or exemplifying them; adding a brief table or figure in the main text would improve clarity on how these forms vary across the attribute/procedural families.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on our manuscript. We address the major comments point by point below and will make the necessary revisions to strengthen the paper.

read point-by-point responses

Referee: [Abstract] Abstract and the description of identifiability conditions: the central claim that performance differences can be attributed to rule induction (rather than prompting format, episode length, or lexical choice) rests on the assertion that the eight tasks plus identifiability conditions isolate the target construct; however, the manuscript provides no explicit ablations or quantitative checks demonstrating that these variables were held constant or shown to be inert, leaving the attribution of 'limited and uneven' induction and 'minimal effect' of surface semantics unsecured.

Authors: We agree that explicit ablations would provide stronger evidence that the identifiability conditions effectively isolate rule induction from confounds such as prompting format, episode length, and lexical choice. The manuscript describes the design elements intended to achieve this isolation, including controllable lexical grounding and the four structural rule forms. However, we acknowledge the absence of quantitative checks or ablations in the current version. In the revised manuscript, we will add ablations and analyses to verify that these variables are inert under the stated conditions. revision: yes
Referee: [Abstract] Abstract, results on steering methods: the differential effect (gains on attribute tasks, none on procedural) is load-bearing for the conclusion that 'the gap in procedural induction remains an open challenge,' yet without reported statistical tests, error analysis, or controls confirming that the steering interventions were applied identically across families, the unevenness cannot be confidently localized to induction ability versus execution or prompting confounds.

Authors: We concur that statistical tests and additional controls are important to support the differential effects observed with steering methods. The current results indicate gains on attribute tasks but no reliable gains on procedural tasks. To address this, the revised version will include statistical significance tests, error analyses, and explicit confirmation that steering interventions were applied consistently across the attribute and procedural families. This will help localize the effects more confidently. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark with no derivation chain

full rationale

The paper introduces an empirical benchmark (HERO'S JOURNEY) consisting of eight tasks for evaluating LLM rule induction in text games, reports model performance results, and discusses effects of execution bottlenecks and surface semantics. No equations, fitted parameters, predictions derived from inputs, or mathematical derivations are present. The central claims rest on experimental measurements against external model evaluations rather than any self-referential reduction, self-citation chain, or ansatz smuggled via prior work. Identifiability conditions are design choices for the benchmark tasks, not a derivation that collapses to its own inputs. This is a standard self-contained empirical study.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical axioms, free parameters, or invented entities; the contribution is an empirical benchmark and evaluation protocol.

pith-pipeline@v0.9.1-grok · 5660 in / 938 out tokens · 22010 ms · 2026-06-28T14:48:25.533947+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

53 extracted references · 20 canonical work pages

[1]

Jimenez and John Yang and Alexander Wettig and Shunyu Yao and Kexin Pei and Ofir Press and Karthik R

Carlos E. Jimenez and John Yang and Alexander Wettig and Shunyu Yao and Kexin Pei and Ofir Press and Karthik R. Narasimhan , title =. The Twelfth International Conference on Learning Representations,. 2024 , url =

2024
[2]

Xu and Hao Zhu and Xuhui Zhou and Robert Lo and Abishek Sridhar and Xianyi Cheng and Tianyue Ou and Yonatan Bisk and Daniel Fried and Uri Alon and Graham Neubig , title =

Shuyan Zhou and Frank F. Xu and Hao Zhu and Xuhui Zhou and Robert Lo and Abishek Sridhar and Xianyi Cheng and Tianyue Ou and Yonatan Bisk and Daniel Fried and Uri Alon and Graham Neubig , title =. The Twelfth International Conference on Learning Representations,. 2024 , url =

2024
[7]

Narasimhan and Yuan Cao , title =

Shunyu Yao and Jeffrey Zhao and Dian Yu and Nan Du and Izhak Shafran and Karthik R. Narasimhan and Yuan Cao , title =. The Eleventh International Conference on Learning Representations,. 2023 , url =

2023
[8]

Lake and Marco Baroni , editor =

Brenden M. Lake and Marco Baroni , editor =. Proceedings of the 35th International Conference on Machine Learning,. 2018 , url =

2018
[9]

Measuring Compositional Generalization:

Daniel Keysers and Nathanael Sch. Measuring Compositional Generalization:. 8th International Conference on Learning Representations,. 2020 , url =

2020
[11]

Lake , editor =

Laura Ruis and Jacob Andreas and Marco Baroni and Diane Bouchacourt and Brenden M. Lake , editor =. Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual , year =

2020
[12]

The Thirteenth International Conference on Learning Representations,

Jiachun Li and Pengfei Cao and Zhuoran Jin and Yubo Chen and Kang Liu and Jun Zhao , title =. The Thirteenth International Conference on Learning Representations,. 2025 , url =

2025
[13]

7th International Conference on Learning Representations,

Maxime Chevalier. 7th International Conference on Learning Representations,. 2019 , url =

2019
[15]

The Twelfth International Conference on Learning Representations,

Linlu Qiu and Liwei Jiang and Ximing Lu and Melanie Sclar and Valentina Pyatkin and Chandra Bhagavatula and Bailin Wang and Yoon Kim and Yejin Choi and Nouha Dziri and Xiang Ren , title =. The Twelfth International Conference on Learning Representations,. 2024 , url =

2024
[17]

i’m not sure, but

Jerry A. Fodor and Zenon W. Pylyshyn , abstract =. Cognition , volume =. 1988 , issn =. doi:https://doi.org/10.1016/0010-0277(88)90031-5 , url =

work page doi:10.1016/0010-0277(88)90031-5 1988
[20]

2026 , eprint=

A Survey of Inductive Reasoning for Large Language Models , author=. 2026 , eprint=

2026
[21]

2019 , eprint=

On the Measure of Intelligence , author=. 2019 , eprint=

2019
[22]

2024 , pages =

Nature Communications , author =. 2024 , pages =. doi:10.1038/s41467-024-50966-x , abstract =

work page doi:10.1038/s41467-024-50966-x 2024
[24]

and Levy, Omer

Honovich, Or and Shaham, Uri and Bowman, Samuel R. and Levy, Omer. Instruction Induction: From Few Examples to Natural Language Task Descriptions. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023. doi:10.18653/v1/2023.acl-long.108

work page doi:10.18653/v1/2023.acl-long.108 2023
[25]

2026 , url=

Qizheng Zhang and Changran Hu and Shubhangi Upasani and Boyuan Ma and Fenglu Hong and Vamsidhar Kamanuru and Jay Rainton and Chen Wu and Mengmeng Ji and Hanchen Li and Urmish Thakker and James Zou and Kunle Olukotun , booktitle=. 2026 , url=

2026
[29]

2014 , url=

Kemp, Charles and Jern, Alan , journal=. 2014 , url=

2014
[30]

and McClelland, James L

Rogers, Timothy T. and McClelland, James L. , title =. 2004 , month =. doi:10.7551/mitpress/6161.001.0001 , url =

work page doi:10.7551/mitpress/6161.001.0001 2004
[31]

Psychological Review , number =

Sudeep Bhatia and Russell Richie , doi =. Psychological Review , number =
[32]

2025 , eprint=

On Language Models' Sensitivity to Suspicious Coincidences , author=. 2025 , eprint=

2025
[34]

Psychological Review , number =

Sudeep Bhatia , doi =. Psychological Review , number =
[35]

2022 , volume=

Kanishka Misra and Julia Rayz and Allyson Ettinger , booktitle=. 2022 , volume=

2022
[36]

Rule, Joshua Stewart , year=
[37]

Sudeep Bhatia. 2024. https://doi.org/10.1037/rev0000446 Inductive Reasoning in Minds and Machines . Psychological Review, 131(6):1373--1391

work page doi:10.1037/rev0000446 2024
[38]

Kedi Chen, Dezhao Ruan, Yuhao Dan, Yaoting Wang, Siyu Yan, Xuecheng Wu, Yinqi Zhang, Qin Chen, Jie Zhou, Liang He, Biqing Qi, Linyang Li, Qipeng Guo, Xiaoming Shi, and Wei Zhang. 2026. https://arxiv.org/abs/2510.10182 A survey of inductive reasoning for large language models . Preprint, arXiv:2510.10182

Pith/arXiv arXiv 2026
[39]

Maxime Chevalier - Boisvert, Dzmitry Bahdanau, Salem Lahlou, Lucas Willems, Chitwan Saharia, Thien Huu Nguyen, and Yoshua Bengio. 2019. https://openreview.net/forum?id=rJeXCo0cYX BabyAI: A Platform to Study the Sample Efficiency of Grounded Language Learning . In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, Ma...

2019
[40]

François Chollet. 2019. https://arxiv.org/abs/1911.01547 On the measure of intelligence . Preprint, arXiv:1911.01547

Pith/arXiv arXiv 2019
[41]

Hausknecht, Layla El Asri, Mahmoud Adada, Wendy Tay, and Adam Trischler

Marc - Alexandre C \^ o t \' e , \' A kos K \' a d \' a r, Xingdi Yuan, Ben Kybartas, Tavian Barnes, Emery Fine, James Moore, Matthew J. Hausknecht, Layla El Asri, Mahmoud Adada, Wendy Tay, and Adam Trischler. 2018. https://doi.org/10.1007/978-3-030-24337-1\_3 TextWorld: A Learning Environment for Text-Based Games . In Computer Games - 7th Workshop, CGW 2...

work page doi:10.1007/978-3-030-24337-1 2018
[42]

Ransom, Andrew Perfors, and Charles Kemp

Simon Jerome Han, Keith J. Ransom, Andrew Perfors, and Charles Kemp. 2024. https://doi.org/10.1016/j.cogsys.2023.101155 Inductive reasoning in humans and large language models . Cognitive Systems Research, 83:101155

work page doi:10.1016/j.cogsys.2023.101155 2024
[43]

Hausknecht, Prithviraj Ammanabrolu, Marc - Alexandre C \^ o t \' e , and Xingdi Yuan

Matthew J. Hausknecht, Prithviraj Ammanabrolu, Marc - Alexandre C \^ o t \' e , and Xingdi Yuan. 2020. https://doi.org/10.1609/AAAI.V34I05.6297 Interactive Fiction Games: A Colossal Adventure . In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2...

work page doi:10.1609/aaai.v34i05.6297 2020
[44]

Hayes and Evan Heit

Brett K. Hayes and Evan Heit. 2018. https://doi.org/10.1002/wcs.1459 Inductive reasoning 2.0 . WIREs Cognitive Science, 9(3):e1459

work page doi:10.1002/wcs.1459 2018
[45]

Kaiyu He, Mian Zhang, Shuo Yan, Peilin Wu, and Zhiyu Chen. 2025. https://doi.org/10.18653/v1/2025.findings-acl.698 IDEA : Enhancing the Rule Learning Ability of Large Language Model Agent through Induction, Deduction, and Abduction . In Findings of the Association for Computational Linguistics: ACL 2025, pages 13563--13597, Vienna, Austria. Association fo...

work page doi:10.18653/v1/2025.findings-acl.698 2025
[46]

Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R. Narasimhan. 2024. https://openreview.net/forum?id=VTF8yNQM66 SWE-bench: Can Language Models Resolve Real-world Github Issues? In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024 . OpenReview.net

2024
[47]

Charles Kemp and Alan Jern. 2014. https://doi.org/10.3758/s13423-013-0467-3 A taxonomy of inductive problems . Psychonomic Bulletin & Review, 21(1):23--46

work page doi:10.3758/s13423-013-0467-3 2014
[48]

Daniel Keysers, Nathanael Sch \" a rli, Nathan Scales, Hylke Buisman, Daniel Furrer, Sergii Kashubin, Nikola Momchev, Danila Sinopalnikov, Lukasz Stafiniak, Tibor Tihon, Dmitry Tsarkov, Xiao Wang, Marc van Zee, and Olivier Bousquet. 2020. https://openreview.net/forum?id=SygcCnNKwr Measuring compositional generalization: A comprehensive method on Realistic...

2020
[49]

Najoung Kim and Tal Linzen. 2020. https://doi.org/10.18653/v1/2020.emnlp-main.731 COGS : A Compositional Generalization Challenge Based on Semantic Interpretation . In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 9087--9105, Online. Association for Computational Linguistics

work page doi:10.18653/v1/2020.emnlp-main.731 2020
[50]

Lake and Marco Baroni

Brenden M. Lake and Marco Baroni. 2018. http://proceedings.mlr.press/v80/lake18a.html Generalization without Systematicity: On the Compositional Skills of Sequence-to-Sequence Recurrent Networks . In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsm \" a ssan, Stockholm, Sweden, July 10-15, 2018 , Proceedings of ...

2018
[51]

Lake, Ruslan Salakhutdinov, and Joshua B

Brenden M. Lake, Ruslan Salakhutdinov, and Joshua B. Tenenbaum. 2015. https://doi.org/10.1126/science.aab3050 Human-level concept learning through probabilistic program induction . Science, 350(6266):1332--1338

work page doi:10.1126/science.aab3050 2015
[52]

Kang-il Lee, Hyukhun Koh, Dongryeol Lee, Seunghyun Yoon, Minsung Kim, and Kyomin Jung. 2025. https://doi.org/10.18653/v1/2025.naacl-long.429 Generating Diverse Hypotheses for Inductive Reasoning . In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volu...

work page doi:10.18653/v1/2025.naacl-long.429 2025
[53]

Jiachun Li, Pengfei Cao, Zhuoran Jin, Yubo Chen, Kang Liu, and Jun Zhao. 2025. https://openreview.net/forum?id=tZCqSVncRf MIRAGE: evaluating and explaining inductive reasoning process in Language Models . In The Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025 . OpenReview.net

2025
[54]

Chris Lu, Cong Lu, Robert Tjarko Lange, Yutaro Yamada, Shengran Hu, Jakob Foerster, David Ha, and Jeff Clune. 2026. https://doi.org/10.1038/s41586-026-10265-5 Towards end-to-end automation of AI research . Nature, 651(8107):914--919

work page doi:10.1038/s41586-026-10265-5 2026
[55]

Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettlemoyer. 2022. https://doi.org/10.18653/v1/2022.emnlp-main.759 Rethinking the Role of Demonstrations: What Makes In-Context Learning Work? In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11048--11064, Abu Dhabi,...

work page doi:10.18653/v1/2022.emnlp-main.759 2022
[56]

Kanishka Misra, Julia Rayz, and Allyson Ettinger. 2022. https://escholarship.org/uc/item/6170h6nj A Property Induction Framework for Neural Language Models . In Proceedings of the Annual Meeting of the Cognitive Science Society, volume 44

2022
[57]

Osherson, Edward E

Daniel N. Osherson, Edward E. Smith, Ormond Wilkie, and Alejandro L\' o pez. 1990. https://doi.org/10.1037/0033-295x.97.2.185 Category-based induction . Psychological Review, 97(2):185--200

work page doi:10.1037/0033-295x.97.2.185 1990
[58]

Sriram Padmanabhan, Kanishka Misra, Kyle Mahowald, and Eunsol Choi. 2025. https://arxiv.org/abs/2504.09387 On language models' sensitivity to suspicious coincidences . Preprint, arXiv:2504.09387

arXiv 2025
[59]

Linlu Qiu, Liwei Jiang, Ximing Lu, Melanie Sclar, Valentina Pyatkin, Chandra Bhagavatula, Bailin Wang, Yoon Kim, Yejin Choi, Nouha Dziri, and Xiang Ren. 2024. https://openreview.net/forum?id=bNt7oajl2a Phenomenal yet puzzling: Testing inductive reasoning capabilities of language models with hypothesis refinement . In The Twelfth International Conference o...

2024
[60]

Laura Ruis, Jacob Andreas, Marco Baroni, Diane Bouchacourt, and Brenden M. Lake. 2020. https://proceedings.neurips.cc/paper/2020/hash/e5a90182cc81e12ab5e72d66e0b46fe3-Abstract.html A Benchmark for Systematic Generalization in Grounded Language Understanding . In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information ...

2020
[61]

Joshua Stewart Rule. 2020. https://dspace.mit.edu/entities/publication/5af05170-125e-401d-a8b7-fe1437468356 The child as hacker: building more human-like models of learning . Ph.D. thesis, Massachusetts Institute of Technology

2020
[62]

S.A. Sloman. 1993. https://doi.org/10.1006/cogp.1993.1006 Feature-Based Induction . Cognitive Psychology, 25(2):231--280

work page doi:10.1006/cogp.1993.1006 1993
[63]

Tenenbaum, Charles Kemp, Thomas L

Joshua B. Tenenbaum, Charles Kemp, Thomas L. Griffiths, and Noah D. Goodman. 2011. https://doi.org/10.1126/science.1192788 How to Grow a Mind: Statistics, Structure, and Abstraction . Science, 331(6022):1279--1285

work page doi:10.1126/science.1192788 2011
[64]

Narasimhan, and Yuan Cao

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R. Narasimhan, and Yuan Cao. 2023. https://openreview.net/forum?id=WE\_vluYUL-X ReAct: Synergizing Reasoning and Acting in Language Models . In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023 . OpenReview.net

2023
[65]

Chi Zhang, Baoxiong Jia, Mark Edmonds, Song - Chun Zhu, and Yixin Zhu. 2021. https://doi.org/10.1109/CVPR46437.2021.01050 ACRE: Abstract Causal REasoning Beyond Covariation . In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021 , pages 10643--10653. Computer Vision Foundation / IEEE

work page doi:10.1109/cvpr46437.2021.01050 2021
[66]

Qizheng Zhang, Changran Hu, Shubhangi Upasani, Boyuan Ma, Fenglu Hong, Vamsidhar Kamanuru, Jay Rainton, Chen Wu, Mengmeng Ji, Hanchen Li, Urmish Thakker, James Zou, and Kunle Olukotun. 2026. https://openreview.net/forum?id=eC4ygDs02R Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models . In The Fourteenth International Confere...

2026
[67]

Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig

Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. 2024. https://openreview.net/forum?id=oKn9c6ytLx WebArena: A Realistic Web Environment for Building Autonomous Agents . In The Twelfth International Conference on Learning Representations, ICLR 2024,...

2024

[1] [1]

Jimenez and John Yang and Alexander Wettig and Shunyu Yao and Kexin Pei and Ofir Press and Karthik R

Carlos E. Jimenez and John Yang and Alexander Wettig and Shunyu Yao and Kexin Pei and Ofir Press and Karthik R. Narasimhan , title =. The Twelfth International Conference on Learning Representations,. 2024 , url =

2024

[2] [2]

Xu and Hao Zhu and Xuhui Zhou and Robert Lo and Abishek Sridhar and Xianyi Cheng and Tianyue Ou and Yonatan Bisk and Daniel Fried and Uri Alon and Graham Neubig , title =

Shuyan Zhou and Frank F. Xu and Hao Zhu and Xuhui Zhou and Robert Lo and Abishek Sridhar and Xianyi Cheng and Tianyue Ou and Yonatan Bisk and Daniel Fried and Uri Alon and Graham Neubig , title =. The Twelfth International Conference on Learning Representations,. 2024 , url =

2024

[3] [7]

Narasimhan and Yuan Cao , title =

Shunyu Yao and Jeffrey Zhao and Dian Yu and Nan Du and Izhak Shafran and Karthik R. Narasimhan and Yuan Cao , title =. The Eleventh International Conference on Learning Representations,. 2023 , url =

2023

[4] [8]

Lake and Marco Baroni , editor =

Brenden M. Lake and Marco Baroni , editor =. Proceedings of the 35th International Conference on Machine Learning,. 2018 , url =

2018

[5] [9]

Measuring Compositional Generalization:

Daniel Keysers and Nathanael Sch. Measuring Compositional Generalization:. 8th International Conference on Learning Representations,. 2020 , url =

2020

[6] [11]

Lake , editor =

Laura Ruis and Jacob Andreas and Marco Baroni and Diane Bouchacourt and Brenden M. Lake , editor =. Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual , year =

2020

[7] [12]

The Thirteenth International Conference on Learning Representations,

Jiachun Li and Pengfei Cao and Zhuoran Jin and Yubo Chen and Kang Liu and Jun Zhao , title =. The Thirteenth International Conference on Learning Representations,. 2025 , url =

2025

[8] [13]

7th International Conference on Learning Representations,

Maxime Chevalier. 7th International Conference on Learning Representations,. 2019 , url =

2019

[9] [15]

The Twelfth International Conference on Learning Representations,

Linlu Qiu and Liwei Jiang and Ximing Lu and Melanie Sclar and Valentina Pyatkin and Chandra Bhagavatula and Bailin Wang and Yoon Kim and Yejin Choi and Nouha Dziri and Xiang Ren , title =. The Twelfth International Conference on Learning Representations,. 2024 , url =

2024

[10] [17]

i’m not sure, but

Jerry A. Fodor and Zenon W. Pylyshyn , abstract =. Cognition , volume =. 1988 , issn =. doi:https://doi.org/10.1016/0010-0277(88)90031-5 , url =

work page doi:10.1016/0010-0277(88)90031-5 1988

[11] [20]

2026 , eprint=

A Survey of Inductive Reasoning for Large Language Models , author=. 2026 , eprint=

2026

[12] [21]

2019 , eprint=

On the Measure of Intelligence , author=. 2019 , eprint=

2019

[13] [22]

2024 , pages =

Nature Communications , author =. 2024 , pages =. doi:10.1038/s41467-024-50966-x , abstract =

work page doi:10.1038/s41467-024-50966-x 2024

[14] [24]

and Levy, Omer

Honovich, Or and Shaham, Uri and Bowman, Samuel R. and Levy, Omer. Instruction Induction: From Few Examples to Natural Language Task Descriptions. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023. doi:10.18653/v1/2023.acl-long.108

work page doi:10.18653/v1/2023.acl-long.108 2023

[15] [25]

2026 , url=

Qizheng Zhang and Changran Hu and Shubhangi Upasani and Boyuan Ma and Fenglu Hong and Vamsidhar Kamanuru and Jay Rainton and Chen Wu and Mengmeng Ji and Hanchen Li and Urmish Thakker and James Zou and Kunle Olukotun , booktitle=. 2026 , url=

2026

[16] [29]

2014 , url=

Kemp, Charles and Jern, Alan , journal=. 2014 , url=

2014

[17] [30]

and McClelland, James L

Rogers, Timothy T. and McClelland, James L. , title =. 2004 , month =. doi:10.7551/mitpress/6161.001.0001 , url =

work page doi:10.7551/mitpress/6161.001.0001 2004

[18] [31]

Psychological Review , number =

Sudeep Bhatia and Russell Richie , doi =. Psychological Review , number =

[19] [32]

2025 , eprint=

On Language Models' Sensitivity to Suspicious Coincidences , author=. 2025 , eprint=

2025

[20] [34]

Psychological Review , number =

Sudeep Bhatia , doi =. Psychological Review , number =

[21] [35]

2022 , volume=

Kanishka Misra and Julia Rayz and Allyson Ettinger , booktitle=. 2022 , volume=

2022

[22] [36]

Rule, Joshua Stewart , year=

[23] [37]

Sudeep Bhatia. 2024. https://doi.org/10.1037/rev0000446 Inductive Reasoning in Minds and Machines . Psychological Review, 131(6):1373--1391

work page doi:10.1037/rev0000446 2024

[24] [38]

Kedi Chen, Dezhao Ruan, Yuhao Dan, Yaoting Wang, Siyu Yan, Xuecheng Wu, Yinqi Zhang, Qin Chen, Jie Zhou, Liang He, Biqing Qi, Linyang Li, Qipeng Guo, Xiaoming Shi, and Wei Zhang. 2026. https://arxiv.org/abs/2510.10182 A survey of inductive reasoning for large language models . Preprint, arXiv:2510.10182

Pith/arXiv arXiv 2026

[25] [39]

Maxime Chevalier - Boisvert, Dzmitry Bahdanau, Salem Lahlou, Lucas Willems, Chitwan Saharia, Thien Huu Nguyen, and Yoshua Bengio. 2019. https://openreview.net/forum?id=rJeXCo0cYX BabyAI: A Platform to Study the Sample Efficiency of Grounded Language Learning . In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, Ma...

2019

[26] [40]

François Chollet. 2019. https://arxiv.org/abs/1911.01547 On the measure of intelligence . Preprint, arXiv:1911.01547

Pith/arXiv arXiv 2019

[27] [41]

Hausknecht, Layla El Asri, Mahmoud Adada, Wendy Tay, and Adam Trischler

Marc - Alexandre C \^ o t \' e , \' A kos K \' a d \' a r, Xingdi Yuan, Ben Kybartas, Tavian Barnes, Emery Fine, James Moore, Matthew J. Hausknecht, Layla El Asri, Mahmoud Adada, Wendy Tay, and Adam Trischler. 2018. https://doi.org/10.1007/978-3-030-24337-1\_3 TextWorld: A Learning Environment for Text-Based Games . In Computer Games - 7th Workshop, CGW 2...

work page doi:10.1007/978-3-030-24337-1 2018

[28] [42]

Ransom, Andrew Perfors, and Charles Kemp

Simon Jerome Han, Keith J. Ransom, Andrew Perfors, and Charles Kemp. 2024. https://doi.org/10.1016/j.cogsys.2023.101155 Inductive reasoning in humans and large language models . Cognitive Systems Research, 83:101155

work page doi:10.1016/j.cogsys.2023.101155 2024

[29] [43]

Hausknecht, Prithviraj Ammanabrolu, Marc - Alexandre C \^ o t \' e , and Xingdi Yuan

Matthew J. Hausknecht, Prithviraj Ammanabrolu, Marc - Alexandre C \^ o t \' e , and Xingdi Yuan. 2020. https://doi.org/10.1609/AAAI.V34I05.6297 Interactive Fiction Games: A Colossal Adventure . In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2...

work page doi:10.1609/aaai.v34i05.6297 2020

[30] [44]

Hayes and Evan Heit

Brett K. Hayes and Evan Heit. 2018. https://doi.org/10.1002/wcs.1459 Inductive reasoning 2.0 . WIREs Cognitive Science, 9(3):e1459

work page doi:10.1002/wcs.1459 2018

[31] [45]

Kaiyu He, Mian Zhang, Shuo Yan, Peilin Wu, and Zhiyu Chen. 2025. https://doi.org/10.18653/v1/2025.findings-acl.698 IDEA : Enhancing the Rule Learning Ability of Large Language Model Agent through Induction, Deduction, and Abduction . In Findings of the Association for Computational Linguistics: ACL 2025, pages 13563--13597, Vienna, Austria. Association fo...

work page doi:10.18653/v1/2025.findings-acl.698 2025

[32] [46]

Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R. Narasimhan. 2024. https://openreview.net/forum?id=VTF8yNQM66 SWE-bench: Can Language Models Resolve Real-world Github Issues? In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024 . OpenReview.net

2024

[33] [47]

Charles Kemp and Alan Jern. 2014. https://doi.org/10.3758/s13423-013-0467-3 A taxonomy of inductive problems . Psychonomic Bulletin & Review, 21(1):23--46

work page doi:10.3758/s13423-013-0467-3 2014

[34] [48]

Daniel Keysers, Nathanael Sch \" a rli, Nathan Scales, Hylke Buisman, Daniel Furrer, Sergii Kashubin, Nikola Momchev, Danila Sinopalnikov, Lukasz Stafiniak, Tibor Tihon, Dmitry Tsarkov, Xiao Wang, Marc van Zee, and Olivier Bousquet. 2020. https://openreview.net/forum?id=SygcCnNKwr Measuring compositional generalization: A comprehensive method on Realistic...

2020

[35] [49]

Najoung Kim and Tal Linzen. 2020. https://doi.org/10.18653/v1/2020.emnlp-main.731 COGS : A Compositional Generalization Challenge Based on Semantic Interpretation . In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 9087--9105, Online. Association for Computational Linguistics

work page doi:10.18653/v1/2020.emnlp-main.731 2020

[36] [50]

Lake and Marco Baroni

Brenden M. Lake and Marco Baroni. 2018. http://proceedings.mlr.press/v80/lake18a.html Generalization without Systematicity: On the Compositional Skills of Sequence-to-Sequence Recurrent Networks . In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsm \" a ssan, Stockholm, Sweden, July 10-15, 2018 , Proceedings of ...

2018

[37] [51]

Lake, Ruslan Salakhutdinov, and Joshua B

Brenden M. Lake, Ruslan Salakhutdinov, and Joshua B. Tenenbaum. 2015. https://doi.org/10.1126/science.aab3050 Human-level concept learning through probabilistic program induction . Science, 350(6266):1332--1338

work page doi:10.1126/science.aab3050 2015

[38] [52]

Kang-il Lee, Hyukhun Koh, Dongryeol Lee, Seunghyun Yoon, Minsung Kim, and Kyomin Jung. 2025. https://doi.org/10.18653/v1/2025.naacl-long.429 Generating Diverse Hypotheses for Inductive Reasoning . In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volu...

work page doi:10.18653/v1/2025.naacl-long.429 2025

[39] [53]

Jiachun Li, Pengfei Cao, Zhuoran Jin, Yubo Chen, Kang Liu, and Jun Zhao. 2025. https://openreview.net/forum?id=tZCqSVncRf MIRAGE: evaluating and explaining inductive reasoning process in Language Models . In The Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025 . OpenReview.net

2025

[40] [54]

Chris Lu, Cong Lu, Robert Tjarko Lange, Yutaro Yamada, Shengran Hu, Jakob Foerster, David Ha, and Jeff Clune. 2026. https://doi.org/10.1038/s41586-026-10265-5 Towards end-to-end automation of AI research . Nature, 651(8107):914--919

work page doi:10.1038/s41586-026-10265-5 2026

[41] [55]

Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettlemoyer. 2022. https://doi.org/10.18653/v1/2022.emnlp-main.759 Rethinking the Role of Demonstrations: What Makes In-Context Learning Work? In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11048--11064, Abu Dhabi,...

work page doi:10.18653/v1/2022.emnlp-main.759 2022

[42] [56]

Kanishka Misra, Julia Rayz, and Allyson Ettinger. 2022. https://escholarship.org/uc/item/6170h6nj A Property Induction Framework for Neural Language Models . In Proceedings of the Annual Meeting of the Cognitive Science Society, volume 44

2022

[43] [57]

Osherson, Edward E

Daniel N. Osherson, Edward E. Smith, Ormond Wilkie, and Alejandro L\' o pez. 1990. https://doi.org/10.1037/0033-295x.97.2.185 Category-based induction . Psychological Review, 97(2):185--200

work page doi:10.1037/0033-295x.97.2.185 1990

[44] [58]

Sriram Padmanabhan, Kanishka Misra, Kyle Mahowald, and Eunsol Choi. 2025. https://arxiv.org/abs/2504.09387 On language models' sensitivity to suspicious coincidences . Preprint, arXiv:2504.09387

arXiv 2025

[45] [59]

Linlu Qiu, Liwei Jiang, Ximing Lu, Melanie Sclar, Valentina Pyatkin, Chandra Bhagavatula, Bailin Wang, Yoon Kim, Yejin Choi, Nouha Dziri, and Xiang Ren. 2024. https://openreview.net/forum?id=bNt7oajl2a Phenomenal yet puzzling: Testing inductive reasoning capabilities of language models with hypothesis refinement . In The Twelfth International Conference o...

2024

[46] [60]

Laura Ruis, Jacob Andreas, Marco Baroni, Diane Bouchacourt, and Brenden M. Lake. 2020. https://proceedings.neurips.cc/paper/2020/hash/e5a90182cc81e12ab5e72d66e0b46fe3-Abstract.html A Benchmark for Systematic Generalization in Grounded Language Understanding . In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information ...

2020

[47] [61]

Joshua Stewart Rule. 2020. https://dspace.mit.edu/entities/publication/5af05170-125e-401d-a8b7-fe1437468356 The child as hacker: building more human-like models of learning . Ph.D. thesis, Massachusetts Institute of Technology

2020

[48] [62]

S.A. Sloman. 1993. https://doi.org/10.1006/cogp.1993.1006 Feature-Based Induction . Cognitive Psychology, 25(2):231--280

work page doi:10.1006/cogp.1993.1006 1993

[49] [63]

Tenenbaum, Charles Kemp, Thomas L

Joshua B. Tenenbaum, Charles Kemp, Thomas L. Griffiths, and Noah D. Goodman. 2011. https://doi.org/10.1126/science.1192788 How to Grow a Mind: Statistics, Structure, and Abstraction . Science, 331(6022):1279--1285

work page doi:10.1126/science.1192788 2011

[50] [64]

Narasimhan, and Yuan Cao

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R. Narasimhan, and Yuan Cao. 2023. https://openreview.net/forum?id=WE\_vluYUL-X ReAct: Synergizing Reasoning and Acting in Language Models . In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023 . OpenReview.net

2023

[51] [65]

Chi Zhang, Baoxiong Jia, Mark Edmonds, Song - Chun Zhu, and Yixin Zhu. 2021. https://doi.org/10.1109/CVPR46437.2021.01050 ACRE: Abstract Causal REasoning Beyond Covariation . In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021 , pages 10643--10653. Computer Vision Foundation / IEEE

work page doi:10.1109/cvpr46437.2021.01050 2021

[52] [66]

Qizheng Zhang, Changran Hu, Shubhangi Upasani, Boyuan Ma, Fenglu Hong, Vamsidhar Kamanuru, Jay Rainton, Chen Wu, Mengmeng Ji, Hanchen Li, Urmish Thakker, James Zou, and Kunle Olukotun. 2026. https://openreview.net/forum?id=eC4ygDs02R Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models . In The Fourteenth International Confere...

2026

[53] [67]

Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig

Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. 2024. https://openreview.net/forum?id=oKn9c6ytLx WebArena: A Realistic Web Environment for Building Autonomous Agents . In The Twelfth International Conference on Learning Representations, ICLR 2024,...

2024