Recognition: 1 theorem link
· Lean TheoremCauSim: Scaling Causal Reasoning with Increasingly Complex Causal Simulators
Pith reviewed 2026-05-12 02:29 UTC · model grok-4.3
The pith
CauSim lets LLMs incrementally build executable causal simulators that scale while keeping query answers verifiable.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CauSim constructs increasingly complex causal simulators as executable structural causal models built incrementally by LLMs. These simulators reach globally complex systems while maintaining verifiable ground-truth answers to causal queries. The framework formalizes non-executable causal knowledge into code for augmentation and translates the simulators back into natural language to enable supervision where it was previously scarce.
What carries the argument
CauSim, a framework that has LLMs incrementally construct executable structural causal models to produce scalable simulators with verifiable causal query answers.
If this is right
- Training LLMs on data generated from these simulators produces consistent gains in causal reasoning performance.
- Performance continues to improve as simulator complexity increases through curriculum ordering and as data volume grows.
- LLMs achieve self-improvement on causal tasks by using simulators they have generated themselves.
- Non-executable domain knowledge can be formalized into executable simulators to augment training data.
Where Pith is reading between the lines
- The same incremental construction process could be tested as a way to bootstrap causal models from observational datasets in specific application areas.
- Generated simulators could serve as controlled environments for measuring the limits of current causal reasoning techniques before applying them to real systems.
- Detecting and correcting construction errors during the incremental build would allow the method to reach even larger scales.
Load-bearing premise
Large language models can incrementally construct these complex simulators without introducing errors that invalidate the ground-truth causal relationships or the correctness of query answers.
What would settle it
Direct verification that answers to causal queries computed from the generated simulator match the answers derived from its underlying causal graph structure, particularly as the number of variables and relations grows.
Figures
read the original abstract
Despite surpassing human performance across mathematics, coding, and other knowledge-intensive tasks, large language models (LLMs) continue to struggle with causal reasoning. A core obstacle is the target data itself: causal systems are complex and often expressed in non-executable forms, while ground-truth answers to causal queries are inherently scarce. We introduce CauSim, a framework that turns causal reasoning from a scarce-label problem into a scalable supervised one. CauSim constructs increasingly complex causal simulators: executable structural causal models (SCMs), incrementally built by LLMs, that scale to globally complex systems while maintaining verifiable answers to causal queries. CauSim operates across representations by formalizing non-executable causal knowledge into code, enabling data augmentation, and translating executable SCMs into natural language, enabling supervision in previously difficult-to-supervise representations. We structure our research into two parts: (1) how to construct increasingly complex causal simulators, and (2) a systematic study of what CauSim enables, demonstrating generalization across representations, consistent gains from curriculum scaling and data volume, LLM self-improvement through self-generated simulators, and data augmentation via formalization of existing domain knowledge.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces CauSim, a framework that uses LLMs to incrementally construct executable structural causal models (SCMs) as simulators for causal reasoning. These simulators scale in complexity while providing verifiable ground-truth answers to causal queries. The approach formalizes non-executable causal knowledge into code for data augmentation and translates SCMs back to natural language for supervision. Research is structured in two parts: (1) methods for building increasingly complex causal simulators and (2) empirical studies demonstrating generalization across representations, gains from curriculum scaling and data volume, LLM self-improvement via self-generated simulators, and augmentation from domain knowledge.
Significance. If the central claims hold, CauSim would address the scarcity of causal data and supervision signals for LLMs by converting causal reasoning into a scalable supervised learning problem with independent executable ground truth. The curriculum scaling, cross-representation generalization, and self-improvement results could represent a meaningful advance. The use of executable SCMs is a strength for verifiability, though the self-improvement component carries a noted risk of circularity that requires careful validation.
major comments (2)
- The abstract and Part (1) description of incremental SCM construction by LLMs claims scaling to globally complex systems while maintaining verifiable causal query answers, but without detailed algorithms for error detection, propagation analysis, or empirical fidelity metrics in the provided text, it is unclear whether construction errors invalidate ground-truth relationships. This is load-bearing for the core claim of reliable simulators.
- In the self-improvement study (Part 2), the loop of LLMs generating and training on their own simulators risks circularity if verification of causal correctness ultimately depends on the same model capabilities rather than fully independent executable checks. Concrete ablation results separating model-generated data from external validation would be needed to support the self-improvement claim.
minor comments (1)
- The abstract would benefit from explicit definitions or metrics for 'increasingly complex' and 'globally complex systems' to make the scaling claims more precise.
Simulated Author's Rebuttal
We thank the referee for their careful review and constructive feedback on our manuscript. We address each major comment point by point below, providing clarifications on the mechanisms that support reliable simulator construction and independent verification.
read point-by-point responses
-
Referee: The abstract and Part (1) description of incremental SCM construction by LLMs claims scaling to globally complex systems while maintaining verifiable causal query answers, but without detailed algorithms for error detection, propagation analysis, or empirical fidelity metrics in the provided text, it is unclear whether construction errors invalidate ground-truth relationships. This is load-bearing for the core claim of reliable simulators.
Authors: We agree that the manuscript would benefit from greater explicitness on these aspects. The incremental construction procedure incorporates execution-based consistency checks after each addition of a variable or edge: a battery of causal queries is run on the updated executable SCM and compared against results from the prior verified state. Inconsistencies arising from construction errors are rejected before the simulator is accepted. While the current text describes this process at a high level, we will add a dedicated subsection with pseudocode for the error-detection routine, a formal propagation analysis, and quantitative fidelity metrics (e.g., query-consistency rates across build steps) in the revised version. revision: yes
-
Referee: In the self-improvement study (Part 2), the loop of LLMs generating and training on their own simulators risks circularity if verification of causal correctness ultimately depends on the same model capabilities rather than fully independent executable checks. Concrete ablation results separating model-generated data from external validation would be needed to support the self-improvement claim.
Authors: Verification of each generated simulator occurs exclusively through direct execution of its code, which supplies ground-truth answers independently of any LLM reasoning. The self-improvement experiments already include controls that train on simulators built from external domain knowledge and compare against purely self-generated ones; performance gains remain when evaluation is restricted to execution-derived labels. To further address the circularity concern, we will expand the results with additional ablations that explicitly isolate model-generated data from any model-assisted validation steps. revision: partial
Circularity Check
No significant circularity detected
full rationale
The provided manuscript text consists only of the abstract and a placeholder for the full paper, with no equations, self-citations, or derivation steps that reduce any claim to its own inputs by construction. The framework's core elements—LLM-driven construction of executable SCMs, formalization across representations, and curriculum scaling—are presented as independent mechanisms that generate verifiable ground truth via code execution rather than through self-referential fitting or imported uniqueness theorems. No load-bearing step matches the enumerated circularity patterns, and the self-improvement aspect is described as enabled by the external executability of the simulators, keeping the derivation self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Structural causal models can be represented as executable programs that provide verifiable ground-truth answers to causal queries
invented entities (1)
-
CauSim framework
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Thomas Hubert, Rishi Mehta, Laurent Sartran, Miklós Z. Horváth, Goran Žuži´c, Eric Wieser, Aja Huang, Julian Schrittwieser, Yannick Schroecker, Hussain Masoom, Ottavia Bertolli, Tom Zahavy, Amol Mandhane, Jessica Yung, Iuliya Beloshapka, Borja Ibarz, Vivek Veeriah, Lei Yu, Oliver Nash, Paul Lezeau, Salvatore Mercuri, Calle Sönne, Bhavik Mehta, Alex Davies...
-
[2]
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[3]
Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, Thomas Hubert, Peter Choy, Cyprien de Masson d’Autume, Igor Babuschkin, Xinyun Chen, Po-Sen Huang, Johannes Welbl, Sven Gowal, Alexey Cherepanov, James Molloy, Daniel J. Mankowitz, Esme Sutherland Robson, Pushm...
-
[4]
Can llms replace human evaluators? an empirical study of llm-as-a-judge in software engineering.Proc
Ruiqi Wang, Jiyu Guo, Cuiyun Gao, Guodong Fan, Chun Yong Chong, and Xin Xia. Can llms replace human evaluators? an empirical study of llm-as-a-judge in software engineering.Proc. ACM Softw. Eng., 2(ISSTA), June 2025. doi: 10.1145/3728963. URL https://doi.org/10. 1145/3728963
-
[5]
Luo, X., Rechardt, A., Sun, G., Nejad, K
Xiaoliang Luo, Akilles Rechardt, Guangzhi Sun, Kevin K. Nejad, Felipe Yáñez, Bati Yil- maz, Kangjoo Lee, Alexandra O. Cohen, Valentina Borghesani, Anton Pashkov, Daniele Marinazzo, Jonathan Nicholas, Alessandro Salatiello, Ilia Sucholutsky, Pasquale Minervini, Sepehr Razavi, Roberta Rocca, Elkhan Yusifov, Tereza Okalova, Nianlong Gu, Martin Ferianc, Mikai...
-
[6]
Zhijing Jin, Yuen Chen, Felix Leeb, Luigi Gresele, Ojasv Kamal, Zhiheng Lyu, Kevin Blin, Fernando Gonzalez Adauto, Max Kleiman-Weiner, Mrinmaya Sachan, et al. Cladder: Assessing causal reasoning in language models.Advances in Neural Information Processing Systems, 36: 31038–31065, 2023
work page 2023
-
[7]
Swagata Ashwani, Kshiteesh Hegde, Nishith Reddy Mannuru, Dushyant Singh Sengar, Mayank Jindal, Krishna Chaitanya Rao Kathala, Dishant Banga, Vinija Jain, and Aman Chadha. Cause and effect: can large language models truly understand causality? InProceedings of the AAAI Symposium Series, volume 4, pages 2–9, 2024. 11
work page 2024
-
[8]
Haoang Chi, He Li, Wenjing Yang, Feng Liu, Long Lan, Xiaoguang Ren, Tongliang Liu, and Bo Han. Unveiling causal reasoning in large language models: Reality or mi- rage? InAdvances in Neural Information Processing Systems, volume 37, pages 96640– 96670, 2024. URL https://proceedings.neurips.cc/paper_files/paper/2024/ hash/af2bb2b2280d36f8842e440b4e275152-A...
work page 2024
-
[9]
Matej Zeˇcevi´c, Moritz Willig, Devendra Singh Dhami, and Kristian Kersting. Causal parrots: Large language models may talk causality but are not causal.arXiv preprint arXiv:2308.13067, 2023
-
[10]
Paul W. Holland. Statistics and causal inference.Journal of the American Statistical Association, 81(396):945–960, 1986. doi: 10.1080/01621459.1986.10478354
-
[11]
Miguel A. Hernán and James M. Robins.Causal Inference: What If. Chapman & Hall/CRC, Boca Raton, 2020
work page 2020
-
[12]
Judea Pearl.Causality: Models, Reasoning, and Inference. Cambridge University Press, Cambridge, UK, 2 edition, 2009. doi: 10.1017/CBO9780511803161
-
[13]
Absolute Zero: Reinforced Self-play Reasoning with Zero Data
Andrew Zhao, Yiran Wu, Yang Yue, Tong Wu, Quentin Xu, Yang Yue, Matthieu Lin, Shenzhi Wang, Qingyun Wu, Zilong Zheng, and Gao Huang. Absolute zero: Reinforced self-play reasoning with zero data.arXiv preprint arXiv:2505.03335, 2025. URL https://arxiv.org/ abs/2505.03335
work page internal anchor Pith review arXiv 2025
-
[14]
Elias Bareinboim and Judea Pearl. Causal inference and the data-fusion problem.Proceedings of the National Academy of Sciences, 113(27):7345–7352, 2016. doi: 10.1073/pnas.1510507113. URLhttps://www.pnas.org/doi/abs/10.1073/pnas.1510507113
-
[15]
Causal diagrams for empirical research.Biometrika, 82(4):669–688, 1995
Judea Pearl. Causal diagrams for empirical research.Biometrika, 82(4):669–688, 1995. doi: 10.1093/biomet/82.4.669
-
[16]
Xinnuo Xu, Rachel Lawrence, Kshitij Dubey, Atharva Pandey, Risa Ueno, Fabian Falck, Aditya V Nori, Rahul Sharma, Amit Sharma, and Javier Gonzalez. Re-imagine: Symbolic benchmark synthesis for reasoning evaluation.arXiv preprint arXiv:2506.15455, 2025
-
[17]
Alihan Hüyük, Xinnuo Xu, Jacqueline Maasch, Aditya V Nori, and Javier González. Reasoning elicitation in language models via counterfactual feedback.arXiv preprint arXiv:2410.03767, 2024
-
[18]
Aniket Vashishtha, Qirun Dai, Hongyuan Mei, Amit Sharma, Chenhao Tan, and Hao Peng. Executable counterfactuals: Improving llms’ causal reasoning through code.arXiv preprint arXiv:2510.01539, 2025
-
[19]
Can large language models infer causation from correlation? arXiv preprint arXiv:2306.05836, 2023
Zhijing Jin, Jiarui Liu, Zhiheng Lyu, Spencer Poff, Mrinmaya Sachan, Rada Mihalcea, Mona Diab, and Bernhard Schölkopf. Can large language models infer causation from correlation? arXiv preprint arXiv:2306.05836, 2023
-
[20]
Zeyu Wang. CausalBench: A comprehensive benchmark for evaluating causal reasoning capabilities of large language models. InProceedings of the 10th SIGHAN Workshop on Chinese Language Processing (SIGHAN-10), pages 143–151, Bangkok, Thailand, August
-
[21]
URL https://aclanthology.org/2024
Association for Computational Linguistics. URL https://aclanthology.org/2024. sighan-1.17/
work page 2024
-
[22]
Longllada: Unlocking long context capabilities in diffusion llms
Yuefei Chen, Vivek K. Singh, Jing Ma, and Ruixiang Tang. Counterbench: Evaluating and improving counterfactual reasoning in large language models. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 30350–30358, 2026. doi: 10.1609/aaai. v40i36.40287. URLhttps://ojs.aaai.org/index.php/AAAI/article/view/40287
-
[23]
Causalarc: Abstract reasoning with causal world models.arXiv preprint arXiv:2509.03636, 2025
Jacqueline Maasch, John Kalantari, and Kia Khezeli. Causalarc: Abstract reasoning with causal world models.arXiv preprint arXiv:2509.03636, 2025
-
[24]
Com- positional causal reasoning evaluation in language models.arXiv preprint arXiv:2503.04556, 2025
Jacqueline RMA Maasch, Alihan Hüyük, Xinnuo Xu, Aditya V Nori, and Javier Gonzalez. Com- positional causal reasoning evaluation in language models.arXiv preprint arXiv:2503.04556, 2025. 12
-
[25]
Javier González and Aditya Nori. Does reasoning emerge? examining the probabilities of causation in large language models.Advances in Neural Information Processing Systems, 37: 117737–117761, 2024
work page 2024
-
[26]
Jiaxuan Li, Lang Yu, and Allyson Ettinger. Counterfactual reasoning: Testing language models’ understanding of hypothetical scenarios.arXiv preprint arXiv:2305.16572, 2023
-
[27]
Look before you decide: Prompting active deduction of mllms for assumptive reasoning,
Yian Li, Wentao Tian, Yang Jiao, Jingjing Chen, Tianwen Qian, Bin Zhu, Na Zhao, and Yu-Gang Jiang. Look before you decide: Prompting active deduction of mllms for assumptive reasoning,
-
[28]
Version 5, last revised 17 Apr 2025
URLhttps://arxiv.org/abs/2404.12966. Version 5, last revised 17 Apr 2025
-
[29]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,
work page internal anchor Pith review Pith/arXiv arXiv
-
[30]
doi: 10.48550/arXiv.2110.14168
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2110.14168
-
[31]
Trust region policy optimization
John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In Francis Bach and David Blei, editors,Proceedings of the 32nd International Conference on Machine Learning, volume 37 ofProceedings of Machine Learning Research, pages 1889–1897, Lille, France, 2015. PMLR. URLhttps://proceedings.mlr. press/...
work page 2015
-
[32]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[33]
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[34]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, et al. Deepseek- r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[35]
Xumeng Wen, Zihan Liu, Shun Zheng, Shengyu Ye, Zhirong Wu, Yang Wang, Zhijian Xu, Xiao Liang, Junjie Li, Ziming Miao, Jiang Bian, and Mao Yang. Reinforcement learning with verifiable rewards implicitly incentivizes correct reasoning in base llms.arXiv preprint arXiv:2506.14245, 2025
work page internal anchor Pith review arXiv 2025
-
[36]
Self-evolving curriculum for llm reasoning.arXiv preprint arXiv:2505.14970, 2025
Xiaoyin Chen, Jiarui Lu, Minsu Kim, Dinghuai Zhang, Jian Tang, Alexandre Piché, Nicolas Gontier, Yoshua Bengio, and Ehsan Kamalloo. Self-evolving curriculum for llm reasoning. arXiv preprint arXiv:2505.14970, 2025
-
[37]
Xiao Liu, Da Yin, Chen Zhang, Dongyan Zhao, and Yansong Feng. Eliciting and im- proving the causal reasoning abilities of large language models with conditional statements. Computational Linguistics, 51:467–504, June 2025. doi: 10.1162/coli_a_00548. URL https://aclanthology.org/2025.cl-2.3/
-
[38]
Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Xian Li, Sainbayar Sukhbaatar, Jing Xu, and Jason E. Weston. Self-rewarding language models. InProceedings of the 41st Inter- national Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research, pages 57905–57923. PMLR, 2024. URL https://proceedings.mlr.press/ v235/yuan24d.html
work page 2024
-
[39]
Huajian Xin, Z. Z. Ren, Junxiao Song, Zhihong Shao, Wanjia Zhao, Haocheng Wang, Bo Liu, Liyue Zhang, Xuan Lu, Qiushi Du, Wenjun Gao, Qihao Zhu, Dejian Yang, Zhibin Gou, Z. F. Wu, Fuli Luo, and Chong Ruan. Deepseek-prover-v1.5: Harnessing proof assistant feedback for reinforcement learning and monte-carlo tree search, 2024. URL https://arxiv.org/abs/ 2408.08152
-
[40]
Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah D. Goodman. STaR: Bootstrapping reasoning with reasoning. InAdvances in Neural Information Processing Systems, 2022. 13
work page 2022
-
[41]
Scaling Relationship on Learning Mathematical Reasoning with Large Language Models
Zheng Yuan, Hongyi Yuan, Chengpeng Li, Guanting Dong, Keming Lu, Chuanqi Tan, Chang Zhou, and Jingren Zhou. Scaling relationship on learning mathematical reasoning with large language models.arXiv preprint arXiv:2308.01825, 2023. doi: 10.48550/arXiv.2308.01825. URLhttps://arxiv.org/abs/2308.01825
work page internal anchor Pith review doi:10.48550/arxiv.2308.01825 2023
-
[42]
arXiv preprint arXiv:2304.06767 , year=
Hanze Dong, Wei Xiong, Deepanshu Goyal, Yihan Zhang, Winnie Chow, Rui Pan, Shizhe Diao, Jipeng Zhang, Kashun Shum, and Tong Zhang. RAFT: Reward ranked finetuning for generative foundation model alignment.Transactions on Machine Learning Research, 2023. doi: 10.48550/arXiv.2304.06767. URLhttps://arxiv.org/abs/2304.06767
-
[43]
Marco Scutari. Learning bayesian networks with the bnlearn r package.Journal of Statistical Software, 35(3):1–22, 2010. doi: 10.18637/jss.v035.i03. URL https://www.jstatsoft. org/article/view/v035i03
-
[44]
An axiomatic characterization of causal counterfactuals.Founda- tions of Science, 3(1):151–182, 1998
David Galles and Judea Pearl. An axiomatic characterization of causal counterfactuals.Founda- tions of Science, 3(1):151–182, 1998. doi: 10.1023/A:1009602825894
-
[45]
Ilya Shpitser and Judea Pearl. Complete identification methods for the causal hierarchy.Journal of Machine Learning Research, 9:1941–1979, 2008
work page 1941
-
[46]
Adaptive Computation and Machine Learning Series
Jonas Peters, Dominik Janzing, and Bernhard Schölkopf.Elements of Causal Inference: Foundations and Learning Algorithms. Adaptive Computation and Machine Learning Series. The MIT Press, Cambridge, MA, USA, 2017
work page 2017
-
[47]
Bernhard Schölkopf, Dominik Janzing, Jonas Peters, Eleni Sgouritsa, Kun Zhang, and Joris M. Mooij. On causal and anticausal learning. InProceedings of the 29th International Conference on Machine Learning, pages 1255–1262. Omnipress, 2012
work page 2012
-
[48]
Robert Geirhos, Jörn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and Felix A. Wichmann. Shortcut learning in deep neural networks.Nature Machine Intelligence, 2(11):665–673, 2020. doi: 10.1038/s42256-020-00257-z
-
[49]
Toward causal representation learning.Proceedings of the IEEE, 109(5):612–634, 2021
Bernhard Schölkopf, Francesco Locatello, Stefan Bauer, Nan Rosemary Ke, Nal Kalchbrenner, Anirudh Goyal, and Yoshua Bengio. Toward causal representation learning.Proceedings of the IEEE, 109(5):612–634, 2021. doi: 10.1109/JPROC.2021.3058954
-
[50]
Jonas Peters, Peter Bühlmann, and Nicolai Meinshausen. Causal inference by using invariant prediction: Identification and confidence intervals.Journal of the Royal Statistical Society: Series B (Statistical Methodology), 78(5):947–1012, 2016. doi: 10.1111/rssb.12167
-
[51]
Mateo Rojas-Carulla, Bernhard Schölkopf, Richard E. Turner, and Jonas Peters. Invariant models for causal transfer learning.Journal of Machine Learning Research, 19(36):1–34, 2018
work page 2018
-
[52]
Martin Arjovsky, Léon Bottou, Ishaan Gulrajani, and David Lopez-Paz. Invariant risk mini- mization.arXiv preprint arXiv:1907.02893, 2019
work page internal anchor Pith review arXiv 1907
-
[53]
Challenging common assumptions in the unsupervised learning of disentangled representations
Francesco Locatello, Stefan Bauer, Mario Lucic, Gunnar Rätsch, Sylvain Gelly, Bernhard Schölkopf, and Olivier Bachem. Challenging common assumptions in the unsupervised learning of disentangled representations. InProceedings of the 36th International Conference on Machine Learning, volume 97 ofProceedings of Machine Learning Research, pages 4114–4124. PMLR, 2019
work page 2019
-
[54]
Investigating gender bias in language models using causal mediation analysis
Jesse Vig, Sebastian Gehrmann, Yonatan Belinkov, Sharon Qian, Daniel Nevo, Yaron Singer, and Stuart Shieber. Investigating gender bias in language models using causal mediation analysis. InAdvances in Neural Information Processing Systems, volume 33, pages 12388–12401, 2020
work page 2020
-
[55]
Causal abstractions of neural networks
Atticus Geiger, Hanson Lu, Thomas Icard, and Christopher Potts. Causal abstractions of neural networks. InAdvances in Neural Information Processing Systems, volume 34, pages 9574–9586, 2021
work page 2021
-
[56]
Qizhe Xie, Minh-Thang Luong, Eduard Hovy, and Quoc V . Le. Self-training with noisy student improves imagenet classification. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10687–10698, 2020. 14
work page 2020
-
[57]
Large Language Models Can Self-Improve
Jiaxin Huang, Shixiang Gu, Le Hou, Yuexin Wu, Xuezhi Wang, Hongkun Yu, and Jiawei Han. Large language models can self-improve. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 1051–1068, Singapore, 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.67. URL https://aclanthology.o...
-
[58]
O’Connor, and Kevin McGuinness
Eric Arazo, Diego Ortego, Paul Albert, Noel E. O’Connor, and Kevin McGuinness. Pseudo- labeling and confirmation bias in deep semi-supervised learning. In2020 International Joint Conference on Neural Networks, pages 1–8. IEEE, 2020. doi: 10.1109/IJCNN48605.2020. 9207304
-
[59]
Debiased self-training for semi-supervised learning
Baixu Chen, Junguang Jiang, Ximei Wang, Pengfei Wan, Jianmin Wang, and Mingsheng Long. Debiased self-training for semi-supervised learning. InAdvances in Neural Information Processing Systems, 2022
work page 2022
-
[60]
AI models collapse when trained on recursively generated data.Nature, 631:755–759,
Ilia Shumailov, Zakhar Shumaylov, Yiren Zhao, Nicolas Papernot, Ross Anderson, and Yarin Gal. AI models collapse when trained on recursively generated data.Nature, 631:755–759,
-
[61]
doi: 10.1038/s41586-024-07566-y
- [62]
-
[63]
Matthias Gerstgrasser, Rylan Schaeffer, Apratim Dey, Rafael Rafailov, Henry Sleight, John Hughes, Tomasz Korbak, Rajashree Agrawal, Dhruv Pai, Andrey Gromov, Daniel A. Roberts, Diyi Yang, David L. Donoho, and Sanmi Koyejo. Is model collapse inevitable? breaking the curse of recursion by accumulating real and synthetic data. InConference on Language Modeli...
-
[64]
Mathematical Description No No Yes, including mutation One-shot
-
[65]
Logic Description Yes No No; static benchmark Human-created
-
[66]
Spatial Examples No No No; static benchmark Human-created
-
[67]
Mathematical Description No LLM derivesp(u| ·)No; static benchmark Human-created F Causal Pattern Recognition versus Causal Mechanism Learning A useful way to interpret the scope of our results is to distinguish causal-query answering from learning a causal model. In the structural causal model (SCM) view, causal knowledge is not only a joint distribution...
-
[68]
All U_ * ex oge no us sampler d e f i n i t i o n s
-
[69]
All f_ * s t r u c t u r a l me ch an is m d e f i n i t i o n s
-
[70]
The run_once ( seed ) driver . Return ONLY a single Python code block . No prose . Task : - Create an SCM with { nu m_ no de s } e xo ge nou s samplers and { nu m_ no de s } s t r u c t u r a l fu nc tio ns . - The internal nodes ( f_ *) of the SCM must follow a { dag_type } topology . { D A G _ D E F I N I T I O N S [ dag_type ]} 30 Topology Definitions ...
-
[71]
A DECISION JSON with : - r at io na le - n e w _ f u n c t i o n : name of new f_ * - n e w _ e x o _ n o i s e : name of c o r r e s p o n d i n g U_ * - d i r e c t _ p a r e n t s : list of internal parent nodes of n e w _ f u n c t i o n - d i r e c t _ c h i l d r e n : list of child nodes of n e w _ f u n c t i o n
-
[72]
CODE CONTEXT : - For each d i r e c t _ c h i l d : full function , in cl ud in g signature , docstring , and body - For each d i r e c t _ p a r e n t : do cs tr in g only , no code
-
[73]
none"; else if it is less than 0.7, return
DRIVER FUNCTION : def run_once ( seed : int | None ) -> dict [ str , object ]: Your task : (1) Define a new exo ge no us sampler U_new () . - It must be pure and have no I / O . - It must return symbolic , categorical , or boolean states , not numeric math . - It must have a single do cs tri ng c o n t a i n i n g : Purpose : describe symbolic / m e c h a...
-
[74]
A bd uc tio n : infer u n k n o w n _ e x o g e n o u s a s s i g n m e n t ( s ) c o n s i s t e n t with o b s e r v e d _ e n d o g e n o u s and f i x e d _ e x o g e n o u s
-
[75]
I n t e r v e n t i o n : apply do (...)
-
[76]
P r e d i c t i o n : using the SAME abduced ex og en ou s values ( no r e s a m p l i n g ) , compute required outputs under do (...) . Im po rt an t : o b s e r v e d _ e n d o g e n o u s is only evidence for step (1) ; do NOT enforce it after the i n t e r v e n t i o n . Return that unique output . Required output keys ( EXACT match ; no extras ) : [...
work page 2048
-
[77]
Writes an executable Python module scm_{id}_v{v}.py by emitting imports, func- tion signatures, and function bodies from the bundle. A from __future__ import annotations header is included to avoid import-time failures due to unevaluated type annotations
-
[78]
Versions failing validation are skipped
Optionally validates executability by importing the module and running run_once(seed=0). Versions failing validation are skipped
-
[79]
This reduces prompt length and removes non-semantic text for downstream prompt construction
Writes a comment-stripped variant, scm_{id}_v{v}_nocomments.py, obtained by remov- ing docstrings and #-style comments. This reduces prompt length and removes non-semantic text for downstream prompt construction
-
[80]
Writes the bundle metadata itself tobundle_v{v}.json. 44 Parallelism and CPU hygiene.Rehydration and sampling are parallelized across SCMs using a process pool. The worker count is selected using cpuset-aware or cgroup-aware CPU detection, or scheduler environment variables such as SLURM and PBS, and capped by the number of SCMs. To prevent oversubscripti...
work page 2000
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.