pith. sign in

arxiv: 2607.00924 · v1 · pith:EJ4DZLSBnew · submitted 2026-07-01 · 💻 cs.AI · cond-mat.mtrl-sci· cs.CL· cs.LG

Graph-Native Reinforcement Learning Enables Traceable Scientific Hypothesis Generation through Conceptual Recombination

Pith reviewed 2026-07-02 12:25 UTC · model grok-4.3

classification 💻 cs.AI cond-mat.mtrl-scics.CLcs.LG
keywords graph-native reinforcement learningscientific hypothesis generationmaterials sciencereasoning traceabilityconceptual recombinationGroup Relative Policy OptimizationGraph-PRefLexORsemantic diversity
0
0 comments X

The pith

Graph-PRefLexOR organizes reasoning into explicit phases via graph-native reinforcement learning, producing more traceable hypotheses than base models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Graph-PRefLexOR, a family of models fine-tuned with Group Relative Policy Optimization to structure scientific reasoning into four explicit phases: mechanism exploration, graph construction, pattern extraction, and hypothesis synthesis. This links neural language generation directly to symbolic relational graphs so that causal connections can be built, inspected, and reused during open-ended materials design tasks. Tested on 100 questions drawn from materials science and mechanics literature, the approach delivers 40-65 percent gains over base models, with the biggest lifts in reasoning traceability and roughly two to three times greater semantic diversity. Embedding and hidden-state analyses confirm tighter alignment between the structured steps and the final answers, while test-time graph expansion shows that extra compute mainly drives long-range conceptual recombination inside a bounded semantic space.

Core claim

Graph-PRefLexOR links neural language generation with symbolic relational structure by organizing reasoning into explicit phases for mechanism exploration, graph construction, pattern extraction, and hypothesis synthesis. This design enables causal connections to be constructed, inspected, and reused, resulting in 40-65% improvements over base models on 100 open-ended materials questions, with the largest gains in reasoning traceability, broader semantic exploration, and stronger alignment between intermediate reasoning and final answers.

What carries the argument

Graph-PRefLexOR, the graph-native reasoning model fine-tuned with Group Relative Policy Optimization (GRPO) to enforce phased reasoning that connects language outputs to symbolic graphs for inspection and recombination.

If this is right

  • Reasoning steps become inspectable, so users can trace how intermediate graphs support or contradict the final hypothesis.
  • Semantic diversity roughly doubles, allowing the model to explore a wider set of conceptual combinations within the same domain.
  • Additional test-time compute increases long-range recombination rather than simply widening the covered semantic space.
  • Hidden-state analyses show tighter coupling between the phased reasoning layers and the generated answer.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same phased graph structure could be adapted to hypothesis generation in chemistry or biology where causal mechanisms are also central.
  • If the graph-construction phase can be made fully automatic from raw text, the method might scale to larger corpora without extra human annotation.
  • The bounded semantic space implies the model excels at recombining known concepts but may still require external novelty injection to propose truly paradigm-shifting ideas.

Load-bearing premise

The 100 questions drawn from existing literature are enough to measure scientific validity and that the reported gains in traceability stem specifically from the graph-native phased structure.

What would settle it

An expert panel rates traceability and scientific validity on the same 100 questions for both Graph-PRefLexOR and base models trained to the same compute budget but without the explicit graph-phased structure; if the gap disappears, the central claim is false.

Figures

Figures reproduced from arXiv: 2607.00924 by Markus J. Buehler, Shashwat Sourav, Subhadeep Pal, Tirthankar Ghosal.

Figure 1
Figure 1. Figure 1: (a) Scientific discovery often proceeds through iterative hypothesis generation, validation, re-ideation and refinement. (b) Standard LLM responses to scientific queries can be difficult to trace, leading to untraceability, hallucination, or contradiction. (c) Graph-PRefLexOR addresses this limitation by organizing the <think> section into explicit reasoning phases: <brainstorm> for mechanism exploration, … view at source ↗
Figure 2
Figure 2. Figure 2: Evaluation of structured reasoning across model scales on open-ended scientific questions (N = 100), assessed using Claude Opus-4.7. Metrics include Reasoning Quality, Intellectual Depth, Reasoning Traceability, and Overall score (0–10). (a) Graph-PRefLexOR-8B vs. Qwen3-8B (with no-thinking variant), (b) Graph-PRefLexOR-3B vs. Llama-3.2-3B-Instruct, and (c) Graph-PRefLexOR-1.7B vs. Qwen3-1.7B (with no-thin… view at source ↗
Figure 3
Figure 3. Figure 3: Representative cross-disciplinary hypothesis-generation question used to evaluate Graph-PRefLexOR. The question is derived from Ref. [10] and probes analogical mapping, mechanistic breakdown, and long-horizon adaptive reasoning. models with reasoning disabled (no-thinking setting). The resulting performance degradation closely mirrors that observed for the Llama baseline, with overall reductions on the ord… view at source ↗
Figure 4
Figure 4. Figure 4: Representative Graph-PRefLexOR-8B reasoning to the benchmark question in [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Graph and pattern representation extracted from the Graph-PRefLexOR-8B response. (a) Directed graph linking biological immune-system concepts, multi-agent AI components, and the proposed bridging mechanism. (b) Higher-order reasoning patterns extracted from the graph, summarizing the main causal motifs used for hypothesis synthesis. This qualitative comparison motivates the embedding-based analyses that fo… view at source ↗
Figure 6
Figure 6. Figure 6: PCA projection of reasoning traces and final answers comparing Graph-PRefLexOR and base models across scales. (a) Graph-PRefLexOR-8B vs. Qwen3-8B reasoning traces, (b) Graph-PRefLexOR-1.7B vs. Qwen3-1.7B reasoning traces, and (c–e) corresponding comparisons of final answers for 8B, 3B, and 1.7B models, respectively. Reasoning traces are decomposed into structured components (<brainstorm>, <graph>, <pattern… view at source ↗
Figure 7
Figure 7. Figure 7: PCA projection of directed (a) reasoning, and (b) answer trajectories between Graph-PRefLexOR-8B and Qwen3-8B. For Graph-PRefLexOR, trajectories explicitly follow structured stages (<brainstorm>, <graph>, <patterns>, and <synthesis>), forming coherent, directional transitions in latent space. In contrast, base model trajectories (shown as sequential chunks) remain more localized and less structured. For an… view at source ↗
Figure 8
Figure 8. Figure 8: Semantic diversity measured via inter-phase centroid distance for (a) reasoning traces and (b) final answers across model scales. Violin plots show the distribution of sample-level semantic diversity scores for Graph-PRefLexOR and the corresponding base models. Individual points denote responses, horizontal black lines indicate medians, black diamonds denote means, and vertical error bars represent one sta… view at source ↗
Figure 9
Figure 9. Figure 9: Semantic backtracking analysis of final answer alignment for Qwen3-8B and Graph-PRefLexOR 8B across 100 open-ended scientific questions. (a) Binary split showing whether each final answer is closest to its own reasoning trace or to the other model’s outputs. (b) Source distribution for Qwen3-8B final answers, which align with its own <think> trace in only 16/100 cases and more often align with Graph-PRefLe… view at source ↗
Figure 10
Figure 10. Figure 10: Internal semantic backtracking of Graph-PRefLexOR-8B final answers. (a) Closest structured reasoning stage for each final answer across 100 benchmark questions. (b) Mean cosine similarity between the final answer and each reasoning phase. Final answers align most frequently and most strongly with the <synthesis> stage, indicating that response generation is primarily grounded in the final integrative reas… view at source ↗
Figure 11
Figure 11. Figure 11: Layer-wise hidden-state divergence between reasoning and final-answer representations for Qwen3-8B and Graph-PRefLexOR-8B. Qwen3-8B exhibits a larger reasoning-answer separation, with a pronounced increase around layers 7-10 and a final-layer spike. In contrast, Graph-PRefLexOR-8B maintains lower divergence across most layers, indicating a more continuous transition from structured reasoning to final-answ… view at source ↗
Figure 12
Figure 12. Figure 12: Backtracking-conditioned layer-wise hidden-state divergence for Qwen3-8B and Graph-PRefLexOR-8B. (a) Qwen3-8B divergence between thinking and final-answer, separated by whether the final answer backtracks to the model’s own thinking trace or to another source. Non-backtracking cases show larger divergence, particularly around layers 7-10 and at the final layer. (b) Graph-PRefLexOR-8B divergence between st… view at source ↗
Figure 13
Figure 13. Figure 13: Graph-native ideation loop for test-time graph expansion. At each iteration, the reasoner answers a question, emits a small ontological graph, and merges it into a growing memory graph Gt using embedding-based de-duplication. An expansion strategy then selects concepts or concept pairs from Gt to generate the next question. The four strategies are frontier, which expands low-degree leaves and central hubs… view at source ↗
Figure 14
Figure 14. Figure 14: Test-time compute expands a bounded idea space through recombination. Four size-robust metrics are shown as a function of reasoning iteration up to 2,000 iterations for the four expansion strategies. The number of distinct concepts continues to increase (a), whereas the explored embedding volume (b) and maximum distance from the seed (c) saturate within a few hundred iterations, indicating that the semant… view at source ↗
Figure 15
Figure 15. Figure 15: Semantic organization and broker concepts in the leap run. (a) Principal-component projection of concept embeddings, with colors indicating greedy-modularity communities and marker size indicating PageRank. The fifteen highest-PageRank concepts are numbered and listed below the map. (b) Broker concepts plotted by degree and betweenness. A small set of high-betweenness concepts mediates most cross-communit… view at source ↗
Figure 16
Figure 16. Figure 16: Growth dynamics of the leap run. The final graph is replayed in birth-iteration order, and each panel is evaluated using embedding geometry or mesoscale community structure rather than raw graph distance. (a) New concepts per iteration bin, separated into novel concepts and consolidating in-fill concepts; the black line shows the fraction of novel concepts. (b) Number of greedy-modularity communities and … view at source ↗
Figure 17
Figure 17. Figure 17: Statistical novelty of mined connections in the leap run. (a) Relational-motif significance relative to a label-shuffled null model. The ten most over-represented relation-typed two-step motifs reach z ≈ 100–160, far exceeding the ordinary significance threshold (z = 1.96), indicating that the graph follows consistent relational templates rather than random associations. The graph is also more modular tha… view at source ↗
Figure 18
Figure 18. Figure 18: ORPO cold start for all three models. Top row: total ORPO loss and its NLL component; bottom row: preference accuracy and reward margin (both in [0, 1]); columns are (a) 1.7B, (b) 3B, (c) 8B, with y shared per row (note the differing ORPO durations, ∼480/480/240 steps). Loss falls and preference accuracy saturates near 1.0 for all backbones; the 1.7B’s much larger reward margin reflects its higher learnin… view at source ↗
Figure 19
Figure 19. Figure 19: Graph-GRPO reward for all three models. Top row: total composite reward; bottom row: the six reward components; columns are (a) 1.7B, (b) 3B, (c) 8B (differing GRPO durations, ∼970/1970/1260 steps), with y shared per row for direct comparison. The 8B starts highest and the 3B climbs most, while graph utility (green) is the lowest component at every scale. Rationale for This Reasoning Structure Each of the… view at source ↗
Figure 20
Figure 20. Figure 20: Graph-GRPO dynamics across the three models, with each run rescaled to [0, 1] training progress so the durations align. (a) Reasoning-trace length: mean terminated completion length (top) and fraction of completions truncated at the token budget (bottom). (b) Optimization diagnostics: within-group reward standard deviation, i.e. the scale of the group-normalized advantage of Eq. (2) (top), and policy entr… view at source ↗
Figure 21
Figure 21. Figure 21: Workflow for constructing the open-ended scientific reasoning benchmark from research papers. For each paper, OpenAI gpt-5.4 with high reasoning effort generates one self-contained, research-level evalua￾tion question. The resulting benchmark contains 100 open-ended questions. Each question is assigned to one of five predefined reasoning categories: causal_multiscale_reasoning, tradeoff_and_non_monotonici… view at source ↗
read the original abstract

Accelerating materials discovery requires AI systems that can generate scientifically valid hypotheses through multi-step, domain-grounded reasoning. Standard large language models often produce fluent but weakly traceable responses to open-ended materials design problems, making it difficult to determine whether final answers are supported by coherent intermediate reasoning. We develop Graph-PRefLexOR, a family of graph-native reasoning models fine-tuned with Group Relative Policy Optimization (GRPO) to organize reasoning into explicit phases for mechanism exploration, graph construction, pattern extraction, and hypothesis synthesis. This design links neural language generation with symbolic relational structure, enabling causal connections to be constructed, inspected, and reused. On 100 open-ended questions from materials science and mechanics literature, Graph-PRefLexOR achieves 40-65% improvements over corresponding base models, with the largest gains in reasoning traceability. Embedding analyses show broader semantic exploration and approximately 2-3 times greater semantic diversity than baselines. Semantic backtracking and layer-wise hidden-state analyses further show stronger alignment between structured reasoning and final answers. Finally, test-time graph expansion reveals that additional compute primarily increases long-range conceptual recombination within a bounded semantic space, rather than simply expanding semantic coverage. These results establish graph-native reinforcement learning as a pathway toward interpretable AI systems for scientific hypothesis generation in materials design and other scientific applications.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces Graph-PRefLexOR, a family of graph-native reasoning models fine-tuned with Group Relative Policy Optimization (GRPO) that structures reasoning into explicit phases (mechanism exploration, graph construction, pattern extraction, hypothesis synthesis) to link neural generation with symbolic relational structure for traceable hypothesis generation in materials science. On 100 open-ended questions from the literature, it claims 40-65% improvements over base models (largest in traceability), ~2-3x greater semantic diversity, stronger reasoning-answer alignment via semantic backtracking and hidden-state analyses, and that test-time graph expansion increases long-range recombination within a bounded space.

Significance. If the performance gains can be shown to arise specifically from the graph-native phased structure (rather than GRPO, data curation, or context length), the approach would offer a concrete mechanism for improving interpretability and traceability in LLM-based scientific reasoning, with potential applicability beyond materials design.

major comments (3)
  1. [Evaluation / Results] Evaluation section (100-question benchmark): the headline 40-65% gains and traceability improvements are reported without ablations that hold training data, compute budget, and base model fixed while removing only the graph-construction / symbolic-recombination components; comparisons appear limited to 'corresponding base models' without non-graph GRPO or standard SFT controls, so it is impossible to attribute gains to the graph-native structure as claimed.
  2. [Methods] Methods / Experimental setup: no definition is provided for how 'reasoning traceability' was quantified (e.g., the precise metric, inter-annotator protocol, or automated proxy used for the largest reported gains), nor are statistical tests or question-selection criteria described, leaving the central performance claims without visible supporting evidence.
  3. [Semantic analyses] § on semantic analyses: the claims of 'broader semantic exploration' and 'approximately 2-3 times greater semantic diversity' rest on embedding analyses whose construction (distance metric, embedding model, normalization) is not specified, preventing verification that these quantities are independent of the training process itself.
minor comments (2)
  1. [Abstract] Abstract and introduction use 'traceable' and 'interpretable' interchangeably without a crisp operational distinction.
  2. [Figures] Figure captions for the layer-wise hidden-state and test-time expansion plots should explicitly state the number of runs and error bars.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript accordingly to improve clarity and empirical rigor.

read point-by-point responses
  1. Referee: [Evaluation / Results] Evaluation section (100-question benchmark): the headline 40-65% gains and traceability improvements are reported without ablations that hold training data, compute budget, and base model fixed while removing only the graph-construction / symbolic-recombination components; comparisons appear limited to 'corresponding base models' without non-graph GRPO or standard SFT controls, so it is impossible to attribute gains to the graph-native structure as claimed.

    Authors: We agree that the current comparisons to base models do not fully isolate the graph-construction and symbolic-recombination components from GRPO or data effects. To strengthen attribution, we will add the requested ablations in the revision, holding training data, compute budget, and base model fixed while including non-graph GRPO and standard SFT controls. revision: yes

  2. Referee: [Methods] Methods / Experimental setup: no definition is provided for how 'reasoning traceability' was quantified (e.g., the precise metric, inter-annotator protocol, or automated proxy used for the largest reported gains), nor are statistical tests or question-selection criteria described, leaving the central performance claims without visible supporting evidence.

    Authors: We will expand the Methods section in the revision to define the reasoning traceability metric (including the automated proxy and human validation protocol with inter-annotator agreement), report the statistical tests used, and detail the question-selection criteria from the literature. revision: yes

  3. Referee: [Semantic analyses] § on semantic analyses: the claims of 'broader semantic exploration' and 'approximately 2-3 times greater semantic diversity' rest on embedding analyses whose construction (distance metric, embedding model, normalization) is not specified, preventing verification that these quantities are independent of the training process itself.

    Authors: We will specify the full construction of the embedding analyses in the revision, including the embedding model, distance metric, normalization steps, and controls to confirm independence from the training process. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The abstract and provided text describe an empirical method (Graph-PRefLexOR with GRPO) and report performance gains on an external benchmark of 100 literature questions. No equations, fitted parameters renamed as predictions, self-definitional steps, or load-bearing self-citations are present that would reduce the claimed improvements to quantities defined by the training process itself. The evaluation is presented as an independent test set, making the result self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the phases and GRPO are introduced but not formalized enough to audit.

pith-pipeline@v0.9.1-grok · 5785 in / 1136 out tokens · 26159 ms · 2026-07-02T12:25:03.909489+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

66 extracted references · 51 canonical work pages · 21 internal anchors

  1. [1]

    Frontiers in Handwriting Recognition (ICFHR), 2014 14th International Conference on , pages=

    Real-time segmentation of on-line handwritten arabic script , author=. Frontiers in Handwriting Recognition (ICFHR), 2014 14th International Conference on , pages=. 2014 , organization=

  2. [2]

    Soft Computing and Pattern Recognition (SoCPaR), 2014 6th International Conference of , pages=

    Fast classification of handwritten on-line Arabic characters , author=. Soft Computing and Pattern Recognition (SoCPaR), 2014 6th International Conference of , pages=. 2014 , organization=

  3. [3]

    Estimate and Replace: A Novel Approach to Integrating Deep Neural Networks with Existing Applications

    Estimate and Replace: A Novel Approach to Integrating Deep Neural Networks with Existing Applications , author=. arXiv preprint arXiv:1804.09028 , year=

  4. [4]

    Advanced Intelligent Discovery , author =

    In. Advanced Intelligent Discovery , author =. 2025 , pages =. doi:10.1002/aidi.202500006 , abstract =

  5. [5]

    , month = may, year =

    Buehler, Markus J. , month = may, year =. npj Artificial Intelligence , publisher =. doi:10.1038/s44387-025-00003-z , abstract =

  6. [6]

    Yang, An and Li, Anfeng and Yang, Baosong and Zhang, Beichen and Hui, Binyuan and Zheng, Bo and Yu, Bowen and Gao, Chang and Huang, Chengen and Lv, Chenxu and Zheng, Chujie and Liu, Dayiheng and Zhou, Fan and Huang, Fei and Hu, Feng and Ge, Hao and Wei, Haoran and Lin, Huan and Tang, Jialong and Yang, Jian and Tu, Jianhong and Zhang, Jianwei and Yang, Jia...

  7. [7]

    Shao, Zhihong and Wang, Peiyi and Zhu, Qihao and Xu, Runxin and Song, Junxiao and Bi, Xiao and Zhang, Haowei and Zhang, Mingchuan and Li, Y. K. and Wu, Y. and Guo, Daya , month = apr, year =. doi:10.48550/arXiv.2402.03300 , abstract =

  8. [8]

    ACM Computing Surveys , author =

    Knowledge. ACM Computing Surveys , author =. 2022 , note =. doi:10.1145/3447772 , abstract =

  9. [9]

    Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

    Zheng, Lianmin and Chiang, Wei-Lin and Sheng, Ying and Zhuang, Siyuan and Wu, Zhanghao and Zhuang, Yonghao and Lin, Zi and Li, Zhuohan and Li, Dacheng and Xing, Eric P. and Zhang, Hao and Gonzalez, Joseph E. and Stoica, Ion , month = dec, year =. Judging. doi:10.48550/arXiv.2306.05685 , abstract =

  10. [10]

    Vera, Henrique Schechter and Dua, Sahil and Zhang, Biao and Salz, Daniel and Mullins, Ryan and Panyam, Sindhu Raghuram and Smoot, Sara and Naim, Iftekhar and Zou, Joe and Chen, Feiyang and Cer, Daniel and Lisak, Alice and Choi, Min and Gonzalez, Lucas and Sanseviero, Omar and Cameron, Glenn and Ballantyne, Ian and Black, Kat and Chen, Kaifeng and Wang, We...

  11. [11]

    WIREs Computational Statistics , author =

    Principal component analysis , volume =. WIREs Computational Statistics , author =. 2010 , note =. doi:10.1002/wics.101 , abstract =

  12. [12]

    , editor =

    Scott, David W. , editor =. Multivariate. Handbook of. 2012 , keywords =. doi:10.1007/978-3-642-21551-3_19 , abstract =

  13. [13]

    2025 , howpublished =

    Marker: Convert PDF to Markdown, JSON, and HTML , author =. 2025 , howpublished =

  14. [14]

    2024 , howpublished =

    GPT-4o mini Model , author =. 2024 , howpublished =

  15. [15]

    2026 , howpublished =

    GPT-5.5 Model , author =. 2026 , howpublished =

  16. [16]

    Wang, Hanchen and Fu, Tianfan and Du, Yuanqi and Gao, Wenhao and Huang, Kexin and Liu, Ziming and Chandak, Payal and Liu, Shengchao and Van Katwyk, Peter and Deac, Andreea and Anandkumar, Anima and Bergen, Karianne and Gomes, Carla P. and Ho, Shirley and Kohli, Pushmeet and Lasenby, Joan and Leskovec, Jure and Liu, Tie-Yan and Manrai, Arjun and Marks, Deb...

  17. [17]

    Accelerating scientific discovery with generative knowledge extraction, graph-based representation, and multimodal intelligent graph reasoning , volume =

    Buehler, Markus J , month = sep, year =. Accelerating scientific discovery with generative knowledge extraction, graph-based representation, and multimodal intelligent graph reasoning , volume =. Machine Learning: Science and Technology , publisher =. doi:10.1088/2632-2153/ad7228 , abstract =

  18. [18]

    and Kanhaiya, Krishan and Bockstaller, Michael R

    Nepal, Dhriti and Kang, Saewon and Adstedt, Katarina M. and Kanhaiya, Krishan and Bockstaller, Michael R. and Brinson, L. Catherine and Buehler, Markus J. and Coveney, Peter V. and Dayal, Kaushik and El-Awady, Jaafar A. and Henderson, Luke C. and Kaplan, David L. and Keten, Sinan and Kotov, Nicholas A. and Schatz, George C. and Vignolini, Silvia and Vollr...

  19. [19]

    Wegst, Ulrike G. K. and Bai, Hao and Saiz, Eduardo and Tomsia, Antoni P. and Ritchie, Robert O. , month = jan, year =. Bioinspired structural materials , volume =. Nature Materials , publisher =. doi:10.1038/nmat4089 , abstract =

  20. [20]

    , year =

    Swanson, Don R. , year =. Undiscovered. The Library Quarterly: Information, Community, Policy , publisher =

  21. [21]

    Attention is

    Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, Ł ukasz and Polosukhin, Illia , year =. Attention is. Advances in

  22. [22]

    Advances in Neural Information Processing Systems , author =

    Language. Advances in Neural Information Processing Systems , author =. 2020 , pages =

  23. [23]

    Zhao, Wayne Xin and Zhou, Kun and Li, Junyi and Tang, Tianyi and Wang, Xiaolei and Hou, Yupeng and Min, Yingqian and Zhang, Beichen and Zhang, Junjie and Dong, Zican and Du, Yifan and Yang, Chen and Chen, Yushuo and Chen, Zhipeng and Jiang, Jinhao and Ren, Ruiyang and Li, Yifan and Tang, Xinyu and Liu, Zikang and Liu, Peiyu and Nie, Jian-Yun and Wen, Ji-R...

  24. [24]

    Zhang, Yanbo and Khan, Sumeer A. and Mahmud, Adnan and Yang, Huck and Lavin, Alexander and Levin, Michael and Frey, Jeremy and Dunnmon, Jared and Evans, James and Bundy, Alan and Dzeroski, Saso and Tegner, Jesper and Zenil, Hector , month = aug, year =. Exploring the role of large language models in the scientific method: from hypothesis to discovery , vo...

  25. [25]

    2024 , pages =

    Advanced Science , author =. 2024 , pages =. doi:10.1002/advs.202306724 , abstract =

  26. [26]

    and Buehler, M

    Ghafarollahi, A. and Buehler, M. J. , month = jan, year =. doi:10.48550/arXiv.2402.04268 , abstract =

  27. [27]

    Lu, Chris and Lu, Cong and Lange, Robert Tjarko and Foerster, Jakob and Clune, Jeff and Ha, David , month = sep, year =. The. doi:10.48550/arXiv.2408.06292 , abstract =

  28. [28]

    , month = mar, year =

    Hage, Tarjei Paule and Buehler, Markus J. , month = mar, year =. doi:10.48550/arXiv.2603.04124 , abstract =

  29. [29]

    doi:10.1115/1.4063843 , abstract =

    Applied Mechanics Reviews , author =. doi:10.1115/1.4063843 , abstract =

  30. [30]

    2023 , keywords =

    Journal of the Mechanics and Physics of Solids , author =. 2023 , keywords =. doi:10.1016/j.jmps.2023.105454 , abstract =

  31. [31]

    2025 , pages =

    Advanced Materials , author =. 2025 , pages =. doi:10.1002/adma.202413523 , abstract =

  32. [32]

    , month = apr, year =

    Ghafarollahi, Alireza and Buehler, Markus J. , month = apr, year =. Sparks:. doi:10.48550/arXiv.2504.19017 , abstract =

  33. [33]

    Advances in Neural Information Processing Systems , author =

    Chain-of-. Advances in Neural Information Processing Systems , author =. 2022 , pages =

  34. [34]

    doi:10.1088/3050-287X/ae61d1 , abstract =

    AI for Science , author =. doi:10.1088/3050-287X/ae61d1 , abstract =

  35. [36]

    and Hage, Tarjei Paule and Hsu, Yu-Chuan and Buehler, Markus J

    Stewart, Isabella A. and Hage, Tarjei Paule and Hsu, Yu-Chuan and Buehler, Markus J. , month = feb, year =. doi:10.48550/arXiv.2602.07491 , abstract =

  36. [37]

    Retrieval-

    Lewis, Patrick and Perez, Ethan and Piktus, Aleksandra and Petroni, Fabio and Karpukhin, Vladimir and Goyal, Naman and Küttler, Heinrich and Lewis, Mike and Yih, Wen-tau and Rocktäschel, Tim and Riedel, Sebastian and Kiela, Douwe , year =. Retrieval-. Advances in

  37. [38]

    and Marom, Lee and Pal, Subhadeep and Luu, Rachel K

    Wang, Fiona Y. and Marom, Lee and Pal, Subhadeep and Luu, Rachel K. and Lu, Wei and Berkovich, Jaime A. and Buehler, Markus J. , month = mar, year =. Autonomous. doi:10.48550/arXiv.2603.14312 , abstract =

  38. [39]

    Scientific Data , publisher =

    Venugopal, Vineeth and Olivetti, Elsa , month = feb, year =. Scientific Data , publisher =. doi:10.1038/s41597-024-03039-z , abstract =

  39. [40]

    , month = jan, year =

    Ghafarollahi, Alireza and Buehler, Markus J. , month = jan, year =. Automating alloy design and discovery with physics-aware multimodal multiagent. Proceedings of the National Academy of Sciences , publisher =. doi:10.1073/pnas.2414074122 , abstract =

  40. [41]

    MRS Bulletin , author =

    Rapid and automated alloy design with graph neural network-powered large language model-driven multi-agent. MRS Bulletin , author =. 2025 , keywords =. doi:10.1557/s43577-025-00953-4 , abstract =

  41. [42]

    Retrieval-Augmented Generation for Large Language Models: A Survey

    Gao, Yunfan and Xiong, Yun and Gao, Xinyu and Jia, Kangxiang and Pan, Jinliu and Bi, Yuxi and Dai, Yi and Sun, Jiawei and Wang, Meng and Wang, Haofen , month = mar, year =. Retrieval-. doi:10.48550/arXiv.2312.10997 , abstract =

  42. [43]

    IEEE Transactions on Knowledge and Data Engineering , author =

    Unifying. IEEE Transactions on Knowledge and Data Engineering , author =. 2024 , keywords =. doi:10.1109/TKDE.2024.3352100 , abstract =

  43. [44]

    Eliciting Latent Predictions from Transformers with the Tuned Lens

    Eliciting Latent Predictions from Transformers with the Tuned Lens. arXiv e-prints , keywords =. doi:10.48550/arXiv.2303.08112 , archivePrefix =. 2303.08112 , primaryClass =

  44. [45]

    How to use and interpret activation patching

    How to use and interpret activation patching. arXiv e-prints , keywords =. doi:10.48550/arXiv.2404.15255 , archivePrefix =. 2404.15255 , primaryClass =

  45. [46]

    Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

    Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. arXiv e-prints , keywords =. doi:10.48550/arXiv.2201.11903 , archivePrefix =. 2201.11903 , primaryClass =

  46. [47]

    Measuring Faithfulness in Chain-of-Thought Reasoning

    Measuring Faithfulness in Chain-of-Thought Reasoning. arXiv e-prints , keywords =. doi:10.48550/arXiv.2307.13702 , archivePrefix =. 2307.13702 , primaryClass =

  47. [48]

    Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

    Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. arXiv e-prints , keywords =. doi:10.48550/arXiv.1908.10084 , archivePrefix =. 1908.10084 , primaryClass =

  48. [49]

    MTEB: Massive Text Embedding Benchmark

    MTEB: Massive Text Embedding Benchmark. arXiv e-prints , keywords =. doi:10.48550/arXiv.2210.07316 , archivePrefix =. 2210.07316 , primaryClass =

  49. [50]

    Right for the Right Reasons: Training Differentiable Models by Constraining their Explanations

    Right for the Right Reasons: Training Differentiable Models by Constraining their Explanations. arXiv e-prints , keywords =. doi:10.48550/arXiv.1703.03717 , archivePrefix =. 1703.03717 , primaryClass =

  50. [51]

    arXiv e-prints , keywords =

    Making Reasoning Matter: Measuring and Improving Faithfulness of Chain-of-Thought Reasoning. arXiv e-prints , keywords =. doi:10.48550/arXiv.2402.13950 , archivePrefix =. 2402.13950 , primaryClass =

  51. [52]

    arXiv e-prints , keywords =

    A Primer in BERTology: What we know about how BERT works. arXiv e-prints , keywords =. doi:10.48550/arXiv.2002.12327 , archivePrefix =. 2002.12327 , primaryClass =

  52. [53]

    What Does BERT Look At? An Analysis of BERT's Attention

    What Does BERT Look At? An Analysis of BERT's Attention. arXiv e-prints , keywords =. doi:10.48550/arXiv.1906.04341 , archivePrefix =. 1906.04341 , primaryClass =

  53. [54]

    arXiv e-prints , keywords =

    BERT Rediscovers the Classical NLP Pipeline. arXiv e-prints , keywords =. doi:10.48550/arXiv.1905.05950 , archivePrefix =. 1905.05950 , primaryClass =

  54. [55]

    SVCCA: Singular Vector Canonical Correlation Analysis for Deep Learning Dynamics and Interpretability

    SVCCA: Singular Vector Canonical Correlation Analysis for Deep Learning Dynamics and Interpretability. arXiv e-prints , keywords =. doi:10.48550/arXiv.1706.05806 , archivePrefix =. 1706.05806 , primaryClass =

  55. [56]

    Similarity of Neural Network Representations Revisited

    Similarity of Neural Network Representations Revisited. arXiv e-prints , keywords =. doi:10.48550/arXiv.1905.00414 , archivePrefix =. 1905.00414 , primaryClass =

  56. [57]

    arXiv e-prints , keywords =

    LogitLens4LLMs: Extending Logit Lens Analysis to Modern Large Language Models. arXiv e-prints , keywords =. doi:10.48550/arXiv.2503.11667 , archivePrefix =. 2503.11667 , primaryClass =

  57. [58]

    arXiv e-prints , keywords =

    Towards Automated Circuit Discovery for Mechanistic Interpretability. arXiv e-prints , keywords =. doi:10.48550/arXiv.2304.14997 , archivePrefix =. 2304.14997 , primaryClass =

  58. [59]

    arXiv e-prints , keywords =

    On the Hardness of Faithful Chain-of-Thought Reasoning in Large Language Models. arXiv e-prints , keywords =. doi:10.48550/arXiv.2406.10625 , archivePrefix =. 2406.10625 , primaryClass =

  59. [60]

    Understanding intermediate layers using linear classifier probes

    Understanding intermediate layers using linear classifier probes. arXiv e-prints , keywords =. doi:10.48550/arXiv.1610.01644 , archivePrefix =. 1610.01644 , primaryClass =

  60. [61]

    Analysis Methods in Neural Language Processing: A Survey

    Analysis Methods in Neural Language Processing: A Survey. arXiv e-prints , keywords =. doi:10.48550/arXiv.1812.08951 , archivePrefix =. 1812.08951 , primaryClass =

  61. [62]

    2023 , eprint=

    C-Pack: Packaged Resources To Advance General Chinese Embedding , author=. 2023 , eprint=

  62. [63]

    C-Pack: Packed Resources For General Chinese Embeddings

    C-Pack: Packed Resources For General Chinese Embeddings. arXiv e-prints , keywords =. doi:10.48550/arXiv.2309.07597 , archivePrefix =. 2309.07597 , primaryClass =

  63. [64]

    2024 , eprint=

    ORPO: Monolithic Preference Optimization without Reference Model , author=. 2024 , eprint=

  64. [65]

    Proceedings of the 29th Symposium on Operating Systems Principles (SOSP) , year =

    Efficient Memory Management for Large Language Model Serving with PagedAttention , author =. Proceedings of the 29th Symposium on Operating Systems Principles (SOSP) , year =

  65. [66]

    and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu , booktitle =

    Hu, Edward J. and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu , booktitle =

  66. [67]

    2019 , eprint=

    Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks , author=. 2019 , eprint=