Recognition: unknown
Learning Chain Of Thoughts Prompts for Predicting Entities, Relations, and even Literals on Knowledge Graphs
Pith reviewed 2026-05-10 15:30 UTC · model grok-4.3
The pith
Chain-of-thought prompts optimized on under 30 examples let language models score and complete knowledge graph triples including literals and complex OWL expressions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
RALP learns string-based chain-of-thought prompts as scoring functions for triples by applying Bayesian Optimization through the MIPRO algorithm to fewer than 30 training examples without gradient access. At inference time the same prompts predict missing entities, relations or literals, return whole triples, and produce confidence scores derived from the language model's output.
What carries the argument
RALP, which treats learned chain-of-thought prompt strings as scoring functions for knowledge-graph triples and optimizes them via MIPRO Bayesian search on small example sets.
If this is right
- RALP raises MRR of state-of-the-art KGE models by more than 5 percent across the evaluated datasets.
- On OWL reasoning benchmarks with expressions such as existential and cardinality restrictions, RALP reaches over 88 percent Jaccard similarity to the expected answers.
- The method produces high-quality inferred triples that improve generalization on transductive, numerical, and instance-retrieval tasks.
- Prompts can be used at inference to predict any part of a triple or an entire triple while returning a confidence value.
Where Pith is reading between the lines
- Dynamic knowledge graphs could add new literals without retraining embeddings by switching to prompt-based scoring.
- The same optimization loop might be tested on other structured prediction settings where labeled examples are scarce.
- Hybrid systems that combine the learned prompts with existing embedding models could cover both seen and unseen elements more robustly.
Load-bearing premise
Prompts found by optimizing on fewer than 30 examples will reliably generalize to unseen entities, relations, and literals without large hallucination or heavy dependence on the base language model's pretraining data.
What would settle it
Apply the learned prompts to a test collection containing only triples with novel literals or complex OWL expressions absent from the 30-example optimization set and verify whether the MRR or Jaccard scores remain above those of the strongest embedding baselines.
Figures
read the original abstract
Knowledge graph embedding (KGE) models perform well on link prediction but struggle with unseen entities, relations, and especially literals, limiting their use in dynamic, heterogeneous graphs. In contrast, pretrained large language models (LLMs) generalize effectively through prompting. We reformulate link prediction as a prompt learning problem and introduce RALP, which learns string-based chain-of-thought (CoT) prompts as scoring functions for triples. Using Bayesian Optimization through MIPRO algorithm, RALP identifies effective prompts from fewer than 30 training examples without gradient access. At inference, RALP predicts missing entities, relations or whole triples and assigns confidence scores based on the learned prompt. We evaluate on transductive, numerical, and OWL instance retrieval benchmarks. RALP improves state-of-the-art KGE models by over 5% MRR across datasets and enhances generalization via high-quality inferred triples. On OWL reasoning tasks with complex class expressions (e.g., $\exists hasChild.Female$, $\geq 5 \; hasChild.Female$), it achieves over 88% Jaccard similarity. These results highlight prompt-based LLM reasoning as a flexible alternative to embedding-based methods. We release our implementation, training, and evaluation pipeline as open source: https://github.com/dice-group/RALP .
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes RALP, a method that reformulates link prediction on knowledge graphs as a prompt-learning task. It uses the MIPRO Bayesian optimization algorithm to discover effective chain-of-thought prompt strings from fewer than 30 training examples; these prompts then act as scoring functions to predict missing entities, relations, or literals and to assign confidence scores. The authors report that RALP improves state-of-the-art KGE models by more than 5% MRR on transductive, numerical, and OWL instance-retrieval benchmarks and reaches over 88% Jaccard similarity on OWL reasoning tasks involving complex class expressions.
Significance. If the performance claims are shown to be robust, the work offers a lightweight, gradient-free alternative to embedding-based link prediction that can handle unseen entities, relations, and literals. The open-source release of the full implementation, training, and evaluation pipeline is a concrete contribution that supports reproducibility.
major comments (2)
- [Evaluation and results] The central empirical claims (≥5% MRR improvement and ≥88% Jaccard on complex OWL expressions) rest on prompt optimization performed on <30 examples; the manuscript provides no controls for prompt sensitivity, variance across random seeds or base LLMs, or explicit tests that the learned prompts generalize to entities/relations whose surface forms are absent from the LLM's pre-training corpus.
- [Experimental setup] No section reports safeguards against data leakage between the MIPRO optimization set and the test triples, nor does it compare against strong prompt-based or retrieval-augmented baselines that would isolate whether gains arise from genuine reasoning or from surface-pattern retrieval.
minor comments (2)
- [Abstract] The abstract states performance gains without naming the exact datasets, the precise KGE baselines, or the number of runs; these details should be supplied for immediate clarity.
- [Method] Notation for the learned CoT prompts and the scoring function derived from them should be introduced formally (e.g., as an equation) rather than only described in prose.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our work. We address each major comment below with clarifications from the manuscript and indicate the revisions we will make to strengthen the presentation and empirical support.
read point-by-point responses
-
Referee: [Evaluation and results] The central empirical claims (≥5% MRR improvement and ≥88% Jaccard on complex OWL expressions) rest on prompt optimization performed on <30 examples; the manuscript provides no controls for prompt sensitivity, variance across random seeds or base LLMs, or explicit tests that the learned prompts generalize to entities/relations whose surface forms are absent from the LLM's pre-training corpus.
Authors: The use of fewer than 30 examples is a deliberate design choice to demonstrate the data efficiency of MIPRO-based prompt optimization for link prediction. We agree that additional controls would increase confidence in the results. In the revised manuscript we will report performance variance across multiple random seeds for the MIPRO runs on the main benchmarks, which will quantify prompt stability. While a full sweep across base LLMs is beyond the current scope, we will add a brief discussion of model sensitivity based on preliminary checks with an alternative LLM. For generalization to surface forms absent from pre-training, the OWL instance-retrieval tasks rely on compositional class expressions (e.g., existential and cardinality restrictions) that are not directly present in training corpora, and the numerical literal benchmarks explicitly test unseen numeric values; we will expand the discussion section to highlight these aspects and note the limitation of not having performed an explicit out-of-corpus surface-form ablation. revision: partial
-
Referee: [Experimental setup] No section reports safeguards against data leakage between the MIPRO optimization set and the test triples, nor does it compare against strong prompt-based or retrieval-augmented baselines that would isolate whether gains arise from genuine reasoning or from surface-pattern retrieval.
Authors: The optimization set for MIPRO is sampled exclusively from the training split of each benchmark, following the standard transductive and numerical data partitions described in the experimental setup; no test triples are used during prompt discovery. We acknowledge that an explicit statement of this partitioning and leakage safeguards is currently absent and will add a dedicated paragraph in the revised experimental setup section detailing the splits and confirming zero overlap. Regarding baselines, the manuscript already compares against state-of-the-art KGE models to establish improvement over embedding methods. To better isolate the contribution of the learned CoT prompts versus surface retrieval, we will include additional experiments against zero-shot and few-shot prompting baselines as well as a simple retrieval-augmented prompt baseline in the updated results tables. revision: yes
Circularity Check
No circularity; empirical prompt optimization evaluated on held-out benchmarks
full rationale
The paper presents RALP as an empirical method that optimizes CoT prompt strings via the MIPRO Bayesian optimizer on fewer than 30 examples and measures link-prediction MRR and OWL Jaccard scores on standard transductive/numerical/OWL benchmarks. No equations appear that define a target quantity in terms of itself, no fitted parameter is relabeled as a prediction, and the central performance claims rest on train/test splits rather than any self-referential derivation or load-bearing self-citation chain. The reported gains are therefore independent experimental outcomes, not tautological restatements of the method's inputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Large language models can perform reliable step-by-step reasoning for knowledge-graph completion tasks when given appropriately optimized chain-of-thought prompts.
Reference graph
Works this paper leans on
-
[1]
Knowledge graph construction and applications for web search and beyond.Data Intelligence, 1(4):333–349, 2019
Peilu Wang, Hao Jiang, Jingfang Xu, and Qi Zhang. Knowledge graph construction and applications for web search and beyond.Data Intelligence, 1(4):333–349, 2019
2019
-
[2]
arXiv preprint arXiv:2007.00808 , year=
Lee Xiong, Chenyan Xiong, Ye Li, Kwok-Fung Tang, Jialin Liu, Paul Bennett, Junaid Ahmed, and Arnold Overwijk. Approximate nearest neighbor negative contrastive learning for dense text retrieval.arXiv preprint arXiv:2007.00808, 2020
-
[3]
From Local to Global: A Graph RAG Approach to Query-Focused Summarization
Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, and Jonathan Larson. From local to global: A graph rag approach to query-focused summarization.arXiv preprint arXiv:2404.16130, 2024
work page internal anchor Pith review arXiv 2024
-
[4]
Knowledge-augmented language model prompting for zero- shot knowledge graph question answering
Jinheon Baek, Alham Fikri Aji, and Amir Saffari. Knowledge-augmented language model prompting for zero- shot knowledge graph question answering. In Bhavana Dalvi Mishra, Greg Durrett, Peter Jansen, Danilo Neves Ribeiro, and Jason Wei, editors,Proceedings of the 1st Workshop on Natural Language Reasoning and Structured Explanations (NLRSE), pages 78–106, T...
2023
-
[5]
G-retriever: Retrieval-augmented generation for textual graph understanding and question answering
Xiaoxin He, Yijun Tian, Yifei Sun, Nitesh Chawla, Thomas Laurent, Yann LeCun, Xavier Bresson, and Bryan Hooi. G-retriever: Retrieval-augmented generation for textual graph understanding and question answering. Advances in Neural Information Processing Systems, 37:132876–132907, 2024
2024
-
[6]
A survey on knowl- edge graph-based recommender systems.IEEE Transactions on Knowledge and Data Engineering, 34(8):3549– 3568, 2020
Qingyu Guo, Fuzhen Zhuang, Chuan Qin, Hengshu Zhu, Xing Xie, Hui Xiong, and Qing He. A survey on knowl- edge graph-based recommender systems.IEEE Transactions on Knowledge and Data Engineering, 34(8):3549– 3568, 2020
2020
-
[7]
Knowledge graph convolutional networks for recommender systems.CoRR, abs/1904.12575, 2019
Hongwei Wang, Miao Zhao, Xing Xie, Wenjie Li, and Minyi Guo. Knowledge graph convolutional networks for recommender systems.CoRR, abs/1904.12575, 2019
-
[8]
Convolutional 2d knowledge graph embeddings
Tim Dettmers, Pasquale Minervini, Pontus Stenetorp, and Sebastian Riedel. Convolutional 2d knowledge graph embeddings. InProceedings of the AAAI conference on artificial intelligence, volume 32, 2018
2018
-
[9]
You CAN teach an old dog new tricks! on training knowledge graph embeddings
Daniel Ruffinelli, Samuel Broscheit, and Rainer Gemulla. You CAN teach an old dog new tricks! on training knowledge graph embeddings. In8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020
2020
-
[10]
A survey on knowledge graph embedding: Approaches, applications and benchmarks.Electronics, 9(5):750, 2020
Yuanfei Dai, Shiping Wang, Neal N Xiong, and Wenzhong Guo. A survey on knowledge graph embedding: Approaches, applications and benchmarks.Electronics, 9(5):750, 2020
2020
-
[11]
Large language models are zero-shot reasoners.Advances in neural information processing systems, 35:22199–22213, 2022
Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners.Advances in neural information processing systems, 35:22199–22213, 2022
2022
-
[12]
Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing.ACM computing surveys, 55(9):1–35, 2023
Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing.ACM computing surveys, 55(9):1–35, 2023
2023
-
[13]
Retrieval-augmented generation for knowledge- intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K¨uttler, Mike Lewis, Wen-tau Yih, Tim Rockt ¨aschel, et al. Retrieval-augmented generation for knowledge- intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020
2020
-
[14]
In-context retrieval-augmented language models.Transactions of the Association for Computational Linguistics, 11:1316–1331, 2023
Ori Ram, Yoav Levine, Itay Dalmedigos, Dor Muhlgay, Amnon Shashua, Kevin Leyton-Brown, and Yoav Shoham. In-context retrieval-augmented language models.Transactions of the Association for Computational Linguistics, 11:1316–1331, 2023
2023
-
[15]
KILT: a benchmark for knowledge intensive language tasks
Fabio Petroni, Aleksandra Piktus, Angela Fan, Patrick Lewis, Majid Yazdani, Nicola De Cao, James Thorne, Yacine Jernite, Vladimir Karpukhin, Jean Maillard, Vassilis Plachouras, Tim Rockt¨aschel, and Sebastian Riedel. KILT: a benchmark for knowledge intensive language tasks. In Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tur, Iz Bel...
2021
-
[16]
Large language models are human-level prompt engineers
Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han, Keiran Paster, Silviu Pitis, Harris Chan, and Jimmy Ba. Large language models are human-level prompt engineers. InThe Eleventh International Conference on Learning Representations, 2022
2022
-
[17]
Large language models as optimizers
Chengrun Yang, Xuezhi Wang, Yifeng Lu, Hanxiao Liu, Quoc V Le, Denny Zhou, and Xinyun Chen. Large language models as optimizers. InThe Twelfth International Conference on Learning Representations, 2024
2024
-
[18]
Connecting large language models with evolutionary algorithms yields powerful prompt optimizers
Qingyan Guo, Rui Wang, Junliang Guo, Bei Li, Kaitao Song, Xu Tan, Guoqing Liu, Jiang Bian, and Yujiu Yang. Connecting large language models with evolutionary algorithms yields powerful prompt optimizers. In The Twelfth International Conference on Learning Representations, 2024
2024
-
[19]
Optimizing instructions and demonstrations for multi-stage language model programs
Krista Opsahl-Ong, Michael J Ryan, Josh Purtell, David Broman, Christopher Potts, Matei Zaharia, and Omar Khattab. Optimizing instructions and demonstrations for multi-stage language model programs. In Yaser Al- Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 9...
2024
-
[20]
Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christo- pher Potts
Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan A, Saiful Haq, Ashutosh Sharma, Thomas T. Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christo- pher Potts. DSPy: Compiling declarative language model calls into state-of-the-art pipelines. InThe Twelfth International Conference on Learning Represe...
2024
-
[21]
Knowledge graphs.ACM Computing Surveys (CSUR), 54(4):1–37, 2021
Aidan Hogan, Eva Blomqvist, Michael Cochez, Claudia d’Amato, Gerard de Melo, Claudio Gutierrez, Sab- rina Kirrane, Jos ´e Emilio Labra Gayo, Roberto Navigli, Sebastian Neumaier, et al. Knowledge graphs.ACM Computing Surveys (CSUR), 54(4):1–37, 2021. 12
2021
-
[22]
Clifford embeddings–a generalized approach for embedding in normed algebras
Caglar Demir and Axel-Cyrille Ngonga Ngomo. Clifford embeddings–a generalized approach for embedding in normed algebras. InJoint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 567–582. Springer, 2023
2023
-
[23]
Jason Youn and Ilias Tagkopoulos. Kglm: Integrating knowledge graph structure in language models for link prediction.arXiv preprint arXiv:2211.02744, 2022
-
[24]
Knowledge graph large language model (kg-llm) for link prediction.arXiv preprint arXiv:2403.07311,
Dong Shu, Tianle Chen, Mingyu Jin, Chong Zhang, Mengnan Du, and Yongfeng Zhang. Knowledge graph large language model (kg-llm) for link prediction.arXiv preprint arXiv:2403.07311, 2024
-
[25]
Zhongmou He, Jing Zhu, Shengyi Qian, Joyce Chai, and Danai Koutra. Linkgpt: Teaching large language models to predict missing links.arXiv preprint arXiv:2406.04640, 2024
-
[26]
Prompting disentangled embeddings for knowledge graph completion with pre-trained language model.Expert Systems with Applications, 268:126175, 2025
Yuxia Geng, Jiaoyan Chen, Yuhang Zeng, Zhuo Chen, Wen Zhang, Jeff Z Pan, Yuxiang Wang, and Xiaoliang Xu. Prompting disentangled embeddings for knowledge graph completion with pre-trained language model.Expert Systems with Applications, 268:126175, 2025
2025
-
[27]
Automatic prompt optimization with ”gradient descent” and beam search, 2023
Reid Pryzant, Dan Iter, Jerry Li, Yin Tat Lee, Chenguang Zhu, and Michael Zeng. Automatic prompt optimization with ”gradient descent” and beam search, 2023
2023
-
[28]
Evoprompt: Connecting llms with evolutionary algorithms yields powerful prompt optimizers, 2025
Qingyan Guo, Rui Wang, Junliang Guo, Bei Li, Kaitao Song, Xu Tan, Guoqing Liu, Jiang Bian, and Yujiu Yang. Evoprompt: Connecting llms with evolutionary algorithms yields powerful prompt optimizers, 2025
2025
-
[29]
Large language models can self-improve, 2022
Jiaxin Huang, Shixiang Shane Gu, Le Hou, Yuexin Wu, Xuezhi Wang, Hongkun Yu, and Jiawei Han. Large language models can self-improve, 2022
2022
-
[30]
Enabling intelligent interactions between an agent and an llm: A reinforcement learning approach, 2024
Bin Hu, Chenyang Zhao, Pu Zhang, Zihao Zhou, Yuanhang Yang, Zenglin Xu, and Bin Liu. Enabling intelligent interactions between an agent and an llm: A reinforcement learning approach, 2024
2024
-
[31]
Understanding the difficulty of training deep feedforward neural networks
Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. InProceedings of the thirteenth international conference on artificial intelligence and statistics, pages 249–256. JMLR Workshop and Conference Proceedings, 2010
2010
-
[32]
An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2. 5 technical report.arXiv preprint arXiv:2412.15115, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[33]
Hardware-agnostic computation for large-scale knowledge graph embeddings.Software Impacts, 13:100377, 2022
Caglar Demir and Axel-Cyrille Ngonga Ngomo. Hardware-agnostic computation for large-scale knowledge graph embeddings.Software Impacts, 13:100377, 2022
2022
-
[34]
Literallywikidata - a benchmark for knowledge graph completion using literals, April 2021
Genet Asefa Gesese, Mehwish Alam, and Harald Sack. Literallywikidata - a benchmark for knowledge graph completion using literals, April 2021
2021
-
[35]
Ontolearn—a framework for large-scale owl class expression learning in python.Journal of Machine Learning Research, 26(63):1–6, 2025
Caglar Demir, Alkid Baci, N’Dah Jean Kouagou, Leonie Nora Sieger, Stefan Heindorf, Simon Bin, Lukas Bl¨ubaum, Alexander Bigerl, and Axel-Cyrille Ngonga Ngomo. Ontolearn—a framework for large-scale owl class expression learning in python.Journal of Machine Learning Research, 26(63):1–6, 2025
2025
-
[36]
Gonzalez, Hao Zhang, and Ion Stoica
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023
2023
-
[37]
Qwen2.5: A party of foundation models, September 2024
Qwen Team. Qwen2.5: A party of foundation models, September 2024
2024
-
[38]
An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Day- iheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang, Mei Li, Mingfen...
work page internal anchor Pith review Pith/arXiv arXiv 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.