Recognition: unknown
Automating Categorization of Scientific Texts with In-Context Learning and Prompt-Chaining in Large Language Models
Pith reviewed 2026-05-08 07:13 UTC · model grok-4.3
The pith
Prompt chaining in large language models outperforms in-context learning for hierarchical scientific text classification.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central finding is that prompt chaining yields superior classification accuracy compared to pure in-context learning, particularly when applied to the nested structure of the taxonomy. Large language models using this approach outperform state-of-the-art models for first-level domain prediction and perform better than an older BERT model for second-level subject prediction, though accuracy at the third-level topic remains around 50 percent even with chaining.
What carries the argument
Prompt chaining, a sequence of linked prompts that breaks the hierarchical decision into ordered steps and feeds prior outputs forward to refine each subsequent choice.
If this is right
- Large language models equipped with prompt chaining reach higher accuracy than prior state-of-the-art systems for domain-level assignment of scientific texts.
- The same chaining method produces stronger results than BERT models on subject-level classification tasks.
- Topic-level classification within the hierarchy stays limited to roughly 50 percent accuracy with current prompting techniques.
- Temperature settings influence the stability and accuracy of the classification outputs under both prompting strategies.
- Chaining proves especially useful when the classification scheme itself contains multiple nested layers.
Where Pith is reading between the lines
- The sequential prompting pattern could transfer to other hierarchical labeling tasks that are not limited to scientific literature.
- If chaining reduces reliance on task-specific fine-tuning, organizations could adapt the method to new taxonomies with minimal additional data.
- Combining chaining with retrieval of similar past examples might raise topic-level accuracy without changing the base model.
- Testing the same workflow on texts from a single narrow field versus mixed domains would reveal how sensitive the gains are to content variety.
Load-bearing premise
The labels used as ground truth accurately match the intended hierarchical scheme and the tested models plus strategies will perform similarly on other collections of scientific texts.
What would settle it
Applying the identical chaining procedure and models to an independently labeled collection of scientific texts drawn from a different source and finding that accuracy falls back to or below the in-context learning baseline would show the reported gains do not hold.
Figures
read the original abstract
The relentless expansion of scientific literature presents significant challenges for navigation and knowledge discovery. Within Research Information Retrieval, established tasks such as text summarization and classification remain crucial for enabling researchers and practitioners to effectively navigate this vast landscape, so that efforts have increasingly been focused on developing advanced research information systems. These systems aim not only to provide standard keyword-based search functionalities but also to incorporate capabilities for automatic content categorization within knowledge-intensive organizations across academia and industry. This study systematically evaluates the performance of off-the-shelf Large Language Models (LLMs) in analyzing scientific texts according to a given classification scheme. We utilized the hierarchical ORKG taxonomy as a classification framework, employing the FORC dataset as ground truth. We investigated the effectiveness of advanced prompt engineering strategies, namely In-Context Learning (ICL) and Prompt Chaining, and experimentally explored the influence of the LLMs' temperature hyperparameter on classification accuracy. Our experiments demonstrate that Prompt Chaining yields superior classification accuracy compared to pure ICL, particularly when applied to the nested structure of the ORKG taxonomy. LLMs with prompt chaining outperform the state-of-the-art models for domain (1st level) prediction and show even better performance for subject (2nd level) prediction compared to the older BERT model. However, LLMs are not yet able to perform well in classifying the topic (3rd level) of research areas based on this specific hierarchical taxonomy, as they only reach about 50% accuracy even with prompt chaining.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper evaluates off-the-shelf LLMs for hierarchical classification of scientific texts into the ORKG taxonomy (domain/subject/topic levels), using the FORC dataset as ground truth. It compares in-context learning (ICL) against prompt chaining, reports that chaining yields higher accuracy (outperforming SOTA models at level 1 and BERT at level 2), examines the temperature hyperparameter, and notes that level-3 accuracy remains around 50%.
Significance. If the empirical results prove robust and reproducible, the work indicates that prompt chaining can improve LLM-based hierarchical categorization over standard ICL, offering a practical alternative to fine-tuned models like BERT for organizing scientific literature in research information systems.
major comments (2)
- [Abstract] The central claims (prompt chaining > ICL; outperformance vs. SOTA/BERT at levels 1-2) rest on the assumption that FORC labels are accurate, complete, and correctly aligned with the nested ORKG taxonomy. The abstract states FORC is used as ground truth but provides no mapping procedure, quality checks, or validation of label fidelity; any mismatches would make the reported accuracies (especially the level-3 drop) uninterpretable.
- [Experimental protocol] Performance numbers and comparisons are stated without details on exact prompt templates, number of in-context examples, specific temperature values tested, statistical significance, error bars, or full experimental protocol. This prevents verification of whether gains are robust or sensitive to post-hoc choices.
minor comments (1)
- [Abstract] The abstract mentions exploring the temperature hyperparameter but does not summarize its observed influence; ensure the main text reports these results clearly with any associated figures or tables.
Simulated Author's Rebuttal
We thank the referee for their insightful comments, which have helped us identify areas for improvement in our manuscript. Below, we provide point-by-point responses to the major comments and indicate how we plan to revise the paper accordingly.
read point-by-point responses
-
Referee: [Abstract] The central claims (prompt chaining > ICL; outperformance vs. SOTA/BERT at levels 1-2) rest on the assumption that FORC labels are accurate, complete, and correctly aligned with the nested ORKG taxonomy. The abstract states FORC is used as ground truth but provides no mapping procedure, quality checks, or validation of label fidelity; any mismatches would make the reported accuracies (especially the level-3 drop) uninterpretable.
Authors: We agree with the referee that the abstract lacks sufficient detail on the FORC dataset's alignment with the ORKG taxonomy, which is crucial for interpreting the results. The manuscript does describe the use of FORC as ground truth, but the mapping procedure and validation are not explicitly outlined. To rectify this, we will revise the abstract to include a brief mention of the alignment and add a detailed description of the mapping process, including any quality checks, in the Methods section of the revised manuscript. This will enhance the interpretability of our accuracy figures, particularly at level 3. revision: yes
-
Referee: [Experimental protocol] Performance numbers and comparisons are stated without details on exact prompt templates, number of in-context examples, specific temperature values tested, statistical significance, error bars, or full experimental protocol. This prevents verification of whether gains are robust or sensitive to post-hoc choices.
Authors: We concur that a more complete experimental protocol is essential for reproducibility. While the manuscript discusses the temperature hyperparameter and compares ICL with prompt chaining, it does not provide the exact prompt templates or other specifics mentioned. In the revised version, we will include the full experimental details, such as the prompt templates in an appendix, the number of in-context examples used, the specific temperature values tested, and statistical measures including significance tests and error bars. These additions will allow readers to verify the robustness of the reported performance gains. revision: yes
Circularity Check
No circularity: purely empirical comparisons without derivations or self-referential reductions
full rationale
The paper conducts an experimental evaluation of LLM prompting techniques (In-Context Learning and Prompt Chaining) for hierarchical classification on the FORC dataset using the ORKG taxonomy. No equations, parameter fittings, or derivation chains exist; reported results are direct accuracy metrics compared to baselines such as BERT. The central claims rest on empirical performance differences rather than any reduction to inputs by construction. Assumptions about FORC label quality and ORKG alignment constitute data-validity concerns, not circularity. No self-citations are load-bearing for the methodology or results.
Axiom & Free-Parameter Ledger
free parameters (1)
- temperature hyperparameter
axioms (2)
- domain assumption The ORKG taxonomy provides a valid hierarchical classification scheme for scientific texts.
- domain assumption The FORC dataset labels are accurate and aligned with ORKG categories.
Reference graph
Works this paper leans on
-
[1]
Souradip Chakraborty, Amrit Singh Bedi, Sicheng Zhu, Bang An, Dinesh Manocha, and Furong Huang
Abburi, H., Suesserman, M., Pudota, N., Veeramani, B., Bowen, E., Bhat- tacharya,S.:Generativeaitextclassificationusingensemblellmapproaches. arXiv preprint arXiv:2309.07755 (2023)
-
[2]
In: Interna- tional Workshop on Natural Scientific Language Processing and Research Knowledge Graphs
Abu Ahmad, R., Borisova, E., Rehm, G.: Forc@ nslp2024: Overview and insights from the field of research classification shared task. In: Interna- tional Workshop on Natural Scientific Language Processing and Research Knowledge Graphs. pp. 189–204. Springer (2024)
2024
-
[3]
Natural Language Processing Journal p
Al Nazi, Z., Hossain, M.R., Al Mamun, F.: Evaluation of open and closed- source llms for low-resource language with zero-shot, few-shot, and chain-of- thought prompting. Natural Language Processing Journal p. 100124 (2025)
2025
-
[4]
The Serials Librarian76(1-4), 35–41 (2019)
Auer, S., Mann, S.: Towards an open research knowledge graph. The Serials Librarian76(1-4), 35–41 (2019)
2019
-
[5]
In: LREC (2008)
Bird, S., Dale, R., Dorr, B.J., Gibson, B.R., Joseph, M.T., Kan, M.Y., Lee, D., Powley, B., Radev, D.R., Tan, Y.F., et al.: The acl anthology reference corpus: A reference dataset for bibliographic research in computational lin- guistics. In: LREC (2008)
2008
-
[6]
Humanities and Social Sciences Communications8(1), 1–15 (2021)
Bornmann, L., Haunschild, R., Mutz, R.: Growth rates of modern science: a latent piecewise growth curve approach to model publication numbers from established and new literature databases. Humanities and Social Sciences Communications8(1), 1–15 (2021)
2021
-
[7]
ACM transactions on intelligent systems and technology15(3), 1–45 (2024)
Chang, Y., Wang, X., Wang, J., Wu, Y., Yang, L., Zhu, K., Chen, H., Yi, X., Wang, C., Wang, Y., et al.: A survey on evaluation of large language models. ACM transactions on intelligent systems and technology15(3), 1–45 (2024)
2024
-
[8]
KO KNOWLEDGE OR- GANIZATION40(5), 295–304 (2014)
Desale, S.K., Kumbhar, R.M.: Research on automatic classification of docu- ments in library environment: a literature review. KO KNOWLEDGE OR- GANIZATION40(5), 295–304 (2014)
2014
-
[9]
arXiv preprint arXiv:2412.17321 (2024)
Devatine, N., Abraham, L.: Assessing human editing effort on llm-generated texts via compression-based edit distance. arXiv preprint arXiv:2412.17321 (2024)
-
[10]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
work page internal anchor Pith review arXiv 2018
-
[11]
International Journal of Computer Applications in Technology68(4), 369–378 (2022)
Enamoto,L.,Santos,A.R.,Maia,R.,Weigang,L.,Filho,G.P.R.:Multi-label legal text classification with bilstm and attention. International Journal of Computer Applications in Technology68(4), 369–378 (2022)
2022
-
[12]
Busi- ness & Information Systems Engineering66(1), 111–126 (2024)
Feuerriegel, S., Hartmann, J., Janiesch, C., Zschech, P.: Generative ai. Busi- ness & Information Systems Engineering66(1), 111–126 (2024)
2024
-
[13]
Available at SSRN 4504303 (2023)
Gao, A.: Prompt engineering for large language models. Available at SSRN 4504303 (2023)
2023
-
[14]
arXiv preprint arXiv:2409.18812 (2024) Automating Categorization of Scientific Texts using LLMs 23
Giglou, H.B., D’Souza, J., Auer, S.: Llms4synthesis: Leveraging large language models for scientific synthesis. arXiv preprint arXiv:2409.18812 (2024) Automating Categorization of Scientific Texts using LLMs 23
-
[15]
Journal of Data and Information Science5(1), 18–38 (2020)
Golub, K., Hagelbäck, J., Ardö, A.: Automatic classification of swedish metadata using dewey decimal classification: a comparison of approaches. Journal of Data and Information Science5(1), 18–38 (2020)
2020
-
[16]
In: Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency
Hacker, P., Engel, A., Mauer, M.: Regulating chatgpt and other large gen- erative ai models. In: Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency. pp. 1112–1123 (2023)
2023
-
[17]
D-lib Magazine22(9/10), 37 (2016)
Herrmannova, D., Knoth, P.: An analysis of the microsoft academic graph. D-lib Magazine22(9/10), 37 (2016)
2016
-
[18]
JOM 73(11), 3383–3400 (2021)
Hong, Z., Ward, L., Chard, K., Blaiszik, B., Foster, I.: Challenges and ad- vances in information extraction from scientific literature: a review. JOM 73(11), 3383–3400 (2021)
2021
-
[19]
arXiv preprint arXiv:1508.01991 (2015)
Huang, Z., Xu, W., Yu, K.: Bidirectional lstm-crf models for sequence tag- ging. arXiv preprint arXiv:1508.01991 (2015)
-
[20]
In: Proceedings of the 10th international conference on knowledge capture
Jaradeh, M.Y., Oelen, A., Farfar, K.E., Prinz, M., D’Souza, J., Kismihók, G., Stocker, M., Auer, S.: Open research knowledge graph: Next generation infrastructure for semantic scholarly knowledge. In: Proceedings of the 10th international conference on knowledge capture. pp. 243–246 (2019)
2019
-
[21]
Jiang, M., D’Souza, J., Auer, S., Downie, J.S.: Improving scholarly knowl- edge representation: Evaluating bert-based models for scientific relation classification. In: Digital Libraries at Times of Massive Societal Transition: 22nd International Conference on Asia-Pacific Digital Libraries, ICADL 2020, Kyoto, Japan, November 30–December 1, 2020, Proceedi...
2020
-
[22]
Natural Language Processing Journal p
Kalyan, K.S.: A survey of gpt-3 family large language models including chatgpt and gpt-4. Natural Language Processing Journal p. 100048 (2023)
2023
-
[23]
Kinney, R., Anastasiades, C., Authur, R., Beltagy, I., Bragg, J., Buraczyn- ski,A.,Cachola,I.,Candra,S.,Chandrasekhar,Y.,Cohan,A.,etal.:These- mantic scholar open data platform. arXiv preprint arXiv:2301.10140 (2023)
-
[24]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Liu, S., Yu, S., Lin, Z., Pathak, D., Ramanan, D.: Language models as black-box optimizers for vision-language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 12687–12697 (2024)
2024
-
[25]
Mahapatra, R., Gayan, M., Jamatia, B., et al.: Artificial intelligence tools to enhance scholarly communication: An exploration based on a systematic review (2024)
2024
-
[26]
https://ollama.com (2024), online; ac- cessed 6 August 2024
Morgan, J., Chiang, M.: Ollama. https://ollama.com (2024), online; ac- cessed 6 August 2024
2024
-
[27]
In: Proceedings of the 3rd Work- shop on Trustworthy Natural Language Processing (TrustNLP 2023)
Mosca, E., Abdalla, M.H.I., Basso, P., Musumeci, M., Groh, G.: Distin- guishing fact from fiction: A benchmark dataset for identifying machine- generated scientific papers in the llm era. In: Proceedings of the 3rd Work- shop on Trustworthy Natural Language Processing (TrustNLP 2023). pp. 190–207 (2023)
2023
-
[28]
MIT press (2022)
Murphy, K.P.: Probabilistic machine learning: an introduction. MIT press (2022)
2022
-
[29]
AIS Transactions on Human-Computer Interaction15(3), 247–267 (2023) 24 Shahi & Hummel
Nah, F., Cai, J., Zheng, R., Pang, N.: An activity system-based perspective of generative ai: Challenges and research directions. AIS Transactions on Human-Computer Interaction15(3), 247–267 (2023) 24 Shahi & Hummel
2023
-
[30]
https://ncses.nsf.gov/pubs/nsb202333/publication-output-by-region- country-or-economy-and-by-scientific-field (2023), accessed: 2025-09-23
National Science Foundation: Publication output by re- gion, country, or economy, and by scientific field. https://ncses.nsf.gov/pubs/nsb202333/publication-output-by-region- country-or-economy-and-by-scientific-field (2023), accessed: 2025-09-23
2023
-
[31]
International Journal of Surgery110(2), 1329–1330 (2024)
Pal,S.,Bhattacharya,M.,Islam,M.A.,Chakraborty,C.:Ai-enabledchatgpt or llm: a new algorithm is required for plagiarism-free scientific writing. International Journal of Surgery110(2), 1329–1330 (2024)
2024
-
[32]
arXiv preprint arXiv:2407.07630 (2024)
Perełkiewicz,M.,Poświata,R.:Areviewofthechallengeswithmassiveweb- mined corpora used in large language models pre-training. arXiv preprint arXiv:2407.07630 (2024)
-
[33]
In: Proceedings of the 9th Workshop on Linked Data in Linguistics@ LREC-COLING 2024
Pertsas, V., Kasapaki, M., Constantopoulos, P.: An annotated dataset for transformer-based scholarly information extraction and linguistic linked data generation. In: Proceedings of the 9th Workshop on Linked Data in Linguistics@ LREC-COLING 2024. pp. 84–93 (2024)
2024
-
[34]
Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
Reimers, N., Gurevych, I.: Sentence-bert: Sentence embeddings using siamese bert-networks. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics (11 2019), https://arxiv.org/abs/1908.10084
work page internal anchor Pith review arXiv 2019
-
[35]
Commu- nications of the ACM55(11), 12–12 (2012)
Rous, B.: Major update to acm’s computing classification system. Commu- nications of the ACM55(11), 12–12 (2012)
2012
-
[36]
Libraries Unlimited (1998)
Scott, M.L.: Dewey decimal classification. Libraries Unlimited (1998)
1998
-
[37]
Shahi, G., Hummel, O.: On the effectiveness of large language mod- els in automating categorization of scientific texts. In: Proceedings of the 27th International Conference on Enterprise Information Sys- tems - Volume 1: ICEIS. pp. 544–554. INSTICC, SciTePress (2025). https://doi.org/10.5220/0013299100003929
-
[38]
In: Proceedings of the Bibliometric- enhanced Information Retrieval Workshop (BIR) at the European Confer- ence on Information Retrieval (ECIR 2024)
Shahi, G.K., Hummel, O.: Enhancing research information systems with identification of domain experts. In: Proceedings of the Bibliometric- enhanced Information Retrieval Workshop (BIR) at the European Confer- ence on Information Retrieval (ECIR 2024). CEUR Workshop Proceedings, CEUR-WS.org (March 2024)
2024
-
[39]
In: Proceedings of the 14th International AAAI Conference on Web and Social Media (2020)
Shahi, G.K., Nandini, D.: FakeCovid – a multilingual cross-domain fact check news dataset for covid-19. In: Proceedings of the 14th International AAAI Conference on Web and Social Media (2020)
2020
-
[40]
Gemma: Open Models Based on Gemini Research and Technology
Team, G., Mesnard, T., Hardin, C., Dadashi, R., Bhupatiraju, S., Pathak, S., Sifre, L., Rivière, M., Kale, M.S., Love, J., et al.: Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295 (2024)
work page internal anchor Pith review arXiv 2024
-
[41]
LLaMA: Open and Efficient Foundation Language Models
Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)
work page internal anchor Pith review arXiv 2023
-
[42]
Advances in neural information processing systems30(2017) Automating Categorization of Scientific Texts using LLMs 25
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems30(2017) Automating Categorization of Scientific Texts using LLMs 25
2017
-
[43]
Journal of the American Society for Information Science and Technology 60(11), 2269–2286 (2009)
Wang, J.: An extensive study on automated dewey decimal classification. Journal of the American Society for Information Science and Technology 60(11), 2269–2286 (2009)
2009
-
[44]
Wang, K., Shen, Z., Huang, C., Wu, C.H., Dong, Y., Kanakia, A.: Microsoft academicgraph:Whenexpertsarenotenough.QuantitativeScienceStudies 1(1), 396–413 (2020)
2020
-
[45]
International Journal of Digital Earth17(1), 2353122 (2024)
Wang, S., Hu, T., Xiao, H., Li, Y., Zhang, C., Ning, H., Zhu, R., Li, Z., Ye, X.: Gpt, large language models (llms) and generative artificial intelli- gence (gai) models in geospatial science: a systematic review. International Journal of Digital Earth17(1), 2353122 (2024)
2024
-
[46]
Young, J.S., Lammert, M.: Chatgpt for classification: Evaluation of an au- tomated course mapping method in academic libraries (2024)
2024
-
[47]
Information Processing & Management60(6), 103507 (2023)
Zhang, C., Tian, L., Chu, H.: Usage frequency and application variety of re- search methods in library and information science: Continuous investigation from 1991 to 2021. Information Processing & Management60(6), 103507 (2023)
1991
-
[48]
A Survey of Large Language Models
Zhao, W.X., Zhou, K., Li, J., Tang, T., Wang, X., Hou, Y., Min, Y., Zhang, B., Zhang, J., Dong, Z., et al.: A survey of large language models. arXiv preprint arXiv:2303.18223 (2023)
work page internal anchor Pith review arXiv 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.