Term-Centric Hierarchy Induction from Heterogeneous Corpora

Barbara Plank; Elena Senger; Jan-Peter Bergmann; Rob van der Goot; Yuri Campbell

arxiv: 2606.26963 · v1 · pith:OMY2Q6JVnew · submitted 2026-06-25 · 💻 cs.CL

Term-Centric Hierarchy Induction from Heterogeneous Corpora

Elena Senger , Yuri Campbell , Jan-Peter Bergmann , Rob van der Goot , Barbara Plank This is my paper

Pith reviewed 2026-06-26 04:37 UTC · model grok-4.3

classification 💻 cs.CL

keywords term-centric hierarchy inductionheterogeneous corporaautomatic term extractiontaxonomy inductioncross-source alignmentknowledge organizationmulti-source benchmark

0 comments

The pith

A term-centric approach using automatic term extraction induces more coherent hierarchies from heterogeneous corpora than document- or summary-level methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that extracting key terms from documents creates a shared representation space that aligns content from diverse sources more effectively than working with full texts or summaries. A sympathetic reader would care because many practical tasks require turning scattered documents into organized, interpretable knowledge structures without losing source-specific details. The method first maps documents via term extraction, then builds hierarchies by combining domain priors with data-driven clustering. Experiments on a new benchmark of over one million English and German documents show gains in cross-source coherence and overall quality. A case study on German regional innovation further illustrates use for mapping technology landscapes.

Core claim

We propose a term-centric framework for inducing hierarchical taxonomies from heterogeneous corpora that scales to massive document collections. Our approach maps documents from diverse sources into a shared representation space using automatic term extraction, enabling robust cross-source alignment. Based on these representations, we construct interpretable hierarchies that integrate domain priors with data-driven clustering. Experiments on a novel English and German multi-source benchmark of over one million documents demonstrate that our method improves cross-source coherence and hierarchy quality over text- and summary-based baselines.

What carries the argument

The term-centric framework that maps documents via automatic term extraction into a shared representation space for cross-source alignment and hierarchy construction.

If this is right

Hierarchies integrate domain knowledge across sources while preserving source-specific details.
The approach scales to collections of more than one million documents.
Hierarchy quality and cross-source coherence exceed results from document-level and summary-based baselines.
The framework supports practical tasks such as technology landscape mapping in innovation analysis.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same term-extraction alignment step could be tested on additional languages or specialized domains to check how far the shared space generalizes.
The resulting hierarchies might serve as structured input for downstream models in search or recommendation systems.
One could examine whether the method retains low-frequency domain terms more reliably than whole-document representations.

Load-bearing premise

Automatic term extraction can map documents from diverse sources into a shared representation space that enables robust cross-source alignment without significant loss of domain-specific information.

What would settle it

If the method produces no measurable improvement in cross-source coherence or hierarchy quality metrics on the English and German multi-source benchmark of over one million documents compared with text- and summary-based baselines, the central claim would be falsified.

Figures

Figures reproduced from arXiv: 2606.26963 by Barbara Plank, Elena Senger, Jan-Peter Bergmann, Rob van der Goot, Yuri Campbell.

**Figure 2.** Figure 2: Sensitivity of the hierarchy shape to the three construction hyperparameters. [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

read the original abstract

Organizing knowledge from diverse text sources into interpretable hierarchies is crucial for tasks such as policy analysis, innovation monitoring, and exploratory domain mapping. Existing taxonomy induction methods typically rely on document-level representations that capture entire documents rather than the specific domain concepts relevant for knowledge organization, limiting their ability to generalize across heterogeneous sources. We propose a term-centric framework for inducing hierarchical taxonomies from heterogeneous corpora that scales to massive document collections. Our approach maps documents from diverse sources into a shared representation space using automatic term extraction, enabling robust cross-source alignment. Based on these representations, we construct interpretable hierarchies that integrate domain priors with datadriven clustering. Experiments on a novel English and German multi-source benchmark of over one million documents demonstrate that our method improves cross-source coherence and hierarchy quality over text- and summarybased baselines. A case study on German regional innovation analysis further demonstrates its practical utility for technology landscape mapping.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Term-centric hierarchy induction from heterogeneous text claims better cross-source alignment via automatic term extraction, but the abstract provides no validation for the no-loss assumption and no methods details.

read the letter

The main takeaway is a shift to term-centric representations for inducing hierarchies from mixed English and German corpora, using automatic term extraction to create a shared space for alignment before clustering with domain priors. This is positioned as an improvement over document-level methods for tasks like policy analysis and innovation monitoring.

The paper does handle scale, reporting experiments on a new benchmark of over one million documents and gains in cross-source coherence and hierarchy quality against text- and summary-based baselines. The German regional innovation case study adds a practical angle.

The soft spot is the unexamined assumption that term extraction maps diverse sources into the shared space without significant loss of domain-specific information. The abstract states the claim but gives no extraction algorithm, no preservation metrics, and no ablation on information loss. If source-specific terms drop out, the hierarchies would likely align on generic concepts, undercutting the coherence improvements. Without the methods or results sections, the reported gains cannot be checked.

This work is aimed at NLP researchers working on taxonomy induction and knowledge organization from heterogeneous text. A reader in that area could get value from the benchmark and case study if the technical details hold.

It deserves a serious referee to review the full experimental setup and verify whether the central claims are supported.

Referee Report

1 major / 1 minor

Summary. The paper proposes a term-centric framework for inducing hierarchical taxonomies from heterogeneous corpora. Documents from diverse sources are mapped into a shared representation space via automatic term extraction to enable cross-source alignment. Interpretable hierarchies are then constructed by integrating domain priors with data-driven clustering. Experiments on a novel English and German multi-source benchmark of over one million documents report improvements in cross-source coherence and hierarchy quality over text- and summary-based baselines. A case study on German regional innovation analysis is presented to illustrate practical utility.

Significance. If the reported gains hold under scrutiny of the full experimental design, the work could advance scalable taxonomy induction for multi-source and multilingual settings, supporting applications in policy analysis and technology landscape mapping. The scale of the introduced benchmark (>1M documents) represents a positive contribution to evaluation resources in the area.

major comments (1)

[Abstract (method description)] The central claim that automatic term extraction maps heterogeneous documents into a shared space 'enabling robust cross-source alignment' without significant loss of domain-specific information is load-bearing for both the cross-source coherence improvements and the German innovation case study, yet the manuscript provides no extraction algorithm details, no preservation metrics (e.g., per-source domain-term recall or overlap), and no ablation isolating information loss effects. This directly matches the weakest assumption identified in the stress-test note.

minor comments (1)

[Abstract] Typo: 'datadriven' should be 'data-driven'.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed and constructive review. The point raised about missing details on automatic term extraction is well-taken and directly affects the clarity of the central methodological claim.

read point-by-point responses

Referee: [Abstract (method description)] The central claim that automatic term extraction maps heterogeneous documents into a shared space 'enabling robust cross-source alignment' without significant loss of domain-specific information is load-bearing for both the cross-source coherence improvements and the German innovation case study, yet the manuscript provides no extraction algorithm details, no preservation metrics (e.g., per-source domain-term recall or overlap), and no ablation isolating information loss effects. This directly matches the weakest assumption identified in the stress-test note.

Authors: We agree that the current manuscript lacks sufficient detail on the automatic term extraction step, including the specific algorithm, quantitative preservation metrics, and an ablation isolating information loss. In the revised version we will add a dedicated subsection describing the extraction procedure, report per-source domain-term recall and overlap statistics, and include an ablation comparing term-centric versus document-level representations. These additions will directly support the cross-source alignment claim. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical method validated on external benchmark

full rationale

The paper describes a term-centric framework that maps documents via automatic term extraction into a shared space, then builds hierarchies integrating priors and clustering. No equations, derivations, fitted parameters presented as predictions, or self-citation chains appear in the abstract or description. Claims rest on experimental results against baselines on a novel >1M-document benchmark, which is externally falsifiable and does not reduce to self-definition or renaming. This matches the default expectation of a non-circular empirical paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available, so no specific free parameters, axioms, or invented entities can be identified from the text.

pith-pipeline@v0.9.1-grok · 5688 in / 1028 out tokens · 31717 ms · 2026-06-26T04:37:50.600628+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

47 extracted references · 20 canonical work pages

[1]

Context-Aware Hierarchical Taxonomy Generation for Scientific Papers via LLM -Guided Multi-Aspect Clustering

Zhu, Kun and Liao, Lizi and Gu, Yuxuan and Huang, Lei and Feng, Xiaocheng and Qin, Bing. Context-Aware Hierarchical Taxonomy Generation for Scientific Papers via LLM -Guided Multi-Aspect Clustering. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.788

work page doi:10.18653/v1/2025.emnlp-main.788 2025
[2]

GCD - TM : Graph-Driven Community Detection for Topic Modelling in Psychiatry Texts

Krishnan, Anusuya and Ghebrehiwet, Isaias Mehari. GCD - TM : Graph-Driven Community Detection for Topic Modelling in Psychiatry Texts. Proceedings of the 1st Workshop on NLP for Science (NLP4Science). 2024. doi:10.18653/v1/2024.nlp4science-1.6

work page doi:10.18653/v1/2024.nlp4science-1.6 2024
[3]

Topic Modeling Using Community Detection on a Word Association Graph

Chowdhury, Mahfuzur Rahman and Ahmed, Intesur and Sadeque, Farig and Yanhaona, Muhammad. Topic Modeling Using Community Detection on a Word Association Graph. Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing. 2023

2023
[4]

2023 , url=

Graph2topic: an opensource topic modeling framework based on sentence embedding and community detection , author=. 2023 , url=

2023
[5]

2025 , eprint=

Science Hierarchography: Hierarchical Organization of Science Literature , author=. 2025 , eprint=

2025
[6]

AI for Accelerated Materials Design - NeurIPS 2024 , year=

Scientific Knowledge Graph and Ontology Generation using Open Large Language Models , author=. AI for Accelerated Materials Design - NeurIPS 2024 , year=

2024
[7]

Knowledge Navigator: LLM -guided Browsing Framework for Exploratory Search in Scientific Literature

Katz, Uri and Levy, Mosh and Goldberg, Yoav. Knowledge Navigator: LLM -guided Browsing Framework for Exploratory Search in Scientific Literature. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. doi:10.18653/v1/2024.findings-emnlp.516

work page doi:10.18653/v1/2024.findings-emnlp.516 2024
[8]

2025 , eprint=

Taxonomy Tree Generation from Citation Graph , author=. 2025 , eprint=

2025
[9]

Topic Intrusion for Automatic Topic Model Evaluation

Bhatia, Shraey and Lau, Jey Han and Baldwin, Timothy. Topic Intrusion for Automatic Topic Model Evaluation. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2018. doi:10.18653/v1/D18-1098

work page doi:10.18653/v1/d18-1098 2018
[10]

Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data , pages =

Zhang, Tian and Ramakrishnan, Raghu and Livny, Miron , title =. Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data , pages =. 1996 , isbn =. doi:10.1145/233269.233324 , abstract =

work page doi:10.1145/233269.233324 1996
[11]

Ward , journal =

Joe H. Ward , journal =. Hierarchical Grouping to Optimize an Objective Function , urldate =
[12]

, title =

Sculley, D. , title =. Proceedings of the 19th International Conference on World Wide Web , pages =. 2010 , isbn =. doi:10.1145/1772690.1772862 , abstract =

work page doi:10.1145/1772690.1772862 2010
[13]

MacQueen, J. B. Some methods for classification and analysis of multivariate observations. Proceedings of the fifth berkeley symposium on mathematical statistics and probability. 1967

1967
[14]

Proceedings of the eighth ACM international conference on Web search and data mining , pages=

Exploring the space of topic coherence measures , author=. Proceedings of the eighth ACM international conference on Web search and data mining , pages=
[15]

2010 , publisher=

Software framework for topic modelling with large corpora , author=. 2010 , publisher=

2010
[16]

ACM Transactions on Information Systems (TOIS) , volume=

A similarity measure for indefinite rankings , author=. ACM Transactions on Information Systems (TOIS) , volume=. 2010 , publisher=

2010
[17]

International conference on applications of Natural Language to information systems , pages=

Word embedding-based topic similarity measures , author=. International conference on applications of Natural Language to information systems , pages=. 2021 , organization=

2021
[18]

and Bouldin, Donald W

Davies, David L. and Bouldin, Donald W. , journal=. A Cluster Separation Measure , year=
[19]

PeerJ Computer Science , volume=

The Silhouette coefficient and the Davies-Bouldin index are more informative than Dunn index, Calinski-Harabasz index, Shannon entropy, and Gap statistic for unsupervised clustering internal evaluation of two convex clusters , author=. PeerJ Computer Science , volume=. 2025 , publisher=

2025
[20]

Proceedings of the forty-eighth annual ACM symposium on Theory of Computing , pages=

A cost function for similarity-based hierarchical clustering , author=. Proceedings of the forty-eighth annual ACM symposium on Theory of Computing , pages=
[21]

Crossing Domains without Labels: Distant Supervision for Term Extraction

Senger, Elena and Campbell, Yuri and Goot, Rob Van Der and Plank, Barbara. Crossing Domains without Labels: Distant Supervision for Term Extraction. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track. 2025. doi:10.18653/v1/2025.emnlp-industry.95

work page doi:10.18653/v1/2025.emnlp-industry.95 2025
[22]

Pattern Recognition Letters , volume=

Data clustering: 50 years beyond K-means , author=. Pattern Recognition Letters , volume=. 2010 , doi=

2010
[23]

Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD'96) , pages=

A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , author=. Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD'96) , pages=
[24]

2017 , doi=

McInnes, Leland and Healy, John and Astels, Steve , journal=. 2017 , doi=

2017
[25]

Introduction to Information Retrieval , author=
[26]

Statistics and Computing , volume=

A tutorial on spectral clustering , author=. Statistics and Computing , volume=. 2007 , doi=

2007
[27]

A Unified Taxonomy-Guided Instruction Tuning Framework for Entity Set Expansion and Taxonomy Expansion

Shen, Yanzhen and Zhang, Yu and Zhang, Yunyi and Han, Jiawei. A Unified Taxonomy-Guided Instruction Tuning Framework for Entity Set Expansion and Taxonomy Expansion. Findings of the Association for Computational Linguistics: ACL 2025. 2025. doi:10.18653/v1/2025.findings-acl.167

work page doi:10.18653/v1/2025.findings-acl.167 2025
[28]

CoRel: Seed-Guided Topical Taxonomy Construction by Concept Learning and Relation Transferring , url=

Huang, Jiaxin and Xie, Yiqing and Meng, Yu and Zhang, Yunyi and Han, Jiawei , year=. CoRel: Seed-Guided Topical Taxonomy Construction by Concept Learning and Relation Transferring , url=. doi:10.1145/3394486.3403244 , booktitle=

work page doi:10.1145/3394486.3403244
[29]

Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining , pages =

Zhang, Chao and Tao, Fangbo and Chen, Xiusi and Shen, Jiaming and Jiang, Meng and Sadler, Brian and Vanni, Michelle and Han, Jiawei , title =. Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining , pages =. 2018 , isbn =. doi:10.1145/3219819.3220064 , abstract =

work page doi:10.1145/3219819.3220064 2018
[30]

TAXI at S em E val-2016 Task 13: a Taxonomy Induction Method based on Lexico-Syntactic Patterns, Substrings and Focused Crawling

Panchenko, Alexander and Faralli, Stefano and Ruppert, Eugen and Remus, Steffen and Naets, Hubert and Fairon, C \'e drick and Ponzetto, Simone Paolo and Biemann, Chris. TAXI at S em E val-2016 Task 13: a Taxonomy Induction Method based on Lexico-Syntactic Patterns, Substrings and Focused Crawling. Proceedings of the 10th International Workshop on Semantic...

work page doi:10.18653/v1/s16-1206 2016
[31]

Improving Hypernymy Detection with an Integrated Path-based and Distributional Method

Shwartz, Vered and Goldberg, Yoav and Dagan, Ido. Improving Hypernymy Detection with an Integrated Path-based and Distributional Method. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2016. doi:10.18653/v1/P16-1226

work page doi:10.18653/v1/p16-1226 2016
[32]

Automatic Acquisition of Hyponyms from Large Text Corpora

Hearst, Marti A. Automatic Acquisition of Hyponyms from Large Text Corpora. COLING 1992 Volume 2: The 14th I nternational C onference on C omputational L inguistics. 1992

1992
[33]

Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , pages =

Wang, Chi and Danilevsky, Marina and Desai, Nihit and Zhang, Yinan and Nguyen, Phuong and Taula, Thrivikrama and Han, Jiawei , title =. Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , pages =. 2013 , isbn =. doi:10.1145/2487575.2487631 , abstract =

work page doi:10.1145/2487575.2487631 2013
[34]

Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , pages =

Liu, Xueqing and Song, Yangqiu and Liu, Shixia and Wang, Haixun , title =. Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , pages =. 2012 , isbn =. doi:10.1145/2339530.2339754 , abstract =

work page doi:10.1145/2339530.2339754 2012
[35]

Proceedings of the 24th International Conference on Machine Learning , pages =

Mimno, David and Li, Wei and McCallum, Andrew , title =. Proceedings of the 24th International Conference on Machine Learning , pages =. 2007 , isbn =. doi:10.1145/1273496.1273576 , abstract =

work page doi:10.1145/1273496.1273576 2007
[36]

ArXiv , year=

BERTopic: Neural topic modeling with a class-based TF-IDF procedure , author=. ArXiv , year=
[37]

Proceedings of the ACM Web Conference 2022 , pages =

Lee, Dongha and Shen, Jiaming and Kang, Seongku and Yoon, Susik and Han, Jiawei and Yu, Hwanjo , title =. Proceedings of the ACM Web Conference 2022 , pages =. 2022 , isbn =. doi:10.1145/3485447.3512002 , abstract =

work page doi:10.1145/3485447.3512002 2022
[38]

2024 , address =

Polchar, Jan , title =. 2024 , address =. doi:10.1787/84820cd8-en , url =

work page doi:10.1787/84820cd8-en 2024
[39]

2022 , month =

Hakiman, Kamran and Stull-Lane, Chloe , title =. 2022 , month =

2022
[40]

DeepPatent: patent classification with convolutional neural networks and word embedding , pages =

Li, Shaobo and Hu, Jie and Cui, Yuxin and Hu, Jianjun , year =. DeepPatent: patent classification with convolutional neural networks and word embedding , pages =. Scientometrics , doi =
[41]

2022 , eprint=

OpenAlex: A fully-open index of scholarly works, authors, venues, institutions, and concepts , author=. 2022 , eprint=

2022
[42]

doi:10.2906/112117098108/12 , url =

2015 , publisher =. doi:10.2906/112117098108/12 , url =

work page doi:10.2906/112117098108/12 2015
[43]

doi:10.2906/112117098108/20 , url =

2022 , publisher =. doi:10.2906/112117098108/20 , url =

work page doi:10.2906/112117098108/20 2022
[44]

All Science Journal Classification (ASJC) Codes , howpublished =
[45]

2025 , eprint=

Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models , author=. 2025 , eprint=

2025
[46]

2024 , eprint=

The Llama 3 Herd of Models , author=. 2024 , eprint=

2024
[47]

Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval , pages =

Zhu, Xingwei and Ming, Zhao-Yan and Zhu, Xiaoyan and Chua, Tat-Seng , title =. Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval , pages =. 2013 , isbn =. doi:10.1145/2484028.2484032 , abstract =

work page doi:10.1145/2484028.2484032 2013

[1] [1]

Context-Aware Hierarchical Taxonomy Generation for Scientific Papers via LLM -Guided Multi-Aspect Clustering

Zhu, Kun and Liao, Lizi and Gu, Yuxuan and Huang, Lei and Feng, Xiaocheng and Qin, Bing. Context-Aware Hierarchical Taxonomy Generation for Scientific Papers via LLM -Guided Multi-Aspect Clustering. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.788

work page doi:10.18653/v1/2025.emnlp-main.788 2025

[2] [2]

GCD - TM : Graph-Driven Community Detection for Topic Modelling in Psychiatry Texts

Krishnan, Anusuya and Ghebrehiwet, Isaias Mehari. GCD - TM : Graph-Driven Community Detection for Topic Modelling in Psychiatry Texts. Proceedings of the 1st Workshop on NLP for Science (NLP4Science). 2024. doi:10.18653/v1/2024.nlp4science-1.6

work page doi:10.18653/v1/2024.nlp4science-1.6 2024

[3] [3]

Topic Modeling Using Community Detection on a Word Association Graph

Chowdhury, Mahfuzur Rahman and Ahmed, Intesur and Sadeque, Farig and Yanhaona, Muhammad. Topic Modeling Using Community Detection on a Word Association Graph. Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing. 2023

2023

[4] [4]

2023 , url=

Graph2topic: an opensource topic modeling framework based on sentence embedding and community detection , author=. 2023 , url=

2023

[5] [5]

2025 , eprint=

Science Hierarchography: Hierarchical Organization of Science Literature , author=. 2025 , eprint=

2025

[6] [6]

AI for Accelerated Materials Design - NeurIPS 2024 , year=

Scientific Knowledge Graph and Ontology Generation using Open Large Language Models , author=. AI for Accelerated Materials Design - NeurIPS 2024 , year=

2024

[7] [7]

Knowledge Navigator: LLM -guided Browsing Framework for Exploratory Search in Scientific Literature

Katz, Uri and Levy, Mosh and Goldberg, Yoav. Knowledge Navigator: LLM -guided Browsing Framework for Exploratory Search in Scientific Literature. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. doi:10.18653/v1/2024.findings-emnlp.516

work page doi:10.18653/v1/2024.findings-emnlp.516 2024

[8] [8]

2025 , eprint=

Taxonomy Tree Generation from Citation Graph , author=. 2025 , eprint=

2025

[9] [9]

Topic Intrusion for Automatic Topic Model Evaluation

Bhatia, Shraey and Lau, Jey Han and Baldwin, Timothy. Topic Intrusion for Automatic Topic Model Evaluation. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2018. doi:10.18653/v1/D18-1098

work page doi:10.18653/v1/d18-1098 2018

[10] [10]

Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data , pages =

Zhang, Tian and Ramakrishnan, Raghu and Livny, Miron , title =. Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data , pages =. 1996 , isbn =. doi:10.1145/233269.233324 , abstract =

work page doi:10.1145/233269.233324 1996

[11] [11]

Ward , journal =

Joe H. Ward , journal =. Hierarchical Grouping to Optimize an Objective Function , urldate =

[12] [12]

, title =

Sculley, D. , title =. Proceedings of the 19th International Conference on World Wide Web , pages =. 2010 , isbn =. doi:10.1145/1772690.1772862 , abstract =

work page doi:10.1145/1772690.1772862 2010

[13] [13]

MacQueen, J. B. Some methods for classification and analysis of multivariate observations. Proceedings of the fifth berkeley symposium on mathematical statistics and probability. 1967

1967

[14] [14]

Proceedings of the eighth ACM international conference on Web search and data mining , pages=

Exploring the space of topic coherence measures , author=. Proceedings of the eighth ACM international conference on Web search and data mining , pages=

[15] [15]

2010 , publisher=

Software framework for topic modelling with large corpora , author=. 2010 , publisher=

2010

[16] [16]

ACM Transactions on Information Systems (TOIS) , volume=

A similarity measure for indefinite rankings , author=. ACM Transactions on Information Systems (TOIS) , volume=. 2010 , publisher=

2010

[17] [17]

International conference on applications of Natural Language to information systems , pages=

Word embedding-based topic similarity measures , author=. International conference on applications of Natural Language to information systems , pages=. 2021 , organization=

2021

[18] [18]

and Bouldin, Donald W

Davies, David L. and Bouldin, Donald W. , journal=. A Cluster Separation Measure , year=

[19] [19]

PeerJ Computer Science , volume=

The Silhouette coefficient and the Davies-Bouldin index are more informative than Dunn index, Calinski-Harabasz index, Shannon entropy, and Gap statistic for unsupervised clustering internal evaluation of two convex clusters , author=. PeerJ Computer Science , volume=. 2025 , publisher=

2025

[20] [20]

Proceedings of the forty-eighth annual ACM symposium on Theory of Computing , pages=

A cost function for similarity-based hierarchical clustering , author=. Proceedings of the forty-eighth annual ACM symposium on Theory of Computing , pages=

[21] [21]

Crossing Domains without Labels: Distant Supervision for Term Extraction

Senger, Elena and Campbell, Yuri and Goot, Rob Van Der and Plank, Barbara. Crossing Domains without Labels: Distant Supervision for Term Extraction. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track. 2025. doi:10.18653/v1/2025.emnlp-industry.95

work page doi:10.18653/v1/2025.emnlp-industry.95 2025

[22] [22]

Pattern Recognition Letters , volume=

Data clustering: 50 years beyond K-means , author=. Pattern Recognition Letters , volume=. 2010 , doi=

2010

[23] [23]

Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD'96) , pages=

A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , author=. Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD'96) , pages=

[24] [24]

2017 , doi=

McInnes, Leland and Healy, John and Astels, Steve , journal=. 2017 , doi=

2017

[25] [25]

Introduction to Information Retrieval , author=

[26] [26]

Statistics and Computing , volume=

A tutorial on spectral clustering , author=. Statistics and Computing , volume=. 2007 , doi=

2007

[27] [27]

A Unified Taxonomy-Guided Instruction Tuning Framework for Entity Set Expansion and Taxonomy Expansion

Shen, Yanzhen and Zhang, Yu and Zhang, Yunyi and Han, Jiawei. A Unified Taxonomy-Guided Instruction Tuning Framework for Entity Set Expansion and Taxonomy Expansion. Findings of the Association for Computational Linguistics: ACL 2025. 2025. doi:10.18653/v1/2025.findings-acl.167

work page doi:10.18653/v1/2025.findings-acl.167 2025

[28] [28]

CoRel: Seed-Guided Topical Taxonomy Construction by Concept Learning and Relation Transferring , url=

Huang, Jiaxin and Xie, Yiqing and Meng, Yu and Zhang, Yunyi and Han, Jiawei , year=. CoRel: Seed-Guided Topical Taxonomy Construction by Concept Learning and Relation Transferring , url=. doi:10.1145/3394486.3403244 , booktitle=

work page doi:10.1145/3394486.3403244

[29] [29]

Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining , pages =

Zhang, Chao and Tao, Fangbo and Chen, Xiusi and Shen, Jiaming and Jiang, Meng and Sadler, Brian and Vanni, Michelle and Han, Jiawei , title =. Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining , pages =. 2018 , isbn =. doi:10.1145/3219819.3220064 , abstract =

work page doi:10.1145/3219819.3220064 2018

[30] [30]

TAXI at S em E val-2016 Task 13: a Taxonomy Induction Method based on Lexico-Syntactic Patterns, Substrings and Focused Crawling

Panchenko, Alexander and Faralli, Stefano and Ruppert, Eugen and Remus, Steffen and Naets, Hubert and Fairon, C \'e drick and Ponzetto, Simone Paolo and Biemann, Chris. TAXI at S em E val-2016 Task 13: a Taxonomy Induction Method based on Lexico-Syntactic Patterns, Substrings and Focused Crawling. Proceedings of the 10th International Workshop on Semantic...

work page doi:10.18653/v1/s16-1206 2016

[31] [31]

Improving Hypernymy Detection with an Integrated Path-based and Distributional Method

Shwartz, Vered and Goldberg, Yoav and Dagan, Ido. Improving Hypernymy Detection with an Integrated Path-based and Distributional Method. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2016. doi:10.18653/v1/P16-1226

work page doi:10.18653/v1/p16-1226 2016

[32] [32]

Automatic Acquisition of Hyponyms from Large Text Corpora

Hearst, Marti A. Automatic Acquisition of Hyponyms from Large Text Corpora. COLING 1992 Volume 2: The 14th I nternational C onference on C omputational L inguistics. 1992

1992

[33] [33]

Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , pages =

Wang, Chi and Danilevsky, Marina and Desai, Nihit and Zhang, Yinan and Nguyen, Phuong and Taula, Thrivikrama and Han, Jiawei , title =. Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , pages =. 2013 , isbn =. doi:10.1145/2487575.2487631 , abstract =

work page doi:10.1145/2487575.2487631 2013

[34] [34]

Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , pages =

Liu, Xueqing and Song, Yangqiu and Liu, Shixia and Wang, Haixun , title =. Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , pages =. 2012 , isbn =. doi:10.1145/2339530.2339754 , abstract =

work page doi:10.1145/2339530.2339754 2012

[35] [35]

Proceedings of the 24th International Conference on Machine Learning , pages =

Mimno, David and Li, Wei and McCallum, Andrew , title =. Proceedings of the 24th International Conference on Machine Learning , pages =. 2007 , isbn =. doi:10.1145/1273496.1273576 , abstract =

work page doi:10.1145/1273496.1273576 2007

[36] [36]

ArXiv , year=

BERTopic: Neural topic modeling with a class-based TF-IDF procedure , author=. ArXiv , year=

[37] [37]

Proceedings of the ACM Web Conference 2022 , pages =

Lee, Dongha and Shen, Jiaming and Kang, Seongku and Yoon, Susik and Han, Jiawei and Yu, Hwanjo , title =. Proceedings of the ACM Web Conference 2022 , pages =. 2022 , isbn =. doi:10.1145/3485447.3512002 , abstract =

work page doi:10.1145/3485447.3512002 2022

[38] [38]

2024 , address =

Polchar, Jan , title =. 2024 , address =. doi:10.1787/84820cd8-en , url =

work page doi:10.1787/84820cd8-en 2024

[39] [39]

2022 , month =

Hakiman, Kamran and Stull-Lane, Chloe , title =. 2022 , month =

2022

[40] [40]

DeepPatent: patent classification with convolutional neural networks and word embedding , pages =

Li, Shaobo and Hu, Jie and Cui, Yuxin and Hu, Jianjun , year =. DeepPatent: patent classification with convolutional neural networks and word embedding , pages =. Scientometrics , doi =

[41] [41]

2022 , eprint=

OpenAlex: A fully-open index of scholarly works, authors, venues, institutions, and concepts , author=. 2022 , eprint=

2022

[42] [42]

doi:10.2906/112117098108/12 , url =

2015 , publisher =. doi:10.2906/112117098108/12 , url =

work page doi:10.2906/112117098108/12 2015

[43] [43]

doi:10.2906/112117098108/20 , url =

2022 , publisher =. doi:10.2906/112117098108/20 , url =

work page doi:10.2906/112117098108/20 2022

[44] [44]

All Science Journal Classification (ASJC) Codes , howpublished =

[45] [45]

2025 , eprint=

Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models , author=. 2025 , eprint=

2025

[46] [46]

2024 , eprint=

The Llama 3 Herd of Models , author=. 2024 , eprint=

2024

[47] [47]

Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval , pages =

Zhu, Xingwei and Ming, Zhao-Yan and Zhu, Xiaoyan and Chua, Tat-Seng , title =. Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval , pages =. 2013 , isbn =. doi:10.1145/2484028.2484032 , abstract =

work page doi:10.1145/2484028.2484032 2013