pith. sign in

hub Mixed citations

BERTopic: Neural topic modeling with a class-based TF-IDF procedure

Mixed citation behavior. Most common role is method (64%).

89 Pith papers citing it
Method 64% of classified citations
abstract

Topic models can be useful tools to discover latent topics in collections of documents. Recent studies have shown the feasibility of approach topic modeling as a clustering task. We present BERTopic, a topic model that extends this process by extracting coherent topic representation through the development of a class-based variation of TF-IDF. More specifically, BERTopic generates document embedding with pre-trained transformer-based language models, clusters these embeddings, and finally, generates topic representations with the class-based TF-IDF procedure. BERTopic generates coherent topics and remains competitive across a variety of benchmarks involving classical models and those that follow the more recent clustering approach of topic modeling.

hub tools

citation-role summary

method 9 background 4 baseline 1

citation-polarity summary

claims ledger

  • abstract Topic models can be useful tools to discover latent topics in collections of documents. Recent studies have shown the feasibility of approach topic modeling as a clustering task. We present BERTopic, a topic model that extends this process by extracting coherent topic representation through the development of a class-based variation of TF-IDF. More specifically, BERTopic generates document embedding with pre-trained transformer-based language models, clusters these embeddings, and finally, generates topic representations with the class-based TF-IDF procedure. BERTopic generates coherent topics

co-cited works

representative citing papers

SemCEB: A Cardinality Estimation Benchmark for Semantic Operators

cs.DB · 2026-06-22 · unverdicted · novelty 7.0

SemCEB is the first benchmark for cardinality estimation over semantic operators, evaluating sampling methods and Semantic Histograms on accuracy, cost, latency, and memory using 102 queries on a real-world dataset.

Linked Multi-Model Data on Russian Domestic and Foreign Policy Speeches

cs.CL · 2026-05-15 · unverdicted · novelty 7.0

A new linked multimodal dataset of Russian domestic and foreign policy speeches with texts, images, captions, harmonized metadata, and expert-refined topic annotations is introduced to support analyses in political communication and LLM applications.

A Multi-Agent LLM Framework for Rating the Quality of Surgical Feedback

cs.CL · 2026-05-25 · unverdicted · novelty 6.0

A multi-agent LLM system discovers criteria such as Encouraging, Urgent, and Clear for surgical feedback and uses them to score 4.2k instances, outperforming prior content-based approaches in predicting trainee behavior changes and trainer approval.

Algorithmic Cultivation: How Social Media Feeds Shape User Language

cs.SI · 2026-05-16 · unverdicted · novelty 6.0

Quasi-experimental study of 235M Bluesky posts finds that exposure to algorithmic feeds produces greater stylistic accommodation, semantic alignment, and register formalization than in matched controls, with effects varying by feed and strongest for reposting.

citing papers explorer

Showing 50 of 89 citing papers.