arxiv: 2203.05794 · v1 · submitted 2022-03-11 · 💻 cs.CL

Recognition: 2 theorem links

· Lean Theorem

BERTopic: Neural topic modeling with a class-based TF-IDF procedure

Maarten Grootendorst

Authors on Pith no claims yet

Pith reviewed 2026-05-11 14:43 UTC · model grok-4.3

classification 💻 cs.CL

keywords topic modelingBERTopicTF-IDFtransformer embeddingsdocument clusteringlatent topicsneural topic models

0 comments

The pith

BERTopic discovers latent topics by clustering transformer embeddings and applying class-based TF-IDF.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

BERTopic models topics in text collections by creating embeddings of documents using pre-trained transformer language models. These embeddings are clustered to group similar documents, and each cluster is then represented by terms selected through a class-based version of TF-IDF. The paper shows that this produces coherent topics while performing competitively on benchmarks against both traditional and newer topic modeling methods. A reader would care if they need to automatically organize large sets of documents to find hidden themes without prior knowledge of what the topics are. This method combines the strengths of modern language models with a simple yet effective way to label the resulting groups.

Core claim

We present BERTopic, a topic model that extends this process by extracting coherent topic representation through the development of a class-based variation of TF-IDF. More specifically, BERTopic generates document embedding with pre-trained transformer-based language models, clusters these embeddings, and finally, generates topic representations with the class-based TF-IDF procedure. BERTopic generates coherent topics and remains competitive across a variety of benchmarks involving classical models and those that follow the more recent clustering approach of topic modeling.

What carries the argument

The class-based TF-IDF procedure that computes term importance by treating each document cluster as a distinct class and measuring how distinctive terms are to that class compared to others.

Load-bearing premise

That the clusters formed from transformer embeddings correspond to meaningful latent topics in the data.

What would settle it

If BERTopic produces lower topic coherence scores than LDA on standard benchmarks such as those used in the paper, or if human judges find its topics less interpretable, the claim of competitiveness and coherence would not hold.

read the original abstract

Topic models can be useful tools to discover latent topics in collections of documents. Recent studies have shown the feasibility of approach topic modeling as a clustering task. We present BERTopic, a topic model that extends this process by extracting coherent topic representation through the development of a class-based variation of TF-IDF. More specifically, BERTopic generates document embedding with pre-trained transformer-based language models, clusters these embeddings, and finally, generates topic representations with the class-based TF-IDF procedure. BERTopic generates coherent topics and remains competitive across a variety of benchmarks involving classical models and those that follow the more recent clustering approach of topic modeling.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

BERTopic gives a clean, usable pipeline for embedding-based topic modeling with a class-based TF-IDF labeling step that looks competitive on benchmarks, but the paper does not isolate whether that step adds value over plain TF-IDF on the same clusters.

read the letter

The main point is a straightforward pipeline: embed documents with a pre-trained transformer, reduce with UMAP, cluster with HDBSCAN, then label each cluster using a class-based TF-IDF that treats clusters as classes instead of documents. This produces topic words that are more cluster-specific than standard approaches in the reported results. The paper shows the method stays competitive with LDA and other recent clustering topic models across several datasets and coherence metrics. The description is clear, the code appears to be released, and the positioning against prior embedding-plus-clustering work is direct without overclaiming. That combination of modern embeddings and the modified TF-IDF is the actual new element here, and it seems to deliver coherent topics in practice. The soft spot is exactly the one the stress test flags: there is no ablation that fixes the clusters and swaps only the labeling method. Without that comparison, it is difficult to credit the coherence gains specifically to c-TF-IDF rather than to the quality of the BERT embeddings and the clustering step. The paper also leaves several free parameters (number of topics, UMAP and HDBSCAN settings) that users must tune, which is honest but limits how automatic the method is. This work is for NLP practitioners and researchers who need a ready-to-run neural topic model that leverages current embeddings without heavy custom training. It does not open new theoretical ground but fills a practical gap. I would send it to peer review because the pipeline is reproducible, the benchmarks are reported, and the central claim can be evaluated and strengthened with targeted experiments.

Referee Report

2 major / 2 minor

Summary. The paper presents BERTopic, a topic modeling pipeline that (1) embeds documents with pre-trained transformer language models, (2) reduces dimensionality and clusters the embeddings (via UMAP + HDBSCAN), and (3) extracts topic representations by treating each cluster as a single class and applying a class-based TF-IDF (c-TF-IDF) procedure. It claims that the resulting topics are coherent and that the method remains competitive with classical topic models (e.g., LDA) and other recent clustering-based approaches across multiple benchmarks.

Significance. If the empirical claims hold after proper controls, BERTopic supplies a practical, modular pipeline that leverages modern sentence embeddings for clustering and a simple modification of TF-IDF for topic labeling. This could lower the barrier to producing interpretable topics on large corpora while remaining competitive on standard coherence metrics.

major comments (2)

[Experiments] Experiments section: no ablation is reported that fixes the document clusters obtained from the embedding + UMAP + HDBSCAN steps and then compares c-TF-IDF against standard TF-IDF (or other cluster-labeling methods) on the identical clusters. Without this isolation, coherence gains cannot be attributed to the class-based TF-IDF step rather than to the quality of the preceding transformer embeddings and clustering; this directly weakens the central novelty claim that the c-TF-IDF procedure is responsible for improved topic representations.
[Method] Method section (c-TF-IDF description): the procedure is described procedurally but lacks an explicit equation or algorithmic listing that defines how term frequency is aggregated per cluster and how inverse document frequency is computed across clusters. This makes it impossible to verify whether c-TF-IDF is mathematically distinct from simply concatenating documents within each cluster and running ordinary TF-IDF.

minor comments (2)

[Abstract] The abstract and introduction could briefly state the exact coherence metrics (e.g., NPMI, CV) and the precise baselines used in the benchmark tables to allow readers to assess competitiveness without consulting the full experimental section.
[Figures/Tables] Figure captions and table footnotes should explicitly note the number of topics, the embedding model, and the clustering hyperparameters used for each reported result, as these are free parameters that affect reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major comment below and outline the revisions we will make to the manuscript.

read point-by-point responses

Referee: [Experiments] Experiments section: no ablation is reported that fixes the document clusters obtained from the embedding + UMAP + HDBSCAN steps and then compares c-TF-IDF against standard TF-IDF (or other cluster-labeling methods) on the identical clusters. Without this isolation, coherence gains cannot be attributed to the class-based TF-IDF step rather than to the quality of the preceding transformer embeddings and clustering; this directly weakens the central novelty claim that the c-TF-IDF procedure is responsible for improved topic representations.

Authors: We agree that an ablation isolating the contribution of the topic representation step would strengthen the paper. In the revised version we will add an experiment that fixes the clusters obtained from the embedding + UMAP + HDBSCAN pipeline and compares c-TF-IDF against alternative cluster-labeling methods (e.g., raw term-frequency ranking and other simple representations) on those identical clusters. We will also clarify in the text that c-TF-IDF is mathematically equivalent to applying standard TF-IDF after concatenating documents within each cluster; therefore the comparison will focus on distinct labeling alternatives rather than an identical procedure. revision: partial
Referee: [Method] Method section (c-TF-IDF description): the procedure is described procedurally but lacks an explicit equation or algorithmic listing that defines how term frequency is aggregated per cluster and how inverse document frequency is computed across clusters. This makes it impossible to verify whether c-TF-IDF is mathematically distinct from simply concatenating documents within each cluster and running ordinary TF-IDF.

Authors: We acknowledge the lack of a formal definition. In the revised manuscript we will insert explicit equations for the c-TF-IDF procedure: term frequency for a word in a cluster is the sum of its occurrences across all documents belonging to that cluster; inverse document frequency is computed as log(N / df) where N is the number of clusters and df is the number of clusters containing the word (with additive smoothing). We will also state explicitly that this formulation is equivalent to concatenating the documents of each cluster and running ordinary TF-IDF with clusters treated as the documents. The revised text will emphasize that the contribution of the work lies in the overall pipeline rather than in a mathematically novel TF-IDF variant. revision: yes

Circularity Check

0 steps flagged

No circularity: procedural pipeline evaluated on external benchmarks

full rationale

The paper presents BERTopic as a three-step pipeline (transformer embeddings, UMAP+HDBSCAN clustering, class-based TF-IDF) without any mathematical derivation chain or first-principles claims that reduce to fitted inputs. Topic quality is measured via external benchmarks (NPMI, coherence scores) against classical and clustering baselines, providing independent falsifiability. No self-definitional equations, renamed predictions, or load-bearing self-citations appear in the abstract or described method; the class-based TF-IDF is introduced as a novel labeling step rather than derived from prior outputs by construction. This is a standard empirical method paper whose central claim rests on comparative evaluation, not tautology.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The method relies on the assumption that pre-trained transformer embeddings encode topical similarity and that cluster-level TF-IDF produces coherent labels; no new physical constants or invented entities are introduced.

free parameters (2)

number of clusters / topics
Chosen by the user or via heuristics; directly affects the granularity of discovered topics.
clustering hyperparameters (e.g., UMAP and HDBSCAN parameters)
Control embedding reduction and cluster formation; fitted or tuned per dataset.

axioms (2)

domain assumption Pre-trained transformer embeddings capture semantic similarity relevant to topic structure
Invoked when using BERT or similar models to embed documents before clustering.
domain assumption Class-based TF-IDF produces more coherent topic words than standard TF-IDF or other labeling methods
Central to the claim of improved topic quality.

pith-pipeline@v0.9.0 · 5390 in / 1332 out tokens · 22203 ms · 2026-05-11T14:43:21.070034+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Cost.FunctionalEquation washburn_uniqueness_aczel unclear
BERTopic generates document embedding with pre-trained transformer-based language models, clusters these embeddings, and finally, generates topic representations with the class-based TF-IDF procedure.
PhiForcing phi_equation unclear
We present BERTopic, a topic model that extends this process by extracting coherent topic representation through the development of a class-based variation of TF-IDF.

Forward citations

Cited by 40 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

REALISTA: Realistic Latent Adversarial Attacks that Elicit LLM Hallucinations
cs.CL 2026-05 unverdicted novelty 8.0

REALISTA optimizes continuous combinations of valid editing directions in latent space to produce realistic adversarial prompts that elicit hallucinations more effectively than prior methods, including on large reason...
Making MLLMs Blind: Adversarial Smuggling Attacks in MLLM Content Moderation
cs.CV 2026-04 unverdicted novelty 8.0

Adversarial smuggling attacks encode harmful content into human-readable visuals that evade MLLM detection, achieving over 90% attack success rates on models like GPT-5 and Qwen3-VL via the new SmuggleBench benchmark.
What Software Engineering Looks Like to AI Agents? -- An Empirical Study of AI-Only Technical Discourse on MoltBook
cs.SE 2026-05 unverdicted novelty 7.0

AI-only technical discourse on MoltBook is coherent and organized around 12 themes led by security and trust, but it lacks the concrete code, runtime failures, and reproduction steps common in human GitHub discussions.
The Moltbook Files: A Harmless Slopocalypse or Humanity's Last Experiment
cs.CL 2026-05 unverdicted novelty 7.0

An AI-agent social platform generated mostly neutral content whose use in fine-tuning reduced model truthfulness comparably to human Reddit data, suggesting limited unique harm but flagging tail risks like secret leaks.
Mapping Emerging Climate Misinformation Playbooks in the Global South
cs.SI 2026-04 unverdicted novelty 7.0

Brazilian YouTube climate videos show a transition from traditional denial of climate science to 'new denial' that undermines solutions, with the latter attracting more engagement from diverse actors.
The Platform Is Mostly Not a Platform: Token Economies and Agent Discourse on Moltbook
cs.CY 2026-04 unverdicted novelty 7.0

Moltbook operates as two largely separate layers: a dominant transactional token economy using protocols like MBC-20 and a thinner discursive conversation layer with only 3.6% agent overlap.
Participatory provenance as representational auditing for AI-mediated public consultation
cs.AI 2026-04 unverdicted novelty 7.0

Participatory provenance auditing of Canada's AI strategy consultation shows official AI summaries exclude 15-17% of participants more than random baselines, with 33-88% exclusion for dissent clusters.
Stories of Your Life as Others: A Round-Trip Evaluation of LLM-Generated Life Stories Conditioned on Rich Psychometric Profiles
cs.CL 2026-04 unverdicted novelty 7.0

LLMs conditioned on actual psychometric profiles produce life stories from which independent LLMs recover personality scores at mean r=0.75, 85% of human reliability, with emotional patterns replicating in real human data.
Discovery-Oriented Faceting: From Coverage to Blind-Spot Discovery
cs.HC 2026-05 unverdicted novelty 6.0

DOF ranks document categories by distinctiveness instead of size to promote blind-spot discovery, surfacing different content than coverage-based methods across four domains.
MIRA: An LLM-Assisted Benchmark for Multi-Category Integrated Retrieval
cs.IR 2026-05 unverdicted novelty 6.0

MIRA is a new benchmark for multi-category integrated retrieval built from real queries on a social science platform, with LLM assistance for topic descriptions and relevance labeling across four item categories.
TubeCensus: A Transparent, Replicable, and Large-Scale Census of YouTube Channels and their Subscriber Counts Over Time
cs.SI 2026-05 unverdicted novelty 6.0

TubeCensus provides a transparent longitudinal dataset of YouTube channels and subscriber counts covering creators responsible for 30-36% of platform content, distributed via a pip package.
Synthetic Users, Real Differences: an Evaluation Framework for User Simulation in Multi-Turn Conversations
cs.CL 2026-05 unverdicted novelty 6.0

Realsim shows simulated users fail to reproduce communication frictions present in real multi-turn chatbot dialogues, yielding overly optimistic evaluations with domain-dependent variability.
ProEval: Proactive Failure Discovery and Efficient Performance Estimation for Generative AI Evaluation
cs.LG 2026-04 unverdicted novelty 6.0

ProEval is a proactive framework using pre-trained GPs, Bayesian quadrature, and superlevel set sampling to estimate performance and find failures in generative AI with 8-65x fewer samples than baselines.
Proposing Topic Models and Evaluation Frameworks for Analyzing Associations with External Outcomes: An Application to Leadership Analysis Using Large-Scale Corporate Review Data
cs.CL 2026-04 unverdicted novelty 6.0

An LLM-based topic modeling method with a custom evaluation framework improves topic interpretability, specificity, and polarity consistency over prior approaches when linking corporate review text to external outcome...
Detecting and Enhancing Intellectual Humility in Online Political Discourse
cs.CY 2026-04 unverdicted novelty 6.0

Intellectual humility in Reddit political discussions can be measured at scale with a validated classifier and increased via targeted interventions without reducing participation.
The Effect of Document Selection on Query-focused Text Analysis
cs.IR 2026-04 conditional novelty 6.0

Semantic and hybrid document retrieval methods provide reliable, efficient selection for query-focused text analyses like LDA and BERTopic, outperforming random or keyword-only approaches.
Mirroring Minds: Asymmetric Linguistic Accommodation and Diagnostic Identity in ADHD and Autism Reddit Communities
cs.CL 2026-04 unverdicted novelty 6.0

ADHD and autism Reddit users exhibit convergent linguistic accommodation when crossing community boundaries, with diagnosis disclosure showing small and directionally distinct effects on style.
Reasoning-Based Refinement of Unsupervised Text Clusters with LLMs
cs.CL 2026-04 unverdicted novelty 6.0

LLM reasoning refines unsupervised text clusters via coherence checks, redundancy removal, and label grounding, yielding better coherence and human-aligned labels on social media data.
Discovering Failure Modes in Vision-Language Models using RL
cs.CV 2026-04 unverdicted novelty 6.0

An RL-based questioner agent adaptively generates queries to discover novel failure modes in VLMs without human intervention.
Paper Espresso: From Paper Overload to Research Insight
cs.DL 2026-04 unverdicted novelty 6.0

Paper Espresso deploys LLMs to summarize and analyze trends across 13,300+ arXiv papers over 35 months, releasing metadata that shows non-saturating topic growth and higher engagement for novel topics.
PRISM: LLM-Guided Semantic Clustering for High-Precision Topics
cs.LG 2026-04 unverdicted novelty 6.0

PRISM distills sparse LLM labels into a fine-tuned embedding model for thresholded clustering that separates fine-grained topics better than prior local models or raw frontier embeddings.
In your own words: computationally identifying interpretable themes in free-text survey data
cs.CY 2026-03 unverdicted novelty 6.0

A computational framework identifies more coherent themes in free-text survey data on race, gender, and sexual orientation than previous methods, with applications for survey design, explaining variation, and detectin...
Measuring Embedding Sensitivity to Authorial Style in French: Comparing Literary Texts with Language Model Rewritings
cs.CL 2026-05 unverdicted novelty 5.0

Embeddings reliably capture authorial stylistic features in French literary texts, and these signals persist after LLM rewriting while showing model-specific patterns.
Automatic Reflection Level Classification in Hungarian Student Essays
cs.CL 2026-05 unverdicted novelty 5.0

Classical machine learning models outperform Hungarian transformers slightly in overall performance (71% vs 68% average score) for classifying reflection levels in student essays, though transformers handle rare class...
A Gated Hybrid Contrastive Collaborative Filtering Recommendation
cs.IR 2026-04 unverdicted novelty 5.0

A gated hybrid contrastive collaborative filtering framework improves hit rate@10 and NDCG@10 on movie review datasets by layer-wise adaptive fusion of semantic and collaborative signals with contrastive objectives.
From Codebooks to VLMs: Evaluating Automated Visual Discourse Analysis for Climate Change on Social Media
cs.CV 2026-04 unverdicted novelty 5.0

VLMs recover reliable population-level trends in climate change visual discourse on social media even when per-image accuracy is only moderate.
Can Large Language Models Assist the Comprehension of ROS2 Software Architectures?
cs.SE 2026-04 unverdicted novelty 5.0

LLMs achieve 98.22% accuracy answering factual questions about ROS2 software architectures, with top models reaching 100%.
An Explainable Approach to Document-level Translation Evaluation with Topic Modeling
cs.CE 2026-04 unverdicted novelty 5.0

A topic-modeling framework measures document-level thematic consistency in translations by aligning key tokens across languages with a bilingual dictionary and scoring via cosine similarity, providing explainable insi...
Migrant Voices, Local News: Insights on Bridging Community Needs with Media Content
cs.CL 2026-04 unverdicted novelty 5.0

Focus groups reveal topic gaps and readability barriers in local news for migrants, uncovered by applying standard NLP tools to 2000+ hyper-local articles.
NIH-MPINet: A Large-Scale Feature-Rich Network Dataset for Mapping the Frontiers of Team Science
cs.DL 2026-04 unverdicted novelty 5.0

NIH-MPINet is a new large-scale feature-rich collaboration network dataset from NIH grants that maps multi-PI teams, communities, and topic trends in biomedicine.
Collaboration, Integration, and Thematic Exploration in European Framework Programmes: A Longitudinal Network Analysis
physics.soc-ph 2026-04 unverdicted novelty 5.0

EU Framework Programmes have increased participation equity and integrated new countries through collaboration, yet research remains concentrated on established trajectories rather than broadly exploratory.
15 Years of Augmented Human(s) Research: Where Do We Stand?
cs.HC 2026-04 unverdicted novelty 5.0

Scientometric review of 15 years of Augmented Human conference papers shows bimodal submission peaks in 2015 and 2025, dominant topics in haptics and wearables, and an active Japanese community alongside definitional ...
Results and Retrospective Analysis of the CODS 2025 AssetOpsBench Challenge
cs.AI 2026-05 unverdicted novelty 4.0

Retrospective of a 2025 AI agent competition finds public-private score misalignment, an inert composite component, multi-account registrations, and guardrail fixes outperforming architectural novelty.
Reducing Redundancy in Retrieval-Augmented Generation through Chunk Filtering
cs.CL 2026-04 unverdicted novelty 4.0

Entity-based chunk filtering reduces RAG vector index size by 25-36% with retrieval quality near baseline levels.
Mapping the Political Discourse in the Brazilian Chamber of Deputies: A Multi-Faceted Computational Approach
cs.CL 2026-04 unverdicted novelty 4.0

Analysis of 450k Brazilian deputy speeches shows stylistic simplification over time, sharp agenda shifts with national crises, and discursive clusters where region and gender outweigh party affiliation.
A Community-Based Approach for Stance Distribution and Argument Organization
cs.CL 2026-04 unverdicted novelty 4.0

Unsupervised graph community detection organizes arguments to reveal stance distributions in debates.
The Day My Chatbot Changed: Characterizing the Mental Health Impacts of Social AI App Updates via Negative User Reviews
cs.HC 2026-04 unverdicted novelty 4.0

Version-linked review analysis of Character AI shows rating drops with certain updates and negative feedback dominated by technical malfunctions plus occasional psychological framing.
Learning AI Without a STEM Background: Mixed-Methods Evidence from a Diverse, Mixed-Cohort AIED Program
cs.CY 2026-03 unverdicted novelty 4.0

A mixed-cohort AI education program emphasizing ethical judgment and applied literacy produces significant gains in confidence and perceived relevance for non-STEM and adult learners.
Shifting Patterns of Extremist Discourse on Facebook: Analyzing Trends and Developments During the Israel-Hamas Conflict
cs.SI 2026-05 unverdicted novelty 3.0

Extremist Facebook groups showed rising one-sided activity and negative content at the Israel-Hamas conflict onset, with topic shifts from political to religious in anti-Israel groups and religious to activism in anti...
A Guide to Using Social Media as a Geospatial Lens for Studying Public Opinion and Behavior
cs.SI 2026-04 unverdicted novelty 3.0

Social media data functions as passive geospatial sensing for public opinion and behavior via a structured workflow and case studies on topics like COVID-19 vaccines and urban accessibility.