A Shared IPTC Topic Space for Cross-Source Topic Modelling

Aline Villavicencio; Din Iskakov; Marco Idiat; Mendeli Vainstein; Rodrigo Wilkens; Ronaldo Menezes; Sebastian Gon\c{c}alves

arxiv: 2606.26845 · v1 · pith:NBY5ZAELnew · submitted 2026-06-25 · 💻 cs.IR

A Shared IPTC Topic Space for Cross-Source Topic Modelling

Din Iskakov , Sebastian Gon\c{c}alves , Marco Idiat , Mendeli Vainstein , Aline Villavicencio , Ronaldo Menezes , Rodrigo Wilkens This is my paper

Pith reviewed 2026-06-26 02:37 UTC · model grok-4.3

classification 💻 cs.IR

keywords topic modellingshared topic spacetaxonomy alignmentcross-source comparisonmedia topicsguided topic modelling

0 comments

The pith

A fixed taxonomy creates one shared topic space that aligns models fitted to separate corpora.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that topic models trained on different media sources can be placed into one common space by anchoring every discovered topic to entries in a standard taxonomy. This would matter because separate models normally produce incompatible topic sets that cannot be compared directly without extra alignment steps. The method discovers topics through guided modeling, scores each one against the taxonomy entries using weighted centroids, and assigns it to a parent category by highest similarity. If the mappings prove stable, attention to the same topics can be tracked reproducibly across sources. Development on a single news collection included successive tests of mapping choices and thresholds to select the procedure.

Core claim

The paper claims that corpora can be placed in a single shared topic space defined by the IPTC taxonomy. Topics are obtained with guided BERTopic, scored against the ninety-four IPTC Media Topics through weighted keyword and target centroids, and collapsed upward to seventeen IPTC parent topics by a maximum-similarity rule. The framework supplies an externally anchored method that enables reproducible cross-source topic comparison.

What carries the argument

The shared IPTC topic space, which supplies fixed external labels that discovered topics from any corpus are scored against and assigned to.

If this is right

Topics discovered in any number of separate corpora become directly comparable through their common taxonomy assignments.
The parent-level collapse produces consistent category assignments under a range of similarity thresholds.
Enriching the target construction with parent information improves both coverage and assignment stability.
Coverage declines gradually rather than collapsing when stricter assignment thresholds are applied.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same anchoring procedure could be used to compare topic attention in news sources over successive time periods without re-fitting alignment each year.
Replacing the media taxonomy with a different domain-specific taxonomy would allow the method to be tested on scientific literature or social media.
Running the framework on additional corpora beyond the development set would reveal whether the shared space remains stable when content types vary.

Load-bearing premise

The IPTC taxonomy supplies a sufficiently universal and stable topic space so that topics discovered independently in different corpora can be aligned to the same labels without corpus-specific distortion.

What would settle it

Apply the full mapping procedure to two separate corpora, then check whether topics assigned the same IPTC label actually address the same underlying subject when read by independent judges; systematic mismatch would falsify the alignment claim.

Figures

Figures reproduced from arXiv: 2606.26845 by Aline Villavicencio, Din Iskakov, Marco Idiat, Mendeli Vainstein, Rodrigo Wilkens, Ronaldo Menezes, Sebastian Gon\c{c}alves.

**Figure 1.** Figure 1: Overview of the shared topic-space workflow. The corpora are pre-processed, modelled with guided BERTopic, and mapped into IPTC labels to place both sources in the common topic space. Once BERTopic has discovered a set of topics, each discovered topic i is represented by a weighted centroid vi built from its top keyword set. Each IPTC level-1 target j is represented by a centroid cj built from the parent t… view at source ↗

**Figure 2.** Figure 2: Broad BERTopic screen on the NYT 2011 development corpus. Each run combines a modelling family, a document representation, and a clustering specification, ordered by the mapped-document rate as an initial viability filter before more detailed refinement. The finalist stage then compared the retained guided route against a zero-shot benchmark under stricter assignment thresholds, shown in [PITH_FULL_IMAGE… view at source ↗

**Figure 3.** Figure 3: Strict finalist comparison on the development corpus. The guided level-1 MMR model retained substantially stronger mapped coverage than the zero-shot benchmark under stricter assignment thresholds, which is why the guided family was carried forward into the annual pipeline. justified the parent-enriched target space used in the annual analysis. The two panels also showed that the simpler level1 constructi… view at source ↗

**Figure 4.** Figure 4: Target-space ablation on the fixed guided run. The parent-enriched construction improved mapped coverage (left) and strengthened parent-level top-two agreement (right) relative to the simpler target variants, which is why the final analysis uses the parent + level1 + level2 target space. The final step was a threshold sweep on the retained guided route, shown in [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Threshold sweep for the retained guided route. The figure shows how mappeddocument coverage changes as the parent-assignment threshold is tightened, supporting the final choice of a conservative but still usable mapping rule. Taken together, the development-corpus results selected a single configuration: guided BERTopic with parent-enriched hierarchical IPTC mapping, a maximum-similarity parent collapse,… view at source ↗

read the original abstract

Comparing topic attention across different media is hindered by a fundamental modelling problem: topic models fitted separately to each corpus produce corpus-specific topic spaces that cannot be aligned directly. This paper presents a reproducible framework that places corpora in a single shared topic space defined by a taxonomy. Discovered topics are obtained with guided BERTopic, scored against the ninety-four IPTC Media Topics' taxonomy topics (level-1) through weighted keyword and target centroids, and then collapsed upward to seventeen IPTC parent topics by a maximum-similarity rule. The framework was developed and selected on a controlled New York Times 2011 corpus through a narrowing sequence: a broad model screen, a focused mapping refinement, a strict finalist comparison, a target-construction ablation, and a threshold calibration. In this corpus, the guided family retained substantially stronger mapped coverage than a zero-shot benchmark under stricter assignment thresholds, a parent-enriched target construction improved both coverage and parent consistency, and coverage declined gradually rather than collapsing as the assignment threshold was tightened. The contribution is an externally anchored method for constructing a shared topic space that enables reproducible cross-source topic comparison.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a usable IPTC-anchored pipeline for topic alignment but develops and tunes it on only one corpus, leaving the cross-source claim untested.

read the letter

The main takeaway is a concrete pipeline that maps guided BERTopic outputs to the IPTC taxonomy via weighted centroids on keywords and targets, then collapses to the 17 parent topics with a max-similarity rule. This specific externally anchored construction is new relative to the cited prior work.

They did the development work carefully: a sequence of broad screening, mapping refinement, finalist comparison, target ablation, and threshold calibration, all on the 2011 NYT corpus. The guided models showed better coverage than zero-shot under stricter thresholds, and the parent-enriched targets improved consistency. That methodical narrowing is a real plus and shows they treated the mapping as something that needed testing rather than assumed.

The soft spot is the single-corpus limit. The central claim is that the IPTC anchor creates a shared, corpus-independent topic space for cross-source comparison. Yet every step of model choice, weight tuning, and threshold setting happened inside the NYT data alone. No second corpus appears to have been run through the same scoring and collapse procedure to check whether the parent assignments stay stable or get pulled by corpus-specific language. The stress-test note is accurate on this point.

This is for IR researchers who need comparable topic labels across news outlets for media analysis. A reader already working on taxonomy-guided topic models could extract the pipeline and the ablation results directly.

It has enough structure and external grounding to go to referees, but the reviewers will need to see multi-corpus checks before the shared-space claim can be taken as demonstrated.

Referee Report

1 major / 1 minor

Summary. The paper presents a reproducible framework for aligning topics from different corpora into a single shared space anchored by the IPTC Media Topics taxonomy. Topics are discovered via guided BERTopic, scored against the 94 level-1 IPTC topics using weighted keyword and target centroids, and collapsed to 17 parent topics by a maximum-similarity rule. The method is developed and selected through a controlled sequence of broad model screening, mapping refinement, finalist comparison, target-construction ablation, and threshold calibration, all performed on the New York Times 2011 corpus; qualitative observations include stronger mapped coverage than zero-shot baselines under stricter thresholds and gradual rather than abrupt coverage decline with tightening thresholds.

Significance. If the cross-corpus stability of the IPTC-anchored mapping holds, the framework would supply an externally referenced, reproducible method for direct topic-attention comparison across media sources, addressing a recognized alignment problem in cross-corpus topic modeling. The controlled, multi-stage development process with explicit ablations and threshold tests is a methodological strength that supports reproducibility claims.

major comments (1)

[Abstract] Abstract (development sequence paragraph): every stage of framework selection and calibration (broad screen, mapping refinement, finalist comparison, target ablation, threshold calibration) was executed exclusively on the single NYT 2011 corpus. Because the central claim is that the resulting weighted-centroid scoring and max-similarity collapse produce a corpus-independent shared space, the absence of any second corpus test leaves the cross-source alignment unverified and is load-bearing for the contribution.

minor comments (1)

[Abstract] Abstract: improvements are stated only qualitatively (“substantially stronger mapped coverage”, “improved both coverage and parent consistency”) with no numeric values, tables, or error bars, which makes the magnitude and reliability of the reported effects difficult to evaluate.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive review and for recognizing the methodological strengths of the controlled development process. We respond to the single major comment below.

read point-by-point responses

Referee: [Abstract] Abstract (development sequence paragraph): every stage of framework selection and calibration (broad screen, mapping refinement, finalist comparison, target ablation, threshold calibration) was executed exclusively on the single NYT 2011 corpus. Because the central claim is that the resulting weighted-centroid scoring and max-similarity collapse produce a corpus-independent shared space, the absence of any second corpus test leaves the cross-source alignment unverified and is load-bearing for the contribution.

Authors: We agree that all stages of model selection and calibration were performed exclusively on the NYT 2011 corpus, as already stated in the abstract. The shared topic space is defined by the external IPTC taxonomy rather than being learned from any corpus; the weighted-centroid scoring and maximum-similarity collapse rules are formulated without corpus-specific parameters and are therefore corpus-agnostic by construction. The single-corpus development sequence was deliberately chosen to enable a controlled, reproducible selection among modeling variants. At the same time, we acknowledge that empirical verification of mapping stability on at least one additional corpus would provide stronger support for the cross-source claim. We will revise the abstract and add an explicit limitations paragraph to clarify this point and to outline future multi-corpus validation. revision: yes

Circularity Check

0 steps flagged

External IPTC taxonomy anchors alignment; no self-referential reductions

full rationale

The framework maps BERTopic outputs to IPTC level-1 topics via weighted keyword/target centroids then max-similarity collapse to 17 parents. IPTC taxonomy is an independent external reference, not derived from or fitted to the NYT corpus. No equations, parameters, or predictions are shown to equal their own inputs by construction. Development steps (screening, ablation, calibration) on a single corpus constitute method tuning, not circularity. The shared space claim rests on the external taxonomy rather than self-citation chains or renamed fits.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that the IPTC taxonomy is an adequate universal anchor and on several tunable parameters whose values were selected on the development corpus.

free parameters (2)

assignment threshold
Calibrated on the NYT 2011 corpus; controls strictness of topic-to-IPTC assignment.
weights for keyword and target centroids
Used in the similarity scoring step; chosen during mapping refinement.

axioms (1)

domain assumption IPTC Media Topics taxonomy supplies a stable, corpus-independent topic space suitable for aligning independently discovered topics
Invoked when the framework collapses discovered topics to IPTC labels and parents.

pith-pipeline@v0.9.1-grok · 5743 in / 1377 out tokens · 40052 ms · 2026-06-26T02:37:29.798745+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

14 extracted references · 2 canonical work pages

[1]

arXiv preprint arXiv:2008.09470 (2020)

Angelov, D.: Top2Vec: Distributed representations of topics. arXiv preprint arXiv:2008.09470 (2020)

arXiv 2008
[2]

Advances in neural infor- mation processing systems14(2001)

Blei, D., Ng, A., Jordan, M.: Latent Dirichlet allocation. Advances in neural infor- mation processing systems14(2001)

2001
[3]

In: Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval

Carbonell, J., Goldstein, J.: The use of mmr, diversity-based reranking for re- ordering documents and producing summaries. In: Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval. pp. 335–336 (1998)

1998
[4]

Frontiers in Sociology7, 886498 (2022)

Egger, R., Yu, J.: A topic modeling comparison between LDA, NMF, Top2Vec, and BERTopic to demystify Twitter posts. Frontiers in Sociology7, 886498 (2022)

2022
[5]

Nature595(7866), 214–222 (2021)

Galesic, M., Bruine de Bruin, W., Dalege, J., Feld, S.L., Kreuter, F., Olsson, H., Prelec, D., Stein, D.L., van Der Does, T.: Human social sensing is an untapped resource for computational social science. Nature595(7866), 214–222 (2021)

2021
[6]

arXiv preprint arXiv:2203.05794 (2022)

Grootendorst, M.: BERTopic: Neural topic modeling with a class-based TF-IDF procedure. arXiv preprint arXiv:2203.05794 (2022)

Pith/arXiv arXiv 2022
[7]

https://maartengr.github.io/BERTopic/getting started/guided/guided.html (2022), accessed: 2025-02-20

Grootendorst, M.: Guided bertopic: Using seed words to steer topic modelling. https://maartengr.github.io/BERTopic/getting started/guided/guided.html (2022), accessed: 2025-02-20

2022
[8]

Neurocomputing628, 129638 (2025)

Hankar, M., Kasri, M., Beni-Hssane, A.: A comprehensive overview of topic model- ing: Techniques, applications and challenges. Neurocomputing628, 129638 (2025)

2025
[9]

Nature401(6755), 788–791 (1999)

Lee, D.D., Seung, H.S.: Learning the parts of objects by non-negative matrix fac- torization. Nature401(6755), 788–791 (1999)

1999
[10]

McInnes, L., Healy, J., Astels, S.: hdbscan: Hierarchical density based clustering. J. Open Source Softw.2(11), 205 (2017). https://doi.org/10.21105/joss.00205

work page doi:10.21105/joss.00205 2017
[11]

arXiv preprint arXiv:1802.03426 (2018), https://arxiv.org/abs/1802.03426

McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018), https://arxiv.org/abs/1802.03426

Pith/arXiv arXiv 2018
[12]

Annals of the International Communication Association44(2), 157–173 (2020)

Tsfati, Y., Boomgaarden, H.G., Str¨ omb¨ ack, J., Vliegenthart, R., Damstra, A., Lindgren, E.: Causes and consequences of mainstream media dissemination of fake news: literature review and synthesis. Annals of the International Communication Association44(2), 157–173 (2020)

2020
[13]

Science 359(6380), 1146–1151 (2018)

Vosoughi, S., Roy, D., Aral, S.: The spread of true and false news online. Science 359(6380), 1146–1151 (2018). https://doi.org/10.1126/science.aap9559

work page doi:10.1126/science.aap9559 2018
[14]

In: Clough, P., Foley, C., Gurrin, C., Jones, G.J.F., Kraaij, W., Lee, H., Mudoch, V

Zhao, W.X., Jiang, J., Weng, J., He, J., Lim, E.P., Yan, H., Li, X.: Comparing Twitter and Traditional Media Using Topic Models. In: Clough, P., Foley, C., Gurrin, C., Jones, G.J.F., Kraaij, W., Lee, H., Mudoch, V. (eds.) Advances in Information Retrieval. pp. 338–349. Springer Berlin Heidelberg, Berlin, Heidelberg (2011)

2011

[1] [1]

arXiv preprint arXiv:2008.09470 (2020)

Angelov, D.: Top2Vec: Distributed representations of topics. arXiv preprint arXiv:2008.09470 (2020)

arXiv 2008

[2] [2]

Advances in neural infor- mation processing systems14(2001)

Blei, D., Ng, A., Jordan, M.: Latent Dirichlet allocation. Advances in neural infor- mation processing systems14(2001)

2001

[3] [3]

In: Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval

Carbonell, J., Goldstein, J.: The use of mmr, diversity-based reranking for re- ordering documents and producing summaries. In: Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval. pp. 335–336 (1998)

1998

[4] [4]

Frontiers in Sociology7, 886498 (2022)

Egger, R., Yu, J.: A topic modeling comparison between LDA, NMF, Top2Vec, and BERTopic to demystify Twitter posts. Frontiers in Sociology7, 886498 (2022)

2022

[5] [5]

Nature595(7866), 214–222 (2021)

Galesic, M., Bruine de Bruin, W., Dalege, J., Feld, S.L., Kreuter, F., Olsson, H., Prelec, D., Stein, D.L., van Der Does, T.: Human social sensing is an untapped resource for computational social science. Nature595(7866), 214–222 (2021)

2021

[6] [6]

arXiv preprint arXiv:2203.05794 (2022)

Grootendorst, M.: BERTopic: Neural topic modeling with a class-based TF-IDF procedure. arXiv preprint arXiv:2203.05794 (2022)

Pith/arXiv arXiv 2022

[7] [7]

https://maartengr.github.io/BERTopic/getting started/guided/guided.html (2022), accessed: 2025-02-20

Grootendorst, M.: Guided bertopic: Using seed words to steer topic modelling. https://maartengr.github.io/BERTopic/getting started/guided/guided.html (2022), accessed: 2025-02-20

2022

[8] [8]

Neurocomputing628, 129638 (2025)

Hankar, M., Kasri, M., Beni-Hssane, A.: A comprehensive overview of topic model- ing: Techniques, applications and challenges. Neurocomputing628, 129638 (2025)

2025

[9] [9]

Nature401(6755), 788–791 (1999)

Lee, D.D., Seung, H.S.: Learning the parts of objects by non-negative matrix fac- torization. Nature401(6755), 788–791 (1999)

1999

[10] [10]

McInnes, L., Healy, J., Astels, S.: hdbscan: Hierarchical density based clustering. J. Open Source Softw.2(11), 205 (2017). https://doi.org/10.21105/joss.00205

work page doi:10.21105/joss.00205 2017

[11] [11]

arXiv preprint arXiv:1802.03426 (2018), https://arxiv.org/abs/1802.03426

McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018), https://arxiv.org/abs/1802.03426

Pith/arXiv arXiv 2018

[12] [12]

Annals of the International Communication Association44(2), 157–173 (2020)

Tsfati, Y., Boomgaarden, H.G., Str¨ omb¨ ack, J., Vliegenthart, R., Damstra, A., Lindgren, E.: Causes and consequences of mainstream media dissemination of fake news: literature review and synthesis. Annals of the International Communication Association44(2), 157–173 (2020)

2020

[13] [13]

Science 359(6380), 1146–1151 (2018)

Vosoughi, S., Roy, D., Aral, S.: The spread of true and false news online. Science 359(6380), 1146–1151 (2018). https://doi.org/10.1126/science.aap9559

work page doi:10.1126/science.aap9559 2018

[14] [14]

In: Clough, P., Foley, C., Gurrin, C., Jones, G.J.F., Kraaij, W., Lee, H., Mudoch, V

Zhao, W.X., Jiang, J., Weng, J., He, J., Lim, E.P., Yan, H., Li, X.: Comparing Twitter and Traditional Media Using Topic Models. In: Clough, P., Foley, C., Gurrin, C., Jones, G.J.F., Kraaij, W., Lee, H., Mudoch, V. (eds.) Advances in Information Retrieval. pp. 338–349. Springer Berlin Heidelberg, Berlin, Heidelberg (2011)

2011