hub Mixed citations

BERTopic: Neural topic modeling with a class-based TF-IDF procedure

Maarten Grootendorst · 2022 · cs.CL · arXiv 2203.05794

Mixed citation behavior. Most common role is method (64%).

89 Pith papers citing it

Method 64% of classified citations

open full Pith review browse 89 citing papers arXiv PDF

abstract

Topic models can be useful tools to discover latent topics in collections of documents. Recent studies have shown the feasibility of approach topic modeling as a clustering task. We present BERTopic, a topic model that extends this process by extracting coherent topic representation through the development of a class-based variation of TF-IDF. More specifically, BERTopic generates document embedding with pre-trained transformer-based language models, clusters these embeddings, and finally, generates topic representations with the class-based TF-IDF procedure. BERTopic generates coherent topics and remains competitive across a variety of benchmarks involving classical models and those that follow the more recent clustering approach of topic modeling.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

method 9 background 4 baseline 1

citation-polarity summary

use method 9 background 3 baseline 1 unclear 1

claims ledger

abstract Topic models can be useful tools to discover latent topics in collections of documents. Recent studies have shown the feasibility of approach topic modeling as a clustering task. We present BERTopic, a topic model that extends this process by extracting coherent topic representation through the development of a class-based variation of TF-IDF. More specifically, BERTopic generates document embedding with pre-trained transformer-based language models, clusters these embeddings, and finally, generates topic representations with the class-based TF-IDF procedure. BERTopic generates coherent topics

co-cited works

representative citing papers

Making MLLMs Blind: Adversarial Smuggling Attacks in MLLM Content Moderation

cs.CV · 2026-04-08 · unverdicted · novelty 8.0

Adversarial smuggling attacks encode harmful content into human-readable visuals that evade MLLM detection, achieving over 90% attack success rates on models like GPT-5 and Qwen3-VL via the new SmuggleBench benchmark.

SemCEB: A Cardinality Estimation Benchmark for Semantic Operators

cs.DB · 2026-06-22 · unverdicted · novelty 7.0

SemCEB is the first benchmark for cardinality estimation over semantic operators, evaluating sampling methods and Semantic Histograms on accuracy, cost, latency, and memory using 102 queries on a real-world dataset.

From Punishment to Protection: Charting Six Decades of U.S. Juvenile Justice Through Topic Modeling and LLM-Assisted Analysis

cs.CY · 2026-05-18 · unverdicted · novelty 7.0

Topic modeling and LLM-assisted analysis of 60k+ juvenile justice opinions identifies 182 topics showing child welfare tripling, punitive declines, vocabulary drift, and risks for AI tools over six decades.

Linked Multi-Model Data on Russian Domestic and Foreign Policy Speeches

cs.CL · 2026-05-15 · unverdicted · novelty 7.0

A new linked multimodal dataset of Russian domestic and foreign policy speeches with texts, images, captions, harmonized metadata, and expert-refined topic annotations is introduced to support analyses in political communication and LLM applications.

The Moltbook Files: A Harmless Slopocalypse or Humanity's Last Experiment

cs.CL · 2026-05-08 · unverdicted · novelty 7.0

An AI-agent social platform generated mostly neutral content whose use in fine-tuning reduced model truthfulness comparably to human Reddit data, suggesting limited unique harm but flagging tail risks like secret leaks.

Mapping Emerging Climate Misinformation Playbooks in the Global South

cs.SI · 2026-04-27 · unverdicted · novelty 7.0

Brazilian YouTube climate videos show a transition from traditional denial of climate science to 'new denial' that undermines solutions, with the latter attracting more engagement from diverse actors.

The Platform Is Mostly Not a Platform: Token Economies and Agent Discourse on Moltbook

cs.CY · 2026-04-23 · unverdicted · novelty 7.0

Moltbook operates as two largely separate layers: a dominant transactional token economy using protocols like MBC-20 and a thinner discursive conversation layer with only 3.6% agent overlap.

Participatory provenance as representational auditing for AI-mediated public consultation

cs.AI · 2026-04-22 · unverdicted · novelty 7.0

Participatory provenance auditing of Canada's AI strategy consultation shows official AI summaries exclude 15-17% of participants more than random baselines, with 33-88% exclusion for dissent clusters.

Stories of Your Life as Others: A Round-Trip Evaluation of LLM-Generated Life Stories Conditioned on Rich Psychometric Profiles

cs.CL · 2026-04-07 · unverdicted · novelty 7.0

LLMs conditioned on actual psychometric profiles produce life stories from which independent LLMs recover personality scores at mean r=0.75, 85% of human reliability, with emotional patterns replicating in real human data.

What Do AI Agents Talk About? Discourse and Architectural Constraints in the First AI-Only Social Network

cs.CL · 2026-03-09 · unverdicted · novelty 7.0

Discourse among AI agents on Moltbook is largely determined by architectural constraints like context windows and identity files, appearing as social learning but actually short-horizon contextual conditioning.

GRAB: A Risk Taxonomy--Grounded Benchmark for Unsupervised Topic Discovery in Financial Disclosures

cs.CL · 2025-09-25 · unverdicted · novelty 7.0

GRAB is a benchmark dataset of 1.61M sentences from 8,247 10-K filings with taxonomy-anchored weak supervision labels for standardized evaluation of unsupervised topic models on financial risk disclosures.

From tools to thieves: Measuring and understanding public perceptions of AI through crowdsourced metaphors

cs.CY · 2025-01-29 · unverdicted · novelty 7.0

Crowdsourced metaphors show rising anthropomorphism and warmth toward AI that predict trust and adoption, with notable demographic differences.

EconSimulacra: A Digital Twin Platform of Socio-Economic Systems Powered by LLM Agents

cs.DL · 2026-06-25 · unverdicted · novelty 6.0

EconSimulacra is a multi-agent LLM simulator that couples economy, mobility, and social networks through shared internal states to reproduce nonlinear relationships between online attention and offline popularity.

Large-scale semantic mapping of learner agency and autonomy reveals what measurement and generative AI research overlook

cs.AI · 2026-06-09 · unverdicted · novelty 6.0

Semantic mapping of 8,954 definitions and 2,700 scales from 14,000+ papers shows learner agency and autonomy span task regulation, personal motivation, and sociocultural dimensions, with existing scales and generative AI research underrepresenting the sociocultural dimension.

A Multi-Agent LLM Framework for Rating the Quality of Surgical Feedback

cs.CL · 2026-05-25 · unverdicted · novelty 6.0

A multi-agent LLM system discovers criteria such as Encouraging, Urgent, and Clear for surgical feedback and uses them to score 4.2k instances, outperforming prior content-based approaches in predicting trainee behavior changes and trainer approval.

Synthetic Sources?: Auditing Generative Search Engine Citations for Evidence of AI-Generated Sources

cs.IR · 2026-05-22 · unverdicted · novelty 6.0

Audit of ChatGPT, Copilot, Gemini and Perplexity finds ~16% of cited sources are AI-generated across 712 queries on politics, health and environment.

Interpretable Discriminative Text Representations via Agreement and Label Disentanglement

cs.CL · 2026-05-20 · unverdicted · novelty 6.0

LFD discovers predictive text features via LLM contrastive proposals, cross-LLM Cohen's kappa screening, and residual held-out gain selection, matching baseline accuracy while achieving higher human agreement and lower label leakage on ten tasks.

Synthesis and Evaluation of Long-term History-aware Medical Dialogue

cs.CL · 2026-05-19 · unverdicted · novelty 6.0

Creates MediLongChat synthetic longitudinal medical dialogues and benchmarks showing state-of-the-art LLMs struggle with in-dialogue, cross-dialogue, and synthesis reasoning tasks.

Algorithmic Cultivation: How Social Media Feeds Shape User Language

cs.SI · 2026-05-16 · unverdicted · novelty 6.0

Quasi-experimental study of 235M Bluesky posts finds that exposure to algorithmic feeds produces greater stylistic accommodation, semantic alignment, and register formalization than in matched controls, with effects varying by feed and strongest for reposting.

Discovery-Oriented Faceting: From Coverage to Blind-Spot Discovery

cs.HC · 2026-05-13 · unverdicted · novelty 6.0

DOF ranks document categories by distinctiveness instead of size to promote blind-spot discovery, surfacing different content than coverage-based methods across four domains.

REALISTA: Realistic Latent Adversarial Attacks that Elicit LLM Hallucinations

cs.CL · 2026-05-12 · unverdicted · novelty 6.0

REALISTA generates semantically coherent adversarial prompts via latent-space optimization over input-dependent editing directions, achieving stronger hallucination elicitation than prior realistic attacks on open-source and reasoning LLMs.

MIRA: An LLM-Assisted Benchmark for Multi-Category Integrated Retrieval

cs.IR · 2026-05-11 · unverdicted · novelty 6.0

MIRA is a new benchmark for multi-category integrated retrieval built from real queries on a social science platform, with LLM assistance for topic descriptions and relevance labeling across four item categories.

What Software Engineering Looks Like to AI Agents? -- An Empirical Study of AI-Only Technical Discourse on MoltBook

cs.SE · 2026-05-08 · unverdicted · novelty 6.0

Empirical analysis of 4707 MoltBook posts shows AI-only technical discourse focuses on security, trust, and abstract topics while lacking concrete runtime and project details found in human GitHub discussions.

TubeCensus: A Transparent, Replicable, and Large-Scale Census of YouTube Channels and their Subscriber Counts Over Time

cs.SI · 2026-05-07 · unverdicted · novelty 6.0

TubeCensus provides a transparent longitudinal dataset of YouTube channels and subscriber counts covering creators responsible for 30-36% of platform content, distributed via a pip package.

citing papers explorer

Showing 50 of 89 citing papers.

Making MLLMs Blind: Adversarial Smuggling Attacks in MLLM Content Moderation cs.CV · 2026-04-08 · unverdicted · none · ref 1 · internal anchor
Adversarial smuggling attacks encode harmful content into human-readable visuals that evade MLLM detection, achieving over 90% attack success rates on models like GPT-5 and Qwen3-VL via the new SmuggleBench benchmark.
SemCEB: A Cardinality Estimation Benchmark for Semantic Operators cs.DB · 2026-06-22 · unverdicted · none · ref 8 · internal anchor
SemCEB is the first benchmark for cardinality estimation over semantic operators, evaluating sampling methods and Semantic Histograms on accuracy, cost, latency, and memory using 102 queries on a real-world dataset.
From Punishment to Protection: Charting Six Decades of U.S. Juvenile Justice Through Topic Modeling and LLM-Assisted Analysis cs.CY · 2026-05-18 · unverdicted · none · ref 4 · internal anchor
Topic modeling and LLM-assisted analysis of 60k+ juvenile justice opinions identifies 182 topics showing child welfare tripling, punitive declines, vocabulary drift, and risks for AI tools over six decades.
Linked Multi-Model Data on Russian Domestic and Foreign Policy Speeches cs.CL · 2026-05-15 · unverdicted · none · ref 43 · internal anchor
A new linked multimodal dataset of Russian domestic and foreign policy speeches with texts, images, captions, harmonized metadata, and expert-refined topic annotations is introduced to support analyses in political communication and LLM applications.
The Moltbook Files: A Harmless Slopocalypse or Humanity's Last Experiment cs.CL · 2026-05-08 · unverdicted · none · ref 242 · internal anchor
An AI-agent social platform generated mostly neutral content whose use in fine-tuning reduced model truthfulness comparably to human Reddit data, suggesting limited unique harm but flagging tail risks like secret leaks.
Mapping Emerging Climate Misinformation Playbooks in the Global South cs.SI · 2026-04-27 · unverdicted · none · ref 26 · internal anchor
Brazilian YouTube climate videos show a transition from traditional denial of climate science to 'new denial' that undermines solutions, with the latter attracting more engagement from diverse actors.
The Platform Is Mostly Not a Platform: Token Economies and Agent Discourse on Moltbook cs.CY · 2026-04-23 · unverdicted · none · ref 5 · internal anchor
Moltbook operates as two largely separate layers: a dominant transactional token economy using protocols like MBC-20 and a thinner discursive conversation layer with only 3.6% agent overlap.
Participatory provenance as representational auditing for AI-mediated public consultation cs.AI · 2026-04-22 · unverdicted · none · ref 12 · internal anchor
Participatory provenance auditing of Canada's AI strategy consultation shows official AI summaries exclude 15-17% of participants more than random baselines, with 33-88% exclusion for dissent clusters.
Stories of Your Life as Others: A Round-Trip Evaluation of LLM-Generated Life Stories Conditioned on Rich Psychometric Profiles cs.CL · 2026-04-07 · unverdicted · none · ref 12 · internal anchor
LLMs conditioned on actual psychometric profiles produce life stories from which independent LLMs recover personality scores at mean r=0.75, 85% of human reliability, with emotional patterns replicating in real human data.
What Do AI Agents Talk About? Discourse and Architectural Constraints in the First AI-Only Social Network cs.CL · 2026-03-09 · unverdicted · none · ref 50 · internal anchor
Discourse among AI agents on Moltbook is largely determined by architectural constraints like context windows and identity files, appearing as social learning but actually short-horizon contextual conditioning.
GRAB: A Risk Taxonomy--Grounded Benchmark for Unsupervised Topic Discovery in Financial Disclosures cs.CL · 2025-09-25 · unverdicted · none · ref 9 · internal anchor
GRAB is a benchmark dataset of 1.61M sentences from 8,247 10-K filings with taxonomy-anchored weak supervision labels for standardized evaluation of unsupervised topic models on financial risk disclosures.
From tools to thieves: Measuring and understanding public perceptions of AI through crowdsourced metaphors cs.CY · 2025-01-29 · unverdicted · none · ref 61 · internal anchor
Crowdsourced metaphors show rising anthropomorphism and warmth toward AI that predict trust and adoption, with notable demographic differences.
EconSimulacra: A Digital Twin Platform of Socio-Economic Systems Powered by LLM Agents cs.DL · 2026-06-25 · unverdicted · none · ref 5 · internal anchor
EconSimulacra is a multi-agent LLM simulator that couples economy, mobility, and social networks through shared internal states to reproduce nonlinear relationships between online attention and offline popularity.
Large-scale semantic mapping of learner agency and autonomy reveals what measurement and generative AI research overlook cs.AI · 2026-06-09 · unverdicted · none · ref 21 · internal anchor
Semantic mapping of 8,954 definitions and 2,700 scales from 14,000+ papers shows learner agency and autonomy span task regulation, personal motivation, and sociocultural dimensions, with existing scales and generative AI research underrepresenting the sociocultural dimension.
A Multi-Agent LLM Framework for Rating the Quality of Surgical Feedback cs.CL · 2026-05-25 · unverdicted · none · ref 11 · internal anchor
A multi-agent LLM system discovers criteria such as Encouraging, Urgent, and Clear for surgical feedback and uses them to score 4.2k instances, outperforming prior content-based approaches in predicting trainee behavior changes and trainer approval.
Synthetic Sources?: Auditing Generative Search Engine Citations for Evidence of AI-Generated Sources cs.IR · 2026-05-22 · unverdicted · none · ref 69 · internal anchor
Audit of ChatGPT, Copilot, Gemini and Perplexity finds ~16% of cited sources are AI-generated across 712 queries on politics, health and environment.
Interpretable Discriminative Text Representations via Agreement and Label Disentanglement cs.CL · 2026-05-20 · unverdicted · none · ref 10 · internal anchor
LFD discovers predictive text features via LLM contrastive proposals, cross-LLM Cohen's kappa screening, and residual held-out gain selection, matching baseline accuracy while achieving higher human agreement and lower label leakage on ten tasks.
Synthesis and Evaluation of Long-term History-aware Medical Dialogue cs.CL · 2026-05-19 · unverdicted · none · ref 5 · internal anchor
Creates MediLongChat synthetic longitudinal medical dialogues and benchmarks showing state-of-the-art LLMs struggle with in-dialogue, cross-dialogue, and synthesis reasoning tasks.
Algorithmic Cultivation: How Social Media Feeds Shape User Language cs.SI · 2026-05-16 · unverdicted · none · ref 29 · internal anchor
Quasi-experimental study of 235M Bluesky posts finds that exposure to algorithmic feeds produces greater stylistic accommodation, semantic alignment, and register formalization than in matched controls, with effects varying by feed and strongest for reposting.
Discovery-Oriented Faceting: From Coverage to Blind-Spot Discovery cs.HC · 2026-05-13 · unverdicted · none · ref 10 · internal anchor
DOF ranks document categories by distinctiveness instead of size to promote blind-spot discovery, surfacing different content than coverage-based methods across four domains.
REALISTA: Realistic Latent Adversarial Attacks that Elicit LLM Hallucinations cs.CL · 2026-05-12 · unverdicted · none · ref 186 · internal anchor
REALISTA generates semantically coherent adversarial prompts via latent-space optimization over input-dependent editing directions, achieving stronger hallucination elicitation than prior realistic attacks on open-source and reasoning LLMs.
MIRA: An LLM-Assisted Benchmark for Multi-Category Integrated Retrieval cs.IR · 2026-05-11 · unverdicted · none · ref 26 · internal anchor
MIRA is a new benchmark for multi-category integrated retrieval built from real queries on a social science platform, with LLM assistance for topic descriptions and relevance labeling across four item categories.
What Software Engineering Looks Like to AI Agents? -- An Empirical Study of AI-Only Technical Discourse on MoltBook cs.SE · 2026-05-08 · unverdicted · none · ref 15 · internal anchor
Empirical analysis of 4707 MoltBook posts shows AI-only technical discourse focuses on security, trust, and abstract topics while lacking concrete runtime and project details found in human GitHub discussions.
TubeCensus: A Transparent, Replicable, and Large-Scale Census of YouTube Channels and their Subscriber Counts Over Time cs.SI · 2026-05-07 · unverdicted · none · ref 75 · internal anchor
TubeCensus provides a transparent longitudinal dataset of YouTube channels and subscriber counts covering creators responsible for 30-36% of platform content, distributed via a pip package.
Synthetic Users, Real Differences: an Evaluation Framework for User Simulation in Multi-Turn Conversations cs.CL · 2026-05-04 · unverdicted · none · ref 26 · internal anchor
Realsim shows simulated users fail to reproduce communication frictions present in real multi-turn chatbot dialogues, yielding overly optimistic evaluations with domain-dependent variability.
Proposing Topic Models and Evaluation Frameworks for Analyzing Associations with External Outcomes: An Application to Leadership Analysis Using Large-Scale Corporate Review Data cs.CL · 2026-04-20 · unverdicted · none · ref 13 · internal anchor
An LLM-based topic modeling method with a custom evaluation framework improves topic interpretability, specificity, and polarity consistency over prior approaches when linking corporate review text to external outcomes such as employee morale.
Detecting and Enhancing Intellectual Humility in Online Political Discourse cs.CY · 2026-04-14 · unverdicted · none · ref 1 · internal anchor
Intellectual humility in Reddit political discussions can be measured at scale with a validated classifier and increased via targeted interventions without reducing participation.
The Effect of Document Selection on Query-focused Text Analysis cs.IR · 2026-04-13 · conditional · none · ref 2 · internal anchor
Semantic and hybrid document retrieval methods provide reliable, efficient selection for query-focused text analyses like LDA and BERTopic, outperforming random or keyword-only approaches.
Mirroring Minds: Asymmetric Linguistic Accommodation and Diagnostic Identity in ADHD and Autism Reddit Communities cs.CL · 2026-04-11 · unverdicted · none · ref 2 · internal anchor
ADHD and autism Reddit users exhibit convergent linguistic accommodation when crossing community boundaries, with diagnosis disclosure showing small and directionally distinct effects on style.
Reasoning-Based Refinement of Unsupervised Text Clusters with LLMs cs.CL · 2026-04-08 · unverdicted · none · ref 18 · internal anchor
LLM reasoning refines unsupervised text clusters via coherence checks, redundancy removal, and label grounding, yielding better coherence and human-aligned labels on social media data.
Discovering Failure Modes in Vision-Language Models using RL cs.CV · 2026-04-06 · unverdicted · none · ref 10 · internal anchor
An RL-based questioner agent adaptively generates queries to discover novel failure modes in VLMs without human intervention.
Paper Espresso: From Paper Overload to Research Insight cs.DL · 2026-04-06 · unverdicted · none · ref 17 · internal anchor
Paper Espresso deploys LLMs to summarize and analyze trends across 13,300+ arXiv papers over 35 months, releasing metadata that shows non-saturating topic growth and higher engagement for novel topics.
PRISM: LLM-Guided Semantic Clustering for High-Precision Topics cs.LG · 2026-04-03 · unverdicted · none · ref 5 · internal anchor
PRISM distills sparse LLM labels into a fine-tuned embedding model for thresholded clustering that separates fine-grained topics better than prior local models or raw frontier embeddings.
In your own words: computationally identifying interpretable themes in free-text survey data cs.CY · 2026-03-27 · unverdicted · none · ref 16 · internal anchor
A computational framework identifies more coherent themes in free-text survey data on race, gender, and sexual orientation than previous methods, with applications for survey design, explaining variation, and detecting identity discordance.
WebExpert: domain-aware web agents with critic-guided expert experience for high-precision search cs.IR · 2026-02-03 · unverdicted · none · ref 11 · internal anchor
WebExpert improves exact-match accuracy by 1.5-3.6 points on GAIA, GPQA, HLE, and WebWalkerQA benchmarks via experience retrieval, automatic facet induction, and preference-optimized planning.
LLM-MemCluster: Empowering Large Language Models with Dynamic Memory for Text Clustering cs.CL · 2025-11-19 · unverdicted · none · ref 6 · internal anchor
LLM-MemCluster gives LLMs stateful memory and prompts that let them decide cluster count and iteratively refine groupings, outperforming baselines on benchmarks in a tuning-free end-to-end setup.
Disentangling Interaction and Bias Effects in Opinion Dynamics of Large Language Models physics.soc-ph · 2025-09-08 · unverdicted · none · ref 40 · internal anchor
A Bayesian framework disentangles topic, agreement, and anchoring biases from interaction effects in LLM multi-turn dialogues, revealing convergence to attractors that shift with fine-tuning.
FLAME: A New Dataset on FLemish Accounts of Momentary Experiences cs.CL · 2025-04-20 · unverdicted · none · ref 5 · internal anchor
Introduces a 25k-narrative Flemish corpus and finds that BERTopic yields more coherent and culturally relevant topics than LDA or K-Means according to human raters, despite LDA scoring higher on automated coherence metrics.
A Computational Method for Measuring "Open Codes" in Qualitative Analysis cs.CL · 2024-11-19 · unverdicted · none · ref 21 · internal anchor
A method merges codebooks via LLM and evaluates human and AI inductive coding with four new metrics on an online conversation dataset.
MMTM: Tri-Modal Topic Modeling for Long-Form Video via Similarity-Gated Fusion cs.LG · 2026-05-28 · unverdicted · none · ref 11 · internal anchor
MMTM improves topic coherence and temporal stability in long-form video by tri-modal similarity-gated fusion of speech, audio, and visual embeddings with BERTopic, shown on German and English news datasets with released code and corpus.
SmartIterator: Visual Analytics Workflows for Supervising Unsupervised Data Grouping cs.HC · 2026-05-27 · unverdicted · none · ref 25 · internal anchor
SmartIterator supplies method-specific workflows and coordinated visualizations to systematically supervise and interpret parameter sweeps of unsupervised data grouping techniques.
Eliot: Interactively $\underline{E}$xploring Fast-Changing Scientific $\underline{Li}$terature Trends with $\underline{O}$nline Da$\underline{t}$a and Learning cs.IR · 2026-05-26 · unverdicted · none · ref 14 · internal anchor
Eliot is a query-time clustering and temporal visualization system for arXiv literature, evaluated via offline metrics on eight domains and a user survey showing 85% meaningful cluster labels.
Can LLMs extract scientific consensus? A case study in high-temperature superconductivity cs.DL · 2026-05-26 · unverdicted · none · ref 19 · internal anchor
LLMs recover coherent, interpretable structures from HTS literature including family-dependent mechanisms and temporal belief evolution via a constructed knowledge graph.
The Structure and Dynamics of the Online MAHA-sphere cs.SI · 2026-05-19 · unverdicted · none · ref 46 · internal anchor
Reddit analysis finds MAHA users show strong cross-theme belief bundling and network coherence unlike anti-MAHA users, with pandemic-era shifts from anti-fluoride/mask to anti-vaccine to broader anti-science engagement.
ChatGPT vs Teachers vs Students: Large-Scale Analysis of Generative AI Discourse in Education Communities on Reddit cs.CY · 2026-05-18 · unverdicted · none · ref 13 · internal anchor
Large-scale topic modeling of 270k Reddit posts shows GenAI discourse in education shifting from detection-evasion to enforcement, with K-12 teachers emphasizing cognitive dependency, academics focusing on detection, students on career anxiety, and adversarial themes driving engagement and cross-sta
Topical Shifts in the Dark Web: A Longitudinal Analysis of Content from the Cybercrime Ecosystem cs.CR · 2026-05-14 · unverdicted · none · ref 48 · internal anchor
Longitudinal topic modeling on a large dark web dataset finds 75% of discussion volume in persistent core topics with a median lifespan of 75 months and only 3% in short-lived themes.
Analyzing Codes of Conduct for Online Safety in Video Games at Scale cs.CR · 2026-05-14 · unverdicted · none · ref 42 · internal anchor
Large-scale scan of Steam multiplayer games finds CoCs available for just 3.6% of titles, with better coverage of security issues than interpersonal or underage-player harms.
Measuring Embedding Sensitivity to Authorial Style in French: Comparing Literary Texts with Language Model Rewritings cs.CL · 2026-05-11 · unverdicted · none · ref 13 · internal anchor
Embeddings reliably capture authorial stylistic features in French literary texts, and these signals persist after LLM rewriting while showing model-specific patterns.
Automatic Reflection Level Classification in Hungarian Student Essays cs.CL · 2026-05-04 · unverdicted · none · ref 14 · internal anchor
Classical machine learning models outperform Hungarian transformers slightly in overall performance (71% vs 68% average score) for classifying reflection levels in student essays, though transformers handle rare classes better.
A Gated Hybrid Contrastive Collaborative Filtering Recommendation cs.IR · 2026-04-29 · unverdicted · none · ref 32 · internal anchor
A gated hybrid contrastive collaborative filtering framework improves hit rate@10 and NDCG@10 on movie review datasets by layer-wise adaptive fusion of semantic and collaborative signals with contrastive objectives.

BERTopic: Neural topic modeling with a class-based TF-IDF procedure

hub tools

citation-role summary

citation-polarity summary

claims ledger

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer