Score: 2

Topeax -- An Improved Clustering Topic Model with Density Peak Detection and Lexical-Semantic Term Importance

Published: January 29, 2026 | arXiv ID: 2601.21465v1

By: Márton Kardos

Potential Business Impact:

Finds better main ideas in long texts.

Business Areas:
Text Analytics Data and Analytics, Software

Text clustering is today the most popular paradigm for topic modelling, both in academia and industry. Despite clustering topic models' apparent success, we identify a number of issues in Top2Vec and BERTopic, which remain largely unsolved. Firstly, these approaches are unreliable at discovering natural clusters in corpora, due to extreme sensitivity to sample size and hyperparameters, the default values of which result in suboptimal behaviour. Secondly, when estimating term importance, BERTopic ignores the semantic distance of keywords to topic vectors, while Top2Vec ignores word counts in the corpus. This results in, on the one hand, less coherent topics due to the presence of stop words and junk words, and lack of variety and trust on the other. In this paper, I introduce a new approach, \textbf{Topeax}, which discovers the number of clusters from peaks in density estimates, and combines lexical and semantic indices of term importance to gain high-quality topic keywords. Topeax is demonstrated to be better at both cluster recovery and cluster description than Top2Vec and BERTopic, while also exhibiting less erratic behaviour in response to changing sample size and hyperparameters.

Country of Origin
🇩🇰 Denmark


Page Count
14 pages

Category
Computer Science:
Artificial Intelligence