Topeax -- An Improved Clustering Topic Model with Density Peak Detection and Lexical-Semantic Term Importance
By: Márton Kardos
Potential Business Impact:
Finds better main ideas in long texts.
Text clustering is today the most popular paradigm for topic modelling, both in academia and industry. Despite clustering topic models' apparent success, we identify a number of issues in Top2Vec and BERTopic, which remain largely unsolved. Firstly, these approaches are unreliable at discovering natural clusters in corpora, due to extreme sensitivity to sample size and hyperparameters, the default values of which result in suboptimal behaviour. Secondly, when estimating term importance, BERTopic ignores the semantic distance of keywords to topic vectors, while Top2Vec ignores word counts in the corpus. This results in, on the one hand, less coherent topics due to the presence of stop words and junk words, and lack of variety and trust on the other. In this paper, I introduce a new approach, \textbf{Topeax}, which discovers the number of clusters from peaks in density estimates, and combines lexical and semantic indices of term importance to gain high-quality topic keywords. Topeax is demonstrated to be better at both cluster recovery and cluster description than Top2Vec and BERTopic, while also exhibiting less erratic behaviour in response to changing sample size and hyperparameters.
Similar Papers
TopiCLEAR: Topic extraction by CLustering Embeddings with Adaptive dimensional Reduction
Computation and Language
Finds hidden topics in social media posts.
BERTopic for Topic Modeling of Hindi Short Texts: A Comparative Study
Information Retrieval
Finds main ideas in Hindi text messages.
Estimating the Effective Topics of Articles and journals Abstract Using LDA And K-Means Clustering Algorithm
Information Retrieval
Finds important ideas in lots of text.