Conferences

CFE-CM Statistics Conference

Published: September 13, 2024

It is well-known in the literature that the main limitations of document clustering techniques are that they operate in a high-dimensional space and it is difficult to interpret the different clusters once a partition has obtained. The proposed methods for computing document clustering employs a two-stage process. Initially, it can be observed that the information contained within the Document-Term matrix exhibits significant sparsity, so a direct application of a clustering technique would be highly inefficient. Consequently, dimensionality reduction is applied. The proposed strategy involves employing Latent Dirichlet Allocation (LDA) to identify the main topics in the corpus under analysis. To determine the similarity between two documents, the p-value of a hypothesis test of the homogeneity of topic distributions between two documents is computed. This p-value is used as a similarity measure, upon which three different clustering procedures are built. The first two directly employs the new dissimilarity using an hierarchical approach and a fuzzy relational clustering approach while the other is a test-based approach to clustering. The performance of the clustering methods is then assessed using some benchmark datasets in order to understand advantages and disadvantages of the proposals.

Gian Mario Sangiovanni

Conferences

Participation

Violence Against Women (VAW) workshop 2024

Invited speaker

CFE-CM Statistics Conference