Score: 1

Unsupervised Document and Template Clustering using Multimodal Embeddings

Published: June 13, 2025 | arXiv ID: 2506.12116v2

By: Phillipe R. Sampaio, Helene Maxcici

Potential Business Impact:

Groups similar papers by words, look, and layout.

Business Areas:

Semantic Search Internet Services

This paper investigates a novel approach to unsupervised document clustering by leveraging multimodal embeddings as input to clustering algorithms such as $k$-Means, DBSCAN, a combination of HDBSCAN and $k$-NN, and BIRCH. Our method aims to achieve a finer-grained document understanding by not only grouping documents at the type level (e.g., invoices, purchase orders), but also distinguishing between different templates within the same document category. This is achieved by using embeddings that capture textual content, layout information, and visual features of documents. We evaluated the effectiveness of this approach using embeddings generated by several state-of-the-art pre-trained multimodal models, including SBERT, LayoutLMv1, LayoutLMv3, DiT, Donut, ColPali, Gemma3, and InternVL3. Our findings demonstrate the potential of multimodal embeddings to significantly enhance document clustering, offering benefits for various applications in intelligent document processing, document layout analysis, and unsupervised document classification. This work provides valuable insight into the advantages and limitations of different multimodal models for this task and opens new avenues for future research to understand and organize document collections.

Semantic-Driven Topic Modeling for Analyzing Creativity in Virtual Brainstorming

Computation and Language

Finds good ideas in group brainstorming chats.

20 Sep 2025 0

88%

DashCLIP: Leveraging multimodal models for generating semantic embeddings for DoorDash

Information Retrieval

Helps online stores show you better stuff.

18 Mar 2025 0

88%

Optimizing Product Deduplication in E-Commerce with Multimodal Embeddings

Information Retrieval

Finds fake product listings using words and pictures.

19 Sep 2025 1

View PDF Login to Bookmark

Page Count

22 pages

Unsupervised Document and Template Clustering using Multimodal Embeddings

Groups similar papers by words, look, and layout.

Technical Abstract

Semantic-Driven Topic Modeling for Analyzing Creativity in Virtual Brainstorming

DashCLIP: Leveraging multimodal models for generating semantic embeddings for DoorDash

Optimizing Product Deduplication in E-Commerce with Multimodal Embeddings