DINO-MX: A Modular & Flexible Framework for Self-Supervised Learning
By: Mahmut Selman Gokmen, Cody Bumgardner
Potential Business Impact:
Makes AI see better with less computer power.
Vision Foundation Models (VFMs) have advanced representation learning through self-supervised methods. However, existing training pipelines are often inflexible, domain-specific, or computationally expensive, which limits their usability across different domains and resource settings. DINO-MX is a modular and extensible training framework that combines the core principles of DINO, DINOv2 and DINOv3 within a unified configuration-driven system. It supports a variety of transformer-based architectures and is fully compatible with the Hugging Face ecosystem. The framework includes multiple training strategies such as low-rank adaptation (LoRA), layer freezing, and knowledge distillation, along with support for distributed training through both Distributed Data Parallel (DDP) and Fully Sharded Data Parallel (FSDP). DINO-MX is designed to work with both natural and specialized data types, including single- and multi-channel images. Experimental results on diverse datasets show that DINO-MX achieves competitive performance while significantly reducing computational costs. Additionally, it offers interpretability tools and a label-guided data augmentation method that improves attention-based localization without the need for extra detection or segmentation heads. DINO-MX provides a reproducible and scalable foundation for developing, adapting, and benchmarking self-supervised vision models across a range of research and real-world applications.
Similar Papers
DINOv3
CV and Pattern Recognition
Teaches computers to see and understand images better.
MM-DINOv2: Adapting Foundation Models for Multi-Modal Medical Image Analysis
Image and Video Processing
Helps doctors see more in medical scans.
DReX: Pure Vision Fusion of Self-Supervised and Convolutional Representations for Image Complexity Prediction
CV and Pattern Recognition
Helps computers judge picture "busyness" better.