Score: 1

BertsWin: Resolving Topological Sparsity in 3D Masked Autoencoders via Component-Balanced Structural Optimization

Published: December 25, 2025 | arXiv ID: 2512.21769v1

By: Evgeny Alves Limarenko, Anastasiia Studenikina

Potential Business Impact:

Helps doctors see inside bodies better, faster.

Business Areas:

Image Recognition Data and Analytics, Software

The application of self-supervised learning (SSL) and Vision Transformers (ViTs) approaches demonstrates promising results in the field of 2D medical imaging, but the use of these methods on 3D volumetric images is fraught with difficulties. Standard Masked Autoencoders (MAE), which are state-of-the-art solution for 2D, have a hard time capturing three-dimensional spatial relationships, especially when 75% of tokens are discarded during pre-training. We propose BertsWin, a hybrid architecture combining full BERT-style token masking using Swin Transformer windows, to enhance spatial context learning in 3D during SSL pre-training. Unlike the classic MAE, which processes only visible areas, BertsWin introduces a complete 3D grid of tokens (masked and visible), preserving the spatial topology. And to smooth out the quadratic complexity of ViT, single-level local Swin windows are used. We introduce a structural priority loss function and evaluate the results of cone beam computed tomography of the temporomandibular joints. The subsequent assessment includes TMJ segmentation on 3D CT scans. We demonstrate that the BertsWin architecture, by maintaining a complete three-dimensional spatial topology, inherently accelerates semantic convergence by a factor of 5.8x compared to standard ViT-MAE baselines. Furthermore, when coupled with our proposed GradientConductor optimizer, the full BertsWin framework achieves a 15-fold reduction in training epochs (44 vs 660) required to reach state-of-the-art reconstruction fidelity. Analysis reveals that BertsWin achieves this acceleration without the computational penalty typically associated with dense volumetric processing. At canonical input resolutions, the architecture maintains theoretical FLOP parity with sparse ViT baselines, resulting in a significant net reduction in total computational resources due to faster convergence.

CoMA: Complementary Masking and Hierarchical Dynamic Multi-Window Self-Attention in a Unified Pre-training Framework

CV and Pattern Recognition

Teaches computers to see faster and better.

8 Nov 2025 1

87%

Self Pre-training with Adaptive Mask Autoencoders for Variable-Contrast 3D Medical Imaging

Image and Video Processing

Helps doctors find strokes on brain scans better.

15 Jan 2025 0

87%

Structure is Supervision: Multiview Masked Autoencoders for Radiology

CV and Pattern Recognition

Helps doctors find diseases in X-rays better.

27 Nov 2025 1

View PDF Login to Bookmark

Page Count

18 pages

BertsWin: Resolving Topological Sparsity in 3D Masked Autoencoders via Component-Balanced Structural Optimization

Helps doctors see inside bodies better, faster.

Technical Abstract

CoMA: Complementary Masking and Hierarchical Dynamic Multi-Window Self-Attention in a Unified Pre-training Framework

Self Pre-training with Adaptive Mask Autoencoders for Variable-Contrast 3D Medical Imaging

Structure is Supervision: Multiview Masked Autoencoders for Radiology