Differentiable Hierarchical Visual Tokenization
By: Marius Aasan , Martine Hjelkrem-Tan , Nico Catalano and more
Potential Business Impact:
Makes computer vision understand pictures better.
Vision Transformers rely on fixed patch tokens that ignore the spatial and semantic structure of images. In this work, we introduce an end-to-end differentiable tokenizer that adapts to image content with pixel-level granularity while remaining backward-compatible with existing architectures for retrofitting pretrained models. Our method uses hierarchical model selection with information criteria to provide competitive performance in both image-level classification and dense-prediction tasks, and even supports out-of-the-box raster-to-vector conversion.
Similar Papers
Towards Scalable Pre-training of Visual Tokenizers for Generation
CV and Pattern Recognition
Makes AI pictures better by understanding meaning.
2D Gaussians Meet Visual Tokenizer
CV and Pattern Recognition
Makes computer images look more real.
2D Gaussians Meet Visual Tokenizer
CV and Pattern Recognition
Makes computer images look more real.