The Prism Hypothesis: Harmonizing Semantic and Pixel Representations via Unified Autoencoding
By: Weichen Fan , Haiwen Diao , Quan Wang and more
Deep representations across modalities are inherently intertwined. In this paper, we systematically analyze the spectral characteristics of various semantic and pixel encoders. Interestingly, our study uncovers a highly inspiring and rarely explored correspondence between an encoder's feature spectrum and its functional role: semantic encoders primarily capture low-frequency components that encode abstract meaning, whereas pixel encoders additionally retain high-frequency information that conveys fine-grained detail. This heuristic finding offers a unifying perspective that ties encoder behavior to its underlying spectral structure. We define it as the Prism Hypothesis, where each data modality can be viewed as a projection of the natural world onto a shared feature spectrum, just like the prism. Building on this insight, we propose Unified Autoencoding (UAE), a model that harmonizes semantic structure and pixel details via an innovative frequency-band modulator, enabling their seamless coexistence. Extensive experiments on ImageNet and MS-COCO benchmarks validate that our UAE effectively unifies semantic abstraction and pixel-level fidelity into a single latent space with state-of-the-art performance.
Similar Papers
In Pursuit of Pixel Supervision for Visual Pre-training
CV and Pattern Recognition
Teaches computers to understand images without labels.
Can Understanding and Generation Truly Benefit Together -- or Just Coexist?
CV and Pattern Recognition
Makes computers draw pictures from descriptions.
Both Semantics and Reconstruction Matter: Making Representation Encoders Ready for Text-to-Image Generation and Editing
CV and Pattern Recognition
Makes AI create better, more detailed pictures.