MIMIC: Multimodal Inversion for Model Interpretation and Conceptualization
By: Animesh Jain, Alexandros Stergiou
Potential Business Impact:
Shows what computers "see" inside their brains.
Vision Language Models (VLMs) encode multimodal inputs over large, complex, and difficult-to-interpret architectures, which limit transparency and trust. We propose a Multimodal Inversion for Model Interpretation and Conceptualization (MIMIC) framework to visualize the internal representations of VLMs by synthesizing visual concepts corresponding to internal encodings. MIMIC uses a joint VLM-based inversion and a feature alignment objective to account for VLM's autoregressive processing. It additionally includes a triplet of regularizers for spatial alignment, natural image smoothness, and semantic realism. We quantitatively and qualitatively evaluate MIMIC by inverting visual concepts over a range of varying-length free-form VLM output texts. Reported results include both standard visual quality metrics as well as semantic text-based metrics. To the best of our knowledge, this is the first model inversion approach addressing visual interpretations of VLM concepts.
Similar Papers
More Images, More Problems? A Controlled Analysis of VLM Failure Modes
CV and Pattern Recognition
Helps computers understand many pictures together.
Rethinking Visual Information Processing in Multimodal LLMs
CV and Pattern Recognition
Lets computers understand pictures and words together better.
Mimicking or Reasoning: Rethinking Multi-Modal In-Context Learning in Vision-Language Models
CV and Pattern Recognition
Computers learn better by explaining their answers.