Score: 2

Line of Sight: On Linear Representations in VLLMs

Published: June 5, 2025 | arXiv ID: 2506.04706v1

By: Achyuta Rajaram , Sarah Schwettmann , Jacob Andreas and more

BigTech Affiliations: Massachusetts Institute of Technology

Potential Business Impact:

Helps computers understand pictures by looking at them.

Business Areas:
Visual Search Internet Services

Language models can be equipped with multimodal capabilities by fine-tuning on embeddings of visual inputs. But how do such multimodal models represent images in their hidden activations? We explore representations of image concepts within LlaVA-Next, a popular open-source VLLM. We find a diverse set of ImageNet classes represented via linearly decodable features in the residual stream. We show that the features are causal by performing targeted edits on the model output. In order to increase the diversity of the studied linear features, we train multimodal Sparse Autoencoders (SAEs), creating a highly interpretable dictionary of text and image features. We find that although model representations across modalities are quite disjoint, they become increasingly shared in deeper layers.

Country of Origin
🇺🇸 United States

Repos / Data Links

Page Count
15 pages

Category
Computer Science:
CV and Pattern Recognition