Score: 2

Words That Make Language Models Perceive

Published: October 2, 2025 | arXiv ID: 2510.02425v1

By: Sophie L. Wang, Phillip Isola, Brian Cheung

BigTech Affiliations: Massachusetts Institute of Technology

Potential Business Impact:

Makes text-only AI "see" and "hear" with words.

Business Areas:

Visual Search Internet Services

Large language models (LLMs) trained purely on text ostensibly lack any direct perceptual experience, yet their internal representations are implicitly shaped by multimodal regularities encoded in language. We test the hypothesis that explicit sensory prompting can surface this latent structure, bringing a text-only LLM into closer representational alignment with specialist vision and audio encoders. When a sensory prompt tells the model to 'see' or 'hear', it cues the model to resolve its next-token predictions as if they were conditioned on latent visual or auditory evidence that is never actually supplied. Our findings reveal that lightweight prompt engineering can reliably activate modality-appropriate representations in purely text-trained LLMs.

Learning to See Before Seeing: Demystifying LLM Visual Priors from Language Pre-training

Machine Learning (CS)

Computers learn to "see" from reading words.

30 Sep 2025 1

90%

Exploring Multimodal Prompt for Visualization Authoring with Large Language Models

Human-Computer Interaction

Draw pictures to help computers make charts.

18 Apr 2025 0

90%

The Future of MLLM Prompting is Adaptive: A Comprehensive Experimental Evaluation of Prompt Engineering Methods for Robust Multimodal Performance

Artificial Intelligence

Teaches AI to understand pictures and words better.

14 Apr 2025 0

View PDF Login to Bookmark

Country of Origin

🇺🇸 United States

Repos / Data Links

github.com

Page Count

35 pages

Words That Make Language Models Perceive

Makes text-only AI "see" and "hear" with words.

Technical Abstract

Learning to See Before Seeing: Demystifying LLM Visual Priors from Language Pre-training

Exploring Multimodal Prompt for Visualization Authoring with Large Language Models

The Future of MLLM Prompting is Adaptive: A Comprehensive Experimental Evaluation of Prompt Engineering Methods for Robust Multimodal Performance