Score: 1

CityRiSE: Reasoning Urban Socio-Economic Status in Vision-Language Models via Reinforcement Learning

Published: October 25, 2025 | arXiv ID: 2510.22282v1

By: Tianhui Liu , Hetian Pang , Xin Zhang and more

Potential Business Impact:

Helps understand city wealth from pictures.

Business Areas:

Smart Cities Real Estate

Harnessing publicly available, large-scale web data, such as street view and satellite imagery, urban socio-economic sensing is of paramount importance for achieving global sustainable development goals. With the emergence of Large Vision-Language Models (LVLMs), new opportunities have arisen to solve this task by treating it as a multi-modal perception and understanding problem. However, recent studies reveal that LVLMs still struggle with accurate and interpretable socio-economic predictions from visual data. To address these limitations and maximize the potential of LVLMs, we introduce \textbf{CityRiSE}, a novel framework for \textbf{R}eason\textbf{i}ng urban \textbf{S}ocio-\textbf{E}conomic status in LVLMs through pure reinforcement learning (RL). With carefully curated multi-modal data and verifiable reward design, our approach guides the LVLM to focus on semantically meaningful visual cues, enabling structured and goal-oriented reasoning for generalist socio-economic status prediction. Experiments demonstrate that CityRiSE with emergent reasoning process significantly outperforms existing baselines, improving both prediction accuracy and generalization across diverse urban contexts, particularly for prediction on unseen cities and unseen indicators. This work highlights the promise of combining RL and LVLMs for interpretable and generalist urban socio-economic sensing.

CityLens: Benchmarking Large Language-Vision Models for Urban Socioeconomic Sensing

Artificial Intelligence

Helps computers understand city life from pictures.

31 May 2025 1

89%

How Well Do Vision--Language Models Understand Cities? A Comparative Study on Spatial Reasoning from Street-View Images

CV and Pattern Recognition

Helps computers understand city streets better.

29 Aug 2025 1

88%

Recognition through Reasoning: Reinforcing Image Geo-localization with Large Vision-Language Models

CV and Pattern Recognition

Helps computers find places from any picture.

17 Jun 2025 1

View PDF Login to Bookmark

Repos / Data Links

github.com

Page Count

17 pages

CityRiSE: Reasoning Urban Socio-Economic Status in Vision-Language Models via Reinforcement Learning

Helps understand city wealth from pictures.

Technical Abstract

CityLens: Benchmarking Large Language-Vision Models for Urban Socioeconomic Sensing

How Well Do Vision--Language Models Understand Cities? A Comparative Study on Spatial Reasoning from Street-View Images

Recognition through Reasoning: Reinforcing Image Geo-localization with Large Vision-Language Models