Image Tiling for High-Resolution Reasoning: Balancing Local Detail with Global Context
By: Anatole Jacquin de Margerie, Alexis Roger, Irina Rish
Reproducibility remains a cornerstone of scientific progress, yet complex multimodal models often lack transparent implementation details and accessible training infrastructure. In this work, we present a detailed reproduction and critical analysis of the Monkey Vision-Language Model (VLM) (Li et al. 2023b) published in CVPR24, a recent approach to high-resolution image understanding via image tiling. The original paper proposed splitting large images into tiles to recover fine-grained visual details while maintaining computational efficiency. Our study replicates this strategy using open checkpoints and reimplements the training pipeline. We confirm the key finding of the original Monkey VLM work, namely that tiling effectively recovers local details. We then extend this work further, by investigating the effect of the inclusion of the global context, which provide practical insights for future high-resolution multimodal modeling. However, we also report deviations in the results, with the magnitude of these effects depending heavily on task type and tile granularity.
Similar Papers
CropVLM: Learning to Zoom for Fine-Grained Vision-Language Perception
CV and Pattern Recognition
Lets computers see tiny details in pictures better.
HMR3D: Hierarchical Multimodal Representation for 3D Scene Understanding with Large Vision-Language Model
CV and Pattern Recognition
Helps computers understand 3D spaces from pictures and words.
Generating Accurate and Detailed Captions for High-Resolution Images
CV and Pattern Recognition
Makes computer pictures describe more details accurately.