Score: 0

A Multi-Stage Hybrid Framework for Automated Interpretation of Multi-View Engineering Drawings Using Vision Language Model

Published: October 23, 2025 | arXiv ID: 2510.21862v1

By: Muhammad Tayyab Khan , Zane Yong , Lequn Chen and more

Potential Business Impact:

Lets computers understand factory blueprints automatically.

Business Areas:

Image Recognition Data and Analytics, Software

Engineering drawings are fundamental to manufacturing communication, serving as the primary medium for conveying design intent, tolerances, and production details. However, interpreting complex multi-view drawings with dense annotations remains challenging using manual methods, generic optical character recognition (OCR) systems, or traditional deep learning approaches, due to varied layouts, orientations, and mixed symbolic-textual content. To address these challenges, this paper proposes a three-stage hybrid framework for the automated interpretation of 2D multi-view engineering drawings using modern detection and vision language models (VLMs). In the first stage, YOLOv11-det performs layout segmentation to localize key regions such as views, title blocks, and notes. The second stage uses YOLOv11-obb for orientation-aware, fine-grained detection of annotations, including measures, GD&T symbols, and surface roughness indicators. The third stage employs two Donut-based, OCR-free VLMs for semantic content parsing: the Alphabetical VLM extracts textual and categorical information from title blocks and notes, while the Numerical VLM interprets quantitative data such as measures, GD&T frames, and surface roughness. Two specialized datasets were developed to ensure robustness and generalization: 1,000 drawings for layout detection and 1,406 for annotation-level training. The Alphabetical VLM achieved an overall F1 score of 0.672, while the Numerical VLM reached 0.963, demonstrating strong performance in textual and quantitative interpretation, respectively. The unified JSON output enables seamless integration with CAD and manufacturing databases, providing a scalable solution for intelligent engineering drawing analysis.

Automated Parsing of Engineering Drawings for Structured Information Extraction Using a Fine-tuned Document Understanding Transformer

CV and Pattern Recognition

Helps machines read factory blueprints perfectly.

2 May 2025 0

89%

Efficient Few-Shot Learning in Remote Sensing: Fusing Vision and Vision-Language Models

CV and Pattern Recognition

Finds planes in pictures better, even blurry ones.

15 Oct 2025 0

89%

VLM-3D:End-to-End Vision-Language Models for Open-World 3D Perception

CV and Pattern Recognition

Helps self-driving cars see new things safely.

12 Aug 2025 0

View PDF Login to Bookmark

Country of Origin

🇸🇬 Singapore

Page Count

10 pages

A Multi-Stage Hybrid Framework for Automated Interpretation of Multi-View Engineering Drawings Using Vision Language Model

Lets computers understand factory blueprints automatically.

Technical Abstract

Automated Parsing of Engineering Drawings for Structured Information Extraction Using a Fine-tuned Document Understanding Transformer

Efficient Few-Shot Learning in Remote Sensing: Fusing Vision and Vision-Language Models

VLM-3D:End-to-End Vision-Language Models for Open-World 3D Perception