Generative Digital Twins: Vision-Language Simulation Models for Executable Industrial Systems
By: YuChe Hsu , AnJui Wang , TsaiChing Ni and more
Potential Business Impact:
Builds virtual worlds from drawings and words.
We propose a Vision-Language Simulation Model (VLSM) that unifies visual and textual understanding to synthesize executable FlexScript from layout sketches and natural-language prompts, enabling cross-modal reasoning for industrial simulation systems. To support this new paradigm, the study constructs the first large-scale dataset for generative digital twins, comprising over 120,000 prompt-sketch-code triplets that enable multimodal learning between textual descriptions, spatial structures, and simulation logic. In parallel, three novel evaluation metrics, Structural Validity Rate (SVR), Parameter Match Rate (PMR), and Execution Success Rate (ESR), are proposed specifically for this task to comprehensively evaluate structural integrity, parameter fidelity, and simulator executability. Through systematic ablation across vision encoders, connectors, and code-pretrained language backbones, the proposed models achieve near-perfect structural accuracy and high execution robustness. This work establishes a foundation for generative digital twins that integrate visual reasoning and language understanding into executable industrial simulation systems.
Similar Papers
Synthesizing Visual Concepts as Vision-Language Programs
Artificial Intelligence
Makes AI understand pictures and think logically.
Let Language Constrain Geometry: Vision-Language Models as Semantic and Spatial Critics for 3D Generation
CV and Pattern Recognition
Makes 3D pictures match words better.
ExploreVLM: Closed-Loop Robot Exploration Task Planning with Vision-Language Models
Robotics
Robots learn to explore and do tasks better.