Multilingual Multimodal Software Developer for Code Generation
By: Linzheng Chai , Jian Yang , Shukai Liu and more
Potential Business Impact:
Helps computers write code from pictures.
The rapid advancement of Large Language Models (LLMs) has significantly improved code generation, yet most models remain text-only, neglecting crucial visual aids like diagrams and flowcharts used in real-world software development. To bridge this gap, we introduce MM-Coder, a Multilingual Multimodal software developer. MM-Coder integrates visual design inputs-Unified Modeling Language (UML) diagrams and flowcharts (termed Visual Workflow)-with textual instructions to enhance code generation accuracy and architectural alignment. To enable this, we developed MMc-Instruct, a diverse multimodal instruction-tuning dataset including visual-workflow-based code generation, allowing MM-Coder to synthesize textual and graphical information like human developers, distinct from prior work on narrow tasks. Furthermore, we introduce MMEval, a new benchmark for evaluating multimodal code generation, addressing existing text-only limitations. Our evaluations using MMEval highlight significant remaining challenges for models in precise visual information capture, instruction following, and advanced programming knowledge. Our work aims to revolutionize industrial programming by enabling LLMs to interpret and implement complex specifications conveyed through both text and visual designs.
Similar Papers
Unified Modeling Language Code Generation from Diagram Images Using Multimodal Large Language Models
Software Engineering
Turns software pictures into working computer code.
VisCodex: Unified Multimodal Code Generation via Merging Vision and Coding Models
Computation and Language
Helps computers write code from pictures.
MathCoder-VL: Bridging Vision and Code for Enhanced Multimodal Mathematical Reasoning
CV and Pattern Recognition
Teaches computers to solve math problems with pictures.