Advancing vision-language models in front-end development via data synthesis
By: Tong Ge , Yashu Liu , Jieping Ye and more
Potential Business Impact:
Helps computers build websites from pictures.
Modern front-end (FE) development, especially when leveraging the unique features of frameworks like React and Vue, presents distinctive challenges. These include managing modular architectures, ensuring synchronization between data and visual outputs for declarative rendering, and adapting reusable components to various scenarios. Such complexities make it particularly difficult for state-of-the-art large vision-language models (VLMs) to generate accurate and functional code directly from design images. To address these challenges, we propose a reflective agentic workflow that synthesizes high-quality image-text data to capture the diverse characteristics of FE development. This workflow automates the extraction of self-contained\footnote{A \textbf{self-contained} code snippet is one that encapsulates all necessary logic, styling, and dependencies, ensuring it functions independently without requiring external imports or context.} code snippets from real-world projects, renders the corresponding visual outputs, and generates detailed descriptions that link design elements to functional code. To further expand the scope and utility of the synthesis, we introduce three data synthesis strategies: Evolution-based synthesis, which enables scalable and diverse dataset expansion; Waterfall-Model-based synthesis, which generates logically coherent code derived from system requirements; and Additive Development synthesis, which iteratively increases the complexity of human-authored components. We build a large vision-language model, Flame, trained on the synthesized datasets and demonstrate its effectiveness in generating React code via the $\text{pass}@k$ metric. Our results suggest that a code VLM trained to interpret images before code generation may achieve better performance.
Similar Papers
Vision-Enhanced Large Language Models for High-Resolution Image Synthesis and Multimodal Data Interpretation
CV and Pattern Recognition
Makes computers create clearer pictures from words.
Data Factory with Minimal Human Effort Using VLMs
CV and Pattern Recognition
Makes computers create realistic pictures from words.
Weaving Context Across Images: Improving Vision-Language Models through Focus-Centric Visual Chains
CV and Pattern Recognition
Helps computers understand many pictures at once.