Cornstarch: Distributed Multimodal Training Must Be Multimodality-Aware
By: Insu Jang , Runyu Lu , Nikhil Bansal and more
Potential Business Impact:
Trains smart AI models that understand pictures and words.
Multimodal large language models (MLLMs) extend the capabilities of large language models (LLMs) by combining heterogeneous model architectures to handle diverse modalities like images and audio. However, this inherent heterogeneity in MLLM model structure and data types makes makeshift extensions to existing LLM training frameworks unsuitable for efficient MLLM training. In this paper, we present Cornstarch, the first general-purpose distributed MLLM training framework. Cornstarch facilitates modular MLLM construction, enables composable parallelization of constituent models, and introduces MLLM-specific optimizations to pipeline and context parallelism for efficient distributed MLLM training. Our evaluation shows that Cornstarch outperforms state-of-the-art solutions by up to $1.57\times$ in terms of training throughput. Cornstarch is an open-source project available at https://github.com/cornstarch-org/Cornstarch.
Similar Papers
Cornserve: Efficiently Serving Any-to-Any Multimodal Models
Machine Learning (CS)
Makes smart AI understand and create mixed-media faster.
Cream of the Crop: Harvesting Rich, Scalable and Transferable Multi-Modal Data for Instruction Fine-Tuning
CV and Pattern Recognition
Helps AI learn better from pictures and words.
Distributed LLMs and Multimodal Large Language Models: A Survey on Advances, Challenges, and Future Directions
Computation and Language
Lets computers understand text, pictures, and sounds together.