NinA: Normalizing Flows in Action. Training VLA Models with Normalizing Flows
By: Denis Tarasov , Alexander Nikulin , Ilya Zisman and more
Potential Business Impact:
Makes robots move faster and smarter.
Recent advances in Vision-Language-Action (VLA) models have established a two-component architecture, where a pre-trained Vision-Language Model (VLM) encodes visual observations and task descriptions, and an action decoder maps these representations to continuous actions. Diffusion models have been widely adopted as action decoders due to their ability to model complex, multimodal action distributions. However, they require multiple iterative denoising steps at inference time or downstream techniques to speed up sampling, limiting their practicality in real-world settings where high-frequency control is crucial. In this work, we present NinA (Normalizing Flows in Action), a fast and expressive alter- native to diffusion-based decoders for VLAs. NinA replaces the diffusion action decoder with a Normalizing Flow (NF) that enables one-shot sampling through an invertible transformation, significantly reducing inference time. We integrate NinA into the FLOWER VLA architecture and fine-tune on the LIBERO benchmark. Our experiments show that NinA matches the performance of its diffusion-based counterpart under the same training regime, while achieving substantially faster inference. These results suggest that NinA offers a promising path toward efficient, high-frequency VLA control without compromising performance.
Similar Papers
Normalizing Flows are Capable Visuomotor Policy Learning Models
Robotics
Robots learn tasks faster and know when they're unsure.
SimFlow: Simplified and End-to-End Training of Latent Normalizing Flows
CV and Pattern Recognition
Makes AI create clearer pictures from less data.
Discrete Diffusion VLA: Bringing Discrete Diffusion to Action Decoding in Vision-Language-Action Policies
CV and Pattern Recognition
Teaches robots to do tasks by watching and listening.