OTARo: Once Tuning for All Precisions toward Robust On-Device LLMs
By: Shaoyuan Chen , Zhixuan Chen , Dawei Yang and more
Potential Business Impact:
Lets AI change its thinking speed on the fly.
Large Language Models (LLMs) fine-tuning techniques not only improve the adaptability to diverse downstream tasks, but also mitigate adverse effects of model quantization. Despite this, conventional quantization suffers from its structural limitation that hinders flexibility during the fine-tuning and deployment stages. Practical on-device tasks demand different quantization precisions (i.e. different bit-widths), e.g., understanding tasks tend to exhibit higher tolerance to reduced precision compared to generation tasks. Conventional quantization, typically relying on scaling factors that are incompatible across bit-widths, fails to support the on-device switching of precisions when confronted with complex real-world scenarios. To overcome the dilemma, we propose OTARo, a novel method that enables on-device LLMs to flexibly switch quantization precisions while maintaining performance robustness through once fine-tuning. OTARo introduces Shared Exponent Floating Point (SEFP), a distinct quantization mechanism, to produce different bit-widths through simple mantissa truncations of a single model. Moreover, to achieve bit-width robustness in downstream applications, OTARo performs a learning process toward losses induced by different bit-widths. The method involves two critical strategies: (1) Exploitation-Exploration Bit-Width Path Search (BPS), which iteratively updates the search path via a designed scoring mechanism; (2) Low-Precision Asynchronous Accumulation (LAA), which performs asynchronous gradient accumulations and delayed updates under low bit-widths. Experiments on popular LLMs, e.g., LLaMA3.2-1B, LLaMA3-8B, demonstrate that OTARo achieves consistently strong and robust performance for all precisions.
Similar Papers
HALO: Hadamard-Assisted Lower-Precision Optimization for LLMs
Machine Learning (CS)
Makes AI models train faster and smaller.
LoTA-QAF: Lossless Ternary Adaptation for Quantization-Aware Fine-Tuning
Machine Learning (CS)
Makes smart computer brains work on small devices.
LowRA: Accurate and Efficient LoRA Fine-Tuning of LLMs under 2 Bits
Machine Learning (CS)
Makes AI learn with less computer power.