Optimizing LLMs for Resource-Constrained Environments: A Survey of Model Compression Techniques
By: Sanjay Surendranath Girija , Shashank Kapoor , Lakshit Arora and more
Potential Business Impact:
Shrinks big AI so phones can use it.
Large Language Models (LLMs) have revolutionized many areas of artificial intelligence (AI), but their substantial resource requirements limit their deployment on mobile and edge devices. This survey paper provides a comprehensive overview of techniques for compressing LLMs to enable efficient inference in resource-constrained environments. We examine three primary approaches: Knowledge Distillation, Model Quantization, and Model Pruning. For each technique, we discuss the underlying principles, present different variants, and provide examples of successful applications. We also briefly discuss complementary techniques such as mixture-of-experts and early-exit strategies. Finally, we highlight promising future directions, aiming to provide a valuable resource for both researchers and practitioners seeking to optimize LLMs for edge deployment.
Similar Papers
Resource-Efficient Language Models: Quantization for Fast and Accessible Inference
Artificial Intelligence
Makes big computer brains use less power.
Energy-Aware LLMs: A step towards sustainable AI for downstream applications
Performance
Saves energy while making AI smarter.
Optimizing Large Language Models: Metrics, Energy Efficiency, and Case Study Insights
Machine Learning (CS)
Cuts AI's energy use by almost half.