Scaling Up Efficient Small Language Models Serving and Deployment for Semantic Job Search
By: Kayhan Behdin , Qingquan Song , Sriram Vasudevan and more
Potential Business Impact:
Makes smart search engines faster and cheaper.
Large Language Models (LLMs) have demonstrated impressive quality when applied to predictive tasks such as relevance ranking and semantic search. However, deployment of such LLMs remains prohibitively expensive for industry applications with strict latency and throughput requirements. In this work, we present lessons and efficiency insights from developing a purely text-based decoder-only Small Language Model (SLM) for a semantic search application at LinkedIn. Particularly, we discuss model compression techniques such as pruning that allow us to reduce the model size by up to $40\%$ while maintaining the accuracy. Additionally, we present context compression techniques that allow us to reduce the input context length by up to $10$x with minimal loss of accuracy. Finally, we present practical lessons from optimizing the serving infrastructure for deploying such a system on GPUs at scale, serving millions of requests per second. Taken together, this allows us to increase our system's throughput by $10$x in a real-world deployment, while meeting our quality bar.
Similar Papers
Collaborative Inference and Learning between Edge SLMs and Cloud LLMs: A Survey of Algorithms, Execution, and Open Challenges
Distributed, Parallel, and Cluster Computing
Smart computers work together for faster, private AI.
From Large to Super-Tiny: End-to-End Optimization for Cost-Efficient LLMs
Computation and Language
Makes smart computer programs cheaper and faster.
A Survey on Collaborative Mechanisms Between Large and Small Language Models
Artificial Intelligence
Makes smart AI work on phones and less powerful devices.