Score: 0

Efficient Multi-Model Orchestration for Self-Hosted Large Language Models

Published: December 26, 2025 | arXiv ID: 2512.22402v1

By: Bhanu Prakash Vangala, Tanu Malik

Potential Business Impact:

Makes big AI programs run cheaper and faster.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Self-hosting large language models (LLMs) is increasingly appealing for organizations seeking privacy, cost control, and customization. Yet deploying and maintaining in-house models poses challenges in GPU utilization, workload routing, and reliability. We introduce Pick and Spin, a practical framework that makes self-hosted LLM orchestration scalable and economical. Built on Kubernetes, it integrates a unified Helm-based deployment system, adaptive scale-to-zero automation, and a hybrid routing module that balances cost, latency, and accuracy using both keyword heuristics and a lightweight DistilBERT classifier. We evaluate four models, Llama-3 (90B), Gemma-3 (27B), Qwen-3 (235B), and DeepSeek-R1 (685B) across eight public benchmark datasets, with five inference strategies, and two routing variants encompassing 31,019 prompts and 163,720 inference runs. Pick and Spin achieves up to 21.6% higher success rates, 30% lower latency, and 33% lower GPU cost per query compared with static deployments of the same models.

Scalable Engine and the Performance of Different LLM Models in a SLURM based HPC architecture

Distributed, Parallel, and Cluster Computing

Makes smart computer programs run much faster.

25 Aug 2025 0

88%

Small Language Models as Compiler Experts: Auto-Parallelization for Heterogeneous Systems

Machine Learning (CS)

Makes computers run programs much faster.

22 Dec 2025 0

88%

Evaluating Open-Source Large Language Models for Technical Telecom Question Answering

Networking and Internet Architecture

Tests AI for phone network questions.

26 Sep 2025 0

View PDF Login to Bookmark

Country of Origin

🇺🇸 United States

Page Count

9 pages

Efficient Multi-Model Orchestration for Self-Hosted Large Language Models

Makes big AI programs run cheaper and faster.

Technical Abstract

Scalable Engine and the Performance of Different LLM Models in a SLURM based HPC architecture

Small Language Models as Compiler Experts: Auto-Parallelization for Heterogeneous Systems

Evaluating Open-Source Large Language Models for Technical Telecom Question Answering