A Performance Analyzer for a Public Cloud's ML-Augmented VM Allocator
By: Roozbeh Bostandoost , Pooria Namyar , Siva Kesava Reddy Kakarla and more
Potential Business Impact:
Helps cloud computers run much better.
Many operational cloud systems use one or more machine learning models that help them achieve better efficiency and performance. But operators do not have tools to help them understand how each model and the interaction between them affect the end-to-end system performance. SANJESH is such a tool. SANJESH supports a diverse set of performance-related queries which we answer through a bi-level optimization. We invent novel mechanisms to solve this optimization more quickly. These techniques allow us to solve an optimization which prior work failed to solve even after $24$ hours. As a proof of concept, we apply SANJESH to an example production system that uses multiple ML models to optimize virtual machine (VM) placement. These models impact how many servers the operators uses to host VMs and the frequency with which it has to live-migrate them because the servers run out of resources. SANJESH finds scenarios where these models cause $~4\times$ worse performance than what simulation-based approaches detect.
Similar Papers
The SAP Cloud Infrastructure Dataset: A Reality Check of Scheduling and Placement of VMs in Cloud Computing
Distributed, Parallel, and Cluster Computing
Makes computer programs run faster and use less power.
CloudFormer: An Attention-based Performance Prediction for Public Clouds with Unknown Workload
Distributed, Parallel, and Cluster Computing
Predicts computer slowdowns in the cloud.
Machine learning-based cloud resource allocation algorithms: a comprehensive comparative review
Distributed, Parallel, and Cluster Computing
Makes computers use cloud power smarter and cheaper.