Quantifying Autoscaler Vulnerabilities: An Empirical Study of Resource Misallocation Induced by Cloud Infrastructure Faults
By: Gijun Park
Potential Business Impact:
Fixes computer clouds when they break.
Resource autoscaling mechanisms in cloud environments depend on accurate performance metrics to make optimal provisioning decisions. When infrastructure faults including hardware malfunctions, network disruptions, and software anomalies corrupt these metrics, autoscalers may systematically over- or under-provision resources, resulting in elevated operational expenses or degraded service reliability. This paper conducts controlled simulation experiments to measure how four prevalent fault categories affect both vertical and horizontal autoscaling behaviors across multiple instance configurations and service level objective (SLO) thresholds. Experimental findings demonstrate that storage-related faults generate the largest cost overhead, adding up to $258 monthly under horizontal scaling policies, whereas routing anomalies consistently bias autoscalers toward insufficient resource allocation. The sensitivity to fault-induced metric distortions differs markedly between scaling strategies: horizontal autoscaling exhibits greater susceptibility to transient anomalies, particularly near threshold boundaries. These empirically-grounded insights offer actionable recommendations for designing fault-tolerant autoscaling policies that distinguish genuine workload fluctuations from failure artifacts.
Similar Papers
An SLO Driven and Cost-Aware Autoscaling Framework for Kubernetes
Software Engineering
Makes computer programs run better and cheaper.
Experimentally Evaluating the Resource Efficiency of Big Data Autoscaling
Distributed, Parallel, and Cluster Computing
Autoscaling computers for jobs doesn't save resources.
Auto-scaling Approaches for Cloud-native Applications: A Survey and Taxonomy
Distributed, Parallel, and Cluster Computing
Makes apps run better by guessing what they need.