Automated Planning for Optimal Data Pipeline Instantiation
By: Leonardo Rosa Amado , Adriano Vogel , Dalvan Griebler and more
Potential Business Impact:
Makes data processing faster and cheaper.
Data pipeline frameworks provide abstractions for implementing sequences of data-intensive transformation operators, automating the deployment and execution of such transformations in a cluster. Deploying a data pipeline, however, requires computing resources to be allocated in a data center, ideally minimizing the overhead for communicating data and executing operators in the pipeline while considering each operator's execution requirements. In this paper, we model the problem of optimal data pipeline deployment as planning with action costs, where we propose heuristics aiming to minimize total execution time. Experimental results indicate that the heuristics can outperform the baseline deployment and that a heuristic based on connections outperforms other strategies.
Similar Papers
PRE-Share Data: Assistance Tool for Resource-aware Designing of Data-sharing Pipelines
Social and Information Networks
Reuses data steps to save time and resources.
Automatic Operator-level Parallelism Planning for Distributed Deep Learning -- A Mixed-Integer Programming Approach
Machine Learning (CS)
Makes big AI models train faster and smarter.
Declarative Data Pipeline for Large Scale ML Services
Distributed, Parallel, and Cluster Computing
Builds better computer programs faster and smarter.