Score: 0

Performance and Stability of Barrier Mode Parallel Systems with Heterogeneous and Redundant Jobs

Published: December 16, 2025 | arXiv ID: 2512.14445v1

By: Brenton Walker, Markus Fidler

In some models of parallel computation, jobs are split into smaller tasks and can be executed completely asynchronously. In other situations the parallel tasks have constraints that require them to synchronize their start and possibly departure times. This is true of many parallelized machine learning workloads, and the popular Apache Spark processing engine has recently added support for Barrier Execution Mode, which allows users to add such barriers to their jobs. These barriers necessarily result in idle periods on some of the workers, which reduces their stability and performance, compared to equivalent workloads with no barriers. In this paper we will consider and analyze the stability and performance penalties resulting from barriers. We include an analysis of the stability of $(s,k,l)$ barrier systems that allow jobs to depart after $l$ out of $k$ of their tasks complete. We also derive and evaluate performance bounds for hybrid barrier systems servicing a mix of jobs, both with and without barriers, and with varying degrees of parallelism. For the purely 1-barrier case we compare the bounds and simulation results to benchmark data from a standalone Spark system. We study the overhead in the real system, and based on its distribution we attribute it to the dual event and polling-driven mechanism used to schedule barrier-mode jobs. We develop a model for this type of overhead and validate it against the real system through simulation.

Eliminating Multi-GPU Performance Taxes: A Systems Approach to Efficient Distributed LLMs

Distributed, Parallel, and Cluster Computing

Makes AI models train much faster on computers.

4 Nov 2025 2

84%

The Merit of Simple Policies: Buying Performance With Parallelism and System Architecture

Distributed, Parallel, and Cluster Computing

Makes computer jobs finish faster with smart server setups.

20 Mar 2025 0

84%

Parallel/Distributed Tabu Search for Scheduling Microprocessor Tasks in Hybrid Flowshop

Distributed, Parallel, and Cluster Computing

Makes factory jobs finish faster using smart computer rules.

14 Sep 2025 0

View PDF Login to Bookmark

Performance and Stability of Barrier Mode Parallel Systems with Heterogeneous and Redundant Jobs

Technical Abstract

Eliminating Multi-GPU Performance Taxes: A Systems Approach to Efficient Distributed LLMs

The Merit of Simple Policies: Buying Performance With Parallelism and System Architecture

Parallel/Distributed Tabu Search for Scheduling Microprocessor Tasks in Hybrid Flowshop