Score: 1

Batching-Aware Joint Model Onloading and Offloading for Hierarchical Multi-Task Inference

Published: August 18, 2025 | arXiv ID: 2508.13380v1

By: Seohyeon Cha , Kevin Chan , Gustavo de Veciana and more

Potential Business Impact:

Lets phones do many smart jobs at once.

The growing demand for intelligent services on resource-constrained edge devices has spurred the development of collaborative inference systems that distribute workloads across end devices, edge servers, and the cloud. While most existing frameworks focus on single-task, single-model scenarios, many real-world applications (e.g., autonomous driving and augmented reality) require concurrent execution of diverse tasks including detection, segmentation, and depth estimation. In this work, we propose a unified framework to jointly decide which multi-task models to deploy (onload) at clients and edge servers, and how to route queries across the hierarchy (offload) to maximize overall inference accuracy under memory, compute, and communication constraints. We formulate this as a mixed-integer program and introduce J3O (Joint Optimization of Onloading and Offloading), an alternating algorithm that (i) greedily selects models to onload via Lagrangian-relaxed submodular optimization and (ii) determines optimal offloading via constrained linear programming. We further extend J3O to account for batching at the edge, maintaining scalability under heterogeneous task loads. Experiments show J3O consistently achieves over $97\%$ of the optimal accuracy while incurring less than $15\%$ of the runtime required by the optimal solver across multi-task benchmarks.

Joint Optimization of Offloading, Batching and DVFS for Multiuser Co-Inference

Distributed, Parallel, and Cluster Computing

Saves phone battery by sharing tasks with a server.

20 Apr 2025 0

88%

MoA-Off: Adaptive Heterogeneous Modality-Aware Offloading with Edge-Cloud Collaboration for Efficient Multimodal LLM Inference

Distributed, Parallel, and Cluster Computing

Makes smart computer programs run faster on phones.

21 Sep 2025 0

88%

Rethinking Inference Placement for Deep Learning across Edge and Cloud Platforms: A Multi-Objective Optimization Perspective and Future Directions

Distributed, Parallel, and Cluster Computing

Makes smart apps run faster and safer.

27 Oct 2025 1

View PDF Login to Bookmark

Page Count

10 pages

Batching-Aware Joint Model Onloading and Offloading for Hierarchical Multi-Task Inference

Lets phones do many smart jobs at once.

Technical Abstract

Joint Optimization of Offloading, Batching and DVFS for Multiuser Co-Inference

MoA-Off: Adaptive Heterogeneous Modality-Aware Offloading with Edge-Cloud Collaboration for Efficient Multimodal LLM Inference

Rethinking Inference Placement for Deep Learning across Edge and Cloud Platforms: A Multi-Objective Optimization Perspective and Future Directions