Score: 0

HybridFlow: Adaptive Task Scheduling for Fast and Token-Efficient LLM Inference in Edge-Cloud Collaboration

Published: December 11, 2025 | arXiv ID: 2512.22137v1

By: Jiangwen Dong, Jiayu Li, Wanyu Lin

Potential Business Impact:

Splits smart tasks between phone and cloud.

Business Areas:

Crowdsourcing Collaboration

Large language models (LLMs) exhibit impressive reasoning and problem-solving abilities, yet their substantial inference latency and token consumption pose major challenges for real-time deployment on resource-limited edge devices. Recent efforts toward edge-cloud collaboration have attempted to mitigate this issue, but most existing methods adopt coarse-grained task allocation strategies-assigning entire queries either to the edge or the cloud. Such rigid partitioning fails to exploit fine-grained reasoning parallelism and often leads to redundant computation and inefficient resource utilization. To this end, we propose HybridFlow, a resource-adaptive inference framework that enables fast and token-efficient collaborative reasoning between edge and cloud LLMs. HybridFlow operates in two stages: (1) task decomposition and parallel execution, which dynamically splits a complex query into interdependent subtasks that can execute as soon as their dependencies are resolved; and (2) resource-aware subtask routing, where a learned router adaptively assigns each subtask to the edge or cloud model according to predicted utility gains and real-time budget states. Comprehensive evaluations on GPQA, MMLU-Pro, AIME, and LiveBench-Reasoning demonstrate that HybridFlow effectively reduces end-to-end inference time and overall token usage while maintaining competitive accuracy.

Collaborative Inference and Learning between Edge SLMs and Cloud LLMs: A Survey of Algorithms, Execution, and Open Challenges

Distributed, Parallel, and Cluster Computing

Smart computers work together for faster, private AI.

22 Jul 2025 0

89%

Optimizing Resource Allocation for Geographically-Distributed Inference by Large Language Models

Distributed, Parallel, and Cluster Computing

Makes big AI models run on many computers.

26 Dec 2025 1

89%

Hyperion: Hierarchical Scheduling for Parallel LLM Acceleration in Multi-tier Networks

Distributed, Parallel, and Cluster Computing

Makes AI answer questions faster on different devices.

18 Nov 2025 0

View PDF Login to Bookmark

Country of Origin

🇭🇰 Hong Kong

Page Count

21 pages

HybridFlow: Adaptive Task Scheduling for Fast and Token-Efficient LLM Inference in Edge-Cloud Collaboration

Splits smart tasks between phone and cloud.

Technical Abstract

Collaborative Inference and Learning between Edge SLMs and Cloud LLMs: A Survey of Algorithms, Execution, and Open Challenges

Optimizing Resource Allocation for Geographically-Distributed Inference by Large Language Models

Hyperion: Hierarchical Scheduling for Parallel LLM Acceleration in Multi-tier Networks