Score: 2

Near-Zero-Overhead Freshness for Recommendation Systems via Inference-Side Model Updates

Published: December 13, 2025 | arXiv ID: 2512.12295v2

By: Wenjun Yu , Sitian Chen , Cheng Chen and more

BigTech Affiliations: ByteDance

Potential Business Impact:

Keeps online suggestions fresh and accurate.

Business Areas:

Machine Learning Artificial Intelligence, Data and Analytics, Software

Deep Learning Recommendation Models (DLRMs) underpin personalized services but face a critical freshness-accuracy tradeoff due to massive parameter synchronization overheads. Production DLRMs deploy decoupled training/inference clusters, where synchronizing petabyte-scale embedding tables (EMTs) causes multi-minute staleness, degrading recommendation quality and revenue. We observe that (1) inference nodes exhibit sustained CPU underutilization (peak <= 20%), and (2) EMT gradients possess intrinsic low-rank structure, enabling compact update representation. We present LiveUpdate, a system that eliminates inter-cluster synchronization by colocating Low-Rank Adaptation (LoRA) trainers within inference nodes. LiveUpdate addresses two core challenges: (1) dynamic rank adaptation via singular value monitoring to constrain memory overhead (<2% of EMTs), and (2) NUMA-aware resource scheduling with hardware-enforced QoS to eliminate update inference contention (P99 latency impact <20ms). Evaluations show LiveUpdate reduces update costs by 2x versus delta-update baselines while achieving higher accuracy within 1-hour windows. By transforming idle inference resources into freshness engines, LiveUpdate delivers online model updates while outperforming state-of-the-art delta-update methods by 0.04% to 0.24% in accuracy.

Near-Zero-Overhead Freshness for Recommendation Systems via Inference-Side Model Updates

Distributed, Parallel, and Cluster Computing

Keeps online suggestions fresh and better.

13 Dec 2025 2

89%

Deep Recommender Models Inference: Automatic Asymmetric Data Flow Optimization

Distributed, Parallel, and Cluster Computing

Makes AI recommend things much faster.

2 Jul 2025 2

88%

Reuse, Don't Recompute: Efficient Large Reasoning Model Inference via Memory Orchestration

Multiagent Systems

Lets computers remember answers to save time.

17 Nov 2025 0

View PDF Login to Bookmark

Country of Origin

🇭🇰 🇨🇳 China, Hong Kong

Page Count

15 pages

Near-Zero-Overhead Freshness for Recommendation Systems via Inference-Side Model Updates

Keeps online suggestions fresh and accurate.

Technical Abstract

Near-Zero-Overhead Freshness for Recommendation Systems via Inference-Side Model Updates

Deep Recommender Models Inference: Automatic Asymmetric Data Flow Optimization

Reuse, Don't Recompute: Efficient Large Reasoning Model Inference via Memory Orchestration