Lotus: Optimizing Disaggregated Transactions with Disaggregated Locks
By: Zhisheng Hu , Pengfei Zuo , Junliang Hu and more
Disaggregated memory (DM) separates compute and memory resources, allowing flexible scaling to achieve high resource utilization. To ensure atomic and consistent data access on DM, distributed transaction systems have been adapted, where compute nodes (CNs) rely on one-sided RDMA operations to access remote data in memory nodes (MNs). However, we observe that in existing transaction systems, the RDMA network interface cards at MNs become a primary performance bottleneck. This bottleneck arises from the high volume of one-sided atomic operations used for locks, which hinders the system's ability to scale efficiently. To address this issue, this paper presents Lotus, a scalable distributed transaction system with lock disaggregation on DM. The key innovation of Lotus is to disaggregate locks from data and execute all locks on CNs, thus eliminating the bottleneck at MN RNICs. To achieve efficient lock management on CNs, Lotus employs an application-aware lock management mechanism that leverages the locality of the OLTP workloads to shard locks while maintaining load balance. To ensure consistent transaction processing with lock disaggregation, Lotus introduces a lock-first transaction protocol, which separates the locking phase as the first step in each read-write transaction execution. This protocol allows the system to determine the success of lock acquisitions early and proactively abort conflicting transactions, improving overall efficiency. To tolerate lock loss during CN failures, Lotus employs a lock-rebuild-free recovery mechanism that treats locks as ephemeral and avoids their reconstruction, ensuring lightweight recovery for CN failures. Experimental results demonstrate that Lotus improves transaction throughput by up to 2.1$\times$ and reduces latency by up to 49.4% compared to state-of-the-art transaction systems on DM.
Similar Papers
Recoverable Lock-Free Locks
Distributed, Parallel, and Cluster Computing
Makes computer programs safer and able to fix themselves.
OOCO: Latency-disaggregated Architecture for Online-Offline Co-locate LLM Serving
Distributed, Parallel, and Cluster Computing
Makes AI answer questions faster and cheaper.
Reimagining RDMA Through the Lens of ML
Distributed, Parallel, and Cluster Computing
Makes AI training much faster and more reliable.