Mainframes running the z/OS capability can support RDMA protocols. Using this feature, distributed database systems may actually scale linearly.
1. What is the problem?
Scaling distributed transactions is a challenge. Data placement and hardware provisioning considerations often lead to bottlenecks. Distributed systems struggle to scale linearly as the CPU spends cycles processing network packets.
2. Why is it important?
Any administrator who runs a distributed system in a cluster wants to get the most performance for their hardware. Specifically, modern data-driven high-visibility applications need high volume throughput. This throughput can rarely be accomplished on a single node - and so a cluster is required. Moving toward a clustered environment may have issues regarding transaction aborts, contention, or even dead-locks. A solution is desired that mitigates these concerns.
3. Why is it hard?
In a Network Attached Memory (NAM) database server, memory servers hold all the data and compute servers execute transactions on the stored data. Though compute servers do not store the data, they must ensure that each transition has ACID properties. This blended separation of responsibilities is an intuition of the paper, and is challenging because the compute servers handle some of the storage servers responsibilities - basically treating them as dumb. Using one-sided RDMA operations as much as possible, NAM-DB is able to avoid CPU cycles for message handling because NAM-DB is given byte-level access to the memory servers.
4. Why existing solutions do not work?
Previous solutions use partitioning schemes and other data location using developer-driven constructs. However, when transactions cross partitions, the database system may have unpredictable performance and higher transaction abort rates. Also, if this is done at the DataBase Administrator (DBA) level, the developer might not be able to implement a sharding design. Solutions such as Co-Partitioning involve partitioning two tables the same way based on the same key to improve performance, which helps with provisioning but does not scale linearly.
Previous research regarding distributed transactions cited issues with contention and high CPU overhead processing network messages. Also, on slower networks, latency was a bottleneck which reduced transaction throughput.
5. What is the core intuition for the solution?
The authors solution, NAM-DB, uses Remote-Direct-Memory-Access (RDMA) to bypass the CPU of other nodes within a cluster when transferring data from one node to another. With modern RDMA-networks, bandwidth is similar to that of CPU to RAM - which would otherwise make the premise of this paper unachievable. The authors also propose design guidance for using this approach - as a simple migration from previous designs does not automatically guarantee performance improvements. The authors' algorithms support Snapshot Integration (SI) that handles read-only queries without set validations using a scalable global counter. While all transactions are distributed in NAM-DB.
6. Does the paper prove its claims?
The results show that NAM-DB scales nearly linearly with the number of servers to 3.64 million distributed transactions over 56 machines. 3.64 million grows to 6.5 million TPC-C transactions if he system takes advantage of locality. The authors explain the NAM-DB implementation regarding executing both fetch an commit transactions - describing the role of both compute and memory servers. The authors do a good job discussing the challenges in NAM-DB - such as transaction aborts due to stale images and slow workers.
Further, the authors show that 2-sided RDMA (the previous approach) does not scale with number of servers. The metric for scalability is determined by throughput in Millions of transactions per second. There should be a good positive correlation between throughput and cluster size. In the case of 20-sided RDMA, there is not - throughput actually decreases. However, using NAM-DB, throughput increases linearly.
7. What is the setup of analysis/experiments? is it sufficient?
The authors' experiments use the TPC-C benchmark to show that the system scales linearly to over 6.5 million new-order distributed transactions per second on 56 machines. The TPC-C dataset is known for being an industry standard set of transactions and data. The authors also tested different distributions of the workload and did see a decrease in throughput with highly skewed data - demonstrating an increased rate of contention.
The memory machines were Intel Xeon E5-2660 8 core processors with 256 GB RAM. The compute servers were Intel Xeon E7-4820 8 core processors and 128 GB RAM. The authors scaled the number of servers from 2 to 56, adding compute and memory servers evenly. The difference in hardware emphasizes the intended use of each server.
8. Are there any gaps in the logic/proof?
When using NAM-DB with locality, there is shown to be negligible increase in throughput between 40 and 48 nodes. This plateau is not addressed. Since NAM-DB is network bandwidth bound, it doesn't make sense that the original linear increase would continue again above 48 nodes if a network limit was reached. This observation is not explained by any of the following figures either.
9. Describe at least one possible next step.
One possible next step may be to vary the node hardware. The current test used 56 of the same machines, but not all users have the ability to buy all their redundant hardware at once.
As better NIC hardware comes out (specifically those with better cache) comes out, the authors should re-run and compare the experiments. While it is expected that the top-line results will increase, it would be interesting to see if the slope increases. This might help give a sense of contention based on network latency.
BibTex Citation
@article{zamanian2017end, title={The end of a myth: distributed transactions can scale}, author={Zamanian, Erfan and Binnig, Carsten and Harris, Tim and Kraska, Tim}, journal={Proceedings of the VLDB Endowment}, volume={10}, number={6}, pages={685--696}, year={2017}, publisher={VLDB Endowment} }