A review of RUMA has it: Rewired User-space Memory Access is Possible!

RUMA (Rewired User-space Memory Access) allows for physiological data management such that developers can freely rewrite mappings from virtual to physical memory.

Note: This review is in the format of the CS265 Discussion format and may reflect my personal opinion.

1. What is the problem?

Memory management (such as malloc and free) programming does not often consider the effects of the allocation patterns on the system. For example, it is unclear whether a malloc is served from pooled memory (cheap) or fresh pages (expensive). A fresh page requires the kernel to initialize a new page with zeros and interrupts the program. When developers implement their own memory pooling, they try to get large consecutive memory regions – the problem is these large blocks are not as flexible.

2. Why is it important?

Reducing system overhead in such a way will enable better throughput. Combining algorithmic and system efficiencies in such a manner effectively moves the pareto frontier closer to BOTH the read-optimized and write optimized asymptotes. Additionally, by constantly manipulating the mapping of virtual to physical pages during run-time, it is easier to form continuous memory regions and therefore reduce the access overhead. In its current implementation, RUMA makes no changes to the operating system or the kernel – meaning that it can be incorporated into current database systems.

3. Why is it hard?

In order for RUMA to work, it has to work with the memory mappings that are maintained by the operating system and exploit them in a way that benefits higher-level tasks. When performing rewiring, it has to be compatible with all main data types and operation such as resizing. Copying, allocating, and storage all have to be supported by RUMA. Many OLAP databases run long queries based on snapshots. These isolated copies need to take advantage of the reused physical pages as well.

4. Why existing solutions do not work?

Prior solutions require modification of the operating system to expose a page allocation system. While this would allow developers to allocate and map physical pages, the authors of RUMA work around this limitation by mapping directly to the memory filesystem available through Linux. By getting a handle to these files, the authors have a programmatic representation of physical memory. This approach uses the features of the operating system, but by itself, is not fully compatible with cache translations.

5. What is the core intuition for the solution?

In order to use the memory file system, the authors also have to ensure the file has a main memory mapping. Linux has a call, mmap, that performs this operation. When called with null as the first parameter, the authors take full advantage of the kernel to get an address that is guaranteed to start at a page boundary. This can be remapped to existing mappings. While disk operations are typically costly, keeping reference to the virtual page handle ensures this has to be done only once. Accessing the page going forward presents no difference with respect to accessing normal virtual memory.

One of the important advances of rewiring is reference swapping. Previous swapping required a three-way-swap with a helper page - which requires a physical page copy. Using rewiring, data is not copied nor is memory allocated - the references are simply swapped! This is because virtual memory already has an indirection. The knowledge of this efficiency can further improve the efficiency of algorithms using a swap.

6. Does the paper prove its claims?

The authors conducted a set of micro-benchmarks to understand the impact of virtual memory and rewiring (and to test their implementation). The authors compare direct access to memory (unfaulted physical pages) with software indirected memory (a pool with faulted virtual pages) – and finally with rewired memory (mapped faulted pages in the pool using a main memory file). These tests show the influence of mapping and access on performance. Rewired memory is shown to out-perform the other two methods except for randomly mapped small page files. Most of this performance decrease is due to the allocation runtime.

7. What is the setup of analysis/experiments? is it sufficient?

All experiments were run on two 2.2 GHz quad-core Intel Xeon E5-2407 (no hyper threading, no turbo mode) processors with 48GB main memory. This system is sufficient as it represents production hardware.

RUMA was written in C99 and all experiments were tested with 1 billion integer entries. This allows for a very predictable total memory footprint. During experiments, allocation was constrained to a single NUMA (Non-Uniform Memory Access) region.

Writes are conducted in two passes. Rewired Memory tests were conducted with both (2 – sequential, random) page poolings, both (2 – huge, small) Page sizes and both (2 –private, shared) Mapping types to compare results. By using large datasets, the authors are able to determine the average cost.

8. Are there any gaps in the logic/proof?

While the authors keep the data set the same within each analysis, they change the set between analyses. While this doesn’t invalidate their claims, it does present inconsistency between cost evaluation and therefore requires a correction to reach a common baseline. Additionally, while there does not appear to be any items of concern, this approach might mask those items. In order to present a stronger paper, the authors should use the 7.45GB 1B integer set for all cost analyses.

Despite the fact that they are using file handles to get access to virtual memory, the authors never discuss Solid State Drives vs platter disks.

9. Describe at least one possible next step.

In order to investigate different hardware configurations (some may see no benefit from RUMA), the authors should test on both bare metal, virtual machines, and containers. Since the authors did not use hyperthreading or turbo mode in their experiments, it seems like a reasonable transition to attempt integration in a cloud infrastructure.

In order to investigate different hardware configurations (some may see no benefit from RUMA), the authors should test on bare metal, virtual machines, and containers. Since the authors did not use hyperthreading or turbo mode in their experiments, it seems like a reasonable transition to attempt integration in a cloud infrastructure. Also, in regard to containers, the container can be configured to stay within one NUMA channel.

Additionally, the disk type should be investigated – at least SSD vs HDD. The two devices also have different page sizes, which will at least have an effect on design.

BibTex Citation

@article{schuhknecht2016ruma,
  title={RUMA has it: rewired user-space memory access is possible!},
  author={Schuhknecht, Felix Martin and Dittrich, Jens and Sharma, Ankur},
  journal={Proceedings of the VLDB Endowment},
  volume={9},
  number={10},
  pages={768--779},
  year={2016},
  publisher={VLDB Endowment}
}