- Is fast – especially in a multi-threaded environment
- Wastes little memory
- Can be easily modified
- Does not reach out for the OS
The current result of our work is a fast multi-threaded, low-overhead, embeddable, relatively simple (~1800 lines of C++), optimized for small allocations, rpmalloc-based, permissively licensed allocator. The code is available on github under the name coherent-rpmalloc and evolving.
In this and the next post I’m going to describe the implementation details and how the allocator is different from the initial rpmalloc implementation.
Coherent-rpmalloc like the original rpmalloc relies heavily on caches and thread-local pointers. The key for multithreaded performance is reducing the contention between threads. The allocator achieves this by keeping as much data as possible on a per-thread basis and acquiring shared resources very seldom. It is tuned towards applications that allocate relatively small amounts of data – in the tens of megabytes. This is different than the original rpmalloc, which handles well even larger heaps. The tuning can be adjusted by changing the sizes of different memory sizes & caches.
A lot of information on the design can also be found in the documentation of the original rpmalloc, there are however differences with coherent-rpmalloc.
Memory subdivision & lingo
The memory in the allocator is handled through 4 major data-types, which form a hierarchy:
- Blocks – each allocation of a certain size falls in a size class. All allocations of that size class will belong together – for instance all the ones that are between 1 and 16 bytes, all the ones between 16 and 32 bytes etc. A memory request will be rounded to the max of the size class it belongs to. The block is such an allocation – a pointer the allocator will return to the user & that it can later free. Blocks have different sizes, depending on their size classes – they represent a single small/medium allocation.
- Pages –
By far the most important structures are spans and segments, blocks and pages are just useful memory subdivisions, which however carry no code logic – think of them as size slices.
Allocations are subdivided in different buckets by size – small, medium, large and extra-large. All the complex handling and machinery in the library applies to small, medium & large sized allocations. In our case I settled for everything up-to 8144 bytes. All allocations larger than that are classified as extra-large. They will be directly handed over to the embedder.
Hummingbird requests bigger than 8K chunks are almost always for other internal allocators or for large resources (images, fonts etc.). They happen relatively seldom – only ~1% of the memory requests are so large, so their performance impact is overall negligible. I also wanted the library to immediately return such large chunks to the embedder application instead of holding it in internal caches.
Anatomy of an allocation & de-allocation
When the user calls the void* rpmalloc(size_t sz) method, the library has to fulfill the allocation. This happens in multiple steps:
- The allocations size-class is determined. This tells us where to look for free blocks large enough to fulfill the allocation.
- The current threads heap is loaded from thread-local storage. Each thread using the library has a local heap (a struct) that contains all the data necessary to allocate and deallocate memory and almost always it can fulfill a memory operation by itself.
- The heap has an active span for each size class – the active span is just the latest span that was used for allocating. If it has some free blocks – the first free block will be used.
- If there are no free blocks, the library will try to:
- De-allocate some blocks – mark them as free. Blocks freed from other threads are not immediately marked, but are put in a list. The owning thread (the one that allocated them from its own heap) will mark them on the next allocation.
- Look for an empty span in the local thread cache
- Look for an empty span in the global cache
- If no span was found with empty blocks, a new span is requested:
- Look for a segment that has empty spans and take one
- If there are no segments with free spans
- Request a new segment allocation (to external embedder)
- Return the first span and put the segment in the list of segments
The de-allocation works in pretty much the opposite way:
- From the pointer – find the owning span. This is an interesting step and very performance-sensitive, I’ll go into the details in the next post.
- Mark the block in the span as empty. Empty blocks are put in a free-list within the span, where the memory of the block is used as the “next” id. As the memory is now garbage we can safely re-use it for internal data.
- If the span is completely empty (all blocks are free)
- Put in caches – depending on configurable ratios, the span will go to the local or global cache or will need to be freed altogether
- If the span has to be freed altogether
- Mark in the owning segment that the span is free
- If all spans in the segment are free, return the whole segment to the embedder, actually freeing the memory.
The next post in the series will describe in the most interesting implementation details of the library and how most of the operations were implemented.