TLDR; You can check the latest version of our generic C++ memory allocator on github.
Memory allocations in C++ are a major performance pain-point for every real-time application. Most games try to reduce/remove any need for heap allocations during a frame and use different techniques to avoid them. I’ve already blogged about some of those techniques that we use in Hummingbird – linear allocators for temporary & perf-frame memory, pool allocators etc.
Avoiding any dynamic (generic) memory allocation whatsoever is difficult and an ongoing effort in all AAA software I know of, Hummingbird included. You still need to do a new/delete/malloc/free here and there – hence a generic memory allocator is still needed.
What is a generic memory allocator?
In C and C++ the “generic” memory allocator is the pair of functions malloc/free, which allow you to use memory off the heap (aka free-store). In C++ the new & delete operators also usually end-up in malloc/free to get the memory needed for objects.
I call it generic, because it should fulfill any memory request – from a single byte to potentially gigabytes of memory, often with specific alignment requirements.
The standard C runtime provides implementations of malloc/free and their characteristics depend on the specific OS/compiler. The quality of these implementations unfortunately vary wildly, which makes it difficult to offer a consistent application experience.
The main features a good allocator should have are:
- Good performance – calling malloc/free should complete relatively quickly
- Small memory overhead – each allocator needs some mechanism to keep track of its internal state, used/free memory etc, which involves some additional overhead.
There dozens of OSS generic memory allocator implementations with different characteristics, to mention some of the more popular: dlmalloc, jemalloc, tcmalloc, ptmalloc, nedmalloc, hoard, rpmalloc etc. Each involves different design decisions and many claim to be the “fastest”, however applications have different allocation patterns and one might be better than the other for you particular use case – measure before committing.
Our software has also it’s specifics, which define the requirements for our generic allocator.
Specifics of allocations in a library
Hummingbird is a UI software library in C++ and is designed to work within an application and has to “play nicely” with it. One of the most important aspects is never directly using a resource under the hood that the application can’t control. Memory is a prime example. Hummingbird never allocates any memory directly – it asks for memory from the main application. This gives full control of the game developer to sub-allocate from UI-specific arenas and track any memory that the library uses.
So far Hummingbird asks the application for memory on each allocation and returns it on each deallocation. This gives complete control to the user but sometimes can be risky. The major issue we’ve seen is that generic allocators in some engines and on some platforms are quite bad. This means that in some cases the library can be slowed down by factors external to it. It also means that a bug in the game allocator will very likely hit the middleware as well.
We really want consistency of the performance and stability of our library across all platforms and engines, so we decided to go for a hybrid solution:
- Hummingbird will have an internal generic memory allocator that will serve all internal memory needs
- The allocator itself will request memory from the game in larger segments – the user is still in full control of the memory subsystem
In this way the user’s memory system is touched very seldom, which practically eliminates any performance effect it might have, and substantially reduces the risk of memory corruptions introduced by it.
Looking for an embeddable memory allocator
We quickly decided the requirements of our allocator:
- Should be “embeddable” – it means that it should allow giving it memory from an outside source (the game in our case), and not allocate memory directly from the OS (via mmap, VirtualAlloc etc.)
- Should be fast – especially on small allocation. Large allocations in our software are either resources – images, fonts or memory pools that manage internally anyway.
- Should be fast in a multi-threaded environment – Hummingbird is heavily MT
- Should “return” memory often – many allocators are extremely conservative and hold on to the memory they’ve taken. We want to return any excess memory quickly to the game.
- Should be relatively simple to extend/modify
The list very significantly impacted the possibilities we have. Almost all OSS memory allocators are designed to be the “one-and-only” generic allocator within an application and don’t fulfill 1) and 4) (for instance jemalloc, tcmalloc).
The venerable dlmalloc unfortunately hasn’t aged well and is not good in a MT environment. ptmalloc3 is LGPL – haven’t really tried it. In the past I’ve used nedmalloc, which fulfills the API requirements, but is somewhat slow, the current Windows memory allocator beats it all the time.
In the end we decided to take a closer look to rpmalloc.
- It claims (and achieves – I tested it under real workloads) great performance.
- It can be easily studied & modified
However it also has some downsides for our particular case:
- Can’t be used verbatim because allocates from OS – this is trivial to change
- Requires all allocations to be 64KB-aligned – this is a major issue for us actually. rpmalloc can get away with this requirement in the generic-application case, because every VirtualAlloc on Windows is 64KB-aligned anyway, while on POSIX it’s relatively easy to map memory so that the requirement is fulfilled. This is NOT a requirement we can impose game developers. Imagine requiresting 1KB and requiring 64KB-alignment. If the game can’t easily manipulate the virtual memory mapping, in the worst case, it’ll have to allocate 65KB, just to make sure the alignment is right – not fun at all.
- Holds on to memory a bit too aggressively for our requirements – even when compiling it with a more “memory-saving” option
So we decided to give it a try and modify it – keep the awesome performance and overcome the limitations. You can check our current fork on github.
The next post in the series will get our hands dirty with the details of how rpmalloc works and the changes we made to adapt it for our needs.