Generic memory allocator for C++ (part 2)

In my previous post I highlighted the requirements we have for a generic memory allocator within the Hummingbird game UI middleware. To quickly recap, we want memory allocator (malloc/free) that:

  • Is fast – especially in a multi-threaded environment
  • Wastes little memory
  • Can be easily modified
  • Does not reach out for the OS

The current result of our work is a fast multi-threaded, low-overhead, embeddable, relatively simple (~1800 lines of C++), optimized for small allocations, rpmalloc-based, permissively licensed allocator. The code is available on github under the name coherent-rpmalloc and evolving.

In this and the next post I’m going to describe the implementation details and how the allocator is different from the initial rpmalloc implementation.

Overall goals

Coherent-rpmalloc like the original rpmalloc relies heavily on caches and thread-local pointers. The key for multithreaded performance is reducing the contention between threads. The allocator achieves this by keeping as much data as possible on a per-thread basis and acquiring shared resources very seldom. It is tuned towards applications that allocate relatively small amounts of data – in the tens of megabytes. This is different than the original rpmalloc, which handles well even larger heaps. The tuning can be adjusted by changing the sizes of different memory sizes & caches.

A lot of information on the design can also be found in the documentation of the original rpmalloc, there are however differences with coherent-rpmalloc.

Memory subdivision & lingo

The memory in the allocator is handled through 4 major data-types, which form a hierarchy:

 

  • Blocks – each allocation of a certain size falls in a size class. All allocations of that size class will belong together – for instance all the ones that are between 1 and 16 bytes, all the ones between 16 and 32 bytes etc. A memory request will be rounded to the max of the size class it belongs to. The block is such an allocation – a pointer the allocator will return to the user & that it can later free. Blocks have different sizes, depending on their size classes – they represent a single small/medium allocation.
  • Pages
  • Spans
  • Segments

 

 

 

By far the most important structures are spans and segments, blocks and pages are just useful memory subdivisions, which however carry no code logic – think of them as size slices.

mem_alloc_spans_graphic.png

Allocations are subdivided in different buckets by size – small, medium, large and extra-large. All the complex handling and machinery in the library applies to small, medium & large sized allocations. In our case I settled for everything up-to 8144 bytes. All allocations larger than that are classified as extra-large. They will be directly handed over to the embedder.

Hummingbird requests bigger than 8K chunks are almost always for other internal allocators or for large resources (images, fonts etc.). They happen relatively seldom – only ~1% of the memory requests are so large, so their performance impact is overall negligible. I also wanted the library to immediately return such large chunks to the embedder application instead of holding it in internal caches.

Anatomy of an allocation & de-allocation

When the user calls the void* rpmalloc(size_t sz) method, the library has to fulfill the allocation. This happens in multiple steps:

  • The allocations size-class is determined. This tells us where to look for free blocks large enough to fulfill the allocation.
  • The current threads heap is loaded from thread-local storage. Each thread using the library has a local heap (a struct) that contains all the data necessary to allocate and deallocate memory and almost always it can fulfill a memory operation by itself.
  • The heap has an active span for each size class – the active span is just the latest span that was used for allocating. If it has some free blocks – the first free block will be used.
  • If there are no free blocks, the library will try to:
    • De-allocate some blocks – mark them as free. Blocks freed from other threads are not immediately marked, but are put in a list. The owning thread (the one that allocated them from its own heap) will mark them on the next allocation.
    • Look for an empty span in the local thread cache
    • Look for an empty span in the global cache
  • If no span was found with empty blocks, a new span is requested:
    • Look for a segment that has empty spans and take one
    • If there are no segments with free spans
      • Request a new segment allocation (to external embedder)
      • Return the first span and put the segment in the list of segments

The de-allocation works in pretty much the opposite way:

  • From the pointer – find the owning span. This is an interesting step and very performance-sensitive, I’ll go into the details in the next post.
  • Mark the block in the span as empty. Empty blocks are put in a free-list within the span, where the memory of the block is used as the “next” id. As the memory is now garbage we can safely re-use it for internal data.
  • If the span is completely empty (all blocks are free)
    • Put in caches – depending on configurable ratios, the span will go to the local or global cache or will need to be freed altogether
  • If the span has to be freed altogether
    • Mark in the owning segment that the span is free
  • If all spans in the segment are free, return the whole segment to the embedder, actually freeing the memory.

The next post in the series will describe in the most interesting implementation details of the library and how most of the operations were implemented.

Advertisements

Generic memory allocator for C++ (part 1)

TLDR; You can check the latest version of our generic C++ memory allocator on github.

Memory allocations in C++ are a major performance pain-point for every real-time application. Most games try to reduce/remove any need for heap allocations during a frame and use different techniques to avoid them. I’ve already blogged about some of those techniques that we use in Hummingbird – linear allocators for temporary & perf-frame memory, pool allocators etc.

Avoiding any dynamic (generic) memory allocation whatsoever is difficult and an ongoing effort in all AAA software I know of, Hummingbird included. You still need to do a new/delete/malloc/free here and there – hence a generic memory allocator is still needed.

What is a generic memory allocator?

In C and C++ the “generic” memory allocator is the pair of functions malloc/free, which allow you to use memory off the heap (aka free-store). In C++ the new & delete operators also usually end-up in malloc/free to get the memory needed for objects.

I call it generic, because it should fulfill any memory request – from a single byte to potentially gigabytes of memory, often with specific alignment requirements.

The standard C runtime provides implementations of malloc/free and their characteristics depend on the specific OS/compiler. The quality of these implementations unfortunately vary wildly, which makes it difficult to offer a consistent application experience.

The main features a good allocator should have are:

  • Good performance – calling malloc/free should complete relatively quickly
  • Small memory overhead – each allocator needs some mechanism to keep track of its internal state, used/free memory etc, which involves some additional overhead.

There dozens of OSS generic memory allocator implementations with different characteristics, to mention some of the more popular: dlmalloc, jemalloc, tcmalloc, ptmalloc, nedmalloc, hoard, rpmalloc etc. Each involves different design decisions and many claim to be the “fastest”, however applications have different allocation patterns and one might be better than the other for you particular use case – measure before committing.

Our software has also it’s specifics, which define the requirements for our generic allocator.

Specifics of allocations in a library

Hummingbird is a UI software library in C++ and is designed to work within an application and has to “play nicely” with it. One of the most important aspects is never directly using a resource under the hood that the application can’t control. Memory is a prime example. Hummingbird never allocates any memory directly – it asks for memory from the main application. This gives full control of the game developer to sub-allocate from UI-specific arenas and track any memory that the library uses.

So far Hummingbird asks the application for memory on each allocation and returns it on each deallocation. This gives complete control to the user but sometimes can be risky. The major issue we’ve seen is that generic allocators in some engines and on some platforms are quite bad. This means that in some cases the library can be slowed down by factors external to it. It also means that a bug in the game allocator will very likely hit the middleware as well.

We really want consistency of the performance and stability of our library across all platforms and engines, so we decided to go for a hybrid solution:

  • Hummingbird will have an internal generic memory allocator that will serve all internal memory needs
  • The allocator itself will request memory from the game in larger segments – the user is still in full control of the memory subsystem

In this way the user’s memory system is touched very seldom, which practically eliminates any performance effect it might have, and substantially reduces the risk of memory corruptions introduced by it.

Generic memory allocator for C++ - part 1.png

Looking for an embeddable memory allocator

We quickly decided the requirements of our allocator:

  1. Should be “embeddable” – it means that it should allow giving it memory from an outside source (the game in our case), and not allocate memory directly from the OS (via mmap, VirtualAlloc etc.)
  2. Should be fast – especially on small allocation. Large allocations in our software are either resources – images, fonts or memory pools that manage internally anyway.
  3. Should be fast in a multi-threaded environment – Hummingbird is heavily MT
  4. Should “return” memory often – many allocators are extremely conservative and hold on to the memory they’ve taken. We want to return any excess memory quickly to the game.
  5. Should be relatively simple to extend/modify

The list very significantly impacted the possibilities we have. Almost all OSS memory allocators are designed to be the “one-and-only” generic allocator within an application and don’t fulfill 1) and 4) (for instance jemalloc, tcmalloc).

The venerable dlmalloc unfortunately hasn’t aged well and is not good in a MT environment. ptmalloc3 is LGPL – haven’t really tried it. In the past I’ve used nedmalloc, which fulfills the API requirements, but is somewhat slow, the current Windows memory allocator beats it all the time.

In the end we decided to take a closer look to rpmalloc.

  • It claims (and achieves – I tested it under real workloads) great performance.
  • It can be easily studied & modified

However it also has some downsides for our particular case:

  • Can’t be used verbatim because allocates from OS – this is trivial to change
  • Requires all allocations to be 64KB-aligned – this is a major issue for us actually. rpmalloc can get away with this requirement in the generic-application case, because every VirtualAlloc on Windows is 64KB-aligned anyway, while on POSIX it’s relatively easy to map memory so that the requirement is fulfilled. This is NOT a requirement we can impose game developers. Imagine requiresting 1KB and requiring 64KB-alignment. If the game can’t easily manipulate the virtual memory mapping, in the worst case, it’ll have to allocate 65KB, just to make sure the alignment is right – not fun at all.
  • Holds on to memory a bit too aggressively for our requirements – even when compiling it with a more “memory-saving” option

So we decided to give it a try and modify it – keep the awesome performance and overcome the limitations. You can check our current fork on github.

The next post in the series will get our hands dirty with the details of how rpmalloc works and the changes we made to adapt it for our needs.

Rendering HTML at 1000 FPS – Part 2

Squeezing the GPU

The post is also available in the official LensVR blog.

This is the second blog post of the sequence in which I talk about the LensVR rendering engine.

In the first post, I discussed the high level architecture of the LensVR/Hummmingbird rendering. In this post I will get into the specifics of our implementation – how we use the GPU for all drawing. I will also share data on a performance comparison I did with Chrome and Servo Nightly.

Rendering to the GPU

After we have our list of high-level rendering commands, we have to execute them on the GPU. Renoir transforms these commands to low-level graphics API calls (like OpenGL, DirectX etc.). For instance a “FillRectangle” high level command will become a series of setting vertex/index buffers, shaders and draw calls.

Renoir “sees” all commands that will happen in the frame and can do high-level decisions to optimize for three important constraints:

  • Minimize GPU memory usage
  • Minimize synchronization with the GPU
  • Minimize rendering API calls

When drawing a web page, certain effects require the use of intermediate render targets. Renoir will group as much as possible of those effects in the same target to minimize the memory used and reduce changing render targets for the GPU, which is fairly slow operation. It’ll also aggressively cache textures and targets and try to re-use them to avoid continually creating/destroying resources, which is quite slow.

The rendering API commands are immediately sent to the GPU on the rendering thread, there is no “intermediate” commands list as opposed to  Chrome, where a dedicated GPU process is in charge of interacting with the GPU. The “path-to-pixel” is significantly shorter in LensVR compared to all other web browsers, with a lot less abstraction layers in-between, which is one of the keys to the speed it gets.

Rendering in cohtml.png

GPU execution

The GPU works as a consumer of commands and memory buffers generated by the CPU, it can complete its work several frames after that work has been submitted. So what happens when the CPU tries to modify some memory (say a vertex or index buffer) that the GPU hasn’t processed yet?

Graphics drivers keep tracks of these situations, called “hazards” and either stall the CPU until the resource has been consumed or do a process called “renaming” – basically cloning under the hood the resource and letting the CPU modify the fresh copy. Most of the time the driver will do renaming, but if excessive memory is used, it can also stall.

Both possibilities are not great. Resource renaming increases the CPU time required to complete the API calls because of the bookkeeping involved, while a stall will almost certainly introduce serious jank in the page. Newer graphics APIs such as Metal, Vulkan and DirectX 12 let the developer keep track and avoid hazards. Renoir was likewise designed to manually track the usage of it’s resource to prevent renaming and stalls. Thus, it fits perfectly the architecture of the new modern APIs. Renoir has native rendering backend API implementations for all major graphics APIs and uses the best one for the platform it is running on. For instance on Windows it directly uses DirectX 11. In comparison, Chrome has to go through an additional abstraction library called ANGLE, which generates DirectX API calls from OpenGL ones – Chrome (which uses Skia) only understands OpenGL at the time of this post.

Command Batching

Renoir tries very hard to reduce the amount of rendering API calls. The process is called “batching” – combining multiple drawn elements in one draw call.

Most elements in Renoir can be drawn with one shader, which makes them much easier to batch together. A classic way of doing batching in games is combining opaque elements together and relying on the depth buffer to draw them correctly in the final image.

Unfortunately, this is much less effective in modern web pages. A lot of elements have transparency or blending effects and they need to be applied in the z-order of the page, otherwise the final picture will be wrong.

Renoir keeps track of how elements are positioned in the page and if they don’t intersect it  batches them together and in that case the z-order no longer breaks batching. The final result is a substantial reduction of draw calls. It also pre-compiles all the required shaders in the library, which significantly improves “first-use” performance. Other 2D libraries like Skia rely on run-time shader compilation which can be very slow (in the seconds on first time use) on mobile and introduce stalls.

Results & Comparisons

For a performance comparison I took  a page that is used as an example in the WebRender post. I did a small modification, substituting the gradient effect with an “opacity” effect, which is more common and is a good stress test for every web rendering library. I also changed the layout to flex-box, because it’s very common in modern web design. Here is how it looks:

Website gif

Link to page here.

All tests were performed on Windows 10 on a i7-6700HQ @ 2.6GHz, 16GB RAM, NVIDIA GeForce GTX 960M, and on 1080p. I measured only the rendering part in the browsers, using Chrome 61.0.3163.100 (stable) with GPU raster ON, Servo nightly from 14 Oct 2017, and LensVR alpha 0.6.

Chrome version 61.0.3163.100 results

The page definitely takes a toll on Chrome’s rendering, it struggles to maintain 60 FPS, but is significantly faster than the one in the video. The reasons are probably additional improvements in their code and the fact that the laptop I’m testing is significantly more powerful than the machine used in the original Mozilla video.

Let’s look at the performance chart:

Chrome_perf

I’m not sure why, but the rasterization always happens on one thread. Both raster and GPU tasks are quite heavy and a bottleneck in the page – they dominate the time needed to finish one frame.

On average for “painting” tasks I get ~5.3ms on the main thread with large spikes of 10+ms, ~20ms on raster tasks and ~20ms on the GPU process. Raster and GPU tasks seem to “flow” between frames and to dominate the frame-time.

Servo nightly (14 Oct 2017) results

Servo fares significantly better rendering-wise, unfortunately there are some visual artifacts. I think it’s Servo’s port for Windows, that is still a bit shaky.

You can notice that Servo doesn’t achieve 60 FPS as well, but that seems to be due to the flex-box layout, we ignore that and look only at the rendering however. The rendering part is measured as “Backend CPU time” by WebRender at ~6.36ms.

Servo GPU

LensVR alpha 0.6

LensVR Rendering

Here is one performance frame zoomed inside Chrome’s profiling UI which LensVR uses for it’s profiling as well.

The rendering-related tasks are the “Paint” one on-top, which interprets the Renoir commands, performs batching and executes the graphics API calls and the “RecordRendering” on the far right, which actually walks the DOM elements and generates Renior commands.

The sum of both on average is ~2.6ms.

Summary

The following graphic shows the “linearized” time for all rendering-related work in a browser. While parallelism will shorted time-to-frame, the overall linear time is a good indicator on battery life impact.

chart

Both WebRender and Renoir with their novel approaches to rendering have a clear advantage. LensVR is faster compared to WebRender, probably because of a better command generation and API interaction code. I plan to do a deeper analysis in a follow-up post.

 

MSVC mutex is slower than you might expect

TLDR; The 2015 MSVC C++ runtime’s std::mutex (and potentially other) implementation is significantly slower than the equivalent code written by hand. The reason is that the runtime is built with a Windows feature called “Control Flow Guard” and uses function pointers.

While profiling some code in our new HTML5 renderer, I noticed that the “mutex” implementation we use on Windows was unexpectedly slow compared to other platforms. On Windows we were using the std::mutex implementation in the MSVC 2015 runtime. I decided to dig through the code an see what was going on. Everything was running on Windows 10.

Looking at the code revealed that the implementation of the mutex depends on the version of Windows (Microsoft Visual Studio 14.0\VC\crt\src\stl\primitives.h). This makes good sense, because newer versions of Windows include faster synchronization primitives. On Vista+ the locking is implemented with a SRWLOCK, which is very fast.

At that point I thought that the implementation they have is pretty good and there might be an issue with my profiling, so I did a simple test. In an auxiliary application I ran 3 test-suites locking 10000 times a mutex, performing a simple computation, unlocking and measuring the time for all those operations.

I measured 3 scenarios:

  • std::mutex
  • manual CRITICAL_SECTION
  • manual SRWLOCK

The SRWLOCK implementation was slightly faster than the CRITICAL_SECTION (~5-10%), which is expected, but the std::mutex was 30-40% slower than the rest. At that moment I was really surprised, because essentially the implementation was the same – just locking and unlocking the SRWLOCK. The std::mutex has some additional code – it gets the current thread id (I simulated this too and it’s very fast – no noticeable change in perf.), it calls virtual methods and the execution of the SRWLOCK functions happen through function pointers. None of those should incur such a large difference though.

So I dug in the assembly. It turned out that when the CRT calls __crtAcquireSRWLockExclusive (and all other SRWLOCK methods) they don’t go directly in the Windows Kernel32.dll! A lot of checks and code is executed between entering the method and actually arriving in AcquireSRWLockExclusive, which is where we want to go. The reason is a Windows feature called Control Flow Guard. Essentially this is a security feature that instructs the compiler to check every jump through a function pointer and validate if it is a valid target. This is possible because on modern Windows, all function addresses are annotated and known to the loader. The feature will prevent jumping “in the middle” of functions and makes hacking more difficult.

While CFG might be very important for some classes of applications, it’s performance impact on a game unacceptably is high. Unfortunately when using the MCVC runtime you have no choice, because it’s pre-built with the feature. While normal function calls are fine and bypass the mechanism, the std::mutex implementation falls flat there. In order to support multiple Windows versions the authors have to rely on function pointers for routines that might be missing from older Windows’s.

It’s unfortunate that the authors have thought carefully to use different approaches on different Windows versions to squeeze the best performance (the 2015 version is much much better than older std::mutex implementations) but this compiler feature has essentially defeated their efforts.

Fixing this in your application is really trivial – creating a custom mutex implementation with SRWLOCK is < 2 min. work. I didn’t investigate other STL classes that might suffer from the same problem, but I assume there are in the synchronization mechanisms (where you want the most performance unfortunately).

It’d be great if Microsoft provided more versions of the runtime libraries – especially “fast” ones without the security features. It would be even best if they provided all the source for the runtime and allowed developers to compile it themselves.

References:

Voxels 0.6.0 released

I have just uploaded version 0.6.0 of the Voxels library.

Changes include:

  • New API for memory usage statistics
  • Substantially decreased run-time memory usage
  • Slight increase in the polygonization performance
  • Removed all internal Windows-specific dependencies

I’m now working on further improving the polygonization performance as well as some ports and integrations. More info coming soon…

Objective-C++ ARC gotchas

This post has been published also in Coherent Labs’s blog – the company I co-founded and work for.

Lately I had to mix C++ and Objective-C pretty heavily (it’s called Objective-C++ apparently). What I usually need is having Obj-C objects inside C++ ones, which with “Automatic Reference Counting”(ARC) should work out-of-the-box. It actually does – except when it doesn’t!

In essence what the compiler probably does is just add ‘retain’ & ‘release’ calls in the proper places (constructors, destructors etc.). It is smart enough to recognize pointers to ARC objects and treat them as non-POD.

This is actually very cool and simplifies the interaction a lot. Alas there are some problems when you try to do some more ‘exotic’ stuff.

A pretty standard C++ way to use your own memory management routines is to allocate some memory with your custom allocator and then use placement new to construct an object in the fresh memory. Destruction goes the other way around – call the destructor explicitly and deallocate your memory.

class Test {};
int main()
{
  // create manually
  Test *t = static_cast<Test*>(operator new(sizeof(Test)));
  new(t) Test;

  // destroy manually
  t->~Test();
  operator delete(t);

  return 0;
}

In the following code, as you might expect, ‘dealloc’ is called as soon as the C++ object is destroyed – so no leak happens:

#include <iostream>
#import <Foundation/Foundation.h>

@interface TestObj : NSObject
@end

@implementation TestObj

-(id)init {
  self = [super init];
    NSLog(@"init!");
    return self;
}

-(void)dealloc {
  NSLog(@"dealloc!");
}

@end

class Parent
{
public:
  /*virtual */ void aFunc() {} // Uncomment 'virtual' for reference leak!
};

class TestClass : public Parent
{
public:
  TestClass() : m_Obj([[TestObj alloc] init])
  {}

private:
  TestObj* m_Obj;
};

int main(int argc, const char * argv[])
{
  TestClass* cl = new TestClass;

  cl->~TestClass();
  operator delete(cl);

  //delete cl; // Uncomment me and comment the manual mem. management to never leak!

  std::cout << "End!" << std::endl;

  return 0;
}

However if you uncomment the ‘virtual’ keyword and hence make the hierarchy virtual the ‘dealloc’ method will not be called. The compiler in this case does not create the non-trivial destructor required! If you substitute the manual memory management with the delete operator then the destructor is synthesized and ‘dealloc’ gets called. The behavior is also prevented if your class is non-POD in C++ terms.

Not having a virtual destructor in a virtual hierarchy is arguably very bad, however breaking the ARC promise is even worse. I admit that stumbling upon this issue is not very easy because it requires a lot of things to happen but still the leaking references it introduces are serious enough to prompt me to write about it.

I haven’t looked at clang’s source about this issue and I can’t find a rationale behind it in the docs so I think it’s an oversight and can only speculate why it happens. The version of the compiler that I currently work and saw the problem is: “Apple clang version 4.1 (tags/Apple/clang-421.11.66)”.

All that said, if you follow the basic C++ guidelines of having virtual destructors in your polymorphic hierarchies you should be fine when you try to mix C++ and Objective-C.

Dump References VS add-in

This post has been published also in Coherent Labs’s blog – the company I co-founded and work for.

Motivation

In our current products we have to deal with very big build projects, some of are divided in more than 200 libraries. This poses huge challenges on the build process and their overall management.

On Windows we build everything with Visual Studio and msbuild, however I have often stumbled upon two very annoying quirks in them:

  1. If you build a Solution through msbuild and project A in it depends on another project B that is NOT in that solution, project B will be compiled and linked in it’s default configuration. Usually this is the Debug configuration. This might lead to some Debug libraries being linked in a Release project or some other config inconsistency.
  2. If project A in a solution references project B, also in the solution, then Visual Studio fails any compilation attempt with the cryptic “The project file ‘ ‘ has been renamed or is no longer in the solution.” giving no actual info on what’s wrong. There is a bug on MSDN about this. It’s unclear if and when it’ll be fixed and most solutions suggested by users involve removing all projects, adding them one-by-one and testing which one fails. This is unfeasible in large solutions.

The second problem was for us very serious. We use premake to generate our projects on different platforms and often change dependencies, upgrade third party libraries etc. GUIDs however are only used in VS projects and they keep getting out-of sync often.

We knew we needed an automatic way to check those issues. One solution would have been a program that just walks and parses all VS projects and issues the warnings we are interested in. I opted however to create a VS add-in because the data we are interested in is already readily parsed and available in the IDE itself.

Dump References Add-In

Dump References is an add-in for Visual Studio that prints the references of all projects loaded in the solution, checks their GUIDs and prints eventual issues. It is intended for C/C++ solutions.

The add-in is fairly simple – it walks all projects in the solution and their references and prints:
– all projects currently in the solution
– all references that each project has
– all unique references in the solution
– possible issues: referenced projects missing from the solution; projects available in the solution but referenced by other projects with another GUID.

To dump the diagnostic just right-click on your solution in the Solution Explorer and click “Dump References”.

The results will be available in the Output window in the Dump References pane.

The add-in’s source is available on GitHub and licensed under the MIT license, so feel free to use it.