Coroutines & Concurrency in C++

The news I’m most excited about in C++20 is the the inclusion of coroutines. While there are other good steps forward, coroutines can significantly simplify the development of concurrent systems and help C++ tackle some of the complexity that is associated with it.

I decided to try and re-imagine some of the systems I’ve worked on in terms of C++ coroutines and see if they will become easier to reason about. To do that I’ll explore a grossly simplified model of loading an HTML page. In this model we only have to do 3 things:

  1. Load an HTML document on any thread
  2. Load a CSS document on any thread
  3. Create the Document on a thread selected as “main” – happens after the HTML and CSS file are loaded.

As you can see the example is simple, but contains some key nuggets – we have both independent and dependent tasks. Also a certain task – the Document creation should happen on a predetermined thread. The need to run some tasks on specific threads is common especially when interfacing with third-party libraries like graphic API, scripting VMs etc.

Task-based approach

The classic model of creating an async I/O system would be posting tasks on a thread pool and relying on callbacks to get notified when a specific task is complete. This is how the popular libuv library works. Callbacks unfortunately make the code difficult to follow and require manually passing resources across tasks, which complicates their management.

In the “cohtml” HTML rendering engine we developed at Coherent Labs, we built a task system where continuations have to be explicitly specified by posting a new task. The code itself is simplified by the use of lambdas. A lot of details on the task system can be found here.

Let’s get down to the code and see the  approach in action.


static const std::filesystem::path testHTML{"../test_res/index.html"};
static const std::filesystem::path testCSS{"../test_res/style.css"};
using DataBuffer = std::unique_ptr<std::uint8_t[]>;
struct Document
{
DataBuffer HTMLData;
DataBuffer CSSData;
std::mutex DataMut;
void MakeDOM()
{
std::lock_guard<std::mutex> l{DataMut};
assert(HTMLData && CSSData);
// Pretend we do something useful here…
std::cout << "Document created" << std::endl;
}
};
void StartCreateDocument()
{
std::shared_ptr<Document> result = std::make_shared<Document>();
Enqueue(Work, [=]()
{
std::lock_guard<std::mutex> l{result->DataMut};
result->HTMLData = std::move(ReadFile(testHTML));
OnDocumentResourceLoaded(result);
});
Enqueue(Work, [=]()
{
std::lock_guard<std::mutex> l{result->DataMut};
result->CSSData = std::move(ReadFile(testCSS));
OnDocumentResourceLoaded(result);
});
}
void OnDocumentResourcesLoaded(std::shared_ptr<Document>& result)
{
if (result->HTMLData && result->CSSData)
{
Enqueue(Main, [=]()
{
result->MakeDOM();
});
}
}

Note that the code is simplified and shortened for the sake of clarity .The actual production interfaces are more complicated but are beyond the scope of this post.
The code is fairly self-explanatory. The new document loading does the 3 steps we mentioned earlier. It posts 2 tasks on any worker thread through the Enqueue method and they will load the text files. The cool thing is that the usage of lambdas does simplify readability – we see the logic that will be executed close to where the operation is initiated.

There are some downsides as well. The sample contains no logic to execute the MakeDOM method (the continuation), when both the HTML and CSS file are loaded. We solve this by calling a function OnDocumentResourcesLoaded that checks that everything required is loaded and then posts on the Main thread the MakeDOM method.  We also have to add synchronization to our Document fields, because we have to make sure the OnDocumentResourcesLoaded sees the changes applied by all thread.

Another downside is that tasks in this design can’t return a value – they only execute logic. We pass around the resulting Document as a shared pointer to record the new data.

Coroutines approach

Coroutines in C++ offer the basis to build more complicated systems around them. This is inline with the philosophy of the language to provide the foundations and let the developers build libraries on top of them (not always followed unfortunately). Other languages offering coroutines like Go and Kotlin bundle a runtime for scheduling and executing them.

This design choice of the standardization committee means that we can’t easily delve into experimenting with coroutines but would have to write a lot of code to make them truly usable. I decided to leverage the excellent cppcoro library by Lewis Baker. The library is still missing some core features – especially on Linux, but nevertheless is a great example on how we can use coroutines for concurrent programming.

I re-wrote the previous example with cppcoro and this is the result:


using namespace cppcoro;
static const std::filesystem::path testCSS{"../test_res/style.css"};
static const std::filesystem::path testHTML{"../test_res/index.html"};
using DataBuffer = std::unique_ptr<std::uint8_t[]>;
task<DataBuffer> ReadFile(const std::filesystem::path& p)
{
DataBuffer result = std::make_unique<std::uint8_t[]>(len);
//Actual file loading happens here…
co_return result;
}
struct Document
{
DataBuffer HTMLData;
DataBuffer CSSData;
void MakeDOM()
{
std::cout << "Document created" << std::endl;
}
};
task<Document> CreateDocument(static_thread_pool& pool)
{
std::cout << "Creating document on thread " << std::this_thread::get_id() << std::endl;
auto readHTML = ReadFile(testHTML);
auto readCSS = ReadFile(testCSS);
Document result;
std::tie(result.HTMLData, result.CSSData) = std::move(co_await when_all(
schedule_on(pool, std::move(readHTML)),
schedule_on(pool, std::move(readCSS))));
result.MakeDOM();
co_return result;
}
Document BuildDocument(static_thread_pool& pool)
{
return sync_wait(CreateDocument(pool));
}
int main()
{
static_thread_pool pool;
auto document = BuildDocument(pool);
return 0;
}

I particularly like the CreateDocumentfunction where is the gist of the logic. The other functions are helpers. static_thread_poolis a thread pool managed by cppcoro that allows us to run coroutines in parallel.

Let’s look in more detail at CreateDocument. In cppcoro, the coroutines are not started until they are awaited, so the calls to ReadFileonly create cppcoro::tasks that contain the coroutine. When we co_await them on lines 34-36 they are scheduled and started on the thread pool. When both files are read, the CreateDocument coroutine is resumed on the Main thread. All the logic is nicely held within just one function. Overall the code is easier to follow and resembles a linear function – very nice.

There is an inefficiency in the design as well unfortunately. While we are loading the files, the Main thread just waits for them to be ready. A more efficient system would actually schedule one of the file loading coroutines on the Main thread and actively cooperate to get the needed result. The Go language scheduler implements such cooperation. This is doable in C++ and requires a more complicated scheduler.

I’m very happy with the final result of the experiment. C++ coroutines did simplify my sample code flow. Unfortunately wide C++ coroutines adoption requires good libraries and they are still few or incomplete. I hope this will soon change.

Advertisement

Generic memory allocator for C++ (part 3)

The post continues the series about how the coherent-rpmalloc generic C++ memory allocator works. If you haven’t read part 1 and part 2, please do before continuing, as they are an integral part in understanding the implementations described below.

The algorithms in part 2 are relatively simple, but have some tricky moments. I’m going to form their descriptions as Q&A.

Q: How does a simple pointer in void rpfree(void* pointer) know which span it belongs to, so that the algorithm can get to the meta-data needed to handle the de-allocation?

A: This is one of the keys to the great performance of the allocator and a very clever thing that Mattias Jansson, the original rpmalloc creator, did. In his implementation every span has to be 64KB aligned. This means that from every pointer allocated by the library, we can get the span it belongs to by just masking a bunch of bits. This is extremely fast. Some allocators achieve this by putting meta-data in front of every allocation, but that increases significantly the memory overhead. The only down-side is that it requires the embedder to always pass to the library 64KB-aligned memory. Rpmalloc can get away with this, because it directly uses the OS to get memory, VirtualAlloc, already returns 64KB-aligned addresses, while on POSIX systems it’s also relatively easy to map memory that way.

This is very inconvenient for us as we would put a heavy burden to games that use Hummingbird, so we had to get rid of the requirement. To put in perspective, if we request from the client 64KB, with a 64KB alignment, at worst the embedding developer will have to get 128KB just to fulfill the alignment requirement – a huge waste.

Q: How does coherent-rpmalloc remove the need for 64KB-aligned memory, while keeping the performance?

A: Basically the problem boils down to:

  • From an allocated pointer (a block), how do we get the address of the span that allocated it, so that we can reach the meta-data and de-allocate it?

At first I tried different techniques to remove the requirement. The first one was looking at other implementations:

  • Some allocators put meta-data in front of every block. This would waste a lot of memory for small allocations and pollute the CPU cache with unnecessary data when used.
  • tcmalloc, which was rpmalloc’s inspiration, uses a radix-tree to associate each allocation with a pointer to its’ owning structure. However for 64-bit pointers that radix-tree can potentially grow up to ~4MB. Hummingbird as a whole seldom uses more than 10MB, so wasting 4MB just on allocator bookkeeping was out of the question.
  • I rolled my own sorted skip-list that associates each allocation to the span it did it. A skip-list can be made lock-free, but my tests even on a single thread showed a significant slow-down compared to bit-masking. While the search has very good big-O complexity, the performance was poor, because very often involved CPU cache misses. While it was possible to squeeze my skip-list in some contiguous memory piece to improve cache coherency, I ultimately decided that the increase in complexity was too large and the results not guaranteed.

In the end I decided that keeping the span-alignment trick is the best thing, but I had to amortize the risk of wasted memory.

What if instead of requesting from the user spans, like rpmalloc does, we request an array of spans? This is why segments got introduced.

Our spans are 8KB and need 8KB alignment for the bit-masking to work.

  • If we request individual spans, like rpmalloc does, then 8KB with 8KB alignment would at worst need a 16KB “real” allocation to fulfill the alignement – a 50% waste
  • If we request 32 spans from the user, that is 256KB with an 8KB alignment, then the possible waste is only ~3.03%. Much better.

This is how segments were born. We only request and return to the user whole segments of 256KB.

This has multiple good upsides:

  • Drastically reduces the risk of memory waste
  • Reduces the communication between the library and the embedder – if their allocator is slow, it’ll have almost no effect over Hummingbird.
  • Slightly improves cache locality

And some downsides:

  • Segments have to be managed and be performant
  • There is a risk of memory waste in degenerate cases. As we only return whole segments to the user, we need to have a segment completely free before we return it, even a single allocation can keep it alive. In practice our workload does not hit such cases, but is something to keep in mind.

segments.png

Q: How are span caches & segments handled?

A: rpmalloc has multiple levels of span caches. The per-thread ones are pretty trivial because they are touched only by a single thread that owns them, so no synchronization is needed to add/remove spans from the local cache.

The global span cache is protected by a spin lock allowing adding/removing spans from multiple threads.

Segments are always shared across multiple threads. All segments are put in a linked list and each segment has 32 spans inside. The algorithm to “get” a fresh span is relatively simple:

  • Walk the list of segments and look for a free span:
    • Each segment has an “unsigned int”, which is used as a bit-mask and indicates, which “slot” is free of the 32 spans in the segment. Updating it in a lock-free way is very easy, just do a CAS operation on the word by trying to flip the bit we want to mark as used or free.
  • If there are no free spans in any segment
    • Allocate a new segment by requesting memory from the user
    • Adding/removing segments from the linked list is done through a readers-writer lock

As segments are iterated much more often from multiple threads than they are added/removed, it is really important to have very good read performance on the list. I implemented it with a readers-writer lock.

A single “unsigned int” is used as the lock primitive – one bit is reserved for the “write” lock, while all the rest are used as counter that holds the number of readers currently holding the lock.

  • When a reader wants to enter the locked section (iterate the linked list), it tries to CAS the unsigned lock to [0 | READERS + 1], where readers is the lock value with the write bit masked.
  • Releasing the read lock will just try to CAS to [READERS – 1]
  • The writer will try to CAS the lock to “1”, which means the lock is taken for writing. The operation requires a reference value of “0”, which means that nobody is reading or writing the linked-list.
  • Releasing the write lock just involves setting the lock to “0”

rwlock.png

Q: How do you support threads whose lifetime can’t control, when there is thread-local data involved?

A: The original rpmalloc has two functions “rpmalloc_thread_initialize” and “rpmalloc_thread_finalize” that need to be called on each thread before using the allocator and after you’re done with it. They set and later clear the thread-local heap object.

Unfortunately for our middleware software this solution was no good. Some game engines – for instance Unreal Engine will move their workload across multiple new threads. The prime example is their rendering thread, which actually changes from time to time without notifying plugins. If we were to use the default rpmalloc way – we’d have the risk of “leaking” heaps. If UE4 destroys a thread where we have initialized a heap and we are not notified, that heap will leak.

Coherent-rpmalloc solves this by having a pool of heaps that are re-assigned to threads working with the library dynamically. All entry points in the Hummingbird library are already marked within our code. When the user calls any method, if the thread-local data is not set for that thread, then a heap is assigned to it. When code execution leaves the library, that heap is again marked as reusable. We have a fixed amount of threads that can run simultaneously, so there is no risk of leaks. Each thread holds a preferred heap, which is used so that it gets all the time the same heap when it enters the library. Only when a new thread comes into play, will it eventually take a heap of another thread. The old thread has probably been destroyed outside the knowledge of the library.

Q: Why did you move the library to C++?

A: The original rpmalloc uses some inline assembly and platform-specific macros for thread-local data and atomic operations. I really wanted to move this to standard C/C++, because I also needed to support additional platforms like game consoles. It would’ve been great to move to C11, but it’s not supported on all compilers we target, so I went for C++11. All atomic operations use standard primitives, while thread-local storage uses pthread on platforms that have no thread_local support.

This pretty much sums-up the most interesting parts of the allocator we crafted. If you have additional questions or suggestions, please don’t hesitate to comment.

Generic memory allocator for C++ (part 2)

In my previous post I highlighted the requirements we have for a generic memory allocator within the Hummingbird game UI middleware. To quickly recap, we want memory allocator (malloc/free) that:

  • Is fast – especially in a multi-threaded environment
  • Wastes little memory
  • Can be easily modified
  • Does not reach out for the OS

The current result of our work is a fast multi-threaded, low-overhead, embeddable, relatively simple (~1800 lines of C++), optimized for small allocations, rpmalloc-based, permissively licensed allocator. The code is available on github under the name coherent-rpmalloc and evolving.

In this and the next post I’m going to describe the implementation details and how the allocator is different from the initial rpmalloc implementation.

Overall goals

Coherent-rpmalloc like the original rpmalloc relies heavily on caches and thread-local pointers. The key for multithreaded performance is reducing the contention between threads. The allocator achieves this by keeping as much data as possible on a per-thread basis and acquiring shared resources very seldom. It is tuned towards applications that allocate relatively small amounts of data – in the tens of megabytes. This is different than the original rpmalloc, which handles well even larger heaps. The tuning can be adjusted by changing the sizes of different memory sizes & caches.

A lot of information on the design can also be found in the documentation of the original rpmalloc, there are however differences with coherent-rpmalloc.

Memory subdivision & lingo

The memory in the allocator is handled through 4 major data-types, which form a hierarchy:

 

  • Blocks – each allocation of a certain size falls in a size class. All allocations of that size class will belong together – for instance all the ones that are between 1 and 16 bytes, all the ones between 16 and 32 bytes etc. A memory request will be rounded to the max of the size class it belongs to. The block is such an allocation – a pointer the allocator will return to the user & that it can later free. Blocks have different sizes, depending on their size classes – they represent a single small/medium allocation.
  • Pages
  • Spans
  • Segments

 

 

 

By far the most important structures are spans and segments, blocks and pages are just useful memory subdivisions, which however carry no code logic – think of them as size slices.

mem_alloc_spans_graphic.png

Allocations are subdivided in different buckets by size – small, medium, large and extra-large. All the complex handling and machinery in the library applies to small, medium & large sized allocations. In our case I settled for everything up-to 8144 bytes. All allocations larger than that are classified as extra-large. They will be directly handed over to the embedder.

Hummingbird requests bigger than 8K chunks are almost always for other internal allocators or for large resources (images, fonts etc.). They happen relatively seldom – only ~1% of the memory requests are so large, so their performance impact is overall negligible. I also wanted the library to immediately return such large chunks to the embedder application instead of holding it in internal caches.

Anatomy of an allocation & de-allocation

When the user calls the void* rpmalloc(size_t sz) method, the library has to fulfill the allocation. This happens in multiple steps:

  • The allocations size-class is determined. This tells us where to look for free blocks large enough to fulfill the allocation.
  • The current threads heap is loaded from thread-local storage. Each thread using the library has a local heap (a struct) that contains all the data necessary to allocate and deallocate memory and almost always it can fulfill a memory operation by itself.
  • The heap has an active span for each size class – the active span is just the latest span that was used for allocating. If it has some free blocks – the first free block will be used.
  • If there are no free blocks, the library will try to:
    • De-allocate some blocks – mark them as free. Blocks freed from other threads are not immediately marked, but are put in a list. The owning thread (the one that allocated them from its own heap) will mark them on the next allocation.
    • Look for an empty span in the local thread cache
    • Look for an empty span in the global cache
  • If no span was found with empty blocks, a new span is requested:
    • Look for a segment that has empty spans and take one
    • If there are no segments with free spans
      • Request a new segment allocation (to external embedder)
      • Return the first span and put the segment in the list of segments

The de-allocation works in pretty much the opposite way:

  • From the pointer – find the owning span. This is an interesting step and very performance-sensitive, I’ll go into the details in the next post.
  • Mark the block in the span as empty. Empty blocks are put in a free-list within the span, where the memory of the block is used as the “next” id. As the memory is now garbage we can safely re-use it for internal data.
  • If the span is completely empty (all blocks are free)
    • Put in caches – depending on configurable ratios, the span will go to the local or global cache or will need to be freed altogether
  • If the span has to be freed altogether
    • Mark in the owning segment that the span is free
  • If all spans in the segment are free, return the whole segment to the embedder, actually freeing the memory.

The next post in the series will describe in the most interesting implementation details of the library and how most of the operations were implemented.