String interning

When we started to design the architecture of Hummingbird, we knew that an HTML rendering engine has to rely heavily on strings due to the nature of CSS style solving. CSS element ids and classes are defined by the standard as strings and the cascade resolution relies on string comparisons to work. All elements that have a class “my_class” will match a selector for said class.

String operations in C++ are notoriously slow. When solving the styles of a large page you might have to do hundreds or even thousands of string comparisons if you use a vanilla string implementation. The string comparisons are slow as most string implementations allocate their memory on the heap and contain an internal pointer to the actual “char” array and the memory has to be actually compared over the whole string. Cache misses will kill the performance.

In Hummingbird we can’t afford such a naive implementation so we looked for alternatives for how to represent the “id” and “classes” of HTML elements. The most important thing is to have quick comparisons that are vital to the style solving process.

In a game engine where all the content is well known when the game is being built, many developers rely on doing “perfect hashes” of their strings and effectively substituting them everywhere with ids. This is a great solution when you know that the whole pool of strings you’ll encounter is fixed.Unfortunately, such a solution cannot work in Hummingbird where the user can set any id and class in her styles and even generate new ones through JavaScript. Everything has to be dynamic.

We went for a well-known solution called “string interning”and in essence represents a runtime perfect hash.

In our implementation:

  • All strings of a View (basically an HTML page + scripts) are kept in contiguous memory regions.
  • The addresses of all currently interned strings are stored in a hash set.
  • When we encounter a string to be interned, we check the hash set. If the string is already in some buffer – we return the address to it, otherwise we copy the characters in a free region of our buffers, put the address in the set and return the newly created interned string.

The InternedString object that we use is simple, it contains just a const char* pointer. We guarantee that interned strings won’t move in memory.

This representation is very efficient

  • Each interned string consumes memory just once, no matter how many copies are there
  • Two interned strings are the same IFF they have the same address. This makes comparisons super-fast as we only compare pointers!
  • We still have easy access to the contents of the string via the const char*

This representation gives a great performance boost. Our implementation does not try to reclaim eventually unused strings. It is extremely unlikely in our usage scenarios to have interned strings take too much memory.

Hummingbird is a heavily parallel library, so trying to intern a string from multiple threads is a very real scenario. At the moment the “interning context” is locked. Each View has it’s own “interning context. So instead of having a system-wide interning context that risks taking too much memory and becoming a locking bottleneck, we have View-wide ones that reduce contention and practically eliminate the risk of interned strings talking too much memory even during long sessions.

The results are great with ~30% improved style solving performance compared to a version where simple strings are used.


Open-sourcing Voxels

Since I wrote Voxels in 2014, I had interest from developers working on different projects where they found it’ll help them achieve their goals. The previous licensing made this difficult.

While it has always been my intention to open source the library, I finally do it now. I wanted to take some time to clean-up the code and better document it and kept postponing. My day-to-day work in Coherent Labs takes most of my time and Voxels was always left behind.

Now I’m setting it finally free on GitHub. There are no significant changes since the first release in 2014 and honestly I don’t have current plans to continue the development of the library due to my other development work.

The library is licensed under the very permissive 3-clause BSD license, so feel free to modify it and incorporate in your projects. If you plan to use the library – drop me an e-mail , I’m always happy to see what developers achieve with volume surfaces.

MSVC mutex is slower than you might expect

TLDR; The 2015 MSVC C++ runtime’s std::mutex (and potentially other) implementation is significantly slower than the equivalent code written by hand. The reason is that the runtime is built with a Windows feature called “Control Flow Guard” and uses function pointers.

While profiling some code in our new HTML5 renderer, I noticed that the “mutex” implementation we use on Windows was unexpectedly slow compared to other platforms. On Windows we were using the std::mutex implementation in the MSVC 2015 runtime. I decided to dig through the code an see what was going on. Everything was running on Windows 10.

Looking at the code revealed that the implementation of the mutex depends on the version of Windows (Microsoft Visual Studio 14.0\VC\crt\src\stl\primitives.h). This makes good sense, because newer versions of Windows include faster synchronization primitives. On Vista+ the locking is implemented with a SRWLOCK, which is very fast.

At that point I thought that the implementation they have is pretty good and there might be an issue with my profiling, so I did a simple test. In an auxiliary application I ran 3 test-suites locking 10000 times a mutex, performing a simple computation, unlocking and measuring the time for all those operations.

I measured 3 scenarios:

  • std::mutex
  • manual SRWLOCK

The SRWLOCK implementation was slightly faster than the CRITICAL_SECTION (~5-10%), which is expected, but the std::mutex was 30-40% slower than the rest. At that moment I was really surprised, because essentially the implementation was the same – just locking and unlocking the SRWLOCK. The std::mutex has some additional code – it gets the current thread id (I simulated this too and it’s very fast – no noticeable change in perf.), it calls virtual methods and the execution of the SRWLOCK functions happen through function pointers. None of those should incur such a large difference though.

So I dug in the assembly. It turned out that when the CRT calls __crtAcquireSRWLockExclusive (and all other SRWLOCK methods) they don’t go directly in the Windows Kernel32.dll! A lot of checks and code is executed between entering the method and actually arriving in AcquireSRWLockExclusive, which is where we want to go. The reason is a Windows feature called Control Flow Guard. Essentially this is a security feature that instructs the compiler to check every jump through a function pointer and validate if it is a valid target. This is possible because on modern Windows, all function addresses are annotated and known to the loader. The feature will prevent jumping “in the middle” of functions and makes hacking more difficult.

While CFG might be very important for some classes of applications, it’s performance impact on a game unacceptably is high. Unfortunately when using the MCVC runtime you have no choice, because it’s pre-built with the feature. While normal function calls are fine and bypass the mechanism, the std::mutex implementation falls flat there. In order to support multiple Windows versions the authors have to rely on function pointers for routines that might be missing from older Windows’s.

It’s unfortunate that the authors have thought carefully to use different approaches on different Windows versions to squeeze the best performance (the 2015 version is much much better than older std::mutex implementations) but this compiler feature has essentially defeated their efforts.

Fixing this in your application is really trivial – creating a custom mutex implementation with SRWLOCK is < 2 min. work. I didn’t investigate other STL classes that might suffer from the same problem, but I assume there are in the synchronization mechanisms (where you want the most performance unfortunately).

It’d be great if Microsoft provided more versions of the runtime libraries – especially “fast” ones without the security features. It would be even best if they provided all the source for the runtime and allowed developers to compile it themselves.


Unique values in C++

I was profiling a particular piece of code in one of our products at Coherent Labs and noticed some peculiar results. Basically each frame we have to keep track of a certain amount of elements that have changed and at the end of the frame we have to do an operation on all of them. The elements are tracked with pointers that are put in an std::unordered_set because due to different logic paths we might flag the same element more than once.

I noticed that both adding many elements in the set and then walking it took more time than I’d like. Putting aside the operations on the pointers themselves (and the need to achieve better cache coherency in that domain), I concentrated on the set itself.

The VS 2015 implementation seems to keep the different “buckets” of the set in a list. At that point I was already pretty worried, I saw in the profiler that substantial time was going in an std::list iterator – not exactly the high performance code I’d like.

I decided to remove some indirection and improve cache hits by changing the unordered_set with a vector. I blindly push_back elements in the vector. The “uniqueness” of the pointers is required only when we operate on them, so when needed I run and std::sort on the vector and an std::erase with std::unique. This gives me all the elements without duplicates. At this point I have to note that when pushing elements in the set, duplicates happen very rarely. The result I got after changing the std::unordered_set in an std::vector with sort & erase were promising, so I made a larger test.

You can see the whole test here:

I test 3 cases – a std::unordered_set; and std::vector and Google’s dense_hash_set, which is regarded as very fast. A simple sqrt on each element simulates some “work” done on each element. I also allow some “duplicates” multiplier. The results are interesting.

No duplicates:

Time for set: 236.609
Time for vector: 133.017
Time for google set: 234.986

2 duplicates factor:

Time for set: 133.731
Time for vector: 129.611
Time for google set: 131.001

4 duplicates factor:

Time for set: 71.606
Time for vector: 116.688
Time for google set: 77.66

When duplicates are rare – the std::vector outperforms substantially both sets. Things however quickly degrade if we introduce many duplicates. This makes sense as it puts much more pressure on the std::sort and std::unique calls. At that moment I had an idea. In my use-case the work done on each element is relatively light and doing it twice on an element has no side effect – f(f(x)) == f(x) in this case. I have little duplication anyway, so why don’t we ditch the uniqueness requirement.

I added the “UNIQUE_VEC” define – it controls if we do the sort & erase on the vector and re-ran the tests.

No duplicates:

Time for set: 248.518
Time for vector: 66.078
Time for google set: 234.395

2 duplicates factor:

Time for set: 125.65
Time for vector: 58.664
Time for google set: 138.612

4 duplicates factor:

Time for set: 71.207
Time for vector: 51.306
Time for google set: 77.191

Now the vector in the no duplicates case is really the clear winner. Clearly here we are much more dependent on how ‘heavy’ the work-load per-element is, but in duplication factors close to 1 that should be irrelevant, especially when we have many elements.

In my experience duplication is actually rare when we need it in game code. In that case it is almost always better to use an array (vector) and do the uniqueness yourself. Even better if the operation is fast and the implementation supports it – ditch the uniqueness altogether.

Temporary allocations in C++

Heap allocations in C++ are notoriously expensive when they hit the generic allocator. Many high-performance software projects employ custom allocation strategies that fit a scenario and speed up allocations. A great overview of allocation strategies common in games is given by Stefan Reinalter in his Molecular musings blog.

In this post I’ll specifically talk about temporary allocations and how to optimize STL allocations. STL is difficult to use right as its types do many heap allocations and the allocator is template parameter. It makes types with different allocators incompatible. C++11 and C++14 however have added many new features that in my opinion return STL as a viable option even for performance-critical code.

Temporary objects usually survive only a small scope within the execution of the program, but this scope (function) might be called many times. If objects in it always allocate on the heap, we might suffer a substantial performance hit.

Imagine this function:

void OnTaskComplete()
  std::vector tasksToSchedule;
  for(auto& task : m_WaitingTasks)
    if (task->CanBeScheduled())
      // Do some more stuff
  for (auto& task : tasksToSchedule)
    // Schedule task

The function in the example might get called after a task in job system completes. It checks if to schedule some new task. It might get called hundreds of times in a frame. The vanilla vector there might impact performance as it will do a heap allocation each time at least one task will have to be scheduled.

We know that the lifetime of the object is really narrow – it is temporary to the scope of the function. Let’s make it a lot faster.

We will employ a custom allocator for temporary objects that outperforms the standard heap allocations by 7x-10x.

A “Linear allocator” is one that allocates a predefined chunk of memory and each time we request from it – it gives back a chunk from it. We don’t manage anything except where we currently are in the buffer. We can however “reset” the allocator back, which effectively allows us to reuse the memory. It is common in games to use such an allocator for all allocations within the frame and reset it at the beginning of a new one. This allows for very fast O(1) allocations within the frame and an O(1) super-quick bulk “deallocation”. We might be in trouble only if we consume more memory within the frame than the allocator has to offer. Assertions are in order.

class LinearAllocator
  LinearAllocator(size_t size)
    : m_Ptr{ static_cast(TEMP_MALLOC(size)) }
    , m_TotalSize{ size }
    , m_FreeSpace{ size }


  void* Allocate(size_t size, unsigned alignment /* power of 2 */)
    assert((alignment & (alignment - 1)) == 0);

    auto currentPtr = static_cast(m_Ptr + (m_TotalSize - m_FreeSpace));
    auto retPtr = std::align(alignment, size, currentPtr, m_FreeSpace);

    if (!retPtr)
      assert(false && "Linear allocator full!");
      // no space
      return nullptr;

    m_FreeSpace -= size;

    return retPtr;

  void Free()
    // do nothing

  void Reset(size_t freeSpace)
    m_FreeSpace = freeSpace;

  size_t CurrentFreeSpace() const { return m_FreeSpace; }

  char* m_Ptr;
  size_t m_TotalSize;
  size_t m_FreeSpace;

In the systems I work with, we have multiple threads that work on “tasks”. This makes defining a frame somewhat difficult. In this sample we will employ a scope object that in its destructor resets the linear allocator.

Our allocator will also be thread-local, so that every thread has its own copy.

struct TempAllocatorScope
    : m_Space(tlsLinearAllocator->CurrentFreeSpace())

  size_t m_Space;

Finally we will have an STD allocator that forwards all calls to our linear one and allows us to use it with any STL type.

The complete test example is available here.
The sample allows us to switch between the linear allocator and the default one to measure the gains. On my machine it was between 7x – 10x with the VS 2015 runtime.

Full code:

More on overloading new/delete

In my previous post I highlighted some of the pitfalls encountered while adapting memory management functions in a middleware library.

In this post I’ll continue the discussion of overloading new/delete in the context of writing a middleware library. The specifics here are that when you write middleware you have to be very careful not to inadvertently stomp the client application’s domain – memory, rendering state etc.

Overloading new/delete with some sugar on-top

In a new library we are writing at Coherent Labs, we went with the standard memory tracking technique of overriding all new/delete operators with a memory tag.

It looks something like this:

inline void* operator new(size_t size, MemoryTag memtag) {
  return gAllocator->Allocate((unsigned)size, memtag.Tag);

Here gAllocator is a pointer to an interface the client provides to hook her own allocation methods. Adding a memory tag really helps debugging and shows how memory is distributed in subsystems. On-top of these overloaded allocators we have macros that simplify the syntax.

I prefer overloading new/delete as it more or less allows us to continue using the syntax most developers are used to, instead of going for completely custom memory allocation templates.

In C++ you can overload delete but can’t call it. The overloaded instances are only called by the compiler when throwing an exception in a constructor. We solve this by substituting delete with a custom function like this:

template <typename T>
inline void DestroyMemory(T* ptr, MemoryTag tag) {
  if (!ptr)
  operator delete(ptr, tag);

The function will call our overload with a tag. In C++ there is a difference between “delete operator” and “operator delete”. We prefer tagging deallocations too because it helps debugging and you can have accurate per-subsystem stats even if you don’t track every single allocation.
If you’ve read my previous post you know that overriding new[]/delete[] has some issues so we simply disable them and go for custom functions for array allocation and deallocation.

Guarding against default new

Now that we have custom macros and functions for all allocations, it is important that everybody always uses them instead of the default new/delete. Developers have to be careful and reviewers should be extra cautious, but this is not a very good way to make sure the policy is enforced.

We make extensive use of STL containers with custom allocators. People have again to use the specialized templates instead of the generic ones.

Things become even more tricky with C++ smart pointers. It is good policy (and important for performance) to use std::make_shared when creating a shared_ptr. Unfortunately std::make_shared calls the default new. When you need to allocate the pointer through you own custom allocator you have to go with std::allocate_shared. It is easy to make a mistake and call the wrong function due to habit.

We need a way to make sure the default new/delete are never called. A relatively straightforward way is overriding them and asserting their calls. This will fail in debug (although on runtime) and is a good solution if you library is dynamically linked. If the library is statically linked however, the overridden functions will ‘leak’ to the client application! Our library has to be linked statically on some platforms, so this method becomes problematic. Even if we only do the check in Debug, we’d have to disable default allocations in our test applications which is quite inconvenient.

In the end I came up with post-processing step that checks if new/delete are ever called in the library. We do this on POSIX platforms as those were the ones where we had to statically link on. On Windows things should be analogous.

When the library is compiled in the final “.a” file, it contains a symbol list with everything it uses so that the linker can then properly link external symbols. The memory operators are such functions. We use the “nm” tool to inspect what symbols are referenced in the library. The names are mangled, but can be un-mangled with “c++filt”.

If anywhere in the library default “new” is called, it will be reflected by “nm” along with the object where this was done. We wrote a  tool that runs after each build to inspect the library for unwanted calls.

The method can be extended to ‘guard’ against other forbidden functions too.