Open-sourcing Voxels

Since I wrote Voxels in 2014, I had interest from developers working on different projects where they found it’ll help them achieve their goals. The previous licensing made this difficult.

While it has always been my intention to open source the library, I finally do it now. I wanted to take some time to clean-up the code and better document it and kept postponing. My day-to-day work in Coherent Labs takes most of my time and Voxels was always left behind.

Now I’m setting it finally free on GitHub. There are no significant changes since the first release in 2014 and honestly I don’t have current plans to continue the development of the library due to my other development work.

The library is licensed under the very permissive 3-clause BSD license, so feel free to modify it and incorporate in your projects. If you plan to use the library – drop me an e-mail , I’m always happy to see what developers achieve with volume surfaces.

MSVC mutex is slower than you might expect

TLDR; The 2015 MSVC C++ runtime’s std::mutex (and potentially other) implementation is significantly slower than the equivalent code written by hand. The reason is that the runtime is built with a Windows feature called “Control Flow Guard” and uses function pointers.

While profiling some code in our new HTML5 renderer, I noticed that the “mutex” implementation we use on Windows was unexpectedly slow compared to other platforms. On Windows we were using the std::mutex implementation in the MSVC 2015 runtime. I decided to dig through the code an see what was going on. Everything was running on Windows 10.

Looking at the code revealed that the implementation of the mutex depends on the version of Windows (Microsoft Visual Studio 14.0\VC\crt\src\stl\primitives.h). This makes good sense, because newer versions of Windows include faster synchronization primitives. On Vista+ the locking is implemented with a SRWLOCK, which is very fast.

At that point I thought that the implementation they have is pretty good and there might be an issue with my profiling, so I did a simple test. In an auxiliary application I ran 3 test-suites locking 10000 times a mutex, performing a simple computation, unlocking and measuring the time for all those operations.

I measured 3 scenarios:

  • std::mutex
  • manual CRITICAL_SECTION
  • manual SRWLOCK

The SRWLOCK implementation was slightly faster than the CRITICAL_SECTION (~5-10%), which is expected, but the std::mutex was 30-40% slower than the rest. At that moment I was really surprised, because essentially the implementation was the same – just locking and unlocking the SRWLOCK. The std::mutex has some additional code – it gets the current thread id (I simulated this too and it’s very fast – no noticeable change in perf.), it calls virtual methods and the execution of the SRWLOCK functions happen through function pointers. None of those should incur such a large difference though.

So I dug in the assembly. It turned out that when the CRT calls __crtAcquireSRWLockExclusive (and all other SRWLOCK methods) they don’t go directly in the Windows Kernel32.dll! A lot of checks and code is executed between entering the method and actually arriving in AcquireSRWLockExclusive, which is where we want to go. The reason is a Windows feature called Control Flow Guard. Essentially this is a security feature that instructs the compiler to check every jump through a function pointer and validate if it is a valid target. This is possible because on modern Windows, all function addresses are annotated and known to the loader. The feature will prevent jumping “in the middle” of functions and makes hacking more difficult.

While CFG might be very important for some classes of applications, it’s performance impact on a game unacceptably is high. Unfortunately when using the MCVC runtime you have no choice, because it’s pre-built with the feature. While normal function calls are fine and bypass the mechanism, the std::mutex implementation falls flat there. In order to support multiple Windows versions the authors have to rely on function pointers for routines that might be missing from older Windows’s.

It’s unfortunate that the authors have thought carefully to use different approaches on different Windows versions to squeeze the best performance (the 2015 version is much much better than older std::mutex implementations) but this compiler feature has essentially defeated their efforts.

Fixing this in your application is really trivial – creating a custom mutex implementation with SRWLOCK is < 2 min. work. I didn’t investigate other STL classes that might suffer from the same problem, but I assume there are in the synchronization mechanisms (where you want the most performance unfortunately).

It’d be great if Microsoft provided more versions of the runtime libraries – especially “fast” ones without the security features. It would be even best if they provided all the source for the runtime and allowed developers to compile it themselves.

References:

Unique values in C++

I was profiling a particular piece of code in one of our products at Coherent Labs and noticed some peculiar results. Basically each frame we have to keep track of a certain amount of elements that have changed and at the end of the frame we have to do an operation on all of them. The elements are tracked with pointers that are put in an std::unordered_set because due to different logic paths we might flag the same element more than once.

I noticed that both adding many elements in the set and then walking it took more time than I’d like. Putting aside the operations on the pointers themselves (and the need to achieve better cache coherency in that domain), I concentrated on the set itself.

The VS 2015 implementation seems to keep the different “buckets” of the set in a list. At that point I was already pretty worried, I saw in the profiler that substantial time was going in an std::list iterator – not exactly the high performance code I’d like.

I decided to remove some indirection and improve cache hits by changing the unordered_set with a vector. I blindly push_back elements in the vector. The “uniqueness” of the pointers is required only when we operate on them, so when needed I run and std::sort on the vector and an std::erase with std::unique. This gives me all the elements without duplicates. At this point I have to note that when pushing elements in the set, duplicates happen very rarely. The result I got after changing the std::unordered_set in an std::vector with sort & erase were promising, so I made a larger test.

You can see the whole test here: https://gist.github.com/stoyannk/9df2df9d25e1acba7747

I test 3 cases – a std::unordered_set; and std::vector and Google’s dense_hash_set, which is regarded as very fast. A simple sqrt on each element simulates some “work” done on each element. I also allow some “duplicates” multiplier. The results are interesting.

No duplicates:

Time for set: 236.609
Time for vector: 133.017
Time for google set: 234.986

2 duplicates factor:

Time for set: 133.731
Time for vector: 129.611
Time for google set: 131.001

4 duplicates factor:

Time for set: 71.606
Time for vector: 116.688
Time for google set: 77.66

When duplicates are rare – the std::vector outperforms substantially both sets. Things however quickly degrade if we introduce many duplicates. This makes sense as it puts much more pressure on the std::sort and std::unique calls. At that moment I had an idea. In my use-case the work done on each element is relatively light and doing it twice on an element has no side effect – f(f(x)) == f(x) in this case. I have little duplication anyway, so why don’t we ditch the uniqueness requirement.

I added the “UNIQUE_VEC” define – it controls if we do the sort & erase on the vector and re-ran the tests.

No duplicates:

Time for set: 248.518
Time for vector: 66.078
Time for google set: 234.395

2 duplicates factor:

Time for set: 125.65
Time for vector: 58.664
Time for google set: 138.612

4 duplicates factor:

Time for set: 71.207
Time for vector: 51.306
Time for google set: 77.191

Now the vector in the no duplicates case is really the clear winner. Clearly here we are much more dependent on how ‘heavy’ the work-load per-element is, but in duplication factors close to 1 that should be irrelevant, especially when we have many elements.

In my experience duplication is actually rare when we need it in game code. In that case it is almost always better to use an array (vector) and do the uniqueness yourself. Even better if the operation is fast and the implementation supports it – ditch the uniqueness altogether.

Temporary allocations in C++

Heap allocations in C++ are notoriously expensive when they hit the generic allocator. Many high-performance software projects employ custom allocation strategies that fit a scenario and speed up allocations. A great overview of allocation strategies common in games is given by Stefan Reinalter in his Molecular musings blog.

In this post I’ll specifically talk about temporary allocations and how to optimize STL allocations. STL is difficult to use right as its types do many heap allocations and the allocator is template parameter. It makes types with different allocators incompatible. C++11 and C++14 however have added many new features that in my opinion return STL as a viable option even for performance-critical code.

Temporary objects usually survive only a small scope within the execution of the program, but this scope (function) might be called many times. If objects in it always allocate on the heap, we might suffer a substantial performance hit.

Imagine this function:

void OnTaskComplete()
{
  std::vector tasksToSchedule;
  for(auto& task : m_WaitingTasks)
  {
    if (task->CanBeScheduled())
    {
      tasksToSchedule.push_back(task);
      // Do some more stuff
    }
  }
  for (auto& task : tasksToSchedule)
  {
    // Schedule task
  }
}

The function in the example might get called after a task in job system completes. It checks if to schedule some new task. It might get called hundreds of times in a frame. The vanilla vector there might impact performance as it will do a heap allocation each time at least one task will have to be scheduled.

We know that the lifetime of the object is really narrow – it is temporary to the scope of the function. Let’s make it a lot faster.

We will employ a custom allocator for temporary objects that outperforms the standard heap allocations by 7x-10x.

A “Linear allocator” is one that allocates a predefined chunk of memory and each time we request from it – it gives back a chunk from it. We don’t manage anything except where we currently are in the buffer. We can however “reset” the allocator back, which effectively allows us to reuse the memory. It is common in games to use such an allocator for all allocations within the frame and reset it at the beginning of a new one. This allows for very fast O(1) allocations within the frame and an O(1) super-quick bulk “deallocation”. We might be in trouble only if we consume more memory within the frame than the allocator has to offer. Assertions are in order.

class LinearAllocator
{
public:
  LinearAllocator(size_t size)
    : m_Ptr{ static_cast(TEMP_MALLOC(size)) }
    , m_TotalSize{ size }
    , m_FreeSpace{ size }
  {}

  ~LinearAllocator()
  {
    TEMP_FREE(m_Ptr);
  }

  void* Allocate(size_t size, unsigned alignment /* power of 2 */)
  {
    assert((alignment & (alignment - 1)) == 0);

    auto currentPtr = static_cast(m_Ptr + (m_TotalSize - m_FreeSpace));
    auto retPtr = std::align(alignment, size, currentPtr, m_FreeSpace);

    if (!retPtr)
    {
      assert(false && "Linear allocator full!");
      // no space
      return nullptr;
    }

    m_FreeSpace -= size;

    return retPtr;
  }

  void Free()
  {
    // do nothing
  }

  void Reset(size_t freeSpace)
  {
    m_FreeSpace = freeSpace;
  }

  size_t CurrentFreeSpace() const { return m_FreeSpace; }

private:
  char* m_Ptr;
  size_t m_TotalSize;
  size_t m_FreeSpace;
};

In the systems I work with, we have multiple threads that work on “tasks”. This makes defining a frame somewhat difficult. In this sample we will employ a scope object that in its destructor resets the linear allocator.

Our allocator will also be thread-local, so that every thread has its own copy.

struct TempAllocatorScope
{
public:
  TempAllocatorScope()
    : m_Space(tlsLinearAllocator->CurrentFreeSpace())
  {}

  ~TempAllocatorScope()
  {
    tlsLinearAllocator->Reset(m_Space);
  }
private:
  size_t m_Space;
};

Finally we will have an STD allocator that forwards all calls to our linear one and allows us to use it with any STL type.

The complete test example is available here.
The sample allows us to switch between the linear allocator and the default one to measure the gains. On my machine it was between 7x – 10x with the VS 2015 runtime.

Full code:

More on overloading new/delete

In my previous post I highlighted some of the pitfalls encountered while adapting memory management functions in a middleware library.

In this post I’ll continue the discussion of overloading new/delete in the context of writing a middleware library. The specifics here are that when you write middleware you have to be very careful not to inadvertently stomp the client application’s domain – memory, rendering state etc.

Overloading new/delete with some sugar on-top

In a new library we are writing at Coherent Labs, we went with the standard memory tracking technique of overriding all new/delete operators with a memory tag.

It looks something like this:

inline void* operator new(size_t size, MemoryTag memtag) {
  return gAllocator->Allocate((unsigned)size, memtag.Tag);
}

Here gAllocator is a pointer to an interface the client provides to hook her own allocation methods. Adding a memory tag really helps debugging and shows how memory is distributed in subsystems. On-top of these overloaded allocators we have macros that simplify the syntax.

I prefer overloading new/delete as it more or less allows us to continue using the syntax most developers are used to, instead of going for completely custom memory allocation templates.

In C++ you can overload delete but can’t call it. The overloaded instances are only called by the compiler when throwing an exception in a constructor. We solve this by substituting delete with a custom function like this:

template <typename T>
inline void DestroyMemory(T* ptr, MemoryTag tag) {
  if (!ptr)
    return;
  ptr->~T();
  operator delete(ptr, tag);
}

The function will call our overload with a tag. In C++ there is a difference between “delete operator” and “operator delete”. We prefer tagging deallocations too because it helps debugging and you can have accurate per-subsystem stats even if you don’t track every single allocation.
If you’ve read my previous post you know that overriding new[]/delete[] has some issues so we simply disable them and go for custom functions for array allocation and deallocation.

Guarding against default new

Now that we have custom macros and functions for all allocations, it is important that everybody always uses them instead of the default new/delete. Developers have to be careful and reviewers should be extra cautious, but this is not a very good way to make sure the policy is enforced.

We make extensive use of STL containers with custom allocators. People have again to use the specialized templates instead of the generic ones.

Things become even more tricky with C++ smart pointers. It is good policy (and important for performance) to use std::make_shared when creating a shared_ptr. Unfortunately std::make_shared calls the default new. When you need to allocate the pointer through you own custom allocator you have to go with std::allocate_shared. It is easy to make a mistake and call the wrong function due to habit.

We need a way to make sure the default new/delete are never called. A relatively straightforward way is overriding them and asserting their calls. This will fail in debug (although on runtime) and is a good solution if you library is dynamically linked. If the library is statically linked however, the overridden functions will ‘leak’ to the client application! Our library has to be linked statically on some platforms, so this method becomes problematic. Even if we only do the check in Debug, we’d have to disable default allocations in our test applications which is quite inconvenient.

In the end I came up with post-processing step that checks if new/delete are ever called in the library. We do this on POSIX platforms as those were the ones where we had to statically link on. On Windows things should be analogous.

When the library is compiled in the final “.a” file, it contains a symbol list with everything it uses so that the linker can then properly link external symbols. The memory operators are such functions. We use the “nm” tool to inspect what symbols are referenced in the library. The names are mangled, but can be un-mangled with “c++filt”.

If anywhere in the library default “new” is called, it will be reflected by “nm” along with the object where this was done. We wrote a  tool that runs after each build to inspect the library for unwanted calls.

The method can be extended to ‘guard’ against other forbidden functions too.

Pitfalls of placement new[]

TL;DR: Mixing custom memory allocation with placement new[] is non-portable. Either override all new/delete combinations and let the compiler take care of array counters or ditch any new[] usage altogether and do everything by yourself.

Custom memory allocation is a very important aspect of a C++ library. By custom I mean the ability of the user to specify the functions for memory allocation and de-allocation. A library that directly uses the CRT facilities can cause serious headaches for the embedder as its memory is will essentially be ‘invisible’ to her.

There are several ways to write a library with custom memory allocators in C++.

  1. Allocate everything through custom ‘my_malloc’/’my_free’ functions for C libraries.
  2. Allocate everything through specific functions and use placement new for construction. (C++)
  3. Override new/delete and all their flavors. (C++).

The first option is great but is relegated to C libraries. In C++ we must take care of object construction/destruction.

Option 3 is a good one but care should be taken. It is very important that the new/delete overriding doesn’t ‘leak’ in any way in the public interface of the library. Otherwise embedders might include the overridden functions and get their own allocations altered!

Most libraries I write work with a combination of 2 and 3. Allocations I do almost always need additional parameters – memory tags and eventually memory arenas where they should happen.

I override new/delete with additional params and make the default ones assert so that they are not used by mistake. All this is relegated in the internals of the library and the user never sees these declarations.

Recently I had to adapt a popular library – Skia to use custom memory allocations. Skia is a great 2D rendering library that is used in Chrome and Android. It doesn’t by default have a way to change the memory allocation functions though. The whole memory management thing is a bit messy in the library. There are functions called sk_malloc/sk_free, which is great but are use sporadically. Instead all C++ new/delete calls run through apposite macros SkNEW, SkDELETE etc.

By default they are defined:

#ifndef SkNEW
# define SkNEW(type_name) (new type_name)
# define SkNEW_ARGS(type_name, args) (new type_name args)
# define SkNEW_ARRAY(type_name, count) (new type_name[(count)])
# define SkNEW_PLACEMENT(buf, type_name) (new (buf) type_name)
# define SkNEW_PLACEMENT_ARGS(buf, type_name, args) (new (buf) type_name args)
# define SkDELETE(obj) (delete (obj))
# define SkDELETE_ARRAY(array) (delete[] (array))
#endif

There is also a SK_OVERRIDE_GLOBAL_NEW switch that can enable overriding new/delete. As Skia developers had already put the macros in place I decided to modify them to use sk_malloc/sk_free and then modify these functions to go through my own memory allocators.

For the non-array flavors things were pretty straightforward:

# define SkNEW(type_name) (new(sk_malloc_flags(sizeof(type_name), 0)) type_name)
# define SkDELETE(obj) (SkAllocDestroy(obj))

Where SkAllocDestroy is defined in this way:

template
inline void SkAllocDestroy(const T* ptr)
{
if (!ptr) return;
ptr->~T();
sk_free(const_cast<T*>(ptr));
}

I place-new the object in the memory allocated by sk_malloc_flags and the manually destruct it and free the memory.
Things get more complicated with array allocations. My first attempt was something like this:

# define SkNEW_ARRAY(type_name, count) (new (SkAllocArray(count)) type_name[count])
# define SkDELETE_ARRAY(array) (SkFreeArray(array))

I allocate the memory for the array and placement-new it so that all constructors get called. Actually I allocate a bit more space and put in front of the array a counter with the actual size of the array. The SkFreeArray call reads the counter that is always before the actual array pointer, calls all destructors and frees the memory. Seems legit.

I started the application -> crash!

After inspecting the addresses of allocations and the ones used by the application I noticed that when I return pointer 0x100, the pointer actually set to the variable holding it was 0x108. This was also the pointer that eventually arrived in SkFreeArray(). In assembly it was clear that the placement new[] was bumping my pointer 8 bytes – the size of the counter it uses most likely.

Long story short – placement new[] will bump the pointer by some bytes to account for a counter… but not always. This doesn’t happen if the array members are of type with trivial or no destructor. It’s clear that this is an optimization. In that case you don’t need a counter as you can just delete the whole array without calling any destructor.

An excerpt from the standard explains the behaviour:

(14.3)

new T[5] results in a call of operator new[](sizeof(T)*5+x), and

(14.4) new(2,f) T[5] results in a call of operator new[](sizeof(T)*5+y,2,f)

Here, x and y are non-negative unspecified values representing array allocation overhead; the result of the new-expression will be offset by this amount from the value returned by operator new[]. This overhead may be applied in all array new-expression s, including those referencing the library function operator new[](std::size_t, void*) and other placement allocation functions. The amount of overhead may vary from one invocation of new to another.

Indeed if we override new[] the amounts of storage required are calculated by the compiler. While in theory my SkAllocArray code can be fixed to account for the compiler pointer fixups, it will not be portable. The standard doesn’t specify the size of ‘x’ and ‘y’ and exactly when they are used – it is up to the compiler. Something that works today on MSVC might not work on Clang or even change in future versions.

In the end I decided to just re-write my SkAllocArray to do everything – it allocates memory + counter, calls constructors with ‘single’ placement new when necessary and returns the adjusted pointer.

The gist of this post is that mixing custom memory allocation with placement new[] is non-portable. Either override all new/delete combinations and let the compiler take care of array counters or ditch any new[] usage altogether and do everything by yourself.