Unique values in C++

I was profiling a particular piece of code in one of our products at Coherent Labs and noticed some peculiar results. Basically each frame we have to keep track of a certain amount of elements that have changed and at the end of the frame we have to do an operation on all of them. The elements are tracked with pointers that are put in an std::unordered_set because due to different logic paths we might flag the same element more than once.

I noticed that both adding many elements in the set and then walking it took more time than I’d like. Putting aside the operations on the pointers themselves (and the need to achieve better cache coherency in that domain), I concentrated on the set itself.

The VS 2015 implementation seems to keep the different “buckets” of the set in a list. At that point I was already pretty worried, I saw in the profiler that substantial time was going in an std::list iterator – not exactly the high performance code I’d like.

I decided to remove some indirection and improve cache hits by changing the unordered_set with a vector. I blindly push_back elements in the vector. The “uniqueness” of the pointers is required only when we operate on them, so when needed I run and std::sort on the vector and an std::erase with std::unique. This gives me all the elements without duplicates. At this point I have to note that when pushing elements in the set, duplicates happen very rarely. The result I got after changing the std::unordered_set in an std::vector with sort & erase were promising, so I made a larger test.

You can see the whole test here: https://gist.github.com/stoyannk/9df2df9d25e1acba7747

I test 3 cases – a std::unordered_set; and std::vector and Google’s dense_hash_set, which is regarded as very fast. A simple sqrt on each element simulates some “work” done on each element. I also allow some “duplicates” multiplier. The results are interesting.

No duplicates:

Time for set: 236.609
Time for vector: 133.017
Time for google set: 234.986

2 duplicates factor:

Time for set: 133.731
Time for vector: 129.611
Time for google set: 131.001

4 duplicates factor:

Time for set: 71.606
Time for vector: 116.688
Time for google set: 77.66

When duplicates are rare – the std::vector outperforms substantially both sets. Things however quickly degrade if we introduce many duplicates. This makes sense as it puts much more pressure on the std::sort and std::unique calls. At that moment I had an idea. In my use-case the work done on each element is relatively light and doing it twice on an element has no side effect – f(f(x)) == f(x) in this case. I have little duplication anyway, so why don’t we ditch the uniqueness requirement.

I added the “UNIQUE_VEC” define – it controls if we do the sort & erase on the vector and re-ran the tests.

No duplicates:

Time for set: 248.518
Time for vector: 66.078
Time for google set: 234.395

2 duplicates factor:

Time for set: 125.65
Time for vector: 58.664
Time for google set: 138.612

4 duplicates factor:

Time for set: 71.207
Time for vector: 51.306
Time for google set: 77.191

Now the vector in the no duplicates case is really the clear winner. Clearly here we are much more dependent on how ‘heavy’ the work-load per-element is, but in duplication factors close to 1 that should be irrelevant, especially when we have many elements.

In my experience duplication is actually rare when we need it in game code. In that case it is almost always better to use an array (vector) and do the uniqueness yourself. Even better if the operation is fast and the implementation supports it – ditch the uniqueness altogether.

Pitfalls of placement new[]

TL;DR: Mixing custom memory allocation with placement new[] is non-portable. Either override all new/delete combinations and let the compiler take care of array counters or ditch any new[] usage altogether and do everything by yourself.

Custom memory allocation is a very important aspect of a C++ library. By custom I mean the ability of the user to specify the functions for memory allocation and de-allocation. A library that directly uses the CRT facilities can cause serious headaches for the embedder as its memory is will essentially be ‘invisible’ to her.

There are several ways to write a library with custom memory allocators in C++.

  1. Allocate everything through custom ‘my_malloc’/’my_free’ functions for C libraries.
  2. Allocate everything through specific functions and use placement new for construction. (C++)
  3. Override new/delete and all their flavors. (C++).

The first option is great but is relegated to C libraries. In C++ we must take care of object construction/destruction.

Option 3 is a good one but care should be taken. It is very important that the new/delete overriding doesn’t ‘leak’ in any way in the public interface of the library. Otherwise embedders might include the overridden functions and get their own allocations altered!

Most libraries I write work with a combination of 2 and 3. Allocations I do almost always need additional parameters – memory tags and eventually memory arenas where they should happen.

I override new/delete with additional params and make the default ones assert so that they are not used by mistake. All this is relegated in the internals of the library and the user never sees these declarations.

Recently I had to adapt a popular library – Skia to use custom memory allocations. Skia is a great 2D rendering library that is used in Chrome and Android. It doesn’t by default have a way to change the memory allocation functions though. The whole memory management thing is a bit messy in the library. There are functions called sk_malloc/sk_free, which is great but are use sporadically. Instead all C++ new/delete calls run through apposite macros SkNEW, SkDELETE etc.

By default they are defined:

#ifndef SkNEW
# define SkNEW(type_name) (new type_name)
# define SkNEW_ARGS(type_name, args) (new type_name args)
# define SkNEW_ARRAY(type_name, count) (new type_name[(count)])
# define SkNEW_PLACEMENT(buf, type_name) (new (buf) type_name)
# define SkNEW_PLACEMENT_ARGS(buf, type_name, args) (new (buf) type_name args)
# define SkDELETE(obj) (delete (obj))
# define SkDELETE_ARRAY(array) (delete[] (array))
#endif

There is also a SK_OVERRIDE_GLOBAL_NEW switch that can enable overriding new/delete. As Skia developers had already put the macros in place I decided to modify them to use sk_malloc/sk_free and then modify these functions to go through my own memory allocators.

For the non-array flavors things were pretty straightforward:

# define SkNEW(type_name) (new(sk_malloc_flags(sizeof(type_name), 0)) type_name)
# define SkDELETE(obj) (SkAllocDestroy(obj))

Where SkAllocDestroy is defined in this way:

template
inline void SkAllocDestroy(const T* ptr)
{
if (!ptr) return;
ptr->~T();
sk_free(const_cast<T*>(ptr));
}

I place-new the object in the memory allocated by sk_malloc_flags and the manually destruct it and free the memory.
Things get more complicated with array allocations. My first attempt was something like this:

# define SkNEW_ARRAY(type_name, count) (new (SkAllocArray(count)) type_name[count])
# define SkDELETE_ARRAY(array) (SkFreeArray(array))

I allocate the memory for the array and placement-new it so that all constructors get called. Actually I allocate a bit more space and put in front of the array a counter with the actual size of the array. The SkFreeArray call reads the counter that is always before the actual array pointer, calls all destructors and frees the memory. Seems legit.

I started the application -> crash!

After inspecting the addresses of allocations and the ones used by the application I noticed that when I return pointer 0x100, the pointer actually set to the variable holding it was 0x108. This was also the pointer that eventually arrived in SkFreeArray(). In assembly it was clear that the placement new[] was bumping my pointer 8 bytes – the size of the counter it uses most likely.

Long story short – placement new[] will bump the pointer by some bytes to account for a counter… but not always. This doesn’t happen if the array members are of type with trivial or no destructor. It’s clear that this is an optimization. In that case you don’t need a counter as you can just delete the whole array without calling any destructor.

An excerpt from the standard explains the behaviour:

(14.3)

new T[5] results in a call of operator new[](sizeof(T)*5+x), and

(14.4) new(2,f) T[5] results in a call of operator new[](sizeof(T)*5+y,2,f)

Here, x and y are non-negative unspecified values representing array allocation overhead; the result of the new-expression will be offset by this amount from the value returned by operator new[]. This overhead may be applied in all array new-expression s, including those referencing the library function operator new[](std::size_t, void*) and other placement allocation functions. The amount of overhead may vary from one invocation of new to another.

Indeed if we override new[] the amounts of storage required are calculated by the compiler. While in theory my SkAllocArray code can be fixed to account for the compiler pointer fixups, it will not be portable. The standard doesn’t specify the size of ‘x’ and ‘y’ and exactly when they are used – it is up to the compiler. Something that works today on MSVC might not work on Clang or even change in future versions.

In the end I decided to just re-write my SkAllocArray to do everything – it allocates memory + counter, calls constructors with ‘single’ placement new when necessary and returns the adjusted pointer.

The gist of this post is that mixing custom memory allocation with placement new[] is non-portable. Either override all new/delete combinations and let the compiler take care of array counters or ditch any new[] usage altogether and do everything by yourself.