MSVC mutex is slower than you might expect

TLDR; The 2015 MSVC C++ runtime’s std::mutex (and potentially other) implementation is significantly slower than the equivalent code written by hand. The reason is that the runtime is built with a Windows feature called “Control Flow Guard” and uses function pointers.

While profiling some code in our new HTML5 renderer, I noticed that the “mutex” implementation we use on Windows was unexpectedly slow compared to other platforms. On Windows we were using the std::mutex implementation in the MSVC 2015 runtime. I decided to dig through the code an see what was going on. Everything was running on Windows 10.

Looking at the code revealed that the implementation of the mutex depends on the version of Windows (Microsoft Visual Studio 14.0\VC\crt\src\stl\primitives.h). This makes good sense, because newer versions of Windows include faster synchronization primitives. On Vista+ the locking is implemented with a SRWLOCK, which is very fast.

At that point I thought that the implementation they have is pretty good and there might be an issue with my profiling, so I did a simple test. In an auxiliary application I ran 3 test-suites locking 10000 times a mutex, performing a simple computation, unlocking and measuring the time for all those operations.

I measured 3 scenarios:

  • std::mutex
  • manual CRITICAL_SECTION
  • manual SRWLOCK

The SRWLOCK implementation was slightly faster than the CRITICAL_SECTION (~5-10%), which is expected, but the std::mutex was 30-40% slower than the rest. At that moment I was really surprised, because essentially the implementation was the same – just locking and unlocking the SRWLOCK. The std::mutex has some additional code – it gets the current thread id (I simulated this too and it’s very fast – no noticeable change in perf.), it calls virtual methods and the execution of the SRWLOCK functions happen through function pointers. None of those should incur such a large difference though.

So I dug in the assembly. It turned out that when the CRT calls __crtAcquireSRWLockExclusive (and all other SRWLOCK methods) they don’t go directly in the Windows Kernel32.dll! A lot of checks and code is executed between entering the method and actually arriving in AcquireSRWLockExclusive, which is where we want to go. The reason is a Windows feature called Control Flow Guard. Essentially this is a security feature that instructs the compiler to check every jump through a function pointer and validate if it is a valid target. This is possible because on modern Windows, all function addresses are annotated and known to the loader. The feature will prevent jumping “in the middle” of functions and makes hacking more difficult.

While CFG might be very important for some classes of applications, it’s performance impact on a game unacceptably is high. Unfortunately when using the MCVC runtime you have no choice, because it’s pre-built with the feature. While normal function calls are fine and bypass the mechanism, the std::mutex implementation falls flat there. In order to support multiple Windows versions the authors have to rely on function pointers for routines that might be missing from older Windows’s.

It’s unfortunate that the authors have thought carefully to use different approaches on different Windows versions to squeeze the best performance (the 2015 version is much much better than older std::mutex implementations) but this compiler feature has essentially defeated their efforts.

Fixing this in your application is really trivial – creating a custom mutex implementation with SRWLOCK is < 2 min. work. I didn’t investigate other STL classes that might suffer from the same problem, but I assume there are in the synchronization mechanisms (where you want the most performance unfortunately).

It’d be great if Microsoft provided more versions of the runtime libraries – especially “fast” ones without the security features. It would be even best if they provided all the source for the runtime and allowed developers to compile it themselves.

References:

Voxels 0.6.0 released

I have just uploaded version 0.6.0 of the Voxels library.

Changes include:

  • New API for memory usage statistics
  • Substantially decreased run-time memory usage
  • Slight increase in the polygonization performance
  • Removed all internal Windows-specific dependencies

I’m now working on further improving the polygonization performance as well as some ports and integrations. More info coming soon…

Objective-C++ ARC gotchas

This post has been published also in Coherent Labs’s blog – the company I co-founded and work for.

Lately I had to mix C++ and Objective-C pretty heavily (it’s called Objective-C++ apparently). What I usually need is having Obj-C objects inside C++ ones, which with “Automatic Reference Counting”(ARC) should work out-of-the-box. It actually does – except when it doesn’t!

In essence what the compiler probably does is just add ‘retain’ & ‘release’ calls in the proper places (constructors, destructors etc.). It is smart enough to recognize pointers to ARC objects and treat them as non-POD.

This is actually very cool and simplifies the interaction a lot. Alas there are some problems when you try to do some more ‘exotic’ stuff.

A pretty standard C++ way to use your own memory management routines is to allocate some memory with your custom allocator and then use placement new to construct an object in the fresh memory. Destruction goes the other way around – call the destructor explicitly and deallocate your memory.

class Test {};
int main()
{
  // create manually
  Test *t = static_cast<Test*>(operator new(sizeof(Test)));
  new(t) Test;

  // destroy manually
  t->~Test();
  operator delete(t);

  return 0;
}

In the following code, as you might expect, ‘dealloc’ is called as soon as the C++ object is destroyed – so no leak happens:

#include <iostream>
#import <Foundation/Foundation.h>

@interface TestObj : NSObject
@end

@implementation TestObj

-(id)init {
  self = [super init];
    NSLog(@"init!");
    return self;
}

-(void)dealloc {
  NSLog(@"dealloc!");
}

@end

class Parent
{
public:
  /*virtual */ void aFunc() {} // Uncomment 'virtual' for reference leak!
};

class TestClass : public Parent
{
public:
  TestClass() : m_Obj([[TestObj alloc] init])
  {}

private:
  TestObj* m_Obj;
};

int main(int argc, const char * argv[])
{
  TestClass* cl = new TestClass;

  cl->~TestClass();
  operator delete(cl);

  //delete cl; // Uncomment me and comment the manual mem. management to never leak!

  std::cout << "End!" << std::endl;

  return 0;
}

However if you uncomment the ‘virtual’ keyword and hence make the hierarchy virtual the ‘dealloc’ method will not be called. The compiler in this case does not create the non-trivial destructor required! If you substitute the manual memory management with the delete operator then the destructor is synthesized and ‘dealloc’ gets called. The behavior is also prevented if your class is non-POD in C++ terms.

Not having a virtual destructor in a virtual hierarchy is arguably very bad, however breaking the ARC promise is even worse. I admit that stumbling upon this issue is not very easy because it requires a lot of things to happen but still the leaking references it introduces are serious enough to prompt me to write about it.

I haven’t looked at clang’s source about this issue and I can’t find a rationale behind it in the docs so I think it’s an oversight and can only speculate why it happens. The version of the compiler that I currently work and saw the problem is: “Apple clang version 4.1 (tags/Apple/clang-421.11.66)”.

All that said, if you follow the basic C++ guidelines of having virtual destructors in your polymorphic hierarchies you should be fine when you try to mix C++ and Objective-C.

Dump References VS add-in

This post has been published also in Coherent Labs’s blog – the company I co-founded and work for.

Motivation

In our current products we have to deal with very big build projects, some of are divided in more than 200 libraries. This poses huge challenges on the build process and their overall management.

On Windows we build everything with Visual Studio and msbuild, however I have often stumbled upon two very annoying quirks in them:

  1. If you build a Solution through msbuild and project A in it depends on another project B that is NOT in that solution, project B will be compiled and linked in it’s default configuration. Usually this is the Debug configuration. This might lead to some Debug libraries being linked in a Release project or some other config inconsistency.
  2. If project A in a solution references project B, also in the solution, then Visual Studio fails any compilation attempt with the cryptic “The project file ‘ ‘ has been renamed or is no longer in the solution.” giving no actual info on what’s wrong. There is a bug on MSDN about this. It’s unclear if and when it’ll be fixed and most solutions suggested by users involve removing all projects, adding them one-by-one and testing which one fails. This is unfeasible in large solutions.

The second problem was for us very serious. We use premake to generate our projects on different platforms and often change dependencies, upgrade third party libraries etc. GUIDs however are only used in VS projects and they keep getting out-of sync often.

We knew we needed an automatic way to check those issues. One solution would have been a program that just walks and parses all VS projects and issues the warnings we are interested in. I opted however to create a VS add-in because the data we are interested in is already readily parsed and available in the IDE itself.

Dump References Add-In

Dump References is an add-in for Visual Studio that prints the references of all projects loaded in the solution, checks their GUIDs and prints eventual issues. It is intended for C/C++ solutions.

The add-in is fairly simple – it walks all projects in the solution and their references and prints:
– all projects currently in the solution
– all references that each project has
– all unique references in the solution
– possible issues: referenced projects missing from the solution; projects available in the solution but referenced by other projects with another GUID.

To dump the diagnostic just right-click on your solution in the Solution Explorer and click “Dump References”.

The results will be available in the Output window in the Dump References pane.

The add-in’s source is available on GitHub and licensed under the MIT license, so feel free to use it.

Building a shared memory IPC implementation – Part II

This post has been published also in Coherent Labs’s blog – the company I co-founded and work for.

This post is a follow-up on Building a shared memory IPC implementation – Part I. I’d like to discuss the way shared memory is handled and allocated for the inter-process queue.
As in all memory management systems, I try to minimize wasted memory on bookkeeping and the number of shared memory allocations, as they might be slow.

For the shared memory IPC queue I opted for the paged memory allocation scheme sketched below:

IPC memory layout

IPC memory layout

Pages are the ones that get requested form the OS as shared memory regions. For the sake of simplicity and performance they are a fixed number. This incurs a usage limitation as the queue can run out of memory but simplifies the synchronization and management mechanism. The limit should be set reasonably high and reaching it usually means a more serious problem occurred, for instance the consumer stopped working or is too slow.
As you can see on the picture, all shared page handles are put in an array. Only the used pages are requested from the OS – if 3 pages are used from a maximum of 16, then only they will be allocated.
Nodes (messages) are always allocated in one page. If there is not enough room left in a page to accommodate a node, a new page is requested with size = max(newNodeSize*2, DEFAULT_PAGE_SIZE). This formula ensures that pages are never smaller than a preset minimal limit and also allows to have huge nodes. Pages are of variable size which is very handy.
The pair (OwnerID, Offset) uniquely identifies a node.

NB: The identifier of the node is a pair of values and usually it can’t be atomically updated. Special care should be taken to ensure that operator= (assignment) and operator!= are atomic to each other as specified in Building a shared memory IPC implementation – Part I.
A node in the queue knows it’s position but also has the coordinates of it’s successor. If a node grows up too much and has to expand, but there is not enough room in the page, it is moved to another free one and holes remain. Node allocation goes always forward, there is no attempt to squeeze nodes in empty spaces. Node de-allocation goes always forward too (it’s a queue) so free space will be effectively reclaimed as the node just before it is reclaimed:
IPC Pages

IPC Pages

As soon as Node K is freed, the whole Page 2 will be marked as free.
I never return the allocated memory to the OS. When a page is left empty, it is added to a collection of free pages and reused as soon as a new page is required. In my typical usage the pattern is very predictable and stable and just marking a page as reusable allows us to skip costly OS calls when consumption oscillates between M and M+1 pages. If memory is tight or usage spikes are expected, you could free the pages and then reallocated them from the OS.
As pages are freed and reused, nodes will have successors in pages that are not directly next to them. This is not a problem with the described addressing scheme.
When a new node is requested I first check if there is enough room in the current tail’s owner, if not – call AllocateFirstNodeInPage(size_t size):
SMDataQueueBase::Node* SMDataQueueBase::AllocateFirstNodeOnPage(size_t size)
{
  if(m_FreePages.empty())
  {
    // garbage collect
    m_UsedPages.erase(std::remove_if(m_UsedPages.begin(), m_UsedPages.end(), [this](unsigned pageId) -> bool
    {
      const bool isFree = m_Pages[pageId]->NodesOwned == 0;
      if (isFree)
      {
        m_FreePages.push_back(pageId);
      }
      return isFree;
    }), m_UsedPages.end() );
  }

  unsigned freePage = MAX_SHARED_PAGES;
  for(auto it = m_FreePages.begin(); it != m_FreePages.end(); ++it)
  {
    if(m_Pages[*it]->GetDataSize() >= size)
    {
      freePage = *it;
      break;
    }
  }

  if(freePage != MAX_SHARED_PAGES)
  {
    m_UsedPages.push_back(freePage);
    m_FreePages.erase(m_FreePages.begin() + freePage);
  }
  else
  {
    AllocatePage(std::max(size*2, DEFAULT_PAGE_SIZE));
    freePage = m_SharedData->m_AllocatedPagesCount - 1;
    m_UsedPages.push_back(freePage);
  }

  Node* node = new (m_Pages[freePage]->GetData()) Node(size, freePage, 0);
  m_Pages[freePage]->AddNode();

  return node;
}

The method reclaims all pages that are now left empty, checks empty pages for enough size to accommodate the node and as a last resort allocates a new one from the OS.The NodesOwned field of a page is incremented/decremented with atomic operations.

New pages are always allocated by the producer. The consumer also has the same vector of pages but just maps the already allocated memory when it reaches a node in a not-yet-mapped page.

DLL export quirks

This post has been published also in Coherent Labs’s blog – the company I co-founded and work for.

Can you spot an error in this code (compiled with MSVC 2010 SP1 running on Win7; TestDLL is just a simple DLL project and an executable imports and uses it as described):

--- TestDLL project ---

-> Classes.h <-

class MyClass
{
public:
  TESTDLL_API MyClass();
  TESTDLL_API virtual ~MyClass();
};

-> Classes.cpp <-
MyClass::MyClass()
{}

MyClass::~MyClass()
{}

--- EXE that uses TestDLL.dll ---
int _tmain(int argc, _TCHAR* argv[])
{
  MyClass* cl = new MyClass;
  delete cl;
  return 0;
}

Although innocuous looking the code results in undefined behavior. What happens is that operator new is called in the EXE while operator delete is called in the DLL.

A little playing around in the disassembly shows the reason. When you have a virtual destructor it’s address is of course put in the vtable of the object created. However when the compiler sees a class like the one illustrated it creates two destructors – one that works just as the programmer would expect – destroying all the members etc. and another one that does the same things but also calls operator delete on the object(the ‘deleting destructor’). This second destructor is the one set in the vtable of the object and is responsible for the behavior.

A fix for this problem is exporting the whole class, as pointed by Microsoft themselves –
class TESTDLL_API MyClass
{
public:
  MyClass();
  virtual ~MyClass();
};
In this case the compiler creates a ‘scalar deleting destructor’ for the class in the exe – calling the vanilla destructor and operator delete of the executable and putting it in the vtable, so everything works as expected.
Checking-out the assembly shows that the constructor of MyClass in the first case sets the address of the destructor to MyClass::`vector deleting destructor’ in the DLL (the one that calls delete) and nothing more.
However in the export-all-class case the compiler generates a ‘local vftable’ and overwrites the one created in the DLL.
As it turns out before version 5.0(!) of VC++ only the first case used to work but created all the said problems. So in 5.0 they changed the behavior to the current one that also has it’s drawbacks (like calling FreeLibrary as explained nicely in this thread).
If you __dllimport a whole class with a virtual destructor, the compiler creates a new virtual table and redirects the destructor to a local version in order to preserve the new/delete correctness. This appears to be the ONLY case it does this so in all other situations the programmer must be careful.
It is very tempting to just export the needed methods for a task and leave the rest hidden. However one must be aware of this quirk, that if left creeping unnoticed, might bring many headaches. In those cases the best solution is to rely on pure interfaces and factory functions like COM does it. This appears to be the most portable solution too. You could also override new and delete for the exported classes that has the advantage of not forcing you to use factories and can be easily be done with a common base class.

Building a shared memory IPC implementation – Part I

This post has been published also in Coherent Labs’s blog – the company I co-founded and work for.

Overview

Inter-process communication has become a very wide-spread pattern in many modern applications. The nice sandboxing that a multi-process architecture provides is a major contributor to its popularity.

Google Chrome for instance spawns a process for every tab the user opens and there are more system processes running also. It is obvious that a fast IPC system is a must.

After a some research on different transports for an IPC system the best candidates were:

  • shared memory
  • pipes
  • TCP
After some experimentation on Windows, it was clear that a good shared memory implementation outperformed the other ones (not difficult to guess) but we found out that TCP actually has tremendous throughput when it works on loopback and the data packets are reasonably big. Message pipes are very convenient but the need for a cross-platform solution meant that we’d need to accommodate both Linux and Win pipes in our system.

In this post I’d like to share an implementation based on shared memory that we use in a commercial project where the messages themselves are very numerous but small and the overhead per-message must be minimal. I’ll explain the interfaces, algorithm and choices we made using a mix of C++ and pseudo-code. A shared memory transport is more difficult to use but provides us with more control and speed and although there are differences in it’s behavior between OSs they are very easy to deal with.

I gathered the following requirements for the queue that would be the backbone of the IPC system:
  • single producer – single consumer
  • lock-free
  • variable-sized messages
  • minimal number of allocations
  • stream-like message queuing
The system is based on a message queue that is SPSC, so to actually have a two-way communication channel we simply use two queues for which the producer and the consumer are exchanged in the
two processes.
Of all those requirements, I have to say that the “variable-sized messages” and the “stream-like message queuing” are the ones that make the implementation more complicated. By “stream-like message queuing” I mean this: the user that queues a message does not know the size of the messages when she begins queuing it. An obvious solution would be for the user to just compose the message in a buffer and then send the whole buffer – the size would be known at the en-queue moment but would conflict with the “minimal number of allocations” requirement. We need a way to let the user write directly IN the queue and expand the nodes when necessary.
The relevant part of the public interface for the producer looks like this:
void* RequestNode(size_t sz);
void* EnqueueMore(size_t sz);
void CommitNode();
The user requests a Node with some size, she writes directly in the void* returned by the queue, if she realizes that there is need for more memory, calls EnqueueMore with the new size and
gets again memory directly in the queue. The previously written-on memory gets transferred so the user can continue safely adding data. When done, CommitNode() makes the message available to the consumer.
The Consumer interface has just one method:
void* Consume(size_t& size);

The lock-free implementation of the queue is based on a well-known algorithm(it is for multi-threading but applies to inter-process too):

We never allow for the producer and the consumer to reach each other by having a dummy divider node. The producer is in charge of freeing all consumed nodes so minimal sharing of concurrent variables is needed.

For a detailed description of the algorithm take a look at Marginean’s “Lock-FreeQueues”. His algorithm is correct, but the implementation is flawed and not thread-safe as noted and explained by Sutter in a follow-up “WritingLock-Free Code: A Corrected Queue”. Our queue is a correct one and although at the time of writing it I was unaware of Sutter’s article, the resulting implementation is very close to his.
The Implementation is divided in 3 classes: SMDataQueueBase, SMDataQueueProducer, SMDataQueueConsumer. The base class provides all the shared bookkeeping structures and methods. The producer and the consumer classes expose just the methods highlighted before.

The process of en-queuing works like this:

// This is an idealized pseudo-implementation. Details are omitted for clarity
SMDataQueueBase::SMDataQueueBase(bool initializeMemory)
{
  CreateSharedMemoryRegion();

  if(initializeMemory)
  {
    new(m_SharedData) SharedData;
    // Allocate an empty node
    m_SharedData->m_Head = *AllocateNode(sizeof(size_t));
    m_SharedData->m_Tail = m_SharedData->m_Head;
  }
}

// Producer interface
void* SMDataQueueProducer::RequestNode(size_t sz)
{
  m_SharedData->m_NodeInProduction = *AllocateNode(sz);
  return m_SharedData->m_NodeInProduction;
}

void* SMDataQueueProducer::EnqueueMore(size_t sz)
{
  m_SharedData->m_NodeInProduction = *ExpandNode(m_SharedData->m_NodeInProduction, sz);
  return m_SharedData->m_NodeInProduction;
}

void SMDataQueueProducer::CommitNode()
{
  m_SharedData->m_Tail = m_SharedData->m_NodeInProduction;

  m_SharedData->m_NodeInProduction = Node();
}

//Consumer interface
void* SMDataQueueConsumer::Consume(size_t& size)
{
  // operator!= and operator= must be atomic to each other for the implementation to work
  // changing the identity of the object (so that operator!= start returning true) must be done atomically
  if(m_SharedData->m_Head != m_SharedData->m_Tail)
  {
    m_SharedData->m_Head = m_SharedData->m_Head->Successor;
    return GetNodeData(m_SharedData->m_Head);
  }
  else
  {
    return nullptr;
  }
}

As you might have noticed, head always points to a Node that has already been consumed (or is invalid on start). When the queue is empty head == tail but there always is a ‘guard node’ in the structure. In this way the producer and the consumer are always responsible only for their respective pointers and in fact the only point where they touch is the check head != tail. It is very important that operator= and operator!= are mutually atomic – never should operator!= return true if the object is in mid-copy and thus invalid.

In the next posts I’ll cover the memory management part of the implementation.