Robust C++ Task Systems Through Compile-time Checks

At this year’s devcom 2017 I gave a talk on our C++ task-system in Hummingbird, here it is:

Advertisements

Compile-time check on task parameters (part 1)

The Hummingbird HTML rendering engine’s architecture is based on tasks (also known as jobs). Units of work are dispatched to a task scheduler who runs them on different threads to maximize the resource usage of the hardware.

Multi-threaded programming however is notoriously difficult because of the risk of race conditions. An awesome overview of how to utilize the hardware is given in the following blog series.

Hummingbird’s task system is very versatile. It allows developers to schedule tasks whose body is a C++ lambda and put them in different queues for processing. We decided from the inception of the system to try maximize the freedom of the developer when creating and using tasks. This freedom however can be dangerous when the lifetime and eventual data races between objects have to be taken into account.

Tasks conceptually (although the same in code) can be divided in two groups:

  • “Simple” data-oriented tasks that execute some data transformation on an independent work group and have no side effects.
  • High-level tasks that might arise from the interaction of multiple systems and objects with possibly complex lifetimes.

I definitely try to have as much work possible in the first type of tasks. A good example of such a task is the layout of a group of elements. The input data are POD structs with the styles needed for the element layout and the output are the positions of each element on the page.

High level tasks require an interaction with the lifetime of objects, possibly some of them are reference counted and usually can be accessed only in certain threads.

A particularly nasty error that arises from using reference counted objects is having them be destroyed in unexpected moments or threads.

In Hummingbird we try to encourage “simple” tasks and to make higher level ones tough to accidentally misuse. We introduced a compile-time checking system for task parameters. Classes can be marked to allow/disallow using their instances as task parameters. There are four ways for an object to be passed as a parameter in a task:

  • Passing object by value. This is always OK in our system. The object passed will be a copy private to the task so it shouldn’t involve changing global or shared state. The object can still contain pointers as members or modify global state in a method but this is better caught in design or code reviews and has never caused errors in our code so far.
  • Passing object by pointer. This is generally NOT OK. The system will not allow by default passing pointers unless they are explicitly marked by the programmer. Passing naked pointers to tasks is the source of most errors as the lifetime of said object is probably outside the task and there is a chance that object will be accessed concurrently. There is also the issue with the lifetime of the object, which is not well defined.
  • Passing by shared pointer. DOM objects often have shared ownership due to the design of the standard DOM and the interactions with JavaScript. To pass a shared pointer to a certain type, the developer has to explicitly mark it as OK.
  • Passing weak pointers. In the beginning we implicitly allowed this usage but recently made them require explicit marking as well.

Explicitly marking which classes are OK to be passed between tasks has several benefits:

  • Forces the programmer to think twice about the lifetime and concurrent use of data.
  • Helps in code reviews by signaling to reviewers that they should carefully inspect the usage of certain objects.
  • Implies a contract for how the data will be used and self-documents the code.
  • Avoids inadvertently passing wrong objects to tasks, which can happen due to the easy syntax provided by lambdas.

We have also added a mechanism to disallow passing a class to tasks altogether, even by value.

The implementation of the system is based on templates and some macros to make the syntax easy.

Creating a task is done in the following way:

The key here is the TASK_PARAMS, which validates each parameter. In the next blog post I’ll go into details on how the task validation mechanism is implemented.

Static variables in functions bite

The other day I was investigating the performance profile of a stress test in our Coherent Labs Renoir graphics library. Renoir collects high-level rendering commands like “DrawText”, “DrawPath” etc, and transforms them in low-level API commands for the GPU.

What caught my eyes was that the function BlendingMode2State was taking ~ 1.5% of the time. The function is called for each drawing command (hundreds of times per-frame), but still this looked disproportional. The function is declared with the following signature:

inline BlendingState BlendingMode2State(BlendModes mode)

BlendModes is an enum that contains pre-defined blending modes and BlendingState is a simple structure that contains the required GPU operations to implement that blend mode (SrcBlend, DestBlend etc. for more information on graphics alpha blending, take a look here)

Looking at the implementation made things clear:

The implementation immediately rang a bell. The static variable holding the mapping of the blend modes is declared at function level. According to the lifetime rules of C++, it’ll be initialized the first time the function is called. This bring the side effect the in each cal there is branch that checks if the variable has been initialized. Not good.

Checking the disassembly revealed another inefficiency. The initialization assembly was 2 screens long at the beginning of the function – there are a lot of BlendModes. The assembly did a conditional jump if the variable is initialized and skipped hundreds of bytes worth of initialization code to go to the gist of the function, which is just returning the correct entry in the array.

The third problem was that the initialization code hindered the inlining of the function. It was so large that it made sense that the compiler ignored the “inline” request.

To recap, there were actually 3 linked issues with the function:

  • static variable at local scope requires a branch on each call
  • massive initialization code causes potential instruction cache trashing
  • inlining is impossible due to the initialization code withing the function

The fix was trivial:

I moved the variable outside the function. This allowed for proper function inlining and gave an overall 1% performance increase of the library in my test. Not bad for 10 minutes work.

In your work avoid static variables at function level – they are bad practice anyway, mostly used for singleton objects whose lifetime has to be lazy. If you have such an object that is accessed very often, you might have a similar performance hit.

Type annotation in C++

In systems like game engines and our HTML renderer Hummingbird, developers have to work with objects transformed in different coordinate systems. Using one generic type can lead to confusion on what object is required in a particular situation. Errors are often subtle and hard to track. I tried to mitigate this by using stringent static typing in our software. New types are created by annotating them with metadata.

In the HTML visual model, almost everything is a box (a rectangle), and it’s size and position is calculated by the layout engine. The boxes however have to go through additional transformations until they reach their final positions on-screen – they can be scrolled, 2D/3D transformed by CSS properties etc. This is similar to a game engine where objects also undergo different transformations between coordinate systems – model, view, world, shadow and so on. Some calculations in the code require boxes in the Layout coordinate system, others need them scrolled, others need them transformed.

We used a Rectangle C++ struct everywhere in Hummingbird to represent a Box. It became a common error to pass the wrong Rectangle to an operation. For instance a certain calculation needs a Layout box, but the programmer passes by error a scrolled one. A more explicit system was required.
If we look at a typical C++ function signature in the style:

bool HitTest(float2 coords, const Rectangle& box);

It is unclear what the exact type of the box has to be – in our software the programmer could pass a scrolled box, while a layout one is required. Testing this is also tricky, because if the box has no scroll (it is 0,0) then it’ll work most of the time but break upon scrolling.

The issue was solved by “documenting” the requirement:

// Requires a layout box
bool HitTest(float2 coords, const Rectangle& layoutBox);

Unfortunately this is still error-prone as relies only on the programmer and code reviewer attention.

It’s better to have the C++ static type system help in the situation:

bool HitTest(float2 coords, const LayoutRectangle& box);

Now this is much more clear and will avoid accidental errors.

The idea can be extended to cover also the transformations themselves that lead to a certain box. The transformation (a matrix) is also strongly typed and encodes compile-time the coordinate system it work in. Instead of having a generic Matrix class we have a ScrollMatrix, CSSTransformMatrix etc as types. The product of a box transformation is defined by the input box and the matrix applied.

LayoutRectangle lb = LayoutElement(element);
ScrollMatrix scroll = GetScroll(element);
ScrollRectangle sr = lb.Transform(scroll);

The system will automatically generate the correct type from the input parameter and the typed transform.

Implementation

The system has the following requirements:

  • No runtime overhead
  • Types are defined by the transforms they encode
    • DisplayRectangle is Rectangle with {Layout, Scroll and CSSTransform}
    • ScrollRectangle is Rectangle with {Layout and Scroll}
  • Maximum static checking on types and improve code readability
  • Typed transforms must generate correct new types
  • Types that are a subset of another are allowed to be assigned on them. For instance:
    • DisplayRectangle ds = ScrollRectangle sr(..);

The final requirement may seem like a defeat on the static type system but it makes sense. A DisplayRectangle created from a ScrollRectangle is simply one that has an identity CSSTransform. It significantly simplifies intermediate calculations and avoids redundant copies.

The implementation of the system relies on template metaprogramming. It is one of the few cases where I find a good application for it’s somewhat esoteric constructs.
Coordinate systems are defined as types:

namespace CoordComponents
{
     struct Layout{};
     struct Scroll{};
// ..
}

Typed Rectangle is a thin wrapper around our generic Rectangle class.

template<typename… Components>
class Rectangle
{
// …
};

Commonly used types are defined as:

using LayoutRectangle = Rectangle<CoordComponents::Layout>;
using ScrolledRectangle = Rectangle<CoordComponents::Layout, CoordComponents::Scroll>;

The same principle is applied to the Matrix class which has a list of transformation components as its type signature.
The gist of the method validation and type synthesis can be seen in the Unite2D and Transform methods.

template<typename… RhsComponents>
void Unite2D(const Rectangle<RhsComponents…>& other, bool allowEmpty = false)
{
    static_assert(sizeof…(Components) >= sizeof…(RhsComponents), “Cannot assign to type with less components than operand!”);
    static_assert(meta_contains_types<meta_packer<Components…>, meta_packer<RhsComponents…>>::value, “Operand has components that are not part  of this object!”);
    m_Value.Unite2D(other.m_Value, allowEmpty);
}

Unite2D can take as parameter any other type of Rectangle but we validate:

  1. That the components of the this rectangle are less or equal those of the parameter. This avoids applying “broader” transforms to narrower ones, like uniting a ScrolledRectangle on a LayoutRectangle. They belong to different coordinate systems, so the operation is invalid.
  2. The second check handles situations where the two Rectangles might have a different set of components. Uniting a Rectangle<Layout, Scroll> with a Rectangle<Layout, Transform> is invalid.

The static_assert use some meta functions that implement the actual type checking.

The Transform method show type synthesis:

template<typename… MatrixComponents>
typename meta_unite_params<Rectangle<Components…>, Matrix<MatrixComponents…>>::type
Transform(const Matrix<MatrixComponents…>& mat) const
{
    typename meta_unite_params<Rectangle<Components…>, Matrix<MatrixComponents…>>::type result;
    result.m_Value = m_Value.Transform(mat.Unwrap());
    return result;
}

The declaration is definitely a mouthful, but basically says: “make a Rectangle, whose components are the union of the current one and the ones of the Matrix”. For instance:

Rectangle lr{…};
Matrix<Scroll, Transform> mst {…};
auto result = lr.Transform(mst); /*result is Rectangle<Layout, Scroll, Transform>*/

In this case “result” will have a type Rectangle<Layout, Scroll, Transform>, that is a Rectangle that is a Layout box with scrolling and some CSS transformation.

Results

The system is relatively new in our code, so the long time impact still has to be measured. I find however that local operations and data members are now much clearer in their intent and the amount of confusion has definitely decreased. I was concerned about compilation times but found no significant slowdown since the introduction of the system.

While the implementation is somewhat complex due to the meta programming stuff, the benefits outweigh it and the usage itself is straightforward.

MSVC mutex is slower than you might expect

TLDR; The 2015 MSVC C++ runtime’s std::mutex (and potentially other) implementation is significantly slower than the equivalent code written by hand. The reason is that the runtime is built with a Windows feature called “Control Flow Guard” and uses function pointers.

While profiling some code in our new HTML5 renderer, I noticed that the “mutex” implementation we use on Windows was unexpectedly slow compared to other platforms. On Windows we were using the std::mutex implementation in the MSVC 2015 runtime. I decided to dig through the code an see what was going on. Everything was running on Windows 10.

Looking at the code revealed that the implementation of the mutex depends on the version of Windows (Microsoft Visual Studio 14.0\VC\crt\src\stl\primitives.h). This makes good sense, because newer versions of Windows include faster synchronization primitives. On Vista+ the locking is implemented with a SRWLOCK, which is very fast.

At that point I thought that the implementation they have is pretty good and there might be an issue with my profiling, so I did a simple test. In an auxiliary application I ran 3 test-suites locking 10000 times a mutex, performing a simple computation, unlocking and measuring the time for all those operations.

I measured 3 scenarios:

  • std::mutex
  • manual CRITICAL_SECTION
  • manual SRWLOCK

The SRWLOCK implementation was slightly faster than the CRITICAL_SECTION (~5-10%), which is expected, but the std::mutex was 30-40% slower than the rest. At that moment I was really surprised, because essentially the implementation was the same – just locking and unlocking the SRWLOCK. The std::mutex has some additional code – it gets the current thread id (I simulated this too and it’s very fast – no noticeable change in perf.), it calls virtual methods and the execution of the SRWLOCK functions happen through function pointers. None of those should incur such a large difference though.

So I dug in the assembly. It turned out that when the CRT calls __crtAcquireSRWLockExclusive (and all other SRWLOCK methods) they don’t go directly in the Windows Kernel32.dll! A lot of checks and code is executed between entering the method and actually arriving in AcquireSRWLockExclusive, which is where we want to go. The reason is a Windows feature called Control Flow Guard. Essentially this is a security feature that instructs the compiler to check every jump through a function pointer and validate if it is a valid target. This is possible because on modern Windows, all function addresses are annotated and known to the loader. The feature will prevent jumping “in the middle” of functions and makes hacking more difficult.

While CFG might be very important for some classes of applications, it’s performance impact on a game unacceptably is high. Unfortunately when using the MCVC runtime you have no choice, because it’s pre-built with the feature. While normal function calls are fine and bypass the mechanism, the std::mutex implementation falls flat there. In order to support multiple Windows versions the authors have to rely on function pointers for routines that might be missing from older Windows’s.

It’s unfortunate that the authors have thought carefully to use different approaches on different Windows versions to squeeze the best performance (the 2015 version is much much better than older std::mutex implementations) but this compiler feature has essentially defeated their efforts.

Fixing this in your application is really trivial – creating a custom mutex implementation with SRWLOCK is < 2 min. work. I didn’t investigate other STL classes that might suffer from the same problem, but I assume there are in the synchronization mechanisms (where you want the most performance unfortunately).

It’d be great if Microsoft provided more versions of the runtime libraries – especially “fast” ones without the security features. It would be even best if they provided all the source for the runtime and allowed developers to compile it themselves.

References:

Voxels 0.6.0 released

I have just uploaded version 0.6.0 of the Voxels library.

Changes include:

  • New API for memory usage statistics
  • Substantially decreased run-time memory usage
  • Slight increase in the polygonization performance
  • Removed all internal Windows-specific dependencies

I’m now working on further improving the polygonization performance as well as some ports and integrations. More info coming soon…

The Voxels library

Some days ago I finally released the first public alpha of my Voxels library project. For quite some time I’ve been interested in volume rendering for real-time applications. I believe it has a lot of future applications.

The library is the fruit of some of my work in that field but alas I do on it only in my extremely limited spare time as I’m concentrated on my company – Coherent Labs. The main ideas behind the library have already been highlighted in my previous posts and the talk I gave on Chaos groups’ CG2 seminar in October 2013. Preparing the library for release has been a much longer process than I expected but the Windows version at least is finally downloadable from github here, along with a very detailed sample here.

A hand-sculpted surface started as a ball. Voxels supports multiple materials and normal mapping

A hand-sculpted surface started as a ball. Voxels supports multiple materials and normal mapping

Some internal detail

The polygonization algorithm used is TransVoxel. I chose it because it is very fast, proven correct and it’s easy to parallelize. All Eric Lengyel’s Ph.D. thesis on the subject is very interesting and I recommend it to anyone interested not only in volume rendering but in real-time rendering in general. The algorithm addresses one of the most important issues with volume rendering techniques – the need for LOD. Currently the meshes produced by the library are very “organic” in their shape (due to the Marching Cubes roots of the algorithm) and are best suited for terrains and other earth-like surfaces.

My implementation produces correct meshes relatively fast, scales extremely well and tries to keep the memory usage low. Currently I’m using simple RLE compression on the grid which works surprisingly well giving very fast run times and large compression rations 30x+. Lengyel asserts using it in his implementation too with satisfactory results.

The polygonization process is multi-threaded and uses all the available machine cores. Here there is much room for API improvement to set limits on the resources used and eventually the exact cores employed.

In the sample application I’ve also added an implementation for an octree LOD class that culls blocks of the mesh and more importantly decides which levels to draw on which LOD level and when to draw the transitions (the transitions are meshes that fill the gaps between adjacent block of different LOD levels).

Future

I intend to continue the active development of the library. Currently the focus will be adding support for Linux and may be Mac OS X and improving the polygonization speed even further – especially when editing the voxel grid. The API also needs some polishing – I’m currently working on an integration of the library with an open-source engine and see some issues with it.

I’d also like to update the sample or create a new one that draws all the current state of the mesh in one draw call through some of the indirect rendering APIs.

Feedback is extremely appreciated. If you find the library interesting and would like to use it for something or have any suggestions/ideas – drop me a line.