Development process in CL

I wrote a blog post about the Development process we use at Coherent Labs – check it out.

Advertisements

Generic memory allocator for C++ (part 3)

The post continues the series about how the coherent-rpmalloc generic C++ memory allocator works. If you haven’t read part 1 and part 2, please do before continuing, as they are an integral part in understanding the implementations described below.

The algorithms in part 2 are relatively simple, but have some tricky moments. I’m going to form their descriptions as Q&A.

Q: How does a simple pointer in void rpfree(void* pointer) know which span it belongs to, so that the algorithm can get to the meta-data needed to handle the de-allocation?

A: This is one of the keys to the great performance of the allocator and a very clever thing that Mattias Jansson, the original rpmalloc creator, did. In his implementation every span has to be 64KB aligned. This means that from every pointer allocated by the library, we can get the span it belongs to by just masking a bunch of bits. This is extremely fast. Some allocators achieve this by putting meta-data in front of every allocation, but that increases significantly the memory overhead. The only down-side is that it requires the embedder to always pass to the library 64KB-aligned memory. Rpmalloc can get away with this, because it directly uses the OS to get memory, VirtualAlloc, already returns 64KB-aligned addresses, while on POSIX systems it’s also relatively easy to map memory that way.

This is very inconvenient for us as we would put a heavy burden to games that use Hummingbird, so we had to get rid of the requirement. To put in perspective, if we request from the client 64KB, with a 64KB alignment, at worst the embedding developer will have to get 128KB just to fulfill the alignment requirement – a huge waste.

Q: How does coherent-rpmalloc remove the need for 64KB-aligned memory, while keeping the performance?

A: Basically the problem boils down to:

  • From an allocated pointer (a block), how do we get the address of the span that allocated it, so that we can reach the meta-data and de-allocate it?

At first I tried different techniques to remove the requirement. The first one was looking at other implementations:

  • Some allocators put meta-data in front of every block. This would waste a lot of memory for small allocations and pollute the CPU cache with unnecessary data when used.
  • tcmalloc, which was rpmalloc’s inspiration, uses a radix-tree to associate each allocation with a pointer to its’ owning structure. However for 64-bit pointers that radix-tree can potentially grow up to ~4MB. Hummingbird as a whole seldom uses more than 10MB, so wasting 4MB just on allocator bookkeeping was out of the question.
  • I rolled my own sorted skip-list that associates each allocation to the span it did it. A skip-list can be made lock-free, but my tests even on a single thread showed a significant slow-down compared to bit-masking. While the search has very good big-O complexity, the performance was poor, because very often involved CPU cache misses. While it was possible to squeeze my skip-list in some contiguous memory piece to improve cache coherency, I ultimately decided that the increase in complexity was too large and the results not guaranteed.

In the end I decided that keeping the span-alignment trick is the best thing, but I had to amortize the risk of wasted memory.

What if instead of requesting from the user spans, like rpmalloc does, we request an array of spans? This is why segments got introduced.

Our spans are 8KB and need 8KB alignment for the bit-masking to work.

  • If we request individual spans, like rpmalloc does, then 8KB with 8KB alignment would at worst need a 16KB “real” allocation to fulfill the alignement – a 50% waste
  • If we request 32 spans from the user, that is 256KB with an 8KB alignment, then the possible waste is only ~3.03%. Much better.

This is how segments were born. We only request and return to the user whole segments of 256KB.

This has multiple good upsides:

  • Drastically reduces the risk of memory waste
  • Reduces the communication between the library and the embedder – if their allocator is slow, it’ll have almost no effect over Hummingbird.
  • Slightly improves cache locality

And some downsides:

  • Segments have to be managed and be performant
  • There is a risk of memory waste in degenerate cases. As we only return whole segments to the user, we need to have a segment completely free before we return it, even a single allocation can keep it alive. In practice our workload does not hit such cases, but is something to keep in mind.

segments.png

Q: How are span caches & segments handled?

A: rpmalloc has multiple levels of span caches. The per-thread ones are pretty trivial because they are touched only by a single thread that owns them, so no synchronization is needed to add/remove spans from the local cache.

The global span cache is protected by a spin lock allowing adding/removing spans from multiple threads.

Segments are always shared across multiple threads. All segments are put in a linked list and each segment has 32 spans inside. The algorithm to “get” a fresh span is relatively simple:

  • Walk the list of segments and look for a free span:
    • Each segment has an “unsigned int”, which is used as a bit-mask and indicates, which “slot” is free of the 32 spans in the segment. Updating it in a lock-free way is very easy, just do a CAS operation on the word by trying to flip the bit we want to mark as used or free.
  • If there are no free spans in any segment
    • Allocate a new segment by requesting memory from the user
    • Adding/removing segments from the linked list is done through a readers-writer lock

As segments are iterated much more often from multiple threads than they are added/removed, it is really important to have very good read performance on the list. I implemented it with a readers-writer lock.

A single “unsigned int” is used as the lock primitive – one bit is reserved for the “write” lock, while all the rest are used as counter that holds the number of readers currently holding the lock.

  • When a reader wants to enter the locked section (iterate the linked list), it tries to CAS the unsigned lock to [0 | READERS + 1], where readers is the lock value with the write bit masked.
  • Releasing the read lock will just try to CAS to [READERS – 1]
  • The writer will try to CAS the lock to “1”, which means the lock is taken for writing. The operation requires a reference value of “0”, which means that nobody is reading or writing the linked-list.
  • Releasing the write lock just involves setting the lock to “0”

rwlock.png

Q: How do you support threads whose lifetime can’t control, when there is thread-local data involved?

A: The original rpmalloc has two functions “rpmalloc_thread_initialize” and “rpmalloc_thread_finalize” that need to be called on each thread before using the allocator and after you’re done with it. They set and later clear the thread-local heap object.

Unfortunately for our middleware software this solution was no good. Some game engines – for instance Unreal Engine will move their workload across multiple new threads. The prime example is their rendering thread, which actually changes from time to time without notifying plugins. If we were to use the default rpmalloc way – we’d have the risk of “leaking” heaps. If UE4 destroys a thread where we have initialized a heap and we are not notified, that heap will leak.

Coherent-rpmalloc solves this by having a pool of heaps that are re-assigned to threads working with the library dynamically. All entry points in the Hummingbird library are already marked within our code. When the user calls any method, if the thread-local data is not set for that thread, then a heap is assigned to it. When code execution leaves the library, that heap is again marked as reusable. We have a fixed amount of threads that can run simultaneously, so there is no risk of leaks. Each thread holds a preferred heap, which is used so that it gets all the time the same heap when it enters the library. Only when a new thread comes into play, will it eventually take a heap of another thread. The old thread has probably been destroyed outside the knowledge of the library.

Q: Why did you move the library to C++?

A: The original rpmalloc uses some inline assembly and platform-specific macros for thread-local data and atomic operations. I really wanted to move this to standard C/C++, because I also needed to support additional platforms like game consoles. It would’ve been great to move to C11, but it’s not supported on all compilers we target, so I went for C++11. All atomic operations use standard primitives, while thread-local storage uses pthread on platforms that have no thread_local support.

This pretty much sums-up the most interesting parts of the allocator we crafted. If you have additional questions or suggestions, please don’t hesitate to comment.

Generic memory allocator for C++ (part 2)

In my previous post I highlighted the requirements we have for a generic memory allocator within the Hummingbird game UI middleware. To quickly recap, we want memory allocator (malloc/free) that:

  • Is fast – especially in a multi-threaded environment
  • Wastes little memory
  • Can be easily modified
  • Does not reach out for the OS

The current result of our work is a fast multi-threaded, low-overhead, embeddable, relatively simple (~1800 lines of C++), optimized for small allocations, rpmalloc-based, permissively licensed allocator. The code is available on github under the name coherent-rpmalloc and evolving.

In this and the next post I’m going to describe the implementation details and how the allocator is different from the initial rpmalloc implementation.

Overall goals

Coherent-rpmalloc like the original rpmalloc relies heavily on caches and thread-local pointers. The key for multithreaded performance is reducing the contention between threads. The allocator achieves this by keeping as much data as possible on a per-thread basis and acquiring shared resources very seldom. It is tuned towards applications that allocate relatively small amounts of data – in the tens of megabytes. This is different than the original rpmalloc, which handles well even larger heaps. The tuning can be adjusted by changing the sizes of different memory sizes & caches.

A lot of information on the design can also be found in the documentation of the original rpmalloc, there are however differences with coherent-rpmalloc.

Memory subdivision & lingo

The memory in the allocator is handled through 4 major data-types, which form a hierarchy:

 

  • Blocks – each allocation of a certain size falls in a size class. All allocations of that size class will belong together – for instance all the ones that are between 1 and 16 bytes, all the ones between 16 and 32 bytes etc. A memory request will be rounded to the max of the size class it belongs to. The block is such an allocation – a pointer the allocator will return to the user & that it can later free. Blocks have different sizes, depending on their size classes – they represent a single small/medium allocation.
  • Pages
  • Spans
  • Segments

 

 

 

By far the most important structures are spans and segments, blocks and pages are just useful memory subdivisions, which however carry no code logic – think of them as size slices.

mem_alloc_spans_graphic.png

Allocations are subdivided in different buckets by size – small, medium, large and extra-large. All the complex handling and machinery in the library applies to small, medium & large sized allocations. In our case I settled for everything up-to 8144 bytes. All allocations larger than that are classified as extra-large. They will be directly handed over to the embedder.

Hummingbird requests bigger than 8K chunks are almost always for other internal allocators or for large resources (images, fonts etc.). They happen relatively seldom – only ~1% of the memory requests are so large, so their performance impact is overall negligible. I also wanted the library to immediately return such large chunks to the embedder application instead of holding it in internal caches.

Anatomy of an allocation & de-allocation

When the user calls the void* rpmalloc(size_t sz) method, the library has to fulfill the allocation. This happens in multiple steps:

  • The allocations size-class is determined. This tells us where to look for free blocks large enough to fulfill the allocation.
  • The current threads heap is loaded from thread-local storage. Each thread using the library has a local heap (a struct) that contains all the data necessary to allocate and deallocate memory and almost always it can fulfill a memory operation by itself.
  • The heap has an active span for each size class – the active span is just the latest span that was used for allocating. If it has some free blocks – the first free block will be used.
  • If there are no free blocks, the library will try to:
    • De-allocate some blocks – mark them as free. Blocks freed from other threads are not immediately marked, but are put in a list. The owning thread (the one that allocated them from its own heap) will mark them on the next allocation.
    • Look for an empty span in the local thread cache
    • Look for an empty span in the global cache
  • If no span was found with empty blocks, a new span is requested:
    • Look for a segment that has empty spans and take one
    • If there are no segments with free spans
      • Request a new segment allocation (to external embedder)
      • Return the first span and put the segment in the list of segments

The de-allocation works in pretty much the opposite way:

  • From the pointer – find the owning span. This is an interesting step and very performance-sensitive, I’ll go into the details in the next post.
  • Mark the block in the span as empty. Empty blocks are put in a free-list within the span, where the memory of the block is used as the “next” id. As the memory is now garbage we can safely re-use it for internal data.
  • If the span is completely empty (all blocks are free)
    • Put in caches – depending on configurable ratios, the span will go to the local or global cache or will need to be freed altogether
  • If the span has to be freed altogether
    • Mark in the owning segment that the span is free
  • If all spans in the segment are free, return the whole segment to the embedder, actually freeing the memory.

The next post in the series will describe in the most interesting implementation details of the library and how most of the operations were implemented.

Generic memory allocator for C++ (part 1)

TLDR; You can check the latest version of our generic C++ memory allocator on github.

Memory allocations in C++ are a major performance pain-point for every real-time application. Most games try to reduce/remove any need for heap allocations during a frame and use different techniques to avoid them. I’ve already blogged about some of those techniques that we use in Hummingbird – linear allocators for temporary & perf-frame memory, pool allocators etc.

Avoiding any dynamic (generic) memory allocation whatsoever is difficult and an ongoing effort in all AAA software I know of, Hummingbird included. You still need to do a new/delete/malloc/free here and there – hence a generic memory allocator is still needed.

What is a generic memory allocator?

In C and C++ the “generic” memory allocator is the pair of functions malloc/free, which allow you to use memory off the heap (aka free-store). In C++ the new & delete operators also usually end-up in malloc/free to get the memory needed for objects.

I call it generic, because it should fulfill any memory request – from a single byte to potentially gigabytes of memory, often with specific alignment requirements.

The standard C runtime provides implementations of malloc/free and their characteristics depend on the specific OS/compiler. The quality of these implementations unfortunately vary wildly, which makes it difficult to offer a consistent application experience.

The main features a good allocator should have are:

  • Good performance – calling malloc/free should complete relatively quickly
  • Small memory overhead – each allocator needs some mechanism to keep track of its internal state, used/free memory etc, which involves some additional overhead.

There dozens of OSS generic memory allocator implementations with different characteristics, to mention some of the more popular: dlmalloc, jemalloc, tcmalloc, ptmalloc, nedmalloc, hoard, rpmalloc etc. Each involves different design decisions and many claim to be the “fastest”, however applications have different allocation patterns and one might be better than the other for you particular use case – measure before committing.

Our software has also it’s specifics, which define the requirements for our generic allocator.

Specifics of allocations in a library

Hummingbird is a UI software library in C++ and is designed to work within an application and has to “play nicely” with it. One of the most important aspects is never directly using a resource under the hood that the application can’t control. Memory is a prime example. Hummingbird never allocates any memory directly – it asks for memory from the main application. This gives full control of the game developer to sub-allocate from UI-specific arenas and track any memory that the library uses.

So far Hummingbird asks the application for memory on each allocation and returns it on each deallocation. This gives complete control to the user but sometimes can be risky. The major issue we’ve seen is that generic allocators in some engines and on some platforms are quite bad. This means that in some cases the library can be slowed down by factors external to it. It also means that a bug in the game allocator will very likely hit the middleware as well.

We really want consistency of the performance and stability of our library across all platforms and engines, so we decided to go for a hybrid solution:

  • Hummingbird will have an internal generic memory allocator that will serve all internal memory needs
  • The allocator itself will request memory from the game in larger segments – the user is still in full control of the memory subsystem

In this way the user’s memory system is touched very seldom, which practically eliminates any performance effect it might have, and substantially reduces the risk of memory corruptions introduced by it.

Generic memory allocator for C++ - part 1.png

Looking for an embeddable memory allocator

We quickly decided the requirements of our allocator:

  1. Should be “embeddable” – it means that it should allow giving it memory from an outside source (the game in our case), and not allocate memory directly from the OS (via mmap, VirtualAlloc etc.)
  2. Should be fast – especially on small allocation. Large allocations in our software are either resources – images, fonts or memory pools that manage internally anyway.
  3. Should be fast in a multi-threaded environment – Hummingbird is heavily MT
  4. Should “return” memory often – many allocators are extremely conservative and hold on to the memory they’ve taken. We want to return any excess memory quickly to the game.
  5. Should be relatively simple to extend/modify

The list very significantly impacted the possibilities we have. Almost all OSS memory allocators are designed to be the “one-and-only” generic allocator within an application and don’t fulfill 1) and 4) (for instance jemalloc, tcmalloc).

The venerable dlmalloc unfortunately hasn’t aged well and is not good in a MT environment. ptmalloc3 is LGPL – haven’t really tried it. In the past I’ve used nedmalloc, which fulfills the API requirements, but is somewhat slow, the current Windows memory allocator beats it all the time.

In the end we decided to take a closer look to rpmalloc.

  • It claims (and achieves – I tested it under real workloads) great performance.
  • It can be easily studied & modified

However it also has some downsides for our particular case:

  • Can’t be used verbatim because allocates from OS – this is trivial to change
  • Requires all allocations to be 64KB-aligned – this is a major issue for us actually. rpmalloc can get away with this requirement in the generic-application case, because every VirtualAlloc on Windows is 64KB-aligned anyway, while on POSIX it’s relatively easy to map memory so that the requirement is fulfilled. This is NOT a requirement we can impose game developers. Imagine requiresting 1KB and requiring 64KB-alignment. If the game can’t easily manipulate the virtual memory mapping, in the worst case, it’ll have to allocate 65KB, just to make sure the alignment is right – not fun at all.
  • Holds on to memory a bit too aggressively for our requirements – even when compiling it with a more “memory-saving” option

So we decided to give it a try and modify it – keep the awesome performance and overcome the limitations. You can check our current fork on github.

The next post in the series will get our hands dirty with the details of how rpmalloc works and the changes we made to adapt it for our needs.

Bringing fast HTML5 on embedded devices

Graphical user interfaces are now all around us – no longer only on our PCs, phones and tablets. GUIs are also in our cars, home appliances and even elevators. As an engineer I was curious how those are developed for embedded systems. I’d like to share what I found. Unsurprisingly, there are two schools of thought – make it with native/hardcoded C/C++ code (often generated from visual tools) or use data-driven, markup-based language, combined with scripting HTML5 & JavaScript. In that regard the embedded GUI eco-system is similar to PC and mobile. Both approaches have upsides and downsides.

The most common opinion is that C and C++ based solutions are faster, while HTML5-based ones are too slow except for the most beefy devices, which relegates them to very expensive cars or machinery. One of the popular C++ camp proponents, Qt even published a whitepaper which compares the two approaches and concludes with the performance superiority of the native solution. As a C++ developer myself (doing a lot of low-level optimization work) I also often tend to share this limited mindset. However this is dogma – what is the reality of the trade-offs of both approaches?

tesla

The touchscreen in a Tesla automobile. Image courtesy of Tesla

There is a very big advantage to create UI with standard HTML5 & JavaScript. The immense industry know-how in the technology, hundreds of production-grade libraries and tools, and most importantly, enabling front-end developers to work on embedded UIs. The end results of such democratization are better interfaces and happier customers.

With non-native approach the problem to overcome is performance. But I believe it is is not the choice of technology per se, but the specific implementation. Most HTML5-based solutions use Chromium, which simply is not made for embedded architectures. The reluctance of the developers to put millions of lines of code on such limited architecture and use a software solution that is famous to eat gigabytes of RAM like candy, is more than reasonable. Such concern is pretty much the same with what we had in video games when we started Coherent Labs in 2012. Developers wanted and saw the advantages of HTML5 as UI technology, but there was no solution that could cope with the performance requirements of games. We solved these problems and built Coherent GT and Hummingbird, currently powering hundreds of games.

We can make a significant parallel between the requirements of video games and embedded systems:

  • Need of efficient rendering – in games there is very little time to finish the work before the frame has to be displayed to maintain 30 or 60 frames per second. In embedded you have low-powered systems (read slow) that still have to achieve similar FPS.
  • Low memory footprint – in games developers want to use the most of the memory to textures, sounds, 3D models, and so on to create the most appealing gameplay and visuals. So, the user interface  should take very little memory. In the case of embedded devices there is usually little amount of RAM.
  • To create a good user experience interfaces of both games and embedded devices need animations, audio and video. The bar for UI complexity and the user expectations for visuals are already very high.

In theory, Hummingbird on embedded will give us the best of both worlds – an awesome workflow and the required performance.

bmw_x5_rear_seta_ent.jpg

Entertainment system would be a great fit for dynamic HTML5 technology. Image courtesy of BMW

I intend to test Hummingbird on some embedded devices and measure how it performs. The first step will be to port it to an embedded OS – I’ll probably start with Linux, ARM and Raspberry Pi, easily accessible platform with good tools for a proof of concept.

If you are an embedded UI developer – what are your biggest issues and what platforms do you target? If you’re interested to help me with the experiment, don’t hesitate to write a comment or reach me on Twitter @stoyannk!

Small vector optimization

One of the key performance optimization techniques involves reducing the number of dynamic memory allocation that a program does. The reasons are:

  • Generic memory allocations are relatively slow
  • Heap-allocated objects reduce the effectiveness of CPU caches (misses are more probable)

There are numerous techniques to alleviate allocations, but not allocating memory from the heap in the first place would be ideal. This is not possible in most cases when the data we’ll manipulate is truly dynamic, but we can identify common cases where 99% of the time an allocation can be avoided – the size is often the same.

C++ strings are a prime example of objects that would allocate too often. A naive implementation would always have the data of the string on the heap, which will require always at least 1 allocation and possibly cache misses when code accesses that content. Developers noticed that and now effectively all std::string implementations implement the so called “small string optimization”. Each string object contains a small inline buffer that will hold the data, if it fits. It’ll allocate only if the data can’t fit that buffer. As most strings in practice are small – they’ll save an allocation and improve cache locality very often.

Small Vectors

The same technique can be implemented for the more generic vector type. It’s also widely implemented in many libraries. A very inspirational lecture on the successes of using “small vector optimization” is given by Chandler Carruth in his talk “High Performance Code 201: Hybrid Data Structures” where he discusses its application in LLVM.

Instead of always allocating the buffer for the objects in the vector on the heap – we have a small inline piece of memory that can hold N objects. The N constant depends on the usage and should be carefully selected depending on the context to trade between allocations and the potential memory waste if we have less than N objects.

I decided to apply the optimization in certain areas of our Hummingbird software and measure the gains. In our “standard” library I implemented a std::vector substitute creatively named “dyn_array” from dynamic array. It has the same public interface as std::vector, so they are easily interchangeable.

It’s declared like this:

template<typename T, typename Allocator, size_t InlineCapacity = 0>
class dyn_array

The interesting part is the InlineCapacity param, which defines how many objects will be inline in the structure itself. When InlineCapacity is 0, dyn_array works like a standard vector – everything is allocated through the Allocator – usually from some heap.

To save memory within the dyn_array object, the inline buffer is used also as the pointer to the heap memory when necessary. The declaration is this beauty:

using DataStorage = char [Max_eval<sizeof(T) * InlineCapacity, sizeof(T*)>::value];
alignas(Max_eval<alignof(T), alignof(T*)>::value) DataStorage m_DataStorage = {0};

Edit: @lectem suggested that the same effect is achievable with std::aligned_storage

Max_eval is a meta-function, which returns the maximum of two compile-time values. The DataStorage should be big enough to hold at least a pointer, while the field itself has to be aligned properly for the inline objects.

I use the current capacity and the InlineCapacity constants to decide how to interpret the m_DataStorage. If the CurrentCapacity <= InlineCapacity, then the memory is “inline” and it’s interpreted as an array, otherwise the bytes are interpreted as a T*.

When looking at the potential usages in our code, the first target I had in mind were the CSS declarations (Hummingbird is an HTML rendering engine – like a web browser) for animations. Users can have as many animation properties within a declaration as they want – so the values have to be dynamic, but they very rarely exceed 1-4. Some declarations are also enums, which we limit to 1 byte, this means that on 32-bit targets we can squeeze 4 values within a pointer and on 64-bit 8 – more than enough in 90+% of the cases.

I wrote a small template meta-function that simplifies setting the inline size when you want it to fit exactly in one pointer. When the objects you are going to have in the dyn_array are smaller than a pointer, they are essentially free memory-wise.

The meta-function looks like this:

template<typename T>
struct FitInPointer
{
   static const auto value = sizeof(T*) / sizeof(T);
};

and is used like this:

dyn_array<Value, CSSAllocator, csl::FitInPointer<Value>::value> Values;

So far I’ve applied the dyn_array in all animation CSS properties and the result have been more than encouraging – reduction of dynamic allocation by 20% in tests and a page load-time reduction of 10%.

I’m now looking at other places where the optimization will have good effects.

Future direction

The next step will be to implement a debugging mechanism that detects vectors that are often re-allocated in test scenarios. They will be analyzed for opportunities of inline sizes. The LLVM folks implemented this to great effect. It’ll be a piece of cake for us, as adding the required debug code in dyn_array is far easier than modifying a STL vectors implementation.

Rendering HTML at 1000 FPS – Part 2

Squeezing the GPU

The post is also available in the official LensVR blog.

This is the second blog post of the sequence in which I talk about the LensVR rendering engine.

In the first post, I discussed the high level architecture of the LensVR/Hummmingbird rendering. In this post I will get into the specifics of our implementation – how we use the GPU for all drawing. I will also share data on a performance comparison I did with Chrome and Servo Nightly.

Rendering to the GPU

After we have our list of high-level rendering commands, we have to execute them on the GPU. Renoir transforms these commands to low-level graphics API calls (like OpenGL, DirectX etc.). For instance a “FillRectangle” high level command will become a series of setting vertex/index buffers, shaders and draw calls.

Renoir “sees” all commands that will happen in the frame and can do high-level decisions to optimize for three important constraints:

  • Minimize GPU memory usage
  • Minimize synchronization with the GPU
  • Minimize rendering API calls

When drawing a web page, certain effects require the use of intermediate render targets. Renoir will group as much as possible of those effects in the same target to minimize the memory used and reduce changing render targets for the GPU, which is fairly slow operation. It’ll also aggressively cache textures and targets and try to re-use them to avoid continually creating/destroying resources, which is quite slow.

The rendering API commands are immediately sent to the GPU on the rendering thread, there is no “intermediate” commands list as opposed to  Chrome, where a dedicated GPU process is in charge of interacting with the GPU. The “path-to-pixel” is significantly shorter in LensVR compared to all other web browsers, with a lot less abstraction layers in-between, which is one of the keys to the speed it gets.

Rendering in cohtml.png

GPU execution

The GPU works as a consumer of commands and memory buffers generated by the CPU, it can complete its work several frames after that work has been submitted. So what happens when the CPU tries to modify some memory (say a vertex or index buffer) that the GPU hasn’t processed yet?

Graphics drivers keep tracks of these situations, called “hazards” and either stall the CPU until the resource has been consumed or do a process called “renaming” – basically cloning under the hood the resource and letting the CPU modify the fresh copy. Most of the time the driver will do renaming, but if excessive memory is used, it can also stall.

Both possibilities are not great. Resource renaming increases the CPU time required to complete the API calls because of the bookkeeping involved, while a stall will almost certainly introduce serious jank in the page. Newer graphics APIs such as Metal, Vulkan and DirectX 12 let the developer keep track and avoid hazards. Renoir was likewise designed to manually track the usage of it’s resource to prevent renaming and stalls. Thus, it fits perfectly the architecture of the new modern APIs. Renoir has native rendering backend API implementations for all major graphics APIs and uses the best one for the platform it is running on. For instance on Windows it directly uses DirectX 11. In comparison, Chrome has to go through an additional abstraction library called ANGLE, which generates DirectX API calls from OpenGL ones – Chrome (which uses Skia) only understands OpenGL at the time of this post.

Command Batching

Renoir tries very hard to reduce the amount of rendering API calls. The process is called “batching” – combining multiple drawn elements in one draw call.

Most elements in Renoir can be drawn with one shader, which makes them much easier to batch together. A classic way of doing batching in games is combining opaque elements together and relying on the depth buffer to draw them correctly in the final image.

Unfortunately, this is much less effective in modern web pages. A lot of elements have transparency or blending effects and they need to be applied in the z-order of the page, otherwise the final picture will be wrong.

Renoir keeps track of how elements are positioned in the page and if they don’t intersect it  batches them together and in that case the z-order no longer breaks batching. The final result is a substantial reduction of draw calls. It also pre-compiles all the required shaders in the library, which significantly improves “first-use” performance. Other 2D libraries like Skia rely on run-time shader compilation which can be very slow (in the seconds on first time use) on mobile and introduce stalls.

Results & Comparisons

For a performance comparison I took  a page that is used as an example in the WebRender post. I did a small modification, substituting the gradient effect with an “opacity” effect, which is more common and is a good stress test for every web rendering library. I also changed the layout to flex-box, because it’s very common in modern web design. Here is how it looks:

Website gif

Link to page here.

All tests were performed on Windows 10 on a i7-6700HQ @ 2.6GHz, 16GB RAM, NVIDIA GeForce GTX 960M, and on 1080p. I measured only the rendering part in the browsers, using Chrome 61.0.3163.100 (stable) with GPU raster ON, Servo nightly from 14 Oct 2017, and LensVR alpha 0.6.

Chrome version 61.0.3163.100 results

The page definitely takes a toll on Chrome’s rendering, it struggles to maintain 60 FPS, but is significantly faster than the one in the video. The reasons are probably additional improvements in their code and the fact that the laptop I’m testing is significantly more powerful than the machine used in the original Mozilla video.

Let’s look at the performance chart:

Chrome_perf

I’m not sure why, but the rasterization always happens on one thread. Both raster and GPU tasks are quite heavy and a bottleneck in the page – they dominate the time needed to finish one frame.

On average for “painting” tasks I get ~5.3ms on the main thread with large spikes of 10+ms, ~20ms on raster tasks and ~20ms on the GPU process. Raster and GPU tasks seem to “flow” between frames and to dominate the frame-time.

Servo nightly (14 Oct 2017) results

Servo fares significantly better rendering-wise, unfortunately there are some visual artifacts. I think it’s Servo’s port for Windows, that is still a bit shaky.

You can notice that Servo doesn’t achieve 60 FPS as well, but that seems to be due to the flex-box layout, we ignore that and look only at the rendering however. The rendering part is measured as “Backend CPU time” by WebRender at ~6.36ms.

Servo GPU

LensVR alpha 0.6

LensVR Rendering

Here is one performance frame zoomed inside Chrome’s profiling UI which LensVR uses for it’s profiling as well.

The rendering-related tasks are the “Paint” one on-top, which interprets the Renoir commands, performs batching and executes the graphics API calls and the “RecordRendering” on the far right, which actually walks the DOM elements and generates Renior commands.

The sum of both on average is ~2.6ms.

Summary

The following graphic shows the “linearized” time for all rendering-related work in a browser. While parallelism will shorted time-to-frame, the overall linear time is a good indicator on battery life impact.

chart

Both WebRender and Renoir with their novel approaches to rendering have a clear advantage. LensVR is faster compared to WebRender, probably because of a better command generation and API interaction code. I plan to do a deeper analysis in a follow-up post.