Rendering HTML at 1000 FPS – Part 2

Squeezing the GPU

The post is also available in the official LensVR blog.

This is the second blog post of the sequence in which I talk about the LensVR rendering engine.

In the first post, I discussed the high level architecture of the LensVR/Hummmingbird rendering. In this post I will get into the specifics of our implementation – how we use the GPU for all drawing. I will also share data on a performance comparison I did with Chrome and Servo Nightly.

Rendering to the GPU

After we have our list of high-level rendering commands, we have to execute them on the GPU. Renoir transforms these commands to low-level graphics API calls (like OpenGL, DirectX etc.). For instance a “FillRectangle” high level command will become a series of setting vertex/index buffers, shaders and draw calls.

Renoir “sees” all commands that will happen in the frame and can do high-level decisions to optimize for three important constraints:

  • Minimize GPU memory usage
  • Minimize synchronization with the GPU
  • Minimize rendering API calls

When drawing a web page, certain effects require the use of intermediate render targets. Renoir will group as much as possible of those effects in the same target to minimize the memory used and reduce changing render targets for the GPU, which is fairly slow operation. It’ll also aggressively cache textures and targets and try to re-use them to avoid continually creating/destroying resources, which is quite slow.

The rendering API commands are immediately sent to the GPU on the rendering thread, there is no “intermediate” commands list as opposed to  Chrome, where a dedicated GPU process is in charge of interacting with the GPU. The “path-to-pixel” is significantly shorter in LensVR compared to all other web browsers, with a lot less abstraction layers in-between, which is one of the keys to the speed it gets.

Rendering in cohtml.png

GPU execution

The GPU works as a consumer of commands and memory buffers generated by the CPU, it can complete its work several frames after that work has been submitted. So what happens when the CPU tries to modify some memory (say a vertex or index buffer) that the GPU hasn’t processed yet?

Graphics drivers keep tracks of these situations, called “hazards” and either stall the CPU until the resource has been consumed or do a process called “renaming” – basically cloning under the hood the resource and letting the CPU modify the fresh copy. Most of the time the driver will do renaming, but if excessive memory is used, it can also stall.

Both possibilities are not great. Resource renaming increases the CPU time required to complete the API calls because of the bookkeeping involved, while a stall will almost certainly introduce serious jank in the page. Newer graphics APIs such as Metal, Vulkan and DirectX 12 let the developer keep track and avoid hazards. Renoir was likewise designed to manually track the usage of it’s resource to prevent renaming and stalls. Thus, it fits perfectly the architecture of the new modern APIs. Renoir has native rendering backend API implementations for all major graphics APIs and uses the best one for the platform it is running on. For instance on Windows it directly uses DirectX 11. In comparison, Chrome has to go through an additional abstraction library called ANGLE, which generates DirectX API calls from OpenGL ones – Chrome (which uses Skia) only understands OpenGL at the time of this post.

Command Batching

Renoir tries very hard to reduce the amount of rendering API calls. The process is called “batching” – combining multiple drawn elements in one draw call.

Most elements in Renoir can be drawn with one shader, which makes them much easier to batch together. A classic way of doing batching in games is combining opaque elements together and relying on the depth buffer to draw them correctly in the final image.

Unfortunately, this is much less effective in modern web pages. A lot of elements have transparency or blending effects and they need to be applied in the z-order of the page, otherwise the final picture will be wrong.

Renoir keeps track of how elements are positioned in the page and if they don’t intersect it  batches them together and in that case the z-order no longer breaks batching. The final result is a substantial reduction of draw calls. It also pre-compiles all the required shaders in the library, which significantly improves “first-use” performance. Other 2D libraries like Skia rely on run-time shader compilation which can be very slow (in the seconds on first time use) on mobile and introduce stalls.

Results & Comparisons

For a performance comparison I took  a page that is used as an example in the WebRender post. I did a small modification, substituting the gradient effect with an “opacity” effect, which is more common and is a good stress test for every web rendering library. I also changed the layout to flex-box, because it’s very common in modern web design. Here is how it looks:

Website gif

Link to page here.

All tests were performed on Windows 10 on a i7-6700HQ @ 2.6GHz, 16GB RAM, NVIDIA GeForce GTX 960M, and on 1080p. I measured only the rendering part in the browsers, using Chrome 61.0.3163.100 (stable) with GPU raster ON, Servo nightly from 14 Oct 2017, and LensVR alpha 0.6.

Chrome version 61.0.3163.100 results

The page definitely takes a toll on Chrome’s rendering, it struggles to maintain 60 FPS, but is significantly faster than the one in the video. The reasons are probably additional improvements in their code and the fact that the laptop I’m testing is significantly more powerful than the machine used in the original Mozilla video.

Let’s look at the performance chart:

Chrome_perf

I’m not sure why, but the rasterization always happens on one thread. Both raster and GPU tasks are quite heavy and a bottleneck in the page – they dominate the time needed to finish one frame.

On average for “painting” tasks I get ~5.3ms on the main thread with large spikes of 10+ms, ~20ms on raster tasks and ~20ms on the GPU process. Raster and GPU tasks seem to “flow” between frames and to dominate the frame-time.

Servo nightly (14 Oct 2017) results

Servo fares significantly better rendering-wise, unfortunately there are some visual artifacts. I think it’s Servo’s port for Windows, that is still a bit shaky.

You can notice that Servo doesn’t achieve 60 FPS as well, but that seems to be due to the flex-box layout, we ignore that and look only at the rendering however. The rendering part is measured as “Backend CPU time” by WebRender at ~6.36ms.

Servo GPU

LensVR alpha 0.6

LensVR Rendering

Here is one performance frame zoomed inside Chrome’s profiling UI which LensVR uses for it’s profiling as well.

The rendering-related tasks are the “Paint” one on-top, which interprets the Renoir commands, performs batching and executes the graphics API calls and the “RecordRendering” on the far right, which actually walks the DOM elements and generates Renior commands.

The sum of both on average is ~2.6ms.

Summary

The following graphic shows the “linearized” time for all rendering-related work in a browser. While parallelism will shorted time-to-frame, the overall linear time is a good indicator on battery life impact.

chart

Both WebRender and Renoir with their novel approaches to rendering have a clear advantage. LensVR is faster compared to WebRender, probably because of a better command generation and API interaction code. I plan to do a deeper analysis in a follow-up post.

 

Advertisements

Rendering HTML at 1000 FPS – Part 1

This was originally posted on the LensVR blog.

Part 2 is also available here.

Web page rendering is one of the most interesting and active development areas in computer graphics. There are multiple approaches with pros and cons. In the post I’ll go into details about how we do HTML rendering in Coherent Labs’ Hummingbird and LensVR browser and how it compares to Chrome and Mozilla’s WebRender.

I’ll split the post in two parts, this first one is dedicated to the high level architecture and how we decide what to render. The second part – “Squeezing the GPU” will be about how these decisions get implemented to use the GPU for all drawing and will give some performance results I measured.

The renderer described is still experimental for general web pages, but is very successfully deployed in game user interfaces across PC, mobile and consoles. The constraints of these platforms led us to somewhat different design decisions compared to what the folks at Chrome and Mozilla do. Now we are applying this approach to more web pages in LensVR and, feedback is most welcome.

Recently in an awesome post about Mozilla’s WebRender, Lin Clark explained not only how WebRender works, but also gives a great overview of the way rendering happens in most web browsers. I advise everybody who is interested in how browsers work to read her post.

To quickly recap I’ll concentrate on what we internally call rendering of the web page. After the Layout engine has positioned all DOM elements on the page and their styles have been calculated, we have to generate an image that the user will actually see.DOMandRendering.jpg

The rendering is implemented through the Renoir library. Renoir is a 2D rendering library that has all the features required to draw HTML5 content. It is conceptually similar in its role to Mozilla WebRender and Skia (used in Chrome and Firefox before Quantum).

When designing Renoir, performance was our primary goal and we built it around three major paradigms:

  • All rendering on the GPU
  • Parallelism
  • Data-oriented C++ design

We didn’t have all the burden of years of older implementations and could be very bold in the way we do things to achieve our performance goals.

High-level rendering architecture

Most web browsers split the rendering in 2 parts – painting and compositing. The whole page is split in “layers”. A layer is initiated by a DOM element (strictly a stacking context) that has certain styles. The rules differ in implementations, but usually things with 3D transforms, opacity < 1, etc. become layers. You can think of a layer as a texture (an image) that contains part of the web page content.

The layers are individually “painted” by either the GPU or CPU. The painting fills the text, images, effects and so on. After all the needed layers are painted, the whole scene is “composed”. The layers are positioned and the GPU draws them in the final render target which is displayed to the user.

Layers were introduced both as a convenience feature to simplify some effects and as a performance optimization. Often elements move around, but their content doesn’t change, so the browser can skip re-painting a layer whose content is static.

You can see the layers that Chrome produces by enabling from DevTools, Rendering -> Layer Borders.

Layers_chrome

Unfortunately layers have also severe downsides:

  • The implementation of composition is very complex and requires significant computational effort to keep correct. When an element is promoted to “layer”, the browser has to do a lot of calculations and track what other elements it intersects in order to preserve the proper draw order of elements. Otherwise you risk having elements that don’t properly respect the z-index when rendering.
  • Layers consume huge amounts of GPU memory. When you have multiple elements that are layers one-on-top of the other, you have multiple pixels for each “final” pixel that the user will see. The problem is especially bad in 4K, where a full-screen buffer is 32 MB. Some browsers try to reduce the amount of layers by “squashing” them at the expense of even more complex calculations.

We decided pretty early that layers were not something we want in LensVR – we wanted to conserve memory. This proved a big win as it simplifies significantly the painting code and there is no “composing” step.

Mozilla’s WebRender (used in Servo and Mozilla Quantum) has a similar design decision – they also have only 1 conceptual drawing step and no composing. Every other major OS browser uses layers as of the time of this post.

The risk without layer is having slower frames when large parts of the screen have to be re-painted.

Fortunately GPUs are extremely fast at doing just that. All rendering in Renoir happens exclusively on the GPU. The amount of rendering work that a web page generates is far below what a modern PC or mobile GPU can rasterize. The bottleneck in most web browsers is actually on the CPU side – the generation of commands for the GPU to execute.

Web pages tend to generate a lot of draw calls – if done naively you end up with hundreds of calls per-frame – for each text, image effect and so on. The performance results can be especially disastrous on mobile where draw calls are quite expensive.

Renoir implements numerous techniques to reduce the draw call count.

Dirty rectangle tracking

When the page changes due to an animation or another interactive event, usually a small part actually changes visually. We keep a collection of “dirty” rectangles where the page has potentially changed and that have to be re-drawn. Most OS browsers implement some degree of dirty rectangle tracking. Notably Mozilla’s WebRender differs – they re-draw the whole page each frame.

My profiling on typical workloads is that re-drawing only parts of the screen is still a big win both on the CPU and GPU side, even though more bookkeeping has to be done. The rationale is pretty simple, you do less work compared to re-drawing everything. The important part is keeping the dirty rect tracking quick. Elements that have to be drawn are culled against the current dirty rects and anything that doesn’t intersect is thrown out.

In Hummingbird we work as part of a game engine, so we strive for sub-millisecond draw times, far less than the 16.6ms per-frame that a general browser has to get, so dirty rects are hugely important. For LensVR, it’s a big win as well because we can quickly finish our work and get the CPU core back to sleep on mobile, which saves battery life!

In the screenshot below, only the highlighted rectangle will be re-drawn in Hummingbird/LensVR. A similar visualization is also available in Chrome under Rendering->Paint flashing.

dirty_rect.png

Rendering commands generation

From the styled and laid-out page we generate a “command buffer” – a list of high level rendering commands that will be later transformed in low level graphics API commands. The command buffer generation is kept very simple, the buffer is a linear area of memory, there are no dynamic allocations or OOP paradigms. All logical objects like images, effects etc. are simple handles (a number). Command generations happen in all browsers and this is an area of continuous improvement.

We kept Renoir a “thin” library, this is different from the approach taken in the Skia 2D rendering library used in Chrome & Mozilla. Skia is a very object-oriented library with complex object lifetimes, interactions and numerous levels of abstractions. We wanted to keep Renoir very lean, which helped us a lot during the optimization phases. Chromium’s “slimming paint” effort is a way to reduce the abstractions and quicken the “command generation” step.

Parallelism

All command generation and later rendering happen in parallel with the other operations in the webpage like DOM manipulations & JavaScript execution. Other browsers also try to get more work off the main thread by parallelizing composition and rasterization. LensVR/Hummingbird go a step further with their task-based architecture, which overlaps significantly computations and uses more CPU cores to finish the rendering faster. Most threads are not “locked” in doing only one specific job, but can do whatever is needed at the moment to get the frame ready as fast as possible. Still we’re looking to improve this area further as I see possibilities for even better hardware utilization.

In the next post

In part 2 I’ll explain how we utilize the GPU and share some performance comparisons I did between Renoir, Chrome’s rendering and WebRender in Servo. Stay tuned!

Compile-time check on task parameters (part 1)

The Hummingbird HTML rendering engine’s architecture is based on tasks (also known as jobs). Units of work are dispatched to a task scheduler who runs them on different threads to maximize the resource usage of the hardware.

Multi-threaded programming however is notoriously difficult because of the risk of race conditions. An awesome overview of how to utilize the hardware is given in the following blog series.

Hummingbird’s task system is very versatile. It allows developers to schedule tasks whose body is a C++ lambda and put them in different queues for processing. We decided from the inception of the system to try maximize the freedom of the developer when creating and using tasks. This freedom however can be dangerous when the lifetime and eventual data races between objects have to be taken into account.

Tasks conceptually (although the same in code) can be divided in two groups:

  • “Simple” data-oriented tasks that execute some data transformation on an independent work group and have no side effects.
  • High-level tasks that might arise from the interaction of multiple systems and objects with possibly complex lifetimes.

I definitely try to have as much work possible in the first type of tasks. A good example of such a task is the layout of a group of elements. The input data are POD structs with the styles needed for the element layout and the output are the positions of each element on the page.

High level tasks require an interaction with the lifetime of objects, possibly some of them are reference counted and usually can be accessed only in certain threads.

A particularly nasty error that arises from using reference counted objects is having them be destroyed in unexpected moments or threads.

In Hummingbird we try to encourage “simple” tasks and to make higher level ones tough to accidentally misuse. We introduced a compile-time checking system for task parameters. Classes can be marked to allow/disallow using their instances as task parameters. There are four ways for an object to be passed as a parameter in a task:

  • Passing object by value. This is always OK in our system. The object passed will be a copy private to the task so it shouldn’t involve changing global or shared state. The object can still contain pointers as members or modify global state in a method but this is better caught in design or code reviews and has never caused errors in our code so far.
  • Passing object by pointer. This is generally NOT OK. The system will not allow by default passing pointers unless they are explicitly marked by the programmer. Passing naked pointers to tasks is the source of most errors as the lifetime of said object is probably outside the task and there is a chance that object will be accessed concurrently. There is also the issue with the lifetime of the object, which is not well defined.
  • Passing by shared pointer. DOM objects often have shared ownership due to the design of the standard DOM and the interactions with JavaScript. To pass a shared pointer to a certain type, the developer has to explicitly mark it as OK.
  • Passing weak pointers. In the beginning we implicitly allowed this usage but recently made them require explicit marking as well.

Explicitly marking which classes are OK to be passed between tasks has several benefits:

  • Forces the programmer to think twice about the lifetime and concurrent use of data.
  • Helps in code reviews by signaling to reviewers that they should carefully inspect the usage of certain objects.
  • Implies a contract for how the data will be used and self-documents the code.
  • Avoids inadvertently passing wrong objects to tasks, which can happen due to the easy syntax provided by lambdas.

We have also added a mechanism to disallow passing a class to tasks altogether, even by value.

The implementation of the system is based on templates and some macros to make the syntax easy.

Creating a task is done in the following way:

The key here is the TASK_PARAMS, which validates each parameter. In the next blog post I’ll go into details on how the task validation mechanism is implemented.

W3C WebVR workshop 2016

On October 19 and 20 I had the amazing opportunity to take part in the W3C WebVR Workshop in San Jose. The event affirms VR as a major direction for the future of the web.

At Coherent Labs, we create HTML renderers so it was great to meet so many other browser developers that work on the same problems we do and share a similar mindset.

The first day, the event was scheduled to start and 8.30. I and my colleague George are in San Francisco, so this meant an early wake-up at 5 o’clock. After a succession of trolley bus, CalTrain and Uber we arrived at the Samsung offices in San Jose. The content more that paid off the early wake and travel.

I take the opportunity to thank Samsung for hosting the event and organizing it flawlessly with the W3C folks.

The event was packed with browser and VR developers including people from Mozilla, the Chrome team, Oculus, Samsung, Valve, the Edge team and many others. Everybody was really open to share ideas. Browser developer shared current state of VR support in their products along with their short-term release plans.

The workshop was divided in 2 days with the highlights being a starting and ending keynote, many “lightning” (5 minutes) talks and “breakout sessions” the second day. In breakout sessions different groups discussed ideas in many areas of VR integration with the web.

Currently most VR-related work in browsers is in the WebVR standard. This is an extension over WebGL allowing to interface with HMDs and render to them via WebGL. This approach opens a many opportunities to content creation as developers don’t need to learn using a game engine. They can use their familiar JavaScript and web development skills. The web also simplifies content distribution – users just have to navigate on a web page and be immersed.

Of the lightning talks I was most interested in the one by Josh Carpenter from the Chrome team and the ones from Justin Rogers from Oculus. Josh shared his vision of the future of the web in VR, while Justin accented more on technical aspects and performance – many of the problems are similar to the ones we tackle in Coherent Labs’ products.

In the “breakout” sessions, I attended a very interesting one initiated by Tony Parisi (who is now Head of AR/VR at Unity) on an eventual future declarative 3D standard for the web. The idea was met with a lot of enthusiasm in the meeting, although people had different reasons they believed it was important. Half the people attending believe that it’ll ease authoring, a smaller percentage stressed the importance of homogeneity between platforms, while ~15% believe performance is the most important reason. Coming from game development I also voted for the performance reason.

The workshop ended with a recap and the commitment to reconvene soon (at least by W3C metrics ~ 1 year). All the materials from the meeting are public and available here.

The discussions identified many areas in which to work and there are still many open questions. WebVR requires writing a complete new website. While this might be feasible for new sites, it doesn’t solve the need to meaningfully display the billions of pages that are currently on the web. The DOM is still completely separated from the VR world so all the usual layout, styling etc. are still off-limits in VR.

Performance is an open problem, especially on mobile, where battery drain and overheating can make the experience particularly unpleasant for the user. The current browsers do a lot of things besides just rendering the scene and that can introduce stuttering and break immersion.

Last but not least, there is no solution for AR. Bringing all the AR-specifics – spatial interactions, real world elements etc. will be a real challenge in the following years.

The discussions opened some interesting questions for me personally as well. Our Hummingbird HTML renderer is super fast on mobile, equipping it with a WebVR renderer is a way forward that it will bring amazing benefits to VR users who will be able to stay immersed for longer and with better visuals.

Hummingbird scales great on modern mobile platforms, so overheating and stuttering will not happen.

Overall the W3C workshop was an eye-opening experience for me. The lion’s share of the work however is still ahead of us as a community. I’m very happy to be part of it!