Rendering HTML at 1000 FPS – Part 1

This was originally posted on the LensVR blog.

Part 2 is also available here.

Web page rendering is one of the most interesting and active development areas in computer graphics. There are multiple approaches with pros and cons. In the post I’ll go into details about how we do HTML rendering in Coherent Labs’ Hummingbird and LensVR browser and how it compares to Chrome and Mozilla’s WebRender.

I’ll split the post in two parts, this first one is dedicated to the high level architecture and how we decide what to render. The second part – “Squeezing the GPU” will be about how these decisions get implemented to use the GPU for all drawing and will give some performance results I measured.

The renderer described is still experimental for general web pages, but is very successfully deployed in game user interfaces across PC, mobile and consoles. The constraints of these platforms led us to somewhat different design decisions compared to what the folks at Chrome and Mozilla do. Now we are applying this approach to more web pages in LensVR and, feedback is most welcome.

Recently in an awesome post about Mozilla’s WebRender, Lin Clark explained not only how WebRender works, but also gives a great overview of the way rendering happens in most web browsers. I advise everybody who is interested in how browsers work to read her post.

To quickly recap I’ll concentrate on what we internally call rendering of the web page. After the Layout engine has positioned all DOM elements on the page and their styles have been calculated, we have to generate an image that the user will actually see.DOMandRendering.jpg

The rendering is implemented through the Renoir library. Renoir is a 2D rendering library that has all the features required to draw HTML5 content. It is conceptually similar in its role to Mozilla WebRender and Skia (used in Chrome and Firefox before Quantum).

When designing Renoir, performance was our primary goal and we built it around three major paradigms:

  • All rendering on the GPU
  • Parallelism
  • Data-oriented C++ design

We didn’t have all the burden of years of older implementations and could be very bold in the way we do things to achieve our performance goals.

High-level rendering architecture

Most web browsers split the rendering in 2 parts – painting and compositing. The whole page is split in “layers”. A layer is initiated by a DOM element (strictly a stacking context) that has certain styles. The rules differ in implementations, but usually things with 3D transforms, opacity < 1, etc. become layers. You can think of a layer as a texture (an image) that contains part of the web page content.

The layers are individually “painted” by either the GPU or CPU. The painting fills the text, images, effects and so on. After all the needed layers are painted, the whole scene is “composed”. The layers are positioned and the GPU draws them in the final render target which is displayed to the user.

Layers were introduced both as a convenience feature to simplify some effects and as a performance optimization. Often elements move around, but their content doesn’t change, so the browser can skip re-painting a layer whose content is static.

You can see the layers that Chrome produces by enabling from DevTools, Rendering -> Layer Borders.

Layers_chrome

Unfortunately layers have also severe downsides:

  • The implementation of composition is very complex and requires significant computational effort to keep correct. When an element is promoted to “layer”, the browser has to do a lot of calculations and track what other elements it intersects in order to preserve the proper draw order of elements. Otherwise you risk having elements that don’t properly respect the z-index when rendering.
  • Layers consume huge amounts of GPU memory. When you have multiple elements that are layers one-on-top of the other, you have multiple pixels for each “final” pixel that the user will see. The problem is especially bad in 4K, where a full-screen buffer is 32 MB. Some browsers try to reduce the amount of layers by “squashing” them at the expense of even more complex calculations.

We decided pretty early that layers were not something we want in LensVR – we wanted to conserve memory. This proved a big win as it simplifies significantly the painting code and there is no “composing” step.

Mozilla’s WebRender (used in Servo and Mozilla Quantum) has a similar design decision – they also have only 1 conceptual drawing step and no composing. Every other major OS browser uses layers as of the time of this post.

The risk without layer is having slower frames when large parts of the screen have to be re-painted.

Fortunately GPUs are extremely fast at doing just that. All rendering in Renoir happens exclusively on the GPU. The amount of rendering work that a web page generates is far below what a modern PC or mobile GPU can rasterize. The bottleneck in most web browsers is actually on the CPU side – the generation of commands for the GPU to execute.

Web pages tend to generate a lot of draw calls – if done naively you end up with hundreds of calls per-frame – for each text, image effect and so on. The performance results can be especially disastrous on mobile where draw calls are quite expensive.

Renoir implements numerous techniques to reduce the draw call count.

Dirty rectangle tracking

When the page changes due to an animation or another interactive event, usually a small part actually changes visually. We keep a collection of “dirty” rectangles where the page has potentially changed and that have to be re-drawn. Most OS browsers implement some degree of dirty rectangle tracking. Notably Mozilla’s WebRender differs – they re-draw the whole page each frame.

My profiling on typical workloads is that re-drawing only parts of the screen is still a big win both on the CPU and GPU side, even though more bookkeeping has to be done. The rationale is pretty simple, you do less work compared to re-drawing everything. The important part is keeping the dirty rect tracking quick. Elements that have to be drawn are culled against the current dirty rects and anything that doesn’t intersect is thrown out.

In Hummingbird we work as part of a game engine, so we strive for sub-millisecond draw times, far less than the 16.6ms per-frame that a general browser has to get, so dirty rects are hugely important. For LensVR, it’s a big win as well because we can quickly finish our work and get the CPU core back to sleep on mobile, which saves battery life!

In the screenshot below, only the highlighted rectangle will be re-drawn in Hummingbird/LensVR. A similar visualization is also available in Chrome under Rendering->Paint flashing.

dirty_rect.png

Rendering commands generation

From the styled and laid-out page we generate a “command buffer” – a list of high level rendering commands that will be later transformed in low level graphics API commands. The command buffer generation is kept very simple, the buffer is a linear area of memory, there are no dynamic allocations or OOP paradigms. All logical objects like images, effects etc. are simple handles (a number). Command generations happen in all browsers and this is an area of continuous improvement.

We kept Renoir a “thin” library, this is different from the approach taken in the Skia 2D rendering library used in Chrome & Mozilla. Skia is a very object-oriented library with complex object lifetimes, interactions and numerous levels of abstractions. We wanted to keep Renoir very lean, which helped us a lot during the optimization phases. Chromium’s “slimming paint” effort is a way to reduce the abstractions and quicken the “command generation” step.

Parallelism

All command generation and later rendering happen in parallel with the other operations in the webpage like DOM manipulations & JavaScript execution. Other browsers also try to get more work off the main thread by parallelizing composition and rasterization. LensVR/Hummingbird go a step further with their task-based architecture, which overlaps significantly computations and uses more CPU cores to finish the rendering faster. Most threads are not “locked” in doing only one specific job, but can do whatever is needed at the moment to get the frame ready as fast as possible. Still we’re looking to improve this area further as I see possibilities for even better hardware utilization.

In the next post

In part 2 I’ll explain how we utilize the GPU and share some performance comparisons I did between Renoir, Chrome’s rendering and WebRender in Servo. Stay tuned!

Advertisement

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s