String interning

When we started to design the architecture of Hummingbird, we knew that an HTML rendering engine has to rely heavily on strings due to the nature of CSS style solving. CSS element ids and classes are defined by the standard as strings and the cascade resolution relies on string comparisons to work. All elements that have a class “my_class” will match a selector for said class.

String operations in C++ are notoriously slow. When solving the styles of a large page you might have to do hundreds or even thousands of string comparisons if you use a vanilla string implementation. The string comparisons are slow as most string implementations allocate their memory on the heap and contain an internal pointer to the actual “char” array and the memory has to be actually compared over the whole string. Cache misses will kill the performance.

In Hummingbird we can’t afford such a naive implementation so we looked for alternatives for how to represent the “id” and “classes” of HTML elements. The most important thing is to have quick comparisons that are vital to the style solving process.

In a game engine where all the content is well known when the game is being built, many developers rely on doing “perfect hashes” of their strings and effectively substituting them everywhere with ids. This is a great solution when you know that the whole pool of strings you’ll encounter is fixed.Unfortunately, such a solution cannot work in Hummingbird where the user can set any id and class in her styles and even generate new ones through JavaScript. Everything has to be dynamic.

We went for a well-known solution called “string interning”and in essence represents a runtime perfect hash.

In our implementation:

  • All strings of a View (basically an HTML page + scripts) are kept in contiguous memory regions.
  • The addresses of all currently interned strings are stored in a hash set.
  • When we encounter a string to be interned, we check the hash set. If the string is already in some buffer – we return the address to it, otherwise we copy the characters in a free region of our buffers, put the address in the set and return the newly created interned string.

The InternedString object that we use is simple, it contains just a const char* pointer. We guarantee that interned strings won’t move in memory.

This representation is very efficient

  • Each interned string consumes memory just once, no matter how many copies are there
  • Two interned strings are the same IFF they have the same address. This makes comparisons super-fast as we only compare pointers!
  • We still have easy access to the contents of the string via the const char*

This representation gives a great performance boost. Our implementation does not try to reclaim eventually unused strings. It is extremely unlikely in our usage scenarios to have interned strings take too much memory.

Hummingbird is a heavily parallel library, so trying to intern a string from multiple threads is a very real scenario. At the moment the “interning context” is locked. Each View has it’s own “interning context. So instead of having a system-wide interning context that risks taking too much memory and becoming a locking bottleneck, we have View-wide ones that reduce contention and practically eliminate the risk of interned strings talking too much memory even during long sessions.

The results are great with ~30% improved style solving performance compared to a version where simple strings are used.