This post has been published also in Coherent Labs’s blog – the company I co-founded and work for.
When measuring the performance of some code, a critical point to get right is having an accurate and reliable method to get some kind of timestamp at exactly the moments we are interested in.
The obvious way to measure how much time (or maybe cycles) a piece of code takes is:
start <- timestamp
*** code to measure ***
end <- timestamp
delta <- end – start
Pushing aside context switches, interrupts and other sources of error in modern systems, this is pretty straight forward IFF you have a reliable way to get a ‘now’ timestamp. I’ll deal with context switches as a source of error for performance measurement in another post and will concentrate on just getting the timestamp right.
A ton of articles, blog posts and threads have been written on this subject. However people are still confused, so I’ll try to make an overview of the possible approaches and at the end of the post there is an extensive list of references. Keep in mind that I’m looking only at ways to get timestamps for performance measurements and I’ll concentrate on Windows, although some of the techniques are not OS-specific.
1. GetTickCount(), GetSystemTime(), timeGetTime() etc. The resolution provided by those is abysmal and they should not be used for any measurement. The best frequency I achieve with GetSystemTime() is 5 milliseconds. There is way to increases the resolution via timeBeginPeriod() however this also increases the pressure on the Kernel for thread scheduling and the best resolution is still too low – 1 ms.
2. QueryPerformaceCounter() – This is the preferred way according to Microsoft as stated here. The resolution is system dependent but on modern systems my observations show that it uses the RDTSC instruction and the overhead compared to ‘raw’ RDTSC, although significant, is not that bad unless you profile very small functions. The resolution is very high – in the nanosecond range.
3. RDTSC (Read Time-stamp counter) – Here the fun begins. Modern CPUs implement a time stamp counter that starts at 0 on processor reset and steadily increases. There is some misconception around this instruction and indeed it is kind of tricky.
It returns a 64-bit value with the counter in the EDX:EAX registers. You can access it with in-line assembler in C/C++ or with a compiler intrinsic (__rdtsc on VS).
The counter counts the processor cycles which used to be a problem as processors might throttle their frequency or the counter might even stop in some circumstances (low power mode, sleep etc.). As of Pentium 4 processors however the time-stamp counter increments at a constant rate. This is well documented in “Intel® 64 and IA-32 Architectures Software Developer’s Manuals” Volume 3 section 17.13. The feature is called “Invariant TSC” and guarantees a constant increment rate of the counter, so you should not worry about the dynamic change of frequency. This also means that you can no longer measure raw instructions performed – the TSC is now effectively a point in time. The feature can be queried as documented in the manual.
Another concern is that if a thread is migrated on a different processor between two measurements, the counter might skip too much or even “go back”. There are several ways to cope with this issue – you could use SetThreadAffinityMask() on your profiled threads to lock them on a particular processor. The newer RDTSCP instruction can be used to detect such situations – it atomically returns both the TSC and the processor id it has been taken from. On my i7, on any logical processor I run it, it returns the same processor id which makes me think that it’s implemented as one TSC shared by all cores, however it is not stated clearly in the documentation or I can’t find it so, I wouldn’t count on that. The system might also have more physical processors.
Another issue is related to the fact that the processor might not execute all instructions in the order they appear in source (and I mean the assembly too). It might so happen that instructions are moved before/after the measurement which introduces error. If a small piece of code is profiled this error can be quite significant. To ‘serialize’ the instructions we can use CPUID – it works as a of a fence (technically speaking it’s a serialization instruction).
RDTSC actually does something like that – “The RDTSCP instruction waits until all previous instructions have been executed before reading the counter. However, subsequent instructions may begin execution before the read operation is performed” – as stated in the Intel manual.
A correct implementation on how to take measurements is given here.
Serialization instructions are quite slow and might bias the measurement too much, the profiled code will also probably run slower. If you use them you must subtract their cost from your profile. An explanation is given here.
The out-of-order execution problem applies to QueryPerformaceCounter() too, so beware.
With proper care RDTSC seems the way to go for good and fast performance profiling. Depending on the use cases however, if the added price is not a problem, I’d go with QueryPerformaceCounter() – it does offer many guarantees that RDTSC doesn’t, and getting the frequency if you need to transform the abstract counters to real time is trivial. If you profile many small pices of code however the overhead might add-up.
C++11 introduced chrono in the standard library – it’s a family of classes and functions for time manipulation and measurement. I find the library to be very elegant but it has it’s own very serious problems – or to be more precise, the VS 2012 has serious problems. The high_resolution_clock in VS 2012’s std is NOT high resolution at all! It isn’t steady either. This is because it is just a typedef for system_clock which is implemented with GetSystemTimeAsFiletime().
But it doesn’t end here – the steady_clock & monotonic_clock are also implemented via the system_clock; even though they report
is_steady==true && is_monotonic==true – it’s a lie.
If you change the clock on your machine – bang! all ‘steady’ clocks go back/forward in time.
There is a bug on MSDN about the high_resolution_clock here and I hope they fix the implementation soon, because now it’s almost useless now.
Boost’s implementation does not suffer from these problems and can be safely used – it calls QueryPerformanceCounter on Windows for the high_resolution_clock.
I didn’t check other std implementations on Windows but libstdc++ does it right on POSIX.
Event tracing for Windows
If you do your own tracing sessions you can specify the way timestamps are captured and reported. The default is reported as system time, captured with QPC (Win Vista+) which might not be what you want.
Changes can be done by adding to the mask EVENT_TRACE_LOGFILE.ProcessTraceMode the flag PROCESS_TRACE_MODE_RAW_TIMESTAMP and then in the EVENT_TRACE_PROPERTIES.Wnode.ClientContext you can set:
1 – QPC
2 – System time
3 – RDTSC
The long post confirms that getting accurate and fast timestamps is not trivial on current computer architectures and is easy to get wrong and not notice.
1. “Intel® 64 and IA-32 Architectures Software Developer’s Manuals”, Volume 3
2. Game Timing and Multicore Processors
3. rdtsc in the Age of Sandybridge
4. Using the RDTSC Instruction for Performance Monitoring
5. How to Benchmark Code Execution Times on Intel® IA-32 and IA-64 Instruction Set Architectures
6. High Performance C++ Profiling
7. Time Stamp Counter