(Using MeasureIt via Linqpad)

Usage Guide

  1. Introduction
  2. Simple Usage:  Understanding the costs of runtime primitives
  3. How Measurements are Actually Taken
    • Limits of accuracy
  4. Unpacking the Source
    • Rebuilding MeasureIt
  5. Debugging Optimized Code In Visual Studio (Validating a benchmark)
  6. Making New Benchmarks
    • Scaling Small Benchmarks
    • Modifying the Loop Counts
    • Benchmarks with State
  7. Power Options
  8. Contributions

Introduction

Most programmers should be doing more performance measurements as they design and measure code.  It is impossible to create a high performance application unless you understand the costs of the operations that you use.   Because the .NET framework is a virtual machine, the operations are more 'abstract' and thus it is even harder to reason about the cost of operations from first principles.

To solve this problem, it is  often suggested that the document ion for classes and methods include information on expensive the operations are.  Unfortunately, this problematic because

  • The set of classes and methods is large (and growing), making generating the information expensive.
  • The cost is often not simple to describe.   Often it depends on multiple arguments.  Often there is a large first time cost.   Sometimes there is a cache, which means random arguments have one costs, but repeated access to arguments used in the recent past have a much lower cost.    Sometimes the costs is highly input dependent (e.g.. Sorting). 
  • The costs can vary depending on the scale of the problem (doing operations on small vectors is typically CPU bound, doing them on very large vectors, becomes CPU cache bound). 
  • The costs change over time due to changes in the code, or the underlying support code (eg .NET Runtime or OS version). 
  • The costs depend on configuration (did you NGEN, are you a Multi-Appdomain scenario, is your scenario fully trusted or not)
  • The costs depend on CPU architecture, which also change over time (sometimes dramatically)

These factors make it difficult to provide useful performance information as 'normal' documentation.  As an alternative, the MeasureIt utility attempts to make it really easy for you to measure the costs of operations that are important for YOUR application.   These can then be run the CPUs, OS, runtimes, and software versions that are relevant to you (and you can see if they vary or not).   You can also 'mock up' code that closely resembles the performance-critical code paths in your application to insure that types of inputs fed to the methods of interest are 'representative' of your application.    You can then experiment with variations on your design and find the most efficient implementation.

Simple Usage:  The Costs of Runtime Primitives

MeasureIt comes with a set of 'built-in' performance benchmarks that measure the 'basics' of the runtime.  These benchmarks measure the speed of method call, field access, and other common operations.  Running MeasureIt without any parameters will run all of these built-in benchmarks.   For example, running the command

MeasureIt

Runs all the 'default' benchmarks.   After completion it displays a HTML page with a table like the table abbreviated below. 

Scaled where EmptyStaticFunction = 1.0 (6.1 nsec = 1.0 units)

Name Median Mean StdDev Min Max Samples
NOTHING [count=1000] -0.438 -0.252 0.541 -0.438 1.371 10
MethodCalls: EmptyStaticFunction() [count=1000 scale=10.0] 1.000 0.949 0.080 0.851 1.060 10
Loop 1K times [count=1000] 343.472 359.915 29.826 341.746 441.198 10
MethodCalls: EmptyStaticFunction(arg1,...arg5) [count=1000 scale=10.0] 1.556 1.549 0.099 1.403 1.788 10
MethodCalls: aClass.EmptyInstanceFunction() [count=1000 scale=10.0] 1.148 1.160 0.091 1.070 1.334 10
MethodCalls: aClass.Interface() [count=1000 scale=10.0] 1.597 1.664 0.089 1.592 1.843 10

Often only a small number of benchmarks are of interest.  To allow users to control which benchmarks are run, benchmarks are grouped into 'areas' which can be specified on the command line.   MeasureIt /? will give you a list of all the available areas that can be specified.  You can then simply list the ones of interest to limit which benchmarks are run.  For example to run just the benchmarks related to method calls and field access you can do

MeasureIt MethodCalls FieldAccess

Best practices and Caveats

 Fundamentally the MeasureIt harness measures wall clock time.   Thus in particular if you measure on a uniprocessor and there are other background tasks running then they will interfere with the measurements (this is not as problematic on a multi-processor).   Generally it is not a big problem because sensitive benchmarks tend to not block, and tend to complete in the single quantum of CPU time given to a thread, so they tend to not be interrupted.   Still it is best, however to avoid other work on the machine while the harness is running.   

While MeasureIt does take several samples to highlight variability, it does take all these samples in the same process and very close to each other in time.  Thus it is possible that the measurements may vary significantly when run in a different process, or when spread out more in time within the process.   Thus be alert to this and if you see unexplained variability in different runs or under different conditions, keep in mind that the variability metrics that MeasureIt provides are lower bound on the true variability of the measurement.

How Measurements are Actually Taken

MeasureIt goes to some trouble to minimize and highlight the most common sources of measurement error.   These include

  1. 'First Time' costs like Just In Time compilation, memory page faults, and initialization code paths.   This is avoided by running the benchmark once before measurement begins
  2. Timer Resolution.  The timer MeasureIt uses typically has a resolution of < 1usec.   Thus very small intervals (eg nsec) need to be 'magnified' by repeating them many times to bring the actual time being measured to something more like 100usec or 1 msec.   Thus benchmarks have a 'count' which represents the number of times the benchmark is repeated for a single measurement.   The value  presented, however represents the time consumed by a single iteration (thus you can ignore the count unless you are doing a measurement error analysis).   
  3. Overhead associated with starting and stopping the counter (and looping over the benchmark)  This is minimized by measuring an empty benchmark and subtracting that value from every measurement as 'overhead'.   However since there is variability in this overhead, this can lead to negative times being reported for very small intervals (< 10 usec).   However well designed benchmarks should never be measuring times this small (point (2))
  4. Variability associated with loops.    Very small loops (as might happen for short benchmarks) can have execution times that vary depending on the exact memory address of the loop.   It is very difficult to control for this variability.  Instead a robust solution is to 'macro expand' the benchmark a number times (typically 10 is enough), so that the time the benchmark takes is large compared to the overhead of the loop itself.   However it is useful to report the number as if the benchmark was not cloned.   This is called the 'scale' of the benchmark, and the reported number is the measured time (of a single loop) divided by the scale.   Like the count, it can be ignored unless you are doing an measurement error analysis.  
  5. Benchmark variability:   The benchmark is run a number of times (typically 10), and mean, median, min, max and standard deviation are reported to highlight the variability in the measurement. 

To highlight variability, the first line in the output table is the 'NOTHING' benchmark.  It measures an empty benchmark body, and the time should be zero if there were no measurement error.  Thus the number in this line give us some idea of the best accuracy we can hope to have.  In the example above we can see that this measurement varied from -.438 to +1.371, so numbers smaller than that are dubious.   However keep in mind that all benchmarks with a scale are actually measuring a time that is bigger than is actually reported.  For example the 'EmptyStaticFunction' benchmark has a scale of 10, which means that the reported number 1.0 is was really measuring 10 units of time.   Thus errors of -.438 or + 1.37 are not 43% or 137% errors but rather 4.3% to 13% errors.   This is consistent with the min and max values reported for 'EmptyStaticFunction' (which suggest 6% to 15% error).  Thus the 'NOTHING' benchmark gives you a good idea of how much error an inherent part of the harness. 

In addition to the error introduced by the harness itself, the benchmark will have variability of its own.   The standard way of dealing with this is to find the mean (average) of the measurements or the median (middle value in a sorted list).   Generally the mean is a better measurement if you use the number to compute an aggregate throughput for a large number of items. The median is a better guess if you want a  typical sample. The median also has the advantage that it is more stable if the sample is noisy (eg has large min or max values).  If the median and the mean differ significantly however, there is a good chance there is too much error for the data to be valuable.  

Limits of accuracy

The min and max statistics give you the lower and upper bounds of the expected error of the measurements.  The standard deviation is also a useful error metric.   It is very likely that the statistics for the error form what statisticians call a 'normal distribution'.   If this assumption holds, then 68% of all measurements should be with one standard deviation of the mean, and over 95% of all measurements should be within 2 standard deviations of the mean.   For example, since the 'EmptyStaticFunction' benchmark has a standard deviation of .08, and a mean of .949, then our expectation is that 68% of all samples should be in the range of 0.948-0.08 = .868 and 0.948+0.08 = 1.028 and that 95% of all samples should be within 0.948-2*0.08 = .788 and 0.948+2*0.08 = 1.108.   This is another useful way of quickly getting a handle on expected error. 

Unpacking the Source

In performance work, the details matter!   On the display for the benchmark it gives a short description of what the benchmark does, but if the scenario is at all complicated, you will find yourself wondering exactly what is being measured and under what conditions.   Well, you can find out!   The MeasureIt.exe has embedded within it its complete source code.   To unpack it simply type

  • MeasureIt /edit

Which will unpack the source code into a 'MeasureIt.src' directory and launch the Visual Studio .SLN file (If you have it installed).  Even without Visual Studio you can view the source code with notepad or another editor.   A good place to start browsing the source code is the _ReadMe.cs file, which contains an overview of the rest of the source code.   If you use Visual Studio I strongly recommend that you download the code 'hyperlinking' feature mentioned at the top of the _ReadMe.cs file.   This will allow you to navigate the code base more easily. 

All the benchmarks themselves live in the 'MeasureIt.cs' file and you can find the code associated with a particular benchmark by searching for its name. For example the benchmark for 'FullTrustCall' is

        timer1000.Measure("FullTrustCall()", 1, delegate
       {
           PinvokeClass.FullTrustCall(0, 0);
       });

As you can see, a benchmark is simply a call to the 'Measure()' method.   The Measure() method takes three parameters:meters:

  1. The name of the benchmark. This can be any string.
  2. The scale of the benchmark (1 in this case) .This is now many times the code of interest is cloned within the body of the benchmark (do reduce measurement error). The number reported by the harness is the actual time divided by this number.
  3. A delegate representing the code to be measured. Typically the delegate {} syntax is used to place the code inline.

In this case we see that the benchmark consists of a single call to the 'FullTrustCall()' method.  Searching for the definition we find

        [DllImport("mscoree", EntryPoint = "GetXMLElement"), SuppressUnmanagedCodeSecurityAttribute] 
       public static extern void FullTrustCall(int x, int y); 

Which gives you all the details of exactly what is mean by a full trust call (basically using the SuppressUnmanagedCodeSecurityAttribute).  It also shows that the actual native call happens to be a method called GetXMLElement in the mscoree DLL.   There is a comment in this code that indicates that this particular native method consists of exactly 2 instructions (determined by steping through it).  Thus all the information needed to understand anomolous performance results is available. 

Rebuilding MeasureIt

MeasureIt is also set up to make changing it and rebuilding easy.   Inside Visual Studio this is as easy as invoking the build command (eg F6), but can also be done easily for those without Visual Studio by simply executing the 'build.bat' file in the MeasureIt.src directory.   The only prerequisite for build.bat is that a current (V2.0 or later) verison of the .NET Runtime be installed (Windows Vista comes with it built in).     In either case, the build will automatically update the original MeasureIt.exe file (including incorperating the updates source into it) when it is rebuilt.    Thus with one keystroke you can update your oringinal MeasureIt.exe to incorperate any changes you need. 

The build currently has 12 'expected' warnings about links to files within projects.  This results from the technique used to incorperate the source into the EXE and they can be ignored. 

Debugging Optimized Code In Visual Studio (Validating a benchmark)

When measuring a benchmark that does very little (eg a single method call or other operation), it is likley that the just in time (JIT) compiler may transform the benchmark in a way that would yield misleading results.   For example, to measure a single method call the benchmark might call a empty method, but unless special measures are taken the JIT compiler will optimize the method away completely.   Thus it is often a good idea to look at the actual machine code associated with small benchmarks to confirm that you understand the transformations that the runtime and JIT have done.  In a debugger like Visual Studio, it should be as easy as setting a breakpoint in your benchmark code and switching to the disassembly window (Debug -> Windows ->Disassembly). Unfortunately, the default options for Visual Studio are designed to ease debugging, not to do performance investigations, so you need to change two options to make this work

  1. Clear the checkbox : Tools -> Options… -> Debugging -> General -> Suppress JIT Optimization. By default this box is checked, which means that even when debugging code that should be optimized, the debugger tells the runtime not to. The debugger does this so that optimizations don't interfere with the inspection of local variables, but it also means that you are not looking at the code that is actually run. I always uncheck this option because I strongly believe that debuggers should strive to only inspect, and not change the program being debugged.  Unsetting this option has no effect on code that was compiled 'Debug' since the runtime would not have optimized that code anyway.
  2. Clear the checkbox Tools -> Options -> Debugging -> General -> Enable Just My Code. The 'Just My Code' feature instructs the debugger not to show you code that you did not write. Generally this is feature removes the 'clutter' of call frames that are often not interesting to the application developer. However this feature assumes that any code that is optimized can't be yours (it assumes your code is compiled debug or suppressed JIT Optimizations is turned on). If you allow JIT optimizations but don't turn off 'Just My Code' you will find that you never hit any breakpoints because the debugger does not believe your code is yours.

Once you have unset these options, they remain unset for ALL projects. Generally this works out well, but it does mean that you may not get the Just My Code feature when you want it. You may find yourself switching Just My Code on and off as you go from debugging to performance evaluation and back (I personally am happy leaving it off all the time). 

Once you have set up Visual Studio, you can now set breakpoints in your benchmark and view the disassembly to confirm that the benchmark is doing what you think it should be.  This technique is valuable any time a small benchmark has unexpected performance characteristics.

Making New Benchmarks

The real power of MeasureIt comes from the ease of adding new benchmarks.  In the simplest case, to add a new benchmark

  1. Add a 'area' to the program for your new benchmark by declaring a static method of the form 'MeasureXXXX()' on the MeasureIt class.  The XXX is the name of your new area.   If this method is public then this benchmark will be run by default (when no area is specified on the command line).  If the method is private, then the area will only be run when the area is explictly mentioned on the command line.
  2. In this new method add one or more benchmarks by calling the 'Measure' method on the 'timer1000' variable of the MeasureIt class. 

That's it!   For example, we might want to learn more about the the costs of getting and setting environment variables.  To do this we could add the following code to the MeasureIt class.

    static public void MeasureEnvironmentVars()
    {
        timer1000.Measure("GetEnvironmentVariable('WinDir')", 1, delegate
        {
            Environment.GetEnvironmentVariabsteppinganomalousle("WinversionincorporatingoriginalincorporateincorporateDirlikely");
        });
        
        timerexplicitly1000.Measure("SetEnvironmentVariable('X', '1')", 1, delegate
        {
            Environment.SetEnvironmentVariable("X", "1");
        });
    }

The 'timer1000' variable is useful for benchmarks that take less than 100usec to run.    As its name suggests this timer will loop 1000 times over the benchmark.   Thus it will extend the time to something like100msec, which makes measurement error small, but also will not make humans impatient.   As mentioned, before the 'Measure()' method takes a name of a benchmark, the scale, and a delegate representing the body of the benchmark.   Often you can simply name the benchmark the literal code being run, but otherwise any short, descriptive name will do.   The scale variable is used for benchmarks that only use 50 or less instructions and will be explained in the next section.  It should be 1 if the benchmark is already large enough.  Finally you have the delegate that represents benchmark itself.

When writing the benchmark code is is important to realize that the method will be called many times by the harness.  This works great for benchmarks that are 'stateless', however can be a problem if the act of calling the benchmark will perturb state such that would cause the performance to change (for example, adding a entry to a table).  That case is discussed in a later section. 

Scaling Small Benchmarks

If the benchmark is less 50 instructions or so, it is best to make the benchmark bigger by cloning the benchmark a number of times in the delegate body being measured.    To keep the results easy to interpret, however, it is important that the time measured is then divided by the number of times the benchmark was cloned.  That is the purpose of the 'scale' parameter of  the Measure() method.  Here is an example used for method calls.  10 is passed for the scale because the benchmark is cloned 10 times. 

        timer1000.Measure("EmptyStaticFunction(arg1,...arg5)", 10, delegate
        {
            Class.EmptyStaticFunction5Arg(1, 2, 3, 4, 5);
            Class.EmptyStaticFunction5Arg(1, 2, 3, 4, 5);
            Class.EmptyStaticFunction5Arg(1, 2, 3, 4, 5);
            Class.EmptyStaticFunction5Arg(1, 2, 3, 4, 5);
            Class.EmptyStaticFunction5Arg(1, 2, 3, 4, 5);
            Class.EmptyStaticFunction5Arg(1, 2, 3, 4, 5);
            Class.EmptyStaticFunction5Arg(1, 2, 3, 4, 5);
            Class.EmptyStaticFunction5Arg(1, 2, 3, 4, 5);
            Class.EmptyStaticFunction5Arg(1, 2, 3, 4, 5);
            Class.EmptyStaticFunction5Arg(1, 2, 3, 4, 5);
        });
        

Modifying the Loop Counts

As its name suggests timer1000 runs the benchmark 1000 times for every measurement.   When the benchmark is larger than 100us, a loop count of 1000 will make the benchmark run unnecessarily long.  While you can change loop count property on timer1000, it is generally not a good idea to do so because it could easily lead to confusion about how many iterations are to be done.  Instead it is better to create new timers that are used when different loop counts are needed.  For example the following code creates a new timer called timer1 that will only loop a single time for each measurement and will take three (instead of the usual 10) samples for computing min, max, and other statistics.  This would be appropriate for benchmarks that take > 1sec to run.    

        timer1 = new MultiSampleCodeTimer(3, 1);    // Take three samples,  Loop count = 1
       timer1.OnMeasure = logger.AddWithCount;

In the code above, in addition to creating the new timer, the timer also needs to be hooked up to the HTML reporting infrastructure.  This is what the second line does.  The 'OnMeasure' event is fired every time a timer has new data, and the second line tells it to call the 'AddWithCount' method on 'logger' when this happens.  The logger then insures that the data shows up in the HTML report. 

Using this technique you can change the loop count to whatever value makes sense to keep long enough to avoid measurement error, but short enough to allow efficient data collection.   For very expensive benchmarks (> 1 min) you may need to cut the number of measurements to only 1, but ideally you would still do at least 3, so you can get some idea of the variability of the benchmark. 

Finally, by default timers 'prime' the benchmark before actually doing a measurement.  This insure that an 'first time' initialization is done and does not skew the results.   However when benchmarks are very long, (> 10 sec), this priming is expensive, and often not interesting.   The timers have a property called 'Prime' which by default is true but can be set to false to avoid priming the benchmark and save time. 

Benchmarks with State

Benchmarks really need to return the system to the same state they were in before the benchmark started.  However even some very simple scenarios don't have this property.  For this case we need a slightly more complicated 'Measure' routine that knows how to 'reset' the state of the system so that another run of the benchmark can be done.   For example. the following code finds the average cost of adding 1000 entries to a dictionary.

        Dictionary<int, int> intDict = new Dictionary<int, int>();
        timer1.Measure("dictAdd(intKey, intValue)", 1000,   // Scaled by 1000 because we do 1000 Add()s
            delegate             // Benchmark 
            {
                for (int i = 0; i < 1000; i++)
                    intDict.Add(i, i);
            },
            delegate             // Reset routine
            {
                intDict.Clear();
            });

In this example, we pass two delegates to the Measure() routine.  The first delegate is the benchmark code itself, which adds 1000 entries to a list.  However  because we pass 1000 as the 'scale' parameter to Measure() this time will be divided by 1000, and thus report the average cost of adding a single element.  At the end of every measurement, the harness calls the second delegate to 'reset' the state'   In this case we just call the Clear() method.   It is true that there is some measurement error because the cost of the loop is included in the measurement, but that error should be small.   The result is that this code will measure the average time to add an element to add 1000 elements (in order).   

It is worth noting how useful it is for someone using this benchmark to see the actual code for the benchmark.   Because we know that 'Dictionary' is implemented as a hash table, the fact that the entries were added in order probably does not skew our results (although we should check by doing some variations).   However if the Dictionary was implemented as a sorted list, adding entries in order would be much faster than the 'random' case, and our results would be skewed.  As I mentioned before, in performance work, details matter, and it is good to be able to see (and change) the details as the need arises.   

Power Options

Modern computers have the capability of slowing the frequency of the CPU's clock to save power.    By default the operating system tries to determine what the correct balence between power and performance is.  Thus it throttles the CPU when the computer is inactive and unthrottles it when it is being used heavily.   This adaptation however does take time (typically 10s of Msecs), and always lags behind actual CPU usage.   While in general this adaptation is a good thing, it will lead to unreliable benchmark numbers.   Thus for benchmarking it is best to turn off this power saving feature.

By default, on VISTA and above operating system MeasureIt does exactly this, and no additional user interaction is required.  For older operating systems this needs to be done manually by going to the Control Panel's Power options and selecting the 'High Performance' power scheme.

While MeasureIt makes certain that its own benchmarks are run on the 'High Performance' power scheme, other benchmarks will not have this capability.  Thus as a an additional service measureIt exposes the following options for manually controlling the power scheme so that a non-measureIt benchmark can be run under the correct power scheme.

  • measureIt /setHighPerformance
  • measureIt /setBalenced
  • measureIt /setPowerSaver

 The power options capability only work on VISTA (or Win2008) and above.

Last edited Dec 8, 2012 at 9:40 AM by idvorkin, version 4

Comments

No comments yet.