jasplin
Test
Performance
Posted by jasplin
 in Test, Performance
 on Thursday, November 05, 2009 @ 13:25

Benchmarking is an important concept for evaluating the performance of computer systems. It can be considered essentially a three-step process:

Step 1
Define the benchmarks.
Step 2
Execute the benchmarks and produce numbers.
Step 3
Analyze the benchmark numbers.

Each step has its own challenges: Step 1 is about writing programs that test sufficiently interesting/relevant parts of the target product. In Step 2 one needs to define what resource to measure and ensure that the measurements are as reliable as possible in various software/hardware contexts (note that QTestLib provides some help here through the -median option and by supporting callgrind). Step 3 is concerned with “higher level” analysis problems like how to detect performance regressions as reliably as possible, how to compare different implementations of a particular algorithm and so on.

This article introduces BM, a tool that attempts to solve some of the problems that may arise in Step 3. In particular we focus on how potential performance regressions can be detected. Currently the idea is to evaluate this tool for our internal QA needs, but we hope that others may find it interesting too and give us some feedback.

Benchmark results and execution context

BM is based on QTestLib concepts: First, a benchmark is identified by the combination of a test case, a test function, and a data tag (optional). The benchmark should execute some part(s) of the target product. Second, a benchmark result (i.e. the result of executing a particular benchmark) is essentially a single, non-negative number.

A benchmark result is produced within a context made up of the following components:

Metric
The resource used for measuring the benchmark.
Platform
The compiler and OS used for building and executing the benchmark.
Host
The physical computer on which the benchmark was executed.
Branch
The development branch (of the target product) tested by the executable benchmark program.
Snapshot
The timestamp and ID of the current branch revision.

Note that the host may identify a virtual machine that may run on different physical hosts. In this case the physical hosts should be as identical as possible with respect to both hardware and software.

BM currently assumes that git is used for revision control. The ID in the snapshot is the SHA-1 of the head commit at the time the benchmark was built, and the timestamp is the committer date.

Time series as snapshots of revision history

By varying the snapshot component only, successive runs of a particular benchmark form a time series of benchmark results. The notion of a performance regression is defined with respect to this time series. The question to ask is essentially:

How likely is it that a modification to the product itself is affecting the performance of the benchmark in a negative way?

(It would of course also be interesting to know if the benchmark performance is affected in a positive way — BM makes no distinction between the two cases.)

Ideally, new benchmark results should be produced each time the revision history is advanced. In practice this may be hard to achieve if the time it takes to produce a new set of benchmark results exceeds the product update interval.

Note that any modification done to the host or platform of a particular time series (e.g. reconfigurations or upgrades) will in theory invalidate the history in the sense that it may not make sense to compare results before and after the modification. The BM tool currently has no support for this scenario.

Change and stability

For our purpose, the only really interesting property of a sequence of benchmark results is how the results change along the revision history. Moreover, it is normally the current (i.e. most recent) change that we’re most interested in. Who cares about how the benchmark performance changed one year ago if some code that was submitted yesterday is likely to cause a bad performance regression?

Ideally, Step 2 (defined earlier) will produce reliable numbers completely free of “statistical effects”, but for various reasons this cannot be guaranteed in a real life scenario. Since nobody wants to waste their time chasing “false positives”, we need some mechanism to help us reason about the confidence. BM takes a heuristic approach to this problem by letting the user define a difference tolerance (DT) and a stability tolerance (ST):

  • Two values are considered significantly equal if their relative difference does not exceed DT (i.e. if |(v1 - v2) / v1| ≤ DT)
  • If vn is the last value in the time series, the current change (if any) is the latest value vc before vn that differs significantly from vn.
  • A value v is considered stable if it is preceded consecutively by at least ST values that are significantly equal to v.
  • The current change is stable if both vc and vn are stable.

The following figures show examples of stable and non-stable current changes (ST is 4 in all cases):

Stable current change
Non-stable current change 1
Non-stable current change 2

By sorting the benchmarks on their current changes (considering both sign and magnitude) and on stability, we hope to get a useful indication of likely candidates for performance regressions. Once a potential regression is detected, it may need additional manual verification before eventually pinpointing the code modification(s) that caused it.

Architecture

The BM system consists of the following main components:

  • a bmserver (written in Qt) for managing the database of benchmark results,
  • a bmclient (written in Qt) for uploading data to and retrieving data from the bmserver, and
  • a web client (written in JavaScript) for presenting and exploring the most interesting benchmark results.

New benchmark results are uploaded to the bmserver by invoking the bmclient with the benchmark results (in a QTestLib XML output file) along with the execution context. Benchmark results already in the database may be retrieved directly through the bmclient (which can be useful for debugging), but normally it is better to use the web client. In the latter case the bmclient is invoked as a CGI program by a web server.

The following figure shows the client/server architecture with arrow directions indicating the main data flow:

Client server architecture

The bmserver and bmclient communicate over an XML-based protocol, while communication between the web server and web client is based on Ajax and JSON.

For simplicity, the bmserver currently uses SQLite, but we plan to allow for other database backends (like PostgreSQL) if necessary.

Web interface

The web interface is organized into three main sections:

  • A platform summary section that shows overall statistics for each platform/branch combination.
  • A benchmark summary section that shows more detailed statistics for the most interesting benchmarks of a particular platform/branch combination (the ranking criteria may be defined by the user). Other features in this section include detailed views of individual time series, comparing the latest results of two branches, and comparing results in a local file against results in the database.
  • An overall settings section.

Here are some screenshots:

Platform summary section


Benchmark summary section


Table


Graph


Graph 2

Note: BM is currently known to work properly with Firefox only (partly due to the use of the HTML canvas tag for graphics).

Status and plans

It goes without saying that a proper assessment of the usefulness of BM depends on the database containing a fair amount of production data. As we speak, the Qt benchmarks (under the /tests/benchmarks/ directory) are being run at regular intervals on a Linux platform.

We would like to invite anyone interested to give us feedback on the BM tool. Does this approach make sense? How can it be improved? Etc. The repository is here (and there’s a README to get you started).

8 Responses to “Analyzing benchmark results with BM”

» Posted by AlexBravo
 on Thursday, November 05, 2009 @ 23:01

This tool would be great to see if the changes we, outside developers, are contributing, make Qt slower (or faster). The life of Qt reviewers could be simlified, if anybody could run these benchmarks before submitting a merge request.

Can this tool be used (with small modifications) also as a test runner?
Or is Qt going to release to the public what you guys currently using for running autotests?
Is it called Dungeon?

» Posted by Stephan Sokolow
 on Friday, November 06, 2009 @ 02:15

BM looks like something I’ll find useful when I get around to trying it out… but it seems a bit odd that you wouldn’t make sure it works with the current QWebView release.

» Reply from jasplin
 on Friday, November 06, 2009 @ 07:22
jasplin

AlexBravo:

The BM repository contains bmrunner, a simple console-based tool that simply runs all QTestLib-based benchmark programs it can find in a directory subtree and uploads the results to the database. (In our current experimental setup, however, we find it more convenient to run the benchmark programs by extending a script that we already use for running autotests).

If you were thinking of executing benchmark programs interactively from the web client, there is currently no support for that (although it is an interesting idea!). Note that it is possible to upload a local results file and compare it against results in the database, but in that case the comparison would normally make sense only if the two sets of results are produced on identical systems.

I’m not aware of any plans for releasing any more of our internal QA scripts/tools at this point.

» Reply from jasplin
 on Friday, November 06, 2009 @ 07:53
jasplin

Stephan Sokolow: TODO file updated ;)

» Posted by AlexBravo
 on Friday, November 06, 2009 @ 16:54

@jasplin
> In our current experimental setup, however, we find it more convenient to run the benchmark programs by extending a script that we already use for running autotests

Are you talking about test.pl in \qt-qt\tests\auto\ directory?

> If you were thinking of executing benchmark programs interactively from the web client
It looks like there’s a need to run tests in two ways:
1. Automatically run all tests on a regular basis. That’s what BM tool is great for.
2. Manually run any test as you make changes to the code. Running it from IDE would be the best option.

» Posted by zecke
 on Sunday, November 08, 2009 @ 04:06

Have you looked at the Google Chromium buildbots? E.g. they use the following scheme to identify a real performance change (in contrast to an external one… e.g. switch of physical machine for the vm, more ram, system update, updatedb run in the background).

Effectively on each run they run a “Gold” test using the binary that generated the baseline and the newly built software… If both the gold and the newly built product has a local peek they are pretty confident it was not introduced by the latest revision… There is also a Google Techtalk about this..

» Posted by Rohan McGovern
 on Monday, November 09, 2009 @ 02:02

@AlexBravo

> Are you talking about test.pl in \qt-qt\tests\auto\ directory?

No, AFAIK tests/auto/test.pl is not used by anyone.

Internally the current solution is based on support in the build system (qmake) for running autotests. This ought to be public, but not in time for Qt 4.6.

» Posted by staff
 on Saturday, November 14, 2009 @ 10:24

Can this be used on s60??



© 2008 Nokia Corporation and/or its subsidiaries. Nokia, Qt and their respective logos are trademarks of Nokia Corporation in Finland and/or other countries worldwide.
All other trademarks are property of their respective owners.