Morten
Threads
Qt Concurrent
Posted by Morten
 in Threads, Qt Concurrent
 on Friday, June 27, 2008 @ 12:24

The Wide Finder Project is an informal parallel programming competition where the task is to compute web site statistcs from a 218-million line access log. Each entry will be benchmarked on a Sun T2000 with support for 32 hardware threads, giving lots of opportunities for parallel processing.

What makes this really interesting is that the project is not only about performance, but rather about writing code that scales to many CPU cores with as little extra programmer effort as possible. Some results are already in, with OCaml currently in the lead performance-wise.

Each log line looks something like this:

www.example.com - - [17/Jun/2007:21:37:17 -0700] “GET /ongoing/ongoing.atom HTTP/1.1″ 304 - - “-” “Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.4) Gecko/20070515 Firefox/2.0.0.4

Eager to bring the performance lead back to C++ where it belongs I started out writing my own implementation using QtConcurrent and the other Qt APIs. Briefly explained, the code uses QtConcurrent::mappedReduced to multi-thread the code, and then QByteArray::split() twice to iterate over each word in each line. The current version computes the number of hits for each page.

Results: (parsing 100K lines on an 8-core 2.8 GHz Mac Pro)

1 Thread
real 0m2.283s
user 0m1.141s
sys 0m0.249s

2 Threads:
real 0m1.446s
user 0m1.853s
sys 0m0.271s

1.6X speedup.. not too bad.

4 Threads:
real 0m3.186s
user 0m10.643s
sys 0m0.407s

1 second slower that the single-threaded version.. this does not bode well.

8 Threads:
real 0m7.000s
user 0m46.922s
sys 0m0.724s

Seven seconds! We get a nice linear scaling of the run-time as we increase the number of threads, but unfortunately in the wrong direction. The program us spending a lot of user time doing something though, so let’s run it through Shark and see what’s going on:

spinlock.png

80% in a spin-lock used by malloc/free. But who is calling malloc that much?

spinlock-split.png

Aha.. QByteArray::split(). While being a very convenient API, split() was clearly not designed for heavy parsing like this. Still, I’d like a less catastrophic impact on the run-time when adding threads, even if the program really is calling malloc/free to often. Let’s try with the ptmalloc memory allocator instead:

8 Threads:
real 0m0.908s
user 0m3.784s
sys 0m0.533s

ptmalloc is used in GNU/Linux though the GNU C library and scales much better on multicore systems. The program itself still does not scale beyond 4 threads, but it does not get significantly worse either when adding threads. I guess it’s debatable whether or not this is qualifies as a bug in the Darwin memory allocator, but at least ptmalloc shows that it is possible to do better.

That’s all for now :) For the next installment I’ll try to get better scaling, at the expense of increasing the developer effort.

gunnar
Uncategorized
Qt
Qt Jambi
WebKit
Qt Concurrent
Graphics View
Patternist
Posted by gunnar
 in Uncategorized, Qt, Qt Jambi, WebKit, Qt Concurrent, Graphics View, Patternist
 on Tuesday, June 10, 2008 @ 09:13

So the time is finally here. Qt 4.4.0 was released a few weeks ago and as promised Qt Jambi is right behind. A lot of effort has gone into this one, in addition to supporting all the new Qt features, like Phonon, Webkit, Widgets in Graphics View, XQuery and Qt Concurrent, we also have a seriously improved deployment system, JDBC support and a compile-time checked signal-slot approach for the paranoid. Its a good time to be a Java developer I tell yah! We already mentioned all the featuers in the Qt Jambi 4.4.0 Preview Blog so we won’t repeat ourselves here… (There is a danger in linking to eskils blog, as it links to others again, which again links to others, which in the end proves to be a fairly complex graph, but then again… we are engineers and like that kind of stuff)

Under the cover we’ve also done quite some work. We also did an overhaul of the garbage collection and memory management subsystem and hopefully ironed out all the bumps and dents. We’ve also done some work on the build system, so that our users that build from source have a bit more substantial buildsystem to work with. Previously it was a complex install document, which has been replaced by a simple ant command which just does it all… I was very happy to see that the deployment system & ANT build scripts works well enough for the webstart to look like a plain, normal webstart app:

<resources>
<j2se version="1.5+"/>
<jar href="qtjambi-examples-4.4.0_01.jar"/>
<jar href="qtjambi-4.4.0_01.jar"/>
</resources>

<resources os="Windows" arch="x86">
<j2se version="1.5+"/>
<jar href="qtjambi-win32-msvc2005-4.4.0_01.jar"/>
</resources>

No magic nativejar or anything like that, just the qtjambi-win32-msvc2005-4.4.0_01.jar in the classpath and that is enough to load it, jpeg and svg plugins and all. The good thing is that the files included in the webstart are produced directly by the ant script with all dependencies etc set up properly… (well… almost properly, it took us an evening last week to get it really working, but now it works properly). Because of the fixes to memory management and deployment Eskil and I got these offical diplomas:

Absolutely last load issue fixed and Last memory managment bug

So, what more is there to say… Try the webstart with its new demos, download the packages and start hacking!

-
The Qt Jambi Team

Morten
Qt
Threads
Qt Concurrent
Posted by Morten
 in Qt, Threads, Qt Concurrent
 on Friday, November 23, 2007 @ 13:17

Just a quick update to say that Qt Concurrent has been integrated into Qt/main and is now available in the snapshots. The documentation is available here, end there’s a couple of examples in the examples/qtconcurrent/ directory in the snapshot package. Enjoy!

Morten
Labs
Threads
Qt Concurrent
Posted by Morten
 in Labs, Threads, Qt Concurrent
 on Thursday, April 26, 2007 @ 08:29

MapReduce was originally developed by Google to simplify writing parallel algorithms for computer clusters. The basic idea is that you divide your algorithm into two parts: one part that can be run in parallel on individual pieces of the input data (’map’), and one sequential part that collects the map results and produces the final result (’reduce’). Your program then sends the map and reduce functions along with your input data to the MapReduce framework which automatically parcels out the data to each cluster node and collects the results afterwards.

MapReduce in Qt Concurrent is implemented to work on shared-memory systems, so instead of managing cluster nodes it manages threads on a single computer. This also means we can drop some of the features that the Google version has, such as fault tolerance. (we assume that processors don’t fail)

The API looks like this:

QFuture<T> mappedReduced(list, mapFunction, reducefunction);

As an example, let’s say we want to do a word frequency count on the contents of several documents. Here the map function will count the word occurrences in each document in parallel, and the reduce function will combine the them into a final frequency count.

The input for the function is a list of text strings that contains the documents:

QList<QString> list;

The map function takes one document and produces a hash that stores the frequency count for each word in the document. This function will be called in parallel by several threads, so it can’t have any side-effects such as modifying global data (or more accurately: any side-effcts must be thread-safe). The number of threads used will be scaled according to the number of CPU cores on the system.

QHash<QString, int> mapFunction(const QString &document);

The reduce function takes one intermediate result hash and aggregates it into the final result. Qt Concurrent will make sure only one thread calls this function at a time. This has two implications: there is no need to use a mutex lock when updating the result variable, and the system can be smarter about how it manages threads. If a thread tries to become the reducer thread while another thread is reducing, the first thread doesn’t have to block but can put the result on the to-be-reduced queue and then call the map function on a new piece of data instead.

void reduceFunction(QHasht<QString, int> &finalResut, const QHash<QString, int> &intermediateResult);

Finally we put it all together like this:

QFuture<QHash<QString, int> >counting =  mappedReduced(list, mapFunction, reduceFunction);

Since mappedReduced returns a QFuture we have several options on how to synchronize with the result. The simplest thing is to just call QFuture::result() which will block until the result is ready. If blocking is inappropriate (say we are in the gui thread) we can use signal-slot connections to get progress and result notifications instead. It’s also possible to cancel mappedReduced by calling QFuture::cancel().

I’ve skipped over the function implementations here, the complete word count example is available in the Qt Concurrent package.

Morten
Labs
Threads
Qt Concurrent
Posted by Morten
 in Labs, Threads, Qt Concurrent
 on Thursday, March 08, 2007 @ 15:55

QtConcurrent::run() runs a function in a worker thread. It returns a QFuture, which is then used to synchronize with the result:

QString foo();
QFuture<QString> f = QtConcurrent::run(foo);
...
QString string = f.result()

Calling f.result() will block the current thread until foo() has returned. The QFuture template argument must match the return type of foo().

If the function you want to run takes arguments, use QtConcurrent::bind to supply values for them:

QString foo(const QString &string);
QFuture<QString> f = QtConcurrent::run(QtConcurrent::bind(foo, QLatin1String("Hello World")));

qDebug() < < f.result();

(bind is based on the excellent boost::bind package.)

It’s usually not a good idea to block the GUI thread to wait for results, so when writing GUI applications it’s possible to use signals and slots instead. The QFutureWatcher class is used to make the connections:

QFutureWatcher *watcher  = new QFutureWatcher();
QObject::connect(watcher, SIGNAL(resultReady(const QVariant&)), myObject, SLOT(handleResult(const QVariant &)));

QString foo();
QFuture<QString> f = QtConcurrent::run(foo);
watcher->setFuture(f);

When foo() returns, QFutureWatcher calls the handleResult slot using a queued signal-slot connection.

Morten
Labs
Threads
Qt Concurrent
Posted by Morten
 in Labs, Threads, Qt Concurrent
 on Friday, February 23, 2007 @ 10:57

As a part of Trolltech Labs, I’m adding the Qt Concurrent project. Qt Concurrent is a high-level threading framework for writing code that automatically scales on multi-core systems.

For example, to make thumbnails of a list of images you can do this:

QImage scaled(const QImage &image)
{
    return image.scaled(QSize(100, 100));
}

...

const QList<QImage> images = ...
const QList<QImage> thumbnails = QtConcurrent::mapped(images, scaled);

QtConcurrent::mapped() runs scaled() on each image in the list, using as many threads as there are CPU cores on the system.

In addition to this Qt Concurrent also provides a MapReduce implementation for shared-memory systems and support for making asynchronous function calls using futures.

You can get the source code at the labs pages, and there is also a forum for questions and general feedback.