Morten
Threads
Qt Concurrent
Posted by Morten
 in Threads, Qt Concurrent
 on Friday, June 27, 2008 @ 12:24

The Wide Finder Project is an informal parallel programming competition where the task is to compute web site statistcs from a 218-million line access log. Each entry will be benchmarked on a Sun T2000 with support for 32 hardware threads, giving lots of opportunities for parallel processing.

What makes this really interesting is that the project is not only about performance, but rather about writing code that scales to many CPU cores with as little extra programmer effort as possible. Some results are already in, with OCaml currently in the lead performance-wise.

Each log line looks something like this:

www.example.com - - [17/Jun/2007:21:37:17 -0700] “GET /ongoing/ongoing.atom HTTP/1.1″ 304 - - “-” “Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.4) Gecko/20070515 Firefox/2.0.0.4

Eager to bring the performance lead back to C++ where it belongs I started out writing my own implementation using QtConcurrent and the other Qt APIs. Briefly explained, the code uses QtConcurrent::mappedReduced to multi-thread the code, and then QByteArray::split() twice to iterate over each word in each line. The current version computes the number of hits for each page.

Results: (parsing 100K lines on an 8-core 2.8 GHz Mac Pro)

1 Thread
real 0m2.283s
user 0m1.141s
sys 0m0.249s

2 Threads:
real 0m1.446s
user 0m1.853s
sys 0m0.271s

1.6X speedup.. not too bad.

4 Threads:
real 0m3.186s
user 0m10.643s
sys 0m0.407s

1 second slower that the single-threaded version.. this does not bode well.

8 Threads:
real 0m7.000s
user 0m46.922s
sys 0m0.724s

Seven seconds! We get a nice linear scaling of the run-time as we increase the number of threads, but unfortunately in the wrong direction. The program us spending a lot of user time doing something though, so let’s run it through Shark and see what’s going on:

spinlock.png

80% in a spin-lock used by malloc/free. But who is calling malloc that much?

spinlock-split.png

Aha.. QByteArray::split(). While being a very convenient API, split() was clearly not designed for heavy parsing like this. Still, I’d like a less catastrophic impact on the run-time when adding threads, even if the program really is calling malloc/free to often. Let’s try with the ptmalloc memory allocator instead:

8 Threads:
real 0m0.908s
user 0m3.784s
sys 0m0.533s

ptmalloc is used in GNU/Linux though the GNU C library and scales much better on multicore systems. The program itself still does not scale beyond 4 threads, but it does not get significantly worse either when adding threads. I guess it’s debatable whether or not this is qualifies as a bug in the Darwin memory allocator, but at least ptmalloc shows that it is possible to do better.

That’s all for now :) For the next installment I’ll try to get better scaling, at the expense of increasing the developer effort.

6 Responses to “Wide Finding”

» Posted by Adriaan de Groot
 on Friday, June 27, 2008 @ 14:19

Well, at this point you really should be using Studio 12 + Qt on Solaris to get the most out of it; you can do a lot more parallelism testing with the Studio 12 tools. I just saw a very effective demo of determining bottlenecks in parallelism on exactly the test setup you describe. One thing I see is indeed the *huge* number of small allocations going on in Qt apps. You might want to look at libumem (I tihnk that’s the ‘interesting’ allocator under Solaris).

» Posted by babali
 on Saturday, June 28, 2008 @ 11:34

Hello,

With a garbage collector you can do slices without mallocing.
See http://dotnot.org/blog/index.php for example.

» Posted by Alex
 on Saturday, June 28, 2008 @ 15:42

Where is a link to the sample data file or do we have to create it? This is a poorly documented project with no clear rules/data/submit details, feels like I’m in freshman college without directions to the computer lab.

» Posted by Adam Higerd
 on Monday, June 30, 2008 @ 14:33

Actually, babali has a good point, although I wouldn’t necessarily use a garbage collector for it… but it should be feasible to deal in implicitly-shared substrings; I can’t see why it wouldn’t work. (Doesn’t QString even offer some explicitly-shared substring functions?) It wouldn’t even break API/ABI to switch QByteArray::split to return implicitly-shared substrings — might be a fun little experiment.

In other news, I’ve heard jemalloc is designed for this kind of parallelism; you might check it out.

» Reply from Morten
 on Wednesday, July 02, 2008 @ 12:23
Morten

Alex: The instructions for getting the data files can be found here: http://wikis.sun.com/display/WideFinder/Infrastructure

Adam: QStringRef is probably the class you’re thinking about. Modifying QByteArray:split could be possible, but unfortunately you still have to allocate a d-pointer for each returned bytearray (QStringRef shares a d-pointer with the referenced string, making it allocation free.) jemalloc looks promising, it would be interesting to benchmark it against ptmalloc and nedmalloc.

» Posted by Adam Higerd
 on Monday, July 07, 2008 @ 16:50

Yeah, QStringRef is exactly what I was talking about, but that is a point I hadn’t considered; QByteArray::fromData does still have to construct a QByteArray object, just not a buffer for it, and while QByteArrayRef could probably work it wouldn’t be a drop-in replacement — you’d need a new API. *sigh* Well, it was an idea. (And it’ll be fewer and smaller allocations than the current implementation anyway — it would still be a performance benefit for applications using split() even if it isn’t quite as parallelizable.)



© 2008 Nokia Corporation and/or its subsidiaries. Nokia, Qt and their respective logos are trademarks of Nokia Corporation in Finland and/or other countries worldwide.
All other trademarks are property of their respective owners.