Benjamin
WebKit
Performance
Posted by Benjamin
 in WebKit, Performance
 on Tuesday, February 02, 2010 @ 11:13

Like for the other parts of Qt, having great performance is important for QtWebKit.

Traditionally, QtWebKit has been mostly used on desktop computers for advanced layouting, hybrid applications or simply to browse the web. On a modern computer, the speed of WebKit is not a problem.

The world has changed, and QtWebKit is now used on mobile phones running Maemo, Symbian or Windows CE, and it is more and more used in embedded application on various devices.

Working on what matters

To improve the performance of WebKit, we works with benchmarks. Those benchmarks are used as use cases when profiling WebKit and to evaluate the gain of our patches.

Those benchmarks and the tools are available on Gitorious, in the QtWebkit performance repository.

WebKit gives a lot of possibilities, and we try to focus our work on what matters. For the performance work, we use real webpages, and look for ways to improve WebKit in the way it is used on the Web.

How we benchmark WebKit

The performance suite has three kinds of tools:

  • host tools: to manage the data used by the benchmarks
  • tests: the benchmarks
  • reductions: some benchmarks for specific components of WebKit

Let’s have a look at the tools and the tests. The full documentation of the performance suite is on the WebKit’s wiki.

Mirroring the web

For the benchmark, we do not want to access Internet directly. We want to compare the results from one run to the other, so we don’t want the pages to change arbitrarily. Using the Web for benchmarking would also create an important load on the servers.

To use real pages without going online, we create databases of web pages with the mirror application:

Process of webpage mirroring

Those databases are snapshots of webpages at a given point in time, and they are used as input of the benchmarks.

The mirror application uses WebKit to load the pages and intercept all network requests. This means the database also includes resources that are loaded lazily via Javascript.

Using the benchmarks

There are two ways to exploit the databases with the benchmark: the online and offline modes. The difference lies in the way we provide the database’s content to the benchmarks:

Modes of benchmarking

In the “online mode“, we use a basic web server to serve the database over HTTP. The benchmarks use the complete stack to load pages, as they would if we were loading the page from Internet.

In the “offline mode“, the database is loaded directly by the benchmarks and is used as the source of data. In that case, the network is not involved. This mode is mostly useful for the benchmarks that do not involve the network (like measuring the rendering speed).

What is measured

The benchmark suite is still a work in progress. Currently, there are benchmarks for:

  • the page loading performance (with or without rendering)
  • the rendering performance
  • the scrolling performance

How you can use that?

If you use WebKit, and are interested in great performance, you can use the performance suite to profile the use case you are interested in, and optimize those cases.

If you evaluate the use of WebKit for embedded, you can use the benchmark to evaluate how good WebKit performs on the hardware.

If you make patches for WebKit’s performance, have a look on how to contribute. You can also join us on IRC in #qtwebkit on freenode.

gunnar
Threads
Painting
Graphics Dojo
Performance
Posted by gunnar
 in Threads, Painting, Graphics Dojo, Performance
 on Thursday, January 21, 2010 @ 08:18

Previous posts in this topic:

In this series that we’ve been doing, I wanted to cover threading, a topic that has been actively discussed amongst some of the trolls over the last few months. We’ve had support for rendering into QImage’s from non-GUI threads since the early Qt 4.0 days, but its only in recent versions of Qt, I think, 4.4 that we got support for rendering text into images. Now that support is there, it begs the question how to make proper use of it. Generating the actual content in a thread is one usecase, here is an example of it.

What it means is that instead of rendering all the content of a certain view in the QWidget::paintEvent() or in the QGraphicsItem::paint() function, we use a background thread which produces the cache. The benefit is that even though drawing the actual content can be quite costly, drawing a pre-rendered image is fast, making it possible for the UI to stay 100% responsive while the heavy loading is happening in the background. It does imply that not all content is available at all times, but for many scenarios this is perfectly fine. There is nothing novel about this approach, I just think its a nice way to solve a problem that often comes up when dealing with user experience.

This approach is used by Google Maps (actually, what the server does I don’t know, but it sends individual tiles to the browser at least), iPhone and N900 web browsers, and I’ve talked to customers in the past that use this approach for usecases where generating the content is costly, but the user interface needs to stay responsive. In fact, this approach applies to pretty much anything where it is ok that the content is not immediately there, such as data tables like an mp3-index or a contact list, images in a data folder, etc.

The Task

Lets first look at the task. I’ve done a trivial implementation which looks in a directory and displays all the images in there. Each image is a separate content piece and I’ve put a background, a small frame around it and a drop shadow under it. Just so that there is a bit of active work going on. If you are into it, here is the Source Code

The content pieces could have been tiles in a map of Norway or tiles composing a webpage, but I choose images, because I already had some images around and I figured it made for an ok example. The demo is run on an N900 with compositor disabled using the following command lines:

  • Non-Threaded: ./threaded_tile_generation -no-thread -graphicssystem opengl MyImageFolder
  • Threaded: ./threaded_tile_generation -graphicssystem opengl MyImageFolder

Here’s how it looks when the content is generated in the GUI thread:


The UI is running super-smooth as long as I show only the content that is already loaded. Once work is needed, the entire UI stops and the user experience is really bad. Here is how it looks if we move the work into a background thread.

The algorithm

Don’t use this particular algorithm. It is very crude and written to show an idea. First of all, because I was lazy, I used queued connections rather than a synchronized queue to schedule the pieces to be rendered. This means that the queue is managed by Qt’s event loop, out of my control. So if I pan far out, I will schedule a lot of images to be rendered, then pan beyond them before they are done. In a decent implementation, I would dequeue these and make sure that only the pieces that are directly visible are being processed.

The other thing is that there is no logic to “peek ahead”. I schedule images to be generated only when I need them. If I instead scheduled them based on the current panning direction, in addition to not discarding so aggressively, it would probably result in a situation most images are rendered ahead of time.

QGraphicsView

It would be kinda cool if this could be applied directly to QGraphicsView. You set a flag on the item and instead of generating its cache pixmap in the GUI thread, it was offloaded to the worker thread. This is not straight forward however, because the GUI thread can, pr today at least, continue to modify the state of the item, while its being rendered in the worker thread. Synching these two becomes a bit of a mess, and how to solve it, if at all, is not something we have a plan for. That doesn’t prevent people from doing this kind of work in their own custom paint() functions of course.

No
WebKit
Graphics View
Graphics
Performance
Posted by No'am Rosenthal
 in WebKit, Graphics View, Graphics, Performance
 on Wednesday, January 13, 2010 @ 16:07

I’d like to share with the community a project I’m working on, while it’s still in its development phase (isn’t that what labs is for? :))
The goal of the project is to get CSS3 animations to a reasonable FPS performance, mainly on embedded hardware where it’s a pain.

See http://gitorious.org/~noamr/webkit/noamrs-webkit/commits/accel

The idea is to implement webkit’s GraphicsLayer concept, which allows platform-specific implementations of CSS transform and opacity animations, using the graphics-view and the Qt animation framework as a backend. This would only work for QGraphicsWebView and not for QWebView, as rendering a separate QGraphicsScene inside QWebView would probably not give us much performance benefit.

Preliminary results are very promising - The leaves demo, for example, runs 4 times faster on Maemo Fremantle than it does without the acceleration, and it looks graphically accurate.

The reason this gives us a performance benefit is mainly because of graphics-item caching: when a CSS animation occurs inside webkit, the item that’s being animated has to go through a re-layout and re-draw every so often, while with the accelerated approach we draw it once into a QPixmap (QGraphicsView takes care of that) and then it’s just a series of fast and furious pixmap blts. The hardware acceleration becomes relevant when the images are big and the blt itself becomes a bottleneck.

This project is not ready to go upstream, as it supports many delicate use-cases that need to be tested. But if you’re interested in participating (or to just comment!), this has so far been a fun project to hack on.

Instructions:

  1. Get the Git repo from git://gitorious.org/~noamr/webkit/noamrs-webkit.git, branch accel
  2. Build or get a relatively new version of Qt, possibly without building QtWebkit
  3. Build Webkit from the downloaded Git repo:
    export QTDIR=[my-qt-4.6-root]
    export PATH=$QTDIR/bin:$PATH
    ./WebKitTools/Scripts/build-webkit --qt
  4. Run ./WebKitBuild/Release/bin/QGVLauncher --accel: This will enable the necessary web-setting for composite-layer acceleration. You can also create a small QGraphicsWebView example yourself, as long as you enable the new settings flag: QWebSettings::AcceleratedCompositingEnabled
  5. Load a website with CSS transform/opacity animations: like this one or this one.
  6. Hack. Most of the code is in WebCore/platform/graphics/qt/GraphicsLayerQt.cpp, or you could just search for the term USE(ACCELERATED_COMPOSITING)
  7. Send merge requests through Gitorious or comments on Bugzilla

No’am

alexis.menard
Itemviews
Performance
Posted by alexis.menard
 in Itemviews, Performance
 on Friday, January 08, 2010 @ 15:34

Back in Qt 4.4, QFileDialog was rewritten in order to help with performance issues. In addition to that a new model was designed to read the hard drive content: QFileSystemModel. The main advantage of QFileSystemModel is its thread: the model is populated asynchronously by a gatherer thread. The benefit is that the GUI is not stuck while the gatherer is reading the file system (except one case). In 4.4.x, 4.5.x, 4.6.x the focus was to mature QFileSystemModel to make it fast and stable. I think now it has reached this status and QDirModel can be easily replaced by QFileSystemModel as it provides the same features and more.

So in 4.7, QDirModel will be obsoleted and QFileSystemModel will be used in our examples. At the same time QCompleter will gain the support of QFileSystemModel. The commit is 319b0262418d74cc416a7dd1f620b54ba45bad22 and you can find it in our trunk.

TomCooksey
Painting
Graphics Dojo
OpenGL
Performance
Posted by TomCooksey
 in Painting, Graphics Dojo, OpenGL, Performance
 on Wednesday, January 06, 2010 @ 12:01

Introduction

Here’s the next instalment of the graphics performance blog series. We’ll begin by looking at some background about how OpenGL and QPainter work. We’ll then dive into how the two are married together in OpenGL 2 Paint Engine and finish off with some advice about how to get the best out of the engine. Enjoy!

Why OpenGL?

Before I dive into the OpenGL paint engine, I want to make sure we all understand the motivation for the OpenGL 2.0 paint engine. I’ve talked about this before in my article about hardware acceleration, but we still frequently get questions like “Why not implement a Direct2D paint engine?”.

Everyone knows OpenGL means fast graphics right? Well, this is actually a bit of a misconception. What makes graphics fast is a bit of hardware dedicated to computer graphics called a GPU (Graphics Processing Unit). OpenGL 2.x is a software library which often (but not always) uses a particular class of GPU to help satisfy drawing operations (Note: OpenGL 1.x used a different class of GPU). A modern programmable GPU (e.g. nVidia GTX 295) can usually be programmed via both OpenGL, Direct3D and OpenCL. The only difference then is that Direct3D is only available on the Windows platform and OpenCL is not universally supported.

So the reason we are investing our time and effort into OpenGL, rather than Direct3D or OpenCL, is that OpenGL 2.0 is sufficient to give us access to all the GPU features we currently want to use. It is also available on more platforms, especially if you limit yourself to the ES sub-set. We are also looking into restricting ourselves further to only use APIs in OpenGL 3.2 Core Profile.

This might change in the future if we see a new class of GPU, like ones designed for 2D vector graphics which can’t be abstracted by OpenGL 2.0 very well (enter OpenVG), or, if we want to start using GPU features which OpenGL (ES) 2.0 doesn’t give us access to. Having said that, OpenGL is very good at exposing new GPU features through extensions.

History

Qt has had an OpenGL paint engine since early Qt 4.0 days. This engine was designed for the fixed-function hardware available at the time. As time went on and manufacturers added newer bits of hardware to their GPUs, the OpenGL paint engine was adapted to use those features through OpenGL extensions. Over the last 4 years, lots of people have hacked on the engine and added support for things like ARB fragment programs and even adapted the engine to work on OpenGL ES 1.1. The engine is pretty stable and has lots of fall-backs (or original code-paths, depending on how you look at them) for old hardware missing GL extensions the engine can utilise. But, fundamentally, it is an OpenGL 1.x engine.

In early 2008, around the time of the Falcon project (the Falcon Project was an internal project started for Qt 4.5 which focused on painting performance and architecture), it became increasingly clear that Qt needed to support hardware acceleration using the OpenGL ES 2.0 API which was starting to appear on embedded System-On-Chips like the OMAP3. There were two options available: Extend the existing OpenGL paint engine further still, or develop a new paint engine from scratch. When looking at the existing engine, there was a major problem – although it supported fragment programs, it was heavily reliant on fixed-function vertex processing. A further consideration was that the Falcon project had just kicked off and the future of the QPaintEngine API was uncertain. Both of these factors resulted in a new paint engine being written from scratch for OpenGL ES 2.0. This new engine had a distinct advantage over the existing engine: everything I wanted to use from OpenGL was in the core OpenGL ES 2.0 API. This meant I didn’t need to add fallbacks in case of missing functionality, leading to much cleaner and leaner code.

Another point about OpenGL ES 2.0 is that it doesn’t have much in the way of fixed function features – forcing you to write shader programs. While annoying at the time, this is apparently the best way to do things even on desktop GPUs. This point is important because it quickly became apparent that although the engine was designed for GLES2, not only would it also work on desktop OpenGL 2.0, but it would use that API in a way better suited for modern programmable GPUs. So, in Qt 4.6, the new engine is used by default on both GLES2 and on desktop systems which support OpenGL 2.0.

What does OpenGL (ES) 2 provide?

As I’ve already mentioned, OpenGL ES 2.0 is a pretty lean and mean API which models programmable GPUs. The “programmable” bit is fundamental to the API. It means that you write small programs known as shaders, ask OpenGL to compile and then run them on the GPU to process the data you give it. There are two types of shaders: one type processes positions (vertices) and another type processes pixels (fragments), called the vertex shader and fragment shader, respectively. The idea is that you tell OpenGL you want to draw some triangles and the vertex shader is run to determine the position of each of those triangles. Then, the GPU turns each triangle into a bunch of pixels and the fragment shader is run to determine the colour of each of those pixels. The API provides various ways of passing data from the CPU to the GPU (from textures and lists of triangle positions to individual floats) and ways of passing data from the vertex shader to the fragment shader. That’s basically it. All the complexity lives in the shaders you give to the GPU to run.

What does QPainter require?

The rest of this blog assumes you are familiar with the QPainter API (if not, go check the QPainter docs) ). It might also be a good idea to read through Gunnar’s post about how the Raster engine works.

So, the QPainter API provides more than just triangles. It is therefore the GL paint engine’s job to turn the whole of the QPainter API into “just a bunch of triangles”. To understand its task a little better, you have to split QPainter up into chunks which map better to OpenGL. A great example of this is drawRect(). In QPainter terms, this is a single primitive, but in GL engine terms, it is actually two: A rectangle (the fill) and a (possibly quite complex) line round the outside (the stroke). The OpenGL paint engine tries to keep a fairly clean separation between the shape of something which is drawn and its fill. So, here’s the list of primitives (shapes) QPainter requires the engine to draw:

  • Simple primitives (Rectangles, convex polygons, ellipses, etc.)
  • Text
  • Pixmaps
  • Strokes
  • Complex vector paths (QPainterPath)

In addition to this, we have various fills which we can use on our primitives provided by QBrush:

  • Solid colour
  • Linear gradients
  • Radial gradients
  • Conical gradients
  • Bitmap patterns
  • Textures

Not only do we have different types of fill, but we also support a full 3×3 transformation matrix on the brushes. This allows you to draw a rectangle but use it as a kind-of stencil over (for example) a perspective transformed texture.

Finally, QPainter also requires the engine to implement clipping, different composition modes and support it’s state stack (QPainter::save() & QPainter::restore()).

Engine Operation

Primitive Rendering

  • Simple Primitives: To render convex primitives such as rounded rectangles, we just generate a GL triangle fan and render it using glDrawArrays
  • Text: For large text, we convert it to a complex path and render is as such. However, for smaller font sizes, we rasterize the individual font glyphs and upload them as a texture (8-bit texture for bitmap & anti-aliased glyphs and 24-bit RGB for sub-pixel anti-aliased glyphs). This glyph texture is used as a mask in the engine’s pixel pipeline (see below). So, in terms of primitives, text is actually rendered as a set of rectangles - one rectangle for each glyph. When rendering with sub-pixel anti-aliased glyphs, it is possible that the engine will need to do two passes (if the brush is not a solid colour). This is because the engine uses a clever trick and sets the brush’s colour as the glBlendColor and outputs the RGB mask in the fragment shader. It is then able to set a glBlendFunc which combines the two and gives per-sub-pixel blending. If you set a more complex brush, the engine has to do two passes - first apply the mask to the destination, then a second pass to apply the brush, with glBlendFunc set to give the correct result.
  • Pixmaps: A pixmap is actually just a rectangle.
  • Strokes: Strokes can be very complex - just take a look at the pathstoke demo! However, even the most complex dashed pattern with rounded joins and end caps can be turned into a GL triangle strip relatively easily. This is done by the QTriangulatingStroker.
  • Complex vector paths: This is where things get tricky. QPainterPaths can have lots of things which break the “turn lineTo, moveTo and curveTo into verticies and render as triangle fan” algorithm…

Rendering Using Stencil Technique

Take the following path as an example:

Convex Path (1)

Here we have a seemingly trivial path with only 4 points. To draw this with GL, you could just convert the path’s points to verticies and draw it as a triangle fan, which results in two triangles: Triangle 1: ABC and Triangle 2: ACD. The problem is that just looks like a solid triangle, not the path we wanted:

Convex Path (2)

So, to overcome this difficulty, we drop to a 2-pass rendering method which uses the stencil buffer as a temporary scratchpad. So first off, we clear the stencil buffer to all zeros (represented as white):

Stencil Buffer (Clear)

Next, we set the stencil operation to invert, which means instead of setting the stencil value to ‘1′ when a triangle touches a pixel, invert the existing value instead. So 0->1 & 1->0. First we render the first triangle (ABC). As all the pixels are currently 0, every pixel touched by the triangle turns to 1 (represented as black):

Stencil Buffer (Triangle 1)

Next, we draw the second triangle (ACD). Note: We are inverting the stencil’s value, so black pixels touched by the second triangle turn to white and white pixels turns to black:

Stencil Buffer (Triangle 2)

So now the stencil buffer contains the silhouette of our path. All we do now is draw a rectangle into the destination window, but with the stencil test enabled.

In addition to the stencil technique, we are also adding experimental support for triangulating QPainterPaths and caching the triangulation. While this is slower for paths which change often or are zoomed in & out, paths which are relatively static can be triangulated once and rendered multiple times without having to re-triangulate.

Filling Primitives

Now we know how all the different QPainter operations get turned into GL primitives, but we’re still missing how they get filled. As already mentioned, the colour of a pixel is determined by the fragment shader. We therefore have lots of different fragment shaders for different types of fill. However, we also need to support text rendering with arbitrary fills (QPainter lets you fill text with a perspective transformed radial gradient). In the future, we also want to support composition modes which OpenGL doesn’t provide. We’ve also found there are ways we can simplify the shaders for certain situations (and thus improve performance). The result is that Qt needs lots of different shaders. At last count, we’d need over 1000 different shaders to cover all situations. That’s a lot of GLSL to maintain and test, far more than the resources we have available. So instead we split the shaders into different interchangeable “stages”. This is achieved by having each stage in it’s own GLSL function. As an example, lets take regular, non sub-pixel anti-aliased text rendering with a transformed radial gradient. Note, this is just an example to demonstrate how the engine operates and you probably shouldn’t do it in performance critical situations.

We render gradients by pre-calculating a 1px high texture (like a 1D texture) on the CPU which we sample from in the fragment shader. However, we calculate the texture coordinates in the vertex shader and pass it to the fragment shader as a varying. This is because it’s a good idea to do as much work as possible in the vertex shader rather than the fragment shader as it is called so much less frequently.

As already mentioned, we render (non sub-pixel) anti-aliased text by using an 8-bit mask texture. We then multiply the fragment colour by a sample taken from this mask. So, if we’re on the edge of a glyph where the alpha value is <1, we adjust the alpha of the srcPixel by that amount (actually, we also adjust the RGB values too as we use pre-multiplied alpha pixel format internally).

If there was a non-standard composition mode, we’d then pass the masked pixel to another stage which would blend it with the background (although this isn’t implemented yet!).

So you can see in the fragment shader, there’s 3 different stages. The first stage (srcPixel) determines the brush colour of the fragment. The next stage (applyMask) modulates the pixel by a mask to achieve anti-aliased text rendering. The final stage (compose) then blends the pixel with the background. We also have a similar staging technique for the vertex shader. All this complexity is nicely abstracted by the QGLEngineShaderManager. The paint engine tells the shader manager what it wants to draw and the shader manager selects an appropriate selection of shaders. One final note on this: While desktop OpenGL 2 supports linking multiple fragment shaders in a single program, OpenGL ES 2.0 does not. This means that we actually use the different stages by appending them to a single string of GLSL we pass to GL. This also gives the GL implementation the best chance to inline the different stages (without which, performance would suck).

Texture Management

The OpenGL paint engine makes heavy use of gradients. For example, even though it’s perfectly possible to calculate colours for gradients in the fragment shader, we still use a texture as a look-up-table as it is so much faster. Repeatedly uploading textures every time we need them would ruin performance. So instead, we keep a per-context cache of what QPixmap/QImage is already present in texture memory. If two contexts are sharing then we also detect this and don’t duplicate the textures. This functionality is available publicly in QGLContext::bindTexture() too.

On Linux/X11 platforms which support it, Qt will use glX/EGL texture-from-pixmap extension. This means that if your QPixmap has a real X11 pixmap backend, we simply bind that X11 pixmap as a texture and avoid copying it. You will be using the X11 pixmap backend if the pixmap was created with QPixmap::fromX11Pixmap() or you’re using the “native” graphics system. Not only does this avoid overhead but it also allows you to write a composition manager or even a widget which shows previews of all your windows.

Antialiasing

The OpenGL paint engine uses OpenGL multisampling to provide anti-aliasing. Typically, this will be 4x/8x FSAA, meaning 4/8 levels of coverage, which is worse quality than the raster engine, which always uses 256 levels of coverage. However, as the DPI of modern displays increases, you can get away with lower-quality anti-aliasing.

Using multisampling also doesn’t affect text rendering as text is anti-aliased using masks rather than multisampling (for smaller font sizes). So text rendered with the OpenGL engine should look almost as good as text rendered with the raster engine (which also does gamma-correction). The only drawback of using multisampling is that some OpenGL implementations don’t support switching multisampling off. Indeed, the OpenGL ES 2.0 specification doesn’t even provide the API to switch it off. The consequence is that non-anti-aliased (a.k.a. aliased) rendering can be broken (Everything gets anti-aliased even when the QPainter::Antialiasing hint isn’t set). There’s little we can do about this. :-(

Clipping

QPainter supports setting an arbitrary clip, including complex QPainterPaths. Qt uses the GL stencil buffer (or more specifically the lower 7 bits of the stencil buffer) to store the clip. The clip is written to in the same way as we render any other primitive, even using the stencil technique for complex paths. However, instead of filling pixel colours into a colour buffer, we fill stencil values into the stencil buffer. The actual value we use depends on the current QPainter stack depth (how many times save() was called minus the number of time restore() was called). This means that if you restrict yourself to intersect clips (Qt::ClipOperation == Qt::IntersectClip), the engine only needs to write to the part of the stencil buffer which is being clipped to. What’s more, the engine doesn’t need to write to the stencil buffer at all when you call restore() - it just changes the value at which the stencil test passes.

In addition to using the stencil buffer for clipping, the OpenGL paint engine can also just use glScissor. This only allows a single, untransformed rectangle to be used as the clip, which can be quite restrictive. However, it is by far the fastest way to do clipping. So if performance is more important to you than utility, only ever use untransformed rectangular clips.

Recommendations

Interleaved Rendering

Unlike OpenGL, QPainter allows an arbitrary number of rendering contexts (QPainters) to be active in the same thread at the same time. For example, in your widget’s paint event, you can begin a painter on your widget and begin another painter on a QPixmap and interleave rendering to them:

void Widget::paintEvent(QPaintEvent*)
{
QPainter widgetPainter(this);
widgetPainter.fillRect(rect(), Qt::blue);
QPixmap pixmap(256, 256);
QPainter pixmapPainter(&amp;pixmap);
pixmapPainter.drawPath(myPath);
widgetPainter.drawPixmap(0, 0, &amp;pixmap);
}

While this works ok with the OpenGL graphics system, having to switch from doing something with one painter to doing something with a different painter can be very costly and should be avoided whenever possible.

Mixing QPainter and Native OpenGL

As shown in several examples, it is possible to mix your own OpenGL rendering code with QPainter rendering code. However, as OpenGL is a giant state machine, it is very easy for you to accidently clobber Qt’s GL state and vice-versa. To overcome this, we’ve added some new API to QPainter in Qt 4.6 - QPainter::beginNativePainting() and QPainter::endNativePainting(). To prevent artifacts, you must enclose your custom painting in beginNativePainting() and endNativePainting(). This is very important - even if you’re not seeing any problems now, you might find your code starts failing in a future Qt release in which the GL paint engine works slightly differently. Also, as beginNativePainting and endNativePainting sets lots of OpenGL state, it can be quite expensive and thus you should try to use it sparingly. Try to batch up all your custom OpenGL code in a single block.

QGLWidget vs OpenGL Graphics-System

Unlike the raster & OpenVG paint engine, you don’t have to use a specific graphics system to render widgets using the OpenGL paint engine. The QtOpenGL module provides several classes, including QGLWidget, which all use the OpenGL paint engine regardless of what graphics system is being used. QGLWidget is basically a regular widget which always has a native window ID and is always rendered to using OpenGL. You are free to choose whichever method you want to get OpenGL rendering (graphics system or QGLWidget). However, using the opengl graphics system can often be slower than using a QGLWidget, as Qt needs the contents of the “back buffer” (or QWindowSurface) to be preserved when flushing the render to the window system. OpenGL does not guarantee this and it is often not the case so Qt has to use either an FBO or a PBuffer as the back buffer. When the render needs to be flushed, the FBO or PBuffer is bound to a texture, rendered into the window and then the GL buffers are swapped. This extra overhead is avoided by using a QGLWidget, however as a consequence, it is not possible to redraw a sub-region of a QGLWidget: Whenever a QGLWidget is updated, the entire widget must be re-drawn.

It should also be noted that using the OpenGL paint engine isn’t a silver bullet which makes everything faster. For example, the GL engine really sucks at drawing lots of small geometry with state changes between each drawing operation. While we’re working on improving that use case at the moment, the raster paint engine will probably always be faster just because it has so much less overhead. So QGLWidget might be a great way to get the best of both worlds when combined with the raster graphicssystem - Use QGLWidget for operations which GL excels at and the raster engine for everything else.

Tips for Performance (fps)

As a general rule of thumb, OpenGL state changes are expensive. So, use the knowledge you now have of what’s going on under QPainter and try to minimise the number of OpenGL state changes the paint engine needs to do. For example, if you implement a virtual keyboard, you now know that the engine uses a shader for text rendering and a different shader for pixmaps, so draw all the key pixmaps first, then draw all the text on top. That way, the engine only needs to change shaders twice per frame.

  • Never, ever use anything other than intersecting clips
  • Don’t switch render target in the middle of a render
  • Try to use use untransformed rectangular clips whenever possible
  • Minimise changing the brush wherever possible
  • Render batches of primitives of the same types together.
  • Avoid drawing translucent pixels & blending (particularly important on mobile GPUs)
  • Try to cache QPainterPaths and re-use them rather than creating & discarding them in your paintEvent
  • Use QPainterPaths even when there’s a QPainter convenience function. E.g. Rounded rects and elipses.
  • If you’re drawing lots of small pixmaps, try bunching them up into a single, larger pixmap
  • Prefer to use power-of-two (2^n) widths & heights for QImages and QPixmaps (128×256, 256×256, 512×512, etc)
  • If using QGLWidget and don’t need anti-aliasing, don’t enable sample buffers in the QGLFormat
  • If rendering complex QPainterPaths, try to only use odd-even fill rule
Rhys Weatherley
Painting
Graphics
Performance
Posted by Rhys Weatherley
 in Painting, Graphics, Performance
 on Monday, December 21, 2009 @ 04:34

In previous posts in this series, Gunnar has described the design and performance characteristics of the painting system in Qt, and explored Raster in greater depth.  In this post, I’m going to talk about the unique features of the OpenVG graphics system.

Paint Engine

Unlike the other engines, the OpenVG paint engine was much easier to implement because the OpenVG API itself is very close in functionality to QPainter.  You can read all about the specification on Khronos’ OpenVG Page, but here are the high points:

  • VGPath objects represent geometry made up of MoveTo, LineTo, and CubicTo elements.  This is a very close match for Qt’s QPainterPath and QVectorPath abstractions.
  • VGPaint objects represent brushes and pens for filling paths with pixel patterns.  Solid colors, linear gradients, radial gradients, and pattern brushes are supported, but not conical gradients.
  • VGImage objects represent pixmaps in a large variety of pixel formats.  OpenVG supports a lot more formats than OpenGL/ES which makes it a lot easier to convert QImage’s into VGImage’s.
  • VGFont objects (OpenVG 1.1 only) store glyphs represented as VGImage’s or VGPath’s for quicker rendering of text items.  Under OpenVG 1.0, we fall back to path drawing at present.
  • Scissor for rectangle-based clipping.
  • Alpha mask for clipping to arbitrary shapes.
  • Affine transformation matrices for path and glyph drawing, affine and projective transformation matrices for image drawing.

Transformation Matrices

OpenVG does not support projective transformation matrices for path drawing, which is annoying because QPainter allows any affine or projective QTransform to be used for any drawing operation.  There is a registered OpenVG extension called VG_NDS_projective_geometry but none of the OpenVG engines we have come across support it.  The reason why OpenVG doesn’t support it is because generating paint pixels in perspective can be quite difficult.  Projective matrices are supported for image drawing because drawing a simple image in perspective is a well-understood problem that OpenGL/ES systems do all the time.

When a projective transformation matrix is used for path drawing, we convert the path point-by-point using the QTransform and then draw it as a normal affine path using a default transformation for the window surface.  But what about the paint pixels?  Unfortunately, they won’t be perspective-correct.  In practice this isn’t a big problem because most paths are drawn with a solid color brush, and a solid color looks the same in perspective.

In general however, we discourage people from using projective transformations with paths.  If you really want to draw a scene in perspective, first draw it into a QPixmap and then draw the pixmap using a projective transformation.  You’ll probably want to do this anyway because perspective transformations mostly occur during “flip” animation effects - drawing every tiny path in perspective every frame during the flip would be too slow.

Path Transformation and Drawing

Most of the path transformation logic is done in vectorPathToVGPath() and painterPathToVGPath() in qpaintengine_vg.cpp. We detect the presence of affine vs projective transformation matrices and use an appropriate conversion. We convert both QVectorPath and QPainterPath using specialized routines. The other paint engines typically convert everything into a QVectorPath first. The QPainterPath conversion can improve performance slightly when arbitrary paths are drawn during SVG rendering and the like - there’s no point creating a QVectorPath if it is going to be quickly thrown away.

Path drawing takes a lazy update approach, attempting to minimize the number of OpenVG state changes from request to request:

  • If the draw requires a pen, then the penPaint object is updated with the current QPen if it was different from last time.
  • If the draw requires a brush, then the brushPaint object is updated with the current QBrush if it was different from last time.
  • The path transformation matrix is updated if it has changed since the last path drawing operation.
  • The path is drawn using vgDrawPath().

Most of the OpenVG state persists across paint events so if the same pen is used from one frame to the next, then it will be set once and never changed.  The state is also shared between all windows because there is only one OpenVG context for the entire system.

In an earlier version of the OpenVG paint engine, I just uploaded the state changes whenever they were made without trying to be lazy about it.  That was a mistake!  Applications that use QPainter, particularly those using QGraphicsView, can be very chatty - constantly saving and restoring the painter state.  It was quite common for brushes, pens, and transformation matrices to be changed, then changed again, without anything being drawn.  Now, it will only update the OpenVG state at the point where an actual drawing operation is about to happen.  This house-keeping does have a cost though, so if you can avoid unnecessary QPainter state changes in your application, then please do so.

Preallocated Paths

Rectangles, lines, points, and rounded rectangles feature quite heavily in many applications, with constantly changing co-ordinates.  OpenVG makes us create and destroy a VGPath every time.  To alleviate this, we’ve provided some pre-allocated paths for simple drawing operations, which we update with vgModifyPathCoords() rather than allocate GPU memory for a new path.  However, some chipsets can be slower at modifying a path than just making a new one!  On those chipsets, compile Qt with the QVG_NO_MODIFY_PATH macro.
Image Drawing

The best image drawing will be achieved with QPixmap rather than QImage.  With QPixmap, the image is converted into a VGImage once and then drawn multiple times.  With QImage,the image must be converted into a VGImage every time it is drawn.

The OpenVG drawing primitive vgDrawImage() is very primitive - it draws the selected VGImage at the origin of the current transformation.  There is no in-built support for sub-rectangle drawing. Fortunately, OpenVG has vgChildImage() which allows a sub-region to be quickly extracted, with the pixel data shared with the parent.  However, “quickly” is a very relative term - I’ve seen frame rates almost halve when using vgChildImage() compared to drawing a full image.  So if you can, draw entire QPixmap’s when using OpenVG and limit the use of sub-rectangles.

Another source of slowdown is drawing images with opacity.  OpenVG has a way to multiply a VGPaint object with a VGImage to produce a destination image.  This is a very cheap way to achieve opacity effects and is quite fast.  Except!  And there is always an Except!  Except when the image is drawn with a projective transformation matrix.  Remember - paint pixels cannot be generated in perspective - so we cannot use a paint object to generate an “opacity color” even though the solid opacity color will be the same in perspective from all angles!  This is very annoying - the OpenVG committee could have made a special exception for solid color VGPaint objects.

When the OpenVG paint engine draws an image with opacity, and a projective transformation is in effect, we have to generate a copy of the VGImage and use vgColorMatrix() to adjust the opacity.  This isn’t too bad if you are drawing the same image over and over with the same opacity, but it is very inefficient if you are animating the opacity.  So avoid opacity animations with OpenVG if you can.

Painting into a QPixmap isn’t currently accelerated with OpenVG - it uses the raster paint engine instead, so we recommend painting pixmaps once rather than constantly updating them.  We will be addressing this in future versions.  Even when we do implement painting into a pixmap, there will be a cost: switching rendering surfaces from a window to a VGImage and back again is not cheap - on some chipsets it can be as heavy as a full EGL context switch.  So try to avoid switching painting surfaces if you can.

Clipping

Clipping is the bane of my existence!  It seems so easy to application writers - set a clip rectangle and it will be efficient, drawing less pixels!  If only!

There are three techniques that can be used to achieve clipping with OpenVG:

  • Scissor rectangle list.
  • Alpha mask for arbitrary clip shapes.
  • Scissor rectangle for simple clips and alpha mask for complex clips.

The last is the default in the OpenVG paint engine, and there are #define’s that can be used to enable the other modes.  However, on some PowerVR chipset versions there is a bug where if the scissor is combined with the alpha mask, performance drops off a cliff - down to 2 frames per second in some cases!  So on such devices you may want to turn on scissor-only or mask-only clipping.

Better is to not use clipping at all if you can avoid it.  Draw everything in your scene in bottom-up order and let the GPU do the heavy lifting.  Remember, modern OpenGL/ES GPU’s can crank out thousands of triangles per second, with clever algorithms for hidden-surface removal that are much cleverer than anything you can do by setting a clip.  OpenVG uses the same GPU in many cases.  If you set a clip, you may end up confusing the GPU into taking a slower path internally than it would otherwise.

If you must clip, try to use single-rectangle regions that can be set via the scissor.

Window Surfaces

Below the OpenVG paint engine is the window surface logic in the graphics system.  This is usually where platform-specific customizations are required to get pixels onto the screen as fast as possible.  The QVGWindowSurface class wraps a QVGEGLWindowSurfacePrivate object, which provides the heavy lifting.  The default EGL implementation is QVGEGLWindowSurfaceDirect which writes pixels into the window back buffer and calls eglSwapBuffers() to transfer it to the screen.  It is possible to enabled single-buffered operation with QVG_DIRECT_TO_WINDOW, but the cost may be tearing artifacts on-screen.

If your platform has some clever EGL extension mechanism for getting pixels onto the screen, then you will need to write a new graphics system plugin and implement your own QVGEGLWindowSurfacePrivate subclass.  The QtOpenVG module has been structured to make it relatively easy to do this without touching the core Qt code.

Memory Usage

Everything you do with graphics uses memory - in the CPU and in the GPU.  Window surfaces, VG rendering contexts, VGPath objects, VGImage objects, and so on.  It can get quite tight in the GPU on embedded systems.  We’ve taken some steps to manage this; e.g. destroying older VGImage objects when trying to upload a new QPixmap, and destroying all OpenVG objects when an application goes into the background to free up memory for foreground applications.

The more complex your application, the more likely it is that you’ll hit the GPU memory limit.  There’s only so much the QtOpenVG module can do for you.  We can take emergency measures to recover, but that’s about it.  So keep an eye on how many pixmaps and windows you have in use and see if you can simplify your application a little.  Definitely avoid uploading very large jpeg photographs as a single QPixmap - split them up into smaller “tiles” that can be released when GPU memory gets tight.

Summary

The following tips summarise the performance suggestions from the previous section:

  • Avoid projective transformation matrices with drawing paths.
  • Minimize state changes on pens, brushes, transforms, etc.
  • Use QPixmap in preference to QImage where possible.
  • Avoid drawing images using sub-rectangles.
  • When drawing images with opacity, use an affine transformation matrix, or only a single opacity level.
  • Avoid switching painting surfaces, particularly between windows and pixmaps.
  • Don’t use clipping if you can paint your scene in bottom-up order instead.
  • Split large images up into smaller pieces to avoid overloading GPU memory.

What’s Next?

There’s always more that can be done to improve any software system.  QtOpenVG is no different:

  • Painting into QPixmap’s using OpenVG.
  • Smarter VGImage pooling to deal with out of GPU memory conditions.
  • Qt/Embedded and Lighthouse screen drivers.
gunnar
Painting
Graphics Dojo
Performance
Posted by gunnar
 in Painting, Graphics Dojo, Performance
 on Friday, December 18, 2009 @ 09:21

Todays topic is the raster engine, Qt’s software rasterizer. Its the reference implementation and the only paint engine that implements all possible feature combinations that QPainter offers.

History

The story of Qt’s software engine started around December 2004, if my memory serves me. My colleague Trond and I had been working for a while on the new painting architecture for Qt 4, codenamed “Arthur”. Trond had been working on the X11 and OpenGL 1.x engines and I was focusing on the combined Win32 GDI/GDI+ engine along with QPainter and surrounding APIs. We had introduced a few new features, such as antialiasing, alpha transparency for QColor, full world transformation support and linear gradients. As few of these new features were supported by GDI, it meant that using any of these features implied switching to GDI+, which at the time was insanely slow, at least on all the machines we had in the Oslo office back then. Actually, enabling the GDI advanced graphics mode to do transformations was also not very fast.

Then we came upon this toolkit called Anti-Grain Geometry (AGG) which did everything in software, in plain C++, and we were just amazed at what it could do. Our immediate reaction was to curl up on the floor in agony, thinking that we were going about this all wrong. Using these native API’s was not helping us at all. In fact it was preventing us from getting the feature set we wanted with a performance that was acceptable. Once we settled down again, our first idea was to try to implement a custom AGG paint engine which would just delegate all drawing into the AGG pipeline. But alas, the template nature of the AGG API combined with the extremely generic QPainter API bloated up into a pipeline that didn’t perform nearly as good as the demos we had seen.

So we took our Christmas vacation and started over in January of 2005. Still quite depressed over the new feature set that didn’t perform combined with being limited by a minimal subset of native API’s, I went to Matthias and Lars and asked if I could get three weeks of time to hack together a software only paint engine as a proof of concept. I got an “OK” and spent the following weeks implementing software pixmap transformation, bi-linear filtering, clipping support in the crudest possible way and three weeks later I had a running software paint engine and quite proudly announced that I was “just about done”. I’ve reconstructed an image of how I remember it:

groupboxes

The system clipping was all over the place, bitmap patterns were broken, but perhaps worst of all, all text is rendered using QPainterPath’s, and all drawing was antialiased. Despite it not looking 100% good, the performance of the various features was pretty ok. It was agreed that this was a good start, but that we needed a bit more work. And so started the sprint for the Qt 4.0 beta a few months later.

The initial version that was released with Qt 4.0 worked quite well in terms of features, but in hindsight the performance was far from what our users demanded from Qt. As a result, we harvested a lot of criticism over the first year of Qt 4.0. Since then, we’ve done a lot, and I mean a LOT, and my gut feeling is that it is the engine that performs the best for average Qt usage, so I think we made a good choice back then in dropping GDI and GDI+. And, as I outlined in my previous post, we are toying with making raster the default across all desktop systems for the sake of speed and consistency.

Overall structure

The overall structure of the engine is that all drawing is decomposed into horizontal bands with a coverage value, called spans. Many spans will together form the “mask” for a shape and each pixel that is inside the mask is filled using a span function.

antialiasing

The image highlights one scanline of a polygon which is filled with a linear gradient. There are 4 spans, one which fades in the opacity of the polygon and two which fade out the opacity of the gradient. For each pixel in the polygon, the gradient function is called and we write the pixel to the destination, possibly alpha blending it, if the coverage value is other than full opacity or if the pixel we got from the gradient function contains alpha.

Clipping also use the same mechanism. The span function for clipping takes the incoming spans, intersects them with the set of spans that defines the clip and calls the actual filling span function.

clipspans

All operations followed this pattern. When a drawRect call comes in, we generate a list of spans for each scan line and set up a span function according to the current brush. A pixmap is similar, we create a list of spans and use a pixmap span function. A polygon is passed to a scanconverter which produces a span list, etc. We have two scan converters, one for antialiased and one for aliased drawing. The antialiased one is pretty much a fork of FreeType’s grayraster.c, with some minor tweaks, I think we needed to add support odd-even fills, for instance. Text is also converted into spans.

Lines, Polylines and Path Strokes

These primitives are passed to a separate processor called a stroker. The stroker creates a new path that visually matches the fillable shape that the outline represents. There is a public API for this too, in QPainterPathStroker. This fillable shape is then passed to one of the scan converters which in turn scan converts the shape into spans. For dashed outlines, the same process happens, and the resulting fillable shape is a path with a potentially very large amount of subpaths. Naturally, such a sub-path is costly to scan convert, which is part of the reason why we explicitly do not put dashed lines on the list of high-performance features. In fact, in many cases, line dashing is one of the slowest operations available in the raster engine, so use it with extreme caution.

A hacky alternative which performs much better, is to set a 2×2 black/white or black/transparent pixmap brush and draw the stroke using a pen with brush. A bit more to set up, but if that’s what it takes to get in running fast, then that’s what it takes.

State changes

Any setBrush, setTransform or any other state change on QPainter will result in a different set of span functions being set up. Each brush, or fill-type if you like as pens on this level are essentially just fills too, has a special span function associated with it and we also pass a per brush span data. For solid color fills the span data contains the color, for transformed pixmap drawing it contains the inverse matrix, a source pixel pointer, bytes per line and other required information. For clips it contains the span function to call after you clipped the spans. The thing to notice about state changes is that each time you switch from one brush to another brush or from one transformation to another, these structures do need to be updated. Up to Qt 4.4, this was in many cases a noticeable performance problem, bubbling up to 10-15% in profilers when rendering graphics view scenes, but since 4.5 the impact of this is minimal.

Well, perhaps not minimal compared to drawing a 2 pixel long line, but minimal compared to filling a 64×64 rectangle. The point is that though the raster engine is the engine that probably handles state changes best of all our engines, there are some usecases where it still shows up, and it should still be minimized.

Span functions

The task of the span functions is to generate a pixel and combine it with the destination according to the current state of the painter. Though the raster engine supports rendering to any of our image formats except 8-bit indexed, it will internally do all rendering in ARGB32_Premultiplied. Premultiplied alpha has the benefit that we don’t have to multiply the alpha into the color channels and it saves us a division in the blending. The reason for doing all rendering in one format is that the alternative simply doesn’t scale. Just think of the combination of composition modes multiplied with the number of image formats a source image can have multiplied with what formats the destination can have. To support all combinations we have a generic approach where we for each span do:

  • Get the source pixels, e.g. from a gradient, pixmap, image or solid color, and convert them to ARGB32_Premultiplied.
  • Get the destination pixels and convert them to ARGB32_Premultiplied
  • Blend the source into the destination using current composition mode
  • Convert the result to destination format and write it back.

This may seem like a lot of work, so luckily the story doesn’t end there.

Special casing and Optimizations

As I outlined in the QPainter documentation patch that I added recently, which was the start of this blog series, its all about defining which scenarios we want to be fast and which scenarios we just need working. Over the years since the initial release of the raster engine in the summer of 2005, we’ve added tons of of special cases to support what we experience as the functions that are called the most and which have the most impact.

  • First of all, if you look at the things we do for each span above, you see that we convert into ARGB32_Premultiplied. Solid colors are easy to represent, gradients are generated in this format directly, so conversion only happens for images and pixmaps. If the image is ARGB32_Premultiplied, then no conversion is needed, and we just use the scanline pointer directly, without any copying. Our RGB32 format is specified to be 0xffRRGGBB, with the alpha set to 0xff. This means it is pixel-wise compatible with ARGB32_Premultiplied, which again means that it can also be used directly. If the source is ARGB32, you’ll get a memcpy for each scanline where the ARGB32 data is copied into a temporary buffer and converted to ARGB32_Premultiplied. What can you read from that: Do not draw ARGB32 images into the raster engine. Secondly, don’t open a painter on an ARGB32 image, as that implies the exact same, but when reading and writing the destination pixels. Now you know why QPixmap’s prefer to be in these formats too..
  • Source composition modes are special cased for most operations. For instance, we don’t read the destination for source operations because we know there is no blending involved, unless the spans have partial coverage that is. This means that Source is effectively just a memory write.
  • SourceOver is usually special cased to be either inlined and merged with the coverage opacity so it is also usually faster than the other composition modes. As for the other optimizations down below, these only hold for Source and SourceOver, so if you want best performance, make sure that this is what you are using. SourceOver is the default in QPainter, by the way.
  • For gradients and pixmaps, we need to create an array of source data. For solid colors, its just a single pixel, so this is faster. Source color also benefits from that you only have to traverse memory for the destination, where you write to, so the cache misses are significantly reduced.
  • Rectangle fills are very common, both through QPainter::fillRect and through QPainter::drawRect. In 4.4 both of these implied a state change. Actually, fillRect implied two state changes because it set the brush to what was passed to fillRect and then set it back to what the painter state was. In 4.5, as part of this Falcon project, we introduced a new internal QPaintEngine subclass which supports a state-less fillRect with a color. This matches how applications normally use the painter anyway.
  • In addition to being stateless, the fillRect function is special cased for a number of use-cases. For instance, for RGB16, we write two pixels at a time, for Intel machines there is an SSE/MMX optimzied version. The special cased fillRect also has the benefit that it doesn’t require spans, its just a tight 2D for loop, which also saves us quite a bit of work, at least if the spans are short.
  • Duffs Device. I cannot take credit for its addition, but it’s used in a lot of different places in the raster engine today. Its all about loop-unrolling. If you’re not familiar with it yet, read up on it. Its a beautiful abuse of the C++ language to make things potentially faster.
  • Rectangular clipping is also special cased, at least as long as there is no transformation set on the painter. Translate is of course special cased, but scaling and rotating disables this optimization. The benefit we get from doing rectangular clipping is that finding the spans to fill is done on the QRect level, rather than on the pr span level, which makes it significantly faster.
  • So if you have Source of SourceOver, a non-perspective, non-smooth transform and the clip is a rectangular clip, you also get the benefit of our pixmap blend functions. These were added in Qt 4.5 and is the reason why pixmap drawing is quite a bit faster now than in the earlier versions. In Qt 4.5, we had blend functions for scale and translate only, and in Qt 4.6 we added rotations to the list as well. Again, we focus on a selected subset of formats, matching what QPixmap will be using, we only have these for:
    • ARGB32_Premultiplied on ARGB32_Premultiplied
    • ARGB32_Premultiplied on RGB32
    • ARGB32_Premultiplied on RGB16
    • ARGB8565_Premultiplied on RGB16
    • RGB32 on RGB32
    • RGB16 on RGB16

    I think that was all of them.

  • The outlines are processed via the stroker in the general case. However, there are again a number of special cases where we drop to doing a midpoint-algorithm instead. Lines, polylines and paths that only contain line segments will be rendered using the fast midpoint approach as long as the pen width is equal to or less than 1. We also support dashing line segments for 1 pixel wide lines using this method. For any pen width greater than 1, curved paths or antialiasing, we drop to the stroker approach which works, but is far less optimal. Actually, I think there is a special-case for antialiased dashed lines too, as long as they are thin.
  • When antialiasing is enabled, we often need to fall back to the stroker for outlines which is quite a bit slower than the plain case. In addition to that there are a lot of more spans generated for antialiased content, due to the fade-in, fade-out effect on the edge of the primitive, so expect antialiasing to be a significant cost.
  • Text drawing is since 4.5 highly optimized for most engines, to the point where the major bottleneck these days are in doing the actual text layout on the string. We’re working on an API to cache this, so text drawing can be made truly fast, but based on the current API, its as good as it gets. However, if the transformation is a rotate/scale, then we fall back to path drawing. Only the windows version of the raster engine supports drawing glyphs at rotated angles using the fast paths, so beware of that.
  • A lot of details, but it gives an idea of what to consider when you write code for this engine. If all you are drawing is 1024×1024 pixmaps, then none of these things matter because all the time is anyway spent in the span function that does pixmap blending, but the second you have more content, several lines, several polygons, which are smaller in size, then these things are critical to achieve good performance.

    The overall performance of the engine, when used according to how it’s outlined above, can be thought of as:

    Overhead + O(pixelsTouched * memoryAndBusCapacity)

    There is nothing scientific about that formula, but when you’re hitting the optimal path, all time should be spent in one of the many for loops inside qdrawhelper_xxx.cpp or even better qblendfunctions.cpp. These loops will spend all their time on per pixel processing. If these functions could be made faster by doing the algorithms slightly differently, then great, but if you see in your profiling that all time is spent in for instance qt_blend_argb32_on_argb32, then that means you told us to blend alpha pixmaps together and we’re doing that as fast as we can and you have zero loss between your app and actual processing. If all time is spent processing pixels, then that is a good thing. The overhead here is the time spent in state changes, function call overhead, and similar.

    Some numbers

    I got some feedback on one of the previous blogs that a few bar charts would be nice, so I’ll post some numbers on what kind of throughput is possible with the raster paint engine. I’ve timed it on both my Windows desktop machine and on my N900 to get a comparison. The operations range from several million pr second to only a few hundred so the scale is logarithmic, keep that in mind as you look at them.

    Raster Results

    As you can see, the fill-rate is more or less tied to the number of pixels involved. For some operations it takes a little bit longer to do something, like drawPixmap with scaling is somewhat slower than drawPixmap without, but you see that the rough formula I gave above holds quite often. Double the size of the primitive in each direction and you have one quarter the performance. It was also not my intention to trick you with using different numbers for drawPixmap, its just how the test was set up.

    If you compare the three 4×4 rectangle drawing versions, you see that they differ when the rectangles are small. drawRect without brush change is fastest at around 7.4Mops/sec, followed by fillRect at ~6.1Mops/sec and then drawRect with brush change at 1.8Mops/sec. At 128×128 there is just a little difference between the two, which is what I was getting at with the state changes above. It is possible to do them and if you’re drawing semi-large areas, it doesn’t matter, but if you’re plotting pixels, doing loads of small lines here and there or particle effects with 8×8 pixmaps, then you want to do that in a tight loop with nothing else happening.

    You can also see that the speed of non-smooth scaling is holding its own vs non-scaled pixmap drawing.

    Finally, if you compare the N900 to the desktop Windows machine you see that despite windows only having a 4 times faster processor the speed is often around 10 times worse. Why? Because the CPU isn’t the only limitation, bus/memory capacity is also a limiting factor, and it’s to be honest not a fair comparison…

    I hope you enjoyed this post and more will come in 2010.

    gunnar
    Painting
    Graphics Dojo
    OpenGL
    Performance
    Posted by gunnar
     in Painting, Graphics Dojo, OpenGL, Performance
     on Wednesday, December 16, 2009 @ 06:54

    For this blog series that I’m doing, I figure its nice to start with an overview of the whole painter, pixmaps, widgets, graphicsview, backingstore idea.

    At the centre of all Qt graphics is the QPainter class. It can render to surfaces, through the QPaintDevice class. Examples of paint devices are QImage’s, QPixmaps and QWidgets. The way it works is that for a given QPaintDevice implementation we return a custom paint engine which supports rendering to that surface. This is all part of our documentation so perhaps not too interesting. Lets look at this in more detail.

    QWidgets and QWindowSurface

    Even though QWidget is a QPaintDevice subclass, one will never render directly into a QWidget’s surface. Instead, during the paintEvent, the painting is redirected to an offscreen surface which is represented by the internal class QWindowSurface. This was traditionally implemented using the QPainter::setRedirected(), but has since been replaced by an internal mechanism between QPainter and QWidget which is slightly more optimal.

    Some times we refer to this surface as “the backingstore”, but it really is just a 2D surface. If you ever looked through the Qt source code and found a class QWidgetBackingStore, this class is responsible for figuring out which parts of the window surface needs to be updated prior to showing it to screen, so its really a repaint manager. When the concept of backingstore was introduced in Qt 4.1, the two classes were the same, but the introduction of more varying ways to get content to screen made us split it in two.

    In the old days widgets were rendered “on screen”. Though the option to paint on screen is still available, it is not recommended to use it. I believe the only system that remotely supports it is X11, but it is more or less untested and thus often cause artifacts in the more complex styles. Setting the flag Qt::WA_PaintOnScreen means that the repaint manager inside Qt ignores that widget when repainting the windowsurface and instead sends a special paintEvent to that widget only. Prior to Qt 4.5 there was a significant speed gain to be had when 10-100 widgets updated at max fps, but in Qt 4.5 the repaint manager was optimized to handle this better so, on screen painting is usually worse than buffered.

    Back to the window surface. All widgets are composited into the window surface top to bottom and the top-level widget will fill the surface with its background or with transparent if the Qt::WA_TranslucentBackground attribute is set. All other widgets are considered transparent. A label only draws a bit of text, but doesn’t touch anything else. What that means for the repaint manager, is that every widget that overlaps with the label, but stacks behind it, needs to be drawn before it. If the application knows that a certain widget is opaque and will draw every single pixel for every paint event, then one should set the Qt::WA_OpaquePaintEvent, which causes the repaint manager to exclude the widgets region when painting the widgets behind it.

    Since all widgets are repainted into the same surface, we need to make sure that widgets don’t accidentally paint outside their own boundaries and into other widgets. Since there is no guarantee that widgets will paint inside their bounds, this could potentially lead to painting artifacts, so we set up a clip behind QPainter’s back called the “system clip”. For most widgets the system clip is a rectangle and looking at the performance section of the QPainter docs, we see that that is not so bad. Rectangular clips, when pixel aligned, are fast. A masked widget, on the other hand, is a performance disaster. It is slower to set up and slower to render. The system clip is the same clip that is passed to the paint event, except that the clip in the paint event has been translated to be relative to the top-left of the widget, rather than to the top-left of the surface. Do NOT set the paint event’s region as a clip on the painter. It is already set up, and we don’t detect that it is the exact same region and just process it fully again. The purpose of the region/rect in the paint event is so that widgets can decide to not draw certain parts. This is primarily useful when you have big scenes in the widgets, such as a map application, graphics view or similar.

    In addition to the system clip which is set up prior to calling paintEvent, the painter also needs to be in a clean state, which means setting up brushes, pens, fonts and others. Its not a huge amount, but if you have many widgets it adds up. So, though widgets are no longer native window handles (aka Alien), there is still a price tag involved in repainting them. Be aware of that when you design your application. For instance, implementing a photo gallery using QLabel’s with pixmaps in a QScrollArea doesn’t scale. You would have to set up clipping and all the other states per label, even though the label only draws a pixmap. A single “view” widget would scale much better, because the widget can then implement a tight loop that draws pixmaps in the right places.
    This whole backingstore and window surface logic only hold for Mac OS X when raster or opengl graphics systems are used. Personally I would strongly recommend to use raster, it implements the full feature set, it is often faster, has the same performance profile as Qt on Windows and painting bugs are prioritized higher for raster than for the CoreGraphics backend. In qt/main we plan to switch the default for Mac OS X to raster, we just have to iron out some window system integration issues.

    Graphics systems

    The concept of a graphics system was introduced in Qt 4.5. The idea is to be able to select at startup time, on an application level, what kind of graphics stack you should be using. The graphics system is responsible for creating the pixmap backends and the window surface. We currently have graphics systems for raster, OpenGL 1.x, OpenGL/ES 2.0, OpenVG and X11. You can select graphics systems either by starting the application with the command line option -graphicssystem raster|opengl|opengl1|x11|native, where “native” means to use the system default. Another option is to provide the exact same option to configure which will set that option for all applications using Qt. Finally there is the function QApplication::setGraphicsSystem which hardcodes the graphics system for a given application.

    In later blogs, we plan to go into each of the paint engines in more detail, but for now, lets just look at the highlights.

    Raster

    The raster graphics system is the reference implementation of QPainter. It implements all the features we specify and does it all in software. When a new port is started, such as with S60, we usually start with getting raster running. It is currently the default on Windows, Embedded, S60 and will also be on Mac OS X.

    Just a though. What do you think of raster on X11? If you ignore for a second that you currently get a local process local font cache. It performs quite nice on X11 and I’ve seen many people switch it at runtime. If we consider remote displays, this seems daunting, but it still may not be too bad. The way it works in the X11 paint engine today is that any gradient and pixmap transform is anyway done in software and uploaded as an image on a per painter-command level. Why not just do it all client side and upload only the parts that needs updating. We can watch HD videos (for some definition of HD, anyway) on youtube, certainly we can afford to upload a few pixels. This is bound to generate comments on XRender and server-side gradients and transforms, but these have been tried numerous times and the performance is simply not good enough.

    The window system integreation is handcoded for each platform to make the most out of it. For windows the windowsurface is a QImage which shares bits with a DIBSECTION, which results in pretty good blitting speed. On X11 we use MIT Shared Memory Images. We used to use Shared Memory Pixmaps, but this is removed from Xorg, but we got this awesome patch from the community, so we’re back up and running. On Mac OS X, we’re experimenting with using GL texture streaming for getting the backbuffer to screen and we’re seeing some promising numbers with that, so I hope that will make into Qt for 4.7 too.

    Because it is just an array of bytes, most native API’s have the ability to render into the same buffer we do. This makes integration with native theming quite straightforward, which is one of the reasons why this is attractive as a default desktop graphics system, despite not being hardware accelerated.

    OpenGL

    We have two OpenGL based graphics systems in Qt. One for OpenGL 1.x, which is primarily implemented using the fixed functionality pipeline in combination with a few ARB fragment programs. It was written for desktops back in the Qt 4.0 days (2004-2005) and has grown quite a bit since. You can enable it by writing -graphicssystem opengl1 on the command line. It is currently in life-support mode, which means that we will fix critical things like crashes, but otherwise leave it be. It is not a focus for performance from our side, though it does perform quite nicely for many scenarios.

    Our primary focus is the OpenGL/ES 2.0 graphics system, which is written to run on modern graphics hardware. It does not use a fixed functionality pipeline, only vertex shaders and fragment shaders. Since Qt 4.6, this is the default paint engine used for QGLWidget. Only when the required feature set is not available will we fall back to using the 1.x engine instead. When we refer to our OpenGL paint engine, its the 2.0 engine we’re talking about.

    We’ve wanted to have GL as a default graphics system on all our desktop systems for a while, but there are two major problems with it. Aliased drawing is a pain, it is close to impossible to guarantee that a line goes where you want it for certain drivers. Integration with native theming is a pain. It is rarely possible to pass a GL context to a theming function and tell it draw itself, hence we need to use temporary pixmaps for style elements. On Mac OS X, there is a function to get a CGContext from a GL context, but we’ve so far not managed to get any sensible results out of it. On the other hand, much of the UI content doesn’t depend on these features, which makes GL optimal for typical scene rendering, such as the viewport of a QGraphicsView or a photo gallery view. So as far as how the default setup in Qt will look in the future, we’re considering that the best default setup for desktop may be a combination of raster for the natively themed widgets and GL for one or two high-performance widgets. Nothing is decided on this topic though, we’re just looking at alternatives.

    Another problem with using GL by default is font sharing. With raster we could theoretically share pre-rendered glyphs between processes in a cross platform manner using shared memory, with GL this becomes a bit more difficult. On X11, there is an extension to bind textures as XPixmaps which can be shared across processes, but this will usually force the textures into a less optimal format which makes them somewhat slower to draw, so it is still not optimal. On Windows, Mac OS X, S60 or QWS, we would need driver-level support for sharing texture ids, which we currently don’t have.

    OpenVG

    I actually quite blank in this area. I’ve not been involved with writing it nor getting it up and running. It sits on top of EGL which makes it quite similar to the OpenGL graphics systems. We expect that OpenVG will be used in a number of mid-range embedded devices.

    The cool thing about OpenVG is that it matches the QPainter API quite nicely. It supports paths, pens, brushes, gradients and composition modes, so in theory, the vectorial APIs should run optimally.

    Rhys, which wrote the OpenVG paint engine, plans to do a post on the OpenVG paint engines internals in full in the near future.

    Images and Pixmaps

    The difference between these two is mostly covered in the documentation, but I would like to highlight a few things none the less.

    Our documentation says: “QImage is designed and optimized for I/O, and for direct pixel access and manipulation, while QPixmap is designed and optimized for showing images on screen.”

    Raster

    When using the raster graphics system, pixmaps are implemented as a QImage, with a potentially significant difference. When converting a QImage to a QPixmap, we do a few things.

    The image is converted to a pixel format that is fast to render to the backbuffer, meaning ARGB32_Premultiplied, RGB32, ARGB8565_Premultiplied or RGB16. When images are loaded from disk using the PNG plugin or when they are generated in software by the application, the format is often ARGB32 (non-premultiplied) as this is an easy format to work on, pixel-wise. I’ve measured ARGB32_Premultiplied onto RGB32 to be about 2-4x faster than drawing an ARGB32 non-premultiplied depending on the usecase.

    Secondly, we check the pixel data for transparent pixels and convert it to an opaque format if none are found. This means that if a “.png” file is loaded as ARGB32 from disk, but only contains opaque pixels, it will be rendered as an RGB32, which is also about 2-4x faster.

    OpenGL

    When using the OpenGL graphics system the actual implementation of the QPixmap varies a bit from setup to setup. The most ideal option gets enabled when your GL implementation supports Frame Buffer Objects (FBOs) in combination with the GL_EXT_framebuffer_blit extension. In this case, the pixmap is represented as a OpenGL texture id, and whenever a QPainter is opened on the pixmap we grab an FBO from an internal pool and use the FBO to render into the texture.

    Without these extensions available, which is typically the case for OpenGL/ES 2.0 devices, the implementation is a QImage (in optimal layout, same as raster) which is backed by a texture id. When you open a QPainter on the pixmap, you render into the QImage and when the pixmap is drawn to the screen, the texture id is used. Internally there is a syncing process between the two representations, so there will be a one-time hit of re-uploading the texture after drawing into it.

    In general

    If you intend to draw the same QImage twice, always convert it to a QPixmap.

    There are some usecases where QPixmap is potentially worse though. We have these functions, QPixmap::scaled(), QPixmap::tranformed() and friends, which historically are there because we wanted QImage and QPixmap to have similar API’s. We have support for reimplementing this functionality on a per pixmap-backend basis, but currently no engine does this, so for the GL case, or X11 for that matter, calling QPixmap::transformed() implies a conversion from QPixmap into QImage, a software conversion, and the a conversion back to the original format.

    By default a QPixmap is treated as opaque. When doing QPixmap::fill(Qt::transparent), it will be made into a pixmap with alpha channel which is slower to draw. If the pixmap is going to end up as opaque, initialize it with QPixmap::fill(Qt::white). You can even skip the initialization step all together if when you know that all pixels will be written as opaque when the pixmap is painted into.

    Before moving onto something else, I’ll just give a small warning on the functions setAlphaChannel and setMask and the innocently looking alphaChannel() and mask(). These functions are part of the Qt 3 legacy that we didn’t quite manage to clean up when moving to Qt 4. In the past the alpha channel of a pixmap, or its mask, was stored separately from the pixmap data. Depending on which platform you were on, the actual implementation was a bit different. For instance on X11, you had one 1-bit pixmap mask + an 8-bit alpha channel + a 24-bit color buffer. On Windows you had a 1-bit mask + a packed 32-bit ARGB pixel buffer. In Qt 4 we merged all this into one API, so that QPixmap is to be considered a packed datastructure of ARGB pixels. What we did not remove the functions implementing the old API however. In fact, we even added the alpha channel accessors, so we made it worse. The API was to some extent convenient, but all those four functions imply touching all the data and either merging the source with the pixmap or extracting a new pixmap from the current pixmap content. Bottom line. Just don’t call them. With composition modes, you can manipulate the alpha channel of the pixmaps using QPainter. This also has the benefit that it will potentially be SSE optimized for raster or done in hardware on OpenGL, so it has potential for being quite a bit faster. There is also the QGraphicsOpacityEffect which allows you to set a mask widgets and graphics items, but as of today, it is not as fast as we would like it to be.

    QGraphicsView

    I’ll do at least one separate post on graphicsview alone, so I’ll just comment quickly on the difference between using QGraphicsView with items vs QWidget’s. QGraphicsView with its scene populated with items is in many ways very similar to the widgets and their repaint handling. With the addition of layouts and QGraphicsWidgets the line is even more blurry. So which solution should you pick? More and more often, we’re seeing that people choose to create their UI’s in graphics view rather than creating them using traditional widgets.

    Compared to widgets, items in a graphics view are very cheap. If we consider the photo gallery again, then using a separate item for each of the items in the view may (I say may) be reasonable. A widget is repainted through its paintEvent. A QGraphicsItem is repainted through its paint function. The good thing with the items function is that there is no QPainter::begin as the painter is already properly set up for rendering. Another good thing is that the painter has less guaranteed state than the in the widget case. There may be a transformation and some clip, but no guarantees about fonts, pens or brushes. This makes the setup a bit cheaper.

    Another huge improvement over widgets is that items are not clipped by default. They have a bounding rectangle and there is a contract between the subclass implementer and the scene that the item does not paint outside. If we compare this to the system clip we need to set for widgets, then again there is less work to be done for the items. If the item violates this there will be rendering artifacts, but for graphicsview this has proven an acceptable compromise.

    Most UI elements are rather simple. A button, for instance, can be composed of a background image and a short text. In QPainter terms that is one call to drawPixmap and one call to drawText. The less time spent between painter calls the better the performance. The less state changes between painter calls, the better the performance. Looking back at how much happens between these calls for a button, you quickly realize that the traditional widgets are quite heavy. If widgets are going to survive the test of time, then they need to behave more like QGraphicsItem’s.

    Some final words

    I’ve been rambling on for a while, but hopefully there was some useful information in here. You may have noticed that I do not mention printing, PDF or SVG generation, nor do I focus on X11 or CoreGraphics paint engines in great detail. This is because, as outlined in the painter performance docs, we focus our performance efforts in on only a few backends which we consider critical for Qt.

    gunnar
    Painting
    Graphics Dojo
    Performance
    Posted by gunnar
     in Painting, Graphics Dojo, Performance
     on Monday, December 14, 2009 @ 12:19

    On friday I added the following to the QPainter documentation:

    
        section1 Performance
    
        QPainter is a rich framework that allows developers to do a great
        variety of graphical operations, such as gradients, composition
        modes and vector graphics. And QPainter can do this across a
        variety of different hardware and software stacks. Naturally the
        underlying combination of hardware and software has some
        implications for performance, and ensuring that every single
        operation is fast in combination with all the various combinations
        of composition modes, brushes, clipping, transformation, etc, is
        close to an impossible task because of the number of
        permutations. As a compromise we have selected a subset of the
        QPainter API and backends, where performance is guaranteed to be as
        good as we can sensibly get it for the given combination of
        hardware and software.
    
        The backends we focus on as high-performance engines are:
    
        list
    
        o Raster - This backend implements all rendering in pure software
        and is always used to render into QImages. For optimal performance
        only use the format types QImage::Format_ARGB32_Premultiplied,
        QImage::Format_RGB32 or QImage::Format_RGB16. Any other format,
        including QImage::Format_ARGB32, has significantly worse
        performance. This engine is also used by default on Windows and on
        QWS. It can be used as default graphics system on any
        OS/hardware/software combination by passing c {-graphicssystem
        raster} on the command line
    
        o OpenGL 2.0 (ES) - This backend is the primary backend for
        hardware accelerated graphics. It can be run on desktop machines
        and embedded devices supporting the OpenGL 2.0 or OpenGL/ES 2.0
        specification. This includes most graphics chips produced in the
        last couple of years. The engine can be enabled by using QPainter
        onto a QGLWidget or by passing c {-graphicssystem opengl} on the
        command line when the underlying system supports it.
    
        o OpenVG - This backend implements the Khronos standard for 2D
        and Vector Graphics. It is primarily for embedded devices with
        hardware support for OpenVG.  The engine can be enabled by
        passing c {-graphicssystem openvg} on the command line when
        the underlying system supports it.
    
        endlist
    
        These operations are:
    
        list
    
        o Simple transformations, meaning translation and scaling, plus
        0, 90, 180, 270 degree rotations.
    
        o c drawPixmap() in combination with simple transformations and
        opacity with non-smooth transformation mode
        (c QPainter::SmoothPixmapTransform not enabled as a render hint).
    
        o Text drawing with regular font sizes with simple
        transformations with solid colors using no or 8-bit antialiasing.
    
        o Rectangle fills with solid color, two-color linear gradients
        and simple transforms.
    
        o Rectangular clipping with simple transformations and intersect
        clip.
    
        o Composition Modes c QPainter::CompositionMode_Source and
        QPainter::CompositionMode_SourceOver
    
        o Rounded rectangle filling using solid color and two-color
        linear gradients fills.
    
        o 3x3 patched pixmaps, via qDrawBorderPixmap.
    
        endlist
    
        This list gives an indication of which features to safely use in
        an application where performance is critical. For certain setups,
        other operations may be fast too, but before making extensive use
        of them, it is recommended to benchmark and verify them on the
        system where the software will run in the end. There are also
        cases where expensive operations are ok to use, for instance when
        the result is cached in a QPixmap.
    

    I suspect it’s a piece of documentation many of you have been lacking for a while, and its something we should have put in a long time ago, but I can only say “sorry for not doing it sooner”. At least its getting done now. Note: Patch is not visible in public repository at the time of publishing. Should be there shortly

    The urge to get these things into the docs have spun out from a number of dialogues I’ve had recently which all went pretty much like this:

    • TheOtherGirlOrGuy: My application is running slow… What do I do?
    • Me: What is it doing?
    • TheOtherGirlOrGuy: Well, its using QGraphicsView and QPainter and is doing this and that…
    • Me: That doesn’t sound too bad.
    • TheOtherGirlOrGuy: And then its really slow when doing this…
    • Me: Yeah… That doesn’t work very well. What you should be doing is this…
    • TheOtherGirlOrGuy:Is that written down someplace? How am I suppose to know that?
    • Me: Eh…

    To remedy this, I’m going to put into action something I’ve had at the back of my head for a while now, a blog series on Qt Graphics and Performance. Along the way, I’ll also try to get parts of this into the documentation or into examples/demos as best practice use-cases.

    I just have to point out, that this blog series is not a request for more features. It is about us sharing with you what we consider best practises and what our priorities are. Of course if you think our focus is way off, then let us know, but my primary intent with this blog series is to share some thoughts.

    With the help of some of my co-workers, we plan to go through some Qt Graphics fundamentals, the “high-performance” engines, and usecases for graphicsview and widgets. If you have special usecases that you find interesting, then by all means let me know and maybe I can cover those too.

    I need to add a small comment to the “drawText” case. It is currently not super optimal, because we have to do layout on the text for each time you call it. Because there is no “handle” in the function we don’t have the ability to cache the layout either. If we started caching based on a qHash of all the strings that were passed to drawText() then we end up caching a lot of single-shot text drawing… The option that we provide today to work around this is to use a QTextLayout with caching enabled, which is memory-wise quite hungry… I think in the range of 100-300 bytes pr character! So as an alternative, we are working on an API for static text which encapsulates the layout work with very little memory overhead. Its currently called QStaticText and we’re aiming for it to go into 4.7. Once it is in place, we’ll update the drawText comment in the performance documentation to be for these static texts…

    As time permits we plan to push out blogs on the following topics:

    • An overview of the various components involved
    • The raster paint engine in detail
    • The OpenGL paint engine in detail
    • The OpenVG paint engine in detail
    • QGraphicsView optimization flags and cache modes
    Thiago Macieira
    Qt
    Performance
    Posted by Thiago Macieira
     in Qt, Performance
     on Friday, December 11, 2009 @ 22:55

    I don’t know if this is showing up for the community, but we in Qt have been dedicating a lot of effort for performance improvements in Qt. For Qt 4.5, we had a project codenamed “Falcon” whose job was to improve the graphics engines and make them perform much faster. From that project, we got the graphics engines, including the raster and OpenGL ones.

    For Qt 4.6, there was a lot of work done on Graphics View. For Qt 4.7, we’re going to do some more. Where, I don’t know yet.

    Among the many ideas, one I’m interested in seems very small, but may be of important benefit: removing volatile from QAtomicInt and QAtomicPointer.

    Here’s what happens: QAtomicInt derives from the internal class QBasicAtomicInt, which is a struct of one member: a volatile int _q_value. Similarly, QAtomicPointer derives from QBasicAtomicPointer, which is a struct of one member: T * volatile _q_value. The idea here is to remove those two “volatile” keywords.

    Before you cry foul and tell me that I’m going to break your code, let me quote the Qt documentation for these two classes:

    For convenience, QAtomicPointer provides pointer comparison, cast, dereference, and assignment operators. Note that these operators are not atomic.

    (emphasis is in the documentation)

    With that card up my sleeve, I claim that I’m not breaking any contracts. All of the atomic operations that these two classes support (fetch-and-add, test-and-set, fetch-and-store) are implemented in assembly, which means the compiler cannot optimise them anyway. And the assembly code will not be influenced by any caching of the variable contents that the compiler may want to produce. What’s more, we also tell the compiler that we changed the value, so that it will discard its cache.

    So, why are we considering this?

    Well, the reason I hinted above: the compiler caching the value. The whole point of the volatile keyword is that the compiler may not cache the value. It must reload the value every time it tries to access the variable.

    And if we look at any of the Qt tool classes, the non-const functions start by calling detach(), which is generally implemented like this (extracted from qlist.h):

    inline void detach() { if (d->ref != 1) detach_helper(); }

    That is, “if our reference count is not one (i.e., if we’re being shared), do the detaching.”

    And since QAtomicInt::operator int() simply returns _q_value, which is volatile, the compiler has to reload the value every single time. Then it must actually compare that value to 1 and generate the proper branching.

    What the compiler doesn’t know, is that once a container detaches, the reference count will remain 1. It can only increase from 1 in a way that is visible to the compiler: that is, either in the same thread or, if the container is a globally-visible variable, after mutex locking/unlocking.

    So, if we remove the volatile keyword, the compiler is allowed to cache the value of the reference count. Once the first detaching happens, the compiler knows that the reference count is 1. It can therefore optimise out all the next checks, because it also knows that the reference count remains 1.

    This would mean that our reference counting system would be a lot more efficient (hence the title of this blog). It might turn out to be the best ratio of performance gain vs effort ever. After all, it’s a one-word change in one header file.

    That’s the theory anyway. I haven’t yet tested to see if the compiler really knows how to optimise this the way I expect it to.

    PS: credit where credit is due: this optimisation was not my idea. It was Olivier’s. And at first I resisted, saying it would break stuff and wouldn’t work. But now I’m in favour of it. :-)



    © 2008 Nokia Corporation and/or its subsidiaries. Nokia, Qt and their respective logos are trademarks of Nokia Corporation in Finland and/or other countries worldwide.
    All other trademarks are property of their respective owners.