gunnar
Threads
Painting
Graphics Dojo
Performance
Posted by gunnar
 in Threads, Painting, Graphics Dojo, Performance
 on Thursday, January 21, 2010 @ 08:18

Previous posts in this topic:

In this series that we’ve been doing, I wanted to cover threading, a topic that has been actively discussed amongst some of the trolls over the last few months. We’ve had support for rendering into QImage’s from non-GUI threads since the early Qt 4.0 days, but its only in recent versions of Qt, I think, 4.4 that we got support for rendering text into images. Now that support is there, it begs the question how to make proper use of it. Generating the actual content in a thread is one usecase, here is an example of it.

What it means is that instead of rendering all the content of a certain view in the QWidget::paintEvent() or in the QGraphicsItem::paint() function, we use a background thread which produces the cache. The benefit is that even though drawing the actual content can be quite costly, drawing a pre-rendered image is fast, making it possible for the UI to stay 100% responsive while the heavy loading is happening in the background. It does imply that not all content is available at all times, but for many scenarios this is perfectly fine. There is nothing novel about this approach, I just think its a nice way to solve a problem that often comes up when dealing with user experience.

This approach is used by Google Maps (actually, what the server does I don’t know, but it sends individual tiles to the browser at least), iPhone and N900 web browsers, and I’ve talked to customers in the past that use this approach for usecases where generating the content is costly, but the user interface needs to stay responsive. In fact, this approach applies to pretty much anything where it is ok that the content is not immediately there, such as data tables like an mp3-index or a contact list, images in a data folder, etc.

The Task

Lets first look at the task. I’ve done a trivial implementation which looks in a directory and displays all the images in there. Each image is a separate content piece and I’ve put a background, a small frame around it and a drop shadow under it. Just so that there is a bit of active work going on. If you are into it, here is the Source Code

The content pieces could have been tiles in a map of Norway or tiles composing a webpage, but I choose images, because I already had some images around and I figured it made for an ok example. The demo is run on an N900 with compositor disabled using the following command lines:

  • Non-Threaded: ./threaded_tile_generation -no-thread -graphicssystem opengl MyImageFolder
  • Threaded: ./threaded_tile_generation -graphicssystem opengl MyImageFolder

Here’s how it looks when the content is generated in the GUI thread:


The UI is running super-smooth as long as I show only the content that is already loaded. Once work is needed, the entire UI stops and the user experience is really bad. Here is how it looks if we move the work into a background thread.

The algorithm

Don’t use this particular algorithm. It is very crude and written to show an idea. First of all, because I was lazy, I used queued connections rather than a synchronized queue to schedule the pieces to be rendered. This means that the queue is managed by Qt’s event loop, out of my control. So if I pan far out, I will schedule a lot of images to be rendered, then pan beyond them before they are done. In a decent implementation, I would dequeue these and make sure that only the pieces that are directly visible are being processed.

The other thing is that there is no logic to “peek ahead”. I schedule images to be generated only when I need them. If I instead scheduled them based on the current panning direction, in addition to not discarding so aggressively, it would probably result in a situation most images are rendered ahead of time.

QGraphicsView

It would be kinda cool if this could be applied directly to QGraphicsView. You set a flag on the item and instead of generating its cache pixmap in the GUI thread, it was offloaded to the worker thread. This is not straight forward however, because the GUI thread can, pr today at least, continue to modify the state of the item, while its being rendered in the worker thread. Synching these two becomes a bit of a mess, and how to solve it, if at all, is not something we have a plan for. That doesn’t prevent people from doing this kind of work in their own custom paint() functions of course.

gunnar
Painting
Graphics Dojo
OpenGL
Posted by gunnar
 in Painting, Graphics Dojo, OpenGL
 on Monday, January 18, 2010 @ 10:00

Previously in this topic:

In my previous post, The Cost of Convenience, we saw quite clearly that text drawing was a major bottleneck. Text drawing is quite common in GUI applications though, so we need a solution for that. If we break down what happens behind QPainter::drawText(), it is split into two distinct parts. Converting the string into a set of positioned glyphs, often refer to as “text layout” because it positions the glyphs, does text wrapping and adjustments for alignment. The second part is passing the glyphs to the paint engines to be rendered. When the text is the same all the time, the first part could be done once and the glyphs/positions just reused.

We have a class in Qt which allows you to cache the first part and only do the drawing for each frame. The class is QTextLayout. This is a low-level class, throwing asserts at you for the most trivial of mistakes. It also comes with a really inconvenient API, but it does reduce the most costly step of text drawing, which is the layout part. It is also only fair to mention that QTextLayout uses a lot more memory than just the glyph-array and positions array, as one could expect, so in a memory constrained setup, it should be used with caution. In 4.7, we plan to introduce an API for static text, which takes care of all the layout and stores only the required parts, reducing the overall memory footprint, but for now, QTextLayout is how you do it.

Going back to my virtual keyboard, updated Source Code, I’ve changed the “-buttonview” example to make use of QTextLayout. In the constructor, I build the layout:

    ButtonView() {
        QString content;
        for (int i='0'; i< ='Z'; ++i) {
            content += QLatin1Char(i);
            content += QChar(QChar::LineSeparator);
        }
        m_layout = new QTextLayout(content, font());
        QFontMetricsF fm(font());
        m_layout->beginLayout();
        for (int i=0; i<content .size() / 2; ++i) {
            QTextLine line = m_layout->createLine();
            line.setNumColumns(1);
            int x = (i) % 10;
            int y = (i) / 10;
            QSizeF s = fm.boundingRect(content.at(i*2)).size();
            line.setPosition(QPointF(x * 32, y * 32) + QPointF(16 + s.width() / 2, 16 + s.height() / 2));
        }
        m_layout->endLayout();
        m_layout->setCacheEnabled(true);
    }

</content>

If you look at the source code, there is more stuff going on in the constructor than I show above. This is because I extracted the text layout relevant parts only. So what we do is to build a string of the characters. Between each character I insert a LineSeparator. Without this, I wouldn’t be able to split the text into multiple QTextLine objects. From the content string, I construct the layout. For each character, I find its position in the grid and construct a QTextLine and move the line to its position. Each line is one column/character big. Finally I enabled caching on the layout. This is the step where we start caching the laid out text.

When it comes to the paint method, the code is rather straightforward. All the text is contained inside a single layout object so I can just call its draw function.

    void paint(QPainter *p, const QStyleOptionGraphicsItem *, QWidget *) {

        // Draw background pixmaps...

        m_layout->draw(p, QPointF(0, 0));
    }

Now, lets have a look at what this gains us:

Text Layouts

The graph shows the number of milliseconds per frame including the blit. Measured on an N900 with composition disabled. Smaller is better!

If we compare the “-no-indexing -optimize-flags” to the one with “-no-indexing -optimize-flags -text-layout”, we see that there is a significant reduction per frame. It brings raster from 9.3 ms per frame down to 5.5, OpenGL drops from 16 ms per frame to 9.1 ms when using a text layout. A drop of about 4 ms is also visible in the X11 paint engine.

Needless to say, using the QTextLayout class introduces a huge benefit, but it requires a bit more setup to get there. In this implementation I merged all the text into a single object which also makes it impossible for me to move one item relative to the others, such as adding an offset when a button is pressed. I could have one QTextLayout for each item, which would have been roughly the same performance, but at a higher memory cost.

Until next time, take care!

PS: A small comment on the item cache / X11 numbers. The connection is asynchronous and Qt completes its job at about 2.7 ms pr frame. With “-sync” on the command line, which makes all X calls synchronous, raises the time to about 10 ms per frame. If I had put a QApp::syncX() into each frame, synchronizing once per frame which is essentially what GL and VG are doing, I would probably get a number that is in between these two. What this means is that the numbers for X11 in this test are actually quite a bit worse than the graphs show.

gunnar
Graphics View
Painting
Graphics Dojo
OpenGL
Posted by gunnar
 in Graphics View, Painting, Graphics Dojo, OpenGL
 on Monday, January 11, 2010 @ 09:25

Previous posts in this topic:

So, its time for my next post. Todays topic is how convenience relates to performance, specifically in the context of QGraphicsView. My goal is to illustrate that the way to achieve fast graphics is to pack your QPainter draw calls as tightly together as possible. The more stuff that happens in the middle, the slower it gets.

To illustrate this, I’ve implemented a virtual keyboard. Granted, its not a very common layout nor is it usable, but the rendering is the point here, not the functionality. The full source code is here and it looks like this:

Virtual Keyboard Image

I’ve implemented the keyboard using three different approaches. One using proxy widgets, one using graphics items and one where the entire view is one graphics item. In addition to that, I added a number of options to tweak various properties, such as whether or not the text is drawn. I measured this on an N900 rather than a desktop because the difference becomes more profound on a small device. On the desktop it is easy to be fooled because most things complete in a matter of micro seconds anyway. It is only when the entire application comes together one notices that things are not as smooth as in the prototype, but too much work has been invested into the current design that one loses out on the super-slick feeling application.

QGraphicsProxyWidget

Since we’re implementing a series of clickable buttons, a natural and convenient starting point is to use an existing button class, such as the QPushButton. It already implements the logic for mouse/keyboard interaction and has signals for clicking and all sorts of other useful functionality. To get widgets into QGraphicsView, we use a QGraphicsProxyWidget. To make the test “fair”, I actually use a plain QWidget which just paints a pixmap and a draws a text. Had I gone through the styling API, these numbers would have been even worse.

ProxyWidget Results
Milliseconds spent per frame including blit to screen when using QGraphicsProxyWidgets. Low is better!

If we look at the plain “-proxywidgets” run, the fastest engine was the raster engine, running at 26ms per frame. If I wanted to slide this keyboard onto screen, I have 16ms available if I want it running at 60 FPS and 33ms available if I want to do it at 30 FPS. When each frame takes 26ms, I can barely do 30, but with only a little bit of slack, so if another process is soaking up CPU time, that number is also a bit difficult to reach. So, not very good. (BTW, the exact numbers in the graphs are listed as a comment in the top of the .cpp file I linked above).

The first thing I noticed with this approach was that the each button now had a gray background. This is of course the widget background. A QWidget embedded in QGraphicsView will be treated as a top-level and will therefore draw its background. I added an option “-no-widget-background” which sets the Qt::WA_NoBackground on the widget. This brings the rendering speed with raster down to 22ms. 4ms saved per frame, just by setting a flag, not too bad, but still pretty far from being awsome.

I’ve mentioned before that text drawing is not as fast as we would like it, so just to compare how it looks without text, I added a “-no-text” option to the test. This brings the raster results down to 13ms. That is pretty nice and below the 16ms threshold required to achieve 60 FPS, but only with a small margin. And I’m not drawing any text! Before I give up with this approach, I’ll enable item caching. By setting ItemCoordinateCache on each button, I cache both the background pixmap and the text in one single pixmap. This brings the raster results down to 8.5ms, and its starting to look acceptable. But at a very high memory cost… In my original usecase I had one shared pixmap for all the button backgrounds, but now I have one per button.

You may notice that there was a vast difference between item caching and the proxy widget drawing the pixmap. One thing that adds to the proxy widget cost is that the QPainter is recreated and initialized for each button in the buttons paint event. Also, as I mentioned in my previous post, An Overview, you may remember that I said that each widget has a system clip and that there is an overhead involved with calling the paintEvent. For items in QGraphicsView, there is already a painter, and I don’t need a clip, nor do I need any of the other stuff that goes on behind the scenes there. When we enable item coordinate caching, we don’t leave graphics view world and we don’t enter the widget world. This crossing is expensive, so by not going into the widget world, we save a lot.

So, if there is a lesson to be learned it is that QGraphicsProxyWidget should be used with extreme caution. If you really need it, use very few of them.

QGraphicsWidget

If proxy widgets are too slow to be usable in this scenario, then the next best thing is to use a QGraphicsWidget. This is a subclass of both QObject and QGraphicsItem, which gives me signals, slots and properties, but its not a QWidget and therefore still fairly lightweight. The numbers are as follows:

GraphicsWidgets Results
Milliseconds spent per frame including blit to screen when using QGraphicsWidgets. Lower is better!

Compared to the proxy widgets approach we’re starting out quite a bit better, with raster at 13 ms per frame, OpenGL at 20ms and X11 at 22ms. Below this line is a new line: “-no-indexing -optimize-flags”. QGraphicsView will by default put all the items in a view into a BSP tree for fast lookup, this is beneficial when the scene contains many items and you often need to find items that intersect with a small portion of the scene. In the testcase we’re always doing a full update, so there is no benefit from the index, so it can be disabled by calling scene->setItemIndexMethod(QGraphicsScene::NoIndex). Having a BSP is the default behaviour because graphics view was initially intended to be a static scene for many items. The most common usecase today is a few (a few hundred at max) items which tend to move a lot. For this reason, it is always a good idea to try to disable the BSP and see if it makes a difference in performance. If it helps, then leave it off.

I also know that the items play nice, meaning that they don’t change the clip, translate the painter, change the composition mode or modify any other state that would propagate to other items. This means I can safely set the DontSavePainterState optimization flag. Actually, based on an old habit, I set all possible optimization flags. I only consider unsetting them if my drawing code starts to look weird, at which point I would rather fix the drawing code and keep the flags set. By disabling indexing and enabling optimization shaves off 2ms per frame in for all rendering backends, so that is definitely worth it.

If I don’t do text, the performance is about twice as fast. Again we see that text drawing is a huge cost. We’re working on an API to fix this and we’ll have more information for you when we do. You may notice that enabling item caching drops the performance a bit compared to the “-no-text” case. There isn’t much overhead inside QGraphcisView for this path. A likely reason for the decrease is that reading from multiple memory sources (multiple pixmaps) results in a lot of cache misses, compared to the straight approach which draws the same pixmap over and over.

ButtonView Item

In my previous post I briefly mentioned that there is a slight overhead involved with the use of a QGraphicsItem too. Prior to calling the paint function, the painter is transformed to the coordinate system of the item and the painter state is saved. If the item draws a big polygon, this setup cost can be ignored, but when drawing just a pixmap and a few pixels of text, then it may be worth considering. In the spirit of “The more direct the painting code is, the faster it gets”, I implemented the keyboard as a single item. The numbers are as follows:

ButtonView Results
Milliseconds per frame including blit to screen when using a single item. Lower is better!

Raster is now down to 10ms, which is 1ms better than the QGraphicsWidget approach when all optimizations were enabled, so even though graphics items are cheaper than widgets, they still cost a bit. The keyboard is now rendered in a tight loop, and the major difference in performance here is caused by the fact that items in the scene have a transform associated with them. Prior to calling paint() a transform is set to match the painter to the items local coordinate system. This causes a state change in the paint engine. For each button we’re drawing a 32×32 pixmap which means alpha blending 1024 pixels, followed by doing text layout and drawing a single character. Even then do we save about 10% time by not having a QPainter::translate() in the midst, so bear that in mind. By enabling the optimization flags and disabling the index, raster drops a bit more, so having those are still a good idea.

You may have noticed that there is one dataset that is named “cheat” for OpenGL. I was reluctant to include this, because its using a private API that is not, and I really mean NOT, subject to binary compatibility rules. You cannot call this from your application. We’re going to add a public API for this in the future, hopefully 4.7, so until its there, wait. In the interest of showing what we are thinking internally, I thought I would show it.

OpenGL is really great for accelerating graphics, but its way of working does not map optimally to how Qt works. GL is really good at taking a few large datasets of triangles and rendering them, but its not so good at drawing loads of small things. Small things like button backgrounds, icons, single text items, etc. However, all the buttons backgrounds are the same pixmaps, so what if I could tell QPainter to draw the same pixmap in multiple places at once? In GL this would correspond to setting up a texture and one vertex and texture coordinate array and drawing some 40 pixmaps in one go. This fits much better with how GL is made to work. The result is that drawing the buttons drop from 5.2ms to 3.9ms, so another piece of juice squeezed out. Naturally, the more times the pixmap is drawn and the smaller the pixmap gets, the more benefit you get from batching commands like this.

There is a second option to OpenGL for the button view case, which is the “-ordered”. This was done after Tom brought to my attention that the testcase would do a shader program update for each painter call. In the default buttonview implementation we do:

                    for (int i=0; i < m_rects.size(); ++i) {
                        p->drawPixmap(m_rects.at(i), *theButtonPixmap);
                        p->drawText(m_rects.at(i), Qt::AlignCenter, m_texts.at(i));
                    }

Because pixmaps use one shader pipeline and text drawing uses another, the pipeline needs to be switched and reset all the time, which renders at 16m per frame. To see if it makes a difference, I added a second alternative rendering, “-ordered”, where I do all the pixmaps first, then all the text:

                    for (int i=0; i < m_rects.size(); ++i)
                        p->drawPixmap(m_rects.at(i), *theButtonPixmap);
                    for (int i=0; i&lt;m_rects.size(); ++i)
                        p->drawText(m_rects.at(i), Qt::AlignCenter, m_texts.at(i));

This prevents the shader pipeline updates and bring the rendering time per frame down to 13ms, so definitely worth it.

Summing Up

Virtual Keyboard Combined Results
Milliseconds per frame including blit to screen for proxy widgets, graphics widgets and a single widget. Lower is better!

OpenGL comes out rather bad in this testcase, which I was a bit disappointed to see, but it did send Tom into an optimization frenzy, so we’re hoping to remove some of the constant overhead. It should also be said that when using the OpenGL graphics system, we enable multisampling by default, which increases rendering time on the N900 by around 30%. A plain QGLWidget would thus perform slightly better. Another aspect to OpenGL is that it uses a dedicated low-power chip, so even though it for this particular usecase runs at half the speed, it also uses a lot less battery, so it may still be the right choice. OpenGL will also scale significantly better than raster and X11 as the pixmaps get bigger or if the content of the button is slightly more advanced, say like a horizontal gradient.

The best numbers are definitely in the button view case, where all the content is rendered as one item, which is what I wanted to highlight with this blog. The button view item also opens up for other optimizations such as batching. We don’t have that many batching functions in QPainter today, its only drawRects(), drawLines() and drawPoints(), but we’re considering to add more, we are just not sure on how the API’s would look yet.

The bottom line is still that how Qt is used defines how well it performs. On one hand there may be an easy and convenient way to get the job done which performs quite sub-optimally. On the other hand there may be a more involved implementation which performs very well. I’m not trying to suggest that you do one or the other, there are a lot of good reasons for picking either one. But I hope that I’ve illustrated that some features come at a cost and that this is kept in mind along with what the target is when designs evaluated and chosen.

I’ll round off with a question. If you were to implement a particle effect when you press a button, which approach would you choose, having seen the numbers above?

TomCooksey
Painting
Graphics Dojo
OpenGL
Performance
Posted by TomCooksey
 in Painting, Graphics Dojo, OpenGL, Performance
 on Wednesday, January 06, 2010 @ 12:01

Introduction

Here’s the next instalment of the graphics performance blog series. We’ll begin by looking at some background about how OpenGL and QPainter work. We’ll then dive into how the two are married together in OpenGL 2 Paint Engine and finish off with some advice about how to get the best out of the engine. Enjoy!

Why OpenGL?

Before I dive into the OpenGL paint engine, I want to make sure we all understand the motivation for the OpenGL 2.0 paint engine. I’ve talked about this before in my article about hardware acceleration, but we still frequently get questions like “Why not implement a Direct2D paint engine?”.

Everyone knows OpenGL means fast graphics right? Well, this is actually a bit of a misconception. What makes graphics fast is a bit of hardware dedicated to computer graphics called a GPU (Graphics Processing Unit). OpenGL 2.x is a software library which often (but not always) uses a particular class of GPU to help satisfy drawing operations (Note: OpenGL 1.x used a different class of GPU). A modern programmable GPU (e.g. nVidia GTX 295) can usually be programmed via both OpenGL, Direct3D and OpenCL. The only difference then is that Direct3D is only available on the Windows platform and OpenCL is not universally supported.

So the reason we are investing our time and effort into OpenGL, rather than Direct3D or OpenCL, is that OpenGL 2.0 is sufficient to give us access to all the GPU features we currently want to use. It is also available on more platforms, especially if you limit yourself to the ES sub-set. We are also looking into restricting ourselves further to only use APIs in OpenGL 3.2 Core Profile.

This might change in the future if we see a new class of GPU, like ones designed for 2D vector graphics which can’t be abstracted by OpenGL 2.0 very well (enter OpenVG), or, if we want to start using GPU features which OpenGL (ES) 2.0 doesn’t give us access to. Having said that, OpenGL is very good at exposing new GPU features through extensions.

History

Qt has had an OpenGL paint engine since early Qt 4.0 days. This engine was designed for the fixed-function hardware available at the time. As time went on and manufacturers added newer bits of hardware to their GPUs, the OpenGL paint engine was adapted to use those features through OpenGL extensions. Over the last 4 years, lots of people have hacked on the engine and added support for things like ARB fragment programs and even adapted the engine to work on OpenGL ES 1.1. The engine is pretty stable and has lots of fall-backs (or original code-paths, depending on how you look at them) for old hardware missing GL extensions the engine can utilise. But, fundamentally, it is an OpenGL 1.x engine.

In early 2008, around the time of the Falcon project (the Falcon Project was an internal project started for Qt 4.5 which focused on painting performance and architecture), it became increasingly clear that Qt needed to support hardware acceleration using the OpenGL ES 2.0 API which was starting to appear on embedded System-On-Chips like the OMAP3. There were two options available: Extend the existing OpenGL paint engine further still, or develop a new paint engine from scratch. When looking at the existing engine, there was a major problem – although it supported fragment programs, it was heavily reliant on fixed-function vertex processing. A further consideration was that the Falcon project had just kicked off and the future of the QPaintEngine API was uncertain. Both of these factors resulted in a new paint engine being written from scratch for OpenGL ES 2.0. This new engine had a distinct advantage over the existing engine: everything I wanted to use from OpenGL was in the core OpenGL ES 2.0 API. This meant I didn’t need to add fallbacks in case of missing functionality, leading to much cleaner and leaner code.

Another point about OpenGL ES 2.0 is that it doesn’t have much in the way of fixed function features – forcing you to write shader programs. While annoying at the time, this is apparently the best way to do things even on desktop GPUs. This point is important because it quickly became apparent that although the engine was designed for GLES2, not only would it also work on desktop OpenGL 2.0, but it would use that API in a way better suited for modern programmable GPUs. So, in Qt 4.6, the new engine is used by default on both GLES2 and on desktop systems which support OpenGL 2.0.

What does OpenGL (ES) 2 provide?

As I’ve already mentioned, OpenGL ES 2.0 is a pretty lean and mean API which models programmable GPUs. The “programmable” bit is fundamental to the API. It means that you write small programs known as shaders, ask OpenGL to compile and then run them on the GPU to process the data you give it. There are two types of shaders: one type processes positions (vertices) and another type processes pixels (fragments), called the vertex shader and fragment shader, respectively. The idea is that you tell OpenGL you want to draw some triangles and the vertex shader is run to determine the position of each of those triangles. Then, the GPU turns each triangle into a bunch of pixels and the fragment shader is run to determine the colour of each of those pixels. The API provides various ways of passing data from the CPU to the GPU (from textures and lists of triangle positions to individual floats) and ways of passing data from the vertex shader to the fragment shader. That’s basically it. All the complexity lives in the shaders you give to the GPU to run.

What does QPainter require?

The rest of this blog assumes you are familiar with the QPainter API (if not, go check the QPainter docs) ). It might also be a good idea to read through Gunnar’s post about how the Raster engine works.

So, the QPainter API provides more than just triangles. It is therefore the GL paint engine’s job to turn the whole of the QPainter API into “just a bunch of triangles”. To understand its task a little better, you have to split QPainter up into chunks which map better to OpenGL. A great example of this is drawRect(). In QPainter terms, this is a single primitive, but in GL engine terms, it is actually two: A rectangle (the fill) and a (possibly quite complex) line round the outside (the stroke). The OpenGL paint engine tries to keep a fairly clean separation between the shape of something which is drawn and its fill. So, here’s the list of primitives (shapes) QPainter requires the engine to draw:

  • Simple primitives (Rectangles, convex polygons, ellipses, etc.)
  • Text
  • Pixmaps
  • Strokes
  • Complex vector paths (QPainterPath)

In addition to this, we have various fills which we can use on our primitives provided by QBrush:

  • Solid colour
  • Linear gradients
  • Radial gradients
  • Conical gradients
  • Bitmap patterns
  • Textures

Not only do we have different types of fill, but we also support a full 3×3 transformation matrix on the brushes. This allows you to draw a rectangle but use it as a kind-of stencil over (for example) a perspective transformed texture.

Finally, QPainter also requires the engine to implement clipping, different composition modes and support it’s state stack (QPainter::save() & QPainter::restore()).

Engine Operation

Primitive Rendering

  • Simple Primitives: To render convex primitives such as rounded rectangles, we just generate a GL triangle fan and render it using glDrawArrays
  • Text: For large text, we convert it to a complex path and render is as such. However, for smaller font sizes, we rasterize the individual font glyphs and upload them as a texture (8-bit texture for bitmap & anti-aliased glyphs and 24-bit RGB for sub-pixel anti-aliased glyphs). This glyph texture is used as a mask in the engine’s pixel pipeline (see below). So, in terms of primitives, text is actually rendered as a set of rectangles - one rectangle for each glyph. When rendering with sub-pixel anti-aliased glyphs, it is possible that the engine will need to do two passes (if the brush is not a solid colour). This is because the engine uses a clever trick and sets the brush’s colour as the glBlendColor and outputs the RGB mask in the fragment shader. It is then able to set a glBlendFunc which combines the two and gives per-sub-pixel blending. If you set a more complex brush, the engine has to do two passes - first apply the mask to the destination, then a second pass to apply the brush, with glBlendFunc set to give the correct result.
  • Pixmaps: A pixmap is actually just a rectangle.
  • Strokes: Strokes can be very complex - just take a look at the pathstoke demo! However, even the most complex dashed pattern with rounded joins and end caps can be turned into a GL triangle strip relatively easily. This is done by the QTriangulatingStroker.
  • Complex vector paths: This is where things get tricky. QPainterPaths can have lots of things which break the “turn lineTo, moveTo and curveTo into verticies and render as triangle fan” algorithm…

Rendering Using Stencil Technique

Take the following path as an example:

Convex Path (1)

Here we have a seemingly trivial path with only 4 points. To draw this with GL, you could just convert the path’s points to verticies and draw it as a triangle fan, which results in two triangles: Triangle 1: ABC and Triangle 2: ACD. The problem is that just looks like a solid triangle, not the path we wanted:

Convex Path (2)

So, to overcome this difficulty, we drop to a 2-pass rendering method which uses the stencil buffer as a temporary scratchpad. So first off, we clear the stencil buffer to all zeros (represented as white):

Stencil Buffer (Clear)

Next, we set the stencil operation to invert, which means instead of setting the stencil value to ‘1′ when a triangle touches a pixel, invert the existing value instead. So 0->1 & 1->0. First we render the first triangle (ABC). As all the pixels are currently 0, every pixel touched by the triangle turns to 1 (represented as black):

Stencil Buffer (Triangle 1)

Next, we draw the second triangle (ACD). Note: We are inverting the stencil’s value, so black pixels touched by the second triangle turn to white and white pixels turns to black:

Stencil Buffer (Triangle 2)

So now the stencil buffer contains the silhouette of our path. All we do now is draw a rectangle into the destination window, but with the stencil test enabled.

In addition to the stencil technique, we are also adding experimental support for triangulating QPainterPaths and caching the triangulation. While this is slower for paths which change often or are zoomed in & out, paths which are relatively static can be triangulated once and rendered multiple times without having to re-triangulate.

Filling Primitives

Now we know how all the different QPainter operations get turned into GL primitives, but we’re still missing how they get filled. As already mentioned, the colour of a pixel is determined by the fragment shader. We therefore have lots of different fragment shaders for different types of fill. However, we also need to support text rendering with arbitrary fills (QPainter lets you fill text with a perspective transformed radial gradient). In the future, we also want to support composition modes which OpenGL doesn’t provide. We’ve also found there are ways we can simplify the shaders for certain situations (and thus improve performance). The result is that Qt needs lots of different shaders. At last count, we’d need over 1000 different shaders to cover all situations. That’s a lot of GLSL to maintain and test, far more than the resources we have available. So instead we split the shaders into different interchangeable “stages”. This is achieved by having each stage in it’s own GLSL function. As an example, lets take regular, non sub-pixel anti-aliased text rendering with a transformed radial gradient. Note, this is just an example to demonstrate how the engine operates and you probably shouldn’t do it in performance critical situations.

We render gradients by pre-calculating a 1px high texture (like a 1D texture) on the CPU which we sample from in the fragment shader. However, we calculate the texture coordinates in the vertex shader and pass it to the fragment shader as a varying. This is because it’s a good idea to do as much work as possible in the vertex shader rather than the fragment shader as it is called so much less frequently.

As already mentioned, we render (non sub-pixel) anti-aliased text by using an 8-bit mask texture. We then multiply the fragment colour by a sample taken from this mask. So, if we’re on the edge of a glyph where the alpha value is <1, we adjust the alpha of the srcPixel by that amount (actually, we also adjust the RGB values too as we use pre-multiplied alpha pixel format internally).

If there was a non-standard composition mode, we’d then pass the masked pixel to another stage which would blend it with the background (although this isn’t implemented yet!).

So you can see in the fragment shader, there’s 3 different stages. The first stage (srcPixel) determines the brush colour of the fragment. The next stage (applyMask) modulates the pixel by a mask to achieve anti-aliased text rendering. The final stage (compose) then blends the pixel with the background. We also have a similar staging technique for the vertex shader. All this complexity is nicely abstracted by the QGLEngineShaderManager. The paint engine tells the shader manager what it wants to draw and the shader manager selects an appropriate selection of shaders. One final note on this: While desktop OpenGL 2 supports linking multiple fragment shaders in a single program, OpenGL ES 2.0 does not. This means that we actually use the different stages by appending them to a single string of GLSL we pass to GL. This also gives the GL implementation the best chance to inline the different stages (without which, performance would suck).

Texture Management

The OpenGL paint engine makes heavy use of gradients. For example, even though it’s perfectly possible to calculate colours for gradients in the fragment shader, we still use a texture as a look-up-table as it is so much faster. Repeatedly uploading textures every time we need them would ruin performance. So instead, we keep a per-context cache of what QPixmap/QImage is already present in texture memory. If two contexts are sharing then we also detect this and don’t duplicate the textures. This functionality is available publicly in QGLContext::bindTexture() too.

On Linux/X11 platforms which support it, Qt will use glX/EGL texture-from-pixmap extension. This means that if your QPixmap has a real X11 pixmap backend, we simply bind that X11 pixmap as a texture and avoid copying it. You will be using the X11 pixmap backend if the pixmap was created with QPixmap::fromX11Pixmap() or you’re using the “native” graphics system. Not only does this avoid overhead but it also allows you to write a composition manager or even a widget which shows previews of all your windows.

Antialiasing

The OpenGL paint engine uses OpenGL multisampling to provide anti-aliasing. Typically, this will be 4x/8x FSAA, meaning 4/8 levels of coverage, which is worse quality than the raster engine, which always uses 256 levels of coverage. However, as the DPI of modern displays increases, you can get away with lower-quality anti-aliasing.

Using multisampling also doesn’t affect text rendering as text is anti-aliased using masks rather than multisampling (for smaller font sizes). So text rendered with the OpenGL engine should look almost as good as text rendered with the raster engine (which also does gamma-correction). The only drawback of using multisampling is that some OpenGL implementations don’t support switching multisampling off. Indeed, the OpenGL ES 2.0 specification doesn’t even provide the API to switch it off. The consequence is that non-anti-aliased (a.k.a. aliased) rendering can be broken (Everything gets anti-aliased even when the QPainter::Antialiasing hint isn’t set). There’s little we can do about this. :-(

Clipping

QPainter supports setting an arbitrary clip, including complex QPainterPaths. Qt uses the GL stencil buffer (or more specifically the lower 7 bits of the stencil buffer) to store the clip. The clip is written to in the same way as we render any other primitive, even using the stencil technique for complex paths. However, instead of filling pixel colours into a colour buffer, we fill stencil values into the stencil buffer. The actual value we use depends on the current QPainter stack depth (how many times save() was called minus the number of time restore() was called). This means that if you restrict yourself to intersect clips (Qt::ClipOperation == Qt::IntersectClip), the engine only needs to write to the part of the stencil buffer which is being clipped to. What’s more, the engine doesn’t need to write to the stencil buffer at all when you call restore() - it just changes the value at which the stencil test passes.

In addition to using the stencil buffer for clipping, the OpenGL paint engine can also just use glScissor. This only allows a single, untransformed rectangle to be used as the clip, which can be quite restrictive. However, it is by far the fastest way to do clipping. So if performance is more important to you than utility, only ever use untransformed rectangular clips.

Recommendations

Interleaved Rendering

Unlike OpenGL, QPainter allows an arbitrary number of rendering contexts (QPainters) to be active in the same thread at the same time. For example, in your widget’s paint event, you can begin a painter on your widget and begin another painter on a QPixmap and interleave rendering to them:

void Widget::paintEvent(QPaintEvent*)
{
QPainter widgetPainter(this);
widgetPainter.fillRect(rect(), Qt::blue);
QPixmap pixmap(256, 256);
QPainter pixmapPainter(&amp;pixmap);
pixmapPainter.drawPath(myPath);
widgetPainter.drawPixmap(0, 0, &amp;pixmap);
}

While this works ok with the OpenGL graphics system, having to switch from doing something with one painter to doing something with a different painter can be very costly and should be avoided whenever possible.

Mixing QPainter and Native OpenGL

As shown in several examples, it is possible to mix your own OpenGL rendering code with QPainter rendering code. However, as OpenGL is a giant state machine, it is very easy for you to accidently clobber Qt’s GL state and vice-versa. To overcome this, we’ve added some new API to QPainter in Qt 4.6 - QPainter::beginNativePainting() and QPainter::endNativePainting(). To prevent artifacts, you must enclose your custom painting in beginNativePainting() and endNativePainting(). This is very important - even if you’re not seeing any problems now, you might find your code starts failing in a future Qt release in which the GL paint engine works slightly differently. Also, as beginNativePainting and endNativePainting sets lots of OpenGL state, it can be quite expensive and thus you should try to use it sparingly. Try to batch up all your custom OpenGL code in a single block.

QGLWidget vs OpenGL Graphics-System

Unlike the raster & OpenVG paint engine, you don’t have to use a specific graphics system to render widgets using the OpenGL paint engine. The QtOpenGL module provides several classes, including QGLWidget, which all use the OpenGL paint engine regardless of what graphics system is being used. QGLWidget is basically a regular widget which always has a native window ID and is always rendered to using OpenGL. You are free to choose whichever method you want to get OpenGL rendering (graphics system or QGLWidget). However, using the opengl graphics system can often be slower than using a QGLWidget, as Qt needs the contents of the “back buffer” (or QWindowSurface) to be preserved when flushing the render to the window system. OpenGL does not guarantee this and it is often not the case so Qt has to use either an FBO or a PBuffer as the back buffer. When the render needs to be flushed, the FBO or PBuffer is bound to a texture, rendered into the window and then the GL buffers are swapped. This extra overhead is avoided by using a QGLWidget, however as a consequence, it is not possible to redraw a sub-region of a QGLWidget: Whenever a QGLWidget is updated, the entire widget must be re-drawn.

It should also be noted that using the OpenGL paint engine isn’t a silver bullet which makes everything faster. For example, the GL engine really sucks at drawing lots of small geometry with state changes between each drawing operation. While we’re working on improving that use case at the moment, the raster paint engine will probably always be faster just because it has so much less overhead. So QGLWidget might be a great way to get the best of both worlds when combined with the raster graphicssystem - Use QGLWidget for operations which GL excels at and the raster engine for everything else.

Tips for Performance (fps)

As a general rule of thumb, OpenGL state changes are expensive. So, use the knowledge you now have of what’s going on under QPainter and try to minimise the number of OpenGL state changes the paint engine needs to do. For example, if you implement a virtual keyboard, you now know that the engine uses a shader for text rendering and a different shader for pixmaps, so draw all the key pixmaps first, then draw all the text on top. That way, the engine only needs to change shaders twice per frame.

  • Never, ever use anything other than intersecting clips
  • Don’t switch render target in the middle of a render
  • Try to use use untransformed rectangular clips whenever possible
  • Minimise changing the brush wherever possible
  • Render batches of primitives of the same types together.
  • Avoid drawing translucent pixels & blending (particularly important on mobile GPUs)
  • Try to cache QPainterPaths and re-use them rather than creating & discarding them in your paintEvent
  • Use QPainterPaths even when there’s a QPainter convenience function. E.g. Rounded rects and elipses.
  • If you’re drawing lots of small pixmaps, try bunching them up into a single, larger pixmap
  • Prefer to use power-of-two (2^n) widths & heights for QImages and QPixmaps (128×256, 256×256, 512×512, etc)
  • If using QGLWidget and don’t need anti-aliasing, don’t enable sample buffers in the QGLFormat
  • If rendering complex QPainterPaths, try to only use odd-even fill rule
Rhys Weatherley
Painting
Graphics
Performance
Posted by Rhys Weatherley
 in Painting, Graphics, Performance
 on Monday, December 21, 2009 @ 04:34

In previous posts in this series, Gunnar has described the design and performance characteristics of the painting system in Qt, and explored Raster in greater depth.  In this post, I’m going to talk about the unique features of the OpenVG graphics system.

Paint Engine

Unlike the other engines, the OpenVG paint engine was much easier to implement because the OpenVG API itself is very close in functionality to QPainter.  You can read all about the specification on Khronos’ OpenVG Page, but here are the high points:

  • VGPath objects represent geometry made up of MoveTo, LineTo, and CubicTo elements.  This is a very close match for Qt’s QPainterPath and QVectorPath abstractions.
  • VGPaint objects represent brushes and pens for filling paths with pixel patterns.  Solid colors, linear gradients, radial gradients, and pattern brushes are supported, but not conical gradients.
  • VGImage objects represent pixmaps in a large variety of pixel formats.  OpenVG supports a lot more formats than OpenGL/ES which makes it a lot easier to convert QImage’s into VGImage’s.
  • VGFont objects (OpenVG 1.1 only) store glyphs represented as VGImage’s or VGPath’s for quicker rendering of text items.  Under OpenVG 1.0, we fall back to path drawing at present.
  • Scissor for rectangle-based clipping.
  • Alpha mask for clipping to arbitrary shapes.
  • Affine transformation matrices for path and glyph drawing, affine and projective transformation matrices for image drawing.

Transformation Matrices

OpenVG does not support projective transformation matrices for path drawing, which is annoying because QPainter allows any affine or projective QTransform to be used for any drawing operation.  There is a registered OpenVG extension called VG_NDS_projective_geometry but none of the OpenVG engines we have come across support it.  The reason why OpenVG doesn’t support it is because generating paint pixels in perspective can be quite difficult.  Projective matrices are supported for image drawing because drawing a simple image in perspective is a well-understood problem that OpenGL/ES systems do all the time.

When a projective transformation matrix is used for path drawing, we convert the path point-by-point using the QTransform and then draw it as a normal affine path using a default transformation for the window surface.  But what about the paint pixels?  Unfortunately, they won’t be perspective-correct.  In practice this isn’t a big problem because most paths are drawn with a solid color brush, and a solid color looks the same in perspective.

In general however, we discourage people from using projective transformations with paths.  If you really want to draw a scene in perspective, first draw it into a QPixmap and then draw the pixmap using a projective transformation.  You’ll probably want to do this anyway because perspective transformations mostly occur during “flip” animation effects - drawing every tiny path in perspective every frame during the flip would be too slow.

Path Transformation and Drawing

Most of the path transformation logic is done in vectorPathToVGPath() and painterPathToVGPath() in qpaintengine_vg.cpp. We detect the presence of affine vs projective transformation matrices and use an appropriate conversion. We convert both QVectorPath and QPainterPath using specialized routines. The other paint engines typically convert everything into a QVectorPath first. The QPainterPath conversion can improve performance slightly when arbitrary paths are drawn during SVG rendering and the like - there’s no point creating a QVectorPath if it is going to be quickly thrown away.

Path drawing takes a lazy update approach, attempting to minimize the number of OpenVG state changes from request to request:

  • If the draw requires a pen, then the penPaint object is updated with the current QPen if it was different from last time.
  • If the draw requires a brush, then the brushPaint object is updated with the current QBrush if it was different from last time.
  • The path transformation matrix is updated if it has changed since the last path drawing operation.
  • The path is drawn using vgDrawPath().

Most of the OpenVG state persists across paint events so if the same pen is used from one frame to the next, then it will be set once and never changed.  The state is also shared between all windows because there is only one OpenVG context for the entire system.

In an earlier version of the OpenVG paint engine, I just uploaded the state changes whenever they were made without trying to be lazy about it.  That was a mistake!  Applications that use QPainter, particularly those using QGraphicsView, can be very chatty - constantly saving and restoring the painter state.  It was quite common for brushes, pens, and transformation matrices to be changed, then changed again, without anything being drawn.  Now, it will only update the OpenVG state at the point where an actual drawing operation is about to happen.  This house-keeping does have a cost though, so if you can avoid unnecessary QPainter state changes in your application, then please do so.

Preallocated Paths

Rectangles, lines, points, and rounded rectangles feature quite heavily in many applications, with constantly changing co-ordinates.  OpenVG makes us create and destroy a VGPath every time.  To alleviate this, we’ve provided some pre-allocated paths for simple drawing operations, which we update with vgModifyPathCoords() rather than allocate GPU memory for a new path.  However, some chipsets can be slower at modifying a path than just making a new one!  On those chipsets, compile Qt with the QVG_NO_MODIFY_PATH macro.
Image Drawing

The best image drawing will be achieved with QPixmap rather than QImage.  With QPixmap, the image is converted into a VGImage once and then drawn multiple times.  With QImage,the image must be converted into a VGImage every time it is drawn.

The OpenVG drawing primitive vgDrawImage() is very primitive - it draws the selected VGImage at the origin of the current transformation.  There is no in-built support for sub-rectangle drawing. Fortunately, OpenVG has vgChildImage() which allows a sub-region to be quickly extracted, with the pixel data shared with the parent.  However, “quickly” is a very relative term - I’ve seen frame rates almost halve when using vgChildImage() compared to drawing a full image.  So if you can, draw entire QPixmap’s when using OpenVG and limit the use of sub-rectangles.

Another source of slowdown is drawing images with opacity.  OpenVG has a way to multiply a VGPaint object with a VGImage to produce a destination image.  This is a very cheap way to achieve opacity effects and is quite fast.  Except!  And there is always an Except!  Except when the image is drawn with a projective transformation matrix.  Remember - paint pixels cannot be generated in perspective - so we cannot use a paint object to generate an “opacity color” even though the solid opacity color will be the same in perspective from all angles!  This is very annoying - the OpenVG committee could have made a special exception for solid color VGPaint objects.

When the OpenVG paint engine draws an image with opacity, and a projective transformation is in effect, we have to generate a copy of the VGImage and use vgColorMatrix() to adjust the opacity.  This isn’t too bad if you are drawing the same image over and over with the same opacity, but it is very inefficient if you are animating the opacity.  So avoid opacity animations with OpenVG if you can.

Painting into a QPixmap isn’t currently accelerated with OpenVG - it uses the raster paint engine instead, so we recommend painting pixmaps once rather than constantly updating them.  We will be addressing this in future versions.  Even when we do implement painting into a pixmap, there will be a cost: switching rendering surfaces from a window to a VGImage and back again is not cheap - on some chipsets it can be as heavy as a full EGL context switch.  So try to avoid switching painting surfaces if you can.

Clipping

Clipping is the bane of my existence!  It seems so easy to application writers - set a clip rectangle and it will be efficient, drawing less pixels!  If only!

There are three techniques that can be used to achieve clipping with OpenVG:

  • Scissor rectangle list.
  • Alpha mask for arbitrary clip shapes.
  • Scissor rectangle for simple clips and alpha mask for complex clips.

The last is the default in the OpenVG paint engine, and there are #define’s that can be used to enable the other modes.  However, on some PowerVR chipset versions there is a bug where if the scissor is combined with the alpha mask, performance drops off a cliff - down to 2 frames per second in some cases!  So on such devices you may want to turn on scissor-only or mask-only clipping.

Better is to not use clipping at all if you can avoid it.  Draw everything in your scene in bottom-up order and let the GPU do the heavy lifting.  Remember, modern OpenGL/ES GPU’s can crank out thousands of triangles per second, with clever algorithms for hidden-surface removal that are much cleverer than anything you can do by setting a clip.  OpenVG uses the same GPU in many cases.  If you set a clip, you may end up confusing the GPU into taking a slower path internally than it would otherwise.

If you must clip, try to use single-rectangle regions that can be set via the scissor.

Window Surfaces

Below the OpenVG paint engine is the window surface logic in the graphics system.  This is usually where platform-specific customizations are required to get pixels onto the screen as fast as possible.  The QVGWindowSurface class wraps a QVGEGLWindowSurfacePrivate object, which provides the heavy lifting.  The default EGL implementation is QVGEGLWindowSurfaceDirect which writes pixels into the window back buffer and calls eglSwapBuffers() to transfer it to the screen.  It is possible to enabled single-buffered operation with QVG_DIRECT_TO_WINDOW, but the cost may be tearing artifacts on-screen.

If your platform has some clever EGL extension mechanism for getting pixels onto the screen, then you will need to write a new graphics system plugin and implement your own QVGEGLWindowSurfacePrivate subclass.  The QtOpenVG module has been structured to make it relatively easy to do this without touching the core Qt code.

Memory Usage

Everything you do with graphics uses memory - in the CPU and in the GPU.  Window surfaces, VG rendering contexts, VGPath objects, VGImage objects, and so on.  It can get quite tight in the GPU on embedded systems.  We’ve taken some steps to manage this; e.g. destroying older VGImage objects when trying to upload a new QPixmap, and destroying all OpenVG objects when an application goes into the background to free up memory for foreground applications.

The more complex your application, the more likely it is that you’ll hit the GPU memory limit.  There’s only so much the QtOpenVG module can do for you.  We can take emergency measures to recover, but that’s about it.  So keep an eye on how many pixmaps and windows you have in use and see if you can simplify your application a little.  Definitely avoid uploading very large jpeg photographs as a single QPixmap - split them up into smaller “tiles” that can be released when GPU memory gets tight.

Summary

The following tips summarise the performance suggestions from the previous section:

  • Avoid projective transformation matrices with drawing paths.
  • Minimize state changes on pens, brushes, transforms, etc.
  • Use QPixmap in preference to QImage where possible.
  • Avoid drawing images using sub-rectangles.
  • When drawing images with opacity, use an affine transformation matrix, or only a single opacity level.
  • Avoid switching painting surfaces, particularly between windows and pixmaps.
  • Don’t use clipping if you can paint your scene in bottom-up order instead.
  • Split large images up into smaller pieces to avoid overloading GPU memory.

What’s Next?

There’s always more that can be done to improve any software system.  QtOpenVG is no different:

  • Painting into QPixmap’s using OpenVG.
  • Smarter VGImage pooling to deal with out of GPU memory conditions.
  • Qt/Embedded and Lighthouse screen drivers.
gunnar
Painting
Graphics Dojo
Performance
Posted by gunnar
 in Painting, Graphics Dojo, Performance
 on Friday, December 18, 2009 @ 09:21

Todays topic is the raster engine, Qt’s software rasterizer. Its the reference implementation and the only paint engine that implements all possible feature combinations that QPainter offers.

History

The story of Qt’s software engine started around December 2004, if my memory serves me. My colleague Trond and I had been working for a while on the new painting architecture for Qt 4, codenamed “Arthur”. Trond had been working on the X11 and OpenGL 1.x engines and I was focusing on the combined Win32 GDI/GDI+ engine along with QPainter and surrounding APIs. We had introduced a few new features, such as antialiasing, alpha transparency for QColor, full world transformation support and linear gradients. As few of these new features were supported by GDI, it meant that using any of these features implied switching to GDI+, which at the time was insanely slow, at least on all the machines we had in the Oslo office back then. Actually, enabling the GDI advanced graphics mode to do transformations was also not very fast.

Then we came upon this toolkit called Anti-Grain Geometry (AGG) which did everything in software, in plain C++, and we were just amazed at what it could do. Our immediate reaction was to curl up on the floor in agony, thinking that we were going about this all wrong. Using these native API’s was not helping us at all. In fact it was preventing us from getting the feature set we wanted with a performance that was acceptable. Once we settled down again, our first idea was to try to implement a custom AGG paint engine which would just delegate all drawing into the AGG pipeline. But alas, the template nature of the AGG API combined with the extremely generic QPainter API bloated up into a pipeline that didn’t perform nearly as good as the demos we had seen.

So we took our Christmas vacation and started over in January of 2005. Still quite depressed over the new feature set that didn’t perform combined with being limited by a minimal subset of native API’s, I went to Matthias and Lars and asked if I could get three weeks of time to hack together a software only paint engine as a proof of concept. I got an “OK” and spent the following weeks implementing software pixmap transformation, bi-linear filtering, clipping support in the crudest possible way and three weeks later I had a running software paint engine and quite proudly announced that I was “just about done”. I’ve reconstructed an image of how I remember it:

groupboxes

The system clipping was all over the place, bitmap patterns were broken, but perhaps worst of all, all text is rendered using QPainterPath’s, and all drawing was antialiased. Despite it not looking 100% good, the performance of the various features was pretty ok. It was agreed that this was a good start, but that we needed a bit more work. And so started the sprint for the Qt 4.0 beta a few months later.

The initial version that was released with Qt 4.0 worked quite well in terms of features, but in hindsight the performance was far from what our users demanded from Qt. As a result, we harvested a lot of criticism over the first year of Qt 4.0. Since then, we’ve done a lot, and I mean a LOT, and my gut feeling is that it is the engine that performs the best for average Qt usage, so I think we made a good choice back then in dropping GDI and GDI+. And, as I outlined in my previous post, we are toying with making raster the default across all desktop systems for the sake of speed and consistency.

Overall structure

The overall structure of the engine is that all drawing is decomposed into horizontal bands with a coverage value, called spans. Many spans will together form the “mask” for a shape and each pixel that is inside the mask is filled using a span function.

antialiasing

The image highlights one scanline of a polygon which is filled with a linear gradient. There are 4 spans, one which fades in the opacity of the polygon and two which fade out the opacity of the gradient. For each pixel in the polygon, the gradient function is called and we write the pixel to the destination, possibly alpha blending it, if the coverage value is other than full opacity or if the pixel we got from the gradient function contains alpha.

Clipping also use the same mechanism. The span function for clipping takes the incoming spans, intersects them with the set of spans that defines the clip and calls the actual filling span function.

clipspans

All operations followed this pattern. When a drawRect call comes in, we generate a list of spans for each scan line and set up a span function according to the current brush. A pixmap is similar, we create a list of spans and use a pixmap span function. A polygon is passed to a scanconverter which produces a span list, etc. We have two scan converters, one for antialiased and one for aliased drawing. The antialiased one is pretty much a fork of FreeType’s grayraster.c, with some minor tweaks, I think we needed to add support odd-even fills, for instance. Text is also converted into spans.

Lines, Polylines and Path Strokes

These primitives are passed to a separate processor called a stroker. The stroker creates a new path that visually matches the fillable shape that the outline represents. There is a public API for this too, in QPainterPathStroker. This fillable shape is then passed to one of the scan converters which in turn scan converts the shape into spans. For dashed outlines, the same process happens, and the resulting fillable shape is a path with a potentially very large amount of subpaths. Naturally, such a sub-path is costly to scan convert, which is part of the reason why we explicitly do not put dashed lines on the list of high-performance features. In fact, in many cases, line dashing is one of the slowest operations available in the raster engine, so use it with extreme caution.

A hacky alternative which performs much better, is to set a 2×2 black/white or black/transparent pixmap brush and draw the stroke using a pen with brush. A bit more to set up, but if that’s what it takes to get in running fast, then that’s what it takes.

State changes

Any setBrush, setTransform or any other state change on QPainter will result in a different set of span functions being set up. Each brush, or fill-type if you like as pens on this level are essentially just fills too, has a special span function associated with it and we also pass a per brush span data. For solid color fills the span data contains the color, for transformed pixmap drawing it contains the inverse matrix, a source pixel pointer, bytes per line and other required information. For clips it contains the span function to call after you clipped the spans. The thing to notice about state changes is that each time you switch from one brush to another brush or from one transformation to another, these structures do need to be updated. Up to Qt 4.4, this was in many cases a noticeable performance problem, bubbling up to 10-15% in profilers when rendering graphics view scenes, but since 4.5 the impact of this is minimal.

Well, perhaps not minimal compared to drawing a 2 pixel long line, but minimal compared to filling a 64×64 rectangle. The point is that though the raster engine is the engine that probably handles state changes best of all our engines, there are some usecases where it still shows up, and it should still be minimized.

Span functions

The task of the span functions is to generate a pixel and combine it with the destination according to the current state of the painter. Though the raster engine supports rendering to any of our image formats except 8-bit indexed, it will internally do all rendering in ARGB32_Premultiplied. Premultiplied alpha has the benefit that we don’t have to multiply the alpha into the color channels and it saves us a division in the blending. The reason for doing all rendering in one format is that the alternative simply doesn’t scale. Just think of the combination of composition modes multiplied with the number of image formats a source image can have multiplied with what formats the destination can have. To support all combinations we have a generic approach where we for each span do:

  • Get the source pixels, e.g. from a gradient, pixmap, image or solid color, and convert them to ARGB32_Premultiplied.
  • Get the destination pixels and convert them to ARGB32_Premultiplied
  • Blend the source into the destination using current composition mode
  • Convert the result to destination format and write it back.

This may seem like a lot of work, so luckily the story doesn’t end there.

Special casing and Optimizations

As I outlined in the QPainter documentation patch that I added recently, which was the start of this blog series, its all about defining which scenarios we want to be fast and which scenarios we just need working. Over the years since the initial release of the raster engine in the summer of 2005, we’ve added tons of of special cases to support what we experience as the functions that are called the most and which have the most impact.

  • First of all, if you look at the things we do for each span above, you see that we convert into ARGB32_Premultiplied. Solid colors are easy to represent, gradients are generated in this format directly, so conversion only happens for images and pixmaps. If the image is ARGB32_Premultiplied, then no conversion is needed, and we just use the scanline pointer directly, without any copying. Our RGB32 format is specified to be 0xffRRGGBB, with the alpha set to 0xff. This means it is pixel-wise compatible with ARGB32_Premultiplied, which again means that it can also be used directly. If the source is ARGB32, you’ll get a memcpy for each scanline where the ARGB32 data is copied into a temporary buffer and converted to ARGB32_Premultiplied. What can you read from that: Do not draw ARGB32 images into the raster engine. Secondly, don’t open a painter on an ARGB32 image, as that implies the exact same, but when reading and writing the destination pixels. Now you know why QPixmap’s prefer to be in these formats too..
  • Source composition modes are special cased for most operations. For instance, we don’t read the destination for source operations because we know there is no blending involved, unless the spans have partial coverage that is. This means that Source is effectively just a memory write.
  • SourceOver is usually special cased to be either inlined and merged with the coverage opacity so it is also usually faster than the other composition modes. As for the other optimizations down below, these only hold for Source and SourceOver, so if you want best performance, make sure that this is what you are using. SourceOver is the default in QPainter, by the way.
  • For gradients and pixmaps, we need to create an array of source data. For solid colors, its just a single pixel, so this is faster. Source color also benefits from that you only have to traverse memory for the destination, where you write to, so the cache misses are significantly reduced.
  • Rectangle fills are very common, both through QPainter::fillRect and through QPainter::drawRect. In 4.4 both of these implied a state change. Actually, fillRect implied two state changes because it set the brush to what was passed to fillRect and then set it back to what the painter state was. In 4.5, as part of this Falcon project, we introduced a new internal QPaintEngine subclass which supports a state-less fillRect with a color. This matches how applications normally use the painter anyway.
  • In addition to being stateless, the fillRect function is special cased for a number of use-cases. For instance, for RGB16, we write two pixels at a time, for Intel machines there is an SSE/MMX optimzied version. The special cased fillRect also has the benefit that it doesn’t require spans, its just a tight 2D for loop, which also saves us quite a bit of work, at least if the spans are short.
  • Duffs Device. I cannot take credit for its addition, but it’s used in a lot of different places in the raster engine today. Its all about loop-unrolling. If you’re not familiar with it yet, read up on it. Its a beautiful abuse of the C++ language to make things potentially faster.
  • Rectangular clipping is also special cased, at least as long as there is no transformation set on the painter. Translate is of course special cased, but scaling and rotating disables this optimization. The benefit we get from doing rectangular clipping is that finding the spans to fill is done on the QRect level, rather than on the pr span level, which makes it significantly faster.
  • So if you have Source of SourceOver, a non-perspective, non-smooth transform and the clip is a rectangular clip, you also get the benefit of our pixmap blend functions. These were added in Qt 4.5 and is the reason why pixmap drawing is quite a bit faster now than in the earlier versions. In Qt 4.5, we had blend functions for scale and translate only, and in Qt 4.6 we added rotations to the list as well. Again, we focus on a selected subset of formats, matching what QPixmap will be using, we only have these for:
    • ARGB32_Premultiplied on ARGB32_Premultiplied
    • ARGB32_Premultiplied on RGB32
    • ARGB32_Premultiplied on RGB16
    • ARGB8565_Premultiplied on RGB16
    • RGB32 on RGB32
    • RGB16 on RGB16

    I think that was all of them.

  • The outlines are processed via the stroker in the general case. However, there are again a number of special cases where we drop to doing a midpoint-algorithm instead. Lines, polylines and paths that only contain line segments will be rendered using the fast midpoint approach as long as the pen width is equal to or less than 1. We also support dashing line segments for 1 pixel wide lines using this method. For any pen width greater than 1, curved paths or antialiasing, we drop to the stroker approach which works, but is far less optimal. Actually, I think there is a special-case for antialiased dashed lines too, as long as they are thin.
  • When antialiasing is enabled, we often need to fall back to the stroker for outlines which is quite a bit slower than the plain case. In addition to that there are a lot of more spans generated for antialiased content, due to the fade-in, fade-out effect on the edge of the primitive, so expect antialiasing to be a significant cost.
  • Text drawing is since 4.5 highly optimized for most engines, to the point where the major bottleneck these days are in doing the actual text layout on the string. We’re working on an API to cache this, so text drawing can be made truly fast, but based on the current API, its as good as it gets. However, if the transformation is a rotate/scale, then we fall back to path drawing. Only the windows version of the raster engine supports drawing glyphs at rotated angles using the fast paths, so beware of that.
  • A lot of details, but it gives an idea of what to consider when you write code for this engine. If all you are drawing is 1024×1024 pixmaps, then none of these things matter because all the time is anyway spent in the span function that does pixmap blending, but the second you have more content, several lines, several polygons, which are smaller in size, then these things are critical to achieve good performance.

    The overall performance of the engine, when used according to how it’s outlined above, can be thought of as:

    Overhead + O(pixelsTouched * memoryAndBusCapacity)

    There is nothing scientific about that formula, but when you’re hitting the optimal path, all time should be spent in one of the many for loops inside qdrawhelper_xxx.cpp or even better qblendfunctions.cpp. These loops will spend all their time on per pixel processing. If these functions could be made faster by doing the algorithms slightly differently, then great, but if you see in your profiling that all time is spent in for instance qt_blend_argb32_on_argb32, then that means you told us to blend alpha pixmaps together and we’re doing that as fast as we can and you have zero loss between your app and actual processing. If all time is spent processing pixels, then that is a good thing. The overhead here is the time spent in state changes, function call overhead, and similar.

    Some numbers

    I got some feedback on one of the previous blogs that a few bar charts would be nice, so I’ll post some numbers on what kind of throughput is possible with the raster paint engine. I’ve timed it on both my Windows desktop machine and on my N900 to get a comparison. The operations range from several million pr second to only a few hundred so the scale is logarithmic, keep that in mind as you look at them.

    Raster Results

    As you can see, the fill-rate is more or less tied to the number of pixels involved. For some operations it takes a little bit longer to do something, like drawPixmap with scaling is somewhat slower than drawPixmap without, but you see that the rough formula I gave above holds quite often. Double the size of the primitive in each direction and you have one quarter the performance. It was also not my intention to trick you with using different numbers for drawPixmap, its just how the test was set up.

    If you compare the three 4×4 rectangle drawing versions, you see that they differ when the rectangles are small. drawRect without brush change is fastest at around 7.4Mops/sec, followed by fillRect at ~6.1Mops/sec and then drawRect with brush change at 1.8Mops/sec. At 128×128 there is just a little difference between the two, which is what I was getting at with the state changes above. It is possible to do them and if you’re drawing semi-large areas, it doesn’t matter, but if you’re plotting pixels, doing loads of small lines here and there or particle effects with 8×8 pixmaps, then you want to do that in a tight loop with nothing else happening.

    You can also see that the speed of non-smooth scaling is holding its own vs non-scaled pixmap drawing.

    Finally, if you compare the N900 to the desktop Windows machine you see that despite windows only having a 4 times faster processor the speed is often around 10 times worse. Why? Because the CPU isn’t the only limitation, bus/memory capacity is also a limiting factor, and it’s to be honest not a fair comparison…

    I hope you enjoyed this post and more will come in 2010.

    gunnar
    Painting
    Graphics Dojo
    OpenGL
    Performance
    Posted by gunnar
     in Painting, Graphics Dojo, OpenGL, Performance
     on Wednesday, December 16, 2009 @ 06:54

    For this blog series that I’m doing, I figure its nice to start with an overview of the whole painter, pixmaps, widgets, graphicsview, backingstore idea.

    At the centre of all Qt graphics is the QPainter class. It can render to surfaces, through the QPaintDevice class. Examples of paint devices are QImage’s, QPixmaps and QWidgets. The way it works is that for a given QPaintDevice implementation we return a custom paint engine which supports rendering to that surface. This is all part of our documentation so perhaps not too interesting. Lets look at this in more detail.

    QWidgets and QWindowSurface

    Even though QWidget is a QPaintDevice subclass, one will never render directly into a QWidget’s surface. Instead, during the paintEvent, the painting is redirected to an offscreen surface which is represented by the internal class QWindowSurface. This was traditionally implemented using the QPainter::setRedirected(), but has since been replaced by an internal mechanism between QPainter and QWidget which is slightly more optimal.

    Some times we refer to this surface as “the backingstore”, but it really is just a 2D surface. If you ever looked through the Qt source code and found a class QWidgetBackingStore, this class is responsible for figuring out which parts of the window surface needs to be updated prior to showing it to screen, so its really a repaint manager. When the concept of backingstore was introduced in Qt 4.1, the two classes were the same, but the introduction of more varying ways to get content to screen made us split it in two.

    In the old days widgets were rendered “on screen”. Though the option to paint on screen is still available, it is not recommended to use it. I believe the only system that remotely supports it is X11, but it is more or less untested and thus often cause artifacts in the more complex styles. Setting the flag Qt::WA_PaintOnScreen means that the repaint manager inside Qt ignores that widget when repainting the windowsurface and instead sends a special paintEvent to that widget only. Prior to Qt 4.5 there was a significant speed gain to be had when 10-100 widgets updated at max fps, but in Qt 4.5 the repaint manager was optimized to handle this better so, on screen painting is usually worse than buffered.

    Back to the window surface. All widgets are composited into the window surface top to bottom and the top-level widget will fill the surface with its background or with transparent if the Qt::WA_TranslucentBackground attribute is set. All other widgets are considered transparent. A label only draws a bit of text, but doesn’t touch anything else. What that means for the repaint manager, is that every widget that overlaps with the label, but stacks behind it, needs to be drawn before it. If the application knows that a certain widget is opaque and will draw every single pixel for every paint event, then one should set the Qt::WA_OpaquePaintEvent, which causes the repaint manager to exclude the widgets region when painting the widgets behind it.

    Since all widgets are repainted into the same surface, we need to make sure that widgets don’t accidentally paint outside their own boundaries and into other widgets. Since there is no guarantee that widgets will paint inside their bounds, this could potentially lead to painting artifacts, so we set up a clip behind QPainter’s back called the “system clip”. For most widgets the system clip is a rectangle and looking at the performance section of the QPainter docs, we see that that is not so bad. Rectangular clips, when pixel aligned, are fast. A masked widget, on the other hand, is a performance disaster. It is slower to set up and slower to render. The system clip is the same clip that is passed to the paint event, except that the clip in the paint event has been translated to be relative to the top-left of the widget, rather than to the top-left of the surface. Do NOT set the paint event’s region as a clip on the painter. It is already set up, and we don’t detect that it is the exact same region and just process it fully again. The purpose of the region/rect in the paint event is so that widgets can decide to not draw certain parts. This is primarily useful when you have big scenes in the widgets, such as a map application, graphics view or similar.

    In addition to the system clip which is set up prior to calling paintEvent, the painter also needs to be in a clean state, which means setting up brushes, pens, fonts and others. Its not a huge amount, but if you have many widgets it adds up. So, though widgets are no longer native window handles (aka Alien), there is still a price tag involved in repainting them. Be aware of that when you design your application. For instance, implementing a photo gallery using QLabel’s with pixmaps in a QScrollArea doesn’t scale. You would have to set up clipping and all the other states per label, even though the label only draws a pixmap. A single “view” widget would scale much better, because the widget can then implement a tight loop that draws pixmaps in the right places.
    This whole backingstore and window surface logic only hold for Mac OS X when raster or opengl graphics systems are used. Personally I would strongly recommend to use raster, it implements the full feature set, it is often faster, has the same performance profile as Qt on Windows and painting bugs are prioritized higher for raster than for the CoreGraphics backend. In qt/main we plan to switch the default for Mac OS X to raster, we just have to iron out some window system integration issues.

    Graphics systems

    The concept of a graphics system was introduced in Qt 4.5. The idea is to be able to select at startup time, on an application level, what kind of graphics stack you should be using. The graphics system is responsible for creating the pixmap backends and the window surface. We currently have graphics systems for raster, OpenGL 1.x, OpenGL/ES 2.0, OpenVG and X11. You can select graphics systems either by starting the application with the command line option -graphicssystem raster|opengl|opengl1|x11|native, where “native” means to use the system default. Another option is to provide the exact same option to configure which will set that option for all applications using Qt. Finally there is the function QApplication::setGraphicsSystem which hardcodes the graphics system for a given application.

    In later blogs, we plan to go into each of the paint engines in more detail, but for now, lets just look at the highlights.

    Raster

    The raster graphics system is the reference implementation of QPainter. It implements all the features we specify and does it all in software. When a new port is started, such as with S60, we usually start with getting raster running. It is currently the default on Windows, Embedded, S60 and will also be on Mac OS X.

    Just a though. What do you think of raster on X11? If you ignore for a second that you currently get a local process local font cache. It performs quite nice on X11 and I’ve seen many people switch it at runtime. If we consider remote displays, this seems daunting, but it still may not be too bad. The way it works in the X11 paint engine today is that any gradient and pixmap transform is anyway done in software and uploaded as an image on a per painter-command level. Why not just do it all client side and upload only the parts that needs updating. We can watch HD videos (for some definition of HD, anyway) on youtube, certainly we can afford to upload a few pixels. This is bound to generate comments on XRender and server-side gradients and transforms, but these have been tried numerous times and the performance is simply not good enough.

    The window system integreation is handcoded for each platform to make the most out of it. For windows the windowsurface is a QImage which shares bits with a DIBSECTION, which results in pretty good blitting speed. On X11 we use MIT Shared Memory Images. We used to use Shared Memory Pixmaps, but this is removed from Xorg, but we got this awesome patch from the community, so we’re back up and running. On Mac OS X, we’re experimenting with using GL texture streaming for getting the backbuffer to screen and we’re seeing some promising numbers with that, so I hope that will make into Qt for 4.7 too.

    Because it is just an array of bytes, most native API’s have the ability to render into the same buffer we do. This makes integration with native theming quite straightforward, which is one of the reasons why this is attractive as a default desktop graphics system, despite not being hardware accelerated.

    OpenGL

    We have two OpenGL based graphics systems in Qt. One for OpenGL 1.x, which is primarily implemented using the fixed functionality pipeline in combination with a few ARB fragment programs. It was written for desktops back in the Qt 4.0 days (2004-2005) and has grown quite a bit since. You can enable it by writing -graphicssystem opengl1 on the command line. It is currently in life-support mode, which means that we will fix critical things like crashes, but otherwise leave it be. It is not a focus for performance from our side, though it does perform quite nicely for many scenarios.

    Our primary focus is the OpenGL/ES 2.0 graphics system, which is written to run on modern graphics hardware. It does not use a fixed functionality pipeline, only vertex shaders and fragment shaders. Since Qt 4.6, this is the default paint engine used for QGLWidget. Only when the required feature set is not available will we fall back to using the 1.x engine instead. When we refer to our OpenGL paint engine, its the 2.0 engine we’re talking about.

    We’ve wanted to have GL as a default graphics system on all our desktop systems for a while, but there are two major problems with it. Aliased drawing is a pain, it is close to impossible to guarantee that a line goes where you want it for certain drivers. Integration with native theming is a pain. It is rarely possible to pass a GL context to a theming function and tell it draw itself, hence we need to use temporary pixmaps for style elements. On Mac OS X, there is a function to get a CGContext from a GL context, but we’ve so far not managed to get any sensible results out of it. On the other hand, much of the UI content doesn’t depend on these features, which makes GL optimal for typical scene rendering, such as the viewport of a QGraphicsView or a photo gallery view. So as far as how the default setup in Qt will look in the future, we’re considering that the best default setup for desktop may be a combination of raster for the natively themed widgets and GL for one or two high-performance widgets. Nothing is decided on this topic though, we’re just looking at alternatives.

    Another problem with using GL by default is font sharing. With raster we could theoretically share pre-rendered glyphs between processes in a cross platform manner using shared memory, with GL this becomes a bit more difficult. On X11, there is an extension to bind textures as XPixmaps which can be shared across processes, but this will usually force the textures into a less optimal format which makes them somewhat slower to draw, so it is still not optimal. On Windows, Mac OS X, S60 or QWS, we would need driver-level support for sharing texture ids, which we currently don’t have.

    OpenVG

    I actually quite blank in this area. I’ve not been involved with writing it nor getting it up and running. It sits on top of EGL which makes it quite similar to the OpenGL graphics systems. We expect that OpenVG will be used in a number of mid-range embedded devices.

    The cool thing about OpenVG is that it matches the QPainter API quite nicely. It supports paths, pens, brushes, gradients and composition modes, so in theory, the vectorial APIs should run optimally.

    Rhys, which wrote the OpenVG paint engine, plans to do a post on the OpenVG paint engines internals in full in the near future.

    Images and Pixmaps

    The difference between these two is mostly covered in the documentation, but I would like to highlight a few things none the less.

    Our documentation says: “QImage is designed and optimized for I/O, and for direct pixel access and manipulation, while QPixmap is designed and optimized for showing images on screen.”

    Raster

    When using the raster graphics system, pixmaps are implemented as a QImage, with a potentially significant difference. When converting a QImage to a QPixmap, we do a few things.

    The image is converted to a pixel format that is fast to render to the backbuffer, meaning ARGB32_Premultiplied, RGB32, ARGB8565_Premultiplied or RGB16. When images are loaded from disk using the PNG plugin or when they are generated in software by the application, the format is often ARGB32 (non-premultiplied) as this is an easy format to work on, pixel-wise. I’ve measured ARGB32_Premultiplied onto RGB32 to be about 2-4x faster than drawing an ARGB32 non-premultiplied depending on the usecase.

    Secondly, we check the pixel data for transparent pixels and convert it to an opaque format if none are found. This means that if a “.png” file is loaded as ARGB32 from disk, but only contains opaque pixels, it will be rendered as an RGB32, which is also about 2-4x faster.

    OpenGL

    When using the OpenGL graphics system the actual implementation of the QPixmap varies a bit from setup to setup. The most ideal option gets enabled when your GL implementation supports Frame Buffer Objects (FBOs) in combination with the GL_EXT_framebuffer_blit extension. In this case, the pixmap is represented as a OpenGL texture id, and whenever a QPainter is opened on the pixmap we grab an FBO from an internal pool and use the FBO to render into the texture.

    Without these extensions available, which is typically the case for OpenGL/ES 2.0 devices, the implementation is a QImage (in optimal layout, same as raster) which is backed by a texture id. When you open a QPainter on the pixmap, you render into the QImage and when the pixmap is drawn to the screen, the texture id is used. Internally there is a syncing process between the two representations, so there will be a one-time hit of re-uploading the texture after drawing into it.

    In general

    If you intend to draw the same QImage twice, always convert it to a QPixmap.

    There are some usecases where QPixmap is potentially worse though. We have these functions, QPixmap::scaled(), QPixmap::tranformed() and friends, which historically are there because we wanted QImage and QPixmap to have similar API’s. We have support for reimplementing this functionality on a per pixmap-backend basis, but currently no engine does this, so for the GL case, or X11 for that matter, calling QPixmap::transformed() implies a conversion from QPixmap into QImage, a software conversion, and the a conversion back to the original format.

    By default a QPixmap is treated as opaque. When doing QPixmap::fill(Qt::transparent), it will be made into a pixmap with alpha channel which is slower to draw. If the pixmap is going to end up as opaque, initialize it with QPixmap::fill(Qt::white). You can even skip the initialization step all together if when you know that all pixels will be written as opaque when the pixmap is painted into.

    Before moving onto something else, I’ll just give a small warning on the functions setAlphaChannel and setMask and the innocently looking alphaChannel() and mask(). These functions are part of the Qt 3 legacy that we didn’t quite manage to clean up when moving to Qt 4. In the past the alpha channel of a pixmap, or its mask, was stored separately from the pixmap data. Depending on which platform you were on, the actual implementation was a bit different. For instance on X11, you had one 1-bit pixmap mask + an 8-bit alpha channel + a 24-bit color buffer. On Windows you had a 1-bit mask + a packed 32-bit ARGB pixel buffer. In Qt 4 we merged all this into one API, so that QPixmap is to be considered a packed datastructure of ARGB pixels. What we did not remove the functions implementing the old API however. In fact, we even added the alpha channel accessors, so we made it worse. The API was to some extent convenient, but all those four functions imply touching all the data and either merging the source with the pixmap or extracting a new pixmap from the current pixmap content. Bottom line. Just don’t call them. With composition modes, you can manipulate the alpha channel of the pixmaps using QPainter. This also has the benefit that it will potentially be SSE optimized for raster or done in hardware on OpenGL, so it has potential for being quite a bit faster. There is also the QGraphicsOpacityEffect which allows you to set a mask widgets and graphics items, but as of today, it is not as fast as we would like it to be.

    QGraphicsView

    I’ll do at least one separate post on graphicsview alone, so I’ll just comment quickly on the difference between using QGraphicsView with items vs QWidget’s. QGraphicsView with its scene populated with items is in many ways very similar to the widgets and their repaint handling. With the addition of layouts and QGraphicsWidgets the line is even more blurry. So which solution should you pick? More and more often, we’re seeing that people choose to create their UI’s in graphics view rather than creating them using traditional widgets.

    Compared to widgets, items in a graphics view are very cheap. If we consider the photo gallery again, then using a separate item for each of the items in the view may (I say may) be reasonable. A widget is repainted through its paintEvent. A QGraphicsItem is repainted through its paint function. The good thing with the items function is that there is no QPainter::begin as the painter is already properly set up for rendering. Another good thing is that the painter has less guaranteed state than the in the widget case. There may be a transformation and some clip, but no guarantees about fonts, pens or brushes. This makes the setup a bit cheaper.

    Another huge improvement over widgets is that items are not clipped by default. They have a bounding rectangle and there is a contract between the subclass implementer and the scene that the item does not paint outside. If we compare this to the system clip we need to set for widgets, then again there is less work to be done for the items. If the item violates this there will be rendering artifacts, but for graphicsview this has proven an acceptable compromise.

    Most UI elements are rather simple. A button, for instance, can be composed of a background image and a short text. In QPainter terms that is one call to drawPixmap and one call to drawText. The less time spent between painter calls the better the performance. The less state changes between painter calls, the better the performance. Looking back at how much happens between these calls for a button, you quickly realize that the traditional widgets are quite heavy. If widgets are going to survive the test of time, then they need to behave more like QGraphicsItem’s.

    Some final words

    I’ve been rambling on for a while, but hopefully there was some useful information in here. You may have noticed that I do not mention printing, PDF or SVG generation, nor do I focus on X11 or CoreGraphics paint engines in great detail. This is because, as outlined in the painter performance docs, we focus our performance efforts in on only a few backends which we consider critical for Qt.

    gunnar
    Painting
    Graphics Dojo
    Performance
    Posted by gunnar
     in Painting, Graphics Dojo, Performance
     on Monday, December 14, 2009 @ 12:19

    On friday I added the following to the QPainter documentation:

    
        section1 Performance
    
        QPainter is a rich framework that allows developers to do a great
        variety of graphical operations, such as gradients, composition
        modes and vector graphics. And QPainter can do this across a
        variety of different hardware and software stacks. Naturally the
        underlying combination of hardware and software has some
        implications for performance, and ensuring that every single
        operation is fast in combination with all the various combinations
        of composition modes, brushes, clipping, transformation, etc, is
        close to an impossible task because of the number of
        permutations. As a compromise we have selected a subset of the
        QPainter API and backends, where performance is guaranteed to be as
        good as we can sensibly get it for the given combination of
        hardware and software.
    
        The backends we focus on as high-performance engines are:
    
        list
    
        o Raster - This backend implements all rendering in pure software
        and is always used to render into QImages. For optimal performance
        only use the format types QImage::Format_ARGB32_Premultiplied,
        QImage::Format_RGB32 or QImage::Format_RGB16. Any other format,
        including QImage::Format_ARGB32, has significantly worse
        performance. This engine is also used by default on Windows and on
        QWS. It can be used as default graphics system on any
        OS/hardware/software combination by passing c {-graphicssystem
        raster} on the command line
    
        o OpenGL 2.0 (ES) - This backend is the primary backend for
        hardware accelerated graphics. It can be run on desktop machines
        and embedded devices supporting the OpenGL 2.0 or OpenGL/ES 2.0
        specification. This includes most graphics chips produced in the
        last couple of years. The engine can be enabled by using QPainter
        onto a QGLWidget or by passing c {-graphicssystem opengl} on the
        command line when the underlying system supports it.
    
        o OpenVG - This backend implements the Khronos standard for 2D
        and Vector Graphics. It is primarily for embedded devices with
        hardware support for OpenVG.  The engine can be enabled by
        passing c {-graphicssystem openvg} on the command line when
        the underlying system supports it.
    
        endlist
    
        These operations are:
    
        list
    
        o Simple transformations, meaning translation and scaling, plus
        0, 90, 180, 270 degree rotations.
    
        o c drawPixmap() in combination with simple transformations and
        opacity with non-smooth transformation mode
        (c QPainter::SmoothPixmapTransform not enabled as a render hint).
    
        o Text drawing with regular font sizes with simple
        transformations with solid colors using no or 8-bit antialiasing.
    
        o Rectangle fills with solid color, two-color linear gradients
        and simple transforms.
    
        o Rectangular clipping with simple transformations and intersect
        clip.
    
        o Composition Modes c QPainter::CompositionMode_Source and
        QPainter::CompositionMode_SourceOver
    
        o Rounded rectangle filling using solid color and two-color
        linear gradients fills.
    
        o 3x3 patched pixmaps, via qDrawBorderPixmap.
    
        endlist
    
        This list gives an indication of which features to safely use in
        an application where performance is critical. For certain setups,
        other operations may be fast too, but before making extensive use
        of them, it is recommended to benchmark and verify them on the
        system where the software will run in the end. There are also
        cases where expensive operations are ok to use, for instance when
        the result is cached in a QPixmap.
    

    I suspect it’s a piece of documentation many of you have been lacking for a while, and its something we should have put in a long time ago, but I can only say “sorry for not doing it sooner”. At least its getting done now. Note: Patch is not visible in public repository at the time of publishing. Should be there shortly

    The urge to get these things into the docs have spun out from a number of dialogues I’ve had recently which all went pretty much like this:

    • TheOtherGirlOrGuy: My application is running slow… What do I do?
    • Me: What is it doing?
    • TheOtherGirlOrGuy: Well, its using QGraphicsView and QPainter and is doing this and that…
    • Me: That doesn’t sound too bad.
    • TheOtherGirlOrGuy: And then its really slow when doing this…
    • Me: Yeah… That doesn’t work very well. What you should be doing is this…
    • TheOtherGirlOrGuy:Is that written down someplace? How am I suppose to know that?
    • Me: Eh…

    To remedy this, I’m going to put into action something I’ve had at the back of my head for a while now, a blog series on Qt Graphics and Performance. Along the way, I’ll also try to get parts of this into the documentation or into examples/demos as best practice use-cases.

    I just have to point out, that this blog series is not a request for more features. It is about us sharing with you what we consider best practises and what our priorities are. Of course if you think our focus is way off, then let us know, but my primary intent with this blog series is to share some thoughts.

    With the help of some of my co-workers, we plan to go through some Qt Graphics fundamentals, the “high-performance” engines, and usecases for graphicsview and widgets. If you have special usecases that you find interesting, then by all means let me know and maybe I can cover those too.

    I need to add a small comment to the “drawText” case. It is currently not super optimal, because we have to do layout on the text for each time you call it. Because there is no “handle” in the function we don’t have the ability to cache the layout either. If we started caching based on a qHash of all the strings that were passed to drawText() then we end up caching a lot of single-shot text drawing… The option that we provide today to work around this is to use a QTextLayout with caching enabled, which is memory-wise quite hungry… I think in the range of 100-300 bytes pr character! So as an alternative, we are working on an API for static text which encapsulates the layout work with very little memory overhead. Its currently called QStaticText and we’re aiming for it to go into 4.7. Once it is in place, we’ll update the drawText comment in the performance documentation to be for these static texts…

    As time permits we plan to push out blogs on the following topics:

    • An overview of the various components involved
    • The raster paint engine in detail
    • The OpenGL paint engine in detail
    • The OpenVG paint engine in detail
    • QGraphicsView optimization flags and cache modes
    Donald Carr
    Qt
    Graphics View
    Painting
    OpenGL
    Performance
    Embedded
    Build system
    Posted by Donald Carr
     in Qt, Graphics View, Painting, OpenGL, Performance, Embedded, Build system
     on Friday, November 20, 2009 @ 00:53

    Introduction

    Texas Instruments has a wiki which documents what is required to bring Qt
    up on the Beagle board with full OpenGL ES (1/2) support:

    http://www.tiexpressdsp.com/index.php/Building_Qt

    and I would like to thank one of their engineers, Varun, for his quick turn
    around times in addressing any questions I raised.

    This blog entry is intended to serve a similar purpose, but is more verbose regarding
    Qt considerations and the initial beagle board bring up. It attempts to serve
    as a comprehensive independent source of information on getting Qt built
    for the Beagle board with full OpenGL ES 2 support.

    These instructions are intended for use with Qt 4.6 (and beyond), so grab
    the release candidate or check Qt 4.6 out from the public git repository prior
    to proceeding.

    You can choose to use either Qt/Embedded or Qt/X11, both can
    be successfully integrated with the Beagle board’s SGX GPU and the only
    point of divergence in these instructions will be at (Qt) configure time
    and the client side system (run time) configuration. Both implementations
    offer window management, via QWS and X11 respectively, and operate at
    around 27fps and 22fps respectively when running our hellogl_es2 example.
    (16bit color depth at 1280×720)

    I personally deploy Ångström on my Beagle board, it handles a large amount
    of the logistics surrounding cross compilation and is generally very
    agreeable, and these instructions are therefore going to be bolted to
    Ångström for completeness. Feel free to establish an environment capable of
    showing the OpenGL ES examples TI provide, then following the Qt level
    considerations (Configuring Qt) accordingly.

    For those holding a dormant Beagle board who are open to the author’s
    distribution preferences:

    Building the Ångström rootfs

    Open Embedded is manifested in a git repository: in this posting we are
    working within origin/stable/2009. Please follow the instructions give
    here, they are comprehensive and got me completely off the ground.

    http://www.angstrom-distribution.org/building-angstrom

    These instructions end in you running:

    bitbake base-image ; bitbake console-image ; bitbake x11-image

    which actually builds an X11 angstrom image for your Beagle board. Please
    note, you will need to build the X11-image if you want to build and deploy
    the SGX packages (we will do this in the next section) via Ångström as opkg considers
    X11 to be a required dependency of libgles-omap3_3.00.00.09. This is due
    to one of the encapsulated windowing system libraries being X11 centric:

    libpvrPVR2D_X11WSEGL.so

    Regardless of the indicated X11 dependency, this package will bestow the required
    kernel module on you for general OpenGL ES usage (console or X11). We will be
    building our own QWS centric (libpvrPVR2D_X11WSEGL.so equivalent) library
    behind the scenes for QWS in the Qt/Embedded instructions given later.

    Ångström SGX integration

    You now need to integrate the SGX drivers on your Ångström system.

    You need to get your paws on:

    OMAP35x_Graphics_SDK_setuplinux_3_00_00_09.bin

    with the following MD5 checksum:

    e15147ad76ddbe7c5aec682f5455b774

    Getting this involves following the above link and going through the required registration/request process.
    Once you have this file, you drop it in:

    $OETREE/openembedded/recipes/powervr-drivers/libgles-omap3

    and then run:

    bitbake libgles-omap3-3.00.00.09

    which generates the following packages:

    libgles-omap3_3.00.00.09-r1.1_armv7a.ipk
    libgles-omap3-dbg_3.00.00.09-r1.1_armv7a.ipk
    libgles-omap3-demos_3.00.00.09-r1.1_armv7a.ipk
    libgles-omap3-dev_3.00.00.09-r1.1_armv7a.ipk
    libgles-omap3-tests_3.00.00.09-r1.1_armv7a.ipk

    Deploy the x11-image to an sd-card, and copy these packages to the sd-card
    for deployment on the target. If your beagle board does not have internet
    access you will probably also require:

    *  devmem2
    *  libx11-6 (Only if you insisted on using a console build!)

    as opkg will not be able to automatically install the required dependencies
    from its repositories and you would hit the following error at deployment:

    ———————————————-
    root@beagleboard:/opt/deploy# opkg install ./libgles-omap3_3.00.00.09-r1.1_armv7 a.ipk
    Installing libgles-omap3 (3.00.00.09-r1.1) to root…
    libgles-omap3: unsatisfied recommendation for libgles-omap3-tests
    Collected errors:
    * ERROR: Cannot satisfy the following dependencies for libgles-omap3:
    *  devmem2 *  libx11-6 (>= 1.1.5) *
    ———————————————-

    Once you have installed all the above packages, please reboot the board.

    Your bootargs in U-Boot should look something like:

    console=ttyS0,115200n8=noinitrd ip=dhcp rw root=/dev/mmcblk0p2 omapfb.mode=dvi:1280×720MR-16@60

    assuming you want to output via DVI and are running a similar kernel
    version (2.6.29-omap1 on my beagle) which accepts the same kernel
    arguments indicated in the bootargs variable above.

    Please note that we are specifying a 16 bit color depth which is intentional
    and discussed in the “color depth considerations” section in the appendix

    Please run the powervr demos (under X11) to establish that your drivers are
    successfully installed and usable.

    Configuring Qt

    In order to build Qt now, all that is required for each target is an
    appropriate mkspec:

    For Qt/X11

    You would fork your mkspec off the linux-g++ mkspec, the resulting mkspec’s
    qmake.conf would resemble:

    ==================================================================
    ………….
    include(../common/linux.conf)

    # modifications to g++.conf
    # These release optimization flags are TI supplied
    # and a little more aggressive than Qt standard (gentoo types rejoice!)
    QMAKE_CFLAGS_RELEASE     = -O3 -march=armv7-a -mtune=cortex-a8 -mfpu=neon -mfloat-abi=softfp
    QMAKE_CXXFLAGS_RELEASE     = $$QMAKE_CFLAGS_RELEASE

    QMAKE_CC         = $FULLY_QUALIFIED_COMPILER_PREFIX-gcc
    QMAKE_CXX         = $FULLY_QUALIFIED_COMPILER_PREFIX-g++
    QMAKE_LINK         = $FULLY_QUALIFIED_COMPILER_PREFIX-g++
    QMAKE_LINK_SHLIB     = $FULLY_QUALIFIED_COMPILER_PREFIX-g++

    # modifications to linux.conf
    QMAKE_LIBS_EGL         = -lEGL -lIMGegl -lsrv_um
    QMAKE_LIBS_OPENGL_QT     = -lEGL -lGLESv2 -lGLES_CM -lIMGegl -lsrv_um
    QMAKE_LIBS_OPENVG     = -lEGL -lGLESv2 -lGLES_CM -lIMGegl -lsrv_um -lOpenVG -lOpenVGU

    QMAKE_INCDIR         = $TARGET_STAGING_PATH/usr/include
    QMAKE_LIBDIR         = $TARGET_STAGING_PATH/usr/lib

    QMAKE_AR         = $FULLY_QUALIFIED_COMPILER_PREFIX-ar cqs
    QMAKE_OBJCOPY         = $FULLY_QUALIFIED_COMPILER_PREFIX-objcopy
    QMAKE_STRIP         = $FULLY_QUALIFIED_COMPILER_PREFIX-strip

    load(qt_config)
    ==================================================================

    and you would configure Qt with:

    configure -arch arm -xplatform linux-omap3-g++ -opengl es2 -openvg

    all that remains is to adjust /etc/powervr.ini on the target to be:

    [default]
    WindowSystem=libpvrPVR2D_FLIPWSEGL.so

    Now compile an example, eg:

    ./examples/opengl/hellogl_es2

    deploy it and Qt to the target and enjoy.

    For Qt/Embedded

    Since we don’t have the X11 abstraction, we have to interface with the
    underlying hardware/interfaces with Qt/Embedded’s gfx abstraction layer. We
    are going to be making some heavy use of the powervr driver resident under:

    $QTSRCTREE/src/plugins/gfxdrivers/powervr

    there is a README file in the powervr directory that is definitely
    recommend reading, and lends some serious insight into our powervr driver
    and Qt/Embedded in general. The same driver is used for MBX/SGX targets and
    hence sees a fair amount of usage on a variety of target devices.

    You would fork your mkspec off the qws/linux-arm-g++ mkspec, the resulting mkspec’s
    qmake.conf would resemble:

    ==================================================================
    …………….
    include(../../common/qws.conf)

    # modifications to g++.conf
    QMAKE_CFLAGS_RELEASE     = -O3 -march=armv7-a -mtune=cortex-a8 -mfpu=neon -mfloat-abi=softfp
    QMAKE_CXXFLAGS_RELEASE     = $$QMAKE_CFLAGS_RELEASE

    QMAKE_CC         = $FULLY_QUALIFIED_COMPILER_PREFIX-gcc
    QMAKE_CXX         = $FULLY_QUALIFIED_COMPILER_PREFIX-g++
    QMAKE_LINK         = $FULLY_QUALIFIED_COMPILER_PREFIX-g++
    QMAKE_LINK_SHLIB     = $FULLY_QUALIFIED_COMPILER_PREFIX-g++

    # modifications to linux.conf
    QMAKE_INCDIR         = $TARGET_STAGING_PATH/usr/include
    QMAKE_LIBDIR         = $TARGET_STAGING_PATH/usr/lib

    QMAKE_LIBS_EGL         = -lEGL -lIMGegl -lsrv_um
    QMAKE_LIBS_OPENGL_QT     = -lEGL -lGLESv2 -lGLES_CM -lIMGegl -lsrv_um
    QMAKE_LIBS_OPENVG     = -lEGL -lGLESv2 -lGLES_CM -lIMGegl -lsrv_um -lOpenVG -lOpenVGU

    QMAKE_AR         = $FULLY_QUALIFIED_COMPILER_PREFIX-ar cqs
    QMAKE_OBJCOPY         = $FULLY_QUALIFIED_COMPILER_PREFIX-objcopy
    QMAKE_STRIP         = $FULLY_QUALIFIED_COMPILER_PREFIX-strip

    #These defines are documented in the powervr README, please read it
    DEFINES += QT_QWS_CLIENTBLIT QT_NO_QWS_CURSOR

    load(qt_config)
    ==================================================================

    and you would configure Qt with:

    /opt/dev/source/qt-beagle-4.6/configure -embedded arm -little-endian -xplatform qws/linux-omap3-g++ -opengl es2 -openvg -plugin-gfx-powervr

    all that remains is to adjust /etc/powervr.ini on the target to be:

    [default]
    WindowSystem=libpvrQWSWSEGL.so

    Now compile an example, eg:

    ./examples/opengl/hellogl_es2

    deploy it and Qt to your board, and after shutting down X, run the example with
    the following arguments:

    ./hellogl_es2 -qws -display powervr

    -qws - starts the application as the QWS server with exclusive access to the
    system hardware which manages all subsequent Qt “client” applications

    -display powervr - indicates that Qt should use the powervr driver we
    compiled earlier

    Summary

    I hope that this posting encourages people to go forward and experiment
    with a fully accelerated Qt 4.6 on the beagle board. Offloading the
    painting work onto the GPU drastically reduces the load on the CPU and
    broadens the range of applications which can feasibly be run on this
    broadly available (cheap!) embedded hardware. The Beagle board has
    really nice hardware, and it would be infinitely useful for us to have external
    people using our powervr driver and getting it as broadly used/refined as
    possible.

    Appendix

    Additional Benefits to OpenGL ES acceleration

    If you take any Qt Graphics View based example and set a QGLWidget as its
    viewport, a large amount of work will be offloaded on the GPU leaving your
    CPU free to frolic. To put this in perspective, a modified version of:

    ./examples/animation/animatedtiles

    which continually transitions runs smoothly at 720p on the beagle board
    when using software, but consumes 100% CPU time according to top (99.3% to
    be fair). It is therefore CPU bound and you are not going to be doing
    anything else in the background.

    When backed by a QGLWidget, the CPU usage drops to 20% on the exact same
    example in the exact same conditions (720p, at 16bit color depth). The
    frame rate suffers slightly, but at least this is mandated by the GPU

    Minor clipping issue evident in hellogl_es2

    The bubbles are evidently clipped on the right hand side, I will hopefully
    beat you to reporting this at: http://bugreports.qt.nokia.com/secure/Dashboard.jspa

    I have not seen any other artifacts, please file any additional bugs you
    may encounter at the above URL.

    Are these instructions applicable to OMAP3 targets in general

    Yes. There is no theoretical reason these instructions would not suffice
    for any OMAP3 based target, although I have not personally verified them
    outside of Beagle board usage. Caveat emptor.

    No Scratchbox2 usage when cross compiling

    The more astute of your would recognize that I bypassed Scratchbox2 when
    configuring Qt/X11 this time around. I payed dearly for it, and this X11
    build has no fontconfig, dbus or glib support even though the Ångström
    subsystem I am building against has support for all of them. If you want a
    full fledged X11 build with decent font support and OpenGL ES support,
    please either:

    1) Invest your time in physically adjusting your MKSPEC (and/or wrestling
    pkg-config) to get all desired dependencies detected and built against

    -Or-

    2) Take the easy road, refer to my previous blog posting “Cross compiling
    Qt/X11″ and merge the above mkspec changes into the:

    ./mkspecs/unsupported/linux-scratchbox2-g++

    mkspec in your Qt 4.6 source tree.

    The same goes for Qt/Embedded which is more self sufficient, but which will
    be built without dbus, glib, etc and additional external dependency support
    without additional mkspec/environment modification or the use of Scratchbox2
    to abstract this away.

    Color depth considerations

    1) The powervr implementation we are relying on does not support
    PVRSRV_PIXEL_FORMAT_RGB888 (24bit color depths), it does however support
    PVRSRV_PIXEL_FORMAT_RGB565 and PVRSRV_PIXEL_FORMAT_ARGB8888

    2) Ångström is busybox based, and the fbset command you will need to set 32
    bit color depths on the console will not work with the default fbset
    busybox symlink. You will therefore have to install and use fbset(.real)
    in order to get 32bit color depths, which is a simple opkg install away for
    the connected Beagle board and a bitbake away for the stranded.

    Please note the color depth specified in the boot arguments

    console=ttyS0,115200n8=noinitrd ip=dhcp rw root=/dev/mmcblk0p2 omapfb.mode=dvi:1280×720MR-16@60

    if you want 32 bit color depth, use:

    console=ttyS0,115200n8=noinitrd ip=dhcp rw root=/dev/mmcblk0p2 omapfb.mode=dvi:1280×720MR-24@60

    followed by:

    /usr/sbin/fbset.real -depth 32 -rgba 8/16,8/8,8/0,8/24

    after your Linux kernel drops you in userspace with a kiss on the cheek. A
    brave man once tried leaving the color depth at 16 in his boot args, and
    jumping all the way to 32bit with fbset so he could change between the more
    performant 16 bit color space and the hardware compositing ARGB offering.
    Running the dedicated fbset command halved his vertical resolution
    regardless of any other parameters he tried to pass fbset and he eventually
    ran off to fight another day.

    There is a clear performance hit of 7 fps when running hellogl_es2 in
    32bit rather than 16bit, taking you down to 20 fps. This hit is even more
    pronounced when setting a QGLWidget on the viewport of a QGraphicsView. I
    am not sure who is responsible for this, and will be personally
    investigating it in the future. Any conjecture/feedback/research performed
    by the reader would be greatly appreciated.

    *Edited: Introduce rudimentary formatting to make the blog look less Vim forged

    Rhys Weatherley
    Painting
    OpenGL
    Posted by Rhys Weatherley
     in Painting, OpenGL
     on Monday, November 09, 2009 @ 22:44

    For the last year, we have been investigating API’s that Qt needs to support 3D applications and clever 2.5D effects with OpenGL.  When we started all this a year ago, the problem was broken down into three main areas:

    • Enablers - Basic building blocks like matrices, shaders, vertex buffers, etc.
    • Portability API - API’s that make it easier to write code that ports between desktop OpenGL and embedded OpenGL/ES.  Particularly OpenGL/ES 2.0 which does not have a fixed function pipeline.
    • Real 3D - API’s that take Qt into new application spaces beyond animations and 2D effects.

    Obviously that covers a lot of ground, so in this post we will just focus on a few of the Enablers - specifically the ones that made it into 4.6 as the first taste of Qt/3D.  In future posts, we’ll publish Qt/3D repository details and show you more of our plans for later Qt/3D releases.

    Math3d

    Traditionally, Qt has relied upon the OpenGL library to provide mathematical primitives, using functions like glOrtho(), glRotate(), and so on to manipulate matrices and vectors.  However, with the advent of OpenGL/ES 2.0 it is no longer possible to rely upon the OpenGL library to do the heavy-lifting - the programmer has to do all the work. Also, the traditional OpenGL functions are really only useful when drawing objects - they aren’t of much use when building object meshes in memory and transforming them prior to uploading to the GPU.

    So we really needed a hardcore 3D math library, just like the other 3D toolkits (Coin3D, Ogre, OpenSceneGraph, etc).  But we didn’t want to go overboard - it is very easy to re-invent all of linear algebra and lose sight of the core goal: make typical 3D mathematical operations fast and elegant.  We recognized that libraries like Eigen were very good at doing everything in mathematics, but our own goals were more focused.  So what did we do?

    The central workhorse is of course QMatrix4×4, which is highly optimized for 3D operations.  Internally it keeps track of its “type” - whether it is a translation, scale, rotation, etc - so that it can more efficiently build up transformations than a naive “make matrices and multiply” implementation might.  QTransform does the same thing for 2D transformation matrices. The following is an excerpt from the hellogl_es2 example in Qt 4.6 which builds up a modelview matrix and sets it on a shader program:

    QMatrix4x4 modelview;
    modelview.rotate(m_fAngle, 0.0f, 1.0f, 0.0f);
    modelview.rotate(m_fAngle, 1.0f, 0.0f, 0.0f);
    modelview.rotate(m_fAngle, 0.0f, 0.0f, 1.0f);
    modelview.scale(m_fScale);
    modelview.translate(0.0f, -0.2f, 0.0f);
    program1.setUniformValue(matrixUniform1, modelview);

    As can be seen, it is very similar to the traditional OpenGL functions:

    glRotatef(m_fAngle, 0.0f, 1.0f, 0.0f);
    glRotatef(m_fAngle, 1.0f, 0.0f, 0.0f);
    glRotatef(m_fAngle, 0.0f, 0.0f, 1.0f);
    glScalef(m_fScale, m_fScale, m_fScale);
    glTranslatef(0.0f, -0.2f, 0.0f);

    The choice to make the functions similar was deliberate: code that uses the existing OpenGL functions can be quickly converted into more portable code that uses QMatrix4×4.

    The QGenericMatrix template is used for creating “other” matrix sizes that commonly crop up in OpenGL work: 2×2, 2×3, 2×4, 3×2, 3×3, 3×4, 4×2, and 4×3.  It can do a lot more of course, being a template, although we did draw the line at supporting sparse matrices - the matrix sizes that occur in 3D code are rarely very large.  A common question is why didn’t we make QMatrix4×4 an instance or subclass of QGenericMatrix.  The main reason is performance - the 4×4 class needs to be very fast and it is easier to performance-tune a concrete class that isn’t at the mercy of the compiler’s template expansion system.  The other reason is to reduce user confusion - the API’s for all QGenericMatrix sizes is exactly the same, but QMatrix4×4 is extremely rich in the additional operations it provides.

    QVector2D, QVector3D, QVector4D provide vector classes of various sizes to complement QMatrix4×4. An interesting feature for the purposes of OpenGL is that these classes are guaranteed to use the same floating-point type internally as GLfloat on the system. QPointF wasn’t suitable for our 2D vector needs because it uses qreal, which can either be float or double depending upon the compilation flags passed to Qt’s configure. The GLfloat guarantee is very important when building large 3D object meshes: you want to get the vertex data into the most efficient format as early as possible. If we had made the internal type qreal, then Qt/3D would have needed to do a lot of floating-point conversions when uploading vertex data to the GPU.

    The QQuaternion class is the last in our current math3d set. It provides an efficient implementation of rotations in 3D space for use with camera positioning, rotation, and animation.

    Shader Programs

    The fixed function pipeline in OpenGL is getting very “old school”.  These days, OpenGL is all about shaders, shaders, shaders.  But resolving the extensions and managing the compilation, linking, and use of shader programs can be quite daunting.  In Qt 4.5, we had no less than three different internal shader program wrappers for pixmap filters, the OpenGL2 paint engine, and the boxes demo.  So in Qt 4.6 we have merged all of these efforts and devised a new public API to wrap the extensions.  The result is the QGLShader and QGLShaderProgram classes, which:

    • Support the GLSL and GLSL/ES shader languages.
    • Handle vertex and fragment shaders (geometry shaders are coming in future versions of Qt).
    • Support writing portable shaders that work on both GLSL and GLSL/ES.

    That last point is probably the most interesting for Qt.  GLSL has a lot of built-in variables like gl_Vertex, gl_Normal, gl_ModelViewProjectionMatrix, etc that don’t exist in GLSL/ES.  In turn, GLSL/ES has additional type qualifiers like highp, mediump, and lowp that are used to specify the desired precision.  These issues can make it a pain to port existing shader code from desktop to embedded.  We didn’t want to have to write two sets of shaders for the OpenGL2 paint engine, so a solution needed to be found.

    The solution we chose was to use GLSL/ES as the primary language for writing shaders in Qt, and provide #define’s for the extra keywords to make the code compile on desktop GLSL systems.  It is still possible to use the full GLSL language if you want to, but portability will suffer.

    The following example demonstrates how to compile and link a simple shader program that can be used to draw triangles with a flat color:

    program.addShaderFromSourceCode(QGLShader::Vertex,
        "attribute highp vec4 vertex;"
        "attribute mediump mat4 matrix;"
        "void main(void)"
        "{"
        "   gl_Position = matrix * vertex;"
        "}");
    program.addShaderFromSourceCode(QGLShader::Fragment,
        "uniform mediump vec4 color;"
        "void main(void)"
        "{"
        "   gl_FragColor = color;"
        "}");
    program.link();
    program.bind(); 
    
    int vertexLocation = program.attributeLocation("vertex");
    int matrixLocation = program.attributeLocation("matrix");
    int colorLocation = program.uniformLocation("color");

    The highp and mediump keywords are added to keep GLSL/ES happy - on desktop they #define to an empty string. Also, we have deliberately used user variables for the vertex position, matrix, and color rather than relying upon the desktop-specific gl_Vertex, gl_ModelViewProjectionMatrix, and gl_Color variables. We can then draw a green triangle as follows:

    QVector3D triangleVertices[] = {
        QVector3D(60.0f,  10.0f,  0.0f),
        QVector3D(110.0f, 110.0f, 0.0f),
        QVector3D(10.0f,  110.0f, 0.0f)
    }; 
    
    QMatrix4x4 pmvMatrix;
    pmvMatrix.ortho(rect()); 
    
    program.enableAttributeArray(vertexLocation);
    program.setAttributeArray(vertexLocation, triangleVertices);
    program.setUniformValue(matrixLocation, pmvMatrix);
    program.setUniformValue(colorLocation, QColor(0, 255, 0, 255)); 
    
    glDrawArrays(GL_TRIANGLES, 0, 3); 
    
    program.disableAttributeArray(vertexLocation);

    Note the use of QMatrix4×4 above to create an orthographic projection matrix to pass to the vertex shader, and the use of QVector3D to build the vertex array.  And that’s basically it!  Shaders 101.

    What’s Next?

    Lots and lots of stuff.  Wrapper classes for vertex buffers and textures will probably go into Qt in the near future.  Geometry handling for building object models.  Special-purpose 3D viewing widgets. Integration with Declarative UI for quickly building 3D applications.  And the portability API.  More to come on these in the next post …



    © 2008 Nokia Corporation and/or its subsidiaries. Nokia, Qt and their respective logos are trademarks of Nokia Corporation in Finland and/or other countries worldwide.
    All other trademarks are property of their respective owners.