Section C.3. Reduce Pipeline Bottlenecks

C.3. Reduce Pipeline Bottlenecks

As described in Chapter 1, "An Introduction to OpenGL," OpenGL is a pipeline architecture with two main paths: the vertex path and the pixel path (refer to Figure 1-2). Although some applications make extensive use of the pixel path during rendering, most applications tend to load pixel data at init time, and as a result, the rate at which OpenGL processes pixels in the pixel path is rarely a performance issue.

Applications can be bottlenecked in the vertex path in three very broad areas:

Host computer processing, or CPU limited.
Per-vertex operations, or geometry limited.
Per-fragment operations, or fill limited.

To optimize your application, first identify where the application is bottlenecked. Then make modifications to your application to minimize its demand on that portion of the pipeline.

Before attempting to identify the bottleneck, go over your code and ensure that any OpenGL features that your application doesn't need are disabled. Only two OpenGL features are enabled by defaultGL_DITHER and GL_MULTISAMPLE and both could cause your application to be fill limited. Disable both unless your application specifically requires them. Likewise, check for and disable any state left enabled that is unneeded, such as GL_ALPHA_TEST.

C.3.1. CPU Limited

CPU-limited applications are bottlenecked by the processing power of the host computer. To determine whether your application is CPU limited, try the following tests:

Run the application on less-powerful or more-powerful host computers. For this test to be valid, the graphics hardware and device-driver version should be identical between the two systems. In a classic CPU-limited application, performance scales with processor speed. If you measure a one-third drop in performance running on a system with a 2GHz processor compared with a similar system with a 3GHz processor, your application almost certainly is CPU limited.
Change the graphics card in your system. If you measure the same application performance regardless of the class of graphics hardware you're using, your application probably is CPU limited. CPU-limited applications can't take advantage of increased graphics hardware horsepower, so performance will not increase if you run on a higher-end device. Likewise, lower-end devices are often capable of handling the load from CPU-limited applications, so swapping in a less-powerful graphics card won't decrease performance.

Note

If changing systems or graphics cards is impractical, you might be able to change the clock speed of either the CPU or graphics hardware. Visit the hardware vendor's Web site for information on availability of such vendorspecific utilities.

CPU-limited applications can be bottlenecked in the host computer for many reasons, including inefficient code, memory or bus bandwidth, and excessive cache misses or paging. Use platform-specific resource monitors or third-party optimization software to determine the nature of your bottleneck; then tune accordingly. Tuning a CPU-limited OpenGL application is no different from tuning any nongraphics application. Such optimization tips are beyond the scope of this book.

C.3.2. Geometry Limited

Geometry-limited applications are bottlenecked by how fast OpenGL performs per-vertex operations. To determine whether your application is geometry limited, disable OpenGL per-vertex operations, such as lighting and fog, and see whether performance increases.

Reducing the number of vertices is also a good test, as long as you still render a comparable number of pixels. Consider rendering a simple cylinder by approximating it with 1,024 quadrilaterals, or 2,050 vertices. If you reduce the approximation to 128 quadrilaterals (258 vertices) and maintain the same overall cylinder dimensions, OpenGL will still render approximately the same number of pixels. If this results in a significant boost in performance, rendering the cylinder is geometry limited.

To improve the performance of geometry-limited applications, modify your application to send fewer vertices. Consider some of the following suggestions:

Implement frustum culling in your application, and use the occlusion query feature to avoid sending geometry to OpenGL that won't be visible in the final rendered scene.
If your geometry consists of enclosed hulls, enable face culling with glEnable( GL_CULL_FACE ). Face culling is described in Chapter 2, "Drawing Primitives." Many OpenGL implementations perform face culling earlier in the pipeline than required by the OpenGL specification to reduce per-vertex operations.
Don't send more vertices than necessary. To use the cylinder example again, it doesn't make much sense to approximate a cylinder with 1,024 quadrilaterals if your application always renders it so small that it will never occupy more than 100 pixels of screen real estate. Consider implementing a LOD algorithm so that your application sends less-complex geometry for models when their final rendered size is small.
Along the same line as the previous bullet point, improve detail with texture mapping, and use fewer vertices. Applications sometimes increase the vertex count to obtain acceptable specular highlights, for example. Geometry-limited applications should send fewer vertices, however, and use cube mapping to generate the specular highlight.
Note that this technique effectively pushes processing from one part of the pipeline (per-vertex operations) to another part (per-fragment operations). After making a modification like this, remeasure performance to ensure that you haven't created a new bottleneck in a different part of the pipeline.
If your application uses lighting, avoid scaling your geometry with glScale*(). Because this also scales your normals, your application must use either normalization or normal rescaling to restore unit-length normals. Although normal rescaling is less expensive than normalization, both have a nonzero per-vertex expense. You can avoid using glScale*() by manually scaling your geometry at initialization time. The example code uses this strategy to size its cylinder, torus, sphere, and plane primitives.
If your application doesn't use the texture matrix to transform texture coordinates, don't use glLoadMatrix*() to set the texture matrix to the identity. Instead, explicitly set the matrix to the identity with a call to glLoadIdentity(). Most OpenGL implementations will optimize for this case and eliminate the texture-coordinate transformation.

Note that using buffer objects doesn't guarantee maximum performance. Specifically, the OpenGL glDrawRangeElements() command could perform less optimally if the buffer data is too large. Fortunately, applications can query these implementation-dependent size thresholds with glGetIntegerv(), as shown below:

 GLint maxVertices, maxIndices; glGetIntegerv( GL_MAX_ELEMENTS_VERTICES, &maxVertices ); glGetIntegerv( GL_MAX_ELEMENTS_INDICES, &maxIndices );

Implementations should use these size thresholds to limit the amount of vertex and index data sent to OpenGL via glDrawRangeElements(). For large amounts of vertex data, multiple small glDrawRangeElements() commands could perform better than a single large command.

C.3.3. Fill Limited

Fill-limited applications are bottlenecked by how fast OpenGL performs per-fragment operations. To determine whether your application is fill limited, try the following tests:

Make the window smaller. Reducing the window width and height by one-half will reduce the number of pixels filled by one-fourth. If performance improves accordingly, your application certainly is fill limited.
Note that some LOD-based algorithms could send less geometry when rendering to a smaller window. For this test to be valid, make sure that your application continues to send the same amount of geometry.
Disable texture mapping and other per-fragment operation used by your application. If performance increases when you reduce per-fragment operations, your application is fill limited.
Modern OpenGL hardware is optimized to render texture mapped primitives with the depth test enabled, but it's still common for primitives to render faster when texture mapping and depth testing are disabled.

To optimize a fill-limited application, consider the following suggestions:

Minimize texture size. It doesn't make sense to use a 1,024 x 1,024 texture on a primitive that will never occupy more than a few hundred pixels of screen real estate. Reducing texture size increases texture cache coherency in the graphics hardware and also allows more textures to fit in graphics-card RAM.
Use GL_NEAREST_MIPMAP_LINEAR instead of GL_LINEAR_MIPMAP_LINEAR. This change reduces the number of texel fetches per fragment at the expense of some visual quality.
Measure the depth complexity of your scene, and take steps to reduce it if it's too high. The scene depth complexity is the average number of times a pixel was overdrawn to produce the final scene. The higher the depth complexity, the slower the rendering for fill-limited applications.
The optimal depth complexity of 1.0 is rarely realized in professional OpenGL applications. Typical scenes produce a depth complexity of around 1.5. To reduce depth complexity, organize your geometry to render front to back. In extreme cases, you should render a depth-only pass first, followed by a color pass, which produces a maximum depth complexity of 2.0.
Reduce or eliminate other per-fragment operations when possible, such as multisample, alpha test, stencil test, depth test, and blending.

C.3.4. Closing Thoughts on Pipeline Bottlenecks

Rendering style often indicates whether applications are geometry limited or fill limited. Applications that send relatively few vertices and draw relatively large primitives, for example, tend to be fill limited. Conversely, applications that send many vertices but draw comparatively few pixels tend to be geometry limited.

It's entirely possible for your application to have several bottlenecks throughout the pipeline in the course of rendering a single frame. Consider a simulation application that renders a scene with a robot walking on terrain under a sky. The terrain uses a LOD algorithm and is CPU limited. The sky dome contains very few vertices but covers a large area; therefore, it is fill limited. The robot is extremely complex and detailed; therefore, it is geometry limited. To optimize such scenes effectively, first identify the part of the scene that consumes the most rendering time, and focus your efforts there. Optimize other parts of the scene later.