Performance Analysis Suggestions

Identifying bottlenecks in an application using the GPU is a very difficult task and every application behaves differently and has unique performance characteristics. It is often the case that there are multiple bottlenecks throughout a single frame of an application, and thus there may be several methods to improve the overall performance. Additionally, as objects move in and out of visibility on the screen, the object that was the bottleneck in one frame or scene may not be one in the next. Following are a few suggestions on how to take advantage of the features within the Frame Profiler to help you identify the expensive draw calls in your application and determine an approach to improve overall performance.

Pausing The Application

It is especially important when profiling that the application is not changing the draw calls in any way. There are many counters that cannot be collected in a single frame, and thus multiple frames must be rendered in order to collect all the data. Each frame that is rendered while collecting data is called a "pass". If the draw calls are changing between each pass, then the results of the profile may not be accurate, or the profiler may detect the varying draw calls and be forced to exit early. GPU PerfStudio supports several methods for pausing the application although it is best if the application can pause itself, so that the CPU and GPU load is consistent. A paused camera helps a great deal, but if trees are swaying or the clouds are still moving, then the GPU load may also be varying. If you are confident that the application is GPU bound (see next suggestion), using GPU PerfStudio's Frame Capture feature is very useful for profiling applications that cannot be paused via another method.

Profile GPU Bound Applications

The Analysis button on the initial page of the profiler will help you identify whether your application is CPU bound or GPU bound. Make sure VSync is disabled before checking if bound by the CPU or GPU. Catalyst Control Center can help force this option if it cannot be changed from within your application. Some GPU bound applications may appear CPU bound if VSync is enabled. Only GPU bound applications will benefit from optimizing the GPU performance. If your application is CPU bound, check out AMD Code Analyst for CPU performance analysis.

GPU PerfStudio's Frame Profiler has an analysis mode to easily determine if the application is GPU or CPU bound. If the application is running in windowed mode on the same machine as the client, it will automatically be brought into the foreground and given focus during this analysis pass and for each profile that you perform. If an application is not in the foreground, it gets less time on the CPU and GPU and will not have the same performance characteristics as when someone is using the application. It is recommended that you do not interact with any open applications while an analysis or profile is running, as the returned data may indicate an incorrect bottleneck. GPU PerfStudio does not foreground the application if it is running on a remote machine, as it is assumed that the application is already foregrounded.

Similarly, the latest graphics hardware is incredibly fast and simple applications may easily render at hundreds or even thousands of frames per second. At such high frame rates, there is not much benefit from attempting to further improve the performance. However, the profiler may still be useful to help determine why one effect may allow such high frame rates, while another effect may be much slower.

Selecting Counter Groups

The counters exposed by the profiler are organized into groups to make profiling easier. If your application can support multiple passes, it is recommended to first profile using the Timing group. This will include a 'GPUTime' counter which represents the amount of time (in milliseconds) the GPU spent processing each draw call. The draw calls with the highest 'GPUTime' are where the most improvement is likely to be seen. There is also a 'Busy' counter for each stage in the pipeline. These values represent the percentage of time that stage was active. Because the GPU is highly parallelized, these values are likely to sum to more than 100 percent. The stage with the highest percentage is indicative of the bottleneck for that draw call, but if two or more stages are equally high, you may also consider them as potentially needing improvement.

Below is an example profile showing the most expensive draw calls for an application. After sorting on 'GPUTime', the most expensive draw calls are easily picked out - there are three very expensive calls (~7 ms), and three semi-expensive calls (~3 ms). In this particular case, all six calls are rendering the same object. The most expensive calls are rendering into render targets and the second three calls are rendering into a depth buffer.

FPSuggestionsTimingGroup.png

In this example 'ShaderBusy' and 'PrimitiveAssemblyBusy' are both between 99 and 100 percent busy, while 'TexUnitBusy' and 'DepthStencilTestBusy' are both below 7 percent. There are six additional 'ShaderBusy*' counters, one for each programmable pipeline stage. These values represent the percentage of work done by the shader unit for the specified shader. On further inspection, we see that 'ShaderBusyDS' has a much higher percentage (between 62 and 96) than the other shader types (the second highest is between 2 and 36 percent), suggesting that the Domain Shader is the likely bottleneck for these draw calls. Since the Primitive Assembly stage (clipping and culling) is equally as busy as the shader unit, one can suspect that the tessellator is generating many polygons, thus causing the Domain Shader to also be executed many times and the clipping and culling of the generated polygons will take longer. If the 'PrimitiveAssemblyBusy' was not as high, then it could be expected that the Domain Shader itself was expensive.

To confirm our theory, a second profile is usually necessary. For each 'Busy' counter in the Timing group, there is a corresponding group to help identify performance issues within that stage. Since the Domain Shader was identified as a potential area of improvement, a second profile is performed with the 'DomainShader' counter group and two additional counters: 'GPUTime' (so we can identify the expensive calls again) and 'HSVerticesIn' (so that we can compare the number of vertices going into the tessellator to the number of vertices going out).

FPSuggestionsDSGroup.png

In the second profile, we can see that the number of generated vertices is much higher going into the Domain Shader than went into the Hull Shader. In fact the number of vertices is 97 times more. To rule out the cost of the Domain Shader being overly expensive, the 'DSTexInstCount' (texture instructions) and 'DSALUInstCount' (ALU instructions) can be observed, along with the associated percentage of time that each type of instructions took ('DSTexBusy' and 'DSALUBusy'). The texture instructions only took about 0.3 percentage of the 'GPUTime' and ALU instructions took ~9 percent. Since the shader is executed for each of the generated input vertices, reducing the number of ALU instructions will reduce the overall execution time of the shader unit and likely the draw call. However, it may be easier and a far greater gain may be had by reducing the tessellation factor. If the tessellation factor is at a very high value, no negative visual impact may be noticed by reducing the factor slightly, but the reduction in polygon count may significantly improve performance. This is especially true in this particular example, since the same object is rendered six different times, several milliseconds may be gained.

Although it was not the case in this example, if the 'TexUnitBusy' may had the highest busy percentage, there may be several causes: many texture instructions from any of the shaders, expensive filtering, or slow texture formats. It is suggested to check the percentage of time each shader unit spent doing texture instructions (i.e. 'PSTexBusy') to see if any percentage seems high, and which stage is the culprit. A separate profile of the 'TextureFormat' group can help identify if there are any slow texture formats being used, in which case switching to compressed formats may provide a benefit. Similarly, a profile of the 'TextureUnit' group can help indicate if expensive filtering is being performed.

State Groups

The State Bucket Options in the Data tab of the profile results allows selection of state groups. This provides a very simple way to group the draw calls which share the same shader, render into the same render target (or depth or stencil buffer), or are in the same PerfMarker group. If a particular shader is the bottleneck in one draw call, there may be others which also use that same shader which will see an improvement if the shader is optimized. In some cases, it may be preferable to only consider the most expensive stage group because you will be able to identify improvements that will have a positive effect over the entire frame as opposed to a single draw call. The state groups are shown as orange rows in the results table of the Profile Window , with draw calls using that state shown in blue within the state bucket.