GPU PerfStudio's Frame Profiler has an analysis mode to easily determine if the application is GPU or CPU bound. If the application is running in windowed mode on the same machine as the client, it will automatically be brought into the foreground and given focus during this analysis pass and for each profile that you perform. If an application is not in the foreground, it gets less time on the CPU and GPU and will not have the same performance characteristics as when someone is using the application. It is recommended that you do not interact with any open applications while an analysis or profile is running, as the returned data may indicate an incorrect bottleneck. GPU PerfStudio does not foreground the application if it is running on a remote machine, as it is assumed that the application is already foregrounded.
Similarly, the latest graphics hardware is incredibly fast and simple applications may easily render at hundreds or even thousands of frames per second. At such high frame rates, there is not much benefit from attempting to further improve the performance. However, the profiler may still be useful to help determine why one effect may allow such high frame rates, while another effect may be much slower.
Below is an example profile showing the most expensive draw calls for an application. After sorting on 'GPUTime', the most expensive draw calls are easily picked out - there are three very expensive calls (~7 ms), and three semi-expensive calls (~3 ms). In this particular case, all six calls are rendering the same object. The most expensive calls are rendering into render targets and the second three calls are rendering into a depth buffer.
In this example 'ShaderBusy' and 'PrimitiveAssemblyBusy' are both between 99 and 100 percent busy, while 'TexUnitBusy' and 'DepthStencilTestBusy' are both below 7 percent. There are six additional 'ShaderBusy*' counters, one for each programmable pipeline stage. These values represent the percentage of work done by the shader unit for the specified shader. On further inspection, we see that 'ShaderBusyDS' has a much higher percentage (between 62 and 96) than the other shader types (the second highest is between 2 and 36 percent), suggesting that the Domain Shader is the likely bottleneck for these draw calls. Since the Primitive Assembly stage (clipping and culling) is equally as busy as the shader unit, one can suspect that the tessellator is generating many polygons, thus causing the Domain Shader to also be executed many times and the clipping and culling of the generated polygons will take longer. If the 'PrimitiveAssemblyBusy' was not as high, then it could be expected that the Domain Shader itself was expensive.
To confirm our theory, a second profile is usually necessary. For each 'Busy' counter in the Timing group, there is a corresponding group to help identify performance issues within that stage. Since the Domain Shader was identified as a potential area of improvement, a second profile is performed with the 'DomainShader' counter group and two additional counters: 'GPUTime' (so we can identify the expensive calls again) and 'HSVerticesIn' (so that we can compare the number of vertices going into the tessellator to the number of vertices going out).
In the second profile, we can see that the number of generated vertices is much higher going into the Domain Shader than went into the Hull Shader. In fact the number of vertices is 97 times more. To rule out the cost of the Domain Shader being overly expensive, the 'DSTexInstCount' (texture instructions) and 'DSALUInstCount' (ALU instructions) can be observed, along with the associated percentage of time that each type of instructions took ('DSTexBusy' and 'DSALUBusy'). The texture instructions only took about 0.3 percentage of the 'GPUTime' and ALU instructions took ~9 percent. Since the shader is executed for each of the generated input vertices, reducing the number of ALU instructions will reduce the overall execution time of the shader unit and likely the draw call. However, it may be easier and a far greater gain may be had by reducing the tessellation factor. If the tessellation factor is at a very high value, no negative visual impact may be noticed by reducing the factor slightly, but the reduction in polygon count may significantly improve performance. This is especially true in this particular example, since the same object is rendered six different times, several milliseconds may be gained.
Although it was not the case in this example, if the 'TexUnitBusy' may had the highest busy percentage, there may be several causes: many texture instructions from any of the shaders, expensive filtering, or slow texture formats. It is suggested to check the percentage of time each shader unit spent doing texture instructions (i.e. 'PSTexBusy') to see if any percentage seems high, and which stage is the culprit. A separate profile of the 'TextureFormat' group can help identify if there are any slow texture formats being used, in which case switching to compressed formats may provide a benefit. Similarly, a profile of the 'TextureUnit' group can help indicate if expensive filtering is being performed.