Available Counters

The counters available in GPU PerfStudio are organized into groups to help provide clarity and organization to all the available data. Following is a table showing counter groups and the associated counters.

Counter Groups and Associated Counters

Timing CSBusy
DepthStencilTestBusy
DSBusy
GPUTime
GPUBusy
GSBusy
HSbusy
InterpBusy
PrimitiveAssemblyBusy
PSBusy
ShaderBusy
ShaderBusyVS
ShaderBusyGS
ShaderBusyPS
ShaderBusyHS
ShaderBusyDS
ShaderBusyCS
TessellatorBusy
TexUnitBusy
VSBusy
VertexShader VertexMemFetched
VertexMemFetchedCost
VSALUBusy
VSALUEfficiency
VSALUInstCount
VSALUTexRatio
VSSALUBusy
VSSALUInstCount
VSTexBusy
VSTexInstCount
VSVALUBusy
VSVALUInstCount
VSVerticesIn
HullShader * HSALUBusy
HSALUEfficiency
HSALUInstCount
HSALUTexRatio
HSTexBusy
HSTexInstCount
HSPatches
HSSALUBusy
HSSALUInstCount
HSVALUBusy
HSVALUInstCount
GeometryShader GSALUBusy
GSALUEfficiency
GSALUInstCount
GSALUTexRatio
GSExportPct
GSPrimsIn
GSSALUBusy
GSSALUInstCount
GSTexBusy
GSTexInstCount
GSVALUBusy
GSVALUInstCount
GSVerticesOut
PrimitiveAssembly ClippedPrims
CulledPrims
PAStalledOnRasterizer
PrimitivesIn
PAPixelsPerTriangle
DomainShader * DSALUBusy
DSALUEfficiency
DSALUInstCount
DSALUTexRatio
DSTexBusy
DSTexInstCount
DSVerticesIn
PixelShader PSALUBusy
PSALUEfficiency
PSALUInstCount
PSALUTexRatio
PSExportStalls
PSPixelsIn
PSPixelsOut
PSSSALUBusy
PSSALUInstCount
PSTexBusy
PSTexInstCount
PSVSALUBusy
PSVALUInstCount
ComputeShader * CSALUBusy
CSALUFetchRatio
CSALUInsts
CSALUPacking
CSALUStalledByLDS
CSCacheHit
CSCompletePath
CSFastPath
CSFetchInsts
CSFetchSize
CSGDSInsts
CSLDSBankConflict
CSLDSFetchInsts
CSLDSWriteInsts
CSMemUnitBusy
CSMemUnitStalled
CSPathUtilization
CSSALUBusy
CSSALUInsts
CSTexBusy
CSThreadGroups
CSThreads
CSVALUBusy
CSVALUInsts
CSVALUUtilization
CSVFetchInsts
CSVWriteInsts
CSWaveFronts
CSWriteInsts
CSWriteSize
TextureUnit TexAveAnisotropy
TexCacheStalled
TexCostOfFiltering
TexelFetchCount
TexMemBytesRead
TexMissRate
TexTriFilteringPct
TexVolFilteringPct
TextureFormat Pct64SlowTexels
Pct128SlowTexels
PctCompressedTexels
PctDepthTexels
PctInterlacedTexels
PctTex1D
PctTex1Darray
PctTex2D
PctTex2Darray
PctTex2DMSAA
PctTex2DMSAAArray
PctTex3D
PctTexCube
PctTexCubeArray
PctUncompressedTexels
PctVertex64SlowTexels
PctVertex128SlowTexels
PctVertexTexels
DepthAndStencil HiZQuadsCulled
HiZTilesAccepted
PostZQuads
PostZSamplesFailingS
PostZSamplesFailingZ
PostZSamplesPassing
PreZQuadsCulled
PreZSamplesFailingS
PreZSamplesFailingZ
PreZSamplesPassing
PreZTilesDetailCulled
ZUnitStalled
ColorBuffer ** CBMemRead
CBMemWritten
CBSlowPixelPct
* - available only for 5000 and 6000 series hardware
** - available for 4000, 5000, and 6000 series hardware

Counter Names and Descriptions

Here is a list of the available counters and a brief description of them. In some cases, there is also a comment about how to interpret the values so that you can tell if it indicates that a change should made in your application. This description is also shown if you hover the mouse cursor over the counter name.

ClippedPrimsThe number of primitives that required one or more clipping operations due to intersecting the view volume or user clip planes.
CulledPrimsThe number of culled primitives. Typical reasons include scissor, the primitive having zero area, and back or front face culling.
DepthStencilTestBusyPercentage of GPUTime spent performing depth and stencil tests.
GPUBusyPercentage of time GPU was busy
GPUTimeTime this API call took to execute on the GPU in milliseconds. Does not include time that draw calls are processed in parallel.
GSALUBusyThe percentage of GPUTime ALU instructions are being processed by the GS.
GSALUEfficiencyALU vector packing efficiency. Values below 70 percent indicate that ALU dependency chains may be preventing full utilization of the processor.
GSALUInstCountAverage number of ALU instructions executed in GS. Affected by flow control.
GSALUTexRatioThe ratio of ALU to texture instructions in the GS. This can be tuned appropriately to match the target hardware.
GSPrimsInThe number of primitives passed into the GS.
GSTexBusyThe percentage of GPUTime texture instructions are being processed by the GS.
GSTexInstCountAverage number of texture instructions executed in GS. Affected by flow control.
GSVerticesOutThe number of vertices output by the GS.
HiZQuadsCulledPercentage of quads that did not have to continue on in the pipeline after HiZ. They may be written directly to the depth buffer, or culled completely. Consistently low values here may suggest that the Z-range is not being fully utilized.
HiZTilesAcceptedPercentage of tiles accepted by HiZ and will be rendered to the depth or color buffers.
PAStalledOnRasterizerPercentage of GPUTime that primitive assembly waits for rasterization to be ready to accept data. This roughly indicates for what percentage of time the pipeline is bottlenecked by pixel operations.
Pct128SlowTexelsPercentage of texture fetches from a 128-bit texture (slow path). There are also 128-bit formats that take a fast path; they are included in PctUncompressedTexels.
PctCompressedTexelsPercentage of texture fetches from compressed textures.
PctDepthTexelsPercentage of texture fetches from depth textures.
PctInterlacedTexelsPercentage of texture fetches from interlaced textures.
PctTex1DPercentage of texture fetches from a 1D texture.
PctTex1DArrayPercentage of texture fetches from a 1D texture array.
PctTex2DPercentage of texture fetches from a 2D texture.
PctTex2DArrayPercentage of texture fetches from a 2D texture array.
PctTex2DMSAAPercentage of texture fetches from a 2D MSAA texture.
PctTex2DMSAAArrayPercentage of texture fetches from a 2D MSAA texture array.
PctTex3DPercentage of texture fetches from a 3D texture.
PctTexCubePercentage of texture fetches from a cube map.
PctUncompressedTexelsPercentage of texture fetches from uncompressed textures. Does not include depth or interlaced textures.
PostZQuadsPercentage of quads for which the pixel shader will run and may be postZ tested.
PostZSamplesFailingSNumber of samples tested for Z after shading and failed stencil test.
PostZSamplesFailingZNumber of samples tested for Z after shading and failed Z test.
PostZSamplesPassingNumber of samples tested for Z after shading and passed.
PreZQuadsCulledPercentage of quads rejected because they were not actually covered by a primitive. High values here suggest that very small primitives were being rendered and a lower mesh LOD could improve performance.
PreZSamplesFailingSNumber of samples tested for Z before shading and failed stencil test.
PreZSamplesFailingZNumber of samples tested for Z before shading and failed Z test.
PreZSamplesPassingNumber of samples tested for Z before shading and passed.
PreZTilesDetailCulledPercentage of tiles rejected because the associated prim had no contributing area.
PrimitiveAssemblyBusyPercentage of GPUTime that primitive assembly (clipping and culling) is busy. High values may be caused by having many small primitives; mid to low values may indicate pixel shader or output buffer bottleneck.
PrimitivesInThe number of primitives received by the hardware.
PSALUBusyThe percentage of GPUTime ALU instructions are being processed by the PS.
PSALUEfficiencyALU vector packing efficiency. Values below 70 percent indicate that ALU dependency chains may be preventing full utilization of the processor.
PSALUInstCountAverage number of ALU instructions executed in PS. Affected by flow control.
PSALUTexRatioThe ratio of ALU to texture instructions in the PS. This can be tuned appropriately to match the target hardware.
PSExportStallsPercentage of GPUTime that PS output is stalled. Should be zero for PS or further upstream limited cases; if not zero, indicates a bottleneck in late z testing or in the colour buffer.
PSPixelsInThe number of pixels processed by the PS. Does not count pixels culled out by early z or stencil tests.
PSPixelsOutThe number of pixels exported from shader to colour buffers. Does not include killed or alpha tested pixels. If there are multiple rendertargets, each receives one export, so this will be 2 for 1 pixel written to two RTs.
PSTexBusyThe percentage of GPUTime texture instructions are being processed by the PS.
PSTexInstCountAverage number of texture instructions executed in PS. Affected by flow control.
TexAveAnisotropyThe average degree of anisotropy applied. A number between 1 and 16. The anisotropic filtering algorithm only applies samples where they are required (e.g. there will be no extra anisotropic samples if the view vector is perpendicular to the surface) so this can be much lower than the requested anisotropy.
TexCacheStalledPercentage of GPUTime the texture cache is stalled. Try reducing the number of textures or reducing the number of bits per pixel (ie, use compressed textures) if possible.
TexCostOfFilteringThe effective cost of all texture filtering. Percentage indicating the cost relative to all filtering being done as bilinear. Should always be greater or equal to 100 percent. Significantly higher values indicate heavy usage of trilinear or anisotropic filtering.
TexelFetchCountThe total number of texels fetched. This includes all shader types, and any extra fetches caused by trilinear filtering, anisotropic filtering, color formats, and volume textures.
TexMemBytesReadTexture memory read in bytes.
TexMissRateTexture cache miss rate (bytes/texel). A normal value for mipmapped textures on typical scenes is approximately (texture_bpp / 4). For 1:1 mapping, it will be texture_bpp.
TexTriFilteringPctPercentage of pixels that received trilinear filtering. Note that not all pixels for which trilinear filtering is enabled will receive it (e.g. if the texture is magnified).
TexUnitBusyPercentage of GPUTime the texture unit is active. This is measured with all extra fetches and any cache or memory effects taken into account.
TexVolFilteringPctPercentage of pixels that received volume filtering.
VertexMemFetchedNumber of bytes read from memory due to vertex cache miss.
VSALUBusyThe percentage of GPUTime ALU instructions are being processed by the VS.
VSALUEfficiencyALU vector packing efficiency. Values below 70 percent indicate that ALU dependency chains may be preventing full utilization of the processor.
VSALUInstCountAverage number of ALU instructions executed in the VS. Affected by flow control.
VSALUTexRatioThe ratio of ALU to texture instructions in the VS. This can be tuned appropriately to match the target hardware.
VSTexBusyThe percentage of GPUTime texture instructions are being processed by the VS.
VSTexInstCountAverage number of texture instructions executed in VS. Affected by flow control.
VSVerticesInThe number of vertices processed by the VS
ZUnitStalledPercentage of GPUTime the depth buffer spends waiting for the color buffer to be ready to accept data. High figures here indicate a bottleneck in color buffer operations.

GSBusyPercentage of GPUTime that GS is busy.
InterpBusyPercentage of GPUTime that the interpolator is busy.
PSBusyPercentage of GPUTime that PS is busy.
VSBusyPercentage of GPUTime that VS is busy.

CBMemReadNumber of bytes read from the color buffer.
CBMemWrittenNumber of bytes written to the color buffer.
Pct64SlowTexelsPercentage of texture fetches from a 64-bit texture (slow path). There are also 64-bit formats that take a fast path; they are included in PctUncompressedTexels.
PctTexCubeArrayPercentage of texture fetches from a cube map array.
PctVertex64SlowTexelsPercentage of texture fetches from a 64-bit vertex texture (slow path). There are also 64-bit formats that take a fast path; they are included in PctVertexTexels.
PctVertex128SlowTexelsPercentage of texture fetches from a 128-bit vertex texture (slow path). There are also 128-bit formats that take a fast path; they are included in PctVertexTexels.
PctVertexTexelsPercentage of texture fetches from vertex textures.
VertexMemFetchedCostThe percentage of GPUTime that is spent fetching from vertex memory due to cache miss. Improve vertex reuse or use smaller vertex formats to reduce this cost.

CBSlowPixelPctPercentage of pixels written to the color buffer using a half-rate or quarter-rate format.
PAPixelsPerTriangleThe ratio of rasterized pixels to the number of triangles after culling. This does not account for triangles generated due to clipping.
ShaderBusyPercentage of GPUTime that the shader unit is busy.
ShaderBusyGSPercentage of work done by shader units for GS.
ShaderBusyPSPercentage of work done by shader units for PS.
ShaderBusyVSPercentage of work done by shader units for VS.
CSWriteInstsThe average number of write instructions executed in compute shader per execution. Affected by flow control.

CSBusyThe percentage of time the ShaderUnit has compute shader work to do.
DSBusyThe percentage of time the ShaderUnit has domain shader work to do.
GSBusyThe percentage of time the ShaderUnit has geometry shader work to do.
HSBusyThe percentage of time the ShaderUnit has hull shader work to do.
PSBusyThe percentage of time the ShaderUnit has pixel shader work to do.
VSBusyThe percentage of time the ShaderUnit has vertex shader work to do.
CSFetchSizeThe total kilobytes fetched from the video memory. This is measured with all extra fetches and any cache or memory effects taken into account.
CSGDSInstsThe average number of instructions to/from the GDS executed per work-item (affected by flow control)..
CSMemUnitBusyThe percentage of GPUTime the memory unit is active. The result includes the stall time (MemUnitStalled). This is measured with all extra fetches and writes and any cache or memory effects taken into account. Value range: 0% to 100% (fetch-bound).
CSMemUnitStalledThe percentage of GPUTime the memory unit is stalled. Try reducing the number or size of fetches and writes if possible. Value range: 0% (optimal) to 100% (bad).
CSSALUBusyThe percentage of GPUTime scalar ALU instructions are processed. Value range: 0% (bad) to 100% (optimal).
CSSALUInstsThe average number of scalar ALU instructions executed per work-item (affected by flow control).
CSThreadGroupsTotal number of thread groups.
CSVALUBusyThe percentage of GPUTime vector ALU instructions are processed. Value range: 0% (bad) to 100% (optimal).
CSVALUInstsThe average number of vector ALU instructions executed per work-item (affected by flow control).
CSVALUUtilizationThe percentage of active vector ALU threads in a wave. A lower number can mean either more thread divergence in a wave or that the work-group size is not a multiple of 64. Value range: 0% (bad), 100% (ideal - no thread divergence).
CSVFetchInstsThe average number of vector fetch instructions from the video memory executed per work-item (affected by flow control).
CSVWriteInstsThe average number of vector write instructions to the video memory executed per work-item (affected by flow control).
CSWaveFrontsThe total number of wavefronts used for the CS.
CSWriteSizeThe total kilobytes written to the video memory. This is measured with all extra fetches and any cache or memory effects taken into account.

GSExportPctThe percentage of GS work that is related to exporting primitives.
GSSALUBusyThe percentage of GPUTime scalar ALU instructions are being processed by the GS.
GSSALUInstCountAverage number of scalar ALU instructions executed in the GS. Affected by flow control.
GSVALUBusyThe percentage of GPUTime vector ALU instructions are being processed by the GS.
GSVALUInstCountAverage number of vector ALU instructions executed in the GS. Affected by flow control.
HSSALUBusyThe percentage of GPUTime scalar ALU instructions are being processed by the HS.
HSSALUInstCountAverage number of scalar ALU instructions executed in the HS. Affected by flow control.
HSVALUBusyThe percentage of GPUTime vector ALU instructions are being processed by the HS.
HSVALUInstCountAverage number of vector ALU instructions executed in the HS. Affected by flow control.
PSSALUBusyThe percentage of GPUTime scalar ALU instructions are being processed by the PS.
PSSALUInstCountAverage number of scalar ALU instructions executed in the PS. Affected by flow control.
PSVALUBusyThe percentage of GPUTime vector ALU instructions are being processed by the PS.
PSVALUInstCountAverage number of vector ALU instructions executed in the PS. Affected by flow control.
VSSALUBusyThe percentage of GPUTime scalar ALU instructions are being processed by the VS.
VSSALUInstCountAverage number of scalar ALU instructions executed in the VS. Affected by flow control.
VSVALUBusyThe percentage of GPUTime vector ALU instructions are being processed by the VS.
VSVALUInstCountAverage number of vector ALU instructions executed in the VS. Affected by flow control.

DSALUBusyThe percentage of GPUTime ALU instructions are being processed by the DS.
DSALUEfficiencyALU vector packing efficiency. Values below 70 percent indicate that ALU dependency chains may be preventing full utilization of the processor.
DSALUInstCountAverage number of ALU instructions executed in the DS. Affected by flow control.
DSALUTexRatioThe ratio of ALU to texture instructions. This can be tuned appropriately to match the target hardware.
DSTexBusyThe percentage of GPUTime texture instructions are processed by the DS.
DSTexInstCountAverage number of texture instructions executed in DS. Affected by flow control.
DSVerticesInThe number of vertices processed by the DS.
HSALUBusyThe percentage of GPUTime ALU instructions are processed by the HS.
HSALUEfficiencyALU vector packing efficiency. Values below 70 percent indicate that ALU dependency chains may be preventing full utilization of the processor.
HSALUInstCountAverage number of ALU instructions executed in the HS. Affected by flow control.
HSALUTexRatioThe ratio of ALU to texture instructions. This can be tuned appropriately to match the target hardware.
HSPatchesThe number of patches processed by the HS.
HSTexBusyThe percentage of GPUTime texture instructions are processed by the HS.
HSTexInstCountAverage number of texture instructions executed in HS. Affected by flow control.
ShaderBusyDSPercentage of work done by shader units for DS.
ShaderBusyHSPercentage of work done by shader units for HS.
TessellatorBusyPercentage of time the tessellation engine is busy.

CSALUBusyThe percentage of GPUTime ALU instructions are processed by the CS.
CSALUInstsThe number of ALU instructions executed in the CS. Affected by flow control.
CSALUPackingALU vector packing efficiency. Values below 70 percent indicate that ALU dependency chains may be preventing full utilization of the processor.
CSALUStalledByLDSThe percentage of GPUTime ALU units are stalled by LDS input queue being full and output queue is not ready. If there are LDS bank conflicts, reduce it. Otherwise, try reducing the number of LDS accesses if possible.
CSALUFetchRatioThe ratio of ALU to fetch instructions. This can be tuned appropriately to match the target hardware.
CSCachehitThe percentage of fetches from the global memory that hit the L1 cache.
CSCompletePathThe total bytes read and written through the CompletePath. This includes extra bytes needed for addressing, atomics, etc. This number indicates a big performance impact (higher number equals lower performance). Reduce it by removing atomics and non 32-bit types, or move them into a second shader.
CSFastPathThe total bytes written through the FastPath (no atomics, 32-bit type only). This includes extra bytes needed for addressing.
CSFetchInstsAverage number of fetch instructions executed in compute shader per execution. Affected by flow control.
CSTexBusyThe percentage of GPUTime texture instructions are being processed by the CS.
CSThreadsThe number of CS threads processed by the hardware.
CSLDSBankConflictThe percentage of GPUTime LDS is stalled by bank conflicts.
CSLDSFetchInstsThe average Fetch instructions from the local memory executed per thread (affected by flow control).
CSLDSWriteInstsThe average Write instructions to the local memory executed per thread (affected by flow control).
CSPathUtilizationThe percentage of bytes read and written through the FastPath or CompletePath compared to the total number of bytes transferred over the bus. To increase the path utilization, remove atomics and non 32-bit types.
ShaderBusyCSPercentage of work done by shader units for CS.