Summary Report

Similar to the Summary window, available in GUI, the summary report provides overall performance data of your target. Intel® VTune™ Amplifier automatically generates the summary report when data collection completes. To disable this report, use the no-summary option in your command when performing a collect or collect-with action.

Use the following syntax to generate the Summary report from a preexisting result:

$ amplxe-cl -report summary -result-dir <result_path>

The summary report output depends on the collection type:

User-mode Sampling and Tracing Collection Summary Report
Hardware Event-based Sampling Collection Summary Report

User-mode Sampling and Tracing Collection Summary Report

For User-Mode Sampling and Tracing Collection results, the summary report includes the following sections:

Collection and Platform Information
CPU Information
Summary per basic analysis metrics

Example 1: Basic Hotspots Summary

This example generates the summary report for the r000hs Basic Hotspots analysis result on Linux*:

$ amplxe-cl -report summary -r r000hs

Collection and Platform Info
----------------------------
Parameter                 r000hs
------------------------  -------------------------------------------------------------
Application Command Line  /home/tachyon/find_hotspots
Operating System          Ubuntu 11.04
Computer Name             My Computer
Result Size               2926817
Collection start time     11:17:06 13/06/2017 UTC
Collection stop time      11:17:20 13/06/2017 UTC

CPU
---
Parameter               r000hs
----------------------  -------------------------------------------------
Name               		4th generation Intel® Core™ Processor family
Frequency          		2494226458
Logical CPU Core Count  4

Summary
-------
Elapsed Time:       			14.094
CPU Time:           			11.319
Average CPU Usage:  			0.749
amplxe: Executing actions 100 % done

Example 2: Locks and Waits Summary

This example generates a summary report for the Locks and Waits analysis result r003lw. The summary portion of the report shows that the multithreaded target spent 64 seconds waiting, with an average concurrency of only 1.073:

$ amplxe-cl -report summary -r r003lw

Summary
-------
Average Concurrency:  1.073
Elapsed Time:         13.911
CPU Time:             11.031
Wait Time:            64.468
Average CPU Usage:    0.768

To identify the cause of the wait, view the result in the GUI performance pane, or generate a performance report.

Hardware Event-based Sampling Collection Summary Report

For Hardware Event-based Sampling Collection results, the summary report includes the following information (if available):

Collection and Platform information
General Exploration metrics
CPU information
GPU information
Summary per basic analysis metrics
Event summary
Uncore Event summary

For some analysis types, the command-line summary report provides an issue description for metrics that exceed the predefined threshold. If you want to skip issues in the summary report, do one of the following:

Use the -report-knob show-issues=false option when generating the report, for example: $ amplxe-cl -report summary -r r001hpc -report-knob show-issues=false
Use the -format=csv option to view the report in the CSV format, for example: $ amplxe-cl -report summary -r r001hpc -format=csv

Example 3: Advanced Hotspots Summary

This example generates the summary report for the r001ah Advanced Hotspots analysis result.

$ amplxe-cl -report summary -r r001ah

Collection and Platform Info
----------------------------
Parameter                 r001ah

------------------------  --------------------------------------------------------------------------------------------
Application Command Line  C:\tachyon\vc10\find_hotspots_Win32_Release\find_hotspots.exe
Operating System          Microsoft Windows 8.1

Computer Name             My Computer

Result Size               37188680

Collection start time     09:59:01 11/05/2017 UTC

Collection stop time      09:59:28 11/05/2017 UTC


CPU
---
Parameter          			001ah
---------------------- 		-------------------------------------------------
Name               			4th generation Intel® Core™ Processor family
Frequency          			2494232562
Logical CPU Core Count  	4

Summary
-------
Elapsed Time:       					26.785
CPU Time:           					16.394
Average CPU Usage:  					0.610
CPI Rate:           					0.413

Event summary
-------------
Hardware Event Type       Hardware Event Count:Self  Hardware Event Sample Count:Self  Events Per Sample
------------------------  -------------------------  --------------------------------  -----------------
INST_RETIRED.ANY                       110633200000                             58228            1900000
CPU_CLK_UNHALTED.THREAD                 45653200000                             24028            1900000
CPU_CLK_UNHALTED.REF_TSC                40889900000                             21521            1900000

Use the Elapsed Time metric as your performance baseline to estimate your optimizations. The CPU Usage metric is the total CPU time divided by the Elapsed time, which demonstrates an average value of CPU utilization. For example, 'Average CPU Usage: 5.907' means near 6 cores were running on the average for your program overall. If you have, for example, 8 core system, there is some potential to parallelize the code.

Example 4: HPC Performance Characterization Summary

This command generates the summary report for the HPC Performance Characterization analysis result and skips issue descriptions:

$ amplxe-cl -report summary -r r001hpc -report-knob show-issues=false

Elapsed Time: 23.182s
GFLOPS: 14.748
CPU Utilization: 58.0%
    Average CPU Usage: 13.920 Out of 24 logical CPUs
    Serial Time: 0.069s (0.3%)
    Parallel Region Time: 23.113s (99.7%)
        Estimated Ideal Time: 14.010s (60.4%)
        OpenMP Potential Gain: 9.103s (39.3%)
Memory Bound: 0.446
    Cache Bound: 0.175
    DRAM Bound: 0.216
    NUMA: % of Remote Accesses: 38.3%
FPU Utilization: 2.7%
    GFLOPS: 14.748
        Scalar GFLOPS: 4.801
        Packed GFLOPS: 9.947
Collection and Platform Info
    Application Command Line: ./sp.B.x
    User Name: vtune
    Operating System: 3.10.0-327.el7.x86_64 NAME="Red Hat Enterprise Linux Server" VERSION="7.2 (Maipo)" ID="rhel" ID_LIKE="fedora" VERSION_ID="7.2" P
RETTY_NAME="Red Hat Enterprise Linux Server 7.2 (Maipo)" ANSI_COLOR="0;31" CPE_NAME="cpe:/o:redhat:enterprise_linux:7.2:GA:server" HOME_URL="https://w
ww.redhat.com/" BUG_REPORT_URL="https://bugzilla.redhat.com/"  REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 7" REDHAT_BUGZILLA_PRODUCT_VERSION=7.
2 REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux" REDHAT_SUPPORT_PRODUCT_VERSION="7.2"
    Computer Name: nntvtune235
    Result Size: 1 GB
    Collection start time: 19:04:30 13/06/2017 UTC
    Collection stop time: 19:04:53 13/06/2017 UTC
    Name: Intel® Xeon® E5/E7 v2 Processor code named Ivytown
    Frequency: 2.694 GHz
    Logical CPU Count: 24
    CPU
        Name: Intel® Xeon® E5/E7 v2 Processor code named Ivytown
        Frequency: 2.694 GHz
        Logical CPU Count: 24

Example 5: Memory Access Summary

This command generates the summary report for the Memory Access analysis result collected on Windows and shows issue descriptions:

$ amplxe-cl -report summary -r r001macc


Elapsed Time: 7.917s
    CPU Time: 6.473s
    Memory Bound: 21.9% of Pipeline Slots
     | The metric value is high. This may indicate that a significant fraction
     | of execution pipeline slots could be stalled due to demand memory load
     | and stores. Explore the metric breakdown by memory hierarchy, memory
     | bandwidth information, and correlation by memory objects.
     |
        L1 Bound: 8.0% of Clockticks
         | This metric shows how often machine was stalled without missing the
         | L1 data cache. The L1 cache typically has the shortest latency.
         | However, in certain cases like loads blocked on older stores, a load
         | might suffer a high latency even though it is being satisfied by the
         | L1.
         |
        L2 Bound: 3.0% of Clockticks
        L3 Bound: 5.0% of Clockticks
         | This metric shows how often CPU was stalled on L3 cache, or contended
         | with a sibling Core. Avoiding cache misses (L2 misses/L3 hits)
         | improves the latency and increases performance.
         |
        DRAM Bound: 4.1% of Clockticks
            DRAM Bandwidth Bound: 0.4% of Elapsed Time
            Memory Latency: 0.000
    Loads: 10,137,704,122
    Stores: 3,208,896,264
    LLC Miss Count: 1,750,105
    Average Latency (cycles): 11
    Total Thread Count: 21
    Paused Time: 0s
System Bandwidth
    Max DRAM System Bandwidth: 15 GB 

Bandwidth Utilization
Bandwidth Domain  Platform Maximum  Observed Maximum  Average Bandwidth  % of Elapsed Time with High BW Utilization(%)
----------------  ----------------  ----------------  -----------------  ---------------------------------------------
DRAM, GB/sec      15                          11.300              2.836                                           0.4%
Collection and Platform Info
    Application Command Line: C:\samples\tachyon\vc10\analyze_locks_Win32_Release\analyze_locks.exe "C:\samples\tachyon\dat\balls.dat" 
    Operating System: Microsoft Windows 10
    Computer Name: My Computer
    Result Size: 31 MB 
    Collection start time: 09:33:44 07/06/2017 UTC
    Collection stop time: 09:33:52 07/06/2017 UTC
    CPU
        Name: Intel® Processor code named Skylake ULT
        Frequency: 2.496 GHz
        Logical CPU Count: 4

The Bandwidth Utilization section in the summary report shows the following metrics:

Platform Maximum: Expected maximum bandwidth for the system. This value can be automatically estimated using micro-benchmark at the start of analysis or hard-coded based on theoretical bandwidth limits.
Observed Maximum: Maximum bandwidth observed during the analysis. If the value is close to the Platform Maximum, your workload is probably bandwidth-limited.
Average Bandwidth: Average bandwidth utilization during the analysis.
% of Elapsed Time with High BW Utilization: Percentage of Elapsed time spent heavily utilizing system bandwidth.

This information is provided for all kinds of bandwidth domains you have in the result (DRAM, MCDRAM, QPI, and so on).