インテル® VTune™ Amplifier 2018 ヘルプ
Similar to the Summary window, available in GUI, the summary report provides overall performance data of your target. Intel® VTune™ Amplifier automatically generates the summary report when data collection completes. To disable this report, use the no-summary option in your command when performing a collect or collect-with action.
Use the following syntax to generate the Summary report from a preexisting result:
$ amplxe-cl -report summary -result-dir <result_path>
The summary report output depends on the collection type:
For User-Mode Sampling and Tracing Collection results, the summary report includes the following sections:
Collection and Platform Information
CPU Information
Summary per basic analysis metrics
This example generates the summary report for the r000hs Basic Hotspots analysis result on Linux*:
$ amplxe-cl -report summary -r r000hs
Collection and Platform Info
----------------------------
Parameter r000hs
------------------------ -------------------------------------------------------------
Application Command Line /home/tachyon/find_hotspots
Operating System Ubuntu 11.04
Computer Name My Computer
Result Size 2926817
Collection start time 11:17:06 13/06/2017 UTC
Collection stop time 11:17:20 13/06/2017 UTC
CPU
---
Parameter r000hs
---------------------- -------------------------------------------------
Name 4th generation Intel® Core™ Processor family
Frequency 2494226458
Logical CPU Core Count 4
Summary
-------
Elapsed Time: 14.094
CPU Time: 11.319
Average CPU Usage: 0.749
amplxe: Executing actions 100 % done
This example generates a summary report for the Locks and Waits analysis result r003lw. The summary portion of the report shows that the multithreaded target spent 64 seconds waiting, with an average concurrency of only 1.073:
$ amplxe-cl -report summary -r r003lw
Summary
-------
Average Concurrency: 1.073
Elapsed Time: 13.911
CPU Time: 11.031
Wait Time: 64.468
Average CPU Usage: 0.768
To identify the cause of the wait, view the result in the GUI performance pane, or generate a performance report.
For Hardware Event-based Sampling Collection results, the summary report includes the following information (if available):
For some analysis types, the command-line summary report provides an issue description for metrics that exceed the predefined threshold. If you want to skip issues in the summary report, do one of the following:
Use the -report-knob show-issues=false option when generating the report, for example: $ amplxe-cl -report summary -r r001hpc -report-knob show-issues=false
Use the -format=csv option to view the report in the CSV format, for example: $ amplxe-cl -report summary -r r001hpc -format=csv
This example generates the summary report for the r001ah Advanced Hotspots analysis result.
$ amplxe-cl -report summary -r r001ah
Collection and Platform Info
----------------------------
Parameter r001ah
------------------------ --------------------------------------------------------------------------------------------
Application Command Line C:\tachyon\vc10\find_hotspots_Win32_Release\find_hotspots.exe
Operating System Microsoft Windows 8.1
Computer Name My Computer
Result Size 37188680
Collection start time 09:59:01 11/05/2017 UTC
Collection stop time 09:59:28 11/05/2017 UTC
CPU
---
Parameter 001ah
---------------------- -------------------------------------------------
Name 4th generation Intel® Core™ Processor family
Frequency 2494232562
Logical CPU Core Count 4
Summary
-------
Elapsed Time: 26.785
CPU Time: 16.394
Average CPU Usage: 0.610
CPI Rate: 0.413
Event summary
-------------
Hardware Event Type Hardware Event Count:Self Hardware Event Sample Count:Self Events Per Sample
------------------------ ------------------------- -------------------------------- -----------------
INST_RETIRED.ANY 110633200000 58228 1900000
CPU_CLK_UNHALTED.THREAD 45653200000 24028 1900000
CPU_CLK_UNHALTED.REF_TSC 40889900000 21521 1900000
Use the Elapsed Time metric as your performance baseline to estimate your optimizations. The CPU Usage metric is the total CPU time divided by the Elapsed time, which demonstrates an average value of CPU utilization. For example, 'Average CPU Usage: 5.907' means near 6 cores were running on the average for your program overall. If you have, for example, 8 core system, there is some potential to parallelize the code.
This command generates the summary report for the HPC Performance Characterization analysis result and skips issue descriptions:
$ amplxe-cl -report summary -r r001hpc -report-knob show-issues=false
Elapsed Time: 23.182s
GFLOPS: 14.748
CPU Utilization: 58.0%
Average CPU Usage: 13.920 Out of 24 logical CPUs
Serial Time: 0.069s (0.3%)
Parallel Region Time: 23.113s (99.7%)
Estimated Ideal Time: 14.010s (60.4%)
OpenMP Potential Gain: 9.103s (39.3%)
Memory Bound: 0.446
Cache Bound: 0.175
DRAM Bound: 0.216
NUMA: % of Remote Accesses: 38.3%
FPU Utilization: 2.7%
GFLOPS: 14.748
Scalar GFLOPS: 4.801
Packed GFLOPS: 9.947
Collection and Platform Info
Application Command Line: ./sp.B.x
User Name: vtune
Operating System: 3.10.0-327.el7.x86_64 NAME="Red Hat Enterprise Linux Server" VERSION="7.2 (Maipo)" ID="rhel" ID_LIKE="fedora" VERSION_ID="7.2" P
RETTY_NAME="Red Hat Enterprise Linux Server 7.2 (Maipo)" ANSI_COLOR="0;31" CPE_NAME="cpe:/o:redhat:enterprise_linux:7.2:GA:server" HOME_URL="https://w
ww.redhat.com/" BUG_REPORT_URL="https://bugzilla.redhat.com/" REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 7" REDHAT_BUGZILLA_PRODUCT_VERSION=7.
2 REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux" REDHAT_SUPPORT_PRODUCT_VERSION="7.2"
Computer Name: nntvtune235
Result Size: 1 GB
Collection start time: 19:04:30 13/06/2017 UTC
Collection stop time: 19:04:53 13/06/2017 UTC
Name: Intel® Xeon® E5/E7 v2 Processor code named Ivytown
Frequency: 2.694 GHz
Logical CPU Count: 24
CPU
Name: Intel® Xeon® E5/E7 v2 Processor code named Ivytown
Frequency: 2.694 GHz
Logical CPU Count: 24
This command generates the summary report for the Memory Access analysis result collected on Windows and shows issue descriptions:
$ amplxe-cl -report summary -r r001macc
Elapsed Time: 7.917s
CPU Time: 6.473s
Memory Bound: 21.9% of Pipeline Slots
| The metric value is high. This may indicate that a significant fraction
| of execution pipeline slots could be stalled due to demand memory load
| and stores. Explore the metric breakdown by memory hierarchy, memory
| bandwidth information, and correlation by memory objects.
|
L1 Bound: 8.0% of Clockticks
| This metric shows how often machine was stalled without missing the
| L1 data cache. The L1 cache typically has the shortest latency.
| However, in certain cases like loads blocked on older stores, a load
| might suffer a high latency even though it is being satisfied by the
| L1.
|
L2 Bound: 3.0% of Clockticks
L3 Bound: 5.0% of Clockticks
| This metric shows how often CPU was stalled on L3 cache, or contended
| with a sibling Core. Avoiding cache misses (L2 misses/L3 hits)
| improves the latency and increases performance.
|
DRAM Bound: 4.1% of Clockticks
DRAM Bandwidth Bound: 0.4% of Elapsed Time
Memory Latency: 0.000
Loads: 10,137,704,122
Stores: 3,208,896,264
LLC Miss Count: 1,750,105
Average Latency (cycles): 11
Total Thread Count: 21
Paused Time: 0s
System Bandwidth
Max DRAM System Bandwidth: 15 GB
Bandwidth Utilization
Bandwidth Domain Platform Maximum Observed Maximum Average Bandwidth % of Elapsed Time with High BW Utilization(%)
---------------- ---------------- ---------------- ----------------- ---------------------------------------------
DRAM, GB/sec 15 11.300 2.836 0.4%
Collection and Platform Info
Application Command Line: C:\samples\tachyon\vc10\analyze_locks_Win32_Release\analyze_locks.exe "C:\samples\tachyon\dat\balls.dat"
Operating System: Microsoft Windows 10
Computer Name: My Computer
Result Size: 31 MB
Collection start time: 09:33:44 07/06/2017 UTC
Collection stop time: 09:33:52 07/06/2017 UTC
CPU
Name: Intel® Processor code named Skylake ULT
Frequency: 2.496 GHz
Logical CPU Count: 4
The Bandwidth Utilization section in the summary report shows the following metrics:
Platform Maximum: Expected maximum bandwidth for the system. This value can be automatically estimated using micro-benchmark at the start of analysis or hard-coded based on theoretical bandwidth limits.
Observed Maximum: Maximum bandwidth observed during the analysis. If the value is close to the Platform Maximum, your workload is probably bandwidth-limited.
Average Bandwidth: Average bandwidth utilization during the analysis.
% of Elapsed Time with High BW Utilization: Percentage of Elapsed time spent heavily utilizing system bandwidth.
This information is provided for all kinds of bandwidth domains you have in the result (DRAM, MCDRAM, QPI, and so on).