x86/broadwell/bdw-metrics.json

3         "BriefDescription": "C2 residency percent per package",
4         "MetricExpr": "cstate_pkg@c2\\-residency@ / TSC",
10         "BriefDescription": "C3 residency percent per core",
11         "MetricExpr": "cstate_core@c3\\-residency@ / TSC",
17         "BriefDescription": "C3 residency percent per package",
18         "MetricExpr": "cstate_pkg@c3\\-residency@ / TSC",
24         "BriefDescription": "C6 residency percent per core",
25         "MetricExpr": "cstate_core@c6\\-residency@ / TSC",
31         "BriefDescription": "C6 residency percent per package",
32         "MetricExpr": "cstate_pkg@c6\\-residency@ / TSC",
38         "BriefDescription": "C7 residency percent per core",
39         "MetricExpr": "cstate_core@c7\\-residency@ / TSC",
45         "BriefDescription": "C7 residency percent per package",
46         "MetricExpr": "cstate_pkg@c7\\-residency@ / TSC",
52         "BriefDescription": "Uncore frequency per die [GHZ]",
59         "MetricExpr": "((msr@aperf@ - cycles) / msr@aperf@ if msr@smi@ > 0 else 0)",
78 …sible; which incur a few cycles load re-issue. However; the short re-issue duration is often hidde…
96 …er-cases for operations that cannot be handled natively by the execution pipeline. For example; wh…
102         "MetricExpr": "1 - (tma_frontend_bound + tma_bad_speculation + tma_retiring)",
107 …processor core where the out-of-order scheduler dispatches ready uops into their respective execut…
112 …"MetricExpr": "(UOPS_ISSUED.ANY - UOPS_RETIRED.RETIRE_SLOTS + 4 * (INT_MISC.RECOVERY_CYCLES_ANY / …
117 …s for which the issue-pipeline was blocked due to recovery from earlier incorrect speculation. For…
128 …etched from an incorrectly speculated program path; or stalls when the out-of-order part of the ma…
137 … corrected path; following all sorts of miss-predicted branches. For example; branchy code with lo…
143         "MetricExpr": "max(0, tma_microcode_sequencer - tma_assists)",
147 … as in the case of read-modify-write as an example. Since these instructions require multiple uops…
166 …sted accesses occur when data written by one Logical Processor are read by another Logical Process…
170 …"BriefDescription": "This metric represents fraction of slots where Core non-memory issues were of…
172         "MetricExpr": "tma_backend_bound - tma_memory_bound",
177 …-memory issues were of a bottleneck.  Shortage in hardware compute resources; or dependencies in s…
181 …n of cycles while the memory subsystem was handling synchronizations due to data-sharing accesses",
187 … cycles while the memory subsystem was handling synchronizations due to data-sharing accesses. Dat…
202 …"MetricExpr": "(1 - MEM_LOAD_UOPS_RETIRED.L3_HIT / (MEM_LOAD_UOPS_RETIRED.L3_HIT + 7 * MEM_LOAD_UO…
211 …"MetricExpr": "(IDQ.ALL_DSB_CYCLES_ANY_UOPS - IDQ.ALL_DSB_CYCLES_4_UOPS) / tma_info_core_core_clks…
224 …o switches from DSB to MITE pipelines. The DSB (decoded i-cache) is a Uop Cache where the front-en…
233 …-aside Buffers) are processor caches for recently used entries out of the Page Tables that are use…
237 …: "This metric roughly estimates the fraction of cycles spent handling first-level data TLB store …
242 …-level data TLB store misses.  As with ordinary data caching; focus on improving data locality and…
251 …hreading hiccup; where multiple Logical Processors contend on different data-elements mapped into …
266         "MetricExpr": "tma_frontend_bound - tma_fetch_latency",
281 …he CPU was stalled due to Frontend latency issues.  For example; instruction-cache misses; iTLB mi…
285 …"BriefDescription": "This metric represents overall arithmetic floating-point (FP) operations frac…
290 …-point (FP) operations fraction the CPU has executed (retired). Note this metric's value may excee…
294 …"BriefDescription": "This metric approximates arithmetic floating-point (FP) scalar uops fraction …
299 …"PublicDescription": "This metric approximates arithmetic floating-point (FP) scalar uops fraction…
303 …"BriefDescription": "This metric approximates arithmetic floating-point (FP) vector uops fraction …
308 …"PublicDescription": "This metric approximates arithmetic floating-point (FP) vector uops fraction…
312 …tric approximates arithmetic FP vector uops fraction the CPU has retired for 128-bit wide vectors",
317 … approximates arithmetic FP vector uops fraction the CPU has retired for 128-bit wide vectors. May…
321 …tric approximates arithmetic FP vector uops fraction the CPU has retired for 256-bit wide vectors",
326 … approximates arithmetic FP vector uops fraction the CPU has retired for 256-bit wide vectors. May…
330 …"BriefDescription": "This category represents fraction of slots where the processor's Frontend und…
336 …processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor c…
340 … slots where the CPU was retiring heavy-weight operations -- instructions that require two or more…
346 …he CPU was retiring heavy-weight operations -- instructions that require two or more uops or micro…
358 …"BriefDescription": "Instructions per retired mispredicts for indirect CALL or JMP branches (lower…
365 …"BriefDescription": "Number of Instructions per non-speculative Branch Misprediction (JEClear) (lo…
372 …"BriefDescription": "Core actual clocks when any Logical Processor is active on the Physical Core",
378         "BriefDescription": "Instructions Per Cycle across hyper-threads (per physical core)",
384         "BriefDescription": "Floating Point Operations Per Cycle",
390 …"BriefDescription": "Actual per-core usage of the Floating Point non-X87 execution units (regardle…
394 …per-core usage of the Floating Point non-X87 execution units (regardless of precision or vector-wi…
397 …cription": "Instruction-Level-Parallelism (average number of uops executed when there is execution…
411 …"BriefDescription": "Instructions per speculative Unknown Branch Misprediction (BAClear) (lower nu…
417         "BriefDescription": "Branch instructions per taken branch.",
430 …"BriefDescription": "Instructions per FP Arithmetic instruction (lower number means higher occurre…
435 …"PublicDescription": "Instructions per FP Arithmetic instruction (lower number means higher occurr…
438 …"BriefDescription": "Instructions per FP Arithmetic AVX/SSE 128-bit instruction (lower number mean…
443 …"PublicDescription": "Instructions per FP Arithmetic AVX/SSE 128-bit instruction (lower number mea…
446 …"BriefDescription": "Instructions per FP Arithmetic AVX* 256-bit instruction (lower number means h…
451 …"PublicDescription": "Instructions per FP Arithmetic AVX* 256-bit instruction (lower number means …
454 …"BriefDescription": "Instructions per FP Arithmetic Scalar Double-Precision instruction (lower num…
459 …"PublicDescription": "Instructions per FP Arithmetic Scalar Double-Precision instruction (lower nu…
462 …"BriefDescription": "Instructions per FP Arithmetic Scalar Single-Precision instruction (lower num…
467 …"PublicDescription": "Instructions per FP Arithmetic Scalar Single-Precision instruction (lower nu…
470         "BriefDescription": "Instructions per Branch (lower number means higher occurrence rate)",
477 …    "BriefDescription": "Instructions per (near) call (lower number means higher occurrence rate)",
484 …"BriefDescription": "Instructions per Floating Point (FP) Operation (lower number means higher occ…
491         "BriefDescription": "Instructions per Load (lower number means higher occurrence rate)",
498         "BriefDescription": "Instructions per Store (lower number means higher occurrence rate)",
505         "BriefDescription": "Instructions per taken branch",
510 …"PublicDescription": "Instructions per taken branch. Related metrics: tma_dsb_switches, tma_fetch_…
513         "BriefDescription": "Average per-core data fill bandwidth to the L1 data cache [GB / sec]",
519         "BriefDescription": "Average per-core data fill bandwidth to the L2 cache [GB / sec]",
525         "BriefDescription": "Average per-core data fill bandwidth to the L3 cache [GB / sec]",
531 …      "BriefDescription": "Average per-thread data fill bandwidth to the L1 data cache [GB / sec]",
537         "BriefDescription": "L1 cache true misses per kilo instruction for retired demand loads",
543         "BriefDescription": "Average per-thread data fill bandwidth to the L2 cache [GB / sec]",
549 …"BriefDescription": "L2 cache hits per kilo instruction for all request types (including speculati…
550         "MetricExpr": "1e3 * (L2_RQSTS.REFERENCES - L2_RQSTS.MISS) / INST_RETIRED.ANY",
555 …"BriefDescription": "L2 cache hits per kilo instruction for all demand loads  (including speculati…
561         "BriefDescription": "L2 cache true misses per kilo instruction for retired demand loads",
567 …"BriefDescription": "L2 cache ([RKL+] true) misses per kilo instruction for all request types (inc…
573 …"BriefDescription": "L2 cache ([RKL+] true) misses per kilo instruction for all demand loads  (inc…
579         "BriefDescription": "Offcore requests (L2 cache miss) per kilo instruction for demand RFOs",
585         "BriefDescription": "Average per-thread data fill bandwidth to the L3 cache [GB / sec]",
591         "BriefDescription": "L3 cache true misses per kilo instruction for retired demand loads",
615 …"BriefDescription": "Actual Average Latency for L1 data-cache miss demand load operations (in core…
622 …"BriefDescription": "Memory-Level-Parallelism (average number of L1 miss demand load when there is…
627 …cription": "Memory-Level-Parallelism (average number of L1 miss demand load when there is at least…
637 …"BriefDescription": "Instruction-Level-Parallelism (average number of uops executed when there is …
674         "BriefDescription": "Giga Floating Point Operations Per Second",
678 …ting Point Operations Per Second. Aggregate across all supported options of: FP precisions, scalar…
681 …"BriefDescription": "Instructions per Far Branch ( Far Branches apply upon transition from applica…
688         "BriefDescription": "Cycles Per Instruction for the Operating System (OS) Kernel mode",
702 …"MetricExpr": "(1 - CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE / (CPU_CLK_UNHALTED.REF_XCLK_ANY / 2) if #S…
719 …   "BriefDescription": "Per-Logical Processor actual clocks when the Logical Processor is active.",
725         "BriefDescription": "Cycles Per Instruction (per Logical Processor)",
731         "BriefDescription": "The ratio of Executed- by Issued-Uops",
735 …"PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio > 1 suggests high rate of uop m…
738         "BriefDescription": "Instructions Per Cycle (per Logical Processor)",
744 …"BriefDescription": "Total issue-pipeline slots (per-Physical Core till ICL; per-Logical Processor…
750         "BriefDescription": "Uops Per Instruction",
757         "BriefDescription": "Uops per taken branch",
774 …"MetricExpr": "max((CYCLE_ACTIVITY.STALLS_MEM_ANY - CYCLE_ACTIVITY.STALLS_L1D_MISS) / tma_info_thr…
778 … TLB. These cases are characterized by execution unit stalls; while some non-completed demand load…
783 …"MetricExpr": "(CYCLE_ACTIVITY.STALLS_L1D_MISS - CYCLE_ACTIVITY.STALLS_L2_MISS) / tma_info_thread_…
820 …slots where the CPU was retiring light-weight operations -- instructions that require no more than…
821         "MetricExpr": "tma_retiring - tma_heavy_operations",
826 …-weight operations -- instructions that require no more than one uop (micro-operation). This corre…
832 …HED_PORT.PORT_2 + UOPS_DISPATCHED_PORT.PORT_3 + UOPS_DISPATCHED_PORT.PORT_7 - UOPS_DISPATCHED_PORT…
852         "MetricExpr": "tma_bad_speculation - tma_branch_mispredicts",
857 …-of-order portion of the machine needs to recover its state after the clear. For example; this can…
861 …as likely hurt due to approaching bandwidth limits of external memory - DRAM ([SPR-HBM] and/or HBM…
866 …- DRAM ([SPR-HBM] and/or HBM).  The underlying heuristic assumes that a similar off-core traffic i…
870 …e the performance was likely hurt due to latency from external memory - DRAM ([SPR-HBM] and/or HBM…
871 …EAD, OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DATA_RD) / tma_info_thread_clks - tma_mem_bandwidth",
875 …e the performance was likely hurt due to latency from external memory - DRAM ([SPR-HBM] and/or HBM…
881 …CYCLES_GE_1_UOP_EXEC - (UOPS_EXECUTED.CYCLES_GE_3_UOPS_EXEC if tma_info_thread_ipc > 1.8 else UOPS…
886 …o demand load or store instructions. This accounts mainly for (1) non-completed in-flight memory d…
909 …"MetricExpr": "(IDQ.ALL_MITE_CYCLES_ANY_UOPS - IDQ.ALL_MITE_CYCLES_4_UOPS) / tma_info_core_core_cl…
913 …the legacy decode pipeline). This pipeline is used for code that was not pre-cached in the DSB or …
922 … Commonly used instructions are optimized for delivery by the DSB (decoded i-cache) or MITE (legac…
944 …ion of cycles CPU dispatched uops on execution port 2 ([SNB+]Loads and Store-address; [ICL+] Loads…
949 …ion of cycles CPU dispatched uops on execution port 2 ([SNB+]Loads and Store-address; [ICL+] Loads…
953 …ion of cycles CPU dispatched uops on execution port 3 ([SNB+]Loads and Store-address; [ICL+] Loads…
958 …ion of cycles CPU dispatched uops on execution port 3 ([SNB+]Loads and Store-address; [ICL+] Loads…
962 …is metric represents Core fraction of cycles CPU dispatched uops on execution port 4 (Store-data)",
967 …sents Core fraction of cycles CPU dispatched uops on execution port 4 (Store-data). Sample with: U…
989 …ents Core fraction of cycles CPU dispatched uops on execution port 7 ([HSW+]simple Store-address)",
994 …action of cycles CPU dispatched uops on execution port 7 ([HSW+]simple Store-address). Sample with…
998 … the CPU performance was potentially limited due to Core computation issues (non divider-related)",
1000 …- (UOPS_EXECUTED.CYCLES_GE_3_UOPS_EXEC if tma_info_thread_ipc > 1.8 else UOPS_EXECUTED.CYCLES_GE_2…
1004 …-related).  Two distinct categories can be attributed into this metric: (1) heavy data-dependency …
1008 … fraction of cycles CPU executed no uops on any execution port (Logical Processor cycles since ICL…
1009 …ED.CORE\\,inv\\,cmask\\=1@ / 2 if #SMT_on else (CYCLE_ACTIVITY.STALLS_TOTAL - (RS_EVENTS.EMPTY_CYC…
1013 …ted no uops on any execution port (Logical Processor cycles since ICL, Physical Core cycles otherw…
1017 …on of cycles where the CPU executed total of 1 uop per cycle on all execution ports (Logical Proce…
1018 …_EXECUTED.CORE\\,cmask\\=1@ - cpu@UOPS_EXECUTED.CORE\\,cmask\\=2@) / 2 if #SMT_on else (UOPS_EXECU…
1022 …per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwi…
1026 …ts fraction of cycles CPU executed total of 2 uops per cycle on all execution ports (Logical Proce…
1027 …_EXECUTED.CORE\\,cmask\\=2@ - cpu@UOPS_EXECUTED.CORE\\,cmask\\=3@) / 2 if #SMT_on else (UOPS_EXECU…
1031 …per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwi…
1035 …ion of cycles CPU executed total of 3 or more uops per cycle on all execution ports (Logical Proce…
1049 …ions-per-cycle (see IPC metric). Note that a high Retiring value does not necessary mean there is …
1053 … estimates fraction of cycles handling memory load split accesses - load that cross 64-byte cache …
1059 … estimates fraction of cycles handling memory load split accesses - load that cross 64-byte cache …
1068 …resents rate of split store accesses.  Consider aligning your data to the 64-byte cache line granu…
1072 …f cycles where the Super Queue (SQ) was full taking into account all request-types and both hardwa…
1077 …f cycles where the Super Queue (SQ) was full taking into account all request-types and both hardwa…
1081 … CPU was stalled  due to RFO store memory accesses; RFO store issue a read-for-ownership request b…
1086 …ses; RFO store issue a read-for-ownership request before the write. Even though store accesses do …
1095 …perations in the pipeline; a load can avoid waiting for memory if a prior in-flight store is writi…
1101 …"MetricExpr": "(L2_RQSTS.RFO_HIT * 9 * (1 - MEM_UOPS_RETIRED.LOCK_LOADS / MEM_UOPS_RETIRED.ALL_STO…
1105 …-of-order core performance; however; holding resources for longer time can lead into undesired imp…
1118         "MetricExpr": "tma_branch_resteers - tma_mispredicts_resteers - tma_clears_resteers",