cachepc-linux

Fork of AMDESE/linux with modifications for CachePC side-channel attack
git clone https://git.sinitax.com/sinitax/cachepc-linux
Log | Files | Refs | README | LICENSE | sfeed.txt

perf-arm-spe.txt (8260B)


      1perf-arm-spe(1)
      2================
      3
      4NAME
      5----
      6perf-arm-spe - Support for Arm Statistical Profiling Extension within Perf tools
      7
      8SYNOPSIS
      9--------
     10[verse]
     11'perf record' -e arm_spe//
     12
     13DESCRIPTION
     14-----------
     15
     16The SPE (Statistical Profiling Extension) feature provides accurate attribution of latencies and
     17 events down to individual instructions. Rather than being interrupt-driven, it picks an
     18instruction to sample and then captures data for it during execution. Data includes execution time
     19in cycles. For loads and stores it also includes data address, cache miss events, and data origin.
     20
     21The sampling has 5 stages:
     22
     23  1. Choose an operation
     24  2. Collect data about the operation
     25  3. Optionally discard the record based on a filter
     26  4. Write the record to memory
     27  5. Interrupt when the buffer is full
     28
     29Choose an operation
     30~~~~~~~~~~~~~~~~~~~
     31
     32This is chosen from a sample population, for SPE this is an IMPLEMENTATION DEFINED choice of all
     33architectural instructions or all micro-ops. Sampling happens at a programmable interval. The
     34architecture provides a mechanism for the SPE driver to infer the minimum interval at which it should
     35sample. This minimum interval is used by the driver if no interval is specified. A pseudo-random
     36perturbation is also added to the sampling interval by default.
     37
     38Collect data about the operation
     39~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
     40
     41Program counter, PMU events, timings and data addresses related to the operation are recorded.
     42Sampling ensures there is only one sampled operation is in flight.
     43
     44Optionally discard the record based on a filter
     45~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
     46
     47Based on programmable criteria, choose whether to keep the record or discard it. If the record is
     48discarded then the flow stops here for this sample.
     49
     50Write the record to memory
     51~~~~~~~~~~~~~~~~~~~~~~~~~~
     52
     53The record is appended to a memory buffer
     54
     55Interrupt when the buffer is full
     56~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
     57
     58When the buffer fills, an interrupt is sent and the driver signals Perf to collect the records.
     59Perf saves the raw data in the perf.data file.
     60
     61Opening the file
     62----------------
     63
     64Up until this point no decoding of the SPE data was done by either the kernel or Perf. Only when the
     65recorded file is opened with 'perf report' or 'perf script' does the decoding happen. When decoding
     66the data, Perf generates "synthetic samples" as if these were generated at the time of the
     67recording. These samples are the same as if normal sampling was done by Perf without using SPE,
     68although they may have more attributes associated with them. For example a normal sample may have
     69just the instruction pointer, but an SPE sample can have data addresses and latency attributes.
     70
     71Why Sampling?
     72-------------
     73
     74 - Sampling, rather than tracing, cuts down the profiling problem to something more manageable for
     75 hardware. Only one sampled operation is in flight at a time.
     76
     77 - Allows precise attribution data, including: Full PC of instruction, data virtual and physical
     78 addresses.
     79
     80 - Allows correlation between an instruction and events, such as TLB and cache miss. (Data source
     81 indicates which particular cache was hit, but the meaning is implementation defined because
     82 different implementations can have different cache configurations.)
     83
     84However, SPE does not provide any call-graph information, and relies on statistical methods.
     85
     86Collisions
     87----------
     88
     89When an operation is sampled while a previous sampled operation has not finished, a collision
     90occurs. The new sample is dropped. Collisions affect the integrity of the data, so the sample rate
     91should be set to avoid collisions.
     92
     93The 'sample_collision' PMU event can be used to determine the number of lost samples. Although this
     94count is based on collisions _before_ filtering occurs. Therefore this can not be used as an exact
     95number for samples dropped that would have made it through the filter, but can be a rough
     96guide.
     97
     98The effect of microarchitectural sampling
     99-----------------------------------------
    100
    101If an implementation samples micro-operations instead of instructions, the results of sampling must
    102be weighted accordingly.
    103
    104For example, if a given instruction A is always converted into two micro-operations, A0 and A1, it
    105becomes twice as likely to appear in the sample population.
    106
    107The coarse effect of conversions, and, if applicable, sampling of speculative operations, can be
    108estimated from the 'sample_pop' and 'inst_retired' PMU events.
    109
    110Kernel Requirements
    111-------------------
    112
    113The ARM_SPE_PMU config must be set to build as either a module or statically.
    114
    115Depending on CPU model, the kernel may need to be booted with page table isolation disabled
    116(kpti=off). If KPTI needs to be disabled, this will fail with a console message "profiling buffer
    117inaccessible. Try passing 'kpti=off' on the kernel command line".
    118
    119Capturing SPE with perf command-line tools
    120------------------------------------------
    121
    122You can record a session with SPE samples:
    123
    124  perf record -e arm_spe// -- ./mybench
    125
    126The sample period is set from the -c option, and because the minimum interval is used by default
    127it's recommended to set this to a higher value. The value is written to PMSIRR.INTERVAL.
    128
    129Config parameters
    130~~~~~~~~~~~~~~~~~
    131
    132These are placed between the // in the event and comma separated. For example '-e
    133arm_spe/load_filter=1,min_latency=10/'
    134
    135  branch_filter=1     - collect branches only (PMSFCR.B)
    136  event_filter=<mask> - filter on specific events (PMSEVFR) - see bitfield description below
    137  jitter=1            - use jitter to avoid resonance when sampling (PMSIRR.RND)
    138  load_filter=1       - collect loads only (PMSFCR.LD)
    139  min_latency=<n>     - collect only samples with this latency or higher* (PMSLATFR)
    140  pa_enable=1         - collect physical address (as well as VA) of loads/stores (PMSCR.PA) - requires privilege
    141  pct_enable=1        - collect physical timestamp instead of virtual timestamp (PMSCR.PCT) - requires privilege
    142  store_filter=1      - collect stores only (PMSFCR.ST)
    143  ts_enable=1         - enable timestamping with value of generic timer (PMSCR.TS)
    144
    145+++*+++ Latency is the total latency from the point at which sampling started on that instruction, rather
    146than only the execution latency.
    147
    148Only some events can be filtered on; these include:
    149
    150  bit 1     - instruction retired (i.e. omit speculative instructions)
    151  bit 3     - L1D refill
    152  bit 5     - TLB refill
    153  bit 7     - mispredict
    154  bit 11    - misaligned access
    155
    156So to sample just retired instructions:
    157
    158  perf record -e arm_spe/event_filter=2/ -- ./mybench
    159
    160or just mispredicted branches:
    161
    162  perf record -e arm_spe/event_filter=0x80/ -- ./mybench
    163
    164Viewing the data
    165~~~~~~~~~~~~~~~~~
    166
    167By default perf report and perf script will assign samples to separate groups depending on the
    168attributes/events of the SPE record. Because instructions can have multiple events associated with
    169them, the samples in these groups are not necessarily unique. For example perf report shows these
    170groups:
    171
    172  Available samples
    173  0 arm_spe//
    174  0 dummy:u
    175  21 l1d-miss
    176  897 l1d-access
    177  5 llc-miss
    178  7 llc-access
    179  2 tlb-miss
    180  1K tlb-access
    181  36 branch-miss
    182  0 remote-access
    183  900 memory
    184
    185The arm_spe// and dummy:u events are implementation details and are expected to be empty.
    186
    187To get a full list of unique samples that are not sorted into groups, set the itrace option to
    188generate 'instruction' samples. The period option is also taken into account, so set it to 1
    189instruction unless you want to further downsample the already sampled SPE data:
    190
    191  perf report --itrace=i1i
    192
    193Memory access details are also stored on the samples and this can be viewed with:
    194
    195  perf report --mem-mode
    196
    197Common errors
    198~~~~~~~~~~~~~
    199
    200 - "Cannot find PMU `arm_spe'. Missing kernel support?"
    201
    202   Module not built or loaded, KPTI not disabled (see above), or running on a VM
    203
    204 - "Arm SPE CONTEXT packets not found in the traces."
    205
    206   Root privilege is required to collect context packets. But these only increase the accuracy of
    207   assigning PIDs to kernel samples. For userspace sampling this can be ignored.
    208
    209 - Excessively large perf.data file size
    210
    211   Increase sampling interval (see above)
    212
    213
    214SEE ALSO
    215--------
    216
    217linkperf:perf-record[1], linkperf:perf-script[1], linkperf:perf-report[1],
    218linkperf:perf-inject[1]