crc32.rst - cachepc-linux - Fork of AMDESE/linux with modifications for CachePC side-channel attack

	cachepc-linux Fork of AMDESE/linux with modifications for CachePC side-channel attack
	git clone https://git.sinitax.com/sinitax/cachepc-linux
	Log \| Files \| Refs \| README \| LICENSE \| sfeed.txt
crc32.rst (8851B)
      1=================================
      2brief tutorial on CRC computation
      3=================================
      4
      5A CRC is a long-division remainder.  You add the CRC to the message,
      6and the whole thing (message+CRC) is a multiple of the given
      7CRC polynomial.  To check the CRC, you can either check that the
      8CRC matches the recomputed value, *or* you can check that the
      9remainder computed on the message+CRC is 0.  This latter approach
     10is used by a lot of hardware implementations, and is why so many
     11protocols put the end-of-frame flag after the CRC.
     12
     13It's actually the same long division you learned in school, except that:
     14
     15- We're working in binary, so the digits are only 0 and 1, and
     16- When dividing polynomials, there are no carries.  Rather than add and
     17  subtract, we just xor.  Thus, we tend to get a bit sloppy about
     18  the difference between adding and subtracting.
     19
     20Like all division, the remainder is always smaller than the divisor.
     21To produce a 32-bit CRC, the divisor is actually a 33-bit CRC polynomial.
     22Since it's 33 bits long, bit 32 is always going to be set, so usually the
     23CRC is written in hex with the most significant bit omitted.  (If you're
     24familiar with the IEEE 754 floating-point format, it's the same idea.)
     25
     26Note that a CRC is computed over a string of *bits*, so you have
     27to decide on the endianness of the bits within each byte.  To get
     28the best error-detecting properties, this should correspond to the
     29order they're actually sent.  For example, standard RS-232 serial is
     30little-endian; the most significant bit (sometimes used for parity)
     31is sent last.  And when appending a CRC word to a message, you should
     32do it in the right order, matching the endianness.
     33
     34Just like with ordinary division, you proceed one digit (bit) at a time.
     35Each step of the division you take one more digit (bit) of the dividend
     36and append it to the current remainder.  Then you figure out the
     37appropriate multiple of the divisor to subtract to being the remainder
     38back into range.  In binary, this is easy - it has to be either 0 or 1,
     39and to make the XOR cancel, it's just a copy of bit 32 of the remainder.
     40
     41When computing a CRC, we don't care about the quotient, so we can
     42throw the quotient bit away, but subtract the appropriate multiple of
     43the polynomial from the remainder and we're back to where we started,
     44ready to process the next bit.
     45
     46A big-endian CRC written this way would be coded like::
     47
     48	for (i = 0; i < input_bits; i++) {
     49		multiple = remainder & 0x80000000 ? CRCPOLY : 0;
     50		remainder = (remainder << 1 | next_input_bit()) ^ multiple;
     51	}
     52
     53Notice how, to get at bit 32 of the shifted remainder, we look
     54at bit 31 of the remainder *before* shifting it.
     55
     56But also notice how the next_input_bit() bits we're shifting into
     57the remainder don't actually affect any decision-making until
     5832 bits later.  Thus, the first 32 cycles of this are pretty boring.
     59Also, to add the CRC to a message, we need a 32-bit-long hole for it at
     60the end, so we have to add 32 extra cycles shifting in zeros at the
     61end of every message.
     62
     63These details lead to a standard trick: rearrange merging in the
     64next_input_bit() until the moment it's needed.  Then the first 32 cycles
     65can be precomputed, and merging in the final 32 zero bits to make room
     66for the CRC can be skipped entirely.  This changes the code to::
     67
     68	for (i = 0; i < input_bits; i++) {
     69		remainder ^= next_input_bit() << 31;
     70		multiple = (remainder & 0x80000000) ? CRCPOLY : 0;
     71		remainder = (remainder << 1) ^ multiple;
     72	}
     73
     74With this optimization, the little-endian code is particularly simple::
     75
     76	for (i = 0; i < input_bits; i++) {
     77		remainder ^= next_input_bit();
     78		multiple = (remainder & 1) ? CRCPOLY : 0;
     79		remainder = (remainder >> 1) ^ multiple;
     80	}
     81
     82The most significant coefficient of the remainder polynomial is stored
     83in the least significant bit of the binary "remainder" variable.
     84The other details of endianness have been hidden in CRCPOLY (which must
     85be bit-reversed) and next_input_bit().
     86
     87As long as next_input_bit is returning the bits in a sensible order, we don't
     88*have* to wait until the last possible moment to merge in additional bits.
     89We can do it 8 bits at a time rather than 1 bit at a time::
     90
     91	for (i = 0; i < input_bytes; i++) {
     92		remainder ^= next_input_byte() << 24;
     93		for (j = 0; j < 8; j++) {
     94			multiple = (remainder & 0x80000000) ? CRCPOLY : 0;
     95			remainder = (remainder << 1) ^ multiple;
     96		}
     97	}
     98
     99Or in little-endian::
    100
    101	for (i = 0; i < input_bytes; i++) {
    102		remainder ^= next_input_byte();
    103		for (j = 0; j < 8; j++) {
    104			multiple = (remainder & 1) ? CRCPOLY : 0;
    105			remainder = (remainder >> 1) ^ multiple;
    106		}
    107	}
    108
    109If the input is a multiple of 32 bits, you can even XOR in a 32-bit
    110word at a time and increase the inner loop count to 32.
    111
    112You can also mix and match the two loop styles, for example doing the
    113bulk of a message byte-at-a-time and adding bit-at-a-time processing
    114for any fractional bytes at the end.
    115
    116To reduce the number of conditional branches, software commonly uses
    117the byte-at-a-time table method, popularized by Dilip V. Sarwate,
    118"Computation of Cyclic Redundancy Checks via Table Look-Up", Comm. ACM
    119v.31 no.8 (August 1998) p. 1008-1013.
    120
    121Here, rather than just shifting one bit of the remainder to decide
    122in the correct multiple to subtract, we can shift a byte at a time.
    123This produces a 40-bit (rather than a 33-bit) intermediate remainder,
    124and the correct multiple of the polynomial to subtract is found using
    125a 256-entry lookup table indexed by the high 8 bits.
    126
    127(The table entries are simply the CRC-32 of the given one-byte messages.)
    128
    129When space is more constrained, smaller tables can be used, e.g. two
    1304-bit shifts followed by a lookup in a 16-entry table.
    131
    132It is not practical to process much more than 8 bits at a time using this
    133technique, because tables larger than 256 entries use too much memory and,
    134more importantly, too much of the L1 cache.
    135
    136To get higher software performance, a "slicing" technique can be used.
    137See "High Octane CRC Generation with the Intel Slicing-by-8 Algorithm",
    138ftp://download.intel.com/technology/comms/perfnet/download/slicing-by-8.pdf
    139
    140This does not change the number of table lookups, but does increase
    141the parallelism.  With the classic Sarwate algorithm, each table lookup
    142must be completed before the index of the next can be computed.
    143
    144A "slicing by 2" technique would shift the remainder 16 bits at a time,
    145producing a 48-bit intermediate remainder.  Rather than doing a single
    146lookup in a 65536-entry table, the two high bytes are looked up in
    147two different 256-entry tables.  Each contains the remainder required
    148to cancel out the corresponding byte.  The tables are different because the
    149polynomials to cancel are different.  One has non-zero coefficients from
    150x^32 to x^39, while the other goes from x^40 to x^47.
    151
    152Since modern processors can handle many parallel memory operations, this
    153takes barely longer than a single table look-up and thus performs almost
    154twice as fast as the basic Sarwate algorithm.
    155
    156This can be extended to "slicing by 4" using 4 256-entry tables.
    157Each step, 32 bits of data is fetched, XORed with the CRC, and the result
    158broken into bytes and looked up in the tables.  Because the 32-bit shift
    159leaves the low-order bits of the intermediate remainder zero, the
    160final CRC is simply the XOR of the 4 table look-ups.
    161
    162But this still enforces sequential execution: a second group of table
    163look-ups cannot begin until the previous groups 4 table look-ups have all
    164been completed.  Thus, the processor's load/store unit is sometimes idle.
    165
    166To make maximum use of the processor, "slicing by 8" performs 8 look-ups
    167in parallel.  Each step, the 32-bit CRC is shifted 64 bits and XORed
    168with 64 bits of input data.  What is important to note is that 4 of
    169those 8 bytes are simply copies of the input data; they do not depend
    170on the previous CRC at all.  Thus, those 4 table look-ups may commence
    171immediately, without waiting for the previous loop iteration.
    172
    173By always having 4 loads in flight, a modern superscalar processor can
    174be kept busy and make full use of its L1 cache.
    175
    176Two more details about CRC implementation in the real world:
    177
    178Normally, appending zero bits to a message which is already a multiple
    179of a polynomial produces a larger multiple of that polynomial.  Thus,
    180a basic CRC will not detect appended zero bits (or bytes).  To enable
    181a CRC to detect this condition, it's common to invert the CRC before
    182appending it.  This makes the remainder of the message+crc come out not
    183as zero, but some fixed non-zero value.  (The CRC of the inversion
    184pattern, 0xffffffff.)
    185
    186The same problem applies to zero bits prepended to the message, and a
    187similar solution is used.  Instead of starting the CRC computation with
    188a remainder of 0, an initial remainder of all ones is used.  As long as
    189you start the same way on decoding, it doesn't make a difference.