cachepc-linux

Fork of AMDESE/linux with modifications for CachePC side-channel attack
git clone https://git.sinitax.com/sinitax/cachepc-linux
Log | Files | Refs | README | LICENSE | sfeed.txt

timekeeping.rst (31159B)


      1.. SPDX-License-Identifier: GPL-2.0
      2
      3======================================================
      4Timekeeping Virtualization for X86-Based Architectures
      5======================================================
      6
      7:Author: Zachary Amsden <zamsden@redhat.com>
      8:Copyright: (c) 2010, Red Hat.  All rights reserved.
      9
     10.. Contents
     11
     12   1) Overview
     13   2) Timing Devices
     14   3) TSC Hardware
     15   4) Virtualization Problems
     16
     171. Overview
     18===========
     19
     20One of the most complicated parts of the X86 platform, and specifically,
     21the virtualization of this platform is the plethora of timing devices available
     22and the complexity of emulating those devices.  In addition, virtualization of
     23time introduces a new set of challenges because it introduces a multiplexed
     24division of time beyond the control of the guest CPU.
     25
     26First, we will describe the various timekeeping hardware available, then
     27present some of the problems which arise and solutions available, giving
     28specific recommendations for certain classes of KVM guests.
     29
     30The purpose of this document is to collect data and information relevant to
     31timekeeping which may be difficult to find elsewhere, specifically,
     32information relevant to KVM and hardware-based virtualization.
     33
     342. Timing Devices
     35=================
     36
     37First we discuss the basic hardware devices available.  TSC and the related
     38KVM clock are special enough to warrant a full exposition and are described in
     39the following section.
     40
     412.1. i8254 - PIT
     42----------------
     43
     44One of the first timer devices available is the programmable interrupt timer,
     45or PIT.  The PIT has a fixed frequency 1.193182 MHz base clock and three
     46channels which can be programmed to deliver periodic or one-shot interrupts.
     47These three channels can be configured in different modes and have individual
     48counters.  Channel 1 and 2 were not available for general use in the original
     49IBM PC, and historically were connected to control RAM refresh and the PC
     50speaker.  Now the PIT is typically integrated as part of an emulated chipset
     51and a separate physical PIT is not used.
     52
     53The PIT uses I/O ports 0x40 - 0x43.  Access to the 16-bit counters is done
     54using single or multiple byte access to the I/O ports.  There are 6 modes
     55available, but not all modes are available to all timers, as only timer 2
     56has a connected gate input, required for modes 1 and 5.  The gate line is
     57controlled by port 61h, bit 0, as illustrated in the following diagram::
     58
     59  --------------             ----------------
     60  |            |           |                |
     61  |  1.1932 MHz|---------->| CLOCK      OUT | ---------> IRQ 0
     62  |    Clock   |   |       |                |
     63  --------------   |    +->| GATE  TIMER 0  |
     64                   |        ----------------
     65                   |
     66                   |        ----------------
     67                   |       |                |
     68                   |------>| CLOCK      OUT | ---------> 66.3 KHZ DRAM
     69                   |       |                |            (aka /dev/null)
     70                   |    +->| GATE  TIMER 1  |
     71                   |        ----------------
     72                   |
     73                   |        ----------------
     74                   |       |                |
     75                   |------>| CLOCK      OUT | ---------> Port 61h, bit 5
     76                           |                |      |
     77  Port 61h, bit 0 -------->| GATE  TIMER 2  |       \_.----   ____
     78                            ----------------         _|    )--|LPF|---Speaker
     79                                                    / *----   \___/
     80  Port 61h, bit 1 ---------------------------------/
     81
     82The timer modes are now described.
     83
     84Mode 0: Single Timeout.
     85 This is a one-shot software timeout that counts down
     86 when the gate is high (always true for timers 0 and 1).  When the count
     87 reaches zero, the output goes high.
     88
     89Mode 1: Triggered One-shot.
     90 The output is initially set high.  When the gate
     91 line is set high, a countdown is initiated (which does not stop if the gate is
     92 lowered), during which the output is set low.  When the count reaches zero,
     93 the output goes high.
     94
     95Mode 2: Rate Generator.
     96 The output is initially set high.  When the countdown
     97 reaches 1, the output goes low for one count and then returns high.  The value
     98 is reloaded and the countdown automatically resumes.  If the gate line goes
     99 low, the count is halted.  If the output is low when the gate is lowered, the
    100 output automatically goes high (this only affects timer 2).
    101
    102Mode 3: Square Wave.
    103 This generates a high / low square wave.  The count
    104 determines the length of the pulse, which alternates between high and low
    105 when zero is reached.  The count only proceeds when gate is high and is
    106 automatically reloaded on reaching zero.  The count is decremented twice at
    107 each clock to generate a full high / low cycle at the full periodic rate.
    108 If the count is even, the clock remains high for N/2 counts and low for N/2
    109 counts; if the clock is odd, the clock is high for (N+1)/2 counts and low
    110 for (N-1)/2 counts.  Only even values are latched by the counter, so odd
    111 values are not observed when reading.  This is the intended mode for timer 2,
    112 which generates sine-like tones by low-pass filtering the square wave output.
    113
    114Mode 4: Software Strobe.
    115 After programming this mode and loading the counter,
    116 the output remains high until the counter reaches zero.  Then the output
    117 goes low for 1 clock cycle and returns high.  The counter is not reloaded.
    118 Counting only occurs when gate is high.
    119
    120Mode 5: Hardware Strobe.
    121 After programming and loading the counter, the
    122 output remains high.  When the gate is raised, a countdown is initiated
    123 (which does not stop if the gate is lowered).  When the counter reaches zero,
    124 the output goes low for 1 clock cycle and then returns high.  The counter is
    125 not reloaded.
    126
    127In addition to normal binary counting, the PIT supports BCD counting.  The
    128command port, 0x43 is used to set the counter and mode for each of the three
    129timers.
    130
    131PIT commands, issued to port 0x43, using the following bit encoding::
    132
    133  Bit 7-4: Command (See table below)
    134  Bit 3-1: Mode (000 = Mode 0, 101 = Mode 5, 11X = undefined)
    135  Bit 0  : Binary (0) / BCD (1)
    136
    137Command table::
    138
    139  0000 - Latch Timer 0 count for port 0x40
    140	sample and hold the count to be read in port 0x40;
    141	additional commands ignored until counter is read;
    142	mode bits ignored.
    143
    144  0001 - Set Timer 0 LSB mode for port 0x40
    145	set timer to read LSB only and force MSB to zero;
    146	mode bits set timer mode
    147
    148  0010 - Set Timer 0 MSB mode for port 0x40
    149	set timer to read MSB only and force LSB to zero;
    150	mode bits set timer mode
    151
    152  0011 - Set Timer 0 16-bit mode for port 0x40
    153	set timer to read / write LSB first, then MSB;
    154	mode bits set timer mode
    155
    156  0100 - Latch Timer 1 count for port 0x41 - as described above
    157  0101 - Set Timer 1 LSB mode for port 0x41 - as described above
    158  0110 - Set Timer 1 MSB mode for port 0x41 - as described above
    159  0111 - Set Timer 1 16-bit mode for port 0x41 - as described above
    160
    161  1000 - Latch Timer 2 count for port 0x42 - as described above
    162  1001 - Set Timer 2 LSB mode for port 0x42 - as described above
    163  1010 - Set Timer 2 MSB mode for port 0x42 - as described above
    164  1011 - Set Timer 2 16-bit mode for port 0x42 as described above
    165
    166  1101 - General counter latch
    167	Latch combination of counters into corresponding ports
    168	Bit 3 = Counter 2
    169	Bit 2 = Counter 1
    170	Bit 1 = Counter 0
    171	Bit 0 = Unused
    172
    173  1110 - Latch timer status
    174	Latch combination of counter mode into corresponding ports
    175	Bit 3 = Counter 2
    176	Bit 2 = Counter 1
    177	Bit 1 = Counter 0
    178
    179	The output of ports 0x40-0x42 following this command will be:
    180
    181	Bit 7 = Output pin
    182	Bit 6 = Count loaded (0 if timer has expired)
    183	Bit 5-4 = Read / Write mode
    184	    01 = MSB only
    185	    10 = LSB only
    186	    11 = LSB / MSB (16-bit)
    187	Bit 3-1 = Mode
    188	Bit 0 = Binary (0) / BCD mode (1)
    189
    1902.2. RTC
    191--------
    192
    193The second device which was available in the original PC was the MC146818 real
    194time clock.  The original device is now obsolete, and usually emulated by the
    195system chipset, sometimes by an HPET and some frankenstein IRQ routing.
    196
    197The RTC is accessed through CMOS variables, which uses an index register to
    198control which bytes are read.  Since there is only one index register, read
    199of the CMOS and read of the RTC require lock protection (in addition, it is
    200dangerous to allow userspace utilities such as hwclock to have direct RTC
    201access, as they could corrupt kernel reads and writes of CMOS memory).
    202
    203The RTC generates an interrupt which is usually routed to IRQ 8.  The interrupt
    204can function as a periodic timer, an additional once a day alarm, and can issue
    205interrupts after an update of the CMOS registers by the MC146818 is complete.
    206The type of interrupt is signalled in the RTC status registers.
    207
    208The RTC will update the current time fields by battery power even while the
    209system is off.  The current time fields should not be read while an update is
    210in progress, as indicated in the status register.
    211
    212The clock uses a 32.768kHz crystal, so bits 6-4 of register A should be
    213programmed to a 32kHz divider if the RTC is to count seconds.
    214
    215This is the RAM map originally used for the RTC/CMOS::
    216
    217  Location    Size    Description
    218  ------------------------------------------
    219  00h         byte    Current second (BCD)
    220  01h         byte    Seconds alarm (BCD)
    221  02h         byte    Current minute (BCD)
    222  03h         byte    Minutes alarm (BCD)
    223  04h         byte    Current hour (BCD)
    224  05h         byte    Hours alarm (BCD)
    225  06h         byte    Current day of week (BCD)
    226  07h         byte    Current day of month (BCD)
    227  08h         byte    Current month (BCD)
    228  09h         byte    Current year (BCD)
    229  0Ah         byte    Register A
    230                       bit 7   = Update in progress
    231                       bit 6-4 = Divider for clock
    232                                  000 = 4.194 MHz
    233                                  001 = 1.049 MHz
    234                                  010 = 32 kHz
    235                                  10X = test modes
    236                                  110 = reset / disable
    237                                  111 = reset / disable
    238                       bit 3-0 = Rate selection for periodic interrupt
    239                                  000 = periodic timer disabled
    240                                  001 = 3.90625 uS
    241                                  010 = 7.8125 uS
    242                                  011 = .122070 mS
    243                                  100 = .244141 mS
    244                                     ...
    245                                 1101 = 125 mS
    246                                 1110 = 250 mS
    247                                 1111 = 500 mS
    248  0Bh         byte    Register B
    249                       bit 7   = Run (0) / Halt (1)
    250                       bit 6   = Periodic interrupt enable
    251                       bit 5   = Alarm interrupt enable
    252                       bit 4   = Update-ended interrupt enable
    253                       bit 3   = Square wave interrupt enable
    254                       bit 2   = BCD calendar (0) / Binary (1)
    255                       bit 1   = 12-hour mode (0) / 24-hour mode (1)
    256                       bit 0   = 0 (DST off) / 1 (DST enabled)
    257  OCh         byte    Register C (read only)
    258                       bit 7   = interrupt request flag (IRQF)
    259                       bit 6   = periodic interrupt flag (PF)
    260                       bit 5   = alarm interrupt flag (AF)
    261                       bit 4   = update interrupt flag (UF)
    262                       bit 3-0 = reserved
    263  ODh         byte    Register D (read only)
    264                       bit 7   = RTC has power
    265                       bit 6-0 = reserved
    266  32h         byte    Current century BCD (*)
    267  (*) location vendor specific and now determined from ACPI global tables
    268
    2692.3. APIC
    270---------
    271
    272On Pentium and later processors, an on-board timer is available to each CPU
    273as part of the Advanced Programmable Interrupt Controller.  The APIC is
    274accessed through memory-mapped registers and provides interrupt service to each
    275CPU, used for IPIs and local timer interrupts.
    276
    277Although in theory the APIC is a safe and stable source for local interrupts,
    278in practice, many bugs and glitches have occurred due to the special nature of
    279the APIC CPU-local memory-mapped hardware.  Beware that CPU errata may affect
    280the use of the APIC and that workarounds may be required.  In addition, some of
    281these workarounds pose unique constraints for virtualization - requiring either
    282extra overhead incurred from extra reads of memory-mapped I/O or additional
    283functionality that may be more computationally expensive to implement.
    284
    285Since the APIC is documented quite well in the Intel and AMD manuals, we will
    286avoid repetition of the detail here.  It should be pointed out that the APIC
    287timer is programmed through the LVT (local vector timer) register, is capable
    288of one-shot or periodic operation, and is based on the bus clock divided down
    289by the programmable divider register.
    290
    2912.4. HPET
    292---------
    293
    294HPET is quite complex, and was originally intended to replace the PIT / RTC
    295support of the X86 PC.  It remains to be seen whether that will be the case, as
    296the de facto standard of PC hardware is to emulate these older devices.  Some
    297systems designated as legacy free may support only the HPET as a hardware timer
    298device.
    299
    300The HPET spec is rather loose and vague, requiring at least 3 hardware timers,
    301but allowing implementation freedom to support many more.  It also imposes no
    302fixed rate on the timer frequency, but does impose some extremal values on
    303frequency, error and slew.
    304
    305In general, the HPET is recommended as a high precision (compared to PIT /RTC)
    306time source which is independent of local variation (as there is only one HPET
    307in any given system).  The HPET is also memory-mapped, and its presence is
    308indicated through ACPI tables by the BIOS.
    309
    310Detailed specification of the HPET is beyond the current scope of this
    311document, as it is also very well documented elsewhere.
    312
    3132.5. Offboard Timers
    314--------------------
    315
    316Several cards, both proprietary (watchdog boards) and commonplace (e1000) have
    317timing chips built into the cards which may have registers which are accessible
    318to kernel or user drivers.  To the author's knowledge, using these to generate
    319a clocksource for a Linux or other kernel has not yet been attempted and is in
    320general frowned upon as not playing by the agreed rules of the game.  Such a
    321timer device would require additional support to be virtualized properly and is
    322not considered important at this time as no known operating system does this.
    323
    3243. TSC Hardware
    325===============
    326
    327The TSC or time stamp counter is relatively simple in theory; it counts
    328instruction cycles issued by the processor, which can be used as a measure of
    329time.  In practice, due to a number of problems, it is the most complicated
    330timekeeping device to use.
    331
    332The TSC is represented internally as a 64-bit MSR which can be read with the
    333RDMSR, RDTSC, or RDTSCP (when available) instructions.  In the past, hardware
    334limitations made it possible to write the TSC, but generally on old hardware it
    335was only possible to write the low 32-bits of the 64-bit counter, and the upper
    33632-bits of the counter were cleared.  Now, however, on Intel processors family
    3370Fh, for models 3, 4 and 6, and family 06h, models e and f, this restriction
    338has been lifted and all 64-bits are writable.  On AMD systems, the ability to
    339write the TSC MSR is not an architectural guarantee.
    340
    341The TSC is accessible from CPL-0 and conditionally, for CPL > 0 software by
    342means of the CR4.TSD bit, which when enabled, disables CPL > 0 TSC access.
    343
    344Some vendors have implemented an additional instruction, RDTSCP, which returns
    345atomically not just the TSC, but an indicator which corresponds to the
    346processor number.  This can be used to index into an array of TSC variables to
    347determine offset information in SMP systems where TSCs are not synchronized.
    348The presence of this instruction must be determined by consulting CPUID feature
    349bits.
    350
    351Both VMX and SVM provide extension fields in the virtualization hardware which
    352allows the guest visible TSC to be offset by a constant.  Newer implementations
    353promise to allow the TSC to additionally be scaled, but this hardware is not
    354yet widely available.
    355
    3563.1. TSC synchronization
    357------------------------
    358
    359The TSC is a CPU-local clock in most implementations.  This means, on SMP
    360platforms, the TSCs of different CPUs may start at different times depending
    361on when the CPUs are powered on.  Generally, CPUs on the same die will share
    362the same clock, however, this is not always the case.
    363
    364The BIOS may attempt to resynchronize the TSCs during the poweron process and
    365the operating system or other system software may attempt to do this as well.
    366Several hardware limitations make the problem worse - if it is not possible to
    367write the full 64-bits of the TSC, it may be impossible to match the TSC in
    368newly arriving CPUs to that of the rest of the system, resulting in
    369unsynchronized TSCs.  This may be done by BIOS or system software, but in
    370practice, getting a perfectly synchronized TSC will not be possible unless all
    371values are read from the same clock, which generally only is possible on single
    372socket systems or those with special hardware support.
    373
    3743.2. TSC and CPU hotplug
    375------------------------
    376
    377As touched on already, CPUs which arrive later than the boot time of the system
    378may not have a TSC value that is synchronized with the rest of the system.
    379Either system software, BIOS, or SMM code may actually try to establish the TSC
    380to a value matching the rest of the system, but a perfect match is usually not
    381a guarantee.  This can have the effect of bringing a system from a state where
    382TSC is synchronized back to a state where TSC synchronization flaws, however
    383small, may be exposed to the OS and any virtualization environment.
    384
    3853.3. TSC and multi-socket / NUMA
    386--------------------------------
    387
    388Multi-socket systems, especially large multi-socket systems are likely to have
    389individual clocksources rather than a single, universally distributed clock.
    390Since these clocks are driven by different crystals, they will not have
    391perfectly matched frequency, and temperature and electrical variations will
    392cause the CPU clocks, and thus the TSCs to drift over time.  Depending on the
    393exact clock and bus design, the drift may or may not be fixed in absolute
    394error, and may accumulate over time.
    395
    396In addition, very large systems may deliberately slew the clocks of individual
    397cores.  This technique, known as spread-spectrum clocking, reduces EMI at the
    398clock frequency and harmonics of it, which may be required to pass FCC
    399standards for telecommunications and computer equipment.
    400
    401It is recommended not to trust the TSCs to remain synchronized on NUMA or
    402multiple socket systems for these reasons.
    403
    4043.4. TSC and C-states
    405---------------------
    406
    407C-states, or idling states of the processor, especially C1E and deeper sleep
    408states may be problematic for TSC as well.  The TSC may stop advancing in such
    409a state, resulting in a TSC which is behind that of other CPUs when execution
    410is resumed.  Such CPUs must be detected and flagged by the operating system
    411based on CPU and chipset identifications.
    412
    413The TSC in such a case may be corrected by catching it up to a known external
    414clocksource.
    415
    4163.5. TSC frequency change / P-states
    417------------------------------------
    418
    419To make things slightly more interesting, some CPUs may change frequency.  They
    420may or may not run the TSC at the same rate, and because the frequency change
    421may be staggered or slewed, at some points in time, the TSC rate may not be
    422known other than falling within a range of values.  In this case, the TSC will
    423not be a stable time source, and must be calibrated against a known, stable,
    424external clock to be a usable source of time.
    425
    426Whether the TSC runs at a constant rate or scales with the P-state is model
    427dependent and must be determined by inspecting CPUID, chipset or vendor
    428specific MSR fields.
    429
    430In addition, some vendors have known bugs where the P-state is actually
    431compensated for properly during normal operation, but when the processor is
    432inactive, the P-state may be raised temporarily to service cache misses from
    433other processors.  In such cases, the TSC on halted CPUs could advance faster
    434than that of non-halted processors.  AMD Turion processors are known to have
    435this problem.
    436
    4373.6. TSC and STPCLK / T-states
    438------------------------------
    439
    440External signals given to the processor may also have the effect of stopping
    441the TSC.  This is typically done for thermal emergency power control to prevent
    442an overheating condition, and typically, there is no way to detect that this
    443condition has happened.
    444
    4453.7. TSC virtualization - VMX
    446-----------------------------
    447
    448VMX provides conditional trapping of RDTSC, RDMSR, WRMSR and RDTSCP
    449instructions, which is enough for full virtualization of TSC in any manner.  In
    450addition, VMX allows passing through the host TSC plus an additional TSC_OFFSET
    451field specified in the VMCS.  Special instructions must be used to read and
    452write the VMCS field.
    453
    4543.8. TSC virtualization - SVM
    455-----------------------------
    456
    457SVM provides conditional trapping of RDTSC, RDMSR, WRMSR and RDTSCP
    458instructions, which is enough for full virtualization of TSC in any manner.  In
    459addition, SVM allows passing through the host TSC plus an additional offset
    460field specified in the SVM control block.
    461
    4623.9. TSC feature bits in Linux
    463------------------------------
    464
    465In summary, there is no way to guarantee the TSC remains in perfect
    466synchronization unless it is explicitly guaranteed by the architecture.  Even
    467if so, the TSCs in multi-sockets or NUMA systems may still run independently
    468despite being locally consistent.
    469
    470The following feature bits are used by Linux to signal various TSC attributes,
    471but they can only be taken to be meaningful for UP or single node systems.
    472
    473=========================	=======================================
    474X86_FEATURE_TSC			The TSC is available in hardware
    475X86_FEATURE_RDTSCP		The RDTSCP instruction is available
    476X86_FEATURE_CONSTANT_TSC	The TSC rate is unchanged with P-states
    477X86_FEATURE_NONSTOP_TSC		The TSC does not stop in C-states
    478X86_FEATURE_TSC_RELIABLE	TSC sync checks are skipped (VMware)
    479=========================	=======================================
    480
    4814. Virtualization Problems
    482==========================
    483
    484Timekeeping is especially problematic for virtualization because a number of
    485challenges arise.  The most obvious problem is that time is now shared between
    486the host and, potentially, a number of virtual machines.  Thus the virtual
    487operating system does not run with 100% usage of the CPU, despite the fact that
    488it may very well make that assumption.  It may expect it to remain true to very
    489exacting bounds when interrupt sources are disabled, but in reality only its
    490virtual interrupt sources are disabled, and the machine may still be preempted
    491at any time.  This causes problems as the passage of real time, the injection
    492of machine interrupts and the associated clock sources are no longer completely
    493synchronized with real time.
    494
    495This same problem can occur on native hardware to a degree, as SMM mode may
    496steal cycles from the naturally on X86 systems when SMM mode is used by the
    497BIOS, but not in such an extreme fashion.  However, the fact that SMM mode may
    498cause similar problems to virtualization makes it a good justification for
    499solving many of these problems on bare metal.
    500
    5014.1. Interrupt clocking
    502-----------------------
    503
    504One of the most immediate problems that occurs with legacy operating systems
    505is that the system timekeeping routines are often designed to keep track of
    506time by counting periodic interrupts.  These interrupts may come from the PIT
    507or the RTC, but the problem is the same: the host virtualization engine may not
    508be able to deliver the proper number of interrupts per second, and so guest
    509time may fall behind.  This is especially problematic if a high interrupt rate
    510is selected, such as 1000 HZ, which is unfortunately the default for many Linux
    511guests.
    512
    513There are three approaches to solving this problem; first, it may be possible
    514to simply ignore it.  Guests which have a separate time source for tracking
    515'wall clock' or 'real time' may not need any adjustment of their interrupts to
    516maintain proper time.  If this is not sufficient, it may be necessary to inject
    517additional interrupts into the guest in order to increase the effective
    518interrupt rate.  This approach leads to complications in extreme conditions,
    519where host load or guest lag is too much to compensate for, and thus another
    520solution to the problem has risen: the guest may need to become aware of lost
    521ticks and compensate for them internally.  Although promising in theory, the
    522implementation of this policy in Linux has been extremely error prone, and a
    523number of buggy variants of lost tick compensation are distributed across
    524commonly used Linux systems.
    525
    526Windows uses periodic RTC clocking as a means of keeping time internally, and
    527thus requires interrupt slewing to keep proper time.  It does use a low enough
    528rate (ed: is it 18.2 Hz?) however that it has not yet been a problem in
    529practice.
    530
    5314.2. TSC sampling and serialization
    532-----------------------------------
    533
    534As the highest precision time source available, the cycle counter of the CPU
    535has aroused much interest from developers.  As explained above, this timer has
    536many problems unique to its nature as a local, potentially unstable and
    537potentially unsynchronized source.  One issue which is not unique to the TSC,
    538but is highlighted because of its very precise nature is sampling delay.  By
    539definition, the counter, once read is already old.  However, it is also
    540possible for the counter to be read ahead of the actual use of the result.
    541This is a consequence of the superscalar execution of the instruction stream,
    542which may execute instructions out of order.  Such execution is called
    543non-serialized.  Forcing serialized execution is necessary for precise
    544measurement with the TSC, and requires a serializing instruction, such as CPUID
    545or an MSR read.
    546
    547Since CPUID may actually be virtualized by a trap and emulate mechanism, this
    548serialization can pose a performance issue for hardware virtualization.  An
    549accurate time stamp counter reading may therefore not always be available, and
    550it may be necessary for an implementation to guard against "backwards" reads of
    551the TSC as seen from other CPUs, even in an otherwise perfectly synchronized
    552system.
    553
    5544.3. Timespec aliasing
    555----------------------
    556
    557Additionally, this lack of serialization from the TSC poses another challenge
    558when using results of the TSC when measured against another time source.  As
    559the TSC is much higher precision, many possible values of the TSC may be read
    560while another clock is still expressing the same value.
    561
    562That is, you may read (T,T+10) while external clock C maintains the same value.
    563Due to non-serialized reads, you may actually end up with a range which
    564fluctuates - from (T-1.. T+10).  Thus, any time calculated from a TSC, but
    565calibrated against an external value may have a range of valid values.
    566Re-calibrating this computation may actually cause time, as computed after the
    567calibration, to go backwards, compared with time computed before the
    568calibration.
    569
    570This problem is particularly pronounced with an internal time source in Linux,
    571the kernel time, which is expressed in the theoretically high resolution
    572timespec - but which advances in much larger granularity intervals, sometimes
    573at the rate of jiffies, and possibly in catchup modes, at a much larger step.
    574
    575This aliasing requires care in the computation and recalibration of kvmclock
    576and any other values derived from TSC computation (such as TSC virtualization
    577itself).
    578
    5794.4. Migration
    580--------------
    581
    582Migration of a virtual machine raises problems for timekeeping in two ways.
    583First, the migration itself may take time, during which interrupts cannot be
    584delivered, and after which, the guest time may need to be caught up.  NTP may
    585be able to help to some degree here, as the clock correction required is
    586typically small enough to fall in the NTP-correctable window.
    587
    588An additional concern is that timers based off the TSC (or HPET, if the raw bus
    589clock is exposed) may now be running at different rates, requiring compensation
    590in some way in the hypervisor by virtualizing these timers.  In addition,
    591migrating to a faster machine may preclude the use of a passthrough TSC, as a
    592faster clock cannot be made visible to a guest without the potential of time
    593advancing faster than usual.  A slower clock is less of a problem, as it can
    594always be caught up to the original rate.  KVM clock avoids these problems by
    595simply storing multipliers and offsets against the TSC for the guest to convert
    596back into nanosecond resolution values.
    597
    5984.5. Scheduling
    599---------------
    600
    601Since scheduling may be based on precise timing and firing of interrupts, the
    602scheduling algorithms of an operating system may be adversely affected by
    603virtualization.  In theory, the effect is random and should be universally
    604distributed, but in contrived as well as real scenarios (guest device access,
    605causes of virtualization exits, possible context switch), this may not always
    606be the case.  The effect of this has not been well studied.
    607
    608In an attempt to work around this, several implementations have provided a
    609paravirtualized scheduler clock, which reveals the true amount of CPU time for
    610which a virtual machine has been running.
    611
    6124.6. Watchdogs
    613--------------
    614
    615Watchdog timers, such as the lock detector in Linux may fire accidentally when
    616running under hardware virtualization due to timer interrupts being delayed or
    617misinterpretation of the passage of real time.  Usually, these warnings are
    618spurious and can be ignored, but in some circumstances it may be necessary to
    619disable such detection.
    620
    6214.7. Delays and precision timing
    622--------------------------------
    623
    624Precise timing and delays may not be possible in a virtualized system.  This
    625can happen if the system is controlling physical hardware, or issues delays to
    626compensate for slower I/O to and from devices.  The first issue is not solvable
    627in general for a virtualized system; hardware control software can't be
    628adequately virtualized without a full real-time operating system, which would
    629require an RT aware virtualization platform.
    630
    631The second issue may cause performance problems, but this is unlikely to be a
    632significant issue.  In many cases these delays may be eliminated through
    633configuration or paravirtualization.
    634
    6354.8. Covert channels and leaks
    636------------------------------
    637
    638In addition to the above problems, time information will inevitably leak to the
    639guest about the host in anything but a perfect implementation of virtualized
    640time.  This may allow the guest to infer the presence of a hypervisor (as in a
    641red-pill type detection), and it may allow information to leak between guests
    642by using CPU utilization itself as a signalling channel.  Preventing such
    643problems would require completely isolated virtual time which may not track
    644real time any longer.  This may be useful in certain security or QA contexts,
    645but in general isn't recommended for real-world deployment scenarios.