cachepc-linux

Fork of AMDESE/linux with modifications for CachePC side-channel attack
git clone https://git.sinitax.com/sinitax/cachepc-linux
Log | Files | Refs | README | LICENSE | sfeed.txt

concepts.rst (10898B)


      1.. _mm_concepts:
      2
      3=================
      4Concepts overview
      5=================
      6
      7The memory management in Linux is a complex system that evolved over the
      8years and included more and more functionality to support a variety of
      9systems from MMU-less microcontrollers to supercomputers. The memory
     10management for systems without an MMU is called ``nommu`` and it
     11definitely deserves a dedicated document, which hopefully will be
     12eventually written. Yet, although some of the concepts are the same,
     13here we assume that an MMU is available and a CPU can translate a virtual
     14address to a physical address.
     15
     16.. contents:: :local:
     17
     18Virtual Memory Primer
     19=====================
     20
     21The physical memory in a computer system is a limited resource and
     22even for systems that support memory hotplug there is a hard limit on
     23the amount of memory that can be installed. The physical memory is not
     24necessarily contiguous; it might be accessible as a set of distinct
     25address ranges. Besides, different CPU architectures, and even
     26different implementations of the same architecture have different views
     27of how these address ranges are defined.
     28
     29All this makes dealing directly with physical memory quite complex and
     30to avoid this complexity a concept of virtual memory was developed.
     31
     32The virtual memory abstracts the details of physical memory from the
     33application software, allows to keep only needed information in the
     34physical memory (demand paging) and provides a mechanism for the
     35protection and controlled sharing of data between processes.
     36
     37With virtual memory, each and every memory access uses a virtual
     38address. When the CPU decodes an instruction that reads (or
     39writes) from (or to) the system memory, it translates the `virtual`
     40address encoded in that instruction to a `physical` address that the
     41memory controller can understand.
     42
     43The physical system memory is divided into page frames, or pages. The
     44size of each page is architecture specific. Some architectures allow
     45selection of the page size from several supported values; this
     46selection is performed at the kernel build time by setting an
     47appropriate kernel configuration option.
     48
     49Each physical memory page can be mapped as one or more virtual
     50pages. These mappings are described by page tables that allow
     51translation from a virtual address used by programs to the physical
     52memory address. The page tables are organized hierarchically.
     53
     54The tables at the lowest level of the hierarchy contain physical
     55addresses of actual pages used by the software. The tables at higher
     56levels contain physical addresses of the pages belonging to the lower
     57levels. The pointer to the top level page table resides in a
     58register. When the CPU performs the address translation, it uses this
     59register to access the top level page table. The high bits of the
     60virtual address are used to index an entry in the top level page
     61table. That entry is then used to access the next level in the
     62hierarchy with the next bits of the virtual address as the index to
     63that level page table. The lowest bits in the virtual address define
     64the offset inside the actual page.
     65
     66Huge Pages
     67==========
     68
     69The address translation requires several memory accesses and memory
     70accesses are slow relatively to CPU speed. To avoid spending precious
     71processor cycles on the address translation, CPUs maintain a cache of
     72such translations called Translation Lookaside Buffer (or
     73TLB). Usually TLB is pretty scarce resource and applications with
     74large memory working set will experience performance hit because of
     75TLB misses.
     76
     77Many modern CPU architectures allow mapping of the memory pages
     78directly by the higher levels in the page table. For instance, on x86,
     79it is possible to map 2M and even 1G pages using entries in the second
     80and the third level page tables. In Linux such pages are called
     81`huge`. Usage of huge pages significantly reduces pressure on TLB,
     82improves TLB hit-rate and thus improves overall system performance.
     83
     84There are two mechanisms in Linux that enable mapping of the physical
     85memory with the huge pages. The first one is `HugeTLB filesystem`, or
     86hugetlbfs. It is a pseudo filesystem that uses RAM as its backing
     87store. For the files created in this filesystem the data resides in
     88the memory and mapped using huge pages. The hugetlbfs is described at
     89:ref:`Documentation/admin-guide/mm/hugetlbpage.rst <hugetlbpage>`.
     90
     91Another, more recent, mechanism that enables use of the huge pages is
     92called `Transparent HugePages`, or THP. Unlike the hugetlbfs that
     93requires users and/or system administrators to configure what parts of
     94the system memory should and can be mapped by the huge pages, THP
     95manages such mappings transparently to the user and hence the
     96name. See
     97:ref:`Documentation/admin-guide/mm/transhuge.rst <admin_guide_transhuge>`
     98for more details about THP.
     99
    100Zones
    101=====
    102
    103Often hardware poses restrictions on how different physical memory
    104ranges can be accessed. In some cases, devices cannot perform DMA to
    105all the addressable memory. In other cases, the size of the physical
    106memory exceeds the maximal addressable size of virtual memory and
    107special actions are required to access portions of the memory. Linux
    108groups memory pages into `zones` according to their possible
    109usage. For example, ZONE_DMA will contain memory that can be used by
    110devices for DMA, ZONE_HIGHMEM will contain memory that is not
    111permanently mapped into kernel's address space and ZONE_NORMAL will
    112contain normally addressed pages.
    113
    114The actual layout of the memory zones is hardware dependent as not all
    115architectures define all zones, and requirements for DMA are different
    116for different platforms.
    117
    118Nodes
    119=====
    120
    121Many multi-processor machines are NUMA - Non-Uniform Memory Access -
    122systems. In such systems the memory is arranged into banks that have
    123different access latency depending on the "distance" from the
    124processor. Each bank is referred to as a `node` and for each node Linux
    125constructs an independent memory management subsystem. A node has its
    126own set of zones, lists of free and used pages and various statistics
    127counters. You can find more details about NUMA in
    128:ref:`Documentation/vm/numa.rst <numa>` and in
    129:ref:`Documentation/admin-guide/mm/numa_memory_policy.rst <numa_memory_policy>`.
    130
    131Page cache
    132==========
    133
    134The physical memory is volatile and the common case for getting data
    135into the memory is to read it from files. Whenever a file is read, the
    136data is put into the `page cache` to avoid expensive disk access on
    137the subsequent reads. Similarly, when one writes to a file, the data
    138is placed in the page cache and eventually gets into the backing
    139storage device. The written pages are marked as `dirty` and when Linux
    140decides to reuse them for other purposes, it makes sure to synchronize
    141the file contents on the device with the updated data.
    142
    143Anonymous Memory
    144================
    145
    146The `anonymous memory` or `anonymous mappings` represent memory that
    147is not backed by a filesystem. Such mappings are implicitly created
    148for program's stack and heap or by explicit calls to mmap(2) system
    149call. Usually, the anonymous mappings only define virtual memory areas
    150that the program is allowed to access. The read accesses will result
    151in creation of a page table entry that references a special physical
    152page filled with zeroes. When the program performs a write, a regular
    153physical page will be allocated to hold the written data. The page
    154will be marked dirty and if the kernel decides to repurpose it,
    155the dirty page will be swapped out.
    156
    157Reclaim
    158=======
    159
    160Throughout the system lifetime, a physical page can be used for storing
    161different types of data. It can be kernel internal data structures,
    162DMA'able buffers for device drivers use, data read from a filesystem,
    163memory allocated by user space processes etc.
    164
    165Depending on the page usage it is treated differently by the Linux
    166memory management. The pages that can be freed at any time, either
    167because they cache the data available elsewhere, for instance, on a
    168hard disk, or because they can be swapped out, again, to the hard
    169disk, are called `reclaimable`. The most notable categories of the
    170reclaimable pages are page cache and anonymous memory.
    171
    172In most cases, the pages holding internal kernel data and used as DMA
    173buffers cannot be repurposed, and they remain pinned until freed by
    174their user. Such pages are called `unreclaimable`. However, in certain
    175circumstances, even pages occupied with kernel data structures can be
    176reclaimed. For instance, in-memory caches of filesystem metadata can
    177be re-read from the storage device and therefore it is possible to
    178discard them from the main memory when system is under memory
    179pressure.
    180
    181The process of freeing the reclaimable physical memory pages and
    182repurposing them is called (surprise!) `reclaim`. Linux can reclaim
    183pages either asynchronously or synchronously, depending on the state
    184of the system. When the system is not loaded, most of the memory is free
    185and allocation requests will be satisfied immediately from the free
    186pages supply. As the load increases, the amount of the free pages goes
    187down and when it reaches a certain threshold (low watermark), an
    188allocation request will awaken the ``kswapd`` daemon. It will
    189asynchronously scan memory pages and either just free them if the data
    190they contain is available elsewhere, or evict to the backing storage
    191device (remember those dirty pages?). As memory usage increases even
    192more and reaches another threshold - min watermark - an allocation
    193will trigger `direct reclaim`. In this case allocation is stalled
    194until enough memory pages are reclaimed to satisfy the request.
    195
    196Compaction
    197==========
    198
    199As the system runs, tasks allocate and free the memory and it becomes
    200fragmented. Although with virtual memory it is possible to present
    201scattered physical pages as virtually contiguous range, sometimes it is
    202necessary to allocate large physically contiguous memory areas. Such
    203need may arise, for instance, when a device driver requires a large
    204buffer for DMA, or when THP allocates a huge page. Memory `compaction`
    205addresses the fragmentation issue. This mechanism moves occupied pages
    206from the lower part of a memory zone to free pages in the upper part
    207of the zone. When a compaction scan is finished free pages are grouped
    208together at the beginning of the zone and allocations of large
    209physically contiguous areas become possible.
    210
    211Like reclaim, the compaction may happen asynchronously in the ``kcompactd``
    212daemon or synchronously as a result of a memory allocation request.
    213
    214OOM killer
    215==========
    216
    217It is possible that on a loaded machine memory will be exhausted and the
    218kernel will be unable to reclaim enough memory to continue to operate. In
    219order to save the rest of the system, it invokes the `OOM killer`.
    220
    221The `OOM killer` selects a task to sacrifice for the sake of the overall
    222system health. The selected task is killed in a hope that after it exits
    223enough memory will be freed to continue normal operation.