bfq-iosched.rst (26861B)
1========================== 2BFQ (Budget Fair Queueing) 3========================== 4 5BFQ is a proportional-share I/O scheduler, with some extra 6low-latency capabilities. In addition to cgroups support (blkio or io 7controllers), BFQ's main features are: 8 9- BFQ guarantees a high system and application responsiveness, and a 10 low latency for time-sensitive applications, such as audio or video 11 players; 12- BFQ distributes bandwidth, and not just time, among processes or 13 groups (switching back to time distribution when needed to keep 14 throughput high). 15 16In its default configuration, BFQ privileges latency over 17throughput. So, when needed for achieving a lower latency, BFQ builds 18schedules that may lead to a lower throughput. If your main or only 19goal, for a given device, is to achieve the maximum-possible 20throughput at all times, then do switch off all low-latency heuristics 21for that device, by setting low_latency to 0. See Section 3 for 22details on how to configure BFQ for the desired tradeoff between 23latency and throughput, or on how to maximize throughput. 24 25As every I/O scheduler, BFQ adds some overhead to per-I/O-request 26processing. To give an idea of this overhead, the total, 27single-lock-protected, per-request processing time of BFQ---i.e., the 28sum of the execution times of the request insertion, dispatch and 29completion hooks---is, e.g., 1.9 us on an Intel Core i7-2760QM@2.40GHz 30(dated CPU for notebooks; time measured with simple code 31instrumentation, and using the throughput-sync.sh script of the S 32suite [1], in performance-profiling mode). To put this result into 33context, the total, single-lock-protected, per-request execution time 34of the lightest I/O scheduler available in blk-mq, mq-deadline, is 0.7 35us (mq-deadline is ~800 LOC, against ~10500 LOC for BFQ). 36 37Scheduling overhead further limits the maximum IOPS that a CPU can 38process (already limited by the execution of the rest of the I/O 39stack). To give an idea of the limits with BFQ, on slow or average 40CPUs, here are, first, the limits of BFQ for three different CPUs, on, 41respectively, an average laptop, an old desktop, and a cheap embedded 42system, in case full hierarchical support is enabled (i.e., 43CONFIG_BFQ_GROUP_IOSCHED is set), but CONFIG_BFQ_CGROUP_DEBUG is not 44set (Section 4-2): 45- Intel i7-4850HQ: 400 KIOPS 46- AMD A8-3850: 250 KIOPS 47- ARM CortexTM-A53 Octa-core: 80 KIOPS 48 49If CONFIG_BFQ_CGROUP_DEBUG is set (and of course full hierarchical 50support is enabled), then the sustainable throughput with BFQ 51decreases, because all blkio.bfq* statistics are created and updated 52(Section 4-2). For BFQ, this leads to the following maximum 53sustainable throughputs, on the same systems as above: 54- Intel i7-4850HQ: 310 KIOPS 55- AMD A8-3850: 200 KIOPS 56- ARM CortexTM-A53 Octa-core: 56 KIOPS 57 58BFQ works for multi-queue devices too. 59 60.. The table of contents follow. Impatients can just jump to Section 3. 61 62.. CONTENTS 63 64 1. When may BFQ be useful? 65 1-1 Personal systems 66 1-2 Server systems 67 2. How does BFQ work? 68 3. What are BFQ's tunables and how to properly configure BFQ? 69 4. BFQ group scheduling 70 4-1 Service guarantees provided 71 4-2 Interface 72 731. When may BFQ be useful? 74========================== 75 76BFQ provides the following benefits on personal and server systems. 77 781-1 Personal systems 79-------------------- 80 81Low latency for interactive applications 82^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 83 84Regardless of the actual background workload, BFQ guarantees that, for 85interactive tasks, the storage device is virtually as responsive as if 86it was idle. For example, even if one or more of the following 87background workloads are being executed: 88 89- one or more large files are being read, written or copied, 90- a tree of source files is being compiled, 91- one or more virtual machines are performing I/O, 92- a software update is in progress, 93- indexing daemons are scanning filesystems and updating their 94 databases, 95 96starting an application or loading a file from within an application 97takes about the same time as if the storage device was idle. As a 98comparison, with CFQ, NOOP or DEADLINE, and in the same conditions, 99applications experience high latencies, or even become unresponsive 100until the background workload terminates (also on SSDs). 101 102Low latency for soft real-time applications 103^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 104Also soft real-time applications, such as audio and video 105players/streamers, enjoy a low latency and a low drop rate, regardless 106of the background I/O workload. As a consequence, these applications 107do not suffer from almost any glitch due to the background workload. 108 109Higher speed for code-development tasks 110^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 111 112If some additional workload happens to be executed in parallel, then 113BFQ executes the I/O-related components of typical code-development 114tasks (compilation, checkout, merge, ...) much more quickly than CFQ, 115NOOP or DEADLINE. 116 117High throughput 118^^^^^^^^^^^^^^^ 119 120On hard disks, BFQ achieves up to 30% higher throughput than CFQ, and 121up to 150% higher throughput than DEADLINE and NOOP, with all the 122sequential workloads considered in our tests. With random workloads, 123and with all the workloads on flash-based devices, BFQ achieves, 124instead, about the same throughput as the other schedulers. 125 126Strong fairness, bandwidth and delay guarantees 127^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 128 129BFQ distributes the device throughput, and not just the device time, 130among I/O-bound applications in proportion their weights, with any 131workload and regardless of the device parameters. From these bandwidth 132guarantees, it is possible to compute tight per-I/O-request delay 133guarantees by a simple formula. If not configured for strict service 134guarantees, BFQ switches to time-based resource sharing (only) for 135applications that would otherwise cause a throughput loss. 136 1371-2 Server systems 138------------------ 139 140Most benefits for server systems follow from the same service 141properties as above. In particular, regardless of whether additional, 142possibly heavy workloads are being served, BFQ guarantees: 143 144* audio and video-streaming with zero or very low jitter and drop 145 rate; 146 147* fast retrieval of WEB pages and embedded objects; 148 149* real-time recording of data in live-dumping applications (e.g., 150 packet logging); 151 152* responsiveness in local and remote access to a server. 153 154 1552. How does BFQ work? 156===================== 157 158BFQ is a proportional-share I/O scheduler, whose general structure, 159plus a lot of code, are borrowed from CFQ. 160 161- Each process doing I/O on a device is associated with a weight and a 162 `(bfq_)queue`. 163 164- BFQ grants exclusive access to the device, for a while, to one queue 165 (process) at a time, and implements this service model by 166 associating every queue with a budget, measured in number of 167 sectors. 168 169 - After a queue is granted access to the device, the budget of the 170 queue is decremented, on each request dispatch, by the size of the 171 request. 172 173 - The in-service queue is expired, i.e., its service is suspended, 174 only if one of the following events occurs: 1) the queue finishes 175 its budget, 2) the queue empties, 3) a "budget timeout" fires. 176 177 - The budget timeout prevents processes doing random I/O from 178 holding the device for too long and dramatically reducing 179 throughput. 180 181 - Actually, as in CFQ, a queue associated with a process issuing 182 sync requests may not be expired immediately when it empties. In 183 contrast, BFQ may idle the device for a short time interval, 184 giving the process the chance to go on being served if it issues 185 a new request in time. Device idling typically boosts the 186 throughput on rotational devices and on non-queueing flash-based 187 devices, if processes do synchronous and sequential I/O. In 188 addition, under BFQ, device idling is also instrumental in 189 guaranteeing the desired throughput fraction to processes 190 issuing sync requests (see the description of the slice_idle 191 tunable in this document, or [1, 2], for more details). 192 193 - With respect to idling for service guarantees, if several 194 processes are competing for the device at the same time, but 195 all processes and groups have the same weight, then BFQ 196 guarantees the expected throughput distribution without ever 197 idling the device. Throughput is thus as high as possible in 198 this common scenario. 199 200 - On flash-based storage with internal queueing of commands 201 (typically NCQ), device idling happens to be always detrimental 202 for throughput. So, with these devices, BFQ performs idling 203 only when strictly needed for service guarantees, i.e., for 204 guaranteeing low latency or fairness. In these cases, overall 205 throughput may be sub-optimal. No solution currently exists to 206 provide both strong service guarantees and optimal throughput 207 on devices with internal queueing. 208 209 - If low-latency mode is enabled (default configuration), BFQ 210 executes some special heuristics to detect interactive and soft 211 real-time applications (e.g., video or audio players/streamers), 212 and to reduce their latency. The most important action taken to 213 achieve this goal is to give to the queues associated with these 214 applications more than their fair share of the device 215 throughput. For brevity, we call just "weight-raising" the whole 216 sets of actions taken by BFQ to privilege these queues. In 217 particular, BFQ provides a milder form of weight-raising for 218 interactive applications, and a stronger form for soft real-time 219 applications. 220 221 - BFQ automatically deactivates idling for queues born in a burst of 222 queue creations. In fact, these queues are usually associated with 223 the processes of applications and services that benefit mostly 224 from a high throughput. Examples are systemd during boot, or git 225 grep. 226 227 - As CFQ, BFQ merges queues performing interleaved I/O, i.e., 228 performing random I/O that becomes mostly sequential if 229 merged. Differently from CFQ, BFQ achieves this goal with a more 230 reactive mechanism, called Early Queue Merge (EQM). EQM is so 231 responsive in detecting interleaved I/O (cooperating processes), 232 that it enables BFQ to achieve a high throughput, by queue 233 merging, even for queues for which CFQ needs a different 234 mechanism, preemption, to get a high throughput. As such EQM is a 235 unified mechanism to achieve a high throughput with interleaved 236 I/O. 237 238 - Queues are scheduled according to a variant of WF2Q+, named 239 B-WF2Q+, and implemented using an augmented rb-tree to preserve an 240 O(log N) overall complexity. See [2] for more details. B-WF2Q+ is 241 also ready for hierarchical scheduling, details in Section 4. 242 243 - B-WF2Q+ guarantees a tight deviation with respect to an ideal, 244 perfectly fair, and smooth service. In particular, B-WF2Q+ 245 guarantees that each queue receives a fraction of the device 246 throughput proportional to its weight, even if the throughput 247 fluctuates, and regardless of: the device parameters, the current 248 workload and the budgets assigned to the queue. 249 250 - The last, budget-independence, property (although probably 251 counterintuitive in the first place) is definitely beneficial, for 252 the following reasons: 253 254 - First, with any proportional-share scheduler, the maximum 255 deviation with respect to an ideal service is proportional to 256 the maximum budget (slice) assigned to queues. As a consequence, 257 BFQ can keep this deviation tight not only because of the 258 accurate service of B-WF2Q+, but also because BFQ *does not* 259 need to assign a larger budget to a queue to let the queue 260 receive a higher fraction of the device throughput. 261 262 - Second, BFQ is free to choose, for every process (queue), the 263 budget that best fits the needs of the process, or best 264 leverages the I/O pattern of the process. In particular, BFQ 265 updates queue budgets with a simple feedback-loop algorithm that 266 allows a high throughput to be achieved, while still providing 267 tight latency guarantees to time-sensitive applications. When 268 the in-service queue expires, this algorithm computes the next 269 budget of the queue so as to: 270 271 - Let large budgets be eventually assigned to the queues 272 associated with I/O-bound applications performing sequential 273 I/O: in fact, the longer these applications are served once 274 got access to the device, the higher the throughput is. 275 276 - Let small budgets be eventually assigned to the queues 277 associated with time-sensitive applications (which typically 278 perform sporadic and short I/O), because, the smaller the 279 budget assigned to a queue waiting for service is, the sooner 280 B-WF2Q+ will serve that queue (Subsec 3.3 in [2]). 281 282- If several processes are competing for the device at the same time, 283 but all processes and groups have the same weight, then BFQ 284 guarantees the expected throughput distribution without ever idling 285 the device. It uses preemption instead. Throughput is then much 286 higher in this common scenario. 287 288- ioprio classes are served in strict priority order, i.e., 289 lower-priority queues are not served as long as there are 290 higher-priority queues. Among queues in the same class, the 291 bandwidth is distributed in proportion to the weight of each 292 queue. A very thin extra bandwidth is however guaranteed to 293 the Idle class, to prevent it from starving. 294 295 2963. What are BFQ's tunables and how to properly configure BFQ? 297============================================================= 298 299Most BFQ tunables affect service guarantees (basically latency and 300fairness) and throughput. For full details on how to choose the 301desired tradeoff between service guarantees and throughput, see the 302parameters slice_idle, strict_guarantees and low_latency. For details 303on how to maximise throughput, see slice_idle, timeout_sync and 304max_budget. The other performance-related parameters have been 305inherited from, and have been preserved mostly for compatibility with 306CFQ. So far, no performance improvement has been reported after 307changing the latter parameters in BFQ. 308 309In particular, the tunables back_seek-max, back_seek_penalty, 310fifo_expire_async and fifo_expire_sync below are the same as in 311CFQ. Their description is just copied from that for CFQ. Some 312considerations in the description of slice_idle are copied from CFQ 313too. 314 315per-process ioprio and weight 316----------------------------- 317 318Unless the cgroups interface is used (see "4. BFQ group scheduling"), 319weights can be assigned to processes only indirectly, through I/O 320priorities, and according to the relation: 321weight = (IOPRIO_BE_NR - ioprio) * 10. 322 323Beware that, if low-latency is set, then BFQ automatically raises the 324weight of the queues associated with interactive and soft real-time 325applications. Unset this tunable if you need/want to control weights. 326 327slice_idle 328---------- 329 330This parameter specifies how long BFQ should idle for next I/O 331request, when certain sync BFQ queues become empty. By default 332slice_idle is a non-zero value. Idling has a double purpose: boosting 333throughput and making sure that the desired throughput distribution is 334respected (see the description of how BFQ works, and, if needed, the 335papers referred there). 336 337As for throughput, idling can be very helpful on highly seeky media 338like single spindle SATA/SAS disks where we can cut down on overall 339number of seeks and see improved throughput. 340 341Setting slice_idle to 0 will remove all the idling on queues and one 342should see an overall improved throughput on faster storage devices 343like multiple SATA/SAS disks in hardware RAID configuration, as well 344as flash-based storage with internal command queueing (and 345parallelism). 346 347So depending on storage and workload, it might be useful to set 348slice_idle=0. In general for SATA/SAS disks and software RAID of 349SATA/SAS disks keeping slice_idle enabled should be useful. For any 350configurations where there are multiple spindles behind single LUN 351(Host based hardware RAID controller or for storage arrays), or with 352flash-based fast storage, setting slice_idle=0 might end up in better 353throughput and acceptable latencies. 354 355Idling is however necessary to have service guarantees enforced in 356case of differentiated weights or differentiated I/O-request lengths. 357To see why, suppose that a given BFQ queue A must get several I/O 358requests served for each request served for another queue B. Idling 359ensures that, if A makes a new I/O request slightly after becoming 360empty, then no request of B is dispatched in the middle, and thus A 361does not lose the possibility to get more than one request dispatched 362before the next request of B is dispatched. Note that idling 363guarantees the desired differentiated treatment of queues only in 364terms of I/O-request dispatches. To guarantee that the actual service 365order then corresponds to the dispatch order, the strict_guarantees 366tunable must be set too. 367 368There is an important flipside for idling: apart from the above cases 369where it is beneficial also for throughput, idling can severely impact 370throughput. One important case is random workload. Because of this 371issue, BFQ tends to avoid idling as much as possible, when it is not 372beneficial also for throughput (as detailed in Section 2). As a 373consequence of this behavior, and of further issues described for the 374strict_guarantees tunable, short-term service guarantees may be 375occasionally violated. And, in some cases, these guarantees may be 376more important than guaranteeing maximum throughput. For example, in 377video playing/streaming, a very low drop rate may be more important 378than maximum throughput. In these cases, consider setting the 379strict_guarantees parameter. 380 381slice_idle_us 382------------- 383 384Controls the same tuning parameter as slice_idle, but in microseconds. 385Either tunable can be used to set idling behavior. Afterwards, the 386other tunable will reflect the newly set value in sysfs. 387 388strict_guarantees 389----------------- 390 391If this parameter is set (default: unset), then BFQ 392 393- always performs idling when the in-service queue becomes empty; 394 395- forces the device to serve one I/O request at a time, by dispatching a 396 new request only if there is no outstanding request. 397 398In the presence of differentiated weights or I/O-request sizes, both 399the above conditions are needed to guarantee that every BFQ queue 400receives its allotted share of the bandwidth. The first condition is 401needed for the reasons explained in the description of the slice_idle 402tunable. The second condition is needed because all modern storage 403devices reorder internally-queued requests, which may trivially break 404the service guarantees enforced by the I/O scheduler. 405 406Setting strict_guarantees may evidently affect throughput. 407 408back_seek_max 409------------- 410 411This specifies, given in Kbytes, the maximum "distance" for backward seeking. 412The distance is the amount of space from the current head location to the 413sectors that are backward in terms of distance. 414 415This parameter allows the scheduler to anticipate requests in the "backward" 416direction and consider them as being the "next" if they are within this 417distance from the current head location. 418 419back_seek_penalty 420----------------- 421 422This parameter is used to compute the cost of backward seeking. If the 423backward distance of request is just 1/back_seek_penalty from a "front" 424request, then the seeking cost of two requests is considered equivalent. 425 426So scheduler will not bias toward one or the other request (otherwise scheduler 427will bias toward front request). Default value of back_seek_penalty is 2. 428 429fifo_expire_async 430----------------- 431 432This parameter is used to set the timeout of asynchronous requests. Default 433value of this is 250ms. 434 435fifo_expire_sync 436---------------- 437 438This parameter is used to set the timeout of synchronous requests. Default 439value of this is 125ms. In case to favor synchronous requests over asynchronous 440one, this value should be decreased relative to fifo_expire_async. 441 442low_latency 443----------- 444 445This parameter is used to enable/disable BFQ's low latency mode. By 446default, low latency mode is enabled. If enabled, interactive and soft 447real-time applications are privileged and experience a lower latency, 448as explained in more detail in the description of how BFQ works. 449 450DISABLE this mode if you need full control on bandwidth 451distribution. In fact, if it is enabled, then BFQ automatically 452increases the bandwidth share of privileged applications, as the main 453means to guarantee a lower latency to them. 454 455In addition, as already highlighted at the beginning of this document, 456DISABLE this mode if your only goal is to achieve a high throughput. 457In fact, privileging the I/O of some application over the rest may 458entail a lower throughput. To achieve the highest-possible throughput 459on a non-rotational device, setting slice_idle to 0 may be needed too 460(at the cost of giving up any strong guarantee on fairness and low 461latency). 462 463timeout_sync 464------------ 465 466Maximum amount of device time that can be given to a task (queue) once 467it has been selected for service. On devices with costly seeks, 468increasing this time usually increases maximum throughput. On the 469opposite end, increasing this time coarsens the granularity of the 470short-term bandwidth and latency guarantees, especially if the 471following parameter is set to zero. 472 473max_budget 474---------- 475 476Maximum amount of service, measured in sectors, that can be provided 477to a BFQ queue once it is set in service (of course within the limits 478of the above timeout). According to what said in the description of 479the algorithm, larger values increase the throughput in proportion to 480the percentage of sequential I/O requests issued. The price of larger 481values is that they coarsen the granularity of short-term bandwidth 482and latency guarantees. 483 484The default value is 0, which enables auto-tuning: BFQ sets max_budget 485to the maximum number of sectors that can be served during 486timeout_sync, according to the estimated peak rate. 487 488For specific devices, some users have occasionally reported to have 489reached a higher throughput by setting max_budget explicitly, i.e., by 490setting max_budget to a higher value than 0. In particular, they have 491set max_budget to higher values than those to which BFQ would have set 492it with auto-tuning. An alternative way to achieve this goal is to 493just increase the value of timeout_sync, leaving max_budget equal to 0. 494 4954. Group scheduling with BFQ 496============================ 497 498BFQ supports both cgroups-v1 and cgroups-v2 io controllers, namely 499blkio and io. In particular, BFQ supports weight-based proportional 500share. To activate cgroups support, set BFQ_GROUP_IOSCHED. 501 5024-1 Service guarantees provided 503------------------------------- 504 505With BFQ, proportional share means true proportional share of the 506device bandwidth, according to group weights. For example, a group 507with weight 200 gets twice the bandwidth, and not just twice the time, 508of a group with weight 100. 509 510BFQ supports hierarchies (group trees) of any depth. Bandwidth is 511distributed among groups and processes in the expected way: for each 512group, the children of the group share the whole bandwidth of the 513group in proportion to their weights. In particular, this implies 514that, for each leaf group, every process of the group receives the 515same share of the whole group bandwidth, unless the ioprio of the 516process is modified. 517 518The resource-sharing guarantee for a group may partially or totally 519switch from bandwidth to time, if providing bandwidth guarantees to 520the group lowers the throughput too much. This switch occurs on a 521per-process basis: if a process of a leaf group causes throughput loss 522if served in such a way to receive its share of the bandwidth, then 523BFQ switches back to just time-based proportional share for that 524process. 525 5264-2 Interface 527------------- 528 529To get proportional sharing of bandwidth with BFQ for a given device, 530BFQ must of course be the active scheduler for that device. 531 532Within each group directory, the names of the files associated with 533BFQ-specific cgroup parameters and stats begin with the "bfq." 534prefix. So, with cgroups-v1 or cgroups-v2, the full prefix for 535BFQ-specific files is "blkio.bfq." or "io.bfq." For example, the group 536parameter to set the weight of a group with BFQ is blkio.bfq.weight 537or io.bfq.weight. 538 539As for cgroups-v1 (blkio controller), the exact set of stat files 540created, and kept up-to-date by bfq, depends on whether 541CONFIG_BFQ_CGROUP_DEBUG is set. If it is set, then bfq creates all 542the stat files documented in 543Documentation/admin-guide/cgroup-v1/blkio-controller.rst. If, instead, 544CONFIG_BFQ_CGROUP_DEBUG is not set, then bfq creates only the files:: 545 546 blkio.bfq.io_service_bytes 547 blkio.bfq.io_service_bytes_recursive 548 blkio.bfq.io_serviced 549 blkio.bfq.io_serviced_recursive 550 551The value of CONFIG_BFQ_CGROUP_DEBUG greatly influences the maximum 552throughput sustainable with bfq, because updating the blkio.bfq.* 553stats is rather costly, especially for some of the stats enabled by 554CONFIG_BFQ_CGROUP_DEBUG. 555 556Parameters 557---------- 558 559For each group, the following parameters can be set: 560 561 weight 562 This specifies the default weight for the cgroup inside its parent. 563 Available values: 1..1000 (default: 100). 564 565 For cgroup v1, it is set by writing the value to `blkio.bfq.weight`. 566 567 For cgroup v2, it is set by writing the value to `io.bfq.weight`. 568 (with an optional prefix of `default` and a space). 569 570 The linear mapping between ioprio and weights, described at the beginning 571 of the tunable section, is still valid, but all weights higher than 572 IOPRIO_BE_NR*10 are mapped to ioprio 0. 573 574 Recall that, if low-latency is set, then BFQ automatically raises the 575 weight of the queues associated with interactive and soft real-time 576 applications. Unset this tunable if you need/want to control weights. 577 578 weight_device 579 This specifies a per-device weight for the cgroup. The syntax is 580 `minor:major weight`. A weight of `0` may be used to reset to the default 581 weight. 582 583 For cgroup v1, it is set by writing the value to `blkio.bfq.weight_device`. 584 585 For cgroup v2, the file name is `io.bfq.weight`. 586 587 588[1] 589 P. Valente, A. Avanzini, "Evolution of the BFQ Storage I/O 590 Scheduler", Proceedings of the First Workshop on Mobile System 591 Technologies (MST-2015), May 2015. 592 593 http://algogroup.unimore.it/people/paolo/disk_sched/mst-2015.pdf 594 595[2] 596 P. Valente and M. Andreolini, "Improving Application 597 Responsiveness with the BFQ Disk I/O Scheduler", Proceedings of 598 the 5th Annual International Systems and Storage Conference 599 (SYSTOR '12), June 2012. 600 601 Slightly extended version: 602 603 http://algogroup.unimore.it/people/paolo/disk_sched/bfq-v1-suite-results.pdf 604 605[3] 606 https://github.com/Algodev-github/S