Improving performance (简体中文)
本文将介绍与性能有关的系统诊断知识和具体步骤,通过减少资源消耗等方式优化系统性能。
基础
了解系统
性能优化的最佳方法是找到瓶颈或拖慢整体速度的子系统。查看系统细节可以帮助确定问题。
- 如果在同时运行多个大型程序时卡顿(如 LibreOffice、Firefox 等),请检查内存容量是否充足。使用以下命令,并检查“available”一列的数值:
$ free -h
- 如果电脑开机很慢,并且(仅在)第一次打开应用时加载很慢,可能是因为硬盘速度过慢。可以用
hdparm
命令测试硬盘速度:# hdparm -t /dev/sdX
- 如果使用直接渲染(GPU 渲染)的应用运行卡顿(比如使用 GPU 的视频播放器、游戏甚至窗口管理器),改善 GPU 的性能应当有所帮助。首先需要检查直接渲染是否已经开启。可以使用 mesa-demos 中的
glxinfo
命令:$ glxinfo | grep "direct rendering"
,如果开启了,则会返回direct rendering: Yes
。
基准测试
为定量评估优化成果,可使用基准测试。
存储设备
多硬件路径
内部硬件路径意指储存设备是如何连接到主板的。例如 TCP/IP 经由 NIC、即插即用设备可以使用 PCIe/PCI、火线、RAID 卡 、USB 等。通过将储存设备均分到这些接口可以最大化主板的性能,比如将六个硬盘接连到 USB 要比三个连接到 USB、三个连接到火线要慢。原因是主板上的接口点类似管道,而管道同一时间的最大流量是有上限的。幸运的是主板通常会有多个管道。
此外,假设电脑前面有两个 USB 插口,后面有四个 USB 插口,那么前面插两个、后面插两个应当要比前面插一个、后面插三个更快。这是因为前面的插口可能是多个根 Hub 设备,也就是说它可以在同一时间发送更多的数据。
使用下面的命令查看机器上是否有多个路径:
USB设备树
$ lsusb -tv
PCI设备树
$ lspci -tv
分区
确保您已经对齐分区。
多硬盘
如果有多个硬盘,将其设置为软件 RAID 可以提升速度。
在分离的硬盘上创建 交换空间 也有所帮助,尤其是使用交换空间十分频繁时。
机械硬盘布局
如果使用传统的机械硬盘,您的分区布局会影响系统的性能。驱动器开头(靠近磁盘外部)的扇区比末尾的扇区要快。此外,较小的分区不需要驱动器磁头大幅度移动,从而加快磁盘的操作。因此,建议为您的系统创建一个小分区(10GB,或多或少取决于您的需要),并且尽可能靠近驱动器开头。其他数据(图片、视频)应该存放在一个单独的分区上,这通常是通过将家目录(/home/user
)与根目录(/
)分开来实现的。
选择并调整文件系统
为系统选择合适的文件系统十分重要,因为不同文件系统有各自的优势。File systems 文中对主流文件系统作了简短的总结,也可以在 Category:File systems 中阅读其他相关文章。
挂载选项
The noatime option is known to improve performance of the filesystem.
Other mount options are filesystem specific, therefore see the relevant articles for the filesystems:
- Ext3
- Ext4#Improving performance
- JFS Filesystem#Optimizations
- XFS#Performance
- Btrfs#Defragmentation, Btrfs#Compression, and btrfs(5)
- ZFS#Tuning
- NTFS#Improving performance
文件系统
data=writeback
挂载选项可提高速度, 但可能会在断电期间损坏数据. notail
挂载选项将文件系统使用的空间增加了大约 5% , 但也提高了整体速度. 您还可以通过将日志和数据放在单独的驱动器上来减少磁盘负载。 这是在创建文件系统时完成的:
# mkreiserfs –j /dev/sda1 /dev/sdb1
Replace /dev/sda1
with the partition reserved for the journal, and /dev/sdb1
with the partition for data. You can learn more about reiserfs with Funtoo Filesystem Guide.
更改内核选项
There are several key tunables affecting the performance of block devices, see sysctl#Virtual memory for more information.
I/O 调度
背景信息
The input/output (I/O) scheduler is the kernel component that decides in which order the block I/O operations are submitted to storage devices. It is useful to remind here some specifications of two main drive types because the goal of the I/O scheduler is to optimize the way these are able to deal with read requests:
- An HDD has spinning disks and a head that moves physically to the required location. Therefore, random latency is quite high ranging between 3 and 12ms (whether it is a high end server drive or a laptop drive and bypassing the disk controller write buffer) while sequential access provides much higher throughput. The typical HDD throughput is about 200 I/O operations per second (IOPS).
- An SSD does not have moving parts, random access is as fast as sequential one, typically under 0.1ms, and it can handle multiple concurrent requests. The typical SSD throughput is greater than 10,000 IOPS, which is more than needed in common workload situations.
If there are many processes making I/O requests to different storage parts, thousands of IOPS can be generated while a typical HDD can handle only about 200 IOPS. There is a queue of requests that have to wait for access to the storage. This is where the I/O schedulers plays an optimization role.
调度算法
One way to improve throughput is to linearize access: by ordering waiting requests by their logical address and grouping the closest ones. Historically this was the first Linux I/O scheduler called elevator.
One issue with the elevator algorithm is that it is not optimal for a process doing sequential access: reading a block of data, processing it for several microseconds then reading next block and so on. The elevator scheduler does not know that the process is about to read another block nearby and, thus, moves to another request by another process at some other location. The anticipatory I/O scheduler overcomes the problem: it pauses for a few milliseconds in anticipation of another close-by read operation before dealing with another request.
While these schedulers try to improve total throughput, they might leave some unlucky requests waiting for a very long time. As an example, imagine the majority of processes make requests at the beginning of the storage space while an unlucky process makes a request at the other end of storage. This potentially infinite postponement of the process is called starvation. To improve fairness, the deadline algorithm was developed. It has a queue ordered by address, similar to the elevator, but if some request sits in this queue for too long then it moves to an "expired" queue ordered by expire time. The scheduler checks the expire queue first and processes requests from there and only then moves to the elevator queue. Note that this fairness has a negative impact on overall throughput.
The Completely Fair Queuing (CFQ) approaches the problem differently by allocating a timeslice and a number of allowed requests by queue depending on the priority of the process submitting them. It supports cgroup that allows to reserve some amount of I/O to a specific collection of processes. It is in particular useful for shared and cloud hosting: users who paid for some IOPS want to get their share whenever needed. Also, it idles at the end of synchronous I/O waiting for other nearby operations, taking over this feature from the anticipatory scheduler and bringing some enhancements. Both the anticipatory and the elevator schedulers were decommissioned from the Linux kernel replaced by the more advanced alternatives presented below.
The Budget Fair Queuing (BFQ) is based on CFQ code and brings some enhancements. It does not grant the disk to each process for a fixed time-slice but assigns a "budget" measured in number of sectors to the process and uses heuristics. It is a relatively complex scheduler, it may be more adapted to rotational drives and slow SSDs because its high per-operation overhead, especially if associated with a slow CPU, can slow down fast devices. The objective of BFQ on personal systems is that for interactive tasks, the storage device is virtually as responsive as if it was idle. In its default configuration it focuses on delivering the lowest latency rather than achieving the maximum throughput.
Kyber is a recent scheduler inspired by active queue management techniques used for network routing. The implementation is based on "tokens" that serve as a mechanism for limiting requests. A queuing token is required to allocate a request, this is used to prevent starvation of requests. A dispatch token is also needed and limits the operations of a certain priority on a given device. Finally, a target read latency is defined and the scheduler tunes itself to reach this latency goal. The implementation of the algorithm is relatively simple and it is deemed efficient for fast devices.
内核的 I/O 调度器
While some of the early algorithms have now been decommissioned, the official Linux kernel supports a number of I/O schedulers which can be split into two categories:
- The multi-queue schedulers are available by default with the kernel. The Multi-Queue Block I/O Queuing Mechanism (blk-mq) maps I/O queries to multiple queues, the tasks are distributed across threads and therefore CPU cores. Within this framework the following schedulers are available:
- None, where no queuing algorithm is applied.
- mq-deadline, the adaptation of the deadline scheduler (see below) to multi-threading.
- Kyber
- BFQ
- The single-queue schedulers are legacy schedulers:
- NOOP is the simplest scheduler, it inserts all incoming I/O requests into a simple FIFO queue and implements request merging. In this algorithm, there is no re-ordering of the request based on the sector number. Therefore it can be used if the ordering is dealt with at another layer, at the device level for example, or if it does not matter, for SSDs for instance.
- Deadline
- CFQ
- 注意: Single-queue schedulers were removed from kernel since Linux 5.0.
更改 I/O 调度器
To list the available schedulers for a device and the active scheduler (in brackets):
$ cat /sys/block/sda/queue/scheduler
mq-deadline kyber [bfq] none
To list the available schedulers for all devices:
$ grep "" /sys/block/*/queue/scheduler
/sys/block/pktcdvd0/queue/scheduler:none /sys/block/sda/queue/scheduler:mq-deadline kyber [bfq] none /sys/block/sr0/queue/scheduler:[mq-deadline] kyber bfq none
To change the active I/O scheduler to bfq for device sda, use:
# echo bfq > /sys/block/sda/queue/scheduler
The process to change I/O scheduler, depending on whether the disk is rotating or not can be automated and persist across reboots. For example the udev rule below sets the scheduler to none for NVMe, mq-deadline for SSD/eMMC, and bfq for rotational drives:
/etc/udev/rules.d/60-ioschedulers.rules
# set scheduler for NVMe ACTION=="add|change", KERNEL=="nvme[0-9]n[0-9]", ATTR{queue/scheduler}="none" # set scheduler for SSD and eMMC ACTION=="add|change", KERNEL=="sd[a-z]|mmcblk[0-9]*", ATTR{queue/rotational}=="0", ATTR{queue/scheduler}="mq-deadline" # set scheduler for rotating disks ACTION=="add|change", KERNEL=="sd[a-z]", ATTR{queue/rotational}=="1", ATTR{queue/scheduler}="bfq"
Reboot or force udev#Loading new rules.
使用 I/O 调度器
Each of the kernel's I/O scheduler has its own tunables, such as the latency time, the expiry time or the FIFO parameters. They are helpful in adjusting the algorithm to a particular combination of device and workload. This is typically to achieve a higher throughput or a lower latency for a given utilization. The tunables and their description can be found within the kernel documentation.
To list the available tunables for a device, in the example below sdb which is using deadline, use:
$ ls /sys/block/sdb/queue/iosched
fifo_batch front_merges read_expire write_expire writes_starved
To improve deadline's throughput at the cost of latency, one can increase fifo_batch
with the command:
# echo 32 > /sys/block/sdb/queue/iosched/fifo_batch
电源管理配置
When dealing with traditional rotational disks (HDD's) you may want to lower or disable power saving features completely.
减少磁盘读写
Avoiding unnecessary access to slow storage drives is good for performance and also increasing lifetime of the devices, although on modern hardware the difference in life expectancy is usually negligible.
显示磁盘写信息
The iotop package can sort by disk writes, and show how much and how frequently programs are writing to the disk. See iotop(8) for details.
重定位文件到 tmpfs
Relocate files, such as your browser profile, to a tmpfs file system, for improvements in application response as all the files are now stored in RAM:
- Refer to Profile-sync-daemon for syncing browser profiles. Certain browsers might need special attention, see e.g. Firefox on RAM.
- Refer to Anything-sync-daemon for syncing any specified folder.
- Refer to Makepkg#Improving compile times for improving compile times by building packages in tmpfs.
文件系统
Refer to corresponding file system page in case there were performance improvements instructions, e.g. Ext4#Improving performance and XFS#Performance.
交换空间
Writeback interval 和缓冲区大小
使用 ionice 调度储存 I/O
Many tasks such as backups do not rely on a short storage I/O delay or high storage I/O bandwidth to fulfil their task, they can be classified as background tasks. On the other hand quick I/O is necessary for good UI responsiveness on the desktop. Therefore it is beneficial to reduce the amount of storage bandwidth available to background tasks, whilst other tasks are in need of storage I/O. This can be achieved by making use of the linux I/O scheduler CFQ, which allows setting different priorities for processes.
The I/O priority of a background process can be reduced to the "Idle" level by starting it with
# ionice -c 3 command
See a short introduction to ionice and ionice(1) for more information.
CPU
超频
超频通过提升 CPU 的时钟频率提升电脑性能,超频能力取决于 CPU 和主板的型号。通常使用 BIOS 进行超频。超频也会带来风险和不便,这里既不推荐超频也不反对超频。
Many Intel chips will not correctly report their clock frequency to acpi_cpufreq and most other utilities. This will result in excessive messages in dmesg, which can be avoided by unloading and blacklisting the kernel module acpi_cpufreq
.
To read their clock speed use i7z from the i7z package. To check for correct operation of an overclocked CPU, it is recommended to do stress testing.
自动调整频率
Tweak default scheduler (CFS) for responsiveness
The default CPU scheduler in the mainline Linux kernel is CFS.
The upstream default settings are tweaked for high throughput which make the desktop applications unresponsive under heavy CPU loads.
The cfs-zen-tweaksAUR package contains a script that sets up the CFS to use the same settings as the linux-zen kernel. To run the script on startup, enable/start set-cfs-tweaks.service
.
Alternative CPU schedulers
-
MuQSS — Multiple Queue Skiplist Scheduler. Available with the
-ck
patch set developed by Con Kolivas.
- PDS — Priority and Deadline based Skiplist multiple queue scheduler focused on desktop responsiveness.
Real-time kernel
Some applications such as running a TV tuner card at full HD resolution (1080p) may benefit from using a realtime kernel.
Adjusting priorities of processes
See also nice(1) and renice(1).
Ananicy
Ananicy is a daemon, available in the ananicy-gitAUR package, for auto adjusting the nice levels of executables. The nice level represents the priority of the executable when allocating CPU resources.
cgroups
See cgroups.
Cpulimit
Cpulimit is a program to limit the CPU usage percentage of a specific process. After installing cpulimit, you may limit the CPU usage of a processes' PID using a scale of 0 to 100 times the number of CPU cores that the computer has. For example, with eight CPU cores the precentage range will be 0 to 800. Usage:
$ cpulimit -l 50 -p 5081
irqbalance
The purpose of irqbalance is distribute hardware interrupts across processors on a multiprocessor system in order to increase performance. It can be controlled by the provided irqbalance.service
.
Turn off CPU exploit mitigations
Turning off CPU exploit mitigations may improve performance. Use below kernel parameter to disable them all:
mitigations=off
The explanations of all the switches it toggles are given at kernel.org. You can use spectre-meltdown-checkerAUR for vulnerability check.
显卡
Xorg 配置
显卡性能由/etc/X11/xorg.conf
的配置决定,见 NVIDIA、ATI 和 Intel 文章。配置不当可能导致 Xorg 停止工作,请慎重操作。
Mesa 配置
The performance of the Mesa drivers can be configured via drirc. GUI configuration tools are available:
- adriconf (Advanced DRI Configurator) — GUI tool to configure Mesa drivers by setting options and writing them to the standard drirc file.
- DRIconf — Configuration applet for the Direct Rendering Infrastructure. It allows customizing performance and visual quality settings of OpenGL drivers on a per-driver, per-screen and/or per-application level.
硬件视频加速
Hardware video acceleration makes it possible for the video card to decode/encode video.
超频
与 CPU 一样,超频可以直接提高性能,但通常不建议使用。AUR中的超频工具有rovclockAUR (ATI 显卡)、rocm-smi-libAUR (较新的 AMD 显卡) 、nvclockAUR (到 Geforce 9系的旧 NVIDIA 显卡),以及适用于新 NVIDIA 显卡的nvidia-utils。
见 AMDGPU#Overclocking 或 NVIDIA/Tips and tricks#Enabling overclocking。
Enabling PCI Resizable BAR
- On some systems enabling PCI Resizable BAR can result in a significant loss of performance. Benchmark your system to make sure it increases performance.
- The Compatibility Support Module (CSM) must be disabled for this to take effect.
The PCI specification allows larger Base Address Registers to be used for exposing PCI devices memory to the PCI Controller. This can result in a performance increase for video cards. Having access to the the full video memory improves performance, but also enables optimizations in the graphics driver. The combination of resizable BAR, above 4G decoding and these driver optimizations are what AMD calls AMD Smart Access Memory, available at first on AMD Series 500 chipset motherboards, later expanded to AMD Series 400 and Intel Series 300 and later through UEFI updates. This setting may not be available on all motherboards, and is known to sometimes cause boot problems on certain boards.
If the BAR has a 256M size, the feature is not enabled or not supported:
# dmesg | grep BAR=
[drm] Detected VRAM RAM=8176M, BAR=256M
To enable it, enable the setting named "Above 4G Decode" or ">4GB MMIO" in your motherboard settings. Verify that the BAR is now larger:
# dmesg | grep BAR=
[drm] Detected VRAM RAM=8176M, BAR=8192M
内存、虚拟内存与内存溢出处理
Clock frequency and timings
RAM can run at different clock frequencies and timings, which can be configured in the BIOS. Memory performance depends on both values. Selecting the highest preset presented by the BIOS usually improves the performance over the default setting. Note that increasing the frequency to values not supported by both motherboard and RAM vendor is overclocking, and similar risks and disadvantages apply, see #Overclocking.
Root on RAM overlay
If running off a slow writing medium (USB, spinning HDDs) and storage requirements are low, the root may be run on a RAM overlay ontop of read only root (on disk). This can vastly improve performance at the cost of a limited writable space to root. See liverootAUR.
zram 或 zswap
内核模块 zram(以前叫做 compcache)在内存中提供了一个压缩块。若将其用作交换空间,则内存可以保存更多的数据,代价是消耗更多的 CPU 。但是它仍然比硬盘上的交换空间快得多。若一个系统经常使用交换空间,使用 zram 可以提高响应。使用 zram 也可以减少对磁盘的读写,当交换空间被设置到固态硬盘时,这可以增加固态硬盘的寿命。
zswap 可以带来相似的益处(和相似的代价)。两者不同的是 zswap 将页面压缩后换入交换空间,而 zram 则换入内存。详见 zswap 以查看两者差异。
例如:设置一个使用 lz4 压缩算法、32GB、高优先级的 zram(仅作用于当前会话):
# modprobe zram # echo lz4 > /sys/block/zram0/comp_algorithm # echo 32G > /sys/block/zram0/disksize # mkswap --label zram0 /dev/zram0 # swapon --priority 100 /dev/zram0
若要禁用它,可以重启或运行:
# swapoff /dev/zram0 # rmmod zram
若要查看详细的步骤、选项与潜在问题,见 zram 模块的官方文档。
zram-generator 提供了一个 [email protected]
单元用来自动初始化 zram 设备。此单元无需被 [enable/start]。以下资源提供了使用它的必要信息:
“生成器将会在系统启动的早期被 systemd 调用”,因此使用它只需要创建配置文件并重启。这里提供了一个简单的配置:/usr/share/doc/zram-generator/zram-generator.conf.example 。可以通过检查 swap 的状态 或通过检查 systemd-zram-setup@zramN.service
的 状态 来检查 zram 的情况。这里 /dev/zramN
是配置文件中设定的内容。
The package zramswapAUR provides an automated script for setting up a swap with a higher priority and a default size of 20% of the RAM size of your system. To do this automatically on every boot, enable zramswap.service
.
此外,zramdAUR 默认以 zstd 算法自动设置 zram 。其配置文件位于 /etc/default/zramd
并且需要 启用 zramd.service
服务。
Swap on zram using a udev rule
The example below describes how to set up swap on zram automatically at boot with a single udev rule. No extra package should be needed to make this work.
First, enable the module:
/etc/modules-load.d/zram.conf
zram
Configure the number of /dev/zram nodes you need.
/etc/modprobe.d/zram.conf
options zram num_devices=2
Create the udev rule as shown in the example.
/etc/udev/rules.d/99-zram.rules
KERNEL=="zram0", ATTR{disksize}="512M" RUN="/usr/bin/mkswap /dev/zram0", TAG+="systemd" KERNEL=="zram1", ATTR{disksize}="512M" RUN="/usr/bin/mkswap /dev/zram1", TAG+="systemd"
Add /dev/zram to your fstab.
/etc/fstab
/dev/zram0 none swap defaults 0 0 /dev/zram1 none swap defaults 0 0
使用显存
在很少见的情况下,内存很小而显存过剩,那么可以将显存设为交换文件。见 Swap on video RAM.
在低内存情况下改善系统反应速度
On traditional GNU/Linux system, especially for graphical workstations, when allocated memory is overcommitted, the overall system's responsiveness may degrade to a nearly unusable state before either triggering the in-kernel OOM-killer or a sufficient amount of memory got free (which is unlikely to happen quickly when the system is unresponsive, as you can hardly close any memory-hungry applications which may continue to allocate more memory). The behaviour also depends on specific setups and conditions, returning to a normal responsive state may take from a few seconds to more than half an hour, which could be a pain to wait in serious scenario like during a conference presentation.
/proc/sys/vm/oom_kill_allocating_task
is 0 and consider changing it. [2]
While the behaviour of the kernel as well as the userspace things under low-memory conditions may improve in the future as discussed on kernel and Fedora mailing lists, users can use more feasible and effective options than hard-resetting the system or tuning the vm.overcommit_*
sysctl parameters:
- Manually trigger the kernel OOM-killer with Magic SysRq key, namely
Alt+SysRq+f
. - Use a userspace OOM daemon to tackle these automatically (or interactively).
Sometimes a user may prefer OOM daemon to SysRq because with kernel OOM-killer you cannot prioritize the process to (or not) terminate. To list some OOM daemons:
-
systemd-oomd — Provided by systemd as
systemd-oomd.service
that uses cgroups-v2 and pressure stall information (PSI) to monitor and take action on processes before an OOM occurs in kernel space.
- earlyoom — Simple userspace OOM-killer implementation written in C.
- oomd — OOM-killer implementation based on PSI, requires Linux kernel version 4.20+. Configuration is in JSON and is quite complex. Confirmed to work in Facebook's production environment.
- nohang — Sophisticated OOM handler written in Python, with optional PSI support, more configurable than earlyoom.
- low-memory-monitor — GNOME developer's effort that aims to provides better communication to userspace applications to indicate the low memory state, besides that it could be configured to trigger the kernel OOM-killer. Based on PSI, requires Linux 5.2+.
- uresourced — A small daemon that enables cgroup based resource protection for the active graphical user session.
网络
- Kernel networking: see Sysctl#Improving performance
- NIC: see Network configuration#Set device MTU and queue length
- DNS: consider using a caching DNS resolver, see Domain name resolution#DNS servers
- Samba: see Samba#Improve throughput
Watchdogs
According to Wikipedia:Watchdog timer:
- A watchdog timer [...] is an electronic timer that is used to detect and recover from computer malfunctions. During normal operation, the computer regularly resets the watchdog timer [...]. If, [...], the computer fails to reset the watchdog, the timer will elapse and generate a timeout signal [...] used to initiate corrective [...] actions [...] typically include placing the computer system in a safe state and restoring normal system operation.
Many users need this feature due to their system's mission-critical role (i.e. servers), or because of the lack of power reset (i.e. embedded devices). Thus, this feature is required for a good operation in some situations. On the other hand, normal users (i.e. desktop and laptop) do not need this feature and can disable it.
To disable watchdog timers (both software and hardware), append nowatchdog
to your boot parameters.
To check the new configuration do:
# cat /proc/sys/kernel/watchdog
or use:
# wdctl
After you disabled watchdogs, you can optionally avoid the loading of the module responsible of the hardware watchdog, too. Do it by blacklisting the related module, e.g. iTCO_wdt
.
nowatchdog
parameter does not work as expected but they have successfully disabled the watchdog (at least the hardware one) by blacklisting the above-mentioned module.Either action will speed up your boot and shutdown, because one less module is loaded. Additionally disabling watchdog timers increases performance and lowers power consumption.
See [3], [4], [5], and [6] for more information.