EDIT: this is a full benchmark I run on my pool: https://gist.github.com/thegabriele97/9d82ddfbf0f4ec00dbcebc4d6cda29b3.
Hi! I ran into this issue since I started mu homelab adventure a couple of months ago, so I am still very noob, sorry for this.
I decided today to understand what happens and why it happens but I need your help to understand it better.
My homelab consists of a proxmox setup with three 1 TB HDD s in raidz1 (ZFS) (I know the downsides of this and I took my decisions) and 8 GB of RAM, of which 3.5 are assigned to a VM. The remaining parts are used by some LXC containers.
During high worloads (i.e. copying a file, downloading something via torrent/jdownloader) everything is very slow and other services start to be unresponsive due to the high IO delay.
I decided to test the three single devices with this command:
fio --ioengine=libaio --filename=/dev/sda --size=4G --time_based --name=fio --group_reporting --runtime=10 --direct=1 --sync=1 --iodepth=1 --rw=randread --bs=4k --numjobs=32
And more or less they (sda, sdb, sdc) give this results:
Jobs: 32 (f=32): [r(32)][100.0%][r=436KiB/s][r=109 IOPS][eta 00m:00s]
fio: (groupid=0, jobs=32): err= 0: pid=3350293: Sat Jun 24 11:07:02 2023
read: IOPS=119, BW=479KiB/s (490kB/s)(4968KiB/10378msec)
slat (nsec): min=4410, max=40660, avg=12374.56, stdev=5066.56
clat (msec): min=17, max=780, avg=260.78, stdev=132.27
lat (msec): min=17, max=780, avg=260.79, stdev=132.27
clat percentiles (msec):
| 1.00th=[ 26], 5.00th=[ 50], 10.00th=[ 80], 20.00th=[ 140],
| 30.00th=[ 188], 40.00th=[ 230], 50.00th=[ 264], 60.00th=[ 296],
| 70.00th=[ 326], 80.00th=[ 372], 90.00th=[ 430], 95.00th=[ 477],
| 99.00th=[ 617], 99.50th=[ 634], 99.90th=[ 768], 99.95th=[ 785],
| 99.99th=[ 785]
bw ( KiB/s): min= 256, max= 904, per=100.00%, avg=484.71, stdev= 6.17, samples=639
iops : min= 64, max= 226, avg=121.14, stdev= 1.54, samples=639
lat (msec) : 20=0.32%, 50=4.91%, 100=8.13%, 250=32.85%, 500=49.68%
lat (msec) : 750=3.86%, 1000=0.24%
cpu : usr=0.01%, sys=0.00%, ctx=1246, majf=11, minf=562
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=1242,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
READ: bw=479KiB/s (490kB/s), 479KiB/s-479KiB/s (490kB/s-490kB/s), io=4968KiB (5087kB), run=10378-10378msec
Disk stats (read/write):
sda: ios=1470/89, merge=6/7, ticks=385624/14369, in_queue=405546, util=96.66%
Am I wrong or it is a very bad results? Why? The three identical HDs are this one: https://smarthdd.com/database/APPLE-HDD-HTS541010A9E662/JA0AB560/
I jope you can help me. Thank you!
I have a zfs raid1 with 5 disks, and had some very bad performance. I used atop to figure out that one disk was the problem. I replaced that disk, resynced, and now performance is as expected.
At one point I was having intermittent performance issues with my pool, and the issue turned out to be scrubs being too aggressive (even though most all the documentation I read said scrubs should not adversely impact user I/O, they totally did)