Discussion:
Impact of page cache on OSD read performance for SSD
Somnath Roy
2014-09-23 18:05:17 UTC
Permalink
Hi Sage,
I have created the following setup in order to examine how a single OSD is behaving if say ~80-90% of ios hitting the SSDs.

My test includes the following steps.

1. Created a single OSD cluster.
2. Created two rbd images (110GB each) on 2 different pools.
3. Populated entire image, so my working set is ~210GB. My system memory is ~16GB.
4. Dumped page cache before every run.
5. Ran fio_rbd (QD 32, 8 instances) in parallel on these two images.

Here is my disk iops/bandwidth..

***@emsclient:~/fio_test# fio rad_resd_disk.job
random-reads: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=64
2.0.8
Starting 1 process
Jobs: 1 (f=1): [r] [100.0% done] [154.1M/0K /s] [39.7K/0 iops] [eta 00m:00s]
random-reads: (groupid=0, jobs=1): err= 0: pid=1431
read : io=9316.4MB, bw=158994KB/s, iops=39748 , runt= 60002msec

My fio_rbd config..

[global]
ioengine=rbd
clientname=admin
pool=rbd1
rbdname=ceph_regression_test1
invalidate=0 # mandatory
rw=randread
bs=4k
direct=1
time_based
runtime=2m
size=109G
numjobs=8
[rbd_iodepth32]
iodepth=32

Now, I have run Giant Ceph on top of that..

1. OSD config with 25 shards/1 thread per shard :
-------------------------------------------------------

avg-cpu: %user %nice %system %iowait %steal %idle
22.04 0.00 16.46 45.86 0.00 15.64

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 9.00 0.00 6.00 0.00 92.00 30.67 0.01 1.33 0.00 1.33 1.33 0.80
sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdg 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdh 181.00 0.00 34961.00 0.00 176740.00 0.00 10.11 102.71 2.92 2.92 0.00 0.03 100.00
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00


ceph -s:
----------
***@emsclient:~# ceph -s
cluster 94991097-7638-4240-b922-f525300a9026
health HEALTH_OK
monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch 1, quorum 0 a
osdmap e498: 1 osds: 1 up, 1 in
pgmap v386366: 832 pgs, 7 pools, 308 GB data, 247 kobjects
366 GB used, 1122 GB / 1489 GB avail
832 active+clean
client io 75215 kB/s rd, 18803 op/s

cpu util:
----------
Gradually decreases from ~21 core (serving from cache) to ~10 core (while serving from disks).

My Analysis:
-----------------
In this case "All is Well" till ios are served from cache (XFS is smart enough to cache some data ) . Once started hitting disks and throughput is decreasing. As you can see, disk is giving ~35K iops , but, OSD throughput is only ~18.8K ! So, cache miss in case of buffered io seems to be very
expensive. Half of the iops are waste. Also, looking at the bandwidth, it is obvious, not everything is 4K read, May be kernel read_ahead is kicking (?).


Now, I thought of making ceph disk read as direct_io and do the same experiment. I have changed the FileStore::read to do the direct_io only. Rest kept as is. Here is the result with that.


Iostat:
-------

avg-cpu: %user %nice %system %iowait %steal %idle
24.77 0.00 19.52 21.36 0.00 34.36

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdg 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdh 0.00 0.00 25295.00 0.00 101180.00 0.00 8.00 12.73 0.50 0.50 0.00 0.04 100.80
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

ceph -s:
--------
***@emsclient:~/fio_test# ceph -s
cluster 94991097-7638-4240-b922-f525300a9026
health HEALTH_OK
monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch 1, quorum 0 a
osdmap e522: 1 osds: 1 up, 1 in
pgmap v386711: 832 pgs, 7 pools, 308 GB data, 247 kobjects
366 GB used, 1122 GB / 1489 GB avail
832 active+clean
client io 100 MB/s rd, 25618 op/s

cpu util:
--------
~14 core while serving from disks.

My Analysis:
---------------
No surprises here. Whatever is disk throughput ceph throughput is almost matching.


Let's tweak the shard/thread settings and see the impact.


2. OSD config with 36 shards and 1 thread/shard:
-----------------------------------------------------------

Buffered read:
------------------
No change, output is very similar to 25 shards.


direct_io read:
------------------
Iostat:
----------
avg-cpu: %user %nice %system %iowait %steal %idle
33.33 0.00 28.22 23.11 0.00 15.34

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.00 0.00 2.00 0.00 12.00 12.00 0.00 0.00 0.00 0.00 0.00 0.00
sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdg 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdh 0.00 0.00 31987.00 0.00 127948.00 0.00 8.00 18.06 0.56 0.56 0.00 0.03 100.40
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

ceph -s:
--------------
***@emsclient:~/fio_test# ceph -s
cluster 94991097-7638-4240-b922-f525300a9026
health HEALTH_OK
monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch 1, quorum 0 a
osdmap e525: 1 osds: 1 up, 1 in
pgmap v386746: 832 pgs, 7 pools, 308 GB data, 247 kobjects
366 GB used, 1122 GB / 1489 GB avail
832 active+clean
client io 127 MB/s rd, 32763 op/s

cpu util:
--------------
~19 core while serving from disks.

Analysis:
------------------
It is scaling with increased number of shards/threads. The parallelism also increased significantly.


3. OSD config with 48 shards and 1 thread/shard:
----------------------------------------------------------
Buffered read:
-------------------
No change, output is very similar to 25 shards.


direct_io read:
-----------------
Iostat:
--------

avg-cpu: %user %nice %system %iowait %steal %idle
37.50 0.00 33.72 20.03 0.00 8.75

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdg 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdh 0.00 0.00 35360.00 0.00 141440.00 0.00 8.00 22.25 0.62 0.62 0.00 0.03 100.40
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

ceph -s:
--------------
***@emsclient:~/fio_test# ceph -s
cluster 94991097-7638-4240-b922-f525300a9026
health HEALTH_OK
monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch 1, quorum 0 a
osdmap e534: 1 osds: 1 up, 1 in
pgmap v386830: 832 pgs, 7 pools, 308 GB data, 247 kobjects
366 GB used, 1122 GB / 1489 GB avail
832 active+clean
client io 138 MB/s rd, 35582 op/s

cpu util:
----------------
~22.5 core while serving from disks.

Analysis:
--------------------
It is scaling with increased number of shards/threads. The parallelism also increased significantly.



4. OSD config with 64 shards and 1 thread/shard:
---------------------------------------------------------
Buffered read:
------------------
No change, output is very similar to 25 shards.


direct_io read:
-------------------
Iostat:
---------
avg-cpu: %user %nice %system %iowait %steal %idle
40.18 0.00 34.84 19.81 0.00 5.18

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdg 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdh 0.00 0.00 39114.00 0.00 156460.00 0.00 8.00 35.58 0.90 0.90 0.00 0.03 100.40
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

ceph -s:
---------------
***@emsclient:~/fio_test# ceph -s
cluster 94991097-7638-4240-b922-f525300a9026
health HEALTH_OK
monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch 1, quorum 0 a
osdmap e537: 1 osds: 1 up, 1 in
pgmap v386865: 832 pgs, 7 pools, 308 GB data, 247 kobjects
366 GB used, 1122 GB / 1489 GB avail
832 active+clean
client io 153 MB/s rd, 39172 op/s

cpu util:
----------------
~24.5 core while serving from disks. ~3% cpu left.

Analysis:
------------------
It is scaling with increased number of shards/threads. The parallelism also increased significantly. It is disk bound now.


Summary:

So, it seems buffered IO has significant impact on performance in case backend is SSD.
My question is, if the workload is very random and storage(SSD) is very huge compare to system memory, shouldn't we always go for direct_io instead of buffered io from Ceph ?

Please share your thoughts/suggestion on this.

Thanks & Regards
Somnath

________________________________

PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Milosz Tanski
2014-09-23 19:09:23 UTC
Permalink
Somnath,

I wonder if there's a bottleneck or a point of contention for the
kernel. For a entirely uncached workload I expect the page cache
lookup to cause a slow down (since the lookup should be wasted). What
I wouldn't expect is a 45% performance drop. Memory speed should be
one magnitude faster then a modern SATA SSD drive (so it should be
more negligible overhead).

Is there anyway you could perform the same test but monitor what's
going on with the OSD process using the perf tool? Whatever is the
default cpu time spent hardware counter is fine. Make sure you have
the kernel debug info package installed so can get symbol information
for kernel and module calls. With any luck the diff in perf output in
two runs will show us the culprit.

Also, can you tell us what OS/kernel version you're using on the OSD machines?

- Milosz
Post by Somnath Roy
Hi Sage,
I have created the following setup in order to examine how a single OSD is behaving if say ~80-90% of ios hitting the SSDs.
My test includes the following steps.
1. Created a single OSD cluster.
2. Created two rbd images (110GB each) on 2 different pools.
3. Populated entire image, so my working set is ~210GB. My system memory is ~16GB.
4. Dumped page cache before every run.
5. Ran fio_rbd (QD 32, 8 instances) in parallel on these two images.
Here is my disk iops/bandwidth..
random-reads: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=64
2.0.8
Starting 1 process
Jobs: 1 (f=1): [r] [100.0% done] [154.1M/0K /s] [39.7K/0 iops] [eta 00m:00s]
random-reads: (groupid=0, jobs=1): err= 0: pid=1431
read : io=9316.4MB, bw=158994KB/s, iops=39748 , runt= 60002msec
My fio_rbd config..
[global]
ioengine=rbd
clientname=admin
pool=rbd1
rbdname=ceph_regression_test1
invalidate=0 # mandatory
rw=randread
bs=4k
direct=1
time_based
runtime=2m
size=109G
numjobs=8
[rbd_iodepth32]
iodepth=32
Now, I have run Giant Ceph on top of that..
-------------------------------------------------------
avg-cpu: %user %nice %system %iowait %steal %idle
22.04 0.00 16.46 45.86 0.00 15.64
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 9.00 0.00 6.00 0.00 92.00 30.67 0.01 1.33 0.00 1.33 1.33 0.80
sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdg 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdh 181.00 0.00 34961.00 0.00 176740.00 0.00 10.11 102.71 2.92 2.92 0.00 0.03 100.00
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
----------
cluster 94991097-7638-4240-b922-f525300a9026
health HEALTH_OK
monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch 1, quorum 0 a
osdmap e498: 1 osds: 1 up, 1 in
pgmap v386366: 832 pgs, 7 pools, 308 GB data, 247 kobjects
366 GB used, 1122 GB / 1489 GB avail
832 active+clean
client io 75215 kB/s rd, 18803 op/s
----------
Gradually decreases from ~21 core (serving from cache) to ~10 core (while serving from disks).
-----------------
In this case "All is Well" till ios are served from cache (XFS is smart enough to cache some data ) . Once started hitting disks and throughput is decreasing. As you can see, disk is giving ~35K iops , but, OSD throughput is only ~18.8K ! So, cache miss in case of buffered io seems to be very
expensive. Half of the iops are waste. Also, looking at the bandwidth, it is obvious, not everything is 4K read, May be kernel read_ahead is kicking (?).
Now, I thought of making ceph disk read as direct_io and do the same experiment. I have changed the FileStore::read to do the direct_io only. Rest kept as is. Here is the result with that.
-------
avg-cpu: %user %nice %system %iowait %steal %idle
24.77 0.00 19.52 21.36 0.00 34.36
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdg 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdh 0.00 0.00 25295.00 0.00 101180.00 0.00 8.00 12.73 0.50 0.50 0.00 0.04 100.80
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
--------
cluster 94991097-7638-4240-b922-f525300a9026
health HEALTH_OK
monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch 1, quorum 0 a
osdmap e522: 1 osds: 1 up, 1 in
pgmap v386711: 832 pgs, 7 pools, 308 GB data, 247 kobjects
366 GB used, 1122 GB / 1489 GB avail
832 active+clean
client io 100 MB/s rd, 25618 op/s
--------
~14 core while serving from disks.
---------------
No surprises here. Whatever is disk throughput ceph throughput is almost matching.
Let's tweak the shard/thread settings and see the impact.
-----------------------------------------------------------
------------------
No change, output is very similar to 25 shards.
------------------
----------
avg-cpu: %user %nice %system %iowait %steal %idle
33.33 0.00 28.22 23.11 0.00 15.34
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.00 0.00 2.00 0.00 12.00 12.00 0.00 0.00 0.00 0.00 0.00 0.00
sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdg 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdh 0.00 0.00 31987.00 0.00 127948.00 0.00 8.00 18.06 0.56 0.56 0.00 0.03 100.40
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
--------------
cluster 94991097-7638-4240-b922-f525300a9026
health HEALTH_OK
monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch 1, quorum 0 a
osdmap e525: 1 osds: 1 up, 1 in
pgmap v386746: 832 pgs, 7 pools, 308 GB data, 247 kobjects
366 GB used, 1122 GB / 1489 GB avail
832 active+clean
client io 127 MB/s rd, 32763 op/s
--------------
~19 core while serving from disks.
------------------
It is scaling with increased number of shards/threads. The parallelism also increased significantly.
----------------------------------------------------------
-------------------
No change, output is very similar to 25 shards.
-----------------
--------
avg-cpu: %user %nice %system %iowait %steal %idle
37.50 0.00 33.72 20.03 0.00 8.75
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdg 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdh 0.00 0.00 35360.00 0.00 141440.00 0.00 8.00 22.25 0.62 0.62 0.00 0.03 100.40
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
--------------
cluster 94991097-7638-4240-b922-f525300a9026
health HEALTH_OK
monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch 1, quorum 0 a
osdmap e534: 1 osds: 1 up, 1 in
pgmap v386830: 832 pgs, 7 pools, 308 GB data, 247 kobjects
366 GB used, 1122 GB / 1489 GB avail
832 active+clean
client io 138 MB/s rd, 35582 op/s
----------------
~22.5 core while serving from disks.
--------------------
It is scaling with increased number of shards/threads. The parallelism also increased significantly.
---------------------------------------------------------
------------------
No change, output is very similar to 25 shards.
-------------------
---------
avg-cpu: %user %nice %system %iowait %steal %idle
40.18 0.00 34.84 19.81 0.00 5.18
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdg 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdh 0.00 0.00 39114.00 0.00 156460.00 0.00 8.00 35.58 0.90 0.90 0.00 0.03 100.40
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
---------------
cluster 94991097-7638-4240-b922-f525300a9026
health HEALTH_OK
monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch 1, quorum 0 a
osdmap e537: 1 osds: 1 up, 1 in
pgmap v386865: 832 pgs, 7 pools, 308 GB data, 247 kobjects
366 GB used, 1122 GB / 1489 GB avail
832 active+clean
client io 153 MB/s rd, 39172 op/s
----------------
~24.5 core while serving from disks. ~3% cpu left.
------------------
It is scaling with increased number of shards/threads. The parallelism also increased significantly. It is disk bound now.
So, it seems buffered IO has significant impact on performance in case backend is SSD.
My question is, if the workload is very random and storage(SSD) is very huge compare to system memory, shouldn't we always go for direct_io instead of buffered io from Ceph ?
Please share your thoughts/suggestion on this.
Thanks & Regards
Somnath
________________________________
PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016

p: 646-253-9055
e: ***@adfin.com
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Somnath Roy
2014-09-23 19:24:44 UTC
Permalink
Milosz,
Thanks for the response. I will see if I can get any information out of perf.

Here is my OS information.

***@emsclient:~# lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 13.10
Release: 13.10
Codename: saucy
***@emsclient:~# uname -a
Linux emsclient 3.11.0-12-generic #19-Ubuntu SMP Wed Oct 9 16:20:46 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux

BTW, it's not a 45% drop, as you can see, by tuning the OSD parameter I was able to get almost *2X* performance improvement with direct_io.
It's not only page cache (memory) lookup, in case of buffered_io the following could be problem.

1. Double copy (disk -> file buffer cache, file buffer cache -> user buffer)

2. As the iostat output shows, it is not reading 4K only, it is reading more data from disk as required and in the end it will be wasted in case of random workload..

Thanks & Regards
Somnath

-----Original Message-----
From: Milosz Tanski [mailto:***@adfin.com]
Sent: Tuesday, September 23, 2014 12:09 PM
To: Somnath Roy
Cc: ceph-***@vger.kernel.org
Subject: Re: Impact of page cache on OSD read performance for SSD

Somnath,

I wonder if there's a bottleneck or a point of contention for the kernel. For a entirely uncached workload I expect the page cache lookup to cause a slow down (since the lookup should be wasted). What I wouldn't expect is a 45% performance drop. Memory speed should be one magnitude faster then a modern SATA SSD drive (so it should be more negligible overhead).

Is there anyway you could perform the same test but monitor what's going on with the OSD process using the perf tool? Whatever is the default cpu time spent hardware counter is fine. Make sure you have the kernel debug info package installed so can get symbol information for kernel and module calls. With any luck the diff in perf output in two runs will show us the culprit.

Also, can you tell us what OS/kernel version you're using on the OSD machines?

- Milosz
Post by Somnath Roy
Hi Sage,
I have created the following setup in order to examine how a single OSD is behaving if say ~80-90% of ios hitting the SSDs.
My test includes the following steps.
1. Created a single OSD cluster.
2. Created two rbd images (110GB each) on 2 different pools.
3. Populated entire image, so my working set is ~210GB. My system memory is ~16GB.
4. Dumped page cache before every run.
5. Ran fio_rbd (QD 32, 8 instances) in parallel on these two images.
Here is my disk iops/bandwidth..
random-reads: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=64
2.0.8
Starting 1 process
Jobs: 1 (f=1): [r] [100.0% done] [154.1M/0K /s] [39.7K/0 iops] [eta 00m:00s]
random-reads: (groupid=0, jobs=1): err= 0: pid=1431
read : io=9316.4MB, bw=158994KB/s, iops=39748 , runt=
60002msec
My fio_rbd config..
[global]
ioengine=rbd
clientname=admin
pool=rbd1
rbdname=ceph_regression_test1
invalidate=0 # mandatory
rw=randread
bs=4k
direct=1
time_based
runtime=2m
size=109G
numjobs=8
[rbd_iodepth32]
iodepth=32
Now, I have run Giant Ceph on top of that..
-------------------------------------------------------
avg-cpu: %user %nice %system %iowait %steal %idle
22.04 0.00 16.46 45.86 0.00 15.64
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 9.00 0.00 6.00 0.00 92.00 30.67 0.01 1.33 0.00 1.33 1.33 0.80
sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdg 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdh 181.00 0.00 34961.00 0.00 176740.00 0.00 10.11 102.71 2.92 2.92 0.00 0.03 100.00
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
----------
cluster 94991097-7638-4240-b922-f525300a9026
health HEALTH_OK
monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch 1, quorum 0 a
osdmap e498: 1 osds: 1 up, 1 in
pgmap v386366: 832 pgs, 7 pools, 308 GB data, 247 kobjects
366 GB used, 1122 GB / 1489 GB avail
832 active+clean
client io 75215 kB/s rd, 18803 op/s
----------
Gradually decreases from ~21 core (serving from cache) to ~10 core (while serving from disks).
-----------------
In this case "All is Well" till ios are served from cache (XFS is
smart enough to cache some data ) . Once started hitting disks and throughput is decreasing. As you can see, disk is giving ~35K iops , but, OSD throughput is only ~18.8K ! So, cache miss in case of buffered io seems to be very expensive. Half of the iops are waste. Also, looking at the bandwidth, it is obvious, not everything is 4K read, May be kernel read_ahead is kicking (?).
Now, I thought of making ceph disk read as direct_io and do the same experiment. I have changed the FileStore::read to do the direct_io only. Rest kept as is. Here is the result with that.
-------
avg-cpu: %user %nice %system %iowait %steal %idle
24.77 0.00 19.52 21.36 0.00 34.36
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdg 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdh 0.00 0.00 25295.00 0.00 101180.00 0.00 8.00 12.73 0.50 0.50 0.00 0.04 100.80
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
--------
cluster 94991097-7638-4240-b922-f525300a9026
health HEALTH_OK
monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch 1, quorum 0 a
osdmap e522: 1 osds: 1 up, 1 in
pgmap v386711: 832 pgs, 7 pools, 308 GB data, 247 kobjects
366 GB used, 1122 GB / 1489 GB avail
832 active+clean
client io 100 MB/s rd, 25618 op/s
--------
~14 core while serving from disks.
---------------
No surprises here. Whatever is disk throughput ceph throughput is almost matching.
Let's tweak the shard/thread settings and see the impact.
-----------------------------------------------------------
------------------
No change, output is very similar to 25 shards.
------------------
----------
avg-cpu: %user %nice %system %iowait %steal %idle
33.33 0.00 28.22 23.11 0.00 15.34
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.00 0.00 2.00 0.00 12.00 12.00 0.00 0.00 0.00 0.00 0.00 0.00
sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdg 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdh 0.00 0.00 31987.00 0.00 127948.00 0.00 8.00 18.06 0.56 0.56 0.00 0.03 100.40
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
--------------
cluster 94991097-7638-4240-b922-f525300a9026
health HEALTH_OK
monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch 1, quorum 0 a
osdmap e525: 1 osds: 1 up, 1 in
pgmap v386746: 832 pgs, 7 pools, 308 GB data, 247 kobjects
366 GB used, 1122 GB / 1489 GB avail
832 active+clean
client io 127 MB/s rd, 32763 op/s
--------------
~19 core while serving from disks.
------------------
It is scaling with increased number of shards/threads. The parallelism also increased significantly.
----------------------------------------------------------
-------------------
No change, output is very similar to 25 shards.
-----------------
--------
avg-cpu: %user %nice %system %iowait %steal %idle
37.50 0.00 33.72 20.03 0.00 8.75
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdg 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdh 0.00 0.00 35360.00 0.00 141440.00 0.00 8.00 22.25 0.62 0.62 0.00 0.03 100.40
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
--------------
cluster 94991097-7638-4240-b922-f525300a9026
health HEALTH_OK
monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch 1, quorum 0 a
osdmap e534: 1 osds: 1 up, 1 in
pgmap v386830: 832 pgs, 7 pools, 308 GB data, 247 kobjects
366 GB used, 1122 GB / 1489 GB avail
832 active+clean
client io 138 MB/s rd, 35582 op/s
----------------
~22.5 core while serving from disks.
--------------------
It is scaling with increased number of shards/threads. The parallelism also increased significantly.
---------------------------------------------------------
------------------
No change, output is very similar to 25 shards.
-------------------
---------
avg-cpu: %user %nice %system %iowait %steal %idle
40.18 0.00 34.84 19.81 0.00 5.18
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdg 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdh 0.00 0.00 39114.00 0.00 156460.00 0.00 8.00 35.58 0.90 0.90 0.00 0.03 100.40
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
---------------
cluster 94991097-7638-4240-b922-f525300a9026
health HEALTH_OK
monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch 1, quorum 0 a
osdmap e537: 1 osds: 1 up, 1 in
pgmap v386865: 832 pgs, 7 pools, 308 GB data, 247 kobjects
366 GB used, 1122 GB / 1489 GB avail
832 active+clean
client io 153 MB/s rd, 39172 op/s
----------------
~24.5 core while serving from disks. ~3% cpu left.
------------------
It is scaling with increased number of shards/threads. The parallelism also increased significantly. It is disk bound now.
So, it seems buffered IO has significant impact on performance in case backend is SSD.
My question is, if the workload is very random and storage(SSD) is very huge compare to system memory, shouldn't we always go for direct_io instead of buffered io from Ceph ?
Please share your thoughts/suggestion on this.
Thanks & Regards
Somnath
________________________________
PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel"
info at http://vger.kernel.org/majordomo-info.html
--
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016

p: 646-253-9055
e: ***@adfin.com
��{.n�+�������+%��lzwm��b�맲��r��yǩ�ׯzX����ܨ}���Ơz�&j:+v�������zZ+
Sage Weil
2014-09-23 19:29:48 UTC
Permalink
Post by Somnath Roy
Milosz,
Thanks for the response. I will see if I can get any information out of perf.
Here is my OS information.
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 13.10
Release: 13.10
Codename: saucy
Linux emsclient 3.11.0-12-generic #19-Ubuntu SMP Wed Oct 9 16:20:46 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
BTW, it's not a 45% drop, as you can see, by tuning the OSD parameter I was able to get almost *2X* performance improvement with direct_io.
It's not only page cache (memory) lookup, in case of buffered_io the following could be problem.
1. Double copy (disk -> file buffer cache, file buffer cache -> user buffer)
2. As the iostat output shows, it is not reading 4K only, it is reading
more data from disk as required and in the end it will be wasted in case
of random workload..
It might be worth using blktrace to see what the IOs it is issueing are.
Which ones are > 4K and what they point to...

sage
Post by Somnath Roy
Thanks & Regards
Somnath
-----Original Message-----
Sent: Tuesday, September 23, 2014 12:09 PM
To: Somnath Roy
Subject: Re: Impact of page cache on OSD read performance for SSD
Somnath,
I wonder if there's a bottleneck or a point of contention for the kernel. For a entirely uncached workload I expect the page cache lookup to cause a slow down (since the lookup should be wasted). What I wouldn't expect is a 45% performance drop. Memory speed should be one magnitude faster then a modern SATA SSD drive (so it should be more negligible overhead).
Is there anyway you could perform the same test but monitor what's going on with the OSD process using the perf tool? Whatever is the default cpu time spent hardware counter is fine. Make sure you have the kernel debug info package installed so can get symbol information for kernel and module calls. With any luck the diff in perf output in two runs will show us the culprit.
Also, can you tell us what OS/kernel version you're using on the OSD machines?
- Milosz
Post by Somnath Roy
Hi Sage,
I have created the following setup in order to examine how a single OSD is behaving if say ~80-90% of ios hitting the SSDs.
My test includes the following steps.
1. Created a single OSD cluster.
2. Created two rbd images (110GB each) on 2 different pools.
3. Populated entire image, so my working set is ~210GB. My system memory is ~16GB.
4. Dumped page cache before every run.
5. Ran fio_rbd (QD 32, 8 instances) in parallel on these two images.
Here is my disk iops/bandwidth..
random-reads: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=64
2.0.8
Starting 1 process
Jobs: 1 (f=1): [r] [100.0% done] [154.1M/0K /s] [39.7K/0 iops] [eta 00m:00s]
random-reads: (groupid=0, jobs=1): err= 0: pid=1431
read : io=9316.4MB, bw=158994KB/s, iops=39748 , runt= 60002msec
My fio_rbd config..
[global]
ioengine=rbd
clientname=admin
pool=rbd1
rbdname=ceph_regression_test1
invalidate=0 # mandatory
rw=randread
bs=4k
direct=1
time_based
runtime=2m
size=109G
numjobs=8
[rbd_iodepth32]
iodepth=32
Now, I have run Giant Ceph on top of that..
-------------------------------------------------------
avg-cpu: %user %nice %system %iowait %steal %idle
22.04 0.00 16.46 45.86 0.00 15.64
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 9.00 0.00 6.00 0.00 92.00 30.67 0.01 1.33 0.00 1.33 1.33 0.80
sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdg 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdh 181.00 0.00 34961.00 0.00 176740.00 0.00 10.11 102.71 2.92 2.92 0.00 0.03 100.00
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
----------
cluster 94991097-7638-4240-b922-f525300a9026
health HEALTH_OK
monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch 1, quorum 0 a
osdmap e498: 1 osds: 1 up, 1 in
pgmap v386366: 832 pgs, 7 pools, 308 GB data, 247 kobjects
366 GB used, 1122 GB / 1489 GB avail
832 active+clean
client io 75215 kB/s rd, 18803 op/s
----------
Gradually decreases from ~21 core (serving from cache) to ~10 core (while serving from disks).
-----------------
In this case "All is Well" till ios are served from cache (XFS is
smart enough to cache some data ) . Once started hitting disks and throughput is decreasing. As you can see, disk is giving ~35K iops , but, OSD throughput is only ~18.8K ! So, cache miss in case of buffered io seems to be very expensive. Half of the iops are waste. Also, looking at the bandwidth, it is obvious, not everything is 4K read, May be kernel read_ahead is kicking (?).
Now, I thought of making ceph disk read as direct_io and do the same experiment. I have changed the FileStore::read to do the direct_io only. Rest kept as is. Here is the result with that.
-------
avg-cpu: %user %nice %system %iowait %steal %idle
24.77 0.00 19.52 21.36 0.00 34.36
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdg 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdh 0.00 0.00 25295.00 0.00 101180.00 0.00 8.00 12.73 0.50 0.50 0.00 0.04 100.80
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
--------
cluster 94991097-7638-4240-b922-f525300a9026
health HEALTH_OK
monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch 1, quorum 0 a
osdmap e522: 1 osds: 1 up, 1 in
pgmap v386711: 832 pgs, 7 pools, 308 GB data, 247 kobjects
366 GB used, 1122 GB / 1489 GB avail
832 active+clean
client io 100 MB/s rd, 25618 op/s
--------
~14 core while serving from disks.
---------------
No surprises here. Whatever is disk throughput ceph throughput is almost matching.
Let's tweak the shard/thread settings and see the impact.
-----------------------------------------------------------
------------------
No change, output is very similar to 25 shards.
------------------
----------
avg-cpu: %user %nice %system %iowait %steal %idle
33.33 0.00 28.22 23.11 0.00 15.34
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.00 0.00 2.00 0.00 12.00 12.00 0.00 0.00 0.00 0.00 0.00 0.00
sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdg 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdh 0.00 0.00 31987.00 0.00 127948.00 0.00 8.00 18.06 0.56 0.56 0.00 0.03 100.40
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
--------------
cluster 94991097-7638-4240-b922-f525300a9026
health HEALTH_OK
monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch 1, quorum 0 a
osdmap e525: 1 osds: 1 up, 1 in
pgmap v386746: 832 pgs, 7 pools, 308 GB data, 247 kobjects
366 GB used, 1122 GB / 1489 GB avail
832 active+clean
client io 127 MB/s rd, 32763 op/s
--------------
~19 core while serving from disks.
------------------
It is scaling with increased number of shards/threads. The parallelism also increased significantly.
----------------------------------------------------------
-------------------
No change, output is very similar to 25 shards.
-----------------
--------
avg-cpu: %user %nice %system %iowait %steal %idle
37.50 0.00 33.72 20.03 0.00 8.75
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdg 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdh 0.00 0.00 35360.00 0.00 141440.00 0.00 8.00 22.25 0.62 0.62 0.00 0.03 100.40
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
--------------
cluster 94991097-7638-4240-b922-f525300a9026
health HEALTH_OK
monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch 1, quorum 0 a
osdmap e534: 1 osds: 1 up, 1 in
pgmap v386830: 832 pgs, 7 pools, 308 GB data, 247 kobjects
366 GB used, 1122 GB / 1489 GB avail
832 active+clean
client io 138 MB/s rd, 35582 op/s
----------------
~22.5 core while serving from disks.
--------------------
It is scaling with increased number of shards/threads. The parallelism also increased significantly.
---------------------------------------------------------
------------------
No change, output is very similar to 25 shards.
-------------------
---------
avg-cpu: %user %nice %system %iowait %steal %idle
40.18 0.00 34.84 19.81 0.00 5.18
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdg 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdh 0.00 0.00 39114.00 0.00 156460.00 0.00 8.00 35.58 0.90 0.90 0.00 0.03 100.40
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
---------------
cluster 94991097-7638-4240-b922-f525300a9026
health HEALTH_OK
monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch 1, quorum 0 a
osdmap e537: 1 osds: 1 up, 1 in
pgmap v386865: 832 pgs, 7 pools, 308 GB data, 247 kobjects
366 GB used, 1122 GB / 1489 GB avail
832 active+clean
client io 153 MB/s rd, 39172 op/s
----------------
~24.5 core while serving from disks. ~3% cpu left.
------------------
It is scaling with increased number of shards/threads. The parallelism also increased significantly. It is disk bound now.
So, it seems buffered IO has significant impact on performance in case backend is SSD.
My question is, if the workload is very random and storage(SSD) is very huge compare to system memory, shouldn't we always go for direct_io instead of buffered io from Ceph ?
Please share your thoughts/suggestion on this.
Thanks & Regards
Somnath
________________________________
PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel"
info at http://vger.kernel.org/majordomo-info.html
--
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016
p: 646-253-9055
N?????r??y??????X???v???)?{.n?????z?]z????ay?????j??f???h??????w??? ???j:+v???w????????????zZ+???????j"????i
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Haomai Wang
2014-09-24 02:06:48 UTC
Permalink
Good point, but do you have considered that the impaction for write
ops? And if skip page cache, FileStore is responsible for data cache?
Post by Sage Weil
Post by Somnath Roy
Milosz,
Thanks for the response. I will see if I can get any information out of perf.
Here is my OS information.
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 13.10
Release: 13.10
Codename: saucy
Linux emsclient 3.11.0-12-generic #19-Ubuntu SMP Wed Oct 9 16:20:46 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
BTW, it's not a 45% drop, as you can see, by tuning the OSD parameter I was able to get almost *2X* performance improvement with direct_io.
It's not only page cache (memory) lookup, in case of buffered_io the following could be problem.
1. Double copy (disk -> file buffer cache, file buffer cache -> user buffer)
2. As the iostat output shows, it is not reading 4K only, it is reading
more data from disk as required and in the end it will be wasted in case
of random workload..
It might be worth using blktrace to see what the IOs it is issueing are.
Which ones are > 4K and what they point to...
sage
Post by Somnath Roy
Thanks & Regards
Somnath
-----Original Message-----
Sent: Tuesday, September 23, 2014 12:09 PM
To: Somnath Roy
Subject: Re: Impact of page cache on OSD read performance for SSD
Somnath,
I wonder if there's a bottleneck or a point of contention for the kernel. For a entirely uncached workload I expect the page cache lookup to cause a slow down (since the lookup should be wasted). What I wouldn't expect is a 45% performance drop. Memory speed should be one magnitude faster then a modern SATA SSD drive (so it should be more negligible overhead).
Is there anyway you could perform the same test but monitor what's going on with the OSD process using the perf tool? Whatever is the default cpu time spent hardware counter is fine. Make sure you have the kernel debug info package installed so can get symbol information for kernel and module calls. With any luck the diff in perf output in two runs will show us the culprit.
Also, can you tell us what OS/kernel version you're using on the OSD machines?
- Milosz
Post by Somnath Roy
Hi Sage,
I have created the following setup in order to examine how a single OSD is behaving if say ~80-90% of ios hitting the SSDs.
My test includes the following steps.
1. Created a single OSD cluster.
2. Created two rbd images (110GB each) on 2 different pools.
3. Populated entire image, so my working set is ~210GB. My system memory is ~16GB.
4. Dumped page cache before every run.
5. Ran fio_rbd (QD 32, 8 instances) in parallel on these two images.
Here is my disk iops/bandwidth..
random-reads: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=64
2.0.8
Starting 1 process
Jobs: 1 (f=1): [r] [100.0% done] [154.1M/0K /s] [39.7K/0 iops] [eta 00m:00s]
random-reads: (groupid=0, jobs=1): err= 0: pid=1431
read : io=9316.4MB, bw=158994KB/s, iops=39748 , runt= 60002msec
My fio_rbd config..
[global]
ioengine=rbd
clientname=admin
pool=rbd1
rbdname=ceph_regression_test1
invalidate=0 # mandatory
rw=randread
bs=4k
direct=1
time_based
runtime=2m
size=109G
numjobs=8
[rbd_iodepth32]
iodepth=32
Now, I have run Giant Ceph on top of that..
-------------------------------------------------------
avg-cpu: %user %nice %system %iowait %steal %idle
22.04 0.00 16.46 45.86 0.00 15.64
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 9.00 0.00 6.00 0.00 92.00 30.67 0.01 1.33 0.00 1.33 1.33 0.80
sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdg 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdh 181.00 0.00 34961.00 0.00 176740.00 0.00 10.11 102.71 2.92 2.92 0.00 0.03 100.00
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
----------
cluster 94991097-7638-4240-b922-f525300a9026
health HEALTH_OK
monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch 1, quorum 0 a
osdmap e498: 1 osds: 1 up, 1 in
pgmap v386366: 832 pgs, 7 pools, 308 GB data, 247 kobjects
366 GB used, 1122 GB / 1489 GB avail
832 active+clean
client io 75215 kB/s rd, 18803 op/s
----------
Gradually decreases from ~21 core (serving from cache) to ~10 core (while serving from disks).
-----------------
In this case "All is Well" till ios are served from cache (XFS is
smart enough to cache some data ) . Once started hitting disks and throughput is decreasing. As you can see, disk is giving ~35K iops , but, OSD throughput is only ~18.8K ! So, cache miss in case of buffered io seems to be very expensive. Half of the iops are waste. Also, looking at the bandwidth, it is obvious, not everything is 4K read, May be kernel read_ahead is kicking (?).
Now, I thought of making ceph disk read as direct_io and do the same experiment. I have changed the FileStore::read to do the direct_io only. Rest kept as is. Here is the result with that.
-------
avg-cpu: %user %nice %system %iowait %steal %idle
24.77 0.00 19.52 21.36 0.00 34.36
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdg 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdh 0.00 0.00 25295.00 0.00 101180.00 0.00 8.00 12.73 0.50 0.50 0.00 0.04 100.80
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
--------
cluster 94991097-7638-4240-b922-f525300a9026
health HEALTH_OK
monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch 1, quorum 0 a
osdmap e522: 1 osds: 1 up, 1 in
pgmap v386711: 832 pgs, 7 pools, 308 GB data, 247 kobjects
366 GB used, 1122 GB / 1489 GB avail
832 active+clean
client io 100 MB/s rd, 25618 op/s
--------
~14 core while serving from disks.
---------------
No surprises here. Whatever is disk throughput ceph throughput is almost matching.
Let's tweak the shard/thread settings and see the impact.
-----------------------------------------------------------
------------------
No change, output is very similar to 25 shards.
------------------
----------
avg-cpu: %user %nice %system %iowait %steal %idle
33.33 0.00 28.22 23.11 0.00 15.34
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.00 0.00 2.00 0.00 12.00 12.00 0.00 0.00 0.00 0.00 0.00 0.00
sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdg 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdh 0.00 0.00 31987.00 0.00 127948.00 0.00 8.00 18.06 0.56 0.56 0.00 0.03 100.40
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
--------------
cluster 94991097-7638-4240-b922-f525300a9026
health HEALTH_OK
monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch 1, quorum 0 a
osdmap e525: 1 osds: 1 up, 1 in
pgmap v386746: 832 pgs, 7 pools, 308 GB data, 247 kobjects
366 GB used, 1122 GB / 1489 GB avail
832 active+clean
client io 127 MB/s rd, 32763 op/s
--------------
~19 core while serving from disks.
------------------
It is scaling with increased number of shards/threads. The parallelism also increased significantly.
----------------------------------------------------------
-------------------
No change, output is very similar to 25 shards.
-----------------
--------
avg-cpu: %user %nice %system %iowait %steal %idle
37.50 0.00 33.72 20.03 0.00 8.75
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdg 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdh 0.00 0.00 35360.00 0.00 141440.00 0.00 8.00 22.25 0.62 0.62 0.00 0.03 100.40
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
--------------
cluster 94991097-7638-4240-b922-f525300a9026
health HEALTH_OK
monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch 1, quorum 0 a
osdmap e534: 1 osds: 1 up, 1 in
pgmap v386830: 832 pgs, 7 pools, 308 GB data, 247 kobjects
366 GB used, 1122 GB / 1489 GB avail
832 active+clean
client io 138 MB/s rd, 35582 op/s
----------------
~22.5 core while serving from disks.
--------------------
It is scaling with increased number of shards/threads. The parallelism also increased significantly.
---------------------------------------------------------
------------------
No change, output is very similar to 25 shards.
-------------------
---------
avg-cpu: %user %nice %system %iowait %steal %idle
40.18 0.00 34.84 19.81 0.00 5.18
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdg 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdh 0.00 0.00 39114.00 0.00 156460.00 0.00 8.00 35.58 0.90 0.90 0.00 0.03 100.40
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
---------------
cluster 94991097-7638-4240-b922-f525300a9026
health HEALTH_OK
monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch 1, quorum 0 a
osdmap e537: 1 osds: 1 up, 1 in
pgmap v386865: 832 pgs, 7 pools, 308 GB data, 247 kobjects
366 GB used, 1122 GB / 1489 GB avail
832 active+clean
client io 153 MB/s rd, 39172 op/s
----------------
~24.5 core while serving from disks. ~3% cpu left.
------------------
It is scaling with increased number of shards/threads. The parallelism also increased significantly. It is disk bound now.
So, it seems buffered IO has significant impact on performance in case backend is SSD.
My question is, if the workload is very random and storage(SSD) is very huge compare to system memory, shouldn't we always go for direct_io instead of buffered io from Ceph ?
Please share your thoughts/suggestion on this.
Thanks & Regards
Somnath
________________________________
PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel"
info at http://vger.kernel.org/majordomo-info.html
--
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016
p: 646-253-9055
N?????r??y??????X???v???)?{.n?????z?]z????ay? ????j ??f???h????? ?w??? ???j:+v???w???????? ????zZ+???????j"????i
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Best Regards,

Wheat
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Somnath Roy
2014-09-24 02:29:20 UTC
Permalink
Haomai,
I am considering only about random reads and the changes I made only affecting reads. For write, I have not measured yet. But, yes, page cache may be helpful for write coalescing. Still need to evaluate how it is behaving comparing direct_io on SSD though. I think Ceph code path will be much shorter if we use direct_io in the write path where it is actually executing the transactions. Probably, the sync thread and all will not be needed.

I am trying to analyze where is the extra reads coming from in case of buffered io by using blktrace etc. This should give us a clear understanding what exactly is going on there and it may turn out that tuning kernel parameters only we can achieve similar performance as direct_io.

Thanks & Regards
Somnath

-----Original Message-----
From: Haomai Wang [mailto:***@gmail.com]
Sent: Tuesday, September 23, 2014 7:07 PM
To: Sage Weil
Cc: Somnath Roy; Milosz Tanski; ceph-***@vger.kernel.org
Subject: Re: Impact of page cache on OSD read performance for SSD

Good point, but do you have considered that the impaction for write ops? And if skip page cache, FileStore is responsible for data cache?
Post by Sage Weil
Post by Somnath Roy
Milosz,
Thanks for the response. I will see if I can get any information out of perf.
Here is my OS information.
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 13.10
Release: 13.10
Codename: saucy
Linux emsclient 3.11.0-12-generic #19-Ubuntu SMP Wed Oct 9 16:20:46
UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
BTW, it's not a 45% drop, as you can see, by tuning the OSD parameter I was able to get almost *2X* performance improvement with direct_io.
It's not only page cache (memory) lookup, in case of buffered_io the following could be problem.
1. Double copy (disk -> file buffer cache, file buffer cache -> user buffer)
2. As the iostat output shows, it is not reading 4K only, it is
reading more data from disk as required and in the end it will be
wasted in case of random workload..
It might be worth using blktrace to see what the IOs it is issueing are.
Which ones are > 4K and what they point to...
sage
Post by Somnath Roy
Thanks & Regards
Somnath
-----Original Message-----
Sent: Tuesday, September 23, 2014 12:09 PM
To: Somnath Roy
Subject: Re: Impact of page cache on OSD read performance for SSD
Somnath,
I wonder if there's a bottleneck or a point of contention for the kernel. For a entirely uncached workload I expect the page cache lookup to cause a slow down (since the lookup should be wasted). What I wouldn't expect is a 45% performance drop. Memory speed should be one magnitude faster then a modern SATA SSD drive (so it should be more negligible overhead).
Is there anyway you could perform the same test but monitor what's going on with the OSD process using the perf tool? Whatever is the default cpu time spent hardware counter is fine. Make sure you have the kernel debug info package installed so can get symbol information for kernel and module calls. With any luck the diff in perf output in two runs will show us the culprit.
Also, can you tell us what OS/kernel version you're using on the OSD machines?
- Milosz
Post by Somnath Roy
Hi Sage,
I have created the following setup in order to examine how a single OSD is behaving if say ~80-90% of ios hitting the SSDs.
My test includes the following steps.
1. Created a single OSD cluster.
2. Created two rbd images (110GB each) on 2 different pools.
3. Populated entire image, so my working set is ~210GB. My system memory is ~16GB.
4. Dumped page cache before every run.
5. Ran fio_rbd (QD 32, 8 instances) in parallel on these two images.
Here is my disk iops/bandwidth..
random-reads: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=64
2.0.8
Starting 1 process
Jobs: 1 (f=1): [r] [100.0% done] [154.1M/0K /s] [39.7K/0 iops] [eta 00m:00s]
random-reads: (groupid=0, jobs=1): err= 0: pid=1431
read : io=9316.4MB, bw=158994KB/s, iops=39748 , runt= 60002msec
My fio_rbd config..
[global]
ioengine=rbd
clientname=admin
pool=rbd1
rbdname=ceph_regression_test1
invalidate=0 # mandatory
rw=randread
bs=4k
direct=1
time_based
runtime=2m
size=109G
numjobs=8
[rbd_iodepth32]
iodepth=32
Now, I have run Giant Ceph on top of that..
-------------------------------------------------------
avg-cpu: %user %nice %system %iowait %steal %idle
22.04 0.00 16.46 45.86 0.00 15.64
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 9.00 0.00 6.00 0.00 92.00 30.67 0.01 1.33 0.00 1.33 1.33 0.80
sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdg 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdh 181.00 0.00 34961.00 0.00 176740.00 0.00 10.11 102.71 2.92 2.92 0.00 0.03 100.00
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
----------
cluster 94991097-7638-4240-b922-f525300a9026
health HEALTH_OK
monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch 1, quorum 0 a
osdmap e498: 1 osds: 1 up, 1 in
pgmap v386366: 832 pgs, 7 pools, 308 GB data, 247 kobjects
366 GB used, 1122 GB / 1489 GB avail
832 active+clean
client io 75215 kB/s rd, 18803 op/s
----------
Gradually decreases from ~21 core (serving from cache) to ~10 core (while serving from disks).
-----------------
In this case "All is Well" till ios are served from cache (XFS is
smart enough to cache some data ) . Once started hitting disks and throughput is decreasing. As you can see, disk is giving ~35K iops , but, OSD throughput is only ~18.8K ! So, cache miss in case of buffered io seems to be very expensive. Half of the iops are waste. Also, looking at the bandwidth, it is obvious, not everything is 4K read, May be kernel read_ahead is kicking (?).
Now, I thought of making ceph disk read as direct_io and do the same experiment. I have changed the FileStore::read to do the direct_io only. Rest kept as is. Here is the result with that.
-------
avg-cpu: %user %nice %system %iowait %steal %idle
24.77 0.00 19.52 21.36 0.00 34.36
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdg 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdh 0.00 0.00 25295.00 0.00 101180.00 0.00 8.00 12.73 0.50 0.50 0.00 0.04 100.80
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
--------
cluster 94991097-7638-4240-b922-f525300a9026
health HEALTH_OK
monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch 1, quorum 0 a
osdmap e522: 1 osds: 1 up, 1 in
pgmap v386711: 832 pgs, 7 pools, 308 GB data, 247 kobjects
366 GB used, 1122 GB / 1489 GB avail
832 active+clean
client io 100 MB/s rd, 25618 op/s
--------
~14 core while serving from disks.
---------------
No surprises here. Whatever is disk throughput ceph throughput is almost matching.
Let's tweak the shard/thread settings and see the impact.
-----------------------------------------------------------
------------------
No change, output is very similar to 25 shards.
------------------
----------
avg-cpu: %user %nice %system %iowait %steal %idle
33.33 0.00 28.22 23.11 0.00 15.34
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.00 0.00 2.00 0.00 12.00 12.00 0.00 0.00 0.00 0.00 0.00 0.00
sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdg 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdh 0.00 0.00 31987.00 0.00 127948.00 0.00 8.00 18.06 0.56 0.56 0.00 0.03 100.40
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
--------------
cluster 94991097-7638-4240-b922-f525300a9026
health HEALTH_OK
monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch 1, quorum 0 a
osdmap e525: 1 osds: 1 up, 1 in
pgmap v386746: 832 pgs, 7 pools, 308 GB data, 247 kobjects
366 GB used, 1122 GB / 1489 GB avail
832 active+clean
client io 127 MB/s rd, 32763 op/s
--------------
~19 core while serving from disks.
------------------
It is scaling with increased number of shards/threads. The parallelism also increased significantly.
----------------------------------------------------------
-------------------
No change, output is very similar to 25 shards.
-----------------
--------
avg-cpu: %user %nice %system %iowait %steal %idle
37.50 0.00 33.72 20.03 0.00 8.75
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdg 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdh 0.00 0.00 35360.00 0.00 141440.00 0.00 8.00 22.25 0.62 0.62 0.00 0.03 100.40
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
--------------
cluster 94991097-7638-4240-b922-f525300a9026
health HEALTH_OK
monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch 1, quorum 0 a
osdmap e534: 1 osds: 1 up, 1 in
pgmap v386830: 832 pgs, 7 pools, 308 GB data, 247 kobjects
366 GB used, 1122 GB / 1489 GB avail
832 active+clean
client io 138 MB/s rd, 35582 op/s
----------------
~22.5 core while serving from disks.
--------------------
It is scaling with increased number of shards/threads. The parallelism also increased significantly.
---------------------------------------------------------
------------------
No change, output is very similar to 25 shards.
-------------------
---------
avg-cpu: %user %nice %system %iowait %steal %idle
40.18 0.00 34.84 19.81 0.00 5.18
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdg 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdh 0.00 0.00 39114.00 0.00 156460.00 0.00 8.00 35.58 0.90 0.90 0.00 0.03 100.40
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
---------------
cluster 94991097-7638-4240-b922-f525300a9026
health HEALTH_OK
monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch 1, quorum 0 a
osdmap e537: 1 osds: 1 up, 1 in
pgmap v386865: 832 pgs, 7 pools, 308 GB data, 247 kobjects
366 GB used, 1122 GB / 1489 GB avail
832 active+clean
client io 153 MB/s rd, 39172 op/s
----------------
~24.5 core while serving from disks. ~3% cpu left.
------------------
It is scaling with increased number of shards/threads. The parallelism also increased significantly. It is disk bound now.
So, it seems buffered IO has significant impact on performance in case backend is SSD.
My question is, if the workload is very random and storage(SSD) is very huge compare to system memory, shouldn't we always go for direct_io instead of buffered io from Ceph ?
Please share your thoughts/suggestion on this.
Thanks & Regards
Somnath
________________________________
PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel"
majordomo info at http://vger.kernel.org/majordomo-info.html
--
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016
p: 646-253-9055
N?????r??y??????X???v???)?{.n?????z?]z????ay? ????j ??f???h?????
?w??? ???j:+v???w???????? ????zZ+???????j"????i
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel"
info at http://vger.kernel.org/majordomo-info.html
--
Best Regards,

Wheat
��칻�&�~�&���+-��ݶ��w��˛���m��^��b��^n�r���z���h�����&���G���h�
Haomai Wang
2014-09-24 04:01:03 UTC
Permalink
I agree with that direct read will help for disk read. But if read
data is hot and small enough to fit in memory, page cache is a good
place to hold data cache. If discard page cache, we need to implement
a cache to provide with effective lookup impl.

BTW, whether to use direct io we can refer to MySQL Innodb engine with
direct io and PostgreSQL with page cache.
Post by Somnath Roy
Haomai,
I am considering only about random reads and the changes I made only affecting reads. For write, I have not measured yet. But, yes, page cache may be helpful for write coalescing. Still need to evaluate how it is behaving comparing direct_io on SSD though. I think Ceph code path will be much shorter if we use direct_io in the write path where it is actually executing the transactions. Probably, the sync thread and all will not be needed.
I am trying to analyze where is the extra reads coming from in case of buffered io by using blktrace etc. This should give us a clear understanding what exactly is going on there and it may turn out that tuning kernel parameters only we can achieve similar performance as direct_io.
Thanks & Regards
Somnath
-----Original Message-----
Sent: Tuesday, September 23, 2014 7:07 PM
To: Sage Weil
Subject: Re: Impact of page cache on OSD read performance for SSD
Good point, but do you have considered that the impaction for write ops? And if skip page cache, FileStore is responsible for data cache?
Post by Sage Weil
Post by Somnath Roy
Milosz,
Thanks for the response. I will see if I can get any information out of perf.
Here is my OS information.
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 13.10
Release: 13.10
Codename: saucy
Linux emsclient 3.11.0-12-generic #19-Ubuntu SMP Wed Oct 9 16:20:46
UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
BTW, it's not a 45% drop, as you can see, by tuning the OSD parameter I was able to get almost *2X* performance improvement with direct_io.
It's not only page cache (memory) lookup, in case of buffered_io the following could be problem.
1. Double copy (disk -> file buffer cache, file buffer cache -> user buffer)
2. As the iostat output shows, it is not reading 4K only, it is
reading more data from disk as required and in the end it will be
wasted in case of random workload..
It might be worth using blktrace to see what the IOs it is issueing are.
Which ones are > 4K and what they point to...
sage
Post by Somnath Roy
Thanks & Regards
Somnath
-----Original Message-----
Sent: Tuesday, September 23, 2014 12:09 PM
To: Somnath Roy
Subject: Re: Impact of page cache on OSD read performance for SSD
Somnath,
I wonder if there's a bottleneck or a point of contention for the kernel. For a entirely uncached workload I expect the page cache lookup to cause a slow down (since the lookup should be wasted). What I wouldn't expect is a 45% performance drop. Memory speed should be one magnitude faster then a modern SATA SSD drive (so it should be more negligible overhead).
Is there anyway you could perform the same test but monitor what's going on with the OSD process using the perf tool? Whatever is the default cpu time spent hardware counter is fine. Make sure you have the kernel debug info package installed so can get symbol information for kernel and module calls. With any luck the diff in perf output in two runs will show us the culprit.
Also, can you tell us what OS/kernel version you're using on the OSD machines?
- Milosz
Post by Somnath Roy
Hi Sage,
I have created the following setup in order to examine how a single OSD is behaving if say ~80-90% of ios hitting the SSDs.
My test includes the following steps.
1. Created a single OSD cluster.
2. Created two rbd images (110GB each) on 2 different pools.
3. Populated entire image, so my working set is ~210GB. My system memory is ~16GB.
4. Dumped page cache before every run.
5. Ran fio_rbd (QD 32, 8 instances) in parallel on these two images.
Here is my disk iops/bandwidth..
random-reads: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=64
2.0.8
Starting 1 process
Jobs: 1 (f=1): [r] [100.0% done] [154.1M/0K /s] [39.7K/0 iops] [eta 00m:00s]
random-reads: (groupid=0, jobs=1): err= 0: pid=1431
read : io=9316.4MB, bw=158994KB/s, iops=39748 , runt= 60002msec
My fio_rbd config..
[global]
ioengine=rbd
clientname=admin
pool=rbd1
rbdname=ceph_regression_test1
invalidate=0 # mandatory
rw=randread
bs=4k
direct=1
time_based
runtime=2m
size=109G
numjobs=8
[rbd_iodepth32]
iodepth=32
Now, I have run Giant Ceph on top of that..
-------------------------------------------------------
avg-cpu: %user %nice %system %iowait %steal %idle
22.04 0.00 16.46 45.86 0.00 15.64
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 9.00 0.00 6.00 0.00 92.00 30.67 0.01 1.33 0.00 1.33 1.33 0.80
sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdg 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdh 181.00 0.00 34961.00 0.00 176740.00 0.00 10.11 102.71 2.92 2.92 0.00 0.03 100.00
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
----------
cluster 94991097-7638-4240-b922-f525300a9026
health HEALTH_OK
monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch 1, quorum 0 a
osdmap e498: 1 osds: 1 up, 1 in
pgmap v386366: 832 pgs, 7 pools, 308 GB data, 247 kobjects
366 GB used, 1122 GB / 1489 GB avail
832 active+clean
client io 75215 kB/s rd, 18803 op/s
----------
Gradually decreases from ~21 core (serving from cache) to ~10 core (while serving from disks).
-----------------
In this case "All is Well" till ios are served from cache (XFS is
smart enough to cache some data ) . Once started hitting disks and throughput is decreasing. As you can see, disk is giving ~35K iops , but, OSD throughput is only ~18.8K ! So, cache miss in case of buffered io seems to be very expensive. Half of the iops are waste. Also, looking at the bandwidth, it is obvious, not everything is 4K read, May be kernel read_ahead is kicking (?).
Now, I thought of making ceph disk read as direct_io and do the same experiment. I have changed the FileStore::read to do the direct_io only. Rest kept as is. Here is the result with that.
-------
avg-cpu: %user %nice %system %iowait %steal %idle
24.77 0.00 19.52 21.36 0.00 34.36
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdg 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdh 0.00 0.00 25295.00 0.00 101180.00 0.00 8.00 12.73 0.50 0.50 0.00 0.04 100.80
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
--------
cluster 94991097-7638-4240-b922-f525300a9026
health HEALTH_OK
monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch 1, quorum 0 a
osdmap e522: 1 osds: 1 up, 1 in
pgmap v386711: 832 pgs, 7 pools, 308 GB data, 247 kobjects
366 GB used, 1122 GB / 1489 GB avail
832 active+clean
client io 100 MB/s rd, 25618 op/s
--------
~14 core while serving from disks.
---------------
No surprises here. Whatever is disk throughput ceph throughput is almost matching.
Let's tweak the shard/thread settings and see the impact.
-----------------------------------------------------------
------------------
No change, output is very similar to 25 shards.
------------------
----------
avg-cpu: %user %nice %system %iowait %steal %idle
33.33 0.00 28.22 23.11 0.00 15.34
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.00 0.00 2.00 0.00 12.00 12.00 0.00 0.00 0.00 0.00 0.00 0.00
sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdg 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdh 0.00 0.00 31987.00 0.00 127948.00 0.00 8.00 18.06 0.56 0.56 0.00 0.03 100.40
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
--------------
cluster 94991097-7638-4240-b922-f525300a9026
health HEALTH_OK
monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch 1, quorum 0 a
osdmap e525: 1 osds: 1 up, 1 in
pgmap v386746: 832 pgs, 7 pools, 308 GB data, 247 kobjects
366 GB used, 1122 GB / 1489 GB avail
832 active+clean
client io 127 MB/s rd, 32763 op/s
--------------
~19 core while serving from disks.
------------------
It is scaling with increased number of shards/threads. The parallelism also increased significantly.
----------------------------------------------------------
-------------------
No change, output is very similar to 25 shards.
-----------------
--------
avg-cpu: %user %nice %system %iowait %steal %idle
37.50 0.00 33.72 20.03 0.00 8.75
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdg 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdh 0.00 0.00 35360.00 0.00 141440.00 0.00 8.00 22.25 0.62 0.62 0.00 0.03 100.40
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
--------------
cluster 94991097-7638-4240-b922-f525300a9026
health HEALTH_OK
monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch 1, quorum 0 a
osdmap e534: 1 osds: 1 up, 1 in
pgmap v386830: 832 pgs, 7 pools, 308 GB data, 247 kobjects
366 GB used, 1122 GB / 1489 GB avail
832 active+clean
client io 138 MB/s rd, 35582 op/s
----------------
~22.5 core while serving from disks.
--------------------
It is scaling with increased number of shards/threads. The parallelism also increased significantly.
---------------------------------------------------------
------------------
No change, output is very similar to 25 shards.
-------------------
---------
avg-cpu: %user %nice %system %iowait %steal %idle
40.18 0.00 34.84 19.81 0.00 5.18
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdg 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdh 0.00 0.00 39114.00 0.00 156460.00 0.00 8.00 35.58 0.90 0.90 0.00 0.03 100.40
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
---------------
cluster 94991097-7638-4240-b922-f525300a9026
health HEALTH_OK
monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch 1, quorum 0 a
osdmap e537: 1 osds: 1 up, 1 in
pgmap v386865: 832 pgs, 7 pools, 308 GB data, 247 kobjects
366 GB used, 1122 GB / 1489 GB avail
832 active+clean
client io 153 MB/s rd, 39172 op/s
----------------
~24.5 core while serving from disks. ~3% cpu left.
------------------
It is scaling with increased number of shards/threads. The parallelism also increased significantly. It is disk bound now.
So, it seems buffered IO has significant impact on performance in case backend is SSD.
My question is, if the workload is very random and storage(SSD) is very huge compare to system memory, shouldn't we always go for direct_io instead of buffered io from Ceph ?
Please share your thoughts/suggestion on this.
Thanks & Regards
Somnath
________________________________
PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel"
majordomo info at http://vger.kernel.org/majordomo-info.html
--
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016
p: 646-253-9055
N?????r??y??????X???v???)?{.n?????z?]z????ay? ????j ??f???h?????
?w??? ???j:+v???w???????? ????zZ+???????j"????i
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel"
info at http://vger.kernel.org/majordomo-info.html
--
Best Regards,
Wheat
--
Best Regards,

Wheat
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Sage Weil
2014-09-24 12:38:08 UTC
Permalink
I agree with that direct read will help for disk read. But if read data
is hot and small enough to fit in memory, page cache is a good place to
hold data cache. If discard page cache, we need to implement a cache to
provide with effective lookup impl.
This is true for some workloads, but not necessarily true for all. Many
clients (notably RBD) will be caching at the client side (in VM's fs, and
possibly in librbd itself) such that caching at the OSD is largely wasted
effort. For RGW the often is likely true, unless there is a varnish cache
or something in front.

We should probably have a direct_io config option for filestore. But even
better would be some hint from the client about whether it is caching or
not so that FileStore could conditionally cache...

sage
BTW, whether to use direct io we can refer to MySQL Innodb engine with
direct io and PostgreSQL with page cache.
Post by Somnath Roy
Haomai,
I am considering only about random reads and the changes I made only affecting reads. For write, I have not measured yet. But, yes, page cache may be helpful for write coalescing. Still need to evaluate how it is behaving comparing direct_io on SSD though. I think Ceph code path will be much shorter if we use direct_io in the write path where it is actually executing the transactions. Probably, the sync thread and all will not be needed.
I am trying to analyze where is the extra reads coming from in case of buffered io by using blktrace etc. This should give us a clear understanding what exactly is going on there and it may turn out that tuning kernel parameters only we can achieve similar performance as direct_io.
Thanks & Regards
Somnath
-----Original Message-----
Sent: Tuesday, September 23, 2014 7:07 PM
To: Sage Weil
Subject: Re: Impact of page cache on OSD read performance for SSD
Good point, but do you have considered that the impaction for write ops? And if skip page cache, FileStore is responsible for data cache?
Post by Sage Weil
Post by Somnath Roy
Milosz,
Thanks for the response. I will see if I can get any information out of perf.
Here is my OS information.
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 13.10
Release: 13.10
Codename: saucy
Linux emsclient 3.11.0-12-generic #19-Ubuntu SMP Wed Oct 9 16:20:46
UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
BTW, it's not a 45% drop, as you can see, by tuning the OSD parameter I was able to get almost *2X* performance improvement with direct_io.
It's not only page cache (memory) lookup, in case of buffered_io the following could be problem.
1. Double copy (disk -> file buffer cache, file buffer cache -> user buffer)
2. As the iostat output shows, it is not reading 4K only, it is
reading more data from disk as required and in the end it will be
wasted in case of random workload..
It might be worth using blktrace to see what the IOs it is issueing are.
Which ones are > 4K and what they point to...
sage
Post by Somnath Roy
Thanks & Regards
Somnath
-----Original Message-----
Sent: Tuesday, September 23, 2014 12:09 PM
To: Somnath Roy
Subject: Re: Impact of page cache on OSD read performance for SSD
Somnath,
I wonder if there's a bottleneck or a point of contention for the kernel. For a entirely uncached workload I expect the page cache lookup to cause a slow down (since the lookup should be wasted). What I wouldn't expect is a 45% performance drop. Memory speed should be one magnitude faster then a modern SATA SSD drive (so it should be more negligible overhead).
Is there anyway you could perform the same test but monitor what's going on with the OSD process using the perf tool? Whatever is the default cpu time spent hardware counter is fine. Make sure you have the kernel debug info package installed so can get symbol information for kernel and module calls. With any luck the diff in perf output in two runs will show us the culprit.
Also, can you tell us what OS/kernel version you're using on the OSD machines?
- Milosz
Post by Somnath Roy
Hi Sage,
I have created the following setup in order to examine how a single OSD is behaving if say ~80-90% of ios hitting the SSDs.
My test includes the following steps.
1. Created a single OSD cluster.
2. Created two rbd images (110GB each) on 2 different pools.
3. Populated entire image, so my working set is ~210GB. My system memory is ~16GB.
4. Dumped page cache before every run.
5. Ran fio_rbd (QD 32, 8 instances) in parallel on these two images.
Here is my disk iops/bandwidth..
random-reads: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=64
2.0.8
Starting 1 process
Jobs: 1 (f=1): [r] [100.0% done] [154.1M/0K /s] [39.7K/0 iops] [eta 00m:00s]
random-reads: (groupid=0, jobs=1): err= 0: pid=1431
read : io=9316.4MB, bw=158994KB/s, iops=39748 , runt= 60002msec
My fio_rbd config..
[global]
ioengine=rbd
clientname=admin
pool=rbd1
rbdname=ceph_regression_test1
invalidate=0 # mandatory
rw=randread
bs=4k
direct=1
time_based
runtime=2m
size=109G
numjobs=8
[rbd_iodepth32]
iodepth=32
Now, I have run Giant Ceph on top of that..
-------------------------------------------------------
avg-cpu: %user %nice %system %iowait %steal %idle
22.04 0.00 16.46 45.86 0.00 15.64
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 9.00 0.00 6.00 0.00 92.00 30.67 0.01 1.33 0.00 1.33 1.33 0.80
sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdg 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdh 181.00 0.00 34961.00 0.00 176740.00 0.00 10.11 102.71 2.92 2.92 0.00 0.03 100.00
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
----------
cluster 94991097-7638-4240-b922-f525300a9026
health HEALTH_OK
monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch 1, quorum 0 a
osdmap e498: 1 osds: 1 up, 1 in
pgmap v386366: 832 pgs, 7 pools, 308 GB data, 247 kobjects
366 GB used, 1122 GB / 1489 GB avail
832 active+clean
client io 75215 kB/s rd, 18803 op/s
----------
Gradually decreases from ~21 core (serving from cache) to ~10 core (while serving from disks).
-----------------
In this case "All is Well" till ios are served from cache (XFS is
smart enough to cache some data ) . Once started hitting disks and throughput is decreasing. As you can see, disk is giving ~35K iops , but, OSD throughput is only ~18.8K ! So, cache miss in case of buffered io seems to be very expensive. Half of the iops are waste. Also, looking at the bandwidth, it is obvious, not everything is 4K read, May be kernel read_ahead is kicking (?).
Now, I thought of making ceph disk read as direct_io and do the same experiment. I have changed the FileStore::read to do the direct_io only. Rest kept as is. Here is the result with that.
-------
avg-cpu: %user %nice %system %iowait %steal %idle
24.77 0.00 19.52 21.36 0.00 34.36
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdg 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdh 0.00 0.00 25295.00 0.00 101180.00 0.00 8.00 12.73 0.50 0.50 0.00 0.04 100.80
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
--------
cluster 94991097-7638-4240-b922-f525300a9026
health HEALTH_OK
monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch 1, quorum 0 a
osdmap e522: 1 osds: 1 up, 1 in
pgmap v386711: 832 pgs, 7 pools, 308 GB data, 247 kobjects
366 GB used, 1122 GB / 1489 GB avail
832 active+clean
client io 100 MB/s rd, 25618 op/s
--------
~14 core while serving from disks.
---------------
No surprises here. Whatever is disk throughput ceph throughput is almost matching.
Let's tweak the shard/thread settings and see the impact.
-----------------------------------------------------------
------------------
No change, output is very similar to 25 shards.
------------------
----------
avg-cpu: %user %nice %system %iowait %steal %idle
33.33 0.00 28.22 23.11 0.00 15.34
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.00 0.00 2.00 0.00 12.00 12.00 0.00 0.00 0.00 0.00 0.00 0.00
sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdg 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdh 0.00 0.00 31987.00 0.00 127948.00 0.00 8.00 18.06 0.56 0.56 0.00 0.03 100.40
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
--------------
cluster 94991097-7638-4240-b922-f525300a9026
health HEALTH_OK
monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch 1, quorum 0 a
osdmap e525: 1 osds: 1 up, 1 in
pgmap v386746: 832 pgs, 7 pools, 308 GB data, 247 kobjects
366 GB used, 1122 GB / 1489 GB avail
832 active+clean
client io 127 MB/s rd, 32763 op/s
--------------
~19 core while serving from disks.
------------------
It is scaling with increased number of shards/threads. The parallelism also increased significantly.
----------------------------------------------------------
-------------------
No change, output is very similar to 25 shards.
-----------------
--------
avg-cpu: %user %nice %system %iowait %steal %idle
37.50 0.00 33.72 20.03 0.00 8.75
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdg 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdh 0.00 0.00 35360.00 0.00 141440.00 0.00 8.00 22.25 0.62 0.62 0.00 0.03 100.40
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
--------------
cluster 94991097-7638-4240-b922-f525300a9026
health HEALTH_OK
monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch 1, quorum 0 a
osdmap e534: 1 osds: 1 up, 1 in
pgmap v386830: 832 pgs, 7 pools, 308 GB data, 247 kobjects
366 GB used, 1122 GB / 1489 GB avail
832 active+clean
client io 138 MB/s rd, 35582 op/s
----------------
~22.5 core while serving from disks.
--------------------
It is scaling with increased number of shards/threads. The parallelism also increased significantly.
---------------------------------------------------------
------------------
No change, output is very similar to 25 shards.
-------------------
---------
avg-cpu: %user %nice %system %iowait %steal %idle
40.18 0.00 34.84 19.81 0.00 5.18
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdg 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdh 0.00 0.00 39114.00 0.00 156460.00 0.00 8.00 35.58 0.90 0.90 0.00 0.03 100.40
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
---------------
cluster 94991097-7638-4240-b922-f525300a9026
health HEALTH_OK
monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch 1, quorum 0 a
osdmap e537: 1 osds: 1 up, 1 in
pgmap v386865: 832 pgs, 7 pools, 308 GB data, 247 kobjects
366 GB used, 1122 GB / 1489 GB avail
832 active+clean
client io 153 MB/s rd, 39172 op/s
----------------
~24.5 core while serving from disks. ~3% cpu left.
------------------
It is scaling with increased number of shards/threads. The parallelism also increased significantly. It is disk bound now.
So, it seems buffered IO has significant impact on performance in case backend is SSD.
My question is, if the workload is very random and storage(SSD) is very huge compare to system memory, shouldn't we always go for direct_io instead of buffered io from Ceph ?
Please share your thoughts/suggestion on this.
Thanks & Regards
Somnath
________________________________
PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel"
majordomo info at http://vger.kernel.org/majordomo-info.html
--
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016
p: 646-253-9055
N?????r??y??????X???v???)?{.n?????z?]z????ay? ????j ??f???h?????
?w??? ???j:+v???w???????? ????zZ+???????j"????i
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel"
info at http://vger.kernel.org/majordomo-info.html
--
Best Regards,
Wheat
--
Best Regards,
Wheat
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Mark Nelson
2014-09-24 13:27:40 UTC
Permalink
Post by Sage Weil
I agree with that direct read will help for disk read. But if read data
is hot and small enough to fit in memory, page cache is a good place to
hold data cache. If discard page cache, we need to implement a cache to
provide with effective lookup impl.
This is true for some workloads, but not necessarily true for all. Many
clients (notably RBD) will be caching at the client side (in VM's fs, and
possibly in librbd itself) such that caching at the OSD is largely wasted
effort. For RGW the often is likely true, unless there is a varnish cache
or something in front.
We should probably have a direct_io config option for filestore. But even
better would be some hint from the client about whether it is caching or
not so that FileStore could conditionally cache...
I like the hinting idea. Having said that, if the effect being seen is
due to page cache, it seems like something is off. We've seen
performance issues in the kernel before so it's not unprecedented.
Working around it with direct IO could be the right way to go, but it
might be that this is something that could be fixed higher up and
improve performance in other scenarios too. I'd hate to let it go by
the wayside of we could find something actionable.
Post by Sage Weil
sage
BTW, whether to use direct io we can refer to MySQL Innodb engine with
direct io and PostgreSQL with page cache.
Post by Somnath Roy
Haomai,
I am considering only about random reads and the changes I made only affecting reads. For write, I have not measured yet. But, yes, page cache may be helpful for write coalescing. Still need to evaluate how it is behaving comparing direct_io on SSD though. I think Ceph code path will be much shorter if we use direct_io in the write path where it is actually executing the transactions. Probably, the sync thread and all will not be needed.
I am trying to analyze where is the extra reads coming from in case of buffered io by using blktrace etc. This should give us a clear understanding what exactly is going on there and it may turn out that tuning kernel parameters only we can achieve similar performance as direct_io.
Thanks & Regards
Somnath
-----Original Message-----
Sent: Tuesday, September 23, 2014 7:07 PM
To: Sage Weil
Subject: Re: Impact of page cache on OSD read performance for SSD
Good point, but do you have considered that the impaction for write ops? And if skip page cache, FileStore is responsible for data cache?
Post by Sage Weil
Post by Somnath Roy
Milosz,
Thanks for the response. I will see if I can get any information out of perf.
Here is my OS information.
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 13.10
Release: 13.10
Codename: saucy
Linux emsclient 3.11.0-12-generic #19-Ubuntu SMP Wed Oct 9 16:20:46
UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
BTW, it's not a 45% drop, as you can see, by tuning the OSD parameter I was able to get almost *2X* performance improvement with direct_io.
It's not only page cache (memory) lookup, in case of buffered_io the following could be problem.
1. Double copy (disk -> file buffer cache, file buffer cache -> user buffer)
2. As the iostat output shows, it is not reading 4K only, it is
reading more data from disk as required and in the end it will be
wasted in case of random workload..
It might be worth using blktrace to see what the IOs it is issueing are.
Which ones are > 4K and what they point to...
sage
Post by Somnath Roy
Thanks & Regards
Somnath
-----Original Message-----
Sent: Tuesday, September 23, 2014 12:09 PM
To: Somnath Roy
Subject: Re: Impact of page cache on OSD read performance for SSD
Somnath,
I wonder if there's a bottleneck or a point of contention for the kernel. For a entirely uncached workload I expect the page cache lookup to cause a slow down (since the lookup should be wasted). What I wouldn't expect is a 45% performance drop. Memory speed should be one magnitude faster then a modern SATA SSD drive (so it should be more negligible overhead).
Is there anyway you could perform the same test but monitor what's going on with the OSD process using the perf tool? Whatever is the default cpu time spent hardware counter is fine. Make sure you have the kernel debug info package installed so can get symbol information for kernel and module calls. With any luck the diff in perf output in two runs will show us the culprit.
Also, can you tell us what OS/kernel version you're using on the OSD machines?
- Milosz
Post by Somnath Roy
Hi Sage,
I have created the following setup in order to examine how a single OSD is behaving if say ~80-90% of ios hitting the SSDs.
My test includes the following steps.
1. Created a single OSD cluster.
2. Created two rbd images (110GB each) on 2 different pools.
3. Populated entire image, so my working set is ~210GB. My system memory is ~16GB.
4. Dumped page cache before every run.
5. Ran fio_rbd (QD 32, 8 instances) in parallel on these two images.
Here is my disk iops/bandwidth..
random-reads: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=64
2.0.8
Starting 1 process
Jobs: 1 (f=1): [r] [100.0% done] [154.1M/0K /s] [39.7K/0 iops] [eta 00m:00s]
random-reads: (groupid=0, jobs=1): err= 0: pid=1431
read : io=9316.4MB, bw=158994KB/s, iops=39748 , runt= 60002msec
My fio_rbd config..
[global]
ioengine=rbd
clientname=admin
pool=rbd1
rbdname=ceph_regression_test1
invalidate=0 # mandatory
rw=randread
bs=4k
direct=1
time_based
runtime=2m
size=109G
numjobs=8
[rbd_iodepth32]
iodepth=32
Now, I have run Giant Ceph on top of that..
-------------------------------------------------------
avg-cpu: %user %nice %system %iowait %steal %idle
22.04 0.00 16.46 45.86 0.00 15.64
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 9.00 0.00 6.00 0.00 92.00 30.67 0.01 1.33 0.00 1.33 1.33 0.80
sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdg 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdh 181.00 0.00 34961.00 0.00 176740.00 0.00 10.11 102.71 2.92 2.92 0.00 0.03 100.00
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
----------
cluster 94991097-7638-4240-b922-f525300a9026
health HEALTH_OK
monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch 1, quorum 0 a
osdmap e498: 1 osds: 1 up, 1 in
pgmap v386366: 832 pgs, 7 pools, 308 GB data, 247 kobjects
366 GB used, 1122 GB / 1489 GB avail
832 active+clean
client io 75215 kB/s rd, 18803 op/s
----------
Gradually decreases from ~21 core (serving from cache) to ~10 core (while serving from disks).
-----------------
In this case "All is Well" till ios are served from cache (XFS is
smart enough to cache some data ) . Once started hitting disks and throughput is decreasing. As you can see, disk is giving ~35K iops , but, OSD throughput is only ~18.8K ! So, cache miss in case of buffered io seems to be very expensive. Half of the iops are waste. Also, looking at the bandwidth, it is obvious, not everything is 4K read, May be kernel read_ahead is kicking (?).
Now, I thought of making ceph disk read as direct_io and do the same experiment. I have changed the FileStore::read to do the direct_io only. Rest kept as is. Here is the result with that.
-------
avg-cpu: %user %nice %system %iowait %steal %idle
24.77 0.00 19.52 21.36 0.00 34.36
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdg 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdh 0.00 0.00 25295.00 0.00 101180.00 0.00 8.00 12.73 0.50 0.50 0.00 0.04 100.80
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
--------
cluster 94991097-7638-4240-b922-f525300a9026
health HEALTH_OK
monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch 1, quorum 0 a
osdmap e522: 1 osds: 1 up, 1 in
pgmap v386711: 832 pgs, 7 pools, 308 GB data, 247 kobjects
366 GB used, 1122 GB / 1489 GB avail
832 active+clean
client io 100 MB/s rd, 25618 op/s
--------
~14 core while serving from disks.
---------------
No surprises here. Whatever is disk throughput ceph throughput is almost matching.
Let's tweak the shard/thread settings and see the impact.
-----------------------------------------------------------
------------------
No change, output is very similar to 25 shards.
------------------
----------
avg-cpu: %user %nice %system %iowait %steal %idle
33.33 0.00 28.22 23.11 0.00 15.34
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.00 0.00 2.00 0.00 12.00 12.00 0.00 0.00 0.00 0.00 0.00 0.00
sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdg 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdh 0.00 0.00 31987.00 0.00 127948.00 0.00 8.00 18.06 0.56 0.56 0.00 0.03 100.40
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
--------------
cluster 94991097-7638-4240-b922-f525300a9026
health HEALTH_OK
monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch 1, quorum 0 a
osdmap e525: 1 osds: 1 up, 1 in
pgmap v386746: 832 pgs, 7 pools, 308 GB data, 247 kobjects
366 GB used, 1122 GB / 1489 GB avail
832 active+clean
client io 127 MB/s rd, 32763 op/s
--------------
~19 core while serving from disks.
------------------
It is scaling with increased number of shards/threads. The parallelism also increased significantly.
----------------------------------------------------------
-------------------
No change, output is very similar to 25 shards.
-----------------
--------
avg-cpu: %user %nice %system %iowait %steal %idle
37.50 0.00 33.72 20.03 0.00 8.75
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdg 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdh 0.00 0.00 35360.00 0.00 141440.00 0.00 8.00 22.25 0.62 0.62 0.00 0.03 100.40
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
--------------
cluster 94991097-7638-4240-b922-f525300a9026
health HEALTH_OK
monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch 1, quorum 0 a
osdmap e534: 1 osds: 1 up, 1 in
pgmap v386830: 832 pgs, 7 pools, 308 GB data, 247 kobjects
366 GB used, 1122 GB / 1489 GB avail
832 active+clean
client io 138 MB/s rd, 35582 op/s
----------------
~22.5 core while serving from disks.
--------------------
It is scaling with increased number of shards/threads. The parallelism also increased significantly.
---------------------------------------------------------
------------------
No change, output is very similar to 25 shards.
-------------------
---------
avg-cpu: %user %nice %system %iowait %steal %idle
40.18 0.00 34.84 19.81 0.00 5.18
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdg 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdh 0.00 0.00 39114.00 0.00 156460.00 0.00 8.00 35.58 0.90 0.90 0.00 0.03 100.40
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
---------------
cluster 94991097-7638-4240-b922-f525300a9026
health HEALTH_OK
monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch 1, quorum 0 a
osdmap e537: 1 osds: 1 up, 1 in
pgmap v386865: 832 pgs, 7 pools, 308 GB data, 247 kobjects
366 GB used, 1122 GB / 1489 GB avail
832 active+clean
client io 153 MB/s rd, 39172 op/s
----------------
~24.5 core while serving from disks. ~3% cpu left.
------------------
It is scaling with increased number of shards/threads. The parallelism also increased significantly. It is disk bound now.
So, it seems buffered IO has significant impact on performance in case backend is SSD.
My question is, if the workload is very random and storage(SSD) is very huge compare to system memory, shouldn't we always go for direct_io instead of buffered io from Ceph ?
Please share your thoughts/suggestion on this.
Thanks & Regards
Somnath
________________________________
PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel"
majordomo info at http://vger.kernel.org/majordomo-info.html
--
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016
p: 646-253-9055
N?????r??y??????X???v???)?{.n?????z?]z????ay? ????j ??f???h?????
?w??? ???j:+v???w???????? ????zZ+???????j"????i
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel"
info at http://vger.kernel.org/majordomo-info.html
--
Best Regards,
Wheat
--
Best Regards,
Wheat
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Milosz Tanski
2014-09-24 14:29:44 UTC
Permalink
Post by Sage Weil
I agree with that direct read will help for disk read. But if read data
is hot and small enough to fit in memory, page cache is a good place to
hold data cache. If discard page cache, we need to implement a cache to
provide with effective lookup impl.
This is true for some workloads, but not necessarily true for all. Many
clients (notably RBD) will be caching at the client side (in VM's fs, and
possibly in librbd itself) such that caching at the OSD is largely wasted
effort. For RGW the often is likely true, unless there is a varnish cache
or something in front.
We should probably have a direct_io config option for filestore. But even
better would be some hint from the client about whether it is caching or
not so that FileStore could conditionally cache...
I like the hinting idea. Having said that, if the effect being seen is due
to page cache, it seems like something is off. We've seen performance
issues in the kernel before so it's not unprecedented. Working around it
with direct IO could be the right way to go, but it might be that this is
something that could be fixed higher up and improve performance in other
scenarios too. I'd hate to let it go by the wayside of we could find
something actionable.
Post by Sage Weil
sage
BTW, whether to use direct io we can refer to MySQL Innodb engine with
direct io and PostgreSQL with page cache.
Post by Somnath Roy
Haomai,
I am considering only about random reads and the changes I made only
affecting reads. For write, I have not measured yet. But, yes, page cache
may be helpful for write coalescing. Still need to evaluate how it is
behaving comparing direct_io on SSD though. I think Ceph code path will be
much shorter if we use direct_io in the write path where it is actually
executing the transactions. Probably, the sync thread and all will not be
needed.
I am trying to analyze where is the extra reads coming from in case of
buffered io by using blktrace etc. This should give us a clear understanding
what exactly is going on there and it may turn out that tuning kernel
parameters only we can achieve similar performance as direct_io.
Thanks & Regards
Somnath
-----Original Message-----
Sent: Tuesday, September 23, 2014 7:07 PM
To: Sage Weil
Subject: Re: Impact of page cache on OSD read performance for SSD
Good point, but do you have considered that the impaction for write ops?
And if skip page cache, FileStore is responsible for data cache?
Post by Sage Weil
Post by Somnath Roy
Milosz,
Thanks for the response. I will see if I can get any information out of perf.
Here is my OS information.
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 13.10
Release: 13.10
Codename: saucy
Linux emsclient 3.11.0-12-generic #19-Ubuntu SMP Wed Oct 9 16:20:46
UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
BTW, it's not a 45% drop, as you can see, by tuning the OSD parameter
I was able to get almost *2X* performance improvement with direct_io.
It's not only page cache (memory) lookup, in case of buffered_io the
following could be problem.
1. Double copy (disk -> file buffer cache, file buffer cache -> user buffer)
2. As the iostat output shows, it is not reading 4K only, it is
reading more data from disk as required and in the end it will be
wasted in case of random workload..
It might be worth using blktrace to see what the IOs it is issueing are.
Which ones are > 4K and what they point to...
sage
Post by Somnath Roy
Thanks & Regards
Somnath
-----Original Message-----
Sent: Tuesday, September 23, 2014 12:09 PM
To: Somnath Roy
Subject: Re: Impact of page cache on OSD read performance for SSD
Somnath,
I wonder if there's a bottleneck or a point of contention for the
kernel. For a entirely uncached workload I expect the page cache lookup to
cause a slow down (since the lookup should be wasted). What I wouldn't
expect is a 45% performance drop. Memory speed should be one magnitude
faster then a modern SATA SSD drive (so it should be more negligible
overhead).
Is there anyway you could perform the same test but monitor what's
going on with the OSD process using the perf tool? Whatever is the default
cpu time spent hardware counter is fine. Make sure you have the kernel debug
info package installed so can get symbol information for kernel and module
calls. With any luck the diff in perf output in two runs will show us the
culprit.
Also, can you tell us what OS/kernel version you're using on the OSD machines?
- Milosz
Post by Somnath Roy
Hi Sage,
I have created the following setup in order to examine how a single
OSD is behaving if say ~80-90% of ios hitting the SSDs.
My test includes the following steps.
1. Created a single OSD cluster.
2. Created two rbd images (110GB each) on 2 different pools.
3. Populated entire image, so my working set is ~210GB. My
system memory is ~16GB.
4. Dumped page cache before every run.
5. Ran fio_rbd (QD 32, 8 instances) in parallel on these two images.
Here is my disk iops/bandwidth..
random-reads: (g=0): rw=randread, bs=4K-4K/4K-4K,
ioengine=libaio, iodepth=64
2.0.8
Starting 1 process
Jobs: 1 (f=1): [r] [100.0% done] [154.1M/0K /s] [39.7K/0
iops] [eta 00m:00s]
random-reads: (groupid=0, jobs=1): err= 0: pid=1431
read : io=9316.4MB, bw=158994KB/s, iops=39748 , runt= 60002msec
My fio_rbd config..
[global]
ioengine=rbd
clientname=admin
pool=rbd1
rbdname=ceph_regression_test1
invalidate=0 # mandatory
rw=randread
bs=4k
direct=1
time_based
runtime=2m
size=109G
numjobs=8
[rbd_iodepth32]
iodepth=32
Now, I have run Giant Ceph on top of that..
-------------------------------------------------------
avg-cpu: %user %nice %system %iowait %steal %idle
22.04 0.00 16.46 45.86 0.00 15.64
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s
avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 9.00 0.00 6.00 0.00 92.00
30.67 0.01 1.33 0.00 1.33 1.33 0.80
sdd 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00
sde 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdg 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdh 181.00 0.00 34961.00 0.00 176740.00 0.00
10.11 102.71 2.92 2.92 0.00 0.03 100.00
sdc 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00
----------
cluster 94991097-7638-4240-b922-f525300a9026
health HEALTH_OK
monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch
1, quorum 0 a
osdmap e498: 1 osds: 1 up, 1 in
pgmap v386366: 832 pgs, 7 pools, 308 GB data, 247 kobjects
366 GB used, 1122 GB / 1489 GB avail
832 active+clean
client io 75215 kB/s rd, 18803 op/s
----------
Gradually decreases from ~21 core (serving from cache) to ~10 core
(while serving from disks).
-----------------
In this case "All is Well" till ios are served from cache (XFS is
smart enough to cache some data ) . Once started hitting disks and
throughput is decreasing. As you can see, disk is giving ~35K iops , but,
OSD throughput is only ~18.8K ! So, cache miss in case of buffered io seems
to be very expensive. Half of the iops are waste. Also, looking at the
bandwidth, it is obvious, not everything is 4K read, May be kernel
read_ahead is kicking (?).
Now, I thought of making ceph disk read as direct_io and do the same
experiment. I have changed the FileStore::read to do the direct_io only.
Rest kept as is. Here is the result with that.
-------
avg-cpu: %user %nice %system %iowait %steal %idle
24.77 0.00 19.52 21.36 0.00 34.36
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s
avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdd 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00
sde 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdg 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdh 0.00 0.00 25295.00 0.00 101180.00 0.00
8.00 12.73 0.50 0.50 0.00 0.04 100.80
sdc 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00
--------
cluster 94991097-7638-4240-b922-f525300a9026
health HEALTH_OK
monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch
1, quorum 0 a
osdmap e522: 1 osds: 1 up, 1 in
pgmap v386711: 832 pgs, 7 pools, 308 GB data, 247 kobjects
366 GB used, 1122 GB / 1489 GB avail
832 active+clean
client io 100 MB/s rd, 25618 op/s
--------
~14 core while serving from disks.
---------------
No surprises here. Whatever is disk throughput ceph throughput is
almost matching.
Let's tweak the shard/thread settings and see the impact.
-----------------------------------------------------------
------------------
No change, output is very similar to 25 shards.
------------------
----------
avg-cpu: %user %nice %system %iowait %steal %idle
33.33 0.00 28.22 23.11 0.00 15.34
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s
avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.00 0.00 2.00 0.00 12.00
12.00 0.00 0.00 0.00 0.00 0.00 0.00
sdd 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00
sde 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdg 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdh 0.00 0.00 31987.00 0.00 127948.00 0.00
8.00 18.06 0.56 0.56 0.00 0.03 100.40
sdc 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00
--------------
cluster 94991097-7638-4240-b922-f525300a9026
health HEALTH_OK
monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch
1, quorum 0 a
osdmap e525: 1 osds: 1 up, 1 in
pgmap v386746: 832 pgs, 7 pools, 308 GB data, 247 kobjects
366 GB used, 1122 GB / 1489 GB avail
832 active+clean
client io 127 MB/s rd, 32763 op/s
--------------
~19 core while serving from disks.
------------------
It is scaling with increased number of shards/threads. The
parallelism also increased significantly.
----------------------------------------------------------
-------------------
No change, output is very similar to 25 shards.
-----------------
--------
avg-cpu: %user %nice %system %iowait %steal %idle
37.50 0.00 33.72 20.03 0.00 8.75
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s
avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdd 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00
sde 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdg 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdh 0.00 0.00 35360.00 0.00 141440.00 0.00
8.00 22.25 0.62 0.62 0.00 0.03 100.40
sdc 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00
--------------
cluster 94991097-7638-4240-b922-f525300a9026
health HEALTH_OK
monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch
1, quorum 0 a
osdmap e534: 1 osds: 1 up, 1 in
pgmap v386830: 832 pgs, 7 pools, 308 GB data, 247 kobjects
366 GB used, 1122 GB / 1489 GB avail
832 active+clean
client io 138 MB/s rd, 35582 op/s
----------------
~22.5 core while serving from disks.
--------------------
It is scaling with increased number of shards/threads. The
parallelism also increased significantly.
---------------------------------------------------------
------------------
No change, output is very similar to 25 shards.
-------------------
---------
avg-cpu: %user %nice %system %iowait %steal %idle
40.18 0.00 34.84 19.81 0.00 5.18
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s
avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdd 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00
sde 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdg 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdh 0.00 0.00 39114.00 0.00 156460.00 0.00
8.00 35.58 0.90 0.90 0.00 0.03 100.40
sdc 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00
---------------
cluster 94991097-7638-4240-b922-f525300a9026
health HEALTH_OK
monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch
1, quorum 0 a
osdmap e537: 1 osds: 1 up, 1 in
pgmap v386865: 832 pgs, 7 pools, 308 GB data, 247 kobjects
366 GB used, 1122 GB / 1489 GB avail
832 active+clean
client io 153 MB/s rd, 39172 op/s
----------------
~24.5 core while serving from disks. ~3% cpu left.
------------------
It is scaling with increased number of shards/threads. The
parallelism also increased significantly. It is disk bound now.
So, it seems buffered IO has significant impact on performance in
case backend is SSD.
My question is, if the workload is very random and storage(SSD) is
very huge compare to system memory, shouldn't we always go for direct_io
instead of buffered io from Ceph ?
Please share your thoughts/suggestion on this.
Thanks & Regards
Somnath
________________________________
PLEASE NOTE: The information contained in this electronic mail
message is intended only for the use of the designated recipient(s) named
above. If the reader of this message is not the intended recipient, you are
hereby notified that you have received this message in error and that any
review, dissemination, distribution, or copying of this message is strictly
prohibited. If you have received this communication in error, please notify
the sender by telephone or e-mail (as shown above) immediately and destroy
any and all copies of this message in your possession (whether hard copies
or electronically stored copies).
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel"
majordomo info at http://vger.kernel.org/majordomo-info.html
--
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016
p: 646-253-9055
N?????r??y??????X???v???)?{.n?????z?]z????ay? ????j ??f???h?????
?w??? ???j:+v???w???????? ????zZ+???????j"????i
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel"
info at http://vger.kernel.org/majordomo-info.html
--
Best Regards,
Wheat
--
Best Regards,
Wheat
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
I wonder how much (if any) would using posix_fadvise with the
POSIX_FADV_RANDOM hint help in this case? As that tells the kernel to
not perform (aggressive) read-ahead.

Sadly, POSIX_FADV_NOREUSE is a no-op in current kernels although
there's been patches floating over the years to implement it.
http://lxr.free-electrons.com/source/mm/fadvise.c#L113 and
http://thread.gmane.org/gmane.linux.file-systems/61511
--
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016

p: 646-253-9055
e: ***@adfin.com
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Haomai Wang
2014-09-24 16:05:40 UTC
Permalink
Post by Sage Weil
I agree with that direct read will help for disk read. But if read data
is hot and small enough to fit in memory, page cache is a good place to
hold data cache. If discard page cache, we need to implement a cache to
provide with effective lookup impl.
This is true for some workloads, but not necessarily true for all. Many
clients (notably RBD) will be caching at the client side (in VM's fs, and
possibly in librbd itself) such that caching at the OSD is largely wasted
effort. For RGW the often is likely true, unless there is a varnish cache
or something in front.
Still now, I don't think librbd cache can meet all the cache demand
for rbd usage. Even though
we have a effective librbd cache impl, we still need a buffer cache in
ObjectStore level
just like what database did. Client cache and host cache are both needed.
Post by Sage Weil
We should probably have a direct_io config option for filestore. But even
better would be some hint from the client about whether it is caching or
not so that FileStore could conditionally cache...
Yes, I remember we already did some early works like it.
Post by Sage Weil
sage
BTW, whether to use direct io we can refer to MySQL Innodb engine with
direct io and PostgreSQL with page cache.
Post by Somnath Roy
Haomai,
I am considering only about random reads and the changes I made only affecting reads. For write, I have not measured yet. But, yes, page cache may be helpful for write coalescing. Still need to evaluate how it is behaving comparing direct_io on SSD though. I think Ceph code path will be much shorter if we use direct_io in the write path where it is actually executing the transactions. Probably, the sync thread and all will not be needed.
I am trying to analyze where is the extra reads coming from in case of buffered io by using blktrace etc. This should give us a clear understanding what exactly is going on there and it may turn out that tuning kernel parameters only we can achieve similar performance as direct_io.
Thanks & Regards
Somnath
-----Original Message-----
Sent: Tuesday, September 23, 2014 7:07 PM
To: Sage Weil
Subject: Re: Impact of page cache on OSD read performance for SSD
Good point, but do you have considered that the impaction for write ops? And if skip page cache, FileStore is responsible for data cache?
Post by Sage Weil
Post by Somnath Roy
Milosz,
Thanks for the response. I will see if I can get any information out of perf.
Here is my OS information.
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 13.10
Release: 13.10
Codename: saucy
Linux emsclient 3.11.0-12-generic #19-Ubuntu SMP Wed Oct 9 16:20:46
UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
BTW, it's not a 45% drop, as you can see, by tuning the OSD parameter I was able to get almost *2X* performance improvement with direct_io.
It's not only page cache (memory) lookup, in case of buffered_io the following could be problem.
1. Double copy (disk -> file buffer cache, file buffer cache -> user buffer)
2. As the iostat output shows, it is not reading 4K only, it is
reading more data from disk as required and in the end it will be
wasted in case of random workload..
It might be worth using blktrace to see what the IOs it is issueing are.
Which ones are > 4K and what they point to...
sage
Post by Somnath Roy
Thanks & Regards
Somnath
-----Original Message-----
Sent: Tuesday, September 23, 2014 12:09 PM
To: Somnath Roy
Subject: Re: Impact of page cache on OSD read performance for SSD
Somnath,
I wonder if there's a bottleneck or a point of contention for the kernel. For a entirely uncached workload I expect the page cache lookup to cause a slow down (since the lookup should be wasted). What I wouldn't expect is a 45% performance drop. Memory speed should be one magnitude faster then a modern SATA SSD drive (so it should be more negligible overhead).
Is there anyway you could perform the same test but monitor what's going on with the OSD process using the perf tool? Whatever is the default cpu time spent hardware counter is fine. Make sure you have the kernel debug info package installed so can get symbol information for kernel and module calls. With any luck the diff in perf output in two runs will show us the culprit.
Also, can you tell us what OS/kernel version you're using on the OSD machines?
- Milosz
Post by Somnath Roy
Hi Sage,
I have created the following setup in order to examine how a single OSD is behaving if say ~80-90% of ios hitting the SSDs.
My test includes the following steps.
1. Created a single OSD cluster.
2. Created two rbd images (110GB each) on 2 different pools.
3. Populated entire image, so my working set is ~210GB. My system memory is ~16GB.
4. Dumped page cache before every run.
5. Ran fio_rbd (QD 32, 8 instances) in parallel on these two images.
Here is my disk iops/bandwidth..
random-reads: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=64
2.0.8
Starting 1 process
Jobs: 1 (f=1): [r] [100.0% done] [154.1M/0K /s] [39.7K/0 iops] [eta 00m:00s]
random-reads: (groupid=0, jobs=1): err= 0: pid=1431
read : io=9316.4MB, bw=158994KB/s, iops=39748 , runt= 60002msec
My fio_rbd config..
[global]
ioengine=rbd
clientname=admin
pool=rbd1
rbdname=ceph_regression_test1
invalidate=0 # mandatory
rw=randread
bs=4k
direct=1
time_based
runtime=2m
size=109G
numjobs=8
[rbd_iodepth32]
iodepth=32
Now, I have run Giant Ceph on top of that..
-------------------------------------------------------
avg-cpu: %user %nice %system %iowait %steal %idle
22.04 0.00 16.46 45.86 0.00 15.64
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 9.00 0.00 6.00 0.00 92.00 30.67 0.01 1.33 0.00 1.33 1.33 0.80
sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdg 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdh 181.00 0.00 34961.00 0.00 176740.00 0.00 10.11 102.71 2.92 2.92 0.00 0.03 100.00
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
----------
cluster 94991097-7638-4240-b922-f525300a9026
health HEALTH_OK
monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch 1, quorum 0 a
osdmap e498: 1 osds: 1 up, 1 in
pgmap v386366: 832 pgs, 7 pools, 308 GB data, 247 kobjects
366 GB used, 1122 GB / 1489 GB avail
832 active+clean
client io 75215 kB/s rd, 18803 op/s
----------
Gradually decreases from ~21 core (serving from cache) to ~10 core (while serving from disks).
-----------------
In this case "All is Well" till ios are served from cache (XFS is
smart enough to cache some data ) . Once started hitting disks and throughput is decreasing. As you can see, disk is giving ~35K iops , but, OSD throughput is only ~18.8K ! So, cache miss in case of buffered io seems to be very expensive. Half of the iops are waste. Also, looking at the bandwidth, it is obvious, not everything is 4K read, May be kernel read_ahead is kicking (?).
Now, I thought of making ceph disk read as direct_io and do the same experiment. I have changed the FileStore::read to do the direct_io only. Rest kept as is. Here is the result with that.
-------
avg-cpu: %user %nice %system %iowait %steal %idle
24.77 0.00 19.52 21.36 0.00 34.36
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdg 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdh 0.00 0.00 25295.00 0.00 101180.00 0.00 8.00 12.73 0.50 0.50 0.00 0.04 100.80
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
--------
cluster 94991097-7638-4240-b922-f525300a9026
health HEALTH_OK
monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch 1, quorum 0 a
osdmap e522: 1 osds: 1 up, 1 in
pgmap v386711: 832 pgs, 7 pools, 308 GB data, 247 kobjects
366 GB used, 1122 GB / 1489 GB avail
832 active+clean
client io 100 MB/s rd, 25618 op/s
--------
~14 core while serving from disks.
---------------
No surprises here. Whatever is disk throughput ceph throughput is almost matching.
Let's tweak the shard/thread settings and see the impact.
-----------------------------------------------------------
------------------
No change, output is very similar to 25 shards.
------------------
----------
avg-cpu: %user %nice %system %iowait %steal %idle
33.33 0.00 28.22 23.11 0.00 15.34
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.00 0.00 2.00 0.00 12.00 12.00 0.00 0.00 0.00 0.00 0.00 0.00
sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdg 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdh 0.00 0.00 31987.00 0.00 127948.00 0.00 8.00 18.06 0.56 0.56 0.00 0.03 100.40
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
--------------
cluster 94991097-7638-4240-b922-f525300a9026
health HEALTH_OK
monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch 1, quorum 0 a
osdmap e525: 1 osds: 1 up, 1 in
pgmap v386746: 832 pgs, 7 pools, 308 GB data, 247 kobjects
366 GB used, 1122 GB / 1489 GB avail
832 active+clean
client io 127 MB/s rd, 32763 op/s
--------------
~19 core while serving from disks.
------------------
It is scaling with increased number of shards/threads. The parallelism also increased significantly.
----------------------------------------------------------
-------------------
No change, output is very similar to 25 shards.
-----------------
--------
avg-cpu: %user %nice %system %iowait %steal %idle
37.50 0.00 33.72 20.03 0.00 8.75
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdg 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdh 0.00 0.00 35360.00 0.00 141440.00 0.00 8.00 22.25 0.62 0.62 0.00 0.03 100.40
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
--------------
cluster 94991097-7638-4240-b922-f525300a9026
health HEALTH_OK
monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch 1, quorum 0 a
osdmap e534: 1 osds: 1 up, 1 in
pgmap v386830: 832 pgs, 7 pools, 308 GB data, 247 kobjects
366 GB used, 1122 GB / 1489 GB avail
832 active+clean
client io 138 MB/s rd, 35582 op/s
----------------
~22.5 core while serving from disks.
--------------------
It is scaling with increased number of shards/threads. The parallelism also increased significantly.
---------------------------------------------------------
------------------
No change, output is very similar to 25 shards.
-------------------
---------
avg-cpu: %user %nice %system %iowait %steal %idle
40.18 0.00 34.84 19.81 0.00 5.18
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdg 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdh 0.00 0.00 39114.00 0.00 156460.00 0.00 8.00 35.58 0.90 0.90 0.00 0.03 100.40
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
---------------
cluster 94991097-7638-4240-b922-f525300a9026
health HEALTH_OK
monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch 1, quorum 0 a
osdmap e537: 1 osds: 1 up, 1 in
pgmap v386865: 832 pgs, 7 pools, 308 GB data, 247 kobjects
366 GB used, 1122 GB / 1489 GB avail
832 active+clean
client io 153 MB/s rd, 39172 op/s
----------------
~24.5 core while serving from disks. ~3% cpu left.
------------------
It is scaling with increased number of shards/threads. The parallelism also increased significantly. It is disk bound now.
So, it seems buffered IO has significant impact on performance in case backend is SSD.
My question is, if the workload is very random and storage(SSD) is very huge compare to system memory, shouldn't we always go for direct_io instead of buffered io from Ceph ?
Please share your thoughts/suggestion on this.
Thanks & Regards
Somnath
________________________________
PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel"
majordomo info at http://vger.kernel.org/majordomo-info.html
--
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016
p: 646-253-9055
N?????r??y??????X???v???)?{.n?????z?]z????ay? ????j ??f???h?????
?w??? ???j:+v???w???????? ????zZ+???????j"????i
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel"
info at http://vger.kernel.org/majordomo-info.html
--
Best Regards,
Wheat
--
Best Regards,
Wheat
--
Best Regards,

Wheat
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Somnath Roy
2014-09-24 23:49:11 UTC
Permalink
Hi,
After going through the blktrace, I think I have figured out what is going on there. I think kernel read_ahead is causing the extra reads in case of buffered read. If I set read_ahead = 0 , the performance I am getting similar (or more when cache hit actually happens) to direct_io :-)
IMHO, if any user doesn't want these nasty kernel effects and be sure of the random work pattern, we should provide a configurable direct_io read option (Need to quantify direct_io write also) as Sage suggested.

Thanks & Regards
Somnath


-----Original Message-----
From: Haomai Wang [mailto:***@gmail.com]
Sent: Wednesday, September 24, 2014 9:06 AM
To: Sage Weil
Cc: Somnath Roy; Milosz Tanski; ceph-***@vger.kernel.org
Subject: Re: Impact of page cache on OSD read performance for SSD
Post by Sage Weil
Post by Haomai Wang
I agree with that direct read will help for disk read. But if read
data is hot and small enough to fit in memory, page cache is a good
place to hold data cache. If discard page cache, we need to implement
a cache to provide with effective lookup impl.
This is true for some workloads, but not necessarily true for all.
Many clients (notably RBD) will be caching at the client side (in VM's
fs, and possibly in librbd itself) such that caching at the OSD is
largely wasted effort. For RGW the often is likely true, unless there
is a varnish cache or something in front.
Still now, I don't think librbd cache can meet all the cache demand for rbd usage. Even though we have a effective librbd cache impl, we still need a buffer cache in ObjectStore level just like what database did. Client cache and host cache are both needed.
Post by Sage Weil
We should probably have a direct_io config option for filestore. But
even better would be some hint from the client about whether it is
caching or not so that FileStore could conditionally cache...
Yes, I remember we already did some early works like it.
Post by Sage Weil
sage
Post by Haomai Wang
BTW, whether to use direct io we can refer to MySQL Innodb engine
with direct io and PostgreSQL with page cache.
Post by Somnath Roy
Haomai,
I am considering only about random reads and the changes I made only affecting reads. For write, I have not measured yet. But, yes, page cache may be helpful for write coalescing. Still need to evaluate how it is behaving comparing direct_io on SSD though. I think Ceph code path will be much shorter if we use direct_io in the write path where it is actually executing the transactions. Probably, the sync thread and all will not be needed.
I am trying to analyze where is the extra reads coming from in case of buffered io by using blktrace etc. This should give us a clear understanding what exactly is going on there and it may turn out that tuning kernel parameters only we can achieve similar performance as direct_io.
Thanks & Regards
Somnath
-----Original Message-----
Sent: Tuesday, September 23, 2014 7:07 PM
To: Sage Weil
Subject: Re: Impact of page cache on OSD read performance for SSD
Good point, but do you have considered that the impaction for write ops? And if skip page cache, FileStore is responsible for data cache?
Post by Sage Weil
Post by Somnath Roy
Milosz,
Thanks for the response. I will see if I can get any information out of perf.
Here is my OS information.
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 13.10
Release: 13.10
Codename: saucy
Linux emsclient 3.11.0-12-generic #19-Ubuntu SMP Wed Oct 9
16:20:46 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
BTW, it's not a 45% drop, as you can see, by tuning the OSD parameter I was able to get almost *2X* performance improvement with direct_io.
It's not only page cache (memory) lookup, in case of buffered_io the following could be problem.
1. Double copy (disk -> file buffer cache, file buffer cache ->
user
buffer)
2. As the iostat output shows, it is not reading 4K only, it is
reading more data from disk as required and in the end it will be
wasted in case of random workload..
It might be worth using blktrace to see what the IOs it is issueing are.
Which ones are > 4K and what they point to...
sage
Post by Somnath Roy
Thanks & Regards
Somnath
-----Original Message-----
Sent: Tuesday, September 23, 2014 12:09 PM
To: Somnath Roy
Subject: Re: Impact of page cache on OSD read performance for SSD
Somnath,
I wonder if there's a bottleneck or a point of contention for the kernel. For a entirely uncached workload I expect the page cache lookup to cause a slow down (since the lookup should be wasted). What I wouldn't expect is a 45% performance drop. Memory speed should be one magnitude faster then a modern SATA SSD drive (so it should be more negligible overhead).
Is there anyway you could perform the same test but monitor what's going on with the OSD process using the perf tool? Whatever is the default cpu time spent hardware counter is fine. Make sure you have the kernel debug info package installed so can get symbol information for kernel and module calls. With any luck the diff in perf output in two runs will show us the culprit.
Also, can you tell us what OS/kernel version you're using on the OSD machines?
- Milosz
Post by Somnath Roy
Hi Sage,
I have created the following setup in order to examine how a single OSD is behaving if say ~80-90% of ios hitting the SSDs.
My test includes the following steps.
1. Created a single OSD cluster.
2. Created two rbd images (110GB each) on 2 different pools.
3. Populated entire image, so my working set is ~210GB. My system memory is ~16GB.
4. Dumped page cache before every run.
5. Ran fio_rbd (QD 32, 8 instances) in parallel on these two images.
Here is my disk iops/bandwidth..
random-reads: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=64
2.0.8
Starting 1 process
Jobs: 1 (f=1): [r] [100.0% done] [154.1M/0K /s] [39.7K/0 iops] [eta 00m:00s]
random-reads: (groupid=0, jobs=1): err= 0: pid=1431
read : io=9316.4MB, bw=158994KB/s, iops=39748 , runt= 60002msec
My fio_rbd config..
[global]
ioengine=rbd
clientname=admin
pool=rbd1
rbdname=ceph_regression_test1
invalidate=0 # mandatory
rw=randread
bs=4k
direct=1
time_based
runtime=2m
size=109G
numjobs=8
[rbd_iodepth32]
iodepth=32
Now, I have run Giant Ceph on top of that..
-------------------------------------------------------
avg-cpu: %user %nice %system %iowait %steal %idle
22.04 0.00 16.46 45.86 0.00 15.64
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 9.00 0.00 6.00 0.00 92.00 30.67 0.01 1.33 0.00 1.33 1.33 0.80
sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdg 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdh 181.00 0.00 34961.00 0.00 176740.00 0.00 10.11 102.71 2.92 2.92 0.00 0.03 100.00
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
----------
cluster 94991097-7638-4240-b922-f525300a9026
health HEALTH_OK
monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch 1, quorum 0 a
osdmap e498: 1 osds: 1 up, 1 in
pgmap v386366: 832 pgs, 7 pools, 308 GB data, 247 kobjects
366 GB used, 1122 GB / 1489 GB avail
832 active+clean
client io 75215 kB/s rd, 18803 op/s
----------
Gradually decreases from ~21 core (serving from cache) to ~10 core (while serving from disks).
-----------------
In this case "All is Well" till ios are served from cache
(XFS is smart enough to cache some data ) . Once started hitting disks and throughput is decreasing. As you can see, disk is giving ~35K iops , but, OSD throughput is only ~18.8K ! So, cache miss in case of buffered io seems to be very expensive. Half of the iops are waste. Also, looking at the bandwidth, it is obvious, not everything is 4K read, May be kernel read_ahead is kicking (?).
Now, I thought of making ceph disk read as direct_io and do the same experiment. I have changed the FileStore::read to do the direct_io only. Rest kept as is. Here is the result with that.
-------
avg-cpu: %user %nice %system %iowait %steal %idle
24.77 0.00 19.52 21.36 0.00 34.36
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdg 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdh 0.00 0.00 25295.00 0.00 101180.00 0.00 8.00 12.73 0.50 0.50 0.00 0.04 100.80
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
--------
cluster 94991097-7638-4240-b922-f525300a9026
health HEALTH_OK
monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch 1, quorum 0 a
osdmap e522: 1 osds: 1 up, 1 in
pgmap v386711: 832 pgs, 7 pools, 308 GB data, 247 kobjects
366 GB used, 1122 GB / 1489 GB avail
832 active+clean
client io 100 MB/s rd, 25618 op/s
--------
~14 core while serving from disks.
---------------
No surprises here. Whatever is disk throughput ceph throughput is almost matching.
Let's tweak the shard/thread settings and see the impact.
-----------------------------------------------------------
------------------
No change, output is very similar to 25 shards.
------------------
----------
avg-cpu: %user %nice %system %iowait %steal %idle
33.33 0.00 28.22 23.11 0.00 15.34
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.00 0.00 2.00 0.00 12.00 12.00 0.00 0.00 0.00 0.00 0.00 0.00
sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdg 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdh 0.00 0.00 31987.00 0.00 127948.00 0.00 8.00 18.06 0.56 0.56 0.00 0.03 100.40
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
--------------
cluster 94991097-7638-4240-b922-f525300a9026
health HEALTH_OK
monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch 1, quorum 0 a
osdmap e525: 1 osds: 1 up, 1 in
pgmap v386746: 832 pgs, 7 pools, 308 GB data, 247 kobjects
366 GB used, 1122 GB / 1489 GB avail
832 active+clean
client io 127 MB/s rd, 32763 op/s
--------------
~19 core while serving from disks.
------------------
It is scaling with increased number of shards/threads. The parallelism also increased significantly.
----------------------------------------------------------
-------------------
No change, output is very similar to 25 shards.
-----------------
--------
avg-cpu: %user %nice %system %iowait %steal %idle
37.50 0.00 33.72 20.03 0.00 8.75
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdg 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdh 0.00 0.00 35360.00 0.00 141440.00 0.00 8.00 22.25 0.62 0.62 0.00 0.03 100.40
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
--------------
cluster 94991097-7638-4240-b922-f525300a9026
health HEALTH_OK
monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch 1, quorum 0 a
osdmap e534: 1 osds: 1 up, 1 in
pgmap v386830: 832 pgs, 7 pools, 308 GB data, 247 kobjects
366 GB used, 1122 GB / 1489 GB avail
832 active+clean
client io 138 MB/s rd, 35582 op/s
----------------
~22.5 core while serving from disks.
--------------------
It is scaling with increased number of shards/threads. The parallelism also increased significantly.
---------------------------------------------------------
------------------
No change, output is very similar to 25 shards.
-------------------
---------
avg-cpu: %user %nice %system %iowait %steal %idle
40.18 0.00 34.84 19.81 0.00 5.18
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdg 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdh 0.00 0.00 39114.00 0.00 156460.00 0.00 8.00 35.58 0.90 0.90 0.00 0.03 100.40
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
---------------
cluster 94991097-7638-4240-b922-f525300a9026
health HEALTH_OK
monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch 1, quorum 0 a
osdmap e537: 1 osds: 1 up, 1 in
pgmap v386865: 832 pgs, 7 pools, 308 GB data, 247 kobjects
366 GB used, 1122 GB / 1489 GB avail
832 active+clean
client io 153 MB/s rd, 39172 op/s
----------------
~24.5 core while serving from disks. ~3% cpu left.
------------------
It is scaling with increased number of shards/threads. The parallelism also increased significantly. It is disk bound now.
So, it seems buffered IO has significant impact on performance in case backend is SSD.
My question is, if the workload is very random and storage(SSD) is very huge compare to system memory, shouldn't we always go for direct_io instead of buffered io from Ceph ?
Please share your thoughts/suggestion on this.
Thanks & Regards
Somnath
________________________________
PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel"
majordomo info at http://vger.kernel.org/majordomo-info.html
--
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016
p: 646-253-9055
N?????r??y??????X???v???)?{.n?????z?]z????ay? ????j ??f???h?????
?w??? ???j:+v???w???????? ????zZ+???????j"????i
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel"
majordomo info at http://vger.kernel.org/majordomo-info.html
--
Best Regards,
Wheat
--
Best Regards,
Wheat
--
Best Regards,

Wheat
��칻�&�~�&���+-��ݶ��w��˛���m��^��b��^n�r���z���h�����&���G���h�
Haomai Wang
2014-09-25 02:55:59 UTC
Permalink
Post by Somnath Roy
Hi,
After going through the blktrace, I think I have figured out what is going on there. I think kernel read_ahead is causing the extra reads in case of buffered read. If I set read_ahead = 0 , the performance I am getting similar (or more when cache hit actually happens) to direct_io :-)
Hmm, BTW if set read_ahead=0, what about seq read performance compared
to before?
Post by Somnath Roy
IMHO, if any user doesn't want these nasty kernel effects and be sure of the random work pattern, we should provide a configurable direct_io read option (Need to quantify direct_io write also) as Sage suggested.
Thanks & Regards
Somnath
-----Original Message-----
Sent: Wednesday, September 24, 2014 9:06 AM
To: Sage Weil
Subject: Re: Impact of page cache on OSD read performance for SSD
Post by Sage Weil
Post by Haomai Wang
I agree with that direct read will help for disk read. But if read
data is hot and small enough to fit in memory, page cache is a good
place to hold data cache. If discard page cache, we need to implement
a cache to provide with effective lookup impl.
This is true for some workloads, but not necessarily true for all.
Many clients (notably RBD) will be caching at the client side (in VM's
fs, and possibly in librbd itself) such that caching at the OSD is
largely wasted effort. For RGW the often is likely true, unless there
is a varnish cache or something in front.
Still now, I don't think librbd cache can meet all the cache demand for rbd usage. Even though we have a effective librbd cache impl, we still need a buffer cache in ObjectStore level just like what database did. Client cache and host cache are both needed.
Post by Sage Weil
We should probably have a direct_io config option for filestore. But
even better would be some hint from the client about whether it is
caching or not so that FileStore could conditionally cache...
Yes, I remember we already did some early works like it.
Post by Sage Weil
sage
Post by Haomai Wang
BTW, whether to use direct io we can refer to MySQL Innodb engine
with direct io and PostgreSQL with page cache.
Post by Somnath Roy
Haomai,
I am considering only about random reads and the changes I made only affecting reads. For write, I have not measured yet. But, yes, page cache may be helpful for write coalescing. Still need to evaluate how it is behaving comparing direct_io on SSD though. I think Ceph code path will be much shorter if we use direct_io in the write path where it is actually executing the transactions. Probably, the sync thread and all will not be needed.
I am trying to analyze where is the extra reads coming from in case of buffered io by using blktrace etc. This should give us a clear understanding what exactly is going on there and it may turn out that tuning kernel parameters only we can achieve similar performance as direct_io.
Thanks & Regards
Somnath
-----Original Message-----
Sent: Tuesday, September 23, 2014 7:07 PM
To: Sage Weil
Subject: Re: Impact of page cache on OSD read performance for SSD
Good point, but do you have considered that the impaction for write ops? And if skip page cache, FileStore is responsible for data cache?
Post by Sage Weil
Post by Somnath Roy
Milosz,
Thanks for the response. I will see if I can get any information out of perf.
Here is my OS information.
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 13.10
Release: 13.10
Codename: saucy
Linux emsclient 3.11.0-12-generic #19-Ubuntu SMP Wed Oct 9
16:20:46 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
BTW, it's not a 45% drop, as you can see, by tuning the OSD parameter I was able to get almost *2X* performance improvement with direct_io.
It's not only page cache (memory) lookup, in case of buffered_io the following could be problem.
1. Double copy (disk -> file buffer cache, file buffer cache ->
user
buffer)
2. As the iostat output shows, it is not reading 4K only, it is
reading more data from disk as required and in the end it will be
wasted in case of random workload..
It might be worth using blktrace to see what the IOs it is issueing are.
Which ones are > 4K and what they point to...
sage
Post by Somnath Roy
Thanks & Regards
Somnath
-----Original Message-----
Sent: Tuesday, September 23, 2014 12:09 PM
To: Somnath Roy
Subject: Re: Impact of page cache on OSD read performance for SSD
Somnath,
I wonder if there's a bottleneck or a point of contention for the kernel. For a entirely uncached workload I expect the page cache lookup to cause a slow down (since the lookup should be wasted). What I wouldn't expect is a 45% performance drop. Memory speed should be one magnitude faster then a modern SATA SSD drive (so it should be more negligible overhead).
Is there anyway you could perform the same test but monitor what's going on with the OSD process using the perf tool? Whatever is the default cpu time spent hardware counter is fine. Make sure you have the kernel debug info package installed so can get symbol information for kernel and module calls. With any luck the diff in perf output in two runs will show us the culprit.
Also, can you tell us what OS/kernel version you're using on the OSD machines?
- Milosz
Post by Somnath Roy
Hi Sage,
I have created the following setup in order to examine how a single OSD is behaving if say ~80-90% of ios hitting the SSDs.
My test includes the following steps.
1. Created a single OSD cluster.
2. Created two rbd images (110GB each) on 2 different pools.
3. Populated entire image, so my working set is ~210GB. My system memory is ~16GB.
4. Dumped page cache before every run.
5. Ran fio_rbd (QD 32, 8 instances) in parallel on these two images.
Here is my disk iops/bandwidth..
random-reads: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=64
2.0.8
Starting 1 process
Jobs: 1 (f=1): [r] [100.0% done] [154.1M/0K /s] [39.7K/0 iops] [eta 00m:00s]
random-reads: (groupid=0, jobs=1): err= 0: pid=1431
read : io=9316.4MB, bw=158994KB/s, iops=39748 , runt= 60002msec
My fio_rbd config..
[global]
ioengine=rbd
clientname=admin
pool=rbd1
rbdname=ceph_regression_test1
invalidate=0 # mandatory
rw=randread
bs=4k
direct=1
time_based
runtime=2m
size=109G
numjobs=8
[rbd_iodepth32]
iodepth=32
Now, I have run Giant Ceph on top of that..
-------------------------------------------------------
avg-cpu: %user %nice %system %iowait %steal %idle
22.04 0.00 16.46 45.86 0.00 15.64
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 9.00 0.00 6.00 0.00 92.00 30.67 0.01 1.33 0.00 1.33 1.33 0.80
sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdg 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdh 181.00 0.00 34961.00 0.00 176740.00 0.00 10.11 102.71 2.92 2.92 0.00 0.03 100.00
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
----------
cluster 94991097-7638-4240-b922-f525300a9026
health HEALTH_OK
monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch 1, quorum 0 a
osdmap e498: 1 osds: 1 up, 1 in
pgmap v386366: 832 pgs, 7 pools, 308 GB data, 247 kobjects
366 GB used, 1122 GB / 1489 GB avail
832 active+clean
client io 75215 kB/s rd, 18803 op/s
----------
Gradually decreases from ~21 core (serving from cache) to ~10 core (while serving from disks).
-----------------
In this case "All is Well" till ios are served from cache
(XFS is smart enough to cache some data ) . Once started hitting disks and throughput is decreasing. As you can see, disk is giving ~35K iops , but, OSD throughput is only ~18.8K ! So, cache miss in case of buffered io seems to be very expensive. Half of the iops are waste. Also, looking at the bandwidth, it is obvious, not everything is 4K read, May be kernel read_ahead is kicking (?).
Now, I thought of making ceph disk read as direct_io and do the same experiment. I have changed the FileStore::read to do the direct_io only. Rest kept as is. Here is the result with that.
-------
avg-cpu: %user %nice %system %iowait %steal %idle
24.77 0.00 19.52 21.36 0.00 34.36
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdg 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdh 0.00 0.00 25295.00 0.00 101180.00 0.00 8.00 12.73 0.50 0.50 0.00 0.04 100.80
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
--------
cluster 94991097-7638-4240-b922-f525300a9026
health HEALTH_OK
monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch 1, quorum 0 a
osdmap e522: 1 osds: 1 up, 1 in
pgmap v386711: 832 pgs, 7 pools, 308 GB data, 247 kobjects
366 GB used, 1122 GB / 1489 GB avail
832 active+clean
client io 100 MB/s rd, 25618 op/s
--------
~14 core while serving from disks.
---------------
No surprises here. Whatever is disk throughput ceph throughput is almost matching.
Let's tweak the shard/thread settings and see the impact.
-----------------------------------------------------------
------------------
No change, output is very similar to 25 shards.
------------------
----------
avg-cpu: %user %nice %system %iowait %steal %idle
33.33 0.00 28.22 23.11 0.00 15.34
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.00 0.00 2.00 0.00 12.00 12.00 0.00 0.00 0.00 0.00 0.00 0.00
sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdg 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdh 0.00 0.00 31987.00 0.00 127948.00 0.00 8.00 18.06 0.56 0.56 0.00 0.03 100.40
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
--------------
cluster 94991097-7638-4240-b922-f525300a9026
health HEALTH_OK
monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch 1, quorum 0 a
osdmap e525: 1 osds: 1 up, 1 in
pgmap v386746: 832 pgs, 7 pools, 308 GB data, 247 kobjects
366 GB used, 1122 GB / 1489 GB avail
832 active+clean
client io 127 MB/s rd, 32763 op/s
--------------
~19 core while serving from disks.
------------------
It is scaling with increased number of shards/threads. The parallelism also increased significantly.
----------------------------------------------------------
-------------------
No change, output is very similar to 25 shards.
-----------------
--------
avg-cpu: %user %nice %system %iowait %steal %idle
37.50 0.00 33.72 20.03 0.00 8.75
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdg 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdh 0.00 0.00 35360.00 0.00 141440.00 0.00 8.00 22.25 0.62 0.62 0.00 0.03 100.40
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
--------------
cluster 94991097-7638-4240-b922-f525300a9026
health HEALTH_OK
monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch 1, quorum 0 a
osdmap e534: 1 osds: 1 up, 1 in
pgmap v386830: 832 pgs, 7 pools, 308 GB data, 247 kobjects
366 GB used, 1122 GB / 1489 GB avail
832 active+clean
client io 138 MB/s rd, 35582 op/s
----------------
~22.5 core while serving from disks.
--------------------
It is scaling with increased number of shards/threads. The parallelism also increased significantly.
---------------------------------------------------------
------------------
No change, output is very similar to 25 shards.
-------------------
---------
avg-cpu: %user %nice %system %iowait %steal %idle
40.18 0.00 34.84 19.81 0.00 5.18
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdg 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdh 0.00 0.00 39114.00 0.00 156460.00 0.00 8.00 35.58 0.90 0.90 0.00 0.03 100.40
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
---------------
cluster 94991097-7638-4240-b922-f525300a9026
health HEALTH_OK
monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch 1, quorum 0 a
osdmap e537: 1 osds: 1 up, 1 in
pgmap v386865: 832 pgs, 7 pools, 308 GB data, 247 kobjects
366 GB used, 1122 GB / 1489 GB avail
832 active+clean
client io 153 MB/s rd, 39172 op/s
----------------
~24.5 core while serving from disks. ~3% cpu left.
------------------
It is scaling with increased number of shards/threads. The parallelism also increased significantly. It is disk bound now.
So, it seems buffered IO has significant impact on performance in case backend is SSD.
My question is, if the workload is very random and storage(SSD) is very huge compare to system memory, shouldn't we always go for direct_io instead of buffered io from Ceph ?
Please share your thoughts/suggestion on this.
Thanks & Regards
Somnath
________________________________
PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel"
majordomo info at http://vger.kernel.org/majordomo-info.html
--
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016
p: 646-253-9055
N?????r??y??????X???v???)?{.n?????z?]z????ay? ????j ??f???h?????
?w??? ???j:+v???w???????? ????zZ+???????j"????i
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel"
majordomo info at http://vger.kernel.org/majordomo-info.html
--
Best Regards,
Wheat
--
Best Regards,
Wheat
--
Best Regards,
Wheat
--
Best Regards,

Wheat
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Somnath Roy
2014-09-25 03:15:12 UTC
Permalink
It will be definitely hampered.
There will not be a single solution fits all. These parameters needs to be tuned based on the workload.

Thanks & Regards
Somnath

-----Original Message-----
From: Haomai Wang [mailto:***@gmail.com]
Sent: Wednesday, September 24, 2014 7:56 PM
To: Somnath Roy
Cc: Sage Weil; Milosz Tanski; ceph-***@vger.kernel.org
Subject: Re: Impact of page cache on OSD read performance for SSD
Post by Somnath Roy
Hi,
After going through the blktrace, I think I have figured out what is
going on there. I think kernel read_ahead is causing the extra reads
in case of buffered read. If I set read_ahead = 0 , the performance I
am getting similar (or more when cache hit actually happens) to
direct_io :-)
Hmm, BTW if set read_ahead=0, what about seq read performance compared to before?
Post by Somnath Roy
IMHO, if any user doesn't want these nasty kernel effects and be sure of the random work pattern, we should provide a configurable direct_io read option (Need to quantify direct_io write also) as Sage suggested.
Thanks & Regards
Somnath
-----Original Message-----
Sent: Wednesday, September 24, 2014 9:06 AM
To: Sage Weil
Subject: Re: Impact of page cache on OSD read performance for SSD
Post by Sage Weil
Post by Haomai Wang
I agree with that direct read will help for disk read. But if read
data is hot and small enough to fit in memory, page cache is a good
place to hold data cache. If discard page cache, we need to
implement a cache to provide with effective lookup impl.
This is true for some workloads, but not necessarily true for all.
Many clients (notably RBD) will be caching at the client side (in
VM's fs, and possibly in librbd itself) such that caching at the OSD
is largely wasted effort. For RGW the often is likely true, unless
there is a varnish cache or something in front.
Still now, I don't think librbd cache can meet all the cache demand for rbd usage. Even though we have a effective librbd cache impl, we still need a buffer cache in ObjectStore level just like what database did. Client cache and host cache are both needed.
Post by Sage Weil
We should probably have a direct_io config option for filestore. But
even better would be some hint from the client about whether it is
caching or not so that FileStore could conditionally cache...
Yes, I remember we already did some early works like it.
Post by Sage Weil
sage
Post by Haomai Wang
BTW, whether to use direct io we can refer to MySQL Innodb engine
with direct io and PostgreSQL with page cache.
Post by Somnath Roy
Haomai,
I am considering only about random reads and the changes I made only affecting reads. For write, I have not measured yet. But, yes, page cache may be helpful for write coalescing. Still need to evaluate how it is behaving comparing direct_io on SSD though. I think Ceph code path will be much shorter if we use direct_io in the write path where it is actually executing the transactions. Probably, the sync thread and all will not be needed.
I am trying to analyze where is the extra reads coming from in case of buffered io by using blktrace etc. This should give us a clear understanding what exactly is going on there and it may turn out that tuning kernel parameters only we can achieve similar performance as direct_io.
Thanks & Regards
Somnath
-----Original Message-----
Sent: Tuesday, September 23, 2014 7:07 PM
To: Sage Weil
Subject: Re: Impact of page cache on OSD read performance for SSD
Good point, but do you have considered that the impaction for write ops? And if skip page cache, FileStore is responsible for data cache?
Post by Sage Weil
Post by Somnath Roy
Milosz,
Thanks for the response. I will see if I can get any information out of perf.
Here is my OS information.
Distributor ID: Ubuntu
Description: Ubuntu 13.10
Release: 13.10
Codename: saucy
Linux emsclient 3.11.0-12-generic #19-Ubuntu SMP Wed Oct 9
16:20:46 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
BTW, it's not a 45% drop, as you can see, by tuning the OSD parameter I was able to get almost *2X* performance improvement with direct_io.
It's not only page cache (memory) lookup, in case of buffered_io the following could be problem.
1. Double copy (disk -> file buffer cache, file buffer cache ->
user
buffer)
2. As the iostat output shows, it is not reading 4K only, it is
reading more data from disk as required and in the end it will
be wasted in case of random workload..
It might be worth using blktrace to see what the IOs it is issueing are.
Which ones are > 4K and what they point to...
sage
Post by Somnath Roy
Thanks & Regards
Somnath
-----Original Message-----
Sent: Tuesday, September 23, 2014 12:09 PM
To: Somnath Roy
Subject: Re: Impact of page cache on OSD read performance for SSD
Somnath,
I wonder if there's a bottleneck or a point of contention for the kernel. For a entirely uncached workload I expect the page cache lookup to cause a slow down (since the lookup should be wasted). What I wouldn't expect is a 45% performance drop. Memory speed should be one magnitude faster then a modern SATA SSD drive (so it should be more negligible overhead).
Is there anyway you could perform the same test but monitor what's going on with the OSD process using the perf tool? Whatever is the default cpu time spent hardware counter is fine. Make sure you have the kernel debug info package installed so can get symbol information for kernel and module calls. With any luck the diff in perf output in two runs will show us the culprit.
Also, can you tell us what OS/kernel version you're using on the OSD machines?
- Milosz
Post by Somnath Roy
Hi Sage,
I have created the following setup in order to examine how a single OSD is behaving if say ~80-90% of ios hitting the SSDs.
My test includes the following steps.
1. Created a single OSD cluster.
2. Created two rbd images (110GB each) on 2 different pools.
3. Populated entire image, so my working set is ~210GB. My system memory is ~16GB.
4. Dumped page cache before every run.
5. Ran fio_rbd (QD 32, 8 instances) in parallel on these two images.
Here is my disk iops/bandwidth..
random-reads: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=64
2.0.8
Starting 1 process
Jobs: 1 (f=1): [r] [100.0% done] [154.1M/0K /s] [39.7K/0 iops] [eta 00m:00s]
random-reads: (groupid=0, jobs=1): err= 0: pid=1431
read : io=9316.4MB, bw=158994KB/s, iops=39748 , runt= 60002msec
My fio_rbd config..
[global]
ioengine=rbd
clientname=admin
pool=rbd1
rbdname=ceph_regression_test1
invalidate=0 # mandatory
rw=randread
bs=4k
direct=1
time_based
runtime=2m
size=109G
numjobs=8
[rbd_iodepth32]
iodepth=32
Now, I have run Giant Ceph on top of that..
-------------------------------------------------------
avg-cpu: %user %nice %system %iowait %steal %idle
22.04 0.00 16.46 45.86 0.00 15.64
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 9.00 0.00 6.00 0.00 92.00 30.67 0.01 1.33 0.00 1.33 1.33 0.80
sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdg 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdh 181.00 0.00 34961.00 0.00 176740.00 0.00 10.11 102.71 2.92 2.92 0.00 0.03 100.00
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
----------
cluster 94991097-7638-4240-b922-f525300a9026
health HEALTH_OK
monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch 1, quorum 0 a
osdmap e498: 1 osds: 1 up, 1 in
pgmap v386366: 832 pgs, 7 pools, 308 GB data, 247 kobjects
366 GB used, 1122 GB / 1489 GB avail
832 active+clean
client io 75215 kB/s rd, 18803 op/s
----------
Gradually decreases from ~21 core (serving from cache) to ~10 core (while serving from disks).
-----------------
In this case "All is Well" till ios are served from cache
(XFS is smart enough to cache some data ) . Once started hitting disks and throughput is decreasing. As you can see, disk is giving ~35K iops , but, OSD throughput is only ~18.8K ! So, cache miss in case of buffered io seems to be very expensive. Half of the iops are waste. Also, looking at the bandwidth, it is obvious, not everything is 4K read, May be kernel read_ahead is kicking (?).
Now, I thought of making ceph disk read as direct_io and do the same experiment. I have changed the FileStore::read to do the direct_io only. Rest kept as is. Here is the result with that.
-------
avg-cpu: %user %nice %system %iowait %steal %idle
24.77 0.00 19.52 21.36 0.00 34.36
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdg 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdh 0.00 0.00 25295.00 0.00 101180.00 0.00 8.00 12.73 0.50 0.50 0.00 0.04 100.80
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
--------
cluster 94991097-7638-4240-b922-f525300a9026
health HEALTH_OK
monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch 1, quorum 0 a
osdmap e522: 1 osds: 1 up, 1 in
pgmap v386711: 832 pgs, 7 pools, 308 GB data, 247 kobjects
366 GB used, 1122 GB / 1489 GB avail
832 active+clean
client io 100 MB/s rd, 25618 op/s
--------
~14 core while serving from disks.
---------------
No surprises here. Whatever is disk throughput ceph throughput is almost matching.
Let's tweak the shard/thread settings and see the impact.
-----------------------------------------------------------
------------------
No change, output is very similar to 25 shards.
------------------
----------
avg-cpu: %user %nice %system %iowait %steal %idle
33.33 0.00 28.22 23.11 0.00 15.34
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.00 0.00 2.00 0.00 12.00 12.00 0.00 0.00 0.00 0.00 0.00 0.00
sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdg 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdh 0.00 0.00 31987.00 0.00 127948.00 0.00 8.00 18.06 0.56 0.56 0.00 0.03 100.40
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
--------------
cluster 94991097-7638-4240-b922-f525300a9026
health HEALTH_OK
monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch 1, quorum 0 a
osdmap e525: 1 osds: 1 up, 1 in
pgmap v386746: 832 pgs, 7 pools, 308 GB data, 247 kobjects
366 GB used, 1122 GB / 1489 GB avail
832 active+clean
client io 127 MB/s rd, 32763 op/s
--------------
~19 core while serving from disks.
------------------
It is scaling with increased number of shards/threads. The parallelism also increased significantly.
----------------------------------------------------------
-------------------
No change, output is very similar to 25 shards.
-----------------
--------
avg-cpu: %user %nice %system %iowait %steal %idle
37.50 0.00 33.72 20.03 0.00 8.75
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdg 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdh 0.00 0.00 35360.00 0.00 141440.00 0.00 8.00 22.25 0.62 0.62 0.00 0.03 100.40
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
--------------
cluster 94991097-7638-4240-b922-f525300a9026
health HEALTH_OK
monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch 1, quorum 0 a
osdmap e534: 1 osds: 1 up, 1 in
pgmap v386830: 832 pgs, 7 pools, 308 GB data, 247 kobjects
366 GB used, 1122 GB / 1489 GB avail
832 active+clean
client io 138 MB/s rd, 35582 op/s
----------------
~22.5 core while serving from disks.
--------------------
It is scaling with increased number of shards/threads. The parallelism also increased significantly.
---------------------------------------------------------
------------------
No change, output is very similar to 25 shards.
-------------------
---------
avg-cpu: %user %nice %system %iowait %steal %idle
40.18 0.00 34.84 19.81 0.00 5.18
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdg 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdh 0.00 0.00 39114.00 0.00 156460.00 0.00 8.00 35.58 0.90 0.90 0.00 0.03 100.40
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
---------------
cluster 94991097-7638-4240-b922-f525300a9026
health HEALTH_OK
monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch 1, quorum 0 a
osdmap e537: 1 osds: 1 up, 1 in
pgmap v386865: 832 pgs, 7 pools, 308 GB data, 247 kobjects
366 GB used, 1122 GB / 1489 GB avail
832 active+clean
client io 153 MB/s rd, 39172 op/s
----------------
~24.5 core while serving from disks. ~3% cpu left.
------------------
It is scaling with increased number of shards/threads. The parallelism also increased significantly. It is disk bound now.
So, it seems buffered IO has significant impact on performance in case backend is SSD.
My question is, if the workload is very random and storage(SSD) is very huge compare to system memory, shouldn't we always go for direct_io instead of buffered io from Ceph ?
Please share your thoughts/suggestion on this.
Thanks & Regards
Somnath
________________________________
PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel"
majordomo info at http://vger.kernel.org/majordomo-info.html
--
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016
p: 646-253-9055
N?????r??y??????X???v???)?{.n?????z?]z????ay? ????j ??f???h?????
?w??? ???j:+v???w???????? ????zZ+???????j"????i
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel"
majordomo info at http://vger.kernel.org/majordomo-info.html
--
Best Regards,
Wheat
--
Best Regards,
Wheat
--
Best Regards,
Wheat
--
Best Regards,

Wheat
��칻�&�~�&���+-��ݶ��w��˛���m��^��b��^n�r���z���h�����&���G���h�
Chen, Xiaoxi
2014-09-25 05:00:18 UTC
Permalink
Have you ever seen large readahead_kb would hear random performance?

We usually set it to very large (2M) , the random read performance keep steady, even in all SSD setup. Maybe with your optimization code for OP_QUEUE, the things may different?

-----Original Message-----
From: ceph-devel-***@vger.kernel.org [mailto:ceph-devel-***@vger.kernel.org] On Behalf Of Somnath Roy
Sent: Thursday, September 25, 2014 11:15 AM
To: Haomai Wang
Cc: Sage Weil; Milosz Tanski; ceph-***@vger.kernel.org
Subject: RE: Impact of page cache on OSD read performance for SSD

It will be definitely hampered.
There will not be a single solution fits all. These parameters needs to be tuned based on the workload.

Thanks & Regards
Somnath

-----Original Message-----
From: Haomai Wang [mailto:***@gmail.com]
Sent: Wednesday, September 24, 2014 7:56 PM
To: Somnath Roy
Cc: Sage Weil; Milosz Tanski; ceph-***@vger.kernel.org
Subject: Re: Impact of page cache on OSD read performance for SSD
Post by Somnath Roy
Hi,
After going through the blktrace, I think I have figured out what is
going on there. I think kernel read_ahead is causing the extra reads
in case of buffered read. If I set read_ahead = 0 , the performance I
am getting similar (or more when cache hit actually happens) to
direct_io :-)
Hmm, BTW if set read_ahead=0, what about seq read performance compared to before?
Post by Somnath Roy
IMHO, if any user doesn't want these nasty kernel effects and be sure of the random work pattern, we should provide a configurable direct_io read option (Need to quantify direct_io write also) as Sage suggested.
Thanks & Regards
Somnath
-----Original Message-----
Sent: Wednesday, September 24, 2014 9:06 AM
To: Sage Weil
Subject: Re: Impact of page cache on OSD read performance for SSD
Post by Sage Weil
Post by Haomai Wang
I agree with that direct read will help for disk read. But if read
data is hot and small enough to fit in memory, page cache is a good
place to hold data cache. If discard page cache, we need to
implement a cache to provide with effective lookup impl.
This is true for some workloads, but not necessarily true for all.
Many clients (notably RBD) will be caching at the client side (in
VM's fs, and possibly in librbd itself) such that caching at the OSD
is largely wasted effort. For RGW the often is likely true, unless
there is a varnish cache or something in front.
Still now, I don't think librbd cache can meet all the cache demand for rbd usage. Even though we have a effective librbd cache impl, we still need a buffer cache in ObjectStore level just like what database did. Client cache and host cache are both needed.
Post by Sage Weil
We should probably have a direct_io config option for filestore. But
even better would be some hint from the client about whether it is
caching or not so that FileStore could conditionally cache...
Yes, I remember we already did some early works like it.
Post by Sage Weil
sage
Post by Haomai Wang
BTW, whether to use direct io we can refer to MySQL Innodb engine
with direct io and PostgreSQL with page cache.
Post by Somnath Roy
Haomai,
I am considering only about random reads and the changes I made only affecting reads. For write, I have not measured yet. But, yes, page cache may be helpful for write coalescing. Still need to evaluate how it is behaving comparing direct_io on SSD though. I think Ceph code path will be much shorter if we use direct_io in the write path where it is actually executing the transactions. Probably, the sync thread and all will not be needed.
I am trying to analyze where is the extra reads coming from in case of buffered io by using blktrace etc. This should give us a clear understanding what exactly is going on there and it may turn out that tuning kernel parameters only we can achieve similar performance as direct_io.
Thanks & Regards
Somnath
-----Original Message-----
Sent: Tuesday, September 23, 2014 7:07 PM
To: Sage Weil
Subject: Re: Impact of page cache on OSD read performance for SSD
Good point, but do you have considered that the impaction for write ops? And if skip page cache, FileStore is responsible for data cache?
Post by Sage Weil
Post by Somnath Roy
Milosz,
Thanks for the response. I will see if I can get any information out of perf.
Here is my OS information.
Distributor ID: Ubuntu
Description: Ubuntu 13.10
Release: 13.10
Codename: saucy
Linux emsclient 3.11.0-12-generic #19-Ubuntu SMP Wed Oct 9
16:20:46 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
BTW, it's not a 45% drop, as you can see, by tuning the OSD parameter I was able to get almost *2X* performance improvement with direct_io.
It's not only page cache (memory) lookup, in case of buffered_io the following could be problem.
1. Double copy (disk -> file buffer cache, file buffer cache ->
user
buffer)
2. As the iostat output shows, it is not reading 4K only, it is
reading more data from disk as required and in the end it will
be wasted in case of random workload..
It might be worth using blktrace to see what the IOs it is issueing are.
Which ones are > 4K and what they point to...
sage
Post by Somnath Roy
Thanks & Regards
Somnath
-----Original Message-----
Sent: Tuesday, September 23, 2014 12:09 PM
To: Somnath Roy
Subject: Re: Impact of page cache on OSD read performance for SSD
Somnath,
I wonder if there's a bottleneck or a point of contention for the kernel. For a entirely uncached workload I expect the page cache lookup to cause a slow down (since the lookup should be wasted). What I wouldn't expect is a 45% performance drop. Memory speed should be one magnitude faster then a modern SATA SSD drive (so it should be more negligible overhead).
Is there anyway you could perform the same test but monitor what's going on with the OSD process using the perf tool? Whatever is the default cpu time spent hardware counter is fine. Make sure you have the kernel debug info package installed so can get symbol information for kernel and module calls. With any luck the diff in perf output in two runs will show us the culprit.
Also, can you tell us what OS/kernel version you're using on the OSD machines?
- Milosz
Post by Somnath Roy
Hi Sage,
I have created the following setup in order to examine how a single OSD is behaving if say ~80-90% of ios hitting the SSDs.
My test includes the following steps.
1. Created a single OSD cluster.
2. Created two rbd images (110GB each) on 2 different pools.
3. Populated entire image, so my working set is ~210GB. My system memory is ~16GB.
4. Dumped page cache before every run.
5. Ran fio_rbd (QD 32, 8 instances) in parallel on these two images.
Here is my disk iops/bandwidth..
random-reads: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=64
2.0.8
Starting 1 process
Jobs: 1 (f=1): [r] [100.0% done] [154.1M/0K /s] [39.7K/0 iops] [eta 00m:00s]
random-reads: (groupid=0, jobs=1): err= 0: pid=1431
read : io=9316.4MB, bw=158994KB/s, iops=39748 , runt= 60002msec
My fio_rbd config..
[global]
ioengine=rbd
clientname=admin
pool=rbd1
rbdname=ceph_regression_test1
invalidate=0 # mandatory
rw=randread
bs=4k
direct=1
time_based
runtime=2m
size=109G
numjobs=8
[rbd_iodepth32]
iodepth=32
Now, I have run Giant Ceph on top of that..
-------------------------------------------------------
avg-cpu: %user %nice %system %iowait %steal %idle
22.04 0.00 16.46 45.86 0.00 15.64
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 9.00 0.00 6.00 0.00 92.00 30.67 0.01 1.33 0.00 1.33 1.33 0.80
sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdg 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdh 181.00 0.00 34961.00 0.00 176740.00 0.00 10.11 102.71 2.92 2.92 0.00 0.03 100.00
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
----------
cluster 94991097-7638-4240-b922-f525300a9026
health HEALTH_OK
monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch 1, quorum 0 a
osdmap e498: 1 osds: 1 up, 1 in
pgmap v386366: 832 pgs, 7 pools, 308 GB data, 247 kobjects
366 GB used, 1122 GB / 1489 GB avail
832 active+clean
client io 75215 kB/s rd, 18803 op/s
----------
Gradually decreases from ~21 core (serving from cache) to ~10 core (while serving from disks).
-----------------
In this case "All is Well" till ios are served from cache
(XFS is smart enough to cache some data ) . Once started hitting disks and throughput is decreasing. As you can see, disk is giving ~35K iops , but, OSD throughput is only ~18.8K ! So, cache miss in case of buffered io seems to be very expensive. Half of the iops are waste. Also, looking at the bandwidth, it is obvious, not everything is 4K read, May be kernel read_ahead is kicking (?).
Now, I thought of making ceph disk read as direct_io and do the same experiment. I have changed the FileStore::read to do the direct_io only. Rest kept as is. Here is the result with that.
-------
avg-cpu: %user %nice %system %iowait %steal %idle
24.77 0.00 19.52 21.36 0.00 34.36
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdg 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdh 0.00 0.00 25295.00 0.00 101180.00 0.00 8.00 12.73 0.50 0.50 0.00 0.04 100.80
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
--------
cluster 94991097-7638-4240-b922-f525300a9026
health HEALTH_OK
monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch 1, quorum 0 a
osdmap e522: 1 osds: 1 up, 1 in
pgmap v386711: 832 pgs, 7 pools, 308 GB data, 247 kobjects
366 GB used, 1122 GB / 1489 GB avail
832 active+clean
client io 100 MB/s rd, 25618 op/s
--------
~14 core while serving from disks.
---------------
No surprises here. Whatever is disk throughput ceph throughput is almost matching.
Let's tweak the shard/thread settings and see the impact.
-----------------------------------------------------------
------------------
No change, output is very similar to 25 shards.
------------------
----------
avg-cpu: %user %nice %system %iowait %steal %idle
33.33 0.00 28.22 23.11 0.00 15.34
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.00 0.00 2.00 0.00 12.00 12.00 0.00 0.00 0.00 0.00 0.00 0.00
sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdg 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdh 0.00 0.00 31987.00 0.00 127948.00 0.00 8.00 18.06 0.56 0.56 0.00 0.03 100.40
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
--------------
cluster 94991097-7638-4240-b922-f525300a9026
health HEALTH_OK
monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch 1, quorum 0 a
osdmap e525: 1 osds: 1 up, 1 in
pgmap v386746: 832 pgs, 7 pools, 308 GB data, 247 kobjects
366 GB used, 1122 GB / 1489 GB avail
832 active+clean
client io 127 MB/s rd, 32763 op/s
--------------
~19 core while serving from disks.
------------------
It is scaling with increased number of shards/threads. The parallelism also increased significantly.
----------------------------------------------------------
-------------------
No change, output is very similar to 25 shards.
-----------------
--------
avg-cpu: %user %nice %system %iowait %steal %idle
37.50 0.00 33.72 20.03 0.00 8.75
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdg 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdh 0.00 0.00 35360.00 0.00 141440.00 0.00 8.00 22.25 0.62 0.62 0.00 0.03 100.40
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
--------------
cluster 94991097-7638-4240-b922-f525300a9026
health HEALTH_OK
monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch 1, quorum 0 a
osdmap e534: 1 osds: 1 up, 1 in
pgmap v386830: 832 pgs, 7 pools, 308 GB data, 247 kobjects
366 GB used, 1122 GB / 1489 GB avail
832 active+clean
client io 138 MB/s rd, 35582 op/s
----------------
~22.5 core while serving from disks.
--------------------
It is scaling with increased number of shards/threads. The parallelism also increased significantly.
---------------------------------------------------------
------------------
No change, output is very similar to 25 shards.
-------------------
---------
avg-cpu: %user %nice %system %iowait %steal %idle
40.18 0.00 34.84 19.81 0.00 5.18
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdg 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdh 0.00 0.00 39114.00 0.00 156460.00 0.00 8.00 35.58 0.90 0.90 0.00 0.03 100.40
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
---------------
cluster 94991097-7638-4240-b922-f525300a9026
health HEALTH_OK
monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch 1, quorum 0 a
osdmap e537: 1 osds: 1 up, 1 in
pgmap v386865: 832 pgs, 7 pools, 308 GB data, 247 kobjects
366 GB used, 1122 GB / 1489 GB avail
832 active+clean
client io 153 MB/s rd, 39172 op/s
----------------
~24.5 core while serving from disks. ~3% cpu left.
------------------
It is scaling with increased number of shards/threads. The parallelism also increased significantly. It is disk bound now.
So, it seems buffered IO has significant impact on performance in case backend is SSD.
My question is, if the workload is very random and storage(SSD) is very huge compare to system memory, shouldn't we always go for direct_io instead of buffered io from Ceph ?
Please share your thoughts/suggestion on this.
Thanks & Regards
Somnath
________________________________
PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel"
majordomo info at http://vger.kernel.org/majordomo-info.html
--
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016
p: 646-253-9055
N?????r??y??????X???v???)?{.n?????z?]z????ay? ????j ??f???h?????
?w??? ???j:+v???w???????? ????zZ+???????j"????i
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel"
majordomo info at http://vger.kernel.org/majordomo-info.html
--
Best Regards,
Wheat
--
Best Regards,
Wheat
--
Best Regards,
Wheat
--
Best Regards,

Wheat
 칻 & ~ &  +- ݶ w ˛ m  ^ b ^n r z  h &  G h ( 階 ݢj"  m z ޖ f h ~ m
N�����r��y����b�X��ǧv�^�)޺{.n�+���z�]z���{ay�ʇڙ�,j��f���h���z��w���
Somnath Roy
2014-09-25 07:10:48 UTC
Permalink
Well, you never know !
It depends upon lot of factors starting from your workload/different kernel params/RAID controller etc. etc. I have shared my observation in my environment with 4K pseudo random fio_rbd workload. True random, should not kick off read_ahead though.
OP_QUEUE optimization is bringing more parallelism in the filestore read , so, more read going to disk in parallel may have exposed this.
Anyways, I am in process of analyzing why default read_ahead is causing problem for me, will update if I find any..

Thanks & Regards
Somnath

-----Original Message-----
From: Chen, Xiaoxi [mailto:***@intel.com]
Sent: Wednesday, September 24, 2014 10:00 PM
To: Somnath Roy; Haomai Wang
Cc: Sage Weil; Milosz Tanski; ceph-***@vger.kernel.org
Subject: RE: Impact of page cache on OSD read performance for SSD

Have you ever seen large readahead_kb would hear random performance?

We usually set it to very large (2M) , the random read performance keep steady, even in all SSD setup. Maybe with your optimization code for OP_QUEUE, the things may different?

-----Original Message-----
From: ceph-devel-***@vger.kernel.org [mailto:ceph-devel-***@vger.kernel.org] On Behalf Of Somnath Roy
Sent: Thursday, September 25, 2014 11:15 AM
To: Haomai Wang
Cc: Sage Weil; Milosz Tanski; ceph-***@vger.kernel.org
Subject: RE: Impact of page cache on OSD read performance for SSD

It will be definitely hampered.
There will not be a single solution fits all. These parameters needs to be tuned based on the workload.

Thanks & Regards
Somnath

-----Original Message-----
From: Haomai Wang [mailto:***@gmail.com]
Sent: Wednesday, September 24, 2014 7:56 PM
To: Somnath Roy
Cc: Sage Weil; Milosz Tanski; ceph-***@vger.kernel.org
Subject: Re: Impact of page cache on OSD read performance for SSD
Post by Somnath Roy
Hi,
After going through the blktrace, I think I have figured out what is
going on there. I think kernel read_ahead is causing the extra reads
in case of buffered read. If I set read_ahead = 0 , the performance I
am getting similar (or more when cache hit actually happens) to
direct_io :-)
Hmm, BTW if set read_ahead=0, what about seq read performance compared to before?
Post by Somnath Roy
IMHO, if any user doesn't want these nasty kernel effects and be sure of the random work pattern, we should provide a configurable direct_io read option (Need to quantify direct_io write also) as Sage suggested.
Thanks & Regards
Somnath
-----Original Message-----
Sent: Wednesday, September 24, 2014 9:06 AM
To: Sage Weil
Subject: Re: Impact of page cache on OSD read performance for SSD
Post by Sage Weil
Post by Haomai Wang
I agree with that direct read will help for disk read. But if read
data is hot and small enough to fit in memory, page cache is a good
place to hold data cache. If discard page cache, we need to
implement a cache to provide with effective lookup impl.
This is true for some workloads, but not necessarily true for all.
Many clients (notably RBD) will be caching at the client side (in
VM's fs, and possibly in librbd itself) such that caching at the OSD
is largely wasted effort. For RGW the often is likely true, unless
there is a varnish cache or something in front.
Still now, I don't think librbd cache can meet all the cache demand for rbd usage. Even though we have a effective librbd cache impl, we still need a buffer cache in ObjectStore level just like what database did. Client cache and host cache are both needed.
Post by Sage Weil
We should probably have a direct_io config option for filestore. But
even better would be some hint from the client about whether it is
caching or not so that FileStore could conditionally cache...
Yes, I remember we already did some early works like it.
Post by Sage Weil
sage
Post by Haomai Wang
BTW, whether to use direct io we can refer to MySQL Innodb engine
with direct io and PostgreSQL with page cache.
Post by Somnath Roy
Haomai,
I am considering only about random reads and the changes I made only affecting reads. For write, I have not measured yet. But, yes, page cache may be helpful for write coalescing. Still need to evaluate how it is behaving comparing direct_io on SSD though. I think Ceph code path will be much shorter if we use direct_io in the write path where it is actually executing the transactions. Probably, the sync thread and all will not be needed.
I am trying to analyze where is the extra reads coming from in case of buffered io by using blktrace etc. This should give us a clear understanding what exactly is going on there and it may turn out that tuning kernel parameters only we can achieve similar performance as direct_io.
Thanks & Regards
Somnath
-----Original Message-----
Sent: Tuesday, September 23, 2014 7:07 PM
To: Sage Weil
Subject: Re: Impact of page cache on OSD read performance for SSD
Good point, but do you have considered that the impaction for write ops? And if skip page cache, FileStore is responsible for data cache?
Post by Sage Weil
Post by Somnath Roy
Milosz,
Thanks for the response. I will see if I can get any information out of perf.
Here is my OS information.
Distributor ID: Ubuntu
Description: Ubuntu 13.10
Release: 13.10
Codename: saucy
Linux emsclient 3.11.0-12-generic #19-Ubuntu SMP Wed Oct 9
16:20:46 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
BTW, it's not a 45% drop, as you can see, by tuning the OSD parameter I was able to get almost *2X* performance improvement with direct_io.
It's not only page cache (memory) lookup, in case of buffered_io the following could be problem.
1. Double copy (disk -> file buffer cache, file buffer cache ->
user
buffer)
2. As the iostat output shows, it is not reading 4K only, it is
reading more data from disk as required and in the end it will
be wasted in case of random workload..
It might be worth using blktrace to see what the IOs it is issueing are.
Which ones are > 4K and what they point to...
sage
Post by Somnath Roy
Thanks & Regards
Somnath
-----Original Message-----
Sent: Tuesday, September 23, 2014 12:09 PM
To: Somnath Roy
Subject: Re: Impact of page cache on OSD read performance for SSD
Somnath,
I wonder if there's a bottleneck or a point of contention for the kernel. For a entirely uncached workload I expect the page cache lookup to cause a slow down (since the lookup should be wasted). What I wouldn't expect is a 45% performance drop. Memory speed should be one magnitude faster then a modern SATA SSD drive (so it should be more negligible overhead).
Is there anyway you could perform the same test but monitor what's going on with the OSD process using the perf tool? Whatever is the default cpu time spent hardware counter is fine. Make sure you have the kernel debug info package installed so can get symbol information for kernel and module calls. With any luck the diff in perf output in two runs will show us the culprit.
Also, can you tell us what OS/kernel version you're using on the OSD machines?
- Milosz
Post by Somnath Roy
Hi Sage,
I have created the following setup in order to examine how a single OSD is behaving if say ~80-90% of ios hitting the SSDs.
My test includes the following steps.
1. Created a single OSD cluster.
2. Created two rbd images (110GB each) on 2 different pools.
3. Populated entire image, so my working set is ~210GB. My system memory is ~16GB.
4. Dumped page cache before every run.
5. Ran fio_rbd (QD 32, 8 instances) in parallel on these two images.
Here is my disk iops/bandwidth..
random-reads: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=64
2.0.8
Starting 1 process
Jobs: 1 (f=1): [r] [100.0% done] [154.1M/0K /s] [39.7K/0 iops] [eta 00m:00s]
random-reads: (groupid=0, jobs=1): err= 0: pid=1431
read : io=9316.4MB, bw=158994KB/s, iops=39748 , runt= 60002msec
My fio_rbd config..
[global]
ioengine=rbd
clientname=admin
pool=rbd1
rbdname=ceph_regression_test1
invalidate=0 # mandatory
rw=randread
bs=4k
direct=1
time_based
runtime=2m
size=109G
numjobs=8
[rbd_iodepth32]
iodepth=32
Now, I have run Giant Ceph on top of that..
-------------------------------------------------------
avg-cpu: %user %nice %system %iowait %steal %idle
22.04 0.00 16.46 45.86 0.00 15.64
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 9.00 0.00 6.00 0.00 92.00 30.67 0.01 1.33 0.00 1.33 1.33 0.80
sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdg 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdh 181.00 0.00 34961.00 0.00 176740.00 0.00 10.11 102.71 2.92 2.92 0.00 0.03 100.00
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
----------
cluster 94991097-7638-4240-b922-f525300a9026
health HEALTH_OK
monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch 1, quorum 0 a
osdmap e498: 1 osds: 1 up, 1 in
pgmap v386366: 832 pgs, 7 pools, 308 GB data, 247 kobjects
366 GB used, 1122 GB / 1489 GB avail
832 active+clean
client io 75215 kB/s rd, 18803 op/s
----------
Gradually decreases from ~21 core (serving from cache) to ~10 core (while serving from disks).
-----------------
In this case "All is Well" till ios are served from cache
(XFS is smart enough to cache some data ) . Once started hitting disks and throughput is decreasing. As you can see, disk is giving ~35K iops , but, OSD throughput is only ~18.8K ! So, cache miss in case of buffered io seems to be very expensive. Half of the iops are waste. Also, looking at the bandwidth, it is obvious, not everything is 4K read, May be kernel read_ahead is kicking (?).
Now, I thought of making ceph disk read as direct_io and do the same experiment. I have changed the FileStore::read to do the direct_io only. Rest kept as is. Here is the result with that.
-------
avg-cpu: %user %nice %system %iowait %steal %idle
24.77 0.00 19.52 21.36 0.00 34.36
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdg 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdh 0.00 0.00 25295.00 0.00 101180.00 0.00 8.00 12.73 0.50 0.50 0.00 0.04 100.80
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
--------
cluster 94991097-7638-4240-b922-f525300a9026
health HEALTH_OK
monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch 1, quorum 0 a
osdmap e522: 1 osds: 1 up, 1 in
pgmap v386711: 832 pgs, 7 pools, 308 GB data, 247 kobjects
366 GB used, 1122 GB / 1489 GB avail
832 active+clean
client io 100 MB/s rd, 25618 op/s
--------
~14 core while serving from disks.
---------------
No surprises here. Whatever is disk throughput ceph throughput is almost matching.
Let's tweak the shard/thread settings and see the impact.
-----------------------------------------------------------
------------------
No change, output is very similar to 25 shards.
------------------
----------
avg-cpu: %user %nice %system %iowait %steal %idle
33.33 0.00 28.22 23.11 0.00 15.34
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.00 0.00 2.00 0.00 12.00 12.00 0.00 0.00 0.00 0.00 0.00 0.00
sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdg 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdh 0.00 0.00 31987.00 0.00 127948.00 0.00 8.00 18.06 0.56 0.56 0.00 0.03 100.40
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
--------------
cluster 94991097-7638-4240-b922-f525300a9026
health HEALTH_OK
monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch 1, quorum 0 a
osdmap e525: 1 osds: 1 up, 1 in
pgmap v386746: 832 pgs, 7 pools, 308 GB data, 247 kobjects
366 GB used, 1122 GB / 1489 GB avail
832 active+clean
client io 127 MB/s rd, 32763 op/s
--------------
~19 core while serving from disks.
------------------
It is scaling with increased number of shards/threads. The parallelism also increased significantly.
----------------------------------------------------------
-------------------
No change, output is very similar to 25 shards.
-----------------
--------
avg-cpu: %user %nice %system %iowait %steal %idle
37.50 0.00 33.72 20.03 0.00 8.75
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdg 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdh 0.00 0.00 35360.00 0.00 141440.00 0.00 8.00 22.25 0.62 0.62 0.00 0.03 100.40
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
--------------
cluster 94991097-7638-4240-b922-f525300a9026
health HEALTH_OK
monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch 1, quorum 0 a
osdmap e534: 1 osds: 1 up, 1 in
pgmap v386830: 832 pgs, 7 pools, 308 GB data, 247 kobjects
366 GB used, 1122 GB / 1489 GB avail
832 active+clean
client io 138 MB/s rd, 35582 op/s
----------------
~22.5 core while serving from disks.
--------------------
It is scaling with increased number of shards/threads. The parallelism also increased significantly.
---------------------------------------------------------
------------------
No change, output is very similar to 25 shards.
-------------------
---------
avg-cpu: %user %nice %system %iowait %steal %idle
40.18 0.00 34.84 19.81 0.00 5.18
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdg 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdh 0.00 0.00 39114.00 0.00 156460.00 0.00 8.00 35.58 0.90 0.90 0.00 0.03 100.40
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
---------------
cluster 94991097-7638-4240-b922-f525300a9026
health HEALTH_OK
monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch 1, quorum 0 a
osdmap e537: 1 osds: 1 up, 1 in
pgmap v386865: 832 pgs, 7 pools, 308 GB data, 247 kobjects
366 GB used, 1122 GB / 1489 GB avail
832 active+clean
client io 153 MB/s rd, 39172 op/s
----------------
~24.5 core while serving from disks. ~3% cpu left.
------------------
It is scaling with increased number of shards/threads. The parallelism also increased significantly. It is disk bound now.
So, it seems buffered IO has significant impact on performance in case backend is SSD.
My question is, if the workload is very random and storage(SSD) is very huge compare to system memory, shouldn't we always go for direct_io instead of buffered io from Ceph ?
Please share your thoughts/suggestion on this.
Thanks & Regards
Somnath
________________________________
PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel"
majordomo info at http://vger.kernel.org/majordomo-info.html
--
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016
p: 646-253-9055
N?????r??y??????X???v???)?{.n?????z?]z????ay? ????j ??f???h?????
?w??? ???j:+v???w???????? ????zZ+???????j"????i
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel"
majordomo info at http://vger.kernel.org/majordomo-info.html
--
Best Regards,
Wheat
--
Best Regards,
Wheat
--
Best Regards,
Wheat
--
Best Regards,

Wheat
 칻 & ~ &  +- ݶ w ˛ m  ^ b ^n r z  h &  G h ( 階 ݢj"  m z ޖ f h ~ m
N�����r��y����b�X��ǧv�^�)޺{.n�+���z�]z���{ay�ʇڙ�,j��f���h���z��w���
Somnath Roy
2014-09-30 00:23:46 UTC
Permalink
Hi,
I did some digging on the blktrace output to understand why this read_ahead_kb setting is impacting performance in my setup (which is single OSD cluster). Here is the result.

The 99% of the ios are performed by the following processes during the blocktrace collection window.

1. For the ceph-osd process (including unknown process which I figured out different threads of OSd only):


Events Read_ahead_kb = 128 Read_ahead_kb = 0 Direct_io
Reads Queued 4140687 4168816 4042634
Read Dispatches 7734617 5660597 4839428
Reads Requeued 4574032 1789149 944688
Reads Completed 2532893 2996269 3027387
Reads Merges 6415 2 0
IO unplugs 3380175 100911 4042714

2. Swapper process

Events Read_ahead_kb = 128 Read_ahead_kb = 0 Direct_io
Reads Queued 0 0 0
Read Dispatches 1836K 459028 258743
Reads Requeued 1129K 254808 132605
Reads Completed 1175K 937138 891107
Reads Merges 0 0 0
IO unplugs 0 0 0


Now, if we compare the total amount of reads happened during this time for the 3 different type of settings..

Events Read_ahead_kb = 128 Read_ahead_kb = 0 Direct_io
Reads Queued 4140K 4168K 4042K
Read Dispatches 10390K 6363K 5151K
Reads Requeued 6256K 2194K 1108K
Reads Completed 4134K 4168K 4042K
Reads Merges 6415 2 0
IO unplugs 3380183 100924 4042721


Here is my analysis on this.

1. There are lot more (~4M more than read_ahead_kb =0 ) read dispatch in case we set read_ahead_kb = 128
2. Swapper process (which I think doing the read ahead(?)) is issuing lot more reads if read_ahead_kb = 128
3. Read merges are almost 0 all the cases other than 1st one which says the workload is very random (?). The more merges in case of 1st one is probably because of read_ahead (?)

Some open question.

1. Why reads completed are less ? Is it ceph read complete + swapper read complete ? but, still not matching dispatches ?
2. Io unplug is huge in case of read_ahead_kb = 128 and direct io compared to read_ahead_kb = 0 , why ?
3. Why so many requeued ?
4. Requeued + queued = dispatched ?

Tried to set different kernel parameter like nr_requests/scheduler/rq_affinity/vm_cache_pressure etc. , but, still in my workload I am constantly getting ~50% improvement by setting read_ahead_kb =0.

I don't have much expertise in the linux block layer , so, reaching out to community for the answers/suggestions.

Thanks & Regards
Somnath

-----Original Message-----
From: Somnath Roy
Sent: Thursday, September 25, 2014 12:11 AM
To: 'Chen, Xiaoxi'; Haomai Wang
Cc: Sage Weil; Milosz Tanski; ceph-***@vger.kernel.org
Subject: RE: Impact of page cache on OSD read performance for SSD

Well, you never know !
It depends upon lot of factors starting from your workload/different kernel params/RAID controller etc. etc. I have shared my observation in my environment with 4K pseudo random fio_rbd workload. True random, should not kick off read_ahead though.
OP_QUEUE optimization is bringing more parallelism in the filestore read , so, more read going to disk in parallel may have exposed this.
Anyways, I am in process of analyzing why default read_ahead is causing problem for me, will update if I find any..

Thanks & Regards
Somnath

-----Original Message-----
From: Chen, Xiaoxi [mailto:***@intel.com]
Sent: Wednesday, September 24, 2014 10:00 PM
To: Somnath Roy; Haomai Wang
Cc: Sage Weil; Milosz Tanski; ceph-***@vger.kernel.org
Subject: RE: Impact of page cache on OSD read performance for SSD

Have you ever seen large readahead_kb would hear random performance?

We usually set it to very large (2M) , the random read performance keep steady, even in all SSD setup. Maybe with your optimization code for OP_QUEUE, the things may different?

-----Original Message-----
From: ceph-devel-***@vger.kernel.org [mailto:ceph-devel-***@vger.kernel.org] On Behalf Of Somnath Roy
Sent: Thursday, September 25, 2014 11:15 AM
To: Haomai Wang
Cc: Sage Weil; Milosz Tanski; ceph-***@vger.kernel.org
Subject: RE: Impact of page cache on OSD read performance for SSD

It will be definitely hampered.
There will not be a single solution fits all. These parameters needs to be tuned based on the workload.

Thanks & Regards
Somnath

-----Original Message-----
From: Haomai Wang [mailto:***@gmail.com]
Sent: Wednesday, September 24, 2014 7:56 PM
To: Somnath Roy
Cc: Sage Weil; Milosz Tanski; ceph-***@vger.kernel.org
Subject: Re: Impact of page cache on OSD read performance for SSD
Post by Somnath Roy
Hi,
After going through the blktrace, I think I have figured out what is
going on there. I think kernel read_ahead is causing the extra reads
in case of buffered read. If I set read_ahead = 0 , the performance I
am getting similar (or more when cache hit actually happens) to
direct_io :-)
Hmm, BTW if set read_ahead=0, what about seq read performance compared to before?
Post by Somnath Roy
IMHO, if any user doesn't want these nasty kernel effects and be sure of the random work pattern, we should provide a configurable direct_io read option (Need to quantify direct_io write also) as Sage suggested.
Thanks & Regards
Somnath
-----Original Message-----
Sent: Wednesday, September 24, 2014 9:06 AM
To: Sage Weil
Subject: Re: Impact of page cache on OSD read performance for SSD
Post by Sage Weil
Post by Haomai Wang
I agree with that direct read will help for disk read. But if read
data is hot and small enough to fit in memory, page cache is a good
place to hold data cache. If discard page cache, we need to
implement a cache to provide with effective lookup impl.
This is true for some workloads, but not necessarily true for all.
Many clients (notably RBD) will be caching at the client side (in
VM's fs, and possibly in librbd itself) such that caching at the OSD
is largely wasted effort. For RGW the often is likely true, unless
there is a varnish cache or something in front.
Still now, I don't think librbd cache can meet all the cache demand for rbd usage. Even though we have a effective librbd cache impl, we still need a buffer cache in ObjectStore level just like what database did. Client cache and host cache are both needed.
Post by Sage Weil
We should probably have a direct_io config option for filestore. But
even better would be some hint from the client about whether it is
caching or not so that FileStore could conditionally cache...
Yes, I remember we already did some early works like it.
Post by Sage Weil
sage
Post by Haomai Wang
BTW, whether to use direct io we can refer to MySQL Innodb engine
with direct io and PostgreSQL with page cache.
Post by Somnath Roy
Haomai,
I am considering only about random reads and the changes I made only affecting reads. For write, I have not measured yet. But, yes, page cache may be helpful for write coalescing. Still need to evaluate how it is behaving comparing direct_io on SSD though. I think Ceph code path will be much shorter if we use direct_io in the write path where it is actually executing the transactions. Probably, the sync thread and all will not be needed.
I am trying to analyze where is the extra reads coming from in case of buffered io by using blktrace etc. This should give us a clear understanding what exactly is going on there and it may turn out that tuning kernel parameters only we can achieve similar performance as direct_io.
Thanks & Regards
Somnath
-----Original Message-----
Sent: Tuesday, September 23, 2014 7:07 PM
To: Sage Weil
Subject: Re: Impact of page cache on OSD read performance for SSD
Good point, but do you have considered that the impaction for write ops? And if skip page cache, FileStore is responsible for data cache?
Post by Sage Weil
Post by Somnath Roy
Milosz,
Thanks for the response. I will see if I can get any information out of perf.
Here is my OS information.
Distributor ID: Ubuntu
Description: Ubuntu 13.10
Release: 13.10
Codename: saucy
Linux emsclient 3.11.0-12-generic #19-Ubuntu SMP Wed Oct 9
16:20:46 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
BTW, it's not a 45% drop, as you can see, by tuning the OSD parameter I was able to get almost *2X* performance improvement with direct_io.
It's not only page cache (memory) lookup, in case of buffered_io the following could be problem.
1. Double copy (disk -> file buffer cache, file buffer cache ->
user
buffer)
2. As the iostat output shows, it is not reading 4K only, it is
reading more data from disk as required and in the end it will
be wasted in case of random workload..
It might be worth using blktrace to see what the IOs it is issueing are.
Which ones are > 4K and what they point to...
sage
Post by Somnath Roy
Thanks & Regards
Somnath
-----Original Message-----
Sent: Tuesday, September 23, 2014 12:09 PM
To: Somnath Roy
Subject: Re: Impact of page cache on OSD read performance for SSD
Somnath,
I wonder if there's a bottleneck or a point of contention for the kernel. For a entirely uncached workload I expect the page cache lookup to cause a slow down (since the lookup should be wasted). What I wouldn't expect is a 45% performance drop. Memory speed should be one magnitude faster then a modern SATA SSD drive (so it should be more negligible overhead).
Is there anyway you could perform the same test but monitor what's going on with the OSD process using the perf tool? Whatever is the default cpu time spent hardware counter is fine. Make sure you have the kernel debug info package installed so can get symbol information for kernel and module calls. With any luck the diff in perf output in two runs will show us the culprit.
Also, can you tell us what OS/kernel version you're using on the OSD machines?
- Milosz
Post by Somnath Roy
Hi Sage,
I have created the following setup in order to examine how a single OSD is behaving if say ~80-90% of ios hitting the SSDs.
My test includes the following steps.
1. Created a single OSD cluster.
2. Created two rbd images (110GB each) on 2 different pools.
3. Populated entire image, so my working set is ~210GB. My system memory is ~16GB.
4. Dumped page cache before every run.
5. Ran fio_rbd (QD 32, 8 instances) in parallel on these two images.
Here is my disk iops/bandwidth..
random-reads: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=64
2.0.8
Starting 1 process
Jobs: 1 (f=1): [r] [100.0% done] [154.1M/0K /s] [39.7K/0 iops] [eta 00m:00s]
random-reads: (groupid=0, jobs=1): err= 0: pid=1431
read : io=9316.4MB, bw=158994KB/s, iops=39748 , runt= 60002msec
My fio_rbd config..
[global]
ioengine=rbd
clientname=admin
pool=rbd1
rbdname=ceph_regression_test1
invalidate=0 # mandatory
rw=randread
bs=4k
direct=1
time_based
runtime=2m
size=109G
numjobs=8
[rbd_iodepth32]
iodepth=32
Now, I have run Giant Ceph on top of that..
-------------------------------------------------------
avg-cpu: %user %nice %system %iowait %steal %idle
22.04 0.00 16.46 45.86 0.00 15.64
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 9.00 0.00 6.00 0.00 92.00 30.67 0.01 1.33 0.00 1.33 1.33 0.80
sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdg 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdh 181.00 0.00 34961.00 0.00 176740.00 0.00 10.11 102.71 2.92 2.92 0.00 0.03 100.00
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
----------
cluster 94991097-7638-4240-b922-f525300a9026
health HEALTH_OK
monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch 1, quorum 0 a
osdmap e498: 1 osds: 1 up, 1 in
pgmap v386366: 832 pgs, 7 pools, 308 GB data, 247 kobjects
366 GB used, 1122 GB / 1489 GB avail
832 active+clean
client io 75215 kB/s rd, 18803 op/s
----------
Gradually decreases from ~21 core (serving from cache) to ~10 core (while serving from disks).
-----------------
In this case "All is Well" till ios are served from cache
(XFS is smart enough to cache some data ) . Once started hitting disks and throughput is decreasing. As you can see, disk is giving ~35K iops , but, OSD throughput is only ~18.8K ! So, cache miss in case of buffered io seems to be very expensive. Half of the iops are waste. Also, looking at the bandwidth, it is obvious, not everything is 4K read, May be kernel read_ahead is kicking (?).
Now, I thought of making ceph disk read as direct_io and do the same experiment. I have changed the FileStore::read to do the direct_io only. Rest kept as is. Here is the result with that.
-------
avg-cpu: %user %nice %system %iowait %steal %idle
24.77 0.00 19.52 21.36 0.00 34.36
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdg 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdh 0.00 0.00 25295.00 0.00 101180.00 0.00 8.00 12.73 0.50 0.50 0.00 0.04 100.80
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
--------
cluster 94991097-7638-4240-b922-f525300a9026
health HEALTH_OK
monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch 1, quorum 0 a
osdmap e522: 1 osds: 1 up, 1 in
pgmap v386711: 832 pgs, 7 pools, 308 GB data, 247 kobjects
366 GB used, 1122 GB / 1489 GB avail
832 active+clean
client io 100 MB/s rd, 25618 op/s
--------
~14 core while serving from disks.
---------------
No surprises here. Whatever is disk throughput ceph throughput is almost matching.
Let's tweak the shard/thread settings and see the impact.
-----------------------------------------------------------
------------------
No change, output is very similar to 25 shards.
------------------
----------
avg-cpu: %user %nice %system %iowait %steal %idle
33.33 0.00 28.22 23.11 0.00 15.34
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.00 0.00 2.00 0.00 12.00 12.00 0.00 0.00 0.00 0.00 0.00 0.00
sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdg 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdh 0.00 0.00 31987.00 0.00 127948.00 0.00 8.00 18.06 0.56 0.56 0.00 0.03 100.40
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
--------------
cluster 94991097-7638-4240-b922-f525300a9026
health HEALTH_OK
monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch 1, quorum 0 a
osdmap e525: 1 osds: 1 up, 1 in
pgmap v386746: 832 pgs, 7 pools, 308 GB data, 247 kobjects
366 GB used, 1122 GB / 1489 GB avail
832 active+clean
client io 127 MB/s rd, 32763 op/s
--------------
~19 core while serving from disks.
------------------
It is scaling with increased number of shards/threads. The parallelism also increased significantly.
----------------------------------------------------------
-------------------
No change, output is very similar to 25 shards.
-----------------
--------
avg-cpu: %user %nice %system %iowait %steal %idle
37.50 0.00 33.72 20.03 0.00 8.75
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdg 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdh 0.00 0.00 35360.00 0.00 141440.00 0.00 8.00 22.25 0.62 0.62 0.00 0.03 100.40
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
--------------
cluster 94991097-7638-4240-b922-f525300a9026
health HEALTH_OK
monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch 1, quorum 0 a
osdmap e534: 1 osds: 1 up, 1 in
pgmap v386830: 832 pgs, 7 pools, 308 GB data, 247 kobjects
366 GB used, 1122 GB / 1489 GB avail
832 active+clean
client io 138 MB/s rd, 35582 op/s
----------------
~22.5 core while serving from disks.
--------------------
It is scaling with increased number of shards/threads. The parallelism also increased significantly.
---------------------------------------------------------
------------------
No change, output is very similar to 25 shards.
-------------------
---------
avg-cpu: %user %nice %system %iowait %steal %idle
40.18 0.00 34.84 19.81 0.00 5.18
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdg 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdh 0.00 0.00 39114.00 0.00 156460.00 0.00 8.00 35.58 0.90 0.90 0.00 0.03 100.40
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
---------------
cluster 94991097-7638-4240-b922-f525300a9026
health HEALTH_OK
monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch 1, quorum 0 a
osdmap e537: 1 osds: 1 up, 1 in
pgmap v386865: 832 pgs, 7 pools, 308 GB data, 247 kobjects
366 GB used, 1122 GB / 1489 GB avail
832 active+clean
client io 153 MB/s rd, 39172 op/s
----------------
~24.5 core while serving from disks. ~3% cpu left.
------------------
It is scaling with increased number of shards/threads. The parallelism also increased significantly. It is disk bound now.
So, it seems buffered IO has significant impact on performance in case backend is SSD.
My question is, if the workload is very random and storage(SSD) is very huge compare to system memory, shouldn't we always go for direct_io instead of buffered io from Ceph ?
Please share your thoughts/suggestion on this.
Thanks & Regards
Somnath
________________________________
PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel"
majordomo info at http://vger.kernel.org/majordomo-info.html
--
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016
p: 646-253-9055
N?????r??y??????X???v???)?{.n?????z?]z????ay? ????j ??f???h?????
?w??? ???j:+v???w???????? ????zZ+???????j"????i
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel"
majordomo info at http://vger.kernel.org/majordomo-info.html
--
Best Regards,
Wheat
--
Best Regards,
Wheat
--
Best Regards,
Wheat
--
Best Regards,

Wheat
 칻 & ~ &  +- ݶ w ˛ m  ^ b ^n r z  h &  G h ( 階 ݢj"  m z ޖ f h ~ m
N�����r��y����b�X��ǧv�^�)޺{.n�+���z�]z���{ay�ʇڙ�,j��f���h���z��w���
Sage Weil
2014-09-25 14:29:50 UTC
Permalink
Post by Somnath Roy
It will be definitely hampered.
There will not be a single solution fits all. These parameters needs to be tuned based on the workload.
Can you do a test to see if fadvise with FADV_RANDOM is sufficient to
prevent the readahead behavior? If so, we can potentially accomplish this
with proper IO hinting from the clients.

sage
Post by Somnath Roy
Thanks & Regards
Somnath
-----Original Message-----
Sent: Wednesday, September 24, 2014 7:56 PM
To: Somnath Roy
Subject: Re: Impact of page cache on OSD read performance for SSD
Post by Somnath Roy
Hi,
After going through the blktrace, I think I have figured out what is
going on there. I think kernel read_ahead is causing the extra reads
in case of buffered read. If I set read_ahead = 0 , the performance I
am getting similar (or more when cache hit actually happens) to
direct_io :-)
Hmm, BTW if set read_ahead=0, what about seq read performance compared to before?
Post by Somnath Roy
IMHO, if any user doesn't want these nasty kernel effects and be sure of the random work pattern, we should provide a configurable direct_io read option (Need to quantify direct_io write also) as Sage suggested.
Thanks & Regards
Somnath
-----Original Message-----
Sent: Wednesday, September 24, 2014 9:06 AM
To: Sage Weil
Subject: Re: Impact of page cache on OSD read performance for SSD
Post by Sage Weil
Post by Haomai Wang
I agree with that direct read will help for disk read. But if read
data is hot and small enough to fit in memory, page cache is a good
place to hold data cache. If discard page cache, we need to
implement a cache to provide with effective lookup impl.
This is true for some workloads, but not necessarily true for all.
Many clients (notably RBD) will be caching at the client side (in
VM's fs, and possibly in librbd itself) such that caching at the OSD
is largely wasted effort. For RGW the often is likely true, unless
there is a varnish cache or something in front.
Still now, I don't think librbd cache can meet all the cache demand for rbd usage. Even though we have a effective librbd cache impl, we still need a buffer cache in ObjectStore level just like what database did. Client cache and host cache are both needed.
Post by Sage Weil
We should probably have a direct_io config option for filestore. But
even better would be some hint from the client about whether it is
caching or not so that FileStore could conditionally cache...
Yes, I remember we already did some early works like it.
Post by Sage Weil
sage
Post by Haomai Wang
BTW, whether to use direct io we can refer to MySQL Innodb engine
with direct io and PostgreSQL with page cache.
Post by Somnath Roy
Haomai,
I am considering only about random reads and the changes I made only affecting reads. For write, I have not measured yet. But, yes, page cache may be helpful for write coalescing. Still need to evaluate how it is behaving comparing direct_io on SSD though. I think Ceph code path will be much shorter if we use direct_io in the write path where it is actually executing the transactions. Probably, the sync thread and all will not be needed.
I am trying to analyze where is the extra reads coming from in case of buffered io by using blktrace etc. This should give us a clear understanding what exactly is going on there and it may turn out that tuning kernel parameters only we can achieve similar performance as direct_io.
Thanks & Regards
Somnath
-----Original Message-----
Sent: Tuesday, September 23, 2014 7:07 PM
To: Sage Weil
Subject: Re: Impact of page cache on OSD read performance for SSD
Good point, but do you have considered that the impaction for write ops? And if skip page cache, FileStore is responsible for data cache?
Post by Sage Weil
Post by Somnath Roy
Milosz,
Thanks for the response. I will see if I can get any information out of perf.
Here is my OS information.
Distributor ID: Ubuntu
Description: Ubuntu 13.10
Release: 13.10
Codename: saucy
Linux emsclient 3.11.0-12-generic #19-Ubuntu SMP Wed Oct 9
16:20:46 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
BTW, it's not a 45% drop, as you can see, by tuning the OSD parameter I was able to get almost *2X* performance improvement with direct_io.
It's not only page cache (memory) lookup, in case of buffered_io the following could be problem.
1. Double copy (disk -> file buffer cache, file buffer cache ->
user
buffer)
2. As the iostat output shows, it is not reading 4K only, it is
reading more data from disk as required and in the end it will
be wasted in case of random workload..
It might be worth using blktrace to see what the IOs it is issueing are.
Which ones are > 4K and what they point to...
sage
Post by Somnath Roy
Thanks & Regards
Somnath
-----Original Message-----
Sent: Tuesday, September 23, 2014 12:09 PM
To: Somnath Roy
Subject: Re: Impact of page cache on OSD read performance for SSD
Somnath,
I wonder if there's a bottleneck or a point of contention for the kernel. For a entirely uncached workload I expect the page cache lookup to cause a slow down (since the lookup should be wasted). What I wouldn't expect is a 45% performance drop. Memory speed should be one magnitude faster then a modern SATA SSD drive (so it should be more negligible overhead).
Is there anyway you could perform the same test but monitor what's going on with the OSD process using the perf tool? Whatever is the default cpu time spent hardware counter is fine. Make sure you have the kernel debug info package installed so can get symbol information for kernel and module calls. With any luck the diff in perf output in two runs will show us the culprit.
Also, can you tell us what OS/kernel version you're using on the OSD machines?
- Milosz
Post by Somnath Roy
Hi Sage,
I have created the following setup in order to examine how a single OSD is behaving if say ~80-90% of ios hitting the SSDs.
My test includes the following steps.
1. Created a single OSD cluster.
2. Created two rbd images (110GB each) on 2 different pools.
3. Populated entire image, so my working set is ~210GB. My system memory is ~16GB.
4. Dumped page cache before every run.
5. Ran fio_rbd (QD 32, 8 instances) in parallel on these two images.
Here is my disk iops/bandwidth..
random-reads: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=64
2.0.8
Starting 1 process
Jobs: 1 (f=1): [r] [100.0% done] [154.1M/0K /s] [39.7K/0 iops] [eta 00m:00s]
random-reads: (groupid=0, jobs=1): err= 0: pid=1431
read : io=9316.4MB, bw=158994KB/s, iops=39748 , runt= 60002msec
My fio_rbd config..
[global]
ioengine=rbd
clientname=admin
pool=rbd1
rbdname=ceph_regression_test1
invalidate=0 # mandatory
rw=randread
bs=4k
direct=1
time_based
runtime=2m
size=109G
numjobs=8
[rbd_iodepth32]
iodepth=32
Now, I have run Giant Ceph on top of that..
-------------------------------------------------------
avg-cpu: %user %nice %system %iowait %steal %idle
22.04 0.00 16.46 45.86 0.00 15.64
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 9.00 0.00 6.00 0.00 92.00 30.67 0.01 1.33 0.00 1.33 1.33 0.80
sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdg 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdh 181.00 0.00 34961.00 0.00 176740.00 0.00 10.11 102.71 2.92 2.92 0.00 0.03 100.00
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
----------
cluster 94991097-7638-4240-b922-f525300a9026
health HEALTH_OK
monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch 1, quorum 0 a
osdmap e498: 1 osds: 1 up, 1 in
pgmap v386366: 832 pgs, 7 pools, 308 GB data, 247 kobjects
366 GB used, 1122 GB / 1489 GB avail
832 active+clean
client io 75215 kB/s rd, 18803 op/s
----------
Gradually decreases from ~21 core (serving from cache) to ~10 core (while serving from disks).
-----------------
In this case "All is Well" till ios are served from cache
(XFS is smart enough to cache some data ) . Once started hitting disks and throughput is decreasing. As you can see, disk is giving ~35K iops , but, OSD throughput is only ~18.8K ! So, cache miss in case of buffered io seems to be very expensive. Half of the iops are waste. Also, looking at the bandwidth, it is obvious, not everything is 4K read, May be kernel read_ahead is kicking (?).
Now, I thought of making ceph disk read as direct_io and do the same experiment. I have changed the FileStore::read to do the direct_io only. Rest kept as is. Here is the result with that.
-------
avg-cpu: %user %nice %system %iowait %steal %idle
24.77 0.00 19.52 21.36 0.00 34.36
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdg 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdh 0.00 0.00 25295.00 0.00 101180.00 0.00 8.00 12.73 0.50 0.50 0.00 0.04 100.80
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
--------
cluster 94991097-7638-4240-b922-f525300a9026
health HEALTH_OK
monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch 1, quorum 0 a
osdmap e522: 1 osds: 1 up, 1 in
pgmap v386711: 832 pgs, 7 pools, 308 GB data, 247 kobjects
366 GB used, 1122 GB / 1489 GB avail
832 active+clean
client io 100 MB/s rd, 25618 op/s
--------
~14 core while serving from disks.
---------------
No surprises here. Whatever is disk throughput ceph throughput is almost matching.
Let's tweak the shard/thread settings and see the impact.
-----------------------------------------------------------
------------------
No change, output is very similar to 25 shards.
------------------
----------
avg-cpu: %user %nice %system %iowait %steal %idle
33.33 0.00 28.22 23.11 0.00 15.34
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.00 0.00 2.00 0.00 12.00 12.00 0.00 0.00 0.00 0.00 0.00 0.00
sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdg 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdh 0.00 0.00 31987.00 0.00 127948.00 0.00 8.00 18.06 0.56 0.56 0.00 0.03 100.40
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
--------------
cluster 94991097-7638-4240-b922-f525300a9026
health HEALTH_OK
monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch 1, quorum 0 a
osdmap e525: 1 osds: 1 up, 1 in
pgmap v386746: 832 pgs, 7 pools, 308 GB data, 247 kobjects
366 GB used, 1122 GB / 1489 GB avail
832 active+clean
client io 127 MB/s rd, 32763 op/s
--------------
~19 core while serving from disks.
------------------
It is scaling with increased number of shards/threads. The parallelism also increased significantly.
----------------------------------------------------------
-------------------
No change, output is very similar to 25 shards.
-----------------
--------
avg-cpu: %user %nice %system %iowait %steal %idle
37.50 0.00 33.72 20.03 0.00 8.75
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdg 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdh 0.00 0.00 35360.00 0.00 141440.00 0.00 8.00 22.25 0.62 0.62 0.00 0.03 100.40
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
--------------
cluster 94991097-7638-4240-b922-f525300a9026
health HEALTH_OK
monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch 1, quorum 0 a
osdmap e534: 1 osds: 1 up, 1 in
pgmap v386830: 832 pgs, 7 pools, 308 GB data, 247 kobjects
366 GB used, 1122 GB / 1489 GB avail
832 active+clean
client io 138 MB/s rd, 35582 op/s
----------------
~22.5 core while serving from disks.
--------------------
It is scaling with increased number of shards/threads. The parallelism also increased significantly.
---------------------------------------------------------
------------------
No change, output is very similar to 25 shards.
-------------------
---------
avg-cpu: %user %nice %system %iowait %steal %idle
40.18 0.00 34.84 19.81 0.00 5.18
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdg 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdh 0.00 0.00 39114.00 0.00 156460.00 0.00 8.00 35.58 0.90 0.90 0.00 0.03 100.40
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
---------------
cluster 94991097-7638-4240-b922-f525300a9026
health HEALTH_OK
monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch 1, quorum 0 a
osdmap e537: 1 osds: 1 up, 1 in
pgmap v386865: 832 pgs, 7 pools, 308 GB data, 247 kobjects
366 GB used, 1122 GB / 1489 GB avail
832 active+clean
client io 153 MB/s rd, 39172 op/s
----------------
~24.5 core while serving from disks. ~3% cpu left.
------------------
It is scaling with increased number of shards/threads. The parallelism also increased significantly. It is disk bound now.
So, it seems buffered IO has significant impact on performance in case backend is SSD.
My question is, if the workload is very random and storage(SSD) is very huge compare to system memory, shouldn't we always go for direct_io instead of buffered io from Ceph ?
Please share your thoughts/suggestion on this.
Thanks & Regards
Somnath
________________________________
PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel"
majordomo info at http://vger.kernel.org/majordomo-info.html
--
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016
p: 646-253-9055
N?????r??y??????X???v???)?{.n?????z?]z????ay? ????j ??f???h?????
?w??? ???j:+v???w???????? ????zZ+???????j"????i
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel"
majordomo info at http://vger.kernel.org/majordomo-info.html
--
Best Regards,
Wheat
--
Best Regards,
Wheat
--
Best Regards,
Wheat
--
Best Regards,
Wheat
N?????r??y??????X???v???)?{.n?????z?]z????ay?????j??f???h??????w??? ???j:+v???w????????????zZ+???????j"????i
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Loading...