Impact of page cache on OSD read performance for SSD

Discussion:

Somnath Roy

2014-09-23 18:05:17 UTC

Hi Sage,
I have created the following setup in order to examine how a single OSD is behaving if say ~80-90% of ios hitting the SSDs.

My test includes the following steps.

1. Created a single OSD cluster.
2. Created two rbd images (110GB each) on 2 different pools.
3. Populated entire image, so my working set is ~210GB. My system memory is ~16GB.
4. Dumped page cache before every run.
5. Ran fio_rbd (QD 32, 8 instances) in parallel on these two images.

Here is my disk iops/bandwidth..

***@emsclient:~/fio_test# fio rad_resd_disk.job
random-reads: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=64
2.0.8
Starting 1 process
Jobs: 1 (f=1): [r] [100.0% done] [154.1M/0K /s] [39.7K/0 iops] [eta 00m:00s]
random-reads: (groupid=0, jobs=1): err= 0: pid=1431
read : io=9316.4MB, bw=158994KB/s, iops=39748 , runt= 60002msec

My fio_rbd config..

[global]
ioengine=rbd
clientname=admin
pool=rbd1
rbdname=ceph_regression_test1
invalidate=0 # mandatory
rw=randread
bs=4k
direct=1
time_based
runtime=2m
size=109G
numjobs=8
[rbd_iodepth32]
iodepth=32

Now, I have run Giant Ceph on top of that..

1. OSD config with 25 shards/1 thread per shard :
-------------------------------------------------------

avg-cpu: %user %nice %system %iowait %steal %idle
22.04 0.00 16.46 45.86 0.00 15.64

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 9.00 0.00 6.00 0.00 92.00 30.67 0.01 1.33 0.00 1.33 1.33 0.80
sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdg 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdh 181.00 0.00 34961.00 0.00 176740.00 0.00 10.11 102.71 2.92 2.92 0.00 0.03 100.00
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

ceph -s:
----------
***@emsclient:~# ceph -s
cluster 94991097-7638-4240-b922-f525300a9026
health HEALTH_OK
monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch 1, quorum 0 a
osdmap e498: 1 osds: 1 up, 1 in
pgmap v386366: 832 pgs, 7 pools, 308 GB data, 247 kobjects
366 GB used, 1122 GB / 1489 GB avail
832 active+clean
client io 75215 kB/s rd, 18803 op/s

cpu util:
----------
Gradually decreases from ~21 core (serving from cache) to ~10 core (while serving from disks).

My Analysis:
-----------------
In this case "All is Well" till ios are served from cache (XFS is smart enough to cache some data ) . Once started hitting disks and throughput is decreasing. As you can see, disk is giving ~35K iops , but, OSD throughput is only ~18.8K ! So, cache miss in case of buffered io seems to be very
expensive. Half of the iops are waste. Also, looking at the bandwidth, it is obvious, not everything is 4K read, May be kernel read_ahead is kicking (?).

Now, I thought of making ceph disk read as direct_io and do the same experiment. I have changed the FileStore::read to do the direct_io only. Rest kept as is. Here is the result with that.

Iostat:
-------

avg-cpu: %user %nice %system %iowait %steal %idle
24.77 0.00 19.52 21.36 0.00 34.36

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdg 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdh 0.00 0.00 25295.00 0.00 101180.00 0.00 8.00 12.73 0.50 0.50 0.00 0.04 100.80
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

ceph -s:
--------
***@emsclient:~/fio_test# ceph -s
cluster 94991097-7638-4240-b922-f525300a9026
health HEALTH_OK
monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch 1, quorum 0 a
osdmap e522: 1 osds: 1 up, 1 in
pgmap v386711: 832 pgs, 7 pools, 308 GB data, 247 kobjects
366 GB used, 1122 GB / 1489 GB avail
832 active+clean
client io 100 MB/s rd, 25618 op/s

cpu util:
--------
~14 core while serving from disks.

My Analysis:
---------------
No surprises here. Whatever is disk throughput ceph throughput is almost matching.

Let's tweak the shard/thread settings and see the impact.

2. OSD config with 36 shards and 1 thread/shard:
-----------------------------------------------------------

Buffered read:
------------------
No change, output is very similar to 25 shards.

direct_io read:
------------------
Iostat:
----------
avg-cpu: %user %nice %system %iowait %steal %idle
33.33 0.00 28.22 23.11 0.00 15.34

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.00 0.00 2.00 0.00 12.00 12.00 0.00 0.00 0.00 0.00 0.00 0.00
sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdg 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdh 0.00 0.00 31987.00 0.00 127948.00 0.00 8.00 18.06 0.56 0.56 0.00 0.03 100.40
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

ceph -s:
--------------
***@emsclient:~/fio_test# ceph -s
cluster 94991097-7638-4240-b922-f525300a9026
health HEALTH_OK
monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch 1, quorum 0 a
osdmap e525: 1 osds: 1 up, 1 in
pgmap v386746: 832 pgs, 7 pools, 308 GB data, 247 kobjects
366 GB used, 1122 GB / 1489 GB avail
832 active+clean
client io 127 MB/s rd, 32763 op/s

cpu util:
--------------
~19 core while serving from disks.

Analysis:
------------------
It is scaling with increased number of shards/threads. The parallelism also increased significantly.

3. OSD config with 48 shards and 1 thread/shard:
----------------------------------------------------------
Buffered read:
-------------------
No change, output is very similar to 25 shards.

direct_io read:
-----------------
Iostat:
--------

avg-cpu: %user %nice %system %iowait %steal %idle
37.50 0.00 33.72 20.03 0.00 8.75

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdg 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdh 0.00 0.00 35360.00 0.00 141440.00 0.00 8.00 22.25 0.62 0.62 0.00 0.03 100.40
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

ceph -s:
--------------
***@emsclient:~/fio_test# ceph -s
cluster 94991097-7638-4240-b922-f525300a9026
health HEALTH_OK
monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch 1, quorum 0 a
osdmap e534: 1 osds: 1 up, 1 in
pgmap v386830: 832 pgs, 7 pools, 308 GB data, 247 kobjects
366 GB used, 1122 GB / 1489 GB avail
832 active+clean
client io 138 MB/s rd, 35582 op/s

cpu util:
----------------
~22.5 core while serving from disks.

Analysis:
--------------------
It is scaling with increased number of shards/threads. The parallelism also increased significantly.

4. OSD config with 64 shards and 1 thread/shard:
---------------------------------------------------------
Buffered read:
------------------
No change, output is very similar to 25 shards.

direct_io read:
-------------------
Iostat:
---------
avg-cpu: %user %nice %system %iowait %steal %idle
40.18 0.00 34.84 19.81 0.00 5.18

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdg 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdh 0.00 0.00 39114.00 0.00 156460.00 0.00 8.00 35.58 0.90 0.90 0.00 0.03 100.40
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

ceph -s:
---------------
***@emsclient:~/fio_test# ceph -s
cluster 94991097-7638-4240-b922-f525300a9026
health HEALTH_OK
monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch 1, quorum 0 a
osdmap e537: 1 osds: 1 up, 1 in
pgmap v386865: 832 pgs, 7 pools, 308 GB data, 247 kobjects
366 GB used, 1122 GB / 1489 GB avail
832 active+clean
client io 153 MB/s rd, 39172 op/s

cpu util:
----------------
~24.5 core while serving from disks. ~3% cpu left.

Analysis:
------------------
It is scaling with increased number of shards/threads. The parallelism also increased significantly. It is disk bound now.

Summary:

So, it seems buffered IO has significant impact on performance in case backend is SSD.
My question is, if the workload is very random and storage(SSD) is very huge compare to system memory, shouldn't we always go for direct_io instead of buffered io from Ceph ?

Please share your thoughts/suggestion on this.

Thanks & Regards
Somnath

________________________________

PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Milosz Tanski

2014-09-23 19:09:23 UTC

Permalink

Somnath,

I wonder if there's a bottleneck or a point of contention for the
kernel. For a entirely uncached workload I expect the page cache
lookup to cause a slow down (since the lookup should be wasted). What
I wouldn't expect is a 45% performance drop. Memory speed should be
one magnitude faster then a modern SATA SSD drive (so it should be
more negligible overhead).

Is there anyway you could perform the same test but monitor what's
going on with the OSD process using the perf tool? Whatever is the
default cpu time spent hardware counter is fine. Make sure you have
the kernel debug info package installed so can get symbol information
for kernel and module calls. With any luck the diff in perf output in
two runs will show us the culprit.

Also, can you tell us what OS/kernel version you're using on the OSD machines?

- Milosz

Post by Somnath Roy
Hi Sage,
I have created the following setup in order to examine how a single OSD is behaving if say ~80-90% of ios hitting the SSDs.
My test includes the following steps.
1. Created a single OSD cluster.
2. Created two rbd images (110GB each) on 2 different pools.
3. Populated entire image, so my working set is ~210GB. My system memory is ~16GB.
4. Dumped page cache before every run.
5. Ran fio_rbd (QD 32, 8 instances) in parallel on these two images.
Here is my disk iops/bandwidth..
random-reads: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=64
2.0.8
Starting 1 process
Jobs: 1 (f=1): [r] [100.0% done] [154.1M/0K /s] [39.7K/0 iops] [eta 00m:00s]
random-reads: (groupid=0, jobs=1): err= 0: pid=1431
read : io=9316.4MB, bw=158994KB/s, iops=39748 , runt= 60002msec
My fio_rbd config..
[global]
ioengine=rbd
clientname=admin
pool=rbd1
rbdname=ceph_regression_test1
invalidate=0 # mandatory
rw=randread
bs=4k
direct=1
time_based
runtime=2m
size=109G
numjobs=8
[rbd_iodepth32]
iodepth=32
Now, I have run Giant Ceph on top of that..
-------------------------------------------------------
avg-cpu: %user %nice %system %iowait %steal %idle
22.04 0.00 16.46 45.86 0.00 15.64
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 9.00 0.00 6.00 0.00 92.00 30.67 0.01 1.33 0.00 1.33 1.33 0.80
sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdg 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdh 181.00 0.00 34961.00 0.00 176740.00 0.00 10.11 102.71 2.92 2.92 0.00 0.03 100.00
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
----------
cluster 94991097-7638-4240-b922-f525300a9026
health HEALTH_OK
monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch 1, quorum 0 a
osdmap e498: 1 osds: 1 up, 1 in
pgmap v386366: 832 pgs, 7 pools, 308 GB data, 247 kobjects
366 GB used, 1122 GB / 1489 GB avail
832 active+clean
client io 75215 kB/s rd, 18803 op/s
----------
Gradually decreases from ~21 core (serving from cache) to ~10 core (while serving from disks).
-----------------
In this case "All is Well" till ios are served from cache (XFS is smart enough to cache some data ) . Once started hitting disks and throughput is decreasing. As you can see, disk is giving ~35K iops , but, OSD throughput is only ~18.8K ! So, cache miss in case of buffered io seems to be very
expensive. Half of the iops are waste. Also, looking at the bandwidth, it is obvious, not everything is 4K read, May be kernel read_ahead is kicking (?).
Now, I thought of making ceph disk read as direct_io and do the same experiment. I have changed the FileStore::read to do the direct_io only. Rest kept as is. Here is the result with that.
-------
avg-cpu: %user %nice %system %iowait %steal %idle
24.77 0.00 19.52 21.36 0.00 34.36
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdg 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdh 0.00 0.00 25295.00 0.00 101180.00 0.00 8.00 12.73 0.50 0.50 0.00 0.04 100.80
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
--------
cluster 94991097-7638-4240-b922-f525300a9026
health HEALTH_OK
monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch 1, quorum 0 a
osdmap e522: 1 osds: 1 up, 1 in
pgmap v386711: 832 pgs, 7 pools, 308 GB data, 247 kobjects
366 GB used, 1122 GB / 1489 GB avail
832 active+clean
client io 100 MB/s rd, 25618 op/s
--------
~14 core while serving from disks.
---------------
No surprises here. Whatever is disk throughput ceph throughput is almost matching.
Let's tweak the shard/thread settings and see the impact.
-----------------------------------------------------------
------------------
No change, output is very similar to 25 shards.
------------------
----------
avg-cpu: %user %nice %system %iowait %steal %idle
33.33 0.00 28.22 23.11 0.00 15.34
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.00 0.00 2.00 0.00 12.00 12.00 0.00 0.00 0.00 0.00 0.00 0.00
sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdg 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdh 0.00 0.00 31987.00 0.00 127948.00 0.00 8.00 18.06 0.56 0.56 0.00 0.03 100.40
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
--------------
cluster 94991097-7638-4240-b922-f525300a9026
health HEALTH_OK
monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch 1, quorum 0 a
osdmap e525: 1 osds: 1 up, 1 in
pgmap v386746: 832 pgs, 7 pools, 308 GB data, 247 kobjects
366 GB used, 1122 GB / 1489 GB avail
832 active+clean
client io 127 MB/s rd, 32763 op/s
--------------
~19 core while serving from disks.
------------------
It is scaling with increased number of shards/threads. The parallelism also increased significantly.
----------------------------------------------------------
-------------------
No change, output is very similar to 25 shards.
-----------------
--------
avg-cpu: %user %nice %system %iowait %steal %idle
37.50 0.00 33.72 20.03 0.00 8.75
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdg 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdh 0.00 0.00 35360.00 0.00 141440.00 0.00 8.00 22.25 0.62 0.62 0.00 0.03 100.40
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
--------------
cluster 94991097-7638-4240-b922-f525300a9026
health HEALTH_OK
monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch 1, quorum 0 a
osdmap e534: 1 osds: 1 up, 1 in
pgmap v386830: 832 pgs, 7 pools, 308 GB data, 247 kobjects
366 GB used, 1122 GB / 1489 GB avail
832 active+clean
client io 138 MB/s rd, 35582 op/s
----------------
~22.5 core while serving from disks.
--------------------
It is scaling with increased number of shards/threads. The parallelism also increased significantly.
---------------------------------------------------------
------------------
No change, output is very similar to 25 shards.
-------------------
---------
avg-cpu: %user %nice %system %iowait %steal %idle
40.18 0.00 34.84 19.81 0.00 5.18
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdg 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdh 0.00 0.00 39114.00 0.00 156460.00 0.00 8.00 35.58 0.90 0.90 0.00 0.03 100.40
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
---------------
cluster 94991097-7638-4240-b922-f525300a9026
health HEALTH_OK
monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch 1, quorum 0 a
osdmap e537: 1 osds: 1 up, 1 in
pgmap v386865: 832 pgs, 7 pools, 308 GB data, 247 kobjects
366 GB used, 1122 GB / 1489 GB avail
832 active+clean
client io 153 MB/s rd, 39172 op/s
----------------
~24.5 core while serving from disks. ~3% cpu left.
------------------
It is scaling with increased number of shards/threads. The parallelism also increased significantly. It is disk bound now.
So, it seems buffered IO has significant impact on performance in case backend is SSD.
My question is, if the workload is very random and storage(SSD) is very huge compare to system memory, shouldn't we always go for direct_io instead of buffered io from Ceph ?
Please share your thoughts/suggestion on this.
Thanks & Regards
Somnath
________________________________
PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
More majordomo info at http://vger.kernel.org/majordomo-info.html

--
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016

p: 646-253-9055
e: ***@adfin.com
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Somnath Roy

2014-09-23 19:24:44 UTC

Permalink

Milosz,
Thanks for the response. I will see if I can get any information out of perf.

Here is my OS information.

***@emsclient:~# lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 13.10
Release: 13.10
Codename: saucy
***@emsclient:~# uname -a
Linux emsclient 3.11.0-12-generic #19-Ubuntu SMP Wed Oct 9 16:20:46 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux

BTW, it's not a 45% drop, as you can see, by tuning the OSD parameter I was able to get almost *2X* performance improvement with direct_io.
It's not only page cache (memory) lookup, in case of buffered_io the following could be problem.

1. Double copy (disk -> file buffer cache, file buffer cache -> user buffer)

2. As the iostat output shows, it is not reading 4K only, it is reading more data from disk as required and in the end it will be wasted in case of random workload..

Thanks & Regards
Somnath

-----Original Message-----
From: Milosz Tanski [mailto:***@adfin.com]
Sent: Tuesday, September 23, 2014 12:09 PM
To: Somnath Roy
Cc: ceph-***@vger.kernel.org
Subject: Re: Impact of page cache on OSD read performance for SSD

Somnath,

I wonder if there's a bottleneck or a point of contention for the kernel. For a entirely uncached workload I expect the page cache lookup to cause a slow down (since the lookup should be wasted). What I wouldn't expect is a 45% performance drop. Memory speed should be one magnitude faster then a modern SATA SSD drive (so it should be more negligible overhead).

Is there anyway you could perform the same test but monitor what's going on with the OSD process using the perf tool? Whatever is the default cpu time spent hardware counter is fine. Make sure you have the kernel debug info package installed so can get symbol information for kernel and module calls. With any luck the diff in perf output in two runs will show us the culprit.

Also, can you tell us what OS/kernel version you're using on the OSD machines?

- Milosz

Post by Somnath Roy
Hi Sage,
I have created the following setup in order to examine how a single OSD is behaving if say ~80-90% of ios hitting the SSDs.
My test includes the following steps.
1. Created a single OSD cluster.
2. Created two rbd images (110GB each) on 2 different pools.
3. Populated entire image, so my working set is ~210GB. My system memory is ~16GB.
4. Dumped page cache before every run.
5. Ran fio_rbd (QD 32, 8 instances) in parallel on these two images.
Here is my disk iops/bandwidth..
random-reads: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=64
2.0.8
Starting 1 process
Jobs: 1 (f=1): [r] [100.0% done] [154.1M/0K /s] [39.7K/0 iops] [eta 00m:00s]
random-reads: (groupid=0, jobs=1): err= 0: pid=1431
read : io=9316.4MB, bw=158994KB/s, iops=39748 , runt=
60002msec
My fio_rbd config..
[global]
ioengine=rbd
clientname=admin
pool=rbd1
rbdname=ceph_regression_test1
invalidate=0 # mandatory
rw=randread
bs=4k
direct=1
time_based
runtime=2m
size=109G
numjobs=8
[rbd_iodepth32]
iodepth=32
Now, I have run Giant Ceph on top of that..
-------------------------------------------------------
avg-cpu: %user %nice %system %iowait %steal %idle
22.04 0.00 16.46 45.86 0.00 15.64
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 9.00 0.00 6.00 0.00 92.00 30.67 0.01 1.33 0.00 1.33 1.33 0.80
sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdg 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdh 181.00 0.00 34961.00 0.00 176740.00 0.00 10.11 102.71 2.92 2.92 0.00 0.03 100.00
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
----------
cluster 94991097-7638-4240-b922-f525300a9026
health HEALTH_OK
monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch 1, quorum 0 a
osdmap e498: 1 osds: 1 up, 1 in
pgmap v386366: 832 pgs, 7 pools, 308 GB data, 247 kobjects
366 GB used, 1122 GB / 1489 GB avail
832 active+clean
client io 75215 kB/s rd, 18803 op/s
----------
Gradually decreases from ~21 core (serving from cache) to ~10 core (while serving from disks).
-----------------
In this case "All is Well" till ios are served from cache (XFS is
smart enough to cache some data ) . Once started hitting disks and throughput is decreasing. As you can see, disk is giving ~35K iops , but, OSD throughput is only ~18.8K ! So, cache miss in case of buffered io seems to be very expensive. Half of the iops are waste. Also, looking at the bandwidth, it is obvious, not everything is 4K read, May be kernel read_ahead is kicking (?).
Now, I thought of making ceph disk read as direct_io and do the same experiment. I have changed the FileStore::read to do the direct_io only. Rest kept as is. Here is the result with that.
-------
avg-cpu: %user %nice %system %iowait %steal %idle
24.77 0.00 19.52 21.36 0.00 34.36
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdg 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdh 0.00 0.00 25295.00 0.00 101180.00 0.00 8.00 12.73 0.50 0.50 0.00 0.04 100.80
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
--------
cluster 94991097-7638-4240-b922-f525300a9026
health HEALTH_OK
monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch 1, quorum 0 a
osdmap e522: 1 osds: 1 up, 1 in
pgmap v386711: 832 pgs, 7 pools, 308 GB data, 247 kobjects
366 GB used, 1122 GB / 1489 GB avail
832 active+clean
client io 100 MB/s rd, 25618 op/s
--------
~14 core while serving from disks.
---------------
No surprises here. Whatever is disk throughput ceph throughput is almost matching.
Let's tweak the shard/thread settings and see the impact.
-----------------------------------------------------------
------------------
No change, output is very similar to 25 shards.
------------------
----------
avg-cpu: %user %nice %system %iowait %steal %idle
33.33 0.00 28.22 23.11 0.00 15.34
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.00 0.00 2.00 0.00 12.00 12.00 0.00 0.00 0.00 0.00 0.00 0.00
sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdg 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdh 0.00 0.00 31987.00 0.00 127948.00 0.00 8.00 18.06 0.56 0.56 0.00 0.03 100.40
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
--------------
cluster 94991097-7638-4240-b922-f525300a9026
health HEALTH_OK
monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch 1, quorum 0 a
osdmap e525: 1 osds: 1 up, 1 in
pgmap v386746: 832 pgs, 7 pools, 308 GB data, 247 kobjects
366 GB used, 1122 GB / 1489 GB avail
832 active+clean
client io 127 MB/s rd, 32763 op/s
--------------
~19 core while serving from disks.
------------------
It is scaling with increased number of shards/threads. The parallelism also increased significantly.
----------------------------------------------------------
-------------------
No change, output is very similar to 25 shards.
-----------------
--------
avg-cpu: %user %nice %system %iowait %steal %idle
37.50 0.00 33.72 20.03 0.00 8.75
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdg 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdh 0.00 0.00 35360.00 0.00 141440.00 0.00 8.00 22.25 0.62 0.62 0.00 0.03 100.40
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
--------------
cluster 94991097-7638-4240-b922-f525300a9026
health HEALTH_OK
monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch 1, quorum 0 a
osdmap e534: 1 osds: 1 up, 1 in
pgmap v386830: 832 pgs, 7 pools, 308 GB data, 247 kobjects
366 GB used, 1122 GB / 1489 GB avail
832 active+clean
client io 138 MB/s rd, 35582 op/s
----------------
~22.5 core while serving from disks.
--------------------
It is scaling with increased number of shards/threads. The parallelism also increased significantly.
---------------------------------------------------------
------------------
No change, output is very similar to 25 shards.
-------------------
---------
avg-cpu: %user %nice %system %iowait %steal %idle
40.18 0.00 34.84 19.81 0.00 5.18
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdg 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdh 0.00 0.00 39114.00 0.00 156460.00 0.00 8.00 35.58 0.90 0.90 0.00 0.03 100.40
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
---------------
cluster 94991097-7638-4240-b922-f525300a9026
health HEALTH_OK
monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch 1, quorum 0 a
osdmap e537: 1 osds: 1 up, 1 in
pgmap v386865: 832 pgs, 7 pools, 308 GB data, 247 kobjects
366 GB used, 1122 GB / 1489 GB avail
832 active+clean
client io 153 MB/s rd, 39172 op/s
----------------
~24.5 core while serving from disks. ~3% cpu left.
------------------
It is scaling with increased number of shards/threads. The parallelism also increased significantly. It is disk bound now.
So, it seems buffered IO has significant impact on performance in case backend is SSD.
My question is, if the workload is very random and storage(SSD) is very huge compare to system memory, shouldn't we always go for direct_io instead of buffered io from Ceph ?
Please share your thoughts/suggestion on this.
Thanks & Regards
Somnath
________________________________
PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel"
info at http://vger.kernel.org/majordomo-info.html

--
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016

p: 646-253-9055
e: ***@adfin.com
��{.n�+��+%��lzwm��b�맲��r��yǩ�ׯzX��ܨ}��Ơz�&j:+v��zZ+

Sage Weil

2014-09-23 19:29:48 UTC

Permalink

Post by Somnath Roy
Milosz,
Thanks for the response. I will see if I can get any information out of perf.
Here is my OS information.
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 13.10
Release: 13.10
Codename: saucy
Linux emsclient 3.11.0-12-generic #19-Ubuntu SMP Wed Oct 9 16:20:46 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
BTW, it's not a 45% drop, as you can see, by tuning the OSD parameter I was able to get almost *2X* performance improvement with direct_io.
It's not only page cache (memory) lookup, in case of buffered_io the following could be problem.
1. Double copy (disk -> file buffer cache, file buffer cache -> user buffer)
2. As the iostat output shows, it is not reading 4K only, it is reading
more data from disk as required and in the end it will be wasted in case
of random workload..

It might be worth using blktrace to see what the IOs it is issueing are.
Which ones are > 4K and what they point to...

sage

Post by Somnath Roy
Thanks & Regards
Somnath
-----Original Message-----
Sent: Tuesday, September 23, 2014 12:09 PM
To: Somnath Roy
Subject: Re: Impact of page cache on OSD read performance for SSD
Somnath,
I wonder if there's a bottleneck or a point of contention for the kernel. For a entirely uncached workload I expect the page cache lookup to cause a slow down (since the lookup should be wasted). What I wouldn't expect is a 45% performance drop. Memory speed should be one magnitude faster then a modern SATA SSD drive (so it should be more negligible overhead).
Is there anyway you could perform the same test but monitor what's going on with the OSD process using the perf tool? Whatever is the default cpu time spent hardware counter is fine. Make sure you have the kernel debug info package installed so can get symbol information for kernel and module calls. With any luck the diff in perf output in two runs will show us the culprit.
Also, can you tell us what OS/kernel version you're using on the OSD machines?
- Milosz

Post by Somnath Roy
Hi Sage,
I have created the following setup in order to examine how a single OSD is behaving if say ~80-90% of ios hitting the SSDs.
My test includes the following steps.
1. Created a single OSD cluster.
2. Created two rbd images (110GB each) on 2 different pools.
3. Populated entire image, so my working set is ~210GB. My system memory is ~16GB.
4. Dumped page cache before every run.
5. Ran fio_rbd (QD 32, 8 instances) in parallel on these two images.
Here is my disk iops/bandwidth..
random-reads: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=64
2.0.8
Starting 1 process
Jobs: 1 (f=1): [r] [100.0% done] [154.1M/0K /s] [39.7K/0 iops] [eta 00m:00s]
random-reads: (groupid=0, jobs=1): err= 0: pid=1431
read : io=9316.4MB, bw=158994KB/s, iops=39748 , runt= 60002msec
My fio_rbd config..
[global]
ioengine=rbd
clientname=admin
pool=rbd1
rbdname=ceph_regression_test1
invalidate=0 # mandatory
rw=randread
bs=4k
direct=1
time_based
runtime=2m
size=109G
numjobs=8
[rbd_iodepth32]
iodepth=32
Now, I have run Giant Ceph on top of that..
-------------------------------------------------------
avg-cpu: %user %nice %system %iowait %steal %idle
22.04 0.00 16.46 45.86 0.00 15.64
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 9.00 0.00 6.00 0.00 92.00 30.67 0.01 1.33 0.00 1.33 1.33 0.80
sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdg 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdh 181.00 0.00 34961.00 0.00 176740.00 0.00 10.11 102.71 2.92 2.92 0.00 0.03 100.00
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
----------
cluster 94991097-7638-4240-b922-f525300a9026
health HEALTH_OK
monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch 1, quorum 0 a
osdmap e498: 1 osds: 1 up, 1 in
pgmap v386366: 832 pgs, 7 pools, 308 GB data, 247 kobjects
366 GB used, 1122 GB / 1489 GB avail
832 active+clean
client io 75215 kB/s rd, 18803 op/s
----------
Gradually decreases from ~21 core (serving from cache) to ~10 core (while serving from disks).
-----------------
In this case "All is Well" till ios are served from cache (XFS is
smart enough to cache some data ) . Once started hitting disks and throughput is decreasing. As you can see, disk is giving ~35K iops , but, OSD throughput is only ~18.8K ! So, cache miss in case of buffered io seems to be very expensive. Half of the iops are waste. Also, looking at the bandwidth, it is obvious, not everything is 4K read, May be kernel read_ahead is kicking (?).
Now, I thought of making ceph disk read as direct_io and do the same experiment. I have changed the FileStore::read to do the direct_io only. Rest kept as is. Here is the result with that.
-------
avg-cpu: %user %nice %system %iowait %steal %idle
24.77 0.00 19.52 21.36 0.00 34.36
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdg 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdh 0.00 0.00 25295.00 0.00 101180.00 0.00 8.00 12.73 0.50 0.50 0.00 0.04 100.80
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
--------
cluster 94991097-7638-4240-b922-f525300a9026
health HEALTH_OK
monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch 1, quorum 0 a
osdmap e522: 1 osds: 1 up, 1 in
pgmap v386711: 832 pgs, 7 pools, 308 GB data, 247 kobjects
366 GB used, 1122 GB / 1489 GB avail
832 active+clean
client io 100 MB/s rd, 25618 op/s
--------
~14 core while serving from disks.
---------------
No surprises here. Whatever is disk throughput ceph throughput is almost matching.
Let's tweak the shard/thread settings and see the impact.
-----------------------------------------------------------
------------------
No change, output is very similar to 25 shards.
------------------
----------
avg-cpu: %user %nice %system %iowait %steal %idle
33.33 0.00 28.22 23.11 0.00 15.34
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.00 0.00 2.00 0.00 12.00 12.00 0.00 0.00 0.00 0.00 0.00 0.00
sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdg 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdh 0.00 0.00 31987.00 0.00 127948.00 0.00 8.00 18.06 0.56 0.56 0.00 0.03 100.40
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
--------------
cluster 94991097-7638-4240-b922-f525300a9026
health HEALTH_OK
monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch 1, quorum 0 a
osdmap e525: 1 osds: 1 up, 1 in
pgmap v386746: 832 pgs, 7 pools, 308 GB data, 247 kobjects
366 GB used, 1122 GB / 1489 GB avail
832 active+clean
client io 127 MB/s rd, 32763 op/s
--------------
~19 core while serving from disks.
------------------
It is scaling with increased number of shards/threads. The parallelism also increased significantly.
----------------------------------------------------------
-------------------
No change, output is very similar to 25 shards.
-----------------
--------
avg-cpu: %user %nice %system %iowait %steal %idle
37.50 0.00 33.72 20.03 0.00 8.75
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdg 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdh 0.00 0.00 35360.00 0.00 141440.00 0.00 8.00 22.25 0.62 0.62 0.00 0.03 100.40
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
--------------
cluster 94991097-7638-4240-b922-f525300a9026
health HEALTH_OK
monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch 1, quorum 0 a
osdmap e534: 1 osds: 1 up, 1 in
pgmap v386830: 832 pgs, 7 pools, 308 GB data, 247 kobjects
366 GB used, 1122 GB / 1489 GB avail
832 active+clean
client io 138 MB/s rd, 35582 op/s
----------------
~22.5 core while serving from disks.
--------------------
It is scaling with increased number of shards/threads. The parallelism also increased significantly.
---------------------------------------------------------
------------------
No change, output is very similar to 25 shards.
-------------------
---------
avg-cpu: %user %nice %system %iowait %steal %idle
40.18 0.00 34.84 19.81 0.00 5.18
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdg 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdh 0.00 0.00 39114.00 0.00 156460.00 0.00 8.00 35.58 0.90 0.90 0.00 0.03 100.40
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
---------------
cluster 94991097-7638-4240-b922-f525300a9026
health HEALTH_OK
monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch 1, quorum 0 a
osdmap e537: 1 osds: 1 up, 1 in
pgmap v386865: 832 pgs, 7 pools, 308 GB data, 247 kobjects
366 GB used, 1122 GB / 1489 GB avail
832 active+clean
client io 153 MB/s rd, 39172 op/s
----------------
~24.5 core while serving from disks. ~3% cpu left.
------------------
It is scaling with increased number of shards/threads. The parallelism also increased significantly. It is disk bound now.
So, it seems buffered IO has significant impact on performance in case backend is SSD.
My question is, if the workload is very random and storage(SSD) is very huge compare to system memory, shouldn't we always go for direct_io instead of buffered io from Ceph ?
Please share your thoughts/suggestion on this.
Thanks & Regards
Somnath
________________________________
PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel"
info at http://vger.kernel.org/majordomo-info.html

--
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016
p: 646-253-9055
N?????r??y??????X???v???)?{.n?????z?]z????ay?????j??f???h??????w??????j:+v???w????????????zZ+???????j"????i

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Haomai Wang

2014-09-24 02:06:48 UTC

Permalink

Good point, but do you have considered that the impaction for write
ops? And if skip page cache, FileStore is responsible for data cache?