puzzled with the design pattern of ceph journal, really ruining performance

Discussion:

姚宁

2014-09-17 06:29:22 UTC

Hi, guys

I analyze the architecture of the ceph souce code.

I know that, in order to keep journal atomic and consistent, the
journal write mode should be set with O_DSYNC or called fdatasync()
system call after every write operation. However, this kind of
operation is really killing the performance as well as achieving high
committing latency, even if SSD is used as journal disk. If the SSD
has capacitor to keep the data safe when the system crashes, we can
set the mount option nobarrier or SSD itself will ignore the FLUSH
REQUEST. So the performance would be better.

So can it be instead by other strategies?
As far as I am concerned, I think the most important part is pg_log
and pg_info. It will guides the crashed osd recovery its objects from
the peers. Therefore, if we can keep pg_log at a consistent point, we
can recovery data without journal. So can we just use an "undo"
strategy on pg_log and neglect ceph journal? It will save lots of
bandwidth, and also based on the consistent pg_log epoch, we can
always recovery data from its peering osd, right? But this will lead
to recovery more objects if the osd crash.

Nicheal
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Somnath Roy

2014-09-17 07:29:49 UTC

Permalink

Hi Nicheal,
Not only recovery , IMHO the main purpose of ceph journal is to support transaction semantics since XFS doesn't have that. I guess it can't be achieved with pg_log/pg_info.

Thanks & Regards
Somnath

-----Original Message-----
From: ceph-devel-***@vger.kernel.org [mailto:ceph-devel-***@vger.kernel.org] On Behalf Of ??
Sent: Tuesday, September 16, 2014 11:29 PM
To: ceph-***@vger.kernel.org
Subject: puzzled with the design pattern of ceph journal, really ruining performance

Hi, guys

I analyze the architecture of the ceph souce code.

I know that, in order to keep journal atomic and consistent, the journal write mode should be set with O_DSYNC or called fdatasync() system call after every write operation. However, this kind of operation is really killing the performance as well as achieving high committing latency, even if SSD is used as journal disk. If the SSD has capacitor to keep the data safe when the system crashes, we can set the mount option nobarrier or SSD itself will ignore the FLUSH REQUEST. So the performance would be better.

So can it be instead by other strategies?
As far as I am concerned, I think the most important part is pg_log and pg_info. It will guides the crashed osd recovery its objects from the peers. Therefore, if we can keep pg_log at a consistent point, we can recovery data without journal. So can we just use an "undo"
strategy on pg_log and neglect ceph journal? It will save lots of bandwidth, and also based on the consistent pg_log epoch, we can always recovery data from its peering osd, right? But this will lead to recovery more objects if the osd crash.

Nicheal

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to ***@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html

________________________________

PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).

N��r��y��b�X��ǧv�^�)޺{.n�+��z�]z��{ay�ʇڙ�,j��f��h��z��w��

Chen, Xiaoxi

2014-09-17 07:59:37 UTC

Permalink

Hi Nicheal,

1. The main purpose of journal is provide transaction semantics (prevent partially update). Peer is not enough for this need because ceph writes all replica at the same time, so when crush, you have no idea about which replica has right data. For example, say if we have 2 replica, user update a 4M object and the primary OSD crush when the first 2M was written , secondary OSD may also failed when the first 3MB was written. So both versions in primary/secondary are neither the new value, nor the old value, and have no way to recover. So share the same idea as database, we need to have a journal to support transaction and prevent this happen. For some backend support transaction, BTRFS as an instance, we don't need a journal, we can write the journal and data disk at the same time, the journal here is just try to help performance, since it only do sequential write and we suspect it should be faster than backend OSD.

2. Have you got any data to prove the O_DSYNC or fdatasync kill the performance of journal? In our previous test, the journal SSD (use a partition of a SSD as a journal for a particular OSD, and 4 OSD share a same SSD) could reach its peak performance (300-400MB/s)

Xiaoxi

-----Original Message-----
From: ceph-devel-***@vger.kernel.org [mailto:ceph-devel-***@vger.kernel.org] On Behalf Of Somnath Roy
Sent: Wednesday, September 17, 2014 3:30 PM
To: 姚宁; ceph-***@vger.kernel.org
Subject: RE: puzzled with the design pattern of ceph journal, really ruining performance

Hi Nicheal,
Not only recovery , IMHO the main purpose of ceph journal is to support transaction semantics since XFS doesn't have that. I guess it can't be achieved with pg_log/pg_info.

Thanks & Regards
Somnath

-----Original Message-----
From: ceph-devel-***@vger.kernel.org [mailto:ceph-devel-***@vger.kernel.org] On Behalf Of ??
Sent: Tuesday, September 16, 2014 11:29 PM
To: ceph-***@vger.kernel.org
Subject: puzzled with the design pattern of ceph journal, really ruining performance

Hi, guys

I analyze the architecture of the ceph souce code.

I know that, in order to keep journal atomic and consistent, the journal write mode should be set with O_DSYNC or called fdatasync() system call after every write operation. However, this kind of operation is really killing the performance as well as achieving high committing latency, even if SSD is used as journal disk. If the SSD has capacitor to keep the data safe when the system crashes, we can set the mount option nobarrier or SSD itself will ignore the FLUSH REQUEST. So the performance would be better.

So can it be instead by other strategies?
As far as I am concerned, I think the most important part is pg_log and pg_info. It will guides the crashed osd recovery its objects from the peers. Therefore, if we can keep pg_log at a consistent point, we can recovery data without journal. So can we just use an "undo"
strategy on pg_log and neglect ceph journal? It will save lots of bandwidth, and also based on the consistent pg_log epoch, we can always recovery data from its peering osd, right? But this will lead to recovery more objects if the osd crash.

Nicheal

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to ***@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html

________________________________

PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).

N r y b X ǧv ^ )޺{.n + z ]z {ay ʇڙ ,j f h z w j:+v w j m zZ+ ݢj" ! i
��칻�&�~�&��+-��ݶ��w��˛��m��^��b��^n�r��z��h��&��G��h�

Alexandre DERUMIER

2014-09-17 14:20:44 UTC

Permalink

2. Have you got any data to prove the O_DSYNC or fdatasync kill the =

performance of journal? In our previous test, the journal SSD (use a =
partition of a SSD as a journal for a particular OSD, and 4 OSD share a=

same SSD) could reach its peak performance (300-400MB/s)

Hi,

I have done some bench here:

http://www.mail-archive.com/ceph-***@lists.ceph.com/msg12950.html

Some ssd models have really bad performance with O_DSYNC (crucial m550 =
- 312 iops on 4k block).
Benching 1 osd,I can see big latencies for some seconds when O_DSYNC oc=
cur

crucial m550
------------
#fio --filename=3D/dev/sdb --direct=3D1 --rw=3Dwrite --bs=3D4k --numjob=
s=3D2=20
--group_reporting --invalidate=3D0 --name=3Dab --sync=3D1
bw=3D1249.9KB/s, iops=3D312

intel s3500
-----------
fio --filename=3D/dev/sdb --direct=3D1 --rw=3Dwrite --bs=3D4k --numjobs=
=3D2=20
--group_reporting --invalidate=3D0 --name=3Dab --sync=3D1
#bw=3D41794KB/s, iops=3D10448

----- Mail original -----=20

De: "Xiaoxi Chen" <***@intel.com>=20
=C3=80: "Somnath Roy" <***@sandisk.com>, "??" <***@gmail.c=
om>, ceph-***@vger.kernel.org=20
Envoy=C3=A9: Mercredi 17 Septembre 2014 09:59:37=20
Objet: RE: puzzled with the design pattern of ceph journal, really ruin=
ing performance=20

Hi Nicheal,=20

1. The main purpose of journal is provide transaction semantics (preven=
t partially update). Peer is not enough for this need because ceph writ=
es all replica at the same time, so when crush, you have no idea about =
which replica has right data. For example, say if we have 2 replica, us=
er update a 4M object and the primary OSD crush when the first 2M was w=
ritten , secondary OSD may also failed when the first 3MB was written. =
So both versions in primary/secondary are neither the new value, nor th=
e old value, and have no way to recover. So share the same idea as data=
base, we need to have a journal to support transaction and prevent this=
happen. For some backend support transaction, BTRFS as an instance, we=
don't need a journal, we can write the journal and data disk at the sa=
me time, the journal here is just try to help performance, since it onl=
y do sequential write and we suspect it should be faster than backend O=
SD.=20

2. Have you got any data to prove the O_DSYNC or fdatasync kill the per=
formance of journal? In our previous test, the journal SSD (use a parti=
tion of a SSD as a journal for a particular OSD, and 4 OSD share a same=
SSD) could reach its peak performance (300-400MB/s)=20

Xiaoxi=20

-----Original Message-----=20
=46rom: ceph-devel-***@vger.kernel.org [mailto:ceph-devel-***@vger.=
kernel.org] On Behalf Of Somnath Roy=20
Sent: Wednesday, September 17, 2014 3:30 PM=20
To: =E5=A7=9A=E5=AE=81; ceph-***@vger.kernel.org=20
Subject: RE: puzzled with the design pattern of ceph journal, really ru=
ining performance=20

Hi Nicheal,=20
Not only recovery , IMHO the main purpose of ceph journal is to support=
transaction semantics since XFS doesn't have that. I guess it can't be=
achieved with pg_log/pg_info.=20

Thanks & Regards=20
Somnath=20

-----Original Message-----=20
=46rom: ceph-devel-***@vger.kernel.org [mailto:ceph-devel-***@vger.=
kernel.org] On Behalf Of ??=20
Sent: Tuesday, September 16, 2014 11:29 PM=20
To: ceph-***@vger.kernel.org=20
Subject: puzzled with the design pattern of ceph journal, really ruinin=
g performance=20

Hi, guys=20

I analyze the architecture of the ceph souce code.=20

I know that, in order to keep journal atomic and consistent, the journa=
l write mode should be set with O_DSYNC or called fdatasync() system ca=
ll after every write operation. However, this kind of operation is real=
ly killing the performance as well as achieving high committing latency=
, even if SSD is used as journal disk. If the SSD has capacitor to keep=
the data safe when the system crashes, we can set the mount option nob=
arrier or SSD itself will ignore the FLUSH REQUEST. So the performance =
would be better.=20

So can it be instead by other strategies?=20
As far as I am concerned, I think the most important part is pg_log and=
pg_info. It will guides the crashed osd recovery its objects from the =
peers. Therefore, if we can keep pg_log at a consistent point, we can r=
ecovery data without journal. So can we just use an "undo"=20
strategy on pg_log and neglect ceph journal? It will save lots of bandw=
idth, and also based on the consistent pg_log epoch, we can always reco=
very data from its peering osd, right? But this will lead to recovery m=
ore objects if the osd crash.=20

Nicheal=20
--=20
To unsubscribe from this list: send the line "unsubscribe ceph-devel" i=
n the body of a message to ***@vger.kernel.org More majordomo inf=
o at http://vger.kernel.org/majordomo-info.html=20

________________________________=20

PLEASE NOTE: The information contained in this electronic mail message =
is intended only for the use of the designated recipient(s) named above=
=2E If the reader of this message is not the intended recipient, you ar=
e hereby notified that you have received this message in error and that=
any review, dissemination, distribution, or copying of this message is=
strictly prohibited. If you have received this communication in error,=
please notify the sender by telephone or e-mail (as shown above) immed=
iately and destroy any and all copies of this message in your possessio=
n (whether hard copies or electronically stored copies).=20

N r y b X =C7=A7v ^ )=DE=BA{.n + z ]z {ay =CA=87=DA=99 ,j f h z w j:+v =
w j m zZ+ =DD=A2j" ! i=20
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" i=
n
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Mark Nelson

2014-09-17 15:01:06 UTC

Permalink

2. Have you got any data to prove the O_DSYNC or fdatasync kill th=

e performance of journal? In our previous test, the journal SSD (use =
a partition of a SSD as a journal for a particular OSD, and 4 OSD share=
a >>same SSD) could reach its peak performance (300-400MB/s)

Hi,
Some ssd models have really bad performance with O_DSYNC (crucial m55=

0 - 312 iops on 4k block).

Benching 1 osd,I can see big latencies for some seconds when O_DSYNC =

occur

=46WIW, the journal will coalesce writes quickly when there are many=20
concurrent 4k client writes. Once you hit around 8 4k IOs per OSD, the=
=20
journal will start coalescing. For say 100-150 IOPs (what a spinning=20
disk can handle), expect around 9ish 100KB journal writes (with padding=
=20
and header/footer for each client IO). What we've seen is that some=20
drives that aren't that great at 4K O_DSYNC writes are still reasonably=
=20
good with 8+ concurrent larger O_DSYNC writes.

crucial m550
------------
#fio --filename=3D/dev/sdb --direct=3D1 --rw=3Dwrite --bs=3D4k --numj=

obs=3D2

--group_reporting --invalidate=3D0 --name=3Dab --sync=3D1
bw=3D1249.9KB/s, iops=3D312
intel s3500
-----------
fio --filename=3D/dev/sdb --direct=3D1 --rw=3Dwrite --bs=3D4k --numjo=

bs=3D2

--group_reporting --invalidate=3D0 --name=3Dab --sync=3D1
#bw=3D41794KB/s, iops=3D10448
----- Mail original -----
Envoy=C3=A9: Mercredi 17 Septembre 2014 09:59:37
Objet: RE: puzzled with the design pattern of ceph journal, really ru=

ining performance

Hi Nicheal,
1. The main purpose of journal is provide transaction semantics (prev=

ent partially update). Peer is not enough for this need because ceph wr=
ites all replica at the same time, so when crush, you have no idea abou=
t which replica has right data. For example, say if we have 2 replica, =
user update a 4M object and the primary OSD crush when the first 2M was=
written , secondary OSD may also failed when the first 3MB was written=
=2E So both versions in primary/secondary are neither the new value, no=
r the old value, and have no way to recover. So share the same idea as =
database, we need to have a journal to support transaction and prevent =
this happen. For some backend support transaction, BTRFS as an instance=
, we don't need a journal, we can write the journal and data disk at th=
e same time, the journal here is just try to help performance, since it=
only do sequential write and we suspect it should be faster than backe=
nd OSD.

2. Have you got any data to prove the O_DSYNC or fdatasync kill the p=

erformance of journal? In our previous test, the journal SSD (use a par=
tition of a SSD as a journal for a particular OSD, and 4 OSD share a sa=
me SSD) could reach its peak performance (300-400MB/s)

Xiaoxi
-----Original Message-----

kernel.org] On Behalf Of Somnath Roy

Sent: Wednesday, September 17, 2014 3:30 PM
Subject: RE: puzzled with the design pattern of ceph journal, really =

ruining performance

Hi Nicheal,
Not only recovery , IMHO the main purpose of ceph journal is to suppo=

rt transaction semantics since XFS doesn't have that. I guess it can't =
be achieved with pg_log/pg_info.

Thanks & Regards
Somnath
-----Original Message-----

kernel.org] On Behalf Of ??

Sent: Tuesday, September 16, 2014 11:29 PM
Subject: puzzled with the design pattern of ceph journal, really ruin=

ing performance

Hi, guys
I analyze the architecture of the ceph souce code.
I know that, in order to keep journal atomic and consistent, the jour=

nal write mode should be set with O_DSYNC or called fdatasync() system =
call after every write operation. However, this kind of operation is re=
ally killing the performance as well as achieving high committing laten=
cy, even if SSD is used as journal disk. If the SSD has capacitor to ke=
ep the data safe when the system crashes, we can set the mount option n=
obarrier or SSD itself will ignore the FLUSH REQUEST. So the performanc=
e would be better.

So can it be instead by other strategies?
As far as I am concerned, I think the most important part is pg_log a=

nd pg_info. It will guides the crashed osd recovery its objects from th=
e peers. Therefore, if we can keep pg_log at a consistent point, we can=
recovery data without journal. So can we just use an "undo"

strategy on pg_log and neglect ceph journal? It will save lots of ban=

dwidth, and also based on the consistent pg_log epoch, we can always re=
covery data from its peering osd, right? But this will lead to recovery=
more objects if the osd crash.

Nicheal

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" i=
n
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Alexandre DERUMIER

2014-09-17 21:13:16 UTC

Permalink

FWIW, the journal will coalesce writes quickly when there are many=20
concurrent 4k client writes. Once you hit around 8 4k IOs per OSD, t=

he=20

journal will start coalescing. For say 100-150 IOPs (what a spinning=

=20

disk can handle), expect around 9ish 100KB journal writes (with paddi=

ng=20

and header/footer for each client IO). What we've seen is that some=20
drives that aren't that great at 4K O_DSYNC writes are still reasonab=

ly=20

good with 8+ concurrent larger O_DSYNC writes.

Yes, indeed, it's not that bad. (hopefully ;)

When benching the crucial m550, I only see time to time (maybe each 30s=
,don't remember exactly), ios slowing doing to 200 for 1 or 2 seconds
then going up to normal speed around 4000iops.

with intel s3500, I have constant write at 5000iops.

BTW, does rbd client cache also help for coalescing write (client side)=
, then help also the journal ?

----- Mail original -----=20

De: "Mark Nelson" <***@inktank.com>=20
=C3=80: "Alexandre DERUMIER" <***@odiso.com>, "Xiaoxi Chen" <xiao=
***@intel.com>=20
Cc: "Somnath Roy" <***@sandisk.com>, "??" <***@gmail.com>,=
ceph-***@vger.kernel.org=20
Envoy=C3=A9: Mercredi 17 Septembre 2014 17:01:06=20
Objet: Re: puzzled with the design pattern of ceph journal, really ruin=
ing performance=20

On 09/17/2014 09:20 AM, Alexandre DERUMIER wrote:=20

2. Have you got any data to prove the O_DSYNC or fdatasync kill the=

performance of journal? In our previous test, the journal SSD (use a p=
artition of a SSD as a journal for a particular OSD, and 4 OSD share a =

same SSD) could reach its peak performance (300-400MB/s)=20

=20
Hi,=20
=20
I have done some bench here:=20
=20
=20
Some ssd models have really bad performance with O_DSYNC (crucial m55=

0 - 312 iops on 4k block).=20

Benching 1 osd,I can see big latencies for some seconds when O_DSYNC =

occur=20

=46WIW, the journal will coalesce writes quickly when there are many=20
concurrent 4k client writes. Once you hit around 8 4k IOs per OSD, the=20
journal will start coalescing. For say 100-150 IOPs (what a spinning=20
disk can handle), expect around 9ish 100KB journal writes (with padding=
=20
and header/footer for each client IO). What we've seen is that some=20
drives that aren't that great at 4K O_DSYNC writes are still reasonably=
=20
good with 8+ concurrent larger O_DSYNC writes.=20

=20
=20
=20
crucial m550=20
------------=20
#fio --filename=3D/dev/sdb --direct=3D1 --rw=3Dwrite --bs=3D4k --numj=

obs=3D2=20

--group_reporting --invalidate=3D0 --name=3Dab --sync=3D1=20
bw=3D1249.9KB/s, iops=3D312=20
=20
intel s3500=20
-----------=20
fio --filename=3D/dev/sdb --direct=3D1 --rw=3Dwrite --bs=3D4k --numjo=

bs=3D2=20

--group_reporting --invalidate=3D0 --name=3Dab --sync=3D1=20
#bw=3D41794KB/s, iops=3D10448=20
=20
----- Mail original -----=20
=20
Envoy=C3=A9: Mercredi 17 Septembre 2014 09:59:37=20
Objet: RE: puzzled with the design pattern of ceph journal, really ru=

ining performance=20

=20
Hi Nicheal,=20
=20
1. The main purpose of journal is provide transaction semantics (prev=

=20
2. Have you got any data to prove the O_DSYNC or fdatasync kill the p=

=20
Xiaoxi=20
=20
=20
=20
-----Original Message-----=20

kernel.org] On Behalf Of Somnath Roy=20

Sent: Wednesday, September 17, 2014 3:30 PM=20
Subject: RE: puzzled with the design pattern of ceph journal, really =

ruining performance=20

=20
Hi Nicheal,=20
Not only recovery , IMHO the main purpose of ceph journal is to suppo=

rt transaction semantics since XFS doesn't have that. I guess it can't =
be achieved with pg_log/pg_info.=20

=20
Thanks & Regards=20
Somnath=20
=20
-----Original Message-----=20

kernel.org] On Behalf Of ??=20

Sent: Tuesday, September 16, 2014 11:29 PM=20
Subject: puzzled with the design pattern of ceph journal, really ruin=

ing performance=20

=20
Hi, guys=20
=20
I analyze the architecture of the ceph souce code.=20
=20
I know that, in order to keep journal atomic and consistent, the jour=

=20
So can it be instead by other strategies?=20
As far as I am concerned, I think the most important part is pg_log a=

strategy on pg_log and neglect ceph journal? It will save lots of ban=

dwidth, and also based on the consistent pg_log epoch, we can always re=
covery data from its peering osd, right? But this will lead to recovery=
more objects if the osd crash.=20

=20
Nicheal=20
=20

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" i=
n
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Chen, Xiaoxi

2014-09-18 01:05:18 UTC

Permalink

When benching the crucial m550, I only see time to time (maybe each 30s,don't remember exactly), ios slowing doing to 200 for 1 or 2 seconds then going up to normal speed around 4000iops

Wow, that indicate m550 is busying with garbage collection , maybe just try to overprovision a bit (say if you have a 400G ssd , but only partition ~300G), overprovision SSD generally both help performance and durability. Actually if you look at the difference spec between Intel S3500 and S3700, the root cause is different over provision ratio :)

-----Original Message-----
From: ceph-devel-***@vger.kernel.org [mailto:ceph-devel-***@vger.kernel.org] On Behalf Of Alexandre DERUMIER
Sent: Thursday, September 18, 2014 5:13 AM
To: Mark Nelson
Cc: Somnath Roy; ??; ceph-***@vger.kernel.org; Chen, Xiaoxi
Subject: Re: puzzled with the design pattern of ceph journal, really ruining performance

FWIW, the journal will coalesce writes quickly when there are many
concurrent 4k client writes. Once you hit around 8 4k IOs per OSD,
the journal will start coalescing. For say 100-150 IOPs (what a
spinning disk can handle), expect around 9ish 100KB journal writes
(with padding and header/footer for each client IO). What we've seen
is that some drives that aren't that great at 4K O_DSYNC writes are
still reasonably good with 8+ concurrent larger O_DSYNC writes.

Yes, indeed, it's not that bad. (hopefully ;)

When benching the crucial m550, I only see time to time (maybe each 30s,don't remember exactly), ios slowing doing to 200 for 1 or 2 seconds then going up to normal speed around 4000iops.

with intel s3500, I have constant write at 5000iops.

BTW, does rbd client cache also help for coalescing write (client side), then help also the journal ?

----- Mail original -----

De: "Mark Nelson" <***@inktank.com>
À: "Alexandre DERUMIER" <***@odiso.com>, "Xiaoxi Chen" <***@intel.com>
Cc: "Somnath Roy" <***@sandisk.com>, "??" <***@gmail.com>, ceph-***@vger.kernel.org
Envoyé: Mercredi 17 Septembre 2014 17:01:06
Objet: Re: puzzled with the design pattern of ceph journal, really ruining performance

Post by Alexandre DERUMIER
2. Have you got any data to prove the O_DSYNC or fdatasync kill the
performance of journal? In our previous test, the journal SSD (use a
partition of a SSD as a journal for a particular OSD, and 4 OSD
share a >>same SSD) could reach its peak performance (300-400MB/s)

Hi,
Some ssd models have really bad performance with O_DSYNC (crucial m550 - 312 iops on 4k block).
Benching 1 osd,I can see big latencies for some seconds when O_DSYNC
occur

FWIW, the journal will coalesce writes quickly when there are many concurrent 4k client writes. Once you hit around 8 4k IOs per OSD, the journal will start coalescing. For say 100-150 IOPs (what a spinning disk can handle), expect around 9ish 100KB journal writes (with padding and header/footer for each client IO). What we've seen is that some drives that aren't that great at 4K O_DSYNC writes are still reasonably good with 8+ concurrent larger O_DSYNC writes.

crucial m550
------------
#fio --filename=/dev/sdb --direct=1 --rw=write --bs=4k --numjobs=2
--group_reporting --invalidate=0 --name=ab --sync=1 bw=1249.9KB/s,
iops=312
intel s3500
-----------
fio --filename=/dev/sdb --direct=1 --rw=write --bs=4k --numjobs=2
--group_reporting --invalidate=0 --name=ab --sync=1 #bw=41794KB/s,
iops=10448
----- Mail original -----
Envoyé: Mercredi 17 Septembre 2014 09:59:37
Objet: RE: puzzled with the design pattern of ceph journal, really
ruining performance
Hi Nicheal,
1. The main purpose of journal is provide transaction semantics (prevent partially update). Peer is not enough for this need because ceph writes all replica at the same time, so when crush, you have no idea about which replica has right data. For example, say if we have 2 replica, user update a 4M object and the primary OSD crush when the first 2M was written , secondary OSD may also failed when the first 3MB was written. So both versions in primary/secondary are neither the new value, nor the old value, and have no way to recover. So share the same idea as database, we need to have a journal to support transaction and prevent this happen. For some backend support transaction, BTRFS as an instance, we don't need a journal, we can write the journal and data disk at the same time, the journal here is just try to help performance, since it only do sequential write and we suspect it should be faster than backend OSD.
2. Have you got any data to prove the O_DSYNC or fdatasync kill the
performance of journal? In our previous test, the journal SSD (use a
partition of a SSD as a journal for a particular OSD, and 4 OSD share
a same SSD) could reach its peak performance (300-400MB/s)
Xiaoxi
-----Original Message-----
Sent: Wednesday, September 17, 2014 3:30 PM
Subject: RE: puzzled with the design pattern of ceph journal, really
ruining performance
Hi Nicheal,
Not only recovery , IMHO the main purpose of ceph journal is to support transaction semantics since XFS doesn't have that. I guess it can't be achieved with pg_log/pg_info.
Thanks & Regards
Somnath
-----Original Message-----
Sent: Tuesday, September 16, 2014 11:29 PM
Subject: puzzled with the design pattern of ceph journal, really
ruining performance
Hi, guys
I analyze the architecture of the ceph souce code.
I know that, in order to keep journal atomic and consistent, the journal write mode should be set with O_DSYNC or called fdatasync() system call after every write operation. However, this kind of operation is really killing the performance as well as achieving high committing latency, even if SSD is used as journal disk. If the SSD has capacitor to keep the data safe when the system crashes, we can set the mount option nobarrier or SSD itself will ignore the FLUSH REQUEST. So the performance would be better.
So can it be instead by other strategies?
As far as I am concerned, I think the most important part is pg_log and pg_info. It will guides the crashed osd recovery its objects from the peers. Therefore, if we can keep pg_log at a consistent point, we can recovery data without journal. So can we just use an "undo"
strategy on pg_log and neglect ceph journal? It will save lots of bandwidth, and also based on the consistent pg_log epoch, we can always recovery data from its peering osd, right? But this will lead to recovery more objects if the osd crash.
Nicheal

Mark Nelson

2014-09-18 01:23:28 UTC

Permalink

When benching the crucial m550, I only see time to time (maybe each =

30s,don't remember exactly), ios slowing doing to 200 for 1 or 2 second=
s then going up to normal speed around 4000iops

Wow, that indicate m550 is busying with garbage collection , maybe ju=

st try to overprovision a bit (say if you have a 400G ssd , but only pa=
rtition ~300G), overprovision SSD generally both help performance and d=
urability. Actually if you look at the difference spec between Intel S3=
500 and S3700, the root cause is different over provision ratio :)

Hrm, I thought the S3700 uses MLC-HET cells while the S3500 uses regula=
r=20
MLC?

-----Original Message-----

kernel.org] On Behalf Of Alexandre DERUMIER

Sent: Thursday, September 18, 2014 5:13 AM
To: Mark Nelson
Subject: Re: puzzled with the design pattern of ceph journal, really =

ruining performance

is that some drives that aren't that great at 4K O_DSYNC writes are
still reasonably good with 8+ concurrent larger O_DSYNC writes.

Yes, indeed, it's not that bad. (hopefully ;)
When benching the crucial m550, I only see time to time (maybe each 3=

0s,don't remember exactly), ios slowing doing to 200 for 1 or 2 seconds=
then going up to normal speed around 4000iops.

with intel s3500, I have constant write at 5000iops.
BTW, does rbd client cache also help for coalescing write (client sid=

e), then help also the journal ?

----- Mail original -----
Envoy=C3=A9: Mercredi 17 Septembre 2014 17:01:06
Objet: Re: puzzled with the design pattern of ceph journal, really ru=

ining performance

2. Have you got any data to prove the O_DSYNC or fdatasync kill th=

performance of journal? In our previous test, the journal SSD (use=

partition of a SSD as a journal for a particular OSD, and 4 OSD
share a >>same SSD) could reach its peak performance (300-400MB/s)

Hi,
Some ssd models have really bad performance with O_DSYNC (crucial m5=

50 - 312 iops on 4k block).

Benching 1 osd,I can see big latencies for some seconds when O_DSYNC
occur

FWIW, the journal will coalesce writes quickly when there are many co=

ncurrent 4k client writes. Once you hit around 8 4k IOs per OSD, the jo=
urnal will start coalescing. For say 100-150 IOPs (what a spinning disk=
can handle), expect around 9ish 100KB journal writes (with padding and=
header/footer for each client IO). What we've seen is that some drives=
that aren't that great at 4K O_DSYNC writes are still reasonably good =
with 8+ concurrent larger O_DSYNC writes.

crucial m550
------------
#fio --filename=3D/dev/sdb --direct=3D1 --rw=3Dwrite --bs=3D4k --num=

jobs=3D2

--group_reporting --invalidate=3D0 --name=3Dab --sync=3D1 bw=3D1249.=

9KB/s,

iops=3D312
intel s3500
-----------
fio --filename=3D/dev/sdb --direct=3D1 --rw=3Dwrite --bs=3D4k --numj=

obs=3D2

--group_reporting --invalidate=3D0 --name=3Dab --sync=3D1 #bw=3D4179=

4KB/s,

iops=3D10448
----- Mail original -----

l.com>,

Envoy=C3=A9: Mercredi 17 Septembre 2014 09:59:37
Objet: RE: puzzled with the design pattern of ceph journal, really
ruining performance
Hi Nicheal,
1. The main purpose of journal is provide transaction semantics (pre=

vent partially update). Peer is not enough for this need because ceph w=
rites all replica at the same time, so when crush, you have no idea abo=
ut which replica has right data. For example, say if we have 2 replica,=
user update a 4M object and the primary OSD crush when the first 2M wa=
s written , secondary OSD may also failed when the first 3MB was writte=
n. So both versions in primary/secondary are neither the new value, nor=
the old value, and have no way to recover. So share the same idea as d=
atabase, we need to have a journal to support transaction and prevent t=
his happen. For some backend support transaction, BTRFS as an instance,=
we don't need a journal, we can write the journal and data disk at the=
same time, the journal here is just try to help performance, since it =
only do sequential write and we suspect it should be faster than backen=
d OSD.

2. Have you got any data to prove the O_DSYNC or fdatasync kill the
performance of journal? In our previous test, the journal SSD (use a
partition of a SSD as a journal for a particular OSD, and 4 OSD shar=

a same SSD) could reach its peak performance (300-400MB/s)
Xiaoxi
-----Original Message-----
Sent: Wednesday, September 17, 2014 3:30 PM
Subject: RE: puzzled with the design pattern of ceph journal, really
ruining performance
Hi Nicheal,
Not only recovery , IMHO the main purpose of ceph journal is to supp=

ort transaction semantics since XFS doesn't have that. I guess it can't=
be achieved with pg_log/pg_info.

Thanks & Regards
Somnath
-----Original Message-----

=2Ekernel.org] On Behalf Of ??

Sent: Tuesday, September 16, 2014 11:29 PM
Subject: puzzled with the design pattern of ceph journal, really
ruining performance
Hi, guys
I analyze the architecture of the ceph souce code.
I know that, in order to keep journal atomic and consistent, the jou=

rnal write mode should be set with O_DSYNC or called fdatasync() system=
call after every write operation. However, this kind of operation is r=
eally killing the performance as well as achieving high committing late=
ncy, even if SSD is used as journal disk. If the SSD has capacitor to k=
eep the data safe when the system crashes, we can set the mount option =
nobarrier or SSD itself will ignore the FLUSH REQUEST. So the performan=
ce would be better.

So can it be instead by other strategies?
As far as I am concerned, I think the most important part is pg_log =

and pg_info. It will guides the crashed osd recovery its objects from t=
he peers. Therefore, if we can keep pg_log at a consistent point, we ca=
n recovery data without journal. So can we just use an "undo"

strategy on pg_log and neglect ceph journal? It will save lots of ba=

ndwidth, and also based on the consistent pg_log epoch, we can always r=
ecovery data from its peering osd, right? But this will lead to recover=
y more objects if the osd crash.

Nicheal

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel"=

in the body of a message to ***@vger.kernel.org More majordomo i=
nfo at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" i=
n
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html