Discussion:
slow performance even when using SSDs
Stefan Priebe - Profihost AG
2012-05-10 12:09:13 UTC
Permalink
Dear List,

i'm doing a testsetup with ceph v0.46 and wanted to know how fast ceph is.

my testsetup:
3 servers with Intel Xeon X3440, 180GB SSD Intel 520 Series, 4GB RAM, 2x
1Gbit/s LAN each

All 3 are running as mon a-c and osd 0-2. Two of them are also running
as mds.2 and mds.3 (has 8GB RAM instead of 4GB).

All machines run ceph v0.46 and vanilla Linux Kernel v3.0.30 and all of
them use btrfs on the ssd which serves /srv/{osd,mon}.X. All of them use
eth0+eth1 as bond0 (mode 6).

This gives me:
rados -p rbd bench 60 write

...
Total time run: 61.465323
Total writes made: 776
Write size: 4194304
Bandwidth (MB/sec): 50.500

Average Latency: 1.2654
Max latency: 2.77124
Min latency: 0.170936

Shouldn't it be at least 100MB/s? (1Gbit/s / 8)

And rados -p rbd bench 60 write -b 4096 gives pretty bad results:
Total time run: 60.221130
Total writes made: 6401
Write size: 4096
Bandwidth (MB/sec): 0.415

Average Latency: 0.150525
Max latency: 1.12647
Min latency: 0.026599

All btrfs ssds are also mounted with noatime.

Thanks for your help!

Greets Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Stefan Priebe - Profihost AG
2012-05-10 13:09:55 UTC
Permalink
OK, here some retests. I had the SDDs conected to an old Raid controller
even i did used them as JBODs (oops).

Here are two new Tests (using kernel 3.4-rc6) it would be great if
someone could tell me if they're fine or bad.

New tests with all 3 SSDs connected to the mainboard.

#~ rados -p rbd bench 60 write
Total time run: 60.342419
Total writes made: 2021
Write size: 4194304
Bandwidth (MB/sec): 133.969

Average Latency: 0.477476
Max latency: 0.942029
Min latency: 0.109467

#~ rados -p rbd bench 60 write -b 4096
Total time run: 60.726326
Total writes made: 59026
Write size: 4096
Bandwidth (MB/sec): 3.797

Average Latency: 0.016459
Max latency: 0.874841
Min latency: 0.002392

Another test with only osd on the disk and the journal in memory / tmpfs:
#~ rados -p rbd bench 60 write
Total time run: 60.513240
Total writes made: 2555
Write size: 4194304
Bandwidth (MB/sec): 168.889

Average Latency: 0.378775
Max latency: 4.59233
Min latency: 0.055179

#~ rados -p rbd bench 60 write -b 4096
Total time run: 60.116260
Total writes made: 281903
Write size: 4096
Bandwidth (MB/sec): 18.318

Average Latency: 0.00341067
Max latency: 0.720486
Min latency: 0.000602

Another problem i have is i'm always getting:
"2012-05-10 15:05:22.140027 mon.0 192.168.0.100:6789/0 19 : [WRN]
message from mon.2 was stamped 0.109244s in the future, clocks not
synchronized"

even on all systems ntp is running fine.

Stefan
Post by Stefan Priebe - Profihost AG
Dear List,
i'm doing a testsetup with ceph v0.46 and wanted to know how fast ceph is.
3 servers with Intel Xeon X3440, 180GB SSD Intel 520 Series, 4GB RAM, 2x
1Gbit/s LAN each
All 3 are running as mon a-c and osd 0-2. Two of them are also running
as mds.2 and mds.3 (has 8GB RAM instead of 4GB).
All machines run ceph v0.46 and vanilla Linux Kernel v3.0.30 and all of
them use btrfs on the ssd which serves /srv/{osd,mon}.X. All of them use
eth0+eth1 as bond0 (mode 6).
rados -p rbd bench 60 write
...
Total time run: 61.465323
Total writes made: 776
Write size: 4194304
Bandwidth (MB/sec): 50.500
Average Latency: 1.2654
Max latency: 2.77124
Min latency: 0.170936
Shouldn't it be at least 100MB/s? (1Gbit/s / 8)
Total time run: 60.221130
Total writes made: 6401
Write size: 4096
Bandwidth (MB/sec): 0.415
Average Latency: 0.150525
Max latency: 1.12647
Min latency: 0.026599
All btrfs ssds are also mounted with noatime.
Thanks for your help!
Greets Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Calvin Morrow
2012-05-10 18:24:16 UTC
Permalink
I was getting roughly the same results of your tmpfs test using
spinning disks for OSDs with a 160GB Intel 320 SSD being used for the
journal. Theoretically the 520 SSD should give better performance
than my 320s.

Keep in mind that even with balance-alb, multiple GigE connections
will only be used if there are multiple TCP sessions being used by
Ceph.

You don't mention it in your email, but if you're using kernel 3.4+
you'll want to make sure your create your btrfs filesystem using the
large node & leaf size (Big Metadata - I've heard recommendations of
32k instead of default 4k) so your performance doesn't degrade over
time.

I'm curious what speed you're getting from dd in a streaming write.
You might try running a "dd if=3D/dev/zero of=3D<intel ssd partition>
bs=3D128k count=3Dsomething" to see what the SSD will spit out without
Ceph in the picture.

Calvin

On Thu, May 10, 2012 at 7:09 AM, Stefan Priebe - Profihost AG
OK, here some retests. I had the SDDs conected to an old Raid control=
ler
even i did used them as JBODs (oops).
Here are two new Tests (using kernel 3.4-rc6) it would be great if
someone could tell me if they're fine or bad.
New tests with all 3 SSDs connected to the mainboard.
#~ rados -p rbd bench 60 write
Total time run: =A0 =A0 =A0 =A060.342419
Total writes made: =A0 =A0 2021
Write size: =A0 =A0 =A0 =A0 =A0 =A04194304
Bandwidth (MB/sec): =A0 =A0133.969
Average Latency: =A0 =A0 =A0 0.477476
Max latency: =A0 =A0 =A0 =A0 =A0 0.942029
Min latency: =A0 =A0 =A0 =A0 =A0 0.109467
#~ rados -p rbd bench 60 write -b 4096
Total time run: =A0 =A0 =A0 =A060.726326
Total writes made: =A0 =A0 59026
Write size: =A0 =A0 =A0 =A0 =A0 =A04096
Bandwidth (MB/sec): =A0 =A03.797
Average Latency: =A0 =A0 =A0 0.016459
Max latency: =A0 =A0 =A0 =A0 =A0 0.874841
Min latency: =A0 =A0 =A0 =A0 =A0 0.002392
Another test with only osd on the disk and the journal in memory / tm=
#~ rados -p rbd bench 60 write
Total time run: =A0 =A0 =A0 =A060.513240
Total writes made: =A0 =A0 2555
Write size: =A0 =A0 =A0 =A0 =A0 =A04194304
Bandwidth (MB/sec): =A0 =A0168.889
Average Latency: =A0 =A0 =A0 0.378775
Max latency: =A0 =A0 =A0 =A0 =A0 4.59233
Min latency: =A0 =A0 =A0 =A0 =A0 0.055179
#~ rados -p rbd bench 60 write -b 4096
Total time run: =A0 =A0 =A0 =A060.116260
Total writes made: =A0 =A0 281903
Write size: =A0 =A0 =A0 =A0 =A0 =A04096
Bandwidth (MB/sec): =A0 =A018.318
Average Latency: =A0 =A0 =A0 0.00341067
Max latency: =A0 =A0 =A0 =A0 =A0 0.720486
Min latency: =A0 =A0 =A0 =A0 =A0 0.000602
"2012-05-10 15:05:22.140027 mon.0 192.168.0.100:6789/0 19 : [WRN]
message from mon.2 was stamped 0.109244s in the future, clocks not
synchronized"
even on all systems ntp is running fine.
Stefan
Post by Stefan Priebe - Profihost AG
Dear List,
i'm doing a testsetup with ceph v0.46 and wanted to know how fast ce=
ph is.
Post by Stefan Priebe - Profihost AG
3 servers with Intel Xeon X3440, 180GB SSD Intel 520 Series, 4GB RAM=
, 2x
Post by Stefan Priebe - Profihost AG
1Gbit/s LAN each
All 3 are running as mon a-c and osd 0-2. Two of them are also runni=
ng
Post by Stefan Priebe - Profihost AG
as mds.2 and mds.3 (has 8GB RAM instead of 4GB).
All machines run ceph v0.46 and vanilla Linux Kernel v3.0.30 and all=
of
Post by Stefan Priebe - Profihost AG
them use btrfs on the ssd which serves /srv/{osd,mon}.X. All of them=
use
Post by Stefan Priebe - Profihost AG
eth0+eth1 as bond0 (mode 6).
rados -p rbd bench 60 write
...
Total time run: =A0 =A0 =A0 =A061.465323
Total writes made: =A0 =A0 776
Write size: =A0 =A0 =A0 =A0 =A0 =A04194304
Bandwidth (MB/sec): =A0 =A050.500
Average Latency: =A0 =A0 =A0 1.2654
Max latency: =A0 =A0 =A0 =A0 =A0 2.77124
Min latency: =A0 =A0 =A0 =A0 =A0 0.170936
Shouldn't it be at least 100MB/s? (1Gbit/s / 8)
Total time run: =A0 =A0 =A0 =A060.221130
Total writes made: =A0 =A0 6401
Write size: =A0 =A0 =A0 =A0 =A0 =A04096
Bandwidth (MB/sec): =A0 =A00.415
Average Latency: =A0 =A0 =A0 0.150525
Max latency: =A0 =A0 =A0 =A0 =A0 1.12647
Min latency: =A0 =A0 =A0 =A0 =A0 0.026599
All btrfs ssds are also mounted with noatime.
Thanks for your help!
Greets Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel"=
in
More majordomo info at =A0http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" i=
n
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Stefan Priebe - Profihost AG
2012-05-10 13:23:40 UTC
Permalink
Hi,

the "Designing a cluster guide"
http://wiki.ceph.com/wiki/Designing_a_cluster is pretty good but it
still leaves some questions unanswered.

It mentions for example "Fast CPU" for the mds system. What does fast
mean? Just the speed of one core? Or is ceph designed to use multi core?
Is multi core or more speed important?

The Cluster Design Recommendations mentions to seperate all Daemons on
dedicated machines. Is this also for the MON useful? As they're so
leightweight why not running them on the OSDs?

Regarding the OSDs is it fine to use an SSD Raid 1 for the journal and
perhaps 22x SATA Disks in a Raid 10 for the FS or is this quite absurd
and you should go for 22x SSD Disks in a Raid 6? Is it more useful the
use a Raid 6 HW Controller or the btrfs raid?

Use single socket Xeon for the OSDs or Dual Socket?

Thanks and greets
Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Gregory Farnum
2012-05-17 21:27:26 UTC
Permalink
Sorry this got left for so long...

On Thu, May 10, 2012 at 6:23 AM, Stefan Priebe - Profihost AG
Post by Stefan Priebe - Profihost AG
Hi,
the "Designing a cluster guide"
http://wiki.ceph.com/wiki/Designing_a_cluster is pretty good but it
still leaves some questions unanswered.
It mentions for example "Fast CPU" for the mds system. What does fast
mean? Just the speed of one core? Or is ceph designed to use multi co=
re?
Post by Stefan Priebe - Profihost AG
Is multi core or more speed important?
Right now, it's primarily the speed of a single core. The MDS is
highly threaded but doing most things requires grabbing a big lock.
How fast is a qualitative rather than quantitative assessment at this
point, though.
Post by Stefan Priebe - Profihost AG
The Cluster Design Recommendations mentions to seperate all Daemons o=
n
Post by Stefan Priebe - Profihost AG
dedicated machines. Is this also for the MON useful? As they're so
leightweight why not running them on the OSDs?
It depends on what your nodes look like, and what sort of cluster
you're running. The monitors are pretty lightweight, but they will add
*some* load. More important is their disk access patterns =97 they have
to do a lot of syncs. So if they're sharing a machine with some other
daemon you want them to have an independent disk and to be running a
new kernel&glibc so that they can use syncfs rather than sync. (The
only distribution I know for sure does this is Ubuntu 12.04.)
Post by Stefan Priebe - Profihost AG
Regarding the OSDs is it fine to use an SSD Raid 1 for the journal an=
d
Post by Stefan Priebe - Profihost AG
perhaps 22x SATA Disks in a Raid 10 for the FS or is this quite absur=
d
Post by Stefan Priebe - Profihost AG
and you should go for 22x SSD Disks in a Raid 6?
You'll need to do your own failure calculations on this one, I'm
afraid. Just take note that you'll presumably be limited to the speed
of your journaling device here.
Given that Ceph is going to be doing its own replication, though, I
wouldn't want to add in another whole layer of replication with raid10
=97 do you really want to multiply your storage requirements by another
factor of two?
Post by Stefan Priebe - Profihost AG
Is it more useful the use a Raid 6 HW Controller or the btrfs raid?
I would use the hardware controller over btrfs raid for now; it allows
more flexibility in eg switching to xfs. :)
Post by Stefan Priebe - Profihost AG
Use single socket Xeon for the OSDs or Dual Socket?
Dual socket servers will be overkill given the setup you're
describing. Our WAG rule of thumb is 1GHz of modern CPU per OSD
daemon. You might consider it if you decided you wanted to do an OSD
per disk instead (that's a more common configuration, but it requires
more CPU and RAM per disk and we don't know yet which is the better
choice).
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" i=
n
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Stefan Priebe
2012-05-19 08:37:01 UTC
Permalink
Hi Greg,
Post by Gregory Farnum
It mentions for example "Fast CPU" for the mds system. What does fas=
t
Post by Gregory Farnum
mean? Just the speed of one core? Or is ceph designed to use multi c=
ore?
Post by Gregory Farnum
Is multi core or more speed important?
Right now, it's primarily the speed of a single core. The MDS is
highly threaded but doing most things requires grabbing a big lock.
How fast is a qualitative rather than quantitative assessment at this
point, though.
So would you recommand a fast (more ghz) Core i3 instead of a single=20
xeon for this system? (price per ghz is better).
Post by Gregory Farnum
It depends on what your nodes look like, and what sort of cluster
you're running. The monitors are pretty lightweight, but they will ad=
d
Post by Gregory Farnum
*some* load. More important is their disk access patterns =97 they ha=
ve
Post by Gregory Farnum
to do a lot of syncs. So if they're sharing a machine with some other
daemon you want them to have an independent disk and to be running a
new kernel&glibc so that they can use syncfs rather than sync. (The
only distribution I know for sure does this is Ubuntu 12.04.)
Which kernel and which glibc version supports this? I have searched=20
google but haven't found an exact version. We're using debian lenny=20
squeeze with a custom kernel.
Post by Gregory Farnum
Regarding the OSDs is it fine to use an SSD Raid 1 for the journal a=
nd
Post by Gregory Farnum
perhaps 22x SATA Disks in a Raid 10 for the FS or is this quite absu=
rd
Post by Gregory Farnum
and you should go for 22x SSD Disks in a Raid 6?
You'll need to do your own failure calculations on this one, I'm
afraid. Just take note that you'll presumably be limited to the speed
of your journaling device here.
Yeah that's why i wanted to use a Raid 1 of SSDs for the journaling. Or=
=20
is this still too slow? Another idea was to use only a ramdisk for the=20
journal and backup the files while shutting down to disk and restore=20
them after boot.
Post by Gregory Farnum
Given that Ceph is going to be doing its own replication, though, I
wouldn't want to add in another whole layer of replication with raid1=
0
Post by Gregory Farnum
=97 do you really want to multiply your storage requirements by anoth=
er
Post by Gregory Farnum
factor of two?
OK correct bad idea.
Post by Gregory Farnum
Is it more useful the use a Raid 6 HW Controller or the btrfs raid?
I would use the hardware controller over btrfs raid for now; it allow=
s
Post by Gregory Farnum
more flexibility in eg switching to xfs. :)
OK but overall you would recommand running one osd per disk right? So=20
instead of using a Raid 6 with for example 10 disks you would run 6 osd=
s=20
on this machine?
Post by Gregory Farnum
Use single socket Xeon for the OSDs or Dual Socket?
Dual socket servers will be overkill given the setup you're
describing. Our WAG rule of thumb is 1GHz of modern CPU per OSD
daemon. You might consider it if you decided you wanted to do an OSD
per disk instead (that's a more common configuration, but it requires
more CPU and RAM per disk and we don't know yet which is the better
choice).
Is there also a rule of thumb for the memory?

My biggest problem with ceph right now is the awful slow speed while=20
doing random reads and writes.

Sequential read and writes are at 200Mb/s (that's pretty good for bonde=
d=20
dual Gbit/s). But random reads and write are only at 0,8 - 1,5 Mb/s=20
which is def. too slow.

Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" i=
n
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Alexandre DERUMIER
2012-05-19 16:15:22 UTC
Permalink
Hi,

=46or your journal , if you have money, you can use

stec zeusram ssd drive. (around 2000=E2=82=AC /8GB / 100000 iops read/w=
rite with 4k block).
I'm using them with zfs san, they rocks for journal.=20
http://www.stec-inc.com/product/zeusram.php

another interessesting product is ddrdrive
http://www.ddrdrive.com/

----- Mail original -----=20

De: "Stefan Priebe" <***@profihost.ag>=20
=C3=80: "Gregory Farnum" <***@inktank.com>=20
Cc: ceph-***@vger.kernel.org=20
Envoy=C3=A9: Samedi 19 Mai 2012 10:37:01=20
Objet: Re: Designing a cluster guide=20

Hi Greg,=20

Am 17.05.2012 23:27, schrieb Gregory Farnum:=20
It mentions for example "Fast CPU" for the mds system. What does fas=
t=20
mean? Just the speed of one core? Or is ceph designed to use multi c=
ore?=20
Is multi core or more speed important?=20
Right now, it's primarily the speed of a single core. The MDS is=20
highly threaded but doing most things requires grabbing a big lock.=20
How fast is a qualitative rather than quantitative assessment at this=
=20
point, though.=20
So would you recommand a fast (more ghz) Core i3 instead of a single=20
xeon for this system? (price per ghz is better).=20
It depends on what your nodes look like, and what sort of cluster=20
you're running. The monitors are pretty lightweight, but they will ad=
d=20
*some* load. More important is their disk access patterns =E2=80=94 t=
hey have=20
to do a lot of syncs. So if they're sharing a machine with some other=
=20
daemon you want them to have an independent disk and to be running a=20
new kernel&glibc so that they can use syncfs rather than sync. (The=20
only distribution I know for sure does this is Ubuntu 12.04.)=20
Which kernel and which glibc version supports this? I have searched=20
google but haven't found an exact version. We're using debian lenny=20
squeeze with a custom kernel.=20
Regarding the OSDs is it fine to use an SSD Raid 1 for the journal a=
nd=20
perhaps 22x SATA Disks in a Raid 10 for the FS or is this quite absu=
rd=20
and you should go for 22x SSD Disks in a Raid 6?=20
You'll need to do your own failure calculations on this one, I'm=20
afraid. Just take note that you'll presumably be limited to the speed=
=20
of your journaling device here.=20
Yeah that's why i wanted to use a Raid 1 of SSDs for the journaling. Or=
=20
is this still too slow? Another idea was to use only a ramdisk for the=20
journal and backup the files while shutting down to disk and restore=20
them after boot.=20
Given that Ceph is going to be doing its own replication, though, I=20
wouldn't want to add in another whole layer of replication with raid1=
0=20
=E2=80=94 do you really want to multiply your storage requirements by=
another=20
factor of two?=20
OK correct bad idea.=20
Is it more useful the use a Raid 6 HW Controller or the btrfs raid?=20
I would use the hardware controller over btrfs raid for now; it allow=
s=20
more flexibility in eg switching to xfs. :)=20
OK but overall you would recommand running one osd per disk right? So=20
instead of using a Raid 6 with for example 10 disks you would run 6 osd=
s=20
on this machine?=20
Use single socket Xeon for the OSDs or Dual Socket?=20
Dual socket servers will be overkill given the setup you're=20
describing. Our WAG rule of thumb is 1GHz of modern CPU per OSD=20
daemon. You might consider it if you decided you wanted to do an OSD=20
per disk instead (that's a more common configuration, but it requires=
=20
more CPU and RAM per disk and we don't know yet which is the better=20
choice).=20
Is there also a rule of thumb for the memory?=20

My biggest problem with ceph right now is the awful slow speed while=20
doing random reads and writes.=20

Sequential read and writes are at 200Mb/s (that's pretty good for bonde=
d=20
dual Gbit/s). But random reads and write are only at 0,8 - 1,5 Mb/s=20
which is def. too slow.=20

Stefan=20
--=20
To unsubscribe from this list: send the line "unsubscribe ceph-devel" i=
n=20
the body of a message to ***@vger.kernel.org=20
More majordomo info at http://vger.kernel.org/majordomo-info.html=20



--=20

--=20




Alexandre D erumier=20
Ing=C3=A9nieur Syst=C3=A8me=20
=46ixe : 03 20 68 88 90=20
=46ax : 03 20 68 90 81=20
45 Bvd du G=C3=A9n=C3=A9ral Leclerc 59100 Roubaix - France=20
12 rue Marivaux 75002 Paris - France=20
=09
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" i=
n
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Stefan Priebe
2012-05-20 07:56:21 UTC
Permalink
Hi,
For your journal , if you have money, you can use
stec zeusram ssd drive. (around 2000=E2=82=AC /8GB / 100000 iops read=
/write with 4k block).
I'm using them with zfs san, they rocks for journal.
http://www.stec-inc.com/product/zeusram.php
another interessesting product is ddrdrive
http://www.ddrdrive.com/
Great products but really expensive. The question is do we really need=20
this in case of rbd block device.

Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" i=
n
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Alexandre DERUMIER
2012-05-20 08:13:03 UTC
Permalink
I think that depend how much random writes io you have and acceptable l=
atency you need.

(As the purpose of the journal is to take random io then flush them seq=
uentially to slow storage).

Maybe some slower ssd will fill your needs.
(just be carefull of performance degradation in time, trim,....)




----- Mail original -----=20

De: "Stefan Priebe" <***@profihost.ag>=20
=C3=80: "Alexandre DERUMIER" <***@odiso.com>=20
Cc: ceph-***@vger.kernel.org, "Gregory Farnum" <***@inktank.com>=20
Envoy=C3=A9: Dimanche 20 Mai 2012 09:56:21=20
Objet: Re: Designing a cluster guide=20

Am 19.05.2012 18:15, schrieb Alexandre DERUMIER:=20
Hi,=20
=20
For your journal , if you have money, you can use=20
=20
stec zeusram ssd drive. (around 2000=E2=82=AC /8GB / 100000 iops read=
/write with 4k block).=20
I'm using them with zfs san, they rocks for journal.=20
http://www.stec-inc.com/product/zeusram.php=20
=20
another interessesting product is ddrdrive=20
http://www.ddrdrive.com/=20
Great products but really expensive. The question is do we really need=20
this in case of rbd block device.=20

Stefan=20



--=20

--=20




Alexandre D erumier=20
Ing=C3=A9nieur Syst=C3=A8me=20
=46ixe : 03 20 68 88 90=20
=46ax : 03 20 68 90 81=20
45 Bvd du G=C3=A9n=C3=A9ral Leclerc 59100 Roubaix - France=20
12 rue Marivaux 75002 Paris - France=20
=09
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" i=
n
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Christian Brunner
2012-05-20 08:19:33 UTC
Permalink
Hi,
For your journal , if you have money, you can use
stec zeusram ssd drive. (around 2000=80 /8GB / 100000 iops read/writ=
e with
4k block).
I'm using them with zfs san, they rocks for journal.
http://www.stec-inc.com/product/zeusram.php
another interessesting product is ddrdrive
http://www.ddrdrive.com/
Great products but really expensive. The question is do we really nee=
d this
in case of rbd block device.
I think it depends, what you are planning to do. I was calculating
different storage type for our cloud solution lately. I think that
there are three different types that make sense (at least for us):

- Cheap Object Storage (S3):

Many 3,5'' SATA Drives for the storage (probably in a RAID config)
A small and cheap SSD for the journal

- Basic Block Storage (RBD):

Many 2,5'' SATA Drives for the storage (RAID10 and/or mutliple OSDs)
Small MaxIOPS SSDs for each OSD journal

- High performance Block Storage (RBD)

Many large SATA SSDs for the storage (prbably in a RAID5 config)
stec zeusram ssd drive for the journal

Regards,
Christian
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" i=
n
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Stefan Priebe
2012-05-20 08:27:10 UTC
Permalink
Post by Christian Brunner
Many 3,5'' SATA Drives for the storage (probably in a RAID config)
A small and cheap SSD for the journal
Many 2,5'' SATA Drives for the storage (RAID10 and/or mutliple OSDs)
Small MaxIOPS SSDs for each OSD journal
- High performance Block Storage (RBD)
Many large SATA SSDs for the storage (prbably in a RAID5 config)
stec zeusram ssd drive for the journal
That's exactly what i thought too but then you need a seperate ceph /
rbd cluster for each type.

Which will result in a minimum of:
3x mon servers per type
4x osd servers per type
---

so you'll need a minimum of 12x osd systems and 9x mon systems.

Regards,
Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Christian Brunner
2012-05-20 08:31:01 UTC
Permalink
=A0 Many 3,5'' SATA Drives for the storage (probably in a RAID confi=
g)
=A0 A small and cheap SSD for the journal
=A0 Many 2,5'' SATA Drives for the storage (RAID10 and/or mutliple O=
SDs)
=A0 Small MaxIOPS SSDs for each OSD journal
- High performance Block Storage (RBD)
=A0 Many large SATA SSDs for the storage (prbably in a RAID5 config)
=A0 stec zeusram ssd drive for the journal
That's exactly what i thought too but then you need a seperate ceph /=
rbd
cluster for each type.
3x mon servers per type
4x osd servers per type
---
so you'll need a minimum of 12x osd systems and 9x mon systems.
You can arrange the storage types in different pools, so that you
don't need separate mon servers (this can be done by adjusting the
crushmap) and you could even run multiple OSDs per server.

Christian
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" i=
n
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Stefan Priebe - Profihost AG
2012-05-21 08:22:29 UTC
Permalink
Post by Christian Brunner
That's exactly what i thought too but then you need a seperate ceph / rbd
cluster for each type.
3x mon servers per type
4x osd servers per type
---
so you'll need a minimum of 12x osd systems and 9x mon systems.
You can arrange the storage types in different pools, so that you
don't need separate mon servers (this can be done by adjusting the
crushmap) and you could even run multiple OSDs per server.
That sounds great. Can you give me a hint how to setup pools? Right now
i have data, metadata and rbd => the default pools. But i wasn't able to
find any page in the wiki which described how to setup pools.

Thanks,
Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Christian Brunner
2012-05-21 15:03:26 UTC
Permalink
Post by Stefan Priebe - Profihost AG
Post by Christian Brunner
That's exactly what i thought too but then you need a seperate ceph / rbd
cluster for each type.
3x mon servers per type
4x osd servers per type
---
so you'll need a minimum of 12x osd systems and 9x mon systems.
You can arrange the storage types in different pools, so that you
don't need separate mon servers (this can be done by adjusting the
crushmap) and you could even run multiple OSDs per server.
That sounds great. Can you give me a hint how to setup pools? Right now
i have data, metadata and rbd => the default pools. But i wasn't able to
find any page in the wiki which described how to setup pools.
rados mkpool <pool-name> [123[ 4]] create pool <pool-name>'
[with auid 123[and using crush rule 4]]

Christian
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Tim O'Donovan
2012-05-20 08:56:46 UTC
Permalink
Post by Christian Brunner
- High performance Block Storage (RBD)
Many large SATA SSDs for the storage (prbably in a RAID5 config)
stec zeusram ssd drive for the journal
How do you think standard SATA disks would perform in comparison to
this, and is a separate journaling device really necessary?

Perhaps three servers, each with 12 x 1TB SATA disks configured in
RAID10, an osd on each server and three separate mon servers.

Would this be suitable for the storage backend for a small OpenStack
cloud, performance wise, for instance?


Regards,
Tim O'Donovan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Stefan Priebe
2012-05-20 09:24:49 UTC
Permalink
Post by Tim O'Donovan
Post by Christian Brunner
- High performance Block Storage (RBD)
Many large SATA SSDs for the storage (prbably in a RAID5 config)
stec zeusram ssd drive for the journal
How do you think standard SATA disks would perform in comparison to
this, and is a separate journaling device really necessary?
Perhaps three servers, each with 12 x 1TB SATA disks configured in
RAID10, an osd on each server and three separate mon servers.
He's talking about ssd's not normal sata disks.

Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Tim O'Donovan
2012-05-20 09:46:18 UTC
Permalink
Post by Stefan Priebe
He's talking about ssd's not normal sata disks.
I realise that. I'm looking for similar advice and have been following
this thread. It didn't seem off topic to ask here.


Regards,
Tim O'Donovan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Stefan Priebe
2012-05-20 09:49:07 UTC
Permalink
No sorry just wanted to clarify as you quoted the ssd part.

Stefan
Post by Tim O'Donovan
Post by Stefan Priebe
He's talking about ssd's not normal sata disks.
I realise that. I'm looking for similar advice and have been following
this thread. It didn't seem off topic to ask here.
Regards,
Tim O'Donovan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Christian Brunner
2012-05-21 14:59:47 UTC
Permalink
Post by Tim O'Donovan
Post by Christian Brunner
- High performance Block Storage (RBD)
=A0 Many large SATA SSDs for the storage (prbably in a RAID5 config)
=A0 stec zeusram ssd drive for the journal
How do you think standard SATA disks would perform in comparison to
this, and is a separate journaling device really necessary?
A journaling device is improving write latency a lot and the write
latency is directly related to the throughput you get in your virtual
machine. If you have a raid controller with a battery backed write
cache you could try to put the journal on a separate, small partition
of your SATA disk. I haven't tried this, but I think this could work.

Apart from that you should calculate the sum of the IOPS your guests
genereate. In the end everything has to be written on your backend
storage and is has to be able to deliver the IOPS.

With the journal you might be able to compensate short write peaks and
there might be a gain by merging write requests on the OSDs, but for a
solid sizing I would neglect this. Read requests can be delivered for
the OSDs cache (RAM), but again this will probably give you only a
small gain.

=46or a single SATA disk you can calculate with 100-150 IOPS (depending
on the speed of the disk). SSDs can deliver much higher IOPS values.
Post by Tim O'Donovan
Perhaps three servers, each with 12 x 1TB SATA disks configured in
RAID10, an osd on each server and three separate mon servers.
With a replication level of two this would be 1350 IOPS:

150 IOPS per disk * 12 disks * 3 servers / 2 for the RAID10 / 2 for
ceph replication

Comments on this formula would be welcome...
Post by Tim O'Donovan
Would this be suitable for the storage backend for a small OpenStack
cloud, performance wise, for instance?
That depends on what you are doing in your guests.

Regards,
Christian
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" i=
n
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Stefan Priebe - Profihost AG
2012-05-21 15:05:03 UTC
Permalink
Post by Christian Brunner
Apart from that you should calculate the sum of the IOPS your guests
genereate. In the end everything has to be written on your backend
storage and is has to be able to deliver the IOPS.
How to measure the IOPs of a dedicated actual system?

Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Tomasz Paszkowski
2012-05-21 15:12:00 UTC
Permalink
If you're using Qemu/KVM you can use 'info blockstats' command for
measruing I/O on particular VM.


On Mon, May 21, 2012 at 5:05 PM, Stefan Priebe - Profihost AG
Post by Stefan Priebe - Profihost AG
Post by Christian Brunner
Apart from that you should calculate the sum of the IOPS your guests
genereate. In the end everything has to be written on your backend
storage and is has to be able to deliver the IOPS.
How to measure the IOPs of a dedicated actual system?
Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel"=
in
Post by Stefan Priebe - Profihost AG
More majordomo info at =A0http://vger.kernel.org/majordomo-info.html
--=20
Tomasz Paszkowski
SS7, Asterisk, SAN, Datacenter, Cloud Computing
+48500166299
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" i=
n
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Tomasz Paszkowski
2012-05-21 15:36:12 UTC
Permalink
Project is indeed very interesting, but requires to patch a kernel
source. For me using lkm is safer ;)
Hello,
Has someone looked into bcache (http://bcache.evilpiepirate.org/) ?
It seems, it is superior to flashcache.
Lwn.net article: https://lwn.net/Articles/497024/
Mailing list: http://news.gmane.org/gmane.linux.kernel.bcache.devel
Source code: http://evilpiepirate.org/cgi-bin/cgit.cgi/linux-bcache.g=
it/
Thanks,
Kiran Patil.
Post by Tomasz Paszkowski
If you're using Qemu/KVM you can use 'info blockstats' command for
measruing I/O on particular VM.
On Mon, May 21, 2012 at 5:05 PM, Stefan Priebe - Profihost AG
Post by Stefan Priebe - Profihost AG
Apart from that you should calculate the sum of the IOPS your gue=
sts
Post by Tomasz Paszkowski
Post by Stefan Priebe - Profihost AG
genereate. In the end everything has to be written on your backen=
d
Post by Tomasz Paszkowski
Post by Stefan Priebe - Profihost AG
storage and is has to be able to deliver the IOPS.
How to measure the IOPs of a dedicated actual system?
Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-dev=
el" in
Post by Tomasz Paszkowski
Post by Stefan Priebe - Profihost AG
More majordomo info at =A0http://vger.kernel.org/majordomo-info.ht=
ml
Post by Tomasz Paszkowski
--
Tomasz Paszkowski
SS7, Asterisk, SAN, Datacenter, Cloud Computing
+48500166299
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel=
" in
Post by Tomasz Paszkowski
More majordomo info at =A0http://vger.kernel.org/majordomo-info.html
--=20
Tomasz Paszkowski
SS7, Asterisk, SAN, Datacenter, Cloud Computing
+48500166299
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" i=
n
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Damien Churchill
2012-05-21 18:15:51 UTC
Permalink
Post by Tomasz Paszkowski
Project is indeed very interesting, but requires to patch a kernel
source. For me using lkm is safer ;)
I believe bcache is actually in the process of being mainlined and
moved to a device mapper target, although I could wrong about one or
more of those things.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Stefan Priebe
2012-05-21 20:11:15 UTC
Permalink
Post by Tomasz Paszkowski
If you're using Qemu/KVM you can use 'info blockstats' command for
measruing I/O on particular VM.
I want to migrate physical servers to KVM. Any idea for that?

Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Tomasz Paszkowski
2012-05-21 20:13:21 UTC
Permalink
Just to clarify. You'd like to measure I/O on those system which are
currently running on physical machines ?
Post by Stefan Priebe
Post by Tomasz Paszkowski
If you're using Qemu/KVM you can use 'info blockstats' command for
measruing I/O on particular VM.
I want to migrate physical servers to KVM. Any idea for that?
Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel"=
in
Post by Stefan Priebe
More majordomo info at =A0http://vger.kernel.org/majordomo-info.html
--=20
Tomasz Paszkowski
SS7, Asterisk, SAN, Datacenter, Cloud Computing
+48500166299
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" i=
n
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Stefan Priebe
2012-05-21 20:14:33 UTC
Permalink
Post by Tomasz Paszkowski
Just to clarify. You'd like to measure I/O on those system which are
currently running on physical machines ?
IOPs not just I/O.

Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Tomasz Paszkowski
2012-05-21 20:19:30 UTC
Permalink
On Linux boxes you may use output from iostat -x /dev/sda and connect
it it to any monitoring system like: zabbix or cacti :-)
Post by Stefan Priebe
Post by Tomasz Paszkowski
Just to clarify. You'd like to measure I/O on those system which are
currently running on physical machines ?
IOPs not just I/O.
Stefan
--
Tomasz Paszkowski
SS7, Asterisk, SAN, Datacenter, Cloud Computing
+48500166299
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Tomasz Paszkowski
2012-05-21 15:07:25 UTC
Permalink
Another great thing that should be mentioned is:
https://github.com/facebook/flashcache/. It gives really huge
performance improvements for reads/writes (especialy on FunsionIO
drives) event without using librbd caching :-)
Hi,
For your journal , if you have money, you can use
stec zeusram ssd drive. (around 2000=80 /8GB / 100000 iops read/write=
with 4k block).
I'm using them with zfs san, they rocks for journal.
http://www.stec-inc.com/product/zeusram.php
another interessesting product is ddrdrive
http://www.ddrdrive.com/
----- Mail original -----
Envoy=E9: Samedi 19 Mai 2012 10:37:01
Objet: Re: Designing a cluster guide
Hi Greg,
Post by Gregory Farnum
It mentions for example "Fast CPU" for the mds system. What does fa=
st
Post by Gregory Farnum
mean? Just the speed of one core? Or is ceph designed to use multi =
core?
Post by Gregory Farnum
Is multi core or more speed important?
Right now, it's primarily the speed of a single core. The MDS is
highly threaded but doing most things requires grabbing a big lock.
How fast is a qualitative rather than quantitative assessment at thi=
s
Post by Gregory Farnum
point, though.
So would you recommand a fast (more ghz) Core i3 instead of a single
xeon for this system? (price per ghz is better).
Post by Gregory Farnum
It depends on what your nodes look like, and what sort of cluster
you're running. The monitors are pretty lightweight, but they will a=
dd
Post by Gregory Farnum
*some* load. More important is their disk access patterns =97 they h=
ave
Post by Gregory Farnum
to do a lot of syncs. So if they're sharing a machine with some othe=
r
Post by Gregory Farnum
daemon you want them to have an independent disk and to be running a
new kernel&glibc so that they can use syncfs rather than sync. (The
only distribution I know for sure does this is Ubuntu 12.04.)
Which kernel and which glibc version supports this? I have searched
google but haven't found an exact version. We're using debian lenny
squeeze with a custom kernel.
Post by Gregory Farnum
Regarding the OSDs is it fine to use an SSD Raid 1 for the journal =
and
Post by Gregory Farnum
perhaps 22x SATA Disks in a Raid 10 for the FS or is this quite abs=
urd
Post by Gregory Farnum
and you should go for 22x SSD Disks in a Raid 6?
You'll need to do your own failure calculations on this one, I'm
afraid. Just take note that you'll presumably be limited to the spee=
d
Post by Gregory Farnum
of your journaling device here.
Yeah that's why i wanted to use a Raid 1 of SSDs for the journaling. =
Or
is this still too slow? Another idea was to use only a ramdisk for th=
e
journal and backup the files while shutting down to disk and restore
them after boot.
Post by Gregory Farnum
Given that Ceph is going to be doing its own replication, though, I
wouldn't want to add in another whole layer of replication with raid=
10
Post by Gregory Farnum
=97 do you really want to multiply your storage requirements by anot=
her
Post by Gregory Farnum
factor of two?
OK correct bad idea.
Post by Gregory Farnum
Is it more useful the use a Raid 6 HW Controller or the btrfs raid?
I would use the hardware controller over btrfs raid for now; it allo=
ws
Post by Gregory Farnum
more flexibility in eg switching to xfs. :)
OK but overall you would recommand running one osd per disk right? So
instead of using a Raid 6 with for example 10 disks you would run 6 o=
sds
on this machine?
Post by Gregory Farnum
Use single socket Xeon for the OSDs or Dual Socket?
Dual socket servers will be overkill given the setup you're
describing. Our WAG rule of thumb is 1GHz of modern CPU per OSD
daemon. You might consider it if you decided you wanted to do an OSD
per disk instead (that's a more common configuration, but it require=
s
Post by Gregory Farnum
more CPU and RAM per disk and we don't know yet which is the better
choice).
Is there also a rule of thumb for the memory?
My biggest problem with ceph right now is the awful slow speed while
doing random reads and writes.
Sequential read and writes are at 200Mb/s (that's pretty good for bon=
ded
dual Gbit/s). But random reads and write are only at 0,8 - 1,5 Mb/s
which is def. too slow.
Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel"=
in
More majordomo info at http://vger.kernel.org/majordomo-info.html
--
--
=A0 =A0 =A0 =A0Alexandre D erumier
Ing=E9nieur Syst=E8me
Fixe : 03 20 68 88 90
Fax : 03 20 68 90 81
45 Bvd du G=E9n=E9ral Leclerc 59100 Roubaix - France
12 rue Marivaux 75002 Paris - France
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel"=
in
More majordomo info at =A0http://vger.kernel.org/majordomo-info.html
--=20
Tomasz Paszkowski
SS7, Asterisk, SAN, Datacenter, Cloud Computing
+48500166299
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" i=
n
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Sławomir Skowron
2012-05-21 21:22:14 UTC
Permalink
Maybe good for journal will be two cheap MLC Intel drives on Sandforce
(320/520), 120GB or 240GB, and HPA changed to 20-30GB only for
separate journaling partitions with hardware RAID1.

I like to test setup like this, but maybe someone have any real life in=
fo ??
Post by Tomasz Paszkowski
https://github.com/facebook/flashcache/. It gives really huge
performance improvements for reads/writes (especialy on FunsionIO
drives) event without using librbd caching :-)
Hi,
For your journal , if you have money, you can use
stec zeusram ssd drive. (around 2000=E2=82=AC /8GB / 100000 iops rea=
d/write with 4k block).
Post by Tomasz Paszkowski
I'm using them with zfs san, they rocks for journal.
http://www.stec-inc.com/product/zeusram.php
another interessesting product is ddrdrive
http://www.ddrdrive.com/
----- Mail original -----
Envoy=C3=A9: Samedi 19 Mai 2012 10:37:01
Objet: Re: Designing a cluster guide
Hi Greg,
Post by Gregory Farnum
It mentions for example "Fast CPU" for the mds system. What does f=
ast
Post by Tomasz Paszkowski
Post by Gregory Farnum
mean? Just the speed of one core? Or is ceph designed to use multi=
core?
Post by Tomasz Paszkowski
Post by Gregory Farnum
Is multi core or more speed important?
Right now, it's primarily the speed of a single core. The MDS is
highly threaded but doing most things requires grabbing a big lock.
How fast is a qualitative rather than quantitative assessment at th=
is
Post by Tomasz Paszkowski
Post by Gregory Farnum
point, though.
So would you recommand a fast (more ghz) Core i3 instead of a single
xeon for this system? (price per ghz is better).
Post by Gregory Farnum
It depends on what your nodes look like, and what sort of cluster
you're running. The monitors are pretty lightweight, but they will =
add
Post by Tomasz Paszkowski
Post by Gregory Farnum
*some* load. More important is their disk access patterns =E2=80=94=
they have
Post by Tomasz Paszkowski
Post by Gregory Farnum
to do a lot of syncs. So if they're sharing a machine with some oth=
er
Post by Tomasz Paszkowski
Post by Gregory Farnum
daemon you want them to have an independent disk and to be running =
a
Post by Tomasz Paszkowski
Post by Gregory Farnum
new kernel&glibc so that they can use syncfs rather than sync. (The
only distribution I know for sure does this is Ubuntu 12.04.)
Which kernel and which glibc version supports this? I have searched
google but haven't found an exact version. We're using debian lenny
squeeze with a custom kernel.
Post by Gregory Farnum
Regarding the OSDs is it fine to use an SSD Raid 1 for the journal=
and
Post by Tomasz Paszkowski
Post by Gregory Farnum
perhaps 22x SATA Disks in a Raid 10 for the FS or is this quite ab=
surd
Post by Tomasz Paszkowski
Post by Gregory Farnum
and you should go for 22x SSD Disks in a Raid 6?
You'll need to do your own failure calculations on this one, I'm
afraid. Just take note that you'll presumably be limited to the spe=
ed
Post by Tomasz Paszkowski
Post by Gregory Farnum
of your journaling device here.
Yeah that's why i wanted to use a Raid 1 of SSDs for the journaling.=
Or
Post by Tomasz Paszkowski
is this still too slow? Another idea was to use only a ramdisk for t=
he
Post by Tomasz Paszkowski
journal and backup the files while shutting down to disk and restore
them after boot.
Post by Gregory Farnum
Given that Ceph is going to be doing its own replication, though, I
wouldn't want to add in another whole layer of replication with rai=
d10
Post by Tomasz Paszkowski
Post by Gregory Farnum
=E2=80=94 do you really want to multiply your storage requirements =
by another
Post by Tomasz Paszkowski
Post by Gregory Farnum
factor of two?
OK correct bad idea.
Post by Gregory Farnum
Is it more useful the use a Raid 6 HW Controller or the btrfs raid=
?
Post by Tomasz Paszkowski
Post by Gregory Farnum
I would use the hardware controller over btrfs raid for now; it all=
ows
Post by Tomasz Paszkowski
Post by Gregory Farnum
more flexibility in eg switching to xfs. :)
OK but overall you would recommand running one osd per disk right? S=
o
Post by Tomasz Paszkowski
instead of using a Raid 6 with for example 10 disks you would run 6 =
osds
Post by Tomasz Paszkowski
on this machine?
Post by Gregory Farnum
Use single socket Xeon for the OSDs or Dual Socket?
Dual socket servers will be overkill given the setup you're
describing. Our WAG rule of thumb is 1GHz of modern CPU per OSD
daemon. You might consider it if you decided you wanted to do an OS=
D
Post by Tomasz Paszkowski
Post by Gregory Farnum
per disk instead (that's a more common configuration, but it requir=
es
Post by Tomasz Paszkowski
Post by Gregory Farnum
more CPU and RAM per disk and we don't know yet which is the better
choice).
Is there also a rule of thumb for the memory?
My biggest problem with ceph right now is the awful slow speed while
doing random reads and writes.
Sequential read and writes are at 200Mb/s (that's pretty good for bo=
nded
Post by Tomasz Paszkowski
dual Gbit/s). But random reads and write are only at 0,8 - 1,5 Mb/s
which is def. too slow.
Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel=
" in
Post by Tomasz Paszkowski
More majordomo info at http://vger.kernel.org/majordomo-info.html
--
--
=C2=A0 =C2=A0 =C2=A0 =C2=A0Alexandre D erumier
Ing=C3=A9nieur Syst=C3=A8me
Fixe : 03 20 68 88 90
Fax : 03 20 68 90 81
45 Bvd du G=C3=A9n=C3=A9ral Leclerc 59100 Roubaix - France
12 rue Marivaux 75002 Paris - France
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel=
" in
Post by Tomasz Paszkowski
More majordomo info at =C2=A0http://vger.kernel.org/majordomo-info.h=
tml
Post by Tomasz Paszkowski
--
Tomasz Paszkowski
SS7, Asterisk, SAN, Datacenter, Cloud Computing
+48500166299
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel"=
in
Post by Tomasz Paszkowski
More majordomo info at =C2=A0http://vger.kernel.org/majordomo-info.ht=
ml



--=20
-----
Pozdrawiam

S=C5=82awek "sZiBis" Skowron
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" i=
n
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Quenten Grasso
2012-05-21 23:52:31 UTC
Permalink
Hi All,


I've been thinking about this issue myself past few days, and an idea I've come up with is running 16 x 2.5" 15K 72/146GB Disks,
in raid 10 inside a 2U Server with JBOD's attached to the server for actual storage.

Can someone help clarify this one,

Once the data is written to the (journal disk) and then read from the (journal disk) then written to the (storage disk) once this is complete this is considered a successful write by the client?
Or
Once the data is written to the (journal disk) is this considered successful by the client?
Or
Once the data is written to the (journal disk) and written to the (storage disk) at the same time, once complete this is considered a successful write by the client? (if this is the case SSD's may not be so useful)


Pros
Quite fast Write throughput to the journal disks,
No write wareout of SSD's
RAID 10 with 1GB Cache Controller also helps improve things (if really keen you could use a cachecade as well)


Cons
Not as fast as SSD's
More rackspace required per server.


Regards,
Quenten

-----Original Message-----
From: ceph-devel-***@vger.kernel.org [mailto:ceph-devel-***@vger.kernel.org] On Behalf Of Slawomir Skowron
Sent: Tuesday, 22 May 2012 7:22 AM
To: ceph-***@vger.kernel.org
Cc: Tomasz Paszkowski
Subject: Re: Designing a cluster guide

Maybe good for journal will be two cheap MLC Intel drives on Sandforce
(320/520), 120GB or 240GB, and HPA changed to 20-30GB only for
separate journaling partitions with hardware RAID1.

I like to test setup like this, but maybe someone have any real life info ??
Post by Tomasz Paszkowski
https://github.com/facebook/flashcache/. It gives really huge
performance improvements for reads/writes (especialy on FunsionIO
drives) event without using librbd caching :-)
Hi,
For your journal , if you have money, you can use
stec zeusram ssd drive. (around 2000€ /8GB / 100000 iops read/write with 4k block).
I'm using them with zfs san, they rocks for journal.
http://www.stec-inc.com/product/zeusram.php
another interessesting product is ddrdrive
http://www.ddrdrive.com/
----- Mail original -----
Envoyé: Samedi 19 Mai 2012 10:37:01
Objet: Re: Designing a cluster guide
Hi Greg,
Post by Gregory Farnum
Post by Stefan Priebe - Profihost AG
It mentions for example "Fast CPU" for the mds system. What does fast
mean? Just the speed of one core? Or is ceph designed to use multi core?
Is multi core or more speed important?
Right now, it's primarily the speed of a single core. The MDS is
highly threaded but doing most things requires grabbing a big lock.
How fast is a qualitative rather than quantitative assessment at this
point, though.
So would you recommand a fast (more ghz) Core i3 instead of a single
xeon for this system? (price per ghz is better).
Post by Gregory Farnum
It depends on what your nodes look like, and what sort of cluster
you're running. The monitors are pretty lightweight, but they will add
*some* load. More important is their disk access patterns — they have
to do a lot of syncs. So if they're sharing a machine with some other
daemon you want them to have an independent disk and to be running a
new kernel&glibc so that they can use syncfs rather than sync. (The
only distribution I know for sure does this is Ubuntu 12.04.)
Which kernel and which glibc version supports this? I have searched
google but haven't found an exact version. We're using debian lenny
squeeze with a custom kernel.
Post by Gregory Farnum
Post by Stefan Priebe - Profihost AG
Regarding the OSDs is it fine to use an SSD Raid 1 for the journal and
perhaps 22x SATA Disks in a Raid 10 for the FS or is this quite absurd
and you should go for 22x SSD Disks in a Raid 6?
You'll need to do your own failure calculations on this one, I'm
afraid. Just take note that you'll presumably be limited to the speed
of your journaling device here.
Yeah that's why i wanted to use a Raid 1 of SSDs for the journaling. Or
is this still too slow? Another idea was to use only a ramdisk for the
journal and backup the files while shutting down to disk and restore
them after boot.
Post by Gregory Farnum
Given that Ceph is going to be doing its own replication, though, I
wouldn't want to add in another whole layer of replication with raid10
— do you really want to multiply your storage requirements by another
factor of two?
OK correct bad idea.
Post by Gregory Farnum
Post by Stefan Priebe - Profihost AG
Is it more useful the use a Raid 6 HW Controller or the btrfs raid?
I would use the hardware controller over btrfs raid for now; it allows
more flexibility in eg switching to xfs. :)
OK but overall you would recommand running one osd per disk right? So
instead of using a Raid 6 with for example 10 disks you would run 6 osds
on this machine?
Post by Gregory Farnum
Post by Stefan Priebe - Profihost AG
Use single socket Xeon for the OSDs or Dual Socket?
Dual socket servers will be overkill given the setup you're
describing. Our WAG rule of thumb is 1GHz of modern CPU per OSD
daemon. You might consider it if you decided you wanted to do an OSD
per disk instead (that's a more common configuration, but it requires
more CPU and RAM per disk and we don't know yet which is the better
choice).
Is there also a rule of thumb for the memory?
My biggest problem with ceph right now is the awful slow speed while
doing random reads and writes.
Sequential read and writes are at 200Mb/s (that's pretty good for bonded
dual Gbit/s). But random reads and write are only at 0,8 - 1,5 Mb/s
which is def. too slow.
Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
--
--
       Alexandre D erumier
Ingénieur Système
Fixe : 03 20 68 88 90
Fax : 03 20 68 90 81
45 Bvd du Général Leclerc 59100 Roubaix - France
12 rue Marivaux 75002 Paris - France
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
Tomasz Paszkowski
SS7, Asterisk, SAN, Datacenter, Cloud Computing
+48500166299
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
-----
Pozdrawiam

Sławek "sZiBis" Skowron
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
��칻�&�~�&���+-��ݶ��w��˛���m��^��b��^n�r���z���h�����&���G���h�
Gregory Farnum
2012-05-22 00:30:14 UTC
Permalink
Post by Quenten Grasso
Hi All,
I've been thinking about this issue myself past few days, and an idea=
I've come up with is running 16 x 2.5" 15K 72/146GB Disks,
Post by Quenten Grasso
in raid 10 inside a 2U Server with JBOD's attached to the server for =
actual storage.
Post by Quenten Grasso
Can someone help clarify this one,
Once the data is written to the (journal disk) and then read from the=
(journal disk) then written to the (storage disk) once this is complet=
e this is considered a successful write by the client?
Post by Quenten Grasso
Or
Once the data is written to the (journal disk) is this considered suc=
cessful by the client?
This one =E2=80=94=C2=A0the write is considered "safe" once it is on-di=
sk on all
OSDs currently responsible for hosting the object.

Every time anybody mentions RAID10 I have to remind them of the
storage amplification that entails, though. Are you sure you want that
on top of (well, underneath, really) Ceph's own replication?
Post by Quenten Grasso
Or
Once the data is written to the (journal disk) and written to the (st=
orage disk) at the same time, once complete this is considered a succes=
sful write by the client? (if this is the case SSD's may not be so usef=
ul)
Post by Quenten Grasso
Pros
Quite fast Write throughput to the journal disks,
No write wareout of SSD's
RAID 10 with 1GB Cache Controller also helps improve things (if reall=
y keen you could use a cachecade as well)
Post by Quenten Grasso
Cons
Not as fast as SSD's
More rackspace required per server.
Regards,
Quenten
-----Original Message-----
kernel.org] On Behalf Of Slawomir Skowron
Post by Quenten Grasso
Sent: Tuesday, 22 May 2012 7:22 AM
Cc: Tomasz Paszkowski
Subject: Re: Designing a cluster guide
Maybe good for journal will be two cheap MLC Intel drives on Sandforc=
e
Post by Quenten Grasso
(320/520), 120GB or 240GB, and HPA changed to 20-30GB only for
separate journaling partitions with hardware RAID1.
I like to test setup like this, but maybe someone have any real life =
info ??
Post by Quenten Grasso
Post by Tomasz Paszkowski
https://github.com/facebook/flashcache/. It gives really huge
performance improvements for reads/writes (especialy on FunsionIO
drives) event without using librbd caching :-)
Hi,
For your journal , if you have money, you can use
stec zeusram ssd drive. (around 2000=E2=82=AC /8GB / 100000 iops re=
ad/write with 4k block).
Post by Quenten Grasso
Post by Tomasz Paszkowski
I'm using them with zfs san, they rocks for journal.
http://www.stec-inc.com/product/zeusram.php
another interessesting product is ddrdrive
http://www.ddrdrive.com/
----- Mail original -----
Envoy=C3=A9: Samedi 19 Mai 2012 10:37:01
Objet: Re: Designing a cluster guide
Hi Greg,
Post by Gregory Farnum
It mentions for example "Fast CPU" for the mds system. What does =
fast
Post by Quenten Grasso
Post by Tomasz Paszkowski
Post by Gregory Farnum
mean? Just the speed of one core? Or is ceph designed to use mult=
i core?
Post by Quenten Grasso
Post by Tomasz Paszkowski
Post by Gregory Farnum
Is multi core or more speed important?
Right now, it's primarily the speed of a single core. The MDS is
highly threaded but doing most things requires grabbing a big lock=
=2E
Post by Quenten Grasso
Post by Tomasz Paszkowski
Post by Gregory Farnum
How fast is a qualitative rather than quantitative assessment at t=
his
Post by Quenten Grasso
Post by Tomasz Paszkowski
Post by Gregory Farnum
point, though.
So would you recommand a fast (more ghz) Core i3 instead of a singl=
e
Post by Quenten Grasso
Post by Tomasz Paszkowski
xeon for this system? (price per ghz is better).
Post by Gregory Farnum
It depends on what your nodes look like, and what sort of cluster
you're running. The monitors are pretty lightweight, but they will=
add
Post by Quenten Grasso
Post by Tomasz Paszkowski
Post by Gregory Farnum
*some* load. More important is their disk access patterns =E2=80=94=
they have
Post by Quenten Grasso
Post by Tomasz Paszkowski
Post by Gregory Farnum
to do a lot of syncs. So if they're sharing a machine with some ot=
her
Post by Quenten Grasso
Post by Tomasz Paszkowski
Post by Gregory Farnum
daemon you want them to have an independent disk and to be running=
a
Post by Quenten Grasso
Post by Tomasz Paszkowski
Post by Gregory Farnum
new kernel&glibc so that they can use syncfs rather than sync. (Th=
e
Post by Quenten Grasso
Post by Tomasz Paszkowski
Post by Gregory Farnum
only distribution I know for sure does this is Ubuntu 12.04.)
Which kernel and which glibc version supports this? I have searched
google but haven't found an exact version. We're using debian lenny
squeeze with a custom kernel.
Post by Gregory Farnum
Regarding the OSDs is it fine to use an SSD Raid 1 for the journa=
l and
Post by Quenten Grasso
Post by Tomasz Paszkowski
Post by Gregory Farnum
perhaps 22x SATA Disks in a Raid 10 for the FS or is this quite a=
bsurd
Post by Quenten Grasso
Post by Tomasz Paszkowski
Post by Gregory Farnum
and you should go for 22x SSD Disks in a Raid 6?
You'll need to do your own failure calculations on this one, I'm
afraid. Just take note that you'll presumably be limited to the sp=
eed
Post by Quenten Grasso
Post by Tomasz Paszkowski
Post by Gregory Farnum
of your journaling device here.
Yeah that's why i wanted to use a Raid 1 of SSDs for the journaling=
=2E Or
Post by Quenten Grasso
Post by Tomasz Paszkowski
is this still too slow? Another idea was to use only a ramdisk for =
the
Post by Quenten Grasso
Post by Tomasz Paszkowski
journal and backup the files while shutting down to disk and restor=
e
Post by Quenten Grasso
Post by Tomasz Paszkowski
them after boot.
Post by Gregory Farnum
Given that Ceph is going to be doing its own replication, though, =
I
Post by Quenten Grasso
Post by Tomasz Paszkowski
Post by Gregory Farnum
wouldn't want to add in another whole layer of replication with ra=
id10
Post by Quenten Grasso
Post by Tomasz Paszkowski
Post by Gregory Farnum
=E2=80=94 do you really want to multiply your storage requirements=
by another
Post by Quenten Grasso
Post by Tomasz Paszkowski
Post by Gregory Farnum
factor of two?
OK correct bad idea.
Post by Gregory Farnum
Is it more useful the use a Raid 6 HW Controller or the btrfs rai=
d?
Post by Quenten Grasso
Post by Tomasz Paszkowski
Post by Gregory Farnum
I would use the hardware controller over btrfs raid for now; it al=
lows
Post by Quenten Grasso
Post by Tomasz Paszkowski
Post by Gregory Farnum
more flexibility in eg switching to xfs. :)
OK but overall you would recommand running one osd per disk right? =
So
Post by Quenten Grasso
Post by Tomasz Paszkowski
instead of using a Raid 6 with for example 10 disks you would run 6=
osds
Post by Quenten Grasso
Post by Tomasz Paszkowski
on this machine?
Post by Gregory Farnum
Use single socket Xeon for the OSDs or Dual Socket?
Dual socket servers will be overkill given the setup you're
describing. Our WAG rule of thumb is 1GHz of modern CPU per OSD
daemon. You might consider it if you decided you wanted to do an O=
SD
Post by Quenten Grasso
Post by Tomasz Paszkowski
Post by Gregory Farnum
per disk instead (that's a more common configuration, but it requi=
res
Post by Quenten Grasso
Post by Tomasz Paszkowski
Post by Gregory Farnum
more CPU and RAM per disk and we don't know yet which is the bette=
r
Post by Quenten Grasso
Post by Tomasz Paszkowski
Post by Gregory Farnum
choice).
Is there also a rule of thumb for the memory?
My biggest problem with ceph right now is the awful slow speed whil=
e
Post by Quenten Grasso
Post by Tomasz Paszkowski
doing random reads and writes.
Sequential read and writes are at 200Mb/s (that's pretty good for b=
onded
Post by Quenten Grasso
Post by Tomasz Paszkowski
dual Gbit/s). But random reads and write are only at 0,8 - 1,5 Mb/s
which is def. too slow.
Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-deve=
l" in
Post by Quenten Grasso
Post by Tomasz Paszkowski
More majordomo info at http://vger.kernel.org/majordomo-info.html
--
--
=C2=A0 =C2=A0 =C2=A0 =C2=A0Alexandre D erumier
Ing=C3=A9nieur Syst=C3=A8me
Fixe : 03 20 68 88 90
Fax : 03 20 68 90 81
45 Bvd du G=C3=A9n=C3=A9ral Leclerc 59100 Roubaix - France
12 rue Marivaux 75002 Paris - France
--
To unsubscribe from this list: send the line "unsubscribe ceph-deve=
l" in
Post by Quenten Grasso
Post by Tomasz Paszkowski
More majordomo info at =C2=A0http://vger.kernel.org/majordomo-info.=
html
Post by Quenten Grasso
Post by Tomasz Paszkowski
--
Tomasz Paszkowski
SS7, Asterisk, SAN, Datacenter, Cloud Computing
+48500166299
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel=
" in
Post by Quenten Grasso
Post by Tomasz Paszkowski
More majordomo info at =C2=A0http://vger.kernel.org/majordomo-info.h=
tml
Post by Quenten Grasso
--
-----
Pozdrawiam
S=C5=82awek "sZiBis" Skowron
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel"=
in
Post by Quenten Grasso
More majordomo info at =C2=A0http://vger.kernel.org/majordomo-info.ht=
ml
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" i=
n
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Quenten Grasso
2012-05-22 00:42:52 UTC
Permalink
Hi Greg,

I'm only talking about journal disks not storage. :)



Regards,
Quenten


-----Original Message-----
From: ceph-devel-***@vger.kernel.org [mailto:ceph-devel-***@vger.kernel.org] On Behalf Of Gregory Farnum
Sent: Tuesday, 22 May 2012 10:30 AM
To: Quenten Grasso
Cc: ceph-***@vger.kernel.org
Subject: Re: Designing a cluster guide
Post by Quenten Grasso
Hi All,
I've been thinking about this issue myself past few days, and an idea I've come up with is running 16 x 2.5" 15K 72/146GB Disks,
in raid 10 inside a 2U Server with JBOD's attached to the server for actual storage.
Can someone help clarify this one,
Once the data is written to the (journal disk) and then read from the (journal disk) then written to the (storage disk) once this is complete this is considered a successful write by the client?
Or
Once the data is written to the (journal disk) is this considered successful by the client?
This one — the write is considered "safe" once it is on-disk on all
OSDs currently responsible for hosting the object.

Every time anybody mentions RAID10 I have to remind them of the
storage amplification that entails, though. Are you sure you want that
on top of (well, underneath, really) Ceph's own replication?
Post by Quenten Grasso
Or
Once the data is written to the (journal disk) and written to the (storage disk) at the same time, once complete this is considered a successful write by the client? (if this is the case SSD's may not be so useful)
Pros
Quite fast Write throughput to the journal disks,
No write wareout of SSD's
RAID 10 with 1GB Cache Controller also helps improve things (if really keen you could use a cachecade as well)
Cons
Not as fast as SSD's
More rackspace required per server.
Regards,
Quenten
-----Original Message-----
Sent: Tuesday, 22 May 2012 7:22 AM
Cc: Tomasz Paszkowski
Subject: Re: Designing a cluster guide
Maybe good for journal will be two cheap MLC Intel drives on Sandforce
(320/520), 120GB or 240GB, and HPA changed to 20-30GB only for
separate journaling partitions with hardware RAID1.
I like to test setup like this, but maybe someone have any real life info ??
Post by Tomasz Paszkowski
https://github.com/facebook/flashcache/. It gives really huge
performance improvements for reads/writes (especialy on FunsionIO
drives) event without using librbd caching :-)
Hi,
For your journal , if you have money, you can use
stec zeusram ssd drive. (around 2000€ /8GB / 100000 iops read/write with 4k block).
I'm using them with zfs san, they rocks for journal.
http://www.stec-inc.com/product/zeusram.php
another interessesting product is ddrdrive
http://www.ddrdrive.com/
----- Mail original -----
Envoyé: Samedi 19 Mai 2012 10:37:01
Objet: Re: Designing a cluster guide
Hi Greg,
Post by Gregory Farnum
Post by Stefan Priebe - Profihost AG
It mentions for example "Fast CPU" for the mds system. What does fast
mean? Just the speed of one core? Or is ceph designed to use multi core?
Is multi core or more speed important?
Right now, it's primarily the speed of a single core. The MDS is
highly threaded but doing most things requires grabbing a big lock.
How fast is a qualitative rather than quantitative assessment at this
point, though.
So would you recommand a fast (more ghz) Core i3 instead of a single
xeon for this system? (price per ghz is better).
Post by Gregory Farnum
It depends on what your nodes look like, and what sort of cluster
you're running. The monitors are pretty lightweight, but they will add
*some* load. More important is their disk access patterns — they have
to do a lot of syncs. So if they're sharing a machine with some other
daemon you want them to have an independent disk and to be running a
new kernel&glibc so that they can use syncfs rather than sync. (The
only distribution I know for sure does this is Ubuntu 12.04.)
Which kernel and which glibc version supports this? I have searched
google but haven't found an exact version. We're using debian lenny
squeeze with a custom kernel.
Post by Gregory Farnum
Post by Stefan Priebe - Profihost AG
Regarding the OSDs is it fine to use an SSD Raid 1 for the journal and
perhaps 22x SATA Disks in a Raid 10 for the FS or is this quite absurd
and you should go for 22x SSD Disks in a Raid 6?
You'll need to do your own failure calculations on this one, I'm
afraid. Just take note that you'll presumably be limited to the speed
of your journaling device here.
Yeah that's why i wanted to use a Raid 1 of SSDs for the journaling. Or
is this still too slow? Another idea was to use only a ramdisk for the
journal and backup the files while shutting down to disk and restore
them after boot.
Post by Gregory Farnum
Given that Ceph is going to be doing its own replication, though, I
wouldn't want to add in another whole layer of replication with raid10
— do you really want to multiply your storage requirements by another
factor of two?
OK correct bad idea.
Post by Gregory Farnum
Post by Stefan Priebe - Profihost AG
Is it more useful the use a Raid 6 HW Controller or the btrfs raid?
I would use the hardware controller over btrfs raid for now; it allows
more flexibility in eg switching to xfs. :)
OK but overall you would recommand running one osd per disk right? So
instead of using a Raid 6 with for example 10 disks you would run 6 osds
on this machine?
Post by Gregory Farnum
Post by Stefan Priebe - Profihost AG
Use single socket Xeon for the OSDs or Dual Socket?
Dual socket servers will be overkill given the setup you're
describing. Our WAG rule of thumb is 1GHz of modern CPU per OSD
daemon. You might consider it if you decided you wanted to do an OSD
per disk instead (that's a more common configuration, but it requires
more CPU and RAM per disk and we don't know yet which is the better
choice).
Is there also a rule of thumb for the memory?
My biggest problem with ceph right now is the awful slow speed while
doing random reads and writes.
Sequential read and writes are at 200Mb/s (that's pretty good for bonded
dual Gbit/s). But random reads and write are only at 0,8 - 1,5 Mb/s
which is def. too slow.
Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
--
--
       Alexandre D erumier
Ingénieur Système
Fixe : 03 20 68 88 90
Fax : 03 20 68 90 81
45 Bvd du Général Leclerc 59100 Roubaix - France
12 rue Marivaux 75002 Paris - France
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
Tomasz Paszkowski
SS7, Asterisk, SAN, Datacenter, Cloud Computing
+48500166299
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
-----
Pozdrawiam
Sławek "sZiBis" Skowron
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
N�����r��y����b�X��ǧv�^�)޺{.n�+���z�]z���{ay�ʇڙ�,j��f���h���z��w���
Quenten Grasso
2012-05-22 00:46:50 UTC
Permalink
I Should have added For storage I'm considering something like Enterprise nearline SAS 3TB disks running individual disks not raided with rep level of 2 as suggested :)


Regards,
Quenten


-----Original Message-----
From: ceph-devel-***@vger.kernel.org [mailto:ceph-devel-***@vger.kernel.org] On Behalf Of Quenten Grasso
Sent: Tuesday, 22 May 2012 10:43 AM
To: 'Gregory Farnum'
Cc: ceph-***@vger.kernel.org
Subject: RE: Designing a cluster guide

Hi Greg,

I'm only talking about journal disks not storage. :)



Regards,
Quenten


-----Original Message-----
From: ceph-devel-***@vger.kernel.org [mailto:ceph-devel-***@vger.kernel.org] On Behalf Of Gregory Farnum
Sent: Tuesday, 22 May 2012 10:30 AM
To: Quenten Grasso
Cc: ceph-***@vger.kernel.org
Subject: Re: Designing a cluster guide
Post by Quenten Grasso
Hi All,
I've been thinking about this issue myself past few days, and an idea I've come up with is running 16 x 2.5" 15K 72/146GB Disks,
in raid 10 inside a 2U Server with JBOD's attached to the server for actual storage.
Can someone help clarify this one,
Once the data is written to the (journal disk) and then read from the (journal disk) then written to the (storage disk) once this is complete this is considered a successful write by the client?
Or
Once the data is written to the (journal disk) is this considered successful by the client?
This one — the write is considered "safe" once it is on-disk on all
OSDs currently responsible for hosting the object.

Every time anybody mentions RAID10 I have to remind them of the
storage amplification that entails, though. Are you sure you want that
on top of (well, underneath, really) Ceph's own replication?
Post by Quenten Grasso
Or
Once the data is written to the (journal disk) and written to the (storage disk) at the same time, once complete this is considered a successful write by the client? (if this is the case SSD's may not be so useful)
Pros
Quite fast Write throughput to the journal disks,
No write wareout of SSD's
RAID 10 with 1GB Cache Controller also helps improve things (if really keen you could use a cachecade as well)
Cons
Not as fast as SSD's
More rackspace required per server.
Regards,
Quenten
-----Original Message-----
Sent: Tuesday, 22 May 2012 7:22 AM
Cc: Tomasz Paszkowski
Subject: Re: Designing a cluster guide
Maybe good for journal will be two cheap MLC Intel drives on Sandforce
(320/520), 120GB or 240GB, and HPA changed to 20-30GB only for
separate journaling partitions with hardware RAID1.
I like to test setup like this, but maybe someone have any real life info ??
Post by Tomasz Paszkowski
https://github.com/facebook/flashcache/. It gives really huge
performance improvements for reads/writes (especialy on FunsionIO
drives) event without using librbd caching :-)
Hi,
For your journal , if you have money, you can use
stec zeusram ssd drive. (around 2000€ /8GB / 100000 iops read/write with 4k block).
I'm using them with zfs san, they rocks for journal.
http://www.stec-inc.com/product/zeusram.php
another interessesting product is ddrdrive
http://www.ddrdrive.com/
----- Mail original -----
Envoyé: Samedi 19 Mai 2012 10:37:01
Objet: Re: Designing a cluster guide
Hi Greg,
Post by Gregory Farnum
Post by Stefan Priebe - Profihost AG
It mentions for example "Fast CPU" for the mds system. What does fast
mean? Just the speed of one core? Or is ceph designed to use multi core?
Is multi core or more speed important?
Right now, it's primarily the speed of a single core. The MDS is
highly threaded but doing most things requires grabbing a big lock.
How fast is a qualitative rather than quantitative assessment at this
point, though.
So would you recommand a fast (more ghz) Core i3 instead of a single
xeon for this system? (price per ghz is better).
Post by Gregory Farnum
It depends on what your nodes look like, and what sort of cluster
you're running. The monitors are pretty lightweight, but they will add
*some* load. More important is their disk access patterns — they have
to do a lot of syncs. So if they're sharing a machine with some other
daemon you want them to have an independent disk and to be running a
new kernel&glibc so that they can use syncfs rather than sync. (The
only distribution I know for sure does this is Ubuntu 12.04.)
Which kernel and which glibc version supports this? I have searched
google but haven't found an exact version. We're using debian lenny
squeeze with a custom kernel.
Post by Gregory Farnum
Post by Stefan Priebe - Profihost AG
Regarding the OSDs is it fine to use an SSD Raid 1 for the journal and
perhaps 22x SATA Disks in a Raid 10 for the FS or is this quite absurd
and you should go for 22x SSD Disks in a Raid 6?
You'll need to do your own failure calculations on this one, I'm
afraid. Just take note that you'll presumably be limited to the speed
of your journaling device here.
Yeah that's why i wanted to use a Raid 1 of SSDs for the journaling. Or
is this still too slow? Another idea was to use only a ramdisk for the
journal and backup the files while shutting down to disk and restore
them after boot.
Post by Gregory Farnum
Given that Ceph is going to be doing its own replication, though, I
wouldn't want to add in another whole layer of replication with raid10
— do you really want to multiply your storage requirements by another
factor of two?
OK correct bad idea.
Post by Gregory Farnum
Post by Stefan Priebe - Profihost AG
Is it more useful the use a Raid 6 HW Controller or the btrfs raid?
I would use the hardware controller over btrfs raid for now; it allows
more flexibility in eg switching to xfs. :)
OK but overall you would recommand running one osd per disk right? So
instead of using a Raid 6 with for example 10 disks you would run 6 osds
on this machine?
Post by Gregory Farnum
Post by Stefan Priebe - Profihost AG
Use single socket Xeon for the OSDs or Dual Socket?
Dual socket servers will be overkill given the setup you're
describing. Our WAG rule of thumb is 1GHz of modern CPU per OSD
daemon. You might consider it if you decided you wanted to do an OSD
per disk instead (that's a more common configuration, but it requires
more CPU and RAM per disk and we don't know yet which is the better
choice).
Is there also a rule of thumb for the memory?
My biggest problem with ceph right now is the awful slow speed while
doing random reads and writes.
Sequential read and writes are at 200Mb/s (that's pretty good for bonded
dual Gbit/s). But random reads and write are only at 0,8 - 1,5 Mb/s
which is def. too slow.
Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
--
--
       Alexandre D erumier
Ingénieur Système
Fixe : 03 20 68 88 90
Fax : 03 20 68 90 81
45 Bvd du Général Leclerc 59100 Roubaix - France
12 rue Marivaux 75002 Paris - France
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
Tomasz Paszkowski
SS7, Asterisk, SAN, Datacenter, Cloud Computing
+48500166299
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
-----
Pozdrawiam
Sławek "sZiBis" Skowron
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
N�����r��y���b�X��ǧv�^�)޺{.n�+���z�]z���{ay�ʇڙ�,j ��f���h���z��w���
���j:+v���w�j�m���� ����zZ+�����ݢj"��!�i
N�����r��y����b�X��ǧv�^�)޺{.n�+���z�]z���{ay�ʇڙ�,j��f���h���z��w���
Sławomir Skowron
2012-05-22 05:51:42 UTC
Permalink
I have some performance from rbd cluster near 320MB/s on VM from 3
node cluster, but with 10GE, and with 26 2.5" SAS drives used on every
machine it's not everything that can be.
Every osd drive is raid0 with one drive via battery cached nvram in
hardware raid ctrl.
Every osd take much ram for caching.

That's why i'am thinking about to change 2 drives for SSD in raid1
with hpa tuned for increase durability of drive for journaling - but
if this will work ;)

With newest drives can theoreticaly get 500MB/s with a long queue
depth. This means that i can in theory improve bandwith score, and
take lower latency, and better handling of multiple IO writes, from
many hosts.
Reads are cached in ram from OSD daemon, VFS in kernel, nvram in ctrl,
and in near future improve from cache in kvm (i need to test that -
this will improve performance)

But if SSD drive goes slower, it can get whole performance down in
writes. It's is very delicate.

Pozdrawiam

iSS
I Should have added For storage I'm considering something like Enterp=
rise nearline SAS 3TB disks running individual disks not raided with re=
p level of 2 as suggested :)
Regards,
Quenten
-----Original Message-----
kernel.org] On Behalf Of Quenten Grasso
Sent: Tuesday, 22 May 2012 10:43 AM
To: 'Gregory Farnum'
Subject: RE: Designing a cluster guide
Hi Greg,
I'm only talking about journal disks not storage. :)
Regards,
Quenten
-----Original Message-----
kernel.org] On Behalf Of Gregory Farnum
Sent: Tuesday, 22 May 2012 10:30 AM
To: Quenten Grasso
Subject: Re: Designing a cluster guide
Post by Quenten Grasso
Hi All,
I've been thinking about this issue myself past few days, and an ide=
a I've come up with is running 16 x 2.5" 15K 72/146GB Disks,
Post by Quenten Grasso
in raid 10 inside a 2U Server with JBOD's attached to the server for=
actual storage.
Post by Quenten Grasso
Can someone help clarify this one,
Once the data is written to the (journal disk) and then read from th=
e (journal disk) then written to the (storage disk) once this is comple=
te this is considered a successful write by the client?
Post by Quenten Grasso
Or
Once the data is written to the (journal disk) is this considered su=
ccessful by the client?
This one =E2=80=94 the write is considered "safe" once it is on-disk =
on all
OSDs currently responsible for hosting the object.
Every time anybody mentions RAID10 I have to remind them of the
storage amplification that entails, though. Are you sure you want tha=
t
on top of (well, underneath, really) Ceph's own replication?
Post by Quenten Grasso
Or
Once the data is written to the (journal disk) and written to the (s=
torage disk) at the same time, once complete this is considered a succe=
ssful write by the client? (if this is the case SSD's may not be so use=
ful)
Post by Quenten Grasso
Pros
Quite fast Write throughput to the journal disks,
No write wareout of SSD's
RAID 10 with 1GB Cache Controller also helps improve things (if real=
ly keen you could use a cachecade as well)
Post by Quenten Grasso
Cons
Not as fast as SSD's
More rackspace required per server.
Regards,
Quenten
-----Original Message-----
=2Ekernel.org] On Behalf Of Slawomir Skowron
Post by Quenten Grasso
Sent: Tuesday, 22 May 2012 7:22 AM
Cc: Tomasz Paszkowski
Subject: Re: Designing a cluster guide
Maybe good for journal will be two cheap MLC Intel drives on Sandfor=
ce
Post by Quenten Grasso
(320/520), 120GB or 240GB, and HPA changed to 20-30GB only for
separate journaling partitions with hardware RAID1.
I like to test setup like this, but maybe someone have any real life=
info ??
Post by Quenten Grasso
Post by Tomasz Paszkowski
https://github.com/facebook/flashcache/. It gives really huge
performance improvements for reads/writes (especialy on FunsionIO
drives) event without using librbd caching :-)
Hi,
For your journal , if you have money, you can use
stec zeusram ssd drive. (around 2000=E2=82=AC /8GB / 100000 iops r=
ead/write with 4k block).
Post by Quenten Grasso
Post by Tomasz Paszkowski
I'm using them with zfs san, they rocks for journal.
http://www.stec-inc.com/product/zeusram.php
another interessesting product is ddrdrive
http://www.ddrdrive.com/
----- Mail original -----
Envoy=C3=A9: Samedi 19 Mai 2012 10:37:01
Objet: Re: Designing a cluster guide
Hi Greg,
Post by Gregory Farnum
It mentions for example "Fast CPU" for the mds system. What does=
fast
Post by Quenten Grasso
Post by Tomasz Paszkowski
Post by Gregory Farnum
mean? Just the speed of one core? Or is ceph designed to use mul=
ti core?
Post by Quenten Grasso
Post by Tomasz Paszkowski
Post by Gregory Farnum
Is multi core or more speed important?
Right now, it's primarily the speed of a single core. The MDS is
highly threaded but doing most things requires grabbing a big loc=
k.
Post by Quenten Grasso
Post by Tomasz Paszkowski
Post by Gregory Farnum
How fast is a qualitative rather than quantitative assessment at =
this
Post by Quenten Grasso
Post by Tomasz Paszkowski
Post by Gregory Farnum
point, though.
So would you recommand a fast (more ghz) Core i3 instead of a sing=
le
Post by Quenten Grasso
Post by Tomasz Paszkowski
xeon for this system? (price per ghz is better).
Post by Gregory Farnum
It depends on what your nodes look like, and what sort of cluster
you're running. The monitors are pretty lightweight, but they wil=
l add
Post by Quenten Grasso
Post by Tomasz Paszkowski
Post by Gregory Farnum
*some* load. More important is their disk access patterns =E2=80=94=
they have
Post by Quenten Grasso
Post by Tomasz Paszkowski
Post by Gregory Farnum
to do a lot of syncs. So if they're sharing a machine with some o=
ther
Post by Quenten Grasso
Post by Tomasz Paszkowski
Post by Gregory Farnum
daemon you want them to have an independent disk and to be runnin=
g a
Post by Quenten Grasso
Post by Tomasz Paszkowski
Post by Gregory Farnum
new kernel&glibc so that they can use syncfs rather than sync. (T=
he
Post by Quenten Grasso
Post by Tomasz Paszkowski
Post by Gregory Farnum
only distribution I know for sure does this is Ubuntu 12.04.)
Which kernel and which glibc version supports this? I have searche=
d
Post by Quenten Grasso
Post by Tomasz Paszkowski
google but haven't found an exact version. We're using debian lenn=
y
Post by Quenten Grasso
Post by Tomasz Paszkowski
squeeze with a custom kernel.
Post by Gregory Farnum
Regarding the OSDs is it fine to use an SSD Raid 1 for the journ=
al and
Post by Quenten Grasso
Post by Tomasz Paszkowski
Post by Gregory Farnum
perhaps 22x SATA Disks in a Raid 10 for the FS or is this quite =
absurd
Post by Quenten Grasso
Post by Tomasz Paszkowski
Post by Gregory Farnum
and you should go for 22x SSD Disks in a Raid 6?
You'll need to do your own failure calculations on this one, I'm
afraid. Just take note that you'll presumably be limited to the s=
peed
Post by Quenten Grasso
Post by Tomasz Paszkowski
Post by Gregory Farnum
of your journaling device here.
Yeah that's why i wanted to use a Raid 1 of SSDs for the journalin=
g. Or
Post by Quenten Grasso
Post by Tomasz Paszkowski
is this still too slow? Another idea was to use only a ramdisk for=
the
Post by Quenten Grasso
Post by Tomasz Paszkowski
journal and backup the files while shutting down to disk and resto=
re
Post by Quenten Grasso
Post by Tomasz Paszkowski
them after boot.
Post by Gregory Farnum
Given that Ceph is going to be doing its own replication, though,=
I
Post by Quenten Grasso
Post by Tomasz Paszkowski
Post by Gregory Farnum
wouldn't want to add in another whole layer of replication with r=
aid10
Post by Quenten Grasso
Post by Tomasz Paszkowski
Post by Gregory Farnum
=E2=80=94 do you really want to multiply your storage requirement=
s by another
Post by Quenten Grasso
Post by Tomasz Paszkowski
Post by Gregory Farnum
factor of two?
OK correct bad idea.
Post by Gregory Farnum
Is it more useful the use a Raid 6 HW Controller or the btrfs ra=
id?
Post by Quenten Grasso
Post by Tomasz Paszkowski
Post by Gregory Farnum
I would use the hardware controller over btrfs raid for now; it a=
llows
Post by Quenten Grasso
Post by Tomasz Paszkowski
Post by Gregory Farnum
more flexibility in eg switching to xfs. :)
OK but overall you would recommand running one osd per disk right?=
So
Post by Quenten Grasso
Post by Tomasz Paszkowski
instead of using a Raid 6 with for example 10 disks you would run =
6 osds
Post by Quenten Grasso
Post by Tomasz Paszkowski
on this machine?
Post by Gregory Farnum
Use single socket Xeon for the OSDs or Dual Socket?
Dual socket servers will be overkill given the setup you're
describing. Our WAG rule of thumb is 1GHz of modern CPU per OSD
daemon. You might consider it if you decided you wanted to do an =
OSD
Post by Quenten Grasso
Post by Tomasz Paszkowski
Post by Gregory Farnum
per disk instead (that's a more common configuration, but it requ=
ires
Post by Quenten Grasso
Post by Tomasz Paszkowski
Post by Gregory Farnum
more CPU and RAM per disk and we don't know yet which is the bett=
er
Post by Quenten Grasso
Post by Tomasz Paszkowski
Post by Gregory Farnum
choice).
Is there also a rule of thumb for the memory?
My biggest problem with ceph right now is the awful slow speed whi=
le
Post by Quenten Grasso
Post by Tomasz Paszkowski
doing random reads and writes.
Sequential read and writes are at 200Mb/s (that's pretty good for =
bonded
Post by Quenten Grasso
Post by Tomasz Paszkowski
dual Gbit/s). But random reads and write are only at 0,8 - 1,5 Mb/=
s
Post by Quenten Grasso
Post by Tomasz Paszkowski
which is def. too slow.
Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-dev=
el" in
Post by Quenten Grasso
Post by Tomasz Paszkowski
More majordomo info at http://vger.kernel.org/majordomo-info.html
--
--
Alexandre D erumier
Ing=C3=A9nieur Syst=C3=A8me
Fixe : 03 20 68 88 90
Fax : 03 20 68 90 81
45 Bvd du G=C3=A9n=C3=A9ral Leclerc 59100 Roubaix - France
12 rue Marivaux 75002 Paris - France
--
To unsubscribe from this list: send the line "unsubscribe ceph-dev=
el" in
Post by Quenten Grasso
Post by Tomasz Paszkowski
More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Tomasz Paszkowski
SS7, Asterisk, SAN, Datacenter, Cloud Computing
+48500166299
--
To unsubscribe from this list: send the line "unsubscribe ceph-deve=
l" in
Post by Quenten Grasso
Post by Tomasz Paszkowski
More majordomo info at http://vger.kernel.org/majordomo-info.html
--
-----
Pozdrawiam
S=C5=82awek "sZiBis" Skowron
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel=
" in
Post by Quenten Grasso
More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel"=
in
More majordomo info at http://vger.kernel.org/majordomo-info.html
N=EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=BDr=EF=BF=BD=EF=BF=BDy=EF=
=BF=BD=EF=BF=BD=EF=BF=BDb=EF=BF=BDX=EF=BF=BD=EF=BF=BD=C7=A7v=EF=BF=BD^=EF=
=BF=BD)=DE=BA{.n=EF=BF=BD+=EF=BF=BD=EF=BF=BD=EF=BF=BDz=EF=BF=BD]z=EF=BF=
=BD=EF=BF=BD=EF=BF=BD{ay=EF=BF=BD=1D=CA=87=DA=99=EF=BF=BD,j
=EF=BF=BD=EF=BF=BDf=EF=BF=BD=EF=BF=BD=EF=BF=BDh=EF=BF=BD=EF=BF=BD=EF=BF=
=BDz=EF=BF=BD=1E=EF=BF=BDw=EF=BF=BD=EF=BF=BD=EF=BF=BD
=EF=BF=BD=EF=BF=BD=EF=BF=BDj:+v=EF=BF=BD=EF=BF=BD=EF=BF=BDw=EF=BF=BDj=
=EF=BF=BDm=EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=BD
=EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=BDzZ+=EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=
=BF=BD=EF=BF=BD=DD=A2j"=EF=BF=BD=EF=BF=BD!=EF=BF=BDi
N=EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=BDr=EF=BF=BD=EF=BF=BDy=EF=
=BF=BD=EF=BF=BD=EF=BF=BDb=EF=BF=BDX=EF=BF=BD=EF=BF=BD=C7=A7v=EF=BF=BD^=EF=
=BF=BD)=DE=BA{.n=EF=BF=BD+=EF=BF=BD=EF=BF=BD=EF=BF=BDz=EF=BF=BD]z=EF=BF=
=BD{ay=EF=BF=BD=1D=CA=87=DA=99=EF=BF=BD,j=07=EF=BF=BD=EF=BF=BDf=EF=BF=BD=
=EF=BF=BD=EF=BF=BDh=EF=BF=BD=EF=BF=BD=EF=BF=BDz=EF=BF=BD=1E=EF=BF=BDw=EF=
=BF=BD=EF=BF=BD=EF=BF=BD=0C=EF=BF=BD=EF=BF=BD=EF=BF=BDj:+v=EF=BF=BD=EF=BF=
=BD=EF=BF=BDw=EF=BF=BDj=EF=BF=BDm=EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=BD=07=
=EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=BDzZ+=EF=BF=BD=EF=BF=BD=DD=A2j"=EF=BF=
=BD=EF=BF=BD!=EF=BF=BDi
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" i=
n
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Quenten Grasso
2012-05-29 07:25:14 UTC
Permalink
Interesting, I've been thinking about this and I think most Ceph installations could benefit from more nodes and less disks per node.

For example

We have a replica level of 2, your RBD block size of 4mb. You start writing a file of 10gb, This is divided effectively into 4mb chunks,

The first chunk to node 1 and node 2 (at the same time I assume) which is written to a journal then replayed to the data file system.

Second chunk might be sent to node 2 and 3 at the same time which is written to a journal then replayed. (we now have overlap from chunk 1)

Third chunk might be sent to 1 and 3 (we have more overlap from chunks 1 and 2) and as you can see this quickly this becomes an issue.

So if we have 10 nodes vs. 3 nodes with the same mount of disks we should see better write and read performance as you would have less "overlap".

Now we take BTRFS into the picture as I understand journals are not necessary due to the nature of the way it writes/snapshots and reads data this alone would be a major performance increase on a BTRFS Raid level (like ZFS RAIDZ).

Side note this may sound crazy but the more I read about SSD's the less I wish to use/rely on them and RAM SSD's are crazly priced imo. =)

Regards,
Quenten


-----Original Message-----
From: ceph-devel-***@vger.kernel.org [mailto:ceph-devel-***@vger.kernel.org] On Behalf Of Slawomir Skowron
Sent: Tuesday, 22 May 2012 3:52 PM
To: Quenten Grasso
Cc: Gregory Farnum; ceph-***@vger.kernel.org
Subject: Re: Designing a cluster guide

I have some performance from rbd cluster near 320MB/s on VM from 3
node cluster, but with 10GE, and with 26 2.5" SAS drives used on every
machine it's not everything that can be.
Every osd drive is raid0 with one drive via battery cached nvram in
hardware raid ctrl.
Every osd take much ram for caching.

That's why i'am thinking about to change 2 drives for SSD in raid1
with hpa tuned for increase durability of drive for journaling - but
if this will work ;)

With newest drives can theoreticaly get 500MB/s with a long queue
depth. This means that i can in theory improve bandwith score, and
take lower latency, and better handling of multiple IO writes, from
many hosts.
Reads are cached in ram from OSD daemon, VFS in kernel, nvram in ctrl,
and in near future improve from cache in kvm (i need to test that -
this will improve performance)

But if SSD drive goes slower, it can get whole performance down in
writes. It's is very delicate.

Pozdrawiam

iSS
Post by Quenten Grasso
I Should have added For storage I'm considering something like Enterprise nearline SAS 3TB disks running individual disks not raided with rep level of 2 as suggested :)
Regards,
Quenten
-----Original Message-----
Sent: Tuesday, 22 May 2012 10:43 AM
To: 'Gregory Farnum'
Subject: RE: Designing a cluster guide
Hi Greg,
I'm only talking about journal disks not storage. :)
Regards,
Quenten
-----Original Message-----
Sent: Tuesday, 22 May 2012 10:30 AM
To: Quenten Grasso
Subject: Re: Designing a cluster guide
Post by Quenten Grasso
Hi All,
I've been thinking about this issue myself past few days, and an idea I've come up with is running 16 x 2.5" 15K 72/146GB Disks,
in raid 10 inside a 2U Server with JBOD's attached to the server for actual storage.
Can someone help clarify this one,
Once the data is written to the (journal disk) and then read from the (journal disk) then written to the (storage disk) once this is complete this is considered a successful write by the client?
Or
Once the data is written to the (journal disk) is this considered successful by the client?
This one — the write is considered "safe" once it is on-disk on all
OSDs currently responsible for hosting the object.
Every time anybody mentions RAID10 I have to remind them of the
storage amplification that entails, though. Are you sure you want that
on top of (well, underneath, really) Ceph's own replication?
Post by Quenten Grasso
Or
Once the data is written to the (journal disk) and written to the (storage disk) at the same time, once complete this is considered a successful write by the client? (if this is the case SSD's may not be so useful)
Pros
Quite fast Write throughput to the journal disks,
No write wareout of SSD's
RAID 10 with 1GB Cache Controller also helps improve things (if really keen you could use a cachecade as well)
Cons
Not as fast as SSD's
More rackspace required per server.
Regards,
Quenten
-----Original Message-----
Sent: Tuesday, 22 May 2012 7:22 AM
Cc: Tomasz Paszkowski
Subject: Re: Designing a cluster guide
Maybe good for journal will be two cheap MLC Intel drives on Sandforce
(320/520), 120GB or 240GB, and HPA changed to 20-30GB only for
separate journaling partitions with hardware RAID1.
I like to test setup like this, but maybe someone have any real life info ??
Post by Tomasz Paszkowski
https://github.com/facebook/flashcache/. It gives really huge
performance improvements for reads/writes (especialy on FunsionIO
drives) event without using librbd caching :-)
Hi,
For your journal , if you have money, you can use
stec zeusram ssd drive. (around 2000€ /8GB / 100000 iops read/write with 4k block).
I'm using them with zfs san, they rocks for journal.
http://www.stec-inc.com/product/zeusram.php
another interessesting product is ddrdrive
http://www.ddrdrive.com/
----- Mail original -----
Envoyé: Samedi 19 Mai 2012 10:37:01
Objet: Re: Designing a cluster guide
Hi Greg,
Post by Gregory Farnum
Post by Stefan Priebe - Profihost AG
It mentions for example "Fast CPU" for the mds system. What does fast
mean? Just the speed of one core? Or is ceph designed to use multi core?
Is multi core or more speed important?
Right now, it's primarily the speed of a single core. The MDS is
highly threaded but doing most things requires grabbing a big lock.
How fast is a qualitative rather than quantitative assessment at this
point, though.
So would you recommand a fast (more ghz) Core i3 instead of a single
xeon for this system? (price per ghz is better).
Post by Gregory Farnum
It depends on what your nodes look like, and what sort of cluster
you're running. The monitors are pretty lightweight, but they will add
*some* load. More important is their disk access patterns — they have
to do a lot of syncs. So if they're sharing a machine with some other
daemon you want them to have an independent disk and to be running a
new kernel&glibc so that they can use syncfs rather than sync. (The
only distribution I know for sure does this is Ubuntu 12.04.)
Which kernel and which glibc version supports this? I have searched
google but haven't found an exact version. We're using debian lenny
squeeze with a custom kernel.
Post by Gregory Farnum
Post by Stefan Priebe - Profihost AG
Regarding the OSDs is it fine to use an SSD Raid 1 for the journal and
perhaps 22x SATA Disks in a Raid 10 for the FS or is this quite absurd
and you should go for 22x SSD Disks in a Raid 6?
You'll need to do your own failure calculations on this one, I'm
afraid. Just take note that you'll presumably be limited to the speed
of your journaling device here.
Yeah that's why i wanted to use a Raid 1 of SSDs for the journaling. Or
is this still too slow? Another idea was to use only a ramdisk for the
journal and backup the files while shutting down to disk and restore
them after boot.
Post by Gregory Farnum
Given that Ceph is going to be doing its own replication, though, I
wouldn't want to add in another whole layer of replication with raid10
— do you really want to multiply your storage requirements by another
factor of two?
OK correct bad idea.
Post by Gregory Farnum
Post by Stefan Priebe - Profihost AG
Is it more useful the use a Raid 6 HW Controller or the btrfs raid?
I would use the hardware controller over btrfs raid for now; it allows
more flexibility in eg switching to xfs. :)
OK but overall you would recommand running one osd per disk right? So
instead of using a Raid 6 with for example 10 disks you would run 6 osds
on this machine?
Post by Gregory Farnum
Post by Stefan Priebe - Profihost AG
Use single socket Xeon for the OSDs or Dual Socket?
Dual socket servers will be overkill given the setup you're
describing. Our WAG rule of thumb is 1GHz of modern CPU per OSD
daemon. You might consider it if you decided you wanted to do an OSD
per disk instead (that's a more common configuration, but it requires
more CPU and RAM per disk and we don't know yet which is the better
choice).
Is there also a rule of thumb for the memory?
My biggest problem with ceph right now is the awful slow speed while
doing random reads and writes.
Sequential read and writes are at 200Mb/s (that's pretty good for bonded
dual Gbit/s). But random reads and write are only at 0,8 - 1,5 Mb/s
which is def. too slow.
Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
--
--
Alexandre D erumier
Ingénieur Système
Fixe : 03 20 68 88 90
Fax : 03 20 68 90 81
45 Bvd du Général Leclerc 59100 Roubaix - France
12 rue Marivaux 75002 Paris - France
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Tomasz Paszkowski
SS7, Asterisk, SAN, Datacenter, Cloud Computing
+48500166299
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
--
-----
Pozdrawiam
Sławek "sZiBis" Skowron
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
N�����r��y���b�X��ǧv�^�)޺{.n�+���z�]z���{ay�ʇڙ�,j
��f���h���z��w���
���j:+v���w�j�m����
����zZ+�����ݢj"��!�i
N�����r��y���b�X��ǧv�^�)޺{.n�+���z�]z�{ay�ʇڙ�,j ��f���h���z��w���
���j:+v���w�j�m���� ����zZ+��ݢj"��!�i
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
��{.n�+�������+%��lzwm��b�맲��r��yǩ�ׯzX����ܨ}���Ơz�&j:+v�������zZ+
Tommi Virtanen
2012-05-29 16:50:40 UTC
Permalink
Post by Quenten Grasso
So if we have 10 nodes vs. 3 nodes with the same mount of disks we should see better write and read performance as you would have less "overlap".
First of all, a typical way to run Ceph is with say 8-12 disks per
node, and an OSD per disk. That means your 3-10 node clusters actually
have 24-120 OSDs on them. The number of physical machines is not
really a factor, number of OSDs is what matters.

Secondly, 10-node or 3-node clusters are fairly uninteresting for
Ceph. The real challenge is at the hundreds, thousands and above
range.
Post by Quenten Grasso
Now we take BTRFS into the picture as I understand journals are not necessary due to the nature of the way it writes/snapshots and reads data this alone would be a major performance increase on a BTRFS Raid level (like ZFS RAIDZ).
A journal is still needed on btrfs, snapshots just enable us to write
to the journal in parallel to the real write, instead of needing to
journal first.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Jerker Nyberg
2012-05-22 09:04:34 UTC
Permalink
This one  the write is considered "safe" once it is on-disk on all
OSDs currently responsible for hosting the object.
Is it possible to configure the client to consider the write successful
when the data is hitting RAM on all the OSDs but not yet committed to
disk?

Also, the IBM zFS research file system is talking about cooperative cache
and Lustre about a collaborative cache. Do you have any thoughts of this
regarding Ceph?

Regards,
Jerker Nyberg, Uppsala, Sweden.
Gregory Farnum
2012-05-23 05:31:57 UTC
Permalink
=20
This one the write is considered "safe" once it is on-disk on all
OSDs currently responsible for hosting the object.
=20
=20
=20
Is it possible to configure the client to consider the write successf=
ul =20
when the data is hitting RAM on all the OSDs but not yet committed to=
=20
disk?
Direct users of the RADOS object store (i.e., librados) can do all kind=
s of things with the integrity guarantee options. But I don't believe t=
here's currently a way to make the filesystem do so =E2=80=94 among oth=
er things, you're running through the page cache and other writeback ca=
ches anyway, so it generally wouldn't be useful except when running an =
fsync or similar. And at that point you probably really want to not be =
lying to the application that's asking for it.
Also, the IBM zFS research file system is talking about cooperative c=
ache =20
and Lustre about a collaborative cache. Do you have any thoughts of t=
his =20
regarding Ceph?
I haven't heard of this before, but assuming I'm understanding my brief=
read directly, this isn't on the current Ceph roadmap. I sort of see h=
ow it's useful, but I think it's less useful for a system like Ceph =E2=
=80=94 we're more scale-out in terms of CPU and memory correlating with=
added disk space compared to something like Lustre where the object st=
orage (OST) and the object handlers (OSS) are divorced, and we stripe f=
iles across more servers than I believe Lustre tends to do.
But perhaps I'm missing something =E2=80=94 do you have a use case on C=
eph?
-Greg

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" i=
n
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Jerker Nyberg
2012-05-23 19:47:03 UTC
Permalink
Direct users of the RADOS object store (i.e., librados) can do all kinds
of things with the integrity guarantee options. But I don't believe
there's currently a way to make the filesystem do so ÿÿ among other
things, you're running through the page cache and other writeback caches
anyway, so it generally wouldn't be useful except when running an fsync
or similar. And at that point you probably really want to not be lying
to the application that's asking for it.
I am comparing with in-memory databases. If replication and failovers are
used, couldn't in-memory in some cases be good enough? And faster.
do you have a use case on Ceph?
Currently of interest:

* Scratch file system for HPC. (kernel client)
* Scratch file system for research groups. (SMB, NFS, SSH)
* Backend for simple disk backup. (SSH/rsync, AFP, BackupPC)
* Metropolitan cluster.
* VDI backend. KVM with RBD.

Regards,
Jerker Nyberg, Uppsala, Sweden.
Gregory Farnum
2012-05-23 21:47:24 UTC
Permalink
Direct users of the RADOS object store (i.e., librados) can do all k=
inds
of things with the integrity guarantee options. But I don't believe =
there's
currently a way to make the filesystem do so =D1=D1 among other thin=
gs, you're
running through the page cache and other writeback caches anyway, so=
it
generally wouldn't be useful except when running an fsync or similar=
=2E And at
that point you probably really want to not be lying to the applicati=
on
that's asking for it.
I am comparing with in-memory databases. If replication and failovers=
are
used, couldn't in-memory in some cases be good enough? And faster.
do you have a use case on Ceph?
=9A* Scratch file system for HPC. (kernel client)
=9A* Scratch file system for research groups. (SMB, NFS, SSH)
=9A* Backend for simple disk backup. (SSH/rsync, AFP, BackupPC)
=9A* Metropolitan cluster.
=9A* VDI backend. KVM with RBD.
Hmm. Sounds to me like scratch filesystems would get a lot out of not
having to hit disk on the commit, but not much out of having separate
caching locations versus just letting the OSD page cache handle it. :)
The others, I don't really see collaborative caching helping much eithe=
r.

So basically it sounds like you want to be able to toggle off Ceph's
data safety requirements. That would have to be done in the clients;
it wouldn't even be hard in ceph-fuse (although I'm not sure about the
kernel client). It's probably a pretty easy way to jump into the code
base.... :)
Anyway, make a bug for it in the tracker (I don't think one exists
yet, though I could be wrong) and someday when we start work on the
filesystem again we should be able to get to it. :)
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" i=
n
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Jerker Nyberg
2012-05-24 08:33:32 UTC
Permalink
Post by Gregory Farnum
 * Scratch file system for HPC. (kernel client)
 * Scratch file system for research groups. (SMB, NFS, SSH)
 * Backend for simple disk backup. (SSH/rsync, AFP, BackupPC)
 * Metropolitan cluster.
 * VDI backend. KVM with RBD.
Hmm. Sounds to me like scratch filesystems would get a lot out of not
having to hit disk on the commit, but not much out of having separate
caching locations versus just letting the OSD page cache handle it. :)
The others, I don't really see collaborative caching helping much either.
Oh, sorry, those were my use cases for ceph in general. Yes, scratch is
mosty of interest. But also fast backup. Currently IOPS is limiting our
backup speed on a small cluster with many files but not much data. I have
problems scanning through and backing all changed files every night.
Currently I am backing to ZFS but Ceph might help with scaling up
performance and size. Another option is going for SSD instead of
mechanical drives.
Post by Gregory Farnum
Anyway, make a bug for it in the tracker (I don't think one exists
yet, though I could be wrong) and someday when we start work on the
filesystem again we should be able to get to it. :)
Thank you for your thoughts on this. I hope to be able to do that soon.

Regards,
Jerker Nyberg, Uppsala, Sweden.
Stefan Priebe - Profihost AG
2012-05-22 06:30:05 UTC
Permalink
Post by Quenten Grasso
Maybe good for journal will be two cheap MLC Intel drives on Sandforc=
e
Post by Quenten Grasso
(320/520), 120GB or 240GB, and HPA changed to 20-30GB only for
separate journaling partitions with hardware RAID1.
I like to test setup like this, but maybe someone have any real life
info ??
HPA?

That was also my idea but most of the people here still claim that
they're too slow and you need something MORE powerful like.

zeus ram: http://www.stec-inc.com/product/zeusram.php
fusion io: http://www.fusionio.com/platforms/iodrive2/

Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" i=
n
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Sławomir Skowron
2012-05-22 06:59:10 UTC
Permalink
http://en.wikipedia.org/wiki/Host_protected_area

On Tue, May 22, 2012 at 8:30 AM, Stefan Priebe - Profihost AG
Post by Stefan Priebe - Profihost AG
Post by Quenten Grasso
Maybe good for journal will be two cheap MLC Intel drives on Sandfor=
ce
Post by Stefan Priebe - Profihost AG
Post by Quenten Grasso
(320/520), 120GB or 240GB, and HPA changed to 20-30GB only for
separate journaling partitions with hardware RAID1.
I like to test setup like this, but maybe someone have any real life
info ??
HPA?
That was also my idea but most of the people here still claim that
they're too slow and you need something MORE powerful like.
zeus ram: http://www.stec-inc.com/product/zeusram.php
fusion io: http://www.fusionio.com/platforms/iodrive2/
But in commodity hardware, or cheap servers, even in mid-range
machines, cost of pci-e flash/ram card is too high even in small
cluster.
Post by Stefan Priebe - Profihost AG
Stefan
--=20
-----
Pozdrawiam

S=B3awek "sZiBis" Skowron
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" i=
n
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Gregory Farnum
2012-05-21 18:13:10 UTC
Permalink
Post by Stefan Priebe
Hi Greg,
Post by Gregory Farnum
It mentions for example "Fast CPU" for the mds system. What does fa=
st
Post by Stefan Priebe
Post by Gregory Farnum
mean? Just the speed of one core? Or is ceph designed to use multi =
core?
Post by Stefan Priebe
Post by Gregory Farnum
Is multi core or more speed important?
Right now, it's primarily the speed of a single core. The MDS is
highly threaded but doing most things requires grabbing a big lock.
How fast is a qualitative rather than quantitative assessment at thi=
s
Post by Stefan Priebe
Post by Gregory Farnum
point, though.
So would you recommand a fast (more ghz) Core i3 instead of a single =
xeon
Post by Stefan Priebe
for this system? (price per ghz is better).
If that's all the MDS is doing there, probably? (It would also depend
on cache sizes and things; I don't have a good sense for how that
impacts the MDS' performance.)
Post by Stefan Priebe
Post by Gregory Farnum
It depends on what your nodes look like, and what sort of cluster
you're running. The monitors are pretty lightweight, but they will a=
dd
Post by Stefan Priebe
Post by Gregory Farnum
*some* load. More important is their disk access patterns =97 they h=
ave
Post by Stefan Priebe
Post by Gregory Farnum
to do a lot of syncs. So if they're sharing a machine with some othe=
r
Post by Stefan Priebe
Post by Gregory Farnum
daemon you want them to have an independent disk and to be running a
new kernel&glibc so that they can use syncfs rather than sync. (The
only distribution I know for sure does this is Ubuntu 12.04.)
Which kernel and which glibc version supports this? I have searched g=
oogle
Post by Stefan Priebe
but haven't found an exact version. We're using debian lenny squeeze =
with a
Post by Stefan Priebe
custom kernel.
syncfs is in Linux 2.6.39; I'm not sure about glibc but from a quick
web search it looks like it might have appeared in glibc 2.15?
Post by Stefan Priebe
Post by Gregory Farnum
Regarding the OSDs is it fine to use an SSD Raid 1 for the journal =
and
Post by Stefan Priebe
Post by Gregory Farnum
perhaps 22x SATA Disks in a Raid 10 for the FS or is this quite abs=
urd
Post by Stefan Priebe
Post by Gregory Farnum
and you should go for 22x SSD Disks in a Raid 6?
You'll need to do your own failure calculations on this one, I'm
afraid. Just take note that you'll presumably be limited to the spee=
d
Post by Stefan Priebe
Post by Gregory Farnum
of your journaling device here.
Yeah that's why i wanted to use a Raid 1 of SSDs for the journaling. =
Or is
Post by Stefan Priebe
this still too slow? Another idea was to use only a ramdisk for the j=
ournal
Post by Stefan Priebe
and backup the files while shutting down to disk and restore them aft=
er
Post by Stefan Priebe
boot.
Well, RAID1 isn't going to make it any faster than just the single
SSD, is why I pointed that out.
I wouldn't recommend using a ramdisk for the journal =97=A0that will
guarantee local data loss in the event the server doesn't shut down
properly, and if it happens to several servers at once you get a good
chance of losing client writes.
Post by Stefan Priebe
Post by Gregory Farnum
Is it more useful the use a Raid 6 HW Controller or the btrfs raid?
I would use the hardware controller over btrfs raid for now; it allo=
ws
Post by Stefan Priebe
Post by Gregory Farnum
more flexibility in eg switching to xfs. :)
OK but overall you would recommand running one osd per disk right? So
instead of using a Raid 6 with for example 10 disks you would run 6 o=
sds on
Post by Stefan Priebe
this machine?
Right now all the production systems I'm involved in are using 1 OSD
per disk, but honestly we don't know if that's the right answer or
not. It's a tradeoff =97 more OSDs increases cpu and memory requirement=
s
(per storage space) but also localizes failure a bit more.
Post by Stefan Priebe
Post by Gregory Farnum
Use single socket Xeon for the OSDs or Dual Socket?
Dual socket servers will be overkill given the setup you're
describing. Our WAG rule of thumb is 1GHz of modern CPU per OSD
daemon. You might consider it if you decided you wanted to do an OSD
per disk instead (that's a more common configuration, but it require=
s
Post by Stefan Priebe
Post by Gregory Farnum
more CPU and RAM per disk and we don't know yet which is the better
choice).
Is there also a rule of thumb for the memory?
About 200MB per daemon right now, plus however much you want the page
cache to be able to use. :) This might go up a bit during peering, but
under normal operation it shouldn't be more than another couple
hundred MB.
Post by Stefan Priebe
My biggest problem with ceph right now is the awful slow speed while =
doing
Post by Stefan Priebe
random reads and writes.
Sequential read and writes are at 200Mb/s (that's pretty good for bon=
ded
Post by Stefan Priebe
dual Gbit/s). But random reads and write are only at 0,8 - 1,5 Mb/s w=
hich is
Post by Stefan Priebe
def. too slow.
Hmm. I'm not super-familiar where our random IO performance is right
now (and lots of other people seem to have advice on journaling
devices :), but that's about in line with what you get from a hard
disk normally. Unless you've designed your application very carefully
(lots and lots of parallel IO), an individual client doing synchronous
random IO is unlikely to be able to get much faster than a regular
drive.
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" i=
n
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Stefan Priebe - Profihost AG
2012-05-22 06:20:59 UTC
Permalink
Post by Gregory Farnum
Post by Stefan Priebe
So would you recommand a fast (more ghz) Core i3 instead of a single=
xeon
Post by Gregory Farnum
Post by Stefan Priebe
for this system? (price per ghz is better).
=20
If that's all the MDS is doing there, probably? (It would also depend
on cache sizes and things; I don't have a good sense for how that
impacts the MDS' performance.)
As i'm only using KVM / rbd i don't have any MDS.
Post by Gregory Farnum
Well, RAID1 isn't going to make it any faster than just the single
SSD, is why I pointed that out.
I wouldn't recommend using a ramdisk for the journal =97 that will
guarantee local data loss in the event the server doesn't shut down
properly, and if it happens to several servers at once you get a good
chance of losing client writes.
Sure but it's the same WHEN NOT using a Raid 1 for the journal isn't it=
?

Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" i=
n
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Gregory Farnum
2012-06-29 18:07:05 UTC
Permalink
Post by Gregory Farnum
Sorry this got left for so long...
On Thu, May 10, 2012 at 6:23 AM, Stefan Priebe - Profihost AG
Post by Stefan Priebe - Profihost AG
Hi,
the "Designing a cluster guide"
http://wiki.ceph.com/wiki/Designing_a_cluster is pretty good but it
still leaves some questions unanswered.
It mentions for example "Fast CPU" for the mds system. What does fas=
t
Post by Gregory Farnum
Post by Stefan Priebe - Profihost AG
mean? Just the speed of one core? Or is ceph designed to use multi c=
ore?
Post by Gregory Farnum
Post by Stefan Priebe - Profihost AG
Is multi core or more speed important?
Right now, it's primarily the speed of a single core. The MDS is
highly threaded but doing most things requires grabbing a big lock.
How fast is a qualitative rather than quantitative assessment at this
point, though.
Post by Stefan Priebe - Profihost AG
The Cluster Design Recommendations mentions to seperate all Daemons =
on
Post by Gregory Farnum
Post by Stefan Priebe - Profihost AG
dedicated machines. Is this also for the MON useful? As they're so
leightweight why not running them on the OSDs?
It depends on what your nodes look like, and what sort of cluster
you're running. The monitors are pretty lightweight, but they will ad=
d
Post by Gregory Farnum
*some* load. More important is their disk access patterns =97 they ha=
ve
Post by Gregory Farnum
to do a lot of syncs. So if they're sharing a machine with some other
daemon you want them to have an independent disk and to be running a
new kernel&glibc so that they can use syncfs rather than sync. (The
only distribution I know for sure does this is Ubuntu 12.04.)
I just had it pointed out to me that I rather overstated the
importance of syncfs if you were going to do this. The monitor mostly
does fsync, not sync/syncfs(), so that's not so important. What is
important is that it has highly seeky disk behavior, so you don't want
a ceph-osd and ceph-mon daemon to be sharing a disk. :)
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" i=
n
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Brian Edmonds
2012-06-29 18:42:41 UTC
Permalink
Post by Stefan Priebe - Profihost AG
the "Designing a cluster guide"
http://wiki.ceph.com/wiki/Designing_a_cluster is pretty good but it
still leaves some questions unanswered.
Oh, thank you. I've been poking through the Ceph docs, but somehow
had not managed to turn up the wiki yet.

What are the likely and worst case scenarios if the OSD journal were
to simply be on a garden variety ramdisk, no battery backing? In the
case of a single node losing power, and thus losing some data, surely
Ceph can recognize this, and handle it through normal redundancy? I
could see it being an issue if the whole cluster lost power at once.
Anything I'm missing?

Brian.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Gregory Farnum
2012-06-29 18:50:48 UTC
Permalink
Post by Brian Edmonds
What are the likely and worst case scenarios if the OSD journal were
to simply be on a garden variety ramdisk, no battery backing? =A0In t=
he
Post by Brian Edmonds
case of a single node losing power, and thus losing some data, surely
Ceph can recognize this, and handle it through normal redundancy? =A0=
I
Post by Brian Edmonds
could see it being an issue if the whole cluster lost power at once.
Anything I'm missing?
If you lose a journal, you lose the OSD. The end. We could potentially
recover much of the data through developer-driven manual data
inspection, but I suspect it's roughly equivalent to what a lot of
data forensics services offer =97 expensive for everybody and not
something to rely on.
Ceph can certainly handle losing *one* OSD, but if you have a
correlated failure of more than one, you're almost certain to lose
data some amount of data (how much depends on how many OSDs you have,
and how you've replicated that data). If that's an acceptable tradeoff
for you, go for it...but I doubt that it is when you come down to it.
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" i=
n
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Brian Edmonds
2012-06-29 20:59:58 UTC
Permalink
Post by Gregory Farnum
If you lose a journal, you lose the OSD.
Really? Everything? Not just recent commits? I would have hoped it
would just come back up in an old state. Replication should have
already been taking care of regaining redundancy for the stuff that
was on it, particularly the newest stuff that wouldn't return with it
and say "Hi, I'm back."

I suppose it makes the design easier though. =)

Brian.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Gregory Farnum
2012-06-29 21:11:22 UTC
Permalink
Post by Gregory Farnum
If you lose a journal, you lose the OSD.
Really? =A0Everything? =A0Not just recent commits? =A0I would have ho=
ped it
would just come back up in an old state. =A0Replication should have
already been taking care of regaining redundancy for the stuff that
was on it, particularly the newest stuff that wouldn't return with it
and say "Hi, I'm back."
I suppose it makes the design easier though. =3D)
Well, actually this depends on the filesystem you're using. With
btrfs, the OSD will roll back to a consistent state, but you don't
know how out-of-date that state is. (Practically speaking, it's pretty
new, but if you were doing any writes it is going to be data loss.)
With xfs/ext4/other, the OSD can't create consistency points the same
way it can with btrfs, and so the loss of a journal means that it
can't repair itself.

Sorry for not mentioning the distinction earlier; I didn't think we'd
implemented the rollback on btrfs. :)
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" i=
n
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Brian Edmonds
2012-06-29 21:18:20 UTC
Permalink
Post by Gregory Farnum
Well, actually this depends on the filesystem you're using. With
btrfs, the OSD will roll back to a consistent state, but you don't
know how out-of-date that state is.
Ok, so assuming btrfs, then a single machine failure with a ramdisk
journal should not result in any data loss, assuming replication is
working? The cluster would then be at risk of data loss primarily
from a full power outage. (In practice I'd expect either one machine
to die, or a power loss to take out all of them, and smaller but
non-unitary losses would be uncommon.)

Something to play with, perhaps.

Brian.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Gregory Farnum
2012-06-29 21:30:09 UTC
Permalink
Post by Brian Edmonds
Post by Gregory Farnum
Well, actually this depends on the filesystem you're using. With
btrfs, the OSD will roll back to a consistent state, but you don't
know how out-of-date that state is.
Ok, so assuming btrfs, then a single machine failure with a ramdisk
journal should not result in any data loss, assuming replication is
working? =A0The cluster would then be at risk of data loss primarily
from a full power outage. =A0(In practice I'd expect either one machi=
ne
Post by Brian Edmonds
to die, or a power loss to take out all of them, and smaller but
non-unitary losses would be uncommon.)
That's correct. And replication will be working =97 it's all
synchronous, so if the replication isn't working, you won't be able to
write. :) There are some edge cases here =97=A0if an OSD is "down" but =
not
"out" then you might not have the same number of data copies as
normal, but that's all configurable.
Post by Brian Edmonds
Something to play with, perhaps.
Brian.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel"=
in
Post by Brian Edmonds
More majordomo info at =A0http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" i=
n
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Sage Weil
2012-06-29 21:33:20 UTC
Permalink
Post by Brian Edmonds
Post by Gregory Farnum
Well, actually this depends on the filesystem you're using. With
btrfs, the OSD will roll back to a consistent state, but you don't
know how out-of-date that state is.
Ok, so assuming btrfs, then a single machine failure with a ramdisk
journal should not result in any data loss, assuming replication is
working? The cluster would then be at risk of data loss primarily
from a full power outage. (In practice I'd expect either one machine
to die, or a power loss to take out all of them, and smaller but
non-unitary losses would be uncommon.)
Right. From a data-safety perspective ("the cluster said my writes were
safe.. are they?") consider journal loss an OSD failure. If there aren't
other surviving replicas, something may be lost.
Post by Brian Edmonds
From a recovery perspective, it is a partial failure; not everything was
lost, and recovery will be quick (only recent objects get copied around).
Maybe your application can tolerate that, maybe it can't.

sage

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Loading...