slow performance even when using SSDs

Discussion:

Stefan Priebe - Profihost AG

2012-05-10 12:09:13 UTC

Dear List,

i'm doing a testsetup with ceph v0.46 and wanted to know how fast ceph is.

my testsetup:
3 servers with Intel Xeon X3440, 180GB SSD Intel 520 Series, 4GB RAM, 2x
1Gbit/s LAN each

All 3 are running as mon a-c and osd 0-2. Two of them are also running
as mds.2 and mds.3 (has 8GB RAM instead of 4GB).

All machines run ceph v0.46 and vanilla Linux Kernel v3.0.30 and all of
them use btrfs on the ssd which serves /srv/{osd,mon}.X. All of them use
eth0+eth1 as bond0 (mode 6).

This gives me:
rados -p rbd bench 60 write

...
Total time run: 61.465323
Total writes made: 776
Write size: 4194304
Bandwidth (MB/sec): 50.500

Average Latency: 1.2654
Max latency: 2.77124
Min latency: 0.170936

Shouldn't it be at least 100MB/s? (1Gbit/s / 8)

And rados -p rbd bench 60 write -b 4096 gives pretty bad results:
Total time run: 60.221130
Total writes made: 6401
Write size: 4096
Bandwidth (MB/sec): 0.415

Average Latency: 0.150525
Max latency: 1.12647
Min latency: 0.026599

All btrfs ssds are also mounted with noatime.

Thanks for your help!

Greets Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Stefan Priebe - Profihost AG

2012-05-10 13:09:55 UTC

Permalink

OK, here some retests. I had the SDDs conected to an old Raid controller
even i did used them as JBODs (oops).

Here are two new Tests (using kernel 3.4-rc6) it would be great if
someone could tell me if they're fine or bad.

New tests with all 3 SSDs connected to the mainboard.

#~ rados -p rbd bench 60 write
Total time run: 60.342419
Total writes made: 2021
Write size: 4194304
Bandwidth (MB/sec): 133.969

Average Latency: 0.477476
Max latency: 0.942029
Min latency: 0.109467

#~ rados -p rbd bench 60 write -b 4096
Total time run: 60.726326
Total writes made: 59026
Write size: 4096
Bandwidth (MB/sec): 3.797

Average Latency: 0.016459
Max latency: 0.874841
Min latency: 0.002392

Another test with only osd on the disk and the journal in memory / tmpfs:
#~ rados -p rbd bench 60 write
Total time run: 60.513240
Total writes made: 2555
Write size: 4194304
Bandwidth (MB/sec): 168.889

Average Latency: 0.378775
Max latency: 4.59233
Min latency: 0.055179

#~ rados -p rbd bench 60 write -b 4096
Total time run: 60.116260
Total writes made: 281903
Write size: 4096
Bandwidth (MB/sec): 18.318

Average Latency: 0.00341067
Max latency: 0.720486
Min latency: 0.000602

Another problem i have is i'm always getting:
"2012-05-10 15:05:22.140027 mon.0 192.168.0.100:6789/0 19 : [WRN]
message from mon.2 was stamped 0.109244s in the future, clocks not
synchronized"

even on all systems ntp is running fine.

Stefan

Post by Stefan Priebe - Profihost AG
Dear List,
i'm doing a testsetup with ceph v0.46 and wanted to know how fast ceph is.
3 servers with Intel Xeon X3440, 180GB SSD Intel 520 Series, 4GB RAM, 2x
1Gbit/s LAN each
All 3 are running as mon a-c and osd 0-2. Two of them are also running
as mds.2 and mds.3 (has 8GB RAM instead of 4GB).
All machines run ceph v0.46 and vanilla Linux Kernel v3.0.30 and all of
them use btrfs on the ssd which serves /srv/{osd,mon}.X. All of them use
eth0+eth1 as bond0 (mode 6).
rados -p rbd bench 60 write
...
Total time run: 61.465323
Total writes made: 776
Write size: 4194304
Bandwidth (MB/sec): 50.500
Average Latency: 1.2654
Max latency: 2.77124
Min latency: 0.170936
Shouldn't it be at least 100MB/s? (1Gbit/s / 8)
Total time run: 60.221130
Total writes made: 6401
Write size: 4096
Bandwidth (MB/sec): 0.415
Average Latency: 0.150525
Max latency: 1.12647
Min latency: 0.026599
All btrfs ssds are also mounted with noatime.
Thanks for your help!
Greets Stefan

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Calvin Morrow

2012-05-10 18:24:16 UTC

Permalink

I was getting roughly the same results of your tmpfs test using
spinning disks for OSDs with a 160GB Intel 320 SSD being used for the
journal. Theoretically the 520 SSD should give better performance
than my 320s.

Keep in mind that even with balance-alb, multiple GigE connections
will only be used if there are multiple TCP sessions being used by
Ceph.

You don't mention it in your email, but if you're using kernel 3.4+
you'll want to make sure your create your btrfs filesystem using the
large node & leaf size (Big Metadata - I've heard recommendations of
32k instead of default 4k) so your performance doesn't degrade over
time.

I'm curious what speed you're getting from dd in a streaming write.
You might try running a "dd if=3D/dev/zero of=3D<intel ssd partition>
bs=3D128k count=3Dsomething" to see what the SSD will spit out without
Ceph in the picture.

Calvin

On Thu, May 10, 2012 at 7:09 AM, Stefan Priebe - Profihost AG

OK, here some retests. I had the SDDs conected to an old Raid control=

ler

even i did used them as JBODs (oops).
Here are two new Tests (using kernel 3.4-rc6) it would be great if
someone could tell me if they're fine or bad.
New tests with all 3 SSDs connected to the mainboard.
#~ rados -p rbd bench 60 write
Total time run: =A0 =A0 =A0 =A060.342419
Total writes made: =A0 =A0 2021
Write size: =A0 =A0 =A0 =A0 =A0 =A04194304
Bandwidth (MB/sec): =A0 =A0133.969
Average Latency: =A0 =A0 =A0 0.477476
Max latency: =A0 =A0 =A0 =A0 =A0 0.942029
Min latency: =A0 =A0 =A0 =A0 =A0 0.109467
#~ rados -p rbd bench 60 write -b 4096
Total time run: =A0 =A0 =A0 =A060.726326
Total writes made: =A0 =A0 59026
Write size: =A0 =A0 =A0 =A0 =A0 =A04096
Bandwidth (MB/sec): =A0 =A03.797
Average Latency: =A0 =A0 =A0 0.016459
Max latency: =A0 =A0 =A0 =A0 =A0 0.874841
Min latency: =A0 =A0 =A0 =A0 =A0 0.002392
Another test with only osd on the disk and the journal in memory / tm=
#~ rados -p rbd bench 60 write
Total time run: =A0 =A0 =A0 =A060.513240
Total writes made: =A0 =A0 2555
Write size: =A0 =A0 =A0 =A0 =A0 =A04194304
Bandwidth (MB/sec): =A0 =A0168.889
Average Latency: =A0 =A0 =A0 0.378775
Max latency: =A0 =A0 =A0 =A0 =A0 4.59233
Min latency: =A0 =A0 =A0 =A0 =A0 0.055179
#~ rados -p rbd bench 60 write -b 4096
Total time run: =A0 =A0 =A0 =A060.116260
Total writes made: =A0 =A0 281903
Write size: =A0 =A0 =A0 =A0 =A0 =A04096
Bandwidth (MB/sec): =A0 =A018.318
Average Latency: =A0 =A0 =A0 0.00341067
Max latency: =A0 =A0 =A0 =A0 =A0 0.720486
Min latency: =A0 =A0 =A0 =A0 =A0 0.000602
"2012-05-10 15:05:22.140027 mon.0 192.168.0.100:6789/0 19 : [WRN]
message from mon.2 was stamped 0.109244s in the future, clocks not
synchronized"
even on all systems ntp is running fine.
Stefan

Post by Stefan Priebe - Profihost AG
Dear List,
i'm doing a testsetup with ceph v0.46 and wanted to know how fast ce=

ph is.

Post by Stefan Priebe - Profihost AG
3 servers with Intel Xeon X3440, 180GB SSD Intel 520 Series, 4GB RAM=

, 2x

Post by Stefan Priebe - Profihost AG
1Gbit/s LAN each
All 3 are running as mon a-c and osd 0-2. Two of them are also runni=

Post by Stefan Priebe - Profihost AG
as mds.2 and mds.3 (has 8GB RAM instead of 4GB).
All machines run ceph v0.46 and vanilla Linux Kernel v3.0.30 and all=

Post by Stefan Priebe - Profihost AG
them use btrfs on the ssd which serves /srv/{osd,mon}.X. All of them=

use

Post by Stefan Priebe - Profihost AG
eth0+eth1 as bond0 (mode 6).
rados -p rbd bench 60 write
...
Total time run: =A0 =A0 =A0 =A061.465323
Total writes made: =A0 =A0 776
Write size: =A0 =A0 =A0 =A0 =A0 =A04194304
Bandwidth (MB/sec): =A0 =A050.500
Average Latency: =A0 =A0 =A0 1.2654
Max latency: =A0 =A0 =A0 =A0 =A0 2.77124
Min latency: =A0 =A0 =A0 =A0 =A0 0.170936
Shouldn't it be at least 100MB/s? (1Gbit/s / 8)
Total time run: =A0 =A0 =A0 =A060.221130
Total writes made: =A0 =A0 6401
Write size: =A0 =A0 =A0 =A0 =A0 =A04096
Bandwidth (MB/sec): =A0 =A00.415
Average Latency: =A0 =A0 =A0 0.150525
Max latency: =A0 =A0 =A0 =A0 =A0 1.12647
Min latency: =A0 =A0 =A0 =A0 =A0 0.026599
All btrfs ssds are also mounted with noatime.
Thanks for your help!
Greets Stefan

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel"=

More majordomo info at =A0http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" i=
n
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Stefan Priebe - Profihost AG

2012-05-10 13:23:40 UTC

Permalink

Hi,

the "Designing a cluster guide"
http://wiki.ceph.com/wiki/Designing_a_cluster is pretty good but it
still leaves some questions unanswered.

It mentions for example "Fast CPU" for the mds system. What does fast
mean? Just the speed of one core? Or is ceph designed to use multi core?
Is multi core or more speed important?

The Cluster Design Recommendations mentions to seperate all Daemons on
dedicated machines. Is this also for the MON useful? As they're so
leightweight why not running them on the OSDs?

Regarding the OSDs is it fine to use an SSD Raid 1 for the journal and
perhaps 22x SATA Disks in a Raid 10 for the FS or is this quite absurd
and you should go for 22x SSD Disks in a Raid 6? Is it more useful the
use a Raid 6 HW Controller or the btrfs raid?

Use single socket Xeon for the OSDs or Dual Socket?

Thanks and greets
Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Gregory Farnum

2012-05-17 21:27:26 UTC

Permalink

Sorry this got left for so long...

On Thu, May 10, 2012 at 6:23 AM, Stefan Priebe - Profihost AG

Post by Stefan Priebe - Profihost AG
Hi,
the "Designing a cluster guide"
http://wiki.ceph.com/wiki/Designing_a_cluster is pretty good but it
still leaves some questions unanswered.
It mentions for example "Fast CPU" for the mds system. What does fast
mean? Just the speed of one core? Or is ceph designed to use multi co=

re?

Post by Stefan Priebe - Profihost AG
Is multi core or more speed important?

Right now, it's primarily the speed of a single core. The MDS is
highly threaded but doing most things requires grabbing a big lock.
How fast is a qualitative rather than quantitative assessment at this
point, though.

Post by Stefan Priebe - Profihost AG
The Cluster Design Recommendations mentions to seperate all Daemons o=

Post by Stefan Priebe - Profihost AG
dedicated machines. Is this also for the MON useful? As they're so
leightweight why not running them on the OSDs?

It depends on what your nodes look like, and what sort of cluster
you're running. The monitors are pretty lightweight, but they will add
*some* load. More important is their disk access patterns =97 they have
to do a lot of syncs. So if they're sharing a machine with some other
daemon you want them to have an independent disk and to be running a
new kernel&glibc so that they can use syncfs rather than sync. (The
only distribution I know for sure does this is Ubuntu 12.04.)

Post by Stefan Priebe - Profihost AG
Regarding the OSDs is it fine to use an SSD Raid 1 for the journal an=

Post by Stefan Priebe - Profihost AG
perhaps 22x SATA Disks in a Raid 10 for the FS or is this quite absur=

Post by Stefan Priebe - Profihost AG
and you should go for 22x SSD Disks in a Raid 6?

You'll need to do your own failure calculations on this one, I'm
afraid. Just take note that you'll presumably be limited to the speed
of your journaling device here.
Given that Ceph is going to be doing its own replication, though, I
wouldn't want to add in another whole layer of replication with raid10
=97 do you really want to multiply your storage requirements by another
factor of two?

Post by Stefan Priebe - Profihost AG
Is it more useful the use a Raid 6 HW Controller or the btrfs raid?

I would use the hardware controller over btrfs raid for now; it allows
more flexibility in eg switching to xfs. :)

Post by Stefan Priebe - Profihost AG
Use single socket Xeon for the OSDs or Dual Socket?

Dual socket servers will be overkill given the setup you're
describing. Our WAG rule of thumb is 1GHz of modern CPU per OSD
daemon. You might consider it if you decided you wanted to do an OSD
per disk instead (that's a more common configuration, but it requires
more CPU and RAM per disk and we don't know yet which is the better
choice).
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" i=
n
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Stefan Priebe

2012-05-19 08:37:01 UTC

Permalink

Hi Greg,

Post by Gregory Farnum

It mentions for example "Fast CPU" for the mds system. What does fas=

Post by Gregory Farnum

mean? Just the speed of one core? Or is ceph designed to use multi c=

ore?

Post by Gregory Farnum

Is multi core or more speed important?

So would you recommand a fast (more ghz) Core i3 instead of a single=20
xeon for this system? (price per ghz is better).

Post by Gregory Farnum
It depends on what your nodes look like, and what sort of cluster
you're running. The monitors are pretty lightweight, but they will ad=

Post by Gregory Farnum
*some* load. More important is their disk access patterns =97 they ha=

Post by Gregory Farnum
to do a lot of syncs. So if they're sharing a machine with some other
daemon you want them to have an independent disk and to be running a
new kernel&glibc so that they can use syncfs rather than sync. (The
only distribution I know for sure does this is Ubuntu 12.04.)

Which kernel and which glibc version supports this? I have searched=20
google but haven't found an exact version. We're using debian lenny=20
squeeze with a custom kernel.

Post by Gregory Farnum

Regarding the OSDs is it fine to use an SSD Raid 1 for the journal a=

Post by Gregory Farnum

perhaps 22x SATA Disks in a Raid 10 for the FS or is this quite absu=

Post by Gregory Farnum

and you should go for 22x SSD Disks in a Raid 6?

You'll need to do your own failure calculations on this one, I'm
afraid. Just take note that you'll presumably be limited to the speed
of your journaling device here.

Yeah that's why i wanted to use a Raid 1 of SSDs for the journaling. Or=
=20
is this still too slow? Another idea was to use only a ramdisk for the=20
journal and backup the files while shutting down to disk and restore=20
them after boot.

Post by Gregory Farnum
Given that Ceph is going to be doing its own replication, though, I
wouldn't want to add in another whole layer of replication with raid1=

Post by Gregory Farnum
=97 do you really want to multiply your storage requirements by anoth=

Post by Gregory Farnum
factor of two?

OK correct bad idea.

Post by Gregory Farnum

Is it more useful the use a Raid 6 HW Controller or the btrfs raid?

I would use the hardware controller over btrfs raid for now; it allow=

Post by Gregory Farnum
more flexibility in eg switching to xfs. :)

OK but overall you would recommand running one osd per disk right? So=20
instead of using a Raid 6 with for example 10 disks you would run 6 osd=
s=20
on this machine?

Post by Gregory Farnum

Use single socket Xeon for the OSDs or Dual Socket?

Is there also a rule of thumb for the memory?

My biggest problem with ceph right now is the awful slow speed while=20
doing random reads and writes.

Sequential read and writes are at 200Mb/s (that's pretty good for bonde=
d=20
dual Gbit/s). But random reads and write are only at 0,8 - 1,5 Mb/s=20
which is def. too slow.

Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" i=
n
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Alexandre DERUMIER

2012-05-19 16:15:22 UTC

Permalink

Hi,

=46or your journal , if you have money, you can use

stec zeusram ssd drive. (around 2000=E2=82=AC /8GB / 100000 iops read/w=
rite with 4k block).
I'm using them with zfs san, they rocks for journal.=20
http://www.stec-inc.com/product/zeusram.php

another interessesting product is ddrdrive
http://www.ddrdrive.com/

----- Mail original -----=20

De: "Stefan Priebe" <***@profihost.ag>=20
=C3=80: "Gregory Farnum" <***@inktank.com>=20
Cc: ceph-***@vger.kernel.org=20
Envoy=C3=A9: Samedi 19 Mai 2012 10:37:01=20
Objet: Re: Designing a cluster guide=20

Hi Greg,=20

Am 17.05.2012 23:27, schrieb Gregory Farnum:=20

It mentions for example "Fast CPU" for the mds system. What does fas=

t=20

mean? Just the speed of one core? Or is ceph designed to use multi c=

ore?=20

Is multi core or more speed important?=20

Right now, it's primarily the speed of a single core. The MDS is=20
highly threaded but doing most things requires grabbing a big lock.=20
How fast is a qualitative rather than quantitative assessment at this=

=20

point, though.=20

So would you recommand a fast (more ghz) Core i3 instead of a single=20
xeon for this system? (price per ghz is better).=20

It depends on what your nodes look like, and what sort of cluster=20
you're running. The monitors are pretty lightweight, but they will ad=

d=20

*some* load. More important is their disk access patterns =E2=80=94 t=

hey have=20

to do a lot of syncs. So if they're sharing a machine with some other=

=20

daemon you want them to have an independent disk and to be running a=20
new kernel&glibc so that they can use syncfs rather than sync. (The=20
only distribution I know for sure does this is Ubuntu 12.04.)=20

Which kernel and which glibc version supports this? I have searched=20
google but haven't found an exact version. We're using debian lenny=20
squeeze with a custom kernel.=20

Regarding the OSDs is it fine to use an SSD Raid 1 for the journal a=

nd=20

perhaps 22x SATA Disks in a Raid 10 for the FS or is this quite absu=

rd=20

and you should go for 22x SSD Disks in a Raid 6?=20

You'll need to do your own failure calculations on this one, I'm=20
afraid. Just take note that you'll presumably be limited to the speed=

=20

of your journaling device here.=20

Given that Ceph is going to be doing its own replication, though, I=20
wouldn't want to add in another whole layer of replication with raid1=

0=20

=E2=80=94 do you really want to multiply your storage requirements by=

another=20

factor of two?=20

OK correct bad idea.=20

Is it more useful the use a Raid 6 HW Controller or the btrfs raid?=20

I would use the hardware controller over btrfs raid for now; it allow=

s=20

more flexibility in eg switching to xfs. :)=20

OK but overall you would recommand running one osd per disk right? So=20
instead of using a Raid 6 with for example 10 disks you would run 6 osd=
s=20
on this machine?=20

Use single socket Xeon for the OSDs or Dual Socket?=20

Dual socket servers will be overkill given the setup you're=20
describing. Our WAG rule of thumb is 1GHz of modern CPU per OSD=20
daemon. You might consider it if you decided you wanted to do an OSD=20
per disk instead (that's a more common configuration, but it requires=

=20

more CPU and RAM per disk and we don't know yet which is the better=20
choice).=20

Is there also a rule of thumb for the memory?=20

My biggest problem with ceph right now is the awful slow speed while=20
doing random reads and writes.=20

Sequential read and writes are at 200Mb/s (that's pretty good for bonde=
d=20
dual Gbit/s). But random reads and write are only at 0,8 - 1,5 Mb/s=20
which is def. too slow.=20

Stefan=20
--=20
To unsubscribe from this list: send the line "unsubscribe ceph-devel" i=
n=20
the body of a message to ***@vger.kernel.org=20
More majordomo info at http://vger.kernel.org/majordomo-info.html=20

--=20

--=20

Alexandre D erumier=20
Ing=C3=A9nieur Syst=C3=A8me=20
=46ixe : 03 20 68 88 90=20
=46ax : 03 20 68 90 81=20
45 Bvd du G=C3=A9n=C3=A9ral Leclerc 59100 Roubaix - France=20
12 rue Marivaux 75002 Paris - France=20
=09
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" i=
n
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Stefan Priebe

2012-05-20 07:56:21 UTC

Permalink

Hi,
For your journal , if you have money, you can use
stec zeusram ssd drive. (around 2000=E2=82=AC /8GB / 100000 iops read=

/write with 4k block).

I'm using them with zfs san, they rocks for journal.
http://www.stec-inc.com/product/zeusram.php
another interessesting product is ddrdrive
http://www.ddrdrive.com/

Great products but really expensive. The question is do we really need=20
this in case of rbd block device.

Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" i=
n
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Alexandre DERUMIER

2012-05-20 08:13:03 UTC

Permalink

I think that depend how much random writes io you have and acceptable l=
atency you need.

(As the purpose of the journal is to take random io then flush them seq=
uentially to slow storage).

Maybe some slower ssd will fill your needs.
(just be carefull of performance degradation in time, trim,....)

----- Mail original -----=20

De: "Stefan Priebe" <***@profihost.ag>=20
=C3=80: "Alexandre DERUMIER" <***@odiso.com>=20
Cc: ceph-***@vger.kernel.org, "Gregory Farnum" <***@inktank.com>=20
Envoy=C3=A9: Dimanche 20 Mai 2012 09:56:21=20
Objet: Re: Designing a cluster guide=20

Am 19.05.2012 18:15, schrieb Alexandre DERUMIER:=20

Hi,=20
=20
For your journal , if you have money, you can use=20
=20
stec zeusram ssd drive. (around 2000=E2=82=AC /8GB / 100000 iops read=

/write with 4k block).=20

I'm using them with zfs san, they rocks for journal.=20
http://www.stec-inc.com/product/zeusram.php=20
=20
another interessesting product is ddrdrive=20
http://www.ddrdrive.com/=20

Great products but really expensive. The question is do we really need=20
this in case of rbd block device.=20

Stefan=20

--=20

--=20

Alexandre D erumier=20
Ing=C3=A9nieur Syst=C3=A8me=20
=46ixe : 03 20 68 88 90=20
=46ax : 03 20 68 90 81=20
45 Bvd du G=C3=A9n=C3=A9ral Leclerc 59100 Roubaix - France=20
12 rue Marivaux 75002 Paris - France=20
=09
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" i=
n
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Christian Brunner

2012-05-20 08:19:33 UTC

Permalink

Hi,
For your journal , if you have money, you can use
stec zeusram ssd drive. (around 2000=80 /8GB / 100000 iops read/writ=

e with

4k block).
I'm using them with zfs san, they rocks for journal.
http://www.stec-inc.com/product/zeusram.php
another interessesting product is ddrdrive
http://www.ddrdrive.com/

Great products but really expensive. The question is do we really nee=

d this

in case of rbd block device.

I think it depends, what you are planning to do. I was calculating
different storage type for our cloud solution lately. I think that
there are three different types that make sense (at least for us):

- Cheap Object Storage (S3):

Many 3,5'' SATA Drives for the storage (probably in a RAID config)
A small and cheap SSD for the journal

- Basic Block Storage (RBD):

Many 2,5'' SATA Drives for the storage (RAID10 and/or mutliple OSDs)
Small MaxIOPS SSDs for each OSD journal

- High performance Block Storage (RBD)

Many large SATA SSDs for the storage (prbably in a RAID5 config)
stec zeusram ssd drive for the journal

Regards,
Christian
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" i=
n
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Stefan Priebe

2012-05-20 08:27:10 UTC

Permalink

Post by Christian Brunner
Many 3,5'' SATA Drives for the storage (probably in a RAID config)
A small and cheap SSD for the journal
Many 2,5'' SATA Drives for the storage (RAID10 and/or mutliple OSDs)
Small MaxIOPS SSDs for each OSD journal
- High performance Block Storage (RBD)
Many large SATA SSDs for the storage (prbably in a RAID5 config)
stec zeusram ssd drive for the journal

That's exactly what i thought too but then you need a seperate ceph /
rbd cluster for each type.

Which will result in a minimum of:
3x mon servers per type
4x osd servers per type
---

so you'll need a minimum of 12x osd systems and 9x mon systems.

Regards,
Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Christian Brunner

2012-05-20 08:31:01 UTC

Permalink

=A0 Many 3,5'' SATA Drives for the storage (probably in a RAID confi=

=A0 A small and cheap SSD for the journal
=A0 Many 2,5'' SATA Drives for the storage (RAID10 and/or mutliple O=

SDs)

=A0 Small MaxIOPS SSDs for each OSD journal
- High performance Block Storage (RBD)
=A0 Many large SATA SSDs for the storage (prbably in a RAID5 config)
=A0 stec zeusram ssd drive for the journal

That's exactly what i thought too but then you need a seperate ceph /=

rbd

cluster for each type.
3x mon servers per type
4x osd servers per type
---
so you'll need a minimum of 12x osd systems and 9x mon systems.

You can arrange the storage types in different pools, so that you
don't need separate mon servers (this can be done by adjusting the
crushmap) and you could even run multiple OSDs per server.

Christian
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" i=
n
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Stefan Priebe - Profihost AG

2012-05-21 08:22:29 UTC

Permalink

Post by Christian Brunner

That's exactly what i thought too but then you need a seperate ceph / rbd
cluster for each type.
3x mon servers per type
4x osd servers per type
---
so you'll need a minimum of 12x osd systems and 9x mon systems.

You can arrange the storage types in different pools, so that you
don't need separate mon servers (this can be done by adjusting the
crushmap) and you could even run multiple OSDs per server.

That sounds great. Can you give me a hint how to setup pools? Right now
i have data, metadata and rbd => the default pools. But i wasn't able to
find any page in the wiki which described how to setup pools.

Thanks,
Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Christian Brunner

2012-05-21 15:03:26 UTC

Permalink

Post by Stefan Priebe - Profihost AG

Post by Christian Brunner

You can arrange the storage types in different pools, so that you
don't need separate mon servers (this can be done by adjusting the
crushmap) and you could even run multiple OSDs per server.

rados mkpool <pool-name> [123[ 4]] create pool <pool-name>'
[with auid 123[and using crush rule 4]]

Christian
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Tim O'Donovan

2012-05-20 08:56:46 UTC

Permalink

Post by Christian Brunner
- High performance Block Storage (RBD)
Many large SATA SSDs for the storage (prbably in a RAID5 config)
stec zeusram ssd drive for the journal

How do you think standard SATA disks would perform in comparison to
this, and is a separate journaling device really necessary?

Perhaps three servers, each with 12 x 1TB SATA disks configured in
RAID10, an osd on each server and three separate mon servers.

Would this be suitable for the storage backend for a small OpenStack
cloud, performance wise, for instance?

Regards,
Tim O'Donovan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Stefan Priebe

2012-05-20 09:24:49 UTC

Permalink

Post by Tim O'Donovan

Post by Christian Brunner
- High performance Block Storage (RBD)
Many large SATA SSDs for the storage (prbably in a RAID5 config)
stec zeusram ssd drive for the journal

He's talking about ssd's not normal sata disks.

Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Tim O'Donovan

2012-05-20 09:46:18 UTC

Permalink

Post by Stefan Priebe
He's talking about ssd's not normal sata disks.

I realise that. I'm looking for similar advice and have been following
this thread. It didn't seem off topic to ask here.

Regards,
Tim O'Donovan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Stefan Priebe

2012-05-20 09:49:07 UTC

Permalink

No sorry just wanted to clarify as you quoted the ssd part.

Stefan

Post by Tim O'Donovan

Post by Stefan Priebe
He's talking about ssd's not normal sata disks.

I realise that. I'm looking for similar advice and have been following
this thread. It didn't seem off topic to ask here.
Regards,
Tim O'Donovan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
More majordomo info at http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Christian Brunner

2012-05-21 14:59:47 UTC

Permalink

Post by Tim O'Donovan

Post by Christian Brunner
- High performance Block Storage (RBD)
=A0 Many large SATA SSDs for the storage (prbably in a RAID5 config)
=A0 stec zeusram ssd drive for the journal

How do you think standard SATA disks would perform in comparison to
this, and is a separate journaling device really necessary?

A journaling device is improving write latency a lot and the write
latency is directly related to the throughput you get in your virtual
machine. If you have a raid controller with a battery backed write
cache you could try to put the journal on a separate, small partition
of your SATA disk. I haven't tried this, but I think this could work.

Apart from that you should calculate the sum of the IOPS your guests
genereate. In the end everything has to be written on your backend
storage and is has to be able to deliver the IOPS.

With the journal you might be able to compensate short write peaks and
there might be a gain by merging write requests on the OSDs, but for a
solid sizing I would neglect this. Read requests can be delivered for
the OSDs cache (RAM), but again this will probably give you only a
small gain.

=46or a single SATA disk you can calculate with 100-150 IOPS (depending
on the speed of the disk). SSDs can deliver much higher IOPS values.

Post by Tim O'Donovan
Perhaps three servers, each with 12 x 1TB SATA disks configured in
RAID10, an osd on each server and three separate mon servers.

With a replication level of two this would be 1350 IOPS:

150 IOPS per disk * 12 disks * 3 servers / 2 for the RAID10 / 2 for
ceph replication

Comments on this formula would be welcome...

Post by Tim O'Donovan
Would this be suitable for the storage backend for a small OpenStack
cloud, performance wise, for instance?

That depends on what you are doing in your guests.

Regards,
Christian
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" i=
n
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Stefan Priebe - Profihost AG

2012-05-21 15:05:03 UTC

Permalink

Post by Christian Brunner
Apart from that you should calculate the sum of the IOPS your guests
genereate. In the end everything has to be written on your backend
storage and is has to be able to deliver the IOPS.

How to measure the IOPs of a dedicated actual system?

Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Tomasz Paszkowski

2012-05-21 15:12:00 UTC

Permalink

If you're using Qemu/KVM you can use 'info blockstats' command for
measruing I/O on particular VM.

On Mon, May 21, 2012 at 5:05 PM, Stefan Priebe - Profihost AG

Post by Stefan Priebe - Profihost AG

How to measure the IOPs of a dedicated actual system?
Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel"=

Post by Stefan Priebe - Profihost AG
More majordomo info at =A0http://vger.kernel.org/majordomo-info.html

--=20
Tomasz Paszkowski
SS7, Asterisk, SAN, Datacenter, Cloud Computing
+48500166299
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" i=
n
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Tomasz Paszkowski

2012-05-21 15:36:12 UTC

Permalink

Project is indeed very interesting, but requires to patch a kernel
source. For me using lkm is safer ;)

Hello,
Has someone looked into bcache (http://bcache.evilpiepirate.org/) ?
It seems, it is superior to flashcache.
Lwn.net article: https://lwn.net/Articles/497024/
Mailing list: http://news.gmane.org/gmane.linux.kernel.bcache.devel
Source code: http://evilpiepirate.org/cgi-bin/cgit.cgi/linux-bcache.g=

it/

Thanks,
Kiran Patil.

Post by Tomasz Paszkowski
If you're using Qemu/KVM you can use 'info blockstats' command for
measruing I/O on particular VM.
On Mon, May 21, 2012 at 5:05 PM, Stefan Priebe - Profihost AG

Post by Stefan Priebe - Profihost AG

Apart from that you should calculate the sum of the IOPS your gue=

sts