Discussion:
Crushmap ruleset for rack aware PG placement
Amit Vijairania
2014-09-15 07:47:42 UTC
Permalink
Hello!

In a two (2) rack Ceph cluster, with 15 hosts per rack (10 OSD per
host / 150 OSDs per rack), is it possible to create a ruleset for a
pool such that the Primary and Secondary PGs/replicas are placed in
one rack and Tertiary PG/replica is placed in the other rack?

root standard {
id -1 # do not change unnecessarily
# weight 734.400
alg straw
hash 0 # rjenkins1
item rack1 weight 367.200
item rack2 weight 367.200
}

Given there are only two (2) buckets, but three (3) replica, is it
even possible?

I think following Giant blueprint is trying to address scenario I
described above.. Is the following blueprint targeted for Giant
release?
http://wiki.ceph.com/Planning/Blueprints/Giant/crush_extension_for_more_flexible_object_placement


Regards,
Amit Vijairania | Cisco Systems, Inc.
--*--
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Sage Weil
2014-09-15 15:28:11 UTC
Permalink
Hi Amit,
Post by Amit Vijairania
Hello!
In a two (2) rack Ceph cluster, with 15 hosts per rack (10 OSD per
host / 150 OSDs per rack), is it possible to create a ruleset for a
pool such that the Primary and Secondary PGs/replicas are placed in
one rack and Tertiary PG/replica is placed in the other rack?
root standard {
id -1 # do not change unnecessarily
# weight 734.400
alg straw
hash 0 # rjenkins1
item rack1 weight 367.200
item rack2 weight 367.200
}
Given there are only two (2) buckets, but three (3) replica, is it
even possible?
Yes:

rule myrule {
ruleset 1
type replicated
min_size 1
max_size 10
step take default
step choose firstn 2 type rack
step chooseleaf firstn 2 type host
step emit
}

That will give you 4 osds, spread across 2 hosts in each rack. The pool
size (replication factor) is 3, so RADOS will just use the first three (2
hosts in first rack, 1 host in second rack).

sage
Post by Amit Vijairania
I think following Giant blueprint is trying to address scenario I
described above.. Is the following blueprint targeted for Giant
release?
http://wiki.ceph.com/Planning/Blueprints/Giant/crush_extension_for_more_flexible_object_placement
Regards,
Amit Vijairania | Cisco Systems, Inc.
--*--
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
Amit Vijairania
2014-09-15 18:21:54 UTC
Permalink
Thanks Sage! We will test this and share our observations..

Regards,
Amit

Amit Vijairania | 415.610.9908
--*--
Post by Sage Weil
Hi Amit,
Post by Amit Vijairania
Hello!
In a two (2) rack Ceph cluster, with 15 hosts per rack (10 OSD per
host / 150 OSDs per rack), is it possible to create a ruleset for a
pool such that the Primary and Secondary PGs/replicas are placed in
one rack and Tertiary PG/replica is placed in the other rack?
root standard {
id -1 # do not change unnecessarily
# weight 734.400
alg straw
hash 0 # rjenkins1
item rack1 weight 367.200
item rack2 weight 367.200
}
Given there are only two (2) buckets, but three (3) replica, is it
even possible?
rule myrule {
ruleset 1
type replicated
min_size 1
max_size 10
step take default
step choose firstn 2 type rack
step chooseleaf firstn 2 type host
step emit
}
That will give you 4 osds, spread across 2 hosts in each rack. The pool
size (replication factor) is 3, so RADOS will just use the first three (2
hosts in first rack, 1 host in second rack).
sage
Post by Amit Vijairania
I think following Giant blueprint is trying to address scenario I
described above.. Is the following blueprint targeted for Giant
release?
http://wiki.ceph.com/Planning/Blueprints/Giant/crush_extension_for_more_flexible_object_placement
Regards,
Amit Vijairania | Cisco Systems, Inc.
--*--
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Daniel Swarbrick
2014-09-16 10:02:36 UTC
Permalink
Post by Sage Weil
rule myrule {
ruleset 1
type replicated
min_size 1
max_size 10
step take default
step choose firstn 2 type rack
step chooseleaf firstn 2 type host
step emit
}
That will give you 4 osds, spread across 2 hosts in each rack. The pool
size (replication factor) is 3, so RADOS will just use the first three (2
hosts in first rack, 1 host in second rack).
I have a similar requirement, where we currently have four nodes, two in
each fire zone, with pool size 3. At the moment, due to the number of
nodes, we are guaranteed at least one replica in each fire zone (which
we represent with bucket type "room"). If we add more nodes in future,
the current ruleset may cause all three replicas of a PG to land in a
single zone.

I tried the ruleset suggested above (replacing "rack" with "room"), but
when testing it with crushtool --test --show-utilization, I simply get
segfaults. No amount of fiddling around seems to make it work - even
adding two new hypothetical nodes to the crushmap doesn't help.

What could I perhaps be doing wrong?
Johnu George (johnugeo)
2014-09-17 14:42:45 UTC
Permalink
Adding ceph-devel=20
Could you resend with ceph-devel in cc ? It's better for archive purpo=
ses
;-)
Hi Sage,
I was looking at the crash that was reported in this mail
chain.
I am seeing that the crash happens when number of replicas configure=
d is
less than total number of osds to be selected as per rule. This is
because, the crush temporary buffers are allocated as per num_rep si=
ze.
(scratch array has size num_rep * 3) So, when number of osds to be
selected is more, buffer overflow happens and it causes error/crash.=
I
saw
your earlier comment in this mail where you asked to create a rule =
that
selects two osds per rack(2 racks) with num_rep=3D3. I feel that buf=
fer
overflow issue should happen in this situation too, that can cause '=
out
of
array' access. Am I wrong somewhere or am I missing something?
=20
Johnu
=20
On 9/16/14, 9:39 AM, "Daniel Swarbrick"
=20
Hi Loic,
Thanks for providing a detailed example. I'm able to run the exampl=
e
that you provide, and also got my own live crushmap to produce some
results, when I appended the "--num-rep 3" option to the command.
Without that option, even your example is throwing segfaults - mayb=
e a
bug in crushtool?
One other area I wasn't sure about - can the final "chooseleaf" ste=
p
specify "firstn 0" for simplicity's sake (and to automatically hand=
le a
larger pool size in future) ? Would there be any downside to this?
Cheers
Hi Daniel,
When I run
crushtool --outfn crushmap --build --num_osds 100 host straw 2 rac=
k
straw 10 default straw 0
crushtool -d crushmap -o crushmap.txt
cat >> crushmap.txt <<EOF
rule myrule {
ruleset 1
type replicated
min_size 1
max_size 10
step take default
step choose firstn 2 type rack
step chooseleaf firstn 2 type host
step emit
}
EOF
crushtool -c crushmap.txt -o crushmap
crushtool -i crushmap --test --show-utilization --rule 1 --min-x 1
--max-x 10 --num-rep 3
I get
rule 1 (myrule), x =3D 1..10, numrep =3D 3..3
CRUSH rule 1 x 1 [79,69,10]
CRUSH rule 1 x 2 [56,58,60]
CRUSH rule 1 x 3 [30,26,19]
CRUSH rule 1 x 4 [14,8,69]
CRUSH rule 1 x 5 [7,4,88]
CRUSH rule 1 x 6 [54,52,37]
CRUSH rule 1 x 7 [69,67,19]
CRUSH rule 1 x 8 [51,46,83]
CRUSH rule 1 x 9 [55,56,35]
CRUSH rule 1 x 10 [54,51,95]
rule 1 (myrule) num_rep 3 result size =3D=3D 3: 10/10
What command are you running to get a core dump ?
Cheers
Post by Sage Weil
rule myrule {
ruleset 1
type replicated
min_size 1
max_size 10
step take default
step choose firstn 2 type rack
step chooseleaf firstn 2 type host
step emit
}
That will give you 4 osds, spread across 2 hosts in each rack. =
The
Post by Sage Weil
pool=20
size (replication factor) is 3, so RADOS will just use the first
three (2=20
hosts in first rack, 1 host in second rack).
I have a similar requirement, where we currently have four nodes,=
two
in
each fire zone, with pool size 3. At the moment, due to the numbe=
r of
nodes, we are guaranteed at least one replica in each fire zone (which
we represent with bucket type "room"). If we add more nodes in future,
the current ruleset may cause all three replicas of a PG to land =
in a
single zone.
I tried the ruleset suggested above (replacing "rack" with "room"=
),
but
when testing it with crushtool --test --show-utilization, I simpl=
y
get
segfaults. No amount of fiddling around seems to make it work - e=
ven
adding two new hypothetical nodes to the crushmap doesn't help.
What could I perhaps be doing wrong?
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
=20
--=20
Lo=EFc Dachary, Artisan Logiciel Libre
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" i=
n
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Loic Dachary
2014-09-17 16:10:03 UTC
Permalink
Hi,

If the number of replica desired is 1, then

https://github.com/ceph/ceph/blob/firefly/src/crush/CrushWrapper.h#L915

will be called with maxout = 1 and scratch will be maxout * 3. But if the rule always selects 4 items, then it overflows. Is it what you also read ?

Cheers
Post by Johnu George (johnugeo)
Adding ceph-devel
Could you resend with ceph-devel in cc ? It's better for archive purposes
;-)
Hi Sage,
I was looking at the crash that was reported in this mail
chain.
I am seeing that the crash happens when number of replicas configured is
less than total number of osds to be selected as per rule. This is
because, the crush temporary buffers are allocated as per num_rep size.
(scratch array has size num_rep * 3) So, when number of osds to be
selected is more, buffer overflow happens and it causes error/crash. I
saw
your earlier comment in this mail where you asked to create a rule that
selects two osds per rack(2 racks) with num_rep=3. I feel that buffer
overflow issue should happen in this situation too, that can cause 'out
of
array' access. Am I wrong somewhere or am I missing something?
Johnu
On 9/16/14, 9:39 AM, "Daniel Swarbrick"
Hi Loic,
Thanks for providing a detailed example. I'm able to run the example
that you provide, and also got my own live crushmap to produce some
results, when I appended the "--num-rep 3" option to the command.
Without that option, even your example is throwing segfaults - maybe a
bug in crushtool?
One other area I wasn't sure about - can the final "chooseleaf" step
specify "firstn 0" for simplicity's sake (and to automatically handle a
larger pool size in future) ? Would there be any downside to this?
Cheers
Hi Daniel,
When I run
crushtool --outfn crushmap --build --num_osds 100 host straw 2 rack
straw 10 default straw 0
crushtool -d crushmap -o crushmap.txt
cat >> crushmap.txt <<EOF
rule myrule {
ruleset 1
type replicated
min_size 1
max_size 10
step take default
step choose firstn 2 type rack
step chooseleaf firstn 2 type host
step emit
}
EOF
crushtool -c crushmap.txt -o crushmap
crushtool -i crushmap --test --show-utilization --rule 1 --min-x 1
--max-x 10 --num-rep 3
I get
rule 1 (myrule), x = 1..10, numrep = 3..3
CRUSH rule 1 x 1 [79,69,10]
CRUSH rule 1 x 2 [56,58,60]
CRUSH rule 1 x 3 [30,26,19]
CRUSH rule 1 x 4 [14,8,69]
CRUSH rule 1 x 5 [7,4,88]
CRUSH rule 1 x 6 [54,52,37]
CRUSH rule 1 x 7 [69,67,19]
CRUSH rule 1 x 8 [51,46,83]
CRUSH rule 1 x 9 [55,56,35]
CRUSH rule 1 x 10 [54,51,95]
rule 1 (myrule) num_rep 3 result size == 3: 10/10
What command are you running to get a core dump ?
Cheers
Post by Daniel Swarbrick
Post by Sage Weil
rule myrule {
ruleset 1
type replicated
min_size 1
max_size 10
step take default
step choose firstn 2 type rack
step chooseleaf firstn 2 type host
step emit
}
That will give you 4 osds, spread across 2 hosts in each rack. The pool
size (replication factor) is 3, so RADOS will just use the first three (2
hosts in first rack, 1 host in second rack).
I have a similar requirement, where we currently have four nodes, two in
each fire zone, with pool size 3. At the moment, due to the number of
nodes, we are guaranteed at least one replica in each fire zone (which
we represent with bucket type "room"). If we add more nodes in future,
the current ruleset may cause all three replicas of a PG to land in a
single zone.
I tried the ruleset suggested above (replacing "rack" with "room"), but
when testing it with crushtool --test --show-utilization, I simply get
segfaults. No amount of fiddling around seems to make it work - even
adding two new hypothetical nodes to the crushmap doesn't help.
What could I perhaps be doing wrong?
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--
Loïc Dachary, Artisan Logiciel Libre
--
Loïc Dachary, Artisan Logiciel Libre
Johnu George (johnugeo)
2014-09-17 20:03:16 UTC
Permalink
Loic,
You are right. Are we planning to support configurations where
replica number is different from the number of osds selected from a ru=
le?
If not, One solution is to add a validation check when a rule is activa=
ted
for a pool of a specific replica.

Johnu
Post by Loic Dachary
Hi,
If the number of replica desired is 1, then
https://github.com/ceph/ceph/blob/firefly/src/crush/CrushWrapper.h#L91=
5
Post by Loic Dachary
will be called with maxout =3D 1 and scratch will be maxout * 3. But i=
f the
Post by Loic Dachary
rule always selects 4 items, then it overflows. Is it what you also re=
ad ?
Post by Loic Dachary
Cheers
Post by Johnu George (johnugeo)
Adding ceph-devel
=20
=20
Could you resend with ceph-devel in cc ? It's better for archive purposes
;-)
Hi Sage,
I was looking at the crash that was reported in this mail
chain.
I am seeing that the crash happens when number of replicas configu=
red
Post by Loic Dachary
Post by Johnu George (johnugeo)
is
less than total number of osds to be selected as per rule. This is
because, the crush temporary buffers are allocated as per num_rep size.
(scratch array has size num_rep * 3) So, when number of osds to be
selected is more, buffer overflow happens and it causes error/cras=
h. I
Post by Loic Dachary
Post by Johnu George (johnugeo)
saw
your earlier comment in this mail where you asked to create a rul=
e
Post by Loic Dachary
Post by Johnu George (johnugeo)
that
selects two osds per rack(2 racks) with num_rep=3D3. I feel that b=
uffer
Post by Loic Dachary
Post by Johnu George (johnugeo)
overflow issue should happen in this situation too, that can cause 'out
of
array' access. Am I wrong somewhere or am I missing something?
Johnu
On 9/16/14, 9:39 AM, "Daniel Swarbrick"
Hi Loic,
Thanks for providing a detailed example. I'm able to run the exam=
ple
Post by Loic Dachary
Post by Johnu George (johnugeo)
that you provide, and also got my own live crushmap to produce so=
me
Post by Loic Dachary
Post by Johnu George (johnugeo)
results, when I appended the "--num-rep 3" option to the command.
Without that option, even your example is throwing segfaults - ma=
ybe
Post by Loic Dachary
Post by Johnu George (johnugeo)
a
bug in crushtool?
One other area I wasn't sure about - can the final "chooseleaf" s=
tep
Post by Loic Dachary
Post by Johnu George (johnugeo)
specify "firstn 0" for simplicity's sake (and to automatically handle a
larger pool size in future) ? Would there be any downside to this=
?
Post by Loic Dachary
Post by Johnu George (johnugeo)
Cheers
Hi Daniel,
When I run
crushtool --outfn crushmap --build --num_osds 100 host straw 2 r=
ack
Post by Loic Dachary
Post by Johnu George (johnugeo)
straw 10 default straw 0
crushtool -d crushmap -o crushmap.txt
cat >> crushmap.txt <<EOF
rule myrule {
ruleset 1
type replicated
min_size 1
max_size 10
step take default
step choose firstn 2 type rack
step chooseleaf firstn 2 type host
step emit
}
EOF
crushtool -c crushmap.txt -o crushmap
crushtool -i crushmap --test --show-utilization --rule 1 --min-x=
1
Post by Loic Dachary
Post by Johnu George (johnugeo)
--max-x 10 --num-rep 3
I get
rule 1 (myrule), x =3D 1..10, numrep =3D 3..3
CRUSH rule 1 x 1 [79,69,10]
CRUSH rule 1 x 2 [56,58,60]
CRUSH rule 1 x 3 [30,26,19]
CRUSH rule 1 x 4 [14,8,69]
CRUSH rule 1 x 5 [7,4,88]
CRUSH rule 1 x 6 [54,52,37]
CRUSH rule 1 x 7 [69,67,19]
CRUSH rule 1 x 8 [51,46,83]
CRUSH rule 1 x 9 [55,56,35]
CRUSH rule 1 x 10 [54,51,95]
rule 1 (myrule) num_rep 3 result size =3D=3D 3: 10/10
What command are you running to get a core dump ?
Cheers
Post by Sage Weil
rule myrule {
ruleset 1
type replicated
min_size 1
max_size 10
step take default
step choose firstn 2 type rack
step chooseleaf firstn 2 type host
step emit
}
That will give you 4 osds, spread across 2 hosts in each rack.
The
pool=20
size (replication factor) is 3, so RADOS will just use the fir=
st
Post by Loic Dachary
Post by Johnu George (johnugeo)
Post by Sage Weil
three (2=20
hosts in first rack, 1 host in second rack).
I have a similar requirement, where we currently have four node=
s,
Post by Loic Dachary
Post by Johnu George (johnugeo)
two
in
each fire zone, with pool size 3. At the moment, due to the num=
ber
Post by Loic Dachary
Post by Johnu George (johnugeo)
of
nodes, we are guaranteed at least one replica in each fire zone (which
we represent with bucket type "room"). If we add more nodes in future,
the current ruleset may cause all three replicas of a PG to lan=
d
Post by Loic Dachary
Post by Johnu George (johnugeo)
in a
single zone.
I tried the ruleset suggested above (replacing "rack" with "roo=
m"),
Post by Loic Dachary
Post by Johnu George (johnugeo)
but
when testing it with crushtool --test --show-utilization, I sim=
ply
Post by Loic Dachary
Post by Johnu George (johnugeo)
get
segfaults. No amount of fiddling around seems to make it work - even
adding two new hypothetical nodes to the crushmap doesn't help.
What could I perhaps be doing wrong?
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--=20
Lo=EFc Dachary, Artisan Logiciel Libre
=20
--=20
Lo=EFc Dachary, Artisan Logiciel Libre
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" i=
n
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Loic Dachary
2014-09-17 20:11:44 UTC
Permalink
Post by Johnu George (johnugeo)
Loic,
You are right. Are we planning to support configurations where
replica number is different from the number of osds selected from a rule?
I think crush should support it, yes. If a rule can provide 10 OSDs there is no reason for it to fail to provide just one.

Cheers
Post by Johnu George (johnugeo)
If not, One solution is to add a validation check when a rule is activated
for a pool of a specific replica.
Johnu
Post by Loic Dachary
Hi,
If the number of replica desired is 1, then
https://github.com/ceph/ceph/blob/firefly/src/crush/CrushWrapper.h#L915
will be called with maxout = 1 and scratch will be maxout * 3. But if the
rule always selects 4 items, then it overflows. Is it what you also read ?
Cheers
Post by Johnu George (johnugeo)
Adding ceph-devel
Could you resend with ceph-devel in cc ? It's better for archive purposes
;-)
Hi Sage,
I was looking at the crash that was reported in this mail
chain.
I am seeing that the crash happens when number of replicas configured is
less than total number of osds to be selected as per rule. This is
because, the crush temporary buffers are allocated as per num_rep size.
(scratch array has size num_rep * 3) So, when number of osds to be
selected is more, buffer overflow happens and it causes error/crash. I
saw
your earlier comment in this mail where you asked to create a rule that
selects two osds per rack(2 racks) with num_rep=3. I feel that buffer
overflow issue should happen in this situation too, that can cause 'out
of
array' access. Am I wrong somewhere or am I missing something?
Johnu
On 9/16/14, 9:39 AM, "Daniel Swarbrick"
Hi Loic,
Thanks for providing a detailed example. I'm able to run the example
that you provide, and also got my own live crushmap to produce some
results, when I appended the "--num-rep 3" option to the command.
Without that option, even your example is throwing segfaults - maybe a
bug in crushtool?
One other area I wasn't sure about - can the final "chooseleaf" step
specify "firstn 0" for simplicity's sake (and to automatically handle a
larger pool size in future) ? Would there be any downside to this?
Cheers
Hi Daniel,
When I run
crushtool --outfn crushmap --build --num_osds 100 host straw 2 rack
straw 10 default straw 0
crushtool -d crushmap -o crushmap.txt
cat >> crushmap.txt <<EOF
rule myrule {
ruleset 1
type replicated
min_size 1
max_size 10
step take default
step choose firstn 2 type rack
step chooseleaf firstn 2 type host
step emit
}
EOF
crushtool -c crushmap.txt -o crushmap
crushtool -i crushmap --test --show-utilization --rule 1 --min-x 1
--max-x 10 --num-rep 3
I get
rule 1 (myrule), x = 1..10, numrep = 3..3
CRUSH rule 1 x 1 [79,69,10]
CRUSH rule 1 x 2 [56,58,60]
CRUSH rule 1 x 3 [30,26,19]
CRUSH rule 1 x 4 [14,8,69]
CRUSH rule 1 x 5 [7,4,88]
CRUSH rule 1 x 6 [54,52,37]
CRUSH rule 1 x 7 [69,67,19]
CRUSH rule 1 x 8 [51,46,83]
CRUSH rule 1 x 9 [55,56,35]
CRUSH rule 1 x 10 [54,51,95]
rule 1 (myrule) num_rep 3 result size == 3: 10/10
What command are you running to get a core dump ?
Cheers
Post by Daniel Swarbrick
Post by Sage Weil
rule myrule {
ruleset 1
type replicated
min_size 1
max_size 10
step take default
step choose firstn 2 type rack
step chooseleaf firstn 2 type host
step emit
}
That will give you 4 osds, spread across 2 hosts in each rack.
The
pool
size (replication factor) is 3, so RADOS will just use the first three (2
hosts in first rack, 1 host in second rack).
I have a similar requirement, where we currently have four nodes,
two
in
each fire zone, with pool size 3. At the moment, due to the number of
nodes, we are guaranteed at least one replica in each fire zone (which
we represent with bucket type "room"). If we add more nodes in future,
the current ruleset may cause all three replicas of a PG to land in a
single zone.
I tried the ruleset suggested above (replacing "rack" with "room"), but
when testing it with crushtool --test --show-utilization, I simply get
segfaults. No amount of fiddling around seems to make it work - even
adding two new hypothetical nodes to the crushmap doesn't help.
What could I perhaps be doing wrong?
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--
Loïc Dachary, Artisan Logiciel Libre
--
Loïc Dachary, Artisan Logiciel Libre
--
Loïc Dachary, Artisan Logiciel Libre
Johnu George (johnugeo)
2014-09-17 22:40:39 UTC
Permalink
In such a case, we can initialize scratch array in
crush/CrushWrapper.h#L919 with maximum number of osds that can be
selected. Since we know the rule no, it should be possible to calculate
the maximum osds that can be selected.

Johnu
Loic,
You are right. Are we planning to support configurations wher=
e
replica number is different from the number of osds selected from a
rule?
I think crush should support it, yes. If a rule can provide 10 OSDs th=
ere
is no reason for it to fail to provide just one.
Cheers
If not, One solution is to add a validation check when a rule is activated
for a pool of a specific replica.
=20
Johnu
=20
=20
Post by Loic Dachary
Hi,
If the number of replica desired is 1, then
https://github.com/ceph/ceph/blob/firefly/src/crush/CrushWrapper.h#=
L915
Post by Loic Dachary
will be called with maxout =3D 1 and scratch will be maxout * 3. Bu=
t if
Post by Loic Dachary
the
rule always selects 4 items, then it overflows. Is it what you also read ?
Cheers
Post by Johnu George (johnugeo)
Adding ceph-devel
Could you resend with ceph-devel in cc ? It's better for archive purposes
;-)
Hi Sage,
I was looking at the crash that was reported in this ma=
il
Post by Loic Dachary
Post by Johnu George (johnugeo)
chain.
I am seeing that the crash happens when number of replicas
configured
is
less than total number of osds to be selected as per rule. This =
is
Post by Loic Dachary
Post by Johnu George (johnugeo)
because, the crush temporary buffers are allocated as per num_re=
p
Post by Loic Dachary
Post by Johnu George (johnugeo)
size.
(scratch array has size num_rep * 3) So, when number of osds to =
be
Post by Loic Dachary
Post by Johnu George (johnugeo)
selected is more, buffer overflow happens and it causes
error/crash. I
saw
your earlier comment in this mail where you asked to create a r=
ule
Post by Loic Dachary
Post by Johnu George (johnugeo)
that
selects two osds per rack(2 racks) with num_rep=3D3. I feel that
buffer
overflow issue should happen in this situation too, that can cau=
se
Post by Loic Dachary
Post by Johnu George (johnugeo)
'out
of
array' access. Am I wrong somewhere or am I missing something?
Johnu
On 9/16/14, 9:39 AM, "Daniel Swarbrick"
Hi Loic,
Thanks for providing a detailed example. I'm able to run the example
that you provide, and also got my own live crushmap to produce =
some
Post by Loic Dachary
Post by Johnu George (johnugeo)
results, when I appended the "--num-rep 3" option to the comman=
d.
Post by Loic Dachary
Post by Johnu George (johnugeo)
Without that option, even your example is throwing segfaults -
maybe
a
bug in crushtool?
One other area I wasn't sure about - can the final "chooseleaf" step
specify "firstn 0" for simplicity's sake (and to automatically handle a
larger pool size in future) ? Would there be any downside to th=
is?
Post by Loic Dachary
Post by Johnu George (johnugeo)
Cheers
Hi Daniel,
When I run
crushtool --outfn crushmap --build --num_osds 100 host straw 2 rack
straw 10 default straw 0
crushtool -d crushmap -o crushmap.txt
cat >> crushmap.txt <<EOF
rule myrule {
ruleset 1
type replicated
min_size 1
max_size 10
step take default
step choose firstn 2 type rack
step chooseleaf firstn 2 type host
step emit
}
EOF
crushtool -c crushmap.txt -o crushmap
crushtool -i crushmap --test --show-utilization --rule 1 --min=
-x 1
Post by Loic Dachary
Post by Johnu George (johnugeo)
--max-x 10 --num-rep 3
I get
rule 1 (myrule), x =3D 1..10, numrep =3D 3..3
CRUSH rule 1 x 1 [79,69,10]
CRUSH rule 1 x 2 [56,58,60]
CRUSH rule 1 x 3 [30,26,19]
CRUSH rule 1 x 4 [14,8,69]
CRUSH rule 1 x 5 [7,4,88]
CRUSH rule 1 x 6 [54,52,37]
CRUSH rule 1 x 7 [69,67,19]
CRUSH rule 1 x 8 [51,46,83]
CRUSH rule 1 x 9 [55,56,35]
CRUSH rule 1 x 10 [54,51,95]
rule 1 (myrule) num_rep 3 result size =3D=3D 3: 10/10
What command are you running to get a core dump ?
Cheers
Post by Sage Weil
rule myrule {
ruleset 1
type replicated
min_size 1
max_size 10
step take default
step choose firstn 2 type rack
step chooseleaf firstn 2 type host
step emit
}
That will give you 4 osds, spread across 2 hosts in each rac=
k.
Post by Loic Dachary
Post by Johnu George (johnugeo)
Post by Sage Weil
The
pool=20
size (replication factor) is 3, so RADOS will just use the f=
irst
Post by Loic Dachary
Post by Johnu George (johnugeo)
Post by Sage Weil
three (2
hosts in first rack, 1 host in second rack).
I have a similar requirement, where we currently have four no=
des,
Post by Loic Dachary
Post by Johnu George (johnugeo)
two
in
each fire zone, with pool size 3. At the moment, due to the
number
of
nodes, we are guaranteed at least one replica in each fire zo=
ne
Post by Loic Dachary
Post by Johnu George (johnugeo)
(which
we represent with bucket type "room"). If we add more nodes i=
n
Post by Loic Dachary
Post by Johnu George (johnugeo)
future,
the current ruleset may cause all three replicas of a PG to l=
and
Post by Loic Dachary
Post by Johnu George (johnugeo)
in a
single zone.
I tried the ruleset suggested above (replacing "rack" with
"room"),
but
when testing it with crushtool --test --show-utilization, I
simply
get
segfaults. No amount of fiddling around seems to make it work=
-
Post by Loic Dachary
Post by Johnu George (johnugeo)
even
adding two new hypothetical nodes to the crushmap doesn't hel=
p.
Post by Loic Dachary
Post by Johnu George (johnugeo)
What could I perhaps be doing wrong?
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--=20
Lo=EFc Dachary, Artisan Logiciel Libre
--=20
Lo=EFc Dachary, Artisan Logiciel Libre
=20
--=20
Lo=EFc Dachary, Artisan Logiciel Libre
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" i=
n
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Chen, Xiaoxi
2014-09-18 02:03:36 UTC
Permalink
The rule has max_size, can we just use that value?

-----Original Message-----
=46rom: ceph-devel-***@vger.kernel.org [mailto:ceph-devel-***@vger.=
kernel.org] On Behalf Of Johnu George (johnugeo)
Sent: Thursday, September 18, 2014 6:41 AM
To: Loic Dachary; ceph-devel
Subject: Re: [ceph-users] Crushmap ruleset for rack aware PG placement

In such a case, we can initialize scratch array in
crush/CrushWrapper.h#L919 with maximum number of osds that can be selec=
ted. Since we know the rule no, it should be possible to calculate the =
maximum osds that can be selected.

Johnu
Loic,
You are right. Are we planning to support configurations wher=
e =20
replica number is different from the number of osds selected from a=20
rule?
I think crush should support it, yes. If a rule can provide 10 OSDs=20
there is no reason for it to fail to provide just one.
Cheers
If not, One solution is to add a validation check when a rule is=20
activated for a pool of a specific replica.
=20
Johnu
=20
=20
Post by Loic Dachary
Hi,
If the number of replica desired is 1, then
https://github.com/ceph/ceph/blob/firefly/src/crush/CrushWrapper.h#=
L
Post by Loic Dachary
915
will be called with maxout =3D 1 and scratch will be maxout * 3. Bu=
t=20
Post by Loic Dachary
if the rule always selects 4 items, then it overflows. Is it what=20
you also read ?
Cheers
Post by Johnu George (johnugeo)
Adding ceph-devel
Could you resend with ceph-devel in cc ? It's better for archive=20
purposes
;-)
Hi Sage,
I was looking at the crash that was reported in this=20
mail chain.
I am seeing that the crash happens when number of replicas=20
configured is less than total number of osds to be selected as=20
per rule. This is because, the crush temporary buffers are=20
allocated as per num_rep size.
(scratch array has size num_rep * 3) So, when number of osds to=20
be selected is more, buffer overflow happens and it causes=20
error/crash. I saw your earlier comment in this mail where you=
=20
Post by Loic Dachary
Post by Johnu George (johnugeo)
asked to create a rule that selects two osds per rack(2 racks)=20
with num_rep=3D3. I feel that buffer overflow issue should happe=
n=20
Post by Loic Dachary
Post by Johnu George (johnugeo)
in this situation too, that can cause 'out of array' access. A=
m=20
Post by Loic Dachary
Post by Johnu George (johnugeo)
I wrong somewhere or am I missing something?
Johnu
On 9/16/14, 9:39 AM, "Daniel Swarbrick"
Hi Loic,
Thanks for providing a detailed example. I'm able to run the=20
example that you provide, and also got my own live crushmap to=20
produce some results, when I appended the "--num-rep 3" option=20
to the command.
Without that option, even your example is throwing segfaults -=20
maybe a bug in crushtool?
One other area I wasn't sure about - can the final "chooseleaf" step
specify "firstn 0" for simplicity's sake (and to automatically =
=20
Post by Loic Dachary
Post by Johnu George (johnugeo)
handle a larger pool size in future) ? Would there be any=20
downside to this?
Cheers
Hi Daniel,
When I run
crushtool --outfn crushmap --build --num_osds 100 host straw 2=
=20
Post by Loic Dachary
Post by Johnu George (johnugeo)
rack straw 10 default straw 0 crushtool -d crushmap -o=20
crushmap.txt cat >> crushmap.txt <<EOF rule myrule {
ruleset 1
type replicated
min_size 1
max_size 10
step take default
step choose firstn 2 type rack
step chooseleaf firstn 2 type host
step emit
}
EOF
crushtool -c crushmap.txt -o crushmap crushtool -i crushmap=20
--test --show-utilization --rule 1 --min-x 1 --max-x 10=20
--num-rep 3
I get
rule 1 (myrule), x =3D 1..10, numrep =3D 3..3 CRUSH rule 1 x 1=
=20
Post by Loic Dachary
Post by Johnu George (johnugeo)
[79,69,10] CRUSH rule 1 x 2 [56,58,60] CRUSH rule 1 x 3=20
[30,26,19] CRUSH rule 1 x 4 [14,8,69] CRUSH rule 1 x 5 [7,4,88=
]=20
Post by Loic Dachary
Post by Johnu George (johnugeo)
CRUSH rule 1 x 6 [54,52,37] CRUSH rule 1 x 7 [69,67,19] CRUSH=20
rule 1 x 8 [51,46,83] CRUSH rule 1 x 9 [55,56,35] CRUSH rule 1=
=20
Post by Loic Dachary
Post by Johnu George (johnugeo)
x 10 [54,51,95]
rule 1 (myrule) num_rep 3 result size =3D=3D 3: 10/10
What command are you running to get a core dump ?
Cheers
Post by Sage Weil
rule myrule {
ruleset 1
type replicated
min_size 1
max_size 10
step take default
step choose firstn 2 type rack
step chooseleaf firstn 2 type host
step emit
}
That will give you 4 osds, spread across 2 hosts in each rac=
k.
Post by Loic Dachary
Post by Johnu George (johnugeo)
Post by Sage Weil
The
pool
size (replication factor) is 3, so RADOS will just use the=20
first three (2 hosts in first rack, 1 host in second rack).
I have a similar requirement, where we currently have four=20
nodes, two in each fire zone, with pool size 3. At the=20
moment, due to the number of nodes, we are guaranteed at=20
least one replica in each fire zone (which we represent with=
=20
Post by Loic Dachary
Post by Johnu George (johnugeo)
bucket type "room"). If we add more nodes in future, the=20
current ruleset may cause all three replicas of a PG to land =20
in a single zone.
I tried the ruleset suggested above (replacing "rack" with=20
"room"), but when testing it with crushtool --test=20
--show-utilization, I simply get segfaults. No amount of=20
fiddling around seems to make it work - even adding two new=20
hypothetical nodes to the crushmap doesn't help.
What could I perhaps be doing wrong?
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--
Lo=EFc Dachary, Artisan Logiciel Libre
--
Lo=EFc Dachary, Artisan Logiciel Libre
=20
--
Lo=EFc Dachary, Artisan Logiciel Libre
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" i=
n
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" i=
n
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Loading...