Sage Weil
2014-10-20 15:03:42 UTC
As you said at https://github.com/ceph/ceph/pull/2199, adjusting weight or
crush weight should be both effective at any case. We?ve encounter a situation
in which it seems adjusting weight is far less effective than adjusting
crush weight.
We use 6 racks with host number {9, 5, 9, 4, 9, 4} and 11 osds at each host.
ruleset ecrule {
?
min_size 11
max_size 11
step set_chooseleaf_tries 50
step take default
step choose firstn 4 type rack // we want the distribution to be {3, 3, 3,
2} for k=8 m=3
step chooseleaf indep 3 type host
step emit
}
After creation of the pool, we run osd reweight-by-pg many times, the best
result it can reach is
Average PGs/OSD (expected): 225.28
Max PGs/OSD: 307
Min PGs/OSD: 164
Then we run our own tool to reweight(same strategy with reweight-by-pg, just
Average PGs/OSD (expected): 225.28
Max PGs/OSD: 241
Min PGs/OSD: 207
Which is much better than the previous one.
According to my understanding, due to uneven host numbers across rack,
1. If we adjust osd weight, this step is almost unaffected and
will dispatch almost even pg number for each rack. Thus the host in the
rack which have less host will take more pgs, no matter how we adjust
weight.
2. If we adjust osd crush weight, this step is affected and will try to
dispatch more pg to the rack which has higher crush weight value, thus
the result can be even.
Am I right about this?
I think so, yes. I am a bit surprised that this is a problem, though. Wecrush weight should be both effective at any case. We?ve encounter a situation
in which it seems adjusting weight is far less effective than adjusting
crush weight.
We use 6 racks with host number {9, 5, 9, 4, 9, 4} and 11 osds at each host.
ruleset ecrule {
?
min_size 11
max_size 11
step set_chooseleaf_tries 50
step take default
step choose firstn 4 type rack // we want the distribution to be {3, 3, 3,
2} for k=8 m=3
step chooseleaf indep 3 type host
step emit
}
After creation of the pool, we run osd reweight-by-pg many times, the best
result it can reach is
Average PGs/OSD (expected): 225.28
Max PGs/OSD: 307
Min PGs/OSD: 164
Then we run our own tool to reweight(same strategy with reweight-by-pg, just
Average PGs/OSD (expected): 225.28
Max PGs/OSD: 241
Min PGs/OSD: 207
Which is much better than the previous one.
According to my understanding, due to uneven host numbers across rack,
1. If we adjust osd weight, this step is almost unaffected and
will dispatch almost even pg number for each rack. Thus the host in the
rack which have less host will take more pgs, no matter how we adjust
weight.
2. If we adjust osd crush weight, this step is affected and will try to
dispatch more pg to the rack which has higher crush weight value, thus
the result can be even.
Am I right about this?
will still be distributing PGs based on the relative CRUSH weights, and I
would not expect that the expected variation will lead to very much skew
between racks.
It may be that CRUSH is, at baseline, having trouble respecting your
weights. You might try creating a single straw bucket with 6 OSDs and
those weights (9, 5, 9, 4, 9, 4) and see if it is able to achieve a
correct distribution. When there is a lot of variation in weights and the
total number of items are small it can be hard for it to get to the right
result. (We were just looking into a similar problem on another cluster
on Friday.)
For a more typical chooseleaf the osd weight will have the intended
behavior, but when the initial step is a regular choose only the CRUSH
weights affect the decision. My guess is that your process of skewing the
CRUSH weights pretty dramatically which is able to compensate for the
difficulty/improbability of randomly choosing racks with the right
frequency...
sage
We then do a further test with 6 racks and 9 hosts in each rack. In this
situation, adjusting weight or adjusting crush weight has almost the same
effect.
So, weight and crush weight do impact the result of CRUSH in a different
way?
situation, adjusting weight or adjusting crush weight has almost the same
effect.
So, weight and crush weight do impact the result of CRUSH in a different
way?