Sage Weil

2014-10-20 15:03:42 UTC

As you said at https://github.com/ceph/ceph/pull/2199, adjusting weight or

crush weight should be both effective at any case. We?ve encounter a situation

in which it seems adjusting weight is far less effective than adjusting

crush weight.

We use 6 racks with host number {9, 5, 9, 4, 9, 4} and 11 osds at each host.

ruleset ecrule {

?

min_size 11

max_size 11

step set_chooseleaf_tries 50

step take default

step choose firstn 4 type rack // we want the distribution to be {3, 3, 3,

2} for k=8 m=3

step chooseleaf indep 3 type host

step emit

}

After creation of the pool, we run osd reweight-by-pg many times, the best

result it can reach is

Average PGs/OSD (expected): 225.28

Max PGs/OSD: 307

Min PGs/OSD: 164

Then we run our own tool to reweight(same strategy with reweight-by-pg, just

Average PGs/OSD (expected): 225.28

Max PGs/OSD: 241

Min PGs/OSD: 207

Which is much better than the previous one.

According to my understanding, due to uneven host numbers across rack,

1. If we adjust osd weight, this step is almost unaffected and

will dispatch almost even pg number for each rack. Thus the host in the

rack which have less host will take more pgs, no matter how we adjust

weight.

2. If we adjust osd crush weight, this step is affected and will try to

dispatch more pg to the rack which has higher crush weight value, thus

the result can be even.

Am I right about this?

I think so, yes. I am a bit surprised that this is a problem, though. Wecrush weight should be both effective at any case. We?ve encounter a situation

in which it seems adjusting weight is far less effective than adjusting

crush weight.

We use 6 racks with host number {9, 5, 9, 4, 9, 4} and 11 osds at each host.

ruleset ecrule {

?

min_size 11

max_size 11

step set_chooseleaf_tries 50

step take default

step choose firstn 4 type rack // we want the distribution to be {3, 3, 3,

2} for k=8 m=3

step chooseleaf indep 3 type host

step emit

}

After creation of the pool, we run osd reweight-by-pg many times, the best

result it can reach is

Average PGs/OSD (expected): 225.28

Max PGs/OSD: 307

Min PGs/OSD: 164

Then we run our own tool to reweight(same strategy with reweight-by-pg, just

Average PGs/OSD (expected): 225.28

Max PGs/OSD: 241

Min PGs/OSD: 207

Which is much better than the previous one.

According to my understanding, due to uneven host numbers across rack,

1. If we adjust osd weight, this step is almost unaffected and

will dispatch almost even pg number for each rack. Thus the host in the

rack which have less host will take more pgs, no matter how we adjust

weight.

2. If we adjust osd crush weight, this step is affected and will try to

dispatch more pg to the rack which has higher crush weight value, thus

the result can be even.

Am I right about this?

will still be distributing PGs based on the relative CRUSH weights, and I

would not expect that the expected variation will lead to very much skew

between racks.

It may be that CRUSH is, at baseline, having trouble respecting your

weights. You might try creating a single straw bucket with 6 OSDs and

those weights (9, 5, 9, 4, 9, 4) and see if it is able to achieve a

correct distribution. When there is a lot of variation in weights and the

total number of items are small it can be hard for it to get to the right

result. (We were just looking into a similar problem on another cluster

on Friday.)

For a more typical chooseleaf the osd weight will have the intended

behavior, but when the initial step is a regular choose only the CRUSH

weights affect the decision. My guess is that your process of skewing the

CRUSH weights pretty dramatically which is able to compensate for the

difficulty/improbability of randomly choosing racks with the right

frequency...

sage

We then do a further test with 6 racks and 9 hosts in each rack. In this

situation, adjusting weight or adjusting crush weight has almost the same

effect.

So, weight and crush weight do impact the result of CRUSH in a different

way?

situation, adjusting weight or adjusting crush weight has almost the same

effect.

So, weight and crush weight do impact the result of CRUSH in a different

way?