weight VS crush weight when doing osd reweight

Discussion:

Sage Weil

2014-10-20 15:03:42 UTC

As you said at https://github.com/ceph/ceph/pull/2199, adjusting weight or
crush weight should be both effective at any case. We?ve encounter a situation
in which it seems adjusting weight is far less effective than adjusting
crush weight.
We use 6 racks with host number {9, 5, 9, 4, 9, 4} and 11 osds at each host.
ruleset ecrule {
?
min_size 11
max_size 11
step set_chooseleaf_tries 50
step take default
step choose firstn 4 type rack // we want the distribution to be {3, 3, 3,
2} for k=8 m=3
step chooseleaf indep 3 type host
step emit
}
After creation of the pool, we run osd reweight-by-pg many times, the best
result it can reach is
Average PGs/OSD (expected): 225.28
Max PGs/OSD: 307
Min PGs/OSD: 164
Then we run our own tool to reweight(same strategy with reweight-by-pg, just
Average PGs/OSD (expected): 225.28
Max PGs/OSD: 241
Min PGs/OSD: 207
Which is much better than the previous one.
According to my understanding, due to uneven host numbers across rack,
1. If we adjust osd weight, this step is almost unaffected and
will dispatch almost even pg number for each rack. Thus the host in the
rack which have less host will take more pgs, no matter how we adjust
weight.
2. If we adjust osd crush weight, this step is affected and will try to
dispatch more pg to the rack which has higher crush weight value, thus
the result can be even.
Am I right about this?

I think so, yes. I am a bit surprised that this is a problem, though. We
will still be distributing PGs based on the relative CRUSH weights, and I
would not expect that the expected variation will lead to very much skew
between racks.

It may be that CRUSH is, at baseline, having trouble respecting your
weights. You might try creating a single straw bucket with 6 OSDs and
those weights (9, 5, 9, 4, 9, 4) and see if it is able to achieve a
correct distribution. When there is a lot of variation in weights and the
total number of items are small it can be hard for it to get to the right
result. (We were just looking into a similar problem on another cluster
on Friday.)

For a more typical chooseleaf the osd weight will have the intended
behavior, but when the initial step is a regular choose only the CRUSH
weights affect the decision. My guess is that your process of skewing the
CRUSH weights pretty dramatically which is able to compensate for the
difficulty/improbability of randomly choosing racks with the right
frequency...

sage

We then do a further test with 6 racks and 9 hosts in each rack. In this
situation, adjusting weight or adjusting crush weight has almost the same
effect.
So, weight and crush weight do impact the result of CRUSH in a different
way?

Lei Dong

2014-10-21 02:15:02 UTC

Permalink

Thanks Sage!
So you mean:

1. Choose step will not be affected by OSD weight(but only CRUSH weight).
2. Chooseleaf step will be affected by both the two weights. But with a
big variation in CRUSH weight and small OSD number, CRUSH works
inefficiently to make the distribution even although we can adjust OSD
weight.

Right?

LeiDong

Post by Sage Weil

As you said at https://github.com/ceph/ceph/pull/2199, adjusting weight
or
crush weight should be both effective at any case. We?ve encounter a situation
in which it seems adjusting weight is far less effective than adjusting
crush weight.
We use 6 racks with host number {9, 5, 9, 4, 9, 4} and 11 osds at each host.
ruleset ecrule {
?
min_size 11
max_size 11
step set_chooseleaf_tries 50
step take default
step choose firstn 4 type rack // we want the distribution to be {3, 3, 3,
2} for k=8 m=3
step chooseleaf indep 3 type host
step emit
}
After creation of the pool, we run osd reweight-by-pg many times, the best
result it can reach is
Average PGs/OSD (expected): 225.28
Max PGs/OSD: 307
Min PGs/OSD: 164
Then we run our own tool to reweight(same strategy with reweight-by-pg, just
Average PGs/OSD (expected): 225.28
Max PGs/OSD: 241
Min PGs/OSD: 207
Which is much better than the previous one.
According to my understanding, due to uneven host numbers across rack,
1. If we adjust osd weight, this step is almost unaffected and
will dispatch almost even pg number for each rack. Thus the host in
the
rack which have less host will take more pgs, no matter how we adjust
weight.
2. If we adjust osd crush weight, this step is affected and will try to
dispatch more pg to the rack which has higher crush weight value,
thus
the result can be even.
Am I right about this?

I think so, yes. I am a bit surprised that this is a problem, though.
We
will still be distributing PGs based on the relative CRUSH weights, and I
would not expect that the expected variation will lead to very much skew
between racks.
It may be that CRUSH is, at baseline, having trouble respecting your
weights. You might try creating a single straw bucket with 6 OSDs and
those weights (9, 5, 9, 4, 9, 4) and see if it is able to achieve a
correct distribution. When there is a lot of variation in weights and the
total number of items are small it can be hard for it to get to the right
result. (We were just looking into a similar problem on another cluster
on Friday.)
For a more typical chooseleaf the osd weight will have the intended
behavior, but when the initial step is a regular choose only the CRUSH
weights affect the decision. My guess is that your process of skewing the
CRUSH weights pretty dramatically which is able to compensate for the
difficulty/improbability of randomly choosing racks with the right
frequency...
sage

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Sage Weil

2014-10-21 13:24:34 UTC

Permalink

Post by Lei Dong
Thanks Sage!
1. Choose step will not be affected by OSD weight(but only CRUSH weight).

Yes, if the choose type is not 'osd'.

Post by Lei Dong
2. Chooseleaf step will be affected by both the two weights. But with a
big variation in CRUSH weight and small OSD number, CRUSH works
inefficiently to make the distribution even although we can adjust OSD
weight.
Right?

Right.

As a simple example, let's say we're picking 2 replicas and the weights
are [1, 2, 1]. It's pretty obvious that the only two choices are a,b and
b,c, but CRUSH will have a very hard time with this because it is doing an
independent selection for each position. Things get harder as the number
of replicas increases..

sage

Post by Lei Dong
LeiDong

Post by Sage Weil

As you said at https://github.com/ceph/ceph/pull/2199, adjusting weight
or
crush weight should be both effective at any case. We?ve encounter a situation
in which it seems adjusting weight is far less effective than adjusting
crush weight.
We use 6 racks with host number {9, 5, 9, 4, 9, 4} and 11 osds at each host.
ruleset ecrule {
?
min_size 11
max_size 11
step set_chooseleaf_tries 50
step take default
step choose firstn 4 type rack // we want the distribution to be {3, 3, 3,
2} for k=8 m=3
step chooseleaf indep 3 type host
step emit
}
After creation of the pool, we run osd reweight-by-pg many times, the best
result it can reach is
Average PGs/OSD (expected): 225.28
Max PGs/OSD: 307
Min PGs/OSD: 164
Then we run our own tool to reweight(same strategy with reweight-by-pg, just
Average PGs/OSD (expected): 225.28
Max PGs/OSD: 241
Min PGs/OSD: 207
Which is much better than the previous one.
According to my understanding, due to uneven host numbers across rack,
1. If we adjust osd weight, this step is almost unaffected and
will dispatch almost even pg number for each rack. Thus the host in
the
rack which have less host will take more pgs, no matter how we adjust
weight.
2. If we adjust osd crush weight, this step is affected and will try to
dispatch more pg to the rack which has higher crush weight value,
thus
the result can be even.
Am I right about this?

I think so, yes. I am a bit surprised that this is a problem, though.
We
will still be distributing PGs based on the relative CRUSH weights, and I
would not expect that the expected variation will lead to very much skew
between racks.
It may be that CRUSH is, at baseline, having trouble respecting your
weights. You might try creating a single straw bucket with 6 OSDs and
those weights (9, 5, 9, 4, 9, 4) and see if it is able to achieve a
correct distribution. When there is a lot of variation in weights and the
total number of items are small it can be hard for it to get to the right
result. (We were just looking into a similar problem on another cluster
on Friday.)
For a more typical chooseleaf the osd weight will have the intended
behavior, but when the initial step is a regular choose only the CRUSH
weights affect the decision. My guess is that your process of skewing the
CRUSH weights pretty dramatically which is able to compensate for the
difficulty/improbability of randomly choosing racks with the right
frequency...
sage

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html