CURSH optimization for unbalanced pg distribution

Discussion:

Zhang, Jian

2014-03-20 03:54:01 UTC

Forwarding per Sage's suggestion.

-----Original Message-----
From: Sage Weil [mailto:***@inktank.com]
Sent: Wednesday, March 19, 2014 11:29 PM
To: Mark Nelson
Cc: Zhang, Jian; Duan, Jiangang; He, Yujie
Subject: Re: CURSH optimization for unbalanced pg distribution

For more detail data, please refer to the *Testing results* part.
*Optimization proposals: *
After we dived into the source code of CRUSH and related papers, we
1.Add different hash algorithms, as an alternative for the Jenkin's
hash, e.g. algorithm that will produce even values when range of
input value (pg#) is relatively small. Or add new bucket type at the
same time if necessary.

This *might* work, but I don't have a strong intuition about it. The modeling we've done now has essentially assumed a statistically uniform distribution, which has some inherent inbalance for low values of n (num pgs in our case). I have generally assumed we can't do better than "random", and still have the other properties we want (independent, deterministic placement), but it may be possible.

2.Find a better replica placement strategy instead of current retry
logic of crush_choose_firstn, which may cause CRUSH to behave badly.
We find there are several threshold of retry times by referring to code,
choose_total_tries, choose_local_tries and choose_local_fallback_tries.
They are used to decide whether to do a retry_bucket, retry_descent or
use permutation to do an exhaustive bucket search. We are wondering if
a)Backtracking retry. Now the logic of crush_choose_firstn can only
issue an retry either from the initial bucket(retry_descent) or from the
current bucket (retry_bucket), how about retrying the intervening buckets?
b)Adjust threshold of retry times by other values. We do noticed that
the 'optimal' crush tunable could be used to make it, but we still
encounter unbalanced [g distribution by using the optimal strategy.
Please refer to 4 of the Testing results part.
c)Add an mechanism that can adjust above mentioned thresholds
adaptively. Maybe we can record the retry times of the previous call for
CRUSH, and adjust retry thresholds automatically according to the record.

I suggest ignoring all of this retry logic. The original version of CRUSH
has the local retries to try to make data move "less far", but when we
went back a year ago and did a statistical analysis of the distribution we
found that *all* of these hacks degraded the quality of the placement,a nd
by turning them all off (setting the 'optimal' values which zeroes them
all out excent for total_retries) we got something that was
indistinguishable from a uniform distribution.

3.Add soft link for pg directories. During pg creation, we can create
soft links for the pgs if pg# on the selected osd is more than some
threshold, say 10% more than desired average number, to move objects
that will be stored in this pg to another osd. Balanced disk utilization
may be gained in this way.

I think you need to be careful, but yes, this is an option. There is a
similar exception mechanism in place that is used for other purposes and
something similar could be done here. The main challenge will be in
ensuring that the soft links/exceptions follow the same overall policy
that CRUSH does after the raw mapping is performed. This is an option,
but I would put it toward the bottom of the list...

4.Change placement strategy only for step of selecting devices from
hosts. We found in our testing results that pg distribution was balanced
among hosts, which is reasonable since pg# of each host is above 1K
(according to the current BKM that pg# per osd should be about 100). So
how about we apply CRUSH only on the interval buckets and find another
simple but more balanced method to choose osd from host?

This idea has a lot of potential. For example:

If you know the chassis can hold 12 disks, you can force the bucket size
to twelve and somehow prevent users from adjusting the structure of the
tree. Then you can use a simple mapping that is truly flat (like a linear
mapping, disk = x % num_disks) for that bucket/subtree. The downside of
course is that if you remove a disk *everything* reshuffles, hence some
sort of guardrails to prevent a user from inadvertantly doing that. If a
disk *does* fail, you just need to make sure the disk is marked "out" but
not removed from the CRUSH hierarchy and the normal retry will kick in.

Note that all this is reall doing is increasing the size of the "buckets"
that we are (pseudo)randomly distribution over. It is still a
random/uniform distribution, but the N value is 12 times bigger (for a 12
disk chassis) and as a result the variance is substantially lower.

I would suggest making a new bucket type that is called 'linear' and does
a simple modulo and trying this out. We will need a bunch of additional
safety checks to help users avoid doing silly things (like adjusting the
number of items in the linear buckets, which reshuffle everything) but
that wouldn't be needed for an initial analysis of the performance impact.

Do you mind if we shift this thread over to ceph-devel? I think there are
lots of people who would be interested in this discussion. We can of
course leave off your attachment if you prefer.

Thanks!
sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Loic Dachary

2014-09-09 13:36:18 UTC

Permalink

Post by Zhang, Jian
Forwarding per Sage's suggestion.

Very interesting discussion :-) For the record the corresponding pull request is https://github.com/ceph/ceph/pull/2402

Post by Zhang, Jian
-----Original Message-----
Sent: Wednesday, March 19, 2014 11:29 PM
To: Mark Nelson
Cc: Zhang, Jian; Duan, Jiangang; He, Yujie
Subject: Re: CURSH optimization for unbalanced pg distribution

If you know the chassis can hold 12 disks, you can force the bucket size
to twelve and somehow prevent users from adjusting the structure of the
tree. Then you can use a simple mapping that is truly flat (like a linear
mapping, disk = x % num_disks) for that bucket/subtree. The downside of
course is that if you remove a disk *everything* reshuffles, hence some
sort of guardrails to prevent a user from inadvertantly doing that. If a
disk *does* fail, you just need to make sure the disk is marked "out" but
not removed from the CRUSH hierarchy and the normal retry will kick in.
Note that all this is reall doing is increasing the size of the "buckets"
that we are (pseudo)randomly distribution over. It is still a
random/uniform distribution, but the N value is 12 times bigger (for a 12
disk chassis) and as a result the variance is substantially lower.
I would suggest making a new bucket type that is called 'linear' and does
a simple modulo and trying this out. We will need a bunch of additional
safety checks to help users avoid doing silly things (like adjusting the
number of items in the linear buckets, which reshuffle everything) but
that wouldn't be needed for an initial analysis of the performance impact.
Do you mind if we shift this thread over to ceph-devel? I think there are
lots of people who would be interested in this discussion. We can of
course leave off your attachment if you prefer.
Thanks!
sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
More majordomo info at http://vger.kernel.org/majordomo-info.html

--
Loïc Dachary, Artisan Logiciel Libre

Sage Weil

2014-09-10 01:06:12 UTC

Permalink

The lists are rejecting the email because of the big attachments. Send
with links instead?

Yujie sent out the following email yesterday, but it seems it was missed. Resending it.
=============
Hi all,
? Several months ago we met an issue of read performance issues (17% degradation) when working on ceph object storage performance evaluation with 10M objects (scaling from 10K objects to 1Million objects) , and found the root cause is unbalanced pg distribution among all osd disks, leading to unbalanced data distribution. We did some further investigation then and identified that CRUSH failed to map pgs evenly to each osd. Please refer to the attached pdf (crush_proposals) for details.
? As mentioned in the attached pdf, we described possible optimization proposals for CRUSH and got some feedback from community (http://permalink.gmane.org/gmane.comp.file-systems.ceph.devel/18979) . Sage suggested us take the idea of "Change placement strategy only for step of selecting devices from hosts", by adding a new bucket type called ?linear?, and applying a modulo-like hash function to this kind of buckets to achieve balanced distribution. We followed this suggestion and designed an optimized CRUSH algorithm, with new hash methods and an adaptive module. Please refer to the Design and Implementation part for details. We also wrote some POC for it, see the attached patch. And as a result, we got more than 10% read performance improvement using the optimized CRUSH algorithm.
1. Problem Identification
1.1 Input key (pps) space of CRUSH is not uniform
Since PG# on the nested devices of a host is not uniform even if we select the device using simple modulo operation, we decide to change the algorithm of hashing raw pg to pps.
1.2 Algorithm of selecting items from buckets is not uniform
After we get uniform input key space, we should make the procedure of selecting devices from host be uniform. Since current CRUSH algorithm uses Jenkins hash based strategies and failed to reach the goal, we decide to add a new bucket type and apply new (modulo based) hash algorithm to make it.
2. Design
2.1 New pps hash algorithm
We design the new pps hash algorithm based on the "Congruential pseudo-random number generator" (http://comjnl.oxfordjournals.org/content/10/1/74.full.pdf) . It defines a bijection between the original sequence {0, ...,2^N-1} and some permutation of it. In other words, given different keys between 0 and 2^N-1, the generator will produce different integers, but within the same range {0,...,2^N-1}.
Assume there are np PGs in a pool, we can regard pgid (0?pgid<2^n, np?2^n<2*np) as the key, and then it will be hashed into a pps value between 0 and 2^n-1. Since PG# in a pool is usually 2^N, the generator just shuffles the original pgid sequence as output in this case, making the key space consisting of a permutation of {0,...,2^n-1}, which achieves the best uniformity. Moreover, poolid can be regarded as a seed in the generator, producing different pps value even with the same pgid but different poolid. Therefore, pgid sequences of various pools are mapped into distinct pps sequences, getting rid of PG overlapping.
2.2 New bucket type, Linear
We introduce a new bucket type called "linear", and apply a new modulo based hash algorithm to it. As the pps values assigned to each host are a pseudo-random subset of the original permutation and is possibly out of uniformity, in which situation applying modulo operation directly on integers in the subset cannot produce balanced distribution among disks in the host. To decrease deviation of the subset, we apply a balance parameter 1/balance_param to the key before conducting the modulo method.
For osd failure and recovery, it assumes that items nested in this kind of bucket will not be removed, nor new items are added, same as the UNIFORM bucket. Linear host will not introduce more data movement than the uniform bucket.
2.3 Adaptive Strategy
Since there is no one constant balance parameter applying for all cases that will result in the best PG distribution. We make it an adaptive procedure by adjusting the balance parameter automatically during the preparation for creating a new pool, according to different cluster topology, PG# and replica#, in order to gain a most uniform distribution.
??1) Try different balance_param when preparing for a new pool
????- Iteratively call CRUSH(map, ruleno, x, balance_param) to get corresponding PG distribution with different balance_params
????- Calculate stdev of PG# among all osds
????- Choose the balance_param with the minimal stdev
2) Add a member variable to pool struct pg_pool_t to save the best balance_param value
Input: cluster map, total PG number m, adaptive retry times n
Output: local optimal balance parameter balance_param min_pg_stdev = MAX; balance_param = a; // initial value for trial from 0 to n {
for pgid from 0 to m {
calculate pps using the new generator in 2.1;
for bucket b in cluster map // apply CRUSH algorithm
apply corresponding bucket hashing algorithm and get a osd list for pgid
}
calculate pg_stdev_a by gathering all osd lists; // stdev of PG distribution among all osds
if pg_stdev_a < min_pg_stdev {
min_pg_stdev = pg_stdev_a;
balance_param = a;
}
adjust a to a new value;
}
We evaluated the performance of optimized and current CRUSH in a cluster consisting of 4 hosts, and each attaching with 10x1T disks. We designed two test cases, for the first one, by creating a pool with 2048 pgs, 2 replicas, preparing 100 million 128KB objects, and then evaluating read performance of these objects; for the second one, by creating a pool with 2048 pgs, 3 replicas, preparing 1 million 10MB objects, and then evaluating read performance of these objects.
1. PG and data distribution is more balanced using optimized CRUSH algorithm
??a) For 2048 PGs with 3 replicas, stdev of PG# on 40 osds decreases from 12.13 to 5.17; for 2048 PGs with 2 replicas, stdev of PG# on 40 osds decreases from 10.09 to 6.50
??b) For 1 million 10MB objects with 3 replicas, stdev of disk use% on 40 osds decreases from 0.068 to 0.026; for 100 million 128KB objects with 2 replicas, stdev of disk use% on 40 osds decreases from 0.067 to 0.042
2. Large scaled performance is improved since data distribution is more balanced
??a) More than 10% performance improvement for 128K and 10M read
??b) Write performance not impacted
Detailed performance data can be found in the attached pdf (crush_optimization).
We also created a pull request: https://github.com/ceph/ceph/pull/2402
Thanks
Jian
-----Original Message-----
Sent: Tuesday, September 09, 2014 9:36 PM
Subject: Re: FW: CURSH optimization for unbalanced pg distribution

Post by Zhang, Jian
Forwarding per Sage's suggestion.

Very interesting discussion :-) For the record the corresponding pull request is https://github.com/ceph/ceph/pull/2402

2.Find a better replica placement strategy instead of current retry
logic of crush_choose_firstn, which may cause CRUSH to behave badly.
We find there are several threshold of retry times by referring to
code, choose_total_tries, choose_local_tries and choose_local_fallback_tries.
They are used to decide whether to do a retry_bucket, retry_descent
or use permutation to do an exhaustive bucket search. We are
a)Backtracking retry. Now the logic of crush_choose_firstn can only
issue an retry either from the initial bucket(retry_descent) or from
the current bucket (retry_bucket), how about retrying the intervening buckets?
b)Adjust threshold of retry times by other values. We do noticed
that the 'optimal' crush tunable could be used to make it, but we
still encounter unbalanced [g distribution by using the optimal strategy.
Please refer to 4 of the Testing results part.
c)Add an mechanism that can adjust above mentioned thresholds
adaptively. Maybe we can record the retry times of the previous call
for CRUSH, and adjust retry thresholds automatically according to the record.

I suggest ignoring all of this retry logic. The original version of
CRUSH has the local retries to try to make data move "less far", but
when we went back a year ago and did a statistical analysis of the
distribution we found that *all* of these hacks degraded the quality
of the placement,a nd by turning them all off (setting the 'optimal'
values which zeroes them all out excent for total_retries) we got
something that was indistinguishable from a uniform distribution.

3.Add soft link for pg directories. During pg creation, we can
create soft links for the pgs if pg# on the selected osd is more
than some threshold, say 10% more than desired average number, to
move objects that will be stored in this pg to another osd. Balanced
disk utilization may be gained in this way.

I think you need to be careful, but yes, this is an option. There is
a similar exception mechanism in place that is used for other purposes
and something similar could be done here. The main challenge will be
in ensuring that the soft links/exceptions follow the same overall
policy that CRUSH does after the raw mapping is performed. This is an
option, but I would put it toward the bottom of the list...

4.Change placement strategy only for step of selecting devices from
hosts. We found in our testing results that pg distribution was
balanced among hosts, which is reasonable since pg# of each host is
above 1K (according to the current BKM that pg# per osd should be
about 100). So how about we apply CRUSH only on the interval buckets
and find another simple but more balanced method to choose osd from host?

If you know the chassis can hold 12 disks, you can force the bucket
size to twelve and somehow prevent users from adjusting the structure
of the tree. Then you can use a simple mapping that is truly flat
(like a linear mapping, disk = x % num_disks) for that bucket/subtree.
The downside of course is that if you remove a disk *everything*
reshuffles, hence some sort of guardrails to prevent a user from
inadvertantly doing that. If a disk *does* fail, you just need to
make sure the disk is marked "out" but not removed from the CRUSH hierarchy and the normal retry will kick in.
Note that all this is reall doing is increasing the size of the "buckets"
that we are (pseudo)randomly distribution over. It is still a
random/uniform distribution, but the N value is 12 times bigger (for a
12 disk chassis) and as a result the variance is substantially lower.
I would suggest making a new bucket type that is called 'linear' and
does a simple modulo and trying this out. We will need a bunch of
additional safety checks to help users avoid doing silly things (like
adjusting the number of items in the linear buckets, which reshuffle
everything) but that wouldn't be needed for an initial analysis of the performance impact.
Do you mind if we shift this thread over to ceph-devel? I think there
are lots of people who would be interested in this discussion. We can
of course leave off your attachment if you prefer.
Thanks!
sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel"
info at http://vger.kernel.org/majordomo-info.html

--
Lo?c Dachary, Artisan Logiciel Libre

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Zhang, Jian

2014-09-10 01:32:57 UTC

Permalink

Thanks.

Created a feature here: http://tracker.ceph.com/issues/9410, to include all the attachments. .
http://tracker.ceph.com/attachments/download/1383/adaptive-crush-modify.patch
http://tracker.ceph.com/attachments/download/1384/crush_proposals.pdf
http://tracker.ceph.com/attachments/download/1385/crush_optimization.pdf

Hi all,
Several months ago we met an issue of read performance issues (17% degradation) when working on ceph object storage performance evaluation with 10M objects (scaling from 10K objects to 1Million objects) , and found the root cause is unbalanced pg distribution among all osd disks, leading to unbalanced data distribution. We did some further investigation then and identified that CRUSH failed to map pgs evenly to each osd. Please refer to the attached pdf (crush_proposals) for details.

Key Message:
As mentioned in the attached pdf, we described possible optimization proposals (http://tracker.ceph.com/attachments/download/1384/crush_proposals.pdf) for CRUSH and got some feedback from community (http://permalink.gmane.org/gmane.comp.file-systems.ceph.devel/18979 ) . Sage suggested us take the idea of "Change placement strategy only for step of selecting devices from hosts", by adding a new bucket type called "linear", and applying a modulo-like hash function to this kind of buckets to achieve balanced distribution. We followed this suggestion and designed an optimized CRUSH algorithm, with new hash methods and an adaptive module. Please refer to the Design and Implementation part for details. We also wrote some POC for it, see the attached patch. And as a result, we got more than 10% read performance improvement using the optimized CRUSH algorithm.

Design and Implementation:
1.Problem Identification
1.1 Input key (pps) space of CRUSH is not uniform
Since PG# on the nested devices of a host is not uniform even if we select the device using simple modulo operation, we decide to change the algorithm of hashing raw pg to pps.
1.2 Algorithm of selecting items from buckets is not uniform
After we get uniform input key space, we should make the procedure of selecting devices from host be uniform. Since current CRUSH algorithm uses Jenkins hash based strategies and failed to reach the goal, we decide to add a new bucket type and apply new (modulo based) hash algorithm to make it.
2.Design
2.1New pps hash algorithm
We design the new pps hash algorithm based on the "Congruential pseudo-random number generator" (http://comjnl.oxfordjournals.org/content/10/1/74.full.pdf) . It defines a bijection between the original sequence {0, ...,2^N-1} and some permutation of it. In other words, given different keys between 0 and 2^N-1, the generator will produce different integers, but within the same range {0,...,2^N-1}.
Assume there are np PGs in a pool, we can regard pgid (0��pgid<2^n, np��2^n<2*np) as the key, and then it will be hashed into a pps value between 0 and 2^n-1. Since PG# in a pool is usually 2^N, the generator just shuffles the original pgid sequence as output in this case, making the key space consisting of a permutation of {0,...,2^n-1}, which achieves the best uniformity. Moreover, poolid can be regarded as a seed in the generator, producing different pps value even with the same pgid but different poolid. Therefore, pgid sequences of various pools are mapped into distinct pps sequences, getting rid of PG overlapping.
2.2 New bucket type, Linear
We introduce a new bucket type called "linear", and apply a new modulo based hash algorithm to it. As the pps values assigned to each host are a pseudo-random subset of the original permutation and is possibly out of uniformity, in which situation applying modulo operation directly on integers in the subset cannot produce balanced distribution among disks in the host. To decrease deviation of the subset, we apply a balance parameter 1/balance_param to the key before conducting the modulo method.
For osd failure and recovery, it assumes that items nested in this kind of bucket will not be removed, nor new items are added, same as the UNIFORM bucket. Linear host will not introduce more data movement than the uniform bucket.
2.3 Adaptive Strategy
Since there is no one constant balance parameter applying for all cases that will result in the best PG distribution. We make it an adaptive procedure by adjusting the balance parameter automatically during the preparation for creating a new pool, according to different cluster topology, PG# and replica#, in order to gain a most uniform distribution.
1) Try different balance_param when preparing for a new pool
a. Iteratively call CRUSH(map, ruleno, x, balance_param) to get corresponding PG distribution with different balance_params
��b. Calculate stdev of PG# among all osds
�� c. Choose the balance_param with the minimal stdev
2) Add a member variable to pool struct pg_pool_t to save the best balance_param value
The adaptive procedure can be described as following:
Input: cluster map, total PG number m, adaptive retry times n
Output: local optimal balance parameter balance_param min_pg_stdev = MAX; balance_param = a; // initial value for trial from 0 to n {
for pgid from 0 to m {
calculate pps using the new generator in 2.1;
for bucket b in cluster map // apply CRUSH algorithm
apply corresponding bucket hashing algorithm and get a osd list for pgid
}
calculate pg_stdev_a by gathering all osd lists; // stdev of PG distribution among all osds
if pg_stdev_a < min_pg_stdev {
min_pg_stdev = pg_stdev_a;
balance_param = a;
}
adjust a to a new value;
}

Evaluation:
We evaluated the performance of optimized and current CRUSH in a cluster consisting of 4 hosts, and each attaching with 10x1T disks. We designed two test cases, for the first one, by creating a pool with 2048 pgs, 2 replicas, preparing 100 million 128KB objects, and then evaluating read performance of these objects; for the second one, by creating a pool with 2048 pgs, 3 replicas, preparing 1 million 10MB objects, and then evaluating read performance of these objects.
We compared the PG & data distribution and read performance of the two CRUSH algorithms, and got results as follows:
1.PG and data distribution is more balanced using optimized CRUSH algorithm
a) For 2048 PGs with 3 replicas, stdev of PG# on 40 osds decreases from 12.13 to 5.17; for 2048 PGs with 2 replicas, stdev of PG# on 40 osds decreases from 10.09 to 6.50
b) For 1 million 10MB objects with 3 replicas, stdev of disk use% on 40 osds decreases from 0.068 to 0.026; for 100 million 128KB objects with 2 replicas, stdev of disk use% on 40 osds decreases from 0.067 to 0.042
2.Large scaled performance is improved since data distribution is more balanced
a) More than 10% performance improvement for 128K and 10M read
b) Write performance not impacted
Detailed performance data can be found in the http://tracker.ceph.com/attachments/download/1385/crush_optimization.pdf .

We also created a pull request: https://github.com/ceph/ceph/pull/2402

Thanks
Jian

-----Original Message-----
From: Sage Weil [mailto:***@redhat.com]
Sent: Wednesday, September 10, 2014 9:06 AM
To: Zhang, Jian
Cc: Loic Dachary; ceph-***@vger.kernel.org; He, Yujie
Subject: RE: FW: CURSH optimization for unbalanced pg distribution

The lists are rejecting the email because of the big attachments. Send with links instead?

Yujie sent out the following email yesterday, but it seems it was missed. Resending it.
=============
Hi all,
? Several months ago we met an issue of read performance issues (17% degradation) when working on ceph object storage performance evaluation with 10M objects (scaling from 10K objects to 1Million objects) , and found the root cause is unbalanced pg distribution among all osd disks, leading to unbalanced data distribution. We did some further investigation then and identified that CRUSH failed to map pgs evenly to each osd. Please refer to the attached pdf (crush_proposals) for details.
? As mentioned in the attached pdf, we described possible optimization proposals for CRUSH and got some feedback from community (http://permalink.gmane.org/gmane.comp.file-systems.ceph.devel/18979) . Sage suggested us take the idea of "Change placement strategy only for step of selecting devices from hosts", by adding a new bucket type called ?linear?, and applying a modulo-like hash function to this kind of buckets to achieve balanced distribution. We followed this suggestion and designed an optimized CRUSH algorithm, with new hash methods and an adaptive module. Please refer to the Design and Implementation part for details. We also wrote some POC for it, see the attached patch. And as a result, we got more than 10% read performance improvement using the optimized CRUSH algorithm.
1. Problem Identification
1.1 Input key (pps) space of CRUSH is not uniform
Since PG# on the nested devices of a host is not uniform even if we select the device using simple modulo operation, we decide to change the algorithm of hashing raw pg to pps.
1.2 Algorithm of selecting items from buckets is not uniform
After we get uniform input key space, we should make the procedure of selecting devices from host be uniform. Since current CRUSH algorithm uses Jenkins hash based strategies and failed to reach the goal, we decide to add a new bucket type and apply new (modulo based) hash algorithm to make it.
2. Design
2.1 New pps hash algorithm
We design the new pps hash algorithm based on the "Congruential pseudo-random number generator" (http://comjnl.oxfordjournals.org/content/10/1/74.full.pdf) . It defines a bijection between the original sequence {0, ...,2^N-1} and some permutation of it. In other words, given different keys between 0 and 2^N-1, the generator will produce different integers, but within the same range {0,...,2^N-1}.
Assume there are np PGs in a pool, we can regard pgid (0?pgid<2^n, np?2^n<2*np) as the key, and then it will be hashed into a pps value between 0 and 2^n-1. Since PG# in a pool is usually 2^N, the generator just shuffles the original pgid sequence as output in this case, making the key space consisting of a permutation of {0,...,2^n-1}, which achieves the best uniformity. Moreover, poolid can be regarded as a seed in the generator, producing different pps value even with the same pgid but different poolid. Therefore, pgid sequences of various pools are mapped into distinct pps sequences, getting rid of PG overlapping.
2.2 New bucket type, Linear
We introduce a new bucket type called "linear", and apply a new modulo based hash algorithm to it. As the pps values assigned to each host are a pseudo-random subset of the original permutation and is possibly out of uniformity, in which situation applying modulo operation directly on integers in the subset cannot produce balanced distribution among disks in the host. To decrease deviation of the subset, we apply a balance parameter 1/balance_param to the key before conducting the modulo method.
For osd failure and recovery, it assumes that items nested in this kind of bucket will not be removed, nor new items are added, same as the UNIFORM bucket. Linear host will not introduce more data movement than the uniform bucket.
2.3 Adaptive Strategy
Since there is no one constant balance parameter applying for all cases that will result in the best PG distribution. We make it an adaptive procedure by adjusting the balance parameter automatically during the preparation for creating a new pool, according to different cluster topology, PG# and replica#, in order to gain a most uniform distribution.
??1) Try different balance_param when preparing for a new pool
????- Iteratively call CRUSH(map, ruleno, x, balance_param) to get
corresponding PG distribution with different balance_params
????- Calculate stdev of PG# among all osds
????- Choose the balance_param with the minimal stdev
2) Add a member variable to pool struct pg_pool_t to save the best
Input: cluster map, total PG number m, adaptive retry times n
Output: local optimal balance parameter balance_param min_pg_stdev = MAX; balance_param = a; // initial value for trial from 0 to n {
for pgid from 0 to m {
calculate pps using the new generator in 2.1;
for bucket b in cluster map // apply CRUSH algorithm
apply corresponding bucket hashing algorithm and get a osd list for pgid
}
calculate pg_stdev_a by gathering all osd lists; // stdev of PG distribution among all osds
if pg_stdev_a < min_pg_stdev {
min_pg_stdev = pg_stdev_a;
balance_param = a;
}
adjust a to a new value;
}
We evaluated the performance of optimized and current CRUSH in a cluster consisting of 4 hosts, and each attaching with 10x1T disks. We designed two test cases, for the first one, by creating a pool with 2048 pgs, 2 replicas, preparing 100 million 128KB objects, and then evaluating read performance of these objects; for the second one, by creating a pool with 2048 pgs, 3 replicas, preparing 1 million 10MB objects, and then evaluating read performance of these objects.
1. PG and data distribution is more balanced using optimized CRUSH algorithm
??a) For 2048 PGs with 3 replicas, stdev of PG# on 40 osds decreases
from 12.13 to 5.17; for 2048 PGs with 2 replicas, stdev of PG# on 40
osds decreases from 10.09 to 6.50
??b) For 1 million 10MB objects with 3 replicas, stdev of disk use% on 40 osds decreases from 0.068 to 0.026; for 100 million 128KB objects with 2 replicas, stdev of disk use% on 40 osds decreases from 0.067 to 0.042
2. Large scaled performance is improved since data distribution is more balanced
??a) More than 10% performance improvement for 128K and 10M read
??b) Write performance not impacted
Detailed performance data can be found in the attached pdf (crush_optimization).
We also created a pull request: https://github.com/ceph/ceph/pull/2402
Thanks
Jian
-----Original Message-----
Sent: Tuesday, September 09, 2014 9:36 PM
Subject: Re: FW: CURSH optimization for unbalanced pg distribution

Post by Zhang, Jian
Forwarding per Sage's suggestion.

Very interesting discussion :-) For the record the corresponding pull
request is https://github.com/ceph/ceph/pull/2402

For more detail data, please refer to the *Testing results* part.
*Optimization proposals: *
After we dived into the source code of CRUSH and related papers,
1.Add different hash algorithms, as an alternative for the
Jenkin's hash, e.g. algorithm that will produce even values when
range of input value (pg#) is relatively small. Or add new bucket
type at the same time if necessary.

2.Find a better replica placement strategy instead of current
retry logic of crush_choose_firstn, which may cause CRUSH to behave badly.
We find there are several threshold of retry times by referring to
code, choose_total_tries, choose_local_tries and choose_local_fallback_tries.
They are used to decide whether to do a retry_bucket,
retry_descent or use permutation to do an exhaustive bucket
a)Backtracking retry. Now the logic of crush_choose_firstn can
only issue an retry either from the initial bucket(retry_descent)
or from the current bucket (retry_bucket), how about retrying the intervening buckets?
b)Adjust threshold of retry times by other values. We do noticed
that the 'optimal' crush tunable could be used to make it, but we
still encounter unbalanced [g distribution by using the optimal strategy.
Please refer to 4 of the Testing results part.
c)Add an mechanism that can adjust above mentioned thresholds
adaptively. Maybe we can record the retry times of the previous
call for CRUSH, and adjust retry thresholds automatically according to the record.

I suggest ignoring all of this retry logic. The original version of
CRUSH has the local retries to try to make data move "less far", but
when we went back a year ago and did a statistical analysis of the
distribution we found that *all* of these hacks degraded the quality
of the placement,a nd by turning them all off (setting the 'optimal'
values which zeroes them all out excent for total_retries) we got
something that was indistinguishable from a uniform distribution.

3.Add soft link for pg directories. During pg creation, we can
create soft links for the pgs if pg# on the selected osd is more
than some threshold, say 10% more than desired average number, to
move objects that will be stored in this pg to another osd.
Balanced disk utilization may be gained in this way.

I think you need to be careful, but yes, this is an option. There
is a similar exception mechanism in place that is used for other
purposes and something similar could be done here. The main
challenge will be in ensuring that the soft links/exceptions follow
the same overall policy that CRUSH does after the raw mapping is
performed. This is an option, but I would put it toward the bottom of the list...

4.Change placement strategy only for step of selecting devices
from hosts. We found in our testing results that pg distribution
was balanced among hosts, which is reasonable since pg# of each
host is above 1K (according to the current BKM that pg# per osd
should be about 100). So how about we apply CRUSH only on the
interval buckets and find another simple but more balanced method to choose osd from host?

If you know the chassis can hold 12 disks, you can force the bucket
size to twelve and somehow prevent users from adjusting the
structure of the tree. Then you can use a simple mapping that is
truly flat (like a linear mapping, disk = x % num_disks) for that bucket/subtree.
The downside of course is that if you remove a disk *everything*
reshuffles, hence some sort of guardrails to prevent a user from
inadvertantly doing that. If a disk *does* fail, you just need to
make sure the disk is marked "out" but not removed from the CRUSH hierarchy and the normal retry will kick in.
Note that all this is reall doing is increasing the size of the "buckets"
that we are (pseudo)randomly distribution over. It is still a
random/uniform distribution, but the N value is 12 times bigger (for a
12 disk chassis) and as a result the variance is substantially lower.
I would suggest making a new bucket type that is called 'linear' and
does a simple modulo and trying this out. We will need a bunch of
additional safety checks to help users avoid doing silly things
(like adjusting the number of items in the linear buckets, which
reshuffle
everything) but that wouldn't be needed for an initial analysis of the performance impact.
Do you mind if we shift this thread over to ceph-devel? I think
there are lots of people who would be interested in this discussion.
We can of course leave off your attachment if you prefer.
Thanks!
sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel"
info at http://vger.kernel.org/majordomo-info.html

--
Lo?c Dachary, Artisan Logiciel Libre

��{.n�+��+%��lzwm��b�맲��r��yǩ�ׯzX��ܨ}��Ơz�&j:+v��zZ+

Mark Nelson

2014-09-10 02:16:36 UTC

Permalink

Very interesting! I will need to sit down and read through the
attachments tomorrow. Did you find that this method was superior to
simply reweighing the OSDs based on the quality of the distribution
using the Jenkins hash? Clearly this isn't easy with lots of pools,
though with some fancy crush rule management you might be able to get
around it to a limited extent.

I haven't done any extensive analysis, but I've also wondered if simply
replacing Jenkins with something like City, Murmur3, or Spooky might
improve distribution quality (and especially whether it would improve
distribution quality given small changes in the input).

Really glad you guys are looking into this!

Mark

Post by Zhang, Jian
Thanks.
=20
Created a feature here: http://tracker.ceph.com/issues/9410, to inclu=

de all the attachments. .

Post by Zhang, Jian
http://tracker.ceph.com/attachments/download/1383/adaptive-crush-modi=

fy.patch

Post by Zhang, Jian
http://tracker.ceph.com/attachments/download/1384/crush_proposals.pdf
http://tracker.ceph.com/attachments/download/1385/crush_optimization.=

pdf

Post by Zhang, Jian
=20
=20
Hi all,
Several months ago we met an issue of read performance issues (1=

7% degradation) when working on ceph object storage performance evaluat=
ion with 10M objects (scaling from 10K objects to 1Million objects) , a=
nd found the root cause is unbalanced pg distribution among all osd dis=
ks, leading to unbalanced data distribution. We did some further invest=
igation then and identified that CRUSH failed to map pgs evenly to each=
osd. Please refer to the attached pdf (crush_proposals) for details.

Post by Zhang, Jian
=20
As mentioned in the attached pdf, we described possible optimiza=

tion proposals (http://tracker.ceph.com/attachments/download/1384/crush=
_proposals.pdf) for CRUSH and got some feedback from community (http://=
permalink.gmane.org/gmane.comp.file-systems.ceph.devel/18979 ) . Sage s=
uggested us take the idea of "Change placement strategy only for step o=
f selecting devices from hosts", by adding a new bucket type called "li=
near", and applying a modulo-like hash function to this kind of buckets=
to achieve balanced distribution. We followed this suggestion and desi=
gned an optimized CRUSH algorithm, with new hash methods and an adaptiv=
e module. Please refer to the Design and Implementation part for detail=
s. We also wrote some POC for it, see the attached patch. And as a resu=
lt, we got more than 10% read performance improvement using the optimiz=
ed CRUSH algorithm.

Post by Zhang, Jian
=20
1.Problem Identification
1.1 Input key (pps) space of CRUSH is not uniform
Since PG# on the nested devices of a host is not uniform even if=

we select the device using simple modulo operation, we decide to chang=
e the algorithm of hashing raw pg to pps.

Post by Zhang, Jian
1.2 Algorithm of selecting items from buckets is not uniform
After we get uniform input key space, we should make the procedu=

re of selecting devices from host be uniform. Since current CRUSH algor=
ithm uses Jenkins hash based strategies and failed to reach the goal, w=
e decide to add a new bucket type and apply new (modulo based) hash alg=
orithm to make it.

Post by Zhang, Jian
2.Design
2.1New pps hash algorithm
We design the new pps hash algorithm based on the "Congruential =

pseudo-random number generator" (http://comjnl.oxfordjournals.org/conte=
nt/10/1/74.full.pdf) . It defines a bijection between the original sequ=
ence {0, ...,2^N-1} and some permutation of it. In other words, given d=
ifferent keys between 0 and 2^N-1, the generator will produce different=
integers, but within the same range {0,...,2^N-1}.

Post by Zhang, Jian
Assume there are np PGs in a pool, we can regard pgid (0=A1=DCpg=

id<2^n, np=A1=DC2^n<2*np) as the key, and then it will be hashed into a=
pps value between 0 and 2^n-1. Since PG# in a pool is usually 2^N, the=
generator just shuffles the original pgid sequence as output in this c=
ase, making the key space consisting of a permutation of {0,...,2^n-1},=
which achieves the best uniformity. Moreover, poolid can be regarded a=
s a seed in the generator, producing different pps value even with the =
same pgid but different poolid. Therefore, pgid sequences of various po=
ols are mapped into distinct pps sequences, getting rid of PG overlappi=
ng.

Post by Zhang, Jian
2.2 New bucket type, Linear
We introduce a new bucket type called "linear", and apply a new =

modulo based hash algorithm to it. As the pps values assigned to each h=
ost are a pseudo-random subset of the original permutation and is possi=
bly out of uniformity, in which situation applying modulo operation dir=
ectly on integers in the subset cannot produce balanced distribution am=
ong disks in the host. To decrease deviation of the subset, we apply a =
balance parameter 1/balance_param to the key before conducting the modu=
lo method.

Post by Zhang, Jian
For osd failure and recovery, it assumes that items nested in th=

is kind of bucket will not be removed, nor new items are added, same as=
the UNIFORM bucket. Linear host will not introduce more data movement =
than the uniform bucket.

Post by Zhang, Jian
2.3 Adaptive Strategy
Since there is no one constant balance parameter applying for al=

l cases that will result in the best PG distribution. We make it an ada=
ptive procedure by adjusting the balance parameter automatically during=
the preparation for creating a new pool, according to different cluste=
r topology, PG# and replica#, in order to gain a most uniform distribut=
ion.

Post by Zhang, Jian
1) Try different balance_param when preparing for a new pool
a. Iteratively call CRUSH(map, ruleno, x, balance_param) to get =

corresponding PG distribution with different balance_params

Post by Zhang, Jian
=A1=A1=A1=A1b. Calculate stdev of PG# among all osds
=A1=A1 c. Choose the balance_param with the minimal stdev
2) Add a member variable to pool struct pg_pool_t to save the best=

balance_param value

Post by Zhang, Jian
Input: cluster map, total PG number m, adaptive retry times n
Output: local optimal balance parameter balance_param min_pg_std=

ev =3D MAX; balance_param =3D a; // initial value for trial from 0 to n=
{

Post by Zhang, Jian
for pgid from 0 to m {
calculate pps using the new generator in 2.1;
for bucket b in cluster map // apply CRUSH algorithm
apply corresponding bucket hashing algorithm and get a o=

sd list for pgid

Post by Zhang, Jian
}
calculate pg_stdev_a by gathering all osd lists; // stdev of PG =

distribution among all osds

Post by Zhang, Jian
if pg_stdev_a < min_pg_stdev {
min_pg_stdev =3D pg_stdev_a;
balance_param =3D a;
}
adjust a to a new value;
}
=20
=20
We evaluated the performance of optimized and current CRUSH in a=

cluster consisting of 4 hosts, and each attaching with 10x1T disks. We=
designed two test cases, for the first one, by creating a pool with 20=
48 pgs, 2 replicas, preparing 100 million 128KB objects, and then evalu=
ating read performance of these objects; for the second one, by creatin=
g a pool with 2048 pgs, 3 replicas, preparing 1 million 10MB objects, a=
nd then evaluating read performance of these objects.

Post by Zhang, Jian
We compared the PG & data distribution and read performance of t=
1.PG and data distribution is more balanced using optimized CRUSH alg=

orithm

Post by Zhang, Jian
a) For 2048 PGs with 3 replicas, stdev of PG# on 40 osds decreas=

es from 12.13 to 5.17; for 2048 PGs with 2 replicas, stdev of PG# on 40=
osds decreases from 10.09 to 6.50

Post by Zhang, Jian
b) For 1 million 10MB objects with 3 replicas, stdev of disk use=

% on 40 osds decreases from 0.068 to 0.026; for 100 million 128KB objec=
ts with 2 replicas, stdev of disk use% on 40 osds decreases from 0.067 =
to 0.042

Post by Zhang, Jian
2.Large scaled performance is improved since data distribution is mor=

e balanced

Post by Zhang, Jian
a) More than 10% performance improvement for 128K and 10M read
b) Write performance not impacted
Detailed performance data can be found in the http://tracker.ceph.com=

/attachments/download/1385/crush_optimization.pdf .

Post by Zhang, Jian
=20
We also created a pull request: https://github.com/ceph/ceph/pull/240=

Post by Zhang, Jian
=20
=20
Thanks
Jian
=20
=20
-----Original Message-----
Sent: Wednesday, September 10, 2014 9:06 AM
To: Zhang, Jian
Subject: RE: FW: CURSH optimization for unbalanced pg distribution
=20
The lists are rejecting the email because of the big attachments. Se=

nd with links instead?

Post by Zhang, Jian
=20
=20

Yujie sent out the following email yesterday, but it seems it was mi=

ssed. Resending it.

Post by Zhang, Jian

=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
Hi all,
? Several months ago we met an issue of read performance issues (17=

% degradation) when working on ceph object storage performance evaluati=
on with 10M objects (scaling from 10K objects to 1Million objects) , an=
d found the root cause is unbalanced pg distribution among all osd disk=
s, leading to unbalanced data distribution. We did some further investi=
gation then and identified that CRUSH failed to map pgs evenly to each =
osd. Please refer to the attached pdf (crush_proposals) for details.

Post by Zhang, Jian

? As mentioned in the attached pdf, we described possible optimizat=

ion proposals for CRUSH and got some feedback from community (http://pe=
rmalink.gmane.org/gmane.comp.file-systems.ceph.devel/18979) . Sage sugg=
ested us take the idea of "Change placement strategy only for step of s=
electing devices from hosts", by adding a new bucket type called ?linea=
r?, and applying a modulo-like hash function to this kind of buckets to=
achieve balanced distribution. We followed this suggestion and designe=
d an optimized CRUSH algorithm, with new hash methods and an adaptive m=
odule. Please refer to the Design and Implementation part for details. =
We also wrote some POC for it, see the attached patch. And as a result,=
we got more than 10% read performance improvement using the optimized =
CRUSH algorithm.

Post by Zhang, Jian

1. Problem Identification
1.1 Input key (pps) space of CRUSH is not uniform
Since PG# on the nested devices of a host is not uniform even i=

f we select the device using simple modulo operation, we decide to chan=
ge the algorithm of hashing raw pg to pps.

Post by Zhang, Jian

1.2 Algorithm of selecting items from buckets is not uniform
After we get uniform input key space, we should make the proced=

ure of selecting devices from host be uniform. Since current CRUSH algo=
rithm uses Jenkins hash based strategies and failed to reach the goal, =
we decide to add a new bucket type and apply new (modulo based) hash al=
gorithm to make it.

Post by Zhang, Jian

2. Design
2.1 New pps hash algorithm
We design the new pps hash algorithm based on the "Congruential=

pseudo-random number generator" (http://comjnl.oxfordjournals.org/cont=
ent/10/1/74.full.pdf) . It defines a bijection between the original seq=
uence {0, ...,2^N-1} and some permutation of it. In other words, given =
different keys between 0 and 2^N-1, the generator will produce differen=
t integers, but within the same range {0,...,2^N-1}.

Post by Zhang, Jian

Assume there are np PGs in a pool, we can regard pgid (0?pgid<2=

^n, np?2^n<2*np) as the key, and then it will be hashed into a pps valu=
e between 0 and 2^n-1. Since PG# in a pool is usually 2^N, the generato=
r just shuffles the original pgid sequence as output in this case, maki=
ng the key space consisting of a permutation of {0,...,2^n-1}, which ac=
hieves the best uniformity. Moreover, poolid can be regarded as a seed =
in the generator, producing different pps value even with the same pgid=
but different poolid. Therefore, pgid sequences of various pools are m=
apped into distinct pps sequences, getting rid of PG overlapping.

Post by Zhang, Jian

2.2 New bucket type, Linear
We introduce a new bucket type called "linear", and apply a new=

modulo based hash algorithm to it. As the pps values assigned to each =
host are a pseudo-random subset of the original permutation and is poss=
ibly out of uniformity, in which situation applying modulo operation di=
rectly on integers in the subset cannot produce balanced distribution a=
mong disks in the host. To decrease deviation of the subset, we apply a=
balance parameter 1/balance_param to the key before conducting the mod=
ulo method.

Post by Zhang, Jian

For osd failure and recovery, it assumes that items nested in t=

his kind of bucket will not be removed, nor new items are added, same a=
s the UNIFORM bucket. Linear host will not introduce more data movement=
than the uniform bucket.

Post by Zhang, Jian

2.3 Adaptive Strategy
Since there is no one constant balance parameter applying for a=

ll cases that will result in the best PG distribution. We make it an ad=
aptive procedure by adjusting the balance parameter automatically durin=
g the preparation for creating a new pool, according to different clust=
er topology, PG# and replica#, in order to gain a most uniform distribu=
tion.

Post by Zhang, Jian

??1) Try different balance_param when preparing for a new pool
????- Iteratively call CRUSH(map, ruleno, x, balance_param) to get
corresponding PG distribution with different balance_params
????- Calculate stdev of PG# among all osds
????- Choose the balance_param with the minimal stdev
2) Add a member variable to pool struct pg_pool_t to save the bes=

Post by Zhang, Jian

balance_param value ??The adaptive procedure can be described as fol=
Input: cluster map, total PG number m, adaptive retry times n
Output: local optimal balance parameter balance_param min_pg_stdev =3D=

MAX; balance_param =3D a; // initial value for trial from 0 to n {

Post by Zhang, Jian

for pgid from 0 to m {
calculate pps using the new generator in 2.1;
for bucket b in cluster map // apply CRUSH algorithm
apply corresponding bucket hashing algorithm and get a =

osd list for pgid

Post by Zhang, Jian

}
calculate pg_stdev_a by gathering all osd lists; // stdev of PG=

distribution among all osds

Post by Zhang, Jian

if pg_stdev_a < min_pg_stdev {
min_pg_stdev =3D pg_stdev_a;
balance_param =3D a;
}
adjust a to a new value;
}
We evaluated the performance of optimized and current CRUSH in =

a cluster consisting of 4 hosts, and each attaching with 10x1T disks. W=
e designed two test cases, for the first one, by creating a pool with 2=
048 pgs, 2 replicas, preparing 100 million 128KB objects, and then eval=
uating read performance of these objects; for the second one, by creati=
ng a pool with 2048 pgs, 3 replicas, preparing 1 million 10MB objects, =
and then evaluating read performance of these objects.

Post by Zhang, Jian

We compared the PG & data distribution and read performance of =
1. PG and data distribution is more balanced using optimized CRUSH=

algorithm

Post by Zhang, Jian

??a) For 2048 PGs with 3 replicas, stdev of PG# on 40 osds decreases
from 12.13 to 5.17; for 2048 PGs with 2 replicas, stdev of PG# on 40
osds decreases from 10.09 to 6.50
??b) For 1 million 10MB objects with 3 replicas, stdev of disk use% =

on 40 osds decreases from 0.068 to 0.026; for 100 million 128KB objects=
with 2 replicas, stdev of disk use% on 40 osds decreases from 0.067 to=
0.042

Post by Zhang, Jian

2. Large scaled performance is improved since data distribution is=

more balanced

Post by Zhang, Jian

??a) More than 10% performance improvement for 128K and 10M read
??b) Write performance not impacted
Detailed performance data can be found in the attached pdf (crush_op=

timization).

Post by Zhang, Jian

We also created a pull request: https://github.com/ceph/ceph/pull/24=

Post by Zhang, Jian

Thanks
Jian
-----Original Message-----
Sent: Tuesday, September 09, 2014 9:36 PM
Subject: Re: FW: CURSH optimization for unbalanced pg distribution

Post by Zhang, Jian
Forwarding per Sage's suggestion.

Very interesting discussion :-) For the record the corresponding pul=

Post by Zhang, Jian

request is https://github.com/ceph/ceph/pull/2402

This *might* work, but I don't have a strong intuition about it. T=

he modeling we've done now has essentially assumed a statistically unif=
orm distribution, which has some inherent inbalance for low values of n=
(num pgs in our case). I have generally assumed we can't do better th=
an "random", and still have the other properties we want (independent, =
deterministic placement), but it may be possible.

Post by Zhang, Jian

2.Find a better replica placement strategy instead of current
retry logic of crush_choose_firstn, which may cause CRUSH to beha=

ve badly.

Post by Zhang, Jian

We find there are several threshold of retry times by referring t=

Post by Zhang, Jian

code, choose_total_tries, choose_local_tries and choose_local_fal=

lback_tries.

Post by Zhang, Jian

They are used to decide whether to do a retry_bucket,
retry_descent or use permutation to do an exhaustive bucket
a)Backtracking retry. Now the logic of crush_choose_firstn can
only issue an retry either from the initial bucket(retry_descent)
or from the current bucket (retry_bucket), how about retrying the=

intervening buckets?

Post by Zhang, Jian

b)Adjust threshold of retry times by other values. We do noticed
that the 'optimal' crush tunable could be used to make it, but we
still encounter unbalanced [g distribution by using the optimal s=

trategy.

Post by Zhang, Jian

Please refer to 4 of the Testing results part.
c)Add an mechanism that can adjust above mentioned thresholds
adaptively. Maybe we can record the retry times of the previous
call for CRUSH, and adjust retry thresholds automatically accordi=

ng to the record.

Post by Zhang, Jian

Post by Zhang, Jian
I suggest ignoring all of this retry logic. The original version o=

Post by Zhang, Jian

Post by Zhang, Jian
CRUSH has the local retries to try to make data move "less far", bu=

Post by Zhang, Jian

Post by Zhang, Jian
when we went back a year ago and did a statistical analysis of the
distribution we found that *all* of these hacks degraded the qualit=

Post by Zhang, Jian

Post by Zhang, Jian
of the placement,a nd by turning them all off (setting the 'optimal=

Post by Zhang, Jian

Post by Zhang, Jian
values which zeroes them all out excent for total_retries) we got
something that was indistinguishable from a uniform distribution.

3.Add soft link for pg directories. During pg creation, we can
create soft links for the pgs if pg# on the selected osd is more
than some threshold, say 10% more than desired average number, to
move objects that will be stored in this pg to another osd.
Balanced disk utilization may be gained in this way.

I think you need to be careful, but yes, this is an option. There
is a similar exception mechanism in place that is used for other
purposes and something similar could be done here. The main
challenge will be in ensuring that the soft links/exceptions follow
the same overall policy that CRUSH does after the raw mapping is
performed. This is an option, but I would put it toward the bottom=

of the list...

Post by Zhang, Jian

4.Change placement strategy only for step of selecting devices
from hosts. We found in our testing results that pg distribution
was balanced among hosts, which is reasonable since pg# of each
host is above 1K (according to the current BKM that pg# per osd
should be about 100). So how about we apply CRUSH only on the
interval buckets and find another simple but more balanced method=

to choose osd from host?

Post by Zhang, Jian

Post by Zhang, Jian
If you know the chassis can hold 12 disks, you can force the bucket
size to twelve and somehow prevent users from adjusting the
structure of the tree. Then you can use a simple mapping that is
truly flat (like a linear mapping, disk =3D x % num_disks) for that=

bucket/subtree.

Post by Zhang, Jian

Post by Zhang, Jian
The downside of course is that if you remove a disk *everything*
reshuffles, hence some sort of guardrails to prevent a user from
inadvertantly doing that. If a disk *does* fail, you just need to
make sure the disk is marked "out" but not removed from the CRUSH h=

ierarchy and the normal retry will kick in.

Post by Zhang, Jian

Post by Zhang, Jian
Note that all this is reall doing is increasing the size of the "bu=

ckets"

Post by Zhang, Jian

Post by Zhang, Jian
that we are (pseudo)randomly distribution over. It is still a
random/uniform distribution, but the N value is 12 times bigger (fo=

Post by Zhang, Jian

Post by Zhang, Jian
a
12 disk chassis) and as a result the variance is substantially lowe=

Post by Zhang, Jian

Post by Zhang, Jian
I would suggest making a new bucket type that is called 'linear' an=

Post by Zhang, Jian

Post by Zhang, Jian
does a simple modulo and trying this out. We will need a bunch of
additional safety checks to help users avoid doing silly things
(like adjusting the number of items in the linear buckets, which
reshuffle
everything) but that wouldn't be needed for an initial analysis of =

the performance impact.

Post by Zhang, Jian

Post by Zhang, Jian
Do you mind if we shift this thread over to ceph-devel? I think
there are lots of people who would be interested in this discussion=

=2E

Post by Zhang, Jian

Post by Zhang, Jian
We can of course leave off your attachment if you prefer.
Thanks!
sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-deve=

l"
o

Post by Zhang, Jian

Post by Zhang, Jian
info at http://vger.kernel.org/majordomo-info.html

--
Lo?c Dachary, Artisan Logiciel Libre

N=8B=A7=B2=E6=ECr=B8=9By=FA=E8=9A=D8b=B2X=AC=B6=C7=A7v=D8^=81=7F)=DE=BA=

{.n=81=7F+=89=B7=9Cz=98]z=F7=A5=8A{ay=81=7F=1D=CA=87=DA=99=81=7F,j=07=AD=
=A2f=A3=A2=B7h=9A=8B=E0z=81=7F=1E=AEw=A5=A2=81=7F=0C=A2=B7=A6j:+v=89=A8=
=8Aw=E8j=D8m=B6=9F=81=7F=81=7F=07=AB=91=EA=E7zZ+=83=F9=9A=8E=8A=DD=A2j"=
=9D=FA!tml=3D

Post by Zhang, Jian
=20

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" i=
n
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

He, Yujie

2014-09-10 02:56:42 UTC

Permalink

Hi mark,
In my opinion, the method of reweighting osd based on current pg distribution may not work well if we need to create more pools later. Since that if we reweight all osds again after creating new pools, there will be a reshuffle of data on all osds. In our proposal, we just apply the pools with respective parameters, trying to achieve balanced pg distribution for each pool, without influence to pools created in the future.

For other hash algorithms, we did tried the City algorithm, but got no better results than Jenkins. Maybe we need to take a look at others.

Thanks,
Yujie

-----Original Message-----
From: Mark Nelson [mailto:***@inktank.com]
Sent: Wednesday, September 10, 2014 10:17 AM
To: Zhang, Jian; Sage Weil
Cc: Loic Dachary; ceph-***@vger.kernel.org; He, Yujie
Subject: Re: FW: CURSH optimization for unbalanced pg distribution

Very interesting! I will need to sit down and read through the attachments tomorrow. Did you find that this method was superior to simply reweighing the OSDs based on the quality of the distribution using the Jenkins hash? Clearly this isn't easy with lots of pools, though with some fancy crush rule management you might be able to get around it to a limited extent.

I haven't done any extensive analysis, but I've also wondered if simply replacing Jenkins with something like City, Murmur3, or Spooky might improve distribution quality (and especially whether it would improve distribution quality given small changes in the input).

Really glad you guys are looking into this!

Mark

Post by Zhang, Jian
Thanks.
Created a feature here: http://tracker.ceph.com/issues/9410, to include all the attachments. .
http://tracker.ceph.com/attachments/download/1383/adaptive-crush-modif
y.patch
http://tracker.ceph.com/attachments/download/1384/crush_proposals.pdf
http://tracker.ceph.com/attachments/download/1385/crush_optimization.p
df
Hi all,
Several months ago we met an issue of read performance issues (17% degradation) when working on ceph object storage performance evaluation with 10M objects (scaling from 10K objects to 1Million objects) , and found the root cause is unbalanced pg distribution among all osd disks, leading to unbalanced data distribution. We did some further investigation then and identified that CRUSH failed to map pgs evenly to each osd. Please refer to the attached pdf (crush_proposals) for details.
As mentioned in the attached pdf, we described possible optimization proposals (http://tracker.ceph.com/attachments/download/1384/crush_proposals.pdf) for CRUSH and got some feedback from community (http://permalink.gmane.org/gmane.comp.file-systems.ceph.devel/18979 ) . Sage suggested us take the idea of "Change placement strategy only for step of selecting devices from hosts", by adding a new bucket type called "linear", and applying a modulo-like hash function to this kind of buckets to achieve balanced distribution. We followed this suggestion and designed an optimized CRUSH algorithm, with new hash methods and an adaptive module. Please refer to the Design and Implementation part for details. We also wrote some POC for it, see the attached patch. And as a result, we got more than 10% read performance improvement using the optimized CRUSH algorithm.
1.Problem Identification
1.1 Input key (pps) space of CRUSH is not uniform
Since PG# on the nested devices of a host is not uniform even if we select the device using simple modulo operation, we decide to change the algorithm of hashing raw pg to pps.
1.2 Algorithm of selecting items from buckets is not uniform
After we get uniform input key space, we should make the procedure of selecting devices from host be uniform. Since current CRUSH algorithm uses Jenkins hash based strategies and failed to reach the goal, we decide to add a new bucket type and apply new (modulo based) hash algorithm to make it.
2.Design
2.1New pps hash algorithm
We design the new pps hash algorithm based on the "Congruential pseudo-random number generator" (http://comjnl.oxfordjournals.org/content/10/1/74.full.pdf) . It defines a bijection between the original sequence {0, ...,2^N-1} and some permutation of it. In other words, given different keys between 0 and 2^N-1, the generator will produce different integers, but within the same range {0,...,2^N-1}.
Assume there are np PGs in a pool, we can regard pgid (0≤pgid<2^n, np≤2^n<2*np) as the key, and then it will be hashed into a pps value between 0 and 2^n-1. Since PG# in a pool is usually 2^N, the generator just shuffles the original pgid sequence as output in this case, making the key space consisting of a permutation of {0,...,2^n-1}, which achieves the best uniformity. Moreover, poolid can be regarded as a seed in the generator, producing different pps value even with the same pgid but different poolid. Therefore, pgid sequences of various pools are mapped into distinct pps sequences, getting rid of PG overlapping.
2.2 New bucket type, Linear
We introduce a new bucket type called "linear", and apply a new modulo based hash algorithm to it. As the pps values assigned to each host are a pseudo-random subset of the original permutation and is possibly out of uniformity, in which situation applying modulo operation directly on integers in the subset cannot produce balanced distribution among disks in the host. To decrease deviation of the subset, we apply a balance parameter 1/balance_param to the key before conducting the modulo method.
For osd failure and recovery, it assumes that items nested in this kind of bucket will not be removed, nor new items are added, same as the UNIFORM bucket. Linear host will not introduce more data movement than the uniform bucket.
2.3 Adaptive Strategy
Since there is no one constant balance parameter applying for all cases that will result in the best PG distribution. We make it an adaptive procedure by adjusting the balance parameter automatically during the preparation for creating a new pool, according to different cluster topology, PG# and replica#, in order to gain a most uniform distribution.
1) Try different balance_param when preparing for a new pool
a. Iteratively call CRUSH(map, ruleno, x, balance_param) to get corresponding PG distribution with different balance_params
　　b. Calculate stdev of PG# among all osds
　 c. Choose the balance_param with the minimal stdev
2) Add a member variable to pool struct pg_pool_t to save the best balance_param value
Input: cluster map, total PG number m, adaptive retry times n
Output: local optimal balance parameter balance_param min_pg_stdev = MAX; balance_param = a; // initial value for trial from 0 to n {
for pgid from 0 to m {
calculate pps using the new generator in 2.1;
for bucket b in cluster map // apply CRUSH algorithm
apply corresponding bucket hashing algorithm and get a osd list for pgid
}
calculate pg_stdev_a by gathering all osd lists; // stdev of PG distribution among all osds
if pg_stdev_a < min_pg_stdev {
min_pg_stdev = pg_stdev_a;
balance_param = a;
}
adjust a to a new value;
}
We evaluated the performance of optimized and current CRUSH in a cluster consisting of 4 hosts, and each attaching with 10x1T disks. We designed two test cases, for the first one, by creating a pool with 2048 pgs, 2 replicas, preparing 100 million 128KB objects, and then evaluating read performance of these objects; for the second one, by creating a pool with 2048 pgs, 3 replicas, preparing 1 million 10MB objects, and then evaluating read performance of these objects.
1.PG and data distribution is more balanced using optimized CRUSH algorithm
a) For 2048 PGs with 3 replicas, stdev of PG# on 40 osds decreases from 12.13 to 5.17; for 2048 PGs with 2 replicas, stdev of PG# on 40 osds decreases from 10.09 to 6.50
b) For 1 million 10MB objects with 3 replicas, stdev of disk use%
on 40 osds decreases from 0.068 to 0.026; for 100 million 128KB objects with 2 replicas, stdev of disk use% on 40 osds decreases from 0.067 to 0.042 2.Large scaled performance is improved since data distribution is more balanced
a) More than 10% performance improvement for 128K and 10M read
b) Write performance not impacted Detailed performance data can
be found in the http://tracker.ceph.com/attachments/download/1385/crush_optimization.pdf .
We also created a pull request: https://github.com/ceph/ceph/pull/2402
Thanks
Jian
-----Original Message-----
Sent: Wednesday, September 10, 2014 9:06 AM
To: Zhang, Jian
Subject: RE: FW: CURSH optimization for unbalanced pg distribution
The lists are rejecting the email because of the big attachments. Send with links instead?

Yujie sent out the following email yesterday, but it seems it was missed. Resending it.
=============
Hi all,
? Several months ago we met an issue of read performance issues (17% degradation) when working on ceph object storage performance evaluation with 10M objects (scaling from 10K objects to 1Million objects) , and found the root cause is unbalanced pg distribution among all osd disks, leading to unbalanced data distribution. We did some further investigation then and identified that CRUSH failed to map pgs evenly to each osd. Please refer to the attached pdf (crush_proposals) for details.
? As mentioned in the attached pdf, we described possible optimization proposals for CRUSH and got some feedback from community (http://permalink.gmane.org/gmane.comp.file-systems.ceph.devel/18979) . Sage suggested us take the idea of "Change placement strategy only for step of selecting devices from hosts", by adding a new bucket type called ?linear?, and applying a modulo-like hash function to this kind of buckets to achieve balanced distribution. We followed this suggestion and designed an optimized CRUSH algorithm, with new hash methods and an adaptive module. Please refer to the Design and Implementation part for details. We also wrote some POC for it, see the attached patch. And as a result, we got more than 10% read performance improvement using the optimized CRUSH algorithm.
1. Problem Identification
1.1 Input key (pps) space of CRUSH is not uniform
Since PG# on the nested devices of a host is not uniform even if we select the device using simple modulo operation, we decide to change the algorithm of hashing raw pg to pps.
1.2 Algorithm of selecting items from buckets is not uniform
After we get uniform input key space, we should make the procedure of selecting devices from host be uniform. Since current CRUSH algorithm uses Jenkins hash based strategies and failed to reach the goal, we decide to add a new bucket type and apply new (modulo based) hash algorithm to make it.
2. Design
2.1 New pps hash algorithm
We design the new pps hash algorithm based on the "Congruential pseudo-random number generator" (http://comjnl.oxfordjournals.org/content/10/1/74.full.pdf) . It defines a bijection between the original sequence {0, ...,2^N-1} and some permutation of it. In other words, given different keys between 0 and 2^N-1, the generator will produce different integers, but within the same range {0,...,2^N-1}.
Assume there are np PGs in a pool, we can regard pgid (0?pgid<2^n, np?2^n<2*np) as the key, and then it will be hashed into a pps value between 0 and 2^n-1. Since PG# in a pool is usually 2^N, the generator just shuffles the original pgid sequence as output in this case, making the key space consisting of a permutation of {0,...,2^n-1}, which achieves the best uniformity. Moreover, poolid can be regarded as a seed in the generator, producing different pps value even with the same pgid but different poolid. Therefore, pgid sequences of various pools are mapped into distinct pps sequences, getting rid of PG overlapping.
2.2 New bucket type, Linear
We introduce a new bucket type called "linear", and apply a new modulo based hash algorithm to it. As the pps values assigned to each host are a pseudo-random subset of the original permutation and is possibly out of uniformity, in which situation applying modulo operation directly on integers in the subset cannot produce balanced distribution among disks in the host. To decrease deviation of the subset, we apply a balance parameter 1/balance_param to the key before conducting the modulo method.
For osd failure and recovery, it assumes that items nested in this kind of bucket will not be removed, nor new items are added, same as the UNIFORM bucket. Linear host will not introduce more data movement than the uniform bucket.
2.3 Adaptive Strategy
Since there is no one constant balance parameter applying for all cases that will result in the best PG distribution. We make it an adaptive procedure by adjusting the balance parameter automatically during the preparation for creating a new pool, according to different cluster topology, PG# and replica#, in order to gain a most uniform distribution.
??1) Try different balance_param when preparing for a new pool
????- Iteratively call CRUSH(map, ruleno, x, balance_param) to get
corresponding PG distribution with different balance_params
????- Calculate stdev of PG# among all osds
????- Choose the balance_param with the minimal stdev
2) Add a member variable to pool struct pg_pool_t to save the best
Input: cluster map, total PG number m, adaptive retry times n
Output: local optimal balance parameter balance_param min_pg_stdev = MAX; balance_param = a; // initial value for trial from 0 to n {
for pgid from 0 to m {
calculate pps using the new generator in 2.1;
for bucket b in cluster map // apply CRUSH algorithm
apply corresponding bucket hashing algorithm and get a osd list for pgid
}
calculate pg_stdev_a by gathering all osd lists; // stdev of PG distribution among all osds
if pg_stdev_a < min_pg_stdev {
min_pg_stdev = pg_stdev_a;
balance_param = a;
}
adjust a to a new value;
}
We evaluated the performance of optimized and current CRUSH in a cluster consisting of 4 hosts, and each attaching with 10x1T disks. We designed two test cases, for the first one, by creating a pool with 2048 pgs, 2 replicas, preparing 100 million 128KB objects, and then evaluating read performance of these objects; for the second one, by creating a pool with 2048 pgs, 3 replicas, preparing 1 million 10MB objects, and then evaluating read performance of these objects.
1. PG and data distribution is more balanced using optimized CRUSH algorithm
??a) For 2048 PGs with 3 replicas, stdev of PG# on 40 osds decreases
from 12.13 to 5.17; for 2048 PGs with 2 replicas, stdev of PG# on 40
osds decreases from 10.09 to 6.50
??b) For 1 million 10MB objects with 3 replicas, stdev of disk use% on 40 osds decreases from 0.068 to 0.026; for 100 million 128KB objects with 2 replicas, stdev of disk use% on 40 osds decreases from 0.067 to 0.042
2. Large scaled performance is improved since data distribution is more balanced
??a) More than 10% performance improvement for 128K and 10M read
??b) Write performance not impacted
Detailed performance data can be found in the attached pdf (crush_optimization).
https://github.com/ceph/ceph/pull/2402
Thanks
Jian
-----Original Message-----
Sent: Tuesday, September 09, 2014 9:36 PM
Subject: Re: FW: CURSH optimization for unbalanced pg distribution

Post by Zhang, Jian
Forwarding per Sage's suggestion.

Very interesting discussion :-) For the record the corresponding pull
request is https://github.com/ceph/ceph/pull/2402

2.Find a better replica placement strategy instead of current
retry logic of crush_choose_firstn, which may cause CRUSH to behave badly.
We find there are several threshold of retry times by referring to
code, choose_total_tries, choose_local_tries and choose_local_fallback_tries.
They are used to decide whether to do a retry_bucket,
retry_descent or use permutation to do an exhaustive bucket
a)Backtracking retry. Now the logic of crush_choose_firstn can
only issue an retry either from the initial bucket(retry_descent)
or from the current bucket (retry_bucket), how about retrying the intervening buckets?
b)Adjust threshold of retry times by other values. We do noticed
that the 'optimal' crush tunable could be used to make it, but we
still encounter unbalanced [g distribution by using the optimal strategy.
Please refer to 4 of the Testing results part.
c)Add an mechanism that can adjust above mentioned thresholds
adaptively. Maybe we can record the retry times of the previous
call for CRUSH, and adjust retry thresholds automatically according to the record.

I suggest ignoring all of this retry logic. The original version of
CRUSH has the local retries to try to make data move "less far", but
when we went back a year ago and did a statistical analysis of the
distribution we found that *all* of these hacks degraded the quality
of the placement,a nd by turning them all off (setting the 'optimal'
values which zeroes them all out excent for total_retries) we got
something that was indistinguishable from a uniform distribution.

3.Add soft link for pg directories. During pg creation, we can
create soft links for the pgs if pg# on the selected osd is more
than some threshold, say 10% more than desired average number, to
move objects that will be stored in this pg to another osd.
Balanced disk utilization may be gained in this way.

I think you need to be careful, but yes, this is an option. There
is a similar exception mechanism in place that is used for other
purposes and something similar could be done here. The main
challenge will be in ensuring that the soft links/exceptions follow
the same overall policy that CRUSH does after the raw mapping is
performed. This is an option, but I would put it toward the bottom of the list...

4.Change placement strategy only for step of selecting devices
from hosts. We found in our testing results that pg distribution
was balanced among hosts, which is reasonable since pg# of each
host is above 1K (according to the current BKM that pg# per osd
should be about 100). So how about we apply CRUSH only on the
interval buckets and find another simple but more balanced method to choose osd from host?

If you know the chassis can hold 12 disks, you can force the bucket
size to twelve and somehow prevent users from adjusting the
structure of the tree. Then you can use a simple mapping that is
truly flat (like a linear mapping, disk = x % num_disks) for that bucket/subtree.
The downside of course is that if you remove a disk *everything*
reshuffles, hence some sort of guardrails to prevent a user from
inadvertantly doing that. If a disk *does* fail, you just need to
make sure the disk is marked "out" but not removed from the CRUSH hierarchy and the normal retry will kick in.
Note that all this is reall doing is increasing the size of the "buckets"
that we are (pseudo)randomly distribution over. It is still a
random/uniform distribution, but the N value is 12 times bigger (for a
12 disk chassis) and as a result the variance is substantially lower.
I would suggest making a new bucket type that is called 'linear' and
does a simple modulo and trying this out. We will need a bunch of
additional safety checks to help users avoid doing silly things
(like adjusting the number of items in the linear buckets, which
reshuffle
everything) but that wouldn't be needed for an initial analysis of the performance impact.
Do you mind if we shift this thread over to ceph-devel? I think
there are lots of people who would be interested in this discussion.
We can of course leave off your attachment if you prefer.
Thanks!
sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel"
info at http://vger.kernel.org/majordomo-info.html

--
Lo?c Dachary, Artisan Logiciel Libre

N嫥叉靣笡y氊b瞂千v豝?)藓{.n?+壏渮榏z鳐妠ay?蕠跈?,j f＂穐殝鄗?畐ア?

⒎:+v墾妛鑚豰稛?? 珣赙zZ+凒殠娸"濟!tml=
N��r��y��b�X��ǧv�^�)޺{.n�+��z�]z��{ay�ʇڙ�,j��f��h��z��w��

Sage Weil

2014-09-12 04:41:43 UTC

Permalink

Hi,

This is pretty exciting. I haven't read through all of it, but have
some initial comments on the pps mapping portion.

10% read performance improvement using the optimized CRUSH algorithm.

Post by Zhang, Jian
1.Problem Identification
1.1 Input key (pps) space of CRUSH is not uniform
Since PG# on the nested devices of a host is not uniform even if we select the device using simple modulo operation, we decide to change the algorithm of hashing raw pg to pps.
1.2 Algorithm of selecting items from buckets is not uniform
After we get uniform input key space, we should make the procedure of selecting devices from host be uniform. Since current CRUSH algorithm uses Jenkins hash based strategies and failed to reach the goal, we decide to add a new bucket type and apply new (modulo based) hash algorithm to make it.
2.Design
2.1New pps hash algorithm
We design the new pps hash algorithm based on the "Congruential pseudo-random number generator" (http://comjnl.oxfordjournals.org/content/10/1/74.full.pdf) . It defines a bijection between the original sequence {0, ...,2^N-1} and some permutation of it. In other words, given different keys between 0 and 2^N-1, the generator will produce different integers, but within the same range {0,...,2^N-1}.
Assume there are np PGs in a pool, we can regard pgid (0?pgid<2^n, np?2^n<2*np) as the key, and then it will be hashed into a pps value between 0 and 2^n-1. Since PG# in a pool is usually 2^N, the generator just shuffles the original pgid sequence as output in this case, making the key space consisting of a permutation of {0,...,2^n-1}, which achieves the best uniformity. Moreover, poolid can be regarded as a seed in the generator, producing different pps value even with the same pgid but different poolid. Therefore, pgid sequences of various pools are mapped into distinct pps sequences, getting rid of PG overlapping.

I made a few comments on github at

https://github.com/ceph/ceph/pull/2402/files#r17462015

I have some questions about the underlying math. If this is similar to
the approach used by the uniform buckets, I think 1033 needs to be > the
denominator? Also, I looked a bit at the referenced paper and I think the
denominator should be prime, not 2^n-1 (pg_num_mask).

My other concern is with raw_pg_to_congruential_pps. Adding poolid into
the numerator before you do the modulo means that each pool has a
different permutation. But, if you have two pools both with (say) 1024
PGs, they will map to the same 1024 outputs (0..1023). The pool is added
in to the final pps, but this doesn't really help as it only means a
handful of PGs get unique mappings... and they'll be overlap with the next
pool. This is exactly the problem we were solving with the HASHPSPOOL
flag. Perhaps adding a pseudorrandom value between 0 and 2^32 based on
the poolid will (usually) give the pools distinct output ranges and the
linear mapping will still be happy with that (since the inputs for each
pool live in a contiguous range).

In any case, though, yes: this general approach will mean that the pps
values live in a packed range instead of being spread uniformly across the
0..2^32 range.

The other concern I have is whehter the pgid -> pps mapping is stable when
pg_num is adjusted up. Specifically, what we want is that when moving
from pg_num to pg_num * 2, pg_num of the original inputs will keep the
same output pps value, while the other half will get a new value. It
doesn't seem like this is true for this strategy. That may be a tradeoff
the user is willing to make, but we'll need to be very careful about
making that apparent to the user.. it means that bumping pg_num will
reshuffle all (not just half) of their data for each power of 2.

Post by Zhang, Jian
2.2 New bucket type, Linear
We introduce a new bucket type called "linear", and apply a new modulo based hash algorithm to it. As the pps values assigned to each host are a pseudo-random subset of the original permutation and is possibly out of uniformity, in which situation applying modulo operation directly on integers in the subset cannot produce balanced distribution among disks in the host. To decrease deviation of the subset, we apply a balance parameter 1/balance_param to the key before conducting the modulo method.
For osd failure and recovery, it assumes that items nested in this kind of bucket will not be removed, nor new items are added, same as the UNIFORM bucket. Linear host will not introduce more data movement than the uniform bucket.
2.3 Adaptive Strategy
Since there is no one constant balance parameter applying for all cases that will result in the best PG distribution. We make it an adaptive procedure by adjusting the balance parameter automatically during the preparation for creating a new pool, according to different cluster topology, PG# and replica#, in order to gain a most uniform distribution.
1) Try different balance_param when preparing for a new pool
a. Iteratively call CRUSH(map, ruleno, x, balance_param) to get corresponding PG distribution with different balance_params
??b. Calculate stdev of PG# among all osds
? c. Choose the balance_param with the minimal stdev
2) Add a member variable to pool struct pg_pool_t to save the best balance_param value
Input: cluster map, total PG number m, adaptive retry times n
Output: local optimal balance parameter balance_param min_pg_stdev = MAX; balance_param = a; // initial value for trial from 0 to n {
for pgid from 0 to m {
calculate pps using the new generator in 2.1;
for bucket b in cluster map // apply CRUSH algorithm
apply corresponding bucket hashing algorithm and get a osd list for pgid
}
calculate pg_stdev_a by gathering all osd lists; // stdev of PG distribution among all osds
if pg_stdev_a < min_pg_stdev {
min_pg_stdev = pg_stdev_a;
balance_param = a;
}
adjust a to a new value;
}

I see the core placement is basically just x % n. But there is the
balance_param value (which is an integer value in the range 1..5?). I
don't really understand intuitively what this is accomplishing. Is the
goal just to have a different permutation and pick the best of 5? Or is
it specifically dividing the raw x so that it is squished into a narrower
range that is accomplishing a more balance distribution? I'm hoping the
goal is just another permutation, because then we can modify x in some
other way *prior* to feeding it into CRUSH and we can avoid duplicating
half of the code in mapper.c just to pass down the extra argument.

Thanks!
sage

Post by Zhang, Jian
We evaluated the performance of optimized and current CRUSH in a cluster consisting of 4 hosts, and each attaching with 10x1T disks. We designed two test cases, for the first one, by creating a pool with 2048 pgs, 2 replicas, preparing 100 million 128KB objects, and then evaluating read performance of these objects; for the second one, by creating a pool with 2048 pgs, 3 replicas, preparing 1 million 10MB objects, and then evaluating read performance of these objects.
1.PG and data distribution is more balanced using optimized CRUSH algorithm
a) For 2048 PGs with 3 replicas, stdev of PG# on 40 osds decreases from 12.13 to 5.17; for 2048 PGs with 2 replicas, stdev of PG# on 40 osds decreases from 10.09 to 6.50
b) For 1 million 10MB objects with 3 replicas, stdev of disk use% on 40 osds decreases from 0.068 to 0.026; for 100 million 128KB objects with 2 replicas, stdev of disk use% on 40 osds decreases from 0.067 to 0.042
2.Large scaled performance is improved since data distribution is more balanced
a) More than 10% performance improvement for 128K and 10M read
b) Write performance not impacted
Detailed performance data can be found in the http://tracker.ceph.com/attachments/download/1385/crush_optimization.pdf .
We also created a pull request: https://github.com/ceph/ceph/pull/2402
Thanks
Jian
-----Original Message-----
Sent: Wednesday, September 10, 2014 9:06 AM
To: Zhang, Jian
Subject: RE: FW: CURSH optimization for unbalanced pg distribution
The lists are rejecting the email because of the big attachments. Send with links instead?

Yujie sent out the following email yesterday, but it seems it was missed. Resending it.
=============
Hi all,
? Several months ago we met an issue of read performance issues (17% degradation) when working on ceph object storage performance evaluation with 10M objects (scaling from 10K objects to 1Million objects) , and found the root cause is unbalanced pg distribution among all osd disks, leading to unbalanced data distribution. We did some further investigation then and identified that CRUSH failed to map pgs evenly to each osd. Please refer to the attached pdf (crush_proposals) for details.
? As mentioned in the attached pdf, we described possible optimization proposals for CRUSH and got some feedback from community (http://permalink.gmane.org/gmane.comp.file-systems.ceph.devel/18979) . Sage suggested us take the idea of "Change placement strategy only for step of selecting devices from hosts", by adding a new bucket type called ?linear?, and applying a modulo-like hash function to this kind of buckets to achieve balanced distribution. We followed this suggestion and designed an optimized CRUSH algorithm, with new hash methods and an adaptive module. Please refer to the Design and Implementation part for details. We also wrote some POC for it, see the attached patch. And as a result, we got more than 10% read performance improvement using the optimized CRUSH algorithm.
1. Problem Identification
1.1 Input key (pps) space of CRUSH is not uniform
Since PG# on the nested devices of a host is not uniform even if we select the device using simple modulo operation, we decide to change the algorithm of hashing raw pg to pps.
1.2 Algorithm of selecting items from buckets is not uniform
After we get uniform input key space, we should make the procedure of selecting devices from host be uniform. Since current CRUSH algorithm uses Jenkins hash based strategies and failed to reach the goal, we decide to add a new bucket type and apply new (modulo based) hash algorithm to make it.
2. Design
2.1 New pps hash algorithm
We design the new pps hash algorithm based on the "Congruential pseudo-random number generator" (http://comjnl.oxfordjournals.org/content/10/1/74.full.pdf) . It defines a bijection between the original sequence {0, ...,2^N-1} and some permutation of it. In other words, given different keys between 0 and 2^N-1, the generator will produce different integers, but within the same range {0,...,2^N-1}.
Assume there are np PGs in a pool, we can regard pgid (0?pgid<2^n, np?2^n<2*np) as the key, and then it will be hashed into a pps value between 0 and 2^n-1. Since PG# in a pool is usually 2^N, the generator just shuffles the original pgid sequence as output in this case, making the key space consisting of a permutation of {0,...,2^n-1}, which achieves the best uniformity. Moreover, poolid can be regarded as a seed in the generator, producing different pps value even with the same pgid but different poolid. Therefore, pgid sequences of various pools are mapped into distinct pps sequences, getting rid of PG overlapping.
2.2 New bucket type, Linear
We introduce a new bucket type called "linear", and apply a new modulo based hash algorithm to it. As the pps values assigned to each host are a pseudo-random subset of the original permutation and is possibly out of uniformity, in which situation applying modulo operation directly on integers in the subset cannot produce balanced distribution among disks in the host. To decrease deviation of the subset, we apply a balance parameter 1/balance_param to the key before conducting the modulo method.
For osd failure and recovery, it assumes that items nested in this kind of bucket will not be removed, nor new items are added, same as the UNIFORM bucket. Linear host will not introduce more data movement than the uniform bucket.
2.3 Adaptive Strategy
Since there is no one constant balance parameter applying for all cases that will result in the best PG distribution. We make it an adaptive procedure by adjusting the balance parameter automatically during the preparation for creating a new pool, according to different cluster topology, PG# and replica#, in order to gain a most uniform distribution.
??1) Try different balance_param when preparing for a new pool
????- Iteratively call CRUSH(map, ruleno, x, balance_param) to get
corresponding PG distribution with different balance_params
????- Calculate stdev of PG# among all osds
????- Choose the balance_param with the minimal stdev
2) Add a member variable to pool struct pg_pool_t to save the best
Input: cluster map, total PG number m, adaptive retry times n
Output: local optimal balance parameter balance_param min_pg_stdev = MAX; balance_param = a; // initial value for trial from 0 to n {
for pgid from 0 to m {
calculate pps using the new generator in 2.1;
for bucket b in cluster map // apply CRUSH algorithm
apply corresponding bucket hashing algorithm and get a osd list for pgid
}
calculate pg_stdev_a by gathering all osd lists; // stdev of PG distribution among all osds
if pg_stdev_a < min_pg_stdev {
min_pg_stdev = pg_stdev_a;
balance_param = a;
}
adjust a to a new value;
}
We evaluated the performance of optimized and current CRUSH in a cluster consisting of 4 hosts, and each attaching with 10x1T disks. We designed two test cases, for the first one, by creating a pool with 2048 pgs, 2 replicas, preparing 100 million 128KB objects, and then evaluating read performance of these objects; for the second one, by creating a pool with 2048 pgs, 3 replicas, preparing 1 million 10MB objects, and then evaluating read performance of these objects.
1. PG and data distribution is more balanced using optimized CRUSH algorithm
??a) For 2048 PGs with 3 replicas, stdev of PG# on 40 osds decreases
from 12.13 to 5.17; for 2048 PGs with 2 replicas, stdev of PG# on 40
osds decreases from 10.09 to 6.50
??b) For 1 million 10MB objects with 3 replicas, stdev of disk use% on 40 osds decreases from 0.068 to 0.026; for 100 million 128KB objects with 2 replicas, stdev of disk use% on 40 osds decreases from 0.067 to 0.042
2. Large scaled performance is improved since data distribution is more balanced
??a) More than 10% performance improvement for 128K and 10M read
??b) Write performance not impacted
Detailed performance data can be found in the attached pdf (crush_optimization).
We also created a pull request: https://github.com/ceph/ceph/pull/2402
Thanks
Jian
-----Original Message-----
Sent: Tuesday, September 09, 2014 9:36 PM
Subject: Re: FW: CURSH optimization for unbalanced pg distribution

Post by Zhang, Jian
Forwarding per Sage's suggestion.

Very interesting discussion :-) For the record the corresponding pull
request is https://github.com/ceph/ceph/pull/2402

2.Find a better replica placement strategy instead of current
retry logic of crush_choose_firstn, which may cause CRUSH to behave badly.
We find there are several threshold of retry times by referring to
code, choose_total_tries, choose_local_tries and choose_local_fallback_tries.
They are used to decide whether to do a retry_bucket,
retry_descent or use permutation to do an exhaustive bucket
a)Backtracking retry. Now the logic of crush_choose_firstn can
only issue an retry either from the initial bucket(retry_descent)
or from the current bucket (retry_bucket), how about retrying the intervening buckets?
b)Adjust threshold of retry times by other values. We do noticed
that the 'optimal' crush tunable could be used to make it, but we
still encounter unbalanced [g distribution by using the optimal strategy.
Please refer to 4 of the Testing results part.
c)Add an mechanism that can adjust above mentioned thresholds
adaptively. Maybe we can record the retry times of the previous
call for CRUSH, and adjust retry thresholds automatically according to the record.

I suggest ignoring all of this retry logic. The original version of
CRUSH has the local retries to try to make data move "less far", but
when we went back a year ago and did a statistical analysis of the
distribution we found that *all* of these hacks degraded the quality
of the placement,a nd by turning them all off (setting the 'optimal'
values which zeroes them all out excent for total_retries) we got
something that was indistinguishable from a uniform distribution.

3.Add soft link for pg directories. During pg creation, we can
create soft links for the pgs if pg# on the selected osd is more
than some threshold, say 10% more than desired average number, to
move objects that will be stored in this pg to another osd.
Balanced disk utilization may be gained in this way.

I think you need to be careful, but yes, this is an option. There
is a similar exception mechanism in place that is used for other
purposes and something similar could be done here. The main
challenge will be in ensuring that the soft links/exceptions follow
the same overall policy that CRUSH does after the raw mapping is
performed. This is an option, but I would put it toward the bottom of the list...

4.Change placement strategy only for step of selecting devices
from hosts. We found in our testing results that pg distribution
was balanced among hosts, which is reasonable since pg# of each
host is above 1K (according to the current BKM that pg# per osd
should be about 100). So how about we apply CRUSH only on the
interval buckets and find another simple but more balanced method to choose osd from host?

If you know the chassis can hold 12 disks, you can force the bucket
size to twelve and somehow prevent users from adjusting the
structure of the tree. Then you can use a simple mapping that is
truly flat (like a linear mapping, disk = x % num_disks) for that bucket/subtree.
The downside of course is that if you remove a disk *everything*
reshuffles, hence some sort of guardrails to prevent a user from
inadvertantly doing that. If a disk *does* fail, you just need to
make sure the disk is marked "out" but not removed from the CRUSH hierarchy and the normal retry will kick in.
Note that all this is reall doing is increasing the size of the "buckets"
that we are (pseudo)randomly distribution over. It is still a
random/uniform distribution, but the N value is 12 times bigger (for a
12 disk chassis) and as a result the variance is substantially lower.
I would suggest making a new bucket type that is called 'linear' and
does a simple modulo and trying this out. We will need a bunch of
additional safety checks to help users avoid doing silly things
(like adjusting the number of items in the linear buckets, which
reshuffle
everything) but that wouldn't be needed for an initial analysis of the performance impact.
Do you mind if we shift this thread over to ceph-devel? I think
there are lots of people who would be interested in this discussion.
We can of course leave off your attachment if you prefer.
Thanks!
sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel"
info at http://vger.kernel.org/majordomo-info.html

--
Lo?c Dachary, Artisan Logiciel Libre

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

He, Yujie

2014-09-12 08:20:15 UTC

Permalink

Hi sage,
Thanks a lot for the comments!

About the pps part, 1033* is arbitrary as long as it satisfies the value 1mod(4), and m does not need to be prime, though maybe pg_num_mask+1 (2^n) can make it faster.
The formula is just used for reshuffling the original pgid sequence to some permutation of the original one, and which poolid really makes sense is not the one added at the last, but the one before the modulo operation. I did some tests just now and found that it did introduce pg overlaps, since the step of corresponding pps values of two consecutive pgs is determined by e.g. 1033 in all pools in this case. But I just wonder if we can replace the constant value 1033 with e.g. 4*poolid+1, thus making pps =( (4*poolid+1)*stable + 2*pg.pool() + 1) % (pg_num_mask+1) + pg.pool(), in which case the step of pps of a pool is determined by its poolid, varying with pools.

And yes, the pg is not stable when increasing pg# of an existed pool.

For the balance_param, actually we use it for both goals. As we cannot decide in advance which is the suitable value for a certain pool, considering the cluster topology, pool size and pg number. So we need to pick a best one after trying them.

Thanks,
Yujie

-----Original Message-----
From: Sage Weil [mailto:***@redhat.com]
Sent: Friday, September 12, 2014 12:42 PM
To: Zhang, Jian
Cc: Loic Dachary; ceph-***@vger.kernel.org; He, Yujie
Subject: RE: FW: CURSH optimization for unbalanced pg distribution

Hi,

This is pretty exciting. I haven't read through all of it, but have some initial comments on the pps mapping portion.

10% read performance improvement using the optimized CRUSH algorithm.

I made a few comments on github at

https://github.com/ceph/ceph/pull/2402/files#r17462015

I have some questions about the underlying math. If this is similar to the approach used by the uniform buckets, I think 1033 needs to be > the denominator? Also, I looked a bit at the referenced paper and I think the denominator should be prime, not 2^n-1 (pg_num_mask).

My other concern is with raw_pg_to_congruential_pps. Adding poolid into the numerator before you do the modulo means that each pool has a different permutation. But, if you have two pools both with (say) 1024 PGs, they will map to the same 1024 outputs (0..1023). The pool is added in to the final pps, but this doesn't really help as it only means a handful of PGs get unique mappings... and they'll be overlap with the next pool. This is exactly the problem we were solving with the HASHPSPOOL flag. Perhaps adding a pseudorrandom value between 0 and 2^32 based on the poolid will (usually) give the pools distinct output ranges and the linear mapping will still be happy with that (since the inputs for each pool live in a contiguous range).

In any case, though, yes: this general approach will mean that the pps values live in a packed range instead of being spread uniformly across the
0..2^32 range.

The other concern I have is whehter the pgid -> pps mapping is stable when pg_num is adjusted up. Specifically, what we want is that when moving from pg_num to pg_num * 2, pg_num of the original inputs will keep the same output pps value, while the other half will get a new value. It doesn't seem like this is true for this strategy. That may be a tradeoff the user is willing to make, but we'll need to be very careful about making that apparent to the user.. it means that bumping pg_num will reshuffle all (not just half) of their data for each power of 2.

Post by Zhang, Jian
2.2 New bucket type, Linear
We introduce a new bucket type called "linear", and apply a new modulo based hash algorithm to it. As the pps values assigned to each host are a pseudo-random subset of the original permutation and is possibly out of uniformity, in which situation applying modulo operation directly on integers in the subset cannot produce balanced distribution among disks in the host. To decrease deviation of the subset, we apply a balance parameter 1/balance_param to the key before conducting the modulo method.
For osd failure and recovery, it assumes that items nested in this kind of bucket will not be removed, nor new items are added, same as the UNIFORM bucket. Linear host will not introduce more data movement than the uniform bucket.
2.3 Adaptive Strategy
Since there is no one constant balance parameter applying for all cases that will result in the best PG distribution. We make it an adaptive procedure by adjusting the balance parameter automatically during the preparation for creating a new pool, according to different cluster topology, PG# and replica#, in order to gain a most uniform distribution.
1) Try different balance_param when preparing for a new pool
a. Iteratively call CRUSH(map, ruleno, x, balance_param) to get
corresponding PG distribution with different balance_params ??b.
Calculate stdev of PG# among all osds ? c. Choose the balance_param with the minimal stdev
2) Add a member variable to pool struct pg_pool_t to save the best balance_param value
Input: cluster map, total PG number m, adaptive retry times n
Output: local optimal balance parameter balance_param min_pg_stdev = MAX; balance_param = a; // initial value for trial from 0 to n {
for pgid from 0 to m {
calculate pps using the new generator in 2.1;
for bucket b in cluster map // apply CRUSH algorithm
apply corresponding bucket hashing algorithm and get a osd list for pgid
}
calculate pg_stdev_a by gathering all osd lists; // stdev of PG distribution among all osds
if pg_stdev_a < min_pg_stdev {
min_pg_stdev = pg_stdev_a;
balance_param = a;
}
adjust a to a new value;
}

I see the core placement is basically just x % n. But there is the balance_param value (which is an integer value in the range 1..5?). I don't really understand intuitively what this is accomplishing. Is the goal just to have a different permutation and pick the best of 5? Or is it specifically dividing the raw x so that it is squished into a narrower range that is accomplishing a more balance distribution? I'm hoping the goal is just another permutation, because then we can modify x in some other way *prior* to feeding it into CRUSH and we can avoid duplicating half of the code in mapper.c just to pass down the extra argument.

Thanks!
sage

Yujie sent out the following email yesterday, but it seems it was missed. Resending it.
=============
Hi all,
? Several months ago we met an issue of read performance issues (17% degradation) when working on ceph object storage performance evaluation with 10M objects (scaling from 10K objects to 1Million objects) , and found the root cause is unbalanced pg distribution among all osd disks, leading to unbalanced data distribution. We did some further investigation then and identified that CRUSH failed to map pgs evenly to each osd. Please refer to the attached pdf (crush_proposals) for details.
? As mentioned in the attached pdf, we described possible optimization proposals for CRUSH and got some feedback from community (http://permalink.gmane.org/gmane.comp.file-systems.ceph.devel/18979) . Sage suggested us take the idea of "Change placement strategy only for step of selecting devices from hosts", by adding a new bucket type called ?linear?, and applying a modulo-like hash function to this kind of buckets to achieve balanced distribution. We followed this suggestion and designed an optimized CRUSH algorithm, with new hash methods and an adaptive module. Please refer to the Design and Implementation part for details. We also wrote some POC for it, see the attached patch. And as a result, we got more than 10% read performance improvement using the optimized CRUSH algorithm.
1. Problem Identification
1.1 Input key (pps) space of CRUSH is not uniform
Since PG# on the nested devices of a host is not uniform even if we select the device using simple modulo operation, we decide to change the algorithm of hashing raw pg to pps.
1.2 Algorithm of selecting items from buckets is not uniform
After we get uniform input key space, we should make the procedure of selecting devices from host be uniform. Since current CRUSH algorithm uses Jenkins hash based strategies and failed to reach the goal, we decide to add a new bucket type and apply new (modulo based) hash algorithm to make it.
2. Design
2.1 New pps hash algorithm
We design the new pps hash algorithm based on the "Congruential pseudo-random number generator" (http://comjnl.oxfordjournals.org/content/10/1/74.full.pdf) . It defines a bijection between the original sequence {0, ...,2^N-1} and some permutation of it. In other words, given different keys between 0 and 2^N-1, the generator will produce different integers, but within the same range {0,...,2^N-1}.
Assume there are np PGs in a pool, we can regard pgid (0?pgid<2^n, np?2^n<2*np) as the key, and then it will be hashed into a pps value between 0 and 2^n-1. Since PG# in a pool is usually 2^N, the generator just shuffles the original pgid sequence as output in this case, making the key space consisting of a permutation of {0,...,2^n-1}, which achieves the best uniformity. Moreover, poolid can be regarded as a seed in the generator, producing different pps value even with the same pgid but different poolid. Therefore, pgid sequences of various pools are mapped into distinct pps sequences, getting rid of PG overlapping.
2.2 New bucket type, Linear
We introduce a new bucket type called "linear", and apply a new modulo based hash algorithm to it. As the pps values assigned to each host are a pseudo-random subset of the original permutation and is possibly out of uniformity, in which situation applying modulo operation directly on integers in the subset cannot produce balanced distribution among disks in the host. To decrease deviation of the subset, we apply a balance parameter 1/balance_param to the key before conducting the modulo method.
For osd failure and recovery, it assumes that items nested in this kind of bucket will not be removed, nor new items are added, same as the UNIFORM bucket. Linear host will not introduce more data movement than the uniform bucket.
2.3 Adaptive Strategy
Since there is no one constant balance parameter applying for all cases that will result in the best PG distribution. We make it an adaptive procedure by adjusting the balance parameter automatically during the preparation for creating a new pool, according to different cluster topology, PG# and replica#, in order to gain a most uniform distribution.
??1) Try different balance_param when preparing for a new pool
????- Iteratively call CRUSH(map, ruleno, x, balance_param) to get
corresponding PG distribution with different balance_params
????- Calculate stdev of PG# among all osds
????- Choose the balance_param with the minimal stdev
2) Add a member variable to pool struct pg_pool_t to save the best
Input: cluster map, total PG number m, adaptive retry times n
Output: local optimal balance parameter balance_param min_pg_stdev = MAX; balance_param = a; // initial value for trial from 0 to n {
for pgid from 0 to m {
calculate pps using the new generator in 2.1;
for bucket b in cluster map // apply CRUSH algorithm
apply corresponding bucket hashing algorithm and get a osd list for pgid
}
calculate pg_stdev_a by gathering all osd lists; // stdev of PG distribution among all osds
if pg_stdev_a < min_pg_stdev {
min_pg_stdev = pg_stdev_a;
balance_param = a;
}
adjust a to a new value;
}
We evaluated the performance of optimized and current CRUSH in a cluster consisting of 4 hosts, and each attaching with 10x1T disks. We designed two test cases, for the first one, by creating a pool with 2048 pgs, 2 replicas, preparing 100 million 128KB objects, and then evaluating read performance of these objects; for the second one, by creating a pool with 2048 pgs, 3 replicas, preparing 1 million 10MB objects, and then evaluating read performance of these objects.
1. PG and data distribution is more balanced using optimized CRUSH algorithm
??a) For 2048 PGs with 3 replicas, stdev of PG# on 40 osds decreases
from 12.13 to 5.17; for 2048 PGs with 2 replicas, stdev of PG# on 40
osds decreases from 10.09 to 6.50
??b) For 1 million 10MB objects with 3 replicas, stdev of disk use% on 40 osds decreases from 0.068 to 0.026; for 100 million 128KB objects with 2 replicas, stdev of disk use% on 40 osds decreases from 0.067 to 0.042
2. Large scaled performance is improved since data distribution is more balanced
??a) More than 10% performance improvement for 128K and 10M read
??b) Write performance not impacted
Detailed performance data can be found in the attached pdf (crush_optimization).
We also created a pull request: https://github.com/ceph/ceph/pull/2402
Thanks
Jian
-----Original Message-----
Sent: Tuesday, September 09, 2014 9:36 PM
Subject: Re: FW: CURSH optimization for unbalanced pg distribution

Post by Zhang, Jian
Forwarding per Sage's suggestion.

Very interesting discussion :-) For the record the corresponding pull
request is https://github.com/ceph/ceph/pull/2402

2.Find a better replica placement strategy instead of current
retry logic of crush_choose_firstn, which may cause CRUSH to behave badly.
We find there are several threshold of retry times by referring to
code, choose_total_tries, choose_local_tries and choose_local_fallback_tries.
They are used to decide whether to do a retry_bucket,
retry_descent or use permutation to do an exhaustive bucket
a)Backtracking retry. Now the logic of crush_choose_firstn can
only issue an retry either from the initial bucket(retry_descent)
or from the current bucket (retry_bucket), how about retrying the intervening buckets?
b)Adjust threshold of retry times by other values. We do noticed
that the 'optimal' crush tunable could be used to make it, but we
still encounter unbalanced [g distribution by using the optimal strategy.
Please refer to 4 of the Testing results part.
c)Add an mechanism that can adjust above mentioned thresholds
adaptively. Maybe we can record the retry times of the previous
call for CRUSH, and adjust retry thresholds automatically according to the record.

I suggest ignoring all of this retry logic. The original version of
CRUSH has the local retries to try to make data move "less far", but
when we went back a year ago and did a statistical analysis of the
distribution we found that *all* of these hacks degraded the quality
of the placement,a nd by turning them all off (setting the 'optimal'
values which zeroes them all out excent for total_retries) we got
something that was indistinguishable from a uniform distribution.

3.Add soft link for pg directories. During pg creation, we can
create soft links for the pgs if pg# on the selected osd is more
than some threshold, say 10% more than desired average number, to
move objects that will be stored in this pg to another osd.
Balanced disk utilization may be gained in this way.

I think you need to be careful, but yes, this is an option. There
is a similar exception mechanism in place that is used for other
purposes and something similar could be done here. The main
challenge will be in ensuring that the soft links/exceptions follow
the same overall policy that CRUSH does after the raw mapping is
performed. This is an option, but I would put it toward the bottom of the list...

4.Change placement strategy only for step of selecting devices
from hosts. We found in our testing results that pg distribution
was balanced among hosts, which is reasonable since pg# of each
host is above 1K (according to the current BKM that pg# per osd
should be about 100). So how about we apply CRUSH only on the
interval buckets and find another simple but more balanced method to choose osd from host?

If you know the chassis can hold 12 disks, you can force the bucket
size to twelve and somehow prevent users from adjusting the
structure of the tree. Then you can use a simple mapping that is
truly flat (like a linear mapping, disk = x % num_disks) for that bucket/subtree.
The downside of course is that if you remove a disk *everything*
reshuffles, hence some sort of guardrails to prevent a user from
inadvertantly doing that. If a disk *does* fail, you just need to
make sure the disk is marked "out" but not removed from the CRUSH hierarchy and the normal retry will kick in.
Note that all this is reall doing is increasing the size of the "buckets"
that we are (pseudo)randomly distribution over. It is still a
random/uniform distribution, but the N value is 12 times bigger (for a
12 disk chassis) and as a result the variance is substantially lower.
I would suggest making a new bucket type that is called 'linear' and
does a simple modulo and trying this out. We will need a bunch of
additional safety checks to help users avoid doing silly things
(like adjusting the number of items in the linear buckets, which
reshuffle
everything) but that wouldn't be needed for an initial analysis of the performance impact.
Do you mind if we shift this thread over to ceph-devel? I think
there are lots of people who would be interested in this discussion.
We can of course leave off your attachment if you prefer.
Thanks!
sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel"
info at http://vger.kernel.org/majordomo-info.html

--
Lo?c Dachary, Artisan Logiciel Libre

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Zhang, Jian

2014-09-10 00:56:58 UTC

Permalink

Resending.

Thanks
Jian

-----Original Message-----
From: Zhang, Jian
Sent: Wednesday, September 10, 2014 7:16 AM
To: 'Loic Dachary'; ceph-***@vger.kernel.org; Sage Weil <***@redhat.com> (***@redhat.com)
Cc: He, Yujie (***@intel.com)
Subject: RE: FW: CURSH optimization for unbalanced pg distribution

Yujie sent out the following email yesterday, but it seems it was missed. Resending it.

=============
Hi all,
ã Several months ago we met an issue of read performance issues (17% degradation) when working on ceph object storage performance evaluation with 10M objects (scaling from 10K objects to 1Million objects) , and found the root cause is unbalanced pg distribution among all osd disks, leading to unbalanced data distribution. We did some further investigation then and identified that CRUSH failed to map pgs evenly to each osd. Please refer to the attached pdf (crush_proposals) for details.

Key Message:
ã As mentioned in the attached pdf, we described possible optimization proposals for CRUSH and got some feedback from community (http://permalink.gmane.org/gmane.comp.file-systems.ceph.devel/18979) . Sage suggested us take the idea of "Change placement strategy only for step of selecting devices from hosts", by adding a new bucket type called âlinearâ, and applying a modulo-like hash function to this kind of buckets to achieve balanced distribution. We followed this suggestion and designed an optimized CRUSH algorithm, with new hash methods and an adaptive module. Please refer to the Design and Implementation part for details. We also wrote some POC for it, see the attached patch. And as a result, we got more than 10% read performance improvement using the optimized CRUSH algorithm.

Design and Implementation:
1. Problem Identification
1.1 Input key (pps) space of CRUSH is not uniform
Since PG# on the nested devices of a host is not uniform even if we select the device using simple modulo operation, we decide to change the algorithm of hashing raw pg to pps.
1.2 Algorithm of selecting items from buckets is not uniform
After we get uniform input key space, we should make the procedure of selecting devices from host be uniform. Since current CRUSH algorithm uses Jenkins hash based strategies and failed to reach the goal, we decide to add a new bucket type and apply new (modulo based) hash algorithm to make it.
2. Design
2.1 New pps hash algorithm
We design the new pps hash algorithm based on the "Congruential pseudo-random number generator" (http://comjnl.oxfordjournals.org/content/10/1/74.full.pdf) . It defines a bijection between the original sequence {0, ...,2^N-1} and some permutation of it. In other words, given different keys between 0 and 2^N-1, the generator will produce different integers, but within the same range {0,...,2^N-1}.
Assume there are np PGs in a pool, we can regard pgid (0â€pgid<2^n, npâ€2^n<2*np) as the key, and then it will be hashed into a pps value between 0 and 2^n-1. Since PG# in a pool is usually 2^N, the generator just shuffles the original pgid sequence as output in this case, making the key space consisting of a permutation of {0,...,2^n-1}, which achieves the best uniformity. Moreover, poolid can be regarded as a seed in the generator, producing different pps value even with the same pgid but different poolid. Therefore, pgid sequences of various pools are mapped into distinct pps sequences, getting rid of PG overlapping.
2.2 New bucket type, Linear
We introduce a new bucket type called "linear", and apply a new modulo based hash algorithm to it. As the pps values assigned to each host are a pseudo-random subset of the original permutation and is possibly out of uniformity, in which situation applying modulo operation directly on integers in the subset cannot produce balanced distribution among disks in the host. To decrease deviation of the subset, we apply a balance parameter 1/balance_param to the key before conducting the modulo method.
For osd failure and recovery, it assumes that items nested in this kind of bucket will not be removed, nor new items are added, same as the UNIFORM bucket. Linear host will not introduce more data movement than the uniform bucket.
2.3 Adaptive Strategy
Since there is no one constant balance parameter applying for all cases that will result in the best PG distribution. We make it an adaptive procedure by adjusting the balance parameter automatically during the preparation for creating a new pool, according to different cluster topology, PG# and replica#, in order to gain a most uniform distribution.
ãã1) Try different balance_param when preparing for a new pool
ãããã- Iteratively call CRUSH(map, ruleno, x, balance_param) to get corresponding PG distribution with different balance_params
ãããã- Calculate stdev of PG# among all osds
ãããã- Choose the balance_param with the minimal stdev
2) Add a member variable to pool struct pg_pool_t to save the best balance_param value
ããThe adaptive procedure can be described as following:
Input: cluster map, total PG number m, adaptive retry times n
Output: local optimal balance parameter balance_param min_pg_stdev = MAX; balance_param = a; // initial value for trial from 0 to n {
for pgid from 0 to m {
calculate pps using the new generator in 2.1;
for bucket b in cluster map // apply CRUSH algorithm
apply corresponding bucket hashing algorithm and get a osd list for pgid
}
calculate pg_stdev_a by gathering all osd lists; // stdev of PG distribution among all osds
if pg_stdev_a < min_pg_stdev {
min_pg_stdev = pg_stdev_a;
balance_param = a;
}
adjust a to a new value;
}

Evaluation:
We evaluated the performance of optimized and current CRUSH in a cluster consisting of 4 hosts, and each attaching with 10x1T disks. We designed two test cases, for the first one, by creating a pool with 2048 pgs, 2 replicas, preparing 100 million 128KB objects, and then evaluating read performance of these objects; for the second one, by creating a pool with 2048 pgs, 3 replicas, preparing 1 million 10MB objects, and then evaluating read performance of these objects.
We compared the PG & data distribution and read performance of the two CRUSH algorithms, and got results as follows:
1. PG and data distribution is more balanced using optimized CRUSH algorithm
ããa) For 2048 PGs with 3 replicas, stdev of PG# on 40 osds decreases from 12.13 to 5.17; for 2048 PGs with 2 replicas, stdev of PG# on 40 osds decreases from 10.09 to 6.50
ããb) For 1 million 10MB objects with 3 replicas, stdev of disk use% on 40 osds decreases from 0.068 to 0.026; for 100 million 128KB objects with 2 replicas, stdev of disk use% on 40 osds decreases from 0.067 to 0.042
2. Large scaled performance is improved since data distribution is more balanced
ããa) More than 10% performance improvement for 128K and 10M read
ããb) Write performance not impacted
Detailed performance data can be found in the attached pdf (crush_optimization).

We also created a pull request: https://github.com/ceph/ceph/pull/2402

Thanks
Jian

-----Original Message-----
From: Loic Dachary [mailto:***@dachary.org]
Sent: Tuesday, September 09, 2014 9:36 PM
To: Zhang, Jian; ceph-***@vger.kernel.org
Subject: Re: FW: CURSH optimization for unbalanced pg distribution

Post by Zhang, Jian
Forwarding per Sage's suggestion.

Very interesting discussion :-) For the record the corresponding pull request is https://github.com/ceph/ceph/pull/2402

2.Find a better replica placement strategy instead of current retry
logic of crush_choose_firstn, which may cause CRUSH to behave badly.
We find there are several threshold of retry times by referring to
code, choose_total_tries, choose_local_tries and choose_local_fallback_tries.
They are used to decide whether to do a retry_bucket, retry_descent
or use permutation to do an exhaustive bucket search. We are
a)Backtracking retry. Now the logic of crush_choose_firstn can only
issue an retry either from the initial bucket(retry_descent) or from
the current bucket (retry_bucket), how about retrying the intervening buckets?
b)Adjust threshold of retry times by other values. We do noticed
that the 'optimal' crush tunable could be used to make it, but we
still encounter unbalanced [g distribution by using the optimal strategy.
Please refer to 4 of the Testing results part.
c)Add an mechanism that can adjust above mentioned thresholds
adaptively. Maybe we can record the retry times of the previous call
for CRUSH, and adjust retry thresholds automatically according to the record.

I suggest ignoring all of this retry logic. The original version of
CRUSH has the local retries to try to make data move "less far", but
when we went back a year ago and did a statistical analysis of the
distribution we found that *all* of these hacks degraded the quality
of the placement,a nd by turning them all off (setting the 'optimal'
values which zeroes them all out excent for total_retries) we got
something that was indistinguishable from a uniform distribution.

3.Add soft link for pg directories. During pg creation, we can
create soft links for the pgs if pg# on the selected osd is more
than some threshold, say 10% more than desired average number, to
move objects that will be stored in this pg to another osd. Balanced
disk utilization may be gained in this way.

I think you need to be careful, but yes, this is an option. There is
a similar exception mechanism in place that is used for other purposes
and something similar could be done here. The main challenge will be
in ensuring that the soft links/exceptions follow the same overall
policy that CRUSH does after the raw mapping is performed. This is an
option, but I would put it toward the bottom of the list...

4.Change placement strategy only for step of selecting devices from
hosts. We found in our testing results that pg distribution was
balanced among hosts, which is reasonable since pg# of each host is
above 1K (according to the current BKM that pg# per osd should be
about 100). So how about we apply CRUSH only on the interval buckets
and find another simple but more balanced method to choose osd from host?

If you know the chassis can hold 12 disks, you can force the bucket
size to twelve and somehow prevent users from adjusting the structure
of the tree. Then you can use a simple mapping that is truly flat
(like a linear mapping, disk = x % num_disks) for that bucket/subtree.
The downside of course is that if you remove a disk *everything*
reshuffles, hence some sort of guardrails to prevent a user from
inadvertantly doing that. If a disk *does* fail, you just need to
make sure the disk is marked "out" but not removed from the CRUSH hierarchy and the normal retry will kick in.
Note that all this is reall doing is increasing the size of the "buckets"
that we are (pseudo)randomly distribution over. It is still a
random/uniform distribution, but the N value is 12 times bigger (for a
12 disk chassis) and as a result the variance is substantially lower.
I would suggest making a new bucket type that is called 'linear' and
does a simple modulo and trying this out. We will need a bunch of
additional safety checks to help users avoid doing silly things (like
adjusting the number of items in the linear buckets, which reshuffle
everything) but that wouldn't be needed for an initial analysis of the performance impact.
Do you mind if we shift this thread over to ceph-devel? I think there
are lots of people who would be interested in this discussion. We can
of course leave off your attachment if you prefer.
Thanks!
sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel"
info at http://vger.kernel.org/majordomo-info.html

--
LoÃ¯c Dachary, Artisan Logiciel Libre