Haomai Wang
2014-10-22 03:09:54 UTC
[cc to ceph-devel]
ObjectStore::Transaction more flexible and make ObjectStore's
successors can easily aware of the data layout in transaction. I try
to summary the performance optimization for this bp:
1. FileStore/KeyValueStore can aware of the size of write data and do
something special for it, it would be nice for large file
2. A complexity transaction which contains several ops for one object,
the redundancy lookups will be reduced
But the actual consuming time component I think is transaction
encode/decode. Especially encode/decode for ghobject_t and collection
structure.
Combine with Message encode/decode, as I performed encode/decode
logics plays a important role for the latency of op. I want to explain
what I want to do:
All Messages will be restructured and have a common header. All
members in Message will be fixed. I know some critical member such as
ghobject_t or anything else will be hard to decisive. So on the
Messenger side, ghobject_t or other flexible structure will have
separated structure, like ghobject_t will be translated to
Message::object which will packed into a fixed size memory. So
Messenger can directly pick up structures in messages without memory
copy and parsing. And on the side of PG layer,
ObjectStore::Transaction will be refactored to a simple class. A list
of ops will describe the sequences and all data will be referenced
directly which is used in PG layer. It maybe let ObjectStore's
successors less flexible but it's still has space to modify. For
subop, the raw message from client will be validated in primary pg and
add some infos necessary inser into the fixed position of the message
and populate to replicate PG.
Plz correct me if exists awful point.
transaprent to the user (perhaps only use it on the backend network, or
even better, detect/negotiate the protocol for backward compatibility).
But I think in general we can probably constrain the problem: it is only
the MOSD[Sub]Op[Reply] messages that have a real impact here, so we can
probably focus on changing just those message's encoding. (Is that what
you're suggesting?)
Thanks!
sage
Hi Haomai,
You and your team have been doing great work and I'm very happy that you
are working with Ceph! The performance gains you've seen are very
encouraging.
pushed a wip-msgr into ceph.git to make sure it builds okay. Once giant
is out we can mix this into the QA.
have Sam's comments been addressed? IIRC the most recent issue was that
the cache was reset in PG::start_peering_interval.
This should make a big difference. +1 :)
do this, actually (prepare the transaction on the replicas instead of
encoding the one from the primary) but it was a bit less flexible when it
came to the object classes (which might not be deterministic).
I agree that encode/decode is a serious issue, but before
avoiding it for transations I'd like to see what Matt Benjamin
is able to accomplish with his changes, or look at ways to
make transaction encoding in particular more efficient (e.g.
with fixed size structures). Also, you might be interested in
https://wiki.ceph.com/Planning/Blueprints/Hammer/osd%3A_update_Transaction_encoding
Hmm, I try to understand the meaning. Is this BP want to makeYou and your team have been doing great work and I'm very happy that you
are working with Ceph! The performance gains you've seen are very
encouraging.
1. Use AsyncMessenger for both client and OSD
I would like to get this into the tree. I made a few cosmetic changes andpushed a wip-msgr into ceph.git to make sure it builds okay. Once giant
is out we can mix this into the QA.
2. Use ObjectContext Cache
I saw an earlier version of this that didn't break things down per-PG;have Sam's comments been addressed? IIRC the most recent issue was that
the cache was reset in PG::start_peering_interval.
This should make a big difference. +1 :)
3. Avoid extra calculates for pg layers
I haven't seen this one?I hope ceph can make complete with commercial storage system, so how
to make ceph shorter latency is my main concern.
Over the year, I dive into the full ceph IO stacks from librbd down to
FileStore. Besides the attempts mentioned above, I think the main
throttle will be encode/decode which is existed in Messenger and
ObjectStore transactions.
At first FileStore will directly accept inputs without bufferlist
encode/decodes. Now I try to send MOSDOP's payload directly to
replicate PG and avoid overall ObjectStore::Transaction which is used
by replicated pg. Replicated PG maybe need calculate again but as we
performed the main consuming time in PG layer is transaction
encode/decode. KeyValueStore and FileStore will both happy to adopt
it. Then main IO logic such as read/write ops won't need
encode/decode.
Can you send a message to ceph-devel with a bit more detail? We used toto make ceph shorter latency is my main concern.
Over the year, I dive into the full ceph IO stacks from librbd down to
FileStore. Besides the attempts mentioned above, I think the main
throttle will be encode/decode which is existed in Messenger and
ObjectStore transactions.
At first FileStore will directly accept inputs without bufferlist
encode/decodes. Now I try to send MOSDOP's payload directly to
replicate PG and avoid overall ObjectStore::Transaction which is used
by replicated pg. Replicated PG maybe need calculate again but as we
performed the main consuming time in PG layer is transaction
encode/decode. KeyValueStore and FileStore will both happy to adopt
it. Then main IO logic such as read/write ops won't need
encode/decode.
do this, actually (prepare the transaction on the replicas instead of
encoding the one from the primary) but it was a bit less flexible when it
came to the object classes (which might not be deterministic).
I agree that encode/decode is a serious issue, but before
avoiding it for transations I'd like to see what Matt Benjamin
is able to accomplish with his changes, or look at ways to
make transaction encoding in particular more efficient (e.g.
with fixed size structures). Also, you might be interested in
https://wiki.ceph.com/Planning/Blueprints/Hammer/osd%3A_update_Transaction_encoding
ObjectStore::Transaction more flexible and make ObjectStore's
successors can easily aware of the data layout in transaction. I try
to summary the performance optimization for this bp:
1. FileStore/KeyValueStore can aware of the size of write data and do
something special for it, it would be nice for large file
2. A complexity transaction which contains several ops for one object,
the redundancy lookups will be reduced
But the actual consuming time component I think is transaction
encode/decode. Especially encode/decode for ghobject_t and collection
structure.
Combine with Message encode/decode, as I performed encode/decode
logics plays a important role for the latency of op. I want to explain
what I want to do:
All Messages will be restructured and have a common header. All
members in Message will be fixed. I know some critical member such as
ghobject_t or anything else will be hard to decisive. So on the
Messenger side, ghobject_t or other flexible structure will have
separated structure, like ghobject_t will be translated to
Message::object which will packed into a fixed size memory. So
Messenger can directly pick up structures in messages without memory
copy and parsing. And on the side of PG layer,
ObjectStore::Transaction will be refactored to a simple class. A list
of ops will describe the sequences and all data will be referenced
directly which is used in PG layer. It maybe let ObjectStore's
successors less flexible but it's still has space to modify. For
subop, the raw message from client will be validated in primary pg and
add some infos necessary inser into the fixed position of the message
and populate to replicate PG.
Plz correct me if exists awful point.
Next, I hope we can refactor a new Message protocol. The main pain is
that New Message protocol won't compatible with older. Each message is
expected to have a common header, the memory layout for data in
Message will be forced aligned and used. It's expected to discard
overall message encode/decode which is main throttle in
AsyncMessenger. And with new Messenger, SUBOP can be directly
constructed via common header. So the overall encode/decode logics can
be discard for the new Message layout.
I'm also open to changes here, as long as we can make it somewhatthat New Message protocol won't compatible with older. Each message is
expected to have a common header, the memory layout for data in
Message will be forced aligned and used. It's expected to discard
overall message encode/decode which is main throttle in
AsyncMessenger. And with new Messenger, SUBOP can be directly
constructed via common header. So the overall encode/decode logics can
be discard for the new Message layout.
transaprent to the user (perhaps only use it on the backend network, or
even better, detect/negotiate the protocol for backward compatibility).
But I think in general we can probably constrain the problem: it is only
the MOSD[Sub]Op[Reply] messages that have a real impact here, so we can
probably focus on changing just those message's encoding. (Is that what
you're suggesting?)
Thanks!
sage
--
Best Regards,
Wheat
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Best Regards,
Wheat
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html