Filestore throttling

Discussion:

GuangYang

2014-10-23 03:52:05 UTC

Hello Cephers,
During our testing, I found that the filestore throttling became a limi=
ting factor for performance, the four settings (with default value) are=
:
=A0filestore queue max ops =3D 50
=A0filestore queue max bytes =3D 100 << 20
=A0filestore queue committing max ops =3D 500
=A0filestore queue committing max bytes =3D=A0100 << 20

My understanding is, if we lift the threshold, the response for op (end=
to end) could be improved a lot during high load, and that is one reas=
on to have journal. The downside is that if there is a read following a=
successful write, the read might stuck longer as the object is not flu=
shed.

Is my understanding correct here?

If that is the tradeoff and read after write is not a concern in our us=
e case, can I lift the parameters to below values?
=A0filestore queue max ops =3D 500
=A0filestore queue max bytes =3D 200 << 20
=A0filestore queue committing max ops =3D 500
=A0filestore queue committing max bytes =3D=A0200 << 20

It turns out very helpful during PG peering stage (e.g. OSD down and up=
).

Thanks,
Guang

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" i=
n
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Sage Weil

2014-10-23 04:06:21 UTC

Permalink

On Thu, 23 Oct 2014, GuangYang wrote:
> Hello Cephers,
> During our testing, I found that the filestore throttling became a limiting factor for performance, the four settings (with default value) are:
> filestore queue max ops = 50
> filestore queue max bytes = 100 << 20
> filestore queue committing max ops = 500
> filestore queue committing max bytes = 100 << 20
>
> My understanding is, if we lift the threshold, the response for op (end to end) could be improved a lot during high load, and that is one reason to have journal. The downside is that if there is a read following a successful write, the read might stuck longer as the object is not flushed.
>
> Is my understanding correct here?
>
> If that is the tradeoff and read after write is not a concern in our use case, can I lift the parameters to below values?
> filestore queue max ops = 500
> filestore queue max bytes = 200 << 20
> filestore queue committing max ops = 500
> filestore queue committing max bytes = 200 << 20
>
> It turns out very helpful during PG peering stage (e.g. OSD down and up).

That looks reasonable to me.

For peering, I think there isn't really any reason to block sooner rather
than later. I wonder if we should try to mark those transactions such
that they don't run up against the usual limits...

Is this firefly or something later? Sometime after firefly Sam made some
changes so that the OSD is more careful about waiting for PG metadata to
be persisted before sharing state. I wonder if you will still see the
same improvement now...

sage

GuangYang

2014-10-23 04:30:27 UTC

Permalink

Thanks Sage for the quick response!

We are using firefly (v0.80.4 with a couple of back-ports). One observation we have is that during peering stage (especially if the OSD got down/in for several hours with high load), the peering OPs are in contention with normal OPs and thus bring extremely long latency (up to minutes) for client OPs, the contention happened in filestore for throttling budget, it also happened at dispatcher/op threads, I will send another email with more details after more investigation.

As for this one, I created a pull request #2779 to change the default value of filesotre_queue_max_ops to 500 (which is specified in the document but code is inconsistent), do you think we should make others as default as well?

Thanks,
Guang

----------------------------------------
> Date: Wed, 22 Oct 2014 21:06:21 -0700
> From: sage-***@public.gmane.org
> To: yguang11-1ViLX0X+***@public.gmane.org
> CC: ceph-devel-***@public.gmane.org; ceph-users-***@public.gmane.org
> Subject: Re: Filestore throttling
>
> On Thu, 23 Oct 2014, GuangYang wrote:
>> Hello Cephers,
>> During our testing, I found that the filestore throttling became a limiting factor for performance, the four settings (with default value) are:
>> filestore queue max ops = 50
>> filestore queue max bytes = 100 << 20
>> filestore queue committing max ops = 500
>> filestore queue committing max bytes = 100 << 20
>>
>> My understanding is, if we lift the threshold, the response for op (end to end) could be improved a lot during high load, and that is one reason to have journal. The downside is that if there is a read following a successful write, the read might stuck longer as the object is not flushed.
>>
>> Is my understanding correct here?
>>
>> If that is the tradeoff and read after write is not a concern in our use case, can I lift the parameters to below values?
>> filestore queue max ops = 500
>> filestore queue max bytes = 200 << 20
>> filestore queue committing max ops = 500
>> filestore queue committing max bytes = 200 << 20
>>
>> It turns out very helpful during PG peering stage (e.g. OSD down and up).
>
> That looks reasonable to me.
>
> For peering, I think there isn't really any reason to block sooner rather
> than later. I wonder if we should try to mark those transactions such
> that they don't run up against the usual limits...
>
> Is this firefly or something later? Sometime after firefly Sam made some
> changes so that the OSD is more careful about waiting for PG metadata to
> be persisted before sharing state. I wonder if you will still see the
> same improvement now...
>
> sage

Sage Weil

2014-10-23 13:58:58 UTC

Permalink

On Thu, 23 Oct 2014, GuangYang wrote:
> Thanks Sage for the quick response!
>
> We are using firefly (v0.80.4 with a couple of back-ports). One
> observation we have is that during peering stage (especially if the OSD
> got down/in for several hours with high load), the peering OPs are in
> contention with normal OPs and thus bring extremely long latency (up to
> minutes) for client OPs, the contention happened in filestore for
> throttling budget, it also happened at dispatcher/op threads, I will
> send another email with more details after more investigation.

It sounds like the problem here is that when the pg logs are long (1000's
of entries) the MOSDPGLog messages are bit and generate a big
ObjectStore::Transaction. This can be mitigated by shortening the logs,
but that means shortening the duration that an OSD can be down without
triggering a backfill. Part of the answer is probably to break the PGLog
messages into smaller pieces.

> As for this one, I created a pull request #2779 to change the default
> value of filesotre_queue_max_ops to 500 (which is specified in the
> document but code is inconsistent), do you think we should make others
> as default as well?

We reduced it to 50 almost 2 years ago, in this commit:

commit 44dca5c8c5058acf9bc391303dc77893793ce0be
Author: Sage Weil <***@inktank.com>
Date: Sat Jan 19 17:33:25 2013 -0800

filestore: disable extra committing queue allowance

The motivation here is if there is a problem draining the op queue
during a sync. For XFS and ext4, this isn't generally a problem: you
can continue to make writes while a syncfs(2) is in progress. There
are currently some possible implementation issues with btrfs, but we
have not demonstrated them recently.

Meanwhile, this can cause queue length spikes that screw up latency.
During a commit, we allow too much into the queue (say, recovery
operations). After the sync finishes, we have to drain it out before
we can queue new work (say, a higher priority client request). Having
a deep queue below the point where priorities order work limits the
value of the priority queue.

Signed-off-by: Sage Weil <***@inktank.com>

I'm not sure it makes sense to increase it in the general case. It might
make sense for your workload, or we may want to make peering transactions
some sort of special case...?

sage

>
> Thanks,
> Guang
>
> ----------------------------------------
> > Date: Wed, 22 Oct 2014 21:06:21 -0700
> > From: ***@newdream.net
> > To: ***@outlook.com
> > CC: ceph-***@vger.kernel.org; ceph-***@lists.ceph.com
> > Subject: Re: Filestore throttling
> >
> > On Thu, 23 Oct 2014, GuangYang wrote:
> >> Hello Cephers,
> >> During our testing, I found that the filestore throttling became a limiting factor for performance, the four settings (with default value) are:
> >> filestore queue max ops = 50
> >> filestore queue max bytes = 100 << 20
> >> filestore queue committing max ops = 500
> >> filestore queue committing max bytes = 100 << 20
> >>
> >> My understanding is, if we lift the threshold, the response for op (end to end) could be improved a lot during high load, and that is one reason to have journal. The downside is that if there is a read following a successful write, the read might stuck longer as the object is not flushed.
> >>
> >> Is my understanding correct here?
> >>
> >> If that is the tradeoff and read after write is not a concern in our use case, can I lift the parameters to below values?
> >> filestore queue max ops = 500
> >> filestore queue max bytes = 200 << 20
> >> filestore queue committing max ops = 500
> >> filestore queue committing max bytes = 200 << 20
> >>
> >> It turns out very helpful during PG peering stage (e.g. OSD down and up).
> >
> > That looks reasonable to me.
> >
> > For peering, I think there isn't really any reason to block sooner rather
> > than later. I wonder if we should try to mark those transactions such
> > that they don't run up against the usual limits...
> >
> > Is this firefly or something later? Sometime after firefly Sam made some
> > changes so that the OSD is more careful about waiting for PG metadata to
> > be persisted before sharing state. I wonder if you will still see the
> > same improvement now...
> >
> > sage
> N????y????b?????v?????{.n??????z??ay????????j???f????????????????:+v??????????zZ+??????"?!?
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

GuangYang

2014-10-24 04:07:47 UTC

Permalink

---------------------------------------
> Date: Thu, 23 Oct 2014 06:58:58 -0700
> From: sage-***@public.gmane.org
> To: yguang11-1ViLX0X+***@public.gmane.org
> CC: ceph-devel-***@public.gmane.org; ceph-users-***@public.gmane.org
> Subject: RE: Filestore throttling
>
> On Thu, 23 Oct 2014, GuangYang wrote:
>> Thanks Sage for the quick response!
>>
>> We are using firefly (v0.80.4 with a couple of back-ports). One
>> observation we have is that during peering stage (especially if the OSD
>> got down/in for several hours with high load), the peering OPs are in
>> contention with normal OPs and thus bring extremely long latency (up to
>> minutes) for client OPs, the contention happened in filestore for
>> throttling budget, it also happened at dispatcher/op threads, I will
>> send another email with more details after more investigation.
>
> It sounds like the problem here is that when the pg logs are long (1000's
> of entries) the MOSDPGLog messages are bit and generate a big
> ObjectStore::Transaction. This can be mitigated by shortening the logs,
> but that means shortening the duration that an OSD can be down without
> triggering a backfill. Part of the answer is probably to break the PGLog
> messages into smaller pieces.
Making the transaction small should help, let me test that and get back with more information.
>
>> As for this one, I created a pull request #2779 to change the default
>> value of filesotre_queue_max_ops to 500 (which is specified in the
>> document but code is inconsistent), do you think we should make others
>> as default as well?
>
> We reduced it to 50 almost 2 years ago, in this commit:
>
> commit 44dca5c8c5058acf9bc391303dc77893793ce0be
> Author: Sage Weil <sage-4GqslpFJ+***@public.gmane.org>
> Date: Sat Jan 19 17:33:25 2013 -0800
>
> filestore: disable extra committing queue allowance
>
> The motivation here is if there is a problem draining the op queue
> during a sync. For XFS and ext4, this isn't generally a problem: you
> can continue to make writes while a syncfs(2) is in progress. There
> are currently some possible implementation issues with btrfs, but we
> have not demonstrated them recently.
>
> Meanwhile, this can cause queue length spikes that screw up latency.
> During a commit, we allow too much into the queue (say, recovery
> operations). After the sync finishes, we have to drain it out before
> we can queue new work (say, a higher priority client request). Having
> a deep queue below the point where priorities order work limits the
> value of the priority queue.
>
> Signed-off-by: Sage Weil <sage-4GqslpFJ+***@public.gmane.org>
>
> I'm not sure it makes sense to increase it in the general case. It might
> make sense for your workload, or we may want to make peering transactions
> some sort of special case...?
It is actually another commit:

commit 40654d6d53436c210b2f80911217b044f4d7643a
filestore: filestore_queue_max_ops 500 -> 50
Having a deep queue limits the effectiveness of the priority queues
above by adding additional latency.
I don't quite understand the use case that it might add additional latency by increasing this value, would you mind elaborating?

>
> sage
>
>
>>
>> Thanks,
>> Guang
>>
>> ----------------------------------------
>>> Date: Wed, 22 Oct 2014 21:06:21 -0700
>>> From: sage-***@public.gmane.org
>>> To: yguang11-1ViLX0X+***@public.gmane.org
>>> CC: ceph-devel-***@public.gmane.org; ceph-users-***@public.gmane.org
>>> Subject: Re: Filestore throttling
>>>
>>> On Thu, 23 Oct 2014, GuangYang wrote:
>>>> Hello Cephers,
>>>> During our testing, I found that the filestore throttling became a limiting factor for performance, the four settings (with default value) are:
>>>> filestore queue max ops = 50
>>>> filestore queue max bytes = 100 << 20
>>>> filestore queue committing max ops = 500
>>>> filestore queue committing max bytes = 100 << 20
>>>>
>>>> My understanding is, if we lift the threshold, the response for op (end to end) could be improved a lot during high load, and that is one reason to have journal. The downside is that if there is a read following a successful write, the read might stuck longer as the object is not flushed.
>>>>
>>>> Is my understanding correct here?
>>>>
>>>> If that is the tradeoff and read after write is not a concern in our use case, can I lift the parameters to below values?
>>>> filestore queue max ops = 500
>>>> filestore queue max bytes = 200 << 20
>>>> filestore queue committing max ops = 500
>>>> filestore queue committing max bytes = 200 << 20
>>>>
>>>> It turns out very helpful during PG peering stage (e.g. OSD down and up).
>>>
>>> That looks reasonable to me.
>>>
>>> For peering, I think there isn't really any reason to block sooner rather
>>> than later. I wonder if we should try to mark those transactions such
>>> that they don't run up against the usual limits...
>>>
>>> Is this firefly or something later? Sometime after firefly Sam made some
>>> changes so that the OSD is more careful about waiting for PG metadata to
>>> be persisted before sharing state. I wonder if you will still see the
>>> same improvement now...
>>>
>>> sage
>> N????y????b?????v?????{.n??????z??ay????????j???f????????????????:+v??????????zZ+??????"?!?
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo-***@public.gmane.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html

Sage Weil

2014-10-24 04:26:07 UTC

Permalink

On Fri, 24 Oct 2014, GuangYang wrote:
> > commit 44dca5c8c5058acf9bc391303dc77893793ce0be
> > Author: Sage Weil <sage-4GqslpFJ+***@public.gmane.org>
> > Date: Sat Jan 19 17:33:25 2013 -0800
> >
> > filestore: disable extra committing queue allowance
> >
> > The motivation here is if there is a problem draining the op queue
> > during a sync. For XFS and ext4, this isn't generally a problem: you
> > can continue to make writes while a syncfs(2) is in progress. There
> > are currently some possible implementation issues with btrfs, but we
> > have not demonstrated them recently.
> >
> > Meanwhile, this can cause queue length spikes that screw up latency.
> > During a commit, we allow too much into the queue (say, recovery
> > operations). After the sync finishes, we have to drain it out before
> > we can queue new work (say, a higher priority client request). Having
> > a deep queue below the point where priorities order work limits the
> > value of the priority queue.
> >
> > Signed-off-by: Sage Weil <sage-4GqslpFJ+***@public.gmane.org>
> >
> > I'm not sure it makes sense to increase it in the general case. It might
> > make sense for your workload, or we may want to make peering transactions
> > some sort of special case...?
> It is actually another commit:
>
> commit 40654d6d53436c210b2f80911217b044f4d7643a
> filestore: filestore_queue_max_ops 500 -> 50
> Having a deep queue limits the effectiveness of the priority queues
> above by adding additional latency.

Ah, you're right.

> I don't quite understand the use case that it might add additional
> latency by increasing this value, would you mind elaborating?

There is a priority queue a bit further up the stack OpWQ, in which high
priority items (e.g., client IO) can move ahead of low priority items
(e.g., recovery). If the queue beneath that (the filestore one) is very
deep, the client IO will only have a marginal advantage over the recovery
IO since it will still sit in the second queue for a long time. Ideally,
we want the priority queue to be the deepest one (so that we maximize the
amount of stuff we can reorder) and the queues above and below to be as
shallow as possible.

I think the peering operations are different because they can't be
reordered with respect to anything else in the same PG (unlike, say,
client vs recovery io for that pg). On the other hand, there may be
client IO on other PGs that we want to reorder and finish more quickly.
Allowing all of the right reordering and also getting the priority
inheritence right here is probably a hugely complex undertaking, so we
probably just want to go for a reasonably simple strategy that avoids the
worst instances of priority inversion (where an important thing is stuck
behind a slow thing). :/

In any case, though, I'm skeptical that making the lowest-level queue
deeper is going to help in general, even if it addresses the peering
case specifically...

sage