snap_trimming + backfilling is inefficient with many purged

Discussion:

snap_trimming + backfilling is inefficient with many purged_snaps

Dan Van Der Ster

2014-09-18 12:50:44 UTC

(moving this discussion to -devel)

Date: 17 Sep 2014 18:02:09 CEST
Subject: Re: [ceph-users] RGW hung, 2 OSDs using 100% CPU
On Wed, Sep 17, 2014 at 5:42 PM, Dan Van Der Ster

Sent: Sep 17, 2014 5:33 PM
To: Dan Van Der Ster
Subject: Re: [ceph-users] RGW hung, 2 OSDs using 100% CPU
On Wed, Sep 17, 2014 at 5:24 PM, Dan Van Der Ster

Hi Florian,

Hi Craig,
just dug this up in the list archives.

In the interest of removing variables, I removed all snapshots on all
pools,
then restarted all ceph daemons at the same time. This brought up osd.8
as
well.

So just to summarize this: your 100% CPU problem at the time went away
after you removed all snapshots, and the actual cause of the issue was
never found?
I am seeing a similar issue now, and have filed
http://tracker.ceph.com/issues/9503 to make sure it doesn't get lost
again. Can you take a look at that issue and let me know if anything
in the description sounds familiar?

Could your ticket be related to the snap trimming issue I’ve finally
narrowed down in the past couple days?
http://tracker.ceph.com/issues/9487
Bump up debug_osd to 20 then check the log during one of your incidents.
If it is busy logging the snap_trimmer messages, then it’s the same issue.
(The issue is that rbd pools have many purged_snaps, but sometimes after
backfilling a PG the purged_snaps list is lost and thus the snap trimmer
becomes very busy whilst re-trimming thousands of snaps. During that time (a
few minutes on my cluster) the OSD is blocked.)

That sounds promising, thank you! debug_osd=10 should actually be
sufficient as those snap_trim messages get logged at that level. :)
Do I understand your issue report correctly in that you have found
setting osd_snap_trim_sleep to be ineffective, because it's being
applied when iterating from PG to PG, rather than from snap to snap?
If so, then I'm guessing that that can hardly be intentional…

I’m beginning to agree with you on that guess. AFAICT, the normal behavior of the snap trimmer is to trim one single snap, the one which is in the snap_trimq but not yet in purged_snaps. So the only time the current sleep implementation could be useful is if we rm’d a snap across many PGs at once, e.g. rm a pool snap or an rbd snap. But those aren’t a huge problem anyway since you’d at most need to trim O(100) PGs.

We could move the snap trim sleep into the SnapTrimmer state machine, for example in ReplicatedPG::NotTrimming::react. This should allow other IOs to get through to the OSD, but of course the trimming PG would remain locked. And it would be locked for even longer now due to the sleep.

To solve that we could limit the number of trims per instance of the SnapTrimmer, like I’ve done in this pull req: https://github.com/ceph/ceph/pull/2516
Breaking out of the trimmer like that should allow IOs to the trimming PG to get through.

The second aspect of this issue is why are the purged_snaps being lost to begin with. I’ve managed to reproduce that on my test cluster. All you have to do is create many pool snaps (e.g. of a nearly empty pool), then rmsnap all those snapshots. Then use crush reweight to move the PGs around. With debug_osd>=10, you will see "adding snap 1 to purged_snaps”, which is one signature of this lost purged_snaps issue. To reproduce slow requests the number of snaps purged needs to be O(10000).

Looking forward to any ideas someone might have.

Cheers, Dan

N��r��y��b�X��ǧv�^�)޺{.n�+��z�]z��{ay�ʇڙ�,j��f��h��z��w��

Florian Haas

2014-09-18 17:03:59 UTC

Permalink

Hi Dan,

saw the pull request, and can confirm your observations, at least
partially. Comments inline.

On Thu, Sep 18, 2014 at 2:50 PM, Dan Van Der Ster

Do I understand your issue report correctly in that you have found
setting osd_snap_trim_sleep to be ineffective, because it's being
applied when iterating from PG to PG, rather than from snap to snap=

If so, then I'm guessing that that can hardly be intentional=E2=80=A6

I=E2=80=99m beginning to agree with you on that guess. AFAICT, the no=

rmal behavior of the snap trimmer is to trim one single snap, the one w=
hich is in the snap_trimq but not yet in purged_snaps. So the only time=
the current sleep implementation could be useful is if we rm=E2=80=99d=
a snap across many PGs at once, e.g. rm a pool snap or an rbd snap. Bu=
t those aren=E2=80=99t a huge problem anyway since you=E2=80=99d at mos=
t need to trim O(100) PGs.

Hmm. I'm actually seeing this in a system where the problematic snaps
could *only* have been RBD snaps.

We could move the snap trim sleep into the SnapTrimmer state machine,=

for example in ReplicatedPG::NotTrimming::react. This should allow oth=
er IOs to get through to the OSD, but of course the trimming PG would r=
emain locked. And it would be locked for even longer now due to the sle=
ep.

To solve that we could limit the number of trims per instance of the =

SnapTrimmer, like I=E2=80=99ve done in this pull req: https://github.co=
m/ceph/ceph/pull/2516

Breaking out of the trimmer like that should allow IOs to the trimmin=

g PG to get through.

The second aspect of this issue is why are the purged_snaps being los=

t to begin with. I=E2=80=99ve managed to reproduce that on my test clus=
ter. All you have to do is create many pool snaps (e.g. of a nearly emp=
ty pool), then rmsnap all those snapshots. Then use crush reweight to m=
ove the PGs around. With debug_osd>=3D10, you will see "adding snap 1 t=
o purged_snaps=E2=80=9D, which is one signature of this lost purged_sna=
ps issue. To reproduce slow requests the number of snaps purged needs t=
o be O(10000).

Hmmm, I'm not sure if I confirm that. I see "adding snap X to
purged_snaps", but only after the snap has been purged. See
https://gist.github.com/fghaas/88db3cd548983a92aa35. Of course, the
fact that the OSD tries to trim a snap only to get an ENOENT is
probably indicative of something being fishy with the snaptrimq and/or
the purged_snaps list as well.

Looking forward to any ideas someone might have.

So am I. :)

Cheers,
=46lorian
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" i=
n
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Florian Haas

2014-09-18 19:03:18 UTC

Permalink

On Thu, Sep 18, 2014 at 8:56 PM, Mango Thirtyfour

Hi Florian,

Post by Florian Haas
Hi Dan,
saw the pull request, and can confirm your observations, at least
partially. Comments inline.
On Thu, Sep 18, 2014 at 2:50 PM, Dan Van Der Ster

Do I understand your issue report correctly in that you have fou=

Post by Florian Haas

setting osd_snap_trim_sleep to be ineffective, because it's bein=

Post by Florian Haas

applied when iterating from PG to PG, rather than from snap to s=

nap?

Post by Florian Haas

If so, then I'm guessing that that can hardly be intentional=E2=80=

=A6

Post by Florian Haas

I=E2=80=99m beginning to agree with you on that guess. AFAICT, the=

normal behavior of the snap trimmer is to trim one single snap, the on=
e which is in the snap_trimq but not yet in purged_snaps. So the only t=
ime the current sleep implementation could be useful is if we rm=E2=80=99=
d a snap across many PGs at once, e.g. rm a pool snap or an rbd snap. B=
ut those aren=E2=80=99t a huge problem anyway since you=E2=80=99d at mo=
st need to trim O(100) PGs.

Post by Florian Haas
Hmm. I'm actually seeing this in a system where the problematic snap=

Post by Florian Haas
could *only* have been RBD snaps.

True, as am I. The current sleep is useful in this case, but since we=

'd normally only expect up to ~100 of these PGs per OSD, the trimming o=
f 1 snap across all of those PGs would finish rather quickly anyway. La=
tency would surely be increased momentarily, but I wouldn't expect 90s =
slow requests like I have with the 30000 snap_trimq single PG.

Possibly the sleep is useful in both places.

Post by Florian Haas

We could move the snap trim sleep into the SnapTrimmer state machi=

ne, for example in ReplicatedPG::NotTrimming::react. This should allow =
other IOs to get through to the OSD, but of course the trimming PG woul=
d remain locked. And it would be locked for even longer now due to the =
sleep.

Post by Florian Haas

To solve that we could limit the number of trims per instance of t=

he SnapTrimmer, like I=E2=80=99ve done in this pull req: https://github=
=2Ecom/ceph/ceph/pull/2516

Post by Florian Haas

Breaking out of the trimmer like that should allow IOs to the trim=

ming PG to get through.

Post by Florian Haas

The second aspect of this issue is why are the purged_snaps being =

lost to begin with. I=E2=80=99ve managed to reproduce that on my test c=
luster. All you have to do is create many pool snaps (e.g. of a nearly =
empty pool), then rmsnap all those snapshots. Then use crush reweight t=
o move the PGs around. With debug_osd>=3D10, you will see "adding snap =
1 to purged_snaps=E2=80=9D, which is one signature of this lost purged_=
snaps issue. To reproduce slow requests the number of snaps purged need=
s to be O(10000).

Post by Florian Haas
Hmmm, I'm not sure if I confirm that. I see "adding snap X to
purged_snaps", but only after the snap has been purged. See
https://gist.github.com/fghaas/88db3cd548983a92aa35. Of course, the
fact that the OSD tries to trim a snap only to get an ENOENT is
probably indicative of something being fishy with the snaptrimq and/=

Post by Florian Haas
the purged_snaps list as well.

With such a long snap_trimq there in your log, I suspect you're seein=

g the exact same behavior as I am. In my case the first snap trimmed is=
snap 1, of course because that is the first rm'd snap, and the content=
s of your pool are surely different. I also see the ENOENT messages... =
again confirming those snaps were already trimmed. Anyway, what I've ob=
served is that a large snap_trimq like that will block the OSD until th=
ey are all re-trimmed.

That's... a mess.

So what is your workaround for recovery? My hunch would be to

- stop all access to the cluster;
- set nodown and noout so that other OSDs don't mark spinning OSDs
down (which would cause all sorts of primary and PG reassignments,
useless backfill/recovery when mon osd down out interval expires,
etc.);
- set osd_snap_trim_sleep to a ridiculously high value like 10 or 30
so that at least *between* PGs, the OSD has a chance to respond to
heartbeats and do whatever else it needs to do;
- let the snap trim play itself out over several hours (days?).

That sounds utterly awful, but if anyone has a better idea (other than
"wait until the patch is merged"), I'd be all ears.

Cheers
=46lorian
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" i=
n
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Dan van der Ster

2014-09-18 19:12:47 UTC

Permalink

Hi,

Post by Florian Haas
=20

Hi Florian,
=20

Post by Florian Haas
Hi Dan,
=20
saw the pull request, and can confirm your observations, at least
partially. Comments inline.
=20
On Thu, Sep 18, 2014 at 2:50 PM, Dan Van Der Ster

Do I understand your issue report correctly in that you have fou=

Post by Florian Haas

setting osd_snap_trim_sleep to be ineffective, because it's bein=

Post by Florian Haas

applied when iterating from PG to PG, rather than from snap to s=

nap?

Post by Florian Haas

If so, then I'm guessing that that can hardly be intentional=E2=80=

=A6

Post by Florian Haas

=20
=20
I=E2=80=99m beginning to agree with you on that guess. AFAICT, the=

normal behavior of the snap trimmer

Post by Florian Haas

Post by Florian Haas
to trim one single snap, the one which is in the snap_trimq but not=

yet in purged_snaps. So the

Post by Florian Haas

Post by Florian Haas
only time the current sleep implementation could be useful is if we=

rm=E2=80=99d a snap across many PGs

Post by Florian Haas

Post by Florian Haas
once, e.g. rm a pool snap or an rbd snap. But those aren=E2=80=99t =

a huge problem anyway since you=E2=80=99d at

Post by Florian Haas

Post by Florian Haas
most need to trim O(100) PGs.
=20
Hmm. I'm actually seeing this in a system where the problematic sna=

Post by Florian Haas

Post by Florian Haas
could *only* have been RBD snaps.

=20
True, as am I. The current sleep is useful in this case, but since w=

e'd normally only expect up

Post by Florian Haas
to

~100 of these PGs per OSD, the trimming of 1 snap across all of thos=

e PGs would finish rather

Post by Florian Haas

quickly anyway. Latency would surely be increased momentarily, but I=

wouldn't expect 90s slow

Post by Florian Haas

requests like I have with the 30000 snap_trimq single PG.
=20
Possibly the sleep is useful in both places.
=20

Post by Florian Haas

We could move the snap trim sleep into the SnapTrimmer state machi=

ne, for example in

Post by Florian Haas

Post by Florian Haas
ReplicatedPG::NotTrimming::react. This should allow other IOs to ge=

t through to the OSD, but of

Post by Florian Haas

Post by Florian Haas
course the trimming PG would remain locked. And it would be locked =

for even longer now due to

Post by Florian Haas
the

Post by Florian Haas
sleep.

=20
To solve that we could limit the number of trims per instance of t=

he SnapTrimmer, like I=E2=80=99ve

Post by Florian Haas
done

Post by Florian Haas
in this pull req: https://github.com/ceph/ceph/pull/2516

Breaking out of the trimmer like that should allow IOs to the trim=

ming PG to get through.

Post by Florian Haas

=20
The second aspect of this issue is why are the purged_snaps being =

lost to begin with. I=E2=80=99ve

Post by Florian Haas

Post by Florian Haas
managed to reproduce that on my test cluster. All you have to do is=

create many pool snaps (e.g.

Post by Florian Haas

Post by Florian Haas
a nearly empty pool), then rmsnap all those snapshots. Then use cru=

sh reweight to move the PGs

Post by Florian Haas

Post by Florian Haas
around. With debug_osd>=3D10, you will see "adding snap 1 to purged=

_snaps=E2=80=9D, which is one signature

Post by Florian Haas

Post by Florian Haas
this lost purged_snaps issue. To reproduce slow requests the number=

of snaps purged needs to be

Post by Florian Haas

Post by Florian Haas
O(10000).
=20
Hmmm, I'm not sure if I confirm that. I see "adding snap X to
purged_snaps", but only after the snap has been purged. See
https://gist.github.com/fghaas/88db3cd548983a92aa35. Of course, the
fact that the OSD tries to trim a snap only to get an ENOENT is
probably indicative of something being fishy with the snaptrimq and=

/or

Post by Florian Haas

Post by Florian Haas
the purged_snaps list as well.

=20
With such a long snap_trimq there in your log, I suspect you're seei=

ng the exact same behavior as

Post by Florian Haas
I

am. In my case the first snap trimmed is snap 1, of course because t=

hat is the first rm'd snap,

Post by Florian Haas
and

the contents of your pool are surely different. I also see the ENOEN=

T messages... again

Post by Florian Haas
confirming

those snaps were already trimmed. Anyway, what I've observed is that=

a large snap_trimq like that

Post by Florian Haas

will block the OSD until they are all re-trimmed.

=20
That's... a mess.
=20
So what is your workaround for recovery? My hunch would be to
=20
- stop all access to the cluster;
- set nodown and noout so that other OSDs don't mark spinning OSDs
down (which would cause all sorts of primary and PG reassignments,
useless backfill/recovery when mon osd down out interval expires,
etc.);
- set osd_snap_trim_sleep to a ridiculously high value like 10 or 30
so that at least *between* PGs, the OSD has a chance to respond to
heartbeats and do whatever else it needs to do;
- let the snap trim play itself out over several hours (days?).
=20

What I've been doing is I just continue draining my OSDs, two at a time=
=2E Each time, 1-2 other OSDs become blocked for a couple minutes (out =
of the ~1 hour it takes to drain) while a single PG re-trims, leading t=
o ~100 slow requests. The OSD must still be responding to the peer ping=
s, since other OSDs do not mark it down. Luckily this doesn't happen wi=
th every single movement of our pool 5 PGs, otherwise it would be a dis=
aster like you said.

Cheers, Dan

Post by Florian Haas
That sounds utterly awful, but if anyone has a better idea (other tha=

Post by Florian Haas
"wait until the patch is merged"), I'd be all ears.
=20
Cheers
Florian

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" i=
n
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Florian Haas

2014-09-18 21:19:34 UTC

Permalink

On Thu, Sep 18, 2014 at 9:12 PM, Dan van der Ster

Post by Dan Van Der Ster
Hi,

On Thu, Sep 18, 2014 at 8:56 PM, Dan van der Ster <daniel.vanderster=

Hi Florian,

Post by Florian Haas
Hi Dan,
saw the pull request, and can confirm your observations, at least
partially. Comments inline.
On Thu, Sep 18, 2014 at 2:50 PM, Dan Van Der Ster

Do I understand your issue report correctly in that you have fo=

und