Discussion:
snap_trimming + backfilling is inefficient with many purged_snaps
Dan Van Der Ster
2014-09-18 12:50:44 UTC
Permalink
(moving this discussion to -devel)
Date: 17 Sep 2014 18:02:09 CEST
Subject: Re: [ceph-users] RGW hung, 2 OSDs using 100% CPU
On Wed, Sep 17, 2014 at 5:42 PM, Dan Van Der Ster
Sent: Sep 17, 2014 5:33 PM
To: Dan Van Der Ster
Subject: Re: [ceph-users] RGW hung, 2 OSDs using 100% CPU
On Wed, Sep 17, 2014 at 5:24 PM, Dan Van Der Ster
Hi Florian,
Hi Craig,
just dug this up in the list archives.
In the interest of removing variables, I removed all snapshots on all
pools,
then restarted all ceph daemons at the same time. This brought up osd.8
as
well.
So just to summarize this: your 100% CPU problem at the time went away
after you removed all snapshots, and the actual cause of the issue was
never found?
I am seeing a similar issue now, and have filed
http://tracker.ceph.com/issues/9503 to make sure it doesn't get lost
again. Can you take a look at that issue and let me know if anything
in the description sounds familiar?
Could your ticket be related to the snap trimming issue I’ve finally
narrowed down in the past couple days?
http://tracker.ceph.com/issues/9487
Bump up debug_osd to 20 then check the log during one of your incidents.
If it is busy logging the snap_trimmer messages, then it’s the same issue.
(The issue is that rbd pools have many purged_snaps, but sometimes after
backfilling a PG the purged_snaps list is lost and thus the snap trimmer
becomes very busy whilst re-trimming thousands of snaps. During that time (a
few minutes on my cluster) the OSD is blocked.)
That sounds promising, thank you! debug_osd=10 should actually be
sufficient as those snap_trim messages get logged at that level. :)
Do I understand your issue report correctly in that you have found
setting osd_snap_trim_sleep to be ineffective, because it's being
applied when iterating from PG to PG, rather than from snap to snap?
If so, then I'm guessing that that can hardly be intentional…
I’m beginning to agree with you on that guess. AFAICT, the normal behavior of the snap trimmer is to trim one single snap, the one which is in the snap_trimq but not yet in purged_snaps. So the only time the current sleep implementation could be useful is if we rm’d a snap across many PGs at once, e.g. rm a pool snap or an rbd snap. But those aren’t a huge problem anyway since you’d at most need to trim O(100) PGs.

We could move the snap trim sleep into the SnapTrimmer state machine, for example in ReplicatedPG::NotTrimming::react. This should allow other IOs to get through to the OSD, but of course the trimming PG would remain locked. And it would be locked for even longer now due to the sleep.

To solve that we could limit the number of trims per instance of the SnapTrimmer, like I’ve done in this pull req: https://github.com/ceph/ceph/pull/2516
Breaking out of the trimmer like that should allow IOs to the trimming PG to get through.

The second aspect of this issue is why are the purged_snaps being lost to begin with. I’ve managed to reproduce that on my test cluster. All you have to do is create many pool snaps (e.g. of a nearly empty pool), then rmsnap all those snapshots. Then use crush reweight to move the PGs around. With debug_osd>=10, you will see "adding snap 1 to purged_snaps”, which is one signature of this lost purged_snaps issue. To reproduce slow requests the number of snaps purged needs to be O(10000).

Looking forward to any ideas someone might have.

Cheers, Dan



N�����r��y����b�X��ǧv�^�)޺{.n�+���z�]z���{ay�ʇڙ�,j��f���h���z��w���
Florian Haas
2014-09-18 17:03:59 UTC
Permalink
Hi Dan,

saw the pull request, and can confirm your observations, at least
partially. Comments inline.

On Thu, Sep 18, 2014 at 2:50 PM, Dan Van Der Ster
Do I understand your issue report correctly in that you have found
setting osd_snap_trim_sleep to be ineffective, because it's being
applied when iterating from PG to PG, rather than from snap to snap=
?
If so, then I'm guessing that that can hardly be intentional=E2=80=A6
I=E2=80=99m beginning to agree with you on that guess. AFAICT, the no=
rmal behavior of the snap trimmer is to trim one single snap, the one w=
hich is in the snap_trimq but not yet in purged_snaps. So the only time=
the current sleep implementation could be useful is if we rm=E2=80=99d=
a snap across many PGs at once, e.g. rm a pool snap or an rbd snap. Bu=
t those aren=E2=80=99t a huge problem anyway since you=E2=80=99d at mos=
t need to trim O(100) PGs.

Hmm. I'm actually seeing this in a system where the problematic snaps
could *only* have been RBD snaps.
We could move the snap trim sleep into the SnapTrimmer state machine,=
for example in ReplicatedPG::NotTrimming::react. This should allow oth=
er IOs to get through to the OSD, but of course the trimming PG would r=
emain locked. And it would be locked for even longer now due to the sle=
ep.
To solve that we could limit the number of trims per instance of the =
SnapTrimmer, like I=E2=80=99ve done in this pull req: https://github.co=
m/ceph/ceph/pull/2516
Breaking out of the trimmer like that should allow IOs to the trimmin=
g PG to get through.
The second aspect of this issue is why are the purged_snaps being los=
t to begin with. I=E2=80=99ve managed to reproduce that on my test clus=
ter. All you have to do is create many pool snaps (e.g. of a nearly emp=
ty pool), then rmsnap all those snapshots. Then use crush reweight to m=
ove the PGs around. With debug_osd>=3D10, you will see "adding snap 1 t=
o purged_snaps=E2=80=9D, which is one signature of this lost purged_sna=
ps issue. To reproduce slow requests the number of snaps purged needs t=
o be O(10000).

Hmmm, I'm not sure if I confirm that. I see "adding snap X to
purged_snaps", but only after the snap has been purged. See
https://gist.github.com/fghaas/88db3cd548983a92aa35. Of course, the
fact that the OSD tries to trim a snap only to get an ENOENT is
probably indicative of something being fishy with the snaptrimq and/or
the purged_snaps list as well.
Looking forward to any ideas someone might have.
So am I. :)

Cheers,
=46lorian
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" i=
n
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Florian Haas
2014-09-18 19:03:18 UTC
Permalink
On Thu, Sep 18, 2014 at 8:56 PM, Mango Thirtyfour
Hi Florian,
Post by Florian Haas
Hi Dan,
saw the pull request, and can confirm your observations, at least
partially. Comments inline.
On Thu, Sep 18, 2014 at 2:50 PM, Dan Van Der Ster
Do I understand your issue report correctly in that you have fou=
nd
Post by Florian Haas
setting osd_snap_trim_sleep to be ineffective, because it's bein=
g
Post by Florian Haas
applied when iterating from PG to PG, rather than from snap to s=
nap?
Post by Florian Haas
If so, then I'm guessing that that can hardly be intentional=E2=80=
=A6
Post by Florian Haas
I=E2=80=99m beginning to agree with you on that guess. AFAICT, the=
normal behavior of the snap trimmer is to trim one single snap, the on=
e which is in the snap_trimq but not yet in purged_snaps. So the only t=
ime the current sleep implementation could be useful is if we rm=E2=80=99=
d a snap across many PGs at once, e.g. rm a pool snap or an rbd snap. B=
ut those aren=E2=80=99t a huge problem anyway since you=E2=80=99d at mo=
st need to trim O(100) PGs.
Post by Florian Haas
Hmm. I'm actually seeing this in a system where the problematic snap=
s
Post by Florian Haas
could *only* have been RBD snaps.
True, as am I. The current sleep is useful in this case, but since we=
'd normally only expect up to ~100 of these PGs per OSD, the trimming o=
f 1 snap across all of those PGs would finish rather quickly anyway. La=
tency would surely be increased momentarily, but I wouldn't expect 90s =
slow requests like I have with the 30000 snap_trimq single PG.
Possibly the sleep is useful in both places.
Post by Florian Haas
We could move the snap trim sleep into the SnapTrimmer state machi=
ne, for example in ReplicatedPG::NotTrimming::react. This should allow =
other IOs to get through to the OSD, but of course the trimming PG woul=
d remain locked. And it would be locked for even longer now due to the =
sleep.
Post by Florian Haas
To solve that we could limit the number of trims per instance of t=
he SnapTrimmer, like I=E2=80=99ve done in this pull req: https://github=
=2Ecom/ceph/ceph/pull/2516
Post by Florian Haas
Breaking out of the trimmer like that should allow IOs to the trim=
ming PG to get through.
Post by Florian Haas
The second aspect of this issue is why are the purged_snaps being =
lost to begin with. I=E2=80=99ve managed to reproduce that on my test c=
luster. All you have to do is create many pool snaps (e.g. of a nearly =
empty pool), then rmsnap all those snapshots. Then use crush reweight t=
o move the PGs around. With debug_osd>=3D10, you will see "adding snap =
1 to purged_snaps=E2=80=9D, which is one signature of this lost purged_=
snaps issue. To reproduce slow requests the number of snaps purged need=
s to be O(10000).
Post by Florian Haas
Hmmm, I'm not sure if I confirm that. I see "adding snap X to
purged_snaps", but only after the snap has been purged. See
https://gist.github.com/fghaas/88db3cd548983a92aa35. Of course, the
fact that the OSD tries to trim a snap only to get an ENOENT is
probably indicative of something being fishy with the snaptrimq and/=
or
Post by Florian Haas
the purged_snaps list as well.
With such a long snap_trimq there in your log, I suspect you're seein=
g the exact same behavior as I am. In my case the first snap trimmed is=
snap 1, of course because that is the first rm'd snap, and the content=
s of your pool are surely different. I also see the ENOENT messages... =
again confirming those snaps were already trimmed. Anyway, what I've ob=
served is that a large snap_trimq like that will block the OSD until th=
ey are all re-trimmed.

That's... a mess.

So what is your workaround for recovery? My hunch would be to

- stop all access to the cluster;
- set nodown and noout so that other OSDs don't mark spinning OSDs
down (which would cause all sorts of primary and PG reassignments,
useless backfill/recovery when mon osd down out interval expires,
etc.);
- set osd_snap_trim_sleep to a ridiculously high value like 10 or 30
so that at least *between* PGs, the OSD has a chance to respond to
heartbeats and do whatever else it needs to do;
- let the snap trim play itself out over several hours (days?).

That sounds utterly awful, but if anyone has a better idea (other than
"wait until the patch is merged"), I'd be all ears.

Cheers
=46lorian
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" i=
n
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Dan van der Ster
2014-09-18 19:12:47 UTC
Permalink
Hi,
Post by Florian Haas
=20
Hi Florian,
=20
Post by Florian Haas
Hi Dan,
=20
saw the pull request, and can confirm your observations, at least
partially. Comments inline.
=20
On Thu, Sep 18, 2014 at 2:50 PM, Dan Van Der Ster
Do I understand your issue report correctly in that you have fou=
nd
Post by Florian Haas
Post by Florian Haas
setting osd_snap_trim_sleep to be ineffective, because it's bein=
g
Post by Florian Haas
Post by Florian Haas
applied when iterating from PG to PG, rather than from snap to s=
nap?
Post by Florian Haas
Post by Florian Haas
If so, then I'm guessing that that can hardly be intentional=E2=80=
=A6
Post by Florian Haas
Post by Florian Haas
=20
=20
I=E2=80=99m beginning to agree with you on that guess. AFAICT, the=
normal behavior of the snap trimmer
Post by Florian Haas
is
Post by Florian Haas
to trim one single snap, the one which is in the snap_trimq but not=
yet in purged_snaps. So the
Post by Florian Haas
Post by Florian Haas
only time the current sleep implementation could be useful is if we=
rm=E2=80=99d a snap across many PGs
Post by Florian Haas
at
Post by Florian Haas
once, e.g. rm a pool snap or an rbd snap. But those aren=E2=80=99t =
a huge problem anyway since you=E2=80=99d at
Post by Florian Haas
Post by Florian Haas
most need to trim O(100) PGs.
=20
Hmm. I'm actually seeing this in a system where the problematic sna=
ps
Post by Florian Haas
Post by Florian Haas
could *only* have been RBD snaps.
=20
True, as am I. The current sleep is useful in this case, but since w=
e'd normally only expect up
Post by Florian Haas
to
~100 of these PGs per OSD, the trimming of 1 snap across all of thos=
e PGs would finish rather
Post by Florian Haas
quickly anyway. Latency would surely be increased momentarily, but I=
wouldn't expect 90s slow
Post by Florian Haas
requests like I have with the 30000 snap_trimq single PG.
=20
Possibly the sleep is useful in both places.
=20
Post by Florian Haas
We could move the snap trim sleep into the SnapTrimmer state machi=
ne, for example in
Post by Florian Haas
Post by Florian Haas
ReplicatedPG::NotTrimming::react. This should allow other IOs to ge=
t through to the OSD, but of
Post by Florian Haas
Post by Florian Haas
course the trimming PG would remain locked. And it would be locked =
for even longer now due to
Post by Florian Haas
the
Post by Florian Haas
sleep.
=20
To solve that we could limit the number of trims per instance of t=
he SnapTrimmer, like I=E2=80=99ve
Post by Florian Haas
done
Post by Florian Haas
in this pull req: https://github.com/ceph/ceph/pull/2516
Breaking out of the trimmer like that should allow IOs to the trim=
ming PG to get through.
Post by Florian Haas
Post by Florian Haas
=20
The second aspect of this issue is why are the purged_snaps being =
lost to begin with. I=E2=80=99ve
Post by Florian Haas
Post by Florian Haas
managed to reproduce that on my test cluster. All you have to do is=
create many pool snaps (e.g.
Post by Florian Haas
of
Post by Florian Haas
a nearly empty pool), then rmsnap all those snapshots. Then use cru=
sh reweight to move the PGs
Post by Florian Haas
Post by Florian Haas
around. With debug_osd>=3D10, you will see "adding snap 1 to purged=
_snaps=E2=80=9D, which is one signature
Post by Florian Haas
of
Post by Florian Haas
this lost purged_snaps issue. To reproduce slow requests the number=
of snaps purged needs to be
Post by Florian Haas
Post by Florian Haas
O(10000).
=20
Hmmm, I'm not sure if I confirm that. I see "adding snap X to
purged_snaps", but only after the snap has been purged. See
https://gist.github.com/fghaas/88db3cd548983a92aa35. Of course, the
fact that the OSD tries to trim a snap only to get an ENOENT is
probably indicative of something being fishy with the snaptrimq and=
/or
Post by Florian Haas
Post by Florian Haas
the purged_snaps list as well.
=20
With such a long snap_trimq there in your log, I suspect you're seei=
ng the exact same behavior as
Post by Florian Haas
I
am. In my case the first snap trimmed is snap 1, of course because t=
hat is the first rm'd snap,
Post by Florian Haas
and
the contents of your pool are surely different. I also see the ENOEN=
T messages... again
Post by Florian Haas
confirming
those snaps were already trimmed. Anyway, what I've observed is that=
a large snap_trimq like that
Post by Florian Haas
will block the OSD until they are all re-trimmed.
=20
That's... a mess.
=20
So what is your workaround for recovery? My hunch would be to
=20
- stop all access to the cluster;
- set nodown and noout so that other OSDs don't mark spinning OSDs
down (which would cause all sorts of primary and PG reassignments,
useless backfill/recovery when mon osd down out interval expires,
etc.);
- set osd_snap_trim_sleep to a ridiculously high value like 10 or 30
so that at least *between* PGs, the OSD has a chance to respond to
heartbeats and do whatever else it needs to do;
- let the snap trim play itself out over several hours (days?).
=20
What I've been doing is I just continue draining my OSDs, two at a time=
=2E Each time, 1-2 other OSDs become blocked for a couple minutes (out =
of the ~1 hour it takes to drain) while a single PG re-trims, leading t=
o ~100 slow requests. The OSD must still be responding to the peer ping=
s, since other OSDs do not mark it down. Luckily this doesn't happen wi=
th every single movement of our pool 5 PGs, otherwise it would be a dis=
aster like you said.

Cheers, Dan
Post by Florian Haas
That sounds utterly awful, but if anyone has a better idea (other tha=
n
Post by Florian Haas
"wait until the patch is merged"), I'd be all ears.
=20
Cheers
Florian
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" i=
n
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Florian Haas
2014-09-18 21:19:34 UTC
Permalink
On Thu, Sep 18, 2014 at 9:12 PM, Dan van der Ster
Post by Dan Van Der Ster
Hi,
On Thu, Sep 18, 2014 at 8:56 PM, Dan van der Ster <daniel.vanderster=
Hi Florian,
Post by Florian Haas
Hi Dan,
saw the pull request, and can confirm your observations, at least
partially. Comments inline.
On Thu, Sep 18, 2014 at 2:50 PM, Dan Van Der Ster
Do I understand your issue report correctly in that you have fo=
und
Post by Dan Van Der Ster
Post by Florian Haas
setting osd_snap_trim_sleep to be ineffective, because it's bei=
ng
Post by Dan Van Der Ster
Post by Florian Haas
applied when iterating from PG to PG, rather than from snap to =
snap?
Post by Dan Van Der Ster
Post by Florian Haas
If so, then I'm guessing that that can hardly be intentional=E2=
=80=A6
Post by Dan Van Der Ster
Post by Florian Haas
I=E2=80=99m beginning to agree with you on that guess. AFAICT, th=
e normal behavior of the snap trimmer
Post by Dan Van Der Ster
is
Post by Florian Haas
to trim one single snap, the one which is in the snap_trimq but no=
t yet in purged_snaps. So the
Post by Dan Van Der Ster
Post by Florian Haas
only time the current sleep implementation could be useful is if w=
e rm=E2=80=99d a snap across many PGs
Post by Dan Van Der Ster
at
Post by Florian Haas
once, e.g. rm a pool snap or an rbd snap. But those aren=E2=80=99t=
a huge problem anyway since you=E2=80=99d at
Post by Dan Van Der Ster
Post by Florian Haas
most need to trim O(100) PGs.
Hmm. I'm actually seeing this in a system where the problematic sn=
aps
Post by Dan Van Der Ster
Post by Florian Haas
could *only* have been RBD snaps.
True, as am I. The current sleep is useful in this case, but since =
we'd normally only expect up
Post by Dan Van Der Ster
to
~100 of these PGs per OSD, the trimming of 1 snap across all of tho=
se PGs would finish rather
Post by Dan Van Der Ster
quickly anyway. Latency would surely be increased momentarily, but =
I wouldn't expect 90s slow
Post by Dan Van Der Ster
requests like I have with the 30000 snap_trimq single PG.
Possibly the sleep is useful in both places.
Post by Florian Haas
We could move the snap trim sleep into the SnapTrimmer state mach=
ine, for example in
Post by Dan Van Der Ster
Post by Florian Haas
ReplicatedPG::NotTrimming::react. This should allow other IOs to g=
et through to the OSD, but of
Post by Dan Van Der Ster
Post by Florian Haas
course the trimming PG would remain locked. And it would be locked=
for even longer now due to
Post by Dan Van Der Ster
the
Post by Florian Haas
sleep.
To solve that we could limit the number of trims per instance of =
the SnapTrimmer, like I=E2=80=99ve
Post by Dan Van Der Ster
done
Post by Florian Haas
in this pull req: https://github.com/ceph/ceph/pull/2516
Breaking out of the trimmer like that should allow IOs to the tri=
mming PG to get through.
Post by Dan Van Der Ster
Post by Florian Haas
The second aspect of this issue is why are the purged_snaps being=
lost to begin with. I=E2=80=99ve
Post by Dan Van Der Ster
Post by Florian Haas
managed to reproduce that on my test cluster. All you have to do i=
s create many pool snaps (e.g.
Post by Dan Van Der Ster
of
Post by Florian Haas
a nearly empty pool), then rmsnap all those snapshots. Then use cr=
ush reweight to move the PGs
Post by Dan Van Der Ster
Post by Florian Haas
around. With debug_osd>=3D10, you will see "adding snap 1 to purge=
d_snaps=E2=80=9D, which is one signature
Post by Dan Van Der Ster
of
Post by Florian Haas
this lost purged_snaps issue. To reproduce slow requests the numbe=
r of snaps purged needs to be
Post by Dan Van Der Ster
Post by Florian Haas
O(10000).
Hmmm, I'm not sure if I confirm that. I see "adding snap X to
purged_snaps", but only after the snap has been purged. See
https://gist.github.com/fghaas/88db3cd548983a92aa35. Of course, th=
e
Post by Dan Van Der Ster
Post by Florian Haas
fact that the OSD tries to trim a snap only to get an ENOENT is
probably indicative of something being fishy with the snaptrimq an=
d/or
Post by Dan Van Der Ster
Post by Florian Haas
the purged_snaps list as well.
With such a long snap_trimq there in your log, I suspect you're see=
ing the exact same behavior as
Post by Dan Van Der Ster
I
am. In my case the first snap trimmed is snap 1, of course because =
that is the first rm'd snap,
Post by Dan Van Der Ster
and
the contents of your pool are surely different. I also see the ENOE=
NT messages... again
Post by Dan Van Der Ster
confirming
those snaps were already trimmed. Anyway, what I've observed is tha=
t a large snap_trimq like that
Post by Dan Van Der Ster
will block the OSD until they are all re-trimmed.
That's... a mess.
So what is your workaround for recovery? My hunch would be to
- stop all access to the cluster;
- set nodown and noout so that other OSDs don't mark spinning OSDs
down (which would cause all sorts of primary and PG reassignments,
useless backfill/recovery when mon osd down out interval expires,
etc.);
- set osd_snap_trim_sleep to a ridiculously high value like 10 or 30
so that at least *between* PGs, the OSD has a chance to respond to
heartbeats and do whatever else it needs to do;
- let the snap trim play itself out over several hours (days?).
What I've been doing is I just continue draining my OSDs, two at a ti=
me. Each time, 1-2 other OSDs become blocked for a couple minutes (out =
of the ~1 hour it takes to drain) while a single PG re-trims, leading t=
o ~100 slow requests. The OSD must still be responding to the peer ping=
s, since other OSDs do not mark it down. Luckily this doesn't happen wi=
th every single movement of our pool 5 PGs, otherwise it would be a dis=
aster like you said.

So just to clarify, what you're doing is out of the OSDs that are
spinning, you mark 2 out and wait for them to go empty?

What I'm seeing i my environment is that the OSDs *do* go down.
Marking them out seems not to help much as the problem then promptly
pops up elsewhere.

So, disaster is a pretty good description. Would anyone from the core
team like to suggest another course of action or workaround, or are
Dan and I generally on the right track to make the best out of a
pretty bad situation?

It would be helpful for others that bought into the "snapshots are
awesome, cheap and you can have as many as you want" mantra, so as to
perhaps not have their cluster blow up in their faces at some point.
Because right now, to me it seems that as you go past maybe a few
thousand snapshots and then at some point want to remove lots of them
at the same time, you'd better be scared. Happy to stand corrected,
though. :)

Cheers,
=46lorian
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" i=
n
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Sage Weil
2014-09-18 22:27:06 UTC
Permalink
Hi Sage,
was the off-list reply intentional?
Whoops! Nope :)
Post by Florian Haas
So, disaster is a pretty good description. Would anyone from the core
team like to suggest another course of action or workaround, or are
Dan and I generally on the right track to make the best out of a
pretty bad situation?
The short term fix would probably be to just prevent backfill for the time
being until the bug is fixed.
As in, osd max backfills = 0?
Yeah :)

Just managed to reproduce the problem...

sage
The root of the problem seems to be that it is trying to trim snaps that
aren't there. I'm trying to reproduce the issue now! Hopefully the fix
is simple...
http://tracker.ceph.com/issues/9487
Thanks!
sage
Thanks. :)
Cheers,
Florian
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Florian Haas
2014-09-19 06:12:57 UTC
Permalink
Post by Sage Weil
Hi Sage,
was the off-list reply intentional?
Whoops! Nope :)
Post by Florian Haas
So, disaster is a pretty good description. Would anyone from the core
team like to suggest another course of action or workaround, or are
Dan and I generally on the right track to make the best out of a
pretty bad situation?
The short term fix would probably be to just prevent backfill for the time
being until the bug is fixed.
As in, osd max backfills = 0?
Yeah :)
Just managed to reproduce the problem...
sage
Saw the wip branch. Color me freakishly impressed on the turnaround. :) Thanks!

Cheers,
Florian
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Dan Van Der Ster
2014-09-19 08:41:56 UTC
Permalink
Post by Florian Haas
Post by Sage Weil
Hi Sage,
was the off-list reply intentional?
Whoops! Nope :)
Post by Florian Haas
So, disaster is a pretty good description. Would anyone from the core
team like to suggest another course of action or workaround, or are
Dan and I generally on the right track to make the best out of a
pretty bad situation?
The short term fix would probably be to just prevent backfill for the time
being until the bug is fixed.
As in, osd max backfills = 0?
Yeah :)
Just managed to reproduce the problem...
sage
Saw the wip branch. Color me freakishly impressed on the turnaround. :) Thanks!
Indeed :) Thanks Sage!
wip-9487-dumpling fixes the problem on my test cluster. Trying in prod now…
Cheers, DanN�����r��y����b�X��ǧv�^�)޺{.n�+���z�]z���{ay�ʇڙ�,j��f���h���z��w���
Dan van der Ster
2014-09-19 12:58:44 UTC
Permalink
On Fri, Sep 19, 2014 at 10:41 AM, Dan Van Der Ster
Post by Dan Van Der Ster
Post by Sage Weil
Hi Sage,
was the off-list reply intentional?
Whoops! Nope :)
So, disaster is a pretty good description. Would anyone from the=
core
Post by Dan Van Der Ster
Post by Sage Weil
team like to suggest another course of action or workaround, or =
are
Post by Dan Van Der Ster
Post by Sage Weil
Dan and I generally on the right track to make the best out of a
pretty bad situation?
The short term fix would probably be to just prevent backfill for=
the time
Post by Dan Van Der Ster
Post by Sage Weil
being until the bug is fixed.
As in, osd max backfills =3D 0?
Yeah :)
Just managed to reproduce the problem...
sage
Saw the wip branch. Color me freakishly impressed on the turnaround.=
:) Thanks!
Post by Dan Van Der Ster
Indeed :) Thanks Sage!
wip-9487-dumpling fixes the problem on my test cluster. Trying in pro=
d now=E2=80=A6

=46inal update, after 4 hours in prod and after draining 8 OSDs -- zero
slow requests :)

Thanks again!

Dan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" i=
n
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Sage Weil
2014-09-19 15:19:40 UTC
Permalink
Post by Dan van der Ster
On Fri, Sep 19, 2014 at 10:41 AM, Dan Van Der Ster
Post by Dan Van Der Ster
Post by Florian Haas
Post by Sage Weil
Hi Sage,
was the off-list reply intentional?
Whoops! Nope :)
Post by Florian Haas
So, disaster is a pretty good description. Would anyone from the core
team like to suggest another course of action or workaround, or are
Dan and I generally on the right track to make the best out of a
pretty bad situation?
The short term fix would probably be to just prevent backfill for the time
being until the bug is fixed.
As in, osd max backfills = 0?
Yeah :)
Just managed to reproduce the problem...
sage
Saw the wip branch. Color me freakishly impressed on the turnaround. :) Thanks!
Indeed :) Thanks Sage!
wip-9487-dumpling fixes the problem on my test cluster. Trying in prod now?
Final update, after 4 hours in prod and after draining 8 OSDs -- zero
slow requests :)
That's great news!

But, please be careful. This code hasn't been reiewed yet or been through
any testing! I would hold off on further backfills until it's merged.

Thanks!
sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Dan van der Ster
2014-09-19 15:37:01 UTC
Permalink
Post by Sage Weil
Post by Dan van der Ster
On Fri, Sep 19, 2014 at 10:41 AM, Dan Van Der Ster
Post by Dan Van Der Ster
Post by Florian Haas
Hi Sage,
was the off-list reply intentional?
Whoops! Nope :)
Post by Florian Haas
So, disaster is a pretty good description. Would anyone from the core
team like to suggest another course of action or workaround, or are
Dan and I generally on the right track to make the best out of a
pretty bad situation?
The short term fix would probably be to just prevent backfill for the time
being until the bug is fixed.
As in, osd max backfills = 0?
Yeah :)
Just managed to reproduce the problem...
sage
Saw the wip branch. Color me freakishly impressed on the turnaround. :) Thanks!
Indeed :) Thanks Sage!
wip-9487-dumpling fixes the problem on my test cluster. Trying in prod now?
Final update, after 4 hours in prod and after draining 8 OSDs -- zero
slow requests :)
That's great news!
But, please be careful. This code hasn't been reiewed yet or been through
any testing! I would hold off on further backfills until it's merged.
Roger; I've been watching it very closely and so far it seems to work very well. Looking forward to that merge :)

Cheers, Dan
Post by Sage Weil
Thanks!
sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Florian Haas
2014-09-21 13:33:09 UTC
Permalink
Real field testings and proof workout are better than any unit testing ... I
would follow Dan s notice of resolution because it based on real problem and
not fony style test ground.
That statement is almost an insult to the authors and maintainers of
the testing framework around Ceph. Therefore, I'm taking the liberty
to register my objection.

That said, I'm not sure that wip-9487-dumpling is the final fix to the
issue. On the system where I am seeing the issue, even with the fix
deployed, osd's still not only go crazy snap trimming (which by itself
would be understandable, as the system has indeed recently had
thousands of snapshots removed), but they also still produce the
previously seen ENOENT messages indicating they're trying to trim
snaps that aren't there.

That system, however, has PGs marked as recovering, not backfilling as
in Dan's system. Not sure if wip-9487 falls short of fixing the issue
at its root. Sage, whenever you have time, would you mind commenting?

Cheers,
Florian
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Dan van der Ster
2014-09-21 14:26:51 UTC
Permalink
Hi Florian,
Post by Florian Haas
That said, I'm not sure that wip-9487-dumpling is the final fix to the
issue. On the system where I am seeing the issue, even with the fix
deployed, osd's still not only go crazy snap trimming (which by itself
would be understandable, as the system has indeed recently had
thousands of snapshots removed), but they also still produce the
previously seen ENOENT messages indicating they're trying to trim
snaps that aren't there.
You should be able to tell exactly how many snaps need to be trimmed. Check the current purged_snaps with

ceph pg x.y query

and also check the snap_trimq from debug_osd=10. The problem fixed in wip-9487 is the (mis)communication of purged_snaps to a new OSD. But if in your cluster purged_snaps is "correct" (which it should be after the fix from Sage), and it still has lots of snaps to trim, then I believe the only thing to do is let those snaps all get trimmed. (my other patch linked sometime earlier in this thread might help by breaking up all that trimming work into smaller pieces, but that was never tested).

Entering the realm of speculation, I wonder if your OSDs are getting interrupted, marked down, out, or crashing before they have the opportunity to persist purged_snaps? purged_snaps is updated in ReplicatedPG::WaitingOnReplicas::react, but if the primary is too busy to actually send that transaction to its peers, so then eventually it or the new primary needs to start again, and no progress is ever made. If this is what is happening on your cluster, then again, perhaps my osd_snap_trim_max patch could be a solution.

Cheers, Dan
Post by Florian Haas
That system, however, has PGs marked as recovering, not backfilling as
in Dan's system. Not sure if wip-9487 falls short of fixing the issue
at its root. Sage, whenever you have time, would you mind commenting?
Cheers,
Florian
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Florian Haas
2014-09-21 15:27:53 UTC
Permalink
On Sun, Sep 21, 2014 at 4:26 PM, Dan van der Ster
Hi Florian,
Post by Florian Haas
That said, I'm not sure that wip-9487-dumpling is the final fix to the
issue. On the system where I am seeing the issue, even with the fix
deployed, osd's still not only go crazy snap trimming (which by itself
would be understandable, as the system has indeed recently had
thousands of snapshots removed), but they also still produce the
previously seen ENOENT messages indicating they're trying to trim
snaps that aren't there.
You should be able to tell exactly how many snaps need to be trimmed. Check the current purged_snaps with
ceph pg x.y query
and also check the snap_trimq from debug_osd=10. The problem fixed in wip-9487 is the (mis)communication of purged_snaps to a new OSD. But if in your cluster purged_snaps is "correct" (which it should be after the fix from Sage), and it still has lots of snaps to trim, then I believe the only thing to do is let those snaps all get trimmed. (my other patch linked sometime earlier in this thread might help by breaking up all that trimming work into smaller pieces, but that was never tested).
Yes, it does indeed look like the system does have thousands of
snapshots left to trim. That said, since the PGs are locked during
this time, this creates a situation where the cluster is becoming
unusable with no way for the user to recover.
Entering the realm of speculation, I wonder if your OSDs are getting interrupted, marked down, out, or crashing before they have the opportunity to persist purged_snaps? purged_snaps is updated in ReplicatedPG::WaitingOnReplicas::react, but if the primary is too busy to actually send that transaction to its peers, so then eventually it or the new primary needs to start again, and no progress is ever made. If this is what is happening on your cluster, then again, perhaps my osd_snap_trim_max patch could be a solution.
Since the snap trimmer immediately jacks the affected OSDs up to 100%
CPU utilization, and they stop even responding to heartbeats, yes they
do get marked down and that makes the issue much worse. Even when
setting nodown, though, then that doesn't change the fact that the
affected OSDs just spin practically indefinitely.

So, even with the patch for 9487, which fixes *your* issue of the
cluster trying to trim tons of snaps when in fact it should be
trimming only a handful, the user is still in a world of pain when
they do indeed have tons of snaps to trim. And obviously, neither of
osd max backfills nor osd recovery max active help here, because even
a single backfill/recovery makes the OSD go nuts.

There is the silly option of setting osd_snap_trim_sleep to say 61
minutes, and restarting the ceph-osd daemons before the snap trim can
kick in, i.e. hourly, via a cron job. Of course, while this prevents
the OSD from going into a death spin, it only perpetuates the problem
until a patch for this issue is available, because snap trimming never
even runs, let alone completes.

This is particularly bad because a user can get themselves a
non-functional cluster simply by trying to delete thousands of
snapshots at once. If you consider a tiny virtualization cluster of
just 100 persistent VMs, out of which you take one snapshot an hour,
then deleting the snapshots taken in one month puts you well above
that limit. So we're not talking about outrageous numbers here. I
don't think anyone can fault any user for attempting this.

What makes the situation even worse is that there is no cluster-wide
limit to the number of snapshots, or even say snapshots per RBD
volume, or snapshots per PG, nor any limit on the number of snapshots
deleted concurrently.

So yes, I think your patch absolutely still has merit, as would any
means of reducing the number of snapshots an OSD will trim in one go.
As it is, the situation looks really really bad, specifically
considering that RBD and RADOS are meant to be super rock solid, as
opposed to say CephFS which is in an experimental state. And contrary
to CephFS snapshots, I can't recall any documentation saying that RBD
snapshots will break your system.

Cheers,
Florian
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Sage Weil
2014-09-21 19:52:28 UTC
Permalink
Post by Florian Haas
So yes, I think your patch absolutely still has merit, as would any
means of reducing the number of snapshots an OSD will trim in one go.
As it is, the situation looks really really bad, specifically
considering that RBD and RADOS are meant to be super rock solid, as
opposed to say CephFS which is in an experimental state. And contrary
to CephFS snapshots, I can't recall any documentation saying that RBD
snapshots will break your system.
Yeah, it sounds like a separate issue, and no, the limit is not
documented because it's definitely not the intended behavior. :)

...and I see you already have a log attached to #9503. Will take a look.

Thanks!
sage

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Florian Haas
2014-09-22 17:06:38 UTC
Permalink
This post might be inappropriate. Click to display it.
Florian Haas
2014-09-23 13:20:01 UTC
Permalink
Post by Florian Haas
Post by Sage Weil
Post by Florian Haas
So yes, I think your patch absolutely still has merit, as would any
means of reducing the number of snapshots an OSD will trim in one go.
As it is, the situation looks really really bad, specifically
considering that RBD and RADOS are meant to be super rock solid, as
opposed to say CephFS which is in an experimental state. And contrary
to CephFS snapshots, I can't recall any documentation saying that RBD
snapshots will break your system.
Yeah, it sounds like a separate issue, and no, the limit is not
documented because it's definitely not the intended behavior. :)
...and I see you already have a log attached to #9503. Will take a look.
I've already updated that issue in Redmine, but for the list archives
I should also add this here: Dan's patch for #9503, together with
Sage's for #9487, makes the problem go away in an instant. I've
already pointed out that I owe Dan dinner, and Sage, well I already
owe Sage pretty much lifelong full board. :)
Looks like I was bit too eager: while the cluster is behaving nicely
with these patches while nothing happens to any OSDs, it does flag PGs
as incomplete when an OSD goes down. Once the mon osd down out
interval expires things seem to recover/backfill normally, but it's
still disturbing to see this in the interim.

I've updated http://tracker.ceph.com/issues/9503 with a pg query from
one of the affected PGs, within the mon osd down out interval, while
it was marked incomplete.

Dan or Sage, any ideas as to what might be causing this?

Cheers,
Florian
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Gregory Farnum
2014-09-23 20:00:18 UTC
Permalink
Post by Florian Haas
Post by Florian Haas
Post by Sage Weil
Post by Florian Haas
So yes, I think your patch absolutely still has merit, as would any
means of reducing the number of snapshots an OSD will trim in one go.
As it is, the situation looks really really bad, specifically
considering that RBD and RADOS are meant to be super rock solid, as
opposed to say CephFS which is in an experimental state. And contrary
to CephFS snapshots, I can't recall any documentation saying that RBD
snapshots will break your system.
Yeah, it sounds like a separate issue, and no, the limit is not
documented because it's definitely not the intended behavior. :)
...and I see you already have a log attached to #9503. Will take a look.
I've already updated that issue in Redmine, but for the list archives
I should also add this here: Dan's patch for #9503, together with
Sage's for #9487, makes the problem go away in an instant. I've
already pointed out that I owe Dan dinner, and Sage, well I already
owe Sage pretty much lifelong full board. :)
Looks like I was bit too eager: while the cluster is behaving nicely
with these patches while nothing happens to any OSDs, it does flag PGs
as incomplete when an OSD goes down. Once the mon osd down out
interval expires things seem to recover/backfill normally, but it's
still disturbing to see this in the interim.
I've updated http://tracker.ceph.com/issues/9503 with a pg query from
one of the affected PGs, within the mon osd down out interval, while
it was marked incomplete.
Dan or Sage, any ideas as to what might be causing this?
That *looks* like it's just because the pool has both size and
min_size set to 2?
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Florian Haas
2014-10-16 09:04:51 UTC
Permalink
Hi Greg,

sorry, this somehow got stuck in my drafts folder.
Post by Gregory Farnum
Post by Florian Haas
Post by Florian Haas
Post by Sage Weil
Post by Florian Haas
So yes, I think your patch absolutely still has merit, as would any
means of reducing the number of snapshots an OSD will trim in one go.
As it is, the situation looks really really bad, specifically
considering that RBD and RADOS are meant to be super rock solid, as
opposed to say CephFS which is in an experimental state. And contrary
to CephFS snapshots, I can't recall any documentation saying that RBD
snapshots will break your system.
Yeah, it sounds like a separate issue, and no, the limit is not
documented because it's definitely not the intended behavior. :)
...and I see you already have a log attached to #9503. Will take a look.
I've already updated that issue in Redmine, but for the list archives
I should also add this here: Dan's patch for #9503, together with
Sage's for #9487, makes the problem go away in an instant. I've
already pointed out that I owe Dan dinner, and Sage, well I already
owe Sage pretty much lifelong full board. :)
Looks like I was bit too eager: while the cluster is behaving nicely
with these patches while nothing happens to any OSDs, it does flag PGs
as incomplete when an OSD goes down. Once the mon osd down out
interval expires things seem to recover/backfill normally, but it's
still disturbing to see this in the interim.
I've updated http://tracker.ceph.com/issues/9503 with a pg query from
one of the affected PGs, within the mon osd down out interval, while
it was marked incomplete.
Dan or Sage, any ideas as to what might be causing this?
That *looks* like it's just because the pool has both size and
min_size set to 2?
Correct. But the documentation did not reflect that this is a
perfectly expected side effect of having min_size > 1.

pg-states.rst says:

*Incomplete*
Ceph detects that a placement group is missing a necessary period of history
from its log. If you see this state, report a bug, and try to start any
failed OSDs that may contain the needed information.

So if min_size > 1 and replicas < min_size, then the incomplete state
is not a bug but a perfectly expected occurrence, correct?

It's still a bit weird in that the PG seems to behave differently
depending on min_size. If min_size == 1 (default), then a PG with no
remaining replicas is stale, unless a replica failed first and the
primary was written to, after which it also failed, and the replica
then comes up and can't go primary because it now has outdated data,
in which case the PG goes "down". It never goes "incomplete".

So is the documentation wrong, or is there something fishy with the
reported state of the PGs?

Cheers,
Florian
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Gregory Farnum
2014-10-16 13:54:05 UTC
Permalink
Post by Florian Haas
Hi Greg,
sorry, this somehow got stuck in my drafts folder.
Post by Gregory Farnum
Post by Florian Haas
Post by Florian Haas
Post by Sage Weil
Post by Florian Haas
So yes, I think your patch absolutely still has merit, as would any
means of reducing the number of snapshots an OSD will trim in one go.
As it is, the situation looks really really bad, specifically
considering that RBD and RADOS are meant to be super rock solid, as
opposed to say CephFS which is in an experimental state. And contrary
to CephFS snapshots, I can't recall any documentation saying that RBD
snapshots will break your system.
Yeah, it sounds like a separate issue, and no, the limit is not
documented because it's definitely not the intended behavior. :)
...and I see you already have a log attached to #9503. Will take a look.
I've already updated that issue in Redmine, but for the list archives
I should also add this here: Dan's patch for #9503, together with
Sage's for #9487, makes the problem go away in an instant. I've
already pointed out that I owe Dan dinner, and Sage, well I already
owe Sage pretty much lifelong full board. :)
Looks like I was bit too eager: while the cluster is behaving nicely
with these patches while nothing happens to any OSDs, it does flag PGs
as incomplete when an OSD goes down. Once the mon osd down out
interval expires things seem to recover/backfill normally, but it's
still disturbing to see this in the interim.
I've updated http://tracker.ceph.com/issues/9503 with a pg query from
one of the affected PGs, within the mon osd down out interval, while
it was marked incomplete.
Dan or Sage, any ideas as to what might be causing this?
That *looks* like it's just because the pool has both size and
min_size set to 2?
Correct. But the documentation did not reflect that this is a
perfectly expected side effect of having min_size > 1.
*Incomplete*
Ceph detects that a placement group is missing a necessary period of history
from its log. If you see this state, report a bug, and try to start any
failed OSDs that may contain the needed information.
So if min_size > 1 and replicas < min_size, then the incomplete state
is not a bug but a perfectly expected occurrence, correct?
It's still a bit weird in that the PG seems to behave differently
depending on min_size. If min_size == 1 (default), then a PG with no
remaining replicas is stale, unless a replica failed first and the
primary was written to, after which it also failed, and the replica
then comes up and can't go primary because it now has outdated data,
in which case the PG goes "down". It never goes "incomplete".
So is the documentation wrong, or is there something fishy with the
reported state of the PGs?
I guess the documentation is wrong, although I thought we'd fixed that
particular one. :/ Giant actually distinguishes between these
conditions by adding an "undersized" state to the PG, so it'll be
easier to diagnose.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Sage Weil
2014-09-24 00:05:31 UTC
Permalink
Sam and I discussed this on IRC and have we think two simpler patches that
solve the problem more directly. See wip-9487. Queued for testing now.
Once that passes we can backport and test for firefly and dumpling too.

Note that this won't make the next dumpling or firefly point releases
(which are imminent). Should be in the next ones, though.

Upside is it looks like Sam found #9113 (snaptrimmer memory leak) at the
same time, yay!

sage

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Florian Haas
2014-09-24 23:01:51 UTC
Permalink
Post by Sage Weil
Sam and I discussed this on IRC and have we think two simpler patches that
solve the problem more directly. See wip-9487.
So I understand this makes Dan's patch (and the config parameter that
it introduces) unnecessary, but is it correct to assume that just like
Dan's patch yours too will not be effective unless osd snap trim sleep
Post by Sage Weil
0?
Queued for testing now.
Once that passes we can backport and test for firefly and dumpling too.
Note that this won't make the next dumpling or firefly point releases
(which are imminent). Should be in the next ones, though.
OK, just in case anyone else runs into problems after removing tons of
snapshots with <=0.67.11, what's the plan to get them going again
until 0.67.12 comes out? Install the autobuild package from the wip
branch?

Cheers,
Florian
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Sage Weil
2014-09-21 19:41:03 UTC
Permalink
Real field testings and proof workout are better than any unit testing ... I
would follow Dan s notice of resolution because it based on real problem and
not fony style test ground.
It's been reviewed and look right, but the rados torture tests are
pretty ... torturous, and this code is delicate. I would still wait.
Sage apart that problem is there a solution to the ever expending replicas
problem ?
Discard for the kernel RBD client should go upstream this cycle.

As for RADOS consuming more data when RBD blocks are overwritten, I still
have yet to see any actual evidence of this, and have a hard time seeing
how it could happen. A sequence of steps to reproduce would be the next
step.

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Dan Van Der Ster
2014-10-15 14:47:32 UTC
Permalink
Hi Sage,
Post by Sage Weil
Post by Dan van der Ster
On Fri, Sep 19, 2014 at 10:41 AM, Dan Van Der Ster
Post by Dan Van Der Ster
Post by Florian Haas
Hi Sage,
was the off-list reply intentional?
Whoops! Nope :)
Post by Florian Haas
So, disaster is a pretty good description. Would anyone from the core
team like to suggest another course of action or workaround, or are
Dan and I generally on the right track to make the best out of a
pretty bad situation?
The short term fix would probably be to just prevent backfill for the time
being until the bug is fixed.
As in, osd max backfills = 0?
Yeah :)
Just managed to reproduce the problem...
sage
Saw the wip branch. Color me freakishly impressed on the turnaround. :) Thanks!
Indeed :) Thanks Sage!
wip-9487-dumpling fixes the problem on my test cluster. Trying in prod now?
Final update, after 4 hours in prod and after draining 8 OSDs -- zero
slow requests :)
That's great news!
But, please be careful. This code hasn't been reiewed yet or been through
any testing! I would hold off on further backfills until it's merged.
Any news on those merges? It would be good to get this fixed on the dumpling and firefly branches. We're kind of stuck at the moment :(

Cheers, Dan


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Samuel Just
2014-10-15 17:50:45 UTC
Permalink
It's in giant, the firefly backport will happen once we are happy with
the fallout from the 80.7 thing.
-Sam

On Wed, Oct 15, 2014 at 7:47 AM, Dan Van Der Ster
Hi Sage,
Post by Sage Weil
Post by Dan van der Ster
On Fri, Sep 19, 2014 at 10:41 AM, Dan Van Der Ster
Post by Dan Van Der Ster
Post by Florian Haas
Hi Sage,
was the off-list reply intentional?
Whoops! Nope :)
Post by Florian Haas
So, disaster is a pretty good description. Would anyone from the core
team like to suggest another course of action or workaround, or are
Dan and I generally on the right track to make the best out of a
pretty bad situation?
The short term fix would probably be to just prevent backfill for the time
being until the bug is fixed.
As in, osd max backfills = 0?
Yeah :)
Just managed to reproduce the problem...
sage
Saw the wip branch. Color me freakishly impressed on the turnaround. :) Thanks!
Indeed :) Thanks Sage!
wip-9487-dumpling fixes the problem on my test cluster. Trying in prod now?
Final update, after 4 hours in prod and after draining 8 OSDs -- zero
slow requests :)
That's great news!
But, please be careful. This code hasn't been reiewed yet or been through
any testing! I would hold off on further backfills until it's merged.
Any news on those merges? It would be good to get this fixed on the dumpling and firefly branches. We're kind of stuck at the moment :(
Cheers, Dan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Dan van der Ster
2014-09-18 19:31:08 UTC
Permalink
-- Dan van der Ster || Data & Storage Services || CERN IT Department --

September 18 2014 9:12 PM, "Dan van der Ster" <***@cern.c=
h> wrote:=20
Post by Dan Van Der Ster
Hi,
=20
=20
On Thu, Sep 18, 2014 at 8:56 PM, Dan van der Ster <daniel.vanderster=
=20
Hi Florian,
=20
Post by Florian Haas
Hi Dan,
=20
saw the pull request, and can confirm your observations, at least
partially. Comments inline.
=20
On Thu, Sep 18, 2014 at 2:50 PM, Dan Van Der Ster
Do I understand your issue report correctly in that you have fo=
und
Post by Dan Van Der Ster
Post by Florian Haas
setting osd_snap_trim_sleep to be ineffective, because it's bei=
ng
Post by Dan Van Der Ster
Post by Florian Haas
applied when iterating from PG to PG, rather than from snap to =
snap?
Post by Dan Van Der Ster
Post by Florian Haas
If so, then I'm guessing that that can hardly be intentional=E2=
=80=A6
Post by Dan Van Der Ster
Post by Florian Haas
=20
I=E2=80=99m beginning to agree with you on that guess. AFAICT, th=
e normal behavior of the snap trimmer
Post by Dan Van Der Ster
=20
is=20
Post by Florian Haas
to trim one single snap, the one which is in the snap_trimq but no=
t yet in purged_snaps. So the
Post by Dan Van Der Ster
Post by Florian Haas
only time the current sleep implementation could be useful is if w=
e rm=E2=80=99d a snap across many PGs
Post by Dan Van Der Ster
=20
at=20
Post by Florian Haas
once, e.g. rm a pool snap or an rbd snap. But those aren=E2=80=99t=
a huge problem anyway since you=E2=80=99d at
Post by Dan Van Der Ster
Post by Florian Haas
most need to trim O(100) PGs.
=20
Hmm. I'm actually seeing this in a system where the problematic sn=
aps
Post by Dan Van Der Ster
Post by Florian Haas
could *only* have been RBD snaps.
=20
True, as am I. The current sleep is useful in this case, but since =
we'd normally only expect up
Post by Dan Van Der Ster
=20
to=20
~100 of these PGs per OSD, the trimming of 1 snap across all of tho=
se PGs would finish rather
Post by Dan Van Der Ster
quickly anyway. Latency would surely be increased momentarily, but =
I wouldn't expect 90s slow
Post by Dan Van Der Ster
requests like I have with the 30000 snap_trimq single PG.
=20
Possibly the sleep is useful in both places.
=20
Post by Florian Haas
We could move the snap trim sleep into the SnapTrimmer state mach=
ine, for example in
Post by Dan Van Der Ster
Post by Florian Haas
=20
ReplicatedPG::NotTrimming::react. This should allow other IOs to g=
et through to the OSD, but of
Post by Dan Van Der Ster
Post by Florian Haas
course the trimming PG would remain locked. And it would be locked=
for even longer now due to
Post by Dan Van Der Ster
=20
the=20
Post by Florian Haas
sleep.=20
To solve that we could limit the number of trims per instance of =
the SnapTrimmer, like I=E2=80=99ve
Post by Dan Van Der Ster
=20
done=20
Post by Florian Haas
in this pull req: https://github.com/ceph/ceph/pull/2516=20
Breaking out of the trimmer like that should allow IOs to the tri=
mming PG to get through.
Post by Dan Van Der Ster
Post by Florian Haas
=20
The second aspect of this issue is why are the purged_snaps being=
lost to begin with. I=E2=80=99ve
Post by Dan Van Der Ster
Post by Florian Haas
=20
managed to reproduce that on my test cluster. All you have to do i=
s create many pool snaps
Post by Dan Van Der Ster
(e.g.
=20
of=20
Post by Florian Haas
a nearly empty pool), then rmsnap all those snapshots. Then use cr=
ush reweight to move the PGs
Post by Dan Van Der Ster
Post by Florian Haas
around. With debug_osd>=3D10, you will see "adding snap 1 to purge=
d_snaps=E2=80=9D, which is one
Post by Dan Van Der Ster
signature
=20
of=20
Post by Florian Haas
this lost purged_snaps issue. To reproduce slow requests the numbe=
r of snaps purged needs to be
Post by Dan Van Der Ster
Post by Florian Haas
O(10000).
=20
Hmmm, I'm not sure if I confirm that. I see "adding snap X to
purged_snaps", but only after the snap has been purged. See
https://gist.github.com/fghaas/88db3cd548983a92aa35. Of course, th=
e
Post by Dan Van Der Ster
Post by Florian Haas
fact that the OSD tries to trim a snap only to get an ENOENT is
probably indicative of something being fishy with the snaptrimq an=
d/or
Post by Dan Van Der Ster
Post by Florian Haas
the purged_snaps list as well.
=20
With such a long snap_trimq there in your log, I suspect you're see=
ing the exact same behavior
Post by Dan Van Der Ster
as
=20
I=20
am. In my case the first snap trimmed is snap 1, of course because =
that is the first rm'd snap,
Post by Dan Van Der Ster
=20
and=20
the contents of your pool are surely different. I also see the ENOE=
NT messages... again
Post by Dan Van Der Ster
=20
confirming=20
those snaps were already trimmed. Anyway, what I've observed is tha=
t a large snap_trimq like
Post by Dan Van Der Ster
that
will block the OSD until they are all re-trimmed.
=20
That's... a mess.
=20
So what is your workaround for recovery? My hunch would be to
=20
- stop all access to the cluster;
- set nodown and noout so that other OSDs don't mark spinning OSDs
down (which would cause all sorts of primary and PG reassignments,
useless backfill/recovery when mon osd down out interval expires,
etc.);
- set osd_snap_trim_sleep to a ridiculously high value like 10 or 30
so that at least *between* PGs, the OSD has a chance to respond to
heartbeats and do whatever else it needs to do;
- let the snap trim play itself out over several hours (days?).
=20
What I've been doing is I just continue draining my OSDs, two at a ti=
me. Each time, 1-2 other OSDs
Post by Dan Van Der Ster
become blocked for a couple minutes (out of the ~1 hour it takes to d=
rain) while a single PG
Post by Dan Van Der Ster
re-trims, leading to ~100 slow requests. The OSD must still be respon=
ding to the peer pings, since
Post by Dan Van Der Ster
other OSDs do not mark it down. Luckily this doesn't happen with ever=
y single movement of our pool
Post by Dan Van Der Ster
5 PGs, otherwise it would be a disaster like you said.
Two other more risky work-arounds that I didn't try yet are:

1. lower the osd_snap_trim_thread_timeout from 3600s to something like =
10 or 20s, so that these long trim operations are just killed. I have n=
o idea if this is safe.
2. pay close attention to the slow requests and manually mark the affec=
ted OSDs down when they become blocked. by marking the trimming OSD dow=
n the IOs should go elsewhere until the OSD can recover once again late=
r. But I don't know how the backfilling OSD will behave if it is manual=
ly marked down while trimming.

Cheers, Dan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" i=
n
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Loading...