Dan Van Der Ster
2014-09-18 12:50:44 UTC
(moving this discussion to -devel)
We could move the snap trim sleep into the SnapTrimmer state machine, for example in ReplicatedPG::NotTrimming::react. This should allow other IOs to get through to the OSD, but of course the trimming PG would remain locked. And it would be locked for even longer now due to the sleep.
To solve that we could limit the number of trims per instance of the SnapTrimmer, like I’ve done in this pull req: https://github.com/ceph/ceph/pull/2516
Breaking out of the trimmer like that should allow IOs to the trimming PG to get through.
The second aspect of this issue is why are the purged_snaps being lost to begin with. I’ve managed to reproduce that on my test cluster. All you have to do is create many pool snaps (e.g. of a nearly empty pool), then rmsnap all those snapshots. Then use crush reweight to move the PGs around. With debug_osd>=10, you will see "adding snap 1 to purged_snaps”, which is one signature of this lost purged_snaps issue. To reproduce slow requests the number of snaps purged needs to be O(10000).
Looking forward to any ideas someone might have.
Cheers, Dan
N�����r��y����b�X��ǧv�^�){.n�+���z�]z���{ay�ʇڙ�,j��f���h���z��w���
Date: 17 Sep 2014 18:02:09 CEST
Subject: Re: [ceph-users] RGW hung, 2 OSDs using 100% CPU
On Wed, Sep 17, 2014 at 5:42 PM, Dan Van Der Ster
I’m beginning to agree with you on that guess. AFAICT, the normal behavior of the snap trimmer is to trim one single snap, the one which is in the snap_trimq but not yet in purged_snaps. So the only time the current sleep implementation could be useful is if we rm’d a snap across many PGs at once, e.g. rm a pool snap or an rbd snap. But those aren’t a huge problem anyway since you’d at most need to trim O(100) PGs.Subject: Re: [ceph-users] RGW hung, 2 OSDs using 100% CPU
On Wed, Sep 17, 2014 at 5:42 PM, Dan Van Der Ster
Sent: Sep 17, 2014 5:33 PM
To: Dan Van Der Ster
Subject: Re: [ceph-users] RGW hung, 2 OSDs using 100% CPU
On Wed, Sep 17, 2014 at 5:24 PM, Dan Van Der Ster
sufficient as those snap_trim messages get logged at that level. :)
Do I understand your issue report correctly in that you have found
setting osd_snap_trim_sleep to be ineffective, because it's being
applied when iterating from PG to PG, rather than from snap to snap?
If so, then I'm guessing that that can hardly be intentional…
To: Dan Van Der Ster
Subject: Re: [ceph-users] RGW hung, 2 OSDs using 100% CPU
On Wed, Sep 17, 2014 at 5:24 PM, Dan Van Der Ster
Hi Florian,
narrowed down in the past couple days?
http://tracker.ceph.com/issues/9487
Bump up debug_osd to 20 then check the log during one of your incidents.
If it is busy logging the snap_trimmer messages, then it’s the same issue.
(The issue is that rbd pools have many purged_snaps, but sometimes after
backfilling a PG the purged_snaps list is lost and thus the snap trimmer
becomes very busy whilst re-trimming thousands of snaps. During that time (a
few minutes on my cluster) the OSD is blocked.)
That sounds promising, thank you! debug_osd=10 should actually beHi Craig,
just dug this up in the list archives.
after you removed all snapshots, and the actual cause of the issue was
never found?
I am seeing a similar issue now, and have filed
http://tracker.ceph.com/issues/9503 to make sure it doesn't get lost
again. Can you take a look at that issue and let me know if anything
in the description sounds familiar?
Could your ticket be related to the snap trimming issue I’ve finallyjust dug this up in the list archives.
In the interest of removing variables, I removed all snapshots on all
pools,
then restarted all ceph daemons at the same time. This brought up osd.8
as
well.
So just to summarize this: your 100% CPU problem at the time went awaypools,
then restarted all ceph daemons at the same time. This brought up osd.8
as
well.
after you removed all snapshots, and the actual cause of the issue was
never found?
I am seeing a similar issue now, and have filed
http://tracker.ceph.com/issues/9503 to make sure it doesn't get lost
again. Can you take a look at that issue and let me know if anything
in the description sounds familiar?
narrowed down in the past couple days?
http://tracker.ceph.com/issues/9487
Bump up debug_osd to 20 then check the log during one of your incidents.
If it is busy logging the snap_trimmer messages, then it’s the same issue.
(The issue is that rbd pools have many purged_snaps, but sometimes after
backfilling a PG the purged_snaps list is lost and thus the snap trimmer
becomes very busy whilst re-trimming thousands of snaps. During that time (a
few minutes on my cluster) the OSD is blocked.)
sufficient as those snap_trim messages get logged at that level. :)
Do I understand your issue report correctly in that you have found
setting osd_snap_trim_sleep to be ineffective, because it's being
applied when iterating from PG to PG, rather than from snap to snap?
If so, then I'm guessing that that can hardly be intentional…
We could move the snap trim sleep into the SnapTrimmer state machine, for example in ReplicatedPG::NotTrimming::react. This should allow other IOs to get through to the OSD, but of course the trimming PG would remain locked. And it would be locked for even longer now due to the sleep.
To solve that we could limit the number of trims per instance of the SnapTrimmer, like I’ve done in this pull req: https://github.com/ceph/ceph/pull/2516
Breaking out of the trimmer like that should allow IOs to the trimming PG to get through.
The second aspect of this issue is why are the purged_snaps being lost to begin with. I’ve managed to reproduce that on my test cluster. All you have to do is create many pool snaps (e.g. of a nearly empty pool), then rmsnap all those snapshots. Then use crush reweight to move the PGs around. With debug_osd>=10, you will see "adding snap 1 to purged_snaps”, which is one signature of this lost purged_snaps issue. To reproduce slow requests the number of snaps purged needs to be O(10000).
Looking forward to any ideas someone might have.
Cheers, Dan
N�����r��y����b�X��ǧv�^�){.n�+���z�]z���{ay�ʇڙ�,j��f���h���z��w���