capping pgs per osd

Sage Weil

2014-10-16 19:52:03 UTC

There are lots of scenarios in a large cluster an admin mistake or
misconfiguration can cause a bajillion PGs to pile up on one OSD. This
causes various probably, most of which are difficult to diagnose and get
out of because things are so heavily loaded.

We need to have a way to prevent people/clusters from shooting themselves
in the foot in this particular way. Here's a rough proposal:

- config option 'osd max pgs = 500' (or something similar)
- if the osd is at or above pg_max (pg_map.size()), it will silently drop
any pg peering messages or requests for pgs that they don't already have
- if the osd reaches pg_max, it will set a bit or flag in the osd_stat_t
it reports to the monitor.
- when the osd drops below pg_max, it will clear that bit

...and the tricky part...

- when the mon sees that osd bit clear, it will do something to the
osdmap and issue a new epoch. that something will either trigger an
interval change for the osd or in other way induce any unpeered pgs
including that osd to restart peering or resend messages to that osd.

I'm not sure we want to simply trigger an interval change as that will
restart peering on peered PGs at a time when the OSD is under load. A
more targetted way to kick peering (and resend of the potentially dropped
messages) just for unpeered PGs would be ideal.

Eventually I'd like to see us include this in the thrashing tests by
setting the pg_max at something that is reasonably low (2x or 3x the
target/average pgs per osd) and make one of the thrashing operations skew
the crush weights. Ideally in a way that will make an overloaded OSD need
to drain previous PGs before it can accept new ones?

Thoughts?
sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html