Firefly maintenance release schedule

Discussion:

Dmitry Borodaenko

2014-10-15 16:39:49 UTC

On Tue, Sep 30, 2014 at 6:49 PM, Dmitry Borodaenko

Last stable Firefly release (v0.80.5) was tagged on July 29 (over 2
months ago). Since then, there were twice as many commits merged into
$ git log --oneline --no-merges v0.80..v0.80.5|wc -l
122
$ git log --oneline --no-merges v0.80.5..firefly|wc -l
227
Is this a one time aberration in the process or should we expect the
gap between maintenance updates for LTS releases of Ceph to keep
growing?

I didn't get a response to that nag other than the v0.80.6 release
announcement on the day after, so I guess it wasn't completely ignored
:)

Except it turned out v0.80.6 was slightly less than useful as a
maintenance release:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-October/043701.html

Two weeks later we have v0.80.7 with 3 more commits that hopefully
make it actually usable. There are many ways to look at that from
release management perspective.

Good: 2 weeks is much better than 2 months.
Bad: that's 2.5 months since last *stable* Firefly release.
Ugly: that's 2 weeks for 3 commits, and now we have 54 more waiting
for the next release...

Wait what?! Oh right, 54 more commits were merged from firefly-next as
soon as v0.80.7 was tagged:
$ git log --oneline --no-merges v0.80.7..firefly|wc -l
54

Some of these are fixes for Urgent priority bugs, crashes, and data loss:
http://tracker.ceph.com/issues/9492
http://tracker.ceph.com/issues/9039
http://tracker.ceph.com/issues/9582
http://tracker.ceph.com/issues/9307
etc.

So what a Ceph deployer supposed to do with this? Wait another couple
of weeks (hopefully) for v0.80.8? Take v0.80.7 and hope not to
encounter any of these bugs? Or label Firefly as "not production ready
yet" and go back to Dumpling? My personal preference obviously would
be the first option, but waiting for 2.5 more months is not going to
fit my schedule :(

--
Dmitry Borodaenko
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Gregory Farnum

2014-10-15 16:59:03 UTC

Permalink

On Wed, Oct 15, 2014 at 9:39 AM, Dmitry Borodaenko

Post by Dmitry Borodaenko
On Tue, Sep 30, 2014 at 6:49 PM, Dmitry Borodaenko

I didn't get a response to that nag other than the v0.80.6 release
announcement on the day after, so I guess it wasn't completely ignored
:)
Except it turned out v0.80.6 was slightly less than useful as a
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-October/043701.html
Two weeks later we have v0.80.7 with 3 more commits that hopefully
make it actually usable. There are many ways to look at that from
release management perspective.
Good: 2 weeks is much better than 2 months.
Bad: that's 2.5 months since last *stable* Firefly release.
Ugly: that's 2 weeks for 3 commits, and now we have 54 more waiting
for the next release...
Wait what?! Oh right, 54 more commits were merged from firefly-next as
$ git log --oneline --no-merges v0.80.7..firefly|wc -l
54
http://tracker.ceph.com/issues/9492
http://tracker.ceph.com/issues/9039
http://tracker.ceph.com/issues/9582
http://tracker.ceph.com/issues/9307
etc.
So what a Ceph deployer supposed to do with this? Wait another couple
of weeks (hopefully) for v0.80.8? Take v0.80.7 and hope not to
encounter any of these bugs? Or label Firefly as "not production ready
yet" and go back to Dumpling? My personal preference obviously would
be the first option, but waiting for 2.5 more months is not going to
fit my schedule :(

Take .80.7. All of the bugs you've cited, you are supremely unlikely
to run into. The "Urgent" tag is a measure of planning priority, not
of impact to users; here it generally means "we found a bug on a
stable branch that we can reproduce". Taking them in order:
http://tracker.ceph.com/issues/9492: only happens if you try and cheat
with your CRUSH rules, and obviously nobody did that until Sage
suggested it as a solution to the problem somebody had 29 days ago
when this was discovered.
http://tracker.ceph.com/issues/9039: The most serious here, but only
happens if you're using RGW, and storing user data in multiple pools,
and issue a COPY command to copy data between different pools.
http://tracker.ceph.com/issues/9582: Only happens if you're using the
op timeout feature of librados with the C bindings OR the op timeout
feature *and* the user-provided buffers in the C++ interface. (To the
best of my knowledge, the people who discovered this are the only ones
using op timeouts.)
http://tracker.ceph.com/issues/9307: I'm actually not sure what's
going on here; looks like some kind of extremely rare race when
authorizing requests? (ie, fixed by a retry)

We messed up the v0.80.6 release in a very specific way (and if you
were deploying a new cluster it wasn't a problem), but you're
extrapolating too much from the presence of patches about what their
impact is and what the system's stability is. These are largely
cleaning up rough edges around user interfaces, and smoothing out
issues in the new functionality that a standard deployment isn't going
to experience. :)
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Dmitry Borodaenko

2014-10-15 21:45:04 UTC

Permalink

Gregory,

Thanks for prompt response, we'll go with v0.80.7.

It still would be nice if v0.80.8 doesn't take as long as v0.80.6, I
suspect one of the reasons you messed it up was too many commits
without intermediate releases.

Post by Gregory Farnum
Take .80.7. All of the bugs you've cited, you are supremely unlikely
to run into. The "Urgent" tag is a measure of planning priority, not
of impact to users; here it generally means "we found a bug on a
http://tracker.ceph.com/issues/9492: only happens if you try and cheat
with your CRUSH rules, and obviously nobody did that until Sage
suggested it as a solution to the problem somebody had 29 days ago
when this was discovered.
http://tracker.ceph.com/issues/9039: The most serious here, but only
happens if you're using RGW, and storing user data in multiple pools,
and issue a COPY command to copy data between different pools.
http://tracker.ceph.com/issues/9582: Only happens if you're using the
op timeout feature of librados with the C bindings OR the op timeout
feature *and* the user-provided buffers in the C++ interface. (To the
best of my knowledge, the people who discovered this are the only ones
using op timeouts.)
http://tracker.ceph.com/issues/9307: I'm actually not sure what's
going on here; looks like some kind of extremely rare race when
authorizing requests? (ie, fixed by a retry)
We messed up the v0.80.6 release in a very specific way (and if you
were deploying a new cluster it wasn't a problem), but you're
extrapolating too much from the presence of patches about what their
impact is and what the system's stability is. These are largely
cleaning up rough edges around user interfaces, and smoothing out
issues in the new functionality that a standard deployment isn't going
to experience. :)
-Greg