RHEL 6.5 shared library upgrade safety

Discussion:

Loic Dachary

2014-08-18 11:57:16 UTC

Hi Ceph,

In RHEL 6.5, is the following scenario possible :

a) an OSD dlopen a shared library for erasure-code,
b) the shared library file is replaced while the OSD is running,
c) the OSD starts using the new file instead of the old one.

It seems unlikely but it would explain a weird stack trace at http://tracker.ceph.com/issues/9153#note-5 so I'm double checking ;-)

Cheers

--
Loïc Dachary, Artisan Logiciel Libre

Wido den Hollander

2014-08-18 12:11:00 UTC

Permalink

Post by Loic Dachary
Hi Ceph,
a) an OSD dlopen a shared library for erasure-code,
b) the shared library file is replaced while the OSD is running,
c) the OSD starts using the new file instead of the old one.
It seems unlikely but it would explain a weird stack trace at http://tracker.ceph.com/issues/9153#note-5 so I'm double checking ;-)

Well, it could be that it does so. I'm not 100% sure, but afaik it could
happen that when you replace a library certain parts might not be in memory.

See:
http://stackoverflow.com/questions/7767325/replacing-shared-object-so-file-while-main-program-is-running

Post by Loic Dachary
Cheers

--
Wido den Hollander
Ceph consultant and trainer
42on B.V.

Phone: +31 (0)20 700 9902
Skype: contact42on
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Loic Dachary

2014-08-18 14:06:31 UTC

Permalink

Hi Wido,

Well, it could be that it does so. I'm not 100% sure, but afaik it could happen that when you replace a library certain parts might not be in memory.
See: http://stackoverflow.com/questions/7767325/replacing-shared-object-so-file-while-main-program-is-running

As it turns out, the problem is a simpler, but I still have not clue how it can happen.

http://tracker.ceph.com/issues/9153 shows

537187718- ceph version 0.80.5-164-gcc4e625 (cc4e6258d67fb16d4a92c25078a0822a9849cd77)
537187795- 1: ceph-osd() [0x9b58c1]
537187821- 2: (()+0xf710) [0x7f06a3e24710]
537187854- 3: (memcpy()+0x15b) [0x7f06a2d4daab]
537187892- 4: (jerasure_matrix_dotprod()+0xc8) [0x7f067fd11618]
537187946- 5: (jerasure_matrix_encode()+0x75) [0x7f067fd11865]
537187999- 6: (ErasureCodeJerasureReedSolomonVandermonde::jerasure_encode(char**, char**, int)+0x21) [0x7f067fd294b1]
537188107- 7: (ErasureCodeJerasure::encode_chunks(std::set<int, std::less<int>, std::allocator<int> > const&, std::map<int, ceph::buffer::list, std::less<int>, std::allocator<std::pair<int const, ceph::buffer::list> > >*)+0x607) [0x7f067fd2a807]

Meaning ceph-osd firefly crashed trying to use a jerasure plugin coming from master, which is no surprise because the API is incompatible although the data coding / encoding is compatible.

Cheers

Post by Loic Dachary
Cheers

--
Loïc Dachary, Artisan Logiciel Libre

Sage Weil

2014-08-18 15:17:35 UTC

Permalink

Post by Loic Dachary
Hi Ceph,
a) an OSD dlopen a shared library for erasure-code,
b) the shared library file is replaced while the OSD is running,
c) the OSD starts using the new file instead of the old one.
It seems unlikely but it would explain a weird stack trace at
http://tracker.ceph.com/issues/9153#note-5 so I'm double checking ;-)

I think this is possible and likely. We had similar problems with the
rados classes and eventually just made them load all available plugins on
startup (and also on demand in case one is installed later).

The simplest thing is probably to do that here as well...

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Loic Dachary

2014-08-18 15:25:50 UTC

Permalink

Post by Sage Weil

Post by Loic Dachary
Hi Ceph,
a) an OSD dlopen a shared library for erasure-code,
b) the shared library file is replaced while the OSD is running,
c) the OSD starts using the new file instead of the old one.
It seems unlikely but it would explain a weird stack trace at
http://tracker.ceph.com/issues/9153#note-5 so I'm double checking ;-)

This will not solve the upgrade problem for Firefly daemons which are are already running, unfortunately. Stopping the daemons while the package is being upgraded seems safer and more generic (see the other thread). Or are there issues with this approach ?

Cheers

Post by Sage Weil
sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
More majordomo info at http://vger.kernel.org/majordomo-info.html

--
Loïc Dachary, Artisan Logiciel Libre

Sage Weil

2014-08-18 15:32:07 UTC

Permalink

Post by Loic Dachary

Post by Sage Weil

Post by Loic Dachary
Hi Ceph,
a) an OSD dlopen a shared library for erasure-code,
b) the shared library file is replaced while the OSD is running,
c) the OSD starts using the new file instead of the old one.
It seems unlikely but it would explain a weird stack trace at
http://tracker.ceph.com/issues/9153#note-5 so I'm double checking ;-)

This will not solve the upgrade problem for Firefly daemons which are
are already running, unfortunately. Stopping the daemons while the
package is being upgraded seems safer and more generic (see the other
thread). Or are there issues with this approach ?

Operationally it is not something people want to do. Usually admins
upgrade and then do the restarts in a controlled way. At least, that's
what I've heard anecdotally.

FWIW the crash is also something that testing turns up but is unlikely to
happen in production. In testing, the workload is just starting when we
start upgrading so the plugins haven't always loaded. In production, it
is unlikely that a user will be *just* starting to use the EC features
right as they are also doing an upgrade. Unless they forgot to restart
daemons...

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html