Discussion:
Ceph daemon memory utilization: 'heap release' drops use by 50%
David McBride
2014-04-14 12:28:55 UTC
Permalink
Hello,

I'm currently experimenting with a Ceph deployment, and am noting that
some of my machines are having processes killed by the OOM killer,
despite provisioning 32GB for a 12 OSD machine.

(This tended to correlate with reshaping the cluster, which is not
surprising given that OSD memory utilization is documented to spike whe=
n
recovery operations are in progress.)

While the recently-added zRAM kernel facility appears to be helping
somewhat in stretching the available resources, I've been reviewing the
heap utilization statistics displayed via `ceph tell osd.$i heap stats`=
=2E
osd.0tcmalloc heap stats:--------------------------------------------=
----
MALLOC: 593850280 ( 566.3 MiB) Bytes in use by application
MALLOC: + 1621073920 ( 1546.0 MiB) Bytes in page heap freelist
MALLOC: + 117159712 ( 111.7 MiB) Bytes in central cache freelist
MALLOC: + 2987008 ( 2.8 MiB) Bytes in transfer cache freelist
MALLOC: + 84780344 ( 80.9 MiB) Bytes in thread cache freelists
MALLOC: + 13119640 ( 12.5 MiB) Bytes in malloc metadata
MALLOC: ------------
MALLOC: =3D 2432970904 ( 2320.3 MiB) Actual memory used (physical +=
swap)
MALLOC: + 44449792 ( 42.4 MiB) Bytes released to OS (aka unmapp=
ed)
MALLOC: ------------
MALLOC: =3D 2477420696 ( 2362.7 MiB) Virtual address space used
MALLOC: 60887 Spans in use
MALLOC: 775 Thread heaps in use
MALLOC: 8192 Tcmalloc page size
------------------------------------------------
I noticed there's a huge amount of memory =E2=80=94 1.5GB =E2=80=94 on =
the main
freelist. As an experiment, I ran `ceph tell osd.$i heap release`, and
osd.0tcmalloc heap stats:--------------------------------------------=
----
MALLOC: 581434648 ( 554.5 MiB) Bytes in use by application
MALLOC: + 11509760 ( 11.0 MiB) Bytes in page heap freelist
MALLOC: + 105904144 ( 101.0 MiB) Bytes in central cache freelist
MALLOC: + 2070848 ( 2.0 MiB) Bytes in transfer cache freelist
MALLOC: + 97882520 ( 93.3 MiB) Bytes in thread cache freelists
MALLOC: + 13119640 ( 12.5 MiB) Bytes in malloc metadata
MALLOC: ------------
MALLOC: =3D 811921560 ( 774.3 MiB) Actual memory used (physical +=
swap)
MALLOC: + 1665499136 ( 1588.3 MiB) Bytes released to OS (aka unmapp=
ed)
MALLOC: ------------
MALLOC: =3D 2477420696 ( 2362.7 MiB) Virtual address space used
MALLOC: 60733 Spans in use
MALLOC: 803 Thread heaps in use
MALLOC: 8192 Tcmalloc page size
------------------------------------------------
This was consistent across all 12 OSDs; running this command on all the
OSDs on a machine dropped memory utilization by ~15GB, or ~50% of the
amount of RAM in my machine.

Is this expected behaviour? Would it be prudent to treat this as the
amount of memory the Ceph OSDs genuinely requires at peak demand?
(If so, that indicates that I need to be looking to increase the spec o=
f
my storage nodes...)
mon.ceph-sm000tcmalloc heap stats:-----------------------------------=
-------------
MALLOC: 599497240 ( 571.7 MiB) Bytes in use by application
MALLOC: + 806297600 ( 768.9 MiB) Bytes in page heap freelist
MALLOC: + 32448368 ( 30.9 MiB) Bytes in central cache freelist
MALLOC: + 1684080 ( 1.6 MiB) Bytes in transfer cache freelist
MALLOC: + 23270408 ( 22.2 MiB) Bytes in thread cache freelists
MALLOC: + 5091480 ( 4.9 MiB) Bytes in malloc metadata
MALLOC: ------------
MALLOC: =3D 1468289176 ( 1400.3 MiB) Actual memory used (physical +=
swap)
MALLOC: + 30859264 ( 29.4 MiB) Bytes released to OS (aka unmapp=
ed)
MALLOC: ------------
MALLOC: =3D 1499148440 ( 1429.7 MiB) Virtual address space used
MALLOC: 18309 Spans in use
MALLOC: 122 Thread heaps in use
MALLOC: 8192 Tcmalloc page size
------------------------------------------------
mon.ceph-sm000tcmalloc heap stats:-----------------------------------=
-------------
MALLOC: 600108520 ( 572.3 MiB) Bytes in use by application
MALLOC: + 17342464 ( 16.5 MiB) Bytes in page heap freelist
MALLOC: + 32392208 ( 30.9 MiB) Bytes in central cache freelist
MALLOC: + 964240 ( 0.9 MiB) Bytes in transfer cache freelist
MALLOC: + 23402360 ( 22.3 MiB) Bytes in thread cache freelists
MALLOC: + 5091480 ( 4.9 MiB) Bytes in malloc metadata
MALLOC: ------------
MALLOC: =3D 679301272 ( 647.8 MiB) Actual memory used (physical +=
swap)
MALLOC: + 819847168 ( 781.9 MiB) Bytes released to OS (aka unmapp=
ed)
MALLOC: ------------
MALLOC: =3D 1499148440 ( 1429.7 MiB) Virtual address space used
MALLOC: 16396 Spans in use
MALLOC: 122 Thread heaps in use
MALLOC: 8192 Tcmalloc page size
------------------------------------------------
The tcmalloc documentation suggests that memory should be gradually
being returned to the operating system:

http://gperftools.googlecode.com/svn/trunk/doc/tcmalloc.html#runtime

Given these OSDs were largely idle over the weekend prior to running
this experiment, it seems clear that this process is not operating as
designed.

I've looked through the environment of my running processes and the Cep=
h
source, and can see no reference to TCMALLOC_RELEASE_RATE or
SetMemoryReleaseRate().

I'm currently running an experiment whereby I define
"env TCMALLOC_RELEASE_RATE=3D10" in
/etc/init/ceph-{osd,mon}.conf.override; I'll see if this has any impact
on memory usage over time.

(I suspect that my current Ceph cluster placement-group count is
excessive; with 144 OSDs, I'm running with about a dozen pools, each of
which with ~8000 PGs. It's not clear how the guidelines for PG-sizing
should be adjusted for multiple-pool configurations; at some point I'll
see what effect wiping my cluster and using a much smaller per-pool PG
count has.)

Cheers,
David
--=20
David McBride <***@cam.ac.uk>
Unix Specialist, University Information Services
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" i=
n
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Gregory Farnum
2014-04-14 13:53:43 UTC
Permalink
What distro are you running on?
-Greg
Post by David McBride
Hello,
I'm currently experimenting with a Ceph deployment, and am noting tha=
t
Post by David McBride
some of my machines are having processes killed by the OOM killer,
despite provisioning 32GB for a 12 OSD machine.
(This tended to correlate with reshaping the cluster, which is not
surprising given that OSD memory utilization is documented to spike w=
hen
Post by David McBride
recovery operations are in progress.)
While the recently-added zRAM kernel facility appears to be helping
somewhat in stretching the available resources, I've been reviewing t=
he
Post by David McBride
heap utilization statistics displayed via `ceph tell osd.$i heap stat=
s`.
Post by David McBride
osd.0tcmalloc heap stats:-------------------------------------------=
-----
Post by David McBride
MALLOC: 593850280 ( 566.3 MiB) Bytes in use by application
MALLOC: + 1621073920 ( 1546.0 MiB) Bytes in page heap freelist
MALLOC: + 117159712 ( 111.7 MiB) Bytes in central cache freelist
MALLOC: + 2987008 ( 2.8 MiB) Bytes in transfer cache freelis=
t
Post by David McBride
MALLOC: + 84780344 ( 80.9 MiB) Bytes in thread cache freelists
MALLOC: + 13119640 ( 12.5 MiB) Bytes in malloc metadata
MALLOC: ------------
MALLOC: =3D 2432970904 ( 2320.3 MiB) Actual memory used (physical =
+ swap)
Post by David McBride
MALLOC: + 44449792 ( 42.4 MiB) Bytes released to OS (aka unmap=
ped)
Post by David McBride
MALLOC: ------------
MALLOC: =3D 2477420696 ( 2362.7 MiB) Virtual address space used
MALLOC: 60887 Spans in use
MALLOC: 775 Thread heaps in use
MALLOC: 8192 Tcmalloc page size
------------------------------------------------
I noticed there's a huge amount of memory =E2=80=94 1.5GB =E2=80=94 o=
n the main
Post by David McBride
freelist. As an experiment, I ran `ceph tell osd.$i heap release`, a=
nd
Post by David McBride
osd.0tcmalloc heap stats:-------------------------------------------=
-----
Post by David McBride
MALLOC: 581434648 ( 554.5 MiB) Bytes in use by application
MALLOC: + 11509760 ( 11.0 MiB) Bytes in page heap freelist
MALLOC: + 105904144 ( 101.0 MiB) Bytes in central cache freelist
MALLOC: + 2070848 ( 2.0 MiB) Bytes in transfer cache freelis=
t
Post by David McBride
MALLOC: + 97882520 ( 93.3 MiB) Bytes in thread cache freelists
MALLOC: + 13119640 ( 12.5 MiB) Bytes in malloc metadata
MALLOC: ------------
MALLOC: =3D 811921560 ( 774.3 MiB) Actual memory used (physical =
+ swap)
Post by David McBride
MALLOC: + 1665499136 ( 1588.3 MiB) Bytes released to OS (aka unmap=
ped)
Post by David McBride
MALLOC: ------------
MALLOC: =3D 2477420696 ( 2362.7 MiB) Virtual address space used
MALLOC: 60733 Spans in use
MALLOC: 803 Thread heaps in use
MALLOC: 8192 Tcmalloc page size
------------------------------------------------
This was consistent across all 12 OSDs; running this command on all t=
he
Post by David McBride
OSDs on a machine dropped memory utilization by ~15GB, or ~50% of the
amount of RAM in my machine.
Is this expected behaviour? Would it be prudent to treat this as the
amount of memory the Ceph OSDs genuinely requires at peak demand?
(If so, that indicates that I need to be looking to increase the spec=
of
Post by David McBride
my storage nodes...)
mon.ceph-sm000tcmalloc heap stats:----------------------------------=
--------------
Post by David McBride
MALLOC: 599497240 ( 571.7 MiB) Bytes in use by application
MALLOC: + 806297600 ( 768.9 MiB) Bytes in page heap freelist
MALLOC: + 32448368 ( 30.9 MiB) Bytes in central cache freelist
MALLOC: + 1684080 ( 1.6 MiB) Bytes in transfer cache freelis=
t
Post by David McBride
MALLOC: + 23270408 ( 22.2 MiB) Bytes in thread cache freelists
MALLOC: + 5091480 ( 4.9 MiB) Bytes in malloc metadata
MALLOC: ------------
MALLOC: =3D 1468289176 ( 1400.3 MiB) Actual memory used (physical =
+ swap)
Post by David McBride
MALLOC: + 30859264 ( 29.4 MiB) Bytes released to OS (aka unmap=
ped)
Post by David McBride
MALLOC: ------------
MALLOC: =3D 1499148440 ( 1429.7 MiB) Virtual address space used
MALLOC: 18309 Spans in use
MALLOC: 122 Thread heaps in use
MALLOC: 8192 Tcmalloc page size
------------------------------------------------
mon.ceph-sm000tcmalloc heap stats:----------------------------------=
--------------
Post by David McBride
MALLOC: 600108520 ( 572.3 MiB) Bytes in use by application
MALLOC: + 17342464 ( 16.5 MiB) Bytes in page heap freelist
MALLOC: + 32392208 ( 30.9 MiB) Bytes in central cache freelist
MALLOC: + 964240 ( 0.9 MiB) Bytes in transfer cache freelis=
t
Post by David McBride
MALLOC: + 23402360 ( 22.3 MiB) Bytes in thread cache freelists
MALLOC: + 5091480 ( 4.9 MiB) Bytes in malloc metadata
MALLOC: ------------
MALLOC: =3D 679301272 ( 647.8 MiB) Actual memory used (physical =
+ swap)
Post by David McBride
MALLOC: + 819847168 ( 781.9 MiB) Bytes released to OS (aka unmap=
ped)
Post by David McBride
MALLOC: ------------
MALLOC: =3D 1499148440 ( 1429.7 MiB) Virtual address space used
MALLOC: 16396 Spans in use
MALLOC: 122 Thread heaps in use
MALLOC: 8192 Tcmalloc page size
------------------------------------------------
The tcmalloc documentation suggests that memory should be gradually
http://gperftools.googlecode.com/svn/trunk/doc/tcmalloc.html#runtime
Given these OSDs were largely idle over the weekend prior to running
this experiment, it seems clear that this process is not operating as
designed.
I've looked through the environment of my running processes and the C=
eph
Post by David McBride
source, and can see no reference to TCMALLOC_RELEASE_RATE or
SetMemoryReleaseRate().
I'm currently running an experiment whereby I define
"env TCMALLOC_RELEASE_RATE=3D10" in
/etc/init/ceph-{osd,mon}.conf.override; I'll see if this has any impa=
ct
Post by David McBride
on memory usage over time.
(I suspect that my current Ceph cluster placement-group count is
excessive; with 144 OSDs, I'm running with about a dozen pools, each =
of
Post by David McBride
which with ~8000 PGs. It's not clear how the guidelines for PG-sizin=
g
Post by David McBride
should be adjusted for multiple-pool configurations; at some point I'=
ll
Post by David McBride
see what effect wiping my cluster and using a much smaller per-pool P=
G
Post by David McBride
count has.)
Cheers,
David
--
Unix Specialist, University Information Services
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel"=
in
Post by David McBride
More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" i=
n
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
David McBride
2014-04-14 14:04:44 UTC
Permalink
Post by Gregory Farnum
What distro are you running on?
-Greg
Hi Greg,

This is sitting atop a pre-release build of Ubuntu 14.04, using the
packages provided by same.

dpkg-query -W ceph shows:

ceph 0.79-0ubuntu1

ceph --version shows:

ceph version 0.79 (4c2d73a5095f527c3a2168deb5fa54b3c8991a6e)

Note that I'm currently half-way through nuking my existing cluster (to
reduce my PG count, and experiment with different OSD filesystems); I'm
happy to run experiments, but there'll be a small lag while I bring the
cluster back up!

Cheers,
David
--
David McBride <***@cam.ac.uk>
Unix Specialist, University Information Services
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Gregory Farnum
2014-04-14 14:10:58 UTC
Permalink
Hum. We see scattered reports of this occasionally (although it seems
to clump), but usually it's on self-built distros. It's not behavior
we've encountered on any regular basis and it's not expected. If
you're getting all your packages from Canonical, you should probably
report the bug to them as well =E2=80=94 it might be a tcmalloc issue t=
hey can
resolve in their repo.
-Greg
Post by David McBride
Post by Gregory Farnum
What distro are you running on?
-Greg
Hi Greg,
This is sitting atop a pre-release build of Ubuntu 14.04, using the
packages provided by same.
ceph 0.79-0ubuntu1
ceph version 0.79 (4c2d73a5095f527c3a2168deb5fa54b3c8991a6e)
Note that I'm currently half-way through nuking my existing cluster (=
to
Post by David McBride
reduce my PG count, and experiment with different OSD filesystems); I=
'm
Post by David McBride
happy to run experiments, but there'll be a small lag while I bring t=
he
Post by David McBride
cluster back up!
Cheers,
David
--
Unix Specialist, University Information Services
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" i=
n
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
David McBride
2014-04-14 14:14:54 UTC
Permalink
Post by Gregory Farnum
Hum. We see scattered reports of this occasionally (although it seems
to clump), but usually it's on self-built distros. It's not behavior
we've encountered on any regular basis and it's not expected. If
you're getting all your packages from Canonical, you should probably
report the bug to them as well =E2=80=94 it might be a tcmalloc issue=
they can
Post by Gregory Farnum
resolve in their repo.
Hi,

That's useful information. I'll investigate further and flag it
upstream if it looks like a repeatable problem.

Thanks!

Cheers,
David
--=20
David McBride <***@cam.ac.uk>
Unix Specialist, University Information Services
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" i=
n
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Loading...