Discussion:
A problem when restarting OSD
Wang, Zhiqiang
2014-08-21 07:19:47 UTC
Permalink
Hi all,

I ran into a problem when restarting an OSD.

Here is my OSD tree before restarting the OSD:

# id weight type name up/down reweight
-6 8 root ssd
-4 4 host zqw-s1-ssd
16 1 osd.16 up 1
17 1 osd.17 up 1
18 1 osd.18 up 1
19 1 osd.19 up 1
-5 4 host zqw-s2-ssd
20 1 osd.20 up 1
21 1 osd.21 up 1
22 1 osd.22 up 1
23 1 osd.23 up 1
-1 14.56 root default
-2 7.28 host zqw-s1
0 0.91 osd.0 up 1
1 0.91 osd.1 up 1
2 0.91 osd.2 up 1
3 0.91 osd.3 up 1
4 0.91 osd.4 up 1
5 0.91 osd.5 up 1
6 0.91 osd.6 up 1
7 0.91 osd.7 up 1
-3 7.28 host zqw-s2
8 0.91 osd.8 up 1
9 0.91 osd.9 up 1
10 0.91 osd.10 up 1
11 0.91 osd.11 up 1
12 0.91 osd.12 up 1
13 0.91 osd.13 up 1
14 0.91 osd.14 up 1
15 0.91 osd.15 up 1

After I restart one of the OSD with id from 16 to 23, say restarting osd.16, osd.16 goes to 'root default' and 'host zqw-s1', and ceph cluster begins to do rebalance. This surely is not what I want.

# id weight type name up/down reweight
-6 7 root ssd
-4 3 host zqw-s1-ssd
17 1 osd.17 up 1
18 1 osd.18 up 1
19 1 osd.19 up 1
-5 4 host zqw-s2-ssd
20 1 osd.20 up 1
21 1 osd.21 up 1
22 1 osd.22 up 1
23 1 osd.23 up 1
-1 15.56 root default
-2 8.28 host zqw-s1
0 0.91 osd.0 up 1
1 0.91 osd.1 up 1
2 0.91 osd.2 up 1
3 0.91 osd.3 up 1
4 0.91 osd.4 up 1
5 0.91 osd.5 up 1
6 0.91 osd.6 up 1
7 0.91 osd.7 up 1
16 1 osd.16 up 1
-3 7.28 host zqw-s2
8 0.91 osd.8 up 1
9 0.91 osd.9 up 1
10 0.91 osd.10 up 1
11 0.91 osd.11 up 1
12 0.91 osd.12 up 1
13 0.91 osd.13 up 1
14 0.91 osd.14 up 1
15 0.91 osd.15 up 1

After digging into the problem, I find it's because in the ceph init script, we change the OSD's crush location in some way. It uses the script 'ceph-crush-location' to get the crush location from the ceph.conf file for the restarting OSD. If there isn't such an entry in ceph.conf, it uses the default one 'host=$(hostname -s) root=default'. Since I don't have the crush location configuration in my ceph.conf (I guess most of people don't have this in their ceph.conf), when I restarting osd.16, it goes to 'root default' and 'host zqw-s1'.

Here is a fix for this:
When the ceph init script uses 'ceph osd crush create-or-move' to change the OSD's crush location, do a check first, if this OSD is already existing in the crush map, return without making the location change. This change is at: https://github.com/wonzhq/ceph/commit/efdfa23664caa531390d141bd1539878761412fe

What do you think?
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Sage Weil
2014-08-21 15:28:11 UTC
Permalink
Post by Wang, Zhiqiang
Hi all,
I ran into a problem when restarting an OSD.
# id weight type name up/down reweight
-6 8 root ssd
-4 4 host zqw-s1-ssd
16 1 osd.16 up 1
17 1 osd.17 up 1
18 1 osd.18 up 1
19 1 osd.19 up 1
-5 4 host zqw-s2-ssd
20 1 osd.20 up 1
21 1 osd.21 up 1
22 1 osd.22 up 1
23 1 osd.23 up 1
-1 14.56 root default
-2 7.28 host zqw-s1
0 0.91 osd.0 up 1
1 0.91 osd.1 up 1
2 0.91 osd.2 up 1
3 0.91 osd.3 up 1
4 0.91 osd.4 up 1
5 0.91 osd.5 up 1
6 0.91 osd.6 up 1
7 0.91 osd.7 up 1
-3 7.28 host zqw-s2
8 0.91 osd.8 up 1
9 0.91 osd.9 up 1
10 0.91 osd.10 up 1
11 0.91 osd.11 up 1
12 0.91 osd.12 up 1
13 0.91 osd.13 up 1
14 0.91 osd.14 up 1
15 0.91 osd.15 up 1
After I restart one of the OSD with id from 16 to 23, say restarting osd.16, osd.16 goes to 'root default' and 'host zqw-s1', and ceph cluster begins to do rebalance. This surely is not what I want.
# id weight type name up/down reweight
-6 7 root ssd
-4 3 host zqw-s1-ssd
17 1 osd.17 up 1
18 1 osd.18 up 1
19 1 osd.19 up 1
-5 4 host zqw-s2-ssd
20 1 osd.20 up 1
21 1 osd.21 up 1
22 1 osd.22 up 1
23 1 osd.23 up 1
-1 15.56 root default
-2 8.28 host zqw-s1
0 0.91 osd.0 up 1
1 0.91 osd.1 up 1
2 0.91 osd.2 up 1
3 0.91 osd.3 up 1
4 0.91 osd.4 up 1
5 0.91 osd.5 up 1
6 0.91 osd.6 up 1
7 0.91 osd.7 up 1
16 1 osd.16 up 1
-3 7.28 host zqw-s2
8 0.91 osd.8 up 1
9 0.91 osd.9 up 1
10 0.91 osd.10 up 1
11 0.91 osd.11 up 1
12 0.91 osd.12 up 1
13 0.91 osd.13 up 1
14 0.91 osd.14 up 1
15 0.91 osd.15 up 1
After digging into the problem, I find it's because in the ceph init script, we change the OSD's crush location in some way. It uses the script 'ceph-crush-location' to get the crush location from the ceph.conf file for the restarting OSD. If there isn't such an entry in ceph.conf, it uses the default one 'host=$(hostname -s) root=default'. Since I don't have the crush location configuration in my ceph.conf (I guess most of people don't have this in their ceph.conf), when I restarting osd.16, it goes to 'root default' and 'host zqw-s1'.
When the ceph init script uses 'ceph osd crush create-or-move' to change the OSD's crush location, do a check first, if this OSD is already existing in the crush map, return without making the location change. This change is at: https://github.com/wonzhq/ceph/commit/efdfa23664caa531390d141bd1539878761412fe
What do you think?
The goal of this behavior is to allow hot-swapping of devices. You can
pull disks out of one host and put them in another and the udev machinery
will start up the daemon, update the crush location, and the disk and data
will become available. It's not 'ideal' in the sense that there will be
rebalancing, but it does make the data available to the cluster to
preserve data safety.

We haven't come up with a great scheme yet to managing multiple trees yet.
The idea is that the ceph-crush-location hook can be customized to do
whatever is necessary, for example by putting root=ssd if the device type
appears to be an ssd (maybe look at the sysfs metadata, or put a marker
file in the osd data directory?). You can point to your own hook for your
environment with

osd crush location hook = /path/to/my/script

sage



--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Wang, Zhiqiang
2014-08-22 01:36:13 UTC
Permalink
Hi Sage,

Yes, I understand that we can customize the crush location hook to let the OSD go to the right location. But does the ceph user have the idea of this if he/she has more than 1 root in the crush map? At least I don't know this at the beginning. We need to either emphasize this or do it in some ways for the user.

One question for the hot-swapping support of moving an OSD to another host. What if the journal is not located at the same disk of the OSD? Is the OSD still able to be available in the cluster?

-----Original Message-----
From: Sage Weil [mailto:***@redhat.com]
Sent: Thursday, August 21, 2014 11:28 PM
To: Wang, Zhiqiang
Cc: 'ceph-***@vger.kernel.org'
Subject: Re: A problem when restarting OSD
Post by Wang, Zhiqiang
Hi all,
I ran into a problem when restarting an OSD.
# id weight type name up/down reweight
-6 8 root ssd
-4 4 host zqw-s1-ssd
16 1 osd.16 up 1
17 1 osd.17 up 1
18 1 osd.18 up 1
19 1 osd.19 up 1
-5 4 host zqw-s2-ssd
20 1 osd.20 up 1
21 1 osd.21 up 1
22 1 osd.22 up 1
23 1 osd.23 up 1
-1 14.56 root default
-2 7.28 host zqw-s1
0 0.91 osd.0 up 1
1 0.91 osd.1 up 1
2 0.91 osd.2 up 1
3 0.91 osd.3 up 1
4 0.91 osd.4 up 1
5 0.91 osd.5 up 1
6 0.91 osd.6 up 1
7 0.91 osd.7 up 1
-3 7.28 host zqw-s2
8 0.91 osd.8 up 1
9 0.91 osd.9 up 1
10 0.91 osd.10 up 1
11 0.91 osd.11 up 1
12 0.91 osd.12 up 1
13 0.91 osd.13 up 1
14 0.91 osd.14 up 1
15 0.91 osd.15 up 1
After I restart one of the OSD with id from 16 to 23, say restarting osd.16, osd.16 goes to 'root default' and 'host zqw-s1', and ceph cluster begins to do rebalance. This surely is not what I want.
# id weight type name up/down reweight
-6 7 root ssd
-4 3 host zqw-s1-ssd
17 1 osd.17 up 1
18 1 osd.18 up 1
19 1 osd.19 up 1
-5 4 host zqw-s2-ssd
20 1 osd.20 up 1
21 1 osd.21 up 1
22 1 osd.22 up 1
23 1 osd.23 up 1
-1 15.56 root default
-2 8.28 host zqw-s1
0 0.91 osd.0 up 1
1 0.91 osd.1 up 1
2 0.91 osd.2 up 1
3 0.91 osd.3 up 1
4 0.91 osd.4 up 1
5 0.91 osd.5 up 1
6 0.91 osd.6 up 1
7 0.91 osd.7 up 1
16 1 osd.16 up 1
-3 7.28 host zqw-s2
8 0.91 osd.8 up 1
9 0.91 osd.9 up 1
10 0.91 osd.10 up 1
11 0.91 osd.11 up 1
12 0.91 osd.12 up 1
13 0.91 osd.13 up 1
14 0.91 osd.14 up 1
15 0.91 osd.15 up 1
After digging into the problem, I find it's because in the ceph init script, we change the OSD's crush location in some way. It uses the script 'ceph-crush-location' to get the crush location from the ceph.conf file for the restarting OSD. If there isn't such an entry in ceph.conf, it uses the default one 'host=$(hostname -s) root=default'. Since I don't have the crush location configuration in my ceph.conf (I guess most of people don't have this in their ceph.conf), when I restarting osd.16, it goes to 'root default' and 'host zqw-s1'.
When the ceph init script uses 'ceph osd crush create-or-move' to
change the OSD's crush location, do a check first, if this OSD is
already existing in the crush map, return without making the location
https://github.com/wonzhq/ceph/commit/efdfa23664caa531390d141bd1539878
761412fe
What do you think?
The goal of this behavior is to allow hot-swapping of devices. You can pull disks out of one host and put them in another and the udev machinery will start up the daemon, update the crush location, and the disk and data will become available. It's not 'ideal' in the sense that there will be rebalancing, but it does make the data available to the cluster to preserve data safety.

We haven't come up with a great scheme yet to managing multiple trees yet.
The idea is that the ceph-crush-location hook can be customized to do whatever is necessary, for example by putting root=ssd if the device type appears to be an ssd (maybe look at the sysfs metadata, or put a marker file in the osd data directory?). You can point to your own hook for your environment with

osd crush location hook = /path/to/my/script

sage



--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
David Moreau Simard
2014-08-22 02:33:34 UTC
Permalink
I=B9m glad you mention this because I=B9ve also been running into the s=
ame
issue and this took me a while to figure out too.

Is this new behaviour ? I don=B9t remember running into this before...

Sage does mention multiple trees but I=B9ve had this happen with a sing=
le
root.
It is definitely not my expectation that restarting an OSD would move
things around in the crush map.

I=B9m in the process of developing a crush map, looks like this (note:
unfinished and does not make much sense as is):
http://pastebin.com/6vBUQTCk
This results in this tree:
# id weight type name up/down reweight
-1 18 root default
-2 9 host osd02
-4 2 disktype osd02_ssd
3 1 osd.3 up 1
9 1 osd.9 up 1
-5 7 disktype osd02_spinning
8 1 osd.8 up 1
17 1 osd.17 up 1
5 1 osd.5 up 1
11 1 osd.11 up 1
1 1 osd.1 up 1
13 1 osd.13 up 1
15 1 osd.15 up 1
-3 9 host osd01
-6 2 disktype osd01_ssd
2 1 osd.2 up 1
7 1 osd.7 up 1
-7 7 disktype osd01_spinning
0 1 osd.0 up 1
4 1 osd.4 up 1
12 1 osd.12 up 1
6 1 osd.6 up 1
14 1 osd.14 up 1
10 1 osd.10 up 1
16 1 osd.16 up 1

Only restarting the OSDs on both hosts modifies the crush map:
http://pastebin.com/rP8Y8qcH
With the resulting tree:
# id weight type name up/down reweight
-1 18 root default
-2 9 host osd02
-4 0 disktype osd02_ssd
-5 0 disktype osd02_spinning
13 1 osd.13 up 1
3 1 osd.3 up 1
5 1 osd.5 up 1
1 1 osd.1 up 1
11 1 osd.11 up 1
15 1 osd.15 up 1
17 1 osd.17 up 1
8 1 osd.8 up 1
9 1 osd.9 up 1
-3 9 host osd01
-6 0 disktype osd01_ssd
-7 0 disktype osd01_spinning
0 1 osd.0 up 1
10 1 osd.10 up 1
12 1 osd.12 up 1
14 1 osd.14 up 1
16 1 osd.16 up 1
2 1 osd.2 up 1
4 1 osd.4 up 1
7 1 osd.7 up 1
6 1 osd.6 up 1

Would a hook really be the solution I need ?
--=20
David Moreau Simard
Post by Wang, Zhiqiang
a
Hi Sage,
Yes, I understand that we can customize the crush location hook to let
the OSD go to the right location. But does the ceph user have the idea=
of
Post by Wang, Zhiqiang
this if he/she has more than 1 root in the crush map? At least I don't
know this at the beginning. We need to either emphasize this or do it =
in
Post by Wang, Zhiqiang
some ways for the user.
One question for the hot-swapping support of moving an OSD to another
host. What if the journal is not located at the same disk of the OSD? =
Is
Post by Wang, Zhiqiang
the OSD still able to be available in the cluster?
-----Original Message-----
Sent: Thursday, August 21, 2014 11:28 PM
To: Wang, Zhiqiang
Subject: Re: A problem when restarting OSD
Post by Wang, Zhiqiang
Hi all,
=20
I ran into a problem when restarting an OSD.
=20
=20
# id weight type name up/down reweight
-6 8 root ssd
-4 4 host zqw-s1-ssd
16 1 osd.16 up 1
17 1 osd.17 up 1
18 1 osd.18 up 1
19 1 osd.19 up 1
-5 4 host zqw-s2-ssd
20 1 osd.20 up 1
21 1 osd.21 up 1
22 1 osd.22 up 1
23 1 osd.23 up 1
-1 14.56 root default
-2 7.28 host zqw-s1
0 0.91 osd.0 up 1
1 0.91 osd.1 up 1
2 0.91 osd.2 up 1
3 0.91 osd.3 up 1
4 0.91 osd.4 up 1
5 0.91 osd.5 up 1
6 0.91 osd.6 up 1
7 0.91 osd.7 up 1
-3 7.28 host zqw-s2
8 0.91 osd.8 up 1
9 0.91 osd.9 up 1
10 0.91 osd.10 up 1
11 0.91 osd.11 up 1
12 0.91 osd.12 up 1
13 0.91 osd.13 up 1
14 0.91 osd.14 up 1
15 0.91 osd.15 up 1
=20
After I restart one of the OSD with id from 16 to 23, say restarting
osd.16, osd.16 goes to 'root default' and 'host zqw-s1', and ceph
cluster begins to do rebalance. This surely is not what I want.
=20
# id weight type name up/down reweight
-6 7 root ssd
-4 3 host zqw-s1-ssd
17 1 osd.17 up 1
18 1 osd.18 up 1
19 1 osd.19 up 1
-5 4 host zqw-s2-ssd
20 1 osd.20 up 1
21 1 osd.21 up 1
22 1 osd.22 up 1
23 1 osd.23 up 1
-1 15.56 root default
-2 8.28 host zqw-s1
0 0.91 osd.0 up 1
1 0.91 osd.1 up 1
2 0.91 osd.2 up 1
3 0.91 osd.3 up 1
4 0.91 osd.4 up 1
5 0.91 osd.5 up 1
6 0.91 osd.6 up 1
7 0.91 osd.7 up 1
16 1 osd.16 up 1
-3 7.28 host zqw-s2
8 0.91 osd.8 up 1
9 0.91 osd.9 up 1
10 0.91 osd.10 up 1
11 0.91 osd.11 up 1
12 0.91 osd.12 up 1
13 0.91 osd.13 up 1
14 0.91 osd.14 up 1
15 0.91 osd.15 up 1
=20
After digging into the problem, I find it's because in the ceph init
script, we change the OSD's crush location in some way. It uses the
script 'ceph-crush-location' to get the crush location from the
ceph.conf file for the restarting OSD. If there isn't such an entry i=
n
Post by Wang, Zhiqiang
Post by Wang, Zhiqiang
ceph.conf, it uses the default one 'host=3D$(hostname -s) root=3Ddefa=
ult'.
Post by Wang, Zhiqiang
Post by Wang, Zhiqiang
Since I don't have the crush location configuration in my ceph.conf (=
I
Post by Wang, Zhiqiang
Post by Wang, Zhiqiang
guess most of people don't have this in their ceph.conf), when I
restarting osd.16, it goes to 'root default' and 'host zqw-s1'.
=20
When the ceph init script uses 'ceph osd crush create-or-move' to
change the OSD's crush location, do a check first, if this OSD is
already existing in the crush map, return without making the locatio=
n
Post by Wang, Zhiqiang
Post by Wang, Zhiqiang
https://github.com/wonzhq/ceph/commit/efdfa23664caa531390d141bd15398=
78
Post by Wang, Zhiqiang
Post by Wang, Zhiqiang
761412fe
=20
What do you think?
The goal of this behavior is to allow hot-swapping of devices. You ca=
n
Post by Wang, Zhiqiang
pull disks out of one host and put them in another and the udev machin=
ery
Post by Wang, Zhiqiang
will start up the daemon, update the crush location, and the disk and
data will become available. It's not 'ideal' in the sense that there
will be rebalancing, but it does make the data available to the cluste=
r
Post by Wang, Zhiqiang
to preserve data safety.
We haven't come up with a great scheme yet to managing multiple trees
yet. =20
The idea is that the ceph-crush-location hook can be customized to do
whatever is necessary, for example by putting root=3Dssd if the device=
type
Post by Wang, Zhiqiang
appears to be an ssd (maybe look at the sysfs metadata, or put a marke=
r
Post by Wang, Zhiqiang
file in the osd data directory?). You can point to your own hook for
your environment with
osd crush location hook =3D /path/to/my/script
sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" =
in
Post by Wang, Zhiqiang
More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" i=
n
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Wang, Zhiqiang
2014-08-22 02:57:58 UTC
Permalink
Hi David,

Yes, I think adding a hook in your ceph.conf can solve your problem. At=
least this is what I did, and it solves the problem.

=46or example:

[osd.3]
osd crush location =3D "host=3Dosd02 root=3Ddefault disktype=3Dosd02_ss=
d"

You need to add this for every osd.

-----Original Message-----
=46rom: David Moreau Simard [mailto:***@iweb.com]=20
Sent: Friday, August 22, 2014 10:34 AM
To: Wang, Zhiqiang; Sage Weil
Cc: 'ceph-***@vger.kernel.org'
Subject: Re: A problem when restarting OSD

I=B9m glad you mention this because I=B9ve also been running into the s=
ame issue and this took me a while to figure out too.

Is this new behaviour ? I don=B9t remember running into this before...

Sage does mention multiple trees but I=B9ve had this happen with a sing=
le root.
It is definitely not my expectation that restarting an OSD would move t=
hings around in the crush map.

I=B9m in the process of developing a crush map, looks like this (note:
unfinished and does not make much sense as is):
http://pastebin.com/6vBUQTCk
This results in this tree:
# id weight type name up/down reweight
-1 18 root default
-2 9 host osd02
-4 2 disktype osd02_ssd
3 1 osd.3 up 1
9 1 osd.9 up 1
-5 7 disktype osd02_spinning
8 1 osd.8 up 1
17 1 osd.17 up 1
5 1 osd.5 up 1
11 1 osd.11 up 1
1 1 osd.1 up 1
13 1 osd.13 up 1
15 1 osd.15 up 1
-3 9 host osd01
-6 2 disktype osd01_ssd
2 1 osd.2 up 1
7 1 osd.7 up 1
-7 7 disktype osd01_spinning
0 1 osd.0 up 1
4 1 osd.4 up 1
12 1 osd.12 up 1
6 1 osd.6 up 1
14 1 osd.14 up 1
10 1 osd.10 up 1
16 1 osd.16 up 1

Only restarting the OSDs on both hosts modifies the crush map:
http://pastebin.com/rP8Y8qcH
With the resulting tree:
# id weight type name up/down reweight
-1 18 root default
-2 9 host osd02
-4 0 disktype osd02_ssd
-5 0 disktype osd02_spinning
13 1 osd.13 up 1
3 1 osd.3 up 1
5 1 osd.5 up 1
1 1 osd.1 up 1
11 1 osd.11 up 1
15 1 osd.15 up 1
17 1 osd.17 up 1
8 1 osd.8 up 1
9 1 osd.9 up 1
-3 9 host osd01
-6 0 disktype osd01_ssd
-7 0 disktype osd01_spinning
0 1 osd.0 up 1
10 1 osd.10 up 1
12 1 osd.12 up 1
14 1 osd.14 up 1
16 1 osd.16 up 1
2 1 osd.2 up 1
4 1 osd.4 up 1
7 1 osd.7 up 1
6 1 osd.6 up 1

Would a hook really be the solution I need ?
--
David Moreau Simard
Post by Wang, Zhiqiang
Hi Sage,
Yes, I understand that we can customize the crush location hook to let=
=20
Post by Wang, Zhiqiang
the OSD go to the right location. But does the ceph user have the idea=
=20
Post by Wang, Zhiqiang
of this if he/she has more than 1 root in the crush map? At least I=20
don't know this at the beginning. We need to either emphasize this or=20
do it in some ways for the user.
One question for the hot-swapping support of moving an OSD to another=20
host. What if the journal is not located at the same disk of the OSD?=20
Is the OSD still able to be available in the cluster?
-----Original Message-----
Sent: Thursday, August 21, 2014 11:28 PM
To: Wang, Zhiqiang
Subject: Re: A problem when restarting OSD
Post by Wang, Zhiqiang
Hi all,
=20
I ran into a problem when restarting an OSD.
=20
=20
# id weight type name up/down reweight
-6 8 root ssd
-4 4 host zqw-s1-ssd
16 1 osd.16 up 1
17 1 osd.17 up 1
18 1 osd.18 up 1
19 1 osd.19 up 1
-5 4 host zqw-s2-ssd
20 1 osd.20 up 1
21 1 osd.21 up 1
22 1 osd.22 up 1
23 1 osd.23 up 1
-1 14.56 root default
-2 7.28 host zqw-s1
0 0.91 osd.0 up 1
1 0.91 osd.1 up 1
2 0.91 osd.2 up 1
3 0.91 osd.3 up 1
4 0.91 osd.4 up 1
5 0.91 osd.5 up 1
6 0.91 osd.6 up 1
7 0.91 osd.7 up 1
-3 7.28 host zqw-s2
8 0.91 osd.8 up 1
9 0.91 osd.9 up 1
10 0.91 osd.10 up 1
11 0.91 osd.11 up 1
12 0.91 osd.12 up 1
13 0.91 osd.13 up 1
14 0.91 osd.14 up 1
15 0.91 osd.15 up 1
=20
After I restart one of the OSD with id from 16 to 23, say restarting=
=20
Post by Wang, Zhiqiang
Post by Wang, Zhiqiang
osd.16, osd.16 goes to 'root default' and 'host zqw-s1', and ceph=20
cluster begins to do rebalance. This surely is not what I want.
=20
# id weight type name up/down reweight
-6 7 root ssd
-4 3 host zqw-s1-ssd
17 1 osd.17 up 1
18 1 osd.18 up 1
19 1 osd.19 up 1
-5 4 host zqw-s2-ssd
20 1 osd.20 up 1
21 1 osd.21 up 1
22 1 osd.22 up 1
23 1 osd.23 up 1
-1 15.56 root default
-2 8.28 host zqw-s1
0 0.91 osd.0 up 1
1 0.91 osd.1 up 1
2 0.91 osd.2 up 1
3 0.91 osd.3 up 1
4 0.91 osd.4 up 1
5 0.91 osd.5 up 1
6 0.91 osd.6 up 1
7 0.91 osd.7 up 1
16 1 osd.16 up 1
-3 7.28 host zqw-s2
8 0.91 osd.8 up 1
9 0.91 osd.9 up 1
10 0.91 osd.10 up 1
11 0.91 osd.11 up 1
12 0.91 osd.12 up 1
13 0.91 osd.13 up 1
14 0.91 osd.14 up 1
15 0.91 osd.15 up 1
=20
After digging into the problem, I find it's because in the ceph init=
=20
Post by Wang, Zhiqiang
Post by Wang, Zhiqiang
script, we change the OSD's crush location in some way. It uses the=20
script 'ceph-crush-location' to get the crush location from the=20
ceph.conf file for the restarting OSD. If there isn't such an entry i=
n=20
Post by Wang, Zhiqiang
Post by Wang, Zhiqiang
ceph.conf, it uses the default one 'host=3D$(hostname -s) root=3Ddefa=
ult'.
Post by Wang, Zhiqiang
Post by Wang, Zhiqiang
Since I don't have the crush location configuration in my ceph.conf (=
I=20
Post by Wang, Zhiqiang
Post by Wang, Zhiqiang
guess most of people don't have this in their ceph.conf), when I=20
restarting osd.16, it goes to 'root default' and 'host zqw-s1'.
=20
When the ceph init script uses 'ceph osd crush create-or-move' to=20
change the OSD's crush location, do a check first, if this OSD is=20
already existing in the crush map, return without making the locatio=
n=20
Post by Wang, Zhiqiang
Post by Wang, Zhiqiang
https://github.com/wonzhq/ceph/commit/efdfa23664caa531390d141bd15398=
7
Post by Wang, Zhiqiang
Post by Wang, Zhiqiang
8
761412fe
=20
What do you think?
The goal of this behavior is to allow hot-swapping of devices. You ca=
n=20
Post by Wang, Zhiqiang
pull disks out of one host and put them in another and the udev=20
machinery will start up the daemon, update the crush location, and the=
=20
Post by Wang, Zhiqiang
disk and data will become available. It's not 'ideal' in the sense=20
that there will be rebalancing, but it does make the data available to=
=20
Post by Wang, Zhiqiang
the cluster to preserve data safety.
We haven't come up with a great scheme yet to managing multiple trees=20
yet.
The idea is that the ceph-crush-location hook can be customized to do=20
whatever is necessary, for example by putting root=3Dssd if the device=
=20
Post by Wang, Zhiqiang
type appears to be an ssd (maybe look at the sysfs metadata, or put a=20
marker file in the osd data directory?). You can point to your own=20
hook for your environment with
osd crush location hook =3D /path/to/my/script
sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel"=20
info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" i=
n
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
David Moreau Simard
2014-08-22 14:02:40 UTC
Permalink
Hi Wang,

Thanks, I=92ll try that for the time being. This still raises a few
questions I=92d like to discuss.

I=92m convinced we can agree that the CRUSH map is ultimately the autho=
rity
as far as the location of the devices currently are.
My understanding is that we are relying on another source for device
location when (in this case) restarting OSDs: the ceph.conf file.

1) Does this imply that we probably shouldn=92t specify device location=
s
directly in the crush map but in our ceph.conf file instead ?
2) If what is in the crush map is different than what is configured in
ceph.conf, how does Ceph decide which is the authority ? Shouldn=92t it=
be
the crush map ? In this case, it appears to be the ceph.conf file.

Just trying to wrap my head around the vision of how things should be
managed.
--=20
David Moreau Simard


Le 2014-08-21, 10:57 PM, =AB Wang, Zhiqiang =BB <***@intel.co=
m> a
Post by Wang, Zhiqiang
Hi David,
Yes, I think adding a hook in your ceph.conf can solve your problem. A=
t
Post by Wang, Zhiqiang
least this is what I did, and it solves the problem.
[osd.3]
osd crush location =3D "host=3Dosd02 root=3Ddefault disktype=3Dosd02_s=
sd"
Post by Wang, Zhiqiang
You need to add this for every osd.
-----Original Message-----
Sent: Friday, August 22, 2014 10:34 AM
To: Wang, Zhiqiang; Sage Weil
Subject: Re: A problem when restarting OSD
I=B9m glad you mention this because I=B9ve also been running into the =
same
Post by Wang, Zhiqiang
issue and this took me a while to figure out too.
Is this new behaviour ? I don=B9t remember running into this before...
Sage does mention multiple trees but I=B9ve had this happen with a sin=
gle
Post by Wang, Zhiqiang
root.
It is definitely not my expectation that restarting an OSD would move
things around in the crush map.
http://pastebin.com/6vBUQTCk
# id weight type name up/down reweight
-1 18 root default
-2 9 host osd02
-4 2 disktype osd02_ssd
3 1 osd.3 up 1
9 1 osd.9 up 1
-5 7 disktype osd02_spinning
8 1 osd.8 up 1
17 1 osd.17 up 1
5 1 osd.5 up 1
11 1 osd.11 up 1
1 1 osd.1 up 1
13 1 osd.13 up 1
15 1 osd.15 up 1
-3 9 host osd01
-6 2 disktype osd01_ssd
2 1 osd.2 up 1
7 1 osd.7 up 1
-7 7 disktype osd01_spinning
0 1 osd.0 up 1
4 1 osd.4 up 1
12 1 osd.12 up 1
6 1 osd.6 up 1
14 1 osd.14 up 1
10 1 osd.10 up 1
16 1 osd.16 up 1
http://pastebin.com/rP8Y8qcH
# id weight type name up/down reweight
-1 18 root default
-2 9 host osd02
-4 0 disktype osd02_ssd
-5 0 disktype osd02_spinning
13 1 osd.13 up 1
3 1 osd.3 up 1
5 1 osd.5 up 1
1 1 osd.1 up 1
11 1 osd.11 up 1
15 1 osd.15 up 1
17 1 osd.17 up 1
8 1 osd.8 up 1
9 1 osd.9 up 1
-3 9 host osd01
-6 0 disktype osd01_ssd
-7 0 disktype osd01_spinning
0 1 osd.0 up 1
10 1 osd.10 up 1
12 1 osd.12 up 1
14 1 osd.14 up 1
16 1 osd.16 up 1
2 1 osd.2 up 1
4 1 osd.4 up 1
7 1 osd.7 up 1
6 1 osd.6 up 1
Would a hook really be the solution I need ?
--
David Moreau Simard
m> a
Post by Wang, Zhiqiang
Post by Wang, Zhiqiang
Hi Sage,
Yes, I understand that we can customize the crush location hook to le=
t
Post by Wang, Zhiqiang
Post by Wang, Zhiqiang
the OSD go to the right location. But does the ceph user have the ide=
a
Post by Wang, Zhiqiang
Post by Wang, Zhiqiang
of this if he/she has more than 1 root in the crush map? At least I
don't know this at the beginning. We need to either emphasize this or
do it in some ways for the user.
One question for the hot-swapping support of moving an OSD to another
host. What if the journal is not located at the same disk of the OSD?
Is the OSD still able to be available in the cluster?
-----Original Message-----
Sent: Thursday, August 21, 2014 11:28 PM
To: Wang, Zhiqiang
Subject: Re: A problem when restarting OSD
Post by Wang, Zhiqiang
Hi all,
=20
I ran into a problem when restarting an OSD.
=20
=20
# id weight type name up/down reweight
-6 8 root ssd
-4 4 host zqw-s1-ssd
16 1 osd.16 up 1
17 1 osd.17 up 1
18 1 osd.18 up 1
19 1 osd.19 up 1
-5 4 host zqw-s2-ssd
20 1 osd.20 up 1
21 1 osd.21 up 1
22 1 osd.22 up 1
23 1 osd.23 up 1
-1 14.56 root default
-2 7.28 host zqw-s1
0 0.91 osd.0 up 1
1 0.91 osd.1 up 1
2 0.91 osd.2 up 1
3 0.91 osd.3 up 1
4 0.91 osd.4 up 1
5 0.91 osd.5 up 1
6 0.91 osd.6 up 1
7 0.91 osd.7 up 1
-3 7.28 host zqw-s2
8 0.91 osd.8 up 1
9 0.91 osd.9 up 1
10 0.91 osd.10 up 1
11 0.91 osd.11 up 1
12 0.91 osd.12 up 1
13 0.91 osd.13 up 1
14 0.91 osd.14 up 1
15 0.91 osd.15 up 1
=20
After I restart one of the OSD with id from 16 to 23, say restartin=
g
Post by Wang, Zhiqiang
Post by Wang, Zhiqiang
Post by Wang, Zhiqiang
osd.16, osd.16 goes to 'root default' and 'host zqw-s1', and ceph
cluster begins to do rebalance. This surely is not what I want.
=20
# id weight type name up/down reweight
-6 7 root ssd
-4 3 host zqw-s1-ssd
17 1 osd.17 up 1
18 1 osd.18 up 1
19 1 osd.19 up 1
-5 4 host zqw-s2-ssd
20 1 osd.20 up 1
21 1 osd.21 up 1
22 1 osd.22 up 1
23 1 osd.23 up 1
-1 15.56 root default
-2 8.28 host zqw-s1
0 0.91 osd.0 up 1
1 0.91 osd.1 up 1
2 0.91 osd.2 up 1
3 0.91 osd.3 up 1
4 0.91 osd.4 up 1
5 0.91 osd.5 up 1
6 0.91 osd.6 up 1
7 0.91 osd.7 up 1
16 1 osd.16 up 1
-3 7.28 host zqw-s2
8 0.91 osd.8 up 1
9 0.91 osd.9 up 1
10 0.91 osd.10 up 1
11 0.91 osd.11 up 1
12 0.91 osd.12 up 1
13 0.91 osd.13 up 1
14 0.91 osd.14 up 1
15 0.91 osd.15 up 1
=20
After digging into the problem, I find it's because in the ceph ini=
t
Post by Wang, Zhiqiang
Post by Wang, Zhiqiang
Post by Wang, Zhiqiang
script, we change the OSD's crush location in some way. It uses the
script 'ceph-crush-location' to get the crush location from the
ceph.conf file for the restarting OSD. If there isn't such an entry =
in
Post by Wang, Zhiqiang
Post by Wang, Zhiqiang
Post by Wang, Zhiqiang
ceph.conf, it uses the default one 'host=3D$(hostname -s) root=3Ddef=
ault'.
Post by Wang, Zhiqiang
Post by Wang, Zhiqiang
Post by Wang, Zhiqiang
Since I don't have the crush location configuration in my ceph.conf =
(I
Post by Wang, Zhiqiang
Post by Wang, Zhiqiang
Post by Wang, Zhiqiang
guess most of people don't have this in their ceph.conf), when I
restarting osd.16, it goes to 'root default' and 'host zqw-s1'.
=20
When the ceph init script uses 'ceph osd crush create-or-move' to
change the OSD's crush location, do a check first, if this OSD is
already existing in the crush map, return without making the locati=
on
Post by Wang, Zhiqiang
Post by Wang, Zhiqiang
Post by Wang, Zhiqiang
https://github.com/wonzhq/ceph/commit/efdfa23664caa531390d141bd1539=
87
Post by Wang, Zhiqiang
Post by Wang, Zhiqiang
Post by Wang, Zhiqiang
8
761412fe
=20
What do you think?
The goal of this behavior is to allow hot-swapping of devices. You c=
an
Post by Wang, Zhiqiang
Post by Wang, Zhiqiang
pull disks out of one host and put them in another and the udev
machinery will start up the daemon, update the crush location, and th=
e
Post by Wang, Zhiqiang
Post by Wang, Zhiqiang
disk and data will become available. It's not 'ideal' in the sense
that there will be rebalancing, but it does make the data available t=
o
Post by Wang, Zhiqiang
Post by Wang, Zhiqiang
the cluster to preserve data safety.
We haven't come up with a great scheme yet to managing multiple trees
yet.
The idea is that the ceph-crush-location hook can be customized to do
whatever is necessary, for example by putting root=3Dssd if the devic=
e
Post by Wang, Zhiqiang
Post by Wang, Zhiqiang
type appears to be an ssd (maybe look at the sysfs metadata, or put a
marker file in the osd data directory?). You can point to your own
hook for your environment with
osd crush location hook =3D /path/to/my/script
sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel"
info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" i=
n
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Sage Weil
2014-08-22 14:06:59 UTC
Permalink
Post by David Moreau Simard
Hi Wang,
Thanks, I?ll try that for the time being. This still raises a few
questions I?d like to discuss.
I?m convinced we can agree that the CRUSH map is ultimately the authority
as far as the location of the devices currently are.
My understanding is that we are relying on another source for device
location when (in this case) restarting OSDs: the ceph.conf file.
1) Does this imply that we probably shouldn?t specify device locations
directly in the crush map but in our ceph.conf file instead ?
2) If what is in the crush map is different than what is configured in
ceph.conf, how does Ceph decide which is the authority ? Shouldn?t it be
the crush map ? In this case, it appears to be the ceph.conf file.
Just trying to wrap my head around the vision of how things should be
managed.
Generally speaking, you have three options:

- 'osd crush update on start = false' and do it all manually, like you're
used to.
- set 'crush location = a=b c=d e=f' in ceph.conf. The expectation is
that chef or puppet or whatever will fill this in with "host=foo
rack=bar dc=asdf".
- customize ceph-crush-location to do something trickier (like multiple
trees)

sage
Post by David Moreau Simard
--
David Moreau Simard
Post by Wang, Zhiqiang
Hi David,
Yes, I think adding a hook in your ceph.conf can solve your problem. At
least this is what I did, and it solves the problem.
[osd.3]
osd crush location = "host=osd02 root=default disktype=osd02_ssd"
You need to add this for every osd.
-----Original Message-----
Sent: Friday, August 22, 2014 10:34 AM
To: Wang, Zhiqiang; Sage Weil
Subject: Re: A problem when restarting OSD
I?m glad you mention this because I?ve also been running into the same
issue and this took me a while to figure out too.
Is this new behaviour ? I don?t remember running into this before...
Sage does mention multiple trees but I?ve had this happen with a single
root.
It is definitely not my expectation that restarting an OSD would move
things around in the crush map.
http://pastebin.com/6vBUQTCk
# id weight type name up/down reweight
-1 18 root default
-2 9 host osd02
-4 2 disktype osd02_ssd
3 1 osd.3 up 1
9 1 osd.9 up 1
-5 7 disktype osd02_spinning
8 1 osd.8 up 1
17 1 osd.17 up 1
5 1 osd.5 up 1
11 1 osd.11 up 1
1 1 osd.1 up 1
13 1 osd.13 up 1
15 1 osd.15 up 1
-3 9 host osd01
-6 2 disktype osd01_ssd
2 1 osd.2 up 1
7 1 osd.7 up 1
-7 7 disktype osd01_spinning
0 1 osd.0 up 1
4 1 osd.4 up 1
12 1 osd.12 up 1
6 1 osd.6 up 1
14 1 osd.14 up 1
10 1 osd.10 up 1
16 1 osd.16 up 1
http://pastebin.com/rP8Y8qcH
# id weight type name up/down reweight
-1 18 root default
-2 9 host osd02
-4 0 disktype osd02_ssd
-5 0 disktype osd02_spinning
13 1 osd.13 up 1
3 1 osd.3 up 1
5 1 osd.5 up 1
1 1 osd.1 up 1
11 1 osd.11 up 1
15 1 osd.15 up 1
17 1 osd.17 up 1
8 1 osd.8 up 1
9 1 osd.9 up 1
-3 9 host osd01
-6 0 disktype osd01_ssd
-7 0 disktype osd01_spinning
0 1 osd.0 up 1
10 1 osd.10 up 1
12 1 osd.12 up 1
14 1 osd.14 up 1
16 1 osd.16 up 1
2 1 osd.2 up 1
4 1 osd.4 up 1
7 1 osd.7 up 1
6 1 osd.6 up 1
Would a hook really be the solution I need ?
--
David Moreau Simard
Post by Wang, Zhiqiang
Hi Sage,
Yes, I understand that we can customize the crush location hook to let
the OSD go to the right location. But does the ceph user have the idea
of this if he/she has more than 1 root in the crush map? At least I
don't know this at the beginning. We need to either emphasize this or
do it in some ways for the user.
One question for the hot-swapping support of moving an OSD to another
host. What if the journal is not located at the same disk of the OSD?
Is the OSD still able to be available in the cluster?
-----Original Message-----
Sent: Thursday, August 21, 2014 11:28 PM
To: Wang, Zhiqiang
Subject: Re: A problem when restarting OSD
Post by Wang, Zhiqiang
Hi all,
I ran into a problem when restarting an OSD.
# id weight type name up/down reweight
-6 8 root ssd
-4 4 host zqw-s1-ssd
16 1 osd.16 up 1
17 1 osd.17 up 1
18 1 osd.18 up 1
19 1 osd.19 up 1
-5 4 host zqw-s2-ssd
20 1 osd.20 up 1
21 1 osd.21 up 1
22 1 osd.22 up 1
23 1 osd.23 up 1
-1 14.56 root default
-2 7.28 host zqw-s1
0 0.91 osd.0 up 1
1 0.91 osd.1 up 1
2 0.91 osd.2 up 1
3 0.91 osd.3 up 1
4 0.91 osd.4 up 1
5 0.91 osd.5 up 1
6 0.91 osd.6 up 1
7 0.91 osd.7 up 1
-3 7.28 host zqw-s2
8 0.91 osd.8 up 1
9 0.91 osd.9 up 1
10 0.91 osd.10 up 1
11 0.91 osd.11 up 1
12 0.91 osd.12 up 1
13 0.91 osd.13 up 1
14 0.91 osd.14 up 1
15 0.91 osd.15 up 1
After I restart one of the OSD with id from 16 to 23, say restarting
osd.16, osd.16 goes to 'root default' and 'host zqw-s1', and ceph
cluster begins to do rebalance. This surely is not what I want.
# id weight type name up/down reweight
-6 7 root ssd
-4 3 host zqw-s1-ssd
17 1 osd.17 up 1
18 1 osd.18 up 1
19 1 osd.19 up 1
-5 4 host zqw-s2-ssd
20 1 osd.20 up 1
21 1 osd.21 up 1
22 1 osd.22 up 1
23 1 osd.23 up 1
-1 15.56 root default
-2 8.28 host zqw-s1
0 0.91 osd.0 up 1
1 0.91 osd.1 up 1
2 0.91 osd.2 up 1
3 0.91 osd.3 up 1
4 0.91 osd.4 up 1
5 0.91 osd.5 up 1
6 0.91 osd.6 up 1
7 0.91 osd.7 up 1
16 1 osd.16 up 1
-3 7.28 host zqw-s2
8 0.91 osd.8 up 1
9 0.91 osd.9 up 1
10 0.91 osd.10 up 1
11 0.91 osd.11 up 1
12 0.91 osd.12 up 1
13 0.91 osd.13 up 1
14 0.91 osd.14 up 1
15 0.91 osd.15 up 1
After digging into the problem, I find it's because in the ceph init
script, we change the OSD's crush location in some way. It uses the
script 'ceph-crush-location' to get the crush location from the
ceph.conf file for the restarting OSD. If there isn't such an entry in
ceph.conf, it uses the default one 'host=$(hostname -s) root=default'.
Since I don't have the crush location configuration in my ceph.conf (I
guess most of people don't have this in their ceph.conf), when I
restarting osd.16, it goes to 'root default' and 'host zqw-s1'.
When the ceph init script uses 'ceph osd crush create-or-move' to
change the OSD's crush location, do a check first, if this OSD is
already existing in the crush map, return without making the location
https://github.com/wonzhq/ceph/commit/efdfa23664caa531390d141bd153987
8
761412fe
What do you think?
The goal of this behavior is to allow hot-swapping of devices. You can
pull disks out of one host and put them in another and the udev
machinery will start up the daemon, update the crush location, and the
disk and data will become available. It's not 'ideal' in the sense
that there will be rebalancing, but it does make the data available to
the cluster to preserve data safety.
We haven't come up with a great scheme yet to managing multiple trees
yet.
The idea is that the ceph-crush-location hook can be customized to do
whatever is necessary, for example by putting root=ssd if the device
type appears to be an ssd (maybe look at the sysfs metadata, or put a
marker file in the osd data directory?). You can point to your own
hook for your environment with
osd crush location hook = /path/to/my/script
sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel"
info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
David Moreau Simard
2014-08-22 14:14:11 UTC
Permalink
Ah, that does clear things up !

I didn=B9t even know that there was a toggle for =8Cosd crush update on=
start=B9
- my bad.
I searched through the documentation and couldn=B9t find something on t=
hat
topic.

Perhaps we should add a bit about that here:
http://ceph.com/docs/master/rados/operations/crush-map/#crush-location

I=B9ll open a pull request.
--=20
David Moreau Simard
Post by David Moreau Simard
Hi Wang,
=20
Thanks, I?ll try that for the time being. This still raises a few
questions I?d like to discuss.
=20
I?m convinced we can agree that the CRUSH map is ultimately the authority
as far as the location of the devices currently are.
My understanding is that we are relying on another source for device
location when (in this case) restarting OSDs: the ceph.conf file.
=20
1) Does this imply that we probably shouldn?t specify device locatio=
ns
Post by David Moreau Simard
directly in the crush map but in our ceph.conf file instead ?
2) If what is in the crush map is different than what is configured =
in
Post by David Moreau Simard
ceph.conf, how does Ceph decide which is the authority ? Shouldn?t i=
t be
Post by David Moreau Simard
the crush map ? In this case, it appears to be the ceph.conf file.
=20
Just trying to wrap my head around the vision of how things should b=
e
Post by David Moreau Simard
managed.
- 'osd crush update on start =3D false' and do it all manually, like
you're=20
used to.
- set 'crush location =3D a=3Db c=3Dd e=3Df' in ceph.conf. The expec=
tation is
that chef or puppet or whatever will fill this in with "host=3Dfoo
rack=3Dbar dc=3Dasdf".
- customize ceph-crush-location to do something trickier (like multip=
le
trees)
sage
Post by David Moreau Simard
--=20
David Moreau Simard
=20
=20
a
Post by David Moreau Simard
=20
Post by Wang, Zhiqiang
Hi David,
Yes, I think adding a hook in your ceph.conf can solve your problem=
=2E At
Post by David Moreau Simard
Post by Wang, Zhiqiang
least this is what I did, and it solves the problem.
[osd.3]
osd crush location =3D "host=3Dosd02 root=3Ddefault disktype=3Dosd0=
2_ssd"
Post by David Moreau Simard
Post by Wang, Zhiqiang
You need to add this for every osd.
-----Original Message-----
Sent: Friday, August 22, 2014 10:34 AM
To: Wang, Zhiqiang; Sage Weil
Subject: Re: A problem when restarting OSD
I?m glad you mention this because I?ve also been running into the s=
ame
Post by David Moreau Simard
Post by Wang, Zhiqiang
issue and this took me a while to figure out too.
Is this new behaviour ? I don?t remember running into this before..=
=2E
Post by David Moreau Simard
Post by Wang, Zhiqiang
Sage does mention multiple trees but I?ve had this happen with a si=
ngle
Post by David Moreau Simard
Post by Wang, Zhiqiang
root.
It is definitely not my expectation that restarting an OSD would mo=
ve
Post by David Moreau Simard
Post by Wang, Zhiqiang
things around in the crush map.
I?m in the process of developing a crush map, looks like this (note=
http://pastebin.com/6vBUQTCk
# id weight type name up/down reweight
-1 18 root default
-2 9 host osd02
-4 2 disktype osd02_ssd
3 1 osd.3 up 1
9 1 osd.9 up 1
-5 7 disktype osd02_spinning
8 1 osd.8 up 1
17 1 osd.17 up 1
5 1 osd.5 up 1
11 1 osd.11 up 1
1 1 osd.1 up 1
13 1 osd.13 up 1
15 1 osd.15 up 1
-3 9 host osd01
-6 2 disktype osd01_ssd
2 1 osd.2 up 1
7 1 osd.7 up 1
-7 7 disktype osd01_spinning
0 1 osd.0 up 1
4 1 osd.4 up 1
12 1 osd.12 up 1
6 1 osd.6 up 1
14 1 osd.14 up 1
10 1 osd.10 up 1
16 1 osd.16 up 1
http://pastebin.com/rP8Y8qcH
# id weight type name up/down reweight
-1 18 root default
-2 9 host osd02
-4 0 disktype osd02_ssd
-5 0 disktype osd02_spinning
13 1 osd.13 up 1
3 1 osd.3 up 1
5 1 osd.5 up 1
1 1 osd.1 up 1
11 1 osd.11 up 1
15 1 osd.15 up 1
17 1 osd.17 up 1
8 1 osd.8 up 1
9 1 osd.9 up 1
-3 9 host osd01
-6 0 disktype osd01_ssd
-7 0 disktype osd01_spinning
0 1 osd.0 up 1
10 1 osd.10 up 1
12 1 osd.12 up 1
14 1 osd.14 up 1
16 1 osd.16 up 1
2 1 osd.2 up 1
4 1 osd.4 up 1
7 1 osd.7 up 1
6 1 osd.6 up 1
Would a hook really be the solution I need ?
--
David Moreau Simard
a
Post by David Moreau Simard
Post by Wang, Zhiqiang
Post by Wang, Zhiqiang
Hi Sage,
Yes, I understand that we can customize the crush location hook to=
let
Post by David Moreau Simard
Post by Wang, Zhiqiang
Post by Wang, Zhiqiang
the OSD go to the right location. But does the ceph user have the =
idea
Post by David Moreau Simard
Post by Wang, Zhiqiang
Post by Wang, Zhiqiang
of this if he/she has more than 1 root in the crush map? At least =
I
Post by David Moreau Simard
Post by Wang, Zhiqiang
Post by Wang, Zhiqiang
don't know this at the beginning. We need to either emphasize this=
or
Post by David Moreau Simard
Post by Wang, Zhiqiang
Post by Wang, Zhiqiang
do it in some ways for the user.
One question for the hot-swapping support of moving an OSD to anot=
her
Post by David Moreau Simard
Post by Wang, Zhiqiang
Post by Wang, Zhiqiang
host. What if the journal is not located at the same disk of the O=
SD?
Post by David Moreau Simard
Post by Wang, Zhiqiang
Post by Wang, Zhiqiang
Is the OSD still able to be available in the cluster?
-----Original Message-----
Sent: Thursday, August 21, 2014 11:28 PM
To: Wang, Zhiqiang
Subject: Re: A problem when restarting OSD
Post by Wang, Zhiqiang
Hi all,
=20
I ran into a problem when restarting an OSD.
=20
=20
# id weight type name up/down reweight
-6 8 root ssd
-4 4 host zqw-s1-ssd
16 1 osd.16 up 1
17 1 osd.17 up 1
18 1 osd.18 up 1
19 1 osd.19 up 1
-5 4 host zqw-s2-ssd
20 1 osd.20 up 1
21 1 osd.21 up 1
22 1 osd.22 up 1
23 1 osd.23 up 1
-1 14.56 root default
-2 7.28 host zqw-s1
0 0.91 osd.0 up 1
1 0.91 osd.1 up 1
2 0.91 osd.2 up 1
3 0.91 osd.3 up 1
4 0.91 osd.4 up 1
5 0.91 osd.5 up 1
6 0.91 osd.6 up 1
7 0.91 osd.7 up 1
-3 7.28 host zqw-s2
8 0.91 osd.8 up 1
9 0.91 osd.9 up 1
10 0.91 osd.10 up 1
11 0.91 osd.11 up 1
12 0.91 osd.12 up 1
13 0.91 osd.13 up 1
14 0.91 osd.14 up 1
15 0.91 osd.15 up 1
=20
After I restart one of the OSD with id from 16 to 23, say restar=
ting
Post by David Moreau Simard
Post by Wang, Zhiqiang
Post by Wang, Zhiqiang
Post by Wang, Zhiqiang
osd.16, osd.16 goes to 'root default' and 'host zqw-s1', and ceph
cluster begins to do rebalance. This surely is not what I want.
=20
# id weight type name up/down reweight
-6 7 root ssd
-4 3 host zqw-s1-ssd
17 1 osd.17 up 1
18 1 osd.18 up 1
19 1 osd.19 up 1
-5 4 host zqw-s2-ssd
20 1 osd.20 up 1
21 1 osd.21 up 1
22 1 osd.22 up 1
23 1 osd.23 up 1
-1 15.56 root default
-2 8.28 host zqw-s1
0 0.91 osd.0 up 1
1 0.91 osd.1 up 1
2 0.91 osd.2 up 1
3 0.91 osd.3 up 1
4 0.91 osd.4 up 1
5 0.91 osd.5 up 1
6 0.91 osd.6 up 1
7 0.91 osd.7 up 1
16 1 osd.16 up 1
-3 7.28 host zqw-s2
8 0.91 osd.8 up 1
9 0.91 osd.9 up 1
10 0.91 osd.10 up 1
11 0.91 osd.11 up 1
12 0.91 osd.12 up 1
13 0.91 osd.13 up 1
14 0.91 osd.14 up 1
15 0.91 osd.15 up 1
=20
After digging into the problem, I find it's because in the ceph =
init
Post by David Moreau Simard
Post by Wang, Zhiqiang
Post by Wang, Zhiqiang
Post by Wang, Zhiqiang
script, we change the OSD's crush location in some way. It uses t=
he
Post by David Moreau Simard
Post by Wang, Zhiqiang
Post by Wang, Zhiqiang
Post by Wang, Zhiqiang
script 'ceph-crush-location' to get the crush location from the
ceph.conf file for the restarting OSD. If there isn't such an ent=
ry
Post by David Moreau Simard
in
Post by Wang, Zhiqiang
Post by Wang, Zhiqiang
Post by Wang, Zhiqiang
ceph.conf, it uses the default one 'host=3D$(hostname -s)
root=3Ddefault'.
Post by Wang, Zhiqiang
Post by Wang, Zhiqiang
Post by Wang, Zhiqiang
Since I don't have the crush location configuration in my ceph.co=
nf
Post by David Moreau Simard
(I
Post by Wang, Zhiqiang
Post by Wang, Zhiqiang
Post by Wang, Zhiqiang
guess most of people don't have this in their ceph.conf), when I
restarting osd.16, it goes to 'root default' and 'host zqw-s1'.
=20
When the ceph init script uses 'ceph osd crush create-or-move' t=
o
Post by David Moreau Simard
Post by Wang, Zhiqiang
Post by Wang, Zhiqiang
Post by Wang, Zhiqiang
change the OSD's crush location, do a check first, if this OSD i=
s
Post by David Moreau Simard
Post by Wang, Zhiqiang
Post by Wang, Zhiqiang
Post by Wang, Zhiqiang
already existing in the crush map, return without making the
location
Post by Wang, Zhiqiang
Post by Wang, Zhiqiang
Post by Wang, Zhiqiang
=20
https://github.com/wonzhq/ceph/commit/efdfa23664caa531390d141bd153987
Post by Wang, Zhiqiang
Post by Wang, Zhiqiang
Post by Wang, Zhiqiang
8
761412fe
=20
What do you think?
The goal of this behavior is to allow hot-swapping of devices. Yo=
u
Post by David Moreau Simard
can
Post by Wang, Zhiqiang
Post by Wang, Zhiqiang
pull disks out of one host and put them in another and the udev
machinery will start up the daemon, update the crush location, and=
the
Post by David Moreau Simard
Post by Wang, Zhiqiang
Post by Wang, Zhiqiang
disk and data will become available. It's not 'ideal' in the sens=
e
Post by David Moreau Simard
Post by Wang, Zhiqiang
Post by Wang, Zhiqiang
that there will be rebalancing, but it does make the data availabl=
e to
Post by David Moreau Simard
Post by Wang, Zhiqiang
Post by Wang, Zhiqiang
the cluster to preserve data safety.
We haven't come up with a great scheme yet to managing multiple tr=
ees
Post by David Moreau Simard
Post by Wang, Zhiqiang
Post by Wang, Zhiqiang
yet.
The idea is that the ceph-crush-location hook can be customized to=
do
Post by David Moreau Simard
Post by Wang, Zhiqiang
Post by Wang, Zhiqiang
whatever is necessary, for example by putting root=3Dssd if the de=
vice
Post by David Moreau Simard
Post by Wang, Zhiqiang
Post by Wang, Zhiqiang
type appears to be an ssd (maybe look at the sysfs metadata, or pu=
t a
Post by David Moreau Simard
Post by Wang, Zhiqiang
Post by Wang, Zhiqiang
marker file in the osd data directory?). You can point to your ow=
n
Post by David Moreau Simard
Post by Wang, Zhiqiang
Post by Wang, Zhiqiang
hook for your environment with
osd crush location hook =3D /path/to/my/script
sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-dev=
el"
mo
Post by David Moreau Simard
Post by Wang, Zhiqiang
Post by Wang, Zhiqiang
info at http://vger.kernel.org/majordomo-info.html
=20
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel=
" in
Post by David Moreau Simard
More majordomo info at http://vger.kernel.org/majordomo-info.html
=20
=20
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" i=
n
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Loading...