A problem when restarting OSD

Discussion:

Wang, Zhiqiang

2014-08-21 07:19:47 UTC

Hi all,

I ran into a problem when restarting an OSD.

Here is my OSD tree before restarting the OSD:

# id weight type name up/down reweight
-6 8 root ssd
-4 4 host zqw-s1-ssd
16 1 osd.16 up 1
17 1 osd.17 up 1
18 1 osd.18 up 1
19 1 osd.19 up 1
-5 4 host zqw-s2-ssd
20 1 osd.20 up 1
21 1 osd.21 up 1
22 1 osd.22 up 1
23 1 osd.23 up 1
-1 14.56 root default
-2 7.28 host zqw-s1
0 0.91 osd.0 up 1
1 0.91 osd.1 up 1
2 0.91 osd.2 up 1
3 0.91 osd.3 up 1
4 0.91 osd.4 up 1
5 0.91 osd.5 up 1
6 0.91 osd.6 up 1
7 0.91 osd.7 up 1
-3 7.28 host zqw-s2
8 0.91 osd.8 up 1
9 0.91 osd.9 up 1
10 0.91 osd.10 up 1
11 0.91 osd.11 up 1
12 0.91 osd.12 up 1
13 0.91 osd.13 up 1
14 0.91 osd.14 up 1
15 0.91 osd.15 up 1

After I restart one of the OSD with id from 16 to 23, say restarting osd.16, osd.16 goes to 'root default' and 'host zqw-s1', and ceph cluster begins to do rebalance. This surely is not what I want.

# id weight type name up/down reweight
-6 7 root ssd
-4 3 host zqw-s1-ssd
17 1 osd.17 up 1
18 1 osd.18 up 1
19 1 osd.19 up 1
-5 4 host zqw-s2-ssd
20 1 osd.20 up 1
21 1 osd.21 up 1
22 1 osd.22 up 1
23 1 osd.23 up 1
-1 15.56 root default
-2 8.28 host zqw-s1
0 0.91 osd.0 up 1
1 0.91 osd.1 up 1
2 0.91 osd.2 up 1
3 0.91 osd.3 up 1
4 0.91 osd.4 up 1
5 0.91 osd.5 up 1
6 0.91 osd.6 up 1
7 0.91 osd.7 up 1
16 1 osd.16 up 1
-3 7.28 host zqw-s2
8 0.91 osd.8 up 1
9 0.91 osd.9 up 1
10 0.91 osd.10 up 1
11 0.91 osd.11 up 1
12 0.91 osd.12 up 1
13 0.91 osd.13 up 1
14 0.91 osd.14 up 1
15 0.91 osd.15 up 1

After digging into the problem, I find it's because in the ceph init script, we change the OSD's crush location in some way. It uses the script 'ceph-crush-location' to get the crush location from the ceph.conf file for the restarting OSD. If there isn't such an entry in ceph.conf, it uses the default one 'host=$(hostname -s) root=default'. Since I don't have the crush location configuration in my ceph.conf (I guess most of people don't have this in their ceph.conf), when I restarting osd.16, it goes to 'root default' and 'host zqw-s1'.

Here is a fix for this:
When the ceph init script uses 'ceph osd crush create-or-move' to change the OSD's crush location, do a check first, if this OSD is already existing in the crush map, return without making the location change. This change is at: https://github.com/wonzhq/ceph/commit/efdfa23664caa531390d141bd1539878761412fe

What do you think?
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Sage Weil

2014-08-21 15:28:11 UTC

Permalink

Post by Wang, Zhiqiang
Hi all,
I ran into a problem when restarting an OSD.
# id weight type name up/down reweight
-6 8 root ssd
-4 4 host zqw-s1-ssd
16 1 osd.16 up 1
17 1 osd.17 up 1
18 1 osd.18 up 1
19 1 osd.19 up 1
-5 4 host zqw-s2-ssd
20 1 osd.20 up 1
21 1 osd.21 up 1
22 1 osd.22 up 1
23 1 osd.23 up 1
-1 14.56 root default
-2 7.28 host zqw-s1
0 0.91 osd.0 up 1
1 0.91 osd.1 up 1
2 0.91 osd.2 up 1
3 0.91 osd.3 up 1
4 0.91 osd.4 up 1
5 0.91 osd.5 up 1
6 0.91 osd.6 up 1
7 0.91 osd.7 up 1
-3 7.28 host zqw-s2
8 0.91 osd.8 up 1
9 0.91 osd.9 up 1
10 0.91 osd.10 up 1
11 0.91 osd.11 up 1
12 0.91 osd.12 up 1
13 0.91 osd.13 up 1
14 0.91 osd.14 up 1
15 0.91 osd.15 up 1
After I restart one of the OSD with id from 16 to 23, say restarting osd.16, osd.16 goes to 'root default' and 'host zqw-s1', and ceph cluster begins to do rebalance. This surely is not what I want.
# id weight type name up/down reweight
-6 7 root ssd
-4 3 host zqw-s1-ssd
17 1 osd.17 up 1
18 1 osd.18 up 1
19 1 osd.19 up 1
-5 4 host zqw-s2-ssd
20 1 osd.20 up 1
21 1 osd.21 up 1
22 1 osd.22 up 1
23 1 osd.23 up 1
-1 15.56 root default
-2 8.28 host zqw-s1
0 0.91 osd.0 up 1
1 0.91 osd.1 up 1
2 0.91 osd.2 up 1
3 0.91 osd.3 up 1
4 0.91 osd.4 up 1
5 0.91 osd.5 up 1
6 0.91 osd.6 up 1
7 0.91 osd.7 up 1
16 1 osd.16 up 1
-3 7.28 host zqw-s2
8 0.91 osd.8 up 1
9 0.91 osd.9 up 1
10 0.91 osd.10 up 1
11 0.91 osd.11 up 1
12 0.91 osd.12 up 1
13 0.91 osd.13 up 1
14 0.91 osd.14 up 1
15 0.91 osd.15 up 1
After digging into the problem, I find it's because in the ceph init script, we change the OSD's crush location in some way. It uses the script 'ceph-crush-location' to get the crush location from the ceph.conf file for the restarting OSD. If there isn't such an entry in ceph.conf, it uses the default one 'host=$(hostname -s) root=default'. Since I don't have the crush location configuration in my ceph.conf (I guess most of people don't have this in their ceph.conf), when I restarting osd.16, it goes to 'root default' and 'host zqw-s1'.
When the ceph init script uses 'ceph osd crush create-or-move' to change the OSD's crush location, do a check first, if this OSD is already existing in the crush map, return without making the location change. This change is at: https://github.com/wonzhq/ceph/commit/efdfa23664caa531390d141bd1539878761412fe
What do you think?

The goal of this behavior is to allow hot-swapping of devices. You can
pull disks out of one host and put them in another and the udev machinery
will start up the daemon, update the crush location, and the disk and data
will become available. It's not 'ideal' in the sense that there will be
rebalancing, but it does make the data available to the cluster to
preserve data safety.

We haven't come up with a great scheme yet to managing multiple trees yet.
The idea is that the ceph-crush-location hook can be customized to do
whatever is necessary, for example by putting root=ssd if the device type
appears to be an ssd (maybe look at the sysfs metadata, or put a marker
file in the osd data directory?). You can point to your own hook for your
environment with

osd crush location hook = /path/to/my/script

sage

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Wang, Zhiqiang

2014-08-22 01:36:13 UTC

Permalink

Hi Sage,

Yes, I understand that we can customize the crush location hook to let the OSD go to the right location. But does the ceph user have the idea of this if he/she has more than 1 root in the crush map? At least I don't know this at the beginning. We need to either emphasize this or do it in some ways for the user.

One question for the hot-swapping support of moving an OSD to another host. What if the journal is not located at the same disk of the OSD? Is the OSD still able to be available in the cluster?

-----Original Message-----
From: Sage Weil [mailto:***@redhat.com]
Sent: Thursday, August 21, 2014 11:28 PM
To: Wang, Zhiqiang
Cc: 'ceph-***@vger.kernel.org'
Subject: Re: A problem when restarting OSD

Post by Wang, Zhiqiang
Hi all,
I ran into a problem when restarting an OSD.
# id weight type name up/down reweight
-6 8 root ssd
-4 4 host zqw-s1-ssd
16 1 osd.16 up 1
17 1 osd.17 up 1
18 1 osd.18 up 1
19 1 osd.19 up 1
-5 4 host zqw-s2-ssd
20 1 osd.20 up 1
21 1 osd.21 up 1
22 1 osd.22 up 1
23 1 osd.23 up 1
-1 14.56 root default
-2 7.28 host zqw-s1
0 0.91 osd.0 up 1
1 0.91 osd.1 up 1
2 0.91 osd.2 up 1
3 0.91 osd.3 up 1
4 0.91 osd.4 up 1
5 0.91 osd.5 up 1
6 0.91 osd.6 up 1
7 0.91 osd.7 up 1
-3 7.28 host zqw-s2
8 0.91 osd.8 up 1
9 0.91 osd.9 up 1
10 0.91 osd.10 up 1
11 0.91 osd.11 up 1
12 0.91 osd.12 up 1
13 0.91 osd.13 up 1
14 0.91 osd.14 up 1
15 0.91 osd.15 up 1
After I restart one of the OSD with id from 16 to 23, say restarting osd.16, osd.16 goes to 'root default' and 'host zqw-s1', and ceph cluster begins to do rebalance. This surely is not what I want.
# id weight type name up/down reweight
-6 7 root ssd
-4 3 host zqw-s1-ssd
17 1 osd.17 up 1
18 1 osd.18 up 1
19 1 osd.19 up 1
-5 4 host zqw-s2-ssd
20 1 osd.20 up 1
21 1 osd.21 up 1
22 1 osd.22 up 1
23 1 osd.23 up 1
-1 15.56 root default
-2 8.28 host zqw-s1
0 0.91 osd.0 up 1
1 0.91 osd.1 up 1
2 0.91 osd.2 up 1
3 0.91 osd.3 up 1
4 0.91 osd.4 up 1
5 0.91 osd.5 up 1
6 0.91 osd.6 up 1
7 0.91 osd.7 up 1
16 1 osd.16 up 1
-3 7.28 host zqw-s2
8 0.91 osd.8 up 1
9 0.91 osd.9 up 1
10 0.91 osd.10 up 1
11 0.91 osd.11 up 1
12 0.91 osd.12 up 1
13 0.91 osd.13 up 1
14 0.91 osd.14 up 1
15 0.91 osd.15 up 1
After digging into the problem, I find it's because in the ceph init script, we change the OSD's crush location in some way. It uses the script 'ceph-crush-location' to get the crush location from the ceph.conf file for the restarting OSD. If there isn't such an entry in ceph.conf, it uses the default one 'host=$(hostname -s) root=default'. Since I don't have the crush location configuration in my ceph.conf (I guess most of people don't have this in their ceph.conf), when I restarting osd.16, it goes to 'root default' and 'host zqw-s1'.
When the ceph init script uses 'ceph osd crush create-or-move' to
change the OSD's crush location, do a check first, if this OSD is
already existing in the crush map, return without making the location
https://github.com/wonzhq/ceph/commit/efdfa23664caa531390d141bd1539878
761412fe
What do you think?

The goal of this behavior is to allow hot-swapping of devices. You can pull disks out of one host and put them in another and the udev machinery will start up the daemon, update the crush location, and the disk and data will become available. It's not 'ideal' in the sense that there will be rebalancing, but it does make the data available to the cluster to preserve data safety.

We haven't come up with a great scheme yet to managing multiple trees yet.
The idea is that the ceph-crush-location hook can be customized to do whatever is necessary, for example by putting root=ssd if the device type appears to be an ssd (maybe look at the sysfs metadata, or put a marker file in the osd data directory?). You can point to your own hook for your environment with

osd crush location hook = /path/to/my/script

sage

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

David Moreau Simard

2014-08-22 02:33:34 UTC

Permalink

I=B9m glad you mention this because I=B9ve also been running into the s=
ame
issue and this took me a while to figure out too.

Is this new behaviour ? I don=B9t remember running into this before...

Sage does mention multiple trees but I=B9ve had this happen with a sing=
le
root.
It is definitely not my expectation that restarting an OSD would move
things around in the crush map.

I=B9m in the process of developing a crush map, looks like this (note:
unfinished and does not make much sense as is):
http://pastebin.com/6vBUQTCk
This results in this tree:
# id weight type name up/down reweight
-1 18 root default
-2 9 host osd02
-4 2 disktype osd02_ssd
3 1 osd.3 up 1
9 1 osd.9 up 1
-5 7 disktype osd02_spinning
8 1 osd.8 up 1
17 1 osd.17 up 1
5 1 osd.5 up 1
11 1 osd.11 up 1
1 1 osd.1 up 1
13 1 osd.13 up 1
15 1 osd.15 up 1
-3 9 host osd01
-6 2 disktype osd01_ssd
2 1 osd.2 up 1
7 1 osd.7 up 1
-7 7 disktype osd01_spinning
0 1 osd.0 up 1
4 1 osd.4 up 1
12 1 osd.12 up 1
6 1 osd.6 up 1
14 1 osd.14 up 1
10 1 osd.10 up 1
16 1 osd.16 up 1

Only restarting the OSDs on both hosts modifies the crush map:
http://pastebin.com/rP8Y8qcH
With the resulting tree:
# id weight type name up/down reweight
-1 18 root default
-2 9 host osd02
-4 0 disktype osd02_ssd
-5 0 disktype osd02_spinning
13 1 osd.13 up 1
3 1 osd.3 up 1
5 1 osd.5 up 1
1 1 osd.1 up 1
11 1 osd.11 up 1
15 1 osd.15 up 1
17 1 osd.17 up 1
8 1 osd.8 up 1
9 1 osd.9 up 1
-3 9 host osd01
-6 0 disktype osd01_ssd
-7 0 disktype osd01_spinning
0 1 osd.0 up 1
10 1 osd.10 up 1
12 1 osd.12 up 1
14 1 osd.14 up 1
16 1 osd.16 up 1
2 1 osd.2 up 1
4 1 osd.4 up 1
7 1 osd.7 up 1
6 1 osd.6 up 1

Would a hook really be the solution I need ?
--=20
David Moreau Simard

Post by Wang, Zhiqiang
a
Hi Sage,
Yes, I understand that we can customize the crush location hook to let
the OSD go to the right location. But does the ceph user have the idea=

Post by Wang, Zhiqiang
this if he/she has more than 1 root in the crush map? At least I don't
know this at the beginning. We need to either emphasize this or do it =

Post by Wang, Zhiqiang
some ways for the user.
One question for the hot-swapping support of moving an OSD to another
host. What if the journal is not located at the same disk of the OSD? =

Post by Wang, Zhiqiang
the OSD still able to be available in the cluster?
-----Original Message-----
Sent: Thursday, August 21, 2014 11:28 PM
To: Wang, Zhiqiang
Subject: Re: A problem when restarting OSD

Post by Wang, Zhiqiang
Hi all,
=20
I ran into a problem when restarting an OSD.
=20
=20
# id weight type name up/down reweight
-6 8 root ssd
-4 4 host zqw-s1-ssd
16 1 osd.16 up 1
17 1 osd.17 up 1
18 1 osd.18 up 1
19 1 osd.19 up 1
-5 4 host zqw-s2-ssd
20 1 osd.20 up 1
21 1 osd.21 up 1
22 1 osd.22 up 1
23 1 osd.23 up 1
-1 14.56 root default
-2 7.28 host zqw-s1
0 0.91 osd.0 up 1
1 0.91 osd.1 up 1
2 0.91 osd.2 up 1
3 0.91 osd.3 up 1
4 0.91 osd.4 up 1
5 0.91 osd.5 up 1
6 0.91 osd.6 up 1
7 0.91 osd.7 up 1
-3 7.28 host zqw-s2
8 0.91 osd.8 up 1
9 0.91 osd.9 up 1
10 0.91 osd.10 up 1
11 0.91 osd.11 up 1
12 0.91 osd.12 up 1
13 0.91 osd.13 up 1
14 0.91 osd.14 up 1
15 0.91 osd.15 up 1
=20
After I restart one of the OSD with id from 16 to 23, say restarting
osd.16, osd.16 goes to 'root default' and 'host zqw-s1', and ceph
cluster begins to do rebalance. This surely is not what I want.
=20
# id weight type name up/down reweight
-6 7 root ssd
-4 3 host zqw-s1-ssd
17 1 osd.17 up 1
18 1 osd.18 up 1
19 1 osd.19 up 1
-5 4 host zqw-s2-ssd
20 1 osd.20 up 1
21 1 osd.21 up 1
22 1 osd.22 up 1
23 1 osd.23 up 1
-1 15.56 root default
-2 8.28 host zqw-s1
0 0.91 osd.0 up 1
1 0.91 osd.1 up 1
2 0.91 osd.2 up 1
3 0.91 osd.3 up 1
4 0.91 osd.4 up 1
5 0.91 osd.5 up 1
6 0.91 osd.6 up 1
7 0.91 osd.7 up 1
16 1 osd.16 up 1
-3 7.28 host zqw-s2
8 0.91 osd.8 up 1
9 0.91 osd.9 up 1
10 0.91 osd.10 up 1
11 0.91 osd.11 up 1
12 0.91 osd.12 up 1
13 0.91 osd.13 up 1
14 0.91 osd.14 up 1
15 0.91 osd.15 up 1
=20
After digging into the problem, I find it's because in the ceph init
script, we change the OSD's crush location in some way. It uses the
script 'ceph-crush-location' to get the crush location from the
ceph.conf file for the restarting OSD. If there isn't such an entry i=