question about client's cluster aware

Discussion:

yue longguang

2014-09-23 09:53:20 UTC

hi,all

my question is from my test.
let's take a example. object1(4MB)--> pg 0.1 --> osd 1,2,3,p1

when client is writing object1, during the write , osd1 is down. let
suppose 2MB is writed.
1.
when the connection to osd1 is down, what does client do? ask
monitor for new osdmap? or only the pg map?

2.
now client gets a newer map , continues the write , the primary osd
should be osd2. the rest 2MB is writed out.
now what does ceph do to integrate the two part data? and to promise
that replicas is enough?

3.
where is the code. Be sure to tell me where the code is=E3=80=82

it is a very difficult question.

Thanks so much
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" i=
n
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

yue longguang

2014-09-25 01:02:54 UTC

Permalink

---------- Forwarded message ----------
=46rom: yue longguang <***@gmail.com>
Date: Tue, Sep 23, 2014 at 5:53 PM
Subject: question about client's cluster aware
To: ceph-***@vger.kernel.org

hi,all

my question is from my test.
let's take a example. object1(4MB)--> pg 0.1 --> osd 1,2,3,p1

when client is writing object1, during the write , osd1 is down. let
suppose 2MB is writed.
1.
when the connection to osd1 is down, what does client do? ask
monitor for new osdmap? or only the pg map?

2.
now client gets a newer map , continues the write , the primary osd
should be osd2. the rest 2MB is writed out.
now what does ceph do to integrate the two part data? and to promise
that replicas is enough?

3.
where is the code. Be sure to tell me where the code is=E3=80=82

it is a very difficult question.

Thanks so much
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" i=
n
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Sage Weil

2014-09-25 14:13:57 UTC

Permalink

Hi Yue,

Post by yue longguang
---------- Forwarded message ----------
Date: Tue, Sep 23, 2014 at 5:53 PM
Subject: question about client's cluster aware
hi,all
my question is from my test.
let's take a example. object1(4MB)--> pg 0.1 --> osd 1,2,3,p1
when client is writing object1, during the write , osd1 is down. let
suppose 2MB is writed.
1.
when the connection to osd1 is down, what does client do? ask
monitor for new osdmap? or only the pg map?

For a client that is mostly idle and has only a single IO in progress to
the failed machine, it will wait for N seconds before asking the monitor
for an updated OSDMap. Usually, though, it will get that incremental map
update/diff from another OSD in the cluster. Any time the client sends a
request to any OSD, that OSD will share map incrementals/diffs if it has a
newer map. So for a cluster with say 100 OSDs, say 99% of the time it
will fine out about the failure from another OSD.

Post by yue longguang
2.
now client gets a newer map , continues the write , the primary osd
should be osd2. the rest 2MB is writed out.

The client will resend any request that hasn't been acked to the new
primary. If it was a single 2MB write, tha tmeans it will resend the
whole write. If it was two 1MB writes, it will resend whichever
portions haven't been acked (probably both, if the failure happened
mid-write).

Post by yue longguang
now what does ceph do to integrate the two part data? and to promise
that replicas is enough?

If the new primary has that writ eon disk already (because it had
completed the write before it crashed) it will reply immediately--
operations have unique IDs are are idempotent. If it hasn't seen
the write yet, it will do it then.

Post by yue longguang
3.
where is the code. Be sure to tell me where the code is?

osdc/Objecter.cc scan_requests() is where we decide what to resend
(specifically look where we call recalc_target).

You'll find hte dup request check either in OSD.cc handle_op or in
ReplicatedPG.cc do_request. The map sharing code is in OSD.cc in
_share_map_incoming (or something like that).

Hope that helps!
sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html