According to ceph -s the cluster is in recovery, backfill, ect.
data:
pools: 7 pools, 19656 pgs
objects: 65.02M objects, 248 TiB
usage: 761 TiB used, 580 TiB / 1.3 PiB avail
pgs: 16.173% pgs unknown
0.493% pgs not active
890328/195069177 objects degraded (0.456%)
828080/195069177 objects misplaced (0.425%)
15733 active+clean
3179 unknown
215 active+undersized+degraded+remapped+backfilling
152 active+undersized+degraded+remapped+backfill_wait
135 active+remapped+backfill_wait
107 active+remapped+backfilling
65 down
31 undersized+degraded+peered
18 active+recovering
7 active+recovery_wait
6 active+recovery_wait+degraded
4 active+recovering+degraded
1 active+recovery_wait+remapped
1 peering
1 active+remapped+backfill_toofull
1
active+undersized+degraded+remapped+backfill_wait+backfill_toofull
io:
client: 607 B/s rd, 134 MiB/s wr, 0 op/s rd, 34 op/s wr
recovery: 1.9 GiB/s, 511 objects/s
Am 09.12.2019 um 13:44 schrieb Paul Emmerich:
An OSD that is down does not recover or backfill.
Faster recovery or
backfill will not resolve down OSDs
Paul
--
Paul Emmerich
Looking for help with your Ceph cluster? Contact us at
https://croit.io
croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io <http://www.croit.io>
Tel: +49 89 1896585 90
On Mon, Dec 9, 2019 at 1:42 PM Thomas Schneider <74cmonty(a)gmail.com
<mailto:74cmonty@gmail.com>> wrote:
Hi,
I think I can speed-up the recovery / backfill.
What is the recommended setting for
osd_max_backfills
osd_recovery_max_active
?
THX
Am 09.12.2019 um 13:36 schrieb Paul Emmerich:
This message is expected.
But your current situation is a great example of why having a
separate
cluster network is a bad idea in most
situations.
First thing I'd do in this scenario is to get rid of the cluster
network and see if that helps
Paul
--
Paul Emmerich
Looking for help with your Ceph cluster? Contact us at
https://croit.io
croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io <http://www.croit.io> <http://www.croit.io>
Tel: +49 89 1896585 90
On Mon, Dec 9, 2019 at 11:22 AM Thomas Schneider
<74cmonty(a)gmail.com
<mailto:74cmonty@gmail.com>
<mailto:74cmonty@gmail.com
<mailto:74cmonty@gmail.com>>> wrote:
Hi,
I had a failure on 2 of 7 OSD nodes.
This caused a server reboot and unfortunately the cluster
network
failed
to come up.
This resulted in many OSD down situation.
I decided to stop all services (OSD, MGR, MON) and to start them
sequentially.
Now I have multiple OSD marked as down although the service is
running.
None of these down OSDS is connected to the 2 nodes with
failure.
In the OSD logs I can see multiple entries like this:
2019-12-09 11:13:10.378 7f9a372fb700 1 osd.374 pg_epoch: 493189
pg[11.1992( v 457986'92619 (303558'88266,457986'92619]
local-lis/les=466724/466725 n=4107 ec=8346/8346 lis/c
466724/466724
les/c/f 466725/466725/176266
468956/493184/468423) [203,412]
r=-1
lpr=493184 pi=[466724,493184)/1
crt=457986'92619 lcod 0'0
unknown
NOTIFY
mbc={}] state<Start>: transitioning to Stray
I tried to restart the impacted OSD w/o success, means the
relevant OSD
is still marked as down.
Is there a procedure to overcome this issue, means getting
all OSD up?
THX
_______________________________________________
ceph-users mailing list -- ceph-users(a)ceph.io
<mailto:ceph-users@ceph.io>
<mailto:ceph-users@ceph.io
<mailto:ceph-users@ceph.io>>
To unsubscribe send an email to ceph-users-leave(a)ceph.io
<mailto:ceph-users-leave@ceph.io>
<mailto:ceph-users-leave@ceph.io
<mailto:ceph-users-leave@ceph.io>>