[ceph-users] Re: [FORGED] Lost all Monitors in Nautilus Upgrade, best way forward?

19 Feb 2020

On 2/19/20 5:45 AM, Sean Matheny wrote:
...
  I wanted to add a specific question to the previous
post, in the hopes it’s easier to answer.

 We have a Luminous monitor restored from the OSDs using ceph-object-tool, which seems
like the best chance of any success. We followed this rough process:

 https://tracker.ceph.com/issues/24419

 The monitor has come up (as a single monitor cluster), but it’s reporting wildly
inaccurate info, such as the number of osds that are down (157 but all 223 are down), and
hosts (1, but all 14 are down).

Have you verified that the MON's database has the same epoch of the
OSDMap (or newer) as all the other OSDs?

If the OSDs have a newer epoch of the OSDMap than the MON it won't work.

...
  The OSD Daemons are still off, but I’m not sure if
starting them back up with this monitor will make things worse. The fact that this mon
daemon can’t even see how many OSDs are correctly down makes me think that nothing good
will come from turning the OSDs back on.

 Do I run risk of further corruption (i.e. on the Ceph side, not client data as the
cluster is paused) if I proceed and turn on the osd daemons? Or is it worth a shot?

 Are these Ceph health metrics commonly inaccurate until it can talk to the daemons?

The PG stats will be inaccurate indeed and the number of OSDs can vary
as long as they aren't able to peer with each other and the MONs.

> 
> (Also other commands like `ceph osd tree` agree with the below `ceph -s` so far)
> 
> Many thanks for any wisdom… I just don’t want to make things (unnecessarily) much
worse.
> 
> Cheers,
> Sean
> 
> 
> root@ntr-mon01:/var/log/ceph# ceph -s
>   cluster:
>     id:     ababdd7f-1040-431b-962c-c45bea5424aa
>     health: HEALTH_WARN
>             pauserd,pausewr,noout,norecover,noscrub,nodeep-scrub flag(s) set
>             157 osds down
>             1 host (15 osds) down
>             Reduced data availability: 12225 pgs inactive, 885 pgs down, 673 pgs
peering
>             Degraded data redundancy: 14829054/35961087 objects degraded (41.236%),
2869 pgs degraded, 2995 pgs undersized  services:
>     mon: 1 daemons, quorum ntr-mon01
>     mgr: ntr-mon01(active)
>     osd: 223 osds: 66 up, 223 in
>          flags pauserd,pausewr,noout,norecover,noscrub,nodeep-scrub  data:
>     pools:   14 pools, 15220 pgs
>     objects: 10.58M objects, 40.1TiB
>     usage:   43.0TiB used, 121TiB / 164TiB avail
>     pgs:     70.085% pgs unknown
>              10.237% pgs not active
>              14829054/35961087 objects degraded (41.236%)
>              10667 unknown
>              2869  active+undersized+degraded
>              885   down
>              673   peering
>              126   active+undersized
> 
> 
> On 19/02/2020, at 10:18 AM, Sean Matheny
<s.matheny@auckland.ac.nz<mailto:s.matheny@auckland.ac.nz>> wrote:
> 
> Hi folks,
> 
> Our entire cluster is down at the moment.
> 
> We started upgrading from 12.2.13 to 14.2.7 with the monitors. The first monitor we
upgraded crashed. We reverted to luminous on this one and tried another, and it was fine.
We upgraded the rest, and they all worked.
> 
> Then we upgraded the first one again, and after it became the leader, it died. Then
the second one became the leader, and it died. Then the third became the leader, and it
died, leaving mon 4 and 5 unable to form a quorum.
> 
> We tried creating a single monitor cluster by editing the monmap of mon05, and it
died in the same way, just without the paxos negotiation first.
> 
> We have tried to revert to a luminous (12.2.12) monitor backup taken a few hours
before the crash. The mon daemon will start, but is flooded with blocked requests and
unknown pgs after a while. For better or worse we removed the “noout” flag and 144 of 232
OSDs are now showing as down:
> 
>  cluster:
>    id:     ababdd7f-1040-431b-962c-c45bea5424aa
>    health: HEALTH_ERR
>            noout,nobackfill,norecover flag(s) set
>            101 osds down
>            9 hosts (143 osds) down
>            1 auth entities have invalid capabilities
>            Long heartbeat ping times on back interface seen, longest is 15424.178
msec
>            Long heartbeat ping times on front interface seen, longest is 14763.145
msec
>            Reduced data availability: 521 pgs inactive, 48 pgs stale
>            274 slow requests are blocked > 32 sec
>            88 stuck requests are blocked > 4096 sec
>            1303 slow ops, oldest one blocked for 174 sec, mon.ntr-mon01 has slow ops
>            too many PGs per OSD (299 > max 250)  services:
>    mon: 1 daemons, quorum ntr-mon01 (age 3m)
>    mgr: ntr-mon01(active, since 30m)
>    mds: cephfs:1 {0=akld2e18u42=up:active(laggy or crashed)}
>    osd: 223 osds: 66 up, 167 in
>         flags noout,nobackfill,norecover
>    rgw: 2 daemons active (ntr-rgw01, ntr-rgw02)  data:
>    pools:   14 pools, 15220 pgs
>    objects: 35.26M objects, 134 TiB
>    usage:   379 TiB used, 1014 TiB / 1.4 PiB avail
>    pgs:     3.423% pgs unknown
>             14651 active+clean
>             521   unknown
>             48    stale+active+clean  io:
>    client:   20 KiB/s rd, 439 KiB/s wr, 7 op/s rd, 54 op/s wr
> 
> These luminous OSD daemons are not down, but are all in fact running. They just have
no comms with the monitor:
> 
> 2020-02-19 10:12:33.565680 7ff222e24700  1 osd.0 pg_epoch: 305104 pg[100.37as3( v
129516'2 (0'0,129516'2] local-lis/les=297268/297269 n=0 ec=129502/129502 lis/c
297268/297268 les/c/f 297269/297358/0 297268/297268/161526) [41,192,216,0,160,117]p41(0)
r=3 lpr=305101 crt=129516'2 lcod 0'0 unknown NOTIFY mbc={}] state<Start>:
transitioning to Stray
> 2020-02-19 10:12:33.565861 7ff222e24700  1 osd.0 pg_epoch: 305104 pg[4.53c( v
305046'1933429 (304777'1931907,305046'1933429] local-lis/les=298009/298010
n=7350 ec=768/768 lis/c 298009/298009 les/c/f 298010/298010/0 297268/298009/298009)
[0,61,103] r=0 lpr=305101 crt=305046'1933429 lcod 0'0 mlcod 0'0 unknown
mbc={}] state<Start>: transitioning to Primary
> 2020-02-19 10:12:33.566742 7ff222e24700  1 osd.0 pg_epoch: 305104 pg[100.des4( v
129516'1 (0'0,129516'1] local-lis/les=292010/292011 n=1 ec=129502/129502 lis/c
292010/292010 les/c/f 292011/292417/0 292010/292010/280955) [149,62,209,187,0,98]p149(0)
r=4 lpr=305072 crt=129516'1 lcod 0'0 unknown NOTIFY mbc={}] state<Start>:
transitioning to Stray
> 2020-02-19 10:12:33.566896 7ff23ccd9e00  0 osd.0 305104 done with init, starting boot
process
> 2020-02-19 10:12:33.566956 7ff23ccd9e00  1 osd.0 305104 start_boot
> 
> One oddity in our deployment is that there was a test mds instance, and it is running
mimic. I shut it down, as the monitor trace has an MDS call in it, but the nautilus
monitors still die the same way.
> 
>   "mds": {
>        "ceph version 13.2.8 (5579a94fafbc1f9cc913a0f5d362953a5d9c3ae0) mimic
(stable)": 1
>    },
> 
> 
> ...
>   -11> 2020-02-18 09:50:00.800 7fd164a1a700  5 mon.ntr-mon02(a)1(leader).paxos(paxos
recovering c 85448935..85449502) is_readable = 0 - now=2020-02-18 09:50:00.804429
lease_expire=0.000000 has v0 lc 85449502
>   -10> 2020-02-18 09:50:00.800 7fd164a1a700  5 mon.ntr-mon02(a)1(leader).paxos(paxos
recovering c 85448935..85449502) is_readable = 0 - now=2020-02-18 09:50:00.804446
lease_expire=0.000000 has v0 lc 85449502
>    -9> 2020-02-18 09:50:00.800 7fd164a1a700  5 mon.ntr-mon02(a)1(leader).paxos(paxos
recovering c 85448935..85449502) is_readable = 0 - now=2020-02-18 09:50:00.804460
lease_expire=0.000000 has v0 lc 85449502
>    -8> 2020-02-18 09:50:00.800 7fd164a1a700  4 set_mon_vals no callback set
>    -7> 2020-02-18 09:50:00.800 7fd164a1a700  4 mgrc handle_mgr_map Got map version
2301191
>    -6> 2020-02-18 09:50:00.804 7fd164a1a700  4 mgrc handle_mgr_map Active mgr is
now v1:10.31.88.17:6801/2924412
>    -5> 2020-02-18 09:50:00.804 7fd164a1a700  0 log_channel(cluster) log [DBG] :
monmap e25: 5 mons at
{ntr-mon01=v1:10.31.88.14:6789/0,ntr-mon02=v1:10.31.88.15:6789/0,ntr-mon03=v1:10.31.88.16:6789/0,ntr-mon04=v1:10.31.88.17:6789/0,ntr-mon05=v1:10.31.88.18:6789/0}
>    -4> 2020-02-18 09:50:00.804 7fd164a1a700 10 log_client _send_to_mon log to
self
>    -3> 2020-02-18 09:50:00.804 7fd164a1a700 10 log_client  log_queue is 3 last_log
3 sent 2 num 3 unsent 1 sending 1
>    -2> 2020-02-18 09:50:00.804 7fd164a1a700 10 log_client  will send 2020-02-18
09:50:00.806845 mon.ntr-mon02 (mon.1) 3 : cluster [DBG] monmap e25: 5 mons at
{ntr-mon01=v1:10.31.88.14:6789/0,ntr-mon02=v1:10.31.88.15:6789/0,ntr-mon03=v1:10.31.88.16:6789/0,ntr-mon04=v1:10.31.88.17:6789/0,ntr-mon05=v1:10.31.88.18:6789/0}
>    -1> 2020-02-18 09:50:00.804 7fd164a1a700  5 mon.ntr-mon02(a)1(leader).paxos(paxos
active c 85448935..85449502) is_readable = 1 - now=2020-02-18 09:50:00.806920
lease_expire=2020-02-18 09:50:05.804479 has v0 lc 85449502
>     0> 2020-02-18 09:50:00.812 7fd164a1a700 -1 *** Caught signal (Aborted) **
> in thread 7fd164a1a700 thread_name:ms_dispatch
> 
> ceph version 14.2.7 (3d58626ebeec02d8385a4cefb92c6cbc3a45bfe8) nautilus (stable)
> 1: (()+0x11390) [0x7fd171e98390]
> 2: (gsignal()+0x38) [0x7fd1715e5428]
> 3: (abort()+0x16a) [0x7fd1715e702a]
> 4: (__gnu_cxx::__verbose_terminate_handler()+0x135) [0x7fd173673bf5]
> 5: (__cxxabiv1::__terminate(void (*)())+0x6) [0x7fd173667bd6]
> 6: (()+0x8b6c21) [0x7fd173667c21]
> 7: (()+0x8c2e34) [0x7fd173673e34]
> 8: (std::__throw_out_of_range(char const*)+0x3f) [0x7fd17367f55f]
> 9: (MDSMonitor::maybe_resize_cluster(FSMap&, int)+0xcf0) [0x79ae00]
> 10: (MDSMonitor::tick()+0xc9) [0x79c669]
> 11: (MDSMonitor::on_active()+0x28) [0x785e88]
> 12: (PaxosService::_active()+0xdd) [0x6d4b2d]
> 13: (Context::complete(int)+0x9) [0x600789]
> 14: (void finish_contexts<std::__cxx11::list<Context*,
std::allocator<Context*> > >(CephContext*, std::__cxx11::list<Context*,
std::allocator<Context*> >&, int)+0xa8) [0x6299a8]
> 15: (Paxos::finish_round()+0x76) [0x6cb276]
> 16: (Paxos::handle_last(boost::intrusive_ptr<MonOpRequest>)+0xbff) [0x6cc47f]
> 17: (Paxos::dispatch(boost::intrusive_ptr<MonOpRequest>)+0x24b) [0x6ccf2b]
> 18: (Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0x15c5)
[0x5fa6f5]
> 19: (Monitor::_ms_dispatch(Message*)+0x4d2) [0x5fad42]
> 20: (Monitor::ms_dispatch(Message*)+0x26) [0x62b046]
> 21: (Dispatcher::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x26)
[0x6270b6]
> 22: (DispatchQueue::entry()+0x1219) [0x7fd1732b7e59]
> 23: (DispatchQueue::DispatchThread::entry()+0xd) [0x7fd17336836d]
> 24: (()+0x76ba) [0x7fd171e8e6ba]
> 25: (clone()+0x6d) [0x7fd1716b741d]
> ...
> 
> Ceph versions output
> 
> {
>    "mon": {
>        "ceph version 12.2.13 (584a20eb0237c657dc0567da126be145106aa47e) luminous
(stable)": 1,
>        "ceph version 14.2.7 (3d58626ebeec02d8385a4cefb92c6cbc3a45bfe8) nautilus
(stable)": 4
>    },
>    "mgr": {
>        "ceph version 12.2.12 (1436006594665279fe734b4c15d7e08c13ebd777) luminous
(stable)": 1,
>        "ceph version 12.2.13 (584a20eb0237c657dc0567da126be145106aa47e) luminous
(stable)": 1,
>        "ceph version 14.2.7 (3d58626ebeec02d8385a4cefb92c6cbc3a45bfe8) nautilus
(stable)": 2
>    },
>    "osd": {
>        "ceph version 12.2.11 (26dc3775efc7bb286a1d6d66faee0ba30ea23eee) luminous
(stable)": 175,
>        "ceph version 12.2.12 (1436006594665279fe734b4c15d7e08c13ebd777) luminous
(stable)": 32,
>        "ceph version 12.2.13 (584a20eb0237c657dc0567da126be145106aa47e) luminous
(stable)": 16
>    },
>    "mds": {
>        "ceph version 13.2.8 (5579a94fafbc1f9cc913a0f5d362953a5d9c3ae0) mimic
(stable)": 1
>    },
>    "rgw": {
>        "ceph version 12.2.12 (1436006594665279fe734b4c15d7e08c13ebd777) luminous
(stable)": 2
>    },
>    "overall": {
>        "ceph version 12.2.11 (26dc3775efc7bb286a1d6d66faee0ba30ea23eee) luminous
(stable)": 175,
>        "ceph version 12.2.12 (1436006594665279fe734b4c15d7e08c13ebd777) luminous
(stable)": 35,
>        "ceph version 12.2.13 (584a20eb0237c657dc0567da126be145106aa47e) luminous
(stable)": 18,
>        "ceph version 13.2.8 (5579a94fafbc1f9cc913a0f5d362953a5d9c3ae0) mimic
(stable)": 1,
>        "ceph version 14.2.7 (3d58626ebeec02d8385a4cefb92c6cbc3a45bfe8) nautilus
(stable)": 6
>    }
> }
> 
> We’ve filed a bug report with the actions of the actual cascading crash described
above (when we upgraded mon01 and it became the leader):
> https://tracker.ceph.com/issues/44185 (parts here copied from that report)
> 
> Right now we’re not sure what the best path to some sort of recovery would be. All
OSD Daemons are still on Luminous, so AFAICT, we could build the monitor db from the OSDs
with
https://github.com/ceph/ceph/blob/luminous/doc/rados/troubleshooting/troubl…
which describes using this script:
> 
> 
> #!/bin/bash
> hosts="ntr-sto01 ntr-sto02"
> ms=/tmp/mon-store/
> mkdir $ms
> # collect the cluster map from OSDs
> for host in $hosts; do
>  echo $host
>  rsync -avz $ms root@$host:$ms
>  rm -rf $ms
>  ssh root@$host <<EOF
>    for osd in /var/lib/ceph/osd/ceph-*; do
>      ceph-objectstore-tool --data-path \$osd --op update-mon-db --mon-store-path $ms
>    done
> EOF
>  rsync -avz root@$host:$ms $ms
> done
> 
> If this is our best idea to try, should we try the mon store from the above script on
a luminous or nautilus mon daemon? Any other ideas to try at this dark hour? : \
> 
> Cheers,
> Sean
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io<mailto:ceph-users@ceph.io>
> To unsubscribe send an email to
ceph-users-leave@ceph.io<mailto:ceph-users-leave@ceph.io>
> 
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io
> To unsubscribe send an email to ceph-users-leave(a)ceph.io
> 

2024

2023

2022

2021

2020

2019

[ceph-users] Re: [FORGED] Lost all Monitors in Nautilus Upgrade, best way forward?