February 2020 - ceph-users

Re: RESOLVED: Sudden loss of all SSD OSDs in a cluster, immedaite abort on restart [Mimic 13.2.6]

by Dan van der Ster

Hi Troy, Looks like we hit the same today -- Sage posted some observations here: https://tracker.ceph.com/issues/39525#note-6 Did it happen again in your cluster? Cheers, Dan On Tue, Aug 20, 2019 at 2:18 AM Troy Ablan <tablan(a)gmail.com> wrote: > > While I'm still unsure how this happened, this is what was done to solve > this. > > Started OSD in foreground with debug 10, watched for the most recent > osdmap epoch mentioned before abort(). For example, if it mentioned > that it just tried to load 80896 and then crashed > > # ceph osd getmap -o osdmap.80896 80896 > # ceph-objectstore-tool --op set-osdmap --data-path > /var/lib/ceph/osd/ceph-77/ --file osdmap.80896 > > Then I restarted the osd in foreground/debug, and repeated for the next > osdmap epoch until it got past the first few seconds. This process > worked for all but two OSDs. For the ones that succeeded I'd ^C and > then start the osd via systemd > > For the remaining two, it would try loading the incremental map and then > crash. I had presence of mind to make dd images of every OSD before > starting this process, so I reverted these two to the state before > injecting the osdmaps. > > I then injected the last 15 or so epochs of the osdmap in sequential > order before starting the osd, with success. > > This leads me to believe that the step-wise injection didn't work > because the osd had more subtle corruption that it got past, but it was > confused when it requested the next incremental delta. > > Thanks again to Brad/badone for the guidance! > > Tracker issue updated. > > Here's the closing IRC dialogue re this issue (UTC-0700) > > 2019-08-19 16:27:42 < MooingLemur> badone: I appreciate you reaching out > yesterday, you've helped a ton, twice now :) I'm still concerned > because I don't know how this happened. I'll feel better once > everything's active+clean, but it's all at least active. > > 2019-08-19 16:30:28 < badone> MooingLemur: I had a quick discussion with > Josh earlier and he shares my opinion this is likely somehow related to > these drives or perhaps controllers, or at least specific to these machines > > 2019-08-19 16:31:04 < badone> however, there is a possibility you are > seeing some extremely rare race that no one up to this point has seen before > > 2019-08-19 16:31:20 < badone> that is less likely though > > 2019-08-19 16:32:50 < badone> the osd read the osdmap over the wire > successfully but wrote it out to disk in a format that it could not then > read back in (unlikely) or... > > 2019-08-19 16:33:21 < badone> the map "changed" after it had been > written to disk > > 2019-08-19 16:33:46 < badone> the second is considered most likely by us > but I recognise you may not share that opinion > _______________________________________________ > ceph-users mailing list > ceph-users(a)lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

4 years, 2 months

2
6
0 0

bluestore compression questions

by Andras Pataki

We're considering using bluestore compression for some of our data, and I'm not entirely sure how to interpret compression results. As an example, one of the osd perf dump results shows: "bluestore_compressed": 28089935, "bluestore_compressed_allocated": 115539968, "bluestore_compressed_original": 231079936, Am I right to interpret this as "Originally 231079936 bytes (231GB) were compressed to 28089935 bytes (28GB) and were stored in 115539968 bytes (115GB)". If so, why is there such a huge discrepancy between the compressed_allocated and compressed numbers (115GB vs 28GB)? Also, ceph df detail also shows some compression stats: POOL ID STORED OBJECTS USED %USED MAX AVAIL QUOTA OBJECTS QUOTA BYTES DIRTY USED COMPR UNDER COMPR cephfs_data_ec63_c 5 1.2 TiB 311.93k 1.7 TiB 0.02 5.4 PiB N/A N/A 311.93k 116 GiB 232 GiB Would this indicate that the compressed data takes up exactly half the space (which coincides with what the osd's are saying)? It is odd that the allocated numbers are always half the original numbers ... This is ceph Nautilus with all the defaults for compression (except turning it on for the pool in question). Any insights here would be appreciated. Thanks, Andras

4 years, 2 months

2
3
0 0

pg balancer plugin unresponsive

by danjou.philippe＠yahoo.de

Hi, the pg balancer is not working at all and if I call status the plugin does not respond, it just hangs forever. Mgr restart doesnt help. I have a PG distribution issue now, how to fix this? v 14.2.5 kind regards

4 years, 2 months

1
0
0 0

Module 'telemetry' has experienced an error

by alexander.v.litvak＠gmail.com

This evening I was awakened by an error message cluster: id: 9b4468b7-5bf2-4964-8aec-4b2f4bee87ad health: HEALTH_ERR Module 'telemetry' has failed: ('Connection aborted.', error(101, 'Network is unreachable')) services: I have not seen any other problems with anything else on the cluster. I disabled and enabled the telemetry module and health returned to OK status. Any ideas on what could cause the issue? As far as I understand, telemetry is a module that sends messages to an external ceph server outside of the network. Thank you for any advice,

4 years, 2 months

3
2
0 0

Question about min_size for replicated and EC-pools

by Rainer Krienke

Hello, I run a ceph nautilus cluster where we use rbd for data storage and retrieval. THe rbd images metadata are in a replicated pool with 1+2 copies. The data are placed in a pool that uses erasure coding with a 4+2 profile. Now I am unsure about the exact meaing or min_size for both pools. The metadata pool has a min_size of 2, the erasure coded pool has a min_size of 5. The clusters redundancy is based on hosts. As far as I understand "min_size", this would mean that the cluster would be able to access data in the pools if one host would die. In case two hosts would be offline at the same time the metadata-pool would stop delivering data, the ec-Pool would basically be able to deliver data since min_size is irrelevant to ec-pools instead the pools K-value (==4) determines the minimal number of available OSDs to continue working. In the end I would no longer be able to read data in this situation because there is only one working OSD remaining for the replicated pool and min_size is 2. If this is correct, than does it make sense to increase the replication from 3 to 4 for the metadata pool so that I would be able to read and write data even if two hosts should go down at the same time? Is my understanding correct? Thank you very much for your help Rainer -- Rainer Krienke, Uni Koblenz, Rechenzentrum, A22, Universitaetsstrasse 1 56070 Koblenz, Tel: +49261287 1312 Fax +49261287 100 1312 Web: http://userpages.uni-koblenz.de/~krienke PGP: http://userpages.uni-koblenz.de/~krienke/mypgp.html

4 years, 2 months

1
0
0 0

pg balancer plugin unresponsive

by Philippe D'Anjou

Hi,the pg balancer is not working at all and if I call status the plugin does not respond, it just hangs forever. Mgr restart doesnt help.I have a PG distribution issue now, how to fix this?v 14.2.5 kind regards

4 years, 2 months

1
0
0 0

Re: cephfs slow, howto investigate and tune mds configuration?

by Patrick Donnelly

Hello Marc, On Tue, Feb 11, 2020 at 1:41 PM Marc Roos <M.Roos(a)f1-outsourcing.eu> wrote: > > Thanks Samy I will give this a try. > > It would be helpful if there is some value that shows cache misses or > so, so you have a more precise idea with how much you need to increase > the cache. I have now added a couple of GB's see if it is being used and > helps speed up things. The perf dump for the MDS includes an "ino+", "ino", and "ino- " (same for "dn") which tells you how many inodes have been added to the cache ("+"), in cache (""), and removed ("-"). You can tell if you're getting a lot of cache misses if you're seeing a lot of growth in ino+ and ino-. -- Patrick Donnelly, Ph.D. He / Him / His Senior Software Engineer Red Hat Sunnyvale, CA GPG: 19F28A586F808C2402351B93C3301A3E258DD79D

4 years, 2 months

1
0
0 0

Lost all Monitors in Nautilus Upgrade, best way forward?

by Sean Matheny

Hi folks, Our entire cluster is down at the moment. We started upgrading from 12.2.13 to 14.2.7 with the monitors. The first monitor we upgraded crashed. We reverted to luminous on this one and tried another, and it was fine. We upgraded the rest, and they all worked. Then we upgraded the first one again, and after it became the leader, it died. Then the second one became the leader, and it died. Then the third became the leader, and it died, leaving mon 4 and 5 unable to form a quorum. We tried creating a single monitor cluster by editing the monmap of mon05, and it died in the same way, just without the paxos negotiation first. We have tried to revert to a luminous (12.2.12) monitor backup taken a few hours before the crash. The mon daemon will start, but is flooded with blocked requests and unknown pgs after a while. For better or worse we removed the “noout” flag and 144 of 232 OSDs are now showing as down: cluster: id: ababdd7f-1040-431b-962c-c45bea5424aa health: HEALTH_ERR noout,nobackfill,norecover flag(s) set 101 osds down 9 hosts (143 osds) down 1 auth entities have invalid capabilities Long heartbeat ping times on back interface seen, longest is 15424.178 msec Long heartbeat ping times on front interface seen, longest is 14763.145 msec Reduced data availability: 521 pgs inactive, 48 pgs stale 274 slow requests are blocked > 32 sec 88 stuck requests are blocked > 4096 sec 1303 slow ops, oldest one blocked for 174 sec, mon.ntr-mon01 has slow ops too many PGs per OSD (299 > max 250) services: mon: 1 daemons, quorum ntr-mon01 (age 3m) mgr: ntr-mon01(active, since 30m) mds: cephfs:1 {0=akld2e18u42=up:active(laggy or crashed)} osd: 223 osds: 66 up, 167 in flags noout,nobackfill,norecover rgw: 2 daemons active (ntr-rgw01, ntr-rgw02) data: pools: 14 pools, 15220 pgs objects: 35.26M objects, 134 TiB usage: 379 TiB used, 1014 TiB / 1.4 PiB avail pgs: 3.423% pgs unknown 14651 active+clean 521 unknown 48 stale+active+clean io: client: 20 KiB/s rd, 439 KiB/s wr, 7 op/s rd, 54 op/s wr These luminous OSD daemons are not down, but are all in fact running. They just have no comms with the monitor: 2020-02-19 10:12:33.565680 7ff222e24700 1 osd.0 pg_epoch: 305104 pg[100.37as3( v 129516'2 (0'0,129516'2] local-lis/les=297268/297269 n=0 ec=129502/129502 lis/c 297268/297268 les/c/f 297269/297358/0 297268/297268/161526) [41,192,216,0,160,117]p41(0) r=3 lpr=305101 crt=129516'2 lcod 0'0 unknown NOTIFY mbc={}] state<Start>: transitioning to Stray 2020-02-19 10:12:33.565861 7ff222e24700 1 osd.0 pg_epoch: 305104 pg[4.53c( v 305046'1933429 (304777'1931907,305046'1933429] local-lis/les=298009/298010 n=7350 ec=768/768 lis/c 298009/298009 les/c/f 298010/298010/0 297268/298009/298009) [0,61,103] r=0 lpr=305101 crt=305046'1933429 lcod 0'0 mlcod 0'0 unknown mbc={}] state<Start>: transitioning to Primary 2020-02-19 10:12:33.566742 7ff222e24700 1 osd.0 pg_epoch: 305104 pg[100.des4( v 129516'1 (0'0,129516'1] local-lis/les=292010/292011 n=1 ec=129502/129502 lis/c 292010/292010 les/c/f 292011/292417/0 292010/292010/280955) [149,62,209,187,0,98]p149(0) r=4 lpr=305072 crt=129516'1 lcod 0'0 unknown NOTIFY mbc={}] state<Start>: transitioning to Stray 2020-02-19 10:12:33.566896 7ff23ccd9e00 0 osd.0 305104 done with init, starting boot process 2020-02-19 10:12:33.566956 7ff23ccd9e00 1 osd.0 305104 start_boot One oddity in our deployment is that there was a test mds instance, and it is running mimic. I shut it down, as the monitor trace has an MDS call in it, but the nautilus monitors still die the same way. "mds": { "ceph version 13.2.8 (5579a94fafbc1f9cc913a0f5d362953a5d9c3ae0) mimic (stable)": 1 }, ... -11> 2020-02-18 09:50:00.800 7fd164a1a700 5 mon.ntr-mon02(a)1(leader).paxos(paxos recovering c 85448935..85449502) is_readable = 0 - now=2020-02-18 09:50:00.804429 lease_expire=0.000000 has v0 lc 85449502 -10> 2020-02-18 09:50:00.800 7fd164a1a700 5 mon.ntr-mon02(a)1(leader).paxos(paxos recovering c 85448935..85449502) is_readable = 0 - now=2020-02-18 09:50:00.804446 lease_expire=0.000000 has v0 lc 85449502 -9> 2020-02-18 09:50:00.800 7fd164a1a700 5 mon.ntr-mon02(a)1(leader).paxos(paxos recovering c 85448935..85449502) is_readable = 0 - now=2020-02-18 09:50:00.804460 lease_expire=0.000000 has v0 lc 85449502 -8> 2020-02-18 09:50:00.800 7fd164a1a700 4 set_mon_vals no callback set -7> 2020-02-18 09:50:00.800 7fd164a1a700 4 mgrc handle_mgr_map Got map version 2301191 -6> 2020-02-18 09:50:00.804 7fd164a1a700 4 mgrc handle_mgr_map Active mgr is now v1:10.31.88.17:6801/2924412 -5> 2020-02-18 09:50:00.804 7fd164a1a700 0 log_channel(cluster) log [DBG] : monmap e25: 5 mons at {ntr-mon01=v1:10.31.88.14:6789/0,ntr-mon02=v1:10.31.88.15:6789/0,ntr-mon03=v1:10.31.88.16:6789/0,ntr-mon04=v1:10.31.88.17:6789/0,ntr-mon05=v1:10.31.88.18:6789/0} -4> 2020-02-18 09:50:00.804 7fd164a1a700 10 log_client _send_to_mon log to self -3> 2020-02-18 09:50:00.804 7fd164a1a700 10 log_client log_queue is 3 last_log 3 sent 2 num 3 unsent 1 sending 1 -2> 2020-02-18 09:50:00.804 7fd164a1a700 10 log_client will send 2020-02-18 09:50:00.806845 mon.ntr-mon02 (mon.1) 3 : cluster [DBG] monmap e25: 5 mons at {ntr-mon01=v1:10.31.88.14:6789/0,ntr-mon02=v1:10.31.88.15:6789/0,ntr-mon03=v1:10.31.88.16:6789/0,ntr-mon04=v1:10.31.88.17:6789/0,ntr-mon05=v1:10.31.88.18:6789/0} -1> 2020-02-18 09:50:00.804 7fd164a1a700 5 mon.ntr-mon02(a)1(leader).paxos(paxos active c 85448935..85449502) is_readable = 1 - now=2020-02-18 09:50:00.806920 lease_expire=2020-02-18 09:50:05.804479 has v0 lc 85449502 0> 2020-02-18 09:50:00.812 7fd164a1a700 -1 *** Caught signal (Aborted) ** in thread 7fd164a1a700 thread_name:ms_dispatch ceph version 14.2.7 (3d58626ebeec02d8385a4cefb92c6cbc3a45bfe8) nautilus (stable) 1: (()+0x11390) [0x7fd171e98390] 2: (gsignal()+0x38) [0x7fd1715e5428] 3: (abort()+0x16a) [0x7fd1715e702a] 4: (__gnu_cxx::__verbose_terminate_handler()+0x135) [0x7fd173673bf5] 5: (__cxxabiv1::__terminate(void (*)())+0x6) [0x7fd173667bd6] 6: (()+0x8b6c21) [0x7fd173667c21] 7: (()+0x8c2e34) [0x7fd173673e34] 8: (std::__throw_out_of_range(char const*)+0x3f) [0x7fd17367f55f] 9: (MDSMonitor::maybe_resize_cluster(FSMap&, int)+0xcf0) [0x79ae00] 10: (MDSMonitor::tick()+0xc9) [0x79c669] 11: (MDSMonitor::on_active()+0x28) [0x785e88] 12: (PaxosService::_active()+0xdd) [0x6d4b2d] 13: (Context::complete(int)+0x9) [0x600789] 14: (void finish_contexts<std::__cxx11::list<Context*, std::allocator<Context*> > >(CephContext*, std::__cxx11::list<Context*, std::allocator<Context*> >&, int)+0xa8) [0x6299a8] 15: (Paxos::finish_round()+0x76) [0x6cb276] 16: (Paxos::handle_last(boost::intrusive_ptr<MonOpRequest>)+0xbff) [0x6cc47f] 17: (Paxos::dispatch(boost::intrusive_ptr<MonOpRequest>)+0x24b) [0x6ccf2b] 18: (Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0x15c5) [0x5fa6f5] 19: (Monitor::_ms_dispatch(Message*)+0x4d2) [0x5fad42] 20: (Monitor::ms_dispatch(Message*)+0x26) [0x62b046] 21: (Dispatcher::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x26) [0x6270b6] 22: (DispatchQueue::entry()+0x1219) [0x7fd1732b7e59] 23: (DispatchQueue::DispatchThread::entry()+0xd) [0x7fd17336836d] 24: (()+0x76ba) [0x7fd171e8e6ba] 25: (clone()+0x6d) [0x7fd1716b741d] ... Ceph versions output { "mon": { "ceph version 12.2.13 (584a20eb0237c657dc0567da126be145106aa47e) luminous (stable)": 1, "ceph version 14.2.7 (3d58626ebeec02d8385a4cefb92c6cbc3a45bfe8) nautilus (stable)": 4 }, "mgr": { "ceph version 12.2.12 (1436006594665279fe734b4c15d7e08c13ebd777) luminous (stable)": 1, "ceph version 12.2.13 (584a20eb0237c657dc0567da126be145106aa47e) luminous (stable)": 1, "ceph version 14.2.7 (3d58626ebeec02d8385a4cefb92c6cbc3a45bfe8) nautilus (stable)": 2 }, "osd": { "ceph version 12.2.11 (26dc3775efc7bb286a1d6d66faee0ba30ea23eee) luminous (stable)": 175, "ceph version 12.2.12 (1436006594665279fe734b4c15d7e08c13ebd777) luminous (stable)": 32, "ceph version 12.2.13 (584a20eb0237c657dc0567da126be145106aa47e) luminous (stable)": 16 }, "mds": { "ceph version 13.2.8 (5579a94fafbc1f9cc913a0f5d362953a5d9c3ae0) mimic (stable)": 1 }, "rgw": { "ceph version 12.2.12 (1436006594665279fe734b4c15d7e08c13ebd777) luminous (stable)": 2 }, "overall": { "ceph version 12.2.11 (26dc3775efc7bb286a1d6d66faee0ba30ea23eee) luminous (stable)": 175, "ceph version 12.2.12 (1436006594665279fe734b4c15d7e08c13ebd777) luminous (stable)": 35, "ceph version 12.2.13 (584a20eb0237c657dc0567da126be145106aa47e) luminous (stable)": 18, "ceph version 13.2.8 (5579a94fafbc1f9cc913a0f5d362953a5d9c3ae0) mimic (stable)": 1, "ceph version 14.2.7 (3d58626ebeec02d8385a4cefb92c6cbc3a45bfe8) nautilus (stable)": 6 } } We’ve filed a bug report with the actions of the actual cascading crash described above (when we upgraded mon01 and it became the leader): https://tracker.ceph.com/issues/44185 (parts here copied from that report) Right now we’re not sure what the best path to some sort of recovery would be. All OSD Daemons are still on Luminous, so AFAICT, we could build the monitor db from the OSDs with https://github.com/ceph/ceph/blob/luminous/doc/rados/troubleshooting/troubl… which describes using this script: #!/bin/bash hosts="ntr-sto01 ntr-sto02" ms=/tmp/mon-store/ mkdir $ms # collect the cluster map from OSDs for host in $hosts; do echo $host rsync -avz $ms root@$host:$ms rm -rf $ms ssh root@$host <<EOF for osd in /var/lib/ceph/osd/ceph-*; do ceph-objectstore-tool --data-path \$osd --op update-mon-db --mon-store-path $ms done EOF rsync -avz root@$host:$ms $ms done If this is our best idea to try, should we try the mon store from the above script on a luminous or nautilus mon daemon? Any other ideas to try at this dark hour? : \ Cheers, Sean

4 years, 2 months

3
7
0 0

ceph nvme 2x replication

by Frank R

Hi all, I have noticed that RedHat is willing to support 2x replication with NVME drives. Additionally, I have seen CERN presentation where they use a 2x replication with NVME for a hyperconverged/HPC/CephFS solution. I would like to hear some opinions on whether this is really a good idea for production. Is this setup (NVME/2x replication) really only meant to be used for data that is backed up and/or can be lost without causing a catastrophe. Thanks, Frank

4 years, 2 months

5
6
0 0

slow using ISCSI - Help-me

by Gesiel Galvão Bernardes

Hi, I am having an unusual slowdown using VMware with ISCSI gws. I have two ISCSI gateways with two RBD images. I have checked the following in the logs: Dec 24 09:00:26 ceph-iscsi2 tcmu-runner: 2019-12-24 09:00:26.040 969 [INFO] alua_implicit_transition:562 rbd/pool1.vmware_iscsi1: Starting lock acquisition operation.2019-12-24 09:00:26.040 969 [INFO] alua_implicit_transition:557 rbd/pool1.vmware_iscsi1: Lock acquisition operation is already in process.2019-12-24 09:00:26.973 969 [WARN] tcmu_rbd_lock:744 rbd/pool1.vmware_iscsi1: Acquired exclusive lock. Dec 24 09:00:26 ceph-iscsi2 tcmu-runner: tcmu_rbd_lock:744 rbd/pool1.vmware_iscsi1: Acquired exclusive lock. Dec 24 09:00:28 ceph-iscsi2 tcmu-runner: 2019-12-24 09:00:28.099 969 [WARN] tcmu_notify_lock_lost:201 rbd/pool1.vmware_iscsi1: Async lock drop. Old state 1 Dec 24 09:00:28 ceph-iscsi2 tcmu-runner: tcmu_notify_lock_lost:201 rbd/pool1.vmware_iscsi1: Async lock drop. Old state 1 Dec 24 09:00:28 ceph-iscsi2 tcmu-runner: alua_implicit_transition:562 rbd/pool1.vmware_iscsi1: Starting lock acquisition operation. Dec 24 09:00:28 ceph-iscsi2 tcmu-runner: 2019-12-24 09:00:28.824 969 [INFO] alua_implicit_transition:562 rbd/pool1.vmware_iscsi1: Starting lock acquisition operation.2019-12-24 09:00:28.990 969 [WARN] tcmu_rbd_lock:744 rbd/pool1.vmware_iscsi1: Acquired exclusive lock. Dec 24 09:00:28 ceph-iscsi2 tcmu-runner: tcmu_rbd_lock:744 rbd/pool1.vmware_iscsi1: Acquired exclusive lock. Can anyone help-me please? Gesiel

4 years, 2 months

2
14
0 0

2024

2023

2022

2021

2020

2019

ceph-users February 2020