February 2020 - ceph-users

by Ramanathan S

Hi all, I just had created a ceph cluster to use cephfs. When i create the a ceph fs pool i get the filesystem below error. # ceph osd pool create cephfs_data 128 pool 'cephfs_data' created # ceph osd pool create cephfs_metadata 128 pool 'cephfs_metadata' created # ceph fs new cephfs cephfs_metadata cephfs_data new fs with metadata pool 6 and data pool 5 # ceph -s cluster: id: 1c27def45-f0f9-494d-sfke-eb4323432fd health: HEALTH_ERR 1 filesystem is offline 1 filesystem is online with fewer MDS than max_mds services: mon: 2 daemons, quorum ceph-mon01,ceph-mon02 mgr: ceph-adm01(active) mds: cephfs-0/0/1 up osd: 12 osds: 12 up, 12 in data: pools: 2 pools, 256 pgs objects: 0 objects, 0 B usage: 12 GiB used, 588 GiB / 600 GiB avail pgs: 256 active+clean but when i check the max_mds for the ceph fs it says 1 # ceph fs get cephfs | grep max_mds max_mds 1 Let anyone know what am i missing here? Any inputs is much appreciated. Regards, Ram Ceph-explorer..

3 weeks, 3 days

3
3
0 0

kernel client osdc ops stuck and mds slow reqs

by Dan van der Ster

Hi all, We are quite regularly (a couple times per week) seeing: HEALTH_WARN 1 clients failing to respond to capability release; 1 MDSs report slow requests MDS_CLIENT_LATE_RELEASE 1 clients failing to respond to capability release mdshpc-be143(mds.0): Client hpc-be028.cern.ch: failing to respond to capability release client_id: 52919162 MDS_SLOW_REQUEST 1 MDSs report slow requests mdshpc-be143(mds.0): 1 slow requests are blocked > 30 secs Which is being caused by osdc ops stuck in a kernel client, e.g.: 10:57:18 root hpc-be028 /root → cat /sys/kernel/debug/ceph/4da6fd06-b069-49af-901f-c9513baabdbd.client52919162/osdc REQUESTS 9 homeless 0 46559317 osd243 3.ee6ffcdb 3.cdb [243,501,92]/243 [243,501,92]/243 e678697 fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a01.00000057 0x400014 1 read 46559322 osd243 3.ee6ffcdb 3.cdb [243,501,92]/243 [243,501,92]/243 e678697 fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a01.00000057 0x400014 1 read 46559323 osd243 3.969cc573 3.573 [243,330,226]/243 [243,330,226]/243 e678697 fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a56.00000056 0x400014 1 read 46559341 osd243 3.969cc573 3.573 [243,330,226]/243 [243,330,226]/243 e678697 fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a56.00000056 0x400014 1 read 46559342 osd243 3.969cc573 3.573 [243,330,226]/243 [243,330,226]/243 e678697 fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a56.00000056 0x400014 1 read 46559345 osd243 3.969cc573 3.573 [243,330,226]/243 [243,330,226]/243 e678697 fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a56.00000056 0x400014 1 read 46559621 osd243 3.6313e8ef 3.8ef [243,330,521]/243 [243,330,521]/243 e678697 fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a45.0000007a 0x400014 1 read 46559629 osd243 3.b280c852 3.852 [243,113,539]/243 [243,113,539]/243 e678697 fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a3a.0000007f 0x400014 1 read 46559928 osd243 3.1ee7bab4 3.ab4 [243,332,94]/243 [243,332,94]/243 e678697 fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f099ff.0000073f 0x400024 1 write LINGER REQUESTS BACKOFFS We can unblock those requests by doing `ceph osd down osd.243` (or restarting osd.243). This is ceph v14.2.6 and the client kernel is el7 3.10.0-957.27.2.el7.x86_64. Are there a better way to debug this? Best Regards, Dan

1 year, 2 months

4
12
0 0

Re: mds lost very frequently

by Stefan Kooman

Hi, After setting: ceph config set mds mds_recall_max_caps 10000 (5000 before change) and ceph config set mds mds_recall_max_decay_rate 1.0 (2.5 before change) And the: ceph tell 'mds.*' injectargs '--mds_recall_max_caps 10000' ceph tell 'mds.*' injectargs '--mds_recall_max_decay_rate 1.0' our up:active MDS stopped responding and the standby-replay stepped in ... and hit an assert (same as in this thread): 2020-02-06 16:42:16.712 7ff76a528700 1 heartbeat_map reset_timeout 'MDSRank' had timed out after 15 2020-02-06 16:42:17.616 7ff76ff1b700 0 mds.beacon.mds2 MDS is no longer laggy 2020-02-06 16:42:20.348 7ff76d716700 -1 /build/ceph-13.2.8/src/mds/Locker.cc: In function 'void Locker::file_recover(ScatterLock*)' thread 7ff76d716700 time 2020-02-06 16:42:20.351124 /build/ceph-13.2.8/src/mds/Locker.cc: 5307: FAILED assert(lock->get_state() == LOCK_PRE_SCAN) ceph version 13.2.8 (5579a94fafbc1f9cc913a0f5d362953a5d9c3ae0) mimic (stable) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14e) [0x7ff7759939de] 2: (()+0x287b67) [0x7ff775993b67] 3: (()+0x28a9ea) [0x5585eb2b79ea] 4: (MDCache::start_files_to_recover()+0xbb) [0x5585eb1f897b] 5: (MDSRank::active_start()+0x135) [0x5585eb146be5] 6: (MDSRankDispatcher::handle_mds_map(MMDSMap*, MDSMap*)+0x4e5) [0x5585eb151ea5] 7: (MDSDaemon::handle_mds_map(MMDSMap*)+0xca8) [0x5585eb134608] 8: (MDSDaemon::handle_core_message(Message*)+0x6c) [0x5585eb138bbc] 9: (MDSDaemon::ms_dispatch(Message*)+0xbb) [0x5585eb13929b] 10: (DispatchQueue::entry()+0xb92) [0x7ff775a56e52] 11: (DispatchQueue::DispatchThread::entry()+0xd) [0x7ff775af3e2d] 12: (()+0x76db) [0x7ff7752846db] 13: (clone()+0x3f) [0x7ff77446a88f] 2020-02-06 16:42:20.348 7ff76d716700 -1 *** Caught signal (Aborted) ** in thread 7ff76d716700 thread_name:ms_dispatch ceph version 13.2.8 (5579a94fafbc1f9cc913a0f5d362953a5d9c3ae0) mimic (stable) 1: (()+0x12890) [0x7ff77528f890] 2: (gsignal()+0xc7) [0x7ff774387e97] 3: (abort()+0x141) [0x7ff774389801] 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x256) [0x7ff775993ae6] 5: (()+0x287b67) [0x7ff775993b67] 6: (()+0x28a9ea) [0x5585eb2b79ea] 7: (MDCache::start_files_to_recover()+0xbb) [0x5585eb1f897b] 8: (MDSRank::active_start()+0x135) [0x5585eb146be5] 9: (MDSRankDispatcher::handle_mds_map(MMDSMap*, MDSMap*)+0x4e5) [0x5585eb151ea5] 10: (MDSDaemon::handle_mds_map(MMDSMap*)+0xca8) [0x5585eb134608] 11: (MDSDaemon::handle_core_message(Message*)+0x6c) [0x5585eb138bbc] 12: (MDSDaemon::ms_dispatch(Message*)+0xbb) [0x5585eb13929b] 13: (DispatchQueue::entry()+0xb92) [0x7ff775a56e52] 14: (DispatchQueue::DispatchThread::entry()+0xd) [0x7ff775af3e2d] 15: (()+0x76db) [0x7ff7752846db] 16: (clone()+0x3f) [0x7ff77446a88f] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. Quoting Yan, Zheng (ukernel(a)gmail.com): > Please try below patch if you can compile ceph from source. If you > can't compile ceph or the issue still happens, please set debug_mds = > 10 for standby mds (change debug_mds to 0 after mds becomes active). > > Regards > Yan, Zheng > > diff --git a/src/mds/MDSRank.cc b/src/mds/MDSRank.cc > index 1e8b024b8a..d1150578f1 100644 > --- a/src/mds/MDSRank.cc > +++ b/src/mds/MDSRank.cc > @@ -1454,8 +1454,8 @@ void MDSRank::rejoin_done() > void MDSRank::clientreplay_start() > { > dout(1) << "clientreplay_start" << dendl; > - finish_contexts(g_ceph_context, waiting_for_replay); // kick waiters > mdcache->start_files_to_recover(); > + finish_contexts(g_ceph_context, waiting_for_replay); // kick waiters > queue_one_replay(); > } > > @@ -1487,8 +1487,8 @@ void MDSRank::active_start() > > mdcache->clean_open_file_lists(); > mdcache->export_remaining_imported_caps(); > - finish_contexts(g_ceph_context, waiting_for_replay); // kick waiters > mdcache->start_files_to_recover(); > + finish_contexts(g_ceph_context, waiting_for_replay); // kick waiters > > mdcache->reissue_all_caps(); > mdcache->activate_stray_manager(); AFAICT this patch has never been tested and never commited. Do you still think this might fix the issue? Any hints on how we might reproduce this issue: failing active mds and hitting this specific recovery scenario We will happily apply this patch and do testing to check if it really fixes the issue. Gr. Stefan P.s. For my understanding: the MDS should never stop responding by setting these parameters, right? -- | BIT BV https://www.bit.nl/ Kamer van Koophandel 09090351 | GPG: 0xD14839C6 +31 318 648 688 / info(a)bit.nl

3 years, 2 months

1
1
0 0

MDS rejects clients causing hanging mountpoint on linux kernel client

by Florian Pritz

Hi, We are running a ceph cluster on Ubuntu 18.04 machines with ceph 14.2.4. Our cephfs clients are using the kernel module and we have noticed that some of them are sometimes (at least once) hanging after an MDS restart. The only way to resolve this is to unmount and remount the mountpoint, or reboot the machine if unmounting is not possible. After some investigation, the problem seems to be that the MDS denies reconnect attempts from some clients during restart even though the reconnect interval is not yet reached. In particular, I see the following log entries. Note that there are supposedly 9 sessions. 9 clients reconnect (one client has two mountpoints) and then two more clients reconnect after the MDS already logged "reconnect_done". These two clients were hanging after the event. The kernel log of one of them is shown below too. Running `ceph tell mds.0 client ls` after the clients have been rebooted/remounted also shows 11 clients instead of 9. Do you have any ideas what is wrong here and how it could be fixed? I'm guessing that the issue is that the MDS apparently has an incorrect session count and stops the reconnect process to soon. Is this indeed a bug and if so, do you know what is broken? Regardless, I also think that the kernel should be able to deal with a denied reconnect and that it should try again later. Yet, even after 10 minutes, the kernel does not attempt to reconnect. Is this a known issue or maybe fixed in newer kernels? If not, is there a chance to get this fixed? Thanks, Florian MDS log: > 2019-09-26 16:08:27.479 7f9fdde99700 1 mds.0.server reconnect_clients -- 9 sessions > 2019-09-26 16:08:27.479 7f9fdde99700 0 log_channel(cluster) log [DBG] : reconnect by client.24197043 v1:10.1.4.203:0/990008521 after 0 > 2019-09-26 16:08:27.479 7f9fdde99700 0 log_channel(cluster) log [DBG] : reconnect by client.30487144 v1:10.1.4.146:0/483747473 after 0 > 2019-09-26 16:08:27.479 7f9fdde99700 0 log_channel(cluster) log [DBG] : reconnect by client.21019865 v1:10.1.7.22:0/3752632657 after 0 > 2019-09-26 16:08:27.479 7f9fdde99700 0 log_channel(cluster) log [DBG] : reconnect by client.21020717 v1:10.1.7.115:0/2841046616 after 0 > 2019-09-26 16:08:27.479 7f9fdde99700 0 log_channel(cluster) log [DBG] : reconnect by client.24171153 v1:10.1.7.243:0/1127767158 after 0 > 2019-09-26 16:08:27.479 7f9fdde99700 0 log_channel(cluster) log [DBG] : reconnect by client.23978093 v1:10.1.4.71:0/824226283 after 0 > 2019-09-26 16:08:27.479 7f9fdde99700 0 log_channel(cluster) log [DBG] : reconnect by client.24209569 v1:10.1.4.157:0/1271865906 after 0 > 2019-09-26 16:08:27.479 7f9fdde99700 0 log_channel(cluster) log [DBG] : reconnect by client.20190930 v1:10.1.4.240:0/3195698606 after 0 > 2019-09-26 16:08:27.479 7f9fdde99700 0 log_channel(cluster) log [DBG] : reconnect by client.20190912 v1:10.1.4.146:0/852604154 after 0 > 2019-09-26 16:08:27.479 7f9fdde99700 1 mds.0.59 reconnect_done > 2019-09-26 16:08:27.483 7f9fdde99700 1 mds.0.server no longer in reconnect state, ignoring reconnect, sending close > 2019-09-26 16:08:27.483 7f9fdde99700 0 log_channel(cluster) log [INF] : denied reconnect attempt (mds is up:reconnect) from client.24167394 v1:10.1.67.49:0/1483641729 after 0.00400002 (allowed interval 45) > 2019-09-26 16:08:27.483 7f9fe1087700 0 --1- [v2:10.1.4.203:6800/806949107,v1:10.1.4.203:6801/806949107] >> v1:10.1.67.49:0/1483641729 conn(0x55af50053f80 0x55af50140800 :6801 s=OPENED pgs=21 cs=1 l=0).fault server, going to standby > 2019-09-26 16:08:27.483 7f9fdde99700 1 mds.0.server no longer in reconnect state, ignoring reconnect, sending close > 2019-09-26 16:08:27.483 7f9fdde99700 0 log_channel(cluster) log [INF] : denied reconnect attempt (mds is up:reconnect) from client.30586072 v1:10.1.67.140:0/3664284158 after 0.00400002 (allowed interval 45) > 2019-09-26 16:08:27.483 7f9fe1888700 0 --1- [v2:10.1.4.203:6800/806949107,v1:10.1.4.203:6801/806949107] >> v1:10.1.67.140:0/3664284158 conn(0x55af50055600 0x55af50143000 :6801 s=OPENED pgs=8 cs=1 l=0).fault server, going to standby Hanging client (10.1.67.49) kernel log: > 2019-09-26T16:08:27.481676+02:00 hostnamefoo kernel: [708596.227148] ceph: mds0 reconnect start > 2019-09-26T16:08:27.488943+02:00 hostnamefoo kernel: [708596.233145] ceph: mds0 reconnect denied > 2019-09-26T16:16:17.541041+02:00 hostnamefoo kernel: [709066.287601] libceph: mds0 10.1.4.203:6801 socket closed (con state NEGOTIATING) > 2019-09-26T16:16:18.068934+02:00 hostnamefoo kernel: [709066.813064] ceph: mds0 rejected session > 2019-09-26T16:16:18.068955+02:00 hostnamefoo kernel: [709066.814843] ceph: get_quota_realm: ino (10000000008.fffffffffffffffe) null i_snap_realm

3 years, 2 months

3
6
0 0

Provide more documentation for MDS performance tuning on large file systems

by Janek Bevendorff

Hello, Over the last week I have tried optimising the performance of our MDS nodes for the large amount of files and concurrent clients we have. It turns out that despite various stability fixes in recent releases, the default configuration still doesn't appear to be optimal for keeping the cache size under control and avoid intermittent I/O blocks. Unfortunately, it is very hard to tweak the configuration to something that works, because the tuning parameters needed are largely undocumented or only described in very technical terms in the source code making them quite unapproachable for administrators not familiar with all the CephFS internals. I would therefore like to ask if it were possible to document the "advanced" MDS settings more clearly as to what they do and in what direction they have to be tuned for more or less aggressive cap recall, for instance (sometimes it is not clear if a threshold is a min or a max threshold). I am am in the very (un)fortunate situation to have folders with a several 100K direct sub folders or files (and one extreme case with almost 7 million dentries), which is a pretty good benchmark for measuring cap growth while performing operations on them. For the time being, I came up with this configuration, which seems to work for me, but is still far from optimal: mds basic mds_cache_memory_limit 10737418240 mds advanced mds_cache_trim_threshold 131072 mds advanced mds_max_caps_per_client 500000 mds advanced mds_recall_max_caps 17408 mds advanced mds_recall_max_decay_rate 2.000000 The parameters I am least sure about---because I understand the least how they actually work---are mds_cache_trim_threshold and mds_recall_max_decay_rate. Despite reading the description in src/common/options.cc, I understand only half of what they're doing and I am also not quite sure in which direction to tune them for optimal results. Another point where I am struggling is the correct configuration of mds_recall_max_caps. The default of 5K doesn't work too well for me, but values above 20K also don't seem to be a good choice. While high values result in fewer blocked ops and better performance without destabilising the MDS, they also lead to slow but unbounded cache growth, which seems counter-intuitive. 17K was the maximum I could go. Higher values work for most use cases, but when listing very large folders with millions of dentries, the MDS cache size slowly starts to exceed the limit after a few hours, since the MDSs are failing to keep clients below mds_max_caps_per_client despite not showing any "failing to respond to cache pressure" warnings. With the configuration above, I do not have cache size issues any more, but it comes at the cost of performance and slow/blocked ops. A few hints as to how I could optimise my settings for better client performance would be much appreciated and so would be additional documentation for all the "advanced" MDS settings. Thanks a lot Janek

3 years, 4 months

3
13
0 0

atime with cephfs

by Oliver Freyermuth

Dear Cephers, we are currently mounting CephFS with relatime, using the FUSE client (version 13.2.6): ceph-fuse on /cephfs type fuse.ceph-fuse (rw,relatime,user_id=0,group_id=0,allow_other) For the first time, I wanted to use atime to identify old unused data. My expectation with "relatime" was that the access time stamp would be updated less often, for example, only if the last file access was >24 hours ago. However, that does not seem to be the case: ---------------------------------------------- $ stat /cephfs/grid/atlas/atlaslocalgroupdisk/rucio/group/phys-higgs/ed/cb/group.phys-higgs.17620861._000004.HSM_common.root ... Access: 2019-04-10 15:50:04.975959159 +0200 Modify: 2019-04-10 15:50:05.651613843 +0200 Change: 2019-04-10 15:50:06.141006962 +0200 ... $ cat /cephfs/grid/atlas/atlaslocalgroupdisk/rucio/group/phys-higgs/ed/cb/group.phys-higgs.17620861._000004.HSM_common.root > /dev/null $ sync $ stat /cephfs/grid/atlas/atlaslocalgroupdisk/rucio/group/phys-higgs/ed/cb/group.phys-higgs.17620861._000004.HSM_common.root ... Access: 2019-04-10 15:50:04.975959159 +0200 Modify: 2019-04-10 15:50:05.651613843 +0200 Change: 2019-04-10 15:50:06.141006962 +0200 ... ---------------------------------------------- I also tried this via an nfs-ganesha mount, and via a ceph-fuse mount with admin caps, but atime never changes. Is atime really never updated with CephFS, or is this configurable? Something as coarse as "update at maximum once per day only" would be perfectly fine for the use case. Cheers, Oliver

3 years, 4 months

4
6
0 0

Choosing suitable SSD for Ceph cluster

by Hermann Himmelbauer

Hi, I am running a nice ceph (proxmox 4 / debian-8 / ceph 0.94.3) cluster on 3 nodes (supermicro X8DTT-HIBQF), 2 OSD each (2TB SATA harddisks), interconnected via Infiniband 40. Problem is that the ceph performance is quite bad (approx. 30MiB/s reading, 3-4 MiB/s writing ), so I thought about plugging into each node a PCIe to NVMe/M.2 adapter and install SSD harddisks. The idea is to have a faster ceph storage and also some storage extension. The question is now which SSDs I should use. If I understand it right, not every SSD is suitable for ceph, as is denoted at the links below: https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-i… or here: https://www.proxmox.com/en/downloads/item/proxmox-ve-ceph-benchmark In the first link, the Samsung SSD 950 PRO 512GB NVMe is listed as a fast SSD for ceph. As the 950 is not available anymore, I ordered a Samsung 970 1TB for testing, unfortunately, the "EVO" instead of PRO. Before equipping all nodes with these SSDs, I did some tests with "fio" as recommended, e.g. like this: fio --filename=/dev/DEVICE --direct=1 --sync=1 --rw=write --bs=4k --numjobs=1 --iodepth=1 --runtime=60 --time_based --group_reporting --name=journal-test The results are as the following: ----------------------- 1) Samsung 970 EVO NVMe M.2 mit PCIe Adapter Jobs: 1: read : io=26706MB, bw=445MiB/s, iops=113945, runt= 60001msec write: io=252576KB, bw=4.1MiB/s, iops=1052, runt= 60001msec Jobs: 4: read : io=21805MB, bw=432.7MiB/s, iops=93034, runt= 60001msec write: io=422204KB, bw=6.8MiB/s, iops=1759, runt= 60002msec Jobs: 10: read : io=26921MB, bw=448MiB/s, iops=114859, runt= 60001msec write: io=435644KB, bw=7MiB/s, iops=1815, runt= 60004msec ----------------------- So the read speed is impressive, but the write speed is really bad. Therefore I ordered the Samsung 970 PRO (1TB) as it has faster NAND chips (MLC instead of TLC). The results are, however even worse for writing: ----------------------- Samsung 970 PRO NVMe M.2 mit PCIe Adapter Jobs: 1: read : io=15570MB, bw=259.4MiB/s, iops=66430, runt= 60001msec write: io=199436KB, bw=3.2MiB/s, iops=830, runt= 60001msec Jobs: 4: read : io=48982MB, bw=816.3MiB/s, iops=208986, runt= 60001msec write: io=327800KB, bw=5.3MiB/s, iops=1365, runt= 60002msec Jobs: 10: read : io=91753MB, bw=1529.3MiB/s, iops=391474, runt= 60001msec write: io=343368KB, bw=5.6MiB/s, iops=1430, runt= 60005msec ----------------------- I did some research and found out, that the "--sync" flag sets the flag "O_DSYNC" which seems to disable the SSD cache which leads to these horrid write speeds. It seems that this relates to the fact that the write cache is only not disabled for SSDs which implement some kind of battery buffer that guarantees a data flush to the flash in case of a powerloss. However, It seems impossible to find out which SSDs do have this powerloss protection, moreover, these enterprise SSDs are crazy expensive compared to the SSDs above - moreover it's unclear if powerloss protection is even available in the NVMe form factor. So building a 1 or 2 TB cluster seems not really affordable/viable. So, can please anyone give me hints what to do? Is it possible to ensure that the write cache is not disabled in some way (my server is situated in a data center, so there will probably never be loss of power). Or is the link above already outdated as newer ceph releases somehow deal with this problem? Or maybe a later Debian release (10) will handle the O_DSYNC flag differently? Perhaps I should simply invest in faster (and bigger) harddisks and forget the SSD-cluster idea? Thank you in advance for any help, Best Regards, Hermann -- hermann(a)qwer.tk PGP/GPG: 299893C7 (on keyservers)

3 years, 7 months

12
18
0 0

add debian buster stable support for ceph-deploy

by Jelle de Jong

Hello everybody, Can somebody add support for Debian buster and ceph-deploy: https://tracker.ceph.com/issues/42870 Highly appreciated, Regards, Jelle de Jong

3 years, 7 months

6
8
0 0

bluestore_default_buffered_write = true

by Adam Koczarski

Has anyone ever tried using this feature? I've added it to the [global] section of the ceph.conf on my POC cluster but I'm not sure how to tell if it's actually working. I did find a reference to this feature via Google and they had it in their [OSD] section?? I've tried that too.. TIA Adam

3 years, 9 months

2
1
0 0

ERROR: osd init failed: (1) Operation not permitted

by Ml Ml

Hello List, first of all: Yes - i made mistakes. Now i am trying to recover :-/ I had a healthy 3 node cluster which i wanted to convert to a single one. My goal was to reinstall a fresh 3 Node cluster and start with 2 nodes. I was able to healthy turn it from a 3 Node Cluster to a 2 Node cluster. Then the problems began. I started to change size=1 and min_size=1. Health was okay until here. Then over sudden both nodes got fenced...one node refused to boot, mons where missing, etc...to make long story short, here is where i am right now: root@node03:~ # ceph -s cluster b3be313f-d0ef-42d5-80c8-6b41380a47e3 health HEALTH_WARN 53 pgs stale 53 pgs stuck stale monmap e4: 2 mons at {0=10.15.15.3:6789/0,1=10.15.15.2:6789/0} election epoch 298, quorum 0,1 1,0 osdmap e6097: 14 osds: 9 up, 9 in pgmap v93644673: 512 pgs, 1 pools, 1193 GB data, 304 kobjects 1088 GB used, 32277 GB / 33366 GB avail 459 active+clean 53 stale+active+clean root@node03:~ # ceph osd tree ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY -1 32.56990 root default -2 25.35992 host node03 0 3.57999 osd.0 up 1.00000 1.00000 5 3.62999 osd.5 up 1.00000 1.00000 6 3.62999 osd.6 up 1.00000 1.00000 7 3.62999 osd.7 up 1.00000 1.00000 8 3.62999 osd.8 up 1.00000 1.00000 19 3.62999 osd.19 up 1.00000 1.00000 20 3.62999 osd.20 up 1.00000 1.00000 -3 7.20998 host node02 3 3.62999 osd.3 up 1.00000 1.00000 4 3.57999 osd.4 up 1.00000 1.00000 1 0 osd.1 down 0 1.00000 9 0 osd.9 down 0 1.00000 10 0 osd.10 down 0 1.00000 17 0 osd.17 down 0 1.00000 18 0 osd.18 down 0 1.00000 my main mistakes seemd to be: -------------------------------- ceph osd out osd.1 ceph auth del osd.1 systemctl stop ceph-osd@1 ceph osd rm 1 umount /var/lib/ceph/osd/ceph-1 ceph osd crush remove osd.1 As far as i can tell, ceph waits and needs data from that OSD.1 (which i removed) root@node03:~ # ceph health detail HEALTH_WARN 53 pgs stale; 53 pgs stuck stale pg 0.1a6 is stuck stale for 5086.552795, current state stale+active+clean, last acting [1] pg 0.142 is stuck stale for 5086.552784, current state stale+active+clean, last acting [1] pg 0.1e is stuck stale for 5086.552820, current state stale+active+clean, last acting [1] pg 0.e0 is stuck stale for 5086.552855, current state stale+active+clean, last acting [1] pg 0.1d is stuck stale for 5086.552822, current state stale+active+clean, last acting [1] pg 0.13c is stuck stale for 5086.552791, current state stale+active+clean, last acting [1] [...] SNIP [...] pg 0.e9 is stuck stale for 5086.552955, current state stale+active+clean, last acting [1] pg 0.87 is stuck stale for 5086.552939, current state stale+active+clean, last acting [1] When i try to start ODS.1 manually, i get: -------------------------------------------- 2020-02-10 18:48:26.107444 7f9ce31dd880 0 ceph version 0.94.10 (b1e0532418e4631af01acbc0cedd426f1905f4af), process ceph-osd, pid 10210 2020-02-10 18:48:26.134417 7f9ce31dd880 0 filestore(/var/lib/ceph/osd/ceph-1) backend xfs (magic 0x58465342) 2020-02-10 18:48:26.184202 7f9ce31dd880 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-1) detect_features: FIEMAP ioctl is supported and appears to work 2020-02-10 18:48:26.184209 7f9ce31dd880 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-1) detect_features: FIEMAP ioctl is disabled via 'filestore fiemap' config option 2020-02-10 18:48:26.184526 7f9ce31dd880 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-1) detect_features: syncfs(2) syscall fully supported (by glibc and kernel) 2020-02-10 18:48:26.184585 7f9ce31dd880 0 xfsfilestorebackend(/var/lib/ceph/osd/ceph-1) detect_feature: extsize is disabled by conf 2020-02-10 18:48:26.309755 7f9ce31dd880 0 filestore(/var/lib/ceph/osd/ceph-1) mount: enabling WRITEAHEAD journal mode: checkpoint is not enabled 2020-02-10 18:48:26.633926 7f9ce31dd880 1 journal _open /var/lib/ceph/osd/ceph-1/journal fd 20: 5367660544 bytes, block size 4096 bytes, directio = 1, aio = 1 2020-02-10 18:48:26.642185 7f9ce31dd880 1 journal _open /var/lib/ceph/osd/ceph-1/journal fd 20: 5367660544 bytes, block size 4096 bytes, directio = 1, aio = 1 2020-02-10 18:48:26.664273 7f9ce31dd880 0 <cls> cls/hello/cls_hello.cc:271: loading cls_hello 2020-02-10 18:48:26.732154 7f9ce31dd880 0 osd.1 6002 crush map has features 1107558400, adjusting msgr requires for clients 2020-02-10 18:48:26.732163 7f9ce31dd880 0 osd.1 6002 crush map has features 1107558400 was 8705, adjusting msgr requires for mons 2020-02-10 18:48:26.732167 7f9ce31dd880 0 osd.1 6002 crush map has features 1107558400, adjusting msgr requires for osds 2020-02-10 18:48:26.732179 7f9ce31dd880 0 osd.1 6002 load_pgs 2020-02-10 18:48:31.939810 7f9ce31dd880 0 osd.1 6002 load_pgs opened 53 pgs 2020-02-10 18:48:31.940546 7f9ce31dd880 -1 osd.1 6002 log_to_monitors {default=true} 2020-02-10 18:48:31.942471 7f9ce31dd880 1 journal close /var/lib/ceph/osd/ceph-1/journal 2020-02-10 18:48:31.969205 7f9ce31dd880 -1 ESC[0;31m ** ERROR: osd init failed: (1) Operation not permittedESC[0m Its mounted: /dev/sdg1 3.7T 127G 3.6T 4% /var/lib/ceph/osd/ceph-1 Is there any way i can get the OSD.1 back in? Thanks a lot, mario

3 years, 10 months

2
1
0 0

2024

2023

2022

2021

2020

2019

ceph-users February 2020