October 2023 - ceph-users

Ceph 16.2.14: OSDs randomly crash in bstore_kv_sync

by Zakhar Kirpichenko

Hi, After upgrading to Ceph 16.2.14 we had several OSD crashes in bstore_kv_sync thread: 1. "assert_thread_name": "bstore_kv_sync", 2. "backtrace": [ 3. "/lib64/libpthread.so.0(+0x12cf0) [0x7ff2f6750cf0]", 4. "gsignal()", 5. "abort()", 6. "(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a9) [0x564dc5f87d0b]", 7. "/usr/bin/ceph-osd(+0x584ed4) [0x564dc5f87ed4]", 8. "(RocksDBBlueFSVolumeSelector::sub_usage(void*, bluefs_fnode_t const&)+0x15e) [0x564dc6604a9e]", 9. "(BlueFS::_flush_range_F(BlueFS::FileWriter*, unsigned long, unsigned long)+0x77d) [0x564dc66951cd]", 10. "(BlueFS::_flush_F(BlueFS::FileWriter*, bool, bool*)+0x90) [0x564dc6695670]", 11. "(BlueFS::fsync(BlueFS::FileWriter*)+0x18b) [0x564dc66b1a6b]", 12. "(BlueRocksWritableFile::Sync()+0x18) [0x564dc66c1768]", 13. "(rocksdb::LegacyWritableFileWrapper::Sync(rocksdb::IOOptions const&, rocksdb::IODebugContext*)+0x1f) [0x564dc6b6496f]", 14. "(rocksdb::WritableFileWriter::SyncInternal(bool)+0x402) [0x564dc6c761c2]", 15. "(rocksdb::WritableFileWriter::Sync(bool)+0x88) [0x564dc6c77808]", 16. "(rocksdb::DBImpl::WriteToWAL(rocksdb::WriteThread::WriteGroup const&, rocksdb::log::Writer*, unsigned long*, bool, bool, unsigned long)+0x309) [0x564dc6b780c9]", 17. "(rocksdb::DBImpl::WriteImpl(rocksdb::WriteOptions const&, rocksdb::WriteBatch*, rocksdb::WriteCallback*, unsigned long*, unsigned long, bool, unsigned long*, unsigned long, rocksdb::PreReleaseCallback*)+0x2629) [0x564dc6b80c69]", 18. "(rocksdb::DBImpl::Write(rocksdb::WriteOptions const&, rocksdb::WriteBatch*)+0x21) [0x564dc6b80e61]", 19. "(RocksDBStore::submit_common(rocksdb::WriteOptions&, std::shared_ptr<KeyValueDB::TransactionImpl>)+0x84) [0x564dc6b1f644]", 20. "(RocksDBStore::submit_transaction_sync(std::shared_ptr<KeyValueDB::TransactionImpl>)+0x9a) [0x564dc6b2004a]", 21. "(BlueStore::_kv_sync_thread()+0x30d8) [0x564dc6602ec8]", 22. "(BlueStore::KVSyncThread::entry()+0x11) [0x564dc662ab61]", 23. "/lib64/libpthread.so.0(+0x81ca) [0x7ff2f67461ca]", 24. "clone()" 25. ], I am attaching two instances of crash info for further reference: https://pastebin.com/E6myaHNU OSD configuration is rather simple and close to default: osd.6 dev bluestore_cache_size_hdd 4294967296 osd.6 dev bluestore_cache_size_ssd 4294967296 osd advanced debug_rocksdb 1/5 osd advanced osd_max_backfills 2 osd basic osd_memory_target 17179869184 osd advanced osd_recovery_max_active 2 osd advanced osd_scrub_sleep 0.100000 osd advanced rbd_balance_parent_reads false debug_rocksdb is a recent change, otherwise this configuration has been running without issues for months. The crashes happened on two different hosts with identical hardware, the hosts and storage (NVME DB/WAL, HDD block) don't exhibit any issues. We have not experienced such crashes with Ceph < 16.2.14. Is this a known issue, or should I open a bug report? Best regards, Zakhar

7 months

3
13
0 0

fixing future rctime

by MARTEL Arnaud

Hi all, I have some troubles with my backup script because there are few files, in a deep sub-directory, with a creation/modification date in the future (for example: 2040-02-06 18:00:00). As my script uses the ceph.dir.rctime extended attribute to identify the files and directories to backup, it now browses and sync a lot of unchanged sub-directories… I tried a lot of things, including remove and recreate the files so they have now the current datetime, but rctime is never updated. Even when I remove the last directory (ie: the one where the files are located), rctime is not updated for the parent directories. Has someone a trick to reset rctime to current datetime (or any other solution to remove this inconsistent value of rctime) with quincy (17.2.6) ? Regards, Arnaud

7 months

2
2
0 0

Specify priority for active MGR and MDS

by Nicolas FONTAINE

Hi everyone, Is there a way to specify which MGR and which MDS should be the active one? Thanks, Nicolas.

7 months

3
2
0 0

Turn off Dashboard CephNodeDiskspaceWarning for specific drives?

by Daniel Brown

Greetings - Forgive me if this is an elementary question - am fairly new to running CEPH. Have searched but didn’t see anything specific that came up. Is there any way to disable the disk space warnings (CephNodeDiskspaceWarning) for specific drives or filesystems on my CEPH servers? Running 18.2.0, installed with cephadm on Ubuntu 22.04 on Arm. Keep seeing these warnings in the Dashboard for /boot/firmware, which, in my opinion shouldn’t really be something that ceph needs to worry about - or at least, should be something I can configure it to ignore. Thanks in advance. Dan.

7 months

2
1
0 0

How do you handle large Ceph object storage cluster?

by pawel.przestrzelski＠gmail.com

Hi Everyone, My company is dealing with quite large Ceph cluster (>10k OSDs, >60 PB of data). It is entirely dedicated to object storage with S3 interface. Maintenance and its extension are getting more and more problematic and time consuming. We consider to split it to two or more completely separate clusters (without replication of data among them) and create S3 layer of abstraction with some additional metadata that will allow us to use these 2+ physically independent instances as a one logical cluster. Additionally, newest data is the most demanded data, so we have to spread it equally among clusters to avoid skews in cluster load. Do you have any similar experience? How did you handle it? Maybe you have some advice? I'm not a Ceph expert. I'm just a Ceph's user and software developer who does not like to duplicate someone's job. Best, Paweł

7 months

3
2
0 0

Remove empty orphaned PGs not mapped to a pool

by Malte Stroem

Hello, we removed an SSD cache tier and its pool. The PGs for the pool do still exist. The cluster is healthy. The PGs are empty and they reside on the cache tier pool's SSDs. We like to take out the disks but it is not possible. The cluster sees the PGs and answers with a HEALTH_WARN. Because of the replication of three there are still 128 PGs on three of the 24 OSDs. We were able to remove the other OSDs. Summary: - pool removed - 3 x 128 empty PGs still exist - 3 of 24 OSDs still exist How is it possible to remove these empty and healthy PGs? The only way I found was something like: ceph pg {pg-id} mark_unfound_lost delete Is that the right way? Some output of: ceph pg ls-by-osd 23 PG OBJECTS DEGRADED MISPLACED UNFOUND BYTES OMAP_BYTES* OMAP_KEYS* LOG STATE SINCE VERSION REPORTED UP ACTING SCRUB_STAMP DEEP_SCRUB_STAMP 3.0 0 0 0 0 0 0 0 0 active+clean 27h 0'0 2627265:196316 [15,6,23]p15 [15,6,23]p15 2023-09-28T12:41:52.982955+0200 2023-09-27T06:48:23.265838+0200 3.1 0 0 0 0 0 0 0 0 active+clean 9h 0'0 2627266:19330 [6,23,15]p6 [6,23,15]p6 2023-09-29T06:30:57.630016+0200 2023-09-27T22:58:21.992451+0200 3.2 0 0 0 0 0 0 0 0 active+clean 2h 0'0 2627265:1135185 [23,15,6]p23 [23,15,6]p23 2023-09-29T13:42:07.346658+0200 2023-09-24T14:31:52.844427+0200 3.3 0 0 0 0 0 0 0 0 active+clean 13h 0'0 2627266:193170 [6,15,23]p6 [6,15,23]p6 2023-09-29T01:56:54.517337+0200 2023-09-27T17:47:24.961279+0200 3.4 0 0 0 0 0 0 0 0 active+clean 14h 0'0 2627265:2343551 [23,6,15]p23 [23,6,15]p23 2023-09-29T00:47:47.548860+0200 2023-09-25T09:39:51.259304+0200 3.5 0 0 0 0 0 0 0 0 active+clean 2h 0'0 2627265:194111 [15,6,23]p15 [15,6,23]p15 2023-09-29T13:28:48.879959+0200 2023-09-26T15:35:44.217302+0200 3.6 0 0 0 0 0 0 0 0 active+clean 6h 0'0 2627265:2345717 [23,15,6]p23 [23,15,6]p23 2023-09-29T09:26:02.534825+0200 2023-09-27T21:56:57.500126+0200 Best regards, Malte

7 months

4
12
0 0

stuck MDS warning: Client HOST failing to respond to cache pressure

by Frank Schilder

Hi all, I'm affected by a stuck MDS warning for 2 clients: "failing to respond to cache pressure". This is a false alarm as no MDS is under any cache pressure. The warning is stuck already for a couple of days. I found some old threads about cases where the MDS does not update flags/triggers for this warning in certain situations. Dating back to luminous and I'm probably hitting one of these. In these threads I could find a lot except for instructions for how to clear this out in a nice way. Is there something I can do on the clients to clear this warning? I don't want to evict/reboot just because of that. Thanks and best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14

7 months

3
7
0 0

How to deal with increasing HDD sizes ? 1 OSD for 2 LVM-packed HDDs ?

by Renaud Jean Christophe Miel

Hi, Use case: * Ceph cluster with old nodes having 6TB HDDs * Add new node with new 12TB HDDs Is it supported/recommended to pack 2 6TB HDDs handled by 2 old OSDs into 1 12TB LVM disk handled by 1 new OSD ? Regards, Renaud Miel

7 months

5
5
0 0

Nautilus - Octopus upgrade - more questions

by Dave Hall

Hello, I have a Nautilus cluster built using Ceph packages from Debian 10 Backports, deployed with Ceph-Ansible. I see that Debian does not offer Ceph 15/Octopus packages. However, download.ceph.com does offer such packages. Question: Is it a safe upgrade to install the download.ceph.com packages over top of the buster-backports packages? If so, the next question is how to deploy this? Should I pull down an appropriate version of Ceph-Ansible and use the rolling-upgrade playbook? Or just apg-get -f dist-upgrade the new Ceph packages into place? BTW, in the long run I'll probably want to get to container-based Reef, but I need to keep a stable cluster throughout. Any advice or reassurance much appreciated. Thanks. -Dave -- Dave Hall Binghamton University kdhall(a)binghamton.edu

7 months

4
3
0 0

Time Estimation for cephfs-data-scan scan_links

by Odair M.

Hello, I've encountered an issue where the metadata pool has corrupted a cache inode, leading to an MDS rank abort in the 'reconnect' state. To address this, I'm following the "USING AN ALTERNATE METADATA POOL FOR RECOVERY" section from the documentation [1]. However, I've observed that the cephfs-data-scan scan_links step has been running for over 24 hours on 35 TB of data, which is replicated across 3 OSDs, resulting in more than 100 TB of raw data. Does anyone have an estimation on the duration for this step? Additional detail: The corrupted mds log: -9> 2023-10-11T10:13:22.254-0300 7ff901f75700 10 monclient: get_auth_request con 0x559bf41e4400 auth_method 0 -8> 2023-10-11T10:13:22.254-0300 7ff8ff770700 5 mds.barril12 handle_mds_map old map epoch 472481 <= 472481, discarding -7> 2023-10-11T10:13:22.254-0300 7ff8ff770700 0 mds.0.cache missing dir for * (which maps to *) on [inode 0x10021afaf90 [...392,head] /dbteamvenv/ auth v98534854 snaprealm=0x559bf427ce00 f(v60 m2023-10-06T15:35:03.278089-0300 9=0+9) n(v141971 rc2023-10-09T18:41:19.742089-0300 b1424948533453 139810=131460+8350) (iversion lock) 0x559bf4298580] -6> 2023-10-11T10:13:22.254-0300 7ff8ff770700 0 mds.0.cache missing dir ino 0x20005dd786b -5> 2023-10-11T10:13:22.254-0300 7ff902776700 10 monclient: get_auth_request con 0x559bf4142c00 auth_method 0 -4> 2023-10-11T10:13:22.258-0300 7ff902f77700 5 mds.beacon.barril12 received beacon reply up:rejoin seq 4 rtt 1.09601 -3> 2023-10-11T10:13:22.258-0300 7ff8ff770700 -1 ./src/mds/MDCache.cc: In function 'void MDCache::handle_cache_rejoin_weak(ceph::cref_t<MMDSCacheRejoin>&)' thread 7ff8ff770700 time 2023-10-11T10:13:22.259535-0300 ./src/mds/MDCache.cc: 4462: FAILED ceph_assert(diri) ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x124) [0x7ff904a5b282] 2: /usr/lib/ceph/libceph-common.so.2(+0x25b420) [0x7ff904a5b420] 3: (MDCache::handle_cache_rejoin_weak(boost::intrusive_ptr<MMDSCacheRejoin const> const&)+0x20de) [0x559bf0a9da6e] 4: (MDCache::dispatch(boost::intrusive_ptr<Message const> const&)+0x424) [0x559bf0aa2a64] 5: (MDSRank::_dispatch(boost::intrusive_ptr<Message const> const&, bool)+0x5c0) [0x559bf0930580] 6: (MDSRankDispatcher::ms_dispatch(boost::intrusive_ptr<Message const> const&)+0x58) [0x559bf0930b78] 7: (MDSDaemon::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x1bf) [0x559bf090b5df] 8: (Messenger::ms_deliver_dispatch(boost::intrusive_ptr<Message> const&)+0x468) [0x7ff904ca71d8] 9: (DispatchQueue::entry()+0x5ef) [0x7ff904ca48df] 10: (DispatchQueue::DispatchThread::entry()+0xd) [0x7ff904d681cd] 11: /lib/x86_64-linux-gnu/libpthread.so.0(+0x7ea7) [0x7ff905680ea7] 12: clone() -2> 2023-10-11T10:13:22.258-0300 7ff902f77700 10 monclient: get_auth_request con 0x559bf41e4c00 auth_method 0 -1> 2023-10-11T10:13:22.258-0300 7ff902f77700 10 monclient: get_auth_request con 0x559bf41e5400 auth_method 0 0> 2023-10-11T10:13:22.262-0300 7ff8ff770700 -1 *** Caught signal (Aborted) ** in thread 7ff8ff770700 thread_name:ms_dispatch ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable) 1: /lib/x86_64-linux-gnu/libpthread.so.0(+0x13140) [0x7ff90568c140] 2: gsignal() 3: abort() 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x17e) [0x7ff904a5b2dc] 5: /usr/lib/ceph/libceph-common.so.2(+0x25b420) [0x7ff904a5b420] 6: (MDCache::handle_cache_rejoin_weak(boost::intrusive_ptr<MMDSCacheRejoin const> const&)+0x20de) [0x559bf0a9da6e] 7: (MDCache::dispatch(boost::intrusive_ptr<Message const> const&)+0x424) [0x559bf0aa2a64] 8: (MDSRank::_dispatch(boost::intrusive_ptr<Message const> const&, bool)+0x5c0) [0x559bf0930580] 9: (MDSRankDispatcher::ms_dispatch(boost::intrusive_ptr<Message const> const&)+0x58) [0x559bf0930b78] 10: (MDSDaemon::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x1bf) [0x559bf090b5df] 11: (Messenger::ms_deliver_dispatch(boost::intrusive_ptr<Message> const&)+0x468) [0x7ff904ca71d8] 12: (DispatchQueue::entry()+0x5ef) [0x7ff904ca48df] 13: (DispatchQueue::DispatchThread::entry()+0xd) [0x7ff904d681cd] 14: /lib/x86_64-linux-gnu/libpthread.so.0(+0x7ea7) [0x7ff905680ea7] 15: clone() NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. Ceph Cluster status: barril1:~# ceph status cluster: id: c30ecc8d-440e-4608-b3fe-5020337ae11d health: HEALTH_ERR 2 filesystems are degraded 2 filesystems are offline services: mon: 5 daemons, quorum barril4,barril3,barril2,barril1,urquell (age 32h) mgr: barril2(active, since 32h), standbys: barril3, barril4, urquell, barril1 mds: 0/10 daemons up (10 failed), 9 standby osd: 48 osds: 48 up (since 32h), 48 in (since 2M); 22 remapped pgs rgw: 4 daemons active (4 hosts, 1 zones) data: volumes: 0/2 healthy, 2 failed pools: 12 pools, 1475 pgs objects: 50.89M objects, 72 TiB usage: 207 TiB used, 148 TiB / 355 TiB avail pgs: 579358/152674596 objects misplaced (0.379%) 1449 active+clean 22 active+remapped+backfilling 4 active+clean+scrubbing+deep io: client: 7.2 MiB/s rd, 1.2 MiB/s wr, 342 op/s rd, 367 op/s wr recovery: 26 MiB/s, 13 keys/s, 26 objects/s progress: Global Recovery Event (19h) [===========================.] (remaining: 17m) Ceph fs status: barril1:~# ceph fs status cephfs - 0 clients ====== RANK STATE MDS ACTIVITY DNS INOS DIRS CAPS 0 failed 1 failed 2 failed 3 failed 4 failed 5 failed 6 failed 7 failed 8 failed POOL TYPE USED AVAIL cephfs_metadata metadata 1045G 35.6T cephfs.c3sl.data data 114T 35.6T c3sl - 0 clients ==== RANK STATE MDS ACTIVITY DNS INOS DIRS CAPS 0 failed POOL TYPE USED AVAIL cephfs.c3sl.meta metadata 28.2G 35.6T cephfs.c3sl.data data 114T 35.6T STANDBY MDS barril2 barril4 barril42 barril33 barril13 barril23 barril43 barril1 barril12 MDS version: ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable) ceph health detail: barril1:~# ceph health detail HEALTH_ERR 2 filesystems are degraded; 2 filesystems are offline [WRN] FS_DEGRADED: 2 filesystems are degraded fs cephfs is degraded fs c3sl is degraded [ERR] MDS_ALL_DOWN: 2 filesystems are offline fs cephfs is offline because no MDS is active for it. fs c3sl is offline because no MDS is active for it. [1]: https://docs.ceph.com/en/reef/cephfs/disaster-recovery-experts/#using-an-al… Best regards, Odair M. Ditkun Jr

7 months

4
4
0 0

2024

2023

2022

2021

2020

2019

ceph-users October 2023