September 2023 - ceph-users

Separating Mons and OSDs in Ceph Cluster

by Ramin Najjarbashi

Hi I am writing to seek guidance and best practices for a maintenance operation in my Ceph cluster. I have an older cluster in which the Monitors (Mons) and Object Storage Devices (OSDs) are currently deployed on the same host. I am interested in separating them while ensuring zero downtime and minimizing risks to the cluster's stability. The primary goal is to deploy new Monitors on different servers without causing service interruptions or disruptions to data availability. The challenge arises because updating the configuration to add new Monitors typically requires a restart of all OSDs, which is less than ideal in terms of maintaining cluster availability. One approach I considered is to reweight all OSDs on the host to zero, allowing data to gradually transfer to other OSDs. Once all data has been safely migrated, I would proceed to remove the old OSDs. Afterward, I would deploy the new Monitors on a different server with the previous IP addresses and deploy the OSDs on the old Monitors' host with new IP addresses. While this approach seems to minimize risks, it can be time-consuming and may not be the most efficient way to achieve the desired separation. I would greatly appreciate the community's insights and suggestions on the best approach to achieve this separation of Mons and OSDs with zero downtime and minimal risk. If there are alternative methods or best practices that can be recommended, please share your expertise.

8 months, 1 week

6
7
0 0

[ceph v16.2.10] radosgw crash

by Louis Koo

2023-08-15T18:15:55.356+0000 7f7916ef3700 -1 *** Caught signal (Aborted) ** in thread 7f7916ef3700 thread_name:radosgw ceph version 16.2.10 (45fa1a083152e41a408d15505f594ec5f1b4fe17) pacific (stable) 1: /lib64/libpthread.so.0(+0x12ce0) [0x7f79da065ce0] 2: gsignal() 3: abort() 4: /lib64/libstdc++.so.6(+0x9009b) [0x7f79d905809b] 5: /lib64/libstdc++.so.6(+0x9653c) [0x7f79d905e53c] 6: /lib64/libstdc++.so.6(+0x96597) [0x7f79d905e597] 7: /lib64/libstdc++.so.6(+0x9652e) [0x7f79d905e52e] 8: (spawn::detail::continuation_context::resume()+0x87) [0x7f79e4b70a17] 9: (boost::asio::detail::executor_op<ceph::async::ForwardingHandler<ceph::async::CompletionHandler<spawn::detail::coro_handler<boost::asio::executor_binder<void (*)(), boost::asio::strand<boost::asio::io_context::executor_type> >, std::shared_lock<ceph::async::SharedMutex<boost::asio::io_context::executor_type> > >, std::tuple<boost::system::error_code, std::shared_lock<ceph::async::SharedMutex<boost::asio::io_context::executor_type> > > > >, std::allocator<void>, boost::asio::detail::scheduler_operation>::do_complete(void*, boost::asio::detail::scheduler_operation*, boost::system::error_code const&, unsigned long)+0x25a) [0x7f79e4b77b6a] 10: (boost::asio::detail::strand_executor_service::invoker<boost::asio::io_context::executor_type const>::operator()()+0x8d) [0x7f79e4b7f93d] 11: (boost::asio::detail::executor_op<boost::asio::detail::strand_executor_service::invoker<boost::asio::io_context::executor_type const>, boost::asio::detail::recycling_allocator<void, boost::asio::detail::thread_info_base::default_tag>, boost::asio::detail::scheduler_operation>::do_complete(void*, boost::asio::detail::scheduler_operation*, boost::system::error_code const&, unsigned long)+0x96) [0x7f79e4b7fca6] 12: (boost::asio::detail::scheduler::run(boost::system::error_code&)+0x4f2) [0x7f79e4b73ad2] 13: /lib64/libradosgw.so.2(+0x430376) [0x7f79e4b56376] 14: /lib64/libstdc++.so.6(+0xc2ba3) [0x7f79d908aba3] 15: /lib64/libpthread.so.0(+0x81ca) [0x7f79da05b1ca] 16: clone() NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. --- begin dump of recent events --- -9999> 2023-08-15T18:14:25.987+0000 7f79186f6700 2 req 11141123620750988595 0.001000006s s3:get_obj init permissions -9998> 2023-08-15T18:14:25.987+0000 7f79186f6700 2 req 11141123620750988595 0.001000006s s3:get_obj recalculating target -9997> 2023-08-15T18:14:25.987+0000 7f79186f6700 2 req 11141123620750988595 0.001000006s s3:get_obj reading permissions -9996> 2023-08-15T18:14:25.988+0000 7f79186f6700 2 req 11141123620750988595 0.002000013s s3:get_obj init op -9995> 2023-08-15T18:14:25.988+0000 7f79186f6700 2 req 11141123620750988595 0.002000013s s3:get_obj verifying op mask -9994> 2023-08-15T18:14:25.988+0000 7f79186f6700 2 req 11141123620750988595 0.002000013s s3:get_obj verifying op permissions -9993> 2023-08-15T18:14:25.988+0000 7f79186f6700 5 req 11141123620750988595 0.002000013s s3:get_obj Searching permissions for identity=rgw::auth::SysReqApplier -> rgw::auth::LocalApplier(acct_user=suliang, acct_name=suliang, subuser=, perm_mask=15, is_admin=0) mask=49 -9992> 2023-08-15T18:14:25.988+0000 7f79186f6700 5 req 11141123620750988595 0.002000013s s3:get_obj Searching permissions for uid=suliang -9991> 2023-08-15T18:14:25.988+0000 7f79186f6700 5 req 11141123620750988595 0.002000013s s3:get_obj Found permission: 15 -9990> 2023-08-15T18:14:25.988+0000 7f79186f6700 5 req 11141123620750988595 0.002000013s s3:get_obj Searching permissions for group=1 mask=49 -9989> 2023-08-15T18:14:25.988+0000 7f79186f6700 5 req 11141123620750988595 0.002000013s s3:get_obj Permissions for group not found -9988> 2023-08-15T18:14:25.988+0000 7f79186f6700 5 req 11141123620750988595 0.002000013s s3:get_obj Searching permissions for group=2 mask=49 -9987> 2023-08-15T18:14:25.988+0000 7f79186f6700 5 req 11141123620750988595 0.002000013s s3:get_obj Permissions for group not found -9986> 2023-08-15T18:14:25.988+0000 7f79186f6700 5 req 11141123620750988595 0.002000013s s3:get_obj -- Getting permissions done for identity=rgw::auth::SysReqApplier -> rgw::auth::LocalApplier(acct_user=suliang, acct_name=suliang, subuser=, perm_mask=15, is_admin=0), owner=suliang, perm=1 -9985> 2023-08-15T18:14:25.988+0000 7f79186f6700 2 req 11141123620750988595 0.002000013s s3:get_obj verifying op params -9984> 2023-08-15T18:14:25.988+0000 7f79186f6700 2 req 11141123620750988595 0.002000013s s3:get_obj pre-executing -9983> 2023-08-15T18:14:25.988+0000 7f79186f6700 2 req 11141123620750988595 0.002000013s s3:get_obj executing

8 months, 1 week

3
3
1 0

MDS daemons don't report any more

by Frank Schilder

Hi all, I make a weird observation. 8 out of 12 MDS daemons seem not to report to the cluster any more: # ceph fs status con-fs2 - 1625 clients ======= RANK STATE MDS ACTIVITY DNS INOS 0 active ceph-16 Reqs: 0 /s 0 0 1 active ceph-09 Reqs: 128 /s 4251k 4250k 2 active ceph-17 Reqs: 0 /s 0 0 3 active ceph-15 Reqs: 0 /s 0 0 4 active ceph-24 Reqs: 269 /s 3567k 3567k 5 active ceph-11 Reqs: 0 /s 0 0 6 active ceph-14 Reqs: 0 /s 0 0 7 active ceph-23 Reqs: 0 /s 0 0 POOL TYPE USED AVAIL con-fs2-meta1 metadata 2169G 7081G con-fs2-meta2 data 0 7081G con-fs2-data data 1248T 4441T con-fs2-data-ec-ssd data 705G 22.1T con-fs2-data2 data 3172T 4037T STANDBY MDS ceph-08 ceph-10 ceph-12 ceph-13 VERSION DAEMONS None ceph-16, ceph-17, ceph-15, ceph-11, ceph-14, ceph-23, ceph-10, ceph-12 ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable) ceph-09, ceph-24, ceph-08, ceph-13 Version is "none" for these and there are no stats. Ceph versions reports only 4 MDSes out of the 12. 8 are not shown at all: [root@gnosis ~]# ceph versions { "mon": { "ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable)": 5 }, "mgr": { "ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable)": 5 }, "osd": { "ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable)": 1282 }, "mds": { "ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable)": 4 }, "overall": { "ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable)": 1296 } } Ceph status reports everything as up and OK: [root@gnosis ~]# ceph status cluster: id: e4ece518-f2cb-4708-b00f-b6bf511e91d9 health: HEALTH_OK services: mon: 5 daemons, quorum ceph-01,ceph-02,ceph-03,ceph-25,ceph-26 (age 2w) mgr: ceph-03(active, since 61s), standbys: ceph-25, ceph-01, ceph-02, ceph-26 mds: con-fs2:8 4 up:standby 8 up:active osd: 1284 osds: 1282 up (since 31h), 1282 in (since 33h); 567 remapped pgs data: pools: 14 pools, 25065 pgs objects: 2.14G objects, 3.7 PiB usage: 4.7 PiB used, 8.4 PiB / 13 PiB avail pgs: 79908208/18438361040 objects misplaced (0.433%) 23063 active+clean 1225 active+clean+snaptrim_wait 317 active+remapped+backfill_wait 250 active+remapped+backfilling 208 active+clean+snaptrim 2 active+clean+scrubbing+deep io: client: 596 MiB/s rd, 717 MiB/s wr, 4.16k op/s rd, 3.04k op/s wr recovery: 8.7 GiB/s, 3.41k objects/s My first thought is that the status module failed. However, I don't manage to restart it (always on). An MGR fail-over did not help. Any ideas what is going on here? Thanks and best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14

8 months, 1 week

2
3
0 0

Rocksdb compaction and OSD timeout

by J-P Methot

Hi, We're running latest Pacific on our production cluster and we've been seeing the dreaded 'OSD::osd_op_tp thread 0x7f346aa64700' had timed out after 15.000000954s' error. We have reasons to believe this happens each time the RocksDB compaction process is launched on an OSD. My question is, does the cluster detecting that an OSD has timed out interrupt the compaction process? This seems to be what's happening, but it's not immediately obvious. We are currently facing an infinite loop of random OSDs timing out and if the compaction process is interrupted without finishing, it may explain that. -- Jean-Philippe Méthot Senior Openstack system administrator Administrateur système Openstack sénior PlanetHoster inc.

8 months, 1 week

6
19
0 0

ceph orch command hung

by Taku Izumi

Hi all, I have 4-nodes ceph cluster. After I shutted down my cluster, I tried to start it again, but failed due to ceph orch xxx (such as status) commands hung. How sould I recover from this problem ? root@ceph-manager:/# ceph orch status ==> hung ^CInterrupted root@ceph-manager:/# ceph status cluster: id: 4588ed80-352b-11ee-9eae-157ca4325420 health: HEALTH_ERR 2 failed cephadm daemon(s) 1 filesystem is degraded 1 filesystem is offline pauserd,pausewr,nodown,noout,nobackfill,norebalance,norecover flag(s) set 10 slow ops, oldest one blocked for 3736 sec, mon.ceph-osd0 has slow ops services: mon: 4 daemons, quorum ceph-manager,ceph-osd0,ceph-osd1,ceph-osd2 (age 64m) mgr: ceph-manager.kurjlh(active, since 64m), standbys: ceph-osd0.jodevs mds: 0/1 daemons up (1 failed), 2 standby osd: 3 osds: 3 up (since 64m), 3 in (since 2w) flags pauserd,pausewr,nodown,noout,nobackfill,norebalance,norecover data: volumes: 0/1 healthy, 1 failed pools: 11 pools, 243 pgs objects: 3.01k objects, 9.4 GiB usage: 28 GiB used, 2.8 TiB / 2.8 TiB avail pgs: 243 active+clean root@ceph-manager:/# ceph health detail HEALTH_ERR 2 failed cephadm daemon(s); 1 filesystem is degraded; 1 filesystem is offline; pauserd,pausewr,nodown,noout,nobackfill,norebalance,norecover flag(s) set; 10 slow ops, oldest one blocked for 3741 sec, mon.ceph-osd0 has slow ops [WRN] CEPHADM_FAILED_DAEMON: 2 failed cephadm daemon(s) daemon rgw.sno_rgw.ceph-manager.umzmku on ceph-manager is in error state daemon rgw.sno_rgw.ceph-osd2.vfpmbs on ceph-osd2 is in error state [WRN] FS_DEGRADED: 1 filesystem is degraded fs sno_cephfs is degraded [ERR] MDS_ALL_DOWN: 1 filesystem is offline fs sno_cephfs is offline because no MDS is active for it. [WRN] OSDMAP_FLAGS: pauserd,pausewr,nodown,noout,nobackfill,norebalance,norecover flag(s) set [WRN] SLOW_OPS: 10 slow ops, oldest one blocked for 3741 sec, mon.ceph-osd0 has slow ops

8 months, 1 week

3
3
0 0

cannot create new OSDs - ceph version 17.2.6 (810db68029296377607028a6c6da1ec06f5a2b27) quincy (stable)

by Konold, Martin

Hi, I want to create a new OSD on a 4TB Samsung MZ1L23T8HBLA-00A07 enterprise nvme device. Creating the OSD works but it cannot be initialized and therefore not started. In the log I see an entry about a failed assert. ./src/os/bluestore/fastbmap_allocator_impl.cc: 405: FAILED ceph_assert((aligned_extent.length % l0_granularity) == 0) Is this the culprit? In addition at the end of the logfile there is a failed mount and a failed osd init mentioned. 2023-09-11T16:30:04.708+0200 7f99aa28f3c0 -1 bluefs _check_allocations OP_FILE_UPDATE_INC invalid extent 1: 0x140000~10000: duplicate reference, ino 30 2023-09-11T16:30:04.708+0200 7f99aa28f3c0 -1 bluefs mount failed to replay log: (14) Bad address 2023-09-11T16:30:04.708+0200 7f99aa28f3c0 20 bluefs _stop_alloc 2023-09-11T16:30:04.708+0200 7f99aa28f3c0 -1 bluestore(/var/lib/ceph/osd/ceph-43) _open_bluefs failed bluefs mount: (14) Bad address 2023-09-11T16:30:04.708+0200 7f99aa28f3c0 10 bluefs maybe_verify_layout no memorized_layout in bluefs superblock 2023-09-11T16:30:04.708+0200 7f99aa28f3c0 -1 bluestore(/var/lib/ceph/osd/ceph-43) _open_db failed to prepare db environment: 2023-09-11T16:30:04.708+0200 7f99aa28f3c0 1 bdev(0x5565c261fc00 /var/lib/ceph/osd/ceph-43/block) close 2023-09-11T16:30:04.940+0200 7f99aa28f3c0 -1 osd.43 0 OSD:init: unable to mount object store 2023-09-11T16:30:04.940+0200 7f99aa28f3c0 -1 ** ERROR: osd init failed: (5) Input/output error -- Regards, ppa. Martin Konold -- Martin Konold - Prokurist, CTO KONSEC GmbH -⁠ make things real Amtsgericht Stuttgart, HRB 23690 Geschäftsführer: Andreas Mack Im Köller 3, 70794 Filderstadt, Germany

8 months, 1 week

1
0
0 0

ceph_leadership_team_meeting_s18e06.mkv

by Ernesto Puerta

Dear Cephers, Today brought us an eventful CTL meeting: it looks like Jitsi recently started requiring user authentication <https://jitsi.org/blog/authentication-on-meet-jit-si/> (anonymous users will get a "Waiting for a moderator" modal), but authentication didn't work against Google or GitHub accounts, so we had to move to the good old Google Meet. As a result of this, Neha has kindly set up a new private Slack channel (#clt) to allow for quicker communication among CLT members (if you usually attend the CLT meeting and have not been added, please ping any CLT member to request that). Now, let's move on the important stuff: *The latest Pacific Release (v16.2.14)* *The Bad* The 14th drop of the Pacific release has landed with a few hiccups: - Some .deb packages were made available to downloads.ceph.com before the release process completion. Although this is not the first time it happens, we want to ensure this is the last one, so we'd like to gather ideas to improve the release publishing process. Neha encouraged everyone to share ideas here: - https://tracker.ceph.com/issues/62671 - https://tracker.ceph.com/issues/62672 - v16.2.14 also hit issues during the ceph-container stage. Laura wanted to raise awareness of its current setbacks <https://pad.ceph.com/p/16.2.14-struggles> and collect ideas to tackle them: - Enforce reviews and mandatory CI checks - Rework the current approach to use simple Dockerfiles <https://github.com/ceph/ceph/pull/43292> - Call the Ceph community for help: ceph-container is currently maintained part-time by a single contributor (Guillaume Abrioux). This sub-project would benefit from the sound expertise on containers among Ceph users. If you have ever considered contributing to Ceph, but felt a bit intimidated by C++, Paxos and race conditions, ceph-container is a good place to shed your fear. *The Good* Not everything about v16.2.14 was going to be bleak: David Orman brought us really good news. They tested v16.2.14 on a large production cluster (10gbit/s+ RGW and ~13PiB raw) and found that it solved a major issue affecting RGW in Pacific <https://github.com/ceph/ceph/pull/52552>. *The Ugly* During that testing, they noticed that ceph-mgr was occasionally OOM killed (nothing new to 16.2.14, as it was previously reported). They already tried: - Disabling modules (like the restful one, which was a suspect) - Enabling debug 20 - Turning the pg autoscaler off Debugging will continue to characterize this issue: - Enable profiling (Mark Nelson) - Try Bloomberg's Python mem profiler <https://github.com/bloomberg/memray> (Matthew Leonard) *Infrastructure* *Reminder: Infrastructure Meeting Tomorrow. **11:30-12:30 Central Time* Patrick brought up the following topics: - Need to reduce the OVH spending ($72k/year, which is a good cut in the Ceph Foundation budget, that's a lot less avocado sandwiches for the next Cephalocon): - Move services (e.g.: Chacra) to the Sepia lab - Re-use CentOS (and any spared/unused) machines for devel purposes - Current Ceph sys admins are overloaded, so devel/community involvement would be much appreciated. - More to be discussed in tomorrow's meeting. Please join if you think you can help solve/improve the Ceph infrastrucru! *BTW*: today's CDM will be canceled, since no topics were proposed. Kind Regards, Ernesto

8 months, 1 week

5
6
0 0

MGR executes config rm all the time

by Frank Schilder

Hi all, I recently started observing that our MGR seems to execute the same "config rm" commands over and over again, in the audit log: 9/10/23 10:03:24 AM[INF]from='mgr.252911336 ' entity='mgr.ceph-25' cmd=[{"prefix":"config rm","who":"mgr","name":"mgr/rbd_support/ceph-25/mirror_snapshot_schedule"}]: dispatch 9/10/23 10:03:24 AM[INF]from='mgr.252911336 192.168.32.68:0/63' entity='mgr.ceph-25' cmd=[{"prefix":"config rm","who":"mgr","name":"mgr/rbd_support/ceph-25/mirror_snapshot_schedule"}]: dispatch 9/10/23 10:03:19 AM[INF]from='mgr.252911336 ' entity='mgr.ceph-25' cmd=[{"prefix":"config rm","who":"mgr","name":"mgr/rbd_support/ceph-25/trash_purge_schedule"}]: dispatch 9/10/23 10:03:19 AM[INF]from='mgr.252911336 192.168.32.68:0/63' entity='mgr.ceph-25' cmd=[{"prefix":"config rm","who":"mgr","name":"mgr/rbd_support/ceph-25/trash_purge_schedule"}]: dispatch 9/10/23 10:02:24 AM[INF]from='mgr.252911336 ' entity='mgr.ceph-25' cmd=[{"prefix":"config rm","who":"mgr","name":"mgr/rbd_support/ceph-25/mirror_snapshot_schedule"}]: dispatch We don't have mirroring, ts a single cluster. What is going on here and how can I stop that? I already restarted all MGR daemons to no avail. Thanks and best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14

8 months, 1 week

2
2
0 0

Best practices regarding MDS node restart

by Alexander E. Patrakov

Hello, I am interested in the best-practice guidance for the following situation. There is a Ceph cluster with CephFS deployed. There are three servers dedicated to running MDS daemons: one active, one standby-replay, and one standby. There is only a single rank. Sometimes, servers need to be rebooted for reasons unrelated to Ceph. What's the proper procedure to follow when restarting a node that currently contains an active MDS server? The goal is to minimize the client downtime. Ideally, they should not notice even if they play MP3s from the CephFS filesystem (note that I haven't tested this exact scenario) - is this achievable? I tried to use the "ceph mds fail mds02" command while mds02 was active and mds03 was standby-replay, to force the fail-over to mds03 so that I could reboot mds02. Result: mds02 became standby, while mds03 went through reconnect (30 seconds), rejoin (another 30 seconds), and replay (5 seconds) phases. During the "reconnect" and "rejoin" phases, the "Activity" column of "ceph fs status" is empty, which concerns me. It looks like I just caused a 65-second downtime. After all of that, mds02 became standby-replay, as expected. Is there a better way? Or, should I have rebooted mds02 without much thinking? -- Alexander E. Patrakov

8 months, 1 week

2
1
0 0

CephFS session recovery with different source IP

by caskd

Hello fellow ceph users, i've been looking for a way to reduce the recovery of a CephFS mount when the source IP changes. It seems like the sessions refuse to be working properly when that happens a reconnect takes ages. Is there a way to reduce that interval or make CephFS work with mobile sessions? Other great examples of this happening is the case of "privacy IPv6 addresses". -- Alex D. RedXen System & Infrastructure Administration https://redxen.eu/

8 months, 1 week

1
0
0 0

2024

2023

2022

2021

2020

2019

ceph-users September 2023