Hi
I
am writing to seek guidance and best practices for a maintenance operation
in my Ceph cluster. I have an older cluster in which the Monitors (Mons)
and Object Storage Devices (OSDs) are currently deployed on the same host.
I am interested in separating them while ensuring zero downtime and
minimizing risks to the cluster's stability.
The primary goal is to deploy new Monitors on different servers without
causing service interruptions or disruptions to data availability.
The challenge arises because updating the configuration to add new Monitors
typically requires a restart of all OSDs, which is less than ideal in terms
of maintaining cluster availability.
One approach I considered is to reweight all OSDs on the host to zero,
allowing data to gradually transfer to other OSDs. Once all data has been
safely migrated, I would proceed to remove the old OSDs. Afterward, I would
deploy the new Monitors on a different server with the previous IP
addresses and deploy the OSDs on the old Monitors' host with new IP
addresses.
While this approach seems to minimize risks, it can be time-consuming and
may not be the most efficient way to achieve the desired separation.
I would greatly appreciate the community's insights and suggestions on the
best approach to achieve this separation of Mons and OSDs with zero
downtime and minimal risk. If there are alternative methods or best
practices that can be recommended, please share your expertise.
Hi all, I make a weird observation. 8 out of 12 MDS daemons seem not to report to the cluster any more:
# ceph fs status
con-fs2 - 1625 clients
=======
RANK STATE MDS ACTIVITY DNS INOS
0 active ceph-16 Reqs: 0 /s 0 0
1 active ceph-09 Reqs: 128 /s 4251k 4250k
2 active ceph-17 Reqs: 0 /s 0 0
3 active ceph-15 Reqs: 0 /s 0 0
4 active ceph-24 Reqs: 269 /s 3567k 3567k
5 active ceph-11 Reqs: 0 /s 0 0
6 active ceph-14 Reqs: 0 /s 0 0
7 active ceph-23 Reqs: 0 /s 0 0
POOL TYPE USED AVAIL
con-fs2-meta1 metadata 2169G 7081G
con-fs2-meta2 data 0 7081G
con-fs2-data data 1248T 4441T
con-fs2-data-ec-ssd data 705G 22.1T
con-fs2-data2 data 3172T 4037T
STANDBY MDS
ceph-08
ceph-10
ceph-12
ceph-13
VERSION DAEMONS
None ceph-16, ceph-17, ceph-15, ceph-11, ceph-14, ceph-23, ceph-10, ceph-12
ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable) ceph-09, ceph-24, ceph-08, ceph-13
Version is "none" for these and there are no stats. Ceph versions reports only 4 MDSes out of the 12. 8 are not shown at all:
[root@gnosis ~]# ceph versions
{
"mon": {
"ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable)": 5
},
"mgr": {
"ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable)": 5
},
"osd": {
"ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable)": 1282
},
"mds": {
"ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable)": 4
},
"overall": {
"ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable)": 1296
}
}
Ceph status reports everything as up and OK:
[root@gnosis ~]# ceph status
cluster:
id: e4ece518-f2cb-4708-b00f-b6bf511e91d9
health: HEALTH_OK
services:
mon: 5 daemons, quorum ceph-01,ceph-02,ceph-03,ceph-25,ceph-26 (age 2w)
mgr: ceph-03(active, since 61s), standbys: ceph-25, ceph-01, ceph-02, ceph-26
mds: con-fs2:8 4 up:standby 8 up:active
osd: 1284 osds: 1282 up (since 31h), 1282 in (since 33h); 567 remapped pgs
data:
pools: 14 pools, 25065 pgs
objects: 2.14G objects, 3.7 PiB
usage: 4.7 PiB used, 8.4 PiB / 13 PiB avail
pgs: 79908208/18438361040 objects misplaced (0.433%)
23063 active+clean
1225 active+clean+snaptrim_wait
317 active+remapped+backfill_wait
250 active+remapped+backfilling
208 active+clean+snaptrim
2 active+clean+scrubbing+deep
io:
client: 596 MiB/s rd, 717 MiB/s wr, 4.16k op/s rd, 3.04k op/s wr
recovery: 8.7 GiB/s, 3.41k objects/s
My first thought is that the status module failed. However, I don't manage to restart it (always on). An MGR fail-over did not help.
Any ideas what is going on here?
Thanks and best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
Hi,
We're running latest Pacific on our production cluster and we've been
seeing the dreaded 'OSD::osd_op_tp thread 0x7f346aa64700' had timed out
after 15.000000954s' error. We have reasons to believe this happens each
time the RocksDB compaction process is launched on an OSD. My question
is, does the cluster detecting that an OSD has timed out interrupt the
compaction process? This seems to be what's happening, but it's not
immediately obvious. We are currently facing an infinite loop of random
OSDs timing out and if the compaction process is interrupted without
finishing, it may explain that.
--
Jean-Philippe Méthot
Senior Openstack system administrator
Administrateur système Openstack sénior
PlanetHoster inc.
Hi all,
I have 4-nodes ceph cluster.
After I shutted down my cluster, I tried to start it again, but failed due
to
ceph orch xxx (such as status) commands hung.
How sould I recover from this problem ?
root@ceph-manager:/# ceph orch status ==> hung
^CInterrupted
root@ceph-manager:/# ceph status
cluster:
id: 4588ed80-352b-11ee-9eae-157ca4325420
health: HEALTH_ERR
2 failed cephadm daemon(s)
1 filesystem is degraded
1 filesystem is offline
pauserd,pausewr,nodown,noout,nobackfill,norebalance,norecover
flag(s) set
10 slow ops, oldest one blocked for 3736 sec, mon.ceph-osd0 has
slow ops
services:
mon: 4 daemons, quorum ceph-manager,ceph-osd0,ceph-osd1,ceph-osd2 (age
64m)
mgr: ceph-manager.kurjlh(active, since 64m), standbys: ceph-osd0.jodevs
mds: 0/1 daemons up (1 failed), 2 standby
osd: 3 osds: 3 up (since 64m), 3 in (since 2w)
flags pauserd,pausewr,nodown,noout,nobackfill,norebalance,norecover
data:
volumes: 0/1 healthy, 1 failed
pools: 11 pools, 243 pgs
objects: 3.01k objects, 9.4 GiB
usage: 28 GiB used, 2.8 TiB / 2.8 TiB avail
pgs: 243 active+clean
root@ceph-manager:/# ceph health detail
HEALTH_ERR 2 failed cephadm daemon(s); 1 filesystem is degraded; 1
filesystem is offline;
pauserd,pausewr,nodown,noout,nobackfill,norebalance,norecover flag(s) set;
10 slow ops, oldest one blocked for 3741 sec, mon.ceph-osd0 has slow ops
[WRN] CEPHADM_FAILED_DAEMON: 2 failed cephadm daemon(s)
daemon rgw.sno_rgw.ceph-manager.umzmku on ceph-manager is in error state
daemon rgw.sno_rgw.ceph-osd2.vfpmbs on ceph-osd2 is in error state
[WRN] FS_DEGRADED: 1 filesystem is degraded
fs sno_cephfs is degraded
[ERR] MDS_ALL_DOWN: 1 filesystem is offline
fs sno_cephfs is offline because no MDS is active for it.
[WRN] OSDMAP_FLAGS:
pauserd,pausewr,nodown,noout,nobackfill,norebalance,norecover flag(s) set
[WRN] SLOW_OPS: 10 slow ops, oldest one blocked for 3741 sec, mon.ceph-osd0
has slow ops
Hi,
I want to create a new OSD on a 4TB Samsung MZ1L23T8HBLA-00A07
enterprise nvme device.
Creating the OSD works but it cannot be initialized and therefore not
started.
In the log I see an entry about a failed assert.
./src/os/bluestore/fastbmap_allocator_impl.cc: 405: FAILED
ceph_assert((aligned_extent.length % l0_granularity) == 0)
Is this the culprit?
In addition at the end of the logfile there is a failed mount and a
failed osd init mentioned.
2023-09-11T16:30:04.708+0200 7f99aa28f3c0 -1 bluefs _check_allocations
OP_FILE_UPDATE_INC invalid extent 1: 0x140000~10000: duplicate
reference, ino 30
2023-09-11T16:30:04.708+0200 7f99aa28f3c0 -1 bluefs mount failed to
replay log: (14) Bad address
2023-09-11T16:30:04.708+0200 7f99aa28f3c0 20 bluefs _stop_alloc
2023-09-11T16:30:04.708+0200 7f99aa28f3c0 -1
bluestore(/var/lib/ceph/osd/ceph-43) _open_bluefs failed bluefs mount:
(14) Bad address
2023-09-11T16:30:04.708+0200 7f99aa28f3c0 10 bluefs maybe_verify_layout
no memorized_layout in bluefs superblock
2023-09-11T16:30:04.708+0200 7f99aa28f3c0 -1
bluestore(/var/lib/ceph/osd/ceph-43) _open_db failed to prepare db
environment:
2023-09-11T16:30:04.708+0200 7f99aa28f3c0 1 bdev(0x5565c261fc00
/var/lib/ceph/osd/ceph-43/block) close
2023-09-11T16:30:04.940+0200 7f99aa28f3c0 -1 osd.43 0 OSD:init: unable
to mount object store
2023-09-11T16:30:04.940+0200 7f99aa28f3c0 -1 ** ERROR: osd init failed:
(5) Input/output error
--
Regards,
ppa. Martin Konold
--
Martin Konold - Prokurist, CTO
KONSEC GmbH - make things real
Amtsgericht Stuttgart, HRB 23690
Geschäftsführer: Andreas Mack
Im Köller 3, 70794 Filderstadt, Germany
Dear Cephers,
Today brought us an eventful CTL meeting: it looks like Jitsi recently started
requiring user authentication
<https://jitsi.org/blog/authentication-on-meet-jit-si/> (anonymous users
will get a "Waiting for a moderator" modal), but authentication didn't work
against Google or GitHub accounts, so we had to move to the good old Google
Meet.
As a result of this, Neha has kindly set up a new private Slack channel
(#clt) to allow for quicker communication among CLT members (if you usually
attend the CLT meeting and have not been added, please ping any CLT member
to request that).
Now, let's move on the important stuff:
*The latest Pacific Release (v16.2.14)*
*The Bad*
The 14th drop of the Pacific release has landed with a few hiccups:
- Some .deb packages were made available to downloads.ceph.com before
the release process completion. Although this is not the first time it
happens, we want to ensure this is the last one, so we'd like to gather
ideas to improve the release publishing process. Neha encouraged everyone
to share ideas here:
- https://tracker.ceph.com/issues/62671
- https://tracker.ceph.com/issues/62672
- v16.2.14 also hit issues during the ceph-container stage. Laura
wanted to raise awareness of its current setbacks
<https://pad.ceph.com/p/16.2.14-struggles> and collect ideas to tackle
them:
- Enforce reviews and mandatory CI checks
- Rework the current approach to use simple Dockerfiles
<https://github.com/ceph/ceph/pull/43292>
- Call the Ceph community for help: ceph-container is currently
maintained part-time by a single contributor (Guillaume Abrioux). This
sub-project would benefit from the sound expertise on containers
among Ceph
users. If you have ever considered contributing to Ceph, but felt a bit
intimidated by C++, Paxos and race conditions, ceph-container is a good
place to shed your fear.
*The Good*
Not everything about v16.2.14 was going to be bleak: David Orman brought us
really good news. They tested v16.2.14 on a large production cluster
(10gbit/s+ RGW and ~13PiB raw) and found that it solved a major issue
affecting RGW in Pacific <https://github.com/ceph/ceph/pull/52552>.
*The Ugly*
During that testing, they noticed that ceph-mgr was occasionally OOM killed
(nothing new to 16.2.14, as it was previously reported). They already tried:
- Disabling modules (like the restful one, which was a suspect)
- Enabling debug 20
- Turning the pg autoscaler off
Debugging will continue to characterize this issue:
- Enable profiling (Mark Nelson)
- Try Bloomberg's Python mem profiler
<https://github.com/bloomberg/memray> (Matthew Leonard)
*Infrastructure*
*Reminder: Infrastructure Meeting Tomorrow. **11:30-12:30 Central Time*
Patrick brought up the following topics:
- Need to reduce the OVH spending ($72k/year, which is a good cut in the
Ceph Foundation budget, that's a lot less avocado sandwiches for the next
Cephalocon):
- Move services (e.g.: Chacra) to the Sepia lab
- Re-use CentOS (and any spared/unused) machines for devel purposes
- Current Ceph sys admins are overloaded, so devel/community involvement
would be much appreciated.
- More to be discussed in tomorrow's meeting. Please join if you
think you can help solve/improve the Ceph infrastrucru!
*BTW*: today's CDM will be canceled, since no topics were proposed.
Kind Regards,
Ernesto
Hi all,
I recently started observing that our MGR seems to execute the same "config rm" commands over and over again, in the audit log:
9/10/23 10:03:24 AM[INF]from='mgr.252911336 ' entity='mgr.ceph-25' cmd=[{"prefix":"config rm","who":"mgr","name":"mgr/rbd_support/ceph-25/mirror_snapshot_schedule"}]: dispatch
9/10/23 10:03:24 AM[INF]from='mgr.252911336 192.168.32.68:0/63' entity='mgr.ceph-25' cmd=[{"prefix":"config rm","who":"mgr","name":"mgr/rbd_support/ceph-25/mirror_snapshot_schedule"}]: dispatch
9/10/23 10:03:19 AM[INF]from='mgr.252911336 ' entity='mgr.ceph-25' cmd=[{"prefix":"config rm","who":"mgr","name":"mgr/rbd_support/ceph-25/trash_purge_schedule"}]: dispatch
9/10/23 10:03:19 AM[INF]from='mgr.252911336 192.168.32.68:0/63' entity='mgr.ceph-25' cmd=[{"prefix":"config rm","who":"mgr","name":"mgr/rbd_support/ceph-25/trash_purge_schedule"}]: dispatch
9/10/23 10:02:24 AM[INF]from='mgr.252911336 ' entity='mgr.ceph-25' cmd=[{"prefix":"config rm","who":"mgr","name":"mgr/rbd_support/ceph-25/mirror_snapshot_schedule"}]: dispatch
We don't have mirroring, ts a single cluster. What is going on here and how can I stop that? I already restarted all MGR daemons to no avail.
Thanks and best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
Hello,
I am interested in the best-practice guidance for the following situation.
There is a Ceph cluster with CephFS deployed. There are three servers
dedicated to running MDS daemons: one active, one standby-replay, and one
standby. There is only a single rank.
Sometimes, servers need to be rebooted for reasons unrelated to Ceph.
What's the proper procedure to follow when restarting a node that currently
contains an active MDS server? The goal is to minimize the client downtime.
Ideally, they should not notice even if they play MP3s from the CephFS
filesystem (note that I haven't tested this exact scenario) - is this
achievable?
I tried to use the "ceph mds fail mds02" command while mds02 was active and
mds03 was standby-replay, to force the fail-over to mds03 so that I could
reboot mds02. Result: mds02 became standby, while mds03 went through
reconnect (30 seconds), rejoin (another 30 seconds), and replay (5 seconds)
phases. During the "reconnect" and "rejoin" phases, the "Activity" column
of "ceph fs status" is empty, which concerns me. It looks like I just
caused a 65-second downtime. After all of that, mds02 became
standby-replay, as expected.
Is there a better way? Or, should I have rebooted mds02 without much
thinking?
--
Alexander E. Patrakov
Hello fellow ceph users,
i've been looking for a way to reduce the recovery of a CephFS mount when the source IP changes. It seems like the sessions refuse to be working properly when that happens a reconnect takes ages. Is there a way to reduce that interval or make CephFS work with mobile sessions? Other great examples of this happening is the case of "privacy IPv6 addresses".
--
Alex D.
RedXen System & Infrastructure Administration
https://redxen.eu/