Release: 16.2.7 (pacific)
Infra: 4 x Nodes (4xOSD HDD), 3 x Nodes (mon/mds, 1 x OSD NVMe)
We recently had a couple of node which went offline unexpectedly triggering a rebalance which is still ongoing.
The OSDs on the restarted node are marked as down and they keep showing in the log `authenticated timed out`, after a period of time they get marked `autoout`.
We tried setting `noout` on the cluster which has stopped them being marked out but they still never authenticate.
We can access all the ceph tooling from those nodes which indicates connection to mons.
The node keyring/time are both in sync.
We are at a loss to why we can not get the OSDs to authenticate.
Any help would be apreciated.
```
cluster:
id: d5126e5a-882e-11ec-954e-90e2baec3d2c
health: HEALTH_WARN
7 failed cephadm daemon(s)
2 stray daemon(s) not managed by cephadm
insufficient standby MDS daemons available
nodown,noout flag(s) set
8 osds down
2 hosts (8 osds) down
Degraded data redundancy: 195930251/392039621 objects degraded (49.977%), 160 pgs degraded, 160 pgs undersized
2 pgs not deep-scrubbed in time
services:
mon: 3 daemons, quorum ceph5,ceph7,ceph6 (age 38h)
mgr: ceph2.tofizp(active, since 9M), standbys: ceph1.vnkagp
mds: 3/3 daemons up
osd: 19 osds: 11 up (since 38h), 19 in (since 45h); 5 remapped pgs
flags nodown,noout
data:
volumes: 1/1 healthy
pools: 6 pools, 257 pgs
objects: 102.94M objects, 67 TiB
usage: 68 TiB used, 50 TiB / 118 TiB avail
pgs: 195930251/392039621 objects degraded (49.977%)
3205811/392039621 objects misplaced (0.818%)
155 active+undersized+degraded
97 active+clean
3 active+undersized+degraded+remapped+backfill_wait
2 active+undersized+degraded+remapped+backfilling
io:
client: 511 B/s rd, 102 KiB/s wr, 0 op/s rd, 2 op/s wr
recovery: 13 MiB/s, 16 objects/s
```
Due to the ongoing South African energy crisis
<https://en.wikipedia.org/wiki/South_African_energy_crisis> our datacenter
experienced sudden power loss. We are running ceph 17.2.5 deployed with
cephadm. Two of our OSDs did not start correctly, with the error:
# ceph-bluestore-tool fsck --path
/var/lib/ceph/ed7b2c16-b053-45e2-a1fe-bf3474f90508/osd.27/
2023-01-15T08:38:04.289+0200 7f2a2a03c540 -1
bluestore::NCB::__restore_allocator::No Valid allocation info on disk
(empty file)
/build/ceph-17.2.5/src/os/bluestore/BlueStore.cc: In function 'int
BlueStore::read_allocation_from_onodes(SimpleBitmap*,
BlueStore::read_alloc_stats_t&)' thread 7f2a2a03c540 time
2023-01-15T08:39:31.304968+0200
/build/ceph-17.2.5/src/os/bluestore/BlueStore.cc: 18968: FAILED
ceph_assert(collection_ref)
2023-01-15T08:39:31.298+0200 7f2a2a03c540 -1
bluestore::NCB::read_allocation_from_onodes::stray object
2#55:ffffffff:::2000055f327.00002287:head# not owned by any collection
ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy
(stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x14f) [0x7f2a2acc07c6]
2: /usr/lib/ceph/libceph-common.so.2(+0x27c9d8) [0x7f2a2acc09d8]
3: (BlueStore::read_allocation_from_onodes(SimpleBitmap*,
BlueStore::read_alloc_stats_t&)+0xa24) [0x560d6baf5754]
4: (BlueStore::reconstruct_allocations(SimpleBitmap*,
BlueStore::read_alloc_stats_t&)+0x5f) [0x560d6baf66ff]
5: (BlueStore::read_allocation_from_drive_on_startup()+0x99)
[0x560d6baf68b9]
6: (BlueStore::_init_alloc(std::map<unsigned long, unsigned long,
std::less<unsigned long>, std::allocator<std::pair<unsigned long const,
unsigned long> > >*)+0xaca) [0x560d6bb0c15a]
7: (BlueStore::_open_db_and_around(bool, bool)+0x35c) [0x560d6bb380dc]
8: (BlueStore::_fsck(BlueStore::FSCKDepth, bool)+0x250) [0x560d6bb3a8c0]
9: main()
10: __libc_start_main()
11: _start()
*** Caught signal (Aborted) **
in thread 7f2a2a03c540 thread_name:ceph-bluestore-
2023-01-15T08:39:31.306+0200 7f2a2a03c540 -1
/build/ceph-17.2.5/src/os/bluestore/BlueStore.cc: In function 'int
BlueStore::read_allocation_from_onodes(SimpleBitmap*,
BlueStore::read_alloc_stats_t&)' thread 7f2a2a03c540 time
2023-01-15T08:39:31.304968+0200
/build/ceph-17.2.5/src/os/bluestore/BlueStore.cc: 18968: FAILED
ceph_assert(collection_ref)
(complete log
https://gist.github.com/pvanheus/5c57455cacdc91afc9ce27fd489cae25)
Is there a way to recover from this? Or should I accept the OSDs as lost
and rebuild them?
Thanks,
Peter
Hello,
I'm adding OSDs to a 5 node cluster using Quincy 17.2.5. The network is
a bonded 2x10G link. The issue I'm having is that the rebalance
operation seems to impact client I/O and running VMs do not . OSDs are
big 6'4TB NVMe drives, so there will be a lot of data to move.
With previous versions it was easy to throttle down the rebalance with
"ceph config set osd osd_max_backfills", but as Quincy uses MClock,
those values are not used. In fact, default values are overridden with 1000.
If I'm understanding the MClock behavior it will use the estimated
osd_mclock_max_capacity_iops_ssd (benchmarked at OSD deploy time) and
allow client/rebalance/backfill/trims/scrubs I/O to fill the drive with
IOPs up to what is defined in osd_mclock_profile (default value is
high_client_ops). Am I correct?
How could I throttle down the rebalance so it gives more headroom for
client I/O?
Many thanks in advance.
--
Hi there
I am running Ceph version 17.2.5 and have deployed centralised logging as
per this guide:
https://ceph.io/en/news/blog/2022/centralized_logging/
The logs from the OSDs are not, however, showing up in the Grafana
dashboard, as per this screenshot:
[image: image.png]
The Promtail daemons are running on each node including the OSDs. The Loki
server and Grafana are running on one of our monitor nodes.
Thanks for any clarifications you can provide.
Peter
Dear Ceph users,
my cluster is build with old hardware on a gigabit network, so I often
experience warnings like OSD_SLOW_PING_TIME_BACK. These in turn triggers
alert mails too often, forcing me to disable alerts which is not
sustainable. So my question is: is it possible to tell Ceph to ignore
(or at least to not send alerts for) a given class of warnings?
Thank you,
Nicola
Hello ceph community,
I have some questions about the pg autoscaler. I have a cluster with several pools. One of them is a cephfs pool which is behaving in an expected / sane way, and another is a RBD pool with an ec profile of k=2, m=2.
The cluster has about 60 drives across across about 10 failure domains. (Failure domain is set to “chassis”, and there are some chassis with 4 blades per chassis, and the rest with 1 host per chassis).
The rbd ec pool has 66TiB stored with 128 PGs. Each PG has about 500k objects in them, which seems like quite a lot. When rebalancing, this EC pool is always the longpole.
The confusing part is that I am getting inconsistent output on the status on the autoscaler. For example:
root@vis-mgmt:~# ceph osd pool autoscale-status | grep rbd_ec
rbd_ec 67241G 2.0 856.7T 0.1533 1.0 64 on False
Which tells me I have 64 PG_NUM (a lie).
root@vis-mgmt:~# ceph osd pool ls detail | grep rbd_ec
pool 4 'rbd_ec' erasure profile ec22 size 4 min_size 3 crush_rule 1 object_hash rjenkins pg_num 128 pgp_num 120 pg_num_target 64 pgp_num_target 64 autoscale_mode on last_change 83396 lfor 0/83395/83393 flags hashpspool,ec_overwrites,selfmanaged_snaps stripe_width 8192 application rbd
Tells me I have 128 PGs (correct), but with a pgp_num which is not a power of 2 (120 pgs). Also, I am not sure what the pg_num_target and pgp_num_raget are and why they are different from pg_num and pgp_num.
Is there anything I can look into to find out is the autoscaler is working correctly for this pool? Any other tweaks I need to do? Seems to me that with that capacity it ought to have more than 128 PGs…
Thank you!
George
Hi,
What is the safest way to add disk(s) to each of the node in the cluster?
Should it be done 1 by 1 or can add all of them at once and let it rebalance?
My concern is that if add all in one due to host based EC code it will block all the host.
The other side if I add 1 by 1, one node will have more performance and more osds than the others which is also not a good setup, so wonder which is the safer way?
(have 9 nodes with host based EC 4:2, 1 disk is going to have 4osds)
Thank you
________________________________
This message is confidential and is for the sole use of the intended recipient(s). It may also be privileged or otherwise protected by copyright or other legal rules. If you have received it by mistake please let us know by reply email and delete it from your system. It is prohibited to copy this message or disclose its content to anyone. Any confidentiality or privilege is not waived or lost by any mistaken delivery or unauthorized disclosure of the message. All messages sent to and from Agoda may be monitored to ensure compliance with company policies, to protect the company's interests and to remove potential malware. Electronic messages may be intercepted, amended, lost or deleted, or contain viruses.
Hi,
I have two Ceph clusters in a multi-zone setup. The first one (master zone) would be accessible to users for their interaction using RGW.
The second one is set to sync from the master zone with the tier type of the zone set as an archive (to version all files).
My question here is. Is there an option to set a lifecycle for the version files saved on the archive zone? For example, keep only 5 versions per file or delete version files older than one year?
Thanks a lot.
Dear Ceph-Users,
i am struggling to replace a disk. My ceph-cluster is not replacing the
old OSD even though I did:
ceph orch osd rm 232 --replace
The OSD 232 is still shown in the osd list, but the new hdd will be
placed as a new OSD. This wouldnt mind me much, if the OSD was also
placed on the bluestoreDB / NVME, but it doesn't.
My steps:
"ceph orch osd rm 232 --replace"
remove the failed hdd.
add the new one.
Convert the disk within the servers bios, so that the node can have
direct access on it.
It shows up as /dev/sdt,
enter maintenance mode
reboot server
drive is now /dev/sdm (which the old drive had)
"ceph orch device zap node-x /dev/sdm "
A new OSD is placed on the cluster.
Can you give me a hint, where did I take a wrong turn? Why is the disk
not being used as OSD 232?
Best
Ken