February 2023 - ceph-users

OSD fail to authenticate after node outage

by tsmgeek＠gmail.com

Release: 16.2.7 (pacific) Infra: 4 x Nodes (4xOSD HDD), 3 x Nodes (mon/mds, 1 x OSD NVMe) We recently had a couple of node which went offline unexpectedly triggering a rebalance which is still ongoing. The OSDs on the restarted node are marked as down and they keep showing in the log `authenticated timed out`, after a period of time they get marked `autoout`. We tried setting `noout` on the cluster which has stopped them being marked out but they still never authenticate. We can access all the ceph tooling from those nodes which indicates connection to mons. The node keyring/time are both in sync. We are at a loss to why we can not get the OSDs to authenticate. Any help would be apreciated. ``` cluster: id: d5126e5a-882e-11ec-954e-90e2baec3d2c health: HEALTH_WARN 7 failed cephadm daemon(s) 2 stray daemon(s) not managed by cephadm insufficient standby MDS daemons available nodown,noout flag(s) set 8 osds down 2 hosts (8 osds) down Degraded data redundancy: 195930251/392039621 objects degraded (49.977%), 160 pgs degraded, 160 pgs undersized 2 pgs not deep-scrubbed in time services: mon: 3 daemons, quorum ceph5,ceph7,ceph6 (age 38h) mgr: ceph2.tofizp(active, since 9M), standbys: ceph1.vnkagp mds: 3/3 daemons up osd: 19 osds: 11 up (since 38h), 19 in (since 45h); 5 remapped pgs flags nodown,noout data: volumes: 1/1 healthy pools: 6 pools, 257 pgs objects: 102.94M objects, 67 TiB usage: 68 TiB used, 50 TiB / 118 TiB avail pgs: 195930251/392039621 objects degraded (49.977%) 3205811/392039621 objects misplaced (0.818%) 155 active+undersized+degraded 97 active+clean 3 active+undersized+degraded+remapped+backfill_wait 2 active+undersized+degraded+remapped+backfilling io: client: 511 B/s rd, 102 KiB/s wr, 0 op/s rd, 2 op/s wr recovery: 13 MiB/s, 16 objects/s ```

1 year, 2 months

2
1
0 0

Corrupt bluestore after sudden reboot (17.2.5)

by Peter van Heusden

Due to the ongoing South African energy crisis <https://en.wikipedia.org/wiki/South_African_energy_crisis> our datacenter experienced sudden power loss. We are running ceph 17.2.5 deployed with cephadm. Two of our OSDs did not start correctly, with the error: # ceph-bluestore-tool fsck --path /var/lib/ceph/ed7b2c16-b053-45e2-a1fe-bf3474f90508/osd.27/ 2023-01-15T08:38:04.289+0200 7f2a2a03c540 -1 bluestore::NCB::__restore_allocator::No Valid allocation info on disk (empty file) /build/ceph-17.2.5/src/os/bluestore/BlueStore.cc: In function 'int BlueStore::read_allocation_from_onodes(SimpleBitmap*, BlueStore::read_alloc_stats_t&)' thread 7f2a2a03c540 time 2023-01-15T08:39:31.304968+0200 /build/ceph-17.2.5/src/os/bluestore/BlueStore.cc: 18968: FAILED ceph_assert(collection_ref) 2023-01-15T08:39:31.298+0200 7f2a2a03c540 -1 bluestore::NCB::read_allocation_from_onodes::stray object 2#55:ffffffff:::2000055f327.00002287:head# not owned by any collection ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy (stable) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14f) [0x7f2a2acc07c6] 2: /usr/lib/ceph/libceph-common.so.2(+0x27c9d8) [0x7f2a2acc09d8] 3: (BlueStore::read_allocation_from_onodes(SimpleBitmap*, BlueStore::read_alloc_stats_t&)+0xa24) [0x560d6baf5754] 4: (BlueStore::reconstruct_allocations(SimpleBitmap*, BlueStore::read_alloc_stats_t&)+0x5f) [0x560d6baf66ff] 5: (BlueStore::read_allocation_from_drive_on_startup()+0x99) [0x560d6baf68b9] 6: (BlueStore::_init_alloc(std::map<unsigned long, unsigned long, std::less<unsigned long>, std::allocator<std::pair<unsigned long const, unsigned long> > >*)+0xaca) [0x560d6bb0c15a] 7: (BlueStore::_open_db_and_around(bool, bool)+0x35c) [0x560d6bb380dc] 8: (BlueStore::_fsck(BlueStore::FSCKDepth, bool)+0x250) [0x560d6bb3a8c0] 9: main() 10: __libc_start_main() 11: _start() *** Caught signal (Aborted) ** in thread 7f2a2a03c540 thread_name:ceph-bluestore- 2023-01-15T08:39:31.306+0200 7f2a2a03c540 -1 /build/ceph-17.2.5/src/os/bluestore/BlueStore.cc: In function 'int BlueStore::read_allocation_from_onodes(SimpleBitmap*, BlueStore::read_alloc_stats_t&)' thread 7f2a2a03c540 time 2023-01-15T08:39:31.304968+0200 /build/ceph-17.2.5/src/os/bluestore/BlueStore.cc: 18968: FAILED ceph_assert(collection_ref) (complete log https://gist.github.com/pvanheus/5c57455cacdc91afc9ce27fd489cae25) Is there a way to recover from this? Or should I accept the OSDs as lost and rebuild them? Thanks, Peter

1 year, 2 months

2
2
0 0

Throttle down rebalance with Quincy

by Victor Rodriguez

Hello, I'm adding OSDs to a 5 node cluster using Quincy 17.2.5. The network is a bonded 2x10G link. The issue I'm having is that the rebalance operation seems to impact client I/O and running VMs do not . OSDs are big 6'4TB NVMe drives, so there will be a lot of data to move. With previous versions it was easy to throttle down the rebalance with "ceph config set osd osd_max_backfills", but as Quincy uses MClock, those values are not used. In fact, default values are overridden with 1000. If I'm understanding the MClock behavior it will use the estimated osd_mclock_max_capacity_iops_ssd (benchmarked at OSD deploy time) and allow client/rebalance/backfill/trims/scrubs I/O to fill the drive with IOPs up to what is defined in osd_mclock_profile (default value is high_client_ops). Am I correct? How could I throttle down the rebalance so it gives more headroom for client I/O? Many thanks in advance. --

1 year, 2 months

1
0
0 0

OSD logs missing from Centralised Logging

by Peter van Heusden

Hi there I am running Ceph version 17.2.5 and have deployed centralised logging as per this guide: https://ceph.io/en/news/blog/2022/centralized_logging/ The logs from the OSDs are not, however, showing up in the Grafana dashboard, as per this screenshot: [image: image.png] The Promtail daemons are running on each node including the OSDs. The Loki server and Grafana are running on one of our monitor nodes. Thanks for any clarifications you can provide. Peter

1 year, 2 months

2
2
0 0

Deep scrub debug option

by Broccoli Bob

1 year, 2 months

3
3
0 0

Permanently ignore some warning classes

by Nicola Mori

Dear Ceph users, my cluster is build with old hardware on a gigabit network, so I often experience warnings like OSD_SLOW_PING_TIME_BACK. These in turn triggers alert mails too often, forcing me to disable alerts which is not sustainable. So my question is: is it possible to tell Ceph to ignore (or at least to not send alerts for) a given class of warnings? Thank you, Nicola

1 year, 2 months

2
5
0 0

Is autoscaler doing the right thing?

by Kyriazis, George

Hello ceph community, I have some questions about the pg autoscaler. I have a cluster with several pools. One of them is a cephfs pool which is behaving in an expected / sane way, and another is a RBD pool with an ec profile of k=2, m=2. The cluster has about 60 drives across across about 10 failure domains. (Failure domain is set to “chassis”, and there are some chassis with 4 blades per chassis, and the rest with 1 host per chassis). The rbd ec pool has 66TiB stored with 128 PGs. Each PG has about 500k objects in them, which seems like quite a lot. When rebalancing, this EC pool is always the longpole. The confusing part is that I am getting inconsistent output on the status on the autoscaler. For example: root@vis-mgmt:~# ceph osd pool autoscale-status | grep rbd_ec rbd_ec 67241G 2.0 856.7T 0.1533 1.0 64 on False Which tells me I have 64 PG_NUM (a lie). root@vis-mgmt:~# ceph osd pool ls detail | grep rbd_ec pool 4 'rbd_ec' erasure profile ec22 size 4 min_size 3 crush_rule 1 object_hash rjenkins pg_num 128 pgp_num 120 pg_num_target 64 pgp_num_target 64 autoscale_mode on last_change 83396 lfor 0/83395/83393 flags hashpspool,ec_overwrites,selfmanaged_snaps stripe_width 8192 application rbd Tells me I have 128 PGs (correct), but with a pgp_num which is not a power of 2 (120 pgs). Also, I am not sure what the pg_num_target and pgp_num_raget are and why they are different from pg_num and pgp_num. Is there anything I can look into to find out is the autoscaler is working correctly for this pool? Any other tweaks I need to do? Seems to me that with that capacity it ought to have more than 128 PGs… Thank you! George

1 year, 2 months

1
0
0 0

Adding osds to each nodes

by Szabo, Istvan (Agoda)

Hi, What is the safest way to add disk(s) to each of the node in the cluster? Should it be done 1 by 1 or can add all of them at once and let it rebalance? My concern is that if add all in one due to host based EC code it will block all the host. The other side if I add 1 by 1, one node will have more performance and more osds than the others which is also not a good setup, so wonder which is the safer way? (have 9 nodes with host based EC 4:2, 1 disk is going to have 4osds) Thank you ________________________________ This message is confidential and is for the sole use of the intended recipient(s). It may also be privileged or otherwise protected by copyright or other legal rules. If you have received it by mistake please let us know by reply email and delete it from your system. It is prohibited to copy this message or disclose its content to anyone. Any confidentiality or privilege is not waived or lost by any mistaken delivery or unauthorized disclosure of the message. All messages sent to and from Agoda may be monitored to ensure compliance with company policies, to protect the company's interests and to remove potential malware. Electronic messages may be intercepted, amended, lost or deleted, or contain viruses.

1 year, 2 months

2
2
0 0

RGW archive zone lifecycle

by Ondřej Kukla

Hi, I have two Ceph clusters in a multi-zone setup. The first one (master zone) would be accessible to users for their interaction using RGW. The second one is set to sync from the master zone with the tier type of the zone set as an archive (to version all files). My question here is. Is there an option to set a lifecycle for the version files saved on the archive zone? For example, keep only 5 versions per file or delete version files older than one year? Thanks a lot.

1 year, 2 months

2
1
0 0

Replacing OSD with containerized deployment

by mailing-lists

Dear Ceph-Users, i am struggling to replace a disk. My ceph-cluster is not replacing the old OSD even though I did: ceph orch osd rm 232 --replace The OSD 232 is still shown in the osd list, but the new hdd will be placed as a new OSD. This wouldnt mind me much, if the OSD was also placed on the bluestoreDB / NVME, but it doesn't. My steps: "ceph orch osd rm 232 --replace" remove the failed hdd. add the new one. Convert the disk within the servers bios, so that the node can have direct access on it. It shows up as /dev/sdt, enter maintenance mode reboot server drive is now /dev/sdm (which the old drive had) "ceph orch device zap node-x /dev/sdm " A new OSD is placed on the cluster. Can you give me a hint, where did I take a wrong turn? Why is the disk not being used as OSD 232? Best Ken

1 year, 2 months

3
16
0 0

2024

2023

2022

2021

2020

2019

ceph-users February 2023