February 2024 - ceph-users

by Schweiss, Chip

I had to temporarily disconnect the network on my entire Ceph cluster, so I prepared the cluster by following what appears to be some incomplete advice. I did the following before disconnecting the network: #ceph osd set noout #ceph osd set norecover #ceph osd set norebalance #ceph osd set nobackfill #ceph osd set nodown #ceph osd set pause Now, all the ceph services are still running, but I cannot undo any flags: root@proxmox01:~# ceph osd unset pause 2024-02-22T13:16:02.220+0000 7f0aab5a26c0 0 monclient(hunting): authenticate timed out after 300 [errno 110] RADOS timed out (error connecting to the cluster) Any advice on how to recover would be greatly appreciated. Thank you, -Chip

2 months, 3 weeks

2
2
0 0

Re: Scrub stuck and 'pg has invalid (post-split) stat'

by Cedric

Update: we have run fsck and re-shard on all bluestore volume, seems sharding were not applied. Unfortunately scrubs and deep-scrubs are still stuck on PGs of the pool that is suffering the issue, but other PGs scrubs well. The next step will be to remove the cache tier as suggested, but its not available yet as PGs needs to be scrubbed in order for the cache tier can be activated. As we are struggling to make this cluster works again, any help would be greatly appreciated. Cédric > On 20 Feb 2024, at 20:22, Cedric <yipikai7(a)gmail.com> wrote: > > Thanks Eugen, sorry about the missed reply to all. > > The reason we still have the cache tier is because we were not able to flush all dirty entry to remove it (as per the procedure), so the cluster as been migrated from HDD/SSD to NVME a while ago but tiering remains, unfortunately. > > So actually we are trying to understand the root cause > > On Tue, Feb 20, 2024 at 1:43 PM Eugen Block <eblock(a)nde.ag> wrote: >> >> Please don't drop the list from your response. >> >> The first question coming to mind is, why do you have a cache-tier if >> all your pools are on nvme decices anyway? I don't see any benefit here. >> Did you try the suggested workaround and disable the cache-tier? >> >> Zitat von Cedric <yipikai7(a)gmail.com>: >> >>> Thanks Eugen, see attached infos. >>> >>> Some more details: >>> >>> - commands that actually hangs: ceph balancer status ; rbd -p vms ls ; >>> rados -p vms_cache cache-flush-evict-all >>> - all scrub running on vms_caches pgs are stall / start in a loop >>> without actually doing anything >>> - all io are 0 both from ceph status or iostat on nodes >>> >>> On Tue, Feb 20, 2024 at 10:00 AM Eugen Block <eblock(a)nde.ag> wrote: >>>> >>>> Hi, >>>> >>>> some more details would be helpful, for example what's the pool size >>>> of the cache pool? Did you issue a PG split before or during the >>>> upgrade? This thread [1] deals with the same problem, the described >>>> workaround was to set hit_set_count to 0 and disable the cache layer >>>> until that is resolved. Afterwards you could enable the cache layer >>>> again. But keep in mind that the code for cache tier is entirely >>>> removed in Reef (IIRC). >>>> >>>> Regards, >>>> Eugen >>>> >>>> [1] >>>> https://ceph-users.ceph.narkive.com/zChyOq5D/ceph-strange-issue-after-addin… >>>> >>>> Zitat von Cedric <yipikai7(a)gmail.com>: >>>> >>>>> Hello, >>>>> >>>>> Following an upgrade from Nautilus (14.2.22) to Pacific (16.2.13), we >>>>> encounter an issue with a cache pool becoming completely stuck, >>>>> relevant messages below: >>>>> >>>>> pg xx.x has invalid (post-split) stats; must scrub before tier agent >>>>> can activate >>>>> >>>>> In OSD logs, scrubs are starting in a loop without succeeding for all >>>>> pg of this pool. >>>>> >>>>> What we already tried without luck so far: >>>>> >>>>> - shutdown / restart OSD >>>>> - rebalance pg between OSD >>>>> - raise the memory on OSD >>>>> - repeer PG >>>>> >>>>> Any idea what is causing this? any help will be greatly appreciated >>>>> >>>>> Thanks >>>>> >>>>> Cédric >>>>> _______________________________________________ >>>>> ceph-users mailing list -- ceph-users(a)ceph.io >>>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io >>>> >>>> >>>> _______________________________________________ >>>> ceph-users mailing list -- ceph-users(a)ceph.io >>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io >> >> >>

2 months, 3 weeks

2
7
0 0

Sharing our "Containerized Ceph and Radosgw Playground"

by Ansgar Jazdzewski

Hi Folks, We are excited to announce plans for building a larger Ceph-S3 setup. To ensure its success, extensive testing is needed in advance. Some of these tests don't need a full-blown Ceph cluster on hardware but still require meeting specific logical requirements, such as a multi-site S3 setup. To address this, we're pleased to introduce our ceph-s3-box test environment, which you can access on GitHub: https://github.com/hetznercloud/ceph-s3-box In the spirit of collaboration and knowledge sharing, we've made this testing environment publicly available today. We hope that it proves as beneficial to you as it has been for us. If you have any questions or suggestions, please don't hesitate to reach out. Cheers, Ansgar

2 months, 3 weeks

1
0
0 0

Scrub stuck and 'pg has invalid (post-split) stat'

by Cedric

Hello, Following an upgrade from Nautilus (14.2.22) to Pacific (16.2.13), we encounter an issue with a cache pool becoming completely stuck, relevant messages below: pg xx.x has invalid (post-split) stats; must scrub before tier agent can activate In OSD logs, scrubs are starting in a loop without succeeding for all pg of this pool. What we already tried without luck so far: - shutdown / restart OSD - rebalance pg between OSD - raise the memory on OSD - repeer PG Any idea what is causing this? any help will be greatly appreciated Thanks Cédric

2 months, 3 weeks

3
2
0 0

help me understand ceph snapshot sizes

by garcetto

good morning, i am trying to understand ceph snapshot sizing. For example if i have 2.7 GB volume and i create a snap on it, the sizing says: (BEFORE SNAP) rbd du volumes/volume-d954915c-1dc1-41cb-8bf0-0c67e7b6e080 NAME PROVISIONED USED volume-d954915c-1dc1-41cb-8bf0-0c67e7b6e080 10 GiB 2.7 GiB (AFTER SNAP) rbd du volumes/volume-d954915c-1dc1-41cb-8bf0-0c67e7b6e080 NAME PROVISIONED USED volume-d954915c-1dc1-41cb-8bf0-0c67e7b6e080@snap01 10 GiB 2.7 GiB volume-d954915c-1dc1-41cb-8bf0-0c67e7b6e080 10 GiB 0 B <TOTAL> 10 GiB 2.7 GiB why the SNAP is 2.7 GB? is not going to be 0 GB in the beginning and only after the COW start doying its thing (copying original write blocks to snap before overwrite with new ones) it should grow? am i wrong? thank you.

2 months, 3 weeks

1
0
0 0

Reef 18.2.1 unable to join multi-side when rgw_dns_name is configured

by Ansgar Jazdzewski

Hi folks, i just try to setup a new ceph s3 multisite-setup and it looks to me that dns-style s3 is broken in multi-side as wehn rgw_dns_name is configured the `radosgw-admin period update -commit`from the new mebe will not succeeded! it looks like when ever hostnames is configured it brakes on the new to add cluster https://docs.ceph.com/en/reef/radosgw/multisite/#setting-a-zonegroup Thanks for any advice! Ansgar

2 months, 4 weeks

1
1
0 0

first_virtual_router_id not allowed in ingress manifest

by Ramon Orrù

Hello, I deployed RGW and NFSGW services over a ceph (version 17.2.6) cluster. Both services are being accessed using 2 (separated) ingresses, actually working as expected when contacted by clients. Besides, I’m experiencing some problem while letting the ingresses work on the same cluster. keepalived logs are full of "(VI_0) received an invalid passwd!” lines, because both ingresses are using the same virtualrouter id, so I’m trying to introduce some additional parameter in service definition manifests to workaround the problem (first_virtual_router_id, default value is 50), below are the manifest content: service_type: ingress service_id: ingress.rgw service_name: ingress.rgw placement: hosts: - c00.domain.org - c01.domain.org - c02.domain.org spec: backend_service: rgw.rgw frontend_port: 8080 monitor_port: 1967 virtual_ips_list: - X.X.X.200/24 first_virtual_router_id: 60 service_type: ingress service_id: nfs.nfsgw service_name: ingress.nfs.nfsgw placement: count: 2 spec: backend_service: nfs.nfsgw frontend_port: 2049 monitor_port: 9049 virtual_ip: X.X.X.222/24 first_virtual_router_id: 70 When I apply the manifests I’m getting the error, for both ingress definitions: Error EINVAL: ServiceSpec: __init__() got an unexpected keyword argument ‘first_virtual_router_id' even the documentation for quincy version describes the option and includes some similar example at: https://docs.ceph.com/en/quincy/cephadm/services/rgw Both manifests are working smoothly if I remove the first_virtual_router_id line. Any ideas on how I can troubleshoot the issue? Thanks in advance Ramon -- Ramon Orrù Servizio di Calcolo Laboratori Nazionali di Frascati Istituto Nazionale di Fisica Nucleare Via E. Fermi, 54 - 00044 Frascati (RM) Italy Tel. +39 06 9403 2345

2 months, 4 weeks

2
2
0 0

Ceph Leadership Team Meeting: 2024-2-21 Minutes

by Casey Bodley

Estimate on release timeline for 17.2.8? - after pacific 16.2.15 and reef 18.2.2 hotfix (https://tracker.ceph.com/issues/64339, https://tracker.ceph.com/issues/64406) Estimate on release timeline for 19.2.0? - target April, depending on testing and RCs - Testing plan for Squid beyond dev freeze (regression and upgrade tests, performance tests, RCs) Can we fix old.ceph.com? - continued discussion about the need to revive the pg calc tool T release name? - please add and vote for suggestions in https://pad.ceph.com/p/t - need name before we can open "t kickoff" pr

2 months, 4 weeks

1
0
0 0

Re: Debian 12 (bookworm) / Reef 18.2.1 problems

by Chris Palmer

I have logged this as https://tracker.ceph.com/issues/64213 On 16/01/2024 14:18, DERUMIER, Alexandre wrote: > Hi, > >>> ImportError: PyO3 modules may only be initialized once per >>> interpreter >>> process >>> >>> and ceph -s reports "Module 'dashboard' has failed dependency: PyO3 >>> modules may only be initialized once per interpreter process > We have the same problem on proxmox8 (based on debian12) with ceph > quincy or reef. > > It seem to be related to python version on debian12 > > (we have no fix for this currently) > > >

2 months, 4 weeks

6
7
0 0

User + Dev Meetup February 22 - CephFS Snapshots story!

by Neha Ojha

Hi everyone, You are invited to join us at the User + Dev meeting this week Thursday, February 22 at 10:00 AM Eastern Time! Focus Topic: CephFS Snapshots Evaluation Presented by: Enrico Bocchi and Abhishek Lekshmanan, Ceph operators from CERN From the presenters: Ceph at CERN provides block, object, and file storage backing the IT infrastructure of the Organization. CephFS, in particular, is largely used through the integration with OpenStack Manila by container-based workloads (Kubernetes, OpenShift), HPC MPI clusters, and as a general-purpose networked file system for enterprise groupware and open infrastructure technologies (code/software repositories, monitoring, analytics, etc.). Our presentation focuses on CephFS snapshots and their implications on performance and stability. Snapshots would be a valuable addition to our existing CephFS service, as they allow for storage rollback and disaster recovery through mirroring. According to our observations, however, they introduce a non-negligible performance penalty and may jeopardize the stability of the file system. In particular, we would like to discuss: 1. Experiences with CephFS snapshots from other operators in the Ceph community. 2. Tools and strategies one can deploy to pre-empty or mitigate issues. 3. How to effectively contribute with upstream developers and interested community users to address the identified limitations. Feel free to add questions or additional topics under the "Open Discussion" section on the agenda: https://pad.ceph.com/p/ceph-user-dev-monthly-minutes If you have an idea for a focus topic you'd like to present at a future meeting, you are welcome to submit it to this Google Form: https://docs.google.com/forms/d/e/1FAIpQLSdboBhxVoBZoaHm8xSmeBoemuXoV_rmh4v… Any Ceph user or developer is eligible to submit! Thanks, Neha

2 months, 4 weeks

1
0
0 0

2024

2023

2022

2021

2020

2019

ceph-users February 2024