February 2024 - ceph-users

by marc＠singer.services

Hi Ceph users We are using Ceph Pacific (16) in this specific deployment. In our use case we do not want our users to be able to generate signature v4 URLs because they bypass the policies that we set on buckets (e.g IP restrictions). Currently we have a sidecar reverse proxy running that filters requests with signature URL specific request parameters. This is obviously not very efficient and we are looking to replace this somehow in the future. 1. Is there an option in RGW to disable this signed URLs (e.g returning status 403)? 2. If not is this planned or would it make sense to add it as a configuration option? 3. Or is the behaviour of not respecting bucket policies in RGW with signature v4 URLs a bug and they should be actually applied? Thanks you for your help and let me know if you have any questions Marc Singer

2 months

4
4
0 0

erasure-code-lrc Questions regarding repair

by Ansgar Jazdzewski

hi folks, I currently test erasure-code-lrc (1) in a multi-room multi-rack setup. The idea is to be able to repair a disk-failures within the rack itself to lower bandwidth-usage ```bash ceph osd erasure-code-profile set lrc_hdd \ plugin=lrc \ crush-root=default \ crush-locality=rack \ crush-failure-domain=host \ crush-device-class=hdd \ mapping=__DDDDD__DDDDD__DDDDD__DDDDD \ layers=' [ [ "_cDDDDD_cDDDDD_cDDDDD_cDDDDD", "" ], [ "cDDDDDD_____________________", "" ], [ "_______cDDDDDD______________", "" ], [ "______________cDDDDDD_______", "" ], [ "_____________________cDDDDDD", "" ], ]' \ crush-steps='[ [ "choose", "room", 4 ], [ "choose", "rack", 1 ], [ "chooseleaf", "host", 7 ], ]' ``` The roule picks 4 out of 5 rooms and keeps the PG in one rack like expected! However it looks like the PG will not move to another Room if the PG is undersized or the entire Room or Rack is down! Questions: * do I miss something to allow LRC (PG's) to move across Racks/Rooms for repair? * Is it even possible to build such a 'Multi-stage' grushmap? Thanks for your help, Ansgar 1) https://docs.ceph.com/en/quincy/rados/operations/erasure-code-jerasure/

2 months

2
2
0 0

Ceph is constantly scrubbing 1/4 of all PGs and still have pigs not scrubbed in time

by thymus_03fumbler＠icloud.com

I recently switched from 16.2.x to 18.2.x and migrated to cephadm, since the switch the cluster is constantly scrubbing, 24/7 up to 50 PGs simultaneously and up to 20 deep scrubs simultaneously in a cluster that has only 12 (in use) OSDs. Furthermore it still manages to regularly have a warning with ‘pgs not scrubbed in time’ I have tried various settings, like osd_deep_scrub_interval, osd_max_scrubs, mds_max_scrub_ops_in_progress etc. All those get ignored. Please advice. Here is an output of ceos config dump: WHO MASK LEVEL OPTION VALUE RO global advanced auth_client_required cephx * global advanced auth_cluster_required cephx * global advanced auth_service_required cephx * global advanced auth_supported cephx * global basic container_image quay.io/ceph/ceph@sha256:aca35483144ab3548a7f670db9b79772e6fc51167246421c66c0bd56a6585468 * global basic device_failure_prediction_mode local global advanced mon_allow_pool_delete true global advanced mon_data_avail_warn 20 global advanced mon_max_pg_per_osd 400 global advanced osd_max_pg_per_osd_hard_ratio 10.000000 global advanced osd_pool_default_pg_autoscale_mode on mon advanced auth_allow_insecure_global_id_reclaim false mon advanced mon_crush_min_required_version firefly * mon advanced mon_warn_on_pool_no_redundancy false mon advanced public_network 10.79.0.0/16 * mgr advanced mgr/balancer/active true mgr advanced mgr/balancer/mode upmap mgr advanced mgr/cephadm/manage_etc_ceph_ceph_conf_hosts label:admin * mgr advanced mgr/cephadm/migration_current 6 * mgr advanced mgr/dashboard/GRAFANA_API_PASSWORD admin * mgr advanced mgr/dashboard/GRAFANA_API_SSL_VERIFY false * mgr advanced mgr/dashboard/GRAFANA_API_URL https://10.79.79.12:3000 * mgr advanced mgr/dashboard/PROMETHEUS_API_HOST http://10.79.79.12:9095 * mgr advanced mgr/devicehealth/enable_monitoring true mgr advanced mgr/orchestrator/orchestrator cephadm osd advanced osd_map_cache_size 250 osd advanced osd_map_share_max_epochs 50 osd advanced osd_mclock_profile high_client_ops osd advanced osd_pg_epoch_persisted_max_stale 50 osd.0 basic osd_mclock_max_capacity_iops_hdd 380.869888 osd.1 basic osd_mclock_max_capacity_iops_hdd 441.000000 osd.10 basic osd_mclock_max_capacity_iops_ssd 13677.906485 osd.11 basic osd_mclock_max_capacity_iops_hdd 274.411212 osd.13 basic osd_mclock_max_capacity_iops_hdd 198.492501 osd.2 basic osd_mclock_max_capacity_iops_hdd 251.592009 osd.3 basic osd_mclock_max_capacity_iops_hdd 208.197434 osd.4 basic osd_mclock_max_capacity_iops_hdd 196.544082 osd.5 basic osd_mclock_max_capacity_iops_ssd 12739.225456 osd.6 basic osd_mclock_max_capacity_iops_hdd 211.288660 osd.7 basic osd_mclock_max_capacity_iops_hdd 210.543236 osd.8 basic osd_mclock_max_capacity_iops_hdd 242.241594 osd.9 basic osd_mclock_max_capacity_iops_hdd 559.933780 mds.plexfs basic mds_join_fs plexfs Here is a ceph -s output services: mon: 3 daemons, quorum lxt-prod-ceph-util02,lxt-prod-ceph-util01,lxt-prod-ceph-util03 (age 3w) mgr: lxt-prod-ceph-util02.iyrhxj(active, since 3w), standbys: lxt-prod-ceph-util03.wvstpe mds: 1/1 daemons up osd: 14 osds: 14 up (since 4w), 14 in (since 4w) data: volumes: 1/1 healthy pools: 4 pools, 193 pgs objects: 14.48M objects, 52 TiB usage: 71 TiB used, 39 TiB / 110 TiB avail pgs: 131 active+clean 47 active+clean+scrubbing 15 active+clean+scrubbing+deep

2 months

3
2
1 0

ceph Quincy to Reef non cephadm upgrade

by sarda.ravi＠gmail.com

I want to perform non cephadm upgrade from Quincy to Reef. Reason for not using cephadm is do not want to go for ceph in containers. My test deployment is as given below. Total cluster hosts : 5 ceph-mon hosts: 3 ceph-mgr hosts: 3 (ceph-mgr active on one node, and other ceph-mgr each on ceph-mon host) ceph-mds : 1 ceph-osd : 5 (one ceph-osd on each of the host in the cluster.) While I try to follow the steps - https://docs.ceph.com/en/latest/releases/reef/#upgrading-non-cephadm-cluste… - on the step - Upgrade monitors by installing the new packages and restarting the monitor daemons. when I try to upgrade only ceph-mon using "apt upgrade ceph-mon" command it upgrades all packages including ceph-mgr, ceph-mds, ceph-osd etc. as ceph-mon package has dependency on these packages. My question is - does this mean I need to upgrade all ceph packages (ceph, ceph-common) and restart only monitor daemon first? Or there is any way I can upgrade only ceph-mon pacakge first, then ceph-mgr, ceph-osd and so on?

2 months

2
1
0 0

change ip node and public_network in cluster

by farhad kh

I have implemented a ceph cluster with cephadm which has three monitors and three OSDs each node have one interface 192.168.0.0/24 network. I want to change the address of the machines to the range 10.4.4.0/24. Is there a solution for this change without data loss and failure? i change the pubic_network in mon and change the ip node but its not worked . how can i sovle this problem? ``` ceph orch host ls HOST ADDR LABELS STATUS ceph-01 192.168.0.130 _admin,rgw ceph-02 192.168.0.131 _admin,rgw ceph-03 192.168.0.132 _admin,rgw 3 hosts in cluster ```` [root@ceph-01 ~]# ceph config get mon public_network 192.168.0.0/24 ```` [root@ceph-01 ~]# ceph orch ls NAME PORTS RUNNING REFRESHED AGE PLACEMENT alertmanager ?:9093,9094 1/1 112s ago 9M count:1 ceph-exporter 3/3 114s ago 8M * crash 3/3 114s ago 9M * grafana ?:3000 1/1 112s ago 8M count:1 mgr 2/2 113s ago 9M count:2 mon 3/3 114s ago 8M count:3 node-exporter ?:9100 3/3 114s ago 9M * osd.dashboard-admin-1685787597651 6 114s ago 8M * prometheus ?:9095 1/1 112s ago 3M count:1 ````

2 months

3
3
1 0

Slow RGW multisite sync due to "304 Not Modified" responses on primary zone

by Alam Mohammad

Hi, We have 2 clusters (v18.2.1) primarily used for RGW which has over 2+ billion RGW objects. They are also in multisite configuration totaling to 2 zones and we've got around 2 Gbps of bandwidth dedicated (P2P) for the multisite traffic. We see that using "radosgw-admin sync status" on the zone 2, all the 128 shards are recovering and unfortunately there is very less data transfer from primary zone ie., the link utilization is barely 100 Mbps / 2 Gbps. Our objects are quite small as well like avg. of 1 MB in size. On further inspection, we noticed the rgw access the logs at primary site are mostly yielding "304 Not Modified" for RGWs at site-2. Is this expected? Here are some of the logs (information is redacted) root@host-04:~# tail -f /var/log/haproxy-msync.log Feb 12 05:06:51 host-04 haproxy[971171]: 10.1.85.14:33730 [12/Feb/2024:05:06:51.047] https~ backend/host-04-msync 0/0/0/2/2 304 143 - - ---- 56/55/1/0/0 0/0 "GET /bucket1/object1.jpg?rgwx-zonegroup=71dceb3d-3092-4dc6-897f-a9abf60c9972&rgwx-prepend-metadata=true&rgwx-sync-manifest&rgwx-sync-cloudtiered&rgwx-skip-decrypt&rgwx-if-not-replicated-to=a8204ce2-b69e-4d90-bca1-93edd05a1a29%3Abucket1%3A8b96aea5-c763-40a3-8430-efd67cff0c62.20010.7 HTTP/1.1" Feb 12 05:06:51 host-04 haproxy[971171]: 10.1.85.14:59730 [12/Feb/2024:05:06:51.048] https~ backend/host-04-msync 0/0/0/2/2 304 143 - - ---- 56/55/3/1/0 0/0 "GET /bucket1/object91.jpg?rgwx-zonegroup=71dceb3d-3092-4dc6-897f-a9abf60c9972&rgwx-prepend-metadata=true&rgwx-sync-manifest&rgwx-sync-cloudtiered&rgwx-skip-decrypt&rgwx-if-not-replicated-to=a8204ce2-b69e-4d90-bca1-93edd05a1a29%3Abucket1%3A8b96aea5-c763-40a3-8430-efd67cff0c62.20010.7 HTTP/1.1" We also took a look at our grafana instance and out of 1000 requests / second, 200 are "200 OK" and 800 are "304 Not Modified". Sync threads are run on only 2 rgw daemons per zone and are behind a Load Balancer. "# radosgw-admin sync error list" also contains around 20 errors which are mostly automatically recoverable. As we understand, does it mean that RGW multisite sync logs in the log pool are yet to be generated or some sort? Please provide us some insights and let us know how to resolve this. Thanks, Saif

2 months

2
2
0 0

Successfully using dm-cache

by Michael Lipp

Just in case anybody is interested: Using dm-cache works and boosts performance -- at least for my use case. The "challenge" was to get 100 (identical) Linux-VMs started on a three node hyperconverged cluster. The hardware is nothing special, each node has a Supermicro server board with a single CPU with 24 cores and 4 x 4 TB hard disks. And there's that extra 1 TB NVMe... I know that the general recommendation is to use the NVMe for WAL and metadata, but this didn't seem appropriate for my use case and I'm still not quite sure about failure scenarios with this configuration. So instead I made each drive a logical volume (managed by an OSD) and added 85 GiB NVMe to each LV as read-only cache. Each VM uses as system disk an RBD based on a snapshot from the master image. The idea was that with this configuration, all VMs should share most (actually almost all) of the data on their system disk and this data should be available from the cache. Well, it works. When booting the 100 VMs, almost all read operations are satisfied from the cache. So I get close to NVMe speed but have payed for conventional hard drives only (well, SSDs aren't that much more expensive nowadays, but the hardware is 4 years old). So, nothing sophisticated, but as I couldn't find anything about this kind of setup, it might be of interest nevertheless. - Michael

2 months

13
25
0 0

ceph-crash NOT reporting crashes due to wrong permissions on /var/lib/ceph/crash/posted (Debian / Ubuntu packages)

by Christian Rohmann

Hey ceph-users, I just noticed issues with ceph-crash using the Debian /Ubuntu packages (package: ceph-base): While the /var/lib/ceph/crash/posted folder is created by the package install, it's not properly chowned to ceph:ceph by the postinst script. This might also affect RPM based installs somehow, but I did not look into that. I opened a bug report with all the details and two ideas to fix this: https://tracker.ceph.com/issues/64548 The wrong ownership causes ceph-crash to NOT work at all. I myself missed quite a few crash reports. All of them were just sitting around on the machines, but were reported right after I did chown ceph:ceph /var/lib/ceph/crash/posted systemctl restart ceph-crash.service You might want to check if you might be affected as well. Failing to post crashes to the local cluster results in them not being reported back via telemetry. Regards Christian

2 months

3
3
1 0

Re: Scrub stuck and 'pg has invalid (post-split) stat'

by Eugen Block

Please don't drop the list from your response. The first question coming to mind is, why do you have a cache-tier if all your pools are on nvme decices anyway? I don't see any benefit here. Did you try the suggested workaround and disable the cache-tier? Zitat von Cedric <yipikai7(a)gmail.com>: > Thanks Eugen, see attached infos. > > Some more details: > > - commands that actually hangs: ceph balancer status ; rbd -p vms ls ; > rados -p vms_cache cache-flush-evict-all > - all scrub running on vms_caches pgs are stall / start in a loop > without actually doing anything > - all io are 0 both from ceph status or iostat on nodes > > On Tue, Feb 20, 2024 at 10:00 AM Eugen Block <eblock(a)nde.ag> wrote: >> >> Hi, >> >> some more details would be helpful, for example what's the pool size >> of the cache pool? Did you issue a PG split before or during the >> upgrade? This thread [1] deals with the same problem, the described >> workaround was to set hit_set_count to 0 and disable the cache layer >> until that is resolved. Afterwards you could enable the cache layer >> again. But keep in mind that the code for cache tier is entirely >> removed in Reef (IIRC). >> >> Regards, >> Eugen >> >> [1] >> https://ceph-users.ceph.narkive.com/zChyOq5D/ceph-strange-issue-after-addin… >> >> Zitat von Cedric <yipikai7(a)gmail.com>: >> >> > Hello, >> > >> > Following an upgrade from Nautilus (14.2.22) to Pacific (16.2.13), we >> > encounter an issue with a cache pool becoming completely stuck, >> > relevant messages below: >> > >> > pg xx.x has invalid (post-split) stats; must scrub before tier agent >> > can activate >> > >> > In OSD logs, scrubs are starting in a loop without succeeding for all >> > pg of this pool. >> > >> > What we already tried without luck so far: >> > >> > - shutdown / restart OSD >> > - rebalance pg between OSD >> > - raise the memory on OSD >> > - repeer PG >> > >> > Any idea what is causing this? any help will be greatly appreciated >> > >> > Thanks >> > >> > Cédric >> > _______________________________________________ >> > ceph-users mailing list -- ceph-users(a)ceph.io >> > To unsubscribe send an email to ceph-users-leave(a)ceph.io >> >> >> _______________________________________________ >> ceph-users mailing list -- ceph-users(a)ceph.io >> To unsubscribe send an email to ceph-users-leave(a)ceph.io

2 months, 1 week

3
5
1 0

Renaming an OSD node

by Deep Dish

Hello. We have a requirement to change the hostname on some of our OSD nodes. All of our nodes are Ubuntu 22.04 based and have been deployed using 17.2.7 Orchestrator. 1. Is there a procedure to rename the existing node, without rebuilding and have it detected by Ceph Orchestrator? If not, 2. To minimize impact on cluster (rebuilding OSDs / balancing, etc) Is it possible to REINTRODUCE existing OSDs into the cluster a the newly rebuilt node? Is there a ceph orch process to scan local node OSDs, detect and create OSD daemons? Thank you.

2 months, 1 week

2
1
0 0

2024

2023

2022

2021

2020

2019

ceph-users February 2024