October 2023 - ceph-users

rgw: disallowing bucket creation for specific users?

by Matthias Ferdinand

Hi, I am still evaluating ceph rgw for specific use cases. My question is about keeping the realm of bucket names under control of rgw admins. Normal S3 users have the ability to create new buckets as they see fit. This opens opportunities for creating excessive amounts of buckets, or for blocking nice bucket names for other uses, or even using bucketname-typosquatting as an attack vector. In AWS, I can create some IAM users and provide per-bucket access to them via bucket or IAM user policies. These IAM users can't create new buckets on their own. Giving out only those IAM credentials to users and applications, I can ensure no bucket namespace pollution occurs. Ceph rgw does not have IAM users (yet?). What could I use here to not allow certain S3 users to create buckets on their own? Regards Matthias

7 months, 2 weeks

5
10
0 0

Autoscaler problems in pacific

by Boris Behrens

Hi, I've just upgraded to our object storages to the latest pacific version (16.2.14) and the autscaler is acting weird. On one cluster it just shows nothing: ~# ceph osd pool autoscale-status ~# On the other clusters it shows this when it is set to warn: ~# ceph health detail ... [WRN] POOL_TOO_MANY_PGS: 2 pools have too many placement groups Pool .rgw.buckets.data has 1024 placement groups, should have 1024 Pool device_health_metrics has 1 placement groups, should have 1 Version 16.2.13 seems to act normal. Is this a known bug? -- Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im groÃƒ¼en Saal.

7 months, 2 weeks

2
4
0 0

CEPH complete cluster failure: unknown PGS

by v1tnam＠gmail.com

I have an 8-node cluster with old hardware. a week ago 4 nodes went down and the CEPH cluster went nuts. All pgs became unknown and montors took too long to be in sync. So i reduced the number of mons to one and mgrs to one as well Now the recovery starts with 100% unknown pgs and then pgs start to move ot inactive . It generally fails to recover in the middle and starts from scratch. It's hold hardware and OSDs have lots of slow ops and probably number of bad sectors as well Any suggestions on how to tackle this. It's a nautilus cluster and pretty old (8-year old hardware) Thanks

7 months, 2 weeks

3
2
0 0

snap_schedule works after 1 hour of scheduling

by Kushagr Gupta

Hi Team,Milind *Ceph-version:* Quincy, Reef *OS:* Almalinux 8 *Issue:* snap_schedule works after 1 hour of schedule *Description:* We are currently working in a 3-node ceph cluster. We are currently exploring the scheduled snapshot capability of the ceph-mgr module. To enable/configure scheduled snapshots, we followed the following link: https://docs.ceph.com/en/quincy/cephfs/snap-schedule/ We were able to create snap schedules for the subvolumes as suggested. But we have observed a two very strange behaviour: 1. The snap_schedules only work when we restart the ceph-mgr service on the mgr node: We then restarted the mgr-service on the active mgr node, and after 1 hour it started getting created. I am attaching the log file for the same after restart. Thre behaviour looks abnormal. So, for eg consider the below output: ``` [root@storagenode-1 ~]# ceph fs snap-schedule status /volumes/subvolgrp/test3 {"fs": "cephfs", "subvol": null, "path": "/volumes/subvolgrp/test3", "rel_path": "/volumes/subvolgrp/test3", "schedule": "1h", "retention": {}, "start": "2023-10-04T07:20:00", "created": "2023-10-04T07:18:41", "first": "2023-10-04T08:20:00", "last": "2023-10-04T09:20:00", "last_pruned": null, "created_count": 2, "pruned_count": 0, "active": true} [root@storagenode-1 ~]# ``` As we can see in the above o/p, we created the schedule at 2023-10-04T07:18:41. The schedule was suppose to start at 2023-10-04T07:20:00 but it started at 2023-10-04T08:20:00 Any input w.r.t the same will be of great help. Thanks and Regards Kushagra Gupta

7 months, 2 weeks

2
7
0 0

Question about RGW S3 Select

by Dave S

Hi Everyone, I've been trying to get S3 Select working on our system and whenever I send a query I get the following in the Payload (Result 200 from RGW): # aws --endpoint-url http://cephtest1 s3api select-object-content --bucket test1 --expression-type SQL --input-serialization '{"CSV": {"Fieldter": "\"" , "RecordDelimiter" : "\n" , "QuoteEscapeCharacter" : "\\" , "FileHeaderInfo": "USE" }, "CompressionType": "NONE"}' --output-serialization '{"CSV": {"FieldDelimiter": ":", "RecordDelimiter":"\t", "QuoteFields": "ALWAYS"}}' --key sample_data.csv --expression 'SELECT * from s3object' /dev/stderr <Payload> <Records> <Payload> failure -->SELECT * from s3object<--- </Payload></Records></Payload> I also get the same behaviour when accessing via boto3/python. The same command/code works when accessing other S3 services. Am I missing some config or something? The test1/sample_data.csv file is there and the account is able to get the sample_data.csv data. I've tried uploading a version with unix endings and Dos endings for the data file. This is the data file from the AWS example. We're running Pacific and the RGWs are deployed as podman containers on Rocky 8.8: ceph version 16.2.14 (238ba602515df21ea7ffc75c88db29f9e5ef12c9) pacific (stable) Thanks, -Dave

7 months, 2 weeks

2
1
0 0

Calling all Ceph users and developers! Submit a topic for the next User + Dev Meeting!

by Laura Flores

Hi Ceph users and developers, We are gearing up for the next User + Developer Monthly Meeting, happening October 19th at 10am EST. If you are interested in being a guest speaker, you are invited to submit a focus topic to this Google form: https://docs.google.com/forms/d/e/1FAIpQLSdboBhxVoBZoaHm8xSmeBoemuXoV_rmh4v… Examples of what we're looking for in a focus topic include: - Feature requests / RFEs - User feedback - Knowledge sharing (upgrades, workloads, etc.) - Ideas for longterm improvement (user-facing) - Share the use case of your cluster Any Ceph user or developer is eligible to submit! Reach out to me with any questions. - Laura Flores -- Laura Flores She/Her/Hers Software Engineer, Ceph Storage <https://ceph.io> Chicago, IL lflores(a)ibm.com | lflores(a)redhat.com <lflores(a)redhat.com> M: +17087388804

7 months, 2 weeks

1
0
0 0

Issue with radosgw-admin reshard when bucket belongs to user with tenant on ceph quincy (17.2.6)

by christoph.weber+cephmailinglist＠xpecto.com

Hi everybody, I tried to reshard a bucket belonging to the tenant "test-tenant", but got an "No such file or directory" error. $ radosgw-admin reshard add --bucket test-tenant/test-bucket --num-shards 40 $ radosgw-admin reshard process 2023-10-04T12:12:52.470+0200 7f654237afc0 0 process_single_logshard: Error during resharding bucket test-tenant/test-bucket:(2) No such file or directory $ radosgw-admin reshard list [ { "time": "2023-10-04T10:12:46.528741Z", "tenant": "", "bucket_name": "test-tenant/test-bucket", "bucket_id": "43e570fd-7573-403f-ab86-a75e12e60146.24142.3", "new_instance_id": "", "old_num_shards": 23, "tentative_new_num_shards": 40 } ] (I also tried --bucket test-bucket) I then tried to remove the incomplete reshard-job $ radosgw-admin reshard cancel --bucket test-tenant/test-bucket $ radosgw-admin reshard list [] $ radosgw-admin reshard process 2023-10-04T12:33:04.251+0200 7fb728e16fc0 0 INFO: RGWReshardLock::lock found lock on reshard.0000000000 to be held by another RGW process; skipping for now With $ radosgw-admin lc reshard fix --bucket test-tenant/test-bucket and restarting the rgw containers, I could get rid of the reshard.0000000000 lock. Unfortunately updating to the latest 17.2 version did not help. $ ceph orch upgrade start --ceph-version 17.2.6 $ ceph --version ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable) But after transferring the bucket to a different RGW user without tenant, resharding worked fine: $ reshard add --bucket test-bucket-without-tenant-user --num-shards 40 $ radosgw-admin reshard process 2023-10-04T12:44:43.599+0200 7f75c908ffc0 1 execute INFO: reshard of bucket "test-bucket-without-tenant-user" from "test-bucket-without-tenant-user:43e570fd-7573-403f-ab86-a75e12e60146.74748.1" to "test-bucket-without-tenant-user:43e570fd-7573-403f-ab86-a75e12e60146.74819.1" completed successfully This solved my problem, but I wanted to report it, as it may affect other installations too. Is this problem known? I could not find information for this.

7 months, 2 weeks

1
0
0 0

RGW multisite - requesting help for fixing error_code: 125

by Jayanth Reddy

Hello Users, We're running 2 Ceph clusters with v17.2.6 and noticing the error message in # radosgw-admin sync error list *"message": "failed to sync bucket instance: (125) Operation canceled"* We've the output as below, [ { "shard_id": 0, "entries": [ { "id": "1_1690711173.869335_133603.1", "section": "data", "name": "b1:d09d3d16-8601-448b-bf3d-609b8a29647d.38987.1:2897", "timestamp": "2023-07-30T09:59:33.869335Z", "info": { "source_zone": "d09d3d16-8601-448b-bf3d-609b8a29647d", "error_code": 125, "message": "failed to sync bucket instance: (125) Operation canceled" } }, { "id": "1_1690711175.505687_133683.1", "section": "data", "name": "b1:d09d3d16-8601-448b-bf3d-609b8a29647d.38987.1:1719", "timestamp": "2023-07-30T09:59:35.505687Z", "info": { "source_zone": "d09d3d16-8601-448b-bf3d-609b8a29647d", "error_code": 125, "message": "failed to sync bucket instance: (125) Operation canceled" } }, and with around 26236 errors # radosgw-admin sync error list | grep -i "(125) Operation canceled" | wc -l 26236 I'm trying to fix these by rewriting objects but I'm having trouble finding the exact object names and the procedure. Any help is really appreciated! Thanks, Jayanth

7 months, 2 weeks

2
1
0 0

ceph luminous client connect to ceph reef always permission denied

by Pureewat Kaewpoi

Hi All ! We have a new installed cluster with ceph reef. but our old client still using ceph luminous. The problem is when using any command to ceph cluster It will hang and no any output. This is a output from command ceph osd pool ls --debug-ms 1 2023-10-02 23:35:22.727089 7fc93807c700 1 Processor -- start 2023-10-02 23:35:22.729256 7fc93807c700 1 -- - start start 2023-10-02 23:35:22.729790 7fc93807c700 1 -- - --> MON-1:6789/0 -- auth(proto 0 34 bytes epoch 0) v1 -- 0x7fc930174cb0 con 0 2023-10-02 23:35:22.730724 7fc935e72700 1 -- CLIENT:0/187462963 learned_addr learned my addr CLIENT:0/187462963 2023-10-02 23:35:22.732091 7fc927fff700 1 -- CLIENT:0/187462963 <== mon.0 MON-1:6789/0 1 ==== auth_reply(proto 2 0 (0) Success) v1 ==== 33+0+0 (2762451217 0 0) 0x7fc920002310 con 0x7fc93017d0f0 2023-10-02 23:35:22.732228 7fc927fff700 1 -- CLIENT:0/187462963 --> MON-1:6789/0 -- auth(proto 2 32 bytes epoch 0) v1 -- 0x7fc914000fc0 con 0 2023-10-02 23:35:22.733237 7fc927fff700 1 -- CLIENT:0/187462963 <== mon.0 MON-1:6789/0 2 ==== auth_reply(proto 2 0 (0) Success) v1 ==== 206+0+0 (3693167043 0 0) 0x7fc920002830 con 0x7fc93017d0f0 2023-10-02 23:35:22.733428 7fc927fff700 1 -- CLIENT:0/187462963 --> MON-1:6789/0 -- auth(proto 2 165 bytes epoch 0) v1 -- 0x7fc914002e10 con 0 2023-10-02 23:35:22.733451 7fc927fff700 1 -- CLIENT:0/187462963 <== mon.0 MON-1:6789/0 3 ==== mon_map magic: 0 v1 ==== 532+0+0 (3038142027 0 0) 0x7fc920000e50 con 0x7fc93017d0f0 2023-10-02 23:35:22.734365 7fc927fff700 1 -- CLIENT:0/187462963 <== mon.0 MON-1:6789/0 4 ==== auth_reply(proto 2 0 (0) Success) v1 ==== 580+0+0 (3147563293 0 0) 0x7fc920001640 con 0x7fc93017d0f0 2023-10-02 23:35:22.734597 7fc927fff700 1 -- CLIENT:0/187462963 --> MON-1:6789/0 -- mon_subscribe({monmap=0+}) v2 -- 0x7fc9301755e0 con 0 2023-10-02 23:35:22.734678 7fc93807c700 1 -- CLIENT:0/187462963 --> MON-1:6789/0 -- mon_subscribe({mgrmap=0+}) v2 -- 0x7fc930180750 con 0 2023-10-02 23:35:22.734805 7fc93807c700 1 -- CLIENT:0/187462963 --> MON-1:6789/0 -- mon_subscribe({osdmap=0}) v2 -- 0x7fc930180f00 con 0 2023-10-02 23:35:22.734891 7fc935e72700 1 -- CLIENT:0/187462963 >> MON-1:6789/0 conn(0x7fc93017d0f0 :-1 s=STATE_OPEN pgs=754 cs=1 l=1).read_bulk peer close file descriptor 13 2023-10-02 23:35:22.734917 7fc935e72700 1 -- CLIENT:0/187462963 >> MON-1:6789/0 conn(0x7fc93017d0f0 :-1 s=STATE_OPEN pgs=754 cs=1 l=1).read_until read failed 2023-10-02 23:35:22.734922 7fc935e72700 1 -- CLIENT:0/187462963 >> MON-1:6789/0 conn(0x7fc93017d0f0 :-1 s=STATE_OPEN pgs=754 cs=1 l=1).process read tag failed 2023-10-02 23:35:22.734926 7fc935e72700 1 -- CLIENT:0/187462963 >> MON-1:6789/0 conn(0x7fc93017d0f0 :-1 s=STATE_OPEN pgs=754 cs=1 l=1).fault on lossy channel, failing 2023-10-02 23:35:22.734966 7fc927fff700 1 -- CLIENT:0/187462963 >> MON-1:6789/0 conn(0x7fc93017d0f0 :-1 s=STATE_CLOSED pgs=754 cs=1 l=1).mark_down 2023-10-02 23:35:22.735062 7fc927fff700 1 -- CLIENT:0/187462963 --> MON-2:6789/0 -- auth(proto 0 34 bytes epoch 3) v1 -- 0x7fc914005580 con 0 2023-10-02 23:35:22.735077 7fc927fff700 1 -- CLIENT:0/187462963 --> MON-3:6789/0 -- auth(proto 0 34 bytes epoch 3) v1 -- 0x7fc914005910 con 0 2023-10-02 23:35:22.737246 7fc927fff700 1 -- CLIENT:0/187462963 <== mon.2 MON-3:6789/0 1 ==== auth_reply(proto 2 0 (0) Success) v1 ==== 33+0+0 (2138308960 0 0) 0x7fc920002fd0 con 0x7fc91400b0c0 2023-10-02 23:35:22.737443 7fc927fff700 1 -- CLIENT:0/187462963 --> MON-3:6789/0 -- auth(proto 2 32 bytes epoch 0) v1 -- 0x7fc914014f10 con 0 2023-10-02 23:35:22.737765 7fc927fff700 1 -- CLIENT:0/187462963 <== mon.1 MON-2:6789/0 1 ==== auth_reply(proto 2 0 (0) Success) v1 ==== 33+0+0 (3855879565 0 0) 0x7fc928002390 con 0x7fc91400f730 2023-10-02 23:35:22.737799 7fc927fff700 1 -- CLIENT:0/187462963 --> MON-2:6789/0 -- auth(proto 2 32 bytes epoch 0) v1 -- 0x7fc914015850 con 0 2023-10-02 23:35:22.737966 7fc927fff700 1 -- CLIENT:0/187462963 <== mon.2 MON-3:6789/0 2 ==== auth_reply(proto 2 -13 (13) Permission denied) v1 ==== 24+0+0 (2583972696 0 0) 0x7fc920003240 con 0x7fc91400b0c0 2023-10-02 23:35:22.737981 7fc927fff700 1 -- CLIENT:0/187462963 >> MON-3:6789/0 conn(0x7fc91400b0c0 :-1 s=STATE_OPEN pgs=464 cs=1 l=1).mark_down 2023-10-02 23:35:22.738096 7fc927fff700 1 -- CLIENT:0/187462963 <== mon.1 MON-2:6789/0 2 ==== auth_reply(proto 2 -13 (13) Permission denied) v1 ==== 24+0+0 (2583972696 0 0) 0x7fc928002650 con 0x7fc91400f730 2023-10-02 23:35:22.738110 7fc927fff700 1 -- CLIENT:0/187462963 >> MON-2:6789/0 conn(0x7fc91400f730 :-1 s=STATE_OPEN pgs=344 cs=1 l=1).mark_down By the way I have using same keyring with ceph nautilus client it work well without any problem. What should I do next ? Where to debug or where to fix this issue. Thanks

7 months, 2 weeks

2
1
0 0

Balancer blocked as autoscaler not acting on scaling change

by bc10＠sanger.ac.uk

Hi Folks, We are currently running with one nearfull OSD and 15 nearfull pools. The most full OSD is about 86% full but the average is 58% full. However, the balancer is skipping a pool on which the autoscaler is trying to complete a pg_num reduction from 131,072 to 32,768 (default.rgw.buckets.data pool). However, the autoscaler has been working on this for the last 20 days, it works through a list of objects that are misplaced but when it gets close to the end, more objects get added to the list. This morning I observed the list get down to c. 7,000 objects misplaced with 2 PGs active+remapped+backfilling, one PG completed the backfilling then the list shot up to c. 70,000 objects misplaced with 3 PGs active+remapped+backfilling. Has anyone come across this behaviour before? If so, what was your remediation? Thanks in advance for sharing. Bruno Cluster details: 3,068 OSDs when all running, c. 60 per storage node OS: Ubuntu 20.04 Ceph: Pacific 16.2.13 from Ubuntu Cloud Archive Use case: S3 storage and OpenStack backend, all pools three-way replicated

7 months, 2 weeks

5
4
0 0

2024

2023

2022

2021

2020

2019

ceph-users October 2023