Hi,
I am still evaluating ceph rgw for specific use cases.
My question is about keeping the realm of bucket names under control of
rgw admins.
Normal S3 users have the ability to create new buckets as they see fit.
This opens opportunities for creating excessive amounts of buckets, or
for blocking nice bucket names for other uses, or even using
bucketname-typosquatting as an attack vector.
In AWS, I can create some IAM users and provide per-bucket access to
them via bucket or IAM user policies. These IAM users can't create new
buckets on their own. Giving out only those IAM credentials to users and
applications, I can ensure no bucket namespace pollution occurs.
Ceph rgw does not have IAM users (yet?). What could I use here to not
allow certain S3 users to create buckets on their own?
Regards
Matthias
Hi,
I've just upgraded to our object storages to the latest pacific version
(16.2.14) and the autscaler is acting weird.
On one cluster it just shows nothing:
~# ceph osd pool autoscale-status
~#
On the other clusters it shows this when it is set to warn:
~# ceph health detail
...
[WRN] POOL_TOO_MANY_PGS: 2 pools have too many placement groups
Pool .rgw.buckets.data has 1024 placement groups, should have 1024
Pool device_health_metrics has 1 placement groups, should have 1
Version 16.2.13 seems to act normal.
Is this a known bug?
--
Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im
groüen Saal.
I have an 8-node cluster with old hardware. a week ago 4 nodes went down and the CEPH cluster went nuts.
All pgs became unknown and montors took too long to be in sync.
So i reduced the number of mons to one and mgrs to one as well
Now the recovery starts with 100% unknown pgs and then pgs start to move ot inactive . It generally fails to recover in the middle and starts from scratch.
It's hold hardware and OSDs have lots of slow ops and probably number of bad sectors as well
Any suggestions on how to tackle this. It's a nautilus cluster and pretty old (8-year old hardware)
Thanks
Hi Team,Milind
*Ceph-version:* Quincy, Reef
*OS:* Almalinux 8
*Issue:* snap_schedule works after 1 hour of schedule
*Description:*
We are currently working in a 3-node ceph cluster.
We are currently exploring the scheduled snapshot capability of the
ceph-mgr module.
To enable/configure scheduled snapshots, we followed the following link:
https://docs.ceph.com/en/quincy/cephfs/snap-schedule/
We were able to create snap schedules for the subvolumes as suggested.
But we have observed a two very strange behaviour:
1. The snap_schedules only work when we restart the ceph-mgr service on the
mgr node:
We then restarted the mgr-service on the active mgr node, and after 1 hour
it started getting created. I am attaching the log file for the same after
restart. Thre behaviour looks abnormal.
So, for eg consider the below output:
```
[root@storagenode-1 ~]# ceph fs snap-schedule status
/volumes/subvolgrp/test3
{"fs": "cephfs", "subvol": null, "path": "/volumes/subvolgrp/test3",
"rel_path": "/volumes/subvolgrp/test3", "schedule": "1h", "retention": {},
"start": "2023-10-04T07:20:00", "created": "2023-10-04T07:18:41", "first":
"2023-10-04T08:20:00", "last": "2023-10-04T09:20:00", "last_pruned": null,
"created_count": 2, "pruned_count": 0, "active": true}
[root@storagenode-1 ~]#
```
As we can see in the above o/p, we created the schedule at
2023-10-04T07:18:41. The schedule was suppose to start at
2023-10-04T07:20:00 but it started at 2023-10-04T08:20:00
Any input w.r.t the same will be of great help.
Thanks and Regards
Kushagra Gupta
Hi Everyone,
I've been trying to get S3 Select working on our system and whenever I
send a query I get the following in the Payload (Result 200 from RGW):
# aws --endpoint-url http://cephtest1 s3api select-object-content
--bucket test1 --expression-type SQL --input-serialization '{"CSV":
{"Fieldter": "\"" , "RecordDelimiter" : "\n" , "QuoteEscapeCharacter"
: "\\" , "FileHeaderInfo": "USE" }, "CompressionType": "NONE"}'
--output-serialization '{"CSV": {"FieldDelimiter": ":",
"RecordDelimiter":"\t", "QuoteFields": "ALWAYS"}}' --key
sample_data.csv --expression 'SELECT * from s3object' /dev/stderr
<Payload>
<Records>
<Payload>
failure -->SELECT * from s3object<---
</Payload></Records></Payload>
I also get the same behaviour when accessing via boto3/python. The
same command/code works when accessing other S3 services. Am I
missing some config or something?
The test1/sample_data.csv file is there and the account is able to get
the sample_data.csv data. I've tried uploading a version with unix
endings and Dos endings for the data file. This is the data file from
the AWS example.
We're running Pacific and the RGWs are deployed as podman containers
on Rocky 8.8:
ceph version 16.2.14 (238ba602515df21ea7ffc75c88db29f9e5ef12c9) pacific (stable)
Thanks,
-Dave
Hi Ceph users and developers,
We are gearing up for the next User + Developer Monthly Meeting, happening
October 19th at 10am EST.
If you are interested in being a guest speaker, you are invited to submit a
focus topic to this Google form:
https://docs.google.com/forms/d/e/1FAIpQLSdboBhxVoBZoaHm8xSmeBoemuXoV_rmh4v…
Examples of what we're looking for in a focus topic include:
- Feature requests / RFEs
- User feedback
- Knowledge sharing (upgrades, workloads, etc.)
- Ideas for longterm improvement (user-facing)
- Share the use case of your cluster
Any Ceph user or developer is eligible to submit! Reach out to me with any
questions.
- Laura Flores
--
Laura Flores
She/Her/Hers
Software Engineer, Ceph Storage <https://ceph.io>
Chicago, IL
lflores(a)ibm.com | lflores(a)redhat.com <lflores(a)redhat.com>
M: +17087388804
Hi everybody,
I tried to reshard a bucket belonging to the tenant "test-tenant", but got an "No such file or directory" error.
$ radosgw-admin reshard add --bucket test-tenant/test-bucket --num-shards 40
$ radosgw-admin reshard process
2023-10-04T12:12:52.470+0200 7f654237afc0 0 process_single_logshard: Error during resharding bucket test-tenant/test-bucket:(2) No such file or directory
$ radosgw-admin reshard list
[
{
"time": "2023-10-04T10:12:46.528741Z",
"tenant": "",
"bucket_name": "test-tenant/test-bucket",
"bucket_id": "43e570fd-7573-403f-ab86-a75e12e60146.24142.3",
"new_instance_id": "",
"old_num_shards": 23,
"tentative_new_num_shards": 40
}
]
(I also tried --bucket test-bucket)
I then tried to remove the incomplete reshard-job
$ radosgw-admin reshard cancel --bucket test-tenant/test-bucket
$ radosgw-admin reshard list
[]
$ radosgw-admin reshard process
2023-10-04T12:33:04.251+0200 7fb728e16fc0 0 INFO: RGWReshardLock::lock found lock on reshard.0000000000 to be held by another RGW process; skipping for now
With
$ radosgw-admin lc reshard fix --bucket test-tenant/test-bucket
and restarting the rgw containers, I could get rid of the reshard.0000000000 lock.
Unfortunately updating to the latest 17.2 version did not help.
$ ceph orch upgrade start --ceph-version 17.2.6
$ ceph --version
ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable)
But after transferring the bucket to a different RGW user without tenant, resharding worked fine:
$ reshard add --bucket test-bucket-without-tenant-user --num-shards 40
$ radosgw-admin reshard process
2023-10-04T12:44:43.599+0200 7f75c908ffc0 1 execute INFO: reshard of bucket "test-bucket-without-tenant-user" from "test-bucket-without-tenant-user:43e570fd-7573-403f-ab86-a75e12e60146.74748.1" to "test-bucket-without-tenant-user:43e570fd-7573-403f-ab86-a75e12e60146.74819.1" completed successfully
This solved my problem, but I wanted to report it, as it may affect other installations too.
Is this problem known? I could not find information for this.
Hi Folks,
We are currently running with one nearfull OSD and 15 nearfull pools. The most full OSD is about 86% full but the average is 58% full. However, the balancer is skipping a pool on which the autoscaler is trying to complete a pg_num reduction from 131,072 to 32,768 (default.rgw.buckets.data pool). However, the autoscaler has been working on this for the last 20 days, it works through a list of objects that are misplaced but when it gets close to the end, more objects get added to the list.
This morning I observed the list get down to c. 7,000 objects misplaced with 2 PGs active+remapped+backfilling, one PG completed the backfilling then the list shot up to c. 70,000 objects misplaced with 3 PGs active+remapped+backfilling.
Has anyone come across this behaviour before? If so, what was your remediation?
Thanks in advance for sharing.
Bruno
Cluster details:
3,068 OSDs when all running, c. 60 per storage node
OS: Ubuntu 20.04
Ceph: Pacific 16.2.13 from Ubuntu Cloud Archive
Use case:
S3 storage and OpenStack backend, all pools three-way replicated