Hello,
Setting up first ceph cluster in lab.
Rocky 8.6
Ceph quincy
Using curl install method
Following cephadm deployment steps
Everything works as expected except
ceph orch device ls --refresh
Only displays nvme devices and not the sata ssds on the ceph host.
Tried
sgdisk --zap-all /dev/sda
wipefs -a /dev/sda
Adding sata osd manually I get:
ceph orch daemon add osd ceph-a:data_devices=/dev/sda
Created no osd(s) on host ceph-a; already created?
nvme osd gets added without issue.
I have looked in the volume log on the node and monitor log on the admin
server and have not seen anything that seems like an obvious clue.
I can see commands running successfully against /dev/sda in the logs.
Ideas?
Thanks,
cb
Hi to all and thanks for sharing your experience on ceph !
We have an easy setup with 9 osd all hdd and 3 nodes, 3 osd for each node.
We started the cluster to test how it works with hdd with default and easy bootstrap . Then we decide to add ssd and create a pool to use only ssd.
In order to have pools on hdd and pools on ssd only we edited the crushmap to add class hdd
We do not enter anything about ssd till now, nor disk or rules only add the class map to the default rule.
So i show you the rules before introducing class hdd
# rules
rule replicated_rule {
id 0
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}
rule erasure-code {
id 1
type erasure
min_size 3
max_size 4
step set_chooseleaf_tries 5
step set_choose_tries 100
step take default
step chooseleaf indep 0 type host
step emit
}
rule erasure2_1 {
id 2
type erasure
min_size 3
max_size 3
step set_chooseleaf_tries 5
step set_choose_tries 100
step take default
step chooseleaf indep 0 type host
step emit
}
rule erasure-pool.meta {
id 3
type erasure
min_size 3
max_size 3
step set_chooseleaf_tries 5
step set_choose_tries 100
step take default
step chooseleaf indep 0 type host
step emit
}
rule erasure-pool.data {
id 4
type erasure
min_size 3
max_size 3
step set_chooseleaf_tries 5
step set_choose_tries 100
step take default
step chooseleaf indep 0 type host
step emit
}
And here is the after
# rules
rule replicated_rule {
id 0
type replicated
min_size 1
max_size 10
step take default class hdd
step chooseleaf firstn 0 type host
step emit
}
rule erasure-code {
id 1
type erasure
min_size 3
max_size 4
step set_chooseleaf_tries 5
step set_choose_tries 100
step take default class hdd
step chooseleaf indep 0 type host
step emit
}
rule erasure2_1 {
id 2
type erasure
min_size 3
max_size 3
step set_chooseleaf_tries 5
step set_choose_tries 100
step take default class hdd
step chooseleaf indep 0 type host
step emit
}
rule erasure-pool.meta {
id 3
type erasure
min_size 3
max_size 3
step set_chooseleaf_tries 5
step set_choose_tries 100
step take default class hdd
step chooseleaf indep 0 type host
step emit
}
rule erasure-pool.data {
id 4
type erasure
min_size 3
max_size 3
step set_chooseleaf_tries 5
step set_choose_tries 100
step take default class hdd
step chooseleaf indep 0 type host
step emit
}
Just doing this triggered the misplaced of all pgs bind to EC pool.
Is that correct ? and why ?
Best regards
Alessandro Bolgia
I've setup RadosGW with STS ontop of my ceph cluster. It works great and fine but I'm also trying to setup authentication with an OpenIDConnect provider. I'm have a hard time troubleshooting issues because the radosgw log file doesn't have much information in it. For example when I try to use the `sts:AssumeRoleWithWebIdentity` API it fails with `{'Code': 'AccessDenied', ...}` and all I see is the beat log showing an HTTP 403.
Is there a way to enable more verbose logging so I can see what is failing and why I'm getting certain errors with STS, S3, or IAM apis?
My ceph.conf looks like this for each node (mildly redacted):
```
[client.radosgw.pve4]
host = pve4
keyring = /etc/pve/priv/ceph.client.radosgw.keyring
log file = /var/log/ceph/client.radosgw.$host.log
rgw_dns_name = s3.lab
rgw_frontends = beast endpoint=0.0.0.0:7480 ssl_endpoint=0.0.0.0:443 ssl_certificate=/etc/pve/priv/ceph/s3.lab.crt ssl_private_key=/etc/pve/priv/ceph/s3.lab.key
rgw_sts_key = 1111111111111111
rgw_s3_auth_use_sts = true
rgw_enable_apis = s3, s3website, admin, sts, iam
```
Hello
We are planning to start QE validation release next week.
If you have PRs that are to be part of it, please let us know by
adding "needs-qa" for 'quincy' milestone ASAP.
Thx
YuriW
Hello,
Our ceph cluster performance has become horrifically slow over the past few
months.
Nobody here is terribly familiar with ceph and we're inheriting this
cluster without much direction.
Architecture: 40Gbps QDR IB fabric between all ceph nodes and our ovirt VM
hosts. 11 OSD nodes with a total of 163 OSDs. 14 pools, 3616 PGs, 1.19PB
total capacity.
Ceph versions:
{
"mon": {
"ceph version 12.2.11 (26dc3775efc7bb286a1d6d66faee0ba30ea23eee)
luminous (stable)": 3
},
"mgr": {
"ceph version 12.2.11 (26dc3775efc7bb286a1d6d66faee0ba30ea23eee)
luminous (stable)": 3
},
"osd": {
"ceph version 12.2.11 (26dc3775efc7bb286a1d6d66faee0ba30ea23eee)
luminous (stable)": 118,
"ceph version 12.2.12 (1436006594665279fe734b4c15d7e08c13ebd777)
luminous (stable)": 22,
"ceph version 12.2.13 (584a20eb0237c657dc0567da126be145106aa47e)
luminous (stable)": 19
},
"mds": {},
"overall": {
"ceph version 12.2.11 (26dc3775efc7bb286a1d6d66faee0ba30ea23eee)
luminous (stable)": 124,
"ceph version 12.2.12 (1436006594665279fe734b4c15d7e08c13ebd777)
luminous (stable)": 22,
"ceph version 12.2.13 (584a20eb0237c657dc0567da126be145106aa47e)
luminous (stable)": 19
}
}
The majority of disks are spindles but there are also NVMe SSDs. There is a
lot of variability in drive sizes - two different sets of admins added
disks sized between 6TB and 16TB and I suspect this and imbalanced
weighting is to blame.
Performance on the ovirt VMs can dip as low as several *kilobytes*
per-second (!) on reads and a few MB/sec on writes. There are also several
scrub errors. In short, it's a complete wreck.
STATUS:
[root@ceph-admin davei]# ceph -s
cluster:
id: 1b8d958c-e50b-40ef-a681-16cfeb9390b8
health: HEALTH_ERR
3 scrub errors
Possible data damage: 3 pgs inconsistent
services:
mon: 3 daemons, quorum ceph1,ceph2,ceph3
mgr: ceph3(active), standbys: ceph2, ceph1
osd: 163 osds: 159 up, 158 in
data:
pools: 14 pools, 3616 pgs
objects: 46.28M objects, 174TiB
usage: 527TiB used, 694TiB / 1.19PiB avail
pgs: 3609 active+clean
4 active+clean+scrubbing+deep
3 active+clean+inconsistent
io:
client: 74.3MiB/s rd, 96.0MiB/s wr, 3.85kop/s rd, 3.68kop/s wr
---
HEALTH:
[root@ceph-admin davei]# ceph health detail
HEALTH_ERR 3 scrub errors; Possible data damage: 3 pgs inconsistent
OSD_SCRUB_ERRORS 3 scrub errors
PG_DAMAGED Possible data damage: 3 pgs inconsistent
pg 2.8a is active+clean+inconsistent, acting [13,152,127]
pg 2.ce is active+clean+inconsistent, acting [145,13,152]
pg 2.e8 is active+clean+inconsistent, acting [150,162,42]
---
CEPH OSD DF:
(not going to paste that all in here): https://pastebin.com/CNW5RKWx
What else am I missing in terms of what to share with you all?
Any advice on how we should 'reweight' these to get the performance to
improve?
Thanks all,
-Dave
Hi all,
digging around debugging, why our (small: 10 Hosts/~60 OSDs) cluster is so slow even while recovering I found out one of our key issues are some SSDs with SLC cache (in our case Samsung SSD 870 EVO) - which we just recycled from other use cases in the hope to speed up our mainly hdd based cluster. We know it's a little bit random which objects get accelerated when not used as cache.
However the opposite was the case. These type's ssds are only fast when operating in their SLC cache, which is only several Gigabytes in a multi-TB ssd [1]. When doing a big write or a backfill onto these SSDs we got really low IO-rates (around 10 MB/s even with 4M-objects).
But it got even worse. Disclaimer: This is my view as a user, maybe a more technically involved person is able to correct me. Cause seems to be the mclock-scheduler which measures the iops an osd is able to do. As in the blog measured [2], this is usually a good thing as there is done some profiling and queing is done different. But in our case the osd_mclock_max_capacity_iops_ssd for most of the corresponding osds was very low. But not for everyone. I assume that it depends when mclock-scheduler measured the iops capacity. That led to a broken scheduling where backfills were at low speed and the ssd itself had nearly no disk usage because it was operating in it's cache again and could work faster. That issue could be solved by switching back to wpq scheduler for the affected SSDs. This scheduler seems to just queue up ios without throttling because of maximum iops reached. Now we see a still bad IO situation because of the slow SSDs but at least they are operating at their maximum (having typical settings like osd_recovery_max_active and osd_recovery_sleep* tuned).
We are going to replace the SSDs to hopefully more consistent performing ones (even if their peak performance would be not as good).
I hope this may help somebody in the future when being stuck in low performance recoverys.
Refs:
[1] https://www.tomshardware.com/reviews/samsung-870-evo-sata-ssd-review-the-be…
[2] https://ceph.io/en/news/blog/2022/mclock-vs-wpq-testing-with-background-ops…
Happy Storing!
Michael Wodniok
--
Michael Wodniok M.Sc.
WorNet AG
Bürgermeister-Graf-Ring 28
82538 Geretsried
Simply42 und SecuMail sind Marken der WorNet AG.
http://www.wor.net/
Handelsregister Amtsgericht München (HRB 129882)
Vorstand: Christian Eich
Aufsichtsratsvorsitzender: Dirk Steinkopf
I’m trying to find documentation for which mount options are supported directly by the kernel module. For example in the kernel module included in Rocky Linux 8 and 9 the secretfile option isn’t supported even though the documentation seems to imply it is. It seems like the documentation assumes you’ll always be using the mount.ceph helper and I’m trying to find out what options are supported if you don’t have mount.ceph helper.
Thanks
Shawn
Hi Cephers,
I have large OMAP objects on one of my cluster (certainly due to a big
bucket deletion, and things not completely purged).
Since there is no tool to either reconstruct index from data or purge
unused index, I thought I can use mutlisite replication.
As I am in a multisite configuration, and the cluster is not in the
master zone,
will all the data be recovered from the master zone if I stop radosgw,
delete RGW index and data pools, and restart radosgw ?
Or will it definitively not be so simple ?
The data pool reports 8TB used.
Even if it works, it will take ages...
If someone has another idea to remove those large OMAP objects...
I've seen several times that question on the mailing list, but never saw
a response that works or was adapted for my use case.
The rgw-orphan-list script could be a solution, but too long to run on
my cluster.
And still, I have to know if I really can delete objects, and I have no
clue...
Hi,
I notice that CompleteMultipartUploadResult does return an empty ETag
field when completing an multipart upload in v17.2.3.
I haven't had the possibility to verify from which version this changed
and can't find in the changelog that it should be fixed in newer version.
The response looks like:
<?xml version="1.0" encoding="UTF-8"?>
<CompleteMultipartUploadResult xmlns="http://s3.amazonaws.com/doc/2006-03-01/ <http://s3.amazonaws.com/doc/2006-03-01/>">
<Location>s3.myceph.com/test-bucket/test.file</Location>
<Bucket>test-bucket</Bucket>
<Key>test.file</Key>
<ETag></ETag>
</CompleteMultipartUploadResult>
I have found a old issue that is closed around 9 years ago with the same
issue so I guess that this has been fixed before.
https://tracker.ceph.com/issues/6830 <https://tracker.ceph.com/issues/6830>
It looks like my account to the tracker is still not activated so I
can't create or comment on the issue.
Best regards,
Lars Dunemark