Hi Marc,
I uploaded all scripts and a rudimentary readme to https://github.com/frans42/cephfs-bench . I hope it is sufficient to get started. I'm afraid its very much tailored to our deployment and I can't make it fully configurable anytime soon. I hope it serves a purpose though - at least I discovered a few bugs with it.
We actually kept the benchmark running through an upgrade from mimic to octopus. Was quite interesting to see how certain performance properties change with that. This benchmark makes it possible to compare versions with live timings coming in.
Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
________________________________________
From: Marc <Marc(a)f1-outsourcing.eu>
Sent: Monday, May 15, 2023 11:28 PM
To: Frank Schilder
Subject: RE: [ceph-users] Re: CEPH Version choice
> I planned to put it on-line. The hold-back is that the main test is un-
> taring a nasty archive and this archive might contain personal
> information, so I can't just upload it as is. I can try to put together
> a similar archive from public sources. Please give me a bit of time. I'm
> also a bit under stress right now with our users being hit by an FS meta
> data corruption. That's also why I'm a bit trigger happy.
>
Ok thanks, very nice, no hurry!!!
Hi all,
I have a problem with exporting 2 different sub-folder ceph-fs kernel mounts via nfsd to the same IP address. The top-level structure on the ceph fs is something like /A/S1 and /A/S2. On a file server I mount /A/S1 and /A/S2 as two different file systems under /mnt/S1 and /mnt/S2 using the ceph fs kernel client. Then, these 2 mounts are exported with lines like these in /etc/exports:
/mnt/S1 -options NET
/mnt/S2 -options IP
IP is an element of NET, meaning that the host at IP should be the only host being able to access /mnt/S1 and /mnt/S2. What we observe is that any attempt to mount the export /mnt/S1 on the host at IP results in /mnt/S2 being mounted instead.
My first guess was that here we have a clash of fsids and the ceph fs is simply reporting the same fsid to nfsd and, hence, nfsd thinks both mountpoints contain the same. So I modified the second export line to
/mnt/S2 -options,fsid=100 IP
to no avail. The two folders are completely disjoint, neither symlinks nor hard-links between them. So it should be safe to export these as 2 different file systems.
Exporting such constructs to non-overlapping networks/IPs works as expected - even when exporting subdirs of a dir (like exporting /A/B and /A/B/C from the same file server to strictly different IPs). It seems the same-IP config that breaks expectations.
Am I missing here a magic -yes-i-really-know-what-i-am-doing hack? The file server is on AlmaLinux release 8.7 (Stone Smilodon) and all ceph packages match the ceph version octopus latest of our cluster.
Thanks and best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
Hi guys,
I'm awake since 36h and try to restore a broken ceph Pool (2 PGs incomplete)
My vm are all broken. Some Boot, some Dont Boot...
Also I have 5 removed disk with Data of that Pool "in my Hands" - Dont ask...
So my question is it possible to restore Data of these other disks and "add" them thee others for healing?
Best regards
Ben
Hello.
I think i found a bug in cephadm/ceph orch:
Redeploying a container image (tested with alertmanager) after removing
a custom `mgr/cephadm/container_image_alertmanager` value, deploys the
previous container image and not the default container image.
I'm running `cephadm` from ubuntu 22.04 pkg 17.2.5-0ubuntu0.22.04.3 and
`ceph` version 17.2.6.
Here is an example. Node clrz20-08 is the node altermanager is running
on, clrz20-01 the node I'm controlling ceph from:
* Get alertmanager version
```
root@clrz20-08:~# cephadm ls | jq '.[] | select(.service_name ==
"alertmanager")| .container_image_name'
"quay.io/prometheus/alertmanager:v0.23.0"
```
* Set alertmanager image
```
root@clrz20-01:~# ceph config set mgr
mgr/cephadm/container_image_alertmanager quay.io/prometheus/alertmanager
root@clrz20-01:~# ceph config get mgr
mgr/cephadm/container_image_alertmanager
quay.io/prometheus/alertmanager
```
* redeploy altermanager
```
root@clrz20-01:~# ceph orch redeploy alertmanager
Scheduled to redeploy alertmanager.clrz20-08 on host 'clrz20-08'
```
* Get alertmanager version
```
root@clrz20-08:~# cephadm ls | jq '.[] | select(.service_name ==
"alertmanager")| .container_image_name'
"quay.io/prometheus/alertmanager:latest"
```
* Remove alertmanager image setting, revert to default:
```
root@clrz20-01:~# ceph config rm mgr
mgr/cephadm/container_image_alertmanager
root@clrz20-01:~# ceph config get mgr
mgr/cephadm/container_image_alertmanager
quay.io/prometheus/alertmanager:v0.23.0
```
* redeploy altermanager
```
root@clrz20-01:~# ceph orch redeploy alertmanager
Scheduled to redeploy alertmanager.clrz20-08 on host 'clrz20-08'
```
* Get alertmanager version
```
root@clrz20-08:~# cephadm ls | jq '.[] | select(.service_name ==
"alertmanager")| .container_image_name'
"quay.io/prometheus/alertmanager:latest"
```
-> `mgr/cephadm/container_image_alertmanager` is set to
`quay.io/prometheus/alertmanager:v0.23.0`, but redeploy uses
`quay.io/prometheus/alertmanager:latest`. This looks like a bug.
* Set alertmanager image explicitly to the default value
```
root@clrz20-01:~# ceph config set mgr
mgr/cephadm/container_image_alertmanager
quay.io/prometheus/alertmanager:v0.23.0
root@clrz20-01:~# ceph config get mgr
mgr/cephadm/container_image_alertmanager
quay.io/prometheus/alertmanager:v0.23.0
```
* redeploy altermanager
```
root@clrz20-01:~# ceph orch redeploy alertmanager
Scheduled to redeploy alertmanager.clrz20-08 on host 'clrz20-08'
```
* Get alertmanager version
```
root@clrz20-08:~# cephadm ls | jq '.[] | select(.service_name ==
"alertmanager")| .container_image_name'
"quay.io/prometheus/alertmanager:v0.23.0"
```
-> Setting `mgr/cephadm/container_image_alertmanager` to the default
setting fixes the issue.
Bests,
Daniel
Bonjour,
Reading Karan's blog post about benchmarking the insertion of billions objects to Ceph via S3 / RGW[0] from last year, it reads:
> we decided to lower bluestore_min_alloc_size_hdd to 18KB and re-test. As represented in chart-5, the object creation rate found to be notably reduced after lowering the bluestore_min_alloc_size_hdd parameter from 64KB (default) to 18KB. As such, for objects larger than the bluestore_min_alloc_size_hdd , the default values seems to be optimal, smaller objects further require more investigation if you intended to reduce bluestore_min_alloc_size_hdd parameter.
There also is a mail thread dated 2018 on this topic as well, with the same conclusion although using RADOS directly and not RGW[3]. I read the RGW data layout page in the documentation[1] and concluded that by default every object inserted with S3 / RGW will indeed use at least 64kb. A pull request from last year[2] seems to confirm it and also suggests modifying bluestore_min_alloc_size_hdd has adverse side effects.
That being said, I'm curious to know if people developed strategies to cope with this overhead. Someone mentioned packing objects together client side to make them larger. But maybe there are simpler ways to do the same?
Cheers
[0] https://www.redhat.com/en/blog/scaling-ceph-billion-objects-and-beyond
[1] https://docs.ceph.com/en/latest/radosgw/layout/
[2] https://github.com/ceph/ceph/pull/32809
[3] https://www.spinics.net/lists/ceph-users/msg45755.html
--
Loïc Dachary, Artisan Logiciel Libre
Hi,
I've been seeing relatively large fragmentation numbers on all my OSDs:
ceph daemon osd.13 bluestore allocator score block
{
"fragmentation_rating": 0.77251526920454427
}
These aren't that old, as I recreated them all around July last year.
They mostly hold CephFS data with erasure coding, with a mix of large
and small files. The OSDs are at around 80%-85% utilization right now.
Most of the data was written sequentially when the OSDs were created (I
rsynced everything from a remote backup). Since then more data has been
added, but not particularly quickly.
At some point I noticed pathologically slow writes, and I couldn't
figure out what was wrong. Eventually I did some block tracing and
noticed the I/Os were very small, even though CephFS-side I was just
writing one large file sequentially, and that's when I stumbled upon the
free space fragmentation problem. Indeed, deleting some large files
opened up some larger free extents and resolved the problem, but only
until those get filled up and I'm back to fragmented tiny extents. So
effectively I'm stuck at the current utilization, as trying to fill them
up any more just slows down to an absolute crawl.
I'm adding a few more OSDs and plan on doing the dance of removing one
OSD at a time and replacing it with another one to hopefully improve the
situation, but obviously this is going to take forever.
Is there any plan for offering a defrag tool of some sort for bluestore?
- Hector
Hi,
After a wrong manipulation, the admin key no longer works, it seems it has
been modified.
My cluster is built using containers.
When I execute ceph -s I get
[root@controllera ceph]# ceph -s
2023-05-31T11:33:20.940+0100 7ff7b2d13700 -1 monclient(hunting):
handle_auth_bad_method server allowed_methods [2] but i only support [2,1]
2023-05-31T11:33:20.940+0100 7ff7b1d11700 -1 monclient(hunting):
handle_auth_bad_method server allowed_methods [2] but i only support [2,1]
2023-05-31T11:33:20.940+0100 7ff7b2512700 -1 monclient(hunting):
handle_auth_bad_method server allowed_methods [2] but i only support [2,1]
[errno 13] RADOS permission denied (error connecting to the cluster)
From the log file I am getting :
May 31 11:03:02 controllera docker[214909]: debug
2023-05-31T11:03:02.714+0100 7fcfc0c91700 0 cephx server client.admin:
unexpected key: req.key=5fea877f2a68548b expected_key=8c2074e03ffa449a
How can I recover the correct key?
Regards.
I was running on 17.2.5 since October, and just upgraded to 17.2.6, and now the "mtime" property on all my buckets is 0.000000.
On all previous versions going back to Nautilus this wasn't an issue, and we do like to have that value present. radosgw-admin has no quick way to get the last object in the bucket.
Here's my tracker submission:
https://tracker.ceph.com/issues/61264#change-239348
Dear All,
we are trying to recover from what we suspect is a corrupt MDS :(
and have been following the guide here:
<https://docs.ceph.com/en/latest/cephfs/disaster-recovery-experts/>
Symptoms: MDS SSD pool (2TB) filled completely over the weekend,
normally uses less than 400GB, resulting in MDS crash.
We added 4 x extra SSD to increase pool capacity to 3.5TB, however MDS
did not recover
# ceph fs status
cephfs2 - 0 clients
=======
RANK STATE MDS ACTIVITY DNS INOS DIRS CAPS
0 failed
1 resolve wilma-s3 8065 8063 8047 0
2 resolve wilma-s2 901k 802k 34.4k 0
POOL TYPE USED AVAIL
mds_ssd metadata 2296G 3566G
primary_fs_data data 0 3566G
ec82pool data 2168T 3557T
STANDBY MDS
wilma-s1
wilma-s4
setting "ceph mds repaired 0" causes rank 0 to restart, and then
immediately fail.
Following the disaster-recovery-experts guide, the first step we did was
to export the MDS journals, e.g:
# cephfs-journal-tool --rank=cephfs2:0 journal export /root/backup.bin.0
journal is 9744716714163~658103700
wrote 658103700 bytes at offset 9744716714163 to /root/backup.bin.0
so far so good, however when we try to backup the final MDS the process
consumes all available RAM (470GB) and needs to be killed after 14 minutes.
# cephfs-journal-tool --rank=cephfs2:2 journal export /root/backup.bin.2
similarly, "recover_dentries summary" consumes all RAM when applied to MDS 2
# cephfs-journal-tool --rank=cephfs2:2 event recover_dentries summary
We successfully ran "cephfs-journal-tool --rank=cephfs2:0 event
recover_dentries summary" and "cephfs-journal-tool --rank=cephfs2:1
event recover_dentries summary"
at this point, we tried to follow the instructions and make a RADOS
level copy of the journal data, however the link in the docs doesn't
explain how to do this and just points to
<http://tracker.ceph.com/issues/9902>
At this point we are tempted to reset the journal on MDS 2, but wanted
to get a feeling from others about how dangerous this could be?
We have a backup, but as there is 1.8PB of data, it's going to take a
few weeks to restore....
any ideas gratefully received.
Jake
--
Dr Jake Grimmett
Head Of Scientific Computing
MRC Laboratory of Molecular Biology
Francis Crick Avenue,
Cambridge CB2 0QH, UK.
Hey,
I have a test setup with a 3-node samba cluster. This cluster consists
of 3 vm's storing its locks on a replicated gluster volume.
I want to switch to 2 physical smb-gateways for performance reasons
(not enough money for 3), and since the 2-node cluster can't get
quorum, I hope to switch to storing the ctdb lock in ceph and hope
that will work reliably. (experiences with 2 node SMB clusters?)
I am looking into the ctdb rados helper:
[cluster]
recovery lock =
!/usr/lib/x86_64-linux-gnu/ctdb/ctdb_mutex_ceph_rados_helper ceph
client.tenant1 cephfs_metadata ctdb_lock
Now I do have a bit of experience with cephfs, rbd and rgw, but not
rados. How do I give the user client.tenant1 permissions?
We have a single cephfs, with 4 different tenants (departments). Each
department has their own samba cluster. We're using cephfs permissions
to limit the tenants to their own path (I hope).
example of ceph auth:
client.tenant1
key: *****
caps: [mds] allow rws fsname=cephfs path=/tenant1
caps: [mon] allow r fsname=cephfs
caps: [osd] allow rw tag cephfs data=cephfs
If I try some stuff manually (without really knowing how to specify
objects or what that means), I get this permission denied error:
root@tenant1-1:~#
/usr/lib/x86_64-linux-gnu/ctdb/ctdb_mutex_ceph_rados_helper ceph
client.tenant1 cephfs_metadata tenant1/ctdb_lock 1
/usr/lib/x86_64-linux-gnu/ctdb/ctdb_mutex_ceph_rados_helper: Failed to
get lock on RADOS object 'tenant1/ctdb_lock' - (Operation not
permitted)
Angelo.