Hi,
today I did the first update from octopus to pacific, and it looks like the
avg apply latency went up from 1ms to 2ms.
All 36 OSDs are 4TB SSDs and nothing else changed.
Someone knows if this is an issue, or am I just missing a config value?
Cheers
Boris
ceph version: 17.2.0 on Ubuntu 22.04
non-containerized ceph from Ubuntu repos
cluster started on luminous
I have been using bcache on filestore on rotating disks for many years
without problems. Now converting OSDs to bluestore, there are some
strange effects.
If I create the bcache device, set its rotational flag to '1', then do
ceph-volume lvm create ... --crush-device-class=hdd
the OSD comes up with the right parameters and much improved latency
compared to OSD directly on /dev/sdX.
ceph osd metatdata ...
shows
"bluestore_bdev_type": "hdd",
"rotational": "1"
But after reboot, bcache rotational flag is set '0' again, and the OSD
now comes up with "rotational": "0"
Latency immediately starts to increase (and continually increases over
the next days, possibly due to accumulating fragmention).
These wrong settings stay in place even if I stop the OSD, set the
bcache rotational flag to '1' again and restart the OSD. I have found no
way to get back to the original settings other than destroying and
recreating the OSD. I guess I am just not seeing something obvious, like
from where these settings get pulled at OSD startup.
I even created udev rules to set bcache rotational=1 at boot time,
before any ceph daemon starts, but it did not help. Something running
after these rules reset the bcache rotationl flags back to 0.
Haven't found the culprit yet, but not sure if it even matters.
Are these OSD settings (bluestore_bdev_type, rotational) persisted
somewhere and can they be edited and pinned?
Alternatively, can I manually set and persist the relevant bluestore
tunables (per OSD / per device class) so as to make the bcache
rotational flag irrelevant after the OSD is first created?
Regards
Matthias
On Fri, Apr 08, 2022 at 03:05:38PM +0300, Igor Fedotov wrote:
> Hi Frank,
>
> in fact this parameter impacts OSD behavior at both build-time and during
> regular operationing. It simply substitutes hdd/ssd auto-detection with
> manual specification. And hence relevant config parameters are applied. If
> e.g. min_alloc_size is persistent after OSD creation - it wouldn't be
> updated. But if specific setting allows at run-time - it would be altered.
>
> So the proper usage would definitely be manual ssd/hdd mode selection before
> the first OSD creation and keeping it in that mode along the whole OSD
> lifecycle. But technically one can change the mode at any arbitrary point in
> time which would result in run-rime setting being out-of-sync with creation
> ones. With some unclear side-effects..
>
> Please also note that this setting was orignally intended mostly for
> development/testing purposes not regular usage. Hence it's flexible but
> rather unsafe if used improperly.
>
>
> Thanks,
>
> Igor
>
> On 4/7/2022 2:40 PM, Frank Schilder wrote:
> > Hi Richard and Igor,
> >
> > are these tweaks required at build-time (osd prepare) only or are they required for every restart?
> >
> > Is this setting "bluestore debug enforce settings=hdd" in the ceph config data base or set somewhere else? How does this work if deploying HDD- and SSD-OSDs at the same time?
> >
> > Ideally, all these tweaks should be applicable and settable at creation time only without affecting generic settings (that is, at the ceph-volume command line and not via config side effects). Otherwise it becomes really tedious to manage these.
> >
> > For example, would the following work-flow apply the correct settings *permanently* across restarts:
> >
> > 1) Prepare OSD on fresh HDD with ceph-volume lvm batch --prepare ...
> > 2) Assign dm_cache to logical OSD volume created in step 1
> > 3) Start OSD, restart OSDs, boot server ...
> >
> > I would assume that the HDD settings are burned into the OSD in step 1 and will be used in all future (re-)starts without the need to do anything despite the device being detected as non-rotational after step 2. Is this assumption correct?
> >
> > Thanks and best regards,
> > =================
> > Frank Schilder
> > AIT Risø Campus
> > Bygning 109, rum S14
> >
> > ________________________________________
> > From: Richard Bade <hitrich(a)gmail.com>
> > Sent: 06 April 2022 00:43:48
> > To: Igor Fedotov
> > Cc: Ceph Users
> > Subject: [Warning Possible spam] [ceph-users] Re: Ceph Bluestore tweaks for Bcache
> >
> > Just for completeness for anyone that is following this thread. Igor
> > added that setting in Octopus, so unfortunately I am unable to use it
> > as I am still on Nautilus.
> >
> > Thanks,
> > Rich
> >
> > On Wed, 6 Apr 2022 at 10:01, Richard Bade <hitrich(a)gmail.com> wrote:
> > > Thanks Igor for the tip. I'll see if I can use this to reduce the
> > > number of tweaks I need.
> > >
> > > Rich
> > >
> > > On Tue, 5 Apr 2022 at 21:26, Igor Fedotov <igor.fedotov(a)croit.io> wrote:
> > > > Hi Richard,
> > > >
> > > > just FYI: one can use "bluestore debug enforce settings=hdd" config
> > > > parameter to manually enforce HDD-related settings for a BlueStore
> > > >
> > > >
> > > > Thanks,
> > > >
> > > > Igor
> > > >
> > > > On 4/5/2022 1:07 AM, Richard Bade wrote:
> > > > > Hi Everyone,
> > > > > I just wanted to share a discovery I made about running bluestore on
> > > > > top of Bcache in case anyone else is doing this or considering it.
> > > > > We've run Bcache under Filestore for a long time with good results but
> > > > > recently rebuilt all the osds on bluestore. This caused some
> > > > > degradation in performance that I couldn't quite put my finger on.
> > > > > Bluestore osds have some smarts where they detect the disk type.
> > > > > Unfortunately in the case of Bcache it detects as SSD, when in fact
> > > > > the HDD parameters are better suited.
> > > > > I changed the following parameters to match the HDD default values and
> > > > > immediately saw my average osd latency during normal workload drop
> > > > > from 6ms to 2ms. Peak performance didn't change really, but a test
> > > > > machine that I have running a constant iops workload was much more
> > > > > stable as was the average latency.
> > > > > Performance has returned to Filestore or better levels.
> > > > > Here are the parameters.
> > > > >
> > > > > ; Make sure that we use values appropriate for HDD not SSD - Bcache
> > > > > gets detected as SSD
> > > > > bluestore_prefer_deferred_size = 32768
> > > > > bluestore_compression_max_blob_size = 524288
> > > > > bluestore_deferred_batch_ops = 64
> > > > > bluestore_max_blob_size = 524288
> > > > > bluestore_min_alloc_size = 65536
> > > > > bluestore_throttle_cost_per_io = 670000
> > > > >
> > > > > ; Try to improve responsiveness when some disks are fully utilised
> > > > > osd_op_queue = wpq
> > > > > osd_op_queue_cut_off = high
> > > > >
> > > > > Hopefully someone else finds this useful.
> > > > > _______________________________________________
> > > > > ceph-users mailing list -- ceph-users(a)ceph.io
> > > > > To unsubscribe send an email to ceph-users-leave(a)ceph.io
> > > > --
> > > > Igor Fedotov
> > > > Ceph Lead Developer
> > > >
> > > > Looking for help with your Ceph cluster? Contact us at https://croit.io
> > > >
> > > > croit GmbH, Freseniusstr. 31h, 81247 Munich
> > > > CEO: Martin Verges - VAT-ID: DE310638492
> > > > Com. register: Amtsgericht Munich HRB 231263
> > > > Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx
> > > >
> > _______________________________________________
> > ceph-users mailing list -- ceph-users(a)ceph.io
> > To unsubscribe send an email to ceph-users-leave(a)ceph.io
>
> --
> Igor Fedotov
> Ceph Lead Developer
>
> Looking for help with your Ceph cluster? Contact us at https://croit.io
>
> croit GmbH, Freseniusstr. 31h, 81247 Munich
> CEO: Martin Verges - VAT-ID: DE310638492
> Com. register: Amtsgericht Munich HRB 231263
> Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx
>
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io
> To unsubscribe send an email to ceph-users-leave(a)ceph.io
Has anybody run into a 'stuck' OSD service specification? I've tried
to delete it, but it's stuck in 'deleting' state, and has been for
quite some time (even prior to upgrade, on 15.2.x). This is on 16.2.3:
NAME PORTS RUNNING REFRESHED AGE PLACEMENT
osd.osd_spec 504/525 <deleting> 12m label:osd
root@ceph01:/# ceph orch rm osd.osd_spec
Removed service osd.osd_spec
From active monitor:
debug 2021-05-06T23:14:48.909+0000 7f17d310b700 0
log_channel(cephadm) log [INF] : Remove service osd.osd_spec
Yet in ls, it's still there, same as above. --export on it:
root@ceph01:/# ceph orch ls osd.osd_spec --export
service_type: osd
service_id: osd_spec
service_name: osd.osd_spec
placement: {}
unmanaged: true
spec:
filter_logic: AND
objectstore: bluestore
We've tried --force, as well, with no luck.
To be clear, the --export even prior to delete looks nothing like the
actual service specification we're using, even after I re-apply it, so
something seems 'bugged'. Here's the OSD specification we're applying:
service_type: osd
service_id: osd_spec
placement:
label: "osd"
data_devices:
rotational: 1
db_devices:
rotational: 0
db_slots: 12
I would appreciate any insight into how to clear this up (without
removing the actual OSDs, we're just wanting to apply the updated
service specification - we used to use host placement rules and are
switching to label-based).
Thanks,
David
Hi,
There is an operation "radosgw-admin bi purge" that removes all bucket
index objects for one bucket in the rados gateway.
What is the undo operation for this?
After this operation the bucket cannot be listed or removed any more.
Regards
--
Robert Sander
Heinlein Consulting GmbH
Schwedter Str. 8/9b, 10119 Berlin
http://www.heinlein-support.de
Tel: 030 / 405051-43
Fax: 030 / 405051-19
Zwangsangaben lt. §35a GmbHG:
HRB 220009 B / Amtsgericht Berlin-Charlottenburg,
Geschäftsführer: Peer Heinlein -- Sitz: Berlin
Hello. We trying to resolve some issue with ceph. Our openshift cluster is blocked and we tried do almost all.
Actual state is:
MDS_ALL_DOWN: 1 filesystem is offline
MDS_DAMAGE: 1 mds daemon damaged
FS_DEGRADED: 1 filesystem is degraded
MON_DISK_LOW: mon be is low on available space
RECENT_CRASH: 1 daemons have recently crashed
We try to perform
cephfs-journal-tool --rank=gml-okd-cephfs:all event recover_dentries summary
cephfs-journal-tool --rank=gml-okd-cephfs:all journal reset
cephfs-table-tool gml-okd-cephfs:all reset session
ceph mds repaired 0
ceph config rm mds mds_verify_scatter
ceph config rm mds mds_debug_scatterstat
ceph tell gml-okd-cephfs scrub start / recursive repair force
After these commands, mds rises but an error appears:
MDS_READ_ONLY: 1 MDSs are read only
We also tried to create new fs with new metadata pool, delete and recreate old fs with same name with old\new metadatapool.
We got rid of the errors, but the Openshift cluster did not want to work with the old persistence volumes. The pods wrote an error that they could not find it, while it was present and moreover, this volume was associated with pvc.
Now we have rolled back the cluster and are trying to remove the mds error. Any ideas what to try?
Thanks
Hello,
Setting up first ceph cluster in lab.
Rocky 8.6
Ceph quincy
Using curl install method
Following cephadm deployment steps
Everything works as expected except
ceph orch device ls --refresh
Only displays nvme devices and not the sata ssds on the ceph host.
Tried
sgdisk --zap-all /dev/sda
wipefs -a /dev/sda
Adding sata osd manually I get:
ceph orch daemon add osd ceph-a:data_devices=/dev/sda
Created no osd(s) on host ceph-a; already created?
nvme osd gets added without issue.
I have looked in the volume log on the node and monitor log on the admin
server and have not seen anything that seems like an obvious clue.
I can see commands running successfully against /dev/sda in the logs.
Ideas?
Thanks,
cb
Hi to all and thanks for sharing your experience on ceph !
We have an easy setup with 9 osd all hdd and 3 nodes, 3 osd for each node.
We started the cluster to test how it works with hdd with default and easy bootstrap . Then we decide to add ssd and create a pool to use only ssd.
In order to have pools on hdd and pools on ssd only we edited the crushmap to add class hdd
We do not enter anything about ssd till now, nor disk or rules only add the class map to the default rule.
So i show you the rules before introducing class hdd
# rules
rule replicated_rule {
id 0
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}
rule erasure-code {
id 1
type erasure
min_size 3
max_size 4
step set_chooseleaf_tries 5
step set_choose_tries 100
step take default
step chooseleaf indep 0 type host
step emit
}
rule erasure2_1 {
id 2
type erasure
min_size 3
max_size 3
step set_chooseleaf_tries 5
step set_choose_tries 100
step take default
step chooseleaf indep 0 type host
step emit
}
rule erasure-pool.meta {
id 3
type erasure
min_size 3
max_size 3
step set_chooseleaf_tries 5
step set_choose_tries 100
step take default
step chooseleaf indep 0 type host
step emit
}
rule erasure-pool.data {
id 4
type erasure
min_size 3
max_size 3
step set_chooseleaf_tries 5
step set_choose_tries 100
step take default
step chooseleaf indep 0 type host
step emit
}
And here is the after
# rules
rule replicated_rule {
id 0
type replicated
min_size 1
max_size 10
step take default class hdd
step chooseleaf firstn 0 type host
step emit
}
rule erasure-code {
id 1
type erasure
min_size 3
max_size 4
step set_chooseleaf_tries 5
step set_choose_tries 100
step take default class hdd
step chooseleaf indep 0 type host
step emit
}
rule erasure2_1 {
id 2
type erasure
min_size 3
max_size 3
step set_chooseleaf_tries 5
step set_choose_tries 100
step take default class hdd
step chooseleaf indep 0 type host
step emit
}
rule erasure-pool.meta {
id 3
type erasure
min_size 3
max_size 3
step set_chooseleaf_tries 5
step set_choose_tries 100
step take default class hdd
step chooseleaf indep 0 type host
step emit
}
rule erasure-pool.data {
id 4
type erasure
min_size 3
max_size 3
step set_chooseleaf_tries 5
step set_choose_tries 100
step take default class hdd
step chooseleaf indep 0 type host
step emit
}
Just doing this triggered the misplaced of all pgs bind to EC pool.
Is that correct ? and why ?
Best regards
Alessandro Bolgia
I've setup RadosGW with STS ontop of my ceph cluster. It works great and fine but I'm also trying to setup authentication with an OpenIDConnect provider. I'm have a hard time troubleshooting issues because the radosgw log file doesn't have much information in it. For example when I try to use the `sts:AssumeRoleWithWebIdentity` API it fails with `{'Code': 'AccessDenied', ...}` and all I see is the beat log showing an HTTP 403.
Is there a way to enable more verbose logging so I can see what is failing and why I'm getting certain errors with STS, S3, or IAM apis?
My ceph.conf looks like this for each node (mildly redacted):
```
[client.radosgw.pve4]
host = pve4
keyring = /etc/pve/priv/ceph.client.radosgw.keyring
log file = /var/log/ceph/client.radosgw.$host.log
rgw_dns_name = s3.lab
rgw_frontends = beast endpoint=0.0.0.0:7480 ssl_endpoint=0.0.0.0:443 ssl_certificate=/etc/pve/priv/ceph/s3.lab.crt ssl_private_key=/etc/pve/priv/ceph/s3.lab.key
rgw_sts_key = 1111111111111111
rgw_s3_auth_use_sts = true
rgw_enable_apis = s3, s3website, admin, sts, iam
```
Hello
We are planning to start QE validation release next week.
If you have PRs that are to be part of it, please let us know by
adding "needs-qa" for 'quincy' milestone ASAP.
Thx
YuriW