Hello,
We recently upgraded our cluster to version 18 and I've noticed some things
that I'd like feedback on before I go down a rabbit hole for
non-issues. cephadm was used for the upgrade and there were no issues.
Cluster is 56 OSD's spinners for right now only used for RBD images.
I've noticed active scrubs/deep scrubs. I don't remember seeing a large
amount before, usually around 20-30 scrubs and 15 deep I think, now I will
have 70 scrubs and 70 deep scrubs happening. Which I thought were limited
to 1 per OSD or am I misunderstanding osd_max_scrubs? Everything on the
cluster is currently at default values.
The other thing I've noticed is since the upgrade it seems like any time
backfill happens the client io drops, but neither is high to begin with,
30MiB/s read/write client IO drops to 10-15 with 200MiB/s backfill. Before
upgrading backfill would be hitting 5-600 with 30 clientio. I realize lots
of things could affect this and it could be separate from the cluster, I'm
still investigating, but wanted to mention it incase someone could
recommend a check or some change to Reef that could cause this. mclock
profile is client_io.
Thanks,
Curt
Hi,
Several years ago the diskprediction module was added to the MGR
collecting SMART data from the OSDs.
There were local and cloud modes available claiming different
accuracies. Now only the local mode remains.
What is the current status of that MGR module (diskprediction_local)?
We have a cluster where SMART data is available from the disks (tested
with smartctl and visible in the Ceph dashboard), but even with an
enabled diskprediction_local module no health and lifetime info is shown.
Regards
--
Robert Sander
Heinlein Consulting GmbH
Schwedter Str. 8/9b, 10119 Berlin
https://www.heinlein-support.de
Tel: 030 / 405051-43
Fax: 030 / 405051-19
Amtsgericht Berlin-Charlottenburg - HRB 220009 B
Geschäftsführer: Peer Heinlein - Sitz: Berlin
Hello guys,
We are facing/seeing an unexpected mark in one of our pools. Do you guys
know what does "removed_snaps_queue" it mean? We see some notation such as
"d5~3" after this tag. What does it mean? We tried to look into the docs,
but could not find anything meaningful.
We are running Ceph Octopus on top of Ubuntu 18.04.
Hello,
We have been running into an issue installing the pacific windows rbd
driver on windows 2016. It has no issues with either 2019 or 2022. It
looks like it fails at checkpoint creation. We are installing it as
admin. Has anyone seen this before or know of a solution?
The closest thing I can find to why it wont install:
******* Product: D:\software\ceph_pacific_beta.msi
******* Action: INSTALL
******* CommandLine: **********
MSI (s) (CC:24) [12:31:30:315]: Machine policy value
'DisableUserInstalls' is 0
MSI (s) (CC:24) [12:31:30:315]: Note: 1: 2203 2:
C:\windows\Installer\inprogressinstallinfo.ipi 3: -2147287038
MSI (s) (CC:24) [12:31:30:315]: Machine policy value
'LimitSystemRestoreCheckpointing' is 0
MSI (s) (CC:24) [12:31:30:315]: Note: 1: 1715 2: Ceph for Windows
MSI (s) (CC:24) [12:31:30:315]: Calling SRSetRestorePoint API.
dwRestorePtType: 0, dwEventType: 102, llSequenceNumber: 0,
szDescription: "Installed Ceph for Windows".
MSI (s) (CC:24) [12:31:30:315]: The call to SRSetRestorePoint API
failed. Returned status: 0. GetLastError() returned: 127
--
--
Robert Ford
GoDaddy | SRE III
9519020587
Phoenix, AZ
rford(a)godaddy.com
Hi,
I'm using rbd import and export to copy image from one cluster to another.
Also using import-diff and export-diff to update image in remote cluster.
For example, "rbd --cluster local export-diff ... | rbd --cluster remote import-diff ...".
Sometimes, the whole command is stuck. I can't tell it's stuck on which end of the pipe.
I did some search, [1] seems the same issue and [2] is also related.
Wonder if there is any way to identify where it's stuck and get more debugging info.
Given [2], I'd suspect the import-diff is stuck, cause rbd client is importing to the
remote cluster. Networking latency could be involved here? Ping latency is 7~8 ms.
Any comments is appreciated!
[1] https://bugs.launchpad.net/cinder/+bug/2031897
[2] https://stackoverflow.com/questions/69858763/ceph-rbd-import-hangs
Thanks!
Tony
Hi,
Say, source image has snapshot s1, s2 and s3.
I expect "export" behaves the same as "deep cp", when specify a snapshot,
with "--export-format 2", only the specified snapshot and all snapshots
earlier than that will be exported.
What I see is that, no matter which snapshot I specify, "export" with
"--export-format 2" always exports the whole image with all snapshots.
Is this expected?
Could anyone help to clarify?
Thanks!
Tony
Hey ceph-users,
I was wondering if ceph-volume did anything in regards to the management
(creation, setting metadata, ....) of LVs which are used for
DB / WAL of an OSD?
Reading the documentation at
https://docs.ceph.com/en/latest/man/8/ceph-volume/#new-db is seems to
indicate that the LV to be used as e.g. DB needs to be created manually
(without ceph-volume) and exist prior to using ceph-volume to move the
DB to that LV? I suppose the same is true for "ceph-volume lvm create"
or "ceph-volume lvm prepare" and "--block.db"
It's not that creating a few LVs is hard... it's just that ceph volume
does apply some structure to the naming of LVM VGs and LVs on the OSD
device and also adds metadata. That would then be up to the user, right?
Regards
Christian
Folks,
I have 3 nodes with each having 1x NvME (1TB) and 3x 2.9TB SSD. Trying to
build ceph storage using cephadm on Ubuntu 22.04 distro.
If I want to use NvME for Journaling (WAL/DB) for my SSD based OSDs then
how does cephadm handle it?
Trying to find a document where I can tell cephadm to deploy wal/db on nvme
so it can speed up write optimization. Do I need to create or cephadm will
create each partition for the number of OSD?
Help me to understand how it works and is it worth doing?