I have noticed that RBD-Mirror snapshot mode can only manage to take 1 snapshot per second. For example I have 21 images in a single pool. When the schedule is triggered it takes the mirror snapshot of each image 1 at a time. It doesn't feel or look like a performance issue as the OSD's are Micron 9300 PRO NVMe's and each server has 2x Intel Platinum 8268 CPU's.
I was hoping that adding more RDB-Mirror instance would help, but that only seems to help with overall throughput. As it sits I have 3 RBD-Mirror instances running on each cluster.
We run a 30 minute snapshot schedule to our remote site as it is, based on that I can only squeeze 1800 mirror snaps every 30 minutes.
I was hoping there might be something I am missing with RBD-Mirror as far as scaling goes.
Maybe multiple pools would be a solution and have other benefits?
I have a rbd-mirror snapshot on 1 image that failed to replicate and now its not getting cleaned up.
The cause of this was my fault based on my steps. Just trying to understand how to clean up/handle the situation.
Here is how I got into this situation.
- Created manual rbd snapshot on the image
- On the remote cluster I cloned the snapshot
- While cloned on the secondary cluster I made the mistake of deleting the snapshot on the primary
- The subsequent mirror snapshot failed
- I then removed the clone
- The next mirror snapshot was successful but I was left with this mirror snapshot on the primary that I can't seem to get rid of
root@Ccscephtest1:/var/log/ceph# rbd snap ls --all CephTestPool1/vm-100-disk-0
SNAPID NAME SIZE PROTECTED TIMESTAMP NAMESPACE
10082 .mirror.primary.90c53c21-6951-4218-9f07-9e983d490993.e0c63479-b09e-4c66-a65b-085b67a19907 2 TiB Thu Jan 21 07:10:09 2021 mirror (primary peer_uuids:[])
10243 .mirror.primary.90c53c21-6951-4218-9f07-9e983d490993.483e55aa-2f64-4bb0-ac0f-7b5aac59830e 2 TiB Thu Jan 21 07:30:08 2021 mirror (primary peer_uuids:[debf975b-ebb8-432c-a94a-d3b101e0f770])
I have tried deleting the snap with "rbd snap rm" like normal user created snaps, but no luck. Anyway to force the deletion?
I have hell of the question: how to make HEALTH_ERR status for a cluster
without consequences?
I'm working on CI tests and I need to check if our reaction to
HEALTH_ERR is good. For this I need to take an empty cluster with an
empty pool and do something. Preferably quick and reversible.
For HEALTH_WARN the best thing I found is to change pool size to 1, it
raises "1 pool(s) have no replicas configured" warning almost instantly
and it can be reverted very quickly for empty pool.
But HEALTH_ERR is a bit more tricky. Any ideas?
Hi all,
we noticed a massive drop in requests per second a cephfs client is able
to perform when we do a recursive chown over a directory with millions
of files. As soon as we see about 170k caps on the MDS, the client
performance drops from about 660 reqs/sec to 70 reqs/sec.
When we then clear dentries and inodes using "sync; echo 2 >
/proc/sys/vm/drop_caches" on the client, the request go up to ~660 again
just to drop again when reaching about 170k caps.
See the attached screenshots.
When we stop the chown process for a while and restart it ~25min later
again it still performs very slowly and the MDS reqs/sec remain low
(~60/sec.). Clearing the cache (dentries and inodes) on the client
restores the performance again.
When we run the same chown on another client in parallel, it starts
again with reasonable good performance (while the first client is poorly
performing) but eventually it gets slow again just like the first client.
Can someone comment on this and explain it?
How can this be solved, so that the performance remains stable?
We are running ceph version 14.2.16
(762032d6f509d5e7ee7dc008d80fe9c87086603c) nautilus (stable) on all ceph
cluster nodes and on all clients.
The OS on all ceph cluster nodes and client nodes is CentOS 7.9. The
filesystem is mounted via CentOS kernel client (latest official version).
Thanks in advance.
~Best
Dietmar
--
_________________________________________
D i e t m a r R i e d e r, Mag.Dr.
Innsbruck Medical University
Biocenter - Institute of Bioinformatics
Innrain 80, 6020 Innsbruck
Phone: +43 512 9003 71402
Fax: +43 512 9003 73100
Email: dietmar.rieder(a)i-med.ac.at
Web: http://www.icbi.at
Hi,
I'm looking the suse documentation regarding their option to have rbd on win.
I want to try on windows server 2019 vm, but I got this error:
PS C:\Users\$admin$> rbd create image01 --size 4096 --pool windowstest -m 10.118.199.248,10.118.199.249,10.118.199.250 --id windowstest --keyring C:/ProgramData/ceph/keyring
2021-01-20T11:15:29.066SE Asia Standard Time 1 -1 auth: error parsing file C:/ProgramData/ceph/keyring: cannot parse buffer: Malformed input
2021-01-20T11:15:29.066SE Asia Standard Time 1 -1 auth: failed to load C:/ProgramData/ceph/keyring: (5) Input/output error
2021-01-20T11:15:29.066SE Asia Standard Time 1 -1 auth: error parsing file C:/ProgramData/ceph/keyring: cannot parse buffer: Malformed input
2021-01-20T11:15:29.066SE Asia Standard Time 1 -1 auth: failed to load C:/ProgramData/ceph/keyring: (5) Input/output error
2021-01-20T11:15:29.066SE Asia Standard Time 1 -1 auth: error parsing file C:/ProgramData/ceph/keyring: cannot parse buffer: Malformed input
rbd: couldn't connect to the cluster!
2021-01-20T11:15:29.066SE Asia Standard Time 1 -1 auth: failed to load C:/ProgramData/ceph/keyring: (5) Input/output error
2021-01-20T11:15:29.066SE Asia Standard Time 1 -1 monclient: keyring not found
This is the keyring file:
[client.windowstest]
key = AQBJ7wdgdWLIMhAAle+/pg+26XvWsDv8PyPcvw==
caps mon = "allow rw"
caps osd = "allow rwx pool=windowstest"
And this is the ceph.conf file on the windows client:
[global]
log to stderr = true
run dir = C:/ProgramData/ceph
crash dir = C:/ProgramData/ceph
[client]
keyring = C:/ProgramData/ceph/keyring
log file = C:/ProgramData/ceph/$name.$pid.log
admin socket = C:/ProgramData/ceph/$name.$pid.asok
[global]
mon host = [v2:10.118.199.231:3300,v1:10.118.199.231:6789] [v2:10.118.199.232:3300,v1:10.118.199.232:6789] [v2:10.118.199.233:3300,v1:10.118.199.233:6789]
Commands I've tried:
rbd create image01 --size 4096 --pool windowstest -m 10.118.199.248,10.118.199.249,10.118.199.250 --id windowstest --keyring C:/ProgramData/ceph/keyring
rbd create image01 --size 4096 --pool windowstest -m 10.118.199.248,10.118.199.249,10.118.199.250 --id windowstest --keyring C:\ProgramData\ceph\keyring
rbd create image01 --size 4096 --pool windowstest -m 10.118.199.248,10.118.199.249,10.118.199.250 --id windowstest --keyring "C:/ProgramData/ceph/keyring"
rbd create image01 --size 4096 --pool windowstest -m 10.118.199.248,10.118.199.249,10.118.199.250 --id windowstest --keyring "C:\ProgramData\ceph\keyring"
rbd create blank_image --size=1G
The ceph version is luminous 12.2.8.
Don't know why they don't find mon keyring.
Thank you.
________________________________
This message is confidential and is for the sole use of the intended recipient(s). It may also be privileged or otherwise protected by copyright or other legal rules. If you have received it by mistake please let us know by reply email and delete it from your system. It is prohibited to copy this message or disclose its content to anyone. Any confidentiality or privilege is not waived or lost by any mistaken delivery or unauthorized disclosure of the message. All messages sent to and from Agoda may be monitored to ensure compliance with company policies, to protect the company's interests and to remove potential malware. Electronic messages may be intercepted, amended, lost or deleted, or contain viruses.
Dear Ceph users,
We have a slightly dated version of Luminous cluster in which dynamic
bucket resharding was accidentally enabled due to a misconfig (we don't use
this feature since the number of objects per bucket is capped).
This resulted in creation of the RGW reshard pool with lots of bucket
reshard lock objects (we have thousands of buckets) which is leading to
clutter. Also, we've run into a malloc failure issue (similar to
https://tracker.ceph.com/issues/21826 but not the same since we already use
tcmalloc) on the OSDs in which these reshard lock objects are located and
we'd like to reduce the objects that have to be copied out.
My question to the community is: "Is it safe to discard the bucket reshard
lock objects if we know that we'll never use the reshard feature on the
cluster again?".
The RGWs performed resharding several months ago due to a misconfiguration
and we already have stale bucket instances which are due for cleanup on
this cluster.
Thanks,
Prasad Krishnan
--
*-----------------------------------------------------------------------------------------*
*This email and any files transmitted with it are confidential and
intended solely for the use of the individual or entity to whom they are
addressed. If you have received this email in error, please notify the
system manager. This message contains confidential information and is
intended only for the individual named. If you are not the named addressee,
you should not disseminate, distribute or copy this email. Please notify
the sender immediately by email if you have received this email by mistake
and delete this email from your system. If you are not the intended
recipient, you are notified that disclosing, copying, distributing or
taking any action in reliance on the contents of this information is
strictly prohibited.*****
****
*Any views or opinions presented in this
email are solely those of the author and do not necessarily represent those
of the organization. Any information on shares, debentures or similar
instruments, recommended product pricing, valuations and the like are for
information purposes only. It is not meant to be an instruction or
recommendation, as the case may be, to buy or to sell securities, products,
services nor an offer to buy or sell securities, products or services
unless specifically stated to be so on behalf of the Flipkart group.
Employees of the Flipkart group of companies are expressly required not to
make defamatory statements and not to infringe or authorise any
infringement of copyright or any other legal right by email communications.
Any such communication is contrary to organizational policy and outside the
scope of the employment of the individual concerned. The organization will
not accept any liability in respect of such communication, and the employee
responsible will be personally liable for any damages or other liability
arising.*****
****
*Our organization accepts no liability for the
content of this email, or for the consequences of any actions taken on the
basis of the information *provided,* unless that information is
subsequently confirmed in writing. If you are not the intended recipient,
you are notified that disclosing, copying, distributing or taking any
action in reliance on the contents of this information is strictly
prohibited.*
_-----------------------------------------------------------------------------------------_
Hello Cephers,
On a new cluster, I only have 2 RBD block images, and the Dashboard
doesn't manage to list them correctly.
I have this message :
Warning
Displaying previously cached data for pool veeam-repos.
Sometime it disappears, but as soon as I reload or return to the listing
page, it's there.
What I've seen, is a high CPU load due to ceph-mgr on the active
manager.
And also stack-traces like this :
2021-01-15T14:41:12.061+0100 7f7f3fec4700 0 [dashboard ERROR exception]
Dashboard Exception
Traceback (most recent call last):
File "/usr/share/ceph/mgr/dashboard/services/exception.py", line 94,
in dashboard_exception_handler
return handler(*args, **kwargs)
File "/usr/lib/python3/dist-packages/cherrypy/_cpdispatch.py", line
60, in __call__
return self.callable(*self.args, **self.kwargs)
File "/usr/share/ceph/mgr/dashboard/controllers/__init__.py", line
666, in inner
ret = func(*args, **kwargs)
File "/usr/share/ceph/mgr/dashboard/controllers/__init__.py", line
861, in wrapper
return func(*vpath, **params)
File "/usr/lib/python3.6/contextlib.py", line 52, in inner
return func(*args, **kwds)
File "/usr/lib/python3.6/contextlib.py", line 52, in inner
return func(*args, **kwds)
File "/usr/share/ceph/mgr/dashboard/controllers/rbd.py", line 86, in
list
return self._rbd_list(pool_name)
File "/usr/share/ceph/mgr/dashboard/controllers/rbd.py", line 76, in
_rbd_list
status, value = RbdService.rbd_pool_list(pool)
File "/usr/share/ceph/mgr/dashboard/tools.py", line 254, in wrapper
return rvc.run(fn, args, kwargs)
File "/usr/share/ceph/mgr/dashboard/tools.py", line 242, in run
raise ViewCacheNoDataException()
dashboard.exceptions.ViewCacheNoDataException: ViewCache: unable to
retrieve data
Also that, since I changed some features back and forth on one image :
2021-01-18T11:13:26.383+0100 7f00199ca700 0 [dashboard ERROR
frontend.error]
(https://fidcl-mrs4-sto-sds.fidcl.cloud:8443/#/block/rbd/edit/veeam-
repos%252Fveeam-repo2-vol1): Cannot read property 'features_name' of
undefined
TypeError: Cannot read property 'features_name' of undefined
at
https://fidcl-mrs4-sto-sds.fidcl.cloud:8443/1.9e79c41bbaed982a50af.js:1:121…
at Array.forEach (<anonymous>)
at R.deepBoxCheck
(https://fidcl-mrs4-sto-sds.fidcl.cloud:8443/1.9e79c41bbaed982a50af.js:1:120…)
at R.featureFormUpdate
(https://fidcl-mrs4-sto-sds.fidcl.cloud:8443/1.9e79c41bbaed982a50af.js:1:121…)
at
https://fidcl-mrs4-sto-sds.fidcl.cloud:8443/1.9e79c41bbaed982a50af.js:1:119…
at d.a [as _next]
(https://fidcl-mrs4-sto-sds.fidcl.cloud:8443/main.c43d13b597196a5f022f.js:2:…)
at d.__tryOrUnsub
(https://fidcl-mrs4-sto-sds.fidcl.cloud:8443/main.c43d13b597196a5f022f.js:2:…)
at d.next
(https://fidcl-mrs4-sto-sds.fidcl.cloud:8443/main.c43d13b597196a5f022f.js:2:…)
at l._next
(https://fidcl-mrs4-sto-sds.fidcl.cloud:8443/main.c43d13b597196a5f022f.js:2:…)
at l.next
(https://fidcl-mrs4-sto-sds.fidcl.cloud:8443/main.c43d13b597196a5f022f.js:2:…)
But that's perhaps just becaus I open an Edit window on the image and it
does not have the datas.
The Edit window is empty, and I can't edit things, especially, I wan't
to resize the image.
Finally, I find a similar bug already registered here, but that seems
resolved for the owner :
https://tracker.ceph.com/issues/45308
Mmmm, as I read carrefully, it'about cephfs not rbd in that
stacktrace...
Am I the only one ?
Is there a work around ?
I really need the dashboard to be usable, because I want to delegate a
maximum of operations to people that don't need to have the rights to
connect to machines to use CLI.
And also don't have the skills to do so.
Have a good day,
--
Gilles