I am wondering what the problem is with these error messages I am
having. I think it is not really related to capability release because
those messages I am getting at a later time.
I have two processes being affected by this a rsync on a ceph fuse mount
and a rsync on a nfs-ganesha mount running at the same time. The ceph
fuse mount gets at a later time the cache pressure notification.
I think everything starts to go wrong after the mds is reporting on the
xlock. This message is being logged:
2020-06-13 03:38:36.981 7fb5edd82700 0 log_channel(cluster) log [WRN] :
slow request 244.377406 seconds old, received at 2020-06-13
03:34:32.604412: client_request(client.4019800:20354 setattr
mtime=2020-04-22 12:58:47.000000 atime=2020-06-13 03:34:32.000000
#0x100001b9177 2020-06-13 03:34:32.604119 caller_uid=500,
caller_gid=500{500,1,2,3,4,6,10,}) currently failed to xlock, waiting
From the charts here you can see that caps are climbing and inodes are
dropping.
https://snapshot.raintank.io/dashboard/snapshot/4ij6AF1JoDzdZNI6WzCyewkn7Oq…
(I am not entirely sure about the correctness of the used units, and the
ino/caps chart should have dual y-axis)
It looks like that if I start multiple concurrent rsyncs on the
nfs-ganesha I can trigger this problem. Every time it seems I am getting
the same xlock listed at the mds log.
2020-06-13 17:32:24.920 7fb5edd82700 0 log_channel(cluster) log [WRN] :
slow request 240.294505 seconds old, received at 2020-06-13
17:28:24.626608: client_request(client.4021284:2468 setattr
mtime=2020-04-22 12:58:47.000000 atime=2020-06-13 17:28:24.000000
#0x100001b9177 2020-06-13 17:28:24.626527 caller_uid=500,
caller_gid=500{500,1,2,3,4,6,10,}) currently failed to xlock, waiting
Question: I have seen this work around/solution[1] being offered a lot,
but I do not get why I have the xlock. When does one get an xlock?
I think when I can prevent the xlock, I do not need to set osd op queue
options.
[1]
https://www.mail-archive.com/ceph-users@ceph.io/msg04421.html
With 'work around':
osd op queue = wpq
osd op queue cut off = high
I have sometimes that the dashboard keeps loading when switching to 3
hours range. However I do not see any load on the prometheus server.
Anyone having something similar?
The grafana dashboard 'rbd overview' is empty. Queries have measurements
'ceph_rbd_write_ops' that do not exist in prometheus (I think). Should I
enable something more than just 'ceph mgr module enable prometheus'
I am on Nautilus
Hi,
I have Ceph Octopus 4 Node, each node has 12 disk cluster, which is
configured with cephfs (replica 2) + exposed via samba for windows 10G
client.
When a user copies a folder containing 1000's of 7MB files from windows 10
client getting only a speed of 40MB/s.
Client and Ceph nodes all connected in 10G. In the same setup copying 1GB
file from windows client to samba getting 90 MB/s.
Are there any kernel or network tunning needs to be done?
Any suggestions?
regards
Amudhan P
I had 5 of 10 osds fail on one of my nodes, after reboot the other 5 osds failed to start.
I have tried running ceph-disk activate-all and get back and error message about the cluster fsid not matching in /etc/ceph/ceph.conf
Has anyone experienced an issue such as this?
*******************************************************************
IMPORTANT MESSAGE FOR RECIPIENTS IN THE U.S.A.:
This message may constitute an advertisement of a BD group's products or services or a solicitation of interest in them. If this is such a message and you would like to opt out of receiving future advertisements or solicitations from this BD group, please forward this e-mail to optoutbygroup(a)bd.com. [BD.v1.0]
*******************************************************************
This message (which includes any attachments) is intended only for the designated recipient(s). It may contain confidential or proprietary information and may be subject to the attorney-client privilege or other confidentiality protections. If you are not a designated recipient, you may not review, use, copy or distribute this message. If you received this in error, please notify the sender by reply e-mail and delete this message. Thank you.
*******************************************************************
Corporate Headquarters Mailing Address: BD (Becton, Dickinson and Company) 1 Becton Drive Franklin Lakes, NJ 07417 U.S.A.
Hi,
Total newbie question - I'm new to Ceph and am setting up a small test cluster. I've set up five nodes and can see the available drives but I'm unsure on exactly how I can add an OSD and specify the locations for WAL+DB.
Maybe my Google-fu is weak but the only guides I can find refer to ceph-deploy which, as far as I can see, is deprecated. Guides which talk about using cephadm / ceph only mention adding a drive but not specifying the WAL+DB locations.
I want to add HDDs as OSDs and put the WAL and DB onto separate LVs on an SSD. How?
Will
I am still having this issue with nfs-ganesha with nautilus. I assume I
do not have to change configuration of nfs-ganesha as mentioned here[1].
Since I did not have any issues with Luminous.
Does someone with similar symptoms have also this xlock, waiting
message? How to resolve this?
2020-06-11 14:14:19.403 7fb5f0587700 1 mds.a Updating MDS map to
version 32035 from mon.0
2020-06-11 14:19:35.287 7fb5edd82700 0 log_channel(cluster) log [WRN] :
1 slow requests, 1 included below; oldest blocked for > 33.290954 secs
2020-06-11 14:19:35.287 7fb5edd82700 0 log_channel(cluster) log [WRN] :
slow request 33.290953 seconds old, received at 2020-06-11
14:19:01.997415: client_request(client.
4020294:34354 setattr mtime=2020-04-22 12:58:47.000000 atime=2020-06-11
14:19:02.000000 #0x100001b9177 2020-06-11 14:19:01.997149
caller_uid=500, caller_gid=500{500,1,2,3,
4,6,10,}) currently failed to xlock, waiting
[1]
https://tracker.ceph.com/issues/44976#note-23
All;
We've been running our Ceph clusters (Nautilus / 14.2.8) for a while now (roughly 9 months), and I've become curious about the output of the "radosgw-admin sync status" command.
Here's a the output from our secondary zone:
realm <guid> (<realm-name>)
zonegroup <guid> (<zonegroup name>)
zone <guid> (<zone name>)
metadata sync syncing
full sync: 0/64 shards
incremental sync: 64/64 shards
metadata is caught up with master
data sync source: <guid> (<master zone name>)
syncing
full sync: 0/128 shards
incremental sync: 128/128 shards
3 shards are recovering
recovering shards: [39,41,66]
Radosgw is in use, so the active recovery for data doesn't really surprise me.
What I am curious about is these 2 lines:
full sync: 0/64 shards
full sync: 0/128 shards
Is this considered normal? If so, why have those lines present in this output?
Are they relevant to a different type of replication than what we are doing (I'm not aware of a different type of radosgw replication, but I'm not omniscient)?
Thank you,
Dominic L. Hilsbos, MBA
Director - Information Technology
Perform Air International Inc.
DHilsbos(a)PerformAir.com
www.PerformAir.com