I am trying to implement jaeger tracing in RGW, I need some advice
regarding on which functions should I actually tracing to get a good actual
performance status of clusters
Till now I am able to deduce followings :
1.I think we need to provide tracing functions where the `rgw` is
communicating with the librados, (particularly the librgw where the
communication is actually happening), because http request and response not
to be considered for tracing because that depends on clients internet speed.
2.In librgw the functions like this here
<https://github.com/ceph/ceph/blob/0360bea127397a41eb282a1eef9af4ff4477b9d4/…>
and
its corresponding overloading methods and also the this function here
<https://github.com/ceph/ceph/blob/0360bea127397a41eb282a1eef9af4ff4477b9d4/…>
and
its corresponding overloaded functions.
3.I see that pools are ultimately used to enter the crush algorithm for
writing data, so I think the ceation of pools should also be taken into
account while tracing,(creation of pool should be main span and these
functions
<https://github.com/ceph/ceph/blob/0360bea127397a41eb282a1eef9af4ff4477b9d4/…>
should
be its child span).
Functionality of buckets like that of this
<https://github.com/ceph/ceph/blob/0360bea127397a41eb282a1eef9af4ff4477b9d4/…>
do not require tracing beacuse they are http requests.
Any kind of guidance will be of great help.
Thank You.
Is there another way to disable telemetry then using:
> ceph telemetry off
> Error EIO: Module 'telemetry' has experienced an error and cannot handle commands: cannot concatenate 'str' and 'UUID' objects
I'm attempting to get all my clusters out of a constant HEALTH_ERR state caused by either the above error or the telemetry endpoint being down.
After upgrading one of our clusters from Luminous 12.2.12 to Nautilus 14.2.6, I am seeing 100% CPU usage by a single ceph-mgr thread (found using 'top -H'). The way we found this was due to Prometheus being unable to report out certain pieces of data, specifically OSD Usage, OSD Apply and Commit Latency. Which are all similar issues people were having in previous versions of Nautilus.
Bryan Stillwell reported this previously on a separate cluster, 14.2.5, we have here:
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/VW3GNVJGOOW…
That issue was resolved with the upgrade to 14.2.6.
We are seeing a similar issue on this other cluster with a couple differences.
This cluster has 1900+ OSD in it, the previous one had 300+
The top user is libceph-common, instead of mmap
4.86% libceph-common.so.0 [.] EventCenter::create_time_event
2.78% [kernel] [k] nmi
2.64% libstdc++.so.6.0.19 [.] __dynamic_cast
On all our other clusters that have been upgraded to 14.2.6 we are not experiencing this issue, the next largest being 800+ OSD.
We feel this is related to the size of the cluster, similarly to the previous report.
Anyone else experiencing this and/or can provide some direction on how to go about resolving this?
Thanks,
Joe
All;
We are in the middle of upgrading our primary cluster from 14.2.5 to 14.2.8. Our cluster utilizes 6 MDSs for 3 CephFS file systems. 3 MDSs are collocated with MON/MGR, and 3 MDSs are collocated with OSDs.
At this point we have upgraded all 3 of the MON/MDS/MGR servers. The MDS on 2 of the 3 is currently not working, and we are seeing the below log messages.
2020-03-06 11:12:56.184 <> -1 mds.<daemon> unable to obtain rotating service keys; retrying
2020-03-06 11:13:26.184 <> 0 monclient: wait_auth_rotating timed out after 30
2020-03-06 11:13:26.184 <> -1 mds.<daemon> ERROR: failed to refresh rotating keys, maximum retry time reached.
2020-03-06 11:13:26.184 <> 1 mds.<daemon> suicide! Wanted state up:boot
Any ideas?
Thank you,
Dominic L. Hilsbos, MBA
Director - Information Technology
Perform Air International Inc.
DHilsbos(a)PerformAir.com
www.PerformAir.com
Hi,
Does someone know if the following harddisk has a decent performance in
a ceph cluster:
Micron 5210 ION 1.92TB, SATA (MTFDDAK1T9QDE-2AV1ZABYY)
The spec state, that the disk has power loss protection, however, I'd
nevertheless like to make sure that all goes well with this disk.
Best Regards,
Hermann
--
hermann(a)qwer.tk
PGP/GPG: 299893C7 (on keyservers)
If I do getfattr --only-values --absolute-names -d -m ceph.dir.rbytes
.snap/snap-6 then I am getting the same size as the unaltered source.
This needs to be 0 not? And when the source is changing, this should
increase? Or do I not understand correctly?
Hi,
After upgrading to ceph 14.2.8 I started getting the "1 pool(s) have
non-power-of-two pg_num" warning.
The offending pool had 100 pgs. I tried to adjust it to 128 but it didn't work:
# ceph osd pool set default.rgw.offending pgp_num 128
Error EINVAL: specified pgp_num 128 > pg_num 100
It only worked downwards:
# ceph osd pool set default.rgw.offending pgp_num 64
set pool 7 pgp_num to 64
Is that expected?
And having reduced my pg_num to 64 (isn't it to small?) should I be
prepared for problems in the future?
Regards,
Rodrigo Severo
Hello,
I have a problem with one particular bucket that aborting incomplete
multipart uploads does not remove them from the list i.e. the storage space
is freed up, but 's3cmd multipart' is still showing them. When i retry
aborting, I'm getting 404 which is correct.
The issue here is that I can only display first 1000 of such uploads, thus
can't remove next ones.
Does anyone know how can I make the aborted ones disappear from the listing
(this was checked on a different bucket and they disappear as expected)?
Ceph version is 14.2.4.
Kind regards,
Maks Kowalik