March 2020 - ceph-users - lists.ceph.io

by Abhinav Singh

I am trying to implement jaeger tracing in RGW, I need some advice regarding on which functions should I actually tracing to get a good actual performance status of clusters Till now I am able to deduce followings : 1.I think we need to provide tracing functions where the `rgw` is communicating with the librados, (particularly the librgw where the communication is actually happening), because http request and response not to be considered for tracing because that depends on clients internet speed. 2.In librgw the functions like this here <https://github.com/ceph/ceph/blob/0360bea127397a41eb282a1eef9af4ff4477b9d4/…> and its corresponding overloading methods and also the this function here <https://github.com/ceph/ceph/blob/0360bea127397a41eb282a1eef9af4ff4477b9d4/…> and its corresponding overloaded functions. 3.I see that pools are ultimately used to enter the crush algorithm for writing data, so I think the ceation of pools should also be taken into account while tracing,(creation of pool should be main span and these functions <https://github.com/ceph/ceph/blob/0360bea127397a41eb282a1eef9af4ff4477b9d4/…> should be its child span). Functionality of buckets like that of this <https://github.com/ceph/ceph/blob/0360bea127397a41eb282a1eef9af4ff4477b9d4/…> do not require tracing beacuse they are http requests. Any kind of guidance will be of great help. Thank You.

4 years, 1 month

2
1
0 0

log_latency_fn slow operation

by 徐蕴

Hi, Our cluster (14.2.6) has sporadic slow ops warnings since upgrading from Jewel 1 month ago. Today I checked the OSD log files and found out a lot of entries like: ceph-osd.5.log:2020-03-04 10:33:31.592 7f18ca41f700 0 bluestore(/var/lib/ceph/osd/ceph-5) log_latency_fn slow operation observed for _txc_committed_kv, latency = 5.16871s, txc = 0x55e33ae41b80 ceph-osd.5.log:2020-03-04 10:33:31.592 7f18ca41f700 0 bluestore(/var/lib/ceph/osd/ceph-5) log_latency_fn slow operation observed for _txc_committed_kv, latency = 5.15158s, txc = 0x55e3639b3340 ceph-osd.5.log:2020-03-04 10:33:31.592 7f18ca41f700 0 bluestore(/var/lib/ceph/osd/ceph-5) log_latency_fn slow operation observed for _txc_committed_kv, latency = 6.77361s, txc = 0x55e3379cc840 ceph-osd.5.log:2020-03-04 10:33:52.666 7f18ca41f700 0 bluestore(/var/lib/ceph/osd/ceph-5) log_latency_fn slow operation observed for _txc_committed_kv, latency = 5.42519s, txc = 0x55e33722d600 or /var/log/kolla/ceph/ceph-osd.7.log:2020-03-04 00:41:31.110 7f3dc0bc8700 0 bluestore(/var/lib/ceph/osd/ceph-7) log_latency slow operation observed for submit_transact, latency = 8.1279s /var/log/kolla/ceph/ceph-osd.7.log:2020-03-04 00:41:31.110 7f3dd1bea700 0 bluestore(/var/lib/ceph/osd/ceph-7) log_latency slow operation observed for kv_final, latency = 7.88786s /var/log/kolla/ceph/ceph-osd.7.log:2020-03-04 02:21:35.180 7f3dd1bea700 0 bluestore(/var/lib/ceph/osd/ceph-7) log_latency slow operation observed for kv_final, latency = 6.06171s /var/log/kolla/ceph/ceph-osd.7.log:2020-03-04 05:31:30.298 7f3dc1bca700 0 bluestore(/var/lib/ceph/osd/ceph-7) log_latency slow operation observed for submit_transact, latency = 5.34228s The cluster setup is: SATA SSD (as DB) + SATA HDD 1:3. Any suggest how to debug this problem? Thank you! br, Xu Yun

4 years, 1 month

2
1
0 0

Welcome to the "ceph-users" mailing list

by Abhinav Singh

singhabhinav9051571833(a)gmail.com

4 years, 1 month

1
0
0 0

Disabling Telemetry

by m＠silvenga.com

Is there another way to disable telemetry then using: > ceph telemetry off > Error EIO: Module 'telemetry' has experienced an error and cannot handle commands: cannot concatenate 'str' and 'UUID' objects I'm attempting to get all my clusters out of a constant HEALTH_ERR state caused by either the above error or the telemetry endpoint being down.

4 years, 1 month

2
2
0 0

High CPU usage by ceph-mgr in 14.2.6

by jbardgett＠godaddy.com

After upgrading one of our clusters from Luminous 12.2.12 to Nautilus 14.2.6, I am seeing 100% CPU usage by a single ceph-mgr thread (found using 'top -H'). The way we found this was due to Prometheus being unable to report out certain pieces of data, specifically OSD Usage, OSD Apply and Commit Latency. Which are all similar issues people were having in previous versions of Nautilus. Bryan Stillwell reported this previously on a separate cluster, 14.2.5, we have here: https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/VW3GNVJGOOW… That issue was resolved with the upgrade to 14.2.6. We are seeing a similar issue on this other cluster with a couple differences. This cluster has 1900+ OSD in it, the previous one had 300+ The top user is libceph-common, instead of mmap 4.86% libceph-common.so.0 [.] EventCenter::create_time_event 2.78% [kernel] [k] nmi 2.64% libstdc++.so.6.0.19 [.] __dynamic_cast On all our other clusters that have been upgraded to 14.2.6 we are not experiencing this issue, the next largest being 800+ OSD. We feel this is related to the size of the cluster, similarly to the previous report. Anyone else experiencing this and/or can provide some direction on how to go about resolving this? Thanks, Joe

4 years, 1 month

6
7
0 0

MDS Issues

by DHilsbos＠performair.com

All; We are in the middle of upgrading our primary cluster from 14.2.5 to 14.2.8. Our cluster utilizes 6 MDSs for 3 CephFS file systems. 3 MDSs are collocated with MON/MGR, and 3 MDSs are collocated with OSDs. At this point we have upgraded all 3 of the MON/MDS/MGR servers. The MDS on 2 of the 3 is currently not working, and we are seeing the below log messages. 2020-03-06 11:12:56.184 <> -1 mds.<daemon> unable to obtain rotating service keys; retrying 2020-03-06 11:13:26.184 <> 0 monclient: wait_auth_rotating timed out after 30 2020-03-06 11:13:26.184 <> -1 mds.<daemon> ERROR: failed to refresh rotating keys, maximum retry time reached. 2020-03-06 11:13:26.184 <> 1 mds.<daemon> suicide! Wanted state up:boot Any ideas? Thank you, Dominic L. Hilsbos, MBA Director - Information Technology Perform Air International Inc. DHilsbos(a)PerformAir.com www.PerformAir.com

4 years, 1 month

2
2
0 0

Ceph Performance of Micron 5210 SATA?

by Hermann Himmelbauer

Hi, Does someone know if the following harddisk has a decent performance in a ceph cluster: Micron 5210 ION 1.92TB, SATA (MTFDDAK1T9QDE-2AV1ZABYY) The spec state, that the disk has power loss protection, however, I'd nevertheless like to make sure that all goes well with this disk. Best Regards, Hermann -- hermann(a)qwer.tk PGP/GPG: 299893C7 (on keyservers)

4 years, 1 month

5
7
0 0

How to get the size of cephfs snapshot?

by Marc Roos

If I do getfattr --only-values --absolute-names -d -m ceph.dir.rbytes .snap/snap-6 then I am getting the same size as the unaltered source. This needs to be 0 not? And when the source is changing, this should increase? Or do I not understand correctly?

4 years, 1 month

2
1
0 0

pg_num as power of two adjustment: only downwards?

by Rodrigo Severo - Fábrica

Hi, After upgrading to ceph 14.2.8 I started getting the "1 pool(s) have non-power-of-two pg_num" warning. The offending pool had 100 pgs. I tried to adjust it to 128 but it didn't work: # ceph osd pool set default.rgw.offending pgp_num 128 Error EINVAL: specified pgp_num 128 > pg_num 100 It only worked downwards: # ceph osd pool set default.rgw.offending pgp_num 64 set pool 7 pgp_num to 64 Is that expected? And having reduced my pg_num to 64 (isn't it to small?) should I be prepared for problems in the future? Regards, Rodrigo Severo

4 years, 1 month

2
2
0 0

Aborted multipart uploads still visible

by Maks Kowalik

Hello, I have a problem with one particular bucket that aborting incomplete multipart uploads does not remove them from the list i.e. the storage space is freed up, but 's3cmd multipart' is still showing them. When i retry aborting, I'm getting 404 which is correct. The issue here is that I can only display first 1000 of such uploads, thus can't remove next ones. Does anyone know how can I make the aborted ones disappear from the listing (this was checked on a different bucket and they disappear as expected)? Ceph version is 14.2.4. Kind regards, Maks Kowalik

4 years, 1 month

2
1
0 0

2024

2023

2022

2021

2020

2019

ceph-users March 2020