November 2019 - ceph-users

by Chad W Seys

Hi all, Am I missing the ceph buster build built by ceph.com ? http://download.ceph.com/debian-nautilus/dists/ Should I be using the Croit supplied builds? Thanks! Chad.

4 years, 1 month

2
2
0 0

ceph status reports: slow ops - this is related to long running process /usr/bin/ceph-osd

by Thomas

Hi, ceph status reports: root@ld3955:~# ceph -s cluster: id: 6b1b5117-6e08-4843-93d6-2da3cf8a6bae health: HEALTH_ERR 1 filesystem is degraded 1 filesystem has a failed mds daemon 1 filesystem is offline insufficient standby MDS daemons available 4 nearfull osd(s) 1 pool(s) nearfull Reduced data availability: 59 pgs inactive, 16 pgs peering Degraded data redundancy: 597/153910758 objects degraded (0.000%), 2 pgs degraded, 1 pg undersized Degraded data redundancy (low space): 23 pgs backfill_toofull 1 pgs not deep-scrubbed in time 4 pgs not scrubbed in time 3 pools have too many placement groups 164 slow requests are blocked > 32 sec 1082 stuck requests are blocked > 4096 sec 1490 slow ops, oldest one blocked for 19711 sec, daemons [osd,0,osd,175,osd,186,osd,5,osd,6,osd,63,osd,68,osd,9,mon,ld5505,mon,ld5506]... have slow ops. services: mon: 3 daemons, quorum ld5505,ld5506,ld5507 (age 5h) mgr: ld5507(active, since 5h), standbys: ld5506, ld5505 mds: pve_cephfs:0/1, 1 failed osd: 419 osds: 416 up, 416 in; 6024 remapped pgs data: pools: 6 pools, 8864 pgs objects: 51.30M objects, 196 TiB usage: 594 TiB used, 907 TiB / 1.5 PiB avail pgs: 0.666% pgs not active 597/153910758 objects degraded (0.000%) 52964415/153910758 objects misplaced (34.412%) 5954 active+remapped+backfill_wait 2786 active+clean 40 active+remapped+backfilling 35 activating 23 active+remapped+backfill_wait+backfill_toofull 16 peering 7 activating+remapped 1 activating+undersized+degraded 1 active+clean+scrubbing 1 active+recovering+degraded io: client: 3.5 KiB/s wr, 0 op/s rd, 0 op/s wr recovery: 551 MiB/s, 137 objects/s I'm concerned about the slow ops on osd.0 and osd.9. On the relevant OSD node I can see 2 relevant services running for hours: ceph 14795 1 99 09:58 ? 08:49:22 /usr/bin/ceph-osd -f --cluster ceph --id 9 --setuser ceph --setgroup ceph ceph 15394 1 99 09:58 ? 07:10:00 /usr/bin/ceph-osd -f --cluster ceph --id 0 --setuser ceph --setgroup ceph In the relevant osd log I can find similar messages: root@ld5505:~# tail -f /var/log/ceph/ceph-osd.0.log 2019-10-08 15:35:32.830 7ff60c7cc700 -1 osd.0 233323 get_health_metrics reporting 236 slow ops, oldest is osd_pg_create(e233257 38.0:199987) 2019-10-08 15:35:33.806 7ff60c7cc700 -1 osd.0 233323 get_health_metrics reporting 236 slow ops, oldest is osd_pg_create(e233257 38.0:199987) 2019-10-08 15:35:34.842 7ff60c7cc700 -1 osd.0 233323 get_health_metrics reporting 236 slow ops, oldest is osd_pg_create(e233257 38.0:199987) 2019-10-08 15:35:35.862 7ff60c7cc700 -1 osd.0 233323 get_health_metrics reporting 236 slow ops, oldest is osd_pg_create(e233257 38.0:199987) root@ld5505:~# tail -f /var/log/ceph/ceph-osd.9.log 2019-10-08 15:35:38.822 7f8957599700 -1 osd.9 233407 get_health_metrics reporting 818 slow ops, oldest is osd_op(client.53385387.0:23 30.f7 30.bcc140f7 (undecoded) ondisk+retry+read+known_if_redirected e233362) 2019-10-08 15:35:39.854 7f8957599700 -1 osd.9 233407 get_health_metrics reporting 818 slow ops, oldest is osd_op(client.53385387.0:23 30.f7 30.bcc140f7 (undecoded) ondisk+retry+read+known_if_redirected e233362) 2019-10-08 15:35:40.850 7f8957599700 -1 osd.9 233407 get_health_metrics reporting 818 slow ops, oldest is osd_op(client.53385387.0:23 30.f7 30.bcc140f7 (undecoded) ondisk+retry+read+known_if_redirected e233362) 2019-10-08 15:35:41.862 7f8957599700 -1 osd.9 233407 get_health_metrics reporting 818 slow ops, oldest is osd_op(client.53385387.0:23 30.f7 30.bcc140f7 (undecoded) ondisk+retry+read+known_if_redirected e233362) Question: How can I analyse and solve the issue with slow ops? THX

4 years, 2 months

3
2
0 0

Changing failure domain

by Francois Legrand

Hi, I have a cephfs in production based on 2 pools (data+metadata). Data is in erasure coding with the profile : crush-failure-domain=host crush-root=default jerasure-per-chunk-alignment=false k=3 m=2 plugin=jerasure technique=reed_sol_van w=8 Metadata is in replicated mode with k=3 The crush rules are as follow : [ { "rule_id": 0, "rule_name": "replicated_rule", "ruleset": 0, "type": 1, "min_size": 1, "max_size": 10, "steps": [ { "op": "take", "item": -1, "item_name": "default" }, { "op": "chooseleaf_firstn", "num": 0, "type": "host" }, { "op": "emit" } ] }, { "rule_id": 1, "rule_name": "ec_data", "ruleset": 1, "type": 3, "min_size": 3, "max_size": 5, "steps": [ { "op": "set_chooseleaf_tries", "num": 5 }, { "op": "set_choose_tries", "num": 100 }, { "op": "take", "item": -1, "item_name": "default" }, { "op": "chooseleaf_indep", "num": 0, "type": "host" }, { "op": "emit" } ] } ] When we installed it, everything was in the same room, but know we splitted our cluster (6 servers but soon 8) in 2 rooms. Thus we updated the crushmap by adding a room layer (with ceph osd crush add-bucket room1 room etc) and move all our servers in the tree to the correct place (ceph osd crush move server1 room=room1 etc...). Now, we would like to change the rules to set a failure domain to room instead of host (to be sure that in case of disaster in one of the rooms we will still have a copy in the other). What is the best strategy to do this ? F.

4 years, 2 months

4
10
0 0

Radosgw/Objecter behaviour for homeless session

by Biswajeet Patra

Hi All, I have a query regarding objecter behaviour for homeless session. In situations when all OSDs containing copies (*let say replication 3*) of an object are down, the objecter assigns a homeless session (OSD=-1) to a client request. This request makes radosgw thread hang indefinitely as the data could never be served because all required OSDs are down. With multiple similar requests, all the radosgw threads gets exhausted and hanged indefinitely waiting for the OSDs to come up. This creates complete service unavailability as no rgw threads are present to process valid requests which could have been directed towards active PGs/OSDs. I think we should have behaviour in objecter or radosgw to terminate request and return early in case of a homeless session. Let me know your thoughts on this. Regards, Biswajeet -- *-----------------------------------------------------------------------------------------* *This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you have received this email in error, please notify the system manager. This message contains confidential information and is intended only for the individual named. If you are not the named addressee, you should not disseminate, distribute or copy this email. Please notify the sender immediately by email if you have received this email by mistake and delete this email from your system. If you are not the intended recipient, you are notified that disclosing, copying, distributing or taking any action in reliance on the contents of this information is strictly prohibited.***** **** *Any views or opinions presented in this email are solely those of the author and do not necessarily represent those of the organization. Any information on shares, debentures or similar instruments, recommended product pricing, valuations and the like are for information purposes only. It is not meant to be an instruction or recommendation, as the case may be, to buy or to sell securities, products, services nor an offer to buy or sell securities, products or services unless specifically stated to be so on behalf of the Flipkart group. Employees of the Flipkart group of companies are expressly required not to make defamatory statements and not to infringe or authorise any infringement of copyright or any other legal right by email communications. Any such communication is contrary to organizational policy and outside the scope of the employment of the individual concerned. The organization will not accept any liability in respect of such communication, and the employee responsible will be personally liable for any damages or other liability arising.***** **** *Our organization accepts no liability for the content of this email, or for the consequences of any actions taken on the basis of the information *provided,* unless that information is subsequently confirmed in writing. If you are not the intended recipient, you are notified that disclosing, copying, distributing or taking any action in reliance on the contents of this information is strictly prohibited.* _-----------------------------------------------------------------------------------------_

4 years, 3 months

1
1
0 0

why osd's heartbeat partner comes from another root tree?

by opengers

Hello: According to my understanding, osd's heartbeat partners only come from those osds who assume the same pg See below(# ceph osd tree), osd.10 and osd.0-6 cannot assume the same pg, because osd.10 and osd.0-6 are from different root tree, and pg in my cluster doesn't map across root trees(# ceph osd crush rule dump). so, osd.0-6 cannot become the heartbeat partner of osd.10 But, below is the log on osd.10, It can be seen that the osd.10's heartbeat partner include osd.0/1/2/5, why? thanks for any help > # osd.10 log > 2019-11-20 09:21:50.431799 7fbb369fb700 -1 osd.10 7344 heartbeat_check: no > reply from 10.13.6.162:6806 osd.2 since back 2019-11-20 09:21:19.979712 > front 2019-11-20 09:21:19.979712 (cutoff 2019-11-20 09:21:30.431768) > 2019-11-20 13:15:59.175060 7fbb369fb700 -1 osd.10 7357 heartbeat_check: no > reply from 10.13.6.162:6806 <http://10.13.6.152:6806> osd.2 since back > 2019-11-20 13:15:38.710424 front 2019-11-20 13:15:38.710424 (cutoff > 2019-11-20 13:15:39.175058) > 2019-11-20 13:15:59.175110 7fbb369fb700 -1 osd.10 7357 heartbeat_check: no > reply from 10.13.6.160:6803 osd.0 since back 2019-11-20 13:15:38.710424 > front 2019-11-20 13:15:38.710424 (cutoff 2019-11-20 13:15:39.175058) > 2019-11-20 13:15:59.175118 7fbb369fb700 -1 osd.10 7357 heartbeat_check: no > reply from 10.13.6.161:6803 osd.1 since back 2019-11-20 13:15:38.710424 > front 2019-11-20 13:15:38.710424 (cutoff 2019-11-20 13:15:39.175058) > 2019-11-21 02:52:24.656783 7fbb369fb700 -1 osd.10 7374 heartbeat_check: no > reply from 10.13.6.158:6810 osd.5 since back 2019-11-21 02:52:04.557548 > front 2019-11-21 02:52:04.557548 (cutoff 2019-11-21 02:52:04.656781) > # ceph osd tree > -17 3.29095 root ssd-storage > > -25 1.09698 rack rack-ssd-A > > -18 1.09698 host ssd-osd01 > 10 hdd 1.09698 osd.10 up 1.00000 > 1.00000 > -26 1.09698 rack rack-ssd-B > > -19 1.09698 host ssd-osd02 > 11 hdd 1.09698 osd.11 up 1.00000 > 1.00000 > -27 1.09698 rack rack-ssd-C > > -20 1.09698 host ssd-osd03 > 12 hdd 1.09698 osd.12 up 1.00000 > 1.00000 > -1 3.22256 root default > > -3 0.29300 host test-osd01 > > 0 hdd 0.29300 osd.0 up 1.00000 > 1.00000 > -5 0.29300 host test-osd02 > > 1 hdd 0.29300 osd.1 up 0.89999 > 1.00000 > -7 0.29300 host test-osd03 > > 2 hdd 0.29300 osd.2 up 0.79999 > 1.00000 > -9 0.29300 host test-osd04 > > 3 hdd 0.29300 osd.3 up 1.00000 > 1.00000 > -11 0.29300 host test-osd05 > > 4 hdd 0.29300 osd.4 up 1.00000 > 1.00000 > -13 0.29300 host test-osd06 > > 5 hdd 0.29300 osd.5 up 1.00000 > 1.00000 > -15 0.29300 host test-osd07 > > 6 hdd 0.29300 osd.6 up 1.00000 > 1.00000 # ceph osd crush rule dump > > [ > { > "rule_id": 0, > "rule_name": "replicated_rule", > "ruleset": 0, > "type": 1, > "min_size": 1, > "max_size": 10, > "steps": [ > { > "op": "take", > "item": -1, > "item_name": "default" > }, > { > "op": "chooseleaf_firstn", > "num": 0, > "type": "host" > }, > { > "op": "emit" > } > ] > }, > { > "rule_id": 1, > "rule_name": "replicated_rule_ssd", > "ruleset": 1, > "type": 1, > "min_size": 1, > "max_size": 10, > "steps": [ > { > "op": "take", > "item": -17, > "item_name": "ssd-storage" > }, > { > "op": "chooseleaf_firstn", > "num": 0, > "type": "rack" > }, > { > "op": "emit" > } > ] > } > ] > # some parameters > "mon_osd_down_out_interval": "600", > "mon_osd_down_out_subtree_limit": "rack", > "mds_debug_subtrees": "false", > "mon_osd_down_out_subtree_limit": "rack", > "mon_osd_reporter_subtree_level": "host",

4 years, 3 months

3
3
0 0

MDS failing under load with large cache sizes

by Janek Bevendorff

Hi, I am trying to copy the contents of our storage server into a CephFS, but am experiencing stability issues with my MDSs. The CephFS sits on top of an erasure-coded pool with 5 MONs, 5 MDSs and a max_mds setting of two. My Ceph cluster version is Nautilus, the client is Mimic and uses the kernel module to mount the FS. The index of filenames to copy is about 23GB and I am using 16 parallel rsync processes over a 10G link to copy the files over to Ceph. This works perfectly for a while, but then the MDSs start reporting oversized caches (between 20 and 50GB, sometimes more) and an inode count between 1 and 4 million. Particularly the Inode count seems quite high to me. Each rsync job has 25k files to work with, so if all 16 processes open all their files at the same time, I should not exceed 400k. Even if I double this number to account for the client's page cache, I should get nowhere near that number of inodes (a sync flush takes about 1 second). Then after a few hours, my MDSs start failing with messages like this: -21> 2019-07-22 14:00:05.877 7f67eacec700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 -20> 2019-07-22 14:00:05.877 7f67eacec700 0 mds.beacon.XXX Skipping beacon heartbeat to monitors (last acked 24.0042s ago); MDS internal heartbeat is not healthy! The standby nodes try to take over, but take forever to become active and will fail as well eventually. During my research, I found this related topic: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-January/015959.html, but I tried everything in there from increasing to lowering my cache size, the number of segments etc. I also played around with the number of active MDSs and two appears to work the best, whereas one cannot keep up with the load and three seems to be the worst of all choices. Do you have any ideas how I can improve the stability of my MDS damons to handle the load properly? single 10G link is a toy and we could query the cluster with a lot more requests per second, but it's already yielding to 16 rsync processes. Thanks

4 years, 3 months

4
33
0 0

Balancing PGs across OSDs

by Thomas Schneider

Hi, in this <https://ceph.io/community/the-first-telemetry-results-are-in/> blog post I find this statement: "So, in our ideal world so far (assuming equal size OSDs), every OSD now has the same number of PGs assigned." My issue is that accross all pools the number of PGs per OSD is not equal. And I conclude that this is causing very unbalanced data placement. As a matter of fact the data stored on my 1.6TB HDD in specific pool "hdb_backup" is in a range starting with osd.228 size: 1.6 usage: 52.61 reweight: 1.00000 and ending with osd.145 size: 1.6 usage: 81.11 reweight: 1.00000 This impacts the amount of data that can be stored in the cluster heavily. Ceph balancer is enabled, but this is not solving this issue. root@ld3955:~# ceph balancer status { "active": true, "plans": [], "mode": "upmap" } Therefore I would ask you for suggestions how to work on this unbalanced data distribution. I have attached pastebin for - ceph osd df sorted by usage <https://pastebin.com/QLQHjA9g> - ceph osd df tree <https://pastebin.com/SvhP2hp5> My cluster has multiple crush roots respresenting different disks. In addition I have defined multiple pools, one pool for each disk type: hdd, ssd, nvme. THX

4 years, 3 months

6
23
0 0

Ceph and centos 8

by fleg＠lpnhe.in2p3.fr

Hi, We have a ceph+cephfs cluster runing nautilus version 14.2.4 We have debian buster/ubuntu bionic clients mounting cephfs in kernel mode without problems. We now want to mount cephfs from our new centos 8 clients. Unfortunately, ceph-common is needed but there are no packages available for el8 (only el7). And no way to install the el7 packages on centos 8 (missing deps). Thus, despite the fact that centos 8 have a 4.18 kernel (required to use quota, snapshots etc...), it seems impossible to mount in kernel mode (good perfs) and we still have to use the so slow fuse mode. Is it possible to workaround this problem ? Or when is it planned to provides (even as beta) the ceph packages for centos 8 ? Thanks.

4 years, 4 months

4
5
0 0

v13.2.7 mimic released

by Sage Weil

This is the seventh bugfix release of the Mimic v13.2.x long term stable release series. We recommend all Mimic users upgrade. For the full release notes, see https://ceph.io/releases/v13-2-7-mimic-released/ Notable Changes MDS: - Cache trimming is now throttled. Dropping the MDS cache via the “ceph tell mds.<foo> cache drop” command or large reductions in the cache size will no longer cause service unavailability. - Behavior with recalling caps has been significantly improved to not attempt recalling too many caps at once, leading to instability. MDS with a large cache (64GB+) should be more stable. - MDS now provides a config option “mds_max_caps_per_client” (default: 1M) to limit the number of caps a client session may hold. Long running client sessions with a large number of caps have been a source of instability in the MDS when all of these caps need to be processed during certain session events. It is recommended to not unnecessarily increase this value. - The “mds_recall_state_timeout” config parameter has been removed. Late client recall warnings are now generated based on the number of caps the MDS has recalled which have not been released. The new config parameters “mds_recall_warning_threshold” (default: 32K) and “mds_recall_warning_decay_rate” (default: 60s) set the threshold for this warning. - The “cache drop” admin socket command has been removed. The “ceph tell mds.X cache drop” remains. OSD: - A health warning is now generated if the average osd heartbeat ping time exceeds a configurable threshold for any of the intervals computed. The OSD computes 1 minute, 5 minute and 15 minute intervals with average, minimum and maximum values. New configuration option “mon_warn_on_slow_ping_ratio” specifies a percentage of “osd_heartbeat_grace” to determine the threshold. A value of zero disables the warning. A new configuration option “mon_warn_on_slow_ping_time”, specified in milliseconds, overrides the computed value, causing a warning when OSD heartbeat pings take longer than the specified amount. A new admin command “ceph daemon mgr.# dump_osd_network [threshold]” lists all connections with a ping time longer than the specified threshold or value determined by the config options, for the average for any of the 3 intervals. A new admin command ceph daemon osd.# dump_osd_network [threshold]” does the same but only including heartbeats initiated by the specified OSD. - The default value of the “osd_deep_scrub_large_omap_object_key_threshold” parameter has been lowered to detect an object with large number of omap keys more easily. RGW: - radosgw-admin introduces two subcommands that allow the managing of expire-stale objects that might be left behind after a bucket reshard in earlier versions of RGW. One subcommand lists such objects and the other deletes them. Read the troubleshooting section of the dynamic resharding docs for details.

4 years, 4 months

4
7
0 0

Ceph User Survey 2019

by Mike Perez

Hi Cephers, To better understand how our current users utilize Ceph, we conducted a public community survey. This information is a guide to the community of how we spend our contribution efforts for future development. The survey results will remain anonymous and aggregated in future Ceph Foundation publications to the community. I'm pleased to announce after much discussion on the Ceph dev mailing list [0] that the community has formed the Ceph Survey for 2019. The deadline for this survey due to it being out later than we'd like will be January 31st, 2020 at 11:59 PT. https://ceph.io/user-survey/ We have discussed in the future to use the Ceph telemetry module to collect the data to save time for our users. Please let me know of any mistakes that need to be corrected on the survey. Thanks! [0] - https://lists.ceph.io/hyperkitty/list/dev@ceph.io/thread/WU374ZJP5N3NKY22X2… -- Mike Perez he/him Ceph Community Manager M: +1-951-572-2633 494C 5D25 2968 D361 65FB 3829 94BC D781 ADA8 8AEA @Thingee <https://twitter.com/thingee> Thingee <https://www.linkedin.com/thingee> <https://www.facebook.com/RedHatInc> <https://www.redhat.com>

4 years, 4 months

4
3
0 0

2024

2023

2022

2021

2020

2019

ceph-users November 2019