November 2020 - ceph-users

Ceph as a distributed filesystem and kerberos integration

by Alessandro Piazza

Dear all, I am experimenting with Ceph as a replacement for the AndrewFileSystem (https://en.wikipedia.org/wiki/Andrew_File_System). In my current setup, I am using AFS as a distributed filesystem for approximately 1000 users to store personal data and let them access their home directories and other shared data from multiple locations across different buildings. The authentication is managed by Kerberos (+ LDAP server). My goal is to replace AFS with CephFS but keep the current Kerberos database. Right now I've managed to set up a testing Ceph cluster with 6 nodes and 11 osds and I can mount CephFS using the kernel driver + CephX. However, from the Ceph docs, I can't understand if this might be a correct use-case for Ceph since the default authentication method CephX doesn't have a standard username/password authentication protocol. As far as I understand it requires the creation of a keyring with a random password generated on-the-fly which can then be used to mount the filesystem using the CephFS kernel module (https://docs.ceph.com/en/latest/cephfs/mount-using-kernel-driver/#mounting-…). As for the Kerberos integration, I found in the docs this page https://docs.ceph.com/en/latest/dev/ceph_krb_auth/ which is still a draft even if the last update was almost 2 years ago. From this page, I don't understand if the current version of Ceph supports full integration with GSSAPI/kerberos/LDAP. Since the docs only refer to keytab files, I was wondering if Kerberos can only be used as an authentication protocol between Ceph monitors/osds/metadata-servers and not for mounting the filesystem. Therefore I am asking - if anyone has tried Ceph for a similar use-case - what is the current status of Kerberos integration - if there are alternatives to CephX for mounting CephFS using kernel drivers which uses a username/password protocol Thank you and best regards, Alessandro Piazza

3 years, 5 months

4
4
0 0

Debugging slow ops

by Void Star Nill

Hello, I am trying to debug slow operations in our cluster running Nautilus 14.2.13. I am analysing the output of "ceph daemon osd.N dump_historic_ops" command. I am noticing that the I am noticing that most of the time is spent between "header_read" and "throttled" events. For example, below is an operation that took ~160 seconds to complete and almost all of that time was spent between these 2 events. Going by the descriptions at https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-osd/#… - header_read: When the messenger first started reading the message off the wire. - throttled: When the messenger tried to acquire memory throttle space to read the message into memory. - all_read: When the messenger finished reading the message off the wire. Does this mean that the slowness I am observing is because OSD's messaging layer is not able to acquire the memory required for the message fast enough? The system has lots of available memory (over 300G), so how do I tune OSD to perform better at this? Appreciate any feedback on this. { "description": "osd_op(client.405792.0:98299 3.313 3:c8c63189:::rbd_data.51b046b8b4567.0000000000000180:head [set-alloc-hint object_size 4194304 write_size 4194304,writefull 0~4194304] snapc 0=[] ondisk+write+known_if_redirected e1073)", "initiated_at": "2020-11-06 16:16:40.924448", "age": 164.32155802899999, "duration": 159.57800813, "type_data": { "flag_point": "commit sent; apply or cleanup", "client_info": { "client": "client.405792", "client_addr": "v1:x.y.156.101:0/3840080733", "tid": 98299 }, "events": [ { "time": "2020-11-06 16:16:40.924448", "event": "initiated" }, { "time": "2020-11-06 16:16:40.924448", "event": "header_read" }, { "time": "2020-11-06 16:19:20.481593", "event": "throttled" }, { "time": "2020-11-06 16:19:20.487331", "event": "all_read" }, { "time": "2020-11-06 16:19:20.487333", "event": "dispatched" }, { "time": "2020-11-06 16:19:20.487340", "event": "queued_for_pg" }, { "time": "2020-11-06 16:19:20.487372", "event": "reached_pg" }, { "time": "2020-11-06 16:19:20.487507", "event": "started" }, { "time": "2020-11-06 16:19:20.487586", "event": "waiting for subops from 1,94" }, { "time": "2020-11-06 16:19:20.491873", "event": "op_commit" }, { "time": "2020-11-06 16:19:20.501164", "event": "sub_op_commit_rec" }, { "time": "2020-11-06 16:19:20.502423", "event": "sub_op_commit_rec" }, { "time": "2020-11-06 16:19:20.502438", "event": "commit_sent" }, { "time": "2020-11-06 16:19:20.502456", "event": "done" } ] } }

3 years, 5 months

1
0
0 0

using msgr-v1 for OSDs on nautilus

by Void Star Nill

Hello, I am running nautilus cluster. Is there a way to force the cluster to use msgr-v1 instead of msgr-v2? I am debugging an issue and it seems like it could be related to the msgr layer, so want to test it by using msgr-v1. Thanks, Shridhar

3 years, 5 months

2
2
0 0

Re: high latency after maintenance]

by Marcel Kuiper

Hi Anthony Thank you for your respons I am looking at the"OSDs highest latency of write operations" panel of the grafana dashboard found in the ceph source in ./monitoring/grafana/dashboards/osds-overview.json. It is a topk graph that uses ceph_osd_op_w_latency_sum / ceph_osd_op_w_latency_count. During normal operations we see sometime latency spikes of 4 seconds max but during the bringing back of the rack we saw a consistent increase in latency for a lot of osds into the 20 seconds range The cluster has 1139 osds total of which we had 5 x 9 - 45 in maintenance We did not throttle the backfilling proces because we succesfully did the same maintenance before on a few occasions for other racks without problems. I will throttle backfills next time we have the same sort of maintenance in the next rack Can you elaborate a bit more what happens exactly during the peering process? I understand that the osds need to catch up. I also see that the nr of scrubs increases a lot when osds are brought back online. Is that part of the peering proces? Thx, Marcel > HDDs and concern for latency donât mix. That said, you donât specify > what you mean by âlatencyâ. Does that mean average client write > latency? median? P99? Something else? > > If you have a 15 node cluster and you took a third of it down for two > hours then yeah youâll have a lot to catch up on when you come back. > Bringing the nodes back one at a time can help, to spread out the peering. > Did you throttle backfill/recovery tunables all the way down to 1? In a > way that the restarted OSDs would use the throttled values as they boot? > > > > >> On Nov 5, 2020, at 6:47 AM, Marcel Kuiper <ceph(a)mknet.nl> wrote: >> >> Hi >> >> We had a rack down for 2hours for maintenance. 5 storage nodes were >> involved. We had noout en norebalance flags set before the start of the >> maintenance >> >> When the systems were brought back online we noticed a lot of osds with >> high latency (in 20 seconds range) . Mostly osds that are not on the >> storage nodes that were down. It took about 20 minutes for things to >> settle down. >> >> We're running nautilus 14.2.11. The storage nodes run bluestore and have >> 9 >> x 8T HDD's and 3 x SSD for rocksdb. Each with 3 x 123G LV >> >> - Can anyone give a reason for these high latencies? >> - Is there a way to avoid or lower these latencies when bringing systems >> back into operation? >> >> Best Regards >> >> Marcel >> _______________________________________________ >> ceph-users mailing list -- ceph-users(a)ceph.io >> To unsubscribe send an email to ceph-users-leave(a)ceph.io > >

3 years, 5 months

2
1
0 0

Hadoop to Ceph

by Szabo, Istvan (Agoda)

Hi, Is there anybody tried to migrate data from Hadoop to Ceph? If yes what is the right way? Thank you ________________________________ This message is confidential and is for the sole use of the intended recipient(s). It may also be privileged or otherwise protected by copyright or other legal rules. If you have received it by mistake please let us know by reply email and delete it from your system. It is prohibited to copy this message or disclose its content to anyone. Any confidentiality or privilege is not waived or lost by any mistaken delivery or unauthorized disclosure of the message. All messages sent to and from Agoda may be monitored to ensure compliance with company policies, to protect the company's interests and to remove potential malware. Electronic messages may be intercepted, amended, lost or deleted, or contain viruses.

3 years, 5 months

2
3
0 0

Low Memory Nodes

by Ml Ml

Hello List, i think 3 of 6 Nodes have to less memory. This triggers the effect, that the nodes will swap a lot and almost kill themselfes. That triggers OSDs to go down, which triggers a rebalance which does not really help :D I already ordered more ram. Can i turn temporary down the RAM usage of the OSDs to not get into that vicious cycle and just suffer small but stable performance? This is ceph version 15.2.5 with bluestore. Thanks, Michael

3 years, 5 months

3
2
0 0

NoSuchKey on key that is visible in s3 list/radosgw bk

by Mariusz Gronczewski

Hi, I've got a problem on Octopus (15.2.3, debian packages) install, bucket S3 index shows a file: s3cmd ls s3://upvid/255/38355 --recursive 2020-07-27 17:48 50584342 s3://upvid/255/38355/juz_nie_zyjesz_sezon_2___oficjalny_zwiastun___netflix_mp4 radosgw-admin bi list also shows it { "type": "plain", "idx": "255/38355/juz_nie_zyjesz_sezon_2___oficjalny_zwiastun___netflix_mp4", "entry": { "name": "255/38355/juz_nie_zyjesz_sezon_2___oficjalny_zwiastun___netflix_mp4", "instance": "", "ver": { "pool": 11, "epoch": 853842 }, "locator": "", "exists": "true", "meta": { "category": 1, "size": 50584342, "mtime": "2020-07-27T17:48:27.203008Z", "etag": "2b31cc8ce8b1fb92a5f65034f2d12581-7", "storage_class": "", "owner": "filmweb-app", "owner_display_name": "filmweb app user", "content_type": "", "accounted_size": 50584342, "user_data": "", "appendable": "false" }, "tag": "_3ubjaztglHXfZr05wZCFCPzebQf-ZFP", "flags": 0, "pending_map": [], "versioned_epoch": 0 } }, but trying to download it via curl (I've set permissions to public0 only gets me <?xml version="1.0" encoding="UTF-8"?><Error><Code>NoSuchKey</Code><BucketName>upvid</BucketName><RequestId>tx0000000000000000e716d-005f1f14cb-e478a-pl-war1</RequestId><HostId>e478a-pl-war1-pl</HostId></Error> (the actually nonexisting files shows access denied in same context) same with other tools: $ s3cmd get s3://upvid/255/38355/juz_nie_zyjesz_sezon_2___oficjalny_zwiastun___netflix_mp4 /tmp download: 's3://upvid/255/38355/juz_nie_zyjesz_sezon_2___oficjalny_zwiastun___netflix_mp4' -> '/tmp/juz_nie_zyjesz_sezon_2___oficjalny_zwiastun___netflix_mp4' [1 of 1] ERROR: S3 error: 404 (NoSuchKey) cluster health is OK Any ideas what is happening here ? -- Mariusz Gronczewski, Administrator Efigence S. A. ul. Wołoska 9a, 02-583 Warszawa T: [+48] 22 380 13 13 NOC: [+48] 22 380 10 20 E: admin(a)efigence.com

3 years, 5 months

3
3
0 0

msgr-v2 log flooding on OSD proceses

by Void Star Nill

Hello, I am running 14.2.13-1xenial version and I am seeing lot of logs from msgv2 layer on the OSDs. Attached are some of the logs. It looks like these logs are not controlled by the standard log level configuration, so I couldn't find a way to disable these logs. I am concerned that these logs may be hurting the system performance. Any inputs on how I can disable the level of these logs? Regards, Shridhar

3 years, 5 months

1
0
0 0

Problem with checking mon for new map after upgrade

by Ingo Ebel

Hi, we upgraded our ceph cluster from 14.2.9 to 15.2.5 but osds with 15.2.5 are not joining the cluster after restart. They hang with "1234 tick checking mon for new map" The Systems are Centos 7.8 I tried everything i could think of. But nothing helped. The mons and mgrs are 15.2.5. Anyone an Idea want could cause this? 2020-11-05T15:18:13.142+0100 7f02fce81700 1 osd.114 pg_epoch: 1234 pg[5.c9s0( v 839'39171 (839'37583,839'39171] local-lis/les=1129/1130 n=495 ec=813/813 lis/c=1129/1081 les/c/f=1130/1082/0 sis=1210) [NONE,NONE,NONE,NONE,124,NONE,157,NONE,NONE,149,NONE]p124(4) r=-1 lpr=1212 pi=[1081,1210)/1 crt=839'39171 lcod 0'0 mlcod 0'0 unknown mbc={} ps=[1~3]] state<Start>: transitioning to Stray 2020-11-05T15:18:13.143+0100 7f02fce81700 1 osd.114 pg_epoch: 1234 pg[1.736( v 1105'9685 (815'8100,1105'9685] local-lis/les=1131/1132 n=6 ec=719/719 lis/c=1131/1019 les/c/f=1132/1020/0 sis=1210 pruub=11.135073235s) [] r=-1 lpr=1212 pi=[1019,1210)/1 crt=1105'9685 lcod 0'0 mlcod 0'0 unknown mbc={}] state<Start>: transitioning to Stray 2020-11-05T15:18:13.143+0100 7f02fce81700 1 osd.114 pg_epoch: 1234 pg[4.3fd( v 1109'105689 (1109'102600,1109'105689] local-lis/les=1131/1132 n=32 ec=740/740 lis/c=1131/1089 les/c/f=1132/1090/0 sis=1210 pruub=11.134586488s) [] r=-1 lpr=1212 pi=[1089,1210)/1 crt=1109'105689 lcod 0'0 mlcod 0'0 unknown mbc={}] state<Start>: transitioning to Stray 2020-11-05T15:18:13.144+0100 7f02fce81700 1 osd.114 pg_epoch: 1234 pg[5.123s10( v 839'38976 (839'37410,839'38976] local-lis/les=1119/1120 n=498 ec=813/813 lis/c=1119/1064 les/c/f=1120/1065/0 sis=1210) [NONE,NONE,NONE,NONE,NONE,NONE,NONE,NONE,NONE,NONE,NONE]p? r=-1 lpr=1212 pi=[1064,1210)/1 crt=839'38976 lcod 0'0 mlcod 0'0 unknown mbc={} ps=[1~3]] state<Start>: transitioning to Stray 2020-11-05T15:18:13.145+0100 7f03140b5700 1 osd.114 1234 set_numa_affinity public network bond0 numa node 2 2020-11-05T15:18:13.145+0100 7f03140b5700 1 osd.114 1234 set_numa_affinity cluster network bond1 numa node 0 2020-11-05T15:18:13.145+0100 7f03140b5700 1 osd.114 1234 set_numa_affinity public and cluster network numa nodes do not match 2020-11-05T15:18:13.145+0100 7f03140b5700 1 osd.114 1234 set_numa_affinity not setting numa affinity 2020-11-05T15:18:14.072+0100 7f031713a700 1 osd.114 1234 tick checking mon for new map 2020-11-05T15:18:44.397+0100 7f031713a700 1 osd.114 1234 tick checking mon for new map 2020-11-05T15:19:15.370+0100 7f031713a700 1 osd.114 1234 tick checking mon for new map 2020-11-05T15:19:46.206+0100 7f031713a700 1 osd.114 1234 tick checking mon for new map 2020-11-05T15:20:16.400+0100 7f031713a700 1 osd.114 1234 tick checking mon for new map 2020-11-05T15:20:46.483+0100 7f031713a700 1 osd.114 1234 tick checking mon for new map Regards -- Ingo Ebel Human knowledge belongs to the world. RadioTux.de - Internet-Radio rund um Linux und Open Source ## https://twitter.com/ingoebel ## https://keybase.io/savar ## Jabber: ingo.ebel(a)ingoebel.de

3 years, 5 months

2
2
0 0

high latency after maintenance

by Marcel Kuiper

Hi We had a rack down for 2hours for maintenance. 5 storage nodes were involved. We had noout en norebalance flags set before the start of the maintenance When the systems were brought back online we noticed a lot of osds with high latency (in 20 seconds range) . Mostly osds that are not on the storage nodes that were down. It took about 20 minutes for things to settle down. We're running nautilus 14.2.11. The storage nodes run bluestore and have 9 x 8T HDD's and 3 x SSD for rocksdb. Each with 3 x 123G LV - Can anyone give a reason for these high latencies? - Is there a way to avoid or lower these latencies when bringing systems back into operation? Best Regards Marcel

3 years, 5 months

1
0
0 0

2024

2023

2022

2021

2020

2019

ceph-users November 2020