September 2019 - ceph-users

by Ramanathan S

Hi all, I just had created a ceph cluster to use cephfs. When i create the a ceph fs pool i get the filesystem below error. # ceph osd pool create cephfs_data 128 pool 'cephfs_data' created # ceph osd pool create cephfs_metadata 128 pool 'cephfs_metadata' created # ceph fs new cephfs cephfs_metadata cephfs_data new fs with metadata pool 6 and data pool 5 # ceph -s cluster: id: 1c27def45-f0f9-494d-sfke-eb4323432fd health: HEALTH_ERR 1 filesystem is offline 1 filesystem is online with fewer MDS than max_mds services: mon: 2 daemons, quorum ceph-mon01,ceph-mon02 mgr: ceph-adm01(active) mds: cephfs-0/0/1 up osd: 12 osds: 12 up, 12 in data: pools: 2 pools, 256 pgs objects: 0 objects, 0 B usage: 12 GiB used, 588 GiB / 600 GiB avail pgs: 256 active+clean but when i check the max_mds for the ceph fs it says 1 # ceph fs get cephfs | grep max_mds max_mds 1 Let anyone know what am i missing here? Any inputs is much appreciated. Regards, Ram Ceph-explorer..

3 weeks, 3 days

3
3
0 0

MDS rejects clients causing hanging mountpoint on linux kernel client

by Florian Pritz

Hi, We are running a ceph cluster on Ubuntu 18.04 machines with ceph 14.2.4. Our cephfs clients are using the kernel module and we have noticed that some of them are sometimes (at least once) hanging after an MDS restart. The only way to resolve this is to unmount and remount the mountpoint, or reboot the machine if unmounting is not possible. After some investigation, the problem seems to be that the MDS denies reconnect attempts from some clients during restart even though the reconnect interval is not yet reached. In particular, I see the following log entries. Note that there are supposedly 9 sessions. 9 clients reconnect (one client has two mountpoints) and then two more clients reconnect after the MDS already logged "reconnect_done". These two clients were hanging after the event. The kernel log of one of them is shown below too. Running `ceph tell mds.0 client ls` after the clients have been rebooted/remounted also shows 11 clients instead of 9. Do you have any ideas what is wrong here and how it could be fixed? I'm guessing that the issue is that the MDS apparently has an incorrect session count and stops the reconnect process to soon. Is this indeed a bug and if so, do you know what is broken? Regardless, I also think that the kernel should be able to deal with a denied reconnect and that it should try again later. Yet, even after 10 minutes, the kernel does not attempt to reconnect. Is this a known issue or maybe fixed in newer kernels? If not, is there a chance to get this fixed? Thanks, Florian MDS log: > 2019-09-26 16:08:27.479 7f9fdde99700 1 mds.0.server reconnect_clients -- 9 sessions > 2019-09-26 16:08:27.479 7f9fdde99700 0 log_channel(cluster) log [DBG] : reconnect by client.24197043 v1:10.1.4.203:0/990008521 after 0 > 2019-09-26 16:08:27.479 7f9fdde99700 0 log_channel(cluster) log [DBG] : reconnect by client.30487144 v1:10.1.4.146:0/483747473 after 0 > 2019-09-26 16:08:27.479 7f9fdde99700 0 log_channel(cluster) log [DBG] : reconnect by client.21019865 v1:10.1.7.22:0/3752632657 after 0 > 2019-09-26 16:08:27.479 7f9fdde99700 0 log_channel(cluster) log [DBG] : reconnect by client.21020717 v1:10.1.7.115:0/2841046616 after 0 > 2019-09-26 16:08:27.479 7f9fdde99700 0 log_channel(cluster) log [DBG] : reconnect by client.24171153 v1:10.1.7.243:0/1127767158 after 0 > 2019-09-26 16:08:27.479 7f9fdde99700 0 log_channel(cluster) log [DBG] : reconnect by client.23978093 v1:10.1.4.71:0/824226283 after 0 > 2019-09-26 16:08:27.479 7f9fdde99700 0 log_channel(cluster) log [DBG] : reconnect by client.24209569 v1:10.1.4.157:0/1271865906 after 0 > 2019-09-26 16:08:27.479 7f9fdde99700 0 log_channel(cluster) log [DBG] : reconnect by client.20190930 v1:10.1.4.240:0/3195698606 after 0 > 2019-09-26 16:08:27.479 7f9fdde99700 0 log_channel(cluster) log [DBG] : reconnect by client.20190912 v1:10.1.4.146:0/852604154 after 0 > 2019-09-26 16:08:27.479 7f9fdde99700 1 mds.0.59 reconnect_done > 2019-09-26 16:08:27.483 7f9fdde99700 1 mds.0.server no longer in reconnect state, ignoring reconnect, sending close > 2019-09-26 16:08:27.483 7f9fdde99700 0 log_channel(cluster) log [INF] : denied reconnect attempt (mds is up:reconnect) from client.24167394 v1:10.1.67.49:0/1483641729 after 0.00400002 (allowed interval 45) > 2019-09-26 16:08:27.483 7f9fe1087700 0 --1- [v2:10.1.4.203:6800/806949107,v1:10.1.4.203:6801/806949107] >> v1:10.1.67.49:0/1483641729 conn(0x55af50053f80 0x55af50140800 :6801 s=OPENED pgs=21 cs=1 l=0).fault server, going to standby > 2019-09-26 16:08:27.483 7f9fdde99700 1 mds.0.server no longer in reconnect state, ignoring reconnect, sending close > 2019-09-26 16:08:27.483 7f9fdde99700 0 log_channel(cluster) log [INF] : denied reconnect attempt (mds is up:reconnect) from client.30586072 v1:10.1.67.140:0/3664284158 after 0.00400002 (allowed interval 45) > 2019-09-26 16:08:27.483 7f9fe1888700 0 --1- [v2:10.1.4.203:6800/806949107,v1:10.1.4.203:6801/806949107] >> v1:10.1.67.140:0/3664284158 conn(0x55af50055600 0x55af50143000 :6801 s=OPENED pgs=8 cs=1 l=0).fault server, going to standby Hanging client (10.1.67.49) kernel log: > 2019-09-26T16:08:27.481676+02:00 hostnamefoo kernel: [708596.227148] ceph: mds0 reconnect start > 2019-09-26T16:08:27.488943+02:00 hostnamefoo kernel: [708596.233145] ceph: mds0 reconnect denied > 2019-09-26T16:16:17.541041+02:00 hostnamefoo kernel: [709066.287601] libceph: mds0 10.1.4.203:6801 socket closed (con state NEGOTIATING) > 2019-09-26T16:16:18.068934+02:00 hostnamefoo kernel: [709066.813064] ceph: mds0 rejected session > 2019-09-26T16:16:18.068955+02:00 hostnamefoo kernel: [709066.814843] ceph: get_quota_realm: ino (10000000008.fffffffffffffffe) null i_snap_realm

3 years, 2 months

3
6
0 0

ceph osd set-require-min-compat-client jewel failure

by 潘东元

hi,every one, my ceph version 12.2.12，I want to set require min compat client luminous,I use command #ceph osd set-require-min-compat-client luminous but ceph report:Error EPERM: cannot set require_min_compat_client to luminous: 4 connected client(s) look like jewel (missing 0xa00000000200000); add --yes-i-really-mean-it to do it anyway [root@node-1 ~]# ceph features { "mon": { "group": { "features": "0x3ffddff8eeacfffb", "release": "luminous", "num": 3 } }, "osd": { "group": { "features": "0x3ffddff8eeacfffb", "release": "luminous", "num": 15 } }, "client": { "group": { "features": "0x40106b84a842a52", "release": "jewel", "num": 4 }, "group": { "features": "0x3ffddff8eeacfffb", "release": "luminous", "num": 168 } } } so,I run command: [root@node-1 gyt]# ceph osd set-require-min-compat-client luminous --yes-i-really-mean-it set require_min_compat_client to luminous but now,I want to set require min compat client jewel,I use command： [root@node-1 gyt]# ceph osd set-require-min-compat-client jewel Error EPERM: osdmap current utilizes features that require luminous; cannot set require_min_compat_client below that to jewel what‘s the way we are set luminous chang to jewel？

3 years, 11 months

3
2
0 0

osd_pg_create causing slow requests in Nautilus

by Bryan Stillwell

We've run into a problem on our test cluster this afternoon which is running Nautilus (14.2.2). It seems that any time PGs move on the cluster (from marking an OSD down, setting the primary-affinity to 0, or by using the balancer), a large number of the OSDs in the cluster peg the CPU cores they're running on for a while which causes slow requests. From what I can tell it appears to be related to slow peering caused by osd_pg_create() taking a long time. This was seen on quite a few OSDs while waiting for peering to complete: # ceph daemon osd.3 ops { "ops": [ { "description": "osd_pg_create(e179061 287.7a:177739 287.9a:177739 287.e2:177739 287.e7:177739 287.f6:177739 287.187:177739 287.1aa:177739 287.216:177739 287.306:177739 287.3e6:177739)", "initiated_at": "2019-08-27 14:34:46.556413", "age": 318.25234538000001, "duration": 318.25241895300002, "type_data": { "flag_point": "started", "events": [ { "time": "2019-08-27 14:34:46.556413", "event": "initiated" }, { "time": "2019-08-27 14:34:46.556413", "event": "header_read" }, { "time": "2019-08-27 14:34:46.556299", "event": "throttled" }, { "time": "2019-08-27 14:34:46.556456", "event": "all_read" }, { "time": "2019-08-27 14:35:12.456901", "event": "dispatched" }, { "time": "2019-08-27 14:35:12.456903", "event": "wait for new map" }, { "time": "2019-08-27 14:40:01.292346", "event": "started" } ] } }, ...snip... { "description": "osd_pg_create(e179066 287.7a:177739 287.9a:177739 287.e2:177739 287.e7:177739 287.f6:177739 287.187:177739 287.1aa:177739 287.216:177739 287.306:177739 287.3e6:177739)", "initiated_at": "2019-08-27 14:35:09.908567", "age": 294.900191001, "duration": 294.90068416899999, "type_data": { "flag_point": "delayed", "events": [ { "time": "2019-08-27 14:35:09.908567", "event": "initiated" }, { "time": "2019-08-27 14:35:09.908567", "event": "header_read" }, { "time": "2019-08-27 14:35:09.908520", "event": "throttled" }, { "time": "2019-08-27 14:35:09.908617", "event": "all_read" }, { "time": "2019-08-27 14:35:12.456921", "event": "dispatched" }, { "time": "2019-08-27 14:35:12.456923", "event": "wait for new map" } ] } } ], "num_ops": 6 } That "wait for new map" message made us think something was getting hung up on the monitors, so we restarted them all without any luck. I'll keep investigating, but so far my google searches aren't pulling anything up so I wanted to see if anyone else is running into this? Thanks, Bryan

4 years

6
16
0 0

official ceph.com buster builds?

by Chad W Seys

Hi all, Am I missing the ceph buster build built by ceph.com ? http://download.ceph.com/debian-nautilus/dists/ Should I be using the Croit supplied builds? Thanks! Chad.

4 years, 2 months

2
2
0 0

MDS failing under load with large cache sizes

by Janek Bevendorff

Hi, I am trying to copy the contents of our storage server into a CephFS, but am experiencing stability issues with my MDSs. The CephFS sits on top of an erasure-coded pool with 5 MONs, 5 MDSs and a max_mds setting of two. My Ceph cluster version is Nautilus, the client is Mimic and uses the kernel module to mount the FS. The index of filenames to copy is about 23GB and I am using 16 parallel rsync processes over a 10G link to copy the files over to Ceph. This works perfectly for a while, but then the MDSs start reporting oversized caches (between 20 and 50GB, sometimes more) and an inode count between 1 and 4 million. Particularly the Inode count seems quite high to me. Each rsync job has 25k files to work with, so if all 16 processes open all their files at the same time, I should not exceed 400k. Even if I double this number to account for the client's page cache, I should get nowhere near that number of inodes (a sync flush takes about 1 second). Then after a few hours, my MDSs start failing with messages like this: -21> 2019-07-22 14:00:05.877 7f67eacec700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 -20> 2019-07-22 14:00:05.877 7f67eacec700 0 mds.beacon.XXX Skipping beacon heartbeat to monitors (last acked 24.0042s ago); MDS internal heartbeat is not healthy! The standby nodes try to take over, but take forever to become active and will fail as well eventually. During my research, I found this related topic: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-January/015959.html, but I tried everything in there from increasing to lowering my cache size, the number of segments etc. I also played around with the number of active MDSs and two appears to work the best, whereas one cannot keep up with the load and three seems to be the worst of all choices. Do you have any ideas how I can improve the stability of my MDS damons to handle the load properly? single 10G link is a toy and we could query the cluster with a lot more requests per second, but it's already yielding to 16 rsync processes. Thanks

4 years, 3 months

4
33
0 0

Re: Authentication failure at radosgw for presigned urls

by Biswajeet Patra

++ceph-users On Fri, Sep 20, 2019 at 4:48 PM Biswajeet Patra < biswajeet.patra(a)flipkart.com> wrote: > Hi, > We recently faced an issue with radosgw authentication of presigned urls. > The presigned url generated by client to download object fails at radosgw > with http error 403 i.e SignatureDoesNotMatch. > > The radosgw computes the signature (*v2 signature in this case*) using > the s3 specification > https://docs.aws.amazon.com/AmazonS3/latest/dev/RESTAuthentication.html. > In this process, a string_to_sign is created by concatenating selected > elements from requests which is then authenticated using hmac to create the > final signature. If this signature matches with the client signature, the > authentication is successful or else it fails with SignatureDoesNotMatch > error. As part of the StringToSign, the CanonicalizedAmzHeaders should only > include headers that start with "x-amz-" and ignore other headers. But in > the radosgw code, there are other meta prefixes that are checked against > the http request headers and if matched are included in the > CanonicalizedAmzHeaders to compute the final signature. For e.g, if a > request header contains "HTTP_X_ACCOUNT" its selected by radosgw to include > in amz_headers but the same will be ignored by AWS SDK as it does not start > with "x-amz-". This will result in different signature computed by client > and radosgw. > > Code Snippet: rgw_common.cc > struct str_len meta_prefixes[] = { STR_LEN_ENTRY("HTTP_X_AMZ"), > STR_LEN_ENTRY("HTTP_X_GOOG"), > STR_LEN_ENTRY("HTTP_X_DHO"), > STR_LEN_ENTRY("HTTP_X_RGW"), > STR_LEN_ENTRY("HTTP_X_OBJECT"), > STR_LEN_ENTRY("HTTP_X_CONTAINER"), > STR_LEN_ENTRY("HTTP_X_ACCOUNT"), > {NULL, 0} }; > > The method init_meta_info() which matches the above prefixes is called > from RGWREST::preprocess() which is invoked for all s3 requests. It will be > helpful to know as to why these prefixes that are not specified by AWS S3 > comes in the path of s3 authentication. Was it added for swift use-cases > only? If so, then why its included in rgw_common.cc? > As a proposed fix, we can remove the highlighted meta prefixes that are > not specified by aws from the s3 authentication path signature calculation. > Let me know if you have any queries or solutions. > > Regards, > Biswajeet > -- *-----------------------------------------------------------------------------------------* *This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you have received this email in error, please notify the system manager. This message contains confidential information and is intended only for the individual named. If you are not the named addressee, you should not disseminate, distribute or copy this email. Please notify the sender immediately by email if you have received this email by mistake and delete this email from your system. If you are not the intended recipient, you are notified that disclosing, copying, distributing or taking any action in reliance on the contents of this information is strictly prohibited.***** **** *Any views or opinions presented in this email are solely those of the author and do not necessarily represent those of the organization. Any information on shares, debentures or similar instruments, recommended product pricing, valuations and the like are for information purposes only. It is not meant to be an instruction or recommendation, as the case may be, to buy or to sell securities, products, services nor an offer to buy or sell securities, products or services unless specifically stated to be so on behalf of the Flipkart group. Employees of the Flipkart group of companies are expressly required not to make defamatory statements and not to infringe or authorise any infringement of copyright or any other legal right by email communications. Any such communication is contrary to organizational policy and outside the scope of the employment of the individual concerned. The organization will not accept any liability in respect of such communication, and the employee responsible will be personally liable for any damages or other liability arising.***** **** *Our organization accepts no liability for the content of this email, or for the consequences of any actions taken on the basis of the information *provided,* unless that information is subsequently confirmed in writing. If you are not the intended recipient, you are notified that disclosing, copying, distributing or taking any action in reliance on the contents of this information is strictly prohibited.* _-----------------------------------------------------------------------------------------_

4 years, 6 months

1
1
0 0

Nautilus: BlueFS spillover

by Eugen Block

Hi, I'm following the discussion for a tracker issue [1] about spillover warnings that affects our upgraded Nautilus cluster. Just to clarify, would a resize of the rocksDB volume (and expanding with 'ceph-bluestore-tool bluefs-bdev-expand...') resolve that or do we have to recreate every OSD? Regards, Eugen [1] https://tracker.ceph.com/issues/38745

4 years, 6 months

4
8
0 0

one read/write, many read only

by khaled atteya

Hi, Is it possible to do this scenario : If one open a file first , he will get read/write permissions and other will get read-only permission if they open the file after the first one. Thanks

4 years, 6 months

2
1
0 0

Doubt about ceph-iscsi and Vmware

by Gesiel Galvão Bernardes

Hi, I'm testing Ceph with Vmware, using Ceph-iscsi gateway. I reading documentation* and have doubts some points: - If I understanded, in general terms, for each VMFS datastore in VMware will match the an RBD image. (consequently in an RBD image I will possible have many VMWare disks). Its correct? - In documentation is this: "gwcli requires a pool with the name rbd, so it can store metadata like the iSCSI configuration". In part 4 of "Configuration", have: "Add a RBD image with the name disk_1 in the pool rbd". In this part, the use of "rbd" pool is a example and I could use any pool for storage of image, or the pool should be "rbd"? Resuming: gwcli require "rbd" pool for metadata and I could use any pool for image, or i will use just "rbd pool" for storage image and metadata? - How much memory ceph-iscsi use? Which is a good number of RAM? Regards Gesiel * https://docs.ceph.com/docs/master/rbd/iscsi-target-cli/

4 years, 6 months

4
5
0 0

2024

2023

2022

2021

2020

2019

ceph-users September 2019