June 2023 - ceph-users - lists.ceph.io

Issues in installing old dumpling version to add a new monitor

by Cloud List

Hi, I have a very old Ceph cluster running the old dumpling version 0.67.1. One of the three monitors suffered a hardware failure and I am setting up a new server to replace the third monitor running Ubuntu 22.04 LTS (all the other monitors are using the old Ubuntu 12.04 LTS). I used ceph-deploy to deploy the cluster initially, and I can't use it since it's a very old version of ceph-deploy -- having issues with apt-key being deprecated and since ceph-deploy is no longer maintained, I can't upgrade it. And even if I can, I am not too sure if it works since the dumpling version is no longer in Ceph's official repository. So I tried to install it manually by cloning it from git: git clone -b dumpling https://github.com/ceph/ceph.git But when I tried to run "git submodule update --init" or "./autogen.sh" as per the README file, I am encountering this error: ==== root@ceph-mon-04:~/ceph-dumpling/ceph# git submodule update --init Submodule 'ceph-object-corpus' (git://ceph.com/git/ceph-object-corpus.git) registered for path 'ceph-object-corpus' Submodule 'src/libs3' (git://github.com/ceph/libs3.git) registered for path 'src/libs3' Cloning into '/root/ceph-dumpling/ceph/ceph-object-corpus'... fatal: repository 'https://ceph.com/git/ceph-object-corpus.git/' not found fatal: clone of 'git://ceph.com/git/ceph-object-corpus.git' into submodule path '/root/ceph-dumpling/ceph/ceph-object-corpus' failed Failed to clone 'ceph-object-corpus'. Retry scheduled Cloning into '/root/ceph-dumpling/ceph/src/libs3'... Cloning into '/root/ceph-dumpling/ceph/ceph-object-corpus'... fatal: repository 'https://ceph.com/git/ceph-object-corpus.git/' not found fatal: clone of 'git://ceph.com/git/ceph-object-corpus.git' into submodule path '/root/ceph-dumpling/ceph/ceph-object-corpus' failed Failed to clone 'ceph-object-corpus' a second time, aborting root@ceph-mon-04:~/ceph-dumpling/ceph# git submodule update --init --recursive Cloning into '/root/ceph-dumpling/ceph/ceph-object-corpus'... fatal: repository 'https://ceph.com/git/ceph-object-corpus.git/' not found fatal: clone of 'git://ceph.com/git/ceph-object-corpus.git' into submodule path '/root/ceph-dumpling/ceph/ceph-object-corpus' failed Failed to clone 'ceph-object-corpus'. Retry scheduled Cloning into '/root/ceph-dumpling/ceph/ceph-object-corpus'... fatal: repository 'https://ceph.com/git/ceph-object-corpus.git/' not found fatal: clone of 'git://ceph.com/git/ceph-object-corpus.git' into submodule path '/root/ceph-dumpling/ceph/ceph-object-corpus' failed Failed to clone 'ceph-object-corpus' a second time, aborting root@ceph-mon-04:~/ceph-dumpling/ceph# ==== It seems that the repositories required for the submodules are no longer there. Anyone can advise me on the correct direction on how can I install the dumpling version of Ceph for me to add a new monitor? At the moment only 2 monitors out of 3 are up and I am worried that the cluster will be down if I lose another monitor. $ ceph status cluster 1660b11f-1074-4f5d-aa7c-64b479397a2f health HEALTH_WARN 1 mons down, quorum 0,1 ceph-mon-01,ceph-mon-02 What approach I should take: - Continue trying the manual installation/compiling route? - Continue trying the ceph-deploy route (by fixing the apt-key deprecation issue)? - Try to install the same old OS (Ubuntu 12.04 LTS) on the new server (not too sure if I still have the ISO) and see if it works? - Try to upgrade the current cluster and then add the monitor later after upgrade? (is it risky to upgrade with HEALTH_WARN status)? Any advice is greatly appreciated. Best regards, -ip-

11 months, 2 weeks

3
2
0 0

slow mds requests with random read test

by Ben

Hi, We are performing couple performance tests on CephFS using fio. fio is run in k8s pod and 3 pods will be up running mounting the same pvc to CephFS volume. Here is command line for random read: fio -direct=1 -iodepth=128 -rw=randread -ioengine=libaio -bs=4k -size=1G -numjobs=5 -runtime=500 -group_reporting -directory=/tmp/cache -name=Rand_Read_Testing_$BUILD_TIMESTAMP The random read is performed very slow. Here is the cluster log from dashboard: 5/30/23 8:13:16 PM [INF] Health check cleared: MDS_SLOW_REQUEST (was: 1 MDSs report slow requests) 5/30/23 8:13:16 PM [INF] Health check cleared: MDS_SLOW_METADATA_IO (was: 1 MDSs report slow metadata IOs) 5/30/23 8:13:16 PM [INF] MDS health message cleared (mds.?): 1 slow metadata IOs are blocked > 30 secs, oldest blocked for 33 secs 5/30/23 8:13:16 PM [INF] MDS health message cleared (mds.?): 1 slow requests are blocked > 30 secs 5/30/23 8:13:14 PM [WRN] Health check update: 2 MDSs report slow requests (MDS_SLOW_REQUEST) 5/30/23 8:13:13 PM [INF] MDS health message cleared (mds.?): 1 slow requests are blocked > 30 secs 5/30/23 8:13:08 PM [WRN] Health check failed: 1 MDSs report slow requests (MDS_SLOW_REQUEST) 5/30/23 8:13:08 PM [WRN] Health check failed: 1 MDSs report slow metadata IOs (MDS_SLOW_METADATA_IO) 5/30/23 8:13:08 PM [WRN] slow request 34.213327 seconds old, received at 2023-05-30T12:12:33.951399+0000: client_request(client.270564:1406144 getattr pAsLsXsFs #0x700000103d0 2023-05-30T12:12:33.947323+0000 caller_uid=0, caller_gid=0{}) currently failed to rdlock, waiting 5/30/23 8:13:08 PM [WRN] 1 slow requests, 1 included below; oldest blocked for > 34.213328 secs 5/30/23 8:13:07 PM [WRN] slow request 33.169703 seconds old, received at 2023-05-30T12:12:33.952078+0000: peer_request:client.270564:1406144 currently dispatched 5/30/23 8:13:07 PM [WRN] 1 slow requests, 1 included below; oldest blocked for > 33.169704 secs 5/30/23 8:13:04 PM [INF] Cluster is now healthy 5/30/23 8:13:04 PM [INF] Health check cleared: MDS_SLOW_REQUEST (was: 1 MDSs report slow requests) 5/30/23 8:13:04 PM [INF] Health check cleared: MDS_SLOW_METADATA_IO (was: 1 MDSs report slow metadata IOs) 5/30/23 8:13:04 PM [INF] MDS health message cleared (mds.?): 9 slow metadata IOs are blocked > 30 secs, oldest blocked for 45 secs 5/30/23 8:13:04 PM [INF] MDS health message cleared (mds.?): 2 slow requests are blocked > 30 secs 5/30/23 8:12:57 PM [WRN] 2 slow requests, 0 included below; oldest blocked for > 44.954377 secs 5/30/23 8:12:52 PM [WRN] 2 slow requests, 0 included below; oldest blocked for > 39.954313 secs 5/30/23 8:12:48 PM [WRN] Health check failed: 1 MDSs report slow requests (MDS_SLOW_REQUEST) 5/30/23 8:12:47 PM [WRN] slow request 34.935921 seconds old, received at 2023-05-30T12:12:12.185614+0000: client_request(client.270564:1406139 create #0x7000001045b/atomic7966567911433736706tmp 2023-05-30T12:12:12.182999+0000 caller_uid=0, caller_gid=0{}) currently submit entry: journal_and_reply 5/30/23 8:12:47 PM [WRN] slow request 34.954254 seconds old, received at 2023-05-30T12:12:12.167281+0000: client_request(client.270564:1406138 rename #0x70000010457/build.xml #0x70000010457/atomic6590865221269854506tmp 2023-05-30T12:12:12.162999+0000 caller_uid=0, caller_gid=0{}) currently submit entry: journal_and_reply 5/30/23 8:12:47 PM [WRN] 2 slow requests, 2 included below; oldest blocked for > 34.954254 secs 5/30/23 8:12:44 PM [WRN] Health check failed: 1 MDSs report slow metadata IOs (MDS_SLOW_METADATA_IO) 5/30/23 8:12:41 PM [INF] Cluster is now healthy 5/30/23 8:12:41 PM [INF] Health check cleared: MDS_SLOW_REQUEST (was: 1 MDSs report slow requests) 5/30/23 8:12:41 PM [INF] MDS health message cleared (mds.?): 1 slow requests are blocked > 30 secs 5/30/23 8:12:40 PM [INF] Health check cleared: MDS_SLOW_METADATA_IO (was: 1 MDSs report slow metadata IOs) 5/30/23 8:12:40 PM [INF] MDS health message cleared (mds.?): 1 slow metadata IOs are blocked > 30 secs, oldest blocked for 38 secs However, random write test is performing very good. Any suggestions on the problem? Thanks, Ben

11 months, 2 weeks

2
3
0 0

RGW: perf dump. What is "objecter-0x55b63c38fb80" ?

by Rudenko Aleksandr

Hi guys, In perf dump of RGW instance I have two similar sections. First one: "objecter": { "op_active": 0, "op_laggy": 0, "op_send": 38816, "op_send_bytes": 199927218, "op_resend": 0, "op_reply": 38816, "oplen_avg": { "avgcount": 38816, "sum": 90408 }, "op": 38816, "op_r": 12624, "op_w": 26192, "op_rmw": 0, "op_pg": 0, … } Second one: "objecter-0x55b63c38fb80": { "op_active": 0, "op_laggy": 0, "op_send": 5540, "op_send_bytes": 217343, "op_resend": 0, "op_reply": 5540, "oplen_avg": { "avgcount": 5540, "sum": 5636 }, "op": 5540, "op_r": 680, "op_w": 4860, "op_rmw": 0, "op_pg": 0, … } What is 0x55b63c38fb80 ? I try to monitor ‘op_active’ metric, but this metric refreshes only in ‘objecter-0x55b63c38fb80’ section and always 0 in ‘objecter’ section and it’s so difficult to monitor this metric because this id is dynamic and will be changed on next rgw restart.

11 months, 2 weeks

1
0
0 0

Orchestration seems not to work

by Thomas Widhalm

Hi, I'm in the process of upgrading my cluster from 17.2.5 to 17.2.6 but the following problem existed when I was still everywhere on 17.2.5 . I had a major issue in my cluster which could be solved with a lot of your help and even more trial and error. Right now it seems that most is already fixed but I can't rule out that there's still some problem hidden. The very issue I'm asking about started during the repair. When I want to orchestrate the cluster, it logs the command but it doesn't do anything. No matter if I use ceph dashboard or "ceph orch" in "cephadm shell". I don't get any error message when I try to deploy new services, redeploy them etc. The log only says "scheduled" and that's it. Same when I change placement rules. Usually I use tags. But since they don't work anymore, too, I tried host and umanaged. No success. The only way I can actually start and stop containers is via systemctl from the host itself. When I run "ceph orch ls" or "ceph orch ps" I see services I deployed for testing being deleted (for weeks now). Ans especially a lot of old MDS are listed as "error" or "starting". The list doesn't match reality at all because I had to start them by hand. I tried "ceph mgr fail" and even a complete shutdown of the whole cluster with all nodes including all mgs, mds even osd - everything during a maintenance window. Didn't change anything. Could you help me? To be honest I'm still rather new to Ceph and since I didn't find anything in the logs that caught my eye I would be thankful for hints how to debug. Cheers, Thomas -- http://www.widhalm.or.at GnuPG : 6265BAE6 , A84CB603 Threema: H7AV7D33 Telegram, Signal: widhalmt(a)widhalm.or.at

11 months, 2 weeks

3
19
0 0

Encryption per user Howto

by huxiaoyu＠horebdata.cn

Dear Ceph folks, Recently one of our clients approached us with a request on encrpytion per user, i.e. using individual encrytion key for each user and encryption files and object store. Does anyone know (or have experience) how to do with CephFS and Ceph RGW? Any suggestionns or comments are highly appreciated, best regards, Samuel huxiaoyu(a)horebdata.cn

11 months, 2 weeks

8
27
0 0

change user root to non-root after deploy cluster by cephadm

by farhad kh

Hi guys I deployed the ceph cluster with cephadm and root user, but I need to change the user to a non-root user And I did these steps: 1- Created a non-root user on all hosts with access without password and sudo `$USER_NAME ALL = (root) NOPASSWD:ALL` 2- Generated a SSH key pair and use ssh-copy-it to add all hosts ` ssh-keygen (accept the default file name and leave the passphrase empty) ssh-copy-id USER_NAME@HOST_NAME ` 3 - ceph cephadm set-user <user>But I get "Error EINVAL: ssh connection to root@hostname failed" error How to deal with this issue? What should be done to change the user to non-root?

11 months, 2 weeks

2
1
0 0

Non cephadm cluster upgrade from octopus to quincy

by Szabo, Istvan (Agoda)

Hi, I don't find any documentation for this upgrade process. Is there anybody who has already done it yet? Is the normal apt-get update method works? Thank you ________________________________ This message is confidential and is for the sole use of the intended recipient(s). It may also be privileged or otherwise protected by copyright or other legal rules. If you have received it by mistake please let us know by reply email and delete it from your system. It is prohibited to copy this message or disclose its content to anyone. Any confidentiality or privilege is not waived or lost by any mistaken delivery or unauthorized disclosure of the message. All messages sent to and from Agoda may be monitored to ensure compliance with company policies, to protect the company's interests and to remove potential malware. Electronic messages may be intercepted, amended, lost or deleted, or contain viruses.

11 months, 2 weeks

1
0
0 0

reef v18.1.0 QE Validation status

by Yuri Weinstein

Details of this release are summarized here: https://tracker.ceph.com/issues/61515#note-1 Release Notes - TBD Seeking approvals/reviews for: rados - Neha, Radek, Travis, Ernesto, Adam King (we still have to merge https://github.com/ceph/ceph/pull/51788 for the core) rgw - Casey fs - Venky orch - Adam King rbd - Ilya krbd - Ilya upgrade/octopus-x - deprecated upgrade/pacific-x - known issues, Ilya, Laura? upgrade/reef-p2p - N/A clients upgrades - not run yet powercycle - Brad ceph-volume - in progress Please reply to this email with approval and/or trackers of known issues/PRs to address them. gibba upgrade was done and will need to be done again this week. LRC upgrade TBD TIA

11 months, 2 weeks

7
9
0 0

Re: Workload Separation in Ceph RGW Cluster - Recommended or Not?

by Ramin Najjarbashi

Thank you for your response and for raising an important question regarding the potential bottlenecks within the RGW or the overall Ceph cluster. I appreciate your insight and would like to provide more information about the issues I have been experiencing. In my deployment, RGW instances 17-20 have been encountering problems such as hanging or returning errors, including "failed to read header: The socket was closed due to a timeout" and "res_query() failed." These issues have led to disruptions and congestions within the cluster. The index pool is indeed placed on a large number of NVMe SSDs to ensure fast access and efficient indexing of data. The number of Placement Groups (PGs) allocated for the index pool is also configured to be sufficient for the workload On Tue, Jun 6, 2023 at 21:27 Anthony D'Atri <anthony.datri(a)gmail.com> wrote: > Do you have reason to believe that your bottlenecks are within RGW not > within the cluster? > > e.g. is your index pool on a large number of NVMe SSDs with sufficient > PGs? Is your bucket data on SSD as well? > > > On Jun 6, 2023, at 13:52, Ramin Najjarbashi <ramin.najarbashi(a)gmail.com> > wrote: > > I would like to seek your insights and recommendations regarding the > practice of workload separation in a Ceph RGW (RADOS Gateway) cluster. I > have been facing challenges with large queues in my deployment and would > appreciate your expertise in determining whether workload separation is a > recommended approach or not. > > >

11 months, 2 weeks

1
0
0 0

Workload Separation in Ceph RGW Cluster - Recommended or Not?

by Ramin Najjarbashi

Hi I would like to seek your insights and recommendations regarding the practice of workload separation in a Ceph RGW (RADOS Gateway) cluster. I have been facing challenges with large queues in my deployment and would appreciate your expertise in determining whether workload separation is a recommended approach or not. In my current Ceph cluster, I have 20 RGW instances. Client requests are directed to RGW1-16, while RGW17-20 are dedicated to administrative tasks and backend usage. However, I have been encountering errors and congestion issues due to the accumulation of large queues within the RGW instances. Considering the above scenario, I would like to inquire about your opinions on workload separation as a potential solution. Specifically, I am interested in knowing whether workload separation is recommended in a Ceph RGW cluster. To address the queue congestion and improve performance, my proposed solution includes separating the RGW instances based on their specific purposes. This entails allocating dedicated instances for client requests, backend usage, administrative tasks, metadata synchronization with other zone groups, garbage collection (GC), and lifecycle (LC) operations. I kindly request your feedback and insights on the following points: 1. Is workload separation considered a recommended practice in Ceph RGW deployments? 2. What are the potential benefits and drawbacks of workload separation in terms of performance, resource utilization, and manageability? 3. Are there any specific considerations or best practices to keep in mind while implementing workload separation in a Ceph RGW cluster? 4. Can you share your experiences or any references/documentation that highlight successful implementations of workload separation in Ceph RGW deployments? I truly value your expertise and appreciate your time and effort in providing guidance on this matter. Your insights will contribute significantly to optimizing the performance and stability of my Ceph RGW cluster.

11 months, 2 weeks

1
0
0 0

2024

2023

2022

2021

2020

2019

ceph-users June 2023