November 2020 - ceph-users

by Mika Saari

Hi, Installing Ceph Octopus using cephadm. I managed to install ceph-common with cephadm and when trying to add new hosts with "ceph orch host add ceph2" I get error "Error EINVAL: Failed to connect to ceph2 (ceph2). Check that the host is reachable and accepts connection using the cephadm SSH key". I verified that I am able to ssh login to the ceph2 server with ceph private_key like it was described in the error message. But since adding new hosts to the ceph wasn't working, I tried generating the new private key and updating the public keys to the remote servers with: # ceph cephadm clear-key # ceph cephadm generate-key # ceph cephadm get-pub-key > ceph.pub # ceph config-key get mgr/cephadm/ssh_identity_key > ceph.priv # ssh-copy-id -f -i /etc/ceph/ceph.pub root@ceph1 # ssh-copy-id -f -i /etc/ceph/ceph.pub root@ceph2 # ssh-copy-id -f -i /etc/ceph/ceph.pub root@ceph3 And then testing that the private key really is working: # chmod 600 ceph.priv # ssh -i ceph.priv root@ceph2 At this point ssh works with passwordless login. But still the ceph orch host add ceph2 doesn't work (giving exactly the same error) I also tried restarting the manager with "ceph mgr fail" which was informed somewhere -> no effect. Also tried rebooting the machines -> no effect. Any tips I could still try ? Thank you very much!

3 years, 5 months

2
3
0 0

Using rbd-nbd tool in Ceph development cluster

by Bobby

Hi, I have to use this *rbd-nbd *tool from Ceph. This is part of Ceph source code. Here: https://github.com/ceph/ceph/tree/master/src/tools/rbd_nbd My question is: Can we use this *rbd-nbd* tool in the Ceph cluster? By Ceph cluster I mean the development cluster we build through *vstart.sh* script. I am quite sure we could use it. I have this script running. I can *start* and *stop* the cluster. But I am struggling to use this rbd-nbd tool in the development cluster which we build through vstart.sh script. Looking for help. Thanks.

3 years, 5 months

2
2
0 0

MONs unresponsive for excessive amount of time

by Frank Schilder

Hi all, one of our MONs was down for maintenance for ca. 45 minutes. After this time I started it up again and it joined the cluster. Unfortunately, things did not go as expected. The MON sub-cluster became unresponsive for a bit more than 10 minutes. Admin commands would hang, even if issued directly to a specific monitor via "ceph tell mon.xxx". In addition, our MDS lost connection to the MONs and reported a laggy connection. Consequently, all ceph fs access was frozen for a bit more than 10 minutes as well. From the little I could get out with "ceph daemon mon.xxx mon_status" I could see that the restarted MON was in state "synchronizing" (or similar, its from memory) while the other mons were in quorum. Our cluster is mimic-12.2.8. Somehow, this observation does not fit together with the intended HA of the MON cluster, there should not be any stall at all. My questions: Why do the MONs become unresponsive for such a long time? What are the MONs doing during this time frame? Are there any config options I should look at? Are there any log messages I should hunt for? Any hint is appreciated. Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14

3 years, 5 months

1
0
0 0

Not all OSDs in rack marked as down when the rack fails

by Wido den Hollander

Hi, I'm investigating an issue where 4 to 5 OSDs in a rack aren't marked as down when the network is cut to that rack. Situation: - Nautilus cluster - 3 racks - 120 OSDs, 40 per rack We performed a test where we turned off the network Top-of-Rack for each rack. This worked as expected with two racks, but with the third something weird happened. From the 40 OSDs which were supposed to be marked as down only 36 were marked as down. In the end it took 15 minutes for all 40 OSDs to be marked as down. $ ceph config set mon mon_osd_reporter_subtree_level rack That setting is set to make sure that we only accept reports from other racks. What we saw in the logs for example: 2020-10-29T03:49:44.409-0400 7fbda185e700 10 mon.CEPH2-MON1-206-U39(a)0(leader).osd e107102 osd.51 has 54 reporters, 239.856038 grace (20.000000 + 219.856 + 7.43801e-23), max_failed_since 2020-10-29T03:47:22.374857-0400 But osd.51 was still not marked as down after 54 reporters have reported that it is actually down. I checked, no ping or other traffic possible to osd.51. Host is unreachable. Another osd was marked as down, but it took a couple of minutes as well: 2020-10-29T03:50:54.455-0400 7fbda185e700 10 mon.CEPH2-MON1-206-U39(a)0(leader).osd e107102 osd.37 has 48 reporters, 221.378970 grace (20.000000 + 201.379 + 6.34437e-23), max_failed_since 2020-10-29T03:47:12.761584-0400 2020-10-29T03:50:54.455-0400 7fbda185e700 1 mon.CEPH2-MON1-206-U39(a)0(leader).osd e107102 we have enough reporters to mark osd.37 down In the end osd.51 was marked as down, but only after the MON decided to do so: 2020-10-29T03:53:44.631-0400 7fbda185e700 0 log_channel(cluster) log [INF] : osd.51 marked down after no beacon for 903.943390 seconds 2020-10-29T03:53:44.631-0400 7fbda185e700 -1 mon.CEPH2-MON1-206-U39(a)0(leader).osd e107104 no beacon from osd.51 since 2020-10-29T03:38:40.689062-0400, 903.943390 seconds ago. marking down I haven't seen this happen before in any cluster. It's also strange that this only happens in this rack, the other two racks work fine. ID CLASS WEIGHT TYPE NAME -1 1545.35999 root default -206 515.12000 rack 206 -7 27.94499 host CEPH2-206-U16 ... -207 515.12000 rack 207 -17 27.94499 host CEPH2-207-U16 ... -208 515.12000 rack 208 -31 27.94499 host CEPH2-208-U16 ... That's how the CRUSHMap looks like. Straight forward and 3x replication over 3 racks. This issue only occurs in rack *207*. Has anybody seen this before or knows where to start? Wido

3 years, 5 months

2
3
0 0

Module 'dashboard' has failed: '_cffi_backend.CDataGCP' object has no attribute 'type'

by Marcelo

Hello all. I'm trying to deploy the dashboard (Nautilus 14.2.8), and after I run ceph dashboard create-self-signed-cert, the cluster started to show this warning: # ceph health detail HEALTH_ERR Module 'dashboard' has failed: '_cffi_backend.CDataGCP' object has no attribute 'type' MGR_MODULE_ERROR Module 'dashboard' has failed: '_cffi_backend.CDataGCP' object has no attribute 'type' Module 'dashboard' has failed: '_cffi_backend.CDataGCP' object has no attribute 'type' If I set ceph config set mgr mgr/dashboard/ssl false, the error goes away. I tried to manually upload the certs, but I'm still hitting the error. Has anyone experienced something similar? Thanks, Marcelo.

3 years, 5 months

2
2
0 0

Unable to clarify error using vfs_ceph (Samba gateway for CephFS)

by Matt Larson

I am getting an error in the log.smbd from the Samba gateway that I don’t understand and looking for help from anyone who has gotten the vfs_ceph working. Background: I am trying to get a Samba gateway with CephFS working with the vfs_ceph module. I observed that the default Samba package on CentOS 7.7 did not come with the ceph.so vfs_ceph module, so I tried to compile a working Samba version with vfs_ceph. Newer Samba versions have a requirement for GnuTLS >= 3.4.7, which is not an available package on CentOS 7.7 without a custom repository. I opted to build an earlier version of Samba. On CentOS 7.7, I built Samba 4.11.16 with [global] security = user map to guest = Bad User username map = /etc/samba/smbusers log level = 4 load printers = no printing = bsd printcap name = /dev/null disable spoolss = yes [cryofs_upload] public = yes read only = yes guest ok = yes vfs objects = ceph path = /upload kernel share modes = no ceph:user_id = samba.upload ceph:config_file = /etc/ceph/ceph.conf I have a file at /etc/ceph/ceph.conf including: fsid = redacted mon_host = redacted auth_cluster_required = cephx auth_service_required = cephx auth_client_required = cephx I have an /etc/ceph/client.samba.upload.keyring /w key for the user `samba.upload` However, connecting fails: smbclient \\\\localhost\\cryofs_upload -U guest Enter guest's password: tree connect failed: NT_STATUS_UNSUCCESSFUL The log.smbd gives these errors: Initialising custom vfs hooks from [ceph] [2020/11/11 17:24:37.388460, 3] ../../lib/util/modules.c:167(load_module_absolute_path) load_module_absolute_path: Module '/usr/local/samba/lib/vfs/ceph.so' loaded [2020/11/11 17:24:37.402026, 1] ../../source3/smbd/service.c:668(make_connection_snum) make_connection_snum: SMB_VFS_CONNECT for service 'cryofs_upload' at '/upload' failed: No such file or directory There is an /upload directory for which the samba.upload user has read access to in the CephFS. What does this error mean: ‘no such file or directory’ ? Is it that vfs_ceph isn’t finding `/upload` or is some other file depended by vfs_ceph not been found? I have also tried to specify a local path rather than a CephFS path and will get the same error. Is there any good guide that describes not just the Samba smb.conf, but also what should be in /etc/ceph/ceph.conf, and how to provide the key for the ceph:user_id ? I am really struggling to find good first-hand documentation for this. Thanks, Matt -- Matt Larson, PhD Madison, WI 53705 U.S.A.

3 years, 5 months

4
5
0 0

Ceph EC PG calculation

by Szabo, Istvan (Agoda)

Hi, I have this error: I have 36 osd and get this: Error ERANGE: pg_num 4096 size 6 would mean 25011 total pgs, which exceeds max 10500 (mon_max_pg_per_osd 250 * num_in_osds 42) If I want to calculate the max pg in my server, how it works if I have EC pool? I have 4:2 data EC pool, and the others are replicated. These are the pools: pool 1 'device_health_metrics' replicated size 3 min_size 2 crush_rule 2 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode warn last_change 597 flags hashpspool stripe_width 0 pg_num_min 1 application mgr_devicehealth pool 2 '.rgw.root' replicated size 3 min_size 2 crush_rule 2 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode warn last_change 598 flags hashpspool stripe_width 0 application rgw pool 6 'sin.rgw.log' replicated size 3 min_size 2 crush_rule 2 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode warn last_change 599 flags hashpspool stripe_width 0 application rgw pool 7 'sin.rgw.control' replicated size 3 min_size 2 crush_rule 2 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode warn last_change 600 flags hashpspool stripe_width 0 application rgw pool 8 'sin.rgw.meta' replicated size 3 min_size 2 crush_rule 1 object_hash rjenkins pg_num 8 pgp_num 8 autoscale_mode warn last_change 601 lfor 0/393/391 flags hashpspool stripe_width 0 pg_autoscale_bias 4 pg_num_min 8 application rgw pool 10 'sin.rgw.buckets.index' replicated size 3 min_size 2 crush_rule 1 object_hash rjenkins pg_num 8 pgp_num 8 autoscale_mode warn last_change 602 lfor 0/529/527 flags hashpspool stripe_width 0 pg_autoscale_bias 4 pg_num_min 8 application rgw pool 11 'sin.rgw.buckets.data.old' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode warn last_change 603 flags hashpspool stripe_width 0 application rgw pool 12 'sin.rgw.buckets.data' erasure profile data-ec size 6 min_size 5 crush_rule 3 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode warn last_change 604 flags hashpspool,ec_overwrites stripe_width 16384 application rgw So how I can calculate the pgs? This is my osd tree: ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 534.38354 root default -5 89.06392 host cephosd-6s01 36 nvme 1.74660 osd.36 up 1.00000 1.00000 0 ssd 14.55289 osd.0 up 1.00000 1.00000 8 ssd 14.55289 osd.8 up 1.00000 1.00000 15 ssd 14.55289 osd.15 up 1.00000 1.00000 18 ssd 14.55289 osd.18 up 1.00000 1.00000 24 ssd 14.55289 osd.24 up 1.00000 1.00000 30 ssd 14.55289 osd.30 up 1.00000 1.00000 -3 89.06392 host cephosd-6s02 37 nvme 1.74660 osd.37 up 1.00000 1.00000 1 ssd 14.55289 osd.1 up 1.00000 1.00000 11 ssd 14.55289 osd.11 up 1.00000 1.00000 17 ssd 14.55289 osd.17 up 1.00000 1.00000 23 ssd 14.55289 osd.23 up 1.00000 1.00000 28 ssd 14.55289 osd.28 up 1.00000 1.00000 35 ssd 14.55289 osd.35 up 1.00000 1.00000 -11 89.06392 host cephosd-6s03 41 nvme 1.74660 osd.41 up 1.00000 1.00000 2 ssd 14.55289 osd.2 up 1.00000 1.00000 6 ssd 14.55289 osd.6 up 1.00000 1.00000 13 ssd 14.55289 osd.13 up 1.00000 1.00000 19 ssd 14.55289 osd.19 up 1.00000 1.00000 26 ssd 14.55289 osd.26 up 1.00000 1.00000 32 ssd 14.55289 osd.32 up 1.00000 1.00000 -13 89.06392 host cephosd-6s04 38 nvme 1.74660 osd.38 up 1.00000 1.00000 5 ssd 14.55289 osd.5 up 1.00000 1.00000 7 ssd 14.55289 osd.7 up 1.00000 1.00000 14 ssd 14.55289 osd.14 up 1.00000 1.00000 20 ssd 14.55289 osd.20 up 1.00000 1.00000 25 ssd 14.55289 osd.25 up 1.00000 1.00000 31 ssd 14.55289 osd.31 up 1.00000 1.00000 -9 89.06392 host cephosd-6s05 40 nvme 1.74660 osd.40 up 1.00000 1.00000 3 ssd 14.55289 osd.3 up 1.00000 1.00000 10 ssd 14.55289 osd.10 up 1.00000 1.00000 12 ssd 14.55289 osd.12 up 1.00000 1.00000 21 ssd 14.55289 osd.21 up 1.00000 1.00000 29 ssd 14.55289 osd.29 up 1.00000 1.00000 33 ssd 14.55289 osd.33 up 1.00000 1.00000 -7 89.06392 host cephosd-6s06 39 nvme 1.74660 osd.39 up 1.00000 1.00000 4 ssd 14.55289 osd.4 up 1.00000 1.00000 9 ssd 14.55289 osd.9 up 1.00000 1.00000 16 ssd 14.55289 osd.16 up 1.00000 1.00000 22 ssd 14.55289 osd.22 up 1.00000 1.00000 27 ssd 14.55289 osd.27 up 1.00000 1.00000 34 ssd 14.55289 osd.34 up 1.00000 1.00000 This is the crush rules: [ { "rule_id": 0, "rule_name": "replicated_rule", "ruleset": 0, "type": 1, "min_size": 1, "max_size": 10, "steps": [ { "op": "take", "item": -1, "item_name": "default" }, { "op": "chooseleaf_firstn", "num": 0, "type": "host" }, { "op": "emit" } ] }, { "rule_id": 1, "rule_name": "replicated_nvme", "ruleset": 1, "type": 1, "min_size": 1, "max_size": 10, "steps": [ { "op": "take", "item": -21, "item_name": "default~nvme" }, { "op": "chooseleaf_firstn", "num": 0, "type": "host" }, { "op": "emit" } ] }, { "rule_id": 2, "rule_name": "replicated_ssd", "ruleset": 2, "type": 1, "min_size": 1, "max_size": 10, "steps": [ { "op": "take", "item": -2, "item_name": "default~ssd" }, { "op": "chooseleaf_firstn", "num": 0, "type": "host" }, { "op": "emit" } ] }, { "rule_id": 3, "rule_name": "sin.rgw.buckets.data.new", "ruleset": 3, "type": 3, "min_size": 3, "max_size": 6, "steps": [ { "op": "set_chooseleaf_tries", "num": 5 }, { "op": "set_choose_tries", "num": 100 }, { "op": "take", "item": -2, "item_name": "default~ssd" }, { "op": "chooseleaf_indep", "num": 0, "type": "host" }, { "op": "emit" } ] } ] So everything else rather than the data pool are on SSD and nvme with replica 3. If I calculate the pg in the ec like 36osd*100/6=600 which means the max pg in the EC pool is 512? But how this affect the SSD replica pools then? This is the EC pool definition: crush-device-class=ssd crush-failure-domain=host crush-root=default jerasure-per-chunk-alignment=false k=4 m=2 plugin=jerasure technique=reed_sol_van w=8 Thank you in advance. ________________________________ This message is confidential and is for the sole use of the intended recipient(s). It may also be privileged or otherwise protected by copyright or other legal rules. If you have received it by mistake please let us know by reply email and delete it from your system. It is prohibited to copy this message or disclose its content to anyone. Any confidentiality or privilege is not waived or lost by any mistaken delivery or unauthorized disclosure of the message. All messages sent to and from Agoda may be monitored to ensure compliance with company policies, to protect the company's interests and to remove potential malware. Electronic messages may be intercepted, amended, lost or deleted, or contain viruses.

3 years, 5 months

3
5
0 0

Accessing Ceph Storage Data via Ceph Block Storage

by Vaughan Beckwith

Hi All, I'm not sure if this is the correct place to ask this question, I have tried the channels, but have received very little help there. I am currently very new to Ceph and am investigating it as to a possible replacement for a legacy application which use to provide us with replication. At the moment my company has three servers, two primary servers running Ubuntu and a backup server also running Ubuntu, the two primary servers each host a virtual machine, and it is these virtual machines that the office workers use for shared folder access, email and as a domain server, the office workers are not aware of the underlying linux servers. In the past the legacy software would replicate the running VM files on both primary servers to the backup server. The replication is done at the underlying linux host level and not from within the guest VMs. I was hoping that I could get Ceph to do this as well. From what I have read and I speak under correction, the best Ceph client type for this would be the block access, whereby I would then mount the block and start up the VMs. As I would be running the VMs, as per normal routine, would Ceph then have to retrieve the large VM files from the storage nodes across the lan and bring the data back to the client to run in the VM. Is there an option to cache certain parts of the data on certain clients? Also none of the primary servers as they currently stand have the capacity to run both VMs together, so each primary has a dedicated VM which it runs, the backup server currently keeps replicated copies of both VM images from each primary, the replication is provided by the legacy application. I'm also wondering if I need to get a fourth server, so I have 2 clients and 2 storage nodes. Any suggestions or help would be greatly appreciated. Yours sincerely Vaughan Beckwith Bluesphere Technologies BSC I.T. (Honours) vaughan.beckwith(a)bluesphere.co.za<mailto:vaughan.beckwith@bluesphere.co.za> Telephone: 011 675 6354 Fax: (011) 675 6423

3 years, 5 months

2
1
0 0

(Ceph Octopus) Repairing a neglected Ceph cluster - Degraded Data Reduncancy, all PGs degraded, undersized, not scrubbed in time

by seffyroff＠gmail.com

I've inherited a Ceph Octopus cluster that seems like it needs urgent maintenance before data loss begins to happen. I'm the guy with the most Ceph experience on hand and that's not saying much. I'm experiencing most of the ops and repair tasks for the first time here. Ceph health output looks like this: HEALTH_WARN Degraded data redundancy: 3640401/8801868 objects degraded (41.359%), 128 pgs degraded, 128 pgs undersized; 128 pgs not deep-scrubbed in time; 128 pgs not scrubbed in time Ceph -s output: https://termbin.com/i06u The crush rule 'cephfs.media' is here: https://termbin.com/2klmq So, it seems like all PGs are in a 'warning' state for the main pool, which is erasure coded and 11TiB across 4 OSDs, of which around 6.4TiB is used. The Ceph services themselves seem happy, they're stable and have Quorum. I'm able to access the web panel fine also. The block devices are of different sizes and types (2 large, different sized spinners, and 2 identical SSDs) I would welcome any pointers on what my steps to bring this up to full health may be. If it's undersized, can I simply add another block device/OSD? Or perhaps adjusting config somewhere will get it to rebalance successfully? (the rebalance jobs have been stuck at 0% for weeks) Thank you for your time reading this message.

3 years, 5 months

8
19
0 0

Ceph RBD - High IOWait during the Writes

by athreyavc＠gmail.com

Hi, We have recently deployed a Ceph cluster with 12 OSD nodes(16 Core + 200GB RAM + 30 disks each of 14TB) Running CentOS 8 3 Monitoring Nodes (8 Core + 16GB RAM) Running CentOS 8 We are using Ceph Octopus and we are using RBD block devices. We have three Ceph client nodes(16core + 30GB RAM, Running CentOS 8) across which RBDs are mapped and mounted, 25 RBDs each on each client node. Each RBD size is 10TB. Each RBD is formatted as EXT4 file system. From network side, we have 10Gbps Active/Passive Bond on all the Ceph cluster nodes, including the clients. Jumbo frames enabled and MTU is 9000 This is a new cluster and cluster health reports Ok. But we see high IO wait during the writes. From one of the clients, 15:14:30 CPU %user %nice %system %iowait %steal %idle 15:14:31 all 0.06 0.00 1.00 45.03 0.00 53.91 15:14:32 all 0.06 0.00 0.94 41.28 0.00 57.72 15:14:33 all 0.06 0.00 1.25 45.78 0.00 52.91 15:14:34 all 0.00 0.00 1.06 40.07 0.00 58.86 15:14:35 all 0.19 0.00 1.38 41.04 0.00 57.39 Average: all 0.08 0.00 1.13 42.64 0.00 56.16 and the system load shows very high top - 15:19:15 up 34 days, 41 min, 2 users, load average: 13.49, 13.62, 13.83 From 'atop' one of the CPUs shows this CPU | sys 7% | user 1% | irq 2% | idle 1394% | wait 195% | steal 0% | guest 0% | ipc initial | cycl initial | curf 806MHz | curscal ?% On the OSD nodes, don't see much %utilization of the disks. RBD caching values are default. Are we overlooking some configuration item ? Thanks and Regards, At

3 years, 5 months

4
5
0 0

2024

2023

2022

2021

2020

2019

ceph-users November 2020