May 2024 - ceph-users - lists.ceph.io

by Marcus

Hi all, Due to so many reasons (political, heating problems, lack of space aso.) we have to plan for our ceph cluster to be hosted externaly. The planned version to setup is reef. Reading up on documentation we found that it was possible to run in secure mode. Our ceph.conf file will state both v1 and v2 addresses for mons: mon host = [v2:4.3.2.1:3300/0,v1:4.3.2.1:6789/0] [v2:4.3.2.2:3300/0,v1:4.3.2.2:6789/0] [v2:4.3.2.3:3300/0,v1:4.3.2.3:6789/0] Then changing the following configuration options to only secure: ms_cluster_mode = secure ms_service_mode = secure ms_client_mode = secure ms_mon_cluster_mode = secure ms_mon_service_mode = secure ms_mon_client_mode = secure Then I remounted cephfs on the clients on our test cluster, but still the fs would mount on ports 6789. I thought that the above secure config change would "force" the mount on port 3300 and v2. Mounting with option ms_mode=secure, did the trick. Is that the way cephfs is working that you explicit have to specify secure mode? I thought that cephfs clients would use the secure mode with these settings, but maybe I am wrong? Of cause we also plan to limit the firewalls on servers so only the specific subnet will be able to connect and mount cephfs. From my understanding from the documenation this would be the way to set this up with ceph exposed to internet. Is there something that we are missing or something that would make the setup more secure? Many thanks in advance Marcus

20 minutes

8
15
0 0

rbd-mirror failed to query services: (13) Permission denied

by Stefan Kooman

Hi, We are testing rbd-mirroring. There seems to be a permission error with the rbd-mirror user. Using this user to query the mirror pool status gives: failed to query services: (13) Permission denied And results in the following output: health: UNKNOWN daemon health: UNKNOWN image health: OK images: 3 total 2 replaying 1 stopped So, this command: rbd --id rbd-mirror mirror pool status rbd So basically the health and daemon health cannot be obtained due to permission errors, but status about images can. When the command is run with admin permissions the health and daemon health are returned without issue. I tested this on Reef 18.2.2. Is this expected behavior? If not, I will create a tracker ticket for it. Gr. Stefan

2 hours, 19 minutes

3
5
0 0

dkim on this mailing list

by Marc

Just to confirm if I am messing up my mailserver configs. But currently all messages from this mailing list should generate a dkim pass status?

4 hours, 57 minutes

2
1
0 0

Please discuss about Slow Peering

by 서민우

Env: - OS: Ubuntu 20.04 - Ceph Version: Octopus 15.0.0.1 - OSD Disk: 2.9TB NVMe - BlockStorage (Replication 3) Symptom: - Peering when OSD's node up is very slow. Peering speed varies from PG to PG, and some PG may even take 10 seconds. But, there is no log for 10 seconds. - I checked the effect of client VM's. Actually, Slow queries of mysql occur at the same time. There are Ceph OSD logs of both Best and Worst. Best Peering Case (0.5 Seconds) 2024-04-11T15:32:44.693+0900 7f108b522700 1 osd.7 pg_epoch: 27368 pg[6.8] state<Start>: transitioning to Primary 2024-04-11T15:32:45.165+0900 7f108f52a700 1 osd.7 pg_epoch: 27371 pg[6.8] state<Started/Primary/Peering>: Peering, affected_by_map, going to Reset 2024-04-11T15:32:45.165+0900 7f108f52a700 1 osd.7 pg_epoch: 27371 pg[6.8] start_peering_interval up [7,6,11] -> [6,11], acting [7,6,11] -> [6,11], acting_primary 7 -> 6, up_primary 7 -> 6, role 0 -> -1, features acting 2024-04-11T15:32:45.165+0900 7f108f52a700 1 osd.7 pg_epoch: 27377 pg[6.8] state<Start>: transitioning to Primary 2024-04-11T15:32:45.165+0900 7f108f52a700 1 osd.7 pg_epoch: 27377 pg[6.8] start_peering_interval up [6,11] -> [7,6,11], acting [6,11] -> [7,6,11], acting_primary 6 -> 7, up_primary 6 -> 7, role -1 -> 0, features acting Worst Peering Case (11.6 Seconds) 2024-04-11T15:32:45.169+0900 7f108b522700 1 osd.7 pg_epoch: 27377 pg[30.20] state<Start>: transitioning to Stray 2024-04-11T15:32:45.169+0900 7f108b522700 1 osd.7 pg_epoch: 27377 pg[30.20] start_peering_interval up [0,1] -> [0,7,1], acting [0,1] -> [0,7,1], acting_primary 0 -> 0, up_primary 0 -> 0, role -1 -> 1, features acting 2024-04-11T15:32:46.173+0900 7f108b522700 1 osd.7 pg_epoch: 27378 pg[30.20] state<Start>: transitioning to Stray 2024-04-11T15:32:46.173+0900 7f108b522700 1 osd.7 pg_epoch: 27378 pg[30.20] start_peering_interval up [0,7,1] -> [0,7,1], acting [0,7,1] -> [0,1], acting_primary 0 -> 0, up_primary 0 -> 0, role 1 -> -1, features acting 2024-04-11T15:32:57.794+0900 7f108b522700 1 osd.7 pg_epoch: 27390 pg[30.20] state<Start>: transitioning to Stray 2024-04-11T15:32:57.794+0900 7f108b522700 1 osd.7 pg_epoch: 27390 pg[30.20] start_peering_interval up [0,7,1] -> [0,7,1], acting [0,1] -> [0,7,1], acting_primary 0 -> 0, up_primary 0 -> 0, role -1 -> 1, features acting *I wish to know about* - Why some PG's take 10 seconds until Peering finishes. - Why Ceph log is quiet during peering. - Is this symptom intended in Ceph. *And please give some advice,* - Is there any way to improve peering speed? - Or, Is there a way to not affect the client when peering occurs? P.S - I checked the symptoms in the following environments. -> Octopus Version, Reef Version, Cephadm, Ceph-Ansible

5 hours, 35 minutes

3
7
0 0

cephadm bootstraps cluster with bad CRUSH map(?)

by Matthew Vernon

Hi, I'm probably Doing It Wrong here, but. My hosts are in racks, and I wanted ceph to use that information from the get-go, so I tried to achieve this during bootstrap. This has left me with a single sad pg: [WRN] PG_AVAILABILITY: Reduced data availability: 1 pg inactive pg 1.0 is stuck inactive for 33m, current state unknown, last acting [] ceph osd tree shows that CRUSH picked up my racks OK, eg. -3 45.11993 rack B4 -2 45.11993 host moss-be1001 1 hdd 3.75999 osd.1 up 1.00000 1.00000 But root seems empty: -1 0 root default and if I decompile the crush map, indeed: # buckets root default { id -1 # do not change unnecessarily id -14 class hdd # do not change unnecessarily # weight 0.00000 alg straw2 hash 0 # rjenkins1 } which does indeed look empty, whereas I have rack entries that contain the relevant hosts. And the replication rule: rule replicated_rule { id 0 type replicated step take default step chooseleaf firstn 0 type rack step emit } I passed this config to bootstrap with --config: [global] osd_crush_chooseleaf_type = 3 and an initial spec file with host entries like this: service_type: host hostname: moss-be1001 addr: 10.64.16.40 location: rack: B4 labels: - _admin - NVMe Once the cluster was up I used an osd spec file that looked like: service_type: osd service_id: rrd_single_NVMe placement: label: "NVMe" spec: data_devices: rotational: 1 db_devices: model: "NVMe" I could presumably fix this up by editing the crushmap (to put the racks into the default bucket), but what did I do wrong? Was this not a reasonable thing to want to do with cephadm? I'm running ceph version 18.2.2 (531c0d11a1c5d39fbfe6aa8a521f023abf3bf3e2) reef (stable) Thanks, Matthew

7 hours, 18 minutes

2
6
0 0

unknown PGs after adding hosts in different subtree

by Eugen Block

Hi, I got into a weird and unexpected situation today. I added 6 hosts to an existing Pacific cluster (16.2.13, 20 existing OSD hosts across 2 DCs). The hosts were added to the root=default subtree, their designated location is one of two datacenters underneath the default root. Nothing unusual, I believe many people use different subtrees to organize their cluster, as do we in our own (and haven't seen above issue yet). The main application is RGW, the main pool is erasure-coded (k=7, m=11). The crush rule looks like this: rule rule-ec-k7m11 { id 1 type erasure min_size 3 max_size 18 step set_chooseleaf_tries 5 step set_choose_tries 100 step take default class hdd step choose indep 2 type datacenter step chooseleaf indep 9 type host step emit } After almost all peering had finished the status showed 6 inactive + peering PGs for a while. I had to fail the mgr because it didn't report correct stats anymore, then it showed 16 unknown PGs. Their application noticed the (unexpected) disruption, after putting the hosts into their designated crush bucket (datacenter) the situation resolved. But I can't make any sense of it, I tried to reproduce it in my lab environment (Quincy), but to no avail. In my tests it behaves as expected, after new OSDs become active there are remapped PGs, but nothing happens until I add them to their designated location. I know I could have prevented that with either osd_crush_initial_weight = 0, then move the crush buckets, then reweight, or by adding the crush buckets first, but usually I don't need to bother about these things. Does anyone have an explanation? I'd appreciate any comments. Thanks! Eugen

7 hours, 34 minutes

2
2
0 0

Ceph osd df tree takes a long time to respond

by Huy Nguyen

Ceph version 14.2.7 Ceph osd df tree command take long time than usual but I can't find out what is the reason? The monitor node still has plenty of available RAM and CPU resources. I checked the monitor and mgr log but nothing seems useful. I checked an older cluster in version 13.2.10 but Ceph osd df tree still responded normally, so that was not version related. Anyone have any idea? Thanks

8 hours, 20 minutes

2
1
0 0

How network latency affects ceph performance really with NVME only storage?

by Stefan Bauer

Dear Users, i recently setup a new ceph 3 node cluster. Network is meshed between all nodes (2 x 25G with DAC). Storage is flash only (Kioxia 3.2 TBBiCS FLASH 3D TLC, KCMYXVUG3T20) The latency with ping tests between the nodes shows: # ping 10.1.3.13 PING 10.1.3.13 (10.1.3.13) 56(84) bytes of data. 64 bytes from 10.1.3.13: icmp_seq=1 ttl=64 time=0.145 ms 64 bytes from 10.1.3.13: icmp_seq=2 ttl=64 time=0.180 ms 64 bytes from 10.1.3.13: icmp_seq=3 ttl=64 time=0.180 ms 64 bytes from 10.1.3.13: icmp_seq=4 ttl=64 time=0.115 ms 64 bytes from 10.1.3.13: icmp_seq=5 ttl=64 time=0.110 ms 64 bytes from 10.1.3.13: icmp_seq=6 ttl=64 time=0.120 ms 64 bytes from 10.1.3.13: icmp_seq=7 ttl=64 time=0.124 ms 64 bytes from 10.1.3.13: icmp_seq=8 ttl=64 time=0.140 ms 64 bytes from 10.1.3.13: icmp_seq=9 ttl=64 time=0.127 ms 64 bytes from 10.1.3.13: icmp_seq=10 ttl=64 time=0.143 ms 64 bytes from 10.1.3.13: icmp_seq=11 ttl=64 time=0.129 ms --- 10.1.3.13 ping statistics --- 11 packets transmitted, 11 received, 0% packet loss, time 10242ms rtt min/avg/max/mdev = 0.110/0.137/0.180/0.022 ms On another cluster i have much better values, with 10G SFP+ and fibre-cables: 64 bytes from large-ipv6-ip: icmp_seq=42 ttl=64 time=0.081 ms 64 bytes from large-ipv6-ip: icmp_seq=43 ttl=64 time=0.078 ms 64 bytes from large-ipv6-ip: icmp_seq=44 ttl=64 time=0.084 ms 64 bytes from large-ipv6-ip: icmp_seq=45 ttl=64 time=0.075 ms 64 bytes from large-ipv6-ip: icmp_seq=46 ttl=64 time=0.071 ms 64 bytes from large-ipv6-ip: icmp_seq=47 ttl=64 time=0.081 ms 64 bytes from large-ipv6-ip: icmp_seq=48 ttl=64 time=0.074 ms 64 bytes from large-ipv6-ip: icmp_seq=49 ttl=64 time=0.085 ms 64 bytes from large-ipv6-ip: icmp_seq=50 ttl=64 time=0.077 ms 64 bytes from large-ipv6-ip: icmp_seq=51 ttl=64 time=0.080 ms 64 bytes from large-ipv6-ip: icmp_seq=52 ttl=64 time=0.084 ms 64 bytes from large-ipv6-ip: icmp_seq=53 ttl=64 time=0.084 ms ^C --- long-ipv6-ip ping statistics --- 53 packets transmitted, 53 received, 0% packet loss, time 53260ms rtt min/avg/max/mdev = 0.071/0.082/0.111/0.006 ms If i want best performance, does the latency difference matter at all? Should i change DAC to SFP-transceivers wwith fibre-cables to improve overall ceph performance or is this nitpicking? Thanks a lot. Stefan

10 hours, 48 minutes

1
0
0 0

lost+found is corrupted.

by Malcolm Haak

Hi all, I've almost got my ceph back to normal after a triple drive failure. But it seems my lost+found folder is corrupted. I've followed the process in https://docs.ceph.com/en/latest/cephfs/disaster-recovery-experts/#disaster-… However doing an online scrub, as there is still other damage, fails as it appears my lost+found inode is corrupted. Any advice is appreciated The only thing I can think of is at the beginning of one of the steps it asks if I want to re-create something and, from what I could read from other emails to the list, I thought answering no was the correct answer but I am now wondering if I did want to recreate those entries. (specifically it asks at the cephfs-data-scan init point) Thanks in advance, ceph version 18.2.2 (531c0d11a1c5d39fbfe6aa8a521f023abf3bf3e2) reef (stable) Arch linux 3 node cluster 33 OSD's

14 hours, 3 minutes

1
0
0 0

CEPH quincy 17.2.5 with Erasure Code

by Andrea Martra

Hi everyone, I'm managing a Ceph Quincy 17.2.5 cluster, waiting to upgrade it to version 17.2.7, composed and configured as follows: - 16 identical nodes 256 GB RAM, 32 CPU Cores (64 threads), 12 x rotary HDD (BLOCK) + 4 x Sata SSD (RocksDB/WAL) - Erasure Code 11+4 (Jerasure) - 10 x S3 RGW on dedicated nodes (5 physical nodes) - 3 x full SSD dedicated nodes for replicated S3 pools - 2 x 10 Gbit Public network (LACP) + 2 x 10 Gbit cluster network (LACP) - On all nodes: Ubuntu 20.04.4 LTS Operating System updated - Ceph deployed on containers on Docker CE (docker-ce 5:20.10.17~3-0~ubuntu-focal). All pools, except the EC data pool, are configured with replication 3 and stored in dedicated SSD devices on 3 dedicated nodes to guarantee the necessary performance. We encountered a constant and random problem relating to the availability of access to bucket data and many slow_ops relating to rotary OSDs (data-pools) not caused by saturation of physical devices nor by the availability of CPU/RAM on all nodes. It sometimes happens that slow_ops caused by some requests to some PCs in a random way are reported. The cluster is currently in the recovery/rebalance phase for the reconstruction of 3 HDDs that we had to recreate from scratch (all 3 HDDs are physically on the same node). By doing some analysis of the events we verified the following from the status of some OSDs impacted by slow_ops: 20/05/2024 10:19 • "description": "osd_op(client.186021790.0:57620 29.258s0 29:1a5928ea:::31497ca8-e7d6-4e53-b150-91f9ac02ac67.246100.6329_storage%2ffirstMemories%2f1010543%2f:head [getxattrs,stat] snapc 0=[] ondisk+read+known_if_redirected+supports_pool_eio e481205)", "initiated_at": "2024-05-16T15:03:06.015956+0000", "age": 963.83795990199997, "duration": 963.83819621700002, "type_data": { "flag_point": "delayed", "client_info": { "client": "client.186021790", "client_addr": "10.151.11.11:0/3913909849", "tid": 57620 }, "events": [ { "event": "initiated", "time": "2024-05-16T15:03:06.015956+0000", "duration": 0 }, { "event": "throttled", "time": "2024-05-16T15:03:06.015956+0000", "duration": 0 }, { "event": "header_read", "time": "2024-05-16T15:03:06.015954+0000", "duration": 4294967295.9999986 }, { "event": "all_read", "time": "2024-05-16T15:03:06.015961+0000", "duration": 7.2300000000000002e-06 }, { "event": "dispatched", "time": "2024-05-16T15:03:06.015962+0000", "duration": 1.063e-06 }, { "event": "queued_for_pg", "time": "2024-05-16T15:03:06.015966+0000", "duration": 3.332e-06 }, { "event": "reached_pg", "time": "2024-05-16T15:03:06.015992+0000", "duration": 2.6078e-05 }, { "event": "waiting for readable", "time": "2024-05-16T15:03:06.016002+0000", "duration": 1.0348e-05 } ] } } ], "num_ops": 6 } ########### { "event": "reached_pg", "time": "2024-05-16T12:43:11.694220+0000", "duration": 480.97258642200001 }, In essence the operations remain suspended in this condition: "event": "waiting for readable" trying to access some PGs. We intend to update the version as soon as the recovery/rebalance is completed. Does anyone have any idea what checks I could do to analyze the problem more thoroughly? I can't define whether the problem could be the use of the EC or whether the data written in some buckets is in a "non-standard" condition, which causes the access to wait for some reason. Thank you all for your kindness. Greetings Andrea Martra -- -- Andrea Martra +39 393 9048451

1 day, 12 hours

1
0
0 0

2024

2023

2022

2021

2020

2019

ceph-users May 2024