March 2023 - ceph-users - lists.ceph.io

by Johan Hattne

Dear all; Up until a few hours ago, I had a seemingly normally-behaving cluster (Quincy, 17.2.5) with 36 OSDs, evenly distributed across 3 of its 6 nodes. The cluster is only used for CephFS and the only non-standard configuration I can think of is that I had 2 active MDSs, but only 1 standby. I had also doubled mds_cache_memory limit to 8 GB (all OSD hosts have 256 G of RAM) at some point in the past. Then I rebooted one of the OSD nodes. The rebooted node held one of the active MDSs. Now the node is back up: ceph -s says the cluster is healthy, but all PGs are in a active+clean+remapped state and 166.67% of the objects are misplaced (dashboard: -66.66% healthy). The data pool is a threefold replica with 5.4M object, the number of misplaced objects is reported as 27087410/16252446. The denominator in the ratio makes sense to me (16.2M / 3 = 5.4M), but the numerator does not. I also note that the ratio is *exactly* 5 / 3. The filesystem is still mounted and appears to be usable, but df reports it as 100% full; I suspect it would say 167% but that is capped somewhere. Any ideas about what is going on? Any suggestions for recovery? // Best wishes; Johan

1 year

3
6
0 0

CephFS thrashing through the page cache

by Ashu Pachauri

We have an internal use case where we back the storage of a proprietary database by a shared file system. We noticed something very odd when testing some workload with a local block device backed file system vs cephfs. We noticed that the amount of network IO done by cephfs is almost double compared to the IO done in case of a local file system backed by an attached block device. We also noticed that CephFS thrashes through the page cache very quickly compared to the amount of data being read and think that the two issues might be related. So, I wrote a simple test. 1. I wrote 10k files 400KB each using dd (approx 4 GB data). 2. I dropped the page cache completely. 3. I then read these files serially, again using dd. The page cache usage shot up to 39 GB for reading such a small amount of data. Following is the code used to repro this in bash: for i in $(seq 1 10000); do dd if=/dev/zero of=test_${i} bs=4k count=100 done sync; echo 1 > /proc/sys/vm/drop_caches for i in $(seq 1 10000); do dd if=test_${i} of=/dev/null bs=4k count=100 done The ceph version being used is: ceph version 15.2.13 (c44bc49e7a57a87d84dfff2a077a2058aa2172e2) octopus (stable) The ceph configs being overriden: WHO MASK LEVEL OPTION VALUE RO mon advanced auth_allow_insecure_global_id_reclaim false mgr advanced mgr/balancer/mode upmap mgr advanced mgr/dashboard/server_addr 127.0.0.1 * mgr advanced mgr/dashboard/server_port 8443 * mgr advanced mgr/dashboard/ssl false * mgr advanced mgr/prometheus/server_addr 0.0.0.0 * mgr advanced mgr/prometheus/server_port 9283 * osd advanced bluestore_compression_algorithm lz4 osd advanced bluestore_compression_mode aggressive osd advanced bluestore_throttle_bytes 536870912 osd advanced osd_max_backfills 3 osd advanced osd_op_num_threads_per_shard_ssd 8 * osd advanced osd_scrub_auto_repair true mds advanced client_oc false mds advanced client_readahead_max_bytes 4096 mds advanced client_readahead_max_periods 1 mds advanced client_readahead_min 0 mds basic mds_cache_memory_limit 21474836480 client advanced client_oc false client advanced client_readahead_max_bytes 4096 client advanced client_readahead_max_periods 1 client advanced client_readahead_min 0 client advanced fuse_disable_pagecache false The cephfs mount options (note that readahead was disabled for this test): /mnt/cephfs type ceph (rw,relatime,name=cephfs,secret=<hidden>,acl,rasize=0) Any help or pointers are appreciated; this is a major performance issue for us. Thanks and Regards, Ashu Pachauri

1 year

4
12
0 0

RGW can't create bucket

by Kamil Madac

Hi, One of my customers had a correctly working RGW cluster with two zones in one zonegroup and since a few days ago users are not able to create buckets and are always getting Access denied. Working with existing buckets works (like listing/putting objects into existing bucket). The only operation which is not working is bucket creation. We also tried to create a new user, but the behavior is the same, and he is not able to create the bucket. We tried s3cmd, python script with boto library and also Dashboard as admin user. We are always getting Access Denied. Zones are in-sync. Has anyone experienced such behavior? Thanks in advance, here are some outputs: $ s3cmd -c .s3cfg_python_client mb s3://test ERROR: Access to bucket 'test' was denied ERROR: S3 error: 403 (AccessDenied) Zones are in-sync: Primary cluster: # radosgw-admin sync status realm 5429b434-6d43-4a18-8f19-a5720a89c621 (solargis-prod) zonegroup 00e4b3ff-1da8-4a86-9f52-4300c6d0f149 (solargis-prod-ba) zone 6067eec6-a930-45c7-af7d-a7ef2785a2d7 (solargis-prod-ba-dc) metadata sync no sync (zone is master) data sync source: e84fd242-dbae-466c-b4d9-545990590995 (solargis-prod-ba-hq) syncing full sync: 0/128 shards incremental sync: 128/128 shards data is caught up with source Secondary cluster: # radosgw-admin sync status realm 5429b434-6d43-4a18-8f19-a5720a89c621 (solargis-prod) zonegroup 00e4b3ff-1da8-4a86-9f52-4300c6d0f149 (solargis-prod-ba) zone e84fd242-dbae-466c-b4d9-545990590995 (solargis-prod-ba-hq) metadata sync syncing full sync: 0/64 shards incremental sync: 64/64 shards metadata is caught up with master data sync source: 6067eec6-a930-45c7-af7d-a7ef2785a2d7 (solargis-prod-ba-dc) syncing full sync: 0/128 shards incremental sync: 128/128 shards data is caught up with source -- Kamil Madac

1 year

3
6
0 0

compiling Nautilus for el9

by Marc

Is it possible to compile Nautilus for el9? Or maybe just the osd's?

1 year

2
4
0 0

How mClock profile calculation works, and IOPS

by Luis Domingues

Hi, I am reading reading some documentation about mClock and have two questions. First, about the IOPS. Are those IOPS disk IOPS or other kind of IOPS? And what the assumption of those? (Like block size, sequential or random reads/writes)? And the second question, How mClock calculates its profiles? I have my lab cluster running Quincy, and I have this parameters for mClock: "osd_mclock_max_capacity_iops_hdd": "450.000000", "osd_mclock_profile": "balanced", According to the documentation: https://docs.ceph.com/en/quincy/rados/configuration/mclock-config-ref/#bala… I am expecting to have: "osd_mclock_scheduler_background_best_effort_lim": "999999", "osd_mclock_scheduler_background_best_effort_res": "90", "osd_mclock_scheduler_background_best_effort_wgt": "2", "osd_mclock_scheduler_background_recovery_lim": "675", "osd_mclock_scheduler_background_recovery_res": "180", "osd_mclock_scheduler_background_recovery_wgt": "1", "osd_mclock_scheduler_client_lim": "450", "osd_mclock_scheduler_client_res": "180", "osd_mclock_scheduler_client_wgt": "1", But what I get is: "osd_mclock_scheduler_background_best_effort_lim": "999999", "osd_mclock_scheduler_background_best_effort_res": "18", "osd_mclock_scheduler_background_best_effort_wgt": "2", "osd_mclock_scheduler_background_recovery_lim": "135", "osd_mclock_scheduler_background_recovery_res": "36", "osd_mclock_scheduler_background_recovery_wgt": "1", "osd_mclock_scheduler_client_lim": "90", "osd_mclock_scheduler_client_res": "36", "osd_mclock_scheduler_client_wgt": "1", Which seems very low according to what my disk seems to be able to handle. Is this calculation the expected one? Or did I miss something on how those profiles are populated? Luis Domingues Proton AG

1 year

2
5
0 0

Eccessive occupation of small OSDs

by Nicola Mori

Dear Ceph users, my cluster is made up of 10 old machines, with uneven number of disks and disk size. Essentially I have just one big data pool (6+2 erasure code, with host failure domain) for which I am currently experiencing a very poor available space (88 TB of which 40 TB occupied, as reported by df -h on hosts mounting the cephfs) compared to the raw one (196.5 TB). I have a total of 104 OSDs and 512 PGs for the pool; I cannot increment the PG number since the machines are old and with very low amount of RAM, and some of them are already overloaded. In this situation I'm seeing a high occupation of small OSDs (500 MB) with respect to bigger ones (2 and 4 TB) even if the weight is set equal to disk capacity (see below for ceph osd tree). For example OSD 9 is at 62% occupancy even with weight 0.5 and reweight 0.75, while the highest occupancy for 2 TB OSDs is 41% (OSD 18) and 4 TB OSDs is 23% (OSD 79). I guess this high occupancy for 500 MB OSDs combined with erasure code size and host failure domain might be the cause of the poor available space, could this be true? The upmap balancer is currently running but I don't know if and how much it could improve the situation. Any hint is greatly appreciated, thanks. Nicola # ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 196.47754 root default -7 14.55518 host aka 4 hdd 1.81940 osd.4 up 1.00000 1.00000 11 hdd 1.81940 osd.11 up 1.00000 1.00000 18 hdd 1.81940 osd.18 up 1.00000 1.00000 26 hdd 1.81940 osd.26 up 1.00000 1.00000 32 hdd 1.81940 osd.32 up 1.00000 1.00000 41 hdd 1.81940 osd.41 up 1.00000 1.00000 48 hdd 1.81940 osd.48 up 1.00000 1.00000 55 hdd 1.81940 osd.55 up 1.00000 1.00000 -3 14.55518 host balin 0 hdd 1.81940 osd.0 up 1.00000 1.00000 8 hdd 1.81940 osd.8 up 1.00000 1.00000 15 hdd 1.81940 osd.15 up 1.00000 1.00000 22 hdd 1.81940 osd.22 up 1.00000 1.00000 29 hdd 1.81940 osd.29 up 1.00000 1.00000 34 hdd 1.81940 osd.34 up 1.00000 1.00000 43 hdd 1.81940 osd.43 up 1.00000 1.00000 49 hdd 1.81940 osd.49 up 1.00000 1.00000 -13 29.10950 host bifur 3 hdd 3.63869 osd.3 up 1.00000 1.00000 14 hdd 3.63869 osd.14 up 1.00000 1.00000 27 hdd 3.63869 osd.27 up 1.00000 1.00000 37 hdd 3.63869 osd.37 up 1.00000 1.00000 50 hdd 3.63869 osd.50 up 1.00000 1.00000 59 hdd 3.63869 osd.59 up 1.00000 1.00000 64 hdd 3.63869 osd.64 up 1.00000 1.00000 69 hdd 3.63869 osd.69 up 1.00000 1.00000 -17 29.10950 host bofur 2 hdd 3.63869 osd.2 up 1.00000 1.00000 21 hdd 3.63869 osd.21 up 1.00000 1.00000 39 hdd 3.63869 osd.39 up 1.00000 1.00000 57 hdd 3.63869 osd.57 up 1.00000 1.00000 66 hdd 3.63869 osd.66 up 1.00000 1.00000 72 hdd 3.63869 osd.72 up 1.00000 1.00000 76 hdd 3.63869 osd.76 up 1.00000 1.00000 79 hdd 3.63869 osd.79 up 1.00000 1.00000 -21 29.10376 host dwalin 88 hdd 1.81898 osd.88 up 1.00000 1.00000 89 hdd 1.81898 osd.89 up 1.00000 1.00000 90 hdd 1.81898 osd.90 up 1.00000 1.00000 91 hdd 1.81898 osd.91 up 1.00000 1.00000 92 hdd 1.81898 osd.92 up 1.00000 1.00000 93 hdd 1.81898 osd.93 up 1.00000 1.00000 94 hdd 1.81898 osd.94 up 1.00000 1.00000 95 hdd 1.81898 osd.95 up 1.00000 1.00000 96 hdd 1.81898 osd.96 up 1.00000 1.00000 97 hdd 1.81898 osd.97 up 1.00000 1.00000 98 hdd 1.81898 osd.98 up 1.00000 1.00000 99 hdd 1.81898 osd.99 up 1.00000 1.00000 100 hdd 1.81898 osd.100 up 1.00000 1.00000 101 hdd 1.81898 osd.101 up 1.00000 1.00000 102 hdd 1.81898 osd.102 up 1.00000 1.00000 103 hdd 1.81898 osd.103 up 1.00000 1.00000 -9 14.55518 host ogion 7 hdd 1.81940 osd.7 up 1.00000 1.00000 16 hdd 1.81940 osd.16 up 1.00000 1.00000 23 hdd 1.81940 osd.23 up 1.00000 1.00000 33 hdd 1.81940 osd.33 up 1.00000 1.00000 40 hdd 1.81940 osd.40 up 1.00000 1.00000 47 hdd 1.81940 osd.47 up 1.00000 1.00000 54 hdd 1.81940 osd.54 up 1.00000 1.00000 61 hdd 1.81940 osd.61 up 1.00000 1.00000 -19 14.55518 host prestno 81 hdd 1.81940 osd.81 up 1.00000 1.00000 82 hdd 1.81940 osd.82 up 1.00000 1.00000 83 hdd 1.81940 osd.83 up 1.00000 1.00000 84 hdd 1.81940 osd.84 up 1.00000 1.00000 85 hdd 1.81940 osd.85 up 1.00000 1.00000 86 hdd 1.81940 osd.86 up 1.00000 1.00000 87 hdd 1.81940 osd.87 up 1.00000 1.00000 104 hdd 1.81940 osd.104 up 1.00000 1.00000 -15 29.10376 host remolo 6 hdd 1.81897 osd.6 up 1.00000 1.00000 12 hdd 1.81897 osd.12 up 1.00000 1.00000 19 hdd 1.81897 osd.19 up 1.00000 1.00000 28 hdd 1.81897 osd.28 up 1.00000 1.00000 35 hdd 1.81897 osd.35 up 1.00000 1.00000 44 hdd 1.81897 osd.44 up 1.00000 1.00000 52 hdd 1.81897 osd.52 up 1.00000 1.00000 58 hdd 1.81897 osd.58 up 1.00000 1.00000 63 hdd 1.81897 osd.63 up 1.00000 1.00000 67 hdd 1.81897 osd.67 up 1.00000 1.00000 71 hdd 1.81897 osd.71 up 1.00000 1.00000 73 hdd 1.81897 osd.73 up 1.00000 1.00000 74 hdd 1.81897 osd.74 up 1.00000 1.00000 75 hdd 1.81897 osd.75 up 1.00000 1.00000 77 hdd 1.81897 osd.77 up 1.00000 1.00000 78 hdd 1.81897 osd.78 up 1.00000 1.00000 -5 14.55518 host rokanan 1 hdd 1.81940 osd.1 up 1.00000 1.00000 10 hdd 1.81940 osd.10 up 1.00000 1.00000 17 hdd 1.81940 osd.17 up 1.00000 1.00000 24 hdd 1.81940 osd.24 up 1.00000 1.00000 31 hdd 1.81940 osd.31 up 1.00000 1.00000 38 hdd 1.81940 osd.38 up 1.00000 1.00000 46 hdd 1.81940 osd.46 up 1.00000 1.00000 53 hdd 1.81940 osd.53 up 1.00000 1.00000 -11 7.27515 host romolo 5 hdd 0.45470 osd.5 up 1.00000 1.00000 9 hdd 0.45470 osd.9 up 0.75000 1.00000 13 hdd 0.45470 osd.13 up 1.00000 1.00000 20 hdd 0.45470 osd.20 up 0.95000 1.00000 25 hdd 0.45470 osd.25 up 0.75000 1.00000 30 hdd 0.45470 osd.30 up 1.00000 1.00000 36 hdd 0.45470 osd.36 up 1.00000 1.00000 42 hdd 0.45470 osd.42 up 1.00000 1.00000 45 hdd 0.45470 osd.45 up 0.85004 1.00000 51 hdd 0.45470 osd.51 up 0.89999 1.00000 56 hdd 0.45470 osd.56 up 1.00000 1.00000 60 hdd 0.45470 osd.60 up 1.00000 1.00000 62 hdd 0.45470 osd.62 up 1.00000 1.00000 65 hdd 0.45470 osd.65 up 0.85004 1.00000 68 hdd 0.45470 osd.68 up 1.00000 1.00000 70 hdd 0.45470 osd.70 up 1.00000 1.00000

1 year

4
5
0 0

monitoring apply_latency / commit_latency ?

by Matthias Ferdinand

Hi, I would like to understand how the per-OSD data from "ceph osd perf" (i.e. apply_latency, commit_latency) is generated. So far I couldn't find documentation on this. "ceph osd perf" output is nice for a quick glimpse, but is not very well suited for graphing. Output values are from the most recent 5s-averages apparently. With "ceph daemon osd.X perf dump" OTOH, you get quite a lot of latency metrics, while it is just not obvious to me how they aggregate into apply_latency and commit_latency. Or some comparably easy read latency metric (something that is missing completely in "ceph osd perf"). Can somebody shed some light on this? Regards Matthias

1 year

2
4
0 0

avg apply latency went up after update from octopus to pacific

by Boris Behrens

Hi, today I did the first update from octopus to pacific, and it looks like the avg apply latency went up from 1ms to 2ms. All 36 OSDs are 4TB SSDs and nothing else changed. Someone knows if this is an issue, or am I just missing a config value? Cheers Boris

1 year

10
27
0 0

Failing to create monitor in a working cluster.

by pmestre＠gmail.com

Hello, i've been running a 3 node proxmox cluster with 4 ceph osd for 3 years as a production cluster. As a test for trying to move ceph cluster network, i destroyed one of the 3 working monitors and tried to recreate it. After destroying it, the new monitor refuses to join the cluster, even in the old network. I've tried all steps in documentation "Troubleshooting monitors" section. New monitor has this config extracted from ceph --admin-daemon file.asok { "name": "n3ceph", "rank": -1, "state": "probing", "election_epoch": 0, "quorum": [], "features": { "required_con": "2449958197560098820", "required_mon": [ "kraken", "luminous", "mimic", "osdmap-prune", "nautilus", "octopus", "pacific", "elector-pinging" ], "quorum_con": "0", "quorum_mon": [] }, "outside_quorum": [], "extra_probe_peers": [], "sync_provider": [], "monmap": { "epoch": 6, "fsid": "5e60d0bb-33b4-42db-bbe7-7032c35ee605", "modified": "2023-03-31T11:54:44.616569Z", "created": "2019-12-02T13:50:38.097448Z", "min_mon_release": 16, "min_mon_release_name": "pacific", "election_strategy": 1, "disallowed_leaders: ": "", "stretch_mode": false, "tiebreaker_mon": "", "removed_ranks: ": "1", "features": { "persistent": [ "kraken", "luminous", "mimic", "osdmap-prune", "nautilus", "octopus", "pacific", "elector-pinging" ], "optional": [] }, "mons": [ { "rank": 0, "name": "node1", "public_addrs": { "addrvec": [ { "type": "v2", "addr": "10.100.100.1:3300", "nonce": 0 }, { "type": "v1", "addr": "10.100.100.1:6789", "nonce": 0 } ] }, "addr": "10.100.100.1:6789/0", "public_addr": "10.100.100.1:6789/0", "priority": 0, "weight": 0, "crush_location": "{}" }, { "rank": 1, "name": "node2", "public_addrs": { "addrvec": [ { "type": "v2", "addr": "10.100.100.2:3300", "nonce": 0 }, { "type": "v1", "addr": "10.100.100.2:6789", "nonce": 0 } ] }, "addr": "10.100.100.2:6789/0", "public_addr": "10.100.100.2:6789/0", "priority": 0, "weight": 0, "crush_location": "{}" } ] }, "feature_map": { "mon": [ { "features": "0x3f01cfbdfffdffff", "release": "luminous", "num": 1 } ] }, "stretch_mode": false } The quorum mon stat is as follows: { "name": "node1", "rank": 0, "state": "leader", "election_epoch": 340, "quorum": [ 0, 1 ], "quorum_age": 13090, "features": { "required_con": "2449958747317026820", "required_mon": [ "kraken", "luminous", "mimic", "osdmap-prune", "nautilus", "octopus", "pacific", "elector-pinging" ], "quorum_con": "4540138314316775423", "quorum_mon": [ "kraken", "luminous", "mimic", "osdmap-prune", "nautilus", "octopus", "pacific", "elector-pinging" ] }, "outside_quorum": [], "extra_probe_peers": [], "sync_provider": [], "monmap": { "epoch": 6, "fsid": "5e60d0bb-33b4-42db-bbe7-7032c35ee605", "modified": "2023-03-31T11:54:44.616569Z", "created": "2019-12-02T13:50:38.097448Z", "min_mon_release": 16, "min_mon_release_name": "pacific", "election_strategy": 1, "disallowed_leaders: ": "", "stretch_mode": false, "tiebreaker_mon": "", "removed_ranks: ": "1", "features": { "persistent": [ "kraken", "luminous", "mimic", "osdmap-prune", "nautilus", "octopus", "pacific", "elector-pinging" ], "optional": [] }, "mons": [ { "rank": 0, "name": "node1", "public_addrs": { "addrvec": [ { "type": "v2", "addr": "10.100.100.1:3300", "nonce": 0 }, { "type": "v1", "addr": "10.100.100.1:6789", "nonce": 0 } ] }, "addr": "10.100.100.1:6789/0", "public_addr": "10.100.100.1:6789/0", "priority": 0, "weight": 0, "crush_location": "{}" }, { "rank": 1, "name": "node2", "public_addrs": { "addrvec": [ { "type": "v2", "addr": "10.100.100.2:3300", "nonce": 0 }, { "type": "v1", "addr": "10.100.100.2:6789", "nonce": 0 } ] }, "addr": "10.100.100.2:6789/0", "public_addr": "10.100.100.2:6789/0", "priority": 0, "weight": 0, "crush_location": "{}" } ] }, "feature_map": { "mon": [ { "features": "0x3f01cfbdfffdffff", "release": "luminous", "num": 1 } ], "osd": [ { "features": "0x3f01cfbdfffdffff", "release": "luminous", "num": 5 } ], "client": [ { "features": "0x2f018fb87aa4aafe", "release": "luminous", "num": 1 }, { "features": "0x3f01cfbdfffdffff", "release": "luminous", "num": 12 } ], "mgr": [ { "features": "0x3f01cfbdfffdffff", "release": "luminous", "num": 1 } ] }, "stretch_mode": false I tried to get a debug log with ceph daemon mon.n3ceph config set debug_mon 10/10 and restarting the service, but the ceph log file stoped working after i tried that setting. journalctl -u tells me: mar 31 17:35:22 node3 ceph-mon[240916]: 2023-03-31T17:35:22.926+0200 7f49e0699700 -1 mon.n3ceph@-1(probing) e6 get_health_metrics reporting 4 slow ops, oldest is log(1 entries from seq 1 at 2023-03-31T17:30:19.347379+0200) mar 31 17:35:27 node3 ceph-mon[240916]: 2023-03-31T17:35:27.926+0200 7f49e0699700 -1 mon.n3ceph@-1(probing) e6 get_health_metrics reporting 4 slow ops, oldest is log(1 entries from seq 1 at 2023-03-31T17:30:19.347379+0200) mar 31 17:35:32 node3 ceph-mon[240916]: 2023-03-31T17:35:32.926+0200 7f49e0699700 -1 mon.n3ceph@-1(probing) e6 get_health_metrics reporting 4 slow ops, oldest is log(1 entries from seq 1 at 2023-03-31T17:30:19.347379+0200). Any ideas? Cluster is running fine with two monitors, but a reboot in one of the nodes might be a big problem. Kind regards and many thanks.

1 year

2
1
0 0

Ceph Failure and OSD Node Stuck Incident

by petersun＠raksmart.com

We encountered a Ceph failure where the system became unresponsive with no IOPS or throughput after encountering a failed node. Upon investigation, it appears that the OSD process on one of the Ceph storage nodes is stuck, but ping is still responsive. However, during the failure, Ceph was unable to recognize the problematic node, which resulted in all other OSDs in the cluster experiencing slow operations and no IOPS in the cluster at all. Here's the timeline of the incident: - At 10:40, an alert is triggered, indicating a problem with the OSD. - After the alert, Ceph becomes unresponsive with no IOPS or throughput. - At 11:26, an engineer discovers that there is a gradual OSD failure, with 6 out of 12 OSDs on the node being down. - At 11:46, the Ceph engineer is unable to SSH into the faulty node and attempts a soft restart, but the "smartmontools" process is stuck while shutting down the server. Ping works during this time. - After waiting for about one or two minutes, a hard restart is attempted for the server. - At 11:57, after the Ceph node starts normally, service resumes as usual, indicating that the issue has been resolved. Here is some basic information about our services: - `Mon: 5 daemons, quorum host001, host002, host003, host004, host005 (age 4w)` - `Mgr: host005 (active, since 4w), standbys: host001, host002, host003, host004` - `Osd: 218 osds: 218 up (since 22h), 218 in (since 22h)` We have a cluster with 19 nodes, including 15 SSD nodes and 4 HDD nodes. In total, there are 218 OSDs. The SSD nodes have 11 OSDs with Samsung EVO 870 SSD and each drive DB/WAL by 1.6T NVME drive. We are using Ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable). Here is the health check detail: [root@node21 ~]# ceph health detail HEALTH_WARN 1 osds down; Reduced data availability: 12 pgs inactive, 12 pgs peering; Degraded data redundancy: 272273/43967625 objects degraded (0.619%), 88 pgs degraded, 5 pgs undersized; 18192 slow ops, oldest one blocked for 3730 sec, daemons [osd.0,osd.1,osd.101,osd.103,osd.107,osd.108,osd.109,osd.11,osd.111,osd.112]... have slow ops. [WRN] OSD_DOWN: 1 osds down osd.174 (root=default,host=hkhost031) is down [WRN] PG_AVAILABILITY: Reduced data availability: 12 pgs inactive, 12 pgs peering pg 2.dc is stuck peering for 49m, current state peering, last acting [87,95,172] pg 2.e2 is stuck peering for 15m, current state peering, last acting [51,177,97] ...... pg 2.f7e is active+undersized+degraded, acting [10,214] pg 2.f84 is active+undersized+degraded, acting [91,52] [WRN] SLOW_OPS: 18192 slow ops, oldest one blocked for 3730 sec, daemons [osd.0,osd.1,osd.101,osd.103,osd.107,osd.108,osd.109,osd.11,osd.111,osd.112]... have slow ops. I have the following questions: 1. Why couldn't Ceph detect the faulty node and automatically abandon its resources? Can anyone provide more troubleshooting guidance for this case? 2. What is Ceph's detection mechanism and where can I find related information? All of our production cloud machines were affected and suspended. If RBD is unstable, we cannot continue to use Ceph technology for our RBD source. 3. Did we miss any patches or bug fixes? 4. Is there anyone who can suggest improvements and how we can quickly detect and avoid similar issues in the future?

1 year

5
4
0 0

2024

2023

2022

2021

2020

2019

ceph-users March 2023