May 2020 - ceph-users - lists.ceph.io

by Robert LeBlanc

We upgraded our Jewel cluster to Nautilus a few months ago and I've noticed that op behavior has changed. This is an HDD cluster (NVMe journals and NVMe CephFS metadata pool) with about 800 OSDs. When on Jewel and running WPQ with the high cut-off, it was rock solid. When we had recoveries going on it barely dented the client ops and when the client ops on the cluster went down the backfills would run as fast as the cluster could go. I could have max_backfills set to 10 and the cluster performed admirably. After upgrading to Nautilus the cluster struggles with any kind of recovery and if there is any significant client write load the cluster can get into a death spiral. Even heavy client write bandwidth (3-4 GB/s) can cause the heartbeat checks to raise, blocked IO and even OSDs becoming unresponsive. As the person who wrote the WPQ code initially, I know that it was fair and proportional to the op priority and in Jewel it worked. It's not working in Nautilus. I've tweaked a lot of things trying to troubleshoot the issue and setting the recovery priority to 1 or zero barely makes any difference. My best estimation is that the op priority is getting lost before reaching the WPQ scheduler and is thus not prioritizing and dispatching ops correctly. It's almost as if all ops are being treated the same and there is no priority at all. Unfortunately, I do not have the time to set up the dev/testing environment to track this down and we will be moving away from Ceph. But I really like Ceph and want to see it succeed. I strongly suggest that someone look into this because I think it will resolve a lot of problems people have had on the mailing list. I'm not sure if a bug was introduced with the other queues that touches more of the op path or if something in the op path restructuring that changed how things work (I know that was being discussed around the time that Jewel was released). But my guess is that it is somewhere between the op being created and being received into the queue. I really hope that this helps in the search for this regression. I spent a lot of time studying the issue to come up with WPQ and saw it work great when I switched this cluster from PRIO to WPQ. I've also spent countless hours studying how it's changed in Nautilus. Thank you, Robert LeBlanc ---------------- Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1

3 years, 12 months

2
3
0 0

Reweighting OSD while down results in undersized+degraded PGs

by Andras Pataki

In a recent cluster reorganization, we ended up with a lot of undersized/degraded PGs and a day of recovery from them, when all we expected was moving some data around. After retracing my steps, I found something odd. If I crush reweight an OSD to 0 while it is down - it results in the PGs of that OSD ending up degraded even after the OSD is restarted. If I do the same reweighting while the OSD is up - data gets moved without any degraded/undersized states. I would not expect this - so I wonder if this is a bug or is somehow intended. This is on ceph Nautilus 14.2.8. Below are the details. Andras First the case that works as I would expect: # Healthy cluster ... [root@xorphosd00 ~]# ceph -s cluster: id: 86d8a1b9-761b-4099-a960-6a303b951236 health: HEALTH_WARN noout,nobackfill,noscrub,nodeep-scrub flag(s) set services: mon: 3 daemons, quorum xorphmon00,xorphmon01,xorphmon02 (age 11d) mgr: xorphmon01(active, since 6w), standbys: xorphmon02, xorphmon00 mds: cephfs:1 {0=xorphmon02=up:active} 1 up:standby osd: 270 osds: 270 up (since 2m), 270 in (since 4h) flags noout,nobackfill,noscrub,nodeep-scrub data: pools: 4 pools, 5312 pgs objects: 75.87M objects, 287 TiB usage: 864 TiB used, 1.1 PiB / 1.9 PiB avail pgs: 5312 active+clean # Reweight an OSD to 0 [root@xorphosd00 ~]# ceph osd crush reweight osd.0 0.0 reweighted item id 0 name 'osd.0' to 0 in crush map # Crush map changes - data movement is set up, no degraded PGs: [root@xorphosd00 ~]# ceph -s cluster: id: 86d8a1b9-761b-4099-a960-6a303b951236 health: HEALTH_WARN noout,nobackfill,noscrub,nodeep-scrub flag(s) set services: mon: 3 daemons, quorum xorphmon00,xorphmon01,xorphmon02 (age 11d) mgr: xorphmon01(active, since 6w), standbys: xorphmon02, xorphmon00 mds: cephfs:1 {0=xorphmon02=up:active} 1 up:standby osd: 270 osds: 270 up (since 10m), 270 in (since 5h); 175 remapped pgs flags noout,nobackfill,noscrub,nodeep-scrub data: pools: 4 pools, 5312 pgs objects: 75.87M objects, 287 TiB usage: 864 TiB used, 1.1 PiB / 1.9 PiB avail pgs: 2562045/232996662 objects misplaced (1.100%) 5137 active+clean 172 active+remapped+backfilling 3 active+remapped+backfill_wait # Reweight it back to the original weight [root@xorphosd00 ~]# ceph osd crush reweight osd.0 8.0 # Cluster goes back to clean reweighted item id 0 name 'osd.0' to 8 in crush map [root@xorphosd00 ~]# ceph -s cluster: id: 86d8a1b9-761b-4099-a960-6a303b951236 health: HEALTH_WARN noout,nobackfill,noscrub,nodeep-scrub flag(s) set services: mon: 3 daemons, quorum xorphmon00,xorphmon01,xorphmon02 (age 11d) mgr: xorphmon01(active, since 6w), standbys: xorphmon02, xorphmon00 mds: cephfs:1 {0=xorphmon02=up:active} 1 up:standby osd: 270 osds: 270 up (since 11m), 270 in (since 5h) flags noout,nobackfill,noscrub,nodeep-scrub data: pools: 4 pools, 5312 pgs objects: 75.87M objects, 287 TiB usage: 864 TiB used, 1.1 PiB / 1.9 PiB avail pgs: 5312 active+clean # # Now the problematic case # # Stop an OSD [root@xorphosd00 ~]# systemctl stop ceph-osd@0 # We get degraded PGs - as expected [root@xorphosd00 ~]# ceph -s cluster: id: 86d8a1b9-761b-4099-a960-6a303b951236 health: HEALTH_WARN noout,nobackfill,noscrub,nodeep-scrub flag(s) set 1 osds down Degraded data redundancy: 873964/232996662 objects degraded (0.375%), 82 pgs degraded services: mon: 3 daemons, quorum xorphmon00,xorphmon01,xorphmon02 (age 11d) mgr: xorphmon01(active, since 6w), standbys: xorphmon02, xorphmon00 mds: cephfs:1 {0=xorphmon02=up:active} 1 up:standby osd: 270 osds: 269 up (since 16s), 270 in (since 5h) flags noout,nobackfill,noscrub,nodeep-scrub data: pools: 4 pools, 5312 pgs objects: 75.87M objects, 287 TiB usage: 864 TiB used, 1.1 PiB / 1.9 PiB avail pgs: 873964/232996662 objects degraded (0.375%) 5230 active+clean 82 active+undersized+degraded # Reweight the OSD to 0: [root@xorphosd00 ~]# ceph osd crush reweight osd.0 0.0 # Still degraded - as expected reweighted item id 0 name 'osd.0' to 0 in crush map [root@xorphosd00 ~]# ceph -s cluster: id: 86d8a1b9-761b-4099-a960-6a303b951236 health: HEALTH_WARN noout,nobackfill,noscrub,nodeep-scrub flag(s) set 1 osds down Degraded data redundancy: 873964/232996662 objects degraded (0.375%), 82 pgs degraded services: mon: 3 daemons, quorum xorphmon00,xorphmon01,xorphmon02 (age 11d) mgr: xorphmon01(active, since 6w), standbys: xorphmon02, xorphmon00 mds: cephfs:1 {0=xorphmon02=up:active} 1 up:standby osd: 270 osds: 269 up (since 59s), 270 in (since 5h); 175 remapped pgs flags noout,nobackfill,noscrub,nodeep-scrub data: pools: 4 pools, 5312 pgs objects: 75.87M objects, 287 TiB usage: 864 TiB used, 1.1 PiB / 1.9 PiB avail pgs: 873964/232996662 objects degraded (0.375%) 1688081/232996662 objects misplaced (0.725%) 5137 active+clean 93 active+remapped+backfilling 82 active+undersized+degraded+remapped+backfilling # Restarting the OSD [root@xorphosd00 ~]# systemctl start ceph-osd@0 # And the PGs still stay degraded - THIS IS UNEXPECTED!!! [root@xorphosd00 ~]# ceph -s cluster: id: 86d8a1b9-761b-4099-a960-6a303b951236 health: HEALTH_WARN noout,nobackfill,noscrub,nodeep-scrub flag(s) set Degraded data redundancy: 873964/232996662 objects degraded (0.375%), 82 pgs degraded services: mon: 3 daemons, quorum xorphmon00,xorphmon01,xorphmon02 (age 11d) mgr: xorphmon01(active, since 6w), standbys: xorphmon02, xorphmon00 mds: cephfs:1 {0=xorphmon02=up:active} 1 up:standby osd: 270 osds: 270 up (since 14s), 270 in (since 5h); 175 remapped pgs flags noout,nobackfill,noscrub,nodeep-scrub data: pools: 4 pools, 5312 pgs objects: 75.87M objects, 287 TiB usage: 864 TiB used, 1.1 PiB / 1.9 PiB avail pgs: 873964/232996662 objects degraded (0.375%) 1688081/232996662 objects misplaced (0.725%) 5137 active+clean 93 active+remapped+backfilling 82 active+undersized+degraded+remapped+backfilling # Now for something even more odd - reweight the OSD back to its original weigh # and all the data gets magically FOUND again on that OSD!!! [root@xorphosd00 ~]# ceph osd crush reweight osd.0 8.0 reweighted item id 0 name 'osd.0' to 8 in crush map [root@xorphosd00 ~]# ceph -s cluster: id: 86d8a1b9-761b-4099-a960-6a303b951236 health: HEALTH_WARN noout,nobackfill,noscrub,nodeep-scrub flag(s) set services: mon: 3 daemons, quorum xorphmon00,xorphmon01,xorphmon02 (age 11d) mgr: xorphmon01(active, since 6w), standbys: xorphmon02, xorphmon00 mds: cephfs:1 {0=xorphmon02=up:active} 1 up:standby osd: 270 osds: 270 up (since 51s), 270 in (since 5h) flags noout,nobackfill,noscrub,nodeep-scrub data: pools: 4 pools, 5312 pgs objects: 75.87M objects, 287 TiB usage: 864 TiB used, 1.1 PiB / 1.9 PiB avail pgs: 5312 active+clean

3 years, 12 months

3
6
0 0

OSD crashes regularely

by Thomas

Hello, I have a pool of +300 OSDs that are identical model (Seagate model: ST1800MM0129 size: 1.64 TiB). Only 1 OSD crashes regularely, however I cannot identify a root cause. Based on the output of smartctl the disk is ok. # smartctl -a -d megaraid,1 /dev/sda [47/1833] smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.3.18-2-pve] (local build) Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Vendor: LENOVO-X Product: ST1800MM0129 Revision: L2B6 Compliance: SPC-4 User Capacity: 1,800,360,124,416 bytes [1.80 TB] Logical block size: 512 bytes Physical block size: 4096 bytes LU is fully provisioned Rotation Rate: 10500 rpm Form Factor: 2.5 inches Logical Unit id: 0x5000c500bb7822cf Serial number: WBN0QHX80000E852944J Device type: disk Transport protocol: SAS (SPL-3) Local Time is: Mon May 18 09:19:41 2020 CEST SMART support is: Available - device has SMART capability. SMART support is: Enabled Temperature Warning: Enabled === START OF READ SMART DATA SECTION === SMART Health Status: HARDWARE IMPENDING FAILURE GENERAL HARD DRIVE FAILURE [asc=5d, ascq=10] [22/1833] Grown defects during certification <not available> Total blocks reassigned during format <not available> Total new blocks reassigned = 68 Power on minutes since format <not available> Current Drive Temperature: 33 C Drive Trip Temperature: 65 C Manufactured in week 31 of year 2018 Specified cycle count over device lifetime: 10000 Accumulated start-stop cycles: 21 Specified load-unload count over device lifetime: 300000 Accumulated load-unload cycles: 709 Elements in grown defect list: 18 Error counter log: Errors Corrected by Total Correction Gigabytes Total ECC rereads/ errors algorithm processed uncorrected fast | delayed rewrites corrected invocations [10^9 bytes] errors read: 3278853896 1 0 3278853897 32 83933.567 19 write: 0 0 0 0 0 24093.894 0 verify: 3080361880 0 0 3080361880 0 12630.494 0 Non-medium error count: 244 SMART Self-test log Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ] Description number (hours) # 1 Background short Completed - 3761 - [- - -] # 2 Background short Completed - 3737 - [- - -] # 3 Background short Completed - 3713 - [- - -] # 4 Background short Completed - 3689 - [- - -] # 5 Background short Completed - 3665 - [- - -] # 6 Background short Completed - 3641 - [- - -] # 7 Background short Completed - 3617 - [- - -] # 8 Background short Completed - 3593 - [- - -] # 9 Background long Completed - 3569 - [- - -] #10 Background short Completed - 3545 - [- - -] #11 Background short Completed - 3521 - [- - -] #12 Background short Completed - 3497 - [- - -] #13 Background short Completed - 3473 - [- - -] #14 Background short Completed - 3449 - [- - -] #15 Background short Completed - 3425 - [- - -] #16 Background short Completed - 3401 - [- - -] #17 Background short Completed - 3377 - [- - -] #18 Background short Completed - 3353 - [- - -] #19 Background short Completed - 3329 - [- - -] #20 Background short Completed - 3305 - [- - -] Long (extended) Self-test duration: 9459 seconds [157.7 minutes] I have attached the log of the affected OSD. THX Thomas Ich habe 1 zu dieser E-Mail gehörende Datei hochgeladen: ceph-osd.92.log.1.gz <https://we.tl/t-7DzNCDP3iZ>(578 KB)WeTransferhttps://we.tl/t-7DzNCDP3iZ Mozilla Thunderbird <https://www.thunderbird.net> macht es einfach, große Dateien über E-Mails zu teilen.

3 years, 12 months

2
1
0 0

Aging in S3 or Moving old data to slow OSDs

by Khodayar Doustar

Hi, I'm using Nautilus and I'm using the whole cluster mainly for a single bucket in RadosGW. There is a lot of data in this bucket (Petabyte scale) and I don't want to waste all of SSD on it. Is there anyway to automatically set some aging threshold for this data and e.g. move any data older than a month to HDD OSDs? Does anyone has experience with this: Pool Placement and Storage Classes: https://docs.ceph.com/docs/master/radosgw/placement/ But something automatic would be much better for me in this case. Any help would be appreciated. Thanks a lot, Khodayar

3 years, 12 months

3
5
0 0

15.2.2 Upgrade - Corruption: error in middle of record

by Ashley Merrick

I just upgraded a cephadm cluster from 15.2.1 to 15.2.2. Everything went fine on the upgrade, however after restarting one node that has 3 OSD's for ecmeta two of the 3 ODS's now wont boot with the following error: May 20 08:29:42 sn-m01 bash[6833]: debug 2020-05-20T08:29:42.598+0000 7fbcc46f7ec0 4 rocksdb: [db/version_set.cc:3757] Recovered from manifest file:db/MANIFEST-002768 succeeded,manifest_file_number is 2768, next_file_number is 2775, last_sequence is 188026749, log_number is 2767,prev_log_number is 0,max_column_family is 0,min_log_number_to_keep is 0 May 20 08:29:42 sn-m01 bash[6833]: debug 2020-05-20T08:29:42.598+0000 7fbcc46f7ec0 4 rocksdb: [db/version_set.cc:3766] Column family [default] (ID 0), log number is 2767 May 20 08:29:42 sn-m01 bash[6833]: debug 2020-05-20T08:29:42.598+0000 7fbcc46f7ec0 4 rocksdb: EVENT_LOG_v1 {"time_micros": 1589963382599157, "job": 1, "event": "recovery_started", "log_files": [2769]} May 20 08:29:42 sn-m01 bash[6833]: debug 2020-05-20T08:29:42.598+0000 7fbcc46f7ec0 4 rocksdb: [db/db_impl_open.cc:583] Recovering log #2769 mode 0 May 20 08:29:42 sn-m01 bash[6833]: debug 2020-05-20T08:29:42.598+0000 7fbcc46f7ec0 3 rocksdb: [db/db_impl_open.cc:518] db/002769.log: dropping 537526 bytes; Corruption: error in middle of record May 20 08:29:42 sn-m01 bash[6833]: debug 2020-05-20T08:29:42.598+0000 7fbcc46f7ec0 3 rocksdb: [db/db_impl_open.cc:518] db/002769.log: dropping 32757 bytes; Corruption: missing start of fragmented record(1) May 20 08:29:42 sn-m01 bash[6833]: debug 2020-05-20T08:29:42.602+0000 7fbcc46f7ec0 3 rocksdb: [db/db_impl_open.cc:518] db/002769.log: dropping 32757 bytes; Corruption: missing start of fragmented record(1) May 20 08:29:42 sn-m01 bash[6833]: debug 2020-05-20T08:29:42.602+0000 7fbcc46f7ec0 3 rocksdb: [db/db_impl_open.cc:518] db/002769.log: dropping 32757 bytes; Corruption: missing start of fragmented record(1) May 20 08:29:42 sn-m01 bash[6833]: debug 2020-05-20T08:29:42.602+0000 7fbcc46f7ec0 3 rocksdb: [db/db_impl_open.cc:518] db/002769.log: dropping 32757 bytes; Corruption: missing start of fragmented record(1) May 20 08:29:42 sn-m01 bash[6833]: debug 2020-05-20T08:29:42.602+0000 7fbcc46f7ec0 3 rocksdb: [db/db_impl_open.cc:518] db/002769.log: dropping 32757 bytes; Corruption: missing start of fragmented record(1) May 20 08:29:42 sn-m01 bash[6833]: debug 2020-05-20T08:29:42.602+0000 7fbcc46f7ec0 3 rocksdb: [db/db_impl_open.cc:518] db/002769.log: dropping 32757 bytes; Corruption: missing start of fragmented record(1) May 20 08:29:42 sn-m01 bash[6833]: debug 2020-05-20T08:29:42.602+0000 7fbcc46f7ec0 3 rocksdb: [db/db_impl_open.cc:518] db/002769.log: dropping 23263 bytes; Corruption: missing start of fragmented record(2) May 20 08:29:42 sn-m01 bash[6833]: debug 2020-05-20T08:29:42.602+0000 7fbcc46f7ec0 4 rocksdb: [db/db_impl.cc:390] Shutdown: canceling all background work May 20 08:29:42 sn-m01 bash[6833]: debug 2020-05-20T08:29:42.602+0000 7fbcc46f7ec0 4 rocksdb: [db/db_impl.cc:563] Shutdown complete May 20 08:29:42 sn-m01 bash[6833]: debug 2020-05-20T08:29:42.602+0000 7fbcc46f7ec0 -1 rocksdb: Corruption: error in middle of record May 20 08:29:42 sn-m01 bash[6833]: debug 2020-05-20T08:29:42.602+0000 7fbcc46f7ec0 -1 bluestore(/var/lib/ceph/osd/ceph-0) _open_db erroring opening db: May 20 08:29:42 sn-m01 bash[6833]: debug 2020-05-20T08:29:42.602+0000 7fbcc46f7ec0 1 bdev(0x558a28dd0700 /var/lib/ceph/osd/ceph-0/block) close May 20 08:29:42 sn-m01 bash[6833]: debug 2020-05-20T08:29:42.870+0000 7fbcc46f7ec0 1 bdev(0x558a28dd0000 /var/lib/ceph/osd/ceph-0/block) close May 20 08:29:43 sn-m01 bash[6833]: debug 2020-05-20T08:29:43.118+0000 7fbcc46f7ec0 -1 osd.0 0 OSD:init: unable to mount object store May 20 08:29:43 sn-m01 bash[6833]: debug 2020-05-20T08:29:43.118+0000 7fbcc46f7ec0 -1 ** ERROR: osd init failed: (5) Input/output error Have I hit a bug, or is there something I can do to try and fix these OSD's? Thanks

3 years, 12 months

3
11
0 0

[ceph-users][ceph-dev] Upgrade Luminous to Nautilus 14.2.8 mon service crash

by Amit Ghadge

Hi All, While we enable *ceph mon enable-msgr2 *after gateway service upgrade, the one of the mon service getting crash and never come back, it shows, /usr/bin/ceph-mon -f --cluster ceph --id mon01 --setuser ceph --setgroup ceph --debug_monc 20 --debug_ms 5 global_init: error reading config file. Thanks AmitG

3 years, 12 months

1
0
0 0

[ceph][nautilus] prformances with db/wal on nvme

by Ignazio Cassano

Hello All, We have 6 servers. Configuration for each server: 1 ssd for mon (only on three servers) 1 ssd 1.9 TB for db/wal 1 nvme 1.6 TB for db/wal 10 SAS hdd 3.6 TB for osd We decided to create a pool of 30 osd (5x6) with db/wal on ssd and a pool of 30 (5x6) osd with db/wal on nvme. So we create a vm on the pool with db.wal on ssd and a vm on the pool with db/wal on nvme. Fio performances are almost the same on both . What do you think about it ? I expect better performance on pool with db/wal on pci express nvme PS SSD are under SAS controller NVMe pcie Samsung PM1725B Best Regards Ignazio

3 years, 12 months

2
4
0 0

Re: total ceph outage again, need help

by Frank Schilder

Looks like the immediate danger has passed by: [root@gnosis ~]# ceph status cluster: id: e4ece518-f2cb-4708-b00f-b6bf511e91d9 health: HEALTH_WARN nodown,noout flag(s) set 735 slow ops, oldest one blocked for 3573 sec, daemons [mon.ceph-02,mon.ceph-03] have slow ops. services: mon: 3 daemons, quorum ceph-01,ceph-02,ceph-03 mgr: ceph-01(active), standbys: ceph-03, ceph-02 mds: con-fs2-1/1/1 up {0=ceph-08=up:active}, 1 up:standby-replay osd: 288 osds: 268 up, 268 in flags nodown,noout data: pools: 10 pools, 2545 pgs objects: 86.76 M objects, 218 TiB usage: 277 TiB used, 1.5 PiB / 1.8 PiB avail pgs: 2537 active+clean 8 active+clean+scrubbing+deep io: client: 34 MiB/s rd, 24 MiB/s wr, 954 op/s rd, 1.01 kop/s wr I will prepare a new case with info we have collected so far. Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Amit Ghadge <amitg.b14(a)gmail.com> Sent: 20 May 2020 09:44 To: Frank Schilder Subject: Re: [ceph-users] total ceph outage again, need help look like ceph-01 shows in starting, so I think that why command not executed and you also try to disable to scrubbing for temporary. On Wed, May 20, 2020 at 12:57 PM Frank Schilder <frans(a)dtu.dk<mailto:frans@dtu.dk>> wrote: Dear cephers, I'm sitting with a major ceph outage again. The mon/mgr hosts suffer from a packet storm of ceph traffic between ceph fs clients and the mons. No idea why this is happening. Main problem is, that I can't get through to the cluster. Admin commands hang forever: [root@gnosis ~]# ceph osd set nodown However, "ceph status" returns and shows me that I need to do something: [root@gnosis ~]# ceph status cluster: id: --- health: HEALTH_WARN 2 MDSs report slow metadata IOs 1 MDSs report slow requests 8 osds down services: mon: 3 daemons, quorum ceph-01,ceph-02,ceph-03 mgr: ceph-01(active, starting), standbys: ceph-02, ceph-03 mds: con-fs2-1/1/1 up {0=ceph-08=up:active}, 1 up:standby-replay osd: 288 osds: 208 up, 216 in; 153 remapped pgs data: pools: 10 pools, 2545 pgs objects: 86.71 M objects, 218 TiB usage: 277 TiB used, 1.5 PiB / 1.8 PiB avail pgs: 2542 active+clean 3 active+clean+scrubbing+deep io: client: 152 MiB/s rd, 72 MiB/s wr, 854 op/s rd, 796 op/s wr Is there any way to get admin commands to the mons with higher priority? Thanks and best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 _______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io<mailto:ceph-users@ceph.io> To unsubscribe send an email to ceph-users-leave(a)ceph.io<mailto:ceph-users-leave@ceph.io>

3 years, 12 months

1
0
0 0

total ceph outage again, need help

by Frank Schilder

Dear cephers, I'm sitting with a major ceph outage again. The mon/mgr hosts suffer from a packet storm of ceph traffic between ceph fs clients and the mons. No idea why this is happening. Main problem is, that I can't get through to the cluster. Admin commands hang forever: [root@gnosis ~]# ceph osd set nodown However, "ceph status" returns and shows me that I need to do something: [root@gnosis ~]# ceph status cluster: id: --- health: HEALTH_WARN 2 MDSs report slow metadata IOs 1 MDSs report slow requests 8 osds down services: mon: 3 daemons, quorum ceph-01,ceph-02,ceph-03 mgr: ceph-01(active, starting), standbys: ceph-02, ceph-03 mds: con-fs2-1/1/1 up {0=ceph-08=up:active}, 1 up:standby-replay osd: 288 osds: 208 up, 216 in; 153 remapped pgs data: pools: 10 pools, 2545 pgs objects: 86.71 M objects, 218 TiB usage: 277 TiB used, 1.5 PiB / 1.8 PiB avail pgs: 2542 active+clean 3 active+clean+scrubbing+deep io: client: 152 MiB/s rd, 72 MiB/s wr, 854 op/s rd, 796 op/s wr Is there any way to get admin commands to the mons with higher priority? Thanks and best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14

3 years, 12 months

1
0
0 0

What is a pgmap?

by Bryan Henderson

I'm surprised I couldn't find this explained anywhere (I did look), but ... What is the pgmap and why does it get updated every few seconds on a tiny cluster that's mostly idle? I do know what a placement group (PG) is and that when documentation talks about placement group maps, it is talking about something else -- mapping of PGs to OSDs by CRUSH and OSD maps. -- Bryan Henderson San Jose, California

3 years, 12 months

4
6
0 0

2024

2023

2022

2021

2020

2019

ceph-users May 2020