We upgraded our Jewel cluster to Nautilus a few months ago and I've noticed
that op behavior has changed. This is an HDD cluster (NVMe journals and
NVMe CephFS metadata pool) with about 800 OSDs. When on Jewel and running
WPQ with the high cut-off, it was rock solid. When we had recoveries going
on it barely dented the client ops and when the client ops on the cluster
went down the backfills would run as fast as the cluster could go. I could
have max_backfills set to 10 and the cluster performed admirably.
After upgrading to Nautilus the cluster struggles with any kind of recovery
and if there is any significant client write load the cluster can get into
a death spiral. Even heavy client write bandwidth (3-4 GB/s) can cause the
heartbeat checks to raise, blocked IO and even OSDs becoming unresponsive.
As the person who wrote the WPQ code initially, I know that it was fair and
proportional to the op priority and in Jewel it worked. It's not working in
Nautilus. I've tweaked a lot of things trying to troubleshoot the issue and
setting the recovery priority to 1 or zero barely makes any difference. My
best estimation is that the op priority is getting lost before reaching the
WPQ scheduler and is thus not prioritizing and dispatching ops correctly.
It's almost as if all ops are being treated the same and there is no
priority at all.
Unfortunately, I do not have the time to set up the dev/testing environment
to track this down and we will be moving away from Ceph. But I really like
Ceph and want to see it succeed. I strongly suggest that someone look into
this because I think it will resolve a lot of problems people have had on
the mailing list. I'm not sure if a bug was introduced with the other
queues that touches more of the op path or if something in the op path
restructuring that changed how things work (I know that was being discussed
around the time that Jewel was released). But my guess is that it is
somewhere between the op being created and being received into the queue.
I really hope that this helps in the search for this regression. I spent a
lot of time studying the issue to come up with WPQ and saw it work great
when I switched this cluster from PRIO to WPQ. I've also spent countless
hours studying how it's changed in Nautilus.
Thank you,
Robert LeBlanc
----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1
In a recent cluster reorganization, we ended up with a lot of
undersized/degraded PGs and a day of recovery from them, when all we
expected was moving some data around. After retracing my steps, I found
something odd. If I crush reweight an OSD to 0 while it is down - it
results in the PGs of that OSD ending up degraded even after the OSD is
restarted. If I do the same reweighting while the OSD is up - data gets
moved without any degraded/undersized states. I would not expect this -
so I wonder if this is a bug or is somehow intended. This is on ceph
Nautilus 14.2.8. Below are the details.
Andras
First the case that works as I would expect:
# Healthy cluster ...
[root@xorphosd00 ~]# ceph -s
cluster:
id: 86d8a1b9-761b-4099-a960-6a303b951236
health: HEALTH_WARN
noout,nobackfill,noscrub,nodeep-scrub flag(s) set
services:
mon: 3 daemons, quorum xorphmon00,xorphmon01,xorphmon02 (age 11d)
mgr: xorphmon01(active, since 6w), standbys: xorphmon02, xorphmon00
mds: cephfs:1 {0=xorphmon02=up:active} 1 up:standby
osd: 270 osds: 270 up (since 2m), 270 in (since 4h)
flags noout,nobackfill,noscrub,nodeep-scrub
data:
pools: 4 pools, 5312 pgs
objects: 75.87M objects, 287 TiB
usage: 864 TiB used, 1.1 PiB / 1.9 PiB avail
pgs: 5312 active+clean
# Reweight an OSD to 0
[root@xorphosd00 ~]# ceph osd crush reweight osd.0 0.0
reweighted item id 0 name 'osd.0' to 0 in crush map
# Crush map changes - data movement is set up, no degraded PGs:
[root@xorphosd00 ~]# ceph -s
cluster:
id: 86d8a1b9-761b-4099-a960-6a303b951236
health: HEALTH_WARN
noout,nobackfill,noscrub,nodeep-scrub flag(s) set
services:
mon: 3 daemons, quorum xorphmon00,xorphmon01,xorphmon02 (age 11d)
mgr: xorphmon01(active, since 6w), standbys: xorphmon02, xorphmon00
mds: cephfs:1 {0=xorphmon02=up:active} 1 up:standby
osd: 270 osds: 270 up (since 10m), 270 in (since 5h); 175 remapped pgs
flags noout,nobackfill,noscrub,nodeep-scrub
data:
pools: 4 pools, 5312 pgs
objects: 75.87M objects, 287 TiB
usage: 864 TiB used, 1.1 PiB / 1.9 PiB avail
pgs: 2562045/232996662 objects misplaced (1.100%)
5137 active+clean
172 active+remapped+backfilling
3 active+remapped+backfill_wait
# Reweight it back to the original weight
[root@xorphosd00 ~]# ceph osd crush reweight osd.0 8.0
# Cluster goes back to clean
reweighted item id 0 name 'osd.0' to 8 in crush map
[root@xorphosd00 ~]# ceph -s
cluster:
id: 86d8a1b9-761b-4099-a960-6a303b951236
health: HEALTH_WARN
noout,nobackfill,noscrub,nodeep-scrub flag(s) set
services:
mon: 3 daemons, quorum xorphmon00,xorphmon01,xorphmon02 (age 11d)
mgr: xorphmon01(active, since 6w), standbys: xorphmon02, xorphmon00
mds: cephfs:1 {0=xorphmon02=up:active} 1 up:standby
osd: 270 osds: 270 up (since 11m), 270 in (since 5h)
flags noout,nobackfill,noscrub,nodeep-scrub
data:
pools: 4 pools, 5312 pgs
objects: 75.87M objects, 287 TiB
usage: 864 TiB used, 1.1 PiB / 1.9 PiB avail
pgs: 5312 active+clean
#
# Now the problematic case
#
# Stop an OSD
[root@xorphosd00 ~]# systemctl stop ceph-osd@0
# We get degraded PGs - as expected
[root@xorphosd00 ~]# ceph -s
cluster:
id: 86d8a1b9-761b-4099-a960-6a303b951236
health: HEALTH_WARN
noout,nobackfill,noscrub,nodeep-scrub flag(s) set
1 osds down
Degraded data redundancy: 873964/232996662 objects degraded
(0.375%), 82 pgs degraded
services:
mon: 3 daemons, quorum xorphmon00,xorphmon01,xorphmon02 (age 11d)
mgr: xorphmon01(active, since 6w), standbys: xorphmon02, xorphmon00
mds: cephfs:1 {0=xorphmon02=up:active} 1 up:standby
osd: 270 osds: 269 up (since 16s), 270 in (since 5h)
flags noout,nobackfill,noscrub,nodeep-scrub
data:
pools: 4 pools, 5312 pgs
objects: 75.87M objects, 287 TiB
usage: 864 TiB used, 1.1 PiB / 1.9 PiB avail
pgs: 873964/232996662 objects degraded (0.375%)
5230 active+clean
82 active+undersized+degraded
# Reweight the OSD to 0:
[root@xorphosd00 ~]# ceph osd crush reweight osd.0 0.0
# Still degraded - as expected
reweighted item id 0 name 'osd.0' to 0 in crush map
[root@xorphosd00 ~]# ceph -s
cluster:
id: 86d8a1b9-761b-4099-a960-6a303b951236
health: HEALTH_WARN
noout,nobackfill,noscrub,nodeep-scrub flag(s) set
1 osds down
Degraded data redundancy: 873964/232996662 objects degraded
(0.375%), 82 pgs degraded
services:
mon: 3 daemons, quorum xorphmon00,xorphmon01,xorphmon02 (age 11d)
mgr: xorphmon01(active, since 6w), standbys: xorphmon02, xorphmon00
mds: cephfs:1 {0=xorphmon02=up:active} 1 up:standby
osd: 270 osds: 269 up (since 59s), 270 in (since 5h); 175 remapped pgs
flags noout,nobackfill,noscrub,nodeep-scrub
data:
pools: 4 pools, 5312 pgs
objects: 75.87M objects, 287 TiB
usage: 864 TiB used, 1.1 PiB / 1.9 PiB avail
pgs: 873964/232996662 objects degraded (0.375%)
1688081/232996662 objects misplaced (0.725%)
5137 active+clean
93 active+remapped+backfilling
82 active+undersized+degraded+remapped+backfilling
# Restarting the OSD
[root@xorphosd00 ~]# systemctl start ceph-osd@0
# And the PGs still stay degraded - THIS IS UNEXPECTED!!!
[root@xorphosd00 ~]# ceph -s
cluster:
id: 86d8a1b9-761b-4099-a960-6a303b951236
health: HEALTH_WARN
noout,nobackfill,noscrub,nodeep-scrub flag(s) set
Degraded data redundancy: 873964/232996662 objects degraded
(0.375%), 82 pgs degraded
services:
mon: 3 daemons, quorum xorphmon00,xorphmon01,xorphmon02 (age 11d)
mgr: xorphmon01(active, since 6w), standbys: xorphmon02, xorphmon00
mds: cephfs:1 {0=xorphmon02=up:active} 1 up:standby
osd: 270 osds: 270 up (since 14s), 270 in (since 5h); 175 remapped pgs
flags noout,nobackfill,noscrub,nodeep-scrub
data:
pools: 4 pools, 5312 pgs
objects: 75.87M objects, 287 TiB
usage: 864 TiB used, 1.1 PiB / 1.9 PiB avail
pgs: 873964/232996662 objects degraded (0.375%)
1688081/232996662 objects misplaced (0.725%)
5137 active+clean
93 active+remapped+backfilling
82 active+undersized+degraded+remapped+backfilling
# Now for something even more odd - reweight the OSD back to its
original weigh
# and all the data gets magically FOUND again on that OSD!!!
[root@xorphosd00 ~]# ceph osd crush reweight osd.0 8.0
reweighted item id 0 name 'osd.0' to 8 in crush map
[root@xorphosd00 ~]# ceph -s
cluster:
id: 86d8a1b9-761b-4099-a960-6a303b951236
health: HEALTH_WARN
noout,nobackfill,noscrub,nodeep-scrub flag(s) set
services:
mon: 3 daemons, quorum xorphmon00,xorphmon01,xorphmon02 (age 11d)
mgr: xorphmon01(active, since 6w), standbys: xorphmon02, xorphmon00
mds: cephfs:1 {0=xorphmon02=up:active} 1 up:standby
osd: 270 osds: 270 up (since 51s), 270 in (since 5h)
flags noout,nobackfill,noscrub,nodeep-scrub
data:
pools: 4 pools, 5312 pgs
objects: 75.87M objects, 287 TiB
usage: 864 TiB used, 1.1 PiB / 1.9 PiB avail
pgs: 5312 active+clean
Hello,
I have a pool of +300 OSDs that are identical model (Seagate model:
ST1800MM0129 size: 1.64 TiB).
Only 1 OSD crashes regularely, however I cannot identify a root cause.
Based on the output of smartctl the disk is ok.
# smartctl -a -d megaraid,1
/dev/sda
[47/1833]
smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.3.18-2-pve] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Vendor: LENOVO-X
Product: ST1800MM0129
Revision: L2B6
Compliance: SPC-4
User Capacity: 1,800,360,124,416 bytes [1.80 TB]
Logical block size: 512 bytes
Physical block size: 4096 bytes
LU is fully provisioned
Rotation Rate: 10500 rpm
Form Factor: 2.5 inches
Logical Unit id: 0x5000c500bb7822cf
Serial number: WBN0QHX80000E852944J
Device type: disk
Transport protocol: SAS (SPL-3)
Local Time is: Mon May 18 09:19:41 2020 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
Temperature Warning: Enabled
=== START OF READ SMART DATA SECTION ===
SMART Health Status: HARDWARE IMPENDING FAILURE GENERAL HARD DRIVE
FAILURE [asc=5d, ascq=10] [22/1833]
Grown defects during certification <not available>
Total blocks reassigned during format <not available>
Total new blocks reassigned = 68
Power on minutes since format <not available>
Current Drive Temperature: 33 C
Drive Trip Temperature: 65 C
Manufactured in week 31 of year 2018
Specified cycle count over device lifetime: 10000
Accumulated start-stop cycles: 21
Specified load-unload count over device lifetime: 300000
Accumulated load-unload cycles: 709
Elements in grown defect list: 18
Error counter log:
Errors Corrected by Total Correction
Gigabytes Total
ECC rereads/ errors algorithm
processed uncorrected
fast | delayed rewrites corrected invocations [10^9
bytes] errors
read: 3278853896 1 0 3278853897 32
83933.567 19
write: 0 0 0 0 0
24093.894 0
verify: 3080361880 0 0 3080361880 0
12630.494 0
Non-medium error count: 244
SMART Self-test log
Num Test Status segment LifeTime
LBA_first_err [SK ASC ASQ]
Description number (hours)
# 1 Background short Completed -
3761 - [- - -]
# 2 Background short Completed -
3737 - [- - -]
# 3 Background short Completed -
3713 - [- - -]
# 4 Background short Completed -
3689 - [- - -]
# 5 Background short Completed -
3665 - [- - -]
# 6 Background short Completed -
3641 - [- - -]
# 7 Background short Completed -
3617 - [- - -]
# 8 Background short Completed -
3593 - [- - -]
# 9 Background long Completed -
3569 - [- - -]
#10 Background short Completed -
3545 - [- - -]
#11 Background short Completed -
3521 - [- - -]
#12 Background short Completed -
3497 - [- - -]
#13 Background short Completed -
3473 - [- - -]
#14 Background short Completed -
3449 - [- - -]
#15 Background short Completed -
3425 - [- - -]
#16 Background short Completed -
3401 - [- - -]
#17 Background short Completed -
3377 - [- - -]
#18 Background short Completed -
3353 - [- - -]
#19 Background short Completed -
3329 - [- - -]
#20 Background short Completed -
3305 - [- - -]
Long (extended) Self-test duration: 9459 seconds [157.7 minutes]
I have attached the log of the affected OSD.
THX
Thomas
Ich habe 1 zu dieser E-Mail gehörende Datei hochgeladen:
ceph-osd.92.log.1.gz <https://we.tl/t-7DzNCDP3iZ>(578
KB)WeTransferhttps://we.tl/t-7DzNCDP3iZ
Mozilla Thunderbird <https://www.thunderbird.net> macht es einfach,
große Dateien über E-Mails zu teilen.
Hi,
I'm using Nautilus and I'm using the whole cluster mainly for a single
bucket in RadosGW.
There is a lot of data in this bucket (Petabyte scale) and I don't want to
waste all of SSD on it.
Is there anyway to automatically set some aging threshold for this data and
e.g. move any data older than a month to HDD OSDs?
Does anyone has experience with this:
Pool Placement and Storage Classes:
https://docs.ceph.com/docs/master/radosgw/placement/
But something automatic would be much better for me in this case.
Any help would be appreciated.
Thanks a lot,
Khodayar
Hi All,
While we enable *ceph mon enable-msgr2 *after gateway service upgrade, the
one of the mon service getting crash and never come back, it shows,
/usr/bin/ceph-mon -f --cluster ceph --id mon01 --setuser ceph --setgroup
ceph --debug_monc 20 --debug_ms 5
global_init: error reading config file.
Thanks
AmitG
Hello All,
We have 6 servers.
Configuration for each server:
1 ssd for mon (only on three servers)
1 ssd 1.9 TB for db/wal
1 nvme 1.6 TB for db/wal
10 SAS hdd 3.6 TB for osd
We decided to create a pool of 30 osd (5x6) with db/wal on ssd and a pool
of 30 (5x6) osd with db/wal on nvme.
So we create a vm on the pool with db.wal on ssd and a vm on the pool with
db/wal on nvme.
Fio performances are almost the same on both .
What do you think about it ?
I expect better performance on pool with db/wal on pci express nvme
PS
SSD are under SAS controller
NVMe pcie Samsung PM1725B
Best Regards
Ignazio
Looks like the immediate danger has passed by:
[root@gnosis ~]# ceph status
cluster:
id: e4ece518-f2cb-4708-b00f-b6bf511e91d9
health: HEALTH_WARN
nodown,noout flag(s) set
735 slow ops, oldest one blocked for 3573 sec, daemons [mon.ceph-02,mon.ceph-03] have slow ops.
services:
mon: 3 daemons, quorum ceph-01,ceph-02,ceph-03
mgr: ceph-01(active), standbys: ceph-03, ceph-02
mds: con-fs2-1/1/1 up {0=ceph-08=up:active}, 1 up:standby-replay
osd: 288 osds: 268 up, 268 in
flags nodown,noout
data:
pools: 10 pools, 2545 pgs
objects: 86.76 M objects, 218 TiB
usage: 277 TiB used, 1.5 PiB / 1.8 PiB avail
pgs: 2537 active+clean
8 active+clean+scrubbing+deep
io:
client: 34 MiB/s rd, 24 MiB/s wr, 954 op/s rd, 1.01 kop/s wr
I will prepare a new case with info we have collected so far.
Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
________________________________________
From: Amit Ghadge <amitg.b14(a)gmail.com>
Sent: 20 May 2020 09:44
To: Frank Schilder
Subject: Re: [ceph-users] total ceph outage again, need help
look like ceph-01 shows in starting, so I think that why command not executed and you also try to disable to scrubbing for temporary.
On Wed, May 20, 2020 at 12:57 PM Frank Schilder <frans(a)dtu.dk<mailto:frans@dtu.dk>> wrote:
Dear cephers,
I'm sitting with a major ceph outage again. The mon/mgr hosts suffer from a packet storm of ceph traffic between ceph fs clients and the mons. No idea why this is happening.
Main problem is, that I can't get through to the cluster. Admin commands hang forever:
[root@gnosis ~]# ceph osd set nodown
However, "ceph status" returns and shows me that I need to do something:
[root@gnosis ~]# ceph status
cluster:
id: ---
health: HEALTH_WARN
2 MDSs report slow metadata IOs
1 MDSs report slow requests
8 osds down
services:
mon: 3 daemons, quorum ceph-01,ceph-02,ceph-03
mgr: ceph-01(active, starting), standbys: ceph-02, ceph-03
mds: con-fs2-1/1/1 up {0=ceph-08=up:active}, 1 up:standby-replay
osd: 288 osds: 208 up, 216 in; 153 remapped pgs
data:
pools: 10 pools, 2545 pgs
objects: 86.71 M objects, 218 TiB
usage: 277 TiB used, 1.5 PiB / 1.8 PiB avail
pgs: 2542 active+clean
3 active+clean+scrubbing+deep
io:
client: 152 MiB/s rd, 72 MiB/s wr, 854 op/s rd, 796 op/s wr
Is there any way to get admin commands to the mons with higher priority?
Thanks and best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
_______________________________________________
ceph-users mailing list -- ceph-users(a)ceph.io<mailto:ceph-users@ceph.io>
To unsubscribe send an email to ceph-users-leave(a)ceph.io<mailto:ceph-users-leave@ceph.io>
Dear cephers,
I'm sitting with a major ceph outage again. The mon/mgr hosts suffer from a packet storm of ceph traffic between ceph fs clients and the mons. No idea why this is happening.
Main problem is, that I can't get through to the cluster. Admin commands hang forever:
[root@gnosis ~]# ceph osd set nodown
However, "ceph status" returns and shows me that I need to do something:
[root@gnosis ~]# ceph status
cluster:
id: ---
health: HEALTH_WARN
2 MDSs report slow metadata IOs
1 MDSs report slow requests
8 osds down
services:
mon: 3 daemons, quorum ceph-01,ceph-02,ceph-03
mgr: ceph-01(active, starting), standbys: ceph-02, ceph-03
mds: con-fs2-1/1/1 up {0=ceph-08=up:active}, 1 up:standby-replay
osd: 288 osds: 208 up, 216 in; 153 remapped pgs
data:
pools: 10 pools, 2545 pgs
objects: 86.71 M objects, 218 TiB
usage: 277 TiB used, 1.5 PiB / 1.8 PiB avail
pgs: 2542 active+clean
3 active+clean+scrubbing+deep
io:
client: 152 MiB/s rd, 72 MiB/s wr, 854 op/s rd, 796 op/s wr
Is there any way to get admin commands to the mons with higher priority?
Thanks and best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
I'm surprised I couldn't find this explained anywhere (I did look), but ...
What is the pgmap and why does it get updated every few seconds on a tiny
cluster that's mostly idle?
I do know what a placement group (PG) is and that when documentation talks
about placement group maps, it is talking about something else -- mapping of
PGs to OSDs by CRUSH and OSD maps.
--
Bryan Henderson San Jose, California