Long running cluster, currently running 14.2.6
I have a certain user whose buckets have become corrupted in that the
following commands:
radosgw-admin bucket check --bucket <bucket>
radosgw-admin bucket list --bucket= <bucket>
return with the following:
ERROR: could not init bucket: (2) No such file or directory
2020-08-04 13:47:03.417 7f94dfea86c0 -1 ERROR: get_bucket_instance_from_oid
failed: -2
radosgw-admin metadata get bucket:<bucket>
is successful.
radosgw-admin metadata get bucket.instance:<bucket>:<bucket_id>
yields: ERROR: can't get key: (2) No such file or directory
radosgw-admin metadata list bucket.instance | grep -i <bucket>
yields no results.
When I drop to rados and look in the index pool I can see 128 objects
matching the bucket_id as derived from the "metadata get" and this seems to
match other functioning buckets.
Unfortunately this issue was sleepy and happened many months ago unnoticed.
We have not retained many of the ceph logs from this time. We do have the
civetweb access logs and have found that error codes began on the same day
that we lowered the pg_num on many of the rgw pools (all of them but the
index_pool and the data_pool). OSDs were filestore at that time and have
since been converted to bluestore. Other than the dates lining up we have
no direct evidence these are related, and did not encounter any
inconsistent PGs. We also used this process on other clusters with no ill
effects.
Ideally I would like to repair and restore the functionality of these
buckets given that it appears the objects in the index pool still exist. Is
there any way to repair these? Do these errors correlate to any known
issues? Thanks in advance for any leads.
Respectfully,
*Wes Dillingham*
wes(a)wesdillingham.com
LinkedIn <http://www.linkedin.com/in/wesleydillingham>
Hello,
I am looking into connecting my rados gateway to LDAP and found the
following documentation.
https://docs.ceph.com/docs/master/radosgw/ldap-auth/
I would like to allow an LDAP group to have access to create and manage
buckets.
The questions I still have are the following:
-Do the LDAP users need to log in to some sort of portal before their
corresponding ceph user is created? If so, where do they go to do so? Or
does the creation of ceph users and keys happen automatically?
-How can you access the ldap users key and secret after they are
integrated?
Thanks in advance for any information you can provide.
Regards,
Jared
Hi,
I've been tasked with moving Jewel clusters to Nautilus. After the final
upgrade Ceph Health warns about legacy tunables. On clusters running SSD's
I enabled the optimal flag. Which took weeks to chug through remappings. My
remaining clusters run HDD's. Does anyone have experience with using the
legacy flag? I'd like to clear up the health warning without outright
silencing it. But, I also do not want to kick off any remapping either.
Does anyone have experience with pushing this change down the road?
Mike
Hi Eric,
thanks for the clarification, I did misunderstand you.
> You should not have to move OSDs in and out of the CRUSH tree however
> in order to solve any data placement problems (This is the baffling part).
Exactly. Should I create a tracker issue? I think this is not hard to reproduce with a standard crush map where host-bucket=physical host and I would, in fact, expect that this scenario is part of the integration test.
Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
________________________________________
From: Eric Smith <Eric.Smith(a)vecima.com>
Sent: 04 August 2020 13:58:47
To: Frank Schilder; ceph-users
Subject: RE: Ceph does not recover from OSD restart
All seems in order in terms of your CRUSH layout. You can speed up the rebalancing / scale-out operations by increasing the osd_max_backfills on each OSD (Especially during off hours). The unnecessary degradation is not expected behavior with a cluster in HEALTH_OK status, but with backfill / rebalancing ongoing it's not unexpected. You should not have to move OSDs in and out of the CRUSH tree however in order to solve any data placement problems (This is the baffling part).
-----Original Message-----
From: Frank Schilder <frans(a)dtu.dk>
Sent: Tuesday, August 4, 2020 7:45 AM
To: Eric Smith <Eric.Smith(a)vecima.com>; ceph-users <ceph-users(a)ceph.io>
Subject: Re: Ceph does not recover from OSD restart
Hi Erik,
I added the disks and started the rebalancing. When I run into the issue, ca. 3 days after start of rebalancing, it was about 25% done. The cluster does not go to HEALTH_OK before the rebalancing is finished, it shows the "xxx objects misplaced" warning. The OSD crush locations for the logical hosts are in ceph.conf, the OSDs come up in the proper crush bucket.
> All seems in order then
In what sense?
The rebalancing is still ongoing and usually a very long operation. This time I added only 9 disks, but we will almost triple the number of disks of a larger pool soon, which has 150 OSDs at the moment. I expect the rebalancing for this expansion to take months. Due to a memory leak, I need to restart OSDs regularly. Also, a host may restart or we might have a power outage during this window. In these situations, it will be a real pain if I have to play the crush move game with 300+ OSDs.
This unnecessary redundancy degradation on OSD restart cannot possibly be expected behaviour, or do I misunderstand something here?
Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
________________________________________
From: Eric Smith <Eric.Smith(a)vecima.com>
Sent: 04 August 2020 13:19:41
To: Frank Schilder; ceph-users
Subject: RE: Ceph does not recover from OSD restart
All seems in order then - when you ran into your maintenance issue, how long was if after you added the new OSDs and did Ceph ever get to HEALTH_OK so it could trim PG history? Also did the OSDs just start back up in the wrong place in the CRUSH tree?
-----Original Message-----
From: Frank Schilder <frans(a)dtu.dk>
Sent: Tuesday, August 4, 2020 7:10 AM
To: Eric Smith <Eric.Smith(a)vecima.com>; ceph-users <ceph-users(a)ceph.io>
Subject: Re: Ceph does not recover from OSD restart
Hi Eric,
> Have you adjusted the min_size for pool sr-rbd-data-one-hdd
Yes. For all EC pools located in datacenter ServerRoom, we currently set min_size=k=6, because we lack physical servers. Hosts ceph-21 and ceph-22 are logical but not physical, disks in these buckets are co-located such that no more than 2 host buckets share the same physical host. With failure domain = host, we can ensure that no more than 2 EC shards are on the same physical host. With m=2 and min_size=k we have continued service with any 1 physical host down for maintenance and also recovery will happen if a physical host fails. Some objects will have no redundancy for a while then. We will increase min_size to k+1 as soon as we have 2 additional hosts and simply move the OSDs from buckets ceph-21/22 to these without rebalancing.
The distribution of disks and buckets is listed below as well (longer listing).
Thanks and best regards,
Frank
# ceph osd erasure-code-profile ls
con-ec-8-2-hdd
con-ec-8-2-ssd
default
sr-ec-6-2-hdd
This is the relevant one:
# ceph osd erasure-code-profile get sr-ec-6-2-hdd crush-device-class=hdd crush-failure-domain=host crush-root=ServerRoom jerasure-per-chunk-alignment=false
k=6
m=2
plugin=jerasure
technique=reed_sol_van
w=8
Note that the pool sr-rbd-data-one (id 2) was created with this profile and later moved to SSD. Therefore, the crush rule does not match the profile's device class any more.
These two are under different roots:
# ceph osd erasure-code-profile get con-ec-8-2-hdd crush-device-class=hdd crush-failure-domain=host crush-root=ContainerSquare jerasure-per-chunk-alignment=false
k=8
m=2
plugin=jerasure
technique=reed_sol_van
w=8
# ceph osd erasure-code-profile get con-ec-8-2-ssd crush-device-class=ssd crush-failure-domain=host crush-root=ContainerSquare jerasure-per-chunk-alignment=false
k=8
m=2
plugin=jerasure
technique=reed_sol_van
w=8
Full physical placement information for OSDs under tree "datacenter ServerRoom":
----------------
ceph-04
----------------
CONT ID BUCKET SIZE TYP
osd-phy0 243 ceph-04 1.8T SSD
osd-phy1 247 ceph-21 1.8T SSD
osd-phy2 254 ceph-04 1.8T SSD
osd-phy3 256 ceph-04 1.8T SSD
osd-phy4 286 ceph-04 1.8T SSD
osd-phy5 287 ceph-04 1.8T SSD
osd-phy6 288 ceph-04 10.7T HDD
osd-phy7 48 ceph-04 372.6G SSD
osd-phy8 264 ceph-21 1.8T SSD
osd-phy9 84 ceph-04 8.9T HDD
osd-phy10 72 ceph-21 8.9T HDD
osd-phy11 145 ceph-04 8.9T HDD
osd-phy14 156 ceph-04 8.9T HDD
osd-phy15 168 ceph-04 8.9T HDD
osd-phy16 181 ceph-04 8.9T HDD
osd-phy17 0 ceph-21 8.9T HDD
----------------
ceph-05
----------------
CONT ID BUCKET SIZE TYP
osd-phy0 240 ceph-05 1.8T SSD
osd-phy1 249 ceph-22 1.8T SSD
osd-phy2 251 ceph-05 1.8T SSD
osd-phy3 255 ceph-05 1.8T SSD
osd-phy4 284 ceph-05 1.8T SSD
osd-phy5 285 ceph-05 1.8T SSD
osd-phy6 289 ceph-05 10.7T HDD
osd-phy7 49 ceph-05 372.6G SSD
osd-phy8 265 ceph-22 1.8T SSD
osd-phy9 74 ceph-05 8.9T HDD
osd-phy10 85 ceph-22 8.9T HDD
osd-phy11 144 ceph-05 8.9T HDD
osd-phy14 157 ceph-05 8.9T HDD
osd-phy15 169 ceph-05 8.9T HDD
osd-phy16 180 ceph-05 8.9T HDD
osd-phy17 1 ceph-22 8.9T HDD
----------------
ceph-06
----------------
CONT ID BUCKET SIZE TYP
osd-phy0 244 ceph-06 1.8T SSD
osd-phy1 246 ceph-21 1.8T SSD
osd-phy2 253 ceph-06 1.8T SSD
osd-phy3 257 ceph-06 1.8T SSD
osd-phy4 282 ceph-06 1.8T SSD
osd-phy5 283 ceph-06 1.8T SSD
osd-phy6 40 ceph-06 372.6G SSD
osd-phy7 50 ceph-06 372.6G SSD
osd-phy8 60 ceph-06 8.9T HDD
osd-phy9 290 ceph-06 10.7T HDD
osd-phy10 291 ceph-21 10.7T HDD
osd-phy11 146 ceph-06 8.9T HDD
osd-phy14 158 ceph-06 8.9T HDD
osd-phy15 170 ceph-06 8.9T HDD
osd-phy16 182 ceph-06 8.9T HDD
osd-phy17 2 ceph-21 8.9T HDD
----------------
ceph-07
----------------
CONT ID BUCKET SIZE TYP
osd-phy0 242 ceph-07 1.8T SSD
osd-phy1 250 ceph-22 1.8T SSD
osd-phy2 252 ceph-07 1.8T SSD
osd-phy3 258 ceph-07 1.8T SSD
osd-phy4 279 ceph-07 1.8T SSD
osd-phy5 280 ceph-07 1.8T SSD
osd-phy6 292 ceph-07 10.7T HDD
osd-phy7 52 ceph-07 372.6G SSD
osd-phy8 63 ceph-07 8.9T HDD
osd-phy9 281 ceph-22 1.8T SSD
osd-phy10 87 ceph-22 8.9T HDD
osd-phy11 148 ceph-07 8.9T HDD
osd-phy14 159 ceph-07 8.9T HDD
osd-phy15 172 ceph-07 8.9T HDD
osd-phy16 183 ceph-07 8.9T HDD
osd-phy17 3 ceph-22 8.9T HDD
----------------
ceph-18
----------------
CONT ID BUCKET SIZE TYP
osd-phy0 241 ceph-18 1.8T SSD
osd-phy1 248 ceph-18 1.8T SSD
osd-phy2 41 ceph-18 372.6G SSD
osd-phy3 31 ceph-18 372.6G SSD
osd-phy4 277 ceph-18 1.8T SSD
osd-phy5 278 ceph-21 1.8T SSD
osd-phy6 53 ceph-21 372.6G SSD
osd-phy7 267 ceph-18 1.8T SSD
osd-phy8 266 ceph-18 1.8T SSD
osd-phy9 293 ceph-18 10.7T HDD
osd-phy10 86 ceph-21 8.9T HDD
osd-phy11 259 ceph-18 10.9T HDD
osd-phy14 229 ceph-18 8.9T HDD
osd-phy15 232 ceph-18 8.9T HDD
osd-phy16 235 ceph-18 8.9T HDD
osd-phy17 238 ceph-18 8.9T HDD
----------------
ceph-19
----------------
CONT ID BUCKET SIZE TYP
osd-phy0 261 ceph-19 1.8T SSD
osd-phy1 262 ceph-19 1.8T SSD
osd-phy2 295 ceph-19 10.7T HDD
osd-phy3 43 ceph-19 372.6G SSD
osd-phy4 275 ceph-19 1.8T SSD
osd-phy5 294 ceph-22 10.7T HDD
osd-phy6 51 ceph-22 372.6G SSD
osd-phy7 269 ceph-19 1.8T SSD
osd-phy8 268 ceph-19 1.8T SSD
osd-phy9 276 ceph-22 1.8T SSD
osd-phy10 73 ceph-22 8.9T HDD
osd-phy11 263 ceph-19 10.9T HDD
osd-phy14 231 ceph-19 8.9T HDD
osd-phy15 233 ceph-19 8.9T HDD
osd-phy16 236 ceph-19 8.9T HDD
osd-phy17 239 ceph-19 8.9T HDD
----------------
ceph-20
----------------
CONT ID BUCKET SIZE TYP
osd-phy0 245 ceph-20 1.8T SSD
osd-phy1 28 ceph-20 372.6G SSD
osd-phy2 44 ceph-20 372.6G SSD
osd-phy3 271 ceph-20 1.8T SSD
osd-phy4 272 ceph-20 1.8T SSD
osd-phy5 273 ceph-20 1.8T SSD
osd-phy6 274 ceph-21 1.8T SSD
osd-phy7 296 ceph-20 10.7T HDD
osd-phy8 76 ceph-21 8.9T HDD
osd-phy9 39 ceph-21 372.6G SSD
osd-phy10 270 ceph-20 1.8T SSD
osd-phy11 260 ceph-20 10.9T HDD
osd-phy14 228 ceph-20 8.9T HDD
osd-phy15 230 ceph-20 8.9T HDD
osd-phy16 234 ceph-20 8.9T HDD
osd-phy17 237 ceph-20 8.9T HDD
CONT is the container name and encodes the physical slot on the host where the OSD is located.
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
________________________________________
From: Eric Smith <Eric.Smith(a)vecima.com>
Sent: 04 August 2020 12:47:12
To: Frank Schilder; ceph-users
Subject: RE: Ceph does not recover from OSD restart
Have you adjusted the min_size for pool sr-rbd-data-one-hdd at all? Also can you send the output of "ceph osd erasure-code-profile ls" and for each EC profile, "ceph osd erasure-code-profile get <profile>"?
-----Original Message-----
From: Frank Schilder <frans(a)dtu.dk>
Sent: Monday, August 3, 2020 11:05 AM
To: Eric Smith <Eric.Smith(a)vecima.com>; ceph-users <ceph-users(a)ceph.io>
Subject: Re: Ceph does not recover from OSD restart
Sorry for the many small e-mails: requested IDs in the commands, 288-296. One new OSD per host.
Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
________________________________________
From: Frank Schilder <frans(a)dtu.dk>
Sent: 03 August 2020 16:59:04
To: Eric Smith; ceph-users
Subject: [ceph-users] Re: Ceph does not recover from OSD restart
Hi Eric,
the procedure for re-discovering all objects is:
# Flag: norebalance
ceph osd crush move osd.288 host=bb-04
ceph osd crush move osd.289 host=bb-05
ceph osd crush move osd.290 host=bb-06
ceph osd crush move osd.291 host=bb-21
ceph osd crush move osd.292 host=bb-07
ceph osd crush move osd.293 host=bb-18
ceph osd crush move osd.295 host=bb-19
ceph osd crush move osd.294 host=bb-22
ceph osd crush move osd.296 host=bb-20
# Wait until all PGs are peered and recovery is done. In my case, there was only little I/O, # no more than 50-100 objects had writes missing and recovery was a few seconds.
#
# The bb-hosts are under a separate crush root that I use solely as parking space # and for draining OSDs.
ceph osd crush move osd.288 host=ceph-04 ceph osd crush move osd.289 host=ceph-05 ceph osd crush move osd.290 host=ceph-06 ceph osd crush move osd.291 host=ceph-21 ceph osd crush move osd.292 host=ceph-07 ceph osd crush move osd.293 host=ceph-18 ceph osd crush move osd.295 host=ceph-19 ceph osd crush move osd.294 host=ceph-22 ceph osd crush move osd.296 host=ceph-20
After peering, no degraded PGs/objects any more, just the misplaced ones as expected.
Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
________________________________________
From: Eric Smith <Eric.Smith(a)vecima.com>
Sent: 03 August 2020 16:45:28
To: Frank Schilder; ceph-users
Subject: RE: Ceph does not recover from OSD restart
You said you had to move some OSDs out and back in for Ceph to go back to normal (The OSDs you added). Which OSDs were added?
-----Original Message-----
From: Frank Schilder <frans(a)dtu.dk>
Sent: Monday, August 3, 2020 9:55 AM
To: Eric Smith <Eric.Smith(a)vecima.com>; ceph-users <ceph-users(a)ceph.io>
Subject: Re: Ceph does not recover from OSD restart
Hi Eric,
thanks for your fast response. Below the output, shortened a bit as indicated. Disks have been added to pool 11 'sr-rbd-data-one-hdd' only, this is the only pool with remapped PGs and is also the only pool experiencing the "loss of track" to objects. Every other pool recovers from restart by itself.
Best regards,
Frank
# ceph osd pool stats
pool sr-rbd-meta-one id 1
client io 5.3 KiB/s rd, 3.2 KiB/s wr, 4 op/s rd, 1 op/s wr
pool sr-rbd-data-one id 2
client io 24 MiB/s rd, 32 MiB/s wr, 380 op/s rd, 594 op/s wr
pool sr-rbd-one-stretch id 3
nothing is going on
pool con-rbd-meta-hpc-one id 7
nothing is going on
pool con-rbd-data-hpc-one id 8
client io 0 B/s rd, 5.6 KiB/s wr, 0 op/s rd, 0 op/s wr
pool sr-rbd-data-one-hdd id 11
53241814/346903376 objects misplaced (15.348%)
client io 73 MiB/s rd, 3.4 MiB/s wr, 236 op/s rd, 69 op/s wr
pool con-fs2-meta1 id 12
client io 106 KiB/s rd, 112 KiB/s wr, 3 op/s rd, 11 op/s wr
pool con-fs2-meta2 id 13
client io 0 B/s wr, 0 op/s rd, 0 op/s wr
pool con-fs2-data id 14
client io 5.5 MiB/s rd, 201 KiB/s wr, 34 op/s rd, 8 op/s wr
pool con-fs2-data-ec-ssd id 17
nothing is going on
pool ms-rbd-one id 18
client io 5.6 MiB/s wr, 0 op/s rd, 179 op/s wr
# ceph osd pool ls detail
pool 1 'sr-rbd-meta-one' replicated size 3 min_size 2 crush_rule 11 object_hash rjenkins pg_num 80 pgp_num 80 last_change 122597 flags hashpspool,nodelete,selfmanaged_snaps max_bytes 536870912000 stripe_width 0 application rbd
removed_snaps [1~45]
pool 2 'sr-rbd-data-one' erasure size 8 min_size 6 crush_rule 5 object_hash rjenkins pg_num 560 pgp_num 560 last_change 186437 lfor 0/126858 flags hashpspool,ec_overwrites,nodelete,selfmanaged_snaps max_bytes 43980465111040 stripe_width 24576 fast_read 1 compression_mode aggressive application rbd
removed_snaps [1~3,5~2, ... huge list ... ,11f9d~1,11fa0~2] pool 3 'sr-rbd-one-stretch' replicated size 3 min_size 2 crush_rule 12 object_hash rjenkins pg_num 160 pgp_num 160 last_change 143202 lfor 0/79983 flags hashpspool,nodelete,selfmanaged_snaps max_bytes 1099511627776 stripe_width 0 compression_mode aggressive application rbd
removed_snaps [1~7,b~2,11~2,14~2,17~9e,b8~1e] pool 7 'con-rbd-meta-hpc-one' replicated size 3 min_size 2 crush_rule 3 object_hash rjenkins pg_num 50 pgp_num 50 last_change 96357 lfor 0/90462 flags hashpspool,nodelete,selfmanaged_snaps max_bytes 10737418240 stripe_width 0 application rbd
removed_snaps [1~3]
pool 8 'con-rbd-data-hpc-one' erasure size 10 min_size 9 crush_rule 7 object_hash rjenkins pg_num 150 pgp_num 150 last_change 96358 lfor 0/90996 flags hashpspool,ec_overwrites,nodelete,selfmanaged_snaps max_bytes 5497558138880 stripe_width 32768 fast_read 1 compression_mode aggressive application rbd
removed_snaps [1~7,9~2]
pool 11 'sr-rbd-data-one-hdd' erasure size 8 min_size 6 crush_rule 9 object_hash rjenkins pg_num 560 pgp_num 560 last_change 186331 lfor 0/127768 flags hashpspool,ec_overwrites,nodelete,selfmanaged_snaps max_bytes 219902325555200 stripe_width 24576 fast_read 1 compression_mode aggressive application rbd
removed_snaps [1~59f,5a2~fe, ... less huge list ... ,2559~1,255b~1]
removed_snaps_queue [1a64~5,1a6a~1,1a6c~1, ... long list ... ,220a~1,220c~1] pool 12 'con-fs2-meta1' replicated size 3 min_size 2 crush_rule 3 object_hash rjenkins pg_num 50 pgp_num 50 last_change 57096 flags hashpspool,nodelete max_bytes 268435456000 stripe_width 0 application cephfs pool 13 'con-fs2-meta2' replicated size 3 min_size 2 crush_rule 3 object_hash rjenkins pg_num 50 pgp_num 50 last_change 96359 flags hashpspool,nodelete max_bytes 107374182400 stripe_width 0 application cephfs pool 14 'con-fs2-data' erasure size 10 min_size 9 crush_rule 8 object_hash rjenkins pg_num 1350 pgp_num 1350 last_change 96360 lfor 0/91144 flags hashpspool,ec_overwrites,nodelete max_bytes 879609302220800 stripe_width 32768 fast_read 1 compression_mode aggressive application cephfs pool 17 'con-fs2-data-ec-ssd' erasure size 10 min_size 9 crush_rule 10 object_hash rjenkins pg_num 55 pgp_num 55 last_change 96361 lfor 0/90473 flags hashpspool,ec_overwrites,nodelete max_bytes 1099511627776 stripe_width 32768 fast_read 1 compression_mode aggressive application cephfs pool 18 'ms-rbd-one' replicated size 3 min_size 2 crush_rule 12 object_hash rjenkins pg_num 150 pgp_num 150 last_change 143206 flags hashpspool,nodelete,selfmanaged_snaps max_bytes 1099511627776 stripe_width 0 compression_mode aggressive application rbd
removed_snaps [1~3]
# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-40 2384.09058 root DTU
-42 0 region Lyngby
-41 2384.09058 region Risoe
2 sub-trees on level datacenter removed for brevity
-49 586.49347 datacenter ServerRoom
-55 586.49347 room SR-113
-65 64.33617 host ceph-04
84 hdd 8.90999 osd.84 up 1.00000 1.00000
145 hdd 8.90999 osd.145 up 1.00000 1.00000
156 hdd 8.90999 osd.156 up 1.00000 1.00000
168 hdd 8.90999 osd.168 up 1.00000 1.00000
181 hdd 8.90999 osd.181 up 0.95000 1.00000
288 hdd 10.69229 osd.288 up 1.00000 1.00000
243 rbd_data 1.74599 osd.243 up 1.00000 1.00000
254 rbd_data 1.74599 osd.254 up 1.00000 1.00000
256 rbd_data 1.74599 osd.256 up 1.00000 1.00000
286 rbd_data 1.74599 osd.286 up 1.00000 1.00000
287 rbd_data 1.74599 osd.287 up 1.00000 1.00000
48 rbd_meta 0.36400 osd.48 up 1.00000 1.00000
-67 64.33617 host ceph-05
74 hdd 8.90999 osd.74 up 1.00000 1.00000
144 hdd 8.90999 osd.144 up 1.00000 1.00000
157 hdd 8.90999 osd.157 up 0.84999 1.00000
169 hdd 8.90999 osd.169 up 0.95000 1.00000
180 hdd 8.90999 osd.180 up 0.89999 1.00000
289 hdd 10.69229 osd.289 up 1.00000 1.00000
240 rbd_data 1.74599 osd.240 up 1.00000 1.00000
251 rbd_data 1.74599 osd.251 up 1.00000 1.00000
255 rbd_data 1.74599 osd.255 up 1.00000 1.00000
284 rbd_data 1.74599 osd.284 up 1.00000 1.00000
285 rbd_data 1.74599 osd.285 up 1.00000 1.00000
49 rbd_meta 0.36400 osd.49 up 1.00000 1.00000
-69 64.70016 host ceph-06
60 hdd 8.90999 osd.60 up 1.00000 1.00000
146 hdd 8.90999 osd.146 up 1.00000 1.00000
158 hdd 8.90999 osd.158 up 0.95000 1.00000
170 hdd 8.90999 osd.170 up 0.89999 1.00000
182 hdd 8.90999 osd.182 up 1.00000 1.00000
290 hdd 10.69229 osd.290 up 1.00000 1.00000
244 rbd_data 1.74599 osd.244 up 1.00000 1.00000
253 rbd_data 1.74599 osd.253 up 1.00000 1.00000
257 rbd_data 1.74599 osd.257 up 1.00000 1.00000
282 rbd_data 1.74599 osd.282 up 1.00000 1.00000
283 rbd_data 1.74599 osd.283 up 1.00000 1.00000
40 rbd_meta 0.36400 osd.40 up 1.00000 1.00000
50 rbd_meta 0.36400 osd.50 up 1.00000 1.00000
-71 64.33617 host ceph-07
63 hdd 8.90999 osd.63 up 1.00000 1.00000
148 hdd 8.90999 osd.148 up 0.95000 1.00000
159 hdd 8.90999 osd.159 up 1.00000 1.00000
172 hdd 8.90999 osd.172 up 0.95000 1.00000
183 hdd 8.90999 osd.183 up 0.84999 1.00000
292 hdd 10.69229 osd.292 up 1.00000 1.00000
242 rbd_data 1.74599 osd.242 up 1.00000 1.00000
252 rbd_data 1.74599 osd.252 up 1.00000 1.00000
258 rbd_data 1.74599 osd.258 up 1.00000 1.00000
279 rbd_data 1.74599 osd.279 up 1.00000 1.00000
280 rbd_data 1.74599 osd.280 up 1.00000 1.00000
52 rbd_meta 0.36400 osd.52 up 1.00000 1.00000
-81 66.70416 host ceph-18
229 hdd 8.90999 osd.229 up 1.00000 1.00000
232 hdd 8.90999 osd.232 up 1.00000 1.00000
235 hdd 8.90999 osd.235 up 1.00000 1.00000
238 hdd 8.90999 osd.238 up 0.95000 1.00000
259 hdd 10.91399 osd.259 up 1.00000 1.00000
293 hdd 10.69229 osd.293 up 1.00000 1.00000
241 rbd_data 1.74599 osd.241 up 1.00000 1.00000
248 rbd_data 1.74599 osd.248 up 1.00000 1.00000
266 rbd_data 1.74599 osd.266 up 1.00000 1.00000
267 rbd_data 1.74599 osd.267 up 1.00000 1.00000
277 rbd_data 1.74599 osd.277 up 1.00000 1.00000
31 rbd_meta 0.36400 osd.31 up 1.00000 1.00000
41 rbd_meta 0.36400 osd.41 up 1.00000 1.00000
-94 66.34016 host ceph-19
231 hdd 8.90999 osd.231 up 1.00000 1.00000
233 hdd 8.90999 osd.233 up 0.95000 1.00000
236 hdd 8.90999 osd.236 up 1.00000 1.00000
239 hdd 8.90999 osd.239 up 1.00000 1.00000
263 hdd 10.91399 osd.263 up 1.00000 1.00000
295 hdd 10.69229 osd.295 up 1.00000 1.00000
261 rbd_data 1.74599 osd.261 up 1.00000 1.00000
262 rbd_data 1.74599 osd.262 up 1.00000 1.00000
268 rbd_data 1.74599 osd.268 up 1.00000 1.00000
269 rbd_data 1.74599 osd.269 up 1.00000 1.00000
275 rbd_data 1.74599 osd.275 up 1.00000 1.00000
43 rbd_meta 0.36400 osd.43 up 1.00000 1.00000
-4 66.70416 host ceph-20
228 hdd 8.90999 osd.228 up 1.00000 1.00000
230 hdd 8.90999 osd.230 up 1.00000 1.00000
234 hdd 8.90999 osd.234 up 0.95000 1.00000
237 hdd 8.90999 osd.237 up 1.00000 1.00000
260 hdd 10.91399 osd.260 up 1.00000 1.00000
296 hdd 10.69229 osd.296 up 1.00000 1.00000
245 rbd_data 1.74599 osd.245 up 1.00000 1.00000
270 rbd_data 1.74599 osd.270 up 1.00000 1.00000
271 rbd_data 1.74599 osd.271 up 1.00000 1.00000
272 rbd_data 1.74599 osd.272 up 1.00000 1.00000
273 rbd_data 1.74599 osd.273 up 1.00000 1.00000
28 rbd_meta 0.36400 osd.28 up 1.00000 1.00000
44 rbd_meta 0.36400 osd.44 up 1.00000 1.00000
-64 64.70016 host ceph-21
0 hdd 8.90999 osd.0 up 1.00000 1.00000
2 hdd 8.90999 osd.2 up 0.95000 1.00000
72 hdd 8.90999 osd.72 up 1.00000 1.00000
76 hdd 8.90999 osd.76 up 1.00000 1.00000
86 hdd 8.90999 osd.86 up 1.00000 1.00000
291 hdd 10.69229 osd.291 up 1.00000 1.00000
246 rbd_data 1.74599 osd.246 up 1.00000 1.00000
247 rbd_data 1.74599 osd.247 up 1.00000 1.00000
264 rbd_data 1.74599 osd.264 up 1.00000 1.00000
274 rbd_data 1.74599 osd.274 up 1.00000 1.00000
278 rbd_data 1.74599 osd.278 up 1.00000 1.00000
39 rbd_meta 0.36400 osd.39 up 1.00000 1.00000
53 rbd_meta 0.36400 osd.53 up 1.00000 1.00000
-66 64.33617 host ceph-22
1 hdd 8.90999 osd.1 up 1.00000 1.00000
3 hdd 8.90999 osd.3 up 1.00000 1.00000
73 hdd 8.90999 osd.73 up 1.00000 1.00000
85 hdd 8.90999 osd.85 up 0.95000 1.00000
87 hdd 8.90999 osd.87 up 1.00000 1.00000
294 hdd 10.69229 osd.294 up 1.00000 1.00000
249 rbd_data 1.74599 osd.249 up 1.00000 1.00000
250 rbd_data 1.74599 osd.250 up 1.00000 1.00000
265 rbd_data 1.74599 osd.265 up 1.00000 1.00000
276 rbd_data 1.74599 osd.276 up 1.00000 1.00000
281 rbd_data 1.74599 osd.281 up 1.00000 1.00000
51 rbd_meta 0.36400 osd.51 up 1.00000 1.00000
# ceph osd crush rule dump # crush rules outside tree under "datacenter ServerRoom" removed for brevity [
{
"rule_id": 0,
"rule_name": "replicated_rule",
"ruleset": 0,
"type": 1,
"min_size": 1,
"max_size": 10,
"steps": [
{
"op": "take",
"item": -1,
"item_name": "default"
},
{
"op": "chooseleaf_firstn",
"num": 0,
"type": "host"
},
{
"op": "emit"
}
]
},
{
"rule_id": 5,
"rule_name": "sr-rbd-data-one",
"ruleset": 5,
"type": 3,
"min_size": 3,
"max_size": 8,
"steps": [
{
"op": "set_chooseleaf_tries",
"num": 50
},
{
"op": "set_choose_tries",
"num": 1000
},
{
"op": "take",
"item": -185,
"item_name": "ServerRoom~rbd_data"
},
{
"op": "chooseleaf_indep",
"num": 0,
"type": "host"
},
{
"op": "emit"
}
]
},
{
"rule_id": 9,
"rule_name": "sr-rbd-data-one-hdd",
"ruleset": 9,
"type": 3,
"min_size": 3,
"max_size": 8,
"steps": [
{
"op": "set_chooseleaf_tries",
"num": 5
},
{
"op": "set_choose_tries",
"num": 100
},
{
"op": "take",
"item": -53,
"item_name": "ServerRoom~hdd"
},
{
"op": "chooseleaf_indep",
"num": 0,
"type": "host"
},
{
"op": "emit"
}
]
}
]
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
________________________________________
From: Eric Smith <Eric.Smith(a)vecima.com>
Sent: 03 August 2020 15:40
To: Frank Schilder; ceph-users
Subject: RE: Ceph does not recover from OSD restart
Can you post the output of these commands:
ceph osd pool ls detail
ceph osd tree
ceph osd crush rule dump
-----Original Message-----
From: Frank Schilder <frans(a)dtu.dk>
Sent: Monday, August 3, 2020 9:19 AM
To: ceph-users <ceph-users(a)ceph.io>
Subject: [ceph-users] Re: Ceph does not recover from OSD restart
After moving the newly added OSDs out of the crush tree and back in again, I get to exactly what I want to see:
cluster:
id: e4ece518-f2cb-4708-b00f-b6bf511e91d9
health: HEALTH_WARN
norebalance,norecover flag(s) set
53030026/1492404361 objects misplaced (3.553%)
1 pools nearfull
services:
mon: 3 daemons, quorum ceph-01,ceph-02,ceph-03
mgr: ceph-01(active), standbys: ceph-03, ceph-02
mds: con-fs2-1/1/1 up {0=ceph-08=up:active}, 1 up:standby-replay
osd: 297 osds: 272 up, 272 in; 307 remapped pgs
flags norebalance,norecover
data:
pools: 11 pools, 3215 pgs
objects: 177.3 M objects, 489 TiB
usage: 696 TiB used, 1.2 PiB / 1.9 PiB avail
pgs: 53030026/1492404361 objects misplaced (3.553%)
2902 active+clean
299 active+remapped+backfill_wait
8 active+remapped+backfilling
5 active+clean+scrubbing+deep
1 active+clean+snaptrim
io:
client: 69 MiB/s rd, 117 MiB/s wr, 399 op/s rd, 856 op/s wr
Why does a cluster with remapped PGs not survive OSD restarts without loosing track of objects?
Why is it not finding the objects by itself?
A power outage of 3 hosts will halt everything for no reason until manual intervention. How can I avoid this problem?
Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
________________________________________
From: Frank Schilder <frans(a)dtu.dk>
Sent: 03 August 2020 15:03:05
To: ceph-users
Subject: [ceph-users] Ceph does not recover from OSD restart
Dear cephers,
I have a serious issue with degraded objects after an OSD restart. The cluster was in a state of re-balancing after adding disks to each host. Before restart I had "X/Y objects misplaced". Apart from that, health was OK. I now restarted all OSDs of one host and the cluster does not recover from that:
cluster:
id: xxx
health: HEALTH_ERR
45813194/1492348700 objects misplaced (3.070%)
Degraded data redundancy: 6798138/1492348700 objects degraded (0.456%), 85 pgs degraded, 86 pgs undersized
Degraded data redundancy (low space): 17 pgs backfill_toofull
1 pools nearfull
services:
mon: 3 daemons, quorum ceph-01,ceph-02,ceph-03
mgr: ceph-01(active), standbys: ceph-03, ceph-02
mds: con-fs2-1/1/1 up {0=ceph-08=up:active}, 1 up:standby-replay
osd: 297 osds: 272 up, 272 in; 307 remapped pgs
data:
pools: 11 pools, 3215 pgs
objects: 177.3 M objects, 489 TiB
usage: 696 TiB used, 1.2 PiB / 1.9 PiB avail
pgs: 6798138/1492348700 objects degraded (0.456%)
45813194/1492348700 objects misplaced (3.070%)
2903 active+clean
209 active+remapped+backfill_wait
73 active+undersized+degraded+remapped+backfill_wait
9 active+remapped+backfill_wait+backfill_toofull
8 active+undersized+degraded+remapped+backfill_wait+backfill_toofull
4 active+undersized+degraded+remapped+backfilling
3 active+remapped+backfilling
3 active+clean+scrubbing+deep
1 active+clean+scrubbing
1 active+undersized+remapped+backfilling
1 active+clean+snaptrim
io:
client: 47 MiB/s rd, 61 MiB/s wr, 732 op/s rd, 792 op/s wr
recovery: 195 MiB/s, 48 objects/s
After restarting there should only be a small number of degraded objects, the ones that received writes during OSD restart. What I see, however, is that the cluster seems to have lost track of a huge amount of objects, the 0.456% degraded are 1-2 days worth of I/O. I did reboots before and saw only a few thousand objects degraded at most. The output of ceph health detail shows a lot of lines like these:
[root@gnosis ~]# ceph health detail
HEALTH_ERR 45804316/1492356704 objects misplaced (3.069%); Degraded data redundancy: 6792562/1492356704 objects degraded (0.455%), 85 pgs degraded, 86 pgs undersized; Degraded data redundancy (low space): 17 pgs backfill_toofull; 1 pools nearfull OBJECT_MISPLACED 45804316/1492356704 objects misplaced (3.069%) PG_DEGRADED Degraded data redundancy: 6792562/1492356704 objects degraded (0.455%), 85 pgs degraded, 86 pgs undersized
pg 11.9 is stuck undersized for 815.188981, current state active+undersized+degraded+remapped+backfill_wait, last acting [60,148,2147483647,263,76,230,87,169]
8...9
pg 11.48 is active+undersized+degraded+remapped+backfill_wait, acting [159,60,180,263,237,3,2147483647,72]
pg 11.4a is stuck undersized for 851.162862, current state active+undersized+degraded+remapped+backfill_wait, last acting [182,233,87,228,2,180,63,2147483647]
[...]
pg 11.22e is stuck undersized for 851.162402, current state active+undersized+degraded+remapped+backfill_wait+backfill_toofull, last acting [234,183,239,2147483647,170,229,1,86]
PG_DEGRADED_FULL Degraded data redundancy (low space): 17 pgs backfill_toofull
pg 11.24 is active+undersized+degraded+remapped+backfill_wait+backfill_toofull, acting [230,259,2147483647,1,144,159,233,146]
[...]
pg 11.1d9 is active+remapped+backfill_wait+backfill_toofull, acting [84,259,183,170,85,234,233,2]
pg 11.225 is active+undersized+degraded+remapped+backfill_wait+backfill_toofull, acting [236,183,1,2147483647,2147483647,169,229,230]
pg 11.22e is active+undersized+degraded+remapped+backfill_wait+backfill_toofull, acting [234,183,239,2147483647,170,229,1,86]
POOL_NEAR_FULL 1 pools nearfull
pool 'sr-rbd-data-one-hdd' has 164 TiB (max 200 TiB)
It looks like a lot of PGs are not receiving theire complete crush map placement, as if the peering is incomplete. This is a serious issue, it looks like the cluster will see a total storage loss if just 2 more hosts reboot - without actually having lost any storage. The pool in question is a 6+2 EC pool.
What is going on here? Why are the PG-maps not restored to their values from before the OSD reboot? The degraded PGs should receive the missing OSD IDs, everything is up exactly as it was before the reboot.
Thanks for your help and best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
_______________________________________________
ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io _______________________________________________
ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io _______________________________________________
ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io
Hi Erik,
I added the disks and started the rebalancing. When I run into the issue, ca. 3 days after start of rebalancing, it was about 25% done. The cluster does not go to HEALTH_OK before the rebalancing is finished, it shows the "xxx objects misplaced" warning. The OSD crush locations for the logical hosts are in ceph.conf, the OSDs come up in the proper crush bucket.
> All seems in order then
In what sense?
The rebalancing is still ongoing and usually a very long operation. This time I added only 9 disks, but we will almost triple the number of disks of a larger pool soon, which has 150 OSDs at the moment. I expect the rebalancing for this expansion to take months. Due to a memory leak, I need to restart OSDs regularly. Also, a host may restart or we might have a power outage during this window. In these situations, it will be a real pain if I have to play the crush move game with 300+ OSDs.
This unnecessary redundancy degradation on OSD restart cannot possibly be expected behaviour, or do I misunderstand something here?
Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
________________________________________
From: Eric Smith <Eric.Smith(a)vecima.com>
Sent: 04 August 2020 13:19:41
To: Frank Schilder; ceph-users
Subject: RE: Ceph does not recover from OSD restart
All seems in order then - when you ran into your maintenance issue, how long was if after you added the new OSDs and did Ceph ever get to HEALTH_OK so it could trim PG history? Also did the OSDs just start back up in the wrong place in the CRUSH tree?
-----Original Message-----
From: Frank Schilder <frans(a)dtu.dk>
Sent: Tuesday, August 4, 2020 7:10 AM
To: Eric Smith <Eric.Smith(a)vecima.com>; ceph-users <ceph-users(a)ceph.io>
Subject: Re: Ceph does not recover from OSD restart
Hi Eric,
> Have you adjusted the min_size for pool sr-rbd-data-one-hdd
Yes. For all EC pools located in datacenter ServerRoom, we currently set min_size=k=6, because we lack physical servers. Hosts ceph-21 and ceph-22 are logical but not physical, disks in these buckets are co-located such that no more than 2 host buckets share the same physical host. With failure domain = host, we can ensure that no more than 2 EC shards are on the same physical host. With m=2 and min_size=k we have continued service with any 1 physical host down for maintenance and also recovery will happen if a physical host fails. Some objects will have no redundancy for a while then. We will increase min_size to k+1 as soon as we have 2 additional hosts and simply move the OSDs from buckets ceph-21/22 to these without rebalancing.
The distribution of disks and buckets is listed below as well (longer listing).
Thanks and best regards,
Frank
# ceph osd erasure-code-profile ls
con-ec-8-2-hdd
con-ec-8-2-ssd
default
sr-ec-6-2-hdd
This is the relevant one:
# ceph osd erasure-code-profile get sr-ec-6-2-hdd crush-device-class=hdd crush-failure-domain=host crush-root=ServerRoom jerasure-per-chunk-alignment=false
k=6
m=2
plugin=jerasure
technique=reed_sol_van
w=8
Note that the pool sr-rbd-data-one (id 2) was created with this profile and later moved to SSD. Therefore, the crush rule does not match the profile's device class any more.
These two are under different roots:
# ceph osd erasure-code-profile get con-ec-8-2-hdd crush-device-class=hdd crush-failure-domain=host crush-root=ContainerSquare jerasure-per-chunk-alignment=false
k=8
m=2
plugin=jerasure
technique=reed_sol_van
w=8
# ceph osd erasure-code-profile get con-ec-8-2-ssd crush-device-class=ssd crush-failure-domain=host crush-root=ContainerSquare jerasure-per-chunk-alignment=false
k=8
m=2
plugin=jerasure
technique=reed_sol_van
w=8
Full physical placement information for OSDs under tree "datacenter ServerRoom":
----------------
ceph-04
----------------
CONT ID BUCKET SIZE TYP
osd-phy0 243 ceph-04 1.8T SSD
osd-phy1 247 ceph-21 1.8T SSD
osd-phy2 254 ceph-04 1.8T SSD
osd-phy3 256 ceph-04 1.8T SSD
osd-phy4 286 ceph-04 1.8T SSD
osd-phy5 287 ceph-04 1.8T SSD
osd-phy6 288 ceph-04 10.7T HDD
osd-phy7 48 ceph-04 372.6G SSD
osd-phy8 264 ceph-21 1.8T SSD
osd-phy9 84 ceph-04 8.9T HDD
osd-phy10 72 ceph-21 8.9T HDD
osd-phy11 145 ceph-04 8.9T HDD
osd-phy14 156 ceph-04 8.9T HDD
osd-phy15 168 ceph-04 8.9T HDD
osd-phy16 181 ceph-04 8.9T HDD
osd-phy17 0 ceph-21 8.9T HDD
----------------
ceph-05
----------------
CONT ID BUCKET SIZE TYP
osd-phy0 240 ceph-05 1.8T SSD
osd-phy1 249 ceph-22 1.8T SSD
osd-phy2 251 ceph-05 1.8T SSD
osd-phy3 255 ceph-05 1.8T SSD
osd-phy4 284 ceph-05 1.8T SSD
osd-phy5 285 ceph-05 1.8T SSD
osd-phy6 289 ceph-05 10.7T HDD
osd-phy7 49 ceph-05 372.6G SSD
osd-phy8 265 ceph-22 1.8T SSD
osd-phy9 74 ceph-05 8.9T HDD
osd-phy10 85 ceph-22 8.9T HDD
osd-phy11 144 ceph-05 8.9T HDD
osd-phy14 157 ceph-05 8.9T HDD
osd-phy15 169 ceph-05 8.9T HDD
osd-phy16 180 ceph-05 8.9T HDD
osd-phy17 1 ceph-22 8.9T HDD
----------------
ceph-06
----------------
CONT ID BUCKET SIZE TYP
osd-phy0 244 ceph-06 1.8T SSD
osd-phy1 246 ceph-21 1.8T SSD
osd-phy2 253 ceph-06 1.8T SSD
osd-phy3 257 ceph-06 1.8T SSD
osd-phy4 282 ceph-06 1.8T SSD
osd-phy5 283 ceph-06 1.8T SSD
osd-phy6 40 ceph-06 372.6G SSD
osd-phy7 50 ceph-06 372.6G SSD
osd-phy8 60 ceph-06 8.9T HDD
osd-phy9 290 ceph-06 10.7T HDD
osd-phy10 291 ceph-21 10.7T HDD
osd-phy11 146 ceph-06 8.9T HDD
osd-phy14 158 ceph-06 8.9T HDD
osd-phy15 170 ceph-06 8.9T HDD
osd-phy16 182 ceph-06 8.9T HDD
osd-phy17 2 ceph-21 8.9T HDD
----------------
ceph-07
----------------
CONT ID BUCKET SIZE TYP
osd-phy0 242 ceph-07 1.8T SSD
osd-phy1 250 ceph-22 1.8T SSD
osd-phy2 252 ceph-07 1.8T SSD
osd-phy3 258 ceph-07 1.8T SSD
osd-phy4 279 ceph-07 1.8T SSD
osd-phy5 280 ceph-07 1.8T SSD
osd-phy6 292 ceph-07 10.7T HDD
osd-phy7 52 ceph-07 372.6G SSD
osd-phy8 63 ceph-07 8.9T HDD
osd-phy9 281 ceph-22 1.8T SSD
osd-phy10 87 ceph-22 8.9T HDD
osd-phy11 148 ceph-07 8.9T HDD
osd-phy14 159 ceph-07 8.9T HDD
osd-phy15 172 ceph-07 8.9T HDD
osd-phy16 183 ceph-07 8.9T HDD
osd-phy17 3 ceph-22 8.9T HDD
----------------
ceph-18
----------------
CONT ID BUCKET SIZE TYP
osd-phy0 241 ceph-18 1.8T SSD
osd-phy1 248 ceph-18 1.8T SSD
osd-phy2 41 ceph-18 372.6G SSD
osd-phy3 31 ceph-18 372.6G SSD
osd-phy4 277 ceph-18 1.8T SSD
osd-phy5 278 ceph-21 1.8T SSD
osd-phy6 53 ceph-21 372.6G SSD
osd-phy7 267 ceph-18 1.8T SSD
osd-phy8 266 ceph-18 1.8T SSD
osd-phy9 293 ceph-18 10.7T HDD
osd-phy10 86 ceph-21 8.9T HDD
osd-phy11 259 ceph-18 10.9T HDD
osd-phy14 229 ceph-18 8.9T HDD
osd-phy15 232 ceph-18 8.9T HDD
osd-phy16 235 ceph-18 8.9T HDD
osd-phy17 238 ceph-18 8.9T HDD
----------------
ceph-19
----------------
CONT ID BUCKET SIZE TYP
osd-phy0 261 ceph-19 1.8T SSD
osd-phy1 262 ceph-19 1.8T SSD
osd-phy2 295 ceph-19 10.7T HDD
osd-phy3 43 ceph-19 372.6G SSD
osd-phy4 275 ceph-19 1.8T SSD
osd-phy5 294 ceph-22 10.7T HDD
osd-phy6 51 ceph-22 372.6G SSD
osd-phy7 269 ceph-19 1.8T SSD
osd-phy8 268 ceph-19 1.8T SSD
osd-phy9 276 ceph-22 1.8T SSD
osd-phy10 73 ceph-22 8.9T HDD
osd-phy11 263 ceph-19 10.9T HDD
osd-phy14 231 ceph-19 8.9T HDD
osd-phy15 233 ceph-19 8.9T HDD
osd-phy16 236 ceph-19 8.9T HDD
osd-phy17 239 ceph-19 8.9T HDD
----------------
ceph-20
----------------
CONT ID BUCKET SIZE TYP
osd-phy0 245 ceph-20 1.8T SSD
osd-phy1 28 ceph-20 372.6G SSD
osd-phy2 44 ceph-20 372.6G SSD
osd-phy3 271 ceph-20 1.8T SSD
osd-phy4 272 ceph-20 1.8T SSD
osd-phy5 273 ceph-20 1.8T SSD
osd-phy6 274 ceph-21 1.8T SSD
osd-phy7 296 ceph-20 10.7T HDD
osd-phy8 76 ceph-21 8.9T HDD
osd-phy9 39 ceph-21 372.6G SSD
osd-phy10 270 ceph-20 1.8T SSD
osd-phy11 260 ceph-20 10.9T HDD
osd-phy14 228 ceph-20 8.9T HDD
osd-phy15 230 ceph-20 8.9T HDD
osd-phy16 234 ceph-20 8.9T HDD
osd-phy17 237 ceph-20 8.9T HDD
CONT is the container name and encodes the physical slot on the host where the OSD is located.
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
________________________________________
From: Eric Smith <Eric.Smith(a)vecima.com>
Sent: 04 August 2020 12:47:12
To: Frank Schilder; ceph-users
Subject: RE: Ceph does not recover from OSD restart
Have you adjusted the min_size for pool sr-rbd-data-one-hdd at all? Also can you send the output of "ceph osd erasure-code-profile ls" and for each EC profile, "ceph osd erasure-code-profile get <profile>"?
-----Original Message-----
From: Frank Schilder <frans(a)dtu.dk>
Sent: Monday, August 3, 2020 11:05 AM
To: Eric Smith <Eric.Smith(a)vecima.com>; ceph-users <ceph-users(a)ceph.io>
Subject: Re: Ceph does not recover from OSD restart
Sorry for the many small e-mails: requested IDs in the commands, 288-296. One new OSD per host.
Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
________________________________________
From: Frank Schilder <frans(a)dtu.dk>
Sent: 03 August 2020 16:59:04
To: Eric Smith; ceph-users
Subject: [ceph-users] Re: Ceph does not recover from OSD restart
Hi Eric,
the procedure for re-discovering all objects is:
# Flag: norebalance
ceph osd crush move osd.288 host=bb-04
ceph osd crush move osd.289 host=bb-05
ceph osd crush move osd.290 host=bb-06
ceph osd crush move osd.291 host=bb-21
ceph osd crush move osd.292 host=bb-07
ceph osd crush move osd.293 host=bb-18
ceph osd crush move osd.295 host=bb-19
ceph osd crush move osd.294 host=bb-22
ceph osd crush move osd.296 host=bb-20
# Wait until all PGs are peered and recovery is done. In my case, there was only little I/O, # no more than 50-100 objects had writes missing and recovery was a few seconds.
#
# The bb-hosts are under a separate crush root that I use solely as parking space # and for draining OSDs.
ceph osd crush move osd.288 host=ceph-04 ceph osd crush move osd.289 host=ceph-05 ceph osd crush move osd.290 host=ceph-06 ceph osd crush move osd.291 host=ceph-21 ceph osd crush move osd.292 host=ceph-07 ceph osd crush move osd.293 host=ceph-18 ceph osd crush move osd.295 host=ceph-19 ceph osd crush move osd.294 host=ceph-22 ceph osd crush move osd.296 host=ceph-20
After peering, no degraded PGs/objects any more, just the misplaced ones as expected.
Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
________________________________________
From: Eric Smith <Eric.Smith(a)vecima.com>
Sent: 03 August 2020 16:45:28
To: Frank Schilder; ceph-users
Subject: RE: Ceph does not recover from OSD restart
You said you had to move some OSDs out and back in for Ceph to go back to normal (The OSDs you added). Which OSDs were added?
-----Original Message-----
From: Frank Schilder <frans(a)dtu.dk>
Sent: Monday, August 3, 2020 9:55 AM
To: Eric Smith <Eric.Smith(a)vecima.com>; ceph-users <ceph-users(a)ceph.io>
Subject: Re: Ceph does not recover from OSD restart
Hi Eric,
thanks for your fast response. Below the output, shortened a bit as indicated. Disks have been added to pool 11 'sr-rbd-data-one-hdd' only, this is the only pool with remapped PGs and is also the only pool experiencing the "loss of track" to objects. Every other pool recovers from restart by itself.
Best regards,
Frank
# ceph osd pool stats
pool sr-rbd-meta-one id 1
client io 5.3 KiB/s rd, 3.2 KiB/s wr, 4 op/s rd, 1 op/s wr
pool sr-rbd-data-one id 2
client io 24 MiB/s rd, 32 MiB/s wr, 380 op/s rd, 594 op/s wr
pool sr-rbd-one-stretch id 3
nothing is going on
pool con-rbd-meta-hpc-one id 7
nothing is going on
pool con-rbd-data-hpc-one id 8
client io 0 B/s rd, 5.6 KiB/s wr, 0 op/s rd, 0 op/s wr
pool sr-rbd-data-one-hdd id 11
53241814/346903376 objects misplaced (15.348%)
client io 73 MiB/s rd, 3.4 MiB/s wr, 236 op/s rd, 69 op/s wr
pool con-fs2-meta1 id 12
client io 106 KiB/s rd, 112 KiB/s wr, 3 op/s rd, 11 op/s wr
pool con-fs2-meta2 id 13
client io 0 B/s wr, 0 op/s rd, 0 op/s wr
pool con-fs2-data id 14
client io 5.5 MiB/s rd, 201 KiB/s wr, 34 op/s rd, 8 op/s wr
pool con-fs2-data-ec-ssd id 17
nothing is going on
pool ms-rbd-one id 18
client io 5.6 MiB/s wr, 0 op/s rd, 179 op/s wr
# ceph osd pool ls detail
pool 1 'sr-rbd-meta-one' replicated size 3 min_size 2 crush_rule 11 object_hash rjenkins pg_num 80 pgp_num 80 last_change 122597 flags hashpspool,nodelete,selfmanaged_snaps max_bytes 536870912000 stripe_width 0 application rbd
removed_snaps [1~45]
pool 2 'sr-rbd-data-one' erasure size 8 min_size 6 crush_rule 5 object_hash rjenkins pg_num 560 pgp_num 560 last_change 186437 lfor 0/126858 flags hashpspool,ec_overwrites,nodelete,selfmanaged_snaps max_bytes 43980465111040 stripe_width 24576 fast_read 1 compression_mode aggressive application rbd
removed_snaps [1~3,5~2, ... huge list ... ,11f9d~1,11fa0~2] pool 3 'sr-rbd-one-stretch' replicated size 3 min_size 2 crush_rule 12 object_hash rjenkins pg_num 160 pgp_num 160 last_change 143202 lfor 0/79983 flags hashpspool,nodelete,selfmanaged_snaps max_bytes 1099511627776 stripe_width 0 compression_mode aggressive application rbd
removed_snaps [1~7,b~2,11~2,14~2,17~9e,b8~1e] pool 7 'con-rbd-meta-hpc-one' replicated size 3 min_size 2 crush_rule 3 object_hash rjenkins pg_num 50 pgp_num 50 last_change 96357 lfor 0/90462 flags hashpspool,nodelete,selfmanaged_snaps max_bytes 10737418240 stripe_width 0 application rbd
removed_snaps [1~3]
pool 8 'con-rbd-data-hpc-one' erasure size 10 min_size 9 crush_rule 7 object_hash rjenkins pg_num 150 pgp_num 150 last_change 96358 lfor 0/90996 flags hashpspool,ec_overwrites,nodelete,selfmanaged_snaps max_bytes 5497558138880 stripe_width 32768 fast_read 1 compression_mode aggressive application rbd
removed_snaps [1~7,9~2]
pool 11 'sr-rbd-data-one-hdd' erasure size 8 min_size 6 crush_rule 9 object_hash rjenkins pg_num 560 pgp_num 560 last_change 186331 lfor 0/127768 flags hashpspool,ec_overwrites,nodelete,selfmanaged_snaps max_bytes 219902325555200 stripe_width 24576 fast_read 1 compression_mode aggressive application rbd
removed_snaps [1~59f,5a2~fe, ... less huge list ... ,2559~1,255b~1]
removed_snaps_queue [1a64~5,1a6a~1,1a6c~1, ... long list ... ,220a~1,220c~1] pool 12 'con-fs2-meta1' replicated size 3 min_size 2 crush_rule 3 object_hash rjenkins pg_num 50 pgp_num 50 last_change 57096 flags hashpspool,nodelete max_bytes 268435456000 stripe_width 0 application cephfs pool 13 'con-fs2-meta2' replicated size 3 min_size 2 crush_rule 3 object_hash rjenkins pg_num 50 pgp_num 50 last_change 96359 flags hashpspool,nodelete max_bytes 107374182400 stripe_width 0 application cephfs pool 14 'con-fs2-data' erasure size 10 min_size 9 crush_rule 8 object_hash rjenkins pg_num 1350 pgp_num 1350 last_change 96360 lfor 0/91144 flags hashpspool,ec_overwrites,nodelete max_bytes 879609302220800 stripe_width 32768 fast_read 1 compression_mode aggressive application cephfs pool 17 'con-fs2-data-ec-ssd' erasure size 10 min_size 9 crush_rule 10 object_hash rjenkins pg_num 55 pgp_num 55 last_change 96361 lfor 0/90473 flags hashpspool,ec_overwrites,nodelete max_bytes 1099511627776 stripe_width 32768 fast_read 1 compression_mode aggressive application cephfs pool 18 'ms-rbd-one' replicated size 3 min_size 2 crush_rule 12 object_hash rjenkins pg_num 150 pgp_num 150 last_change 143206 flags hashpspool,nodelete,selfmanaged_snaps max_bytes 1099511627776 stripe_width 0 compression_mode aggressive application rbd
removed_snaps [1~3]
# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-40 2384.09058 root DTU
-42 0 region Lyngby
-41 2384.09058 region Risoe
2 sub-trees on level datacenter removed for brevity
-49 586.49347 datacenter ServerRoom
-55 586.49347 room SR-113
-65 64.33617 host ceph-04
84 hdd 8.90999 osd.84 up 1.00000 1.00000
145 hdd 8.90999 osd.145 up 1.00000 1.00000
156 hdd 8.90999 osd.156 up 1.00000 1.00000
168 hdd 8.90999 osd.168 up 1.00000 1.00000
181 hdd 8.90999 osd.181 up 0.95000 1.00000
288 hdd 10.69229 osd.288 up 1.00000 1.00000
243 rbd_data 1.74599 osd.243 up 1.00000 1.00000
254 rbd_data 1.74599 osd.254 up 1.00000 1.00000
256 rbd_data 1.74599 osd.256 up 1.00000 1.00000
286 rbd_data 1.74599 osd.286 up 1.00000 1.00000
287 rbd_data 1.74599 osd.287 up 1.00000 1.00000
48 rbd_meta 0.36400 osd.48 up 1.00000 1.00000
-67 64.33617 host ceph-05
74 hdd 8.90999 osd.74 up 1.00000 1.00000
144 hdd 8.90999 osd.144 up 1.00000 1.00000
157 hdd 8.90999 osd.157 up 0.84999 1.00000
169 hdd 8.90999 osd.169 up 0.95000 1.00000
180 hdd 8.90999 osd.180 up 0.89999 1.00000
289 hdd 10.69229 osd.289 up 1.00000 1.00000
240 rbd_data 1.74599 osd.240 up 1.00000 1.00000
251 rbd_data 1.74599 osd.251 up 1.00000 1.00000
255 rbd_data 1.74599 osd.255 up 1.00000 1.00000
284 rbd_data 1.74599 osd.284 up 1.00000 1.00000
285 rbd_data 1.74599 osd.285 up 1.00000 1.00000
49 rbd_meta 0.36400 osd.49 up 1.00000 1.00000
-69 64.70016 host ceph-06
60 hdd 8.90999 osd.60 up 1.00000 1.00000
146 hdd 8.90999 osd.146 up 1.00000 1.00000
158 hdd 8.90999 osd.158 up 0.95000 1.00000
170 hdd 8.90999 osd.170 up 0.89999 1.00000
182 hdd 8.90999 osd.182 up 1.00000 1.00000
290 hdd 10.69229 osd.290 up 1.00000 1.00000
244 rbd_data 1.74599 osd.244 up 1.00000 1.00000
253 rbd_data 1.74599 osd.253 up 1.00000 1.00000
257 rbd_data 1.74599 osd.257 up 1.00000 1.00000
282 rbd_data 1.74599 osd.282 up 1.00000 1.00000
283 rbd_data 1.74599 osd.283 up 1.00000 1.00000
40 rbd_meta 0.36400 osd.40 up 1.00000 1.00000
50 rbd_meta 0.36400 osd.50 up 1.00000 1.00000
-71 64.33617 host ceph-07
63 hdd 8.90999 osd.63 up 1.00000 1.00000
148 hdd 8.90999 osd.148 up 0.95000 1.00000
159 hdd 8.90999 osd.159 up 1.00000 1.00000
172 hdd 8.90999 osd.172 up 0.95000 1.00000
183 hdd 8.90999 osd.183 up 0.84999 1.00000
292 hdd 10.69229 osd.292 up 1.00000 1.00000
242 rbd_data 1.74599 osd.242 up 1.00000 1.00000
252 rbd_data 1.74599 osd.252 up 1.00000 1.00000
258 rbd_data 1.74599 osd.258 up 1.00000 1.00000
279 rbd_data 1.74599 osd.279 up 1.00000 1.00000
280 rbd_data 1.74599 osd.280 up 1.00000 1.00000
52 rbd_meta 0.36400 osd.52 up 1.00000 1.00000
-81 66.70416 host ceph-18
229 hdd 8.90999 osd.229 up 1.00000 1.00000
232 hdd 8.90999 osd.232 up 1.00000 1.00000
235 hdd 8.90999 osd.235 up 1.00000 1.00000
238 hdd 8.90999 osd.238 up 0.95000 1.00000
259 hdd 10.91399 osd.259 up 1.00000 1.00000
293 hdd 10.69229 osd.293 up 1.00000 1.00000
241 rbd_data 1.74599 osd.241 up 1.00000 1.00000
248 rbd_data 1.74599 osd.248 up 1.00000 1.00000
266 rbd_data 1.74599 osd.266 up 1.00000 1.00000
267 rbd_data 1.74599 osd.267 up 1.00000 1.00000
277 rbd_data 1.74599 osd.277 up 1.00000 1.00000
31 rbd_meta 0.36400 osd.31 up 1.00000 1.00000
41 rbd_meta 0.36400 osd.41 up 1.00000 1.00000
-94 66.34016 host ceph-19
231 hdd 8.90999 osd.231 up 1.00000 1.00000
233 hdd 8.90999 osd.233 up 0.95000 1.00000
236 hdd 8.90999 osd.236 up 1.00000 1.00000
239 hdd 8.90999 osd.239 up 1.00000 1.00000
263 hdd 10.91399 osd.263 up 1.00000 1.00000
295 hdd 10.69229 osd.295 up 1.00000 1.00000
261 rbd_data 1.74599 osd.261 up 1.00000 1.00000
262 rbd_data 1.74599 osd.262 up 1.00000 1.00000
268 rbd_data 1.74599 osd.268 up 1.00000 1.00000
269 rbd_data 1.74599 osd.269 up 1.00000 1.00000
275 rbd_data 1.74599 osd.275 up 1.00000 1.00000
43 rbd_meta 0.36400 osd.43 up 1.00000 1.00000
-4 66.70416 host ceph-20
228 hdd 8.90999 osd.228 up 1.00000 1.00000
230 hdd 8.90999 osd.230 up 1.00000 1.00000
234 hdd 8.90999 osd.234 up 0.95000 1.00000
237 hdd 8.90999 osd.237 up 1.00000 1.00000
260 hdd 10.91399 osd.260 up 1.00000 1.00000
296 hdd 10.69229 osd.296 up 1.00000 1.00000
245 rbd_data 1.74599 osd.245 up 1.00000 1.00000
270 rbd_data 1.74599 osd.270 up 1.00000 1.00000
271 rbd_data 1.74599 osd.271 up 1.00000 1.00000
272 rbd_data 1.74599 osd.272 up 1.00000 1.00000
273 rbd_data 1.74599 osd.273 up 1.00000 1.00000
28 rbd_meta 0.36400 osd.28 up 1.00000 1.00000
44 rbd_meta 0.36400 osd.44 up 1.00000 1.00000
-64 64.70016 host ceph-21
0 hdd 8.90999 osd.0 up 1.00000 1.00000
2 hdd 8.90999 osd.2 up 0.95000 1.00000
72 hdd 8.90999 osd.72 up 1.00000 1.00000
76 hdd 8.90999 osd.76 up 1.00000 1.00000
86 hdd 8.90999 osd.86 up 1.00000 1.00000
291 hdd 10.69229 osd.291 up 1.00000 1.00000
246 rbd_data 1.74599 osd.246 up 1.00000 1.00000
247 rbd_data 1.74599 osd.247 up 1.00000 1.00000
264 rbd_data 1.74599 osd.264 up 1.00000 1.00000
274 rbd_data 1.74599 osd.274 up 1.00000 1.00000
278 rbd_data 1.74599 osd.278 up 1.00000 1.00000
39 rbd_meta 0.36400 osd.39 up 1.00000 1.00000
53 rbd_meta 0.36400 osd.53 up 1.00000 1.00000
-66 64.33617 host ceph-22
1 hdd 8.90999 osd.1 up 1.00000 1.00000
3 hdd 8.90999 osd.3 up 1.00000 1.00000
73 hdd 8.90999 osd.73 up 1.00000 1.00000
85 hdd 8.90999 osd.85 up 0.95000 1.00000
87 hdd 8.90999 osd.87 up 1.00000 1.00000
294 hdd 10.69229 osd.294 up 1.00000 1.00000
249 rbd_data 1.74599 osd.249 up 1.00000 1.00000
250 rbd_data 1.74599 osd.250 up 1.00000 1.00000
265 rbd_data 1.74599 osd.265 up 1.00000 1.00000
276 rbd_data 1.74599 osd.276 up 1.00000 1.00000
281 rbd_data 1.74599 osd.281 up 1.00000 1.00000
51 rbd_meta 0.36400 osd.51 up 1.00000 1.00000
# ceph osd crush rule dump # crush rules outside tree under "datacenter ServerRoom" removed for brevity [
{
"rule_id": 0,
"rule_name": "replicated_rule",
"ruleset": 0,
"type": 1,
"min_size": 1,
"max_size": 10,
"steps": [
{
"op": "take",
"item": -1,
"item_name": "default"
},
{
"op": "chooseleaf_firstn",
"num": 0,
"type": "host"
},
{
"op": "emit"
}
]
},
{
"rule_id": 5,
"rule_name": "sr-rbd-data-one",
"ruleset": 5,
"type": 3,
"min_size": 3,
"max_size": 8,
"steps": [
{
"op": "set_chooseleaf_tries",
"num": 50
},
{
"op": "set_choose_tries",
"num": 1000
},
{
"op": "take",
"item": -185,
"item_name": "ServerRoom~rbd_data"
},
{
"op": "chooseleaf_indep",
"num": 0,
"type": "host"
},
{
"op": "emit"
}
]
},
{
"rule_id": 9,
"rule_name": "sr-rbd-data-one-hdd",
"ruleset": 9,
"type": 3,
"min_size": 3,
"max_size": 8,
"steps": [
{
"op": "set_chooseleaf_tries",
"num": 5
},
{
"op": "set_choose_tries",
"num": 100
},
{
"op": "take",
"item": -53,
"item_name": "ServerRoom~hdd"
},
{
"op": "chooseleaf_indep",
"num": 0,
"type": "host"
},
{
"op": "emit"
}
]
}
]
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
________________________________________
From: Eric Smith <Eric.Smith(a)vecima.com>
Sent: 03 August 2020 15:40
To: Frank Schilder; ceph-users
Subject: RE: Ceph does not recover from OSD restart
Can you post the output of these commands:
ceph osd pool ls detail
ceph osd tree
ceph osd crush rule dump
-----Original Message-----
From: Frank Schilder <frans(a)dtu.dk>
Sent: Monday, August 3, 2020 9:19 AM
To: ceph-users <ceph-users(a)ceph.io>
Subject: [ceph-users] Re: Ceph does not recover from OSD restart
After moving the newly added OSDs out of the crush tree and back in again, I get to exactly what I want to see:
cluster:
id: e4ece518-f2cb-4708-b00f-b6bf511e91d9
health: HEALTH_WARN
norebalance,norecover flag(s) set
53030026/1492404361 objects misplaced (3.553%)
1 pools nearfull
services:
mon: 3 daemons, quorum ceph-01,ceph-02,ceph-03
mgr: ceph-01(active), standbys: ceph-03, ceph-02
mds: con-fs2-1/1/1 up {0=ceph-08=up:active}, 1 up:standby-replay
osd: 297 osds: 272 up, 272 in; 307 remapped pgs
flags norebalance,norecover
data:
pools: 11 pools, 3215 pgs
objects: 177.3 M objects, 489 TiB
usage: 696 TiB used, 1.2 PiB / 1.9 PiB avail
pgs: 53030026/1492404361 objects misplaced (3.553%)
2902 active+clean
299 active+remapped+backfill_wait
8 active+remapped+backfilling
5 active+clean+scrubbing+deep
1 active+clean+snaptrim
io:
client: 69 MiB/s rd, 117 MiB/s wr, 399 op/s rd, 856 op/s wr
Why does a cluster with remapped PGs not survive OSD restarts without loosing track of objects?
Why is it not finding the objects by itself?
A power outage of 3 hosts will halt everything for no reason until manual intervention. How can I avoid this problem?
Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
________________________________________
From: Frank Schilder <frans(a)dtu.dk>
Sent: 03 August 2020 15:03:05
To: ceph-users
Subject: [ceph-users] Ceph does not recover from OSD restart
Dear cephers,
I have a serious issue with degraded objects after an OSD restart. The cluster was in a state of re-balancing after adding disks to each host. Before restart I had "X/Y objects misplaced". Apart from that, health was OK. I now restarted all OSDs of one host and the cluster does not recover from that:
cluster:
id: xxx
health: HEALTH_ERR
45813194/1492348700 objects misplaced (3.070%)
Degraded data redundancy: 6798138/1492348700 objects degraded (0.456%), 85 pgs degraded, 86 pgs undersized
Degraded data redundancy (low space): 17 pgs backfill_toofull
1 pools nearfull
services:
mon: 3 daemons, quorum ceph-01,ceph-02,ceph-03
mgr: ceph-01(active), standbys: ceph-03, ceph-02
mds: con-fs2-1/1/1 up {0=ceph-08=up:active}, 1 up:standby-replay
osd: 297 osds: 272 up, 272 in; 307 remapped pgs
data:
pools: 11 pools, 3215 pgs
objects: 177.3 M objects, 489 TiB
usage: 696 TiB used, 1.2 PiB / 1.9 PiB avail
pgs: 6798138/1492348700 objects degraded (0.456%)
45813194/1492348700 objects misplaced (3.070%)
2903 active+clean
209 active+remapped+backfill_wait
73 active+undersized+degraded+remapped+backfill_wait
9 active+remapped+backfill_wait+backfill_toofull
8 active+undersized+degraded+remapped+backfill_wait+backfill_toofull
4 active+undersized+degraded+remapped+backfilling
3 active+remapped+backfilling
3 active+clean+scrubbing+deep
1 active+clean+scrubbing
1 active+undersized+remapped+backfilling
1 active+clean+snaptrim
io:
client: 47 MiB/s rd, 61 MiB/s wr, 732 op/s rd, 792 op/s wr
recovery: 195 MiB/s, 48 objects/s
After restarting there should only be a small number of degraded objects, the ones that received writes during OSD restart. What I see, however, is that the cluster seems to have lost track of a huge amount of objects, the 0.456% degraded are 1-2 days worth of I/O. I did reboots before and saw only a few thousand objects degraded at most. The output of ceph health detail shows a lot of lines like these:
[root@gnosis ~]# ceph health detail
HEALTH_ERR 45804316/1492356704 objects misplaced (3.069%); Degraded data redundancy: 6792562/1492356704 objects degraded (0.455%), 85 pgs degraded, 86 pgs undersized; Degraded data redundancy (low space): 17 pgs backfill_toofull; 1 pools nearfull OBJECT_MISPLACED 45804316/1492356704 objects misplaced (3.069%) PG_DEGRADED Degraded data redundancy: 6792562/1492356704 objects degraded (0.455%), 85 pgs degraded, 86 pgs undersized
pg 11.9 is stuck undersized for 815.188981, current state active+undersized+degraded+remapped+backfill_wait, last acting [60,148,2147483647,263,76,230,87,169]
8...9
pg 11.48 is active+undersized+degraded+remapped+backfill_wait, acting [159,60,180,263,237,3,2147483647,72]
pg 11.4a is stuck undersized for 851.162862, current state active+undersized+degraded+remapped+backfill_wait, last acting [182,233,87,228,2,180,63,2147483647]
[...]
pg 11.22e is stuck undersized for 851.162402, current state active+undersized+degraded+remapped+backfill_wait+backfill_toofull, last acting [234,183,239,2147483647,170,229,1,86]
PG_DEGRADED_FULL Degraded data redundancy (low space): 17 pgs backfill_toofull
pg 11.24 is active+undersized+degraded+remapped+backfill_wait+backfill_toofull, acting [230,259,2147483647,1,144,159,233,146]
[...]
pg 11.1d9 is active+remapped+backfill_wait+backfill_toofull, acting [84,259,183,170,85,234,233,2]
pg 11.225 is active+undersized+degraded+remapped+backfill_wait+backfill_toofull, acting [236,183,1,2147483647,2147483647,169,229,230]
pg 11.22e is active+undersized+degraded+remapped+backfill_wait+backfill_toofull, acting [234,183,239,2147483647,170,229,1,86]
POOL_NEAR_FULL 1 pools nearfull
pool 'sr-rbd-data-one-hdd' has 164 TiB (max 200 TiB)
It looks like a lot of PGs are not receiving theire complete crush map placement, as if the peering is incomplete. This is a serious issue, it looks like the cluster will see a total storage loss if just 2 more hosts reboot - without actually having lost any storage. The pool in question is a 6+2 EC pool.
What is going on here? Why are the PG-maps not restored to their values from before the OSD reboot? The degraded PGs should receive the missing OSD IDs, everything is up exactly as it was before the reboot.
Thanks for your help and best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
_______________________________________________
ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io _______________________________________________
ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io _______________________________________________
ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io
Hi Eric,
> Have you adjusted the min_size for pool sr-rbd-data-one-hdd
Yes. For all EC pools located in datacenter ServerRoom, we currently set min_size=k=6, because we lack physical servers. Hosts ceph-21 and ceph-22 are logical but not physical, disks in these buckets are co-located such that no more than 2 host buckets share the same physical host. With failure domain = host, we can ensure that no more than 2 EC shards are on the same physical host. With m=2 and min_size=k we have continued service with any 1 physical host down for maintenance and also recovery will happen if a physical host fails. Some objects will have no redundancy for a while then. We will increase min_size to k+1 as soon as we have 2 additional hosts and simply move the OSDs from buckets ceph-21/22 to these without rebalancing.
The distribution of disks and buckets is listed below as well (longer listing).
Thanks and best regards,
Frank
# ceph osd erasure-code-profile ls
con-ec-8-2-hdd
con-ec-8-2-ssd
default
sr-ec-6-2-hdd
This is the relevant one:
# ceph osd erasure-code-profile get sr-ec-6-2-hdd
crush-device-class=hdd
crush-failure-domain=host
crush-root=ServerRoom
jerasure-per-chunk-alignment=false
k=6
m=2
plugin=jerasure
technique=reed_sol_van
w=8
Note that the pool sr-rbd-data-one (id 2) was created with this profile and later moved to SSD. Therefore, the crush rule does not match the profile's device class any more.
These two are under different roots:
# ceph osd erasure-code-profile get con-ec-8-2-hdd
crush-device-class=hdd
crush-failure-domain=host
crush-root=ContainerSquare
jerasure-per-chunk-alignment=false
k=8
m=2
plugin=jerasure
technique=reed_sol_van
w=8
# ceph osd erasure-code-profile get con-ec-8-2-ssd
crush-device-class=ssd
crush-failure-domain=host
crush-root=ContainerSquare
jerasure-per-chunk-alignment=false
k=8
m=2
plugin=jerasure
technique=reed_sol_van
w=8
Full physical placement information for OSDs under tree "datacenter ServerRoom":
----------------
ceph-04
----------------
CONT ID BUCKET SIZE TYP
osd-phy0 243 ceph-04 1.8T SSD
osd-phy1 247 ceph-21 1.8T SSD
osd-phy2 254 ceph-04 1.8T SSD
osd-phy3 256 ceph-04 1.8T SSD
osd-phy4 286 ceph-04 1.8T SSD
osd-phy5 287 ceph-04 1.8T SSD
osd-phy6 288 ceph-04 10.7T HDD
osd-phy7 48 ceph-04 372.6G SSD
osd-phy8 264 ceph-21 1.8T SSD
osd-phy9 84 ceph-04 8.9T HDD
osd-phy10 72 ceph-21 8.9T HDD
osd-phy11 145 ceph-04 8.9T HDD
osd-phy14 156 ceph-04 8.9T HDD
osd-phy15 168 ceph-04 8.9T HDD
osd-phy16 181 ceph-04 8.9T HDD
osd-phy17 0 ceph-21 8.9T HDD
----------------
ceph-05
----------------
CONT ID BUCKET SIZE TYP
osd-phy0 240 ceph-05 1.8T SSD
osd-phy1 249 ceph-22 1.8T SSD
osd-phy2 251 ceph-05 1.8T SSD
osd-phy3 255 ceph-05 1.8T SSD
osd-phy4 284 ceph-05 1.8T SSD
osd-phy5 285 ceph-05 1.8T SSD
osd-phy6 289 ceph-05 10.7T HDD
osd-phy7 49 ceph-05 372.6G SSD
osd-phy8 265 ceph-22 1.8T SSD
osd-phy9 74 ceph-05 8.9T HDD
osd-phy10 85 ceph-22 8.9T HDD
osd-phy11 144 ceph-05 8.9T HDD
osd-phy14 157 ceph-05 8.9T HDD
osd-phy15 169 ceph-05 8.9T HDD
osd-phy16 180 ceph-05 8.9T HDD
osd-phy17 1 ceph-22 8.9T HDD
----------------
ceph-06
----------------
CONT ID BUCKET SIZE TYP
osd-phy0 244 ceph-06 1.8T SSD
osd-phy1 246 ceph-21 1.8T SSD
osd-phy2 253 ceph-06 1.8T SSD
osd-phy3 257 ceph-06 1.8T SSD
osd-phy4 282 ceph-06 1.8T SSD
osd-phy5 283 ceph-06 1.8T SSD
osd-phy6 40 ceph-06 372.6G SSD
osd-phy7 50 ceph-06 372.6G SSD
osd-phy8 60 ceph-06 8.9T HDD
osd-phy9 290 ceph-06 10.7T HDD
osd-phy10 291 ceph-21 10.7T HDD
osd-phy11 146 ceph-06 8.9T HDD
osd-phy14 158 ceph-06 8.9T HDD
osd-phy15 170 ceph-06 8.9T HDD
osd-phy16 182 ceph-06 8.9T HDD
osd-phy17 2 ceph-21 8.9T HDD
----------------
ceph-07
----------------
CONT ID BUCKET SIZE TYP
osd-phy0 242 ceph-07 1.8T SSD
osd-phy1 250 ceph-22 1.8T SSD
osd-phy2 252 ceph-07 1.8T SSD
osd-phy3 258 ceph-07 1.8T SSD
osd-phy4 279 ceph-07 1.8T SSD
osd-phy5 280 ceph-07 1.8T SSD
osd-phy6 292 ceph-07 10.7T HDD
osd-phy7 52 ceph-07 372.6G SSD
osd-phy8 63 ceph-07 8.9T HDD
osd-phy9 281 ceph-22 1.8T SSD
osd-phy10 87 ceph-22 8.9T HDD
osd-phy11 148 ceph-07 8.9T HDD
osd-phy14 159 ceph-07 8.9T HDD
osd-phy15 172 ceph-07 8.9T HDD
osd-phy16 183 ceph-07 8.9T HDD
osd-phy17 3 ceph-22 8.9T HDD
----------------
ceph-18
----------------
CONT ID BUCKET SIZE TYP
osd-phy0 241 ceph-18 1.8T SSD
osd-phy1 248 ceph-18 1.8T SSD
osd-phy2 41 ceph-18 372.6G SSD
osd-phy3 31 ceph-18 372.6G SSD
osd-phy4 277 ceph-18 1.8T SSD
osd-phy5 278 ceph-21 1.8T SSD
osd-phy6 53 ceph-21 372.6G SSD
osd-phy7 267 ceph-18 1.8T SSD
osd-phy8 266 ceph-18 1.8T SSD
osd-phy9 293 ceph-18 10.7T HDD
osd-phy10 86 ceph-21 8.9T HDD
osd-phy11 259 ceph-18 10.9T HDD
osd-phy14 229 ceph-18 8.9T HDD
osd-phy15 232 ceph-18 8.9T HDD
osd-phy16 235 ceph-18 8.9T HDD
osd-phy17 238 ceph-18 8.9T HDD
----------------
ceph-19
----------------
CONT ID BUCKET SIZE TYP
osd-phy0 261 ceph-19 1.8T SSD
osd-phy1 262 ceph-19 1.8T SSD
osd-phy2 295 ceph-19 10.7T HDD
osd-phy3 43 ceph-19 372.6G SSD
osd-phy4 275 ceph-19 1.8T SSD
osd-phy5 294 ceph-22 10.7T HDD
osd-phy6 51 ceph-22 372.6G SSD
osd-phy7 269 ceph-19 1.8T SSD
osd-phy8 268 ceph-19 1.8T SSD
osd-phy9 276 ceph-22 1.8T SSD
osd-phy10 73 ceph-22 8.9T HDD
osd-phy11 263 ceph-19 10.9T HDD
osd-phy14 231 ceph-19 8.9T HDD
osd-phy15 233 ceph-19 8.9T HDD
osd-phy16 236 ceph-19 8.9T HDD
osd-phy17 239 ceph-19 8.9T HDD
----------------
ceph-20
----------------
CONT ID BUCKET SIZE TYP
osd-phy0 245 ceph-20 1.8T SSD
osd-phy1 28 ceph-20 372.6G SSD
osd-phy2 44 ceph-20 372.6G SSD
osd-phy3 271 ceph-20 1.8T SSD
osd-phy4 272 ceph-20 1.8T SSD
osd-phy5 273 ceph-20 1.8T SSD
osd-phy6 274 ceph-21 1.8T SSD
osd-phy7 296 ceph-20 10.7T HDD
osd-phy8 76 ceph-21 8.9T HDD
osd-phy9 39 ceph-21 372.6G SSD
osd-phy10 270 ceph-20 1.8T SSD
osd-phy11 260 ceph-20 10.9T HDD
osd-phy14 228 ceph-20 8.9T HDD
osd-phy15 230 ceph-20 8.9T HDD
osd-phy16 234 ceph-20 8.9T HDD
osd-phy17 237 ceph-20 8.9T HDD
CONT is the container name and encodes the physical slot on the host where the OSD is located.
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
________________________________________
From: Eric Smith <Eric.Smith(a)vecima.com>
Sent: 04 August 2020 12:47:12
To: Frank Schilder; ceph-users
Subject: RE: Ceph does not recover from OSD restart
Have you adjusted the min_size for pool sr-rbd-data-one-hdd at all? Also can you send the output of "ceph osd erasure-code-profile ls" and for each EC profile, "ceph osd erasure-code-profile get <profile>"?
-----Original Message-----
From: Frank Schilder <frans(a)dtu.dk>
Sent: Monday, August 3, 2020 11:05 AM
To: Eric Smith <Eric.Smith(a)vecima.com>; ceph-users <ceph-users(a)ceph.io>
Subject: Re: Ceph does not recover from OSD restart
Sorry for the many small e-mails: requested IDs in the commands, 288-296. One new OSD per host.
Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
________________________________________
From: Frank Schilder <frans(a)dtu.dk>
Sent: 03 August 2020 16:59:04
To: Eric Smith; ceph-users
Subject: [ceph-users] Re: Ceph does not recover from OSD restart
Hi Eric,
the procedure for re-discovering all objects is:
# Flag: norebalance
ceph osd crush move osd.288 host=bb-04
ceph osd crush move osd.289 host=bb-05
ceph osd crush move osd.290 host=bb-06
ceph osd crush move osd.291 host=bb-21
ceph osd crush move osd.292 host=bb-07
ceph osd crush move osd.293 host=bb-18
ceph osd crush move osd.295 host=bb-19
ceph osd crush move osd.294 host=bb-22
ceph osd crush move osd.296 host=bb-20
# Wait until all PGs are peered and recovery is done. In my case, there was only little I/O, # no more than 50-100 objects had writes missing and recovery was a few seconds.
#
# The bb-hosts are under a separate crush root that I use solely as parking space # and for draining OSDs.
ceph osd crush move osd.288 host=ceph-04 ceph osd crush move osd.289 host=ceph-05 ceph osd crush move osd.290 host=ceph-06 ceph osd crush move osd.291 host=ceph-21 ceph osd crush move osd.292 host=ceph-07 ceph osd crush move osd.293 host=ceph-18 ceph osd crush move osd.295 host=ceph-19 ceph osd crush move osd.294 host=ceph-22 ceph osd crush move osd.296 host=ceph-20
After peering, no degraded PGs/objects any more, just the misplaced ones as expected.
Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
________________________________________
From: Eric Smith <Eric.Smith(a)vecima.com>
Sent: 03 August 2020 16:45:28
To: Frank Schilder; ceph-users
Subject: RE: Ceph does not recover from OSD restart
You said you had to move some OSDs out and back in for Ceph to go back to normal (The OSDs you added). Which OSDs were added?
-----Original Message-----
From: Frank Schilder <frans(a)dtu.dk>
Sent: Monday, August 3, 2020 9:55 AM
To: Eric Smith <Eric.Smith(a)vecima.com>; ceph-users <ceph-users(a)ceph.io>
Subject: Re: Ceph does not recover from OSD restart
Hi Eric,
thanks for your fast response. Below the output, shortened a bit as indicated. Disks have been added to pool 11 'sr-rbd-data-one-hdd' only, this is the only pool with remapped PGs and is also the only pool experiencing the "loss of track" to objects. Every other pool recovers from restart by itself.
Best regards,
Frank
# ceph osd pool stats
pool sr-rbd-meta-one id 1
client io 5.3 KiB/s rd, 3.2 KiB/s wr, 4 op/s rd, 1 op/s wr
pool sr-rbd-data-one id 2
client io 24 MiB/s rd, 32 MiB/s wr, 380 op/s rd, 594 op/s wr
pool sr-rbd-one-stretch id 3
nothing is going on
pool con-rbd-meta-hpc-one id 7
nothing is going on
pool con-rbd-data-hpc-one id 8
client io 0 B/s rd, 5.6 KiB/s wr, 0 op/s rd, 0 op/s wr
pool sr-rbd-data-one-hdd id 11
53241814/346903376 objects misplaced (15.348%)
client io 73 MiB/s rd, 3.4 MiB/s wr, 236 op/s rd, 69 op/s wr
pool con-fs2-meta1 id 12
client io 106 KiB/s rd, 112 KiB/s wr, 3 op/s rd, 11 op/s wr
pool con-fs2-meta2 id 13
client io 0 B/s wr, 0 op/s rd, 0 op/s wr
pool con-fs2-data id 14
client io 5.5 MiB/s rd, 201 KiB/s wr, 34 op/s rd, 8 op/s wr
pool con-fs2-data-ec-ssd id 17
nothing is going on
pool ms-rbd-one id 18
client io 5.6 MiB/s wr, 0 op/s rd, 179 op/s wr
# ceph osd pool ls detail
pool 1 'sr-rbd-meta-one' replicated size 3 min_size 2 crush_rule 11 object_hash rjenkins pg_num 80 pgp_num 80 last_change 122597 flags hashpspool,nodelete,selfmanaged_snaps max_bytes 536870912000 stripe_width 0 application rbd
removed_snaps [1~45]
pool 2 'sr-rbd-data-one' erasure size 8 min_size 6 crush_rule 5 object_hash rjenkins pg_num 560 pgp_num 560 last_change 186437 lfor 0/126858 flags hashpspool,ec_overwrites,nodelete,selfmanaged_snaps max_bytes 43980465111040 stripe_width 24576 fast_read 1 compression_mode aggressive application rbd
removed_snaps [1~3,5~2, ... huge list ... ,11f9d~1,11fa0~2] pool 3 'sr-rbd-one-stretch' replicated size 3 min_size 2 crush_rule 12 object_hash rjenkins pg_num 160 pgp_num 160 last_change 143202 lfor 0/79983 flags hashpspool,nodelete,selfmanaged_snaps max_bytes 1099511627776 stripe_width 0 compression_mode aggressive application rbd
removed_snaps [1~7,b~2,11~2,14~2,17~9e,b8~1e] pool 7 'con-rbd-meta-hpc-one' replicated size 3 min_size 2 crush_rule 3 object_hash rjenkins pg_num 50 pgp_num 50 last_change 96357 lfor 0/90462 flags hashpspool,nodelete,selfmanaged_snaps max_bytes 10737418240 stripe_width 0 application rbd
removed_snaps [1~3]
pool 8 'con-rbd-data-hpc-one' erasure size 10 min_size 9 crush_rule 7 object_hash rjenkins pg_num 150 pgp_num 150 last_change 96358 lfor 0/90996 flags hashpspool,ec_overwrites,nodelete,selfmanaged_snaps max_bytes 5497558138880 stripe_width 32768 fast_read 1 compression_mode aggressive application rbd
removed_snaps [1~7,9~2]
pool 11 'sr-rbd-data-one-hdd' erasure size 8 min_size 6 crush_rule 9 object_hash rjenkins pg_num 560 pgp_num 560 last_change 186331 lfor 0/127768 flags hashpspool,ec_overwrites,nodelete,selfmanaged_snaps max_bytes 219902325555200 stripe_width 24576 fast_read 1 compression_mode aggressive application rbd
removed_snaps [1~59f,5a2~fe, ... less huge list ... ,2559~1,255b~1]
removed_snaps_queue [1a64~5,1a6a~1,1a6c~1, ... long list ... ,220a~1,220c~1] pool 12 'con-fs2-meta1' replicated size 3 min_size 2 crush_rule 3 object_hash rjenkins pg_num 50 pgp_num 50 last_change 57096 flags hashpspool,nodelete max_bytes 268435456000 stripe_width 0 application cephfs pool 13 'con-fs2-meta2' replicated size 3 min_size 2 crush_rule 3 object_hash rjenkins pg_num 50 pgp_num 50 last_change 96359 flags hashpspool,nodelete max_bytes 107374182400 stripe_width 0 application cephfs pool 14 'con-fs2-data' erasure size 10 min_size 9 crush_rule 8 object_hash rjenkins pg_num 1350 pgp_num 1350 last_change 96360 lfor 0/91144 flags hashpspool,ec_overwrites,nodelete max_bytes 879609302220800 stripe_width 32768 fast_read 1 compression_mode aggressive application cephfs pool 17 'con-fs2-data-ec-ssd' erasure size 10 min_size 9 crush_rule 10 object_hash rjenkins pg_num 55 pgp_num 55 last_change 96361 lfor 0/90473 flags hashpspool,ec_overwrites,nodelete max_bytes 1099511627776 stripe_width 32768 fast_read 1 compression_mode aggressive application cephfs pool 18 'ms-rbd-one' replicated size 3 min_size 2 crush_rule 12 object_hash rjenkins pg_num 150 pgp_num 150 last_change 143206 flags hashpspool,nodelete,selfmanaged_snaps max_bytes 1099511627776 stripe_width 0 compression_mode aggressive application rbd
removed_snaps [1~3]
# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-40 2384.09058 root DTU
-42 0 region Lyngby
-41 2384.09058 region Risoe
2 sub-trees on level datacenter removed for brevity
-49 586.49347 datacenter ServerRoom
-55 586.49347 room SR-113
-65 64.33617 host ceph-04
84 hdd 8.90999 osd.84 up 1.00000 1.00000
145 hdd 8.90999 osd.145 up 1.00000 1.00000
156 hdd 8.90999 osd.156 up 1.00000 1.00000
168 hdd 8.90999 osd.168 up 1.00000 1.00000
181 hdd 8.90999 osd.181 up 0.95000 1.00000
288 hdd 10.69229 osd.288 up 1.00000 1.00000
243 rbd_data 1.74599 osd.243 up 1.00000 1.00000
254 rbd_data 1.74599 osd.254 up 1.00000 1.00000
256 rbd_data 1.74599 osd.256 up 1.00000 1.00000
286 rbd_data 1.74599 osd.286 up 1.00000 1.00000
287 rbd_data 1.74599 osd.287 up 1.00000 1.00000
48 rbd_meta 0.36400 osd.48 up 1.00000 1.00000
-67 64.33617 host ceph-05
74 hdd 8.90999 osd.74 up 1.00000 1.00000
144 hdd 8.90999 osd.144 up 1.00000 1.00000
157 hdd 8.90999 osd.157 up 0.84999 1.00000
169 hdd 8.90999 osd.169 up 0.95000 1.00000
180 hdd 8.90999 osd.180 up 0.89999 1.00000
289 hdd 10.69229 osd.289 up 1.00000 1.00000
240 rbd_data 1.74599 osd.240 up 1.00000 1.00000
251 rbd_data 1.74599 osd.251 up 1.00000 1.00000
255 rbd_data 1.74599 osd.255 up 1.00000 1.00000
284 rbd_data 1.74599 osd.284 up 1.00000 1.00000
285 rbd_data 1.74599 osd.285 up 1.00000 1.00000
49 rbd_meta 0.36400 osd.49 up 1.00000 1.00000
-69 64.70016 host ceph-06
60 hdd 8.90999 osd.60 up 1.00000 1.00000
146 hdd 8.90999 osd.146 up 1.00000 1.00000
158 hdd 8.90999 osd.158 up 0.95000 1.00000
170 hdd 8.90999 osd.170 up 0.89999 1.00000
182 hdd 8.90999 osd.182 up 1.00000 1.00000
290 hdd 10.69229 osd.290 up 1.00000 1.00000
244 rbd_data 1.74599 osd.244 up 1.00000 1.00000
253 rbd_data 1.74599 osd.253 up 1.00000 1.00000
257 rbd_data 1.74599 osd.257 up 1.00000 1.00000
282 rbd_data 1.74599 osd.282 up 1.00000 1.00000
283 rbd_data 1.74599 osd.283 up 1.00000 1.00000
40 rbd_meta 0.36400 osd.40 up 1.00000 1.00000
50 rbd_meta 0.36400 osd.50 up 1.00000 1.00000
-71 64.33617 host ceph-07
63 hdd 8.90999 osd.63 up 1.00000 1.00000
148 hdd 8.90999 osd.148 up 0.95000 1.00000
159 hdd 8.90999 osd.159 up 1.00000 1.00000
172 hdd 8.90999 osd.172 up 0.95000 1.00000
183 hdd 8.90999 osd.183 up 0.84999 1.00000
292 hdd 10.69229 osd.292 up 1.00000 1.00000
242 rbd_data 1.74599 osd.242 up 1.00000 1.00000
252 rbd_data 1.74599 osd.252 up 1.00000 1.00000
258 rbd_data 1.74599 osd.258 up 1.00000 1.00000
279 rbd_data 1.74599 osd.279 up 1.00000 1.00000
280 rbd_data 1.74599 osd.280 up 1.00000 1.00000
52 rbd_meta 0.36400 osd.52 up 1.00000 1.00000
-81 66.70416 host ceph-18
229 hdd 8.90999 osd.229 up 1.00000 1.00000
232 hdd 8.90999 osd.232 up 1.00000 1.00000
235 hdd 8.90999 osd.235 up 1.00000 1.00000
238 hdd 8.90999 osd.238 up 0.95000 1.00000
259 hdd 10.91399 osd.259 up 1.00000 1.00000
293 hdd 10.69229 osd.293 up 1.00000 1.00000
241 rbd_data 1.74599 osd.241 up 1.00000 1.00000
248 rbd_data 1.74599 osd.248 up 1.00000 1.00000
266 rbd_data 1.74599 osd.266 up 1.00000 1.00000
267 rbd_data 1.74599 osd.267 up 1.00000 1.00000
277 rbd_data 1.74599 osd.277 up 1.00000 1.00000
31 rbd_meta 0.36400 osd.31 up 1.00000 1.00000
41 rbd_meta 0.36400 osd.41 up 1.00000 1.00000
-94 66.34016 host ceph-19
231 hdd 8.90999 osd.231 up 1.00000 1.00000
233 hdd 8.90999 osd.233 up 0.95000 1.00000
236 hdd 8.90999 osd.236 up 1.00000 1.00000
239 hdd 8.90999 osd.239 up 1.00000 1.00000
263 hdd 10.91399 osd.263 up 1.00000 1.00000
295 hdd 10.69229 osd.295 up 1.00000 1.00000
261 rbd_data 1.74599 osd.261 up 1.00000 1.00000
262 rbd_data 1.74599 osd.262 up 1.00000 1.00000
268 rbd_data 1.74599 osd.268 up 1.00000 1.00000
269 rbd_data 1.74599 osd.269 up 1.00000 1.00000
275 rbd_data 1.74599 osd.275 up 1.00000 1.00000
43 rbd_meta 0.36400 osd.43 up 1.00000 1.00000
-4 66.70416 host ceph-20
228 hdd 8.90999 osd.228 up 1.00000 1.00000
230 hdd 8.90999 osd.230 up 1.00000 1.00000
234 hdd 8.90999 osd.234 up 0.95000 1.00000
237 hdd 8.90999 osd.237 up 1.00000 1.00000
260 hdd 10.91399 osd.260 up 1.00000 1.00000
296 hdd 10.69229 osd.296 up 1.00000 1.00000
245 rbd_data 1.74599 osd.245 up 1.00000 1.00000
270 rbd_data 1.74599 osd.270 up 1.00000 1.00000
271 rbd_data 1.74599 osd.271 up 1.00000 1.00000
272 rbd_data 1.74599 osd.272 up 1.00000 1.00000
273 rbd_data 1.74599 osd.273 up 1.00000 1.00000
28 rbd_meta 0.36400 osd.28 up 1.00000 1.00000
44 rbd_meta 0.36400 osd.44 up 1.00000 1.00000
-64 64.70016 host ceph-21
0 hdd 8.90999 osd.0 up 1.00000 1.00000
2 hdd 8.90999 osd.2 up 0.95000 1.00000
72 hdd 8.90999 osd.72 up 1.00000 1.00000
76 hdd 8.90999 osd.76 up 1.00000 1.00000
86 hdd 8.90999 osd.86 up 1.00000 1.00000
291 hdd 10.69229 osd.291 up 1.00000 1.00000
246 rbd_data 1.74599 osd.246 up 1.00000 1.00000
247 rbd_data 1.74599 osd.247 up 1.00000 1.00000
264 rbd_data 1.74599 osd.264 up 1.00000 1.00000
274 rbd_data 1.74599 osd.274 up 1.00000 1.00000
278 rbd_data 1.74599 osd.278 up 1.00000 1.00000
39 rbd_meta 0.36400 osd.39 up 1.00000 1.00000
53 rbd_meta 0.36400 osd.53 up 1.00000 1.00000
-66 64.33617 host ceph-22
1 hdd 8.90999 osd.1 up 1.00000 1.00000
3 hdd 8.90999 osd.3 up 1.00000 1.00000
73 hdd 8.90999 osd.73 up 1.00000 1.00000
85 hdd 8.90999 osd.85 up 0.95000 1.00000
87 hdd 8.90999 osd.87 up 1.00000 1.00000
294 hdd 10.69229 osd.294 up 1.00000 1.00000
249 rbd_data 1.74599 osd.249 up 1.00000 1.00000
250 rbd_data 1.74599 osd.250 up 1.00000 1.00000
265 rbd_data 1.74599 osd.265 up 1.00000 1.00000
276 rbd_data 1.74599 osd.276 up 1.00000 1.00000
281 rbd_data 1.74599 osd.281 up 1.00000 1.00000
51 rbd_meta 0.36400 osd.51 up 1.00000 1.00000
# ceph osd crush rule dump # crush rules outside tree under "datacenter ServerRoom" removed for brevity [
{
"rule_id": 0,
"rule_name": "replicated_rule",
"ruleset": 0,
"type": 1,
"min_size": 1,
"max_size": 10,
"steps": [
{
"op": "take",
"item": -1,
"item_name": "default"
},
{
"op": "chooseleaf_firstn",
"num": 0,
"type": "host"
},
{
"op": "emit"
}
]
},
{
"rule_id": 5,
"rule_name": "sr-rbd-data-one",
"ruleset": 5,
"type": 3,
"min_size": 3,
"max_size": 8,
"steps": [
{
"op": "set_chooseleaf_tries",
"num": 50
},
{
"op": "set_choose_tries",
"num": 1000
},
{
"op": "take",
"item": -185,
"item_name": "ServerRoom~rbd_data"
},
{
"op": "chooseleaf_indep",
"num": 0,
"type": "host"
},
{
"op": "emit"
}
]
},
{
"rule_id": 9,
"rule_name": "sr-rbd-data-one-hdd",
"ruleset": 9,
"type": 3,
"min_size": 3,
"max_size": 8,
"steps": [
{
"op": "set_chooseleaf_tries",
"num": 5
},
{
"op": "set_choose_tries",
"num": 100
},
{
"op": "take",
"item": -53,
"item_name": "ServerRoom~hdd"
},
{
"op": "chooseleaf_indep",
"num": 0,
"type": "host"
},
{
"op": "emit"
}
]
}
]
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
________________________________________
From: Eric Smith <Eric.Smith(a)vecima.com>
Sent: 03 August 2020 15:40
To: Frank Schilder; ceph-users
Subject: RE: Ceph does not recover from OSD restart
Can you post the output of these commands:
ceph osd pool ls detail
ceph osd tree
ceph osd crush rule dump
-----Original Message-----
From: Frank Schilder <frans(a)dtu.dk>
Sent: Monday, August 3, 2020 9:19 AM
To: ceph-users <ceph-users(a)ceph.io>
Subject: [ceph-users] Re: Ceph does not recover from OSD restart
After moving the newly added OSDs out of the crush tree and back in again, I get to exactly what I want to see:
cluster:
id: e4ece518-f2cb-4708-b00f-b6bf511e91d9
health: HEALTH_WARN
norebalance,norecover flag(s) set
53030026/1492404361 objects misplaced (3.553%)
1 pools nearfull
services:
mon: 3 daemons, quorum ceph-01,ceph-02,ceph-03
mgr: ceph-01(active), standbys: ceph-03, ceph-02
mds: con-fs2-1/1/1 up {0=ceph-08=up:active}, 1 up:standby-replay
osd: 297 osds: 272 up, 272 in; 307 remapped pgs
flags norebalance,norecover
data:
pools: 11 pools, 3215 pgs
objects: 177.3 M objects, 489 TiB
usage: 696 TiB used, 1.2 PiB / 1.9 PiB avail
pgs: 53030026/1492404361 objects misplaced (3.553%)
2902 active+clean
299 active+remapped+backfill_wait
8 active+remapped+backfilling
5 active+clean+scrubbing+deep
1 active+clean+snaptrim
io:
client: 69 MiB/s rd, 117 MiB/s wr, 399 op/s rd, 856 op/s wr
Why does a cluster with remapped PGs not survive OSD restarts without loosing track of objects?
Why is it not finding the objects by itself?
A power outage of 3 hosts will halt everything for no reason until manual intervention. How can I avoid this problem?
Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
________________________________________
From: Frank Schilder <frans(a)dtu.dk>
Sent: 03 August 2020 15:03:05
To: ceph-users
Subject: [ceph-users] Ceph does not recover from OSD restart
Dear cephers,
I have a serious issue with degraded objects after an OSD restart. The cluster was in a state of re-balancing after adding disks to each host. Before restart I had "X/Y objects misplaced". Apart from that, health was OK. I now restarted all OSDs of one host and the cluster does not recover from that:
cluster:
id: xxx
health: HEALTH_ERR
45813194/1492348700 objects misplaced (3.070%)
Degraded data redundancy: 6798138/1492348700 objects degraded (0.456%), 85 pgs degraded, 86 pgs undersized
Degraded data redundancy (low space): 17 pgs backfill_toofull
1 pools nearfull
services:
mon: 3 daemons, quorum ceph-01,ceph-02,ceph-03
mgr: ceph-01(active), standbys: ceph-03, ceph-02
mds: con-fs2-1/1/1 up {0=ceph-08=up:active}, 1 up:standby-replay
osd: 297 osds: 272 up, 272 in; 307 remapped pgs
data:
pools: 11 pools, 3215 pgs
objects: 177.3 M objects, 489 TiB
usage: 696 TiB used, 1.2 PiB / 1.9 PiB avail
pgs: 6798138/1492348700 objects degraded (0.456%)
45813194/1492348700 objects misplaced (3.070%)
2903 active+clean
209 active+remapped+backfill_wait
73 active+undersized+degraded+remapped+backfill_wait
9 active+remapped+backfill_wait+backfill_toofull
8 active+undersized+degraded+remapped+backfill_wait+backfill_toofull
4 active+undersized+degraded+remapped+backfilling
3 active+remapped+backfilling
3 active+clean+scrubbing+deep
1 active+clean+scrubbing
1 active+undersized+remapped+backfilling
1 active+clean+snaptrim
io:
client: 47 MiB/s rd, 61 MiB/s wr, 732 op/s rd, 792 op/s wr
recovery: 195 MiB/s, 48 objects/s
After restarting there should only be a small number of degraded objects, the ones that received writes during OSD restart. What I see, however, is that the cluster seems to have lost track of a huge amount of objects, the 0.456% degraded are 1-2 days worth of I/O. I did reboots before and saw only a few thousand objects degraded at most. The output of ceph health detail shows a lot of lines like these:
[root@gnosis ~]# ceph health detail
HEALTH_ERR 45804316/1492356704 objects misplaced (3.069%); Degraded data redundancy: 6792562/1492356704 objects degraded (0.455%), 85 pgs degraded, 86 pgs undersized; Degraded data redundancy (low space): 17 pgs backfill_toofull; 1 pools nearfull OBJECT_MISPLACED 45804316/1492356704 objects misplaced (3.069%) PG_DEGRADED Degraded data redundancy: 6792562/1492356704 objects degraded (0.455%), 85 pgs degraded, 86 pgs undersized
pg 11.9 is stuck undersized for 815.188981, current state active+undersized+degraded+remapped+backfill_wait, last acting [60,148,2147483647,263,76,230,87,169]
8...9
pg 11.48 is active+undersized+degraded+remapped+backfill_wait, acting [159,60,180,263,237,3,2147483647,72]
pg 11.4a is stuck undersized for 851.162862, current state active+undersized+degraded+remapped+backfill_wait, last acting [182,233,87,228,2,180,63,2147483647]
[...]
pg 11.22e is stuck undersized for 851.162402, current state active+undersized+degraded+remapped+backfill_wait+backfill_toofull, last acting [234,183,239,2147483647,170,229,1,86]
PG_DEGRADED_FULL Degraded data redundancy (low space): 17 pgs backfill_toofull
pg 11.24 is active+undersized+degraded+remapped+backfill_wait+backfill_toofull, acting [230,259,2147483647,1,144,159,233,146]
[...]
pg 11.1d9 is active+remapped+backfill_wait+backfill_toofull, acting [84,259,183,170,85,234,233,2]
pg 11.225 is active+undersized+degraded+remapped+backfill_wait+backfill_toofull, acting [236,183,1,2147483647,2147483647,169,229,230]
pg 11.22e is active+undersized+degraded+remapped+backfill_wait+backfill_toofull, acting [234,183,239,2147483647,170,229,1,86]
POOL_NEAR_FULL 1 pools nearfull
pool 'sr-rbd-data-one-hdd' has 164 TiB (max 200 TiB)
It looks like a lot of PGs are not receiving theire complete crush map placement, as if the peering is incomplete. This is a serious issue, it looks like the cluster will see a total storage loss if just 2 more hosts reboot - without actually having lost any storage. The pool in question is a 6+2 EC pool.
What is going on here? Why are the PG-maps not restored to their values from before the OSD reboot? The degraded PGs should receive the missing OSD IDs, everything is up exactly as it was before the reboot.
Thanks for your help and best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
_______________________________________________
ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io _______________________________________________
ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io _______________________________________________
ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io
Hi Eric,
the procedure for re-discovering all objects is:
# Flag: norebalance
ceph osd crush move osd.288 host=bb-04
ceph osd crush move osd.289 host=bb-05
ceph osd crush move osd.290 host=bb-06
ceph osd crush move osd.291 host=bb-21
ceph osd crush move osd.292 host=bb-07
ceph osd crush move osd.293 host=bb-18
ceph osd crush move osd.295 host=bb-19
ceph osd crush move osd.294 host=bb-22
ceph osd crush move osd.296 host=bb-20
# Wait until all PGs are peered and recovery is done. In my case, there was only little I/O,
# no more than 50-100 objects had writes missing and recovery was a few seconds.
#
# The bb-hosts are under a separate crush root that I use solely as parking space
# and for draining OSDs.
ceph osd crush move osd.288 host=ceph-04
ceph osd crush move osd.289 host=ceph-05
ceph osd crush move osd.290 host=ceph-06
ceph osd crush move osd.291 host=ceph-21
ceph osd crush move osd.292 host=ceph-07
ceph osd crush move osd.293 host=ceph-18
ceph osd crush move osd.295 host=ceph-19
ceph osd crush move osd.294 host=ceph-22
ceph osd crush move osd.296 host=ceph-20
After peering, no degraded PGs/objects any more, just the misplaced ones as expected.
Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
________________________________________
From: Eric Smith <Eric.Smith(a)vecima.com>
Sent: 03 August 2020 16:45:28
To: Frank Schilder; ceph-users
Subject: RE: Ceph does not recover from OSD restart
You said you had to move some OSDs out and back in for Ceph to go back to normal (The OSDs you added). Which OSDs were added?
-----Original Message-----
From: Frank Schilder <frans(a)dtu.dk>
Sent: Monday, August 3, 2020 9:55 AM
To: Eric Smith <Eric.Smith(a)vecima.com>; ceph-users <ceph-users(a)ceph.io>
Subject: Re: Ceph does not recover from OSD restart
Hi Eric,
thanks for your fast response. Below the output, shortened a bit as indicated. Disks have been added to pool 11 'sr-rbd-data-one-hdd' only, this is the only pool with remapped PGs and is also the only pool experiencing the "loss of track" to objects. Every other pool recovers from restart by itself.
Best regards,
Frank
# ceph osd pool stats
pool sr-rbd-meta-one id 1
client io 5.3 KiB/s rd, 3.2 KiB/s wr, 4 op/s rd, 1 op/s wr
pool sr-rbd-data-one id 2
client io 24 MiB/s rd, 32 MiB/s wr, 380 op/s rd, 594 op/s wr
pool sr-rbd-one-stretch id 3
nothing is going on
pool con-rbd-meta-hpc-one id 7
nothing is going on
pool con-rbd-data-hpc-one id 8
client io 0 B/s rd, 5.6 KiB/s wr, 0 op/s rd, 0 op/s wr
pool sr-rbd-data-one-hdd id 11
53241814/346903376 objects misplaced (15.348%)
client io 73 MiB/s rd, 3.4 MiB/s wr, 236 op/s rd, 69 op/s wr
pool con-fs2-meta1 id 12
client io 106 KiB/s rd, 112 KiB/s wr, 3 op/s rd, 11 op/s wr
pool con-fs2-meta2 id 13
client io 0 B/s wr, 0 op/s rd, 0 op/s wr
pool con-fs2-data id 14
client io 5.5 MiB/s rd, 201 KiB/s wr, 34 op/s rd, 8 op/s wr
pool con-fs2-data-ec-ssd id 17
nothing is going on
pool ms-rbd-one id 18
client io 5.6 MiB/s wr, 0 op/s rd, 179 op/s wr
# ceph osd pool ls detail
pool 1 'sr-rbd-meta-one' replicated size 3 min_size 2 crush_rule 11 object_hash rjenkins pg_num 80 pgp_num 80 last_change 122597 flags hashpspool,nodelete,selfmanaged_snaps max_bytes 536870912000 stripe_width 0 application rbd
removed_snaps [1~45]
pool 2 'sr-rbd-data-one' erasure size 8 min_size 6 crush_rule 5 object_hash rjenkins pg_num 560 pgp_num 560 last_change 186437 lfor 0/126858 flags hashpspool,ec_overwrites,nodelete,selfmanaged_snaps max_bytes 43980465111040 stripe_width 24576 fast_read 1 compression_mode aggressive application rbd
removed_snaps [1~3,5~2, ... huge list ... ,11f9d~1,11fa0~2] pool 3 'sr-rbd-one-stretch' replicated size 3 min_size 2 crush_rule 12 object_hash rjenkins pg_num 160 pgp_num 160 last_change 143202 lfor 0/79983 flags hashpspool,nodelete,selfmanaged_snaps max_bytes 1099511627776 stripe_width 0 compression_mode aggressive application rbd
removed_snaps [1~7,b~2,11~2,14~2,17~9e,b8~1e] pool 7 'con-rbd-meta-hpc-one' replicated size 3 min_size 2 crush_rule 3 object_hash rjenkins pg_num 50 pgp_num 50 last_change 96357 lfor 0/90462 flags hashpspool,nodelete,selfmanaged_snaps max_bytes 10737418240 stripe_width 0 application rbd
removed_snaps [1~3]
pool 8 'con-rbd-data-hpc-one' erasure size 10 min_size 9 crush_rule 7 object_hash rjenkins pg_num 150 pgp_num 150 last_change 96358 lfor 0/90996 flags hashpspool,ec_overwrites,nodelete,selfmanaged_snaps max_bytes 5497558138880 stripe_width 32768 fast_read 1 compression_mode aggressive application rbd
removed_snaps [1~7,9~2]
pool 11 'sr-rbd-data-one-hdd' erasure size 8 min_size 6 crush_rule 9 object_hash rjenkins pg_num 560 pgp_num 560 last_change 186331 lfor 0/127768 flags hashpspool,ec_overwrites,nodelete,selfmanaged_snaps max_bytes 219902325555200 stripe_width 24576 fast_read 1 compression_mode aggressive application rbd
removed_snaps [1~59f,5a2~fe, ... less huge list ... ,2559~1,255b~1]
removed_snaps_queue [1a64~5,1a6a~1,1a6c~1, ... long list ... ,220a~1,220c~1] pool 12 'con-fs2-meta1' replicated size 3 min_size 2 crush_rule 3 object_hash rjenkins pg_num 50 pgp_num 50 last_change 57096 flags hashpspool,nodelete max_bytes 268435456000 stripe_width 0 application cephfs pool 13 'con-fs2-meta2' replicated size 3 min_size 2 crush_rule 3 object_hash rjenkins pg_num 50 pgp_num 50 last_change 96359 flags hashpspool,nodelete max_bytes 107374182400 stripe_width 0 application cephfs pool 14 'con-fs2-data' erasure size 10 min_size 9 crush_rule 8 object_hash rjenkins pg_num 1350 pgp_num 1350 last_change 96360 lfor 0/91144 flags hashpspool,ec_overwrites,nodelete max_bytes 879609302220800 stripe_width 32768 fast_read 1 compression_mode aggressive application cephfs pool 17 'con-fs2-data-ec-ssd' erasure size 10 min_size 9 crush_rule 10 object_hash rjenkins pg_num 55 pgp_num 55 last_change 96361 lfor 0/90473 flags hashpspool,ec_overwrites,nodelete max_bytes 1099511627776 stripe_width 32768 fast_read 1 compression_mode aggressive application cephfs pool 18 'ms-rbd-one' replicated size 3 min_size 2 crush_rule 12 object_hash rjenkins pg_num 150 pgp_num 150 last_change 143206 flags hashpspool,nodelete,selfmanaged_snaps max_bytes 1099511627776 stripe_width 0 compression_mode aggressive application rbd
removed_snaps [1~3]
# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-40 2384.09058 root DTU
-42 0 region Lyngby
-41 2384.09058 region Risoe
2 sub-trees on level datacenter removed for brevity
-49 586.49347 datacenter ServerRoom
-55 586.49347 room SR-113
-65 64.33617 host ceph-04
84 hdd 8.90999 osd.84 up 1.00000 1.00000
145 hdd 8.90999 osd.145 up 1.00000 1.00000
156 hdd 8.90999 osd.156 up 1.00000 1.00000
168 hdd 8.90999 osd.168 up 1.00000 1.00000
181 hdd 8.90999 osd.181 up 0.95000 1.00000
288 hdd 10.69229 osd.288 up 1.00000 1.00000
243 rbd_data 1.74599 osd.243 up 1.00000 1.00000
254 rbd_data 1.74599 osd.254 up 1.00000 1.00000
256 rbd_data 1.74599 osd.256 up 1.00000 1.00000
286 rbd_data 1.74599 osd.286 up 1.00000 1.00000
287 rbd_data 1.74599 osd.287 up 1.00000 1.00000
48 rbd_meta 0.36400 osd.48 up 1.00000 1.00000
-67 64.33617 host ceph-05
74 hdd 8.90999 osd.74 up 1.00000 1.00000
144 hdd 8.90999 osd.144 up 1.00000 1.00000
157 hdd 8.90999 osd.157 up 0.84999 1.00000
169 hdd 8.90999 osd.169 up 0.95000 1.00000
180 hdd 8.90999 osd.180 up 0.89999 1.00000
289 hdd 10.69229 osd.289 up 1.00000 1.00000
240 rbd_data 1.74599 osd.240 up 1.00000 1.00000
251 rbd_data 1.74599 osd.251 up 1.00000 1.00000
255 rbd_data 1.74599 osd.255 up 1.00000 1.00000
284 rbd_data 1.74599 osd.284 up 1.00000 1.00000
285 rbd_data 1.74599 osd.285 up 1.00000 1.00000
49 rbd_meta 0.36400 osd.49 up 1.00000 1.00000
-69 64.70016 host ceph-06
60 hdd 8.90999 osd.60 up 1.00000 1.00000
146 hdd 8.90999 osd.146 up 1.00000 1.00000
158 hdd 8.90999 osd.158 up 0.95000 1.00000
170 hdd 8.90999 osd.170 up 0.89999 1.00000
182 hdd 8.90999 osd.182 up 1.00000 1.00000
290 hdd 10.69229 osd.290 up 1.00000 1.00000
244 rbd_data 1.74599 osd.244 up 1.00000 1.00000
253 rbd_data 1.74599 osd.253 up 1.00000 1.00000
257 rbd_data 1.74599 osd.257 up 1.00000 1.00000
282 rbd_data 1.74599 osd.282 up 1.00000 1.00000
283 rbd_data 1.74599 osd.283 up 1.00000 1.00000
40 rbd_meta 0.36400 osd.40 up 1.00000 1.00000
50 rbd_meta 0.36400 osd.50 up 1.00000 1.00000
-71 64.33617 host ceph-07
63 hdd 8.90999 osd.63 up 1.00000 1.00000
148 hdd 8.90999 osd.148 up 0.95000 1.00000
159 hdd 8.90999 osd.159 up 1.00000 1.00000
172 hdd 8.90999 osd.172 up 0.95000 1.00000
183 hdd 8.90999 osd.183 up 0.84999 1.00000
292 hdd 10.69229 osd.292 up 1.00000 1.00000
242 rbd_data 1.74599 osd.242 up 1.00000 1.00000
252 rbd_data 1.74599 osd.252 up 1.00000 1.00000
258 rbd_data 1.74599 osd.258 up 1.00000 1.00000
279 rbd_data 1.74599 osd.279 up 1.00000 1.00000
280 rbd_data 1.74599 osd.280 up 1.00000 1.00000
52 rbd_meta 0.36400 osd.52 up 1.00000 1.00000
-81 66.70416 host ceph-18
229 hdd 8.90999 osd.229 up 1.00000 1.00000
232 hdd 8.90999 osd.232 up 1.00000 1.00000
235 hdd 8.90999 osd.235 up 1.00000 1.00000
238 hdd 8.90999 osd.238 up 0.95000 1.00000
259 hdd 10.91399 osd.259 up 1.00000 1.00000
293 hdd 10.69229 osd.293 up 1.00000 1.00000
241 rbd_data 1.74599 osd.241 up 1.00000 1.00000
248 rbd_data 1.74599 osd.248 up 1.00000 1.00000
266 rbd_data 1.74599 osd.266 up 1.00000 1.00000
267 rbd_data 1.74599 osd.267 up 1.00000 1.00000
277 rbd_data 1.74599 osd.277 up 1.00000 1.00000
31 rbd_meta 0.36400 osd.31 up 1.00000 1.00000
41 rbd_meta 0.36400 osd.41 up 1.00000 1.00000
-94 66.34016 host ceph-19
231 hdd 8.90999 osd.231 up 1.00000 1.00000
233 hdd 8.90999 osd.233 up 0.95000 1.00000
236 hdd 8.90999 osd.236 up 1.00000 1.00000
239 hdd 8.90999 osd.239 up 1.00000 1.00000
263 hdd 10.91399 osd.263 up 1.00000 1.00000
295 hdd 10.69229 osd.295 up 1.00000 1.00000
261 rbd_data 1.74599 osd.261 up 1.00000 1.00000
262 rbd_data 1.74599 osd.262 up 1.00000 1.00000
268 rbd_data 1.74599 osd.268 up 1.00000 1.00000
269 rbd_data 1.74599 osd.269 up 1.00000 1.00000
275 rbd_data 1.74599 osd.275 up 1.00000 1.00000
43 rbd_meta 0.36400 osd.43 up 1.00000 1.00000
-4 66.70416 host ceph-20
228 hdd 8.90999 osd.228 up 1.00000 1.00000
230 hdd 8.90999 osd.230 up 1.00000 1.00000
234 hdd 8.90999 osd.234 up 0.95000 1.00000
237 hdd 8.90999 osd.237 up 1.00000 1.00000
260 hdd 10.91399 osd.260 up 1.00000 1.00000
296 hdd 10.69229 osd.296 up 1.00000 1.00000
245 rbd_data 1.74599 osd.245 up 1.00000 1.00000
270 rbd_data 1.74599 osd.270 up 1.00000 1.00000
271 rbd_data 1.74599 osd.271 up 1.00000 1.00000
272 rbd_data 1.74599 osd.272 up 1.00000 1.00000
273 rbd_data 1.74599 osd.273 up 1.00000 1.00000
28 rbd_meta 0.36400 osd.28 up 1.00000 1.00000
44 rbd_meta 0.36400 osd.44 up 1.00000 1.00000
-64 64.70016 host ceph-21
0 hdd 8.90999 osd.0 up 1.00000 1.00000
2 hdd 8.90999 osd.2 up 0.95000 1.00000
72 hdd 8.90999 osd.72 up 1.00000 1.00000
76 hdd 8.90999 osd.76 up 1.00000 1.00000
86 hdd 8.90999 osd.86 up 1.00000 1.00000
291 hdd 10.69229 osd.291 up 1.00000 1.00000
246 rbd_data 1.74599 osd.246 up 1.00000 1.00000
247 rbd_data 1.74599 osd.247 up 1.00000 1.00000
264 rbd_data 1.74599 osd.264 up 1.00000 1.00000
274 rbd_data 1.74599 osd.274 up 1.00000 1.00000
278 rbd_data 1.74599 osd.278 up 1.00000 1.00000
39 rbd_meta 0.36400 osd.39 up 1.00000 1.00000
53 rbd_meta 0.36400 osd.53 up 1.00000 1.00000
-66 64.33617 host ceph-22
1 hdd 8.90999 osd.1 up 1.00000 1.00000
3 hdd 8.90999 osd.3 up 1.00000 1.00000
73 hdd 8.90999 osd.73 up 1.00000 1.00000
85 hdd 8.90999 osd.85 up 0.95000 1.00000
87 hdd 8.90999 osd.87 up 1.00000 1.00000
294 hdd 10.69229 osd.294 up 1.00000 1.00000
249 rbd_data 1.74599 osd.249 up 1.00000 1.00000
250 rbd_data 1.74599 osd.250 up 1.00000 1.00000
265 rbd_data 1.74599 osd.265 up 1.00000 1.00000
276 rbd_data 1.74599 osd.276 up 1.00000 1.00000
281 rbd_data 1.74599 osd.281 up 1.00000 1.00000
51 rbd_meta 0.36400 osd.51 up 1.00000 1.00000
# ceph osd crush rule dump # crush rules outside tree under "datacenter ServerRoom" removed for brevity [
{
"rule_id": 0,
"rule_name": "replicated_rule",
"ruleset": 0,
"type": 1,
"min_size": 1,
"max_size": 10,
"steps": [
{
"op": "take",
"item": -1,
"item_name": "default"
},
{
"op": "chooseleaf_firstn",
"num": 0,
"type": "host"
},
{
"op": "emit"
}
]
},
{
"rule_id": 5,
"rule_name": "sr-rbd-data-one",
"ruleset": 5,
"type": 3,
"min_size": 3,
"max_size": 8,
"steps": [
{
"op": "set_chooseleaf_tries",
"num": 50
},
{
"op": "set_choose_tries",
"num": 1000
},
{
"op": "take",
"item": -185,
"item_name": "ServerRoom~rbd_data"
},
{
"op": "chooseleaf_indep",
"num": 0,
"type": "host"
},
{
"op": "emit"
}
]
},
{
"rule_id": 9,
"rule_name": "sr-rbd-data-one-hdd",
"ruleset": 9,
"type": 3,
"min_size": 3,
"max_size": 8,
"steps": [
{
"op": "set_chooseleaf_tries",
"num": 5
},
{
"op": "set_choose_tries",
"num": 100
},
{
"op": "take",
"item": -53,
"item_name": "ServerRoom~hdd"
},
{
"op": "chooseleaf_indep",
"num": 0,
"type": "host"
},
{
"op": "emit"
}
]
}
]
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
________________________________________
From: Eric Smith <Eric.Smith(a)vecima.com>
Sent: 03 August 2020 15:40
To: Frank Schilder; ceph-users
Subject: RE: Ceph does not recover from OSD restart
Can you post the output of these commands:
ceph osd pool ls detail
ceph osd tree
ceph osd crush rule dump
-----Original Message-----
From: Frank Schilder <frans(a)dtu.dk>
Sent: Monday, August 3, 2020 9:19 AM
To: ceph-users <ceph-users(a)ceph.io>
Subject: [ceph-users] Re: Ceph does not recover from OSD restart
After moving the newly added OSDs out of the crush tree and back in again, I get to exactly what I want to see:
cluster:
id: e4ece518-f2cb-4708-b00f-b6bf511e91d9
health: HEALTH_WARN
norebalance,norecover flag(s) set
53030026/1492404361 objects misplaced (3.553%)
1 pools nearfull
services:
mon: 3 daemons, quorum ceph-01,ceph-02,ceph-03
mgr: ceph-01(active), standbys: ceph-03, ceph-02
mds: con-fs2-1/1/1 up {0=ceph-08=up:active}, 1 up:standby-replay
osd: 297 osds: 272 up, 272 in; 307 remapped pgs
flags norebalance,norecover
data:
pools: 11 pools, 3215 pgs
objects: 177.3 M objects, 489 TiB
usage: 696 TiB used, 1.2 PiB / 1.9 PiB avail
pgs: 53030026/1492404361 objects misplaced (3.553%)
2902 active+clean
299 active+remapped+backfill_wait
8 active+remapped+backfilling
5 active+clean+scrubbing+deep
1 active+clean+snaptrim
io:
client: 69 MiB/s rd, 117 MiB/s wr, 399 op/s rd, 856 op/s wr
Why does a cluster with remapped PGs not survive OSD restarts without loosing track of objects?
Why is it not finding the objects by itself?
A power outage of 3 hosts will halt everything for no reason until manual intervention. How can I avoid this problem?
Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
________________________________________
From: Frank Schilder <frans(a)dtu.dk>
Sent: 03 August 2020 15:03:05
To: ceph-users
Subject: [ceph-users] Ceph does not recover from OSD restart
Dear cephers,
I have a serious issue with degraded objects after an OSD restart. The cluster was in a state of re-balancing after adding disks to each host. Before restart I had "X/Y objects misplaced". Apart from that, health was OK. I now restarted all OSDs of one host and the cluster does not recover from that:
cluster:
id: xxx
health: HEALTH_ERR
45813194/1492348700 objects misplaced (3.070%)
Degraded data redundancy: 6798138/1492348700 objects degraded (0.456%), 85 pgs degraded, 86 pgs undersized
Degraded data redundancy (low space): 17 pgs backfill_toofull
1 pools nearfull
services:
mon: 3 daemons, quorum ceph-01,ceph-02,ceph-03
mgr: ceph-01(active), standbys: ceph-03, ceph-02
mds: con-fs2-1/1/1 up {0=ceph-08=up:active}, 1 up:standby-replay
osd: 297 osds: 272 up, 272 in; 307 remapped pgs
data:
pools: 11 pools, 3215 pgs
objects: 177.3 M objects, 489 TiB
usage: 696 TiB used, 1.2 PiB / 1.9 PiB avail
pgs: 6798138/1492348700 objects degraded (0.456%)
45813194/1492348700 objects misplaced (3.070%)
2903 active+clean
209 active+remapped+backfill_wait
73 active+undersized+degraded+remapped+backfill_wait
9 active+remapped+backfill_wait+backfill_toofull
8 active+undersized+degraded+remapped+backfill_wait+backfill_toofull
4 active+undersized+degraded+remapped+backfilling
3 active+remapped+backfilling
3 active+clean+scrubbing+deep
1 active+clean+scrubbing
1 active+undersized+remapped+backfilling
1 active+clean+snaptrim
io:
client: 47 MiB/s rd, 61 MiB/s wr, 732 op/s rd, 792 op/s wr
recovery: 195 MiB/s, 48 objects/s
After restarting there should only be a small number of degraded objects, the ones that received writes during OSD restart. What I see, however, is that the cluster seems to have lost track of a huge amount of objects, the 0.456% degraded are 1-2 days worth of I/O. I did reboots before and saw only a few thousand objects degraded at most. The output of ceph health detail shows a lot of lines like these:
[root@gnosis ~]# ceph health detail
HEALTH_ERR 45804316/1492356704 objects misplaced (3.069%); Degraded data redundancy: 6792562/1492356704 objects degraded (0.455%), 85 pgs degraded, 86 pgs undersized; Degraded data redundancy (low space): 17 pgs backfill_toofull; 1 pools nearfull OBJECT_MISPLACED 45804316/1492356704 objects misplaced (3.069%) PG_DEGRADED Degraded data redundancy: 6792562/1492356704 objects degraded (0.455%), 85 pgs degraded, 86 pgs undersized
pg 11.9 is stuck undersized for 815.188981, current state active+undersized+degraded+remapped+backfill_wait, last acting [60,148,2147483647,263,76,230,87,169]
8...9
pg 11.48 is active+undersized+degraded+remapped+backfill_wait, acting [159,60,180,263,237,3,2147483647,72]
pg 11.4a is stuck undersized for 851.162862, current state active+undersized+degraded+remapped+backfill_wait, last acting [182,233,87,228,2,180,63,2147483647]
[...]
pg 11.22e is stuck undersized for 851.162402, current state active+undersized+degraded+remapped+backfill_wait+backfill_toofull, last acting [234,183,239,2147483647,170,229,1,86]
PG_DEGRADED_FULL Degraded data redundancy (low space): 17 pgs backfill_toofull
pg 11.24 is active+undersized+degraded+remapped+backfill_wait+backfill_toofull, acting [230,259,2147483647,1,144,159,233,146]
[...]
pg 11.1d9 is active+remapped+backfill_wait+backfill_toofull, acting [84,259,183,170,85,234,233,2]
pg 11.225 is active+undersized+degraded+remapped+backfill_wait+backfill_toofull, acting [236,183,1,2147483647,2147483647,169,229,230]
pg 11.22e is active+undersized+degraded+remapped+backfill_wait+backfill_toofull, acting [234,183,239,2147483647,170,229,1,86]
POOL_NEAR_FULL 1 pools nearfull
pool 'sr-rbd-data-one-hdd' has 164 TiB (max 200 TiB)
It looks like a lot of PGs are not receiving theire complete crush map placement, as if the peering is incomplete. This is a serious issue, it looks like the cluster will see a total storage loss if just 2 more hosts reboot - without actually having lost any storage. The pool in question is a 6+2 EC pool.
What is going on here? Why are the PG-maps not restored to their values from before the OSD reboot? The degraded PGs should receive the missing OSD IDs, everything is up exactly as it was before the reboot.
Thanks for your help and best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
_______________________________________________
ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io _______________________________________________
ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io
Hello everyone,
I am running my Octopus 15.2.4 version and a couple of days ago noticed an ERROR state on the cluster with the following message:
Module 'crash' has failed: dictionary changed size during iteration
I couldn't find much info on this error. I've tried restarting the mon servers, which made no effect. How do I fix the error?
Many thanks
Andrei
If your system runs out of battery very rapidly then avast is the best solution for you. It is one of the most trusted battery savers that stops applications that you are not using, will speed up your operating device and will extend your device’s battery life. It also tells how much time you are left with, on your device’s battery. To know more about how to extend your Android’s battery life with Avast Battery Saver either visit our site or contact our customer support.
https://www.avast-support.net/avast-battery-saver/