Peter Eisch
Senior Site Reliability Engineer
T1.612.445.5135
virginpulse.com
|virginpulse.com/global-challenge
Australia | Bosnia and Herzegovina | Brazil | Canada | Singapore | Switzerland | United
Kingdom | USA
Confidentiality Notice: The information contained in this e-mail, including any
attachment(s), is intended solely for use by the designated recipient(s). Unauthorized
use, dissemination, distribution, or reproduction of this message by anyone other than the
intended recipient(s), or a person designated as responsible for delivering such messages
to the intended recipient, is strictly prohibited and may be unlawful. This e-mail may
contain proprietary, confidential or privileged information. Any views or opinions
expressed are solely those of the author and do not necessarily represent those of Virgin
Pulse, Inc. If you have received this message in error, or are not the named recipient(s),
please immediately notify the sender and delete this e-mail message.
v2.64
On 3/13/20, 11:47 AM, "Wido den Hollander" <wido(a)42on.com> wrote:
On 3/13/20 5:44 PM, Peter Eisch wrote:
On 3/13/20, 11:38 AM, "Wido den Hollander"
<wido(a)42on.com> wrote:
This email originates outside Virgin Pulse.
On 3/13/20 4:09 PM, Peter Eisch wrote:
Full cluster is 14.2.8.
I had some OSD drop overnight which results now in 4 inactive PGs. The
pools had three participant (2 ssd, 1 sas) OSDs. In each pool at least 1
ssd and 1 sas OSD is working without issue. I’ve ‘ceph pg repair <pg>’
but it doesn’t seem to make any changes.
PG_AVAILABILITY Reduced data availability: 4 pgs inactive, 4 pgs
incomplete
pg 10.2e is incomplete, acting [59,67]
pg 10.c3 is incomplete, acting [62,105]
pg 10.f3 is incomplete, acting [62,59]
pg 10.1d5 is incomplete, acting [87,106]
Using `ceph pg <pg> query` I can see the OSD in each case of the ones
which failed. Respectively they are:
pg 10.2e participants: 59, 68, 77, 143
pg 10.c3 participants: 60, 62, 85, 102, 105, 106
pg 10.f3 participants: 59, 64, 75, 107
pg 10.1d5 participants: 64, 77, 87, 106
The OSDs which are now down/out and have been removed from the crush map
and removed the auth are:
62, 64, 68
Of course I have lots of reports of slow OSDs now from OSDs worried
about the inactive PGs.
How do I properly kick these PGs to have them drop their usage of the
OSDs which no longer exist?
You don't. Because those OSDs hold the data you need.
Why did you remove them from the CRUSHMap, OSDMap and auth? As you need
these to rebuild the PGs.
Wido
The drives failed at a hardware level. I've replaced OSDs with this by
either planned migration or failure in previous instances without issue.
I didn't realize all the replicated copies were on just one drive in
each pool.
> What should my actions have been in this case?
Try to get those OSDs online again. Maybe try a rescue of the disks or
see how the OSDs would be able to start.
A tool like dd_rescue can help in getting such a thing done.
pool 10 volumes' replicated size 2 min_size 1 crush_rule 1 object_hash
rjenkins pg_num 512 pgp_num 512 autoscale_mode warn last_change 47570
lfor 0/0/40781 flags hashpspool,selfmanaged_snaps stripe_width 0
application rbd
I see you use 2x replication with min_size=1, that's dangerous and can
easily lead to data loss.
I wouldn't say it's impossible to get the data back, but something like
this can take a while (a lot of hours) to be brought back online.
The three NVMe drives which failed within 10 mins of each other spent the last day at
Kroll/OnTrack for recovery. They can't do anything with them. Apparently they fell
to a bug in the NVMe firmware which was fixed but not it never got applied. (It might be
worth nothing that three more NVMe drives died within 48 hours before I could get them all
'out' but they staggered themselves so things could backfill.
I'm willing to accept the data loss at this point for these four PGs. What can I do
to zero these out or get even just tag them as complete so we can get our filesystems back
into service (and do diligence with fsck/chkdsk/etc.)?
[@cephmon]# ceph pg ls incomplete
PG OBJECTS DEGRADED MISPLACED UNFOUND BYTES OMAP_BYTES* OMAP_KEYS* LOG STATE
SINCE VERSION REPORTED UP ACTING SCRUB_STAMP
DEEP_SCRUB_STAMP
10.c3 0 0 0 0 0 0 0 0
incomplete 16s 0'0 67570:16611 [84,119]p84 [84,119]p84 2020-03-13
00:06:12.356259 2020-03-11 13:04:17.124901
10.1d5 13882 0 0 0 58201653248 0 0 3063
incomplete 16s 48617'19136670 67570:76106823 [87,77]p87 [87,77]p87 2020-03-12
21:00:43.540659 2020-03-12 21:00:43.540659
* NOTE: Omap statistics are gathered during deep scrub and may be inaccurate soon
afterwards depending on utilisation. See
http://docs.ceph.com/docs/master/dev/placement-group/#omap-statistics for further
details.
[@cephmon]# ceph pg ls down
PG OBJECTS DEGRADED MISPLACED UNFOUND BYTES OMAP_BYTES* OMAP_KEYS* LOG STATE
SINCE VERSION REPORTED UP ACTING SCRUB_STAMP
DEEP_SCRUB_STAMP
10.2e 88 0 0 0 373293056 0 0 3001 down
20s 49315'16087499 67570:33657 [77,143]p77 [77,143]p77 2020-03-12 07:55:04.030384
2020-03-05 10:26:43.183563
10.f3 244 0 0 0 1027604480 0 0 3015 down
20s 48741'18343076 67570:34213 [75,139]p75 [75,139]p75 2020-03-13 02:32:20.026885
2020-03-13 02:32:20.026885
* NOTE: Omap statistics are gathered during deep scrub and may be inaccurate soon
afterwards depending on utilisation. See
http://docs.ceph.com/docs/master/dev/placement-group/#omap-statistics for further
details.
[@cephmon]#
Again, 62, 64 and 68 were the OSDs which died and it's clearly now trying to use
others. And yes, I can bump the size to 3 going forward but we need to get past these
guys first.
What should be my next step?
Thanks!