[ceph-users] Re: multiple OSD crash, unfound objects

20 Oct 2020

Dear Michael,

...
  > Can you create a test pool with pg_num=pgp_num=1
and see if the PG gets an OSD mapping? 
I meant here with crush rule replicated_host_nvme. Sorry, forgot.

...
  Yes, the OSD was still out when the previous health
report was created. 
Hmm, this is odd. If this is correct, then it did report a slow op even though it was out
of the cluster:

...
  from https://pastebin.com/3G3ij9ui:
 [WRN] SLOW_OPS: 2 slow ops, oldest one blocked for 8133 sec, daemons [osd.0,osd.41] have
slow ops. 
Not sure what to make of that. It looks almost like you have a ghost osd.41.

I think (some of) the slow ops you are seeing are directed to the health_metrics pool and
can be ignored. If it is too annoying, you could try to find out who runs the client with
IDs client.7524484 and disable it. Might be an MGR module.

Looking at the data you provided and also some older threads of yours
(https://www.mail-archive.com/ceph-users@ceph.io/msg05842.html), I start considering that
we are looking at the fall-out of a past admin operation. A possibility is, that an upmap
for PG 1.0 exists that conflicts with the crush rule replicated_host_nvme and, hence,
prevents the assignment of OSDs to PG 1.0. For example, the upmap specifies HDDs, but the
crush rule required NVMEs. This result is an empty set.

I couldn't really find a simple command to list up-maps. The only non-destructive way
seems to be to extract the osdmap and create a clean-up command file. The cleanup file
should contain a command for every PG with an upmap. To check this, you can execute (see
also https://docs.ceph.com/en/latest/man/8/osdmaptool/)

  # ceph osd getmap > osd.map
  # osdmaptool osd.map --upmap-cleanup cleanup.cmd

If you do this, could you please post as usual the contents of cleanup.cmd?

Also, with the OSD map of your cluster, you can simulate certain admin operations and
check resulting PG mappings for pools and other things without having to touch the
cluster; see https://docs.ceph.com/en/latest/man/8/osdmaptool/.

To dig a little bit deeper, could you please post as usual the output of:

- ceph pg 1.0 query
- ceph pg 7.39d query

It would also be helpful if you could post the decoded crush map. You can get the map as a
txt-file as follows:

  # ceph osd getcrushmap -o crush-orig.bin
  # crushtool -d crush-orig.bin -o crush.txt

and post the contents of file crush.txt.

Did the slow MDS request complete by now?

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

Contents of previous messages removed.

2024

2023

2022

2021

2020

2019

[ceph-users] Re: multiple OSD crash, unfound objects