Hi Michael,
some quick thoughts.
That you can create a pool with 1 PG is a good sign, the crush rule is OK. That pg query
says it doesn't have PG 1.0 points in the right direction. There is an inconsistency
in the cluster. This is also indicated by the fact that no upmaps seem to exist (the
clean-up script was empty). With the osd map you extracted, you could check what the osd
map believes the mapping of the PGs of pool 1 are:
# osdmaptool osd.map --test-map-pgs-dump --pool 1
or if it also claims the PG does not exist. It looks like something went wrong during pool
creation and you are not the only one having problems with this particular pool:
https://www.spinics.net/lists/ceph-users/msg52665.html . Sounds a lot like a bug in
cephadm.
In principle, it looks like the idea to delete and recreate the health metrics pool is a
way forward. Please look at the procedure mentioned in the thread quoted above. Deletion
of the pool there lead to some crashes and some surgery on some OSDs was necessary.
However, in your case it might just work, because you redeployed the OSDs in question
already - if I remember correctly.
In order to do so cleanly, however, you will probably want to shut down all clients
accessing this pool. Note that clients accessing the health metrics pool are not FS
clients, so the mds cannot tell you anything about them. The only command that seems to
list all clients is
# ceph daemon mon.MON-ID sessions
that needs to be executed on all mon hosts. On the other hand, you could also just go
ahead and see if something crashes (an MGR module probably) or disable all MGR modules
during this recovery attempt. I found some info that cephadm creates this pool and starts
an MGR module.
If you google "device_health_metric pool" you should find descriptions of
similar cases. It looks solvable.
I will look at the incomplete PG issue. I hope this is just some PG tuning. At least pg
query didn't complain :)
The stuck MDS request could be an attempt to access an unfound object. It should be
possible to locate the fs client and find out what it was trying to do. I see this
sometimes when people are too impatient. They manage to trigger a race condition and an
MDS operation gets stuck (there are MDS bugs and in my case it was an ls command that got
stuck). Usually, evicting the client temporarily solves the issue (but tell the user :).
Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
________________________________________
From: Michael Thomas <wart(a)caltech.edu>
Sent: 20 October 2020 23:48:36
To: Frank Schilder; ceph-users(a)ceph.io
Subject: Re: [ceph-users] Re: multiple OSD crash, unfound objects
On 10/20/20 1:18 PM, Frank Schilder wrote:
Dear Michael,
> Can you create a test pool with
pg_num=pgp_num=1 and see if the PG gets an OSD mapping?
I meant here with crush rule replicated_host_nvme. Sorry, forgot.
Seems to have worked fine:
https://pastebin.com/PFgDE4J1
Yes, the OSD
was still out when the previous health report was created.
Hmm, this is odd. If this is correct, then it did report a slow op even though it was out
of the cluster:
from
https://pastebin.com/3G3ij9ui:
[WRN] SLOW_OPS: 2 slow ops, oldest one blocked for 8133 sec, daemons [osd.0,osd.41] have
slow ops.
Not sure what to make of that. It looks almost like you have a ghost osd.41.
I think (some of) the slow ops you are seeing are directed to the health_metrics pool and
can be ignored. If it is too annoying, you could try to find out who runs the client with
IDs client.7524484 and disable it. Might be an MGR module.
I'm also pretty certain that the slow ops are related to the health
metrics pool, which is why I've been ignoring them.
What I'm not sure about is whether re-creating the device_health_metrics
pool will cause any problems in the ceph cluster.
Looking at the data you provided and also some older
threads of yours (
https://www.mail-archive.com/ceph-users@ceph.io/msg05842.html), I start
considering that we are looking at the fall-out of a past admin operation. A possibility
is, that an upmap for PG 1.0 exists that conflicts with the crush rule
replicated_host_nvme and, hence, prevents the assignment of OSDs to PG 1.0. For example,
the upmap specifies HDDs, but the crush rule required NVMEs. This result is an empty set.
So var I've been unable to locate the client with the ID 7524484. It's
not showing up in the manager dashboard -> Filesystems page, nor in the
output of 'ceph tell mds.ceph1 client ls'.
I'm digging through the compress logs for the past week to see if I can
find the culprit.
I couldn't really find a simple command to list
up-maps. The only non-destructive way seems to be to extract the osdmap and create a
clean-up command file. The cleanup file should contain a command for every PG with an
upmap. To check this, you can execute (see also
https://docs.ceph.com/en/latest/man/8/osdmaptool/)
# ceph osd getmap > osd.map
# osdmaptool osd.map --upmap-cleanup cleanup.cmd
If you do this, could you please post as usual the contents of cleanup.cmd?
It was empty:
[root@ceph1 ~]# ceph osd getmap > osd.map
got osdmap epoch 52833
[root@ceph1 ~]# osdmaptool osd.map --upmap-cleanup cleanup.cmd
osdmaptool: osdmap file 'osd.map'
writing upmap command output to: cleanup.cmd
checking for upmap cleanups
[root@ceph1 ~]# wc cleanup.cmd
0 0 0 cleanup.cmd
Also, with the OSD map of your cluster, you can
simulate certain admin operations and check resulting PG mappings for pools and other
things without having to touch the cluster; see
https://docs.ceph.com/en/latest/man/8/osdmaptool/.
To dig a little bit deeper, could you please post as usual the output of:
- ceph pg 1.0 query
- ceph pg 7.39d query
Oddly, it claims that it doesn't have pgid 1.0.
https://pastebin.com/pHh33Dq7
It would also be helpful if you could post the decoded
crush map. You can get the map as a txt-file as follows:
# ceph osd getcrushmap -o crush-orig.bin
# crushtool -d crush-orig.bin -o crush.txt
and post the contents of file crush.txt.
https://pastebin.com/EtEGpWy3
Did the slow MDS request complete by now?
Nope.
--Mike