Performance issues in newly deployed Ceph cluster - ceph-users

26 May 2020

Folks,

I am running into a very strange issue with a brand new Ceph cluster during initial
testing. Cluster
consists of 12 nodes, 4 of them have SSDs only, the other eight have a mixture of SSDs and
HDDs.
The latter nods are configured so that three or four HDDs use one SSDs for their blockdb.

Ceph version is Nautilus.

When writing to the cluster, clients will, in regular intervals, run into I/O stall (i.e.
writes will take up
to 25 minutes to complete). Deleting RBD Images will often take forever as well. After
several weeks
of debugging, what I can say from looking at the log files, is that what appears to take a
lot of time
is writing stuff to OSDs:

                    "time": "2020-05-20 10:52:23.211006", 
                        "event": "reached_pg"
                    },
                    {
                        "time": "2020-05-20 10:52:23.211047",
                        "event": "waiting for ondisk"
                    },
                    {
                        "time": "2020-05-20 10:53:35.369081",
                        "event": "done"
                    }

But these machines are I/O idling. there is almost no I/O happening at all according to
sysstat.
I am slowly growing a bit desperate over this, and hence I wonder whether anybody has
ever
seen a similar issue? Or are there possibly any tips on where to carry on with debugging?

Servers are from Dell with PERC controllers in HBA mode.

The primary purpose of this Ceph cluster is to serve as backing storage for OpenStack, and
to
this point, I was not able to reproduce the issue with the SSD-only nodes.

Best regards
Martin