Just shooting in the dark here, but you may be affected by similar issue I
had a while back, it was discussed here:
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/ZOPBOY6XQOY…
In short - they've changed setting bluefs_buffered_io to false in the
recent Nautilus release. I guess the same was applied to newer releases.
That lead to severe performance issues and similar symptoms, i.e. lower
memory usage on OSD nodes. Worth checking out.
Of course, it may be something completely different. You should look into
monitoring all your OSDs separately, checking their utilization, await, and
other parameters, at the same time comparing them to pre-upgrade values, to
find the root cause.
пн, 2 нояб. 2020 г. в 11:55, Marc Roos <M.Roos(a)f1-outsourcing.eu>eu>:
I am advocating already a long time for publishing testing data of some
basic test cluster against different ceph releases. Just a basic ceph
cluster that covers most configs and run the same tests, so you can
compare just ceph performance. That would mean a lot for smaller
companies that do not have access to a good test environment. I have
asked also about this at some ceph seminar.
-----Original Message-----
From: Martin Rasmus Lundquist Hansen [mailto:hansen@imada.sdu.dk]
Sent: Monday, November 02, 2020 7:53 AM
To: ceph-users(a)ceph.io
Subject: [ceph-users] Seriously degraded performance after update to
Octopus
Two weeks ago we updated our Ceph cluster from Nautilus (14.2.0) to
Octopus (15.2.5), an update that was long overdue. We used the Ansible
playbooks to perform a rolling update and except from a few minor
problems with the Ansible code, the update went well. The Ansible
playbooks were also used for setting up the cluster in the first place.
Before updating the Ceph software we also performed a full update of
CentOS and the Linux kernel (this part of the update had already been
tested on one of the OSD nodes the week before and we didn't notice any
problems).
However, after the update we are seeing a serious decrease in
performance, more than a factor of 10x in some cases. I spend a week
trying to come up with an explantion or solution, but I am completely
blank. Independently of Ceph I tested the network performance and the
performance of the OSD disks, and I am not really seeing any problems
here.
The specifications of the cluster is:
- 3x Monitor nodes running mgr+mon+mds (Intel(R) Xeon(R) Silver 4108 CPU
@ 1.80GHz, 16 cores, 196 GB RAM)
- 14x OSD nodes, each with 18 HDDs and 1 NVME (Intel(R) Xeon(R) Gold
6126 CPU @ 2.60GHz, 24 cores, 384 GB RAM)
- CentOS 7.8 and Kernel 5.4.51
- 100 Gbps Infiniband
We are collecting various metrics using Prometheus, and on the OSD nodes
we are seeing some clear differences when it comes to CPU and Memory
usage. I collected some graphs here:
http://mitsted.dk/ceph . After the
update the system load is highly reduced, there is almost no longer any
iowait for the CPU, and the free memory is no longer used for Buffers (I
can confirm that the changes in these metrics are not due to the update
of CentOS or the Linux kernel). All in all, now the OSD nodes are almost
completely idle all the time (and so are the monitors). On the linked
page I also attached two RADOS benchmarks. The first benchmark was
performed when the cluster was initially configured, and the second is
the same benchmark after the update to Octopus. When comparing these
two, it is clear that the performance has changed dramatically. For
example, in the write test the bandwidth is reduced from 320 MB/s to 21
MB/s and the number of IOPS has also dropped significantly.
I temporarily tried to disable the firewall and SELinux on all nodes to
see if it made any difference, but it didnt look like it (I did not
restart any services during this test, I am not sure if that could be
necessary).
Any suggestions for finding the root cause of this performance decrease
would be greatly appreciated.
_______________________________________________
ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an
email to ceph-users-leave(a)ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users(a)ceph.io
To unsubscribe send an email to ceph-users-leave(a)ceph.io