Poor performance after (incomplete?) upgrade to Nautilus - ceph-users

8 Jan 2020

Hi all,

since a few weeks our Nautilus cluster was struggling with severe performance issues. When
an OSD would go down, the rebalancing was really slow. Long periods with no data transfer
at all (client and rebalancing!) and times with rebalancing traffic only. However, client
traffic was almost stalled for the whole period until all objects were in place again (VMs
were frozen). PGs were stuck in peering or inactive for long times. Sometimes we had to
restart the ceph-mon in order to get the whole process running again.

The issues started all of a sudden, we don't remember doing any changes to the
configuration.

The whole cluster has been updated from Mimic to Nautilus (14.2.3) in September while the
issue occurred just a few weeks ago. Updating it to 14.2.5 did not resolve the issue back
then.

Looking through mailing lists I found the following message:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-July/028035.html

So I ran "ceph osd require-osd-release nautilus" and all of a sudden the
problems where gone! I do not recall executing that command right after the upgrade
because the documentation states "Complete the upgrade by disallowing pre-Nautilus
OSDs and enabling all new Nautilus-only functionality.". As by that point in time all
OSDs, MONs and MGRs were successfully updated there was no reason to believe this command
would be necessary.

Therefore I got two questions:
1. What exactly does the command do besides preventing old OSDs from joining?
2. What could have been the issue with the cluster and how did this command fix it?

If it is really that important to run the command, the docs should state this more
clearly.

I appreciate any insight on this topic.

Thanks,
Georg