Hi Jiri,
this doesn't sound too bad. I don't know if the recovery time is to be expected,
how does it compare to the same operation on mimic with the same utilization? In any case,
single disk recovery is slow and I plan not to do it. Replacements are installed together
with upgrades to have all-to-all instead of all-to-one recovery.
Assuming that you replaced the disk after all objects were recovered (cluster health ok),
then degraded objects and undersized PGs should not occur. What exactly was the sequence
of these events:
- disk fails
- ceph shows health warn
- PGs peer
- ceph starts recovery
- recovery completes
- ceph shows health ok
- server shut down
- disk is exchanged
- server boot
- ceph health ok
- new OSD deployed
- degraded objects and undersized PGs show up
If the order was exactly as listed here, something serious is wrong.
The MGR load problem seems to be a known issue and due to interpreted (slow!) code being
executed in a high-frequency call back.
Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
________________________________________
From: Jiri D. Hoogeveen <wica128(a)gmail.com>
Sent: 25 June 2020 17:44:23
To: Frank Schilder
Cc: Eugen Block; ceph-users(a)ceph.io
Subject: Re: [ceph-users] Re: Removing pool in nautilus is incredibly slow
Hi Frank,
"With our mimic cluster I have absolutely no problems migrating pools in one go to a
completely new set of disks. I have no problems doubling the number of disks and at the
same time doubling the number of PGs in a pooI and let the rebalancing loose in one single
go. No need for slowly increasing weights. No need for slow changes of PG counts. In such
cases,..."
This is also my experience.
I have 2 clusters running on Nautilus 14.2.8, one upgraded 2 weeks ago from mimic.
I do NOT see any performance drop from the client side. But recovering is extremely slow,
after replacing a defect OSD.
When I need to replace an OSD, I destroy them, turn on the noout flag, turn off the
server, replace the disk, and turn the server on. All within 30min. In Mimic I had only
some misplaced objects and it recovered within an hour.
In Nautilis, when I do exactly the same, I get beside misplaced objects, also degraded PGs
and undersized PGs, and the recovery takes almost a day.
I still need to investigate this (tips are welcome ;) ) But what is standing out, is the
load on the manager.
Grtz, Jiri
On Thu, 25 Jun 2020 at 17:18, Frank Schilder
<frans@dtu.dk<mailto:frans@dtu.dk>> wrote:
I actually don't think this is the problem. I removed a 120TB file system EC-data pool
in mimic without any special flags and magic. The OSDs of the data pool are HDD with
everything collocated. I had absolutely no problem, the data was removed after 2-3 days
and nobody even noticed. This is a standard operation and should just work without OPS
queues running full, heartbeat losses and manual compaction or the like.
Looking at all the different reports that came in on this list over the past 1-2 years
about performance issues starting with nautilus, it really sounds to me that a serious
regression happened. Maybe the messenger introduction? Maybe the prioritizing problem that
Robert LeBlanc reported in
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/W4M5XQRDBLX…
?
I guess anyone who started with nautilus doesn't know the good old times of being able
to do admin work without a completely normal cluster collapsing for no reason. Others do.
I find it a bit strange that there is such a long silence on this topic. There are
numerous reports of people having issues with PG changes or rebalancing. Benign operations
that should just work.
With our mimic cluster I have absolutely no problems migrating pools in one go to a
completely new set of disks. I have no problems doubling the number of disks and at the
same time doubling the number of PGs in a pooI and let the rebalancing loose in one single
go. No need for slowly increasing weights. No need for slow changes of PG counts. In such
cases, I casually push the recovery options up close to max available bandwidth and nobody
even notices a performance drop. And all this with WAL/DB and data collocated on the same
disk and with rather low RAM available, I can only afford 2GB per HDD OSD.
Anyone on nautilus or higher who has the same experience?
Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
________________________________________
From: Eugen Block <eblock@nde.ag<mailto:eblock@nde.ag>>
Sent: 25 June 2020 16:42:57
To: ceph-users@ceph.io<mailto:ceph-users@ceph.io>
Subject: [ceph-users] Re: Removing pool in nautilus is incredibly slow
I'm not sure if your OSDs have their rocksDB on faster devices, if not
it sounds a lot like rocksdb fragmentation [1] leading to a very high
load on the OSDs and occasionally crashing OSDs. If you don't plan to
delete so much data at once on a regular basis you could sit this one
out, but one solution is to re-create the OSDs with rocksDB/WAL on
faster devices.
[1]
https://www.mail-archive.com/ceph-users@ceph.io/msg03160.html
Zitat von Francois Legrand <fleg@lpnhe.in2p3.fr<mailto:fleg@lpnhe.in2p3.fr>>:
Thanks for the hint.
I tryed but it doesn't seems to change anything...
Moreover, as the osds seems quite loaded I had regularly some osd
marked down which triggered some new peering and thus more load !!!
I set the osd no down flag, but I still have some osd reported
(wrongly) as down (and back up in the minute) which generate peering
and remapping. I don't really understand the action of no down
parameter !
Is there a way to tell ceph not to peer immediately after an osd is
reported down (let say wait for 60s) ?
I am thinking about restarting all osd (or maybe the whole cluster)
to get osd_op_queue_cut_off changed to high and
osd_op_thread_timeout to something higher than 15 (but I don't think
it will really improve the situation).
F.
Le 25/06/2020 à 14:26, Wout van Heeswijk a écrit :
Hi Francois,
Have you already looked at the option "osd_delete_sleep"? It will
not speed up the process but I will give you some control over your
cluster performance.
Something like:
ceph tell osd.\* injectargs '--osd_delete_sleep1'
kind regards,
Wout
42on
On 25-06-2020 09:57, Francois Legrand wrote:
Does someone have an idea ?
F.
_______________________________________________
ceph-users mailing list --ceph-users@ceph.io<mailto:ceph-users@ceph.io>
To unsubscribe send an email
toceph-users-leave@ceph.io<mailto:toceph-users-leave@ceph.io>
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io<mailto:ceph-users@ceph.io>
To unsubscribe send an email to
ceph-users-leave@ceph.io<mailto:ceph-users-leave@ceph.io>
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io<mailto:ceph-users@ceph.io>
To unsubscribe send an email to
ceph-users-leave@ceph.io<mailto:ceph-users-leave@ceph.io>
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io<mailto:ceph-users@ceph.io>
To unsubscribe send an email to
ceph-users-leave@ceph.io<mailto:ceph-users-leave@ceph.io>