W dniu 2020-11-04 01:18, m.sliwinski(a)lh.pl napisał(a):
Just in case - result of ceph report is here:
http://paste.ubuntu.com/p/D7yfr3pzr4/
> Hi
>
> We have a weird issue iwth our ceph cluster - almost all PGs assigned
> to one specific pool became stuck, locking out all operations without
> reporting any errors.
> Story:
> We have 3 different pools, hdd-backed, ssd-backed and nvme-backed.
> Pool ssh worked fine for few months.
> Today one of the hosts assigned to nvme pool restarted triggering
> recovery in that pool. It wnet fast and cluster went to OK state.
> During these events or shortly after them ssd pool became
> unresponsive. It was impossible to either read or write from/to it.
> We decided to slowly restart fist OSDs assigned to it, thenm as it
> didn't help - all the mons, wihout breaking quorum of course.
> At this moment both nvme and hdd polls are working fine, ssd one is
> stuck in recovery.
> All OSDs in that ssd pool use large amount of CPU and are exchanging
> approx 1Mpps per OSD server between each other.