Hi Kristof,
are you seeing high (around 100%) OSDs' disks (main or DB ones)
utilization along with slow ops?
Thanks,
Igor
On 10/6/2020 11:09 AM, Kristof Coucke wrote:
> Hi all,
>
> We have a Ceph cluster which has been expanded from 10 to 16 nodes.
> Each node has between 14 and 16 OSDs of which 2 are NVMe disks.
> Most disks (except NVMe's) are 16TB large.
>
> The expansion of 16 nodes went ok, but we've configured the system to
> prevent auto balance towards the new disks (weight was set to 0) so we
> could control the expansion.
>
> We started adding 6 disks last week (1 disk on each new node) which didn't
> give a lot of issues.
> When the Ceph status indicated the PG degraded was almost finished, we've
> added 2 disks on each node again.
>
> All seemed to go fine, till yesterday morning... IOs towards the system
> were slowing down.
>
> Diving onto the nodes we could see that the OSD daemons are consuming the
> CPU power, resulting in average CPU loads going near 10 (!).
>
> The RGWs nor monitors nor other involved servers are having CPU issues
> (except for the management server which is fighting with Prometheus), so
> it's latency seems to be related to the ODS hosts.
> All of the hosts are interconnected with 25Gbit connections, no bottlenecks
> are reached on the network either.
>
> Important piece of information: We are using erasure coding (6/3), and we
> do have a lot of small files...
> The current health detail indicates degraded health redundancy where
> 1192911/103387889228 objects are degraded. (1 pg degraded, 1 pg undersized).
>
> Diving into the historic ops of an OSD we can see that the main latency is
> found between the event "queued_for_pg" and "reached_pg".
(Averaging +/- 3
> secs)
>
> As the system load is quite high I assume the systems are busy
> recalculating the code chunks for using the new disks we've added (though
> not sure), but I was wondering how I can better fine tune the system or
> pinpoint the exact bottle neck.
> Latency towards the disks doesn't seem an issue at first sight...
>
> We are running Ceph 14.2.11
>
> Who can give me some thoughts on how I can better pinpoint the bottle neck?
>
> Thanks
>
> Kristof
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io
> To unsubscribe send an email to ceph-users-leave(a)ceph.io