New subject: high latency after maintenance]

6 Nov 2020

Hi Anthony

Thank you for your respons

I am looking at the"OSDs highest latency of write operations" panel of the
grafana dashboard found in the ceph source in
./monitoring/grafana/dashboards/osds-overview.json. It is a topk graph
that  uses ceph_osd_op_w_latency_sum / ceph_osd_op_w_latency_count.
During normal operations we see sometime latency spikes of 4 seconds max
but during the bringing back of the rack we saw a consistent increase in
latency for a lot of osds into the 20 seconds range

The cluster has 1139 osds total of which we had 5 x 9 - 45 in maintenance

We did not throttle the backfilling proces because we succesfully did the
same maintenance before on a few occasions for other racks without
problems. I will throttle backfills next time we have the same sort of
maintenance in the next rack

Can you elaborate a bit more what happens exactly during the peering
process? I understand that the osds need to catch up. I also see that the
nr of scrubs increases a lot when osds are brought back online. Is that
part of the peering proces?

Thx, Marcel

...
  HDDs and concern for latency donât mix.  That said,
you donât specify
 what you mean by âlatencyâ.  Does that mean average client write
 latency?  median? P99? Something else?

 If you have a 15 node cluster and you took a third of it down for two
 hours then yeah youâll have a lot to catch up on when you come back.
 Bringing the nodes back one at a time can help, to spread out the peering.
  Did you throttle backfill/recovery tunables all the way down to 1?  In a
 way that the restarted OSDs would use the throttled values as they boot?

  On Nov 5, 2020, at 6:47 AM, Marcel Kuiper
&lt;ceph(a)mknet.nl&gt; wrote:

 Hi

 We had a rack down for 2hours for maintenance. 5 storage nodes were
 involved. We had noout en norebalance flags set before the start of the
 maintenance

 When the systems were brought back online we noticed a lot of osds with
 high latency (in 20 seconds range) . Mostly osds that are not on the
 storage nodes that were down. It took about 20 minutes for things to
 settle down.

 We're running nautilus 14.2.11. The storage nodes run bluestore and have
 9
 x 8T HDD's and 3 x SSD for rocksdb. Each with 3 x 123G LV

 - Can anyone give a reason for these high latencies?
 - Is there a way to avoid or lower these latencies when bringing systems
 back into operation?

 Best Regards

 Marcel
 _______________________________________________
 ceph-users mailing list -- ceph-users(a)ceph.io
 To unsubscribe send an email to ceph-users-leave(a)ceph.io 

Re: high latency after maintenance]