Hi Marcel,
The peering process is the process used by Ceph OSDs, on a per placement group basis, to
agree on the state of that placement on each of the involved OSDs.
In your case, 2/3 of the placement group metadata that needs to be agreed upon/checked is
on the nodes that did not undergo maintenance. You also need to consider that the acting
primary OSD for everything is now hosted on the OSDs that did not undergo any
maintenance.
This all means that all 'heavy' lifting is done by these nodes until the
recovery/backfilling process is completed is done by the nodes that stayed online. Also
consider that Ceph will, most likely, execute peering twice per pg. Once when the OSDs
start again, and once when the recovery and backfillling is finished.
I really don't want to RTFM, but I don't think it is useful to copy it here:
https://docs.ceph.com/en/latest/dev/peering/#description-of-the-peering-pro…
Peering
the process of bringing all of the OSDs that store a Placement Group (PG) into agreement
about the state of all of the objects (and their metadata) in that PG. Note that agreeing
on the state does not mean that they all have the latest contents.
Kind regards,
Wout
42on
________________________________________
From: Marcel Kuiper <ceph(a)mknet.nl>
Sent: Friday, 6 November 2020 10:23
To: ceph-users(a)ceph.io
Subject: [ceph-users] Re: high latency after maintenance]
Hi Anthony
Thank you for your respons
I am looking at the"OSDs highest latency of write operations" panel of the
grafana dashboard found in the ceph source in
./monitoring/grafana/dashboards/osds-overview.json. It is a topk graph
that uses ceph_osd_op_w_latency_sum / ceph_osd_op_w_latency_count.
During normal operations we see sometime latency spikes of 4 seconds max
but during the bringing back of the rack we saw a consistent increase in
latency for a lot of osds into the 20 seconds range
The cluster has 1139 osds total of which we had 5 x 9 - 45 in maintenance
We did not throttle the backfilling proces because we succesfully did the
same maintenance before on a few occasions for other racks without
problems. I will throttle backfills next time we have the same sort of
maintenance in the next rack
Can you elaborate a bit more what happens exactly during the peering
process? I understand that the osds need to catch up. I also see that the
nr of scrubs increases a lot when osds are brought back online. Is that
part of the peering proces?
Thx, Marcel
HDDs and concern for latency don’t mix. That said,
you don’t specify
what you mean by “latency”. Does that mean average client write
latency? median? P99? Something else?
If you have a 15 node cluster and you took a third of it down for two
hours then yeah you’ll have a lot to catch up on when you come back.
Bringing the nodes back one at a time can help, to spread out the peering.
Did you throttle backfill/recovery tunables all the way down to 1? In a
way that the restarted OSDs would use the throttled values as they boot?
On Nov 5, 2020, at 6:47 AM, Marcel Kuiper
<ceph(a)mknet.nl> wrote:
Hi
We had a rack down for 2hours for maintenance. 5 storage nodes were
involved. We had noout en norebalance flags set before the start of the
maintenance
When the systems were brought back online we noticed a lot of osds with
high latency (in 20 seconds range) . Mostly osds that are not on the
storage nodes that were down. It took about 20 minutes for things to
settle down.
We're running nautilus 14.2.11. The storage nodes run bluestore and have
9
x 8T HDD's and 3 x SSD for rocksdb. Each with 3 x 123G LV
- Can anyone give a reason for these high latencies?
- Is there a way to avoid or lower these latencies when bringing systems
back into operation?
Best Regards
Marcel
_______________________________________________
ceph-users mailing list -- ceph-users(a)ceph.io
To unsubscribe send an email to ceph-users-leave(a)ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users(a)ceph.io
To unsubscribe send an email to ceph-users-leave(a)ceph.io