[ceph-users] Re: ceph iscsi latency too high for esxi?

5 Oct 2020

It is a load issue. Your combined load: client io, recovery, scrub is 
higher that what your cluster can handle.
Whereas some ceph commands can block when things are very busy, VMWare 
iSCSI is less tolerant but it is not the problem.
If you have charts, look at the metric for disk % utilization/busy and 
cpu % utilization, most probably your hdds were near 100% at the time of 
the issue, otherwise you can measure them manually at peak load.

For the large majority of vm workloads, you really you should be using 
SSDs. If you have to use hdds, it could help to reduce the recovery and 
scrub load settings since you mentioned in your case the issue 
correlates to them, maybe try:

osd_max_scrubs 1
osd_scrub_load_threshold 0.4
osd_scrub_sleep 0.2

osd_max_backfills 1
osd_recovery_max_active 1
osd_recovery_sleep 0.2

Generally it is beneficial to know what your peak workload is 
(iops/throughput) and what your cluster is capable of giving including 
during recovery and scrubbing.

/Maged

On 04/10/2020 19:47, Golasowski Martin wrote:
...
  Oh, thanks, that does not sound very encouraging.
 In our case it looked the same, we had to reboot three ESXi nodes via 
 IPMI, because it got stuck at ordinary soft reboot.

 1. RecoveryTimeout is set at 25 on our nodes
 2. We have one two-port adapter per node (Connect-X 5) and 4 iSCSI GWs 
 total, one per OSD server. Multipath works, tested it by randomly 
 rebooting one of two switches or manually shutting down the port.
 One curious thing I did not mention before is that we see a number of 
 dropped Rx packets on each NIC that corresponds to the iSCSI VLAN. 
 Increase in the dropped packets seems to correlate with the current 
 IOPS load.

 I am beginning to settle with the version that our cluster is quite 
 low on IOPS generally and slight increase in traffic may significantly 
 raise latency on the iSCSI target. And ESXi is just being very touchy 
 about that.

  On 4 Oct 2020, at 18:59, Phil Regnauld
&lt;pr(a)x0.dk <mailto:pr@x0.dk>> 
 wrote:

 Yep, and we're still experiencing it every few months. One (and only 
 one) of
 our ESXi nodes, which are otherwise identical, is experiencing total 
 freeze
 of all I/O, and it won't recover - I mean, ESXi is so dead, we have 
 to go into
 IPMI and reset the box...

 We're using Croit's software, but the issue doesn't seem to be with 
 CEPH so
 much as with vmware.

 That said, there's a couple of things you should be looking at:

 1. Make sure you remember to set the RecoveryTimeout to 25 ?

 https://docs.ceph.com/en/latest/rbd/iscsi-initiator-esx/

 2. Make sure you have got working multipath across more than 1 adapter.

 What's possibly biting us right now, is that with 2 iscsi gateways in our
 cluster, and although both are autodiscovered at iscsi configuration 
 time,
 we see that the ESXi nodes still only will show one path to each LUN.

 Currently these ESXi nodes have only 1 x 10gbit connected, it looks like
 I'll need to wire up the second connector and set up a second path to
 the iscsi gateway from that. It may not solve the problem, but it might
 lower the I/O on a single gateway enough that we won't see the problem
 anymore (and hopefully our customers stop getting pissed off).

 Cheers,
 Phil

 Golasowski Martin (martin.golasowski) writes:
  For clarity, the issue has been reported also
before:

https://www.spinics.net/lists/ceph-users/msg59798.html<https://www.spini…

 https://www.spinics.net/lists/target-devel/msg10469.html

  On 4 Oct 2020, at 16:46, Steve Thompson
&lt;smt(a)vgersoft.com&gt; wrote:

 On Sun, 4 Oct 2020, Martin Verges wrote:

>> Does that mean that occasional iSCSI path drop-outs are somewhat
> expected?
> Not that I'm aware of, but I have no HDD based ISCSI cluster at 
> hand to
> check. Sorry.

 I use iscsi extensively, but for ZFS and not ceph. Path drop-outs 
 are not common; indeed, so far as I am aware, I have never had one. 
 CentOS 7.8.

 Steve
 --
 ----------------------------------------------------------------------------
 Steve Thompson                 E-mail:      smt AT vgersoft DOT com
 Voyager Software LLC           Web:         http://www DOT vgersoft 
 DOT com
 3901 N Charles St              VSW Support: support AT vgersoft DOT com
 Baltimore MD 21218
 "186,282 miles per second: it's not just a good idea, it's the law"
 ---------------------------------------------------------------------------- 

  _______________________________________________
 ceph-users mailing list -- ceph-users(a)ceph.io 
 <mailto:ceph-users@ceph.io>
 To unsubscribe send an email to ceph-users-leave(a)ceph.io 
 <mailto:ceph-users-leave@ceph.io> 

 -- 

 _______________________________________________
 ceph-users mailing list -- ceph-users(a)ceph.io
 To unsubscribe send an email to ceph-users-leave(a)ceph.io 

2024

2023

2022

2021

2020

2019

[ceph-users] Re: ceph iscsi latency too high for esxi?