Hi Suresh,
24 HDDs backed by only by 2 NVMes looks like a high ratio. What triggers
my bell in your post is "upgraded from Luminous to Nautilus" and
"Elasticsearch" which mainly reads to index data and also "memory
leak".
You might want to take a look at the current value of bluefs_buffered_io
: ceph --admin-daemon /var/run/ceph/ceph-osd.0.asok config show | grep
bluefs_buffered_io
In Nautilus, the default value of bluefs_buffered_io changed to false to
avoid unexplained excessive swap usage (when true) by the OSDs, but this
change induced a very high load on fast devices SSDs//NVMes hosting
RocksDB and WAL.
Considering the high ratio of HHDs per NVMes that you're using, I
wouldn't be suprised if your NVMes were toping on an iostat (%util) due
to bluefs_buffered_io now being false. If so, change it to true (and
lower osd_mermory_target and vm.swappiness), and keep an eye on swap usage.
Regards,
Frédéric.
Le 15/12/2020 à 21:44, Suresh Rama a écrit :
> Dear All,
>
> We have a 38 node HP Apollo cluster with 24 3.7T Spinning disk and 2 NVME
> for journal. This is one of our 13 clusters which was upgraded from
> Luminous to Nautilus (14.2.11). When one of our openstack customers uses
> elastic search (they offer Logging as a Service) to their end users
> reported IO latency issues, our SME rebooted two nodes that he felt were
> doing memory leak. The reboot didn't help rather worsen the situation and
> he went ahead and recycled the entire cluster one node a time as to fix the
> slow ops reported by OSDs. This caused a huge issue and MONS were not
> able to withstand the spam and started crashing.
>
> 1) We audited the network (inspecting TOR, iperf, MTR) and nothing was
> indicating any issue but OSD logs were keep complaining about
> BADAUTHORIZER
>
> 2020-12-13 15:32:31.607 7fea5e3a2700 0 --1- 10.146.126.200:0/464096978 >>
> v1:10.146.127.122:6809/1803700 conn(0x7fea3c1ba990 0x7fea3c1bf600 :-1
> s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=0 l=1).handle_connect_reply_2
> connect got BADAUTHORIZER
> 2020-12-13 15:32:31.607 7fea5e3a2700 0 --1- 10.146.126.200:0/464096978 >>
> v1:10.146.127.122:6809/1803700 conn(0x7fea3c1c1e20 0x7fea3c1bcdf0 :-1
> s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=0 l=1).handle_connect_reply_2
> connect got BADAUTHORIZER
> 2020-12-13 15:32:31.607 7fea5e3a2700 0 --1- 10.146.126.200:0/464096978 >>
> v1:10.146.127.122:6809/1803700 conn(0x7fea3c1ba990 0x7fea3c1bf600 :-1
> s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=0 l=1).handle_connect_reply_2
> connect got BADAUTHORIZER
>
> 2) Made sure no clock skew and we use timesyncd. After taking out a
> couple of OSDs that were indicating slow in the ceph health response the
> situation didn't improve. After 3 days of troubleshooting we upgraded the
> MONS to 14.2.15 and seems the situation improved a little but still
> reporting 61308 slow ops which we really struggled to isolate with bad
> OSDs as moving a couple of them didn't improve. One of the MON(2) failed
> to join the cluster and always doing compact and never was able to join
> (see the size below). I suspect that could be because the key value store
> information between 1 and 3 is not up to date with 2. At times, we had
> to stop and start to compress to get a better response from Ceph MON
> (keeping them running in one single MON).
>
> root@pistoremon-as-c01:~# du -sh /var/lib/ceph/mon
> 391G /var/lib/ceph/mon
> root@pistoremon-as-c03:~# du -sh /var/lib/ceph/mon
> 337G /var/lib/ceph/mon
> root@pistoremon-as-c02:~# du -sh /var/lib/ceph/mon
> 13G /var/lib/ceph/mon
>
> cluster:
> id: bac20301-d458-4828-9dd9-a8406acf5d0f
> health: HEALTH_WARN
> noout,noscrub,nodeep-scrub flag(s) set
> 1 pools have many more objects per pg than average
> 10969 pgs not deep-scrubbed in time
> 46 daemons have recently crashed
> 61308 slow ops, oldest one blocked for 2572 sec, daemons
> [mon.pistoremon-as-c01,mon.pistoremon-as-c03] have slow ops.
> mons pistoremon-as-c01,pistoremon-as-c03 are using a lot of
> disk space
> 1/3 mons down, quorum pistoremon-as-c01,pistoremon-as-c03
>
> services:
> mon: 3 daemons, quorum pistoremon-as-c01,pistoremon-as-c03 (age 52m),
> out of quorum: pistoremon-as-c02
> mgr: pistoremon-as-c01(active, since 2h), standbys: pistoremon-as-c03,
> pistoremon-as-c02
> osd: 911 osds: 888 up (since 68m), 888 in
> flags noout,noscrub,nodeep-scrub
> rgw: 2 daemons active (pistorergw-as-c01, pistorergw-as-c02)
>
> task status:
>
> data:
> pools: 17 pools, 32968 pgs
> objects: 62.98M objects, 243 TiB
> usage: 748 TiB used, 2.4 PiB / 3.2 PiB avail
> pgs: 32968 active+clean
>
> io:
> client: 56 MiB/s rd, 95 MiB/s wr, 1.78k op/s rd, 4.27k op/s wr
>
> 3) When looking through ceph.log on the mon with tailf, I was getting a lot
> of different time stamp reported in the ceph logs in MON1 which is master.
> Confused on why the live log report various timestamps?
>
> stat,write 2166784~4096] snapc 0=[] ondisk+write+known_if_redirected
> e951384) initiated 2020-12-13 06:16:58.873964 currently delayed
> 2020-12-13 06:39:37.169504 osd.1224 (osd.1224) 325855 : cluster [WRN] slow
> request osd_op(client.461445583.0:8881223 1.16aa
> 1:55684db0:::rbd_data.9ede65fc7af15.0000000000000000:head [stat,write
> 3547136~4096] snapc 0=[] ondisk+write+known_if_redirected e951384)
> initiated 2020-12-13 06:16:59.082012 currently delayed
> 2020-12-13 06:39:37.169510 osd.1224 (osd.1224) 325856 : cluster [WRN] slow
> request osd_op(client.461445583.0:8881191 1.16aa
> 1:55684db0:::rbd_data.9ede65fc7af15.0000000000000000:head [stat,write
> 2314240~4096] snapc 0=[] ondisk+write+known_if_redirected e951384)
> initiated 2020-12-13 06:16:58.874031 currently delayed
> 2020-12-13 06:39:37.169513 osd.1224 (osd.1224) 325857 : cluster [WRN] slow
> request osd_op(client.461445583.0:8881224 1.16aa
> 1:55684db0:::rbd_data.9ede65fc7af15.0000000000000000:head [stat,write
> 3571712~8192] snapc 0=[] ondisk+write+known_if_redirected e951384)
> initiated 2020-12-13 06:16:59.082094 currently delayed
> ^C
>
> root@pistoremon-as-c03:~# date
> Tue Dec 15 20:12:02 UTC 2020
> root@pistoremon-as-c03:~# timedatectl
> Local time: Tue 2020-12-15 20:12:04 UTC
> Universal time: Tue 2020-12-15 20:12:04 UTC
> RTC time: Tue 2020-12-15 20:12:04
> Time zone: Etc/UTC (UTC, +0000)
> Network time on: yes
> NTP synchronized: yes
> RTC in local TZ: no
>
> 4) Keep seeing the error below and assume that's due to the size of the MON
> DB with high workload the clients are not able to fetch the mon map ?
>
> 2020-12-13 13:16:56.197 7f946cddc700 -1 monclient: get_monmap_and_config
> failed to get config
> 2020-12-13 13:17:00.613 7f946cddc700 -1 monclient: get_monmap_and_config
> failed to get config
>
> We are getting all sorts of errors with primary concern on how to get the
> slow ops down and improve the IO latency. MONs using the SSD for the
> storage with NOOP scheduler with nr_request (128).
>
> Please provide some guidance on all 4 issues reported above. Thank you for
> your valueable time.
>