Hi,
On Tue, Sep 8, 2020 at 11:20 PM David Orman <ormandj(a)corenode.com> wrote:
Every time we look at them, we see the same checksum (0x6706be76):
This looks a lot like:
https://tracker.ceph.com/issues/22464
Some more context on this as I've built the work-around for this issue:
* the checksum is for a block of all zeroes
* this seemed to happen when memory runs low
* it is *NOT* related to swap: this happened on systems with swap disabled
and no file-backed mmaped memory (BlueStore-only servers w/o non-OSD disks)
* only showed up on some kernel versions
* re-trying the read did solve it, very rare to see two consecutive read
failures, never saw it with 3 retries
* root cause was never found, as I never managed to reliably reproduce this
on test setups where I could play around with bisecting the kernel :(
Here's the patch that added the read retries:
https://github.com/ceph/ceph/pull/23273/files
What you can do is:
1. check the performance counter bluestore_reads_with_retries on affected
OSDs, should be non-zero
2. increase the setting bluestore_retry_disk_reads (default 3) to see if
that helps
Anyways, what you are seeing might be something completely different than
whatever caused this bug... but it's worth playing around with the retry
option
Paul
That said, we've got the following versions in
play (cluster was created
with 15.2.3):
ceph version 15.2.4 (7447c15c6ff58d7fce91843b705a268a1917325c) octopus
(stable)
This is a containerized cephadm installation, in case it's relevant.
Distribution is Ubuntu 18.04.04, kernel is the HWE kernel:
Linux ceph02 5.4.0-42-generic #46~18.04.1-Ubuntu SMP Fri Jul 10 07:21:24
UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
A repair operation 'fixes' it. These are occurring across many PGs, on the
various different servers, and we see no indication of any hardware related
issues.
Any ideas what to do next?
_______________________________________________
ceph-users mailing list -- ceph-users(a)ceph.io
To unsubscribe send an email to ceph-users-leave(a)ceph.io