It's important to note we do not suggest using the SMART "OK" indicator as
the drive being valid. We monitor correctable/uncorrectable error counts, as you can see a
dramatic rise when the drives start to fail. 'OK' will be reported for SMART
health long after the drive is throwing many uncorrectable errors and needs replacement.
You have to look at the actual counters, themselves.
That said, you will generally see these uncorrectable errors in the kernel output from
dmesg, as well.
On Mon, Jan 9, 2023, at 16:38, Erik Lindahl wrote:
Hi,
We too kept seeing this until a few months ago in a cluster with ~400 HDDs, while all the
drive SMART statistics was always A-OK. Since we use erasure coding each PG involves up to
10 HDDs.
It took us a while to realize we shouldn't expect scrub errors on healthy drives, but
eventually we decided to track it down, and found documentation suggesting to use
rados list-inconsistent-obj <PG> --format=json-pretty
... before you repair the PG. If you look into that (long) output, you are likely going
to find a "read_error" for a specific OSD. Then we started to make a note of the
HDD that saw the error.
This helped us identify two HDDs that had multiple read errors within a few weeks, even
though their SMART data was still perfectly fine. Now that *might* just be bad luck, but
we have enough drives that we don't care, so we just replaced them, and since then
I've only had a single drive report an error.
One conclusion (in our case) is that it could be a drive that likely would have failed
sooner or later, even though it hadn't yet reached a threshold for SMART to worry, or
the alternative might be that it's a drive that just has more frequent read errors,
but it's technically within the allowed variation. Assuming you have configured your
cluster with reasonable redundancy you shouldn't run any risk of data losses, but for
us we figured it's worth replacing a few outlier drives to sleep better.
Cheers,
Erik
--
Erik Lindahl <erik.lindahl(a)gmail.com>
On 9 Jan 2023 at 23:06 +0100, David Orman <ormandj(a)corenode.com>om>, wrote:
"dmesg" on all the linux hosts and look
for signs of failing drives. Look at smart data, your HBAs/disk controllers, OOB
management logs, and so forth. If you're seeing scrub errors, it's probably a bad
disk backing an OSD or OSDs.
Is there a common OSD in the PGs you've run the repairs on?
On Mon, Jan 9, 2023, at 03:37, Kuhring, Mathias wrote:
Hey all,
I'd like to pick up on this topic, since we also see regular scrub
errors recently.
Roughly one per week for around six weeks now.
It's always a different PG and the repair command always helps after a
while.
But the regular re-occurrence seems it bit unsettling.
How to best troubleshoot this.
We are currently on ceph version 17.2.1
(ec95624474b1871a821a912b8c3af68f8f8e7aa1) quincy (stable)
Best Wishes,
Mathias
_______________________________________________
ceph-users mailing list -- ceph-users(a)ceph.io
To unsubscribe send an email to ceph-users-leave(a)ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users(a)ceph.io
To unsubscribe send an email to ceph-users-leave(a)ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users(a)ceph.io
To unsubscribe send an email to ceph-users-leave(a)ceph.io