If you're using an Icinga active check that just looks for
SMART overall-health self-assessment test result: PASSED
then it's not doing much for you. That bivalue status can be shown for a drive that
is decidedly an ex-parrot. Gotta look at specific attributes, which is thorny since they
aren't consistently implemented. drivedb.h is a downright mess, which doesn't
help.
----- Le 12 Avr 24, à 15:17, Albert Shih Albert.Shih(a)obspm.fr a écrit :
Le 12/04/2024 à 12:56:12+0200, Frédéric Nass a
écrit
Hi,
> Have you check the hardware status of
the involved drives other than with
> smartctl? Like with the manufacturer's tools / WebUI (iDrac / perccli for DELL
> hardware for example).
Yes, all my disk are «under» periodic check with smartctl + icinga.
Actually, I meant lower level tools (drive / server vendor tools).
If these tools don't report any media error
(that is bad blocs on disks) then
you might just be facing the bit rot phenomenon. But this is very rare and
should happen in a sysadmin's lifetime as often as a Royal Flush hand in a
professional poker player's lifetime. ;-)
If no media error is reported, then you might want to check and update the
firmware of all drives.
You're perfectly right.
It's just a newbie error, I check on the «main» osd of the PG (meaning the
first in the list) but forget to check on other.
Ok.
On when server I indeed get some error on a
disk.
But strangely smartctl report nothing. I will add a check with dmesg.
That's why I pointed you to the drive / server vendor tools earlier as sometimes
smartctl is missing the information you want.
Once you figured it out, you may enable osd_scrub_auto_repair=true to have these
inconsistencies repaired automatically on deep-scrubbing, but make sure you're
using the alert module [1] so to at least get informed about the scrub errors.
Thanks. I will look into because we got already icinga2 on site so I use
icinga2 to check the cluster.
Is they are a list of what the alert module going to check ?
Basically the module checks for ceph status (ceph -s) changes.
https://github.com/ceph/ceph/blob/main/src/pybind/mgr/alerts/module.py
Regards,
Frédéric.
Regards
JAS
--
Albert SHIH 🦫 🐸
France
Heure locale/Local time:
ven. 12 avril 2024 15:13:13 CEST
_______________________________________________
ceph-users mailing list -- ceph-users(a)ceph.io
To unsubscribe send an email to ceph-users-leave(a)ceph.io