Hello,
I have a pool of +300 OSDs that are identical model (Seagate model:
ST1800MM0129 size: 1.64 TiB).
Only 1 OSD crashes regularely, however I cannot identify a root cause.
Based on the output of smartctl the disk is ok.
# smartctl -a -d megaraid,1
/dev/sda
[47/1833]
smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.3.18-2-pve] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke,
www.smartmontools.org
=== START OF INFORMATION SECTION ===
Vendor: LENOVO-X
Product: ST1800MM0129
Revision: L2B6
Compliance: SPC-4
User Capacity: 1,800,360,124,416 bytes [1.80 TB]
Logical block size: 512 bytes
Physical block size: 4096 bytes
LU is fully provisioned
Rotation Rate: 10500 rpm
Form Factor: 2.5 inches
Logical Unit id: 0x5000c500bb7822cf
Serial number: WBN0QHX80000E852944J
Device type: disk
Transport protocol: SAS (SPL-3)
Local Time is: Mon May 18 09:19:41 2020 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
Temperature Warning: Enabled
=== START OF READ SMART DATA SECTION ===
SMART Health Status: HARDWARE IMPENDING FAILURE GENERAL HARD DRIVE
FAILURE [asc=5d, ascq=10] [22/1833]
Grown defects during certification <not available>
Total blocks reassigned during format <not available>
Total new blocks reassigned = 68
Power on minutes since format <not available>
Current Drive Temperature: 33 C
Drive Trip Temperature: 65 C
Manufactured in week 31 of year 2018
Specified cycle count over device lifetime: 10000
Accumulated start-stop cycles: 21
Specified load-unload count over device lifetime: 300000
Accumulated load-unload cycles: 709
Elements in grown defect list: 18
Error counter log:
Errors Corrected by Total Correction
Gigabytes Total
ECC rereads/ errors algorithm
processed uncorrected
fast | delayed rewrites corrected invocations [10^9
bytes] errors
read: 3278853896 1 0 3278853897 32
83933.567 19
write: 0 0 0 0 0
24093.894 0
verify: 3080361880 0 0 3080361880 0
12630.494 0
Non-medium error count: 244
SMART Self-test log
Num Test Status segment LifeTime
LBA_first_err [SK ASC ASQ]
Description number (hours)
# 1 Background short Completed -
3761 - [- - -]
# 2 Background short Completed -
3737 - [- - -]
# 3 Background short Completed -
3713 - [- - -]
# 4 Background short Completed -
3689 - [- - -]
# 5 Background short Completed -
3665 - [- - -]
# 6 Background short Completed -
3641 - [- - -]
# 7 Background short Completed -
3617 - [- - -]
# 8 Background short Completed -
3593 - [- - -]
# 9 Background long Completed -
3569 - [- - -]
#10 Background short Completed -
3545 - [- - -]
#11 Background short Completed -
3521 - [- - -]
#12 Background short Completed -
3497 - [- - -]
#13 Background short Completed -
3473 - [- - -]
#14 Background short Completed -
3449 - [- - -]
#15 Background short Completed -
3425 - [- - -]
#16 Background short Completed -
3401 - [- - -]
#17 Background short Completed -
3377 - [- - -]
#18 Background short Completed -
3353 - [- - -]
#19 Background short Completed -
3329 - [- - -]
#20 Background short Completed -
3305 - [- - -]
Long (extended) Self-test duration: 9459 seconds [157.7 minutes]
I have attached the log of the affected OSD.
THX
Thomas
Ich habe 1 zu dieser E-Mail gehörende Datei hochgeladen:
ceph-osd.92.log.1.gz <https://we.tl/t-7DzNCDP3iZ>(578
KB)WeTransferhttps://we.tl/t-7DzNCDP3iZ
Mozilla Thunderbird <https://www.thunderbird.net> macht es einfach,
große Dateien über E-Mails zu teilen.