[ceph-users] Re: ceph pgs inconsistent, always the same checksum

15 Sep 2020

Hi Welby,

could you share an OSD log containing such errors then, please?

Also - David mentioned 'repair' which fixes the issue - is it bluestore 
repair or PG  one?

If the latter could you please try bluestore deep fsck (via 
'ceph-bluestore-tool --command fsck --deep 1') immediately after the 
failure has been discovered?

Will it succeed?

Thanks,

Igor

On 9/14/2020 8:45 PM, Welby McRoberts wrote:
...
  Hi Igor

 We'll take a look at disabling swap on the nodes and see if that 
 improves the situation.

 Having checked across all osds we're not seeing 
 bluestore_reads_with_retries as anything other than a zero value. We 
 get the error anywhere from 3 - 10 occurrences of the error a week, 
 but it's usually only one or two PGs that are inconsistent at any one 
 time.

 Thanks
 Welby

 On Mon, Sep 14, 2020 at 12:17 PM Igor Fedotov &lt;ifedotov(a)suse.de 
 <mailto:ifedotov@suse.de>> wrote:

     Hi David,

     you might want to try to disable swap for your nodes. Look like
     there is
     some implicit correlation between such read errors and enabled
     swapping.

     Also wondering whether you can observe non-zero values for
     "bluestore_reads_with_retries" performance counters over your
     OSDs. How
     wide-spread these cases are present? How high this counter might get?

     Thanks,

     Igor

     On 9/9/2020 4:59 PM, David Orman wrote:
  Right, you can see the previously referenced
ticket/bug in the      link I had
  provided. It's definitely not an unknown
situation.

 We have another one today:

 debug 2020-09-09T06:49:36.595+0000 7f570871d700 -1
 bluestore(/var/lib/ceph/osd/ceph-123) _verify_csum bad crc32c/0x1000
 checksum at blob offset 0x60000, got 0x6706be76, expected      0x929a618, device
  location [0x2f387d70000~1000], logical extent
0xe0000~1000, object
 0#2:7ff493bc:::rbd_data.3.20d195d612942.0000000004228a96:head#

 debug 2020-09-09T06:49:36.611+0000 7f570871d700 -1
 bluestore(/var/lib/ceph/osd/ceph-123) _verify_csum bad crc32c/0x1000
 checksum at blob offset 0x60000, got 0x6706be76, expected      0x929a618, device
  location [0x2f387d70000~1000], logical extent
0xe0000~1000, object
 0#2:7ff493bc:::rbd_data.3.20d195d612942.0000000004228a96:head#

 debug 2020-09-09T06:49:36.611+0000 7f570871d700 -1
 bluestore(/var/lib/ceph/osd/ceph-123) _verify_csum bad crc32c/0x1000
 checksum at blob offset 0x60000, got 0x6706be76, expected      0x929a618, device
  location [0x2f387d70000~1000], logical extent
0xe0000~1000, object
 0#2:7ff493bc:::rbd_data.3.20d195d612942.0000000004228a96:head#

 debug 2020-09-09T06:49:36.611+0000 7f570871d700 -1
 bluestore(/var/lib/ceph/osd/ceph-123) _verify_csum bad crc32c/0x1000
 checksum at blob offset 0x60000, got 0x6706be76, expected      0x929a618, device
  location [0x2f387d70000~1000], logical extent
0xe0000~1000, object
 0#2:7ff493bc:::rbd_data.3.20d195d612942.0000000004228a96:head#

 debug 2020-09-09T06:49:37.315+0000 7f570871d700 -1      log_channel(cluster) log
  [ERR] : 2.3fe shard 123(0) soid
 2:7ff493bc:::rbd_data.3.20d195d612942.0000000004228a96:head :      candidate had
  a read error

 debug 2020-09-09T06:57:08.930+0000 7f570871d700 -1      log_channel(cluster) log
  [ERR] : 2.3fes0 deep-scrub 0 missing, 1
inconsistent objects

 debug 2020-09-09T06:57:08.930+0000 7f570871d700 -1      log_channel(cluster) log
  [ERR] : 2.3fe deep-scrub 1 errors

 This happens across the entire cluster, not just one server, so      we don't
  think it's faulty hardware.

 On Wed, Sep 9, 2020 at 12:51 AM Janne Johansson      &lt;icepic.dz(a)gmail.com
<mailto:icepic.dz@gmail.com>> wrote:

> I googled "got 0x6706be76, expected" and found some hits     
regarding ceph,
 > so whatever it is, you are not the first, and
that number has      some internal
 > meaning.
> Redhat solution for similar issue says that checksum is for      seeing all
 > zeroes, and hints at a bad write cache on the
controller or      something that
 > ends up clearing data instead of writing the
correct information on
> shutdowns.
>
>
> Den tis 8 sep. 2020 kl 23:21 skrev David Orman      &lt;ormandj(a)corenode.com
<mailto:ormandj@corenode.com>>:
 >
>>
>> We're seeing repeated inconsistent PG warnings, generally on      the
order of
 >> 3-10 per week.
>>
>>      pg 2.b9 is active+clean+inconsistent, acting     
[25,117,128,95,151,15]
 >>
>>
>
>> Every time we look at them, we see the same checksum (0x6706be76):
>>
>> debug 2020-08-13T18:39:01.731+0000 7fbc037a7700 -1
>> bluestore(/var/lib/ceph/osd/ceph-25) _verify_csum bad      crc32c/0x1000
 >> checksum at blob offset 0x0, got
0x6706be76, expected      0x61f2021c, device
 >> location [0x12b403c0000~1000], logical
extent 0x0~1000, object
>> 2#2:0f1a338f:::rbd_data.3.20d195d612942.0000000001db869b:head#
>>
>> This looks a lot like: https://tracker.ceph.com/issues/22464
>> That said, we've got the following versions in play (cluster      was
created
 >> with 15.2.3):
>> ceph version 15.2.4 (7447c15c6ff58d7fce91843b705a268a1917325c)     
octopus
    (stable)

 --
 May the most significant bit of your life be positive.
  _______________________________________________
 ceph-users mailing list -- ceph-users(a)ceph.io     
<mailto:ceph-users@ceph.io>
  To unsubscribe send an email to
ceph-users-leave(a)ceph.io      <mailto:ceph-users-leave@ceph.io>
     _______________________________________________
     ceph-users mailing list -- ceph-users(a)ceph.io
     <mailto:ceph-users@ceph.io>
     To unsubscribe send an email to ceph-users-leave(a)ceph.io
     <mailto:ceph-users-leave@ceph.io>

2024

2023

2022

2021

2020

2019

[ceph-users] Re: ceph pgs inconsistent, always the same checksum