[ceph-users] Re: 15.2.2 Upgrade - Corruption: error in middle of record

23 May 2020

Status date:

We seem to have success. I followed the steps below. Only one more OSD 
(on node3) failed to restart, showing the same WAL corruption messages. 
After replacing that & backfilling I could then restart it. So we have a 
healthy cluster with restartable OSDs again, with 
bluefs_preextend_wal_files=false until its deemed safe to re-enable it.

Many thanks Igor!

Regards, Chris

On 23/05/2020 11:06, Chris Palmer wrote:
...
  Hi Ashley

 Setting bluefs_preextend_wal_files to false should stop any further 
 corruption of the WAL (subject to the small risk of doing this while 
 the OSD is active). Over time WAL blocks will be recycled and 
 overwritten with new good blocks, so the extent of the corruption may 
 decrease or even eliminate. However you can't tell whether this has 
 happened. But leaving each running for a while may decrease the 
 chances of having to recreate it.

 Having tried changing the parameter on one, then another, I've taken 
 the risk of resetting it on all (running) OSDs, and nothing untoward 
 seems to have happened. I have removed and recreated both failed OSDs 
 (both on the node that was rebooted). They are in different crush 
 device classes so I know that they are used by discrete sets of pgs. 
 osd.9 has been recreated, backfilled, and stopped/started without 
 issue. osd,2 has been recreated and is currently backfilling. When 
 that has finished I will restart osd.2 and expect that the restart 
 will not find any corruption.

 Following that I will cycle through all other OSDs, stopping and 
 starting each in turn. If one fails to restart, I will replace it, 
 wait until it backfills, then stop/start it.

 Do be aware that you can set the parameter globally (for all OSDs) 
 and/or individually. I made sure the global setting was in place 
 before creating new OSDs. (There might be other ways to achieve this 
 on the command line for creating a new one).

 Hope that's clear. But once again, please don't take this as advice on 
 what you should do. That should come from the experts!

 Regards, Chris

 On 23/05/2020 10:03, Ashley Merrick wrote:
  Hello Chris,

 Great to hear, few questions.

 Once you have injected the bluefs_preextend_wal_files to false, are 
 you just rebuilding the OSD's that failed? Or are you going through 
 and rebuilding every OSD even the working one's?

 Or does setting the bluefs_preextend_wal_files value to false and 
 leaving the OSD running fix the WAL automatically?

 Thanks

 ---- On Sat, 23 May 2020 15:53:42 +0800 *Chris Palmer 
 &lt;chris.palmer(a)pobox.com&gt;* wrote ----

     Hi Ashley

     Igor has done a great job of tracking down the problem, and we
     have finally shown evidence of the type of corruption it would
     produce in one of my WALs. Our feeling at the moment is that the
     problem can be detoured by setting bluefs_preextend_wal_files to
     false on affected OSDs while they are running (but see below),
     although Igor does note that there is a small risk in doing this.
     I've agreed a plan of action based on this route, recreating the
     failed OSDs, and then cycling through the others until all are
     healthy. I've started this now, and so far it looks promising,
     although of course I have to wait for recovery/rebalancing. This
     is the fastest route to recovery, although there other options.

     I'll post as it progresses. The good news seems to be that there
     shouldn't be any actual data corruption or loss, providing that
     this can be done before OSDs are taken down (other than as part of
     this process). My understanding is that there will some degree of
     performance penalty until the root cause is fixed in the next
     release and preextending can be turned back on. However it does
     seem like I can get back to a stable/safe position without waiting
     for a software release.

     I'm just working through this at the moment though, so please
     don't take the above as any form of recommendation. It is
     important not to try to restart OSDs though in the meantime. I'm
     sure Igor will publish some more expert recommendations in due
     course...

     Regards, Chris

     On 23/05/2020 06:54, Ashley Merrick wrote:

         Thanks Igor,

         Do you have any idea on a e.t.a or plan for people that are
         running 15.2.2 to be able to patch / fix the issue.

         I had a read of the ticket and seems the corruption is
         happening but the WAL is not read till OSD restart, so I
         imagine we will need some form of fix / patch we can apply to
         a running OSD before we then restart the OSD, as a normal OSD
         upgrade will require the OSD to restart to apply the code
         resulting in a corrupt OSD.

         Thanks

         ---- On Sat, 23 May 2020 00:12:59 +0800 *Igor Fedotov
         &lt;ifedotov(a)suse.de&gt; <mailto:ifedotov@suse.de>* wrote ----

             Status update:

             Finally we have the first patch to fix the issue in master:
             https://github.com/ceph/ceph/pull/35201

             And ticket has been updated with root cause
             analysis:https://tracker.ceph.com/issues/45613On 5/21/2020
             2:07 PM, Igor
             Fedotov wrote:

             @Chris - unfortunately it looks like the corruption is
             permanent since
             valid WAL data are presumably overwritten with another
             stuff. Hence I
             don't know any way to recover - perhaps you can try cutting

             WAL file off which will allow OSD to start. With some
             latest ops lost.
             Once can use exported BlueFS as a drop in replacement for
             regular DB
             volume but I'm not aware of details.

             And the above are just speculations, can't say for sure if
             it helps...

             I can't explain why WAL doesn't have zero block in your
             case though.
             Little chances this is a different issue. Just in case -
             could you
             please search for 32K zero blocks over the whole file? And
             the same for
             another OSD?

             Thanks,

             Igor

  Short update on the issue:

 Finally we're able to reproduce the issue in master (not             
octopus),
  investigating further..

 @Chris - to make sure you're facing the same issue could              you
please
  check the content of the broken file. To do so:

 1) run "ceph-bluestore-tool --path <path-to-osd>              --our-dir
<target
  dir> --command bluefs-export

 This will export bluefs files to <target dir>

 2) Check the content for file db.wal/002040.log at              offset 0x470000

 This will presumably contain 32K of zero bytes. Is this              the case?

 No hurry as I'm just making sure symptoms in Octopus are              the
same...

 Thanks,

 Igor

 On 5/20/2020 5:24 PM, Igor Fedotov wrote:
> Chris,
>
> got them, thanks!
>
> Investigating....
>
>
> Thanks,
>
> Igor
>
> On 5/20/2020 5:23 PM, Chris Palmer wrote:
>> Hi Igor
>> I've sent you these directly as they're a bit chunky.             
Let me know if
 >> you haven't got them.
>> Thx, Chris
>>
>> On 20/05/2020 14:43, Igor Fedotov wrote:
>>> Hi Cris,
>>>
>>> could you please share the full log prior to the              first
failure?
 >>>
>>> Also if possible please set debug-bluestore/debug              bluefs
to 20 and
 >>> collect another one for failed OSD
startup.
>>>
>>>
>>> Thanks,
>>>
>>> Igor
>>>
>>>
>>> On 5/20/2020 4:39 PM, Chris Palmer wrote:
>>>> I'm getting similar errors after rebooting a node.             
Cluster was
 >>>> upgraded 15.2.1 -> 15.2.2
yesterday. No problems              after rebooting
 >>>> during upgrade.
>>>>
>>>> On the node I just rebooted, 2/4 OSDs won't restart. 
            Similar logs
 >>>> from both. Logs from one below.
>>>> Neither OSDs have compression enabled, although              there
is a
 >>>> compression-related error in the
log.
>>>> Both are replicated x3. One has data on HDD &             
separate WAL/DB on
 >>>> NVMe partition, the other is
everything on NVMe              partition only.
 >>>>
>>>> Feeling kinda nervous here - advice welcomed!!
>>>>
>>>> Thx, Chris
>>>>
>>>>
>>>>
>>>> 2020-05-20T13:14:00.837+0100 7f2e0d273700  3 rocksdb:
>>>> [table/block_based_table_reader.cc:1117] Encountered             
error while
 >>>> reading data from compression
dictionary block              Corruption: block
 >>>> checksum mismatch: expected 0,
got 3423870535  in              db/000304.sst
 >>>> offset 18446744073709551615 size
18446744073709551615
>>>> 2020-05-20T13:14:00.841+0100 7f2e1957ee00  4 rocksdb:
>>>> [db/version_set.cc:3757] Recovered from manifest
>>>> file:db/MANIFEST-000312              succeeded,manifest_file_number
is 312,
 >>>> next_file_number is 314,
last_sequence is 22320582,              log_number is
 >>>> 309,prev_log_number is
0,max_column_family is
>>>> 0,min_log_number_to_keep is 0
>>>>
>>>> 2020-05-20T13:14:00.841+0100 7f2e1957ee00  4 rocksdb:
>>>> [db/version_set.cc:3766] Column family [default] (ID             
0), log
 >>>> number is 309
>>>>
>>>> 2020-05-20T13:14:00.841+0100 7f2e1957ee00  4              rocksdb:
EVENT_LOG_v1
 >>>> {"time_micros":
1589976840843199, "job": 1, "event":
>>>> "recovery_started", "log_files": [313]}
>>>> 2020-05-20T13:14:00.841+0100 7f2e1957ee00  4 rocksdb:
>>>> [db/db_impl_open.cc:583] Recovering log #313 mode 0
>>>> 2020-05-20T13:14:00.937+0100 7f2e1957ee00  3 rocksdb:
>>>> [db/db_impl_open.cc:518] db.wal/000313.log: dropping             
9044 bytes;
 >>>> Corruption: error in middle of
record
>>>> 2020-05-20T13:14:00.937+0100 7f2e1957ee00  3 rocksdb:
>>>> [db/db_impl_open.cc:518] db.wal/000313.log: dropping             
86 bytes;
 >>>> Corruption: missing start of
fragmented record(2)
>>>> 2020-05-20T13:14:00.937+0100 7f2e1957ee00  4 rocksdb:
>>>> [db/db_impl.cc:390] Shutdown: canceling all              background
work
 >>>> 2020-05-20T13:14:00.937+0100
7f2e1957ee00  4 rocksdb:
>>>> [db/db_impl.cc:563] Shutdown complete
>>>> 2020-05-20T13:14:00.937+0100 7f2e1957ee00 -1              rocksdb:
Corruption:
 >>>> error in middle of record
>>>> 2020-05-20T13:14:00.937+0100 7f2e1957ee00 -1
>>>> bluestore(/var/lib/ceph/osd/ceph-9) _open_db              erroring
opening db:
 >>>> 2020-05-20T13:14:00.937+0100
7f2e1957ee00  1 bluefs              umount
 >>>> 2020-05-20T13:14:00.937+0100
7f2e1957ee00  1              fbmap_alloc
             >>>>> 0x55daf2b3a900 shutdown
 >>>> 2020-05-20T13:14:00.937+0100
7f2e1957ee00  1              bdev(0x55daf3838700
 >>>> /var/lib/ceph/osd/ceph-9/block)
close
>>>> 2020-05-20T13:14:01.093+0100 7f2e1957ee00  1             
bdev(0x55daf3838000
 >>>> /var/lib/ceph/osd/ceph-9/block)
close
>>>> 2020-05-20T13:14:01.341+0100 7f2e1957ee00 -1 osd.9 0             
OSD:init:
 >>>> unable to mount object store
>>>> 2020-05-20T13:14:01.341+0100 7f2e1957ee00 -1              ESC[0;31m
** ERROR:
 >>>> osd init failed: (5) Input/output
errorESC[0m
>>>> _______________________________________________
>>>> ceph-users mailing list -- ceph-users(a)ceph.io             
<mailto:ceph-users@ceph.io>
 >>>> To unsubscribe send an email to
             ceph-users-leave(a)ceph.io <mailto:ceph-users-leave@ceph.io>
 >>
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io             
<mailto:ceph-users@ceph.io>
 > To unsubscribe send an email to 
            ceph-users-leave(a)ceph.io <mailto:ceph-users-leave@ceph.io>
  _______________________________________________
 ceph-users mailing list -- ceph-users(a)ceph.io             
<mailto:ceph-users@ceph.io>
  To unsubscribe send an email to
ceph-users-leave(a)ceph.io              <mailto:ceph-users-leave@ceph.io>
             _______________________________________________
             ceph-users mailing list -- ceph-users(a)ceph.io
             <mailto:ceph-users@ceph.io>
             To unsubscribe send an email to ceph-users-leave(a)ceph.io
             <mailto:ceph-users-leave@ceph.io>

 _______________________________________________
 ceph-users mailing list -- ceph-users(a)ceph.io
 To unsubscribe send an email to ceph-users-leave(a)ceph.io 

2024

2023

2022

2021

2020

2019

[ceph-users] Re: 15.2.2 Upgrade - Corruption: error in middle of record