Hi Ashley
The command to reset the flag for ALL OSDs is
ceph config set osd bluefs_preextend_wal_files false
And for just an individual OSD:
ceph config set osd.5 bluefs_preextend_wal_files false
And to remove it from an individual one (so you just have the global one
left):
ceph config rm osd.5 bluefs_preextend_wal_files
BUT: I can't stress enough how important it is to only take down ONE OSD
AT A TIME. And not to take any others down until that one is properly
back up (replaced and backfilled if necessary). *Rebooting nodes without
doing this may very well cause irretrievable data loss, no matter how
long it has been since you reset that parameter.* This all seems to have
worked for me but you should get expert advice.
Regards, Chris
On 23/05/2020 16:32, Ashley Merrick wrote:
Hello,
Great news can you confirm the exact command you used to inject the
value so I can replicate you exact steps.
I will do that and then leave it a good couple of days before trying a
reboot to make sure the WAL is completely flushed
Thanks
Ashley
---- On Sat, 23 May 2020 23:20:45 +0800 *chris.palmer(a)pobox.com *
wrote ----
Status date:
We seem to have success. I followed the steps below. Only one more
OSD
(on node3) failed to restart, showing the same WAL corruption
messages.
After replacing that & backfilling I could then restart it. So we
have a
healthy cluster with restartable OSDs again, with
bluefs_preextend_wal_files=false until its deemed safe to
re-enable it.
Many thanks Igor!
Regards, Chris
On 23/05/2020 11:06, Chris Palmer wrote:
Hi Ashley
Setting bluefs_preextend_wal_files to false should stop any further
corruption of the WAL (subject to the small risk of doing this
while
the OSD is active). Over time WAL blocks will be
recycled and
overwritten with new good blocks, so the extent of the
corruption may
decrease or even eliminate. However you can't
tell whether this has
happened. But leaving each running for a while may decrease the
chances of having to recreate it.
Having tried changing the parameter on one, then another, I've
taken
the risk of resetting it on all (running) OSDs,
and nothing
untoward
seems to have happened. I have removed and
recreated both failed
OSDs
(both on the node that was rebooted). They are in
different crush
device classes so I know that they are used by discrete sets of
pgs.
osd.9 has been recreated, backfilled, and
stopped/started without
issue. osd,2 has been recreated and is currently backfilling. When
that has finished I will restart osd.2 and expect that the restart
will not find any corruption.
Following that I will cycle through all other OSDs, stopping and
starting each in turn. If one fails to restart, I will replace it,
wait until it backfills, then stop/start it.
Do be aware that you can set the parameter globally (for all OSDs)
and/or individually. I made sure the global setting was in place
before creating new OSDs. (There might be other ways to achieve
this
on the command line for creating a new one).
Hope that's clear. But once again, please don't take this as
advice
on
what you should do. That should come from the
experts!
Regards, Chris
On 23/05/2020 10:03, Ashley Merrick wrote:
> Hello Chris,
>
> Great to hear, few questions.
>
> Once you have injected the bluefs_preextend_wal_files to false,
are
> you just rebuilding the OSD's that
failed? Or are you going
through
> and rebuilding every OSD even the working
one's?
>
> Or does setting the bluefs_preextend_wal_files value to false and
> leaving the OSD running fix the WAL automatically?
>
> Thanks
>
>
> ---- On Sat, 23 May 2020 15:53:42 +0800 *Chris Palmer
> <chris.palmer(a)pobox.com <mailto:chris.palmer@pobox.com>>* wrote
----
>
> Hi Ashley
>
> Igor has done a great job of tracking down the problem, and we
> have finally shown evidence of the type of corruption it would
> produce in one of my WALs. Our feeling at the moment is
that the
> problem can be detoured by setting
bluefs_preextend_wal_files to
> false on affected OSDs while they are
running (but see below),
> although Igor does note that there is a small risk in doing
this.
> I've agreed a plan of action based on
this route,
recreating the
> failed OSDs, and then cycling through the
others until all are
> healthy. I've started this now, and so far it looks promising,
> although of course I have to wait for recovery/rebalancing.
This
> is the fastest route to recovery,
although there other
options.
>
> I'll post as it progresses. The good news seems to be that
there
> shouldn't be any actual data
corruption or loss, providing
that
> this can be done before OSDs are taken
down (other than as
part of
> this process). My understanding is that
there will some
degree of
> performance penalty until the root cause
is fixed in the next
> release and preextending can be turned back on. However it
does
> seem like I can get back to a stable/safe
position without
waiting
> for a software release.
>
> I'm just working through this at the moment though, so please
> don't take the above as any form of recommendation. It is
> important not to try to restart OSDs though in the
meantime. I'm
> sure Igor will publish some more expert
recommendations in due
> course...
>
> Regards, Chris
>
>
> On 23/05/2020 06:54, Ashley Merrick wrote:
>
>
> Thanks Igor,
>
> Do you have any idea on a e.t.a or plan for people that
are
> running 15.2.2 to be able to patch /
fix the issue.
>
> I had a read of the ticket and seems the corruption is
> happening but the WAL is not read till OSD restart, so I
> imagine we will need some form of fix / patch we can
apply to
> a running OSD before we then restart
the OSD, as a
normal OSD
> upgrade will require the OSD to
restart to apply the code
> resulting in a corrupt OSD.
>
> Thanks
>
>
> ---- On Sat, 23 May 2020 00:12:59 +0800 *Igor Fedotov
> <ifedotov(a)suse.de <mailto:ifedotov@suse.de>>
<mailto:ifedotov@suse.de <mailto:ifedotov@suse.de>>* wrote ----
>
> Status update:
>
> Finally we have the first patch to fix the issue in
master:
5/21/2020
> 2:07 PM, Igor
> Fedotov wrote:
>
> @Chris - unfortunately it looks like the corruption is
> permanent since
> valid WAL data are presumably overwritten with another
> stuff. Hence I
> don't know any way to recover - perhaps you can try
cutting
>
> WAL file off which will allow OSD to start. With some
> latest ops lost.
> Once can use exported BlueFS as a drop in
replacement for
> regular DB
> volume but I'm not aware of details.
>
> And the above are just speculations, can't say for
sure
if
> it helps...
>
> I can't explain why WAL doesn't have zero block in
your
> case though.
> Little chances this is a different issue. Just in
case -
> could you
> please search for 32K zero blocks over the whole
file? And
> the same for
> another OSD?
>
>
> Thanks,
>
> Igor
>
> > Short update on the issue:
> >
> > Finally we're able to reproduce the issue in
master
(not
> octopus),
> > investigating further..
> >
> > @Chris - to make sure you're facing the same
issue
could
> you please
> > check the content of the broken file. To do so:
> >
> > 1) run "ceph-bluestore-tool --path <path-to-osd>
> --our-dir <target
> > dir> --command bluefs-export
> >
> > This will export bluefs files to <target dir>
> >
> > 2) Check the content for file db.wal/002040.log at
> offset 0x470000
> >
> > This will presumably contain 32K of zero bytes.
Is this
> the case?
> >
> >
> > No hurry as I'm just making sure symptoms in
Octopus
are
> the same...
> >
> >
> > Thanks,
> >
> > Igor
> >
> > On 5/20/2020 5:24 PM, Igor Fedotov wrote:
> >> Chris,
> >>
> >> got them, thanks!
> >>
> >> Investigating....
> ; >>
> >>
> >> Thanks,
> >>
> >> Igor
> >>
> >> On 5/20/2020 5:23 PM, Chris Palmer wrote:
> >>> Hi Igor
> >>> I've sent you these directly as they're a bit
chunky.
> Let me know if
> >>> you haven't got them.
> >>> Thx, Chris
> >>>
> >>> On 20/05/2020 14:43, Igor Fedotov wrote:
> >>>> Hi Cris,
> >>>>
> >>>> could you please share the full log prior to the
> first failure?
> >>>>
> >>>> Also if possible please set debug-bluestore/debug
> bluefs to 20 and
> >>>> collect another one for failed OSD startup.
> >>>>
> >>>>
> >>>> Thanks,
> >>>>
> >>>> Igor
> >>>>
> >>>>
> >>>> On 5/20/2020 4:39 PM, Chris Palmer wrote:
> >>>>> I'm getting similar errors after rebooting a
node.
> Cluster was
> >>>>> upgraded 15.2.1 -> 15.2.2 yesterday. No problems
> after rebooting
> >>>>> during upgrade.
> >>>>>
> >>>>> On the node I just rebooted, 2/4 OSDs won't
restart.
> Similar logs
> >>>>> from both. Logs from one below.
> >>>>> Neither OSDs have compression enabled, although
> there is a
> >>>>> compression-related error in the log.
> >>>>> Both are replicated x3. One has data on HDD &
> separate WAL/DB on
> >>>>> NVMe partition, the other is everything on NVMe
> partition only.
> >>>>>
> >>>>> Feeling kinda nervous here - advice welcomed!!
> >>>>>
> >>>>> Thx, Chris
> >>>>>
> >>>>>
> >>>>>
> >>>>> 2020-05-20T13:14:00.837+0100 7f2e0d273700 3
rocksdb:
> >>>>>
[table/block_based_table_reader.cc:1117]
Encountered
> error while
> >>>>> reading data from compression dictionary block
> Corruption: block
> >>>>> checksum mismatch: expected 0, got
3423870535 in
> db/000304.sst
> >>>>> offset 18446744073709551615 size
18446744073709551615
> >>>>>
2020-05-20T13:14:00.841+0100 7f2e1957ee00 4
rocksdb:
> >>>>>
[db/version_set.cc:3757] Recovered from manifest
> >>>>> file:db/MANIFEST-000312
> succeeded,manifest_file_number is 312,
> >>>>> next_file_number is 314, last_sequence is
22320582,
>> log_number is
>> >>>>> 309,prev_log_number is 0,max_column_family
is
>> >>>>> 0,min_log_number_to_keep is 0
>> >>>>>
> >>>>>
2020-05-20T13:14:00.841+0100 7f2e1957ee00 4
rocksdb:
> >>>>>
[db/version_set.cc:3766] Column family
[default] (ID
>> 0), log
>> >>>>> number is 309
>> >>>>>
> >>>>>
2020-05-20T13:14:00.841+0100 7f2e1957ee00 4
>> rocksdb:
EVENT_LOG_v1
>> >>>>> {"time_micros": 1589976840843199,
"job": 1,
"event":
>> >>>>> "recovery_started",
"log_files": [313]}
> >>>>>
2020-05-20T13:14:00.841+0100 7f2e1957ee00 4
rocksdb:
> >>>>>
[db/db_impl_open.cc:583] Recovering log #313
mode 0
> >>>>>
2020-05-20T13:14:00.937+0100 7f2e1957ee00 3
rocksdb:
> >>>>>
[db/db_impl_open.cc:518] db.wal/000313.log:
dropping
>> 9044 bytes;
>> >>>>> Corruption: error in middle of record
> >>>>>
2020-05-20T13:14:00.937+0100 7f2e1957ee00 3
rocksdb:
> >>>>>
[db/db_impl_open.cc:518] db.wal/000313.log:
dropping
> 86 bytes;
> >>>>> Corruption: missing start of fragmented
record(2)
> >>>>>
2020-05-20T13:14:00.937+0100 7f2e1957ee00 4
rocksdb:
>> >>>>> [db/db_impl.cc:390] Shutdown: canceling
all
>> background work
> >>>>>
2020-05-20T13:14:00.937+0100 7f2e1957ee00 4
rocksdb:
> >>>>>
[db/db_impl.cc:563] Shutdown complete
> >>>>> 2020-05-20T13:14:00.937+0100 7f2e1957ee00 -1
> rocksdb: Corruption:
> >>>>> error in middle of record
> >>>>> 2020-05-20T13:14:00.937+0100 7f2e1957ee00 -1
> >>>>> bluestore(/var/lib/ceph/osd/ceph-9) _open_db
> erroring opening db:
> >>>>> 2020-05-20T13:14:00.937+0100 7f2e1957ee00 1
bluefs
> umount
> >>>>> 2020-05-20T13:14:00.937+0100 7f2e1957ee00 1
> fbmap_alloc
> >>>>> 0x55daf2b3a900 shutdown
> >>>>> 2020-05-20T13:14:00.937+0100 7f2e1957ee00 1
> bdev(0x55daf3838700
> >>>>> /var/lib/ceph/osd/ceph-9/block) close
> >>>>> 2020-05-20T13:14:01.093+0100 7f2e1957ee00 1
> bdev(0x55daf3838000
> >>>>> /var/lib/ceph/osd/ceph-9/block) close
> >>>>> 2020-05-20T13:14:01.341+0100 7f2e1957ee00 -1
osd.9 0
> OSD:init:
> >>>>> unable to mount object store
> >>>>> 2020-05-20T13:14:01.341+0100 7f2e1957ee00 -1
> ESC[0;31m ** ERROR:
> >>>>> osd init failed: (5) Input/output errorESC[0m
> >>>>> _______________________________________________
> >>>>> ceph-users mailing list -- ceph-users(a)ceph.io
<mailto:ceph-users@ceph.io>
> <mailto:ceph-users@ceph.io
<mailto:ceph-users@ceph.io>>
> >>>>> To
unsubscribe send an email to
> ceph-users-leave(a)ceph.io <mailto:ceph-users-leave@ceph.io>
<mailto:ceph-users-leave@ceph.io <mailto:ceph-users-leave@ceph.io>>
> >>>
> >> _______________________________________________
> >> ceph-users mailing list -- ceph-users(a)ceph.io
<mailto:ceph-users@ceph.io>
> <mailto:ceph-users@ceph.io
<mailto:ceph-users@ceph.io>>
> >> To unsubscribe send an
email to
> ceph-users-leave(a)ceph.io <mailto:ceph-users-leave@ceph.io>
<mailto:ceph-users-leave@ceph.io <mailto:ceph-users-leave@ceph.io>>
> >
_______________________________________________
> > ceph-users mailing list -- ceph-users(a)ceph.io
<mailto:ceph-users@ceph.io>
> <mailto:ceph-users@ceph.io
<mailto:ceph-users@ceph.io>>
> > To unsubscribe send an email
to
ceph-users-leave(a)ceph.io <mailto:ceph-users-leave@ceph.io>
>
<mailto:ceph-users-leave@ceph.io
<mailto:ceph-users-leave@ceph.io>>
>
_______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io
<mailto:ceph-users@ceph.io>
> <mailto:ceph-users@ceph.io
<mailto:ceph-users@ceph.io>>
> To unsubscribe send an email to
ceph-users-leave(a)ceph.io <mailto:ceph-users-leave@ceph.io>
>
<mailto:ceph-users-leave@ceph.io
<mailto:ceph-users-leave@ceph.io>>
_______________________________________________
ceph-users mailing list -- ceph-users(a)ceph.io
<mailto:ceph-users@ceph.io>
To unsubscribe send an email to
ceph-users-leave(a)ceph.io
<mailto:ceph-users-leave@ceph.io>
_______________________________________________
ceph-users mailing list -- ceph-users(a)ceph.io
<mailto:ceph-users@ceph.io>
To unsubscribe send an email to ceph-users-leave(a)ceph.io
<mailto:ceph-users-leave@ceph.io>