Re: 15.2.2 Upgrade - Corruption: error in middle of record - ceph-users

List overview All Threads
Download

newer

Re: 15.2.2 Upgrade - Corruption: error in middle of record

older

PGS INCONSISTENT - read_error -...

Re: question on ceph node count

Chris Palmer

20 May 2020 20 May '20

7:09 p.m.

I'm getting similar errors after rebooting a node. Cluster was upgraded 15.2.1 -> 15.2.2 yesterday. No problems after rebooting during upgrade. On the node I just rebooted, 2/4 OSDs won't restart. Similar logs from both. Logs from one below. Neither OSDs have compression enabled, although there is a compression-related error in the log. Both are replicated x3. One has data on HDD & separate WAL/DB on NVMe partition, the other is everything on NVMe partition only. Feeling kinda nervous here - advice welcomed!! Thx, Chris 2020-05-20T13:14:00.837+0100 7f2e0d273700 3 rocksdb: [table/block_based_table_reader.cc:1117] Encountered error while reading data from compression dictionary block Corruption: block checksum mismatch: expected 0, got 3423870535 in db/000304.sst offset 18446744073709551615 size 18446744073709551615 2020-05-20T13:14:00.841+0100 7f2e1957ee00 4 rocksdb: [db/version_set.cc:3757] Recovered from manifest file:db/MANIFEST-000312 succeeded,manifest_file_number is 312, next_file_number is 314, last_sequence is 22320582, log_number is 309,prev_log_number is 0,max_column_family is 0,min_log_number_to_keep is 0 2020-05-20T13:14:00.841+0100 7f2e1957ee00 4 rocksdb: [db/version_set.cc:3766] Column family [default] (ID 0), log number is 309 2020-05-20T13:14:00.841+0100 7f2e1957ee00 4 rocksdb: EVENT_LOG_v1 {"time_micros": 1589976840843199, "job": 1, "event": "recovery_started", "log_files": [313]} 2020-05-20T13:14:00.841+0100 7f2e1957ee00 4 rocksdb: [db/db_impl_open.cc:583] Recovering log #313 mode 0 2020-05-20T13:14:00.937+0100 7f2e1957ee00 3 rocksdb: [db/db_impl_open.cc:518] db.wal/000313.log: dropping 9044 bytes; Corruption: error in middle of record 2020-05-20T13:14:00.937+0100 7f2e1957ee00 3 rocksdb: [db/db_impl_open.cc:518] db.wal/000313.log: dropping 86 bytes; Corruption: missing start of fragmented record(2) 2020-05-20T13:14:00.937+0100 7f2e1957ee00 4 rocksdb: [db/db_impl.cc:390] Shutdown: canceling all background work 2020-05-20T13:14:00.937+0100 7f2e1957ee00 4 rocksdb: [db/db_impl.cc:563] Shutdown complete 2020-05-20T13:14:00.937+0100 7f2e1957ee00 -1 rocksdb: Corruption: error in middle of record 2020-05-20T13:14:00.937+0100 7f2e1957ee00 -1 bluestore(/var/lib/ceph/osd/ceph-9) _open_db erroring opening db: 2020-05-20T13:14:00.937+0100 7f2e1957ee00 1 bluefs umount 2020-05-20T13:14:00.937+0100 7f2e1957ee00 1 fbmap_alloc 0x55daf2b3a900 shutdown 2020-05-20T13:14:00.937+0100 7f2e1957ee00 1 bdev(0x55daf3838700 /var/lib/ceph/osd/ceph-9/block) close 2020-05-20T13:14:01.093+0100 7f2e1957ee00 1 bdev(0x55daf3838000 /var/lib/ceph/osd/ceph-9/block) close 2020-05-20T13:14:01.341+0100 7f2e1957ee00 -1 osd.9 0 OSD:init: unable to mount object store 2020-05-20T13:14:01.341+0100 7f2e1957ee00 -1 ESC[0;31m ** ERROR: osd init failed: (5) Input/output errorESC[0m

Show replies by thread

Igor Fedotov

20 May 20 May

7:13 p.m.

New subject: 15.2.2 Upgrade - Corruption: error in middle of record

Hi Cris, could you please share the full log prior to the first failure? Also if possible please set debug-bluestore/debug bluefs to 20 and collect another one for failed OSD startup. Thanks, Igor On 5/20/2020 4:39 PM, Chris Palmer wrote: > I'm getting similar errors after rebooting a node. Cluster was > upgraded 15.2.1 -> 15.2.2 yesterday. No problems after rebooting > during upgrade. > > On the node I just rebooted, 2/4 OSDs won't restart. Similar logs from > both. Logs from one below. > Neither OSDs have compression enabled, although there is a > compression-related error in the log. > Both are replicated x3. One has data on HDD & separate WAL/DB on NVMe > partition, the other is everything on NVMe partition only. > > Feeling kinda nervous here - advice welcomed!! > > Thx, Chris > > > > 2020-05-20T13:14:00.837+0100 7f2e0d273700 3 rocksdb: > [table/block_based_table_reader.cc:1117] Encountered error while > reading data from compression dictionary block Corruption: block > checksum mismatch: expected 0, got 3423870535 in db/000304.sst offset > 18446744073709551615 size 18446744073709551615 > 2020-05-20T13:14:00.841+0100 7f2e1957ee00 4 rocksdb: > [db/version_set.cc:3757] Recovered from manifest > file:db/MANIFEST-000312 succeeded,manifest_file_number is 312, > next_file_number is 314, last_sequence is 22320582, log_number is > 309,prev_log_number is 0,max_column_family is 0,min_log_number_to_keep > is 0 > > 2020-05-20T13:14:00.841+0100 7f2e1957ee00 4 rocksdb: > [db/version_set.cc:3766] Column family [default] (ID 0), log number is > 309 > > 2020-05-20T13:14:00.841+0100 7f2e1957ee00 4 rocksdb: EVENT_LOG_v1 > {"time_micros": 1589976840843199, "job": 1, "event": > "recovery_started", "log_files": [313]} > 2020-05-20T13:14:00.841+0100 7f2e1957ee00 4 rocksdb: > [db/db_impl_open.cc:583] Recovering log #313 mode 0 > 2020-05-20T13:14:00.937+0100 7f2e1957ee00 3 rocksdb: > [db/db_impl_open.cc:518] db.wal/000313.log: dropping 9044 bytes; > Corruption: error in middle of record > 2020-05-20T13:14:00.937+0100 7f2e1957ee00 3 rocksdb: > [db/db_impl_open.cc:518] db.wal/000313.log: dropping 86 bytes; > Corruption: missing start of fragmented record(2) > 2020-05-20T13:14:00.937+0100 7f2e1957ee00 4 rocksdb: > [db/db_impl.cc:390] Shutdown: canceling all background work > 2020-05-20T13:14:00.937+0100 7f2e1957ee00 4 rocksdb: > [db/db_impl.cc:563] Shutdown complete > 2020-05-20T13:14:00.937+0100 7f2e1957ee00 -1 rocksdb: Corruption: > error in middle of record > 2020-05-20T13:14:00.937+0100 7f2e1957ee00 -1 > bluestore(/var/lib/ceph/osd/ceph-9) _open_db erroring opening db: > 2020-05-20T13:14:00.937+0100 7f2e1957ee00 1 bluefs umount > 2020-05-20T13:14:00.937+0100 7f2e1957ee00 1 fbmap_alloc > 0x55daf2b3a900 shutdown > 2020-05-20T13:14:00.937+0100 7f2e1957ee00 1 bdev(0x55daf3838700 > /var/lib/ceph/osd/ceph-9/block) close > 2020-05-20T13:14:01.093+0100 7f2e1957ee00 1 bdev(0x55daf3838000 > /var/lib/ceph/osd/ceph-9/block) close > 2020-05-20T13:14:01.341+0100 7f2e1957ee00 -1 osd.9 0 OSD:init: unable > to mount object store > 2020-05-20T13:14:01.341+0100 7f2e1957ee00 -1 ESC[0;31m ** ERROR: osd > init failed: (5) Input/output errorESC[0m > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io > To unsubscribe send an email to ceph-users-leave(a)ceph.io

Chris Palmer

7:53 p.m.

New subject: 15.2.2 Upgrade - Corruption: error in middle of record

Hi Igor I've sent you these directly as they're a bit chunky. Let me know if you haven't got them. Thx, Chris On 20/05/2020 14:43, Igor Fedotov wrote:

...

Hi Cris, could you please share the full log prior to the first failure? Also if possible please set debug-bluestore/debug bluefs to 20 and collect another one for failed OSD startup. Thanks, Igor On 5/20/2020 4:39 PM, Chris Palmer wrote: > I'm getting similar errors after rebooting a node. Cluster was > upgraded 15.2.1 -> 15.2.2 yesterday. No problems after rebooting > during upgrade. > > On the node I just rebooted, 2/4 OSDs won't restart. Similar logs > from both. Logs from one below. > Neither OSDs have compression enabled, although there is a > compression-related error in the log. > Both are replicated x3. One has data on HDD & separate WAL/DB on NVMe > partition, the other is everything on NVMe partition only. > > Feeling kinda nervous here - advice welcomed!! > > Thx, Chris > > > > 2020-05-20T13:14:00.837+0100 7f2e0d273700 3 rocksdb: > [table/block_based_table_reader.cc:1117] Encountered error while > reading data from compression dictionary block Corruption: block > checksum mismatch: expected 0, got 3423870535 in db/000304.sst > offset 18446744073709551615 size 18446744073709551615 > 2020-05-20T13:14:00.841+0100 7f2e1957ee00 4 rocksdb: > [db/version_set.cc:3757] Recovered from manifest > file:db/MANIFEST-000312 succeeded,manifest_file_number is 312, > next_file_number is 314, last_sequence is 22320582, log_number is > 309,prev_log_number is 0,max_column_family is > 0,min_log_number_to_keep is 0 > > 2020-05-20T13:14:00.841+0100 7f2e1957ee00 4 rocksdb: > [db/version_set.cc:3766] Column family [default] (ID 0), log number > is 309 > > 2020-05-20T13:14:00.841+0100 7f2e1957ee00 4 rocksdb: EVENT_LOG_v1 > {"time_micros": 1589976840843199, "job": 1, "event": > "recovery_started", "log_files": [313]} > 2020-05-20T13:14:00.841+0100 7f2e1957ee00 4 rocksdb: > [db/db_impl_open.cc:583] Recovering log #313 mode 0 > 2020-05-20T13:14:00.937+0100 7f2e1957ee00 3 rocksdb: > [db/db_impl_open.cc:518] db.wal/000313.log: dropping 9044 bytes; > Corruption: error in middle of record > 2020-05-20T13:14:00.937+0100 7f2e1957ee00 3 rocksdb: > [db/db_impl_open.cc:518] db.wal/000313.log: dropping 86 bytes; > Corruption: missing start of fragmented record(2) > 2020-05-20T13:14:00.937+0100 7f2e1957ee00 4 rocksdb: > [db/db_impl.cc:390] Shutdown: canceling all background work > 2020-05-20T13:14:00.937+0100 7f2e1957ee00 4 rocksdb: > [db/db_impl.cc:563] Shutdown complete > 2020-05-20T13:14:00.937+0100 7f2e1957ee00 -1 rocksdb: Corruption: > error in middle of record > 2020-05-20T13:14:00.937+0100 7f2e1957ee00 -1 > bluestore(/var/lib/ceph/osd/ceph-9) _open_db erroring opening db: > 2020-05-20T13:14:00.937+0100 7f2e1957ee00 1 bluefs umount > 2020-05-20T13:14:00.937+0100 7f2e1957ee00 1 fbmap_alloc > 0x55daf2b3a900 shutdown > 2020-05-20T13:14:00.937+0100 7f2e1957ee00 1 bdev(0x55daf3838700 > /var/lib/ceph/osd/ceph-9/block) close > 2020-05-20T13:14:01.093+0100 7f2e1957ee00 1 bdev(0x55daf3838000 > /var/lib/ceph/osd/ceph-9/block) close > 2020-05-20T13:14:01.341+0100 7f2e1957ee00 -1 osd.9 0 OSD:init: unable > to mount object store > 2020-05-20T13:14:01.341+0100 7f2e1957ee00 -1 ESC[0;31m ** ERROR: osd > init failed: (5) Input/output errorESC[0m > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io > To unsubscribe send an email to ceph-users-leave(a)ceph.io

Igor Fedotov

7:54 p.m.

New subject: 15.2.2 Upgrade - Corruption: error in middle of record

Chris, got them, thanks! Investigating.... Thanks, Igor On 5/20/2020 5:23 PM, Chris Palmer wrote: > Hi Igor > I've sent you these directly as they're a bit chunky. Let me know if > you haven't got them. > Thx, Chris > > On 20/05/2020 14:43, Igor Fedotov wrote: >> Hi Cris, >> >> could you please share the full log prior to the first failure? >> >> Also if possible please set debug-bluestore/debug bluefs to 20 and >> collect another one for failed OSD startup. >> >> >> Thanks, >> >> Igor >> >> >> On 5/20/2020 4:39 PM, Chris Palmer wrote: >>> I'm getting similar errors after rebooting a node. Cluster was >>> upgraded 15.2.1 -> 15.2.2 yesterday. No problems after rebooting >>> during upgrade. >>> >>> On the node I just rebooted, 2/4 OSDs won't restart. Similar logs >>> from both. Logs from one below. >>> Neither OSDs have compression enabled, although there is a >>> compression-related error in the log. >>> Both are replicated x3. One has data on HDD & separate WAL/DB on >>> NVMe partition, the other is everything on NVMe partition only. >>> >>> Feeling kinda nervous here - advice welcomed!! >>> >>> Thx, Chris >>> >>> >>> >>> 2020-05-20T13:14:00.837+0100 7f2e0d273700 3 rocksdb: >>> [table/block_based_table_reader.cc:1117] Encountered error while >>> reading data from compression dictionary block Corruption: block >>> checksum mismatch: expected 0, got 3423870535 in db/000304.sst >>> offset 18446744073709551615 size 18446744073709551615 >>> 2020-05-20T13:14:00.841+0100 7f2e1957ee00 4 rocksdb: >>> [db/version_set.cc:3757] Recovered from manifest >>> file:db/MANIFEST-000312 succeeded,manifest_file_number is 312, >>> next_file_number is 314, last_sequence is 22320582, log_number is >>> 309,prev_log_number is 0,max_column_family is >>> 0,min_log_number_to_keep is 0 >>> >>> 2020-05-20T13:14:00.841+0100 7f2e1957ee00 4 rocksdb: >>> [db/version_set.cc:3766] Column family [default] (ID 0), log number >>> is 309 >>> >>> 2020-05-20T13:14:00.841+0100 7f2e1957ee00 4 rocksdb: EVENT_LOG_v1 >>> {"time_micros": 1589976840843199, "job": 1, "event": >>> "recovery_started", "log_files": [313]} >>> 2020-05-20T13:14:00.841+0100 7f2e1957ee00 4 rocksdb: >>> [db/db_impl_open.cc:583] Recovering log #313 mode 0 >>> 2020-05-20T13:14:00.937+0100 7f2e1957ee00 3 rocksdb: >>> [db/db_impl_open.cc:518] db.wal/000313.log: dropping 9044 bytes; >>> Corruption: error in middle of record >>> 2020-05-20T13:14:00.937+0100 7f2e1957ee00 3 rocksdb: >>> [db/db_impl_open.cc:518] db.wal/000313.log: dropping 86 bytes; >>> Corruption: missing start of fragmented record(2) >>> 2020-05-20T13:14:00.937+0100 7f2e1957ee00 4 rocksdb: >>> [db/db_impl.cc:390] Shutdown: canceling all background work >>> 2020-05-20T13:14:00.937+0100 7f2e1957ee00 4 rocksdb: >>> [db/db_impl.cc:563] Shutdown complete >>> 2020-05-20T13:14:00.937+0100 7f2e1957ee00 -1 rocksdb: Corruption: >>> error in middle of record >>> 2020-05-20T13:14:00.937+0100 7f2e1957ee00 -1 >>> bluestore(/var/lib/ceph/osd/ceph-9) _open_db erroring opening db: >>> 2020-05-20T13:14:00.937+0100 7f2e1957ee00 1 bluefs umount >>> 2020-05-20T13:14:00.937+0100 7f2e1957ee00 1 fbmap_alloc >>> 0x55daf2b3a900 shutdown >>> 2020-05-20T13:14:00.937+0100 7f2e1957ee00 1 bdev(0x55daf3838700 >>> /var/lib/ceph/osd/ceph-9/block) close >>> 2020-05-20T13:14:01.093+0100 7f2e1957ee00 1 bdev(0x55daf3838000 >>> /var/lib/ceph/osd/ceph-9/block) close >>> 2020-05-20T13:14:01.341+0100 7f2e1957ee00 -1 osd.9 0 OSD:init: >>> unable to mount object store >>> 2020-05-20T13:14:01.341+0100 7f2e1957ee00 -1 ESC[0;31m ** ERROR: osd >>> init failed: (5) Input/output errorESC[0m >>> _______________________________________________ >>> ceph-users mailing list -- ceph-users(a)ceph.io >>> To unsubscribe send an email to ceph-users-leave(a)ceph.io >

Igor Fedotov

21 May 21 May

4:37 p.m.

New subject: 15.2.2 Upgrade - Corruption: error in middle of record

Short update on the issue: Finally we're able to reproduce the issue in master (not octopus), investigating further.. @Chris - to make sure you're facing the same issue could you please check the content of the broken file. To do so: 1) run "ceph-bluestore-tool --path <path-to-osd> --our-dir <target dir> --command bluefs-export This will export bluefs files to <target dir> 2) Check the content for file db.wal/002040.log at offset 0x470000 This will presumably contain 32K of zero bytes. Is this the case? No hurry as I'm just making sure symptoms in Octopus are the same... Thanks, Igor On 5/20/2020 5:24 PM, Igor Fedotov wrote: > Chris, > > got them, thanks! > > Investigating.... > > > Thanks, > > Igor > > On 5/20/2020 5:23 PM, Chris Palmer wrote: >> Hi Igor >> I've sent you these directly as they're a bit chunky. Let me know if >> you haven't got them. >> Thx, Chris >> >> On 20/05/2020 14:43, Igor Fedotov wrote: >>> Hi Cris, >>> >>> could you please share the full log prior to the first failure? >>> >>> Also if possible please set debug-bluestore/debug bluefs to 20 and >>> collect another one for failed OSD startup. >>> >>> >>> Thanks, >>> >>> Igor >>> >>> >>> On 5/20/2020 4:39 PM, Chris Palmer wrote: >>>> I'm getting similar errors after rebooting a node. Cluster was >>>> upgraded 15.2.1 -> 15.2.2 yesterday. No problems after rebooting >>>> during upgrade. >>>> >>>> On the node I just rebooted, 2/4 OSDs won't restart. Similar logs >>>> from both. Logs from one below. >>>> Neither OSDs have compression enabled, although there is a >>>> compression-related error in the log. >>>> Both are replicated x3. One has data on HDD & separate WAL/DB on >>>> NVMe partition, the other is everything on NVMe partition only. >>>> >>>> Feeling kinda nervous here - advice welcomed!! >>>> >>>> Thx, Chris >>>> >>>> >>>> >>>> 2020-05-20T13:14:00.837+0100 7f2e0d273700 3 rocksdb: >>>> [table/block_based_table_reader.cc:1117] Encountered error while >>>> reading data from compression dictionary block Corruption: block >>>> checksum mismatch: expected 0, got 3423870535 in db/000304.sst >>>> offset 18446744073709551615 size 18446744073709551615 >>>> 2020-05-20T13:14:00.841+0100 7f2e1957ee00 4 rocksdb: >>>> [db/version_set.cc:3757] Recovered from manifest >>>> file:db/MANIFEST-000312 succeeded,manifest_file_number is 312, >>>> next_file_number is 314, last_sequence is 22320582, log_number is >>>> 309,prev_log_number is 0,max_column_family is >>>> 0,min_log_number_to_keep is 0 >>>> >>>> 2020-05-20T13:14:00.841+0100 7f2e1957ee00 4 rocksdb: >>>> [db/version_set.cc:3766] Column family [default] (ID 0), log number >>>> is 309 >>>> >>>> 2020-05-20T13:14:00.841+0100 7f2e1957ee00 4 rocksdb: EVENT_LOG_v1 >>>> {"time_micros": 1589976840843199, "job": 1, "event": >>>> "recovery_started", "log_files": [313]} >>>> 2020-05-20T13:14:00.841+0100 7f2e1957ee00 4 rocksdb: >>>> [db/db_impl_open.cc:583] Recovering log #313 mode 0 >>>> 2020-05-20T13:14:00.937+0100 7f2e1957ee00 3 rocksdb: >>>> [db/db_impl_open.cc:518] db.wal/000313.log: dropping 9044 bytes; >>>> Corruption: error in middle of record >>>> 2020-05-20T13:14:00.937+0100 7f2e1957ee00 3 rocksdb: >>>> [db/db_impl_open.cc:518] db.wal/000313.log: dropping 86 bytes; >>>> Corruption: missing start of fragmented record(2) >>>> 2020-05-20T13:14:00.937+0100 7f2e1957ee00 4 rocksdb: >>>> [db/db_impl.cc:390] Shutdown: canceling all background work >>>> 2020-05-20T13:14:00.937+0100 7f2e1957ee00 4 rocksdb: >>>> [db/db_impl.cc:563] Shutdown complete >>>> 2020-05-20T13:14:00.937+0100 7f2e1957ee00 -1 rocksdb: Corruption: >>>> error in middle of record >>>> 2020-05-20T13:14:00.937+0100 7f2e1957ee00 -1 >>>> bluestore(/var/lib/ceph/osd/ceph-9) _open_db erroring opening db: >>>> 2020-05-20T13:14:00.937+0100 7f2e1957ee00 1 bluefs umount >>>> 2020-05-20T13:14:00.937+0100 7f2e1957ee00 1 fbmap_alloc >>>> 0x55daf2b3a900 shutdown >>>> 2020-05-20T13:14:00.937+0100 7f2e1957ee00 1 bdev(0x55daf3838700 >>>> /var/lib/ceph/osd/ceph-9/block) close >>>> 2020-05-20T13:14:01.093+0100 7f2e1957ee00 1 bdev(0x55daf3838000 >>>> /var/lib/ceph/osd/ceph-9/block) close >>>> 2020-05-20T13:14:01.341+0100 7f2e1957ee00 -1 osd.9 0 OSD:init: >>>> unable to mount object store >>>> 2020-05-20T13:14:01.341+0100 7f2e1957ee00 -1 ESC[0;31m ** ERROR: >>>> osd init failed: (5) Input/output errorESC[0m >>>> _______________________________________________ >>>> ceph-users mailing list -- ceph-users(a)ceph.io >>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io >> > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io > To unsubscribe send an email to ceph-users-leave(a)ceph.io

Igor Fedotov

22 May 22 May

9:42 p.m.

New subject: 15.2.2 Upgrade - Corruption: error in middle of record

Status update: Finally we have the first patch to fix the issue in master: https://github.com/ceph/ceph/pull/35201 And ticket has been updated with root cause analysis:https://tracker.ceph.com/issues/45613On 5/21/2020 2:07 PM, Igor Fedotov wrote: @Chris - unfortunately it looks like the corruption is permanent since valid WAL data are presumably overwritten with another stuff. Hence I don't know any way to recover - perhaps you can try cutting WAL file off which will allow OSD to start. With some latest ops lost. Once can use exported BlueFS as a drop in replacement for regular DB volume but I'm not aware of details. And the above are just speculations, can't say for sure if it helps... I can't explain why WAL doesn't have zero block in your case though. Little chances this is a different issue. Just in case - could you please search for 32K zero blocks over the whole file? And the same for another OSD? Thanks, Igor > Short update on the issue: > > Finally we're able to reproduce the issue in master (not octopus), > investigating further.. > > @Chris - to make sure you're facing the same issue could you please > check the content of the broken file. To do so: > > 1) run "ceph-bluestore-tool --path <path-to-osd> --our-dir <target > dir> --command bluefs-export > > This will export bluefs files to <target dir> > > 2) Check the content for file db.wal/002040.log at offset 0x470000 > > This will presumably contain 32K of zero bytes. Is this the case? > > > No hurry as I'm just making sure symptoms in Octopus are the same... > > > Thanks, > > Igor > > On 5/20/2020 5:24 PM, Igor Fedotov wrote: >> Chris, >> >> got them, thanks! >> >> Investigating.... >> >> >> Thanks, >> >> Igor >> >> On 5/20/2020 5:23 PM, Chris Palmer wrote: >>> Hi Igor >>> I've sent you these directly as they're a bit chunky. Let me know if >>> you haven't got them. >>> Thx, Chris >>> >>> On 20/05/2020 14:43, Igor Fedotov wrote: >>>> Hi Cris, >>>> >>>> could you please share the full log prior to the first failure? >>>> >>>> Also if possible please set debug-bluestore/debug bluefs to 20 and >>>> collect another one for failed OSD startup. >>>> >>>> >>>> Thanks, >>>> >>>> Igor >>>> >>>> >>>> On 5/20/2020 4:39 PM, Chris Palmer wrote: >>>>> I'm getting similar errors after rebooting a node. Cluster was >>>>> upgraded 15.2.1 -> 15.2.2 yesterday. No problems after rebooting >>>>> during upgrade. >>>>> >>>>> On the node I just rebooted, 2/4 OSDs won't restart. Similar logs >>>>> from both. Logs from one below. >>>>> Neither OSDs have compression enabled, although there is a >>>>> compression-related error in the log. >>>>> Both are replicated x3. One has data on HDD & separate WAL/DB on >>>>> NVMe partition, the other is everything on NVMe partition only. >>>>> >>>>> Feeling kinda nervous here - advice welcomed!! >>>>> >>>>> Thx, Chris >>>>> >>>>> >>>>> >>>>> 2020-05-20T13:14:00.837+0100 7f2e0d273700 3 rocksdb: >>>>> [table/block_based_table_reader.cc:1117] Encountered error while >>>>> reading data from compression dictionary block Corruption: block >>>>> checksum mismatch: expected 0, got 3423870535 in db/000304.sst >>>>> offset 18446744073709551615 size 18446744073709551615 >>>>> 2020-05-20T13:14:00.841+0100 7f2e1957ee00 4 rocksdb: >>>>> [db/version_set.cc:3757] Recovered from manifest >>>>> file:db/MANIFEST-000312 succeeded,manifest_file_number is 312, >>>>> next_file_number is 314, last_sequence is 22320582, log_number is >>>>> 309,prev_log_number is 0,max_column_family is >>>>> 0,min_log_number_to_keep is 0 >>>>> >>>>> 2020-05-20T13:14:00.841+0100 7f2e1957ee00 4 rocksdb: >>>>> [db/version_set.cc:3766] Column family [default] (ID 0), log >>>>> number is 309 >>>>> >>>>> 2020-05-20T13:14:00.841+0100 7f2e1957ee00 4 rocksdb: EVENT_LOG_v1 >>>>> {"time_micros": 1589976840843199, "job": 1, "event": >>>>> "recovery_started", "log_files": [313]} >>>>> 2020-05-20T13:14:00.841+0100 7f2e1957ee00 4 rocksdb: >>>>> [db/db_impl_open.cc:583] Recovering log #313 mode 0 >>>>> 2020-05-20T13:14:00.937+0100 7f2e1957ee00 3 rocksdb: >>>>> [db/db_impl_open.cc:518] db.wal/000313.log: dropping 9044 bytes; >>>>> Corruption: error in middle of record >>>>> 2020-05-20T13:14:00.937+0100 7f2e1957ee00 3 rocksdb: >>>>> [db/db_impl_open.cc:518] db.wal/000313.log: dropping 86 bytes; >>>>> Corruption: missing start of fragmented record(2) >>>>> 2020-05-20T13:14:00.937+0100 7f2e1957ee00 4 rocksdb: >>>>> [db/db_impl.cc:390] Shutdown: canceling all background work >>>>> 2020-05-20T13:14:00.937+0100 7f2e1957ee00 4 rocksdb: >>>>> [db/db_impl.cc:563] Shutdown complete >>>>> 2020-05-20T13:14:00.937+0100 7f2e1957ee00 -1 rocksdb: Corruption: >>>>> error in middle of record >>>>> 2020-05-20T13:14:00.937+0100 7f2e1957ee00 -1 >>>>> bluestore(/var/lib/ceph/osd/ceph-9) _open_db erroring opening db: >>>>> 2020-05-20T13:14:00.937+0100 7f2e1957ee00 1 bluefs umount >>>>> 2020-05-20T13:14:00.937+0100 7f2e1957ee00 1 fbmap_alloc >>>>> 0x55daf2b3a900 shutdown >>>>> 2020-05-20T13:14:00.937+0100 7f2e1957ee00 1 bdev(0x55daf3838700 >>>>> /var/lib/ceph/osd/ceph-9/block) close >>>>> 2020-05-20T13:14:01.093+0100 7f2e1957ee00 1 bdev(0x55daf3838000 >>>>> /var/lib/ceph/osd/ceph-9/block) close >>>>> 2020-05-20T13:14:01.341+0100 7f2e1957ee00 -1 osd.9 0 OSD:init: >>>>> unable to mount object store >>>>> 2020-05-20T13:14:01.341+0100 7f2e1957ee00 -1 ESC[0;31m ** ERROR: >>>>> osd init failed: (5) Input/output errorESC[0m >>>>> _______________________________________________ >>>>> ceph-users mailing list -- ceph-users(a)ceph.io >>>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io >>> >> _______________________________________________ >> ceph-users mailing list -- ceph-users(a)ceph.io >> To unsubscribe send an email to ceph-users-leave(a)ceph.io > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io > To unsubscribe send an email to ceph-users-leave(a)ceph.io

Ashley Merrick

23 May 23 May

11:24 a.m.

New subject: 15.2.2 Upgrade - Corruption: error in middle of record

Thanks Igor, Do you have any idea on a e.t.a or plan for people that are running 15.2.2 to be able to patch / fix the issue. I had a read of the ticket and seems the corruption is happening but the WAL is not read till OSD restart, so I imagine we will need some form of fix / patch we can apply to a running OSD before we then restart the OSD, as a normal OSD upgrade will require the OSD to restart to apply the code resulting in a corrupt OSD. Thanks ---- On Sat, 23 May 2020 00:12:59 +0800 Igor Fedotov <ifedotov(a)suse.de> wrote ---- Status update: Finally we have the first patch to fix the issue in master: https://github.com/ceph/ceph/pull/35201 And ticket has been updated with root cause analysis:https://tracker.ceph.com/issues/45613On 5/21/2020 2:07 PM, Igor Fedotov wrote: @Chris - unfortunately it looks like the corruption is permanent since valid WAL data are presumably overwritten with another stuff. Hence I don't know any way to recover - perhaps you can try cutting WAL file off which will allow OSD to start. With some latest ops lost. Once can use exported BlueFS as a drop in replacement for regular DB volume but I'm not aware of details. And the above are just speculations, can't say for sure if it helps... I can't explain why WAL doesn't have zero block in your case though. Little chances this is a different issue. Just in case - could you please search for 32K zero blocks over the whole file? And the same for another OSD? Thanks, Igor

...

Chris, got them, thanks! Investigating.... Thanks, Igor On 5/20/2020 5:23 PM, Chris Palmer wrote:

Hi Igor I've sent you these directly as they're a bit chunky. Let me know if you haven't got them. Thx, Chris On 20/05/2020 14:43, Igor Fedotov wrote:

Hi Cris, could you please share the full log prior to the first failure? Also if possible please set debug-bluestore/debug bluefs to 20 and collect another one for failed OSD startup. Thanks, Igor On 5/20/2020 4:39 PM, Chris Palmer wrote: > I'm getting similar errors after rebooting a node. Cluster was > upgraded 15.2.1 -> 15.2.2 yesterday. No problems after rebooting > during upgrade. > > On the node I just rebooted, 2/4 OSDs won't restart. Similar logs > from both. Logs from one below. > Neither OSDs have compression enabled, although there is a > compression-related error in the log. > Both are replicated x3. One has data on HDD & separate WAL/DB on > NVMe partition, the other is everything on NVMe partition only. > > Feeling kinda nervous here - advice welcomed!! > > Thx, Chris > > > > 2020-05-20T13:14:00.837+0100 7f2e0d273700 3 rocksdb: > [table/block_based_table_reader.cc:1117] Encountered error while > reading data from compression dictionary block Corruption: block > checksum mismatch: expected 0, got 3423870535 in db/000304.sst > offset 18446744073709551615 size 18446744073709551615 > 2020-05-20T13:14:00.841+0100 7f2e1957ee00 4 rocksdb: > [db/version_set.cc:3757] Recovered from manifest > file:db/MANIFEST-000312 succeeded,manifest_file_number is 312, > next_file_number is 314, last_sequence is 22320582, log_number is > 309,prev_log_number is 0,max_column_family is > 0,min_log_number_to_keep is 0 > > 2020-05-20T13:14:00.841+0100 7f2e1957ee00 4 rocksdb: > [db/version_set.cc:3766] Column family [default] (ID 0), log > number is 309 > > 2020-05-20T13:14:00.841+0100 7f2e1957ee00 4 rocksdb: EVENT_LOG_v1 > {"time_micros": 1589976840843199, "job": 1, "event": > "recovery_started", "log_files": [313]} > 2020-05-20T13:14:00.841+0100 7f2e1957ee00 4 rocksdb: > [db/db_impl_open.cc:583] Recovering log #313 mode 0 > 2020-05-20T13:14:00.937+0100 7f2e1957ee00 3 rocksdb: > [db/db_impl_open.cc:518] db.wal/000313.log: dropping 9044 bytes; > Corruption: error in middle of record > 2020-05-20T13:14:00.937+0100 7f2e1957ee00 3 rocksdb: > [db/db_impl_open.cc:518] db.wal/000313.log: dropping 86 bytes; > Corruption: missing start of fragmented record(2) > 2020-05-20T13:14:00.937+0100 7f2e1957ee00 4 rocksdb: > [db/db_impl.cc:390] Shutdown: canceling all background work > 2020-05-20T13:14:00.937+0100 7f2e1957ee00 4 rocksdb: > [db/db_impl.cc:563] Shutdown complete > 2020-05-20T13:14:00.937+0100 7f2e1957ee00 -1 rocksdb: Corruption: > error in middle of record > 2020-05-20T13:14:00.937+0100 7f2e1957ee00 -1 > bluestore(/var/lib/ceph/osd/ceph-9) _open_db erroring opening db: > 2020-05-20T13:14:00.937+0100 7f2e1957ee00 1 bluefs umount > 2020-05-20T13:14:00.937+0100 7f2e1957ee00 1 fbmap_alloc > 0x55daf2b3a900 shutdown > 2020-05-20T13:14:00.937+0100 7f2e1957ee00 1 bdev(0x55daf3838700 > /var/lib/ceph/osd/ceph-9/block) close > 2020-05-20T13:14:01.093+0100 7f2e1957ee00 1 bdev(0x55daf3838000 > /var/lib/ceph/osd/ceph-9/block) close > 2020-05-20T13:14:01.341+0100 7f2e1957ee00 -1 osd.9 0 OSD:init: > unable to mount object store > 2020-05-20T13:14:01.341+0100 7f2e1957ee00 -1 ESC[0;31m ** ERROR: > osd init failed: (5) Input/output errorESC[0m > _______________________________________________ > ceph-users mailing list -- mailto:ceph-users@ceph.io > To unsubscribe send an email to mailto:ceph-users-leave@ceph.io

_______________________________________________ ceph-users mailing list -- mailto:ceph-users@ceph.io To unsubscribe send an email to mailto:ceph-users-leave@ceph.io

Chris Palmer

1:23 p.m.

New subject: 15.2.2 Upgrade - Corruption: error in middle of record

Hi Ashley Igor has done a great job of tracking down the problem, and we have finally shown evidence of the type of corruption it would produce in one of my WALs. Our feeling at the moment is that the problem can be detoured by setting bluefs_preextend_wal_files to false on affected OSDs while they are running (but see below), although Igor does note that there is a small risk in doing this. I've agreed a plan of action based on this route, recreating the failed OSDs, and then cycling through the others until all are healthy. I've started this now, and so far it looks promising, although of course I have to wait for recovery/rebalancing. This is the fastest route to recovery, although there other options. I'll post as it progresses. The good news seems to be that there shouldn't be any actual data corruption or loss, providing that this can be done before OSDs are taken down (other than as part of this process). My understanding is that there will some degree of performance penalty until the root cause is fixed in the next release and preextending can be turned back on. However it does seem like I can get back to a stable/safe position without waiting for a software release. I'm just working through this at the moment though, so please don't take the above as any form of recommendation. It is important not to try to restart OSDs though in the meantime. I'm sure Igor will publish some more expert recommendations in due course... Regards, Chris On 23/05/2020 06:54, Ashley Merrick wrote:

...

Thanks Igor, Do you have any idea on a e.t.a or plan for people that are running 15.2.2 to be able to patch / fix the issue. I had a read of the ticket and seems the corruption is happening but the WAL is not read till OSD restart, so I imagine we will need some form of fix / patch we can apply to a running OSD before we then restart the OSD, as a normal OSD upgrade will require the OSD to restart to apply the code resulting in a corrupt OSD. Thanks ---- On Sat, 23 May 2020 00:12:59 +0800 *Igor Fedotov <ifedotov(a)suse.de>* wrote ---- Status update: Finally we have the first patch to fix the issue in master: https://github.com/ceph/ceph/pull/35201 And ticket has been updated with root cause analysis:https://tracker.ceph.com/issues/45613On 5/21/2020 2:07 PM, Igor Fedotov wrote: @Chris - unfortunately it looks like the corruption is permanent since valid WAL data are presumably overwritten with another stuff. Hence I don't know any way to recover - perhaps you can try cutting WAL file off which will allow OSD to start. With some latest ops lost. Once can use exported BlueFS as a drop in replacement for regular DB volume but I'm not aware of details. And the above are just speculations, can't say for sure if it helps... I can't explain why WAL doesn't have zero block in your case though. Little chances this is a different issue. Just in case - could you please search for 32K zero blocks over the whole file? And the same for another OSD? Thanks, Igor

same...

Thanks, Igor On 5/20/2020 5:24 PM, Igor Fedotov wrote: > Chris, > > got them, thanks! > > Investigating.... > > > Thanks, > > Igor > > On 5/20/2020 5:23 PM, Chris Palmer wrote: >> Hi Igor >> I've sent you these directly as they're a bit chunky. Let me

know if

>> you haven't got them. >> Thx, Chris >> >> On 20/05/2020 14:43, Igor Fedotov wrote: >>> Hi Cris, >>> >>> could you please share the full log prior to the first failure? >>> >>> Also if possible please set debug-bluestore/debug bluefs to

20 and

>>> collect another one for failed OSD startup. >>> >>> >>> Thanks, >>> >>> Igor >>> >>> >>> On 5/20/2020 4:39 PM, Chris Palmer wrote: >>>> I'm getting similar errors after rebooting a node. Cluster was >>>> upgraded 15.2.1 -> 15.2.2 yesterday. No problems after

rebooting

>>>> during upgrade. >>>> >>>> On the node I just rebooted, 2/4 OSDs won't restart. Similar

logs

>>>> from both. Logs from one below. >>>> Neither OSDs have compression enabled, although there is a >>>> compression-related error in the log. >>>> Both are replicated x3. One has data on HDD & separate

WAL/DB on

>>>> NVMe partition, the other is everything on NVMe partition only. >>>> >>>> Feeling kinda nervous here - advice welcomed!! >>>> >>>> Thx, Chris >>>> >>>> >>>> >>>> 2020-05-20T13:14:00.837+0100 7f2e0d273700 3 rocksdb: >>>> [table/block_based_table_reader.cc:1117] Encountered error

while

>>>> reading data from compression dictionary block Corruption:

block

>>>> checksum mismatch: expected 0, got 3423870535 in db/000304.sst >>>> offset 18446744073709551615 size 18446744073709551615 >>>> 2020-05-20T13:14:00.841+0100 7f2e1957ee00 4 rocksdb: >>>> [db/version_set.cc:3757] Recovered from manifest >>>> file:db/MANIFEST-000312 succeeded,manifest_file_number is 312, >>>> next_file_number is 314, last_sequence is 22320582,

log_number is

>>>> 309,prev_log_number is 0,max_column_family is >>>> 0,min_log_number_to_keep is 0 >>>> >>>> 2020-05-20T13:14:00.841+0100 7f2e1957ee00 4 rocksdb: >>>> [db/version_set.cc:3766] Column family [default] (ID 0), log >>>> number is 309 >>>> >>>> 2020-05-20T13:14:00.841+0100 7f2e1957ee00 4 rocksdb:

EVENT_LOG_v1

>>>> {"time_micros": 1589976840843199, "job": 1, "event": >>>> "recovery_started", "log_files": [313]} >>>> 2020-05-20T13:14:00.841+0100 7f2e1957ee00 4 rocksdb: >>>> [db/db_impl_open.cc:583] Recovering log #313 mode 0 >>>> 2020-05-20T13:14:00.937+0100 7f2e1957ee00 3 rocksdb: >>>> [db/db_impl_open.cc:518] db.wal/000313.log: dropping 9044

bytes;

>>>> Corruption: error in middle of record >>>> 2020-05-20T13:14:00.937+0100 7f2e1957ee00 3 rocksdb: >>>> [db/db_impl_open.cc:518] db.wal/000313.log: dropping 86 bytes; >>>> Corruption: missing start of fragmented record(2) >>>> 2020-05-20T13:14:00.937+0100 7f2e1957ee00 4 rocksdb: >>>> [db/db_impl.cc:390] Shutdown: canceling all background work >>>> 2020-05-20T13:14:00.937+0100 7f2e1957ee00 4 rocksdb: >>>> [db/db_impl.cc:563] Shutdown complete >>>> 2020-05-20T13:14:00.937+0100 7f2e1957ee00 -1 rocksdb:

Corruption:

>>>> error in middle of record >>>> 2020-05-20T13:14:00.937+0100 7f2e1957ee00 -1 >>>> bluestore(/var/lib/ceph/osd/ceph-9) _open_db erroring

opening db:

>>>> 2020-05-20T13:14:00.937+0100 7f2e1957ee00 1 bluefs umount >>>> 2020-05-20T13:14:00.937+0100 7f2e1957ee00 1 fbmap_alloc >>>> 0x55daf2b3a900 shutdown >>>> 2020-05-20T13:14:00.937+0100 7f2e1957ee00 1

bdev(0x55daf3838700

>>>> /var/lib/ceph/osd/ceph-9/block) close >>>> 2020-05-20T13:14:01.093+0100 7f2e1957ee00 1

bdev(0x55daf3838000

>>>> /var/lib/ceph/osd/ceph-9/block) close >>>> 2020-05-20T13:14:01.341+0100 7f2e1957ee00 -1 osd.9 0 OSD:init: >>>> unable to mount object store >>>> 2020-05-20T13:14:01.341+0100 7f2e1957ee00 -1 ESC[0;31m **

ERROR:

>>>> osd init failed: (5) Input/output errorESC[0m >>>> _______________________________________________ >>>> ceph-users mailing list -- ceph-users(a)ceph.io

<mailto:ceph-users@ceph.io>

>>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io

<mailto:ceph-users-leave@ceph.io>

>> > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io

<mailto:ceph-users@ceph.io>

> To unsubscribe send an email to ceph-users-leave(a)ceph.io

<mailto:ceph-users-leave@ceph.io>

_______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io

<mailto:ceph-users@ceph.io>

To unsubscribe send an email to ceph-users-leave(a)ceph.io

<mailto:ceph-users-leave@ceph.io> _______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io <mailto:ceph-users@ceph.io> To unsubscribe send an email to ceph-users-leave(a)ceph.io <mailto:ceph-users-leave@ceph.io>

Ashley Merrick

2:33 p.m.

New subject: 15.2.2 Upgrade - Corruption: error in middle of record

Hello Chris, Great to hear, few questions. Once you have injected the bluefs_preextend_wal_files to false, are you just rebuilding the OSD's that failed? Or are you going through and rebuilding every OSD even the working one's? Or does setting the bluefs_preextend_wal_files value to false and leaving the OSD running fix the WAL automatically? Thanks ---- On Sat, 23 May 2020 15:53:42 +0800 Chris Palmer <chris.palmer(a)pobox.com> wrote ---- Hi Ashley Igor has done a great job of tracking down the problem, and we have finally shown evidence of the type of corruption it would produce in one of my WALs. Our feeling at the moment is that the problem can be detoured by setting bluefs_preextend_wal_files to false on affected OSDs while they are running (but see below), although Igor does note that there is a small risk in doing this. I've agreed a plan of action based on this route, recreating the failed OSDs, and then cycling through the others until all are healthy. I've started this now, and so far it looks promising, although of course I have to wait for recovery/rebalancing. This is the fastest route to recovery, although there other options. I'll post as it progresses. The good news seems to be that there shouldn't be any actual data corruption or loss, providing that this can be done before OSDs are taken down (other than as part of this process). My understanding is that there will some degree of performance penalty until the root cause is fixed in the next release and preextending can be turned back on. However it does seem like I can get back to a stable/safe position without waiting for a software release. I'm just working through this at the moment though, so please don't take the above as any form of recommendation. It is important not to try to restart OSDs though in the meantime. I'm sure Igor will publish some more expert recommendations in due course... Regards, Chris On 23/05/2020 06:54, Ashley Merrick wrote: Thanks Igor, Do you have any idea on a e.t.a or plan for people that are running 15.2.2 to be able to patch / fix the issue. I had a read of the ticket and seems the corruption is happening but the WAL is not read till OSD restart, so I imagine we will need some form of fix / patch we can apply to a running OSD before we then restart the OSD, as a normal OSD upgrade will require the OSD to restart to apply the code resulting in a corrupt OSD. Thanks ---- On Sat, 23 May 2020 00:12:59 +0800 Igor Fedotov mailto:ifedotov@suse.de wrote ---- Status update: Finally we have the first patch to fix the issue in master: https://github.com/ceph/ceph/pull/35201 And ticket has been updated with root cause analysis:https://tracker.ceph.com/issues/45613On 5/21/2020 2:07 PM, Igor Fedotov wrote: @Chris - unfortunately it looks like the corruption is permanent since valid WAL data are presumably overwritten with another stuff. Hence I don't know any way to recover - perhaps you can try cutting WAL file off which will allow OSD to start. With some latest ops lost. Once can use exported BlueFS as a drop in replacement for regular DB volume but I'm not aware of details. And the above are just speculations, can't say for sure if it helps... I can't explain why WAL doesn't have zero block in your case though. Little chances this is a different issue. Just in case - could you please search for 32K zero blocks over the whole file? And the same for another OSD? Thanks, Igor

...

Short update on the issue: Finally we're able to reproduce the issue in master

(not octopus),

...

investigating further.. @Chris - to make sure you're facing the same issue

could you please

...

check the content of the broken file. To do so: 1) run "ceph-bluestore-tool --path

<path-to-osd> --our-dir <target

...

dir> --command bluefs-export This will export bluefs files to <target dir> 2) Check the content for file db.wal/002040.log at

offset 0x470000

...

This will presumably contain 32K of zero bytes. Is

this the case?

...

No hurry as I'm just making sure symptoms in Octopus

are the same...

...

chunky. Let me know if

...

>> you haven't got them. >> Thx, Chris >> >> On 20/05/2020 14:43, Igor Fedotov wrote: >>> Hi Cris, >>> >>> could you please share the full log prior

to the first failure?

...

>>> >>> Also if possible please set

debug-bluestore/debug bluefs to 20 and

...

>>> collect another one for failed OSD

startup.

...

>>> >>> >>> Thanks, >>> >>> Igor >>> >>> >>> On 5/20/2020 4:39 PM, Chris Palmer wrote: >>>> I'm getting similar errors after

rebooting a node. Cluster was

...

>>>> upgraded 15.2.1 -> 15.2.2

yesterday. No problems after rebooting

...

>>>> during upgrade. >>>> >>>> On the node I just rebooted, 2/4 OSDs

won't restart. Similar logs

...

>>>> from both. Logs from one below. >>>> Neither OSDs have compression

enabled, although there is a

...

>>>> compression-related error in the log. >>>> Both are replicated x3. One has data

on HDD & separate WAL/DB on

...

>>>> NVMe partition, the other is

everything on NVMe partition only.

...

>>>> >>>> Feeling kinda nervous here - advice

welcomed!!

...

>>>> >>>> Thx, Chris >>>> >>>> >>>> >>>> 2020-05-20T13:14:00.837+0100

7f2e0d273700 3 rocksdb:

...

>>>>

[table/block_based_table_reader.cc:1117] Encountered error while

...

>>>> reading data from compression

dictionary block Corruption: block

...

>>>> checksum mismatch: expected 0, got

3423870535 in db/000304.sst

...

>>>> offset 18446744073709551615 size

18446744073709551615

...

>>>> 2020-05-20T13:14:00.841+0100

7f2e1957ee00 4 rocksdb:

...

>>>> [db/version_set.cc:3757] Recovered

from manifest

...

>>>> succeeded,manifest_file_number is 312, >>>> next_file_number is 314,

last_sequence is 22320582, log_number is

...

>>>> 309,prev_log_number is

0,max_column_family is >>>>> 0,min_log_number_to_keep is 0 >>>>>

...

>>>> 2020-05-20T13:14:00.841+0100

7f2e1957ee00 4 rocksdb:

...

>>>> [db/version_set.cc:3766] Column

family [default] (ID 0), log >>>>> number is 309 >>>>>

...

>>>> 2020-05-20T13:14:00.841+0100

7f2e1957ee00 4 rocksdb: EVENT_LOG_v1

...

>>>> {"time_micros": 1589976840843199,

"job": 1, "event":

...

>>>> "recovery_started", "log_files":

[313]}

...

>>>> 2020-05-20T13:14:00.841+0100

7f2e1957ee00 4 rocksdb:

...

>>>> [db/db_impl_open.cc:583] Recovering

log #313 mode 0

...

>>>> 2020-05-20T13:14:00.937+0100

7f2e1957ee00 3 rocksdb:

...

>>>> [db/db_impl_open.cc:518]

db.wal/000313.log: dropping 9044 bytes; >>>>> Corruption: error in middle of record

...

>>>> 2020-05-20T13:14:00.937+0100

7f2e1957ee00 3 rocksdb:

...

>>>> [db/db_impl_open.cc:518]

db.wal/000313.log: dropping 86 bytes;

...

>>>> Corruption: missing start of

fragmented record(2)

...

>>>> 2020-05-20T13:14:00.937+0100

7f2e1957ee00 4 rocksdb:

...

>>>> [db/db_impl.cc:390] Shutdown:

canceling all background work

...

>>>> 2020-05-20T13:14:00.937+0100

7f2e1957ee00 4 rocksdb: >>>>> [db/db_impl.cc:563] Shutdown complete

...

>>>> 2020-05-20T13:14:00.937+0100

7f2e1957ee00 -1 rocksdb: Corruption: >>>>> error in middle of record

...

>>>> 2020-05-20T13:14:00.937+0100

7f2e1957ee00 -1

...

>>>> bluestore(/var/lib/ceph/osd/ceph-9)

_open_db erroring opening db:

...

>>>> 2020-05-20T13:14:00.937+0100

7f2e1957ee00 1 bluefs umount

...

>>>> 2020-05-20T13:14:00.937+0100

7f2e1957ee00 1 fbmap_alloc >>>>> 0x55daf2b3a900 shutdown

...

>>>> 2020-05-20T13:14:00.937+0100

7f2e1957ee00 1 bdev(0x55daf3838700

...

>>>> /var/lib/ceph/osd/ceph-9/block) close >>>> 2020-05-20T13:14:01.093+0100

7f2e1957ee00 1 bdev(0x55daf3838000

...

>>>> /var/lib/ceph/osd/ceph-9/block) close >>>> 2020-05-20T13:14:01.341+0100

7f2e1957ee00 -1 osd.9 0 OSD:init:

...

>>>> unable to mount object store >>>> 2020-05-20T13:14:01.341+0100

7f2e1957ee00 -1 ESC[0;31m ** ERROR:

...

>>>> osd init failed: (5) Input/output

errorESC[0m

...

>>>>

_______________________________________________

...

> ceph-users mailing list -- mailto:ceph-users@ceph.io > To unsubscribe send an email to mailto:ceph-users-leave@ceph.io

_______________________________________________ ceph-users mailing list -- mailto:ceph-users@ceph.io To unsubscribe send an email to mailto:ceph-users-leave@ceph.io

Chris Palmer

3:36 p.m.

New subject: 15.2.2 Upgrade - Corruption: error in middle of record

Hi Ashley Setting bluefs_preextend_wal_files to false should stop any further corruption of the WAL (subject to the small risk of doing this while the OSD is active). Over time WAL blocks will be recycled and overwritten with new good blocks, so the extent of the corruption may decrease or even eliminate. However you can't tell whether this has happened. But leaving each running for a while may decrease the chances of having to recreate it. Having tried changing the parameter on one, then another, I've taken the risk of resetting it on all (running) OSDs, and nothing untoward seems to have happened. I have removed and recreated both failed OSDs (both on the node that was rebooted). They are in different crush device classes so I know that they are used by discrete sets of pgs. osd.9 has been recreated, backfilled, and stopped/started without issue. osd,2 has been recreated and is currently backfilling. When that has finished I will restart osd.2 and expect that the restart will not find any corruption. Following that I will cycle through all other OSDs, stopping and starting each in turn. If one fails to restart, I will replace it, wait until it backfills, then stop/start it. Do be aware that you can set the parameter globally (for all OSDs) and/or individually. I made sure the global setting was in place before creating new OSDs. (There might be other ways to achieve this on the command line for creating a new one). Hope that's clear. But once again, please don't take this as advice on what you should do. That should come from the experts! Regards, Chris On 23/05/2020 10:03, Ashley Merrick wrote:

...

Hello Chris, Great to hear, few questions. Once you have injected the bluefs_preextend_wal_files to false, are you just rebuilding the OSD's that failed? Or are you going through and rebuilding every OSD even the working one's? Or does setting the bluefs_preextend_wal_files value to false and leaving the OSD running fix the WAL automatically? Thanks ---- On Sat, 23 May 2020 15:53:42 +0800 *Chris Palmer <chris.palmer(a)pobox.com>* wrote ---- Hi Ashley Igor has done a great job of tracking down the problem, and we have finally shown evidence of the type of corruption it would produce in one of my WALs. Our feeling at the moment is that the problem can be detoured by setting bluefs_preextend_wal_files to false on affected OSDs while they are running (but see below), although Igor does note that there is a small risk in doing this. I've agreed a plan of action based on this route, recreating the failed OSDs, and then cycling through the others until all are healthy. I've started this now, and so far it looks promising, although of course I have to wait for recovery/rebalancing. This is the fastest route to recovery, although there other options. I'll post as it progresses. The good news seems to be that there shouldn't be any actual data corruption or loss, providing that this can be done before OSDs are taken down (other than as part of this process). My understanding is that there will some degree of performance penalty until the root cause is fixed in the next release and preextending can be turned back on. However it does seem like I can get back to a stable/safe position without waiting for a software release. I'm just working through this at the moment though, so please don't take the above as any form of recommendation. It is important not to try to restart OSDs though in the meantime. I'm sure Igor will publish some more expert recommendations in due course... Regards, Chris On 23/05/2020 06:54, Ashley Merrick wrote: Thanks Igor, Do you have any idea on a e.t.a or plan for people that are running 15.2.2 to be able to patch / fix the issue. I had a read of the ticket and seems the corruption is happening but the WAL is not read till OSD restart, so I imagine we will need some form of fix / patch we can apply to a running OSD before we then restart the OSD, as a normal OSD upgrade will require the OSD to restart to apply the code resulting in a corrupt OSD. Thanks ---- On Sat, 23 May 2020 00:12:59 +0800 *Igor Fedotov <ifedotov(a)suse.de> <mailto:ifedotov@suse.de>* wrote ---- Status update: Finally we have the first patch to fix the issue in master: https://github.com/ceph/ceph/pull/35201 And ticket has been updated with root cause analysis:https://tracker.ceph.com/issues/45613On 5/21/2020 2:07 PM, Igor Fedotov wrote: @Chris - unfortunately it looks like the corruption is permanent since valid WAL data are presumably overwritten with another stuff. Hence I don't know any way to recover - perhaps you can try cutting WAL file off which will allow OSD to start. With some latest ops lost. Once can use exported BlueFS as a drop in replacement for regular DB volume but I'm not aware of details. And the above are just speculations, can't say for sure if it helps... I can't explain why WAL doesn't have zero block in your case though. Little chances this is a different issue. Just in case - could you please search for 32K zero blocks over the whole file? And the same for another OSD? Thanks, Igor

Short update on the issue: Finally we're able to reproduce the issue in master (not

octopus),

investigating further.. @Chris - to make sure you're facing the same issue could

you please

check the content of the broken file. To do so: 1) run "ceph-bluestore-tool --path <path-to-osd>

--our-dir <target

dir> --command bluefs-export This will export bluefs files to <target dir> 2) Check the content for file db.wal/002040.log at

offset 0x470000

This will presumably contain 32K of zero bytes. Is this

the case?

No hurry as I'm just making sure symptoms in Octopus are

the same...

Let me know if

>> you haven't got them. >> Thx, Chris >> >> On 20/05/2020 14:43, Igor Fedotov wrote: >>> Hi Cris, >>> >>> could you please share the full log prior to the

first failure?

>>> >>> Also if possible please set debug-bluestore/debug

bluefs to 20 and

>>> collect another one for failed OSD startup. >>> >>> >>> Thanks, >>> >>> Igor >>> >>> >>> On 5/20/2020 4:39 PM, Chris Palmer wrote: >>>> I'm getting similar errors after rebooting a node.

Cluster was

>>>> upgraded 15.2.1 -> 15.2.2 yesterday. No problems

after rebooting

>>>> during upgrade. >>>> >>>> On the node I just rebooted, 2/4 OSDs won't restart.

Similar logs

>>>> from both. Logs from one below. >>>> Neither OSDs have compression enabled, although

there is a

>>>> compression-related error in the log. >>>> Both are replicated x3. One has data on HDD &

separate WAL/DB on

>>>> NVMe partition, the other is everything on NVMe

partition only.

>>>> >>>> Feeling kinda nervous here - advice welcomed!! >>>> >>>> Thx, Chris >>>> >>>> >>>> >>>> 2020-05-20T13:14:00.837+0100 7f2e0d273700 3 rocksdb: >>>> [table/block_based_table_reader.cc:1117] Encountered

error while

>>>> reading data from compression dictionary block

Corruption: block

>>>> checksum mismatch: expected 0, got 3423870535 in

db/000304.sst

>>>> offset 18446744073709551615 size 18446744073709551615 >>>> 2020-05-20T13:14:00.841+0100 7f2e1957ee00 4 rocksdb: >>>> [db/version_set.cc:3757] Recovered from manifest >>>> file:db/MANIFEST-000312

succeeded,manifest_file_number is 312,

>>>> next_file_number is 314, last_sequence is 22320582,

log_number is

0), log

>>>> number is 309 >>>> >>>> 2020-05-20T13:14:00.841+0100 7f2e1957ee00 4

rocksdb: EVENT_LOG_v1

9044 bytes;

>>>> Corruption: error in middle of record >>>> 2020-05-20T13:14:00.937+0100 7f2e1957ee00 3 rocksdb: >>>> [db/db_impl_open.cc:518] db.wal/000313.log: dropping

86 bytes;

>>>> Corruption: missing start of fragmented record(2) >>>> 2020-05-20T13:14:00.937+0100 7f2e1957ee00 4 rocksdb: >>>> [db/db_impl.cc:390] Shutdown: canceling all

background work

>>>> 2020-05-20T13:14:00.937+0100 7f2e1957ee00 4 rocksdb: >>>> [db/db_impl.cc:563] Shutdown complete >>>> 2020-05-20T13:14:00.937+0100 7f2e1957ee00 -1

rocksdb: Corruption:

>>>> error in middle of record >>>> 2020-05-20T13:14:00.937+0100 7f2e1957ee00 -1 >>>> bluestore(/var/lib/ceph/osd/ceph-9) _open_db

erroring opening db:

>>>> 2020-05-20T13:14:00.937+0100 7f2e1957ee00 1 bluefs

umount

>>>> 2020-05-20T13:14:00.937+0100 7f2e1957ee00 1

fbmap_alloc >>>>> 0x55daf2b3a900 shutdown

>>>> 2020-05-20T13:14:00.937+0100 7f2e1957ee00 1

bdev(0x55daf3838700

>>>> /var/lib/ceph/osd/ceph-9/block) close >>>> 2020-05-20T13:14:01.093+0100 7f2e1957ee00 1

bdev(0x55daf3838000

>>>> /var/lib/ceph/osd/ceph-9/block) close >>>> 2020-05-20T13:14:01.341+0100 7f2e1957ee00 -1 osd.9 0

OSD:init:

>>>> unable to mount object store >>>> 2020-05-20T13:14:01.341+0100 7f2e1957ee00 -1

ESC[0;31m ** ERROR:

>>>> osd init failed: (5) Input/output errorESC[0m >>>> _______________________________________________ >>>> ceph-users mailing list -- ceph-users(a)ceph.io

<mailto:ceph-users@ceph.io>

>>>> To unsubscribe send an email to

ceph-users-leave(a)ceph.io <mailto:ceph-users-leave@ceph.io>

>> > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io

<mailto:ceph-users@ceph.io>

> To unsubscribe send an email to

ceph-users-leave(a)ceph.io <mailto:ceph-users-leave@ceph.io>

_______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io

<mailto:ceph-users@ceph.io>

To unsubscribe send an email to ceph-users-leave(a)ceph.io

Chris Palmer

8:50 p.m.

New subject: 15.2.2 Upgrade - Corruption: error in middle of record

Status date: We seem to have success. I followed the steps below. Only one more OSD (on node3) failed to restart, showing the same WAL corruption messages. After replacing that & backfilling I could then restart it. So we have a healthy cluster with restartable OSDs again, with bluefs_preextend_wal_files=false until its deemed safe to re-enable it. Many thanks Igor! Regards, Chris On 23/05/2020 11:06, Chris Palmer wrote:

...

Hello Chris, Great to hear, few questions. Once you have injected the bluefs_preextend_wal_files to false, are you just rebuilding the OSD's that failed? Or are you going through and rebuilding every OSD even the working one's? Or does setting the bluefs_preextend_wal_files value to false and leaving the OSD running fix the WAL automatically? Thanks ---- On Sat, 23 May 2020 15:53:42 +0800 *Chris Palmer <chris.palmer(a)pobox.com>* wrote ---- Hi Ashley Igor has done a great job of tracking down the problem, and we have finally shown evidence of the type of corruption it would produce in one of my WALs. Our feeling at the moment is that the problem can be detoured by setting bluefs_preextend_wal_files to false on affected OSDs while they are running (but see below), although Igor does note that there is a small risk in doing this. I've agreed a plan of action based on this route, recreating the failed OSDs, and then cycling through the others until all are healthy. I've started this now, and so far it looks promising, although of course I have to wait for recovery/rebalancing. This is the fastest route to recovery, although there other options. I'll post as it progresses. The good news seems to be that there shouldn't be any actual data corruption or loss, providing that this can be done before OSDs are taken down (other than as part of this process). My understanding is that there will some degree of performance penalty until the root cause is fixed in the next release and preextending can be turned back on. However it does seem like I can get back to a stable/safe position without waiting for a software release. I'm just working through this at the moment though, so please don't take the above as any form of recommendation. It is important not to try to restart OSDs though in the meantime. I'm sure Igor will publish some more expert recommendations in due course... Regards, Chris On 23/05/2020 06:54, Ashley Merrick wrote: Thanks Igor, Do you have any idea on a e.t.a or plan for people that are running 15.2.2 to be able to patch / fix the issue. I had a read of the ticket and seems the corruption is happening but the WAL is not read till OSD restart, so I imagine we will need some form of fix / patch we can apply to a running OSD before we then restart the OSD, as a normal OSD upgrade will require the OSD to restart to apply the code resulting in a corrupt OSD. Thanks ---- On Sat, 23 May 2020 00:12:59 +0800 *Igor Fedotov <ifedotov(a)suse.de> <mailto:ifedotov@suse.de>* wrote ---- Status update: Finally we have the first patch to fix the issue in master: https://github.com/ceph/ceph/pull/35201 And ticket has been updated with root cause analysis:https://tracker.ceph.com/issues/45613On 5/21/2020 2:07 PM, Igor Fedotov wrote: @Chris - unfortunately it looks like the corruption is permanent since valid WAL data are presumably overwritten with another stuff. Hence I don't know any way to recover - perhaps you can try cutting WAL file off which will allow OSD to start. With some latest ops lost. Once can use exported BlueFS as a drop in replacement for regular DB volume but I'm not aware of details. And the above are just speculations, can't say for sure if it helps... I can't explain why WAL doesn't have zero block in your case though. Little chances this is a different issue. Just in case - could you please search for 32K zero blocks over the whole file? And the same for another OSD? Thanks, Igor

Short update on the issue: Finally we're able to reproduce the issue in master (not

octopus),

investigating further.. @Chris - to make sure you're facing the same issue could

you please

check the content of the broken file. To do so: 1) run "ceph-bluestore-tool --path <path-to-osd>

--our-dir <target

dir> --command bluefs-export This will export bluefs files to <target dir> 2) Check the content for file db.wal/002040.log at

offset 0x470000

This will presumably contain 32K of zero bytes. Is this

the case?

No hurry as I'm just making sure symptoms in Octopus are

the same...

Let me know if

>> you haven't got them. >> Thx, Chris >> >> On 20/05/2020 14:43, Igor Fedotov wrote: >>> Hi Cris, >>> >>> could you please share the full log prior to the

first failure?

>>> >>> Also if possible please set debug-bluestore/debug

bluefs to 20 and

>>> collect another one for failed OSD startup. >>> >>> >>> Thanks, >>> >>> Igor >>> >>> >>> On 5/20/2020 4:39 PM, Chris Palmer wrote: >>>> I'm getting similar errors after rebooting a node.

Cluster was

>>>> upgraded 15.2.1 -> 15.2.2 yesterday. No problems

after rebooting

>>>> during upgrade. >>>> >>>> On the node I just rebooted, 2/4 OSDs won't restart.

Similar logs

>>>> from both. Logs from one below. >>>> Neither OSDs have compression enabled, although

there is a

>>>> compression-related error in the log. >>>> Both are replicated x3. One has data on HDD &

separate WAL/DB on

>>>> NVMe partition, the other is everything on NVMe

partition only.

error while

>>>> reading data from compression dictionary block

Corruption: block

>>>> checksum mismatch: expected 0, got 3423870535 in

db/000304.sst

>>>> offset 18446744073709551615 size 18446744073709551615 >>>> 2020-05-20T13:14:00.841+0100 7f2e1957ee00 4 rocksdb: >>>> [db/version_set.cc:3757] Recovered from manifest >>>> file:db/MANIFEST-000312

succeeded,manifest_file_number is 312,

>>>> next_file_number is 314, last_sequence is 22320582,

log_number is

0), log

>>>> number is 309 >>>> >>>> 2020-05-20T13:14:00.841+0100 7f2e1957ee00 4

rocksdb: EVENT_LOG_v1

9044 bytes;

>>>> Corruption: error in middle of record >>>> 2020-05-20T13:14:00.937+0100 7f2e1957ee00 3 rocksdb: >>>> [db/db_impl_open.cc:518] db.wal/000313.log: dropping

86 bytes;

>>>> Corruption: missing start of fragmented record(2) >>>> 2020-05-20T13:14:00.937+0100 7f2e1957ee00 4 rocksdb: >>>> [db/db_impl.cc:390] Shutdown: canceling all

background work

>>>> 2020-05-20T13:14:00.937+0100 7f2e1957ee00 4 rocksdb: >>>> [db/db_impl.cc:563] Shutdown complete >>>> 2020-05-20T13:14:00.937+0100 7f2e1957ee00 -1

rocksdb: Corruption:

>>>> error in middle of record >>>> 2020-05-20T13:14:00.937+0100 7f2e1957ee00 -1 >>>> bluestore(/var/lib/ceph/osd/ceph-9) _open_db

erroring opening db:

>>>> 2020-05-20T13:14:00.937+0100 7f2e1957ee00 1 bluefs

umount

>>>> 2020-05-20T13:14:00.937+0100 7f2e1957ee00 1

fbmap_alloc >>>>> 0x55daf2b3a900 shutdown

>>>> 2020-05-20T13:14:00.937+0100 7f2e1957ee00 1

bdev(0x55daf3838700

>>>> /var/lib/ceph/osd/ceph-9/block) close >>>> 2020-05-20T13:14:01.093+0100 7f2e1957ee00 1

bdev(0x55daf3838000

>>>> /var/lib/ceph/osd/ceph-9/block) close >>>> 2020-05-20T13:14:01.341+0100 7f2e1957ee00 -1 osd.9 0

OSD:init:

>>>> unable to mount object store >>>> 2020-05-20T13:14:01.341+0100 7f2e1957ee00 -1

ESC[0;31m ** ERROR:

>>>> osd init failed: (5) Input/output errorESC[0m >>>> _______________________________________________ >>>> ceph-users mailing list -- ceph-users(a)ceph.io

<mailto:ceph-users@ceph.io>

>>>> To unsubscribe send an email to

ceph-users-leave(a)ceph.io <mailto:ceph-users-leave@ceph.io>

>> > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io

<mailto:ceph-users@ceph.io>

> To unsubscribe send an email to

ceph-users-leave(a)ceph.io <mailto:ceph-users-leave@ceph.io>

_______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io

<mailto:ceph-users@ceph.io>

To unsubscribe send an email to ceph-users-leave(a)ceph.io

_______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

Ashley Merrick

9:02 p.m.

New subject: 15.2.2 Upgrade - Corruption: error in middle of record

Hello,Great news can you confirm the exact command you used to inject the value so I can replicate you exact steps.I will do that and then leave it a good couple of days before trying a reboot to make sure the WAL is completely flushed Thanks Ashley ---- On Sat, 23 May 2020 23:20:45 +0800 chris.palmer(a)pobox.com wrote ----Status date: We seem to have success. I followed the steps below. Only one more OSD (on node3) failed to restart, showing the same WAL corruption messages. After replacing that & backfilling I could then restart it. So we have a healthy cluster with restartable OSDs again, with bluefs_preextend_wal_files=false until its deemed safe to re-enable it. Many thanks Igor! Regards, Chris On 23/05/2020 11:06, Chris Palmer wrote:

...

Hello Chris, Great to hear, few questions. Once you have injected the bluefs_preextend_wal_files to false, are you just rebuilding the OSD's that failed? Or are you going through and rebuilding every OSD even the working one's? Or does setting the bluefs_preextend_wal_files value to false and leaving the OSD running fix the WAL automatically? Thanks ---- On Sat, 23 May 2020 15:53:42 +0800 *Chris Palmer <chris.palmer(a)pobox.com>* wrote ---- Hi Ashley Igor has done a great job of tracking down the problem, and we have finally shown evidence of the type of corruption it would produce in one of my WALs. Our feeling at the moment is that the problem can be detoured by setting bluefs_preextend_wal_files to false on affected OSDs while they are running (but see below), although Igor does note that there is a small risk in doing this. I've agreed a plan of action based on this route, recreating the failed OSDs, and then cycling through the others until all are healthy. I've started this now, and so far it looks promising, although of course I have to wait for recovery/rebalancing. This is the fastest route to recovery, although there other options. I'll post as it progresses. The good news seems to be that there shouldn't be any actual data corruption or loss, providing that this can be done before OSDs are taken down (other than as part of this process). My understanding is that there will some degree of performance penalty until the root cause is fixed in the next release and preextending can be turned back on. However it does seem like I can get back to a stable/safe position without waiting for a software release. I'm just working through this at the moment though, so please don't take the above as any form of recommendation. It is important not to try to restart OSDs though in the meantime. I'm sure Igor will publish some more expert recommendations in due course... Regards, Chris On 23/05/2020 06:54, Ashley Merrick wrote: Thanks Igor, Do you have any idea on a e.t.a or plan for people that are running 15.2.2 to be able to patch / fix the issue. I had a read of the ticket and seems the corruption is happening but the WAL is not read till OSD restart, so I imagine we will need some form of fix / patch we can apply to a running OSD before we then restart the OSD, as a normal OSD upgrade will require the OSD to restart to apply the code resulting in a corrupt OSD. Thanks ---- On Sat, 23 May 2020 00:12:59 +0800 *Igor Fedotov <ifedotov(a)suse.de> <mailto:ifedotov@suse.de>* wrote ---- Status update: Finally we have the first patch to fix the issue in master: https://github.com/ceph/ceph/pull/35201 And ticket has been updated with root cause analysis:https://tracker.ceph.com/issues/45613On 5/21/2020 2:07 PM, Igor Fedotov wrote: @Chris - unfortunately it looks like the corruption is permanent since valid WAL data are presumably overwritten with another stuff. Hence I don't know any way to recover - perhaps you can try cutting WAL file off which will allow OSD to start. With some latest ops lost. Once can use exported BlueFS as a drop in replacement for regular DB volume but I'm not aware of details. And the above are just speculations, can't say for sure if it helps... I can't explain why WAL doesn't have zero block in your case though. Little chances this is a different issue. Just in case - could you please search for 32K zero blocks over the whole file? And the same for another OSD? Thanks, Igor

Short update on the issue: Finally we're able to reproduce the issue in master (not

octopus),

investigating further.. @Chris - to make sure you're facing the same issue could

you please

check the content of the broken file. To do so: 1) run "ceph-bluestore-tool --path <path-to-osd>

--our-dir <target

dir> --command bluefs-export This will export bluefs files to <target dir> 2) Check the content for file db.wal/002040.log at

offset 0x470000

This will presumably contain 32K of zero bytes. Is this

the case?

No hurry as I'm just making sure symptoms in Octopus are

the same...

Thanks, Igor On 5/20/2020 5:24 PM, Igor Fedotov wrote: > Chris, > > got them, thanks! > > Investigating....

; >>

> > Thanks, > > Igor > > On 5/20/2020 5:23 PM, Chris Palmer wrote: >> Hi Igor >> I've sent you these directly as they're a bit chunky.

Let me know if

>> you haven't got them. >> Thx, Chris >> >> On 20/05/2020 14:43, Igor Fedotov wrote: >>> Hi Cris, >>> >>> could you please share the full log prior to the

first failure?

>>> >>> Also if possible please set debug-bluestore/debug

bluefs to 20 and

>>> collect another one for failed OSD startup. >>> >>> >>> Thanks, >>> >>> Igor >>> >>> >>> On 5/20/2020 4:39 PM, Chris Palmer wrote: >>>> I'm getting similar errors after rebooting a node.

Cluster was

>>>> upgraded 15.2.1 -> 15.2.2 yesterday. No problems

after rebooting

>>>> during upgrade. >>>> >>>> On the node I just rebooted, 2/4 OSDs won't restart.

Similar logs

>>>> from both. Logs from one below. >>>> Neither OSDs have compression enabled, although

there is a

>>>> compression-related error in the log. >>>> Both are replicated x3. One has data on HDD &

separate WAL/DB on

>>>> NVMe partition, the other is everything on NVMe

partition only.

error while

>>>> reading data from compression dictionary block

Corruption: block

>>>> checksum mismatch: expected 0, got 3423870535 in

db/000304.sst

>>>> offset 18446744073709551615 size 18446744073709551615 >>>> 2020-05-20T13:14:00.841+0100 7f2e1957ee00 4 rocksdb: >>>> [db/version_set.cc:3757] Recovered from manifest >>>> file:db/MANIFEST-000312

succeeded,manifest_file_number is 312,

>>>> next_file_number is 314, last_sequence is 22320582,

log_number is

0), log

>>>> number is 309 >>>> >>>> 2020-05-20T13:14:00.841+0100 7f2e1957ee00 4

rocksdb: EVENT_LOG_v1

9044 bytes;

>>>> Corruption: error in middle of record >>>> 2020-05-20T13:14:00.937+0100 7f2e1957ee00 3 rocksdb: >>>> [db/db_impl_open.cc:518] db.wal/000313.log: dropping

86 bytes;

>>>> Corruption: missing start of fragmented record(2) >>>> 2020-05-20T13:14:00.937+0100 7f2e1957ee00 4 rocksdb: >>>> [db/db_impl.cc:390] Shutdown: canceling all

background work

>>>> 2020-05-20T13:14:00.937+0100 7f2e1957ee00 4 rocksdb: >>>> [db/db_impl.cc:563] Shutdown complete >>>> 2020-05-20T13:14:00.937+0100 7f2e1957ee00 -1

rocksdb: Corruption:

>>>> error in middle of record >>>> 2020-05-20T13:14:00.937+0100 7f2e1957ee00 -1 >>>> bluestore(/var/lib/ceph/osd/ceph-9) _open_db

erroring opening db:

>>>> 2020-05-20T13:14:00.937+0100 7f2e1957ee00 1 bluefs

umount

>>>> 2020-05-20T13:14:00.937+0100 7f2e1957ee00 1

fbmap_alloc >>>>> 0x55daf2b3a900 shutdown

>>>> 2020-05-20T13:14:00.937+0100 7f2e1957ee00 1

bdev(0x55daf3838700

>>>> /var/lib/ceph/osd/ceph-9/block) close >>>> 2020-05-20T13:14:01.093+0100 7f2e1957ee00 1

bdev(0x55daf3838000

>>>> /var/lib/ceph/osd/ceph-9/block) close >>>> 2020-05-20T13:14:01.341+0100 7f2e1957ee00 -1 osd.9 0

OSD:init:

>>>> unable to mount object store >>>> 2020-05-20T13:14:01.341+0100 7f2e1957ee00 -1

ESC[0;31m ** ERROR:

>>>> osd init failed: (5) Input/output errorESC[0m >>>> _______________________________________________ >>>> ceph-users mailing list -- ceph-users(a)ceph.io

<mailto:ceph-users@ceph.io>

>>>> To unsubscribe send an email to

ceph-users-leave(a)ceph.io <mailto:ceph-users-leave@ceph.io>

>> > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io

<mailto:ceph-users@ceph.io>

> To unsubscribe send an email to

ceph-users-leave(a)ceph.io <mailto:ceph-users-leave@ceph.io>

_______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io

<mailto:ceph-users@ceph.io>

To unsubscribe send an email to ceph-users-leave(a)ceph.io

_______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

_______________________________________________ceph-users mailing list -- ceph-users(a)ceph.ioTo unsubscribe send an email to ceph-users-leave(a)ceph.io

Chris Palmer

9:17 p.m.

New subject: 15.2.2 Upgrade - Corruption: error in middle of record

Hi Ashley The command to reset the flag for ALL OSDs is ceph config set osd bluefs_preextend_wal_files false And for just an individual OSD: ceph config set osd.5 bluefs_preextend_wal_files false And to remove it from an individual one (so you just have the global one left): ceph config rm osd.5 bluefs_preextend_wal_files BUT: I can't stress enough how important it is to only take down ONE OSD AT A TIME. And not to take any others down until that one is properly back up (replaced and backfilled if necessary). *Rebooting nodes without doing this may very well cause irretrievable data loss, no matter how long it has been since you reset that parameter.* This all seems to have worked for me but you should get expert advice. Regards, Chris On 23/05/2020 16:32, Ashley Merrick wrote:

...

Hello, Great news can you confirm the exact command you used to inject the value so I can replicate you exact steps. I will do that and then leave it a good couple of days before trying a reboot to make sure the WAL is completely flushed Thanks Ashley ---- On Sat, 23 May 2020 23:20:45 +0800 *chris.palmer(a)pobox.com * wrote ---- Status date: We seem to have success. I followed the steps below. Only one more OSD (on node3) failed to restart, showing the same WAL corruption messages. After replacing that & backfilling I could then restart it. So we have a healthy cluster with restartable OSDs again, with bluefs_preextend_wal_files=false until its deemed safe to re-enable it. Many thanks Igor! Regards, Chris On 23/05/2020 11:06, Chris Palmer wrote:

Hi Ashley Setting bluefs_preextend_wal_files to false should stop any further corruption of the WAL (subject to the small risk of doing this

while

the OSD is active). Over time WAL blocks will be recycled and overwritten with new good blocks, so the extent of the

corruption may

decrease or even eliminate. However you can't tell whether this has happened. But leaving each running for a while may decrease the chances of having to recreate it. Having tried changing the parameter on one, then another, I've

taken

the risk of resetting it on all (running) OSDs, and nothing

untoward

seems to have happened. I have removed and recreated both failed

OSDs

(both on the node that was rebooted). They are in different crush device classes so I know that they are used by discrete sets of

pgs.

osd.9 has been recreated, backfilled, and stopped/started without issue. osd,2 has been recreated and is currently backfilling. When that has finished I will restart osd.2 and expect that the restart will not find any corruption. Following that I will cycle through all other OSDs, stopping and starting each in turn. If one fails to restart, I will replace it, wait until it backfills, then stop/start it. Do be aware that you can set the parameter globally (for all OSDs) and/or individually. I made sure the global setting was in place before creating new OSDs. (There might be other ways to achieve

this

on the command line for creating a new one). Hope that's clear. But once again, please don't take this as

advice on

what you should do. That should come from the experts! Regards, Chris On 23/05/2020 10:03, Ashley Merrick wrote: > Hello Chris, > > Great to hear, few questions. > > Once you have injected the bluefs_preextend_wal_files to false,

are

> you just rebuilding the OSD's that failed? Or are you going

through

> and rebuilding every OSD even the working one's? > > Or does setting the bluefs_preextend_wal_files value to false and > leaving the OSD running fix the WAL automatically? > > Thanks > > > ---- On Sat, 23 May 2020 15:53:42 +0800 *Chris Palmer > <chris.palmer(a)pobox.com <mailto:chris.palmer@pobox.com>>* wrote

----

> > Hi Ashley > > Igor has done a great job of tracking down the problem, and we > have finally shown evidence of the type of corruption it would > produce in one of my WALs. Our feeling at the moment is

that the

> problem can be detoured by setting

bluefs_preextend_wal_files to

> false on affected OSDs while they are running (but see below), > although Igor does note that there is a small risk in doing

this.

> I've agreed a plan of action based on this route,

recreating the

> failed OSDs, and then cycling through the others until all are > healthy. I've started this now, and so far it looks promising, > although of course I have to wait for recovery/rebalancing.

This

> is the fastest route to recovery, although there other

options.

> > I'll post as it progresses. The good news seems to be that

there

> shouldn't be any actual data corruption or loss, providing

that

> this can be done before OSDs are taken down (other than as

part of

> this process). My understanding is that there will some

degree of

> performance penalty until the root cause is fixed in the next > release and preextending can be turned back on. However it

does

> seem like I can get back to a stable/safe position without

waiting

> for a software release. > > I'm just working through this at the moment though, so please > don't take the above as any form of recommendation. It is > important not to try to restart OSDs though in the

meantime. I'm

> sure Igor will publish some more expert recommendations in due > course... > > Regards, Chris > > > On 23/05/2020 06:54, Ashley Merrick wrote: > > > Thanks Igor, > > Do you have any idea on a e.t.a or plan for people that

are

> running 15.2.2 to be able to patch / fix the issue. > > I had a read of the ticket and seems the corruption is > happening but the WAL is not read till OSD restart, so I > imagine we will need some form of fix / patch we can

apply to

> a running OSD before we then restart the OSD, as a

normal OSD

> upgrade will require the OSD to restart to apply the code > resulting in a corrupt OSD. > > Thanks > > > ---- On Sat, 23 May 2020 00:12:59 +0800 *Igor Fedotov > <ifedotov(a)suse.de <mailto:ifedotov@suse.de>>

<mailto:ifedotov@suse.de <mailto:ifedotov@suse.de>>* wrote ----

> > Status update: > > Finally we have the first patch to fix the issue in

master:

> https://github.com/ceph/ceph/pull/35201 > > And ticket has been updated with root cause > analysis:https://tracker.ceph.com/issues/45613On

5/21/2020

> 2:07 PM, Igor > Fedotov wrote: > > @Chris - unfortunately it looks like the corruption is > permanent since > valid WAL data are presumably overwritten with another > stuff. Hence I > don't know any way to recover - perhaps you can try

cutting

> > WAL file off which will allow OSD to start. With some > latest ops lost. > Once can use exported BlueFS as a drop in

replacement for

> regular DB > volume but I'm not aware of details. > > And the above are just speculations, can't say for

sure if

> it helps... > > I can't explain why WAL doesn't have zero block in

your

> case though. > Little chances this is a different issue. Just in

case -

> could you > please search for 32K zero blocks over the whole

file? And

> the same for > another OSD? > > > Thanks, > > Igor > > > Short update on the issue: > > > > Finally we're able to reproduce the issue in

master (not

> octopus), > > investigating further.. > > > > @Chris - to make sure you're facing the same

issue could

> you please > > check the content of the broken file. To do so: > > > > 1) run "ceph-bluestore-tool --path <path-to-osd> > --our-dir <target > > dir> --command bluefs-export > > > > This will export bluefs files to <target dir> > > > > 2) Check the content for file db.wal/002040.log at > offset 0x470000 > > > > This will presumably contain 32K of zero bytes.

Is this

> the case? > > > > > > No hurry as I'm just making sure symptoms in

Octopus are

> the same... > > > > > > Thanks, > > > > Igor > > > > On 5/20/2020 5:24 PM, Igor Fedotov wrote: > >> Chris, > >> > >> got them, thanks! > >> > >> Investigating.... > ; >> > >> > >> Thanks, > >> > >> Igor > >> > >> On 5/20/2020 5:23 PM, Chris Palmer wrote: > >>> Hi Igor > >>> I've sent you these directly as they're a bit

chunky.

> Let me know if > >>> you haven't got them. > >>> Thx, Chris > >>> > >>> On 20/05/2020 14:43, Igor Fedotov wrote: > >>>> Hi Cris, > >>>> > >>>> could you please share the full log prior to the > first failure? > >>>> > >>>> Also if possible please set debug-bluestore/debug > bluefs to 20 and > >>>> collect another one for failed OSD startup. > >>>> > >>>> > >>>> Thanks, > >>>> > >>>> Igor > >>>> > >>>> > >>>> On 5/20/2020 4:39 PM, Chris Palmer wrote: > >>>>> I'm getting similar errors after rebooting a

node.

> Cluster was > >>>>> upgraded 15.2.1 -> 15.2.2 yesterday. No problems > after rebooting > >>>>> during upgrade. > >>>>> > >>>>> On the node I just rebooted, 2/4 OSDs won't

restart.

> Similar logs > >>>>> from both. Logs from one below. > >>>>> Neither OSDs have compression enabled, although > there is a > >>>>> compression-related error in the log. > >>>>> Both are replicated x3. One has data on HDD & > separate WAL/DB on > >>>>> NVMe partition, the other is everything on NVMe > partition only. > >>>>> > >>>>> Feeling kinda nervous here - advice welcomed!! > >>>>> > >>>>> Thx, Chris > >>>>> > >>>>> > >>>>> > >>>>> 2020-05-20T13:14:00.837+0100 7f2e0d273700 3

rocksdb:

> >>>>> [table/block_based_table_reader.cc:1117]

Encountered

> error while > >>>>> reading data from compression dictionary block > Corruption: block > >>>>> checksum mismatch: expected 0, got

3423870535 in

> db/000304.sst > >>>>> offset 18446744073709551615 size

18446744073709551615

> >>>>> 2020-05-20T13:14:00.841+0100 7f2e1957ee00 4

rocksdb:

> >>>>> [db/version_set.cc:3757] Recovered from manifest > >>>>> file:db/MANIFEST-000312 > succeeded,manifest_file_number is 312, > >>>>> next_file_number is 314, last_sequence is

22320582, >> log_number is >> >>>>> 309,prev_log_number is 0,max_column_family is >> >>>>> 0,min_log_number_to_keep is 0 >> >>>>>

> >>>>> 2020-05-20T13:14:00.841+0100 7f2e1957ee00 4

rocksdb:

> >>>>> [db/version_set.cc:3766] Column family

[default] (ID >> 0), log >> >>>>> number is 309 >> >>>>>

> >>>>> 2020-05-20T13:14:00.841+0100 7f2e1957ee00 4

>> rocksdb: EVENT_LOG_v1 >> >>>>> {"time_micros": 1589976840843199, "job": 1, "event": >> >>>>> "recovery_started", "log_files": [313]}

> >>>>> 2020-05-20T13:14:00.841+0100 7f2e1957ee00 4

rocksdb:

> >>>>> [db/db_impl_open.cc:583] Recovering log #313

mode 0

> >>>>> 2020-05-20T13:14:00.937+0100 7f2e1957ee00 3

rocksdb:

> >>>>> [db/db_impl_open.cc:518] db.wal/000313.log:

dropping >> 9044 bytes; >> >>>>> Corruption: error in middle of record

> >>>>> 2020-05-20T13:14:00.937+0100 7f2e1957ee00 3

rocksdb:

> >>>>> [db/db_impl_open.cc:518] db.wal/000313.log:

dropping

> 86 bytes; > >>>>> Corruption: missing start of fragmented

record(2)

> >>>>> 2020-05-20T13:14:00.937+0100 7f2e1957ee00 4

rocksdb: >> >>>>> [db/db_impl.cc:390] Shutdown: canceling all >> background work

> >>>>> 2020-05-20T13:14:00.937+0100 7f2e1957ee00 4

rocksdb:

> >>>>> [db/db_impl.cc:563] Shutdown complete > >>>>> 2020-05-20T13:14:00.937+0100 7f2e1957ee00 -1 > rocksdb: Corruption: > >>>>> error in middle of record > >>>>> 2020-05-20T13:14:00.937+0100 7f2e1957ee00 -1 > >>>>> bluestore(/var/lib/ceph/osd/ceph-9) _open_db > erroring opening db: > >>>>> 2020-05-20T13:14:00.937+0100 7f2e1957ee00 1

bluefs

> umount > >>>>> 2020-05-20T13:14:00.937+0100 7f2e1957ee00 1 > fbmap_alloc > >>>>> 0x55daf2b3a900 shutdown > >>>>> 2020-05-20T13:14:00.937+0100 7f2e1957ee00 1 > bdev(0x55daf3838700 > >>>>> /var/lib/ceph/osd/ceph-9/block) close > >>>>> 2020-05-20T13:14:01.093+0100 7f2e1957ee00 1 > bdev(0x55daf3838000 > >>>>> /var/lib/ceph/osd/ceph-9/block) close > >>>>> 2020-05-20T13:14:01.341+0100 7f2e1957ee00 -1

osd.9 0

> OSD:init: > >>>>> unable to mount object store > >>>>> 2020-05-20T13:14:01.341+0100 7f2e1957ee00 -1 > ESC[0;31m ** ERROR: > >>>>> osd init failed: (5) Input/output errorESC[0m > >>>>> _______________________________________________ > >>>>> ceph-users mailing list -- ceph-users(a)ceph.io

<mailto:ceph-users@ceph.io>

> <mailto:ceph-users@ceph.io

<mailto:ceph-users@ceph.io>>

> >>>>> To unsubscribe send an email to > ceph-users-leave(a)ceph.io <mailto:ceph-users-leave@ceph.io>

<mailto:ceph-users-leave@ceph.io <mailto:ceph-users-leave@ceph.io>>

> >>> > >> _______________________________________________ > >> ceph-users mailing list -- ceph-users(a)ceph.io

<mailto:ceph-users@ceph.io>

> <mailto:ceph-users@ceph.io

<mailto:ceph-users@ceph.io>>

> >> To unsubscribe send an email to > ceph-users-leave(a)ceph.io <mailto:ceph-users-leave@ceph.io>

<mailto:ceph-users-leave@ceph.io <mailto:ceph-users-leave@ceph.io>>

> > _______________________________________________ > > ceph-users mailing list -- ceph-users(a)ceph.io

<mailto:ceph-users@ceph.io>

> <mailto:ceph-users@ceph.io

<mailto:ceph-users@ceph.io>>

> > To unsubscribe send an email to

ceph-users-leave(a)ceph.io <mailto:ceph-users-leave@ceph.io>

> <mailto:ceph-users-leave@ceph.io

<mailto:ceph-users-leave@ceph.io>>

> _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io

<mailto:ceph-users@ceph.io>

> <mailto:ceph-users@ceph.io

<mailto:ceph-users@ceph.io>>

> To unsubscribe send an email to

ceph-users-leave(a)ceph.io <mailto:ceph-users-leave@ceph.io>

> <mailto:ceph-users-leave@ceph.io

<mailto:ceph-users-leave@ceph.io>>

_______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io

<mailto:ceph-users@ceph.io>

To unsubscribe send an email to ceph-users-leave(a)ceph.io

1438

days inactive

1441

days old

ceph-users@ceph.io

Manage subscription

12 comments

4 participants

tags (0)

participants (4)

Ashley Merrick
Chris Palmer
Chris Palmer
Igor Fedotov