[ceph-users] Re: How to recover from active+clean+inconsistent+failed_repair?

2 Nov 2020

...
  Hmm, I'm getting a bit confused. Could you also
send the output of "ceph osd pool ls detail". 
File ceph-osd-pool-ls-detail.txt attached.

...
  Did you look at the disk/controller cache settings?
I don't have disk controllers on Ceph machines. The hard disk is directly
attached to the motherboard via SATA cable. But there can be a on chip disk controller on
the motherboard, I'm not sure. 
If your worry is fsync persistence, I have thoroughly tested database fsync reliability on
Ceph RBD with hundreds of transactions per second and remove network cable and restart the
database machine, etc. while inserts going on. and I did not lose a single transaction. I
simulated this many times and persistence on my Ceph cluster was perfect (i.e not a single
loss).

...
  I think you should start a deep-scrub with "ceph
pg deep-scrub 3.b" and record the output of "ceph -w | grep '3\.b'"
(note the single quotes). 
...
  The error messages you included in one of your first
e-mails are only on 1 out of 3 scrub errors (3 lines for 1 error). We need to find all 3
errors. 
I ran again the "ceph pg deep-scrub 3.b", here is the whole output of ceph -w:

2020-11-02 22:33:48.224392 osd.0 [ERR] 3.b shard 2 soid
3:d577e975:::1000023675e.00000000:head : candidate had a missing snapset key, candidate
had a missing info key

2020-11-02 22:33:48.224396 osd.0 [ERR] 3.b soid 3:d577e975:::1000023675e.00000000:head :
failed to pick suitable object info

2020-11-02 22:35:30.087042 osd.0 [ERR] 3.b deep-scrub 3 errors

Btw, I'm very grateful for your perseverance on this.

Best regards
Sagara

2024

2023

2022

2021

2020

2019

[ceph-users] Re: How to recover from active+clean+inconsistent+failed_repair?