Re: ticket #40300 - Dev

26 Mar 2020

On Tue, 24 Mar 2020, Igor Fedotov wrote:
...
  Hi Sage,

 We've got another occurrence for the ticket:
 https://tracker.ceph.com/issues/40300

 Now I'm trying to realize what's happening in BlueFS when it occurs.
 Unfortunately customer applied the suggested workaround and hence likely
 killed the original sst layout.

 So I'm wondering if the bluefs part of the issue (which prevents OSD from
 restart) is caused by a very long single read (>4GB len) from BlueFS.

 If so BlueRocksSequentialFile::Read  implementation seems to be broken due to
 int usage:

   rocksdb::Status Read(size_t n, rocksdb::Slice* result, char* scratch)
 override {
     int r = fs->read(h, &h->buf, h->buf.pos, n, NULL, scratch);

     ceph_assert(r >= 0);

     *result = rocksdb::Slice(scratch, r);

 ...

 Please note that sizeof(int) is 4! 
size_t is long, so we could change int here (and for _read, and so on 
down the stack) to ssize_t...

...
  Also I'm wondering if we're obliged  to return
exactly the requested amount of
 data from Read to RocksDB. Can't some read cap at this function be the simple
 solution? 
I think the easiest way to answer that is to look at the PosixStack (or 
whatever it's called) implementation in the rocksdb tree and see if it 
ever returns short, or whether it wraps read(2) in a loop.

Unfortunately that's not entirely clear...

  if (r < n) {
    if (feof(file_)) {
      // We leave status as ok if we hit the end of the file
      // We also clear the error so that the reads can continue
      // if a new data is written to the file
      clearerr(file_);
    } else {
      // A partial read with an error: return a non-ok status
      s = IOError("While reading file sequentially", filename_, errno);
    }
  }

But, I think the only real reason we'd want to do a short read is if 
the read is long due to readahead, in which case the problem is 
more that readahead was kludged into the stack at the wrong 
point.

sage