On Tue, 24 Mar 2020, Igor Fedotov wrote:
Hi Sage,
We've got another occurrence for the ticket:
https://tracker.ceph.com/issues/40300
Now I'm trying to realize what's happening in BlueFS when it occurs.
Unfortunately customer applied the suggested workaround and hence likely
killed the original sst layout.
So I'm wondering if the bluefs part of the issue (which prevents OSD from
restart) is caused by a very long single read (>4GB len) from BlueFS.
If so BlueRocksSequentialFile::Read implementation seems to be broken due to
int usage:
rocksdb::Status Read(size_t n, rocksdb::Slice* result, char* scratch)
override {
int r = fs->read(h, &h->buf, h->buf.pos, n, NULL, scratch);
ceph_assert(r >= 0);
*result = rocksdb::Slice(scratch, r);
...
Please note that sizeof(int) is 4!
size_t is long, so we could change int here (and for _read, and so on
down the stack) to ssize_t...
Also I'm wondering if we're obliged to return
exactly the requested amount of
data from Read to RocksDB. Can't some read cap at this function be the simple
solution?
I think the easiest way to answer that is to look at the PosixStack (or
whatever it's called) implementation in the rocksdb tree and see if it
ever returns short, or whether it wraps read(2) in a loop.
Unfortunately that's not entirely clear...
if (r < n) {
if (feof(file_)) {
// We leave status as ok if we hit the end of the file
// We also clear the error so that the reads can continue
// if a new data is written to the file
clearerr(file_);
} else {
// A partial read with an error: return a non-ok status
s = IOError("While reading file sequentially", filename_, errno);
}
}
But, I think the only real reason we'd want to do a short read is if
the read is long due to readahead, in which case the problem is
more that readahead was kludged into the stack at the wrong
point.
sage