My recollection is that rocksdb is always flushing,
correct. There are
conveniently only a handful of writers in rocksdb, the main ones being log
files and sst files.
We could probably put an assertion in fsync() so ensure that the
FileWriter buffer is empty and flushed...?
Thanks for your reply, sage:-) I will do that:-)
By the way, I've got another question here:
It seems that BlueStore tries to provide some kind of atomic
I/O mechanism in which data and metadata are either both modified or
both untouched. To accomplish this, for modifications whose size is
larger than prefer_defer_size, BlueStore will allocate new space for
the modifications and release the old storage space. I think, in the
long run, a initially contiguous stored file in bluestore could become
scattered if there have been many random modifications to that file.
Actually, this is what we are experiencing in our test clusters. The
consequence is that after some period of random modification, the
sequential read performance of that file is significantly degraded.
Should we make this atomic I/O mechanism optional? It seems that most
hard disk only make sure that a sector is never half-modified, for
which, I think, the deferred I/O is enough. Am I right? Thanks:-)
I mean, in the scenario of RBD, since most real hard disk only
guarantee that a sector is never half-modified, only providing atomic
I/O guarantee for modifications whose are less than or equal to that
of a disk sector, which is guaranteed by deferred io, should be
enough. So, maybe, this atomic I/O guarantee for large size
modifications should be made configurable.
The OSD needs to record both the data update *and* the metadata associated
with it (pg log entry) atomically, so atomic sector updates aren't
You might try looking at the bluestore_prefer_deferred_size, which will
make writes take the deferred IO path. This gets increasingly inefficient
the larger the value is, though!
If we really find that fragmentation is a problem over the long term, we
should make the deep scrub process rewrite the data it has read if/when it
is too fragmented.