Why BlueRocksDirectory::Fsync only sync metadata? - Dev - lists.ceph.io

List overview All Threads
Download

Why BlueRocksDirectory::Fsync only sync metadata?

10/10/2019 perf meeting is on!

Re: NewTour knitting garments...

Xuehan Xu

29 Sep 2019 29 Sep '19

9:29 a.m.

Hi, everyone. I'm trying to read the source code of BlueStore. My question is why it is sufficient to only flush the log in BlueRocksDirectory::Fsync? Shouldn't it flush the file data first? Is it because rocksdb always flush file data before doing fsync? Thanks:-)

Reply

Show replies by date

Sage Weil

29 Sep 29 Sep

9 p.m.

On Sun, 29 Sep 2019, Xuehan Xu wrote:

Hi, everyone. I'm trying to read the source code of BlueStore. My question is why it is sufficient to only flush the log in BlueRocksDirectory::Fsync? Shouldn't it flush the file data first? Is it because rocksdb always flush file data before doing fsync? Thanks:-)

My recollection is that rocksdb is always flushing, correct. There are conveniently only a handful of writers in rocksdb, the main ones being log files and sst files. We could probably put an assertion in fsync() so ensure that the FileWriter buffer is empty and flushed...? s

Reply

Xuehan Xu

10 Oct 10 Oct

10:06 a.m.

My recollection is that rocksdb is always flushing, correct. There are conveniently only a handful of writers in rocksdb, the main ones being log files and sst files. We could probably put an assertion in fsync() so ensure that the FileWriter buffer is empty and flushed...?

Thanks for your reply, sage:-) I will do that:-) By the way, I've got another question here: It seems that BlueStore tries to provide some kind of atomic I/O mechanism in which data and metadata are either both modified or both untouched. To accomplish this, for modifications whose size is larger than prefer_defer_size, BlueStore will allocate new space for the modifications and release the old storage space. I think, in the long run, a initially contiguous stored file in bluestore could become scattered if there have been many random modifications to that file. Actually, this is what we are experiencing in our test clusters. The consequence is that after some period of random modification, the sequential read performance of that file is significantly degraded. Should we make this atomic I/O mechanism optional? It seems that most hard disk only make sure that a sector is never half-modified, for which, I think, the deferred I/O is enough. Am I right? Thanks:-)

Reply

Xuehan Xu

1:12 p.m.

My recollection is that rocksdb is always flushing, correct. There are conveniently only a handful of writers in rocksdb, the main ones being log files and sst files. We could probably put an assertion in fsync() so ensure that the FileWriter buffer is empty and flushed...?

Thanks for your reply, sage:-) I will do that:-) By the way, I've got another question here: It seems that BlueStore tries to provide some kind of atomic I/O mechanism in which data and metadata are either both modified or both untouched. To accomplish this, for modifications whose size is larger than prefer_defer_size, BlueStore will allocate new space for the modifications and release the old storage space. I think, in the long run, a initially contiguous stored file in bluestore could become scattered if there have been many random modifications to that file. Actually, this is what we are experiencing in our test clusters. The consequence is that after some period of random modification, the sequential read performance of that file is significantly degraded. Should we make this atomic I/O mechanism optional? It seems that most hard disk only make sure that a sector is never half-modified, for which, I think, the deferred I/O is enough. Am I right? Thanks:-)

I mean, in the scenario of RBD, since most real hard disk only guarantee that a sector is never half-modified, only providing atomic I/O guarantee for modifications whose are less than or equal to that of a disk sector, which is guaranteed by deferred io, should be enough. So, maybe, this atomic I/O guarantee for large size modifications should be made configurable.

Reply

Sage Weil

5:40 p.m.

On Thu, 10 Oct 2019, Xuehan Xu wrote:

My recollection is that rocksdb is always flushing, correct. There are conveniently only a handful of writers in rocksdb, the main ones being log files and sst files. We could probably put an assertion in fsync() so ensure that the FileWriter buffer is empty and flushed...?

Thanks for your reply, sage:-) I will do that:-) By the way, I've got another question here: It seems that BlueStore tries to provide some kind of atomic I/O mechanism in which data and metadata are either both modified or both untouched. To accomplish this, for modifications whose size is larger than prefer_defer_size, BlueStore will allocate new space for the modifications and release the old storage space. I think, in the long run, a initially contiguous stored file in bluestore could become scattered if there have been many random modifications to that file. Actually, this is what we are experiencing in our test clusters. The consequence is that after some period of random modification, the sequential read performance of that file is significantly degraded. Should we make this atomic I/O mechanism optional? It seems that most hard disk only make sure that a sector is never half-modified, for which, I think, the deferred I/O is enough. Am I right? Thanks:-)

I mean, in the scenario of RBD, since most real hard disk only guarantee that a sector is never half-modified, only providing atomic I/O guarantee for modifications whose are less than or equal to that of a disk sector, which is guaranteed by deferred io, should be enough. So, maybe, this atomic I/O guarantee for large size modifications should be made configurable.

The OSD needs to record both the data update *and* the metadata associated with it (pg log entry) atomically, so atomic sector updates aren't sufficient. You might try looking at the bluestore_prefer_deferred_size, which will make writes take the deferred IO path. This gets increasingly inefficient the larger the value is, though! If we really find that fragmentation is a problem over the long term, we should make the deep scrub process rewrite the data it has read if/when it is too fragmented. sage

Reply

1660

days inactive

1671

days old

Manage subscription

4 comments

2 participants

tags (0)

participants (2)

Sage Weil
Xuehan Xu