High io wait when osd rocksdb is compacting - ceph-users

Wido den Hollander

29 Jul 29 Jul

10:04 a.m.

On 29/07/2020 14:52, Raffael Bachmann wrote:

Hi All, I'm kind of crossposting this from here: https://forum.proxmox.com/threads/i-o-wait-after-upgrade-5-x-to-6-2-and-cep… But since I'm more and more sure that it's a ceph problem I'll try my luck here. Since updating from Luminous to Nautilus I have a big problem. I have a 3 node cluster. Each cluster has 2 nvme ssd and a 10GBASE-T net for ceph. Every few minutes a osd seems to compact the rocksdb. While doing this it uses alot of I/O and blocks. This basically blocks the whole cluster and no VM/Container can read data for some seconds (minutes). While it happens "iostat -x" looks like this: Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util nvme0n1 0.00 2.00 0.00 24.00 0.00 46.00 0.00 95.83 0.00 0.00 0.00 0.00 12.00 2.00 0.40 nvme1n1 0.00 1495.00 0.00 3924.00 0.00 6099.00 0.00 80.31 0.00 352.39 523.78 0.00 2.62 0.67 100.00 And iotop: Total DISK READ: 0.00 B/s | Total DISK WRITE: 1573.47 K/s Current DISK READ: 0.00 B/s | Current DISK WRITE: 3.43 M/s TID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND 2306 be/4 ceph 0.00 B/s 1533.22 K/s 0.00 % 99.99 % ceph-osd -f --cluster ceph --id 3 --setuser ceph --setgroup ceph [rocksdb:low1] In the ceph-osd log I see that rocksdb is compacting. https://gist.github.com/qwasli/3bd0c7d535ee462feff8aaee618f3e08 The pool and one OSD is nearfull. I'd planed to move some data away to another ceph pool. But now I'm not sure anymore if I should go with ceph. I'l move some data away anyway today to see if that helps, but before the upgrade there was the same amount of data an I haven't had a problem. Any hints to solve this are appreciated.

What model/type of NVMe is this? And on a nearfull cluster these problems can arise, it's usually not a good idea to have OSDs be nearfull. What does 'ceph df' tell you? Wido > > Cheers > Raffael > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io > To unsubscribe send an email to ceph-users-leave(a)ceph.io

Reply

Mark Nelson

10:19 a.m.

Hi Raffael, Adam made a PR this year that shards rocksdb data across different column families to help reduce compaction overhead. The goal is to reduce write-amplification during compaction by storing multiple small LSM hierarchies rather than 1 big one. We've seen evidence that this lowers compaction time and overhead, sometimes significantly. That PR was merged to master on April 26th so I don't believe it's in any of the releases yet but you can test it if you have a non-production cluster available. That PR is here: https://github.com/ceph/ceph/pull/34006 Normally though you should have about 1GB of WAL to absorb writes during compaction and rocksdb automatically slows writes down if the buffers start filling up. You should only see a write stall from compaction if you completely fill all of the buffers. Also, you shouldn't see compaction at one level blocking IO to the entire database. Something seems off to me here. If you have OSD logs, you can see a history of the compaction events by running this script: https://github.com/ceph/cbt/blob/master/tools/ceph_rocksdb_log_parser.py That can give you an idea of how long your compaction events are lasting and what they are doing. Mark On 7/29/20 7:52 AM, Raffael Bachmann wrote:

...

Hi All, I'm kind of crossposting this from here: https://forum.proxmox.com/threads/i-o-wait-after-upgrade-5-x-to-6-2-and-cep… But since I'm more and more sure that it's a ceph problem I'll try my luck here. Since updating from Luminous to Nautilus I have a big problem. I have a 3 node cluster. Each cluster has 2 nvme ssd and a 10GBASE-T net for ceph. Every few minutes a osd seems to compact the rocksdb. While doing this it uses alot of I/O and blocks. This basically blocks the whole cluster and no VM/Container can read data for some seconds (minutes). While it happens "iostat -x" looks like this: Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util nvme0n1 0.00 2.00 0.00 24.00 0.00 46.00 0.00 95.83 0.00 0.00 0.00 0.00 12.00 2.00 0.40 nvme1n1 0.00 1495.00 0.00 3924.00 0.00 6099.00 0.00 80.31 0.00 352.39 523.78 0.00 2.62 0.67 100.00 And iotop: Total DISK READ: 0.00 B/s | Total DISK WRITE: 1573.47 K/s Current DISK READ: 0.00 B/s | Current DISK WRITE: 3.43 M/s TID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND 2306 be/4 ceph 0.00 B/s 1533.22 K/s 0.00 % 99.99 % ceph-osd -f --cluster ceph --id 3 --setuser ceph --setgroup ceph [rocksdb:low1] In the ceph-osd log I see that rocksdb is compacting. https://gist.github.com/qwasli/3bd0c7d535ee462feff8aaee618f3e08 The pool and one OSD is nearfull. I'd planed to move some data away to another ceph pool. But now I'm not sure anymore if I should go with ceph. I'l move some data away anyway today to see if that helps, but before the upgrade there was the same amount of data an I haven't had a problem. Any hints to solve this are appreciated. Cheers Raffael _______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

Reply

Igor Fedotov

11:48 a.m.

Hi Raffael, wondering if all OSDs are suffering from slow compaction or just he one which is "near full"? Do other OSDs has that "log_latency_fn slow operation observed for" lines? Have you tried "osd bench" command for your OSDs? Does it show similar numbers for every OSD? You might want to try manual offline DB compaction using ceph-kvstore-tool. Any improvements after that? Thanks, Igor On 7/29/2020 4:35 PM, Raffael Bachmann wrote: > Hi Mark > > Unfortunately it is the production cluster and I don't have another > one :-( > > This is the output of the log parser. I'have nothing to compare them > to. Stupid me has no more logs from before the upgrade. > > python ceph_rocksdb_log_parser.py ceph-osd.1.log > Compaction Statistics ceph-osd.1.log > Total OSD Log Duration (seconds) 55500.457 > Number of Compaction Events 13 > Avg Compaction Time (seconds) 116.498074615 > Total Compaction Time (seconds) 1514.47497 > Avg Output Size: (MB) 422.757656391 > Total Output Size: (MB) 5495.84953308 > Total Input Records 21019590 > Total Output Records 18093259 > Avg Output Throughput (MB/s) 3.53010211372 > Avg Input Records/second 17994.0419635 > Avg Output Records/second 16449.9710169 > Avg Output/Input Ratio 0.891530624966 > > ceph-osd.1.log > > start_offset compaction_time_seconds output_level > num_output_files total_output_size num_input_records > num_output_records output (MB/s) input (r/s) output (r/s) > output/input ratio > 417.204 70.247058 1 5 261853019 1476689 > 1384444 3.55491754393 21021.3643396 19708.2132607 0.937532547476 > 546.271 128.652685 2 7 473883973 1674393 > 1098908 3.51279861751 13014.8313655 8541.66393807 0.656302313734 > 5761.795 60.460736 1 4 211033833 1041408 > 1013909 3.32873133441 17224.5339521 16769.7098494 0.973594402962 > 14912.985 64.958415 1 4 231336608 1316575 > 1249120 3.3963233477 20267.9668215 19229.5332329 0.948764787422 > 15152.316 238.925764 2 14 944635417 2445094 > 1902084 3.77052068592 10233.6975262 7960.98322825 0.77791855855 > 24607.857 53.022134 1 4 188414045 1029179 > 988116 3.38887973778 19410.36549 18635.915333 0.960101206884 > 31259.993 55.442826 1 4 210856392 1296725 > 1221474 3.62694941814 23388.5083708 22031.2362865 0.941968420444 > 31574.193 313.736584 2 18 1213247010 2928742 > 2359960 3.68794259867 9335.03502416 7522.10650703 0.805793067467 > 37708.375 49.78089 1 3 171888381 974097 > 939847 3.29294101107 19567.6895291 18879.6745096 0.96483923059 > 43219.745 51.798215 1 4 193360867 1246101 > 1172257 3.5600318014 24056.8328465 22631.2238752 0.940739956071 > 48041.751 56.559014 1 4 208216413 1451105 > 1367052 3.5108576209 25656.4762604 24170.3647804 0.942076555453 > 48368.403 325.833185 2 19 1289359869 3196156 > 2489088 3.77380036251 9809.17889011 7639.1482347 0.778775504074 > 52693.952 45.057464 1 3 164730093 943326 > 907000 3.48663339848 20936.0651101 20129.8501842 0.961491573433 > > cheers > Raffael > > > On 29/07/2020 15:19, Mark Nelson wrote: >> Hi Raffael, >> >> >> Adam made a PR this year that shards rocksdb data across different >> column families to help reduce compaction overhead. The goal is to >> reduce write-amplification during compaction by storing multiple >> small LSM hierarchies rather than 1 big one. We've seen evidence that >> this lowers compaction time and overhead, sometimes significantly. >> That PR was merged to master on April 26th so I don't believe it's in >> any of the releases yet but you can test it if you have a >> non-production cluster available. That PR is here: >> >> >> https://github.com/ceph/ceph/pull/34006 >> >> >> Normally though you should have about 1GB of WAL to absorb writes >> during compaction and rocksdb automatically slows writes down if the >> buffers start filling up. You should only see a write stall from >> compaction if you completely fill all of the buffers. Also, you >> shouldn't see compaction at one level blocking IO to the entire >> database. Something seems off to me here. >> >> If you have OSD logs, you can see a history of the compaction events >> by running this script: >> >> https://github.com/ceph/cbt/blob/master/tools/ceph_rocksdb_log_parser.py >> >> >> That can give you an idea of how long your compaction events are >> lasting and what they are doing. >> >> >> Mark >> >> >> On 7/29/20 7:52 AM, Raffael Bachmann wrote: >>> Hi All, >>> >>> I'm kind of crossposting this from here: >>> https://forum.proxmox.com/threads/i-o-wait-after-upgrade-5-x-to-6-2-and-cep… >>> But since I'm more and more sure that it's a ceph problem I'll try >>> my luck here. >>> >>> Since updating from Luminous to Nautilus I have a big problem. >>> >>> I have a 3 node cluster. Each cluster has 2 nvme ssd and a 10GBASE-T >>> net for ceph. >>> Every few minutes a osd seems to compact the rocksdb. While doing >>> this it uses alot of I/O and blocks. >>> This basically blocks the whole cluster and no VM/Container can read >>> data for some seconds (minutes). >>> >>> While it happens "iostat -x" looks like this: >>> >>> Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s >>> %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util >>> nvme0n1 0.00 2.00 0.00 24.00 0.00 46.00 >>> 0.00 95.83 0.00 0.00 0.00 0.00 12.00 2.00 0.40 >>> nvme1n1 0.00 1495.00 0.00 3924.00 0.00 6099.00 >>> 0.00 80.31 0.00 352.39 523.78 0.00 2.62 0.67 100.00 >>> >>> And iotop: >>> >>> Total DISK READ: 0.00 B/s | Total DISK WRITE: 1573.47 K/s >>> Current DISK READ: 0.00 B/s | Current DISK WRITE: 3.43 M/s >>> TID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND >>> 2306 be/4 ceph 0.00 B/s 1533.22 K/s 0.00 % 99.99 % >>> ceph-osd -f --cluster ceph --id 3 --setuser ceph --setgroup ceph >>> [rocksdb:low1] >>> >>> >>> In the ceph-osd log I see that rocksdb is compacting. >>> https://gist.github.com/qwasli/3bd0c7d535ee462feff8aaee618f3e08 >>> >>> The pool and one OSD is nearfull. I'd planed to move some data away >>> to another ceph pool. But now I'm not sure anymore if I should go >>> with ceph. >>> I'l move some data away anyway today to see if that helps, but >>> before the upgrade there was the same amount of data an I haven't >>> had a problem. >>> >>> Any hints to solve this are appreciated. >>> >>> Cheers >>> Raffael >>> _______________________________________________ >>> ceph-users mailing list -- ceph-users(a)ceph.io >>> To unsubscribe send an email to ceph-users-leave(a)ceph.io >>> >> _______________________________________________ >> ceph-users mailing list -- ceph-users(a)ceph.io >> To unsubscribe send an email to ceph-users-leave(a)ceph.io > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io > To unsubscribe send an email to ceph-users-leave(a)ceph.io

Reply

Mark Nelson

11:53 a.m.

Wow, that's crazy. You only had 13 compaction events for that OSD over roughly 15 days but the average compaction time was 116 seconds! Notice too though that the average compaction output size is 422MB with an average output throughput of 3.5MB! That's really slow with RocksDB sitting on an NVMe drive. You are only processing about 16K records/second. Here are some of the results from our internal NVMe (Intel P4510) test cluster looking at Sharded vs Unsharded rocksdb. This was based on master from last fall so figure it's about halfway between Nautilus and Octopus. These results are not exactly comparable to yours since we're using some experimental settings, but your compaction events look like they are orders of magnitude slower. https://docs.google.com/spreadsheets/d/1FYFBxwvE1i28AKoLyqrksHptE1Z523NU3Fa… No wonder you are seeing periodic stalls. How many DBs per NVMe drive? What's your cluster workload typically like? Also, can you see if the NVMe drive aqu-sz is getting big waiting for the requests to be serviced? Mark On 7/29/20 8:35 AM, Raffael Bachmann wrote:

...

Hi Mark Unfortunately it is the production cluster and I don't have another one :-( This is the output of the log parser. I'have nothing to compare them to. Stupid me has no more logs from before the upgrade. python ceph_rocksdb_log_parser.py ceph-osd.1.log Compaction Statistics ceph-osd.1.log Total OSD Log Duration (seconds) 55500.457 Number of Compaction Events 13 Avg Compaction Time (seconds) 116.498074615 Total Compaction Time (seconds) 1514.47497 Avg Output Size: (MB) 422.757656391 Total Output Size: (MB) 5495.84953308 Total Input Records 21019590 Total Output Records 18093259 Avg Output Throughput (MB/s) 3.53010211372 Avg Input Records/second 17994.0419635 Avg Output Records/second 16449.9710169 Avg Output/Input Ratio 0.891530624966 ceph-osd.1.log start_offset compaction_time_seconds output_level num_output_files total_output_size num_input_records num_output_records output (MB/s) input (r/s) output (r/s) output/input ratio 417.204 70.247058 1 5 261853019 1476689 1384444 3.55491754393 21021.3643396 19708.2132607 0.937532547476 546.271 128.652685 2 7 473883973 1674393 1098908 3.51279861751 13014.8313655 8541.66393807 0.656302313734 5761.795 60.460736 1 4 211033833 1041408 1013909 3.32873133441 17224.5339521 16769.7098494 0.973594402962 14912.985 64.958415 1 4 231336608 1316575 1249120 3.3963233477 20267.9668215 19229.5332329 0.948764787422 15152.316 238.925764 2 14 944635417 2445094 1902084 3.77052068592 10233.6975262 7960.98322825 0.77791855855 24607.857 53.022134 1 4 188414045 1029179 988116 3.38887973778 19410.36549 18635.915333 0.960101206884 31259.993 55.442826 1 4 210856392 1296725 1221474 3.62694941814 23388.5083708 22031.2362865 0.941968420444 31574.193 313.736584 2 18 1213247010 2928742 2359960 3.68794259867 9335.03502416 7522.10650703 0.805793067467 37708.375 49.78089 1 3 171888381 974097 939847 3.29294101107 19567.6895291 18879.6745096 0.96483923059 43219.745 51.798215 1 4 193360867 1246101 1172257 3.5600318014 24056.8328465 22631.2238752 0.940739956071 48041.751 56.559014 1 4 208216413 1451105 1367052 3.5108576209 25656.4762604 24170.3647804 0.942076555453 48368.403 325.833185 2 19 1289359869 3196156 2489088 3.77380036251 9809.17889011 7639.1482347 0.778775504074 52693.952 45.057464 1 3 164730093 943326 907000 3.48663339848 20936.0651101 20129.8501842 0.961491573433 cheers Raffael On 29/07/2020 15:19, Mark Nelson wrote:

Hi Raffael, Adam made a PR this year that shards rocksdb data across different column families to help reduce compaction overhead. The goal is to reduce write-amplification during compaction by storing multiple small LSM hierarchies rather than 1 big one. We've seen evidence that this lowers compaction time and overhead, sometimes significantly. That PR was merged to master on April 26th so I don't believe it's in any of the releases yet but you can test it if you have a non-production cluster available. That PR is here: https://github.com/ceph/ceph/pull/34006 Normally though you should have about 1GB of WAL to absorb writes during compaction and rocksdb automatically slows writes down if the buffers start filling up. You should only see a write stall from compaction if you completely fill all of the buffers. Also, you shouldn't see compaction at one level blocking IO to the entire database. Something seems off to me here. If you have OSD logs, you can see a history of the compaction events by running this script: https://github.com/ceph/cbt/blob/master/tools/ceph_rocksdb_log_parser.py That can give you an idea of how long your compaction events are lasting and what they are doing. Mark On 7/29/20 7:52 AM, Raffael Bachmann wrote:

Hi All, I'm kind of crossposting this from here: https://forum.proxmox.com/threads/i-o-wait-after-upgrade-5-x-to-6-2-and-cep… But since I'm more and more sure that it's a ceph problem I'll try my luck here. Since updating from Luminous to Nautilus I have a big problem. I have a 3 node cluster. Each cluster has 2 nvme ssd and a 10GBASE-T net for ceph. Every few minutes a osd seems to compact the rocksdb. While doing this it uses alot of I/O and blocks. This basically blocks the whole cluster and no VM/Container can read data for some seconds (minutes). While it happens "iostat -x" looks like this: Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util nvme0n1 0.00 2.00 0.00 24.00 0.00 46.00 0.00 95.83 0.00 0.00 0.00 0.00 12.00 2.00 0.40 nvme1n1 0.00 1495.00 0.00 3924.00 0.00 6099.00 0.00 80.31 0.00 352.39 523.78 0.00 2.62 0.67 100.00 And iotop: Total DISK READ: 0.00 B/s | Total DISK WRITE: 1573.47 K/s Current DISK READ: 0.00 B/s | Current DISK WRITE: 3.43 M/s TID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND 2306 be/4 ceph 0.00 B/s 1533.22 K/s 0.00 % 99.99 % ceph-osd -f --cluster ceph --id 3 --setuser ceph --setgroup ceph [rocksdb:low1] In the ceph-osd log I see that rocksdb is compacting. https://gist.github.com/qwasli/3bd0c7d535ee462feff8aaee618f3e08 The pool and one OSD is nearfull. I'd planed to move some data away to another ceph pool. But now I'm not sure anymore if I should go with ceph. I'l move some data away anyway today to see if that helps, but before the upgrade there was the same amount of data an I haven't had a problem. Any hints to solve this are appreciated. Cheers Raffael _______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

_______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

Reply

Mark Nelson

30 Jul 30 Jul

1:44 a.m.

On 7/29/20 7:47 PM, Raffael Bachmann wrote:

...

Hi Mark I think its 15 hours not 15 days. But the compaction time seems really to be slow. I' destroying and recreating all nvme osds one by one. And the recreated ones don't have latency problems and are also much faster compacting the disk. This is since two hours: Compaction Statistics /var/log/ceph/ceph-osd.0.log Total OSD Log Duration (seconds) 7909.104 Number of Compaction Events 11 Avg Compaction Time (seconds) 1.26702554545 Total Compaction Time (seconds) 13.937281 Avg Output Size: (MB) 268.282840729 Total Output Size: (MB) 2951.11124802 Total Input Records 7693669 Total Output Records 7670229 Avg Output Throughput (MB/s) 225.29745104 Avg Input Records/second 533134.954087 Avg Output Records/second 531386.725197 Avg Output/Input Ratio 0.996558805862 Not sure if you are interested in the answers to your questions anymore but: DBs per drive: I think one? (Not yet familiar with all the ceph details. It was "just" a normal osd. the full disk, bluestore including db) Workload: Rather low. Having three node proxmox/ceph cluster is just for not have a single point of failure. The cpus and disks are mostly bored. In iostat the aqu-sz value is getting to about 520 when this occours

Well now that is an interesting detail. That's the number of requests that are backed up waiting to be serviced by the device below Ceph. That would indicate that the device wasn't servicing requests quickly. Not sure what that means in relation to all of your other new findings, but something definitely seems to be behaving strangely.

...

There is one difference between the old osds and the recreated ones. The old ones were partitioned and the mount /var/lib/ceph/osd/ceph-1 was the first partition as xfs. Now they are lvm and /var/lib/ceph/osd/ceph-1 is tmpfs. Both, old and new, are bluestore. I'm still in the middle recreating one by one. Luckily it's not a petabyte cluster with thousand of disks ;-) Anyway, thanks everyone for answering and helping so fast. Having a mailing list this active is really nice. Cheers, Raffael On 29/07/2020 16:53, Mark Nelson wrote:

Wow, that's crazy. You only had 13 compaction events for that OSD over roughly 15 days but the average compaction time was 116 seconds! Notice too though that the average compaction output size is 422MB with an average output throughput of 3.5MB! That's really slow with RocksDB sitting on an NVMe drive. You are only processing about 16K records/second. Here are some of the results from our internal NVMe (Intel P4510) test cluster looking at Sharded vs Unsharded rocksdb. This was based on master from last fall so figure it's about halfway between Nautilus and Octopus. These results are not exactly comparable to yours since we're using some experimental settings, but your compaction events look like they are orders of magnitude slower. https://docs.google.com/spreadsheets/d/1FYFBxwvE1i28AKoLyqrksHptE1Z523NU3Fa… No wonder you are seeing periodic stalls. How many DBs per NVMe drive? What's your cluster workload typically like? Also, can you see if the NVMe drive aqu-sz is getting big waiting for the requests to be serviced? Mark On 7/29/20 8:35 AM, Raffael Bachmann wrote:

Hi Mark Unfortunately it is the production cluster and I don't have another one :-( This is the output of the log parser. I'have nothing to compare them to. Stupid me has no more logs from before the upgrade. python ceph_rocksdb_log_parser.py ceph-osd.1.log Compaction Statistics ceph-osd.1.log Total OSD Log Duration (seconds) 55500.457 Number of Compaction Events 13 Avg Compaction Time (seconds) 116.498074615 Total Compaction Time (seconds) 1514.47497 Avg Output Size: (MB) 422.757656391 Total Output Size: (MB) 5495.84953308 Total Input Records 21019590 Total Output Records 18093259 Avg Output Throughput (MB/s) 3.53010211372 Avg Input Records/second 17994.0419635 Avg Output Records/second 16449.9710169 Avg Output/Input Ratio 0.891530624966 ceph-osd.1.log start_offset compaction_time_seconds output_level num_output_files total_output_size num_input_records num_output_records output (MB/s) input (r/s) output (r/s) output/input ratio 417.204 70.247058 1 5 261853019 1476689 1384444 3.55491754393 21021.3643396 19708.2132607 0.937532547476 546.271 128.652685 2 7 473883973 1674393 1098908 3.51279861751 13014.8313655 8541.66393807 0.656302313734 5761.795 60.460736 1 4 211033833 1041408 1013909 3.32873133441 17224.5339521 16769.7098494 0.973594402962 14912.985 64.958415 1 4 231336608 1316575 1249120 3.3963233477 20267.9668215 19229.5332329 0.948764787422 15152.316 238.925764 2 14 944635417 2445094 1902084 3.77052068592 10233.6975262 7960.98322825 0.77791855855 24607.857 53.022134 1 4 188414045 1029179 988116 3.38887973778 19410.36549 18635.915333 0.960101206884 31259.993 55.442826 1 4 210856392 1296725 1221474 3.62694941814 23388.5083708 22031.2362865 0.941968420444 31574.193 313.736584 2 18 1213247010 2928742 2359960 3.68794259867 9335.03502416 7522.10650703 0.805793067467 37708.375 49.78089 1 3 171888381 974097 939847 3.29294101107 19567.6895291 18879.6745096 0.96483923059 43219.745 51.798215 1 4 193360867 1246101 1172257 3.5600318014 24056.8328465 22631.2238752 0.940739956071 48041.751 56.559014 1 4 208216413 1451105 1367052 3.5108576209 25656.4762604 24170.3647804 0.942076555453 48368.403 325.833185 2 19 1289359869 3196156 2489088 3.77380036251 9809.17889011 7639.1482347 0.778775504074 52693.952 45.057464 1 3 164730093 943326 907000 3.48663339848 20936.0651101 20129.8501842 0.961491573433 cheers Raffael On 29/07/2020 15:19, Mark Nelson wrote:

Hi Raffael, Adam made a PR this year that shards rocksdb data across different column families to help reduce compaction overhead. The goal is to reduce write-amplification during compaction by storing multiple small LSM hierarchies rather than 1 big one. We've seen evidence that this lowers compaction time and overhead, sometimes significantly. That PR was merged to master on April 26th so I don't believe it's in any of the releases yet but you can test it if you have a non-production cluster available. That PR is here: https://github.com/ceph/ceph/pull/34006 Normally though you should have about 1GB of WAL to absorb writes during compaction and rocksdb automatically slows writes down if the buffers start filling up. You should only see a write stall from compaction if you completely fill all of the buffers. Also, you shouldn't see compaction at one level blocking IO to the entire database. Something seems off to me here. If you have OSD logs, you can see a history of the compaction events by running this script: https://github.com/ceph/cbt/blob/master/tools/ceph_rocksdb_log_parser.py That can give you an idea of how long your compaction events are lasting and what they are doing. Mark On 7/29/20 7:52 AM, Raffael Bachmann wrote: > Hi All, > > I'm kind of crossposting this from here: > https://forum.proxmox.com/threads/i-o-wait-after-upgrade-5-x-to-6-2-and-cep… > But since I'm more and more sure that it's a ceph problem I'll try > my luck here. > > Since updating from Luminous to Nautilus I have a big problem. > > I have a 3 node cluster. Each cluster has 2 nvme ssd and a > 10GBASE-T net for ceph. > Every few minutes a osd seems to compact the rocksdb. While doing > this it uses alot of I/O and blocks. > This basically blocks the whole cluster and no VM/Container can > read data for some seconds (minutes). > > While it happens "iostat -x" looks like this: > > Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s > %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util > nvme0n1 0.00 2.00 0.00 24.00 0.00 46.00 > 0.00 95.83 0.00 0.00 0.00 0.00 12.00 2.00 0.40 > nvme1n1 0.00 1495.00 0.00 3924.00 0.00 > 6099.00 0.00 80.31 0.00 352.39 523.78 0.00 2.62 0.67 > 100.00 > > And iotop: > > Total DISK READ: 0.00 B/s | Total DISK WRITE: 1573.47 K/s > Current DISK READ: 0.00 B/s | Current DISK WRITE: 3.43 M/s > TID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND > 2306 be/4 ceph 0.00 B/s 1533.22 K/s 0.00 % 99.99 % > ceph-osd -f --cluster ceph --id 3 --setuser ceph --setgroup ceph > [rocksdb:low1] > > > In the ceph-osd log I see that rocksdb is compacting. > https://gist.github.com/qwasli/3bd0c7d535ee462feff8aaee618f3e08 > > The pool and one OSD is nearfull. I'd planed to move some data > away to another ceph pool. But now I'm not sure anymore if I > should go with ceph. > I'l move some data away anyway today to see if that helps, but > before the upgrade there was the same amount of data an I haven't > had a problem. > > Any hints to solve this are appreciated. > > Cheers > Raffael > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io > To unsubscribe send an email to ceph-users-leave(a)ceph.io > _______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

_______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

Reply