Hi Igor
Thanks for you answer. All the disks had low latancy warnings. "had"
because I think the problem is solved.
After moving some data and almost losing the nearfull nvme pool, because
one disk had so much latency that ceph decided to mark it out, I could
start destroying and recreating each nvme osd.
I did this becuase the latency problem still existed even with only a
half full pool. I'm now in the middle of recreating the osds one by one.
The old ones still have latency issues when compacting the rocksdb but
the new ones don't. So I hope the problem will be gone by tomorrow.
There is one difference between the old osds and the recreated ones. The
old ones were partitioned and the mount /var/lib/ceph/osd/ceph-1 was the
first partition as xfs.
Now they are lvm and /var/lib/ceph/osd/ceph-1 is tmpfs. Im not yet
familiar enough with all ceph details to know why this changed or what
exactly the change is. Both, old and new, are bluestore.
Cheers,
Raffael
On 29/07/2020 16:48, Igor Fedotov wrote:
Hi Raffael,
wondering if all OSDs are suffering from slow compaction or just he
one which is "near full"?
Do other OSDs has that "log_latency_fn slow operation observed for"
lines?
Have you tried "osd bench" command for your OSDs? Does it show similar
numbers for every OSD?
You might want to try manual offline DB compaction using
ceph-kvstore-tool. Any improvements after that?
Thanks,
Igor
On 7/29/2020 4:35 PM, Raffael Bachmann wrote:
> Hi Mark
>
> Unfortunately it is the production cluster and I don't have another
> one :-(
>
> This is the output of the log parser. I'have nothing to compare them
> to. Stupid me has no more logs from before the upgrade.
>
> python ceph_rocksdb_log_parser.py ceph-osd.1.log
> Compaction Statistics ceph-osd.1.log
> Total OSD Log Duration (seconds) 55500.457
> Number of Compaction Events 13
> Avg Compaction Time (seconds) 116.498074615
> Total Compaction Time (seconds) 1514.47497
> Avg Output Size: (MB) 422.757656391
> Total Output Size: (MB) 5495.84953308
> Total Input Records 21019590
> Total Output Records 18093259
> Avg Output Throughput (MB/s) 3.53010211372
> Avg Input Records/second 17994.0419635
> Avg Output Records/second 16449.9710169
> Avg Output/Input Ratio 0.891530624966
>
> ceph-osd.1.log
>
> start_offset compaction_time_seconds output_level
> num_output_files total_output_size num_input_records
> num_output_records output (MB/s) input (r/s) output (r/s)
> output/input ratio
> 417.204 70.247058 1 5 261853019 1476689
> 1384444 3.55491754393 21021.3643396 19708.2132607 0.937532547476
> 546.271 128.652685 2 7 473883973 1674393
> 1098908 3.51279861751 13014.8313655 8541.66393807 0.656302313734
> 5761.795 60.460736 1 4 211033833 1041408
> 1013909 3.32873133441 17224.5339521 16769.7098494 0.973594402962
> 14912.985 64.958415 1 4 231336608 1316575
> 1249120 3.3963233477 20267.9668215 19229.5332329 0.948764787422
> 15152.316 238.925764 2 14 944635417 2445094
> 1902084 3.77052068592 10233.6975262 7960.98322825 0.77791855855
> 24607.857 53.022134 1 4 188414045 1029179
> 988116 3.38887973778 19410.36549 18635.915333 0.960101206884
> 31259.993 55.442826 1 4 210856392 1296725
> 1221474 3.62694941814 23388.5083708 22031.2362865 0.941968420444
> 31574.193 313.736584 2 18 1213247010 2928742
> 2359960 3.68794259867 9335.03502416 7522.10650703 0.805793067467
> 37708.375 49.78089 1 3 171888381 974097
> 939847 3.29294101107 19567.6895291 18879.6745096 0.96483923059
> 43219.745 51.798215 1 4 193360867 1246101
> 1172257 3.5600318014 24056.8328465 22631.2238752 0.940739956071
> 48041.751 56.559014 1 4 208216413 1451105
> 1367052 3.5108576209 25656.4762604 24170.3647804 0.942076555453
> 48368.403 325.833185 2 19 1289359869 3196156
> 2489088 3.77380036251 9809.17889011 7639.1482347 0.778775504074
> 52693.952 45.057464 1 3 164730093 943326
> 907000 3.48663339848 20936.0651101 20129.8501842 0.961491573433
>
> cheers
> Raffael
>
>
> On 29/07/2020 15:19, Mark Nelson wrote:
>> Hi Raffael,
>>
>>
>> Adam made a PR this year that shards rocksdb data across different
>> column families to help reduce compaction overhead. The goal is to
>> reduce write-amplification during compaction by storing multiple
>> small LSM hierarchies rather than 1 big one. We've seen evidence
>> that this lowers compaction time and overhead, sometimes
>> significantly. That PR was merged to master on April 26th so I
>> don't believe it's in any of the releases yet but you can test it if
>> you have a non-production cluster available. That PR is here:
>>
>>
>>
https://github.com/ceph/ceph/pull/34006
>>
>>
>> Normally though you should have about 1GB of WAL to absorb writes
>> during compaction and rocksdb automatically slows writes down if the
>> buffers start filling up. You should only see a write stall from
>> compaction if you completely fill all of the buffers. Also, you
>> shouldn't see compaction at one level blocking IO to the entire
>> database. Something seems off to me here.
>>
>> If you have OSD logs, you can see a history of the compaction events
>> by running this script:
>>
>>
https://github.com/ceph/cbt/blob/master/tools/ceph_rocksdb_log_parser.py
>>
>>
>>
>> That can give you an idea of how long your compaction events are
>> lasting and what they are doing.
>>
>>
>> Mark
>>
>>
>> On 7/29/20 7:52 AM, Raffael Bachmann wrote:
>>> Hi All,
>>>
>>> I'm kind of crossposting this from here:
>>>
https://forum.proxmox.com/threads/i-o-wait-after-upgrade-5-x-to-6-2-and-cep…
>>> But since I'm more and more sure that it's a ceph problem I'll
try
>>> my luck here.
>>>
>>> Since updating from Luminous to Nautilus I have a big problem.
>>>
>>> I have a 3 node cluster. Each cluster has 2 nvme ssd and a
>>> 10GBASE-T net for ceph.
>>> Every few minutes a osd seems to compact the rocksdb. While doing
>>> this it uses alot of I/O and blocks.
>>> This basically blocks the whole cluster and no VM/Container can
>>> read data for some seconds (minutes).
>>>
>>> While it happens "iostat -x" looks like this:
>>>
>>> Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s
>>> %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
>>> nvme0n1 0.00 2.00 0.00 24.00 0.00 46.00
>>> 0.00 95.83 0.00 0.00 0.00 0.00 12.00 2.00 0.40
>>> nvme1n1 0.00 1495.00 0.00 3924.00 0.00
>>> 6099.00 0.00 80.31 0.00 352.39 523.78 0.00 2.62 0.67 100.00
>>>
>>> And iotop:
>>>
>>> Total DISK READ: 0.00 B/s | Total DISK WRITE: 1573.47 K/s
>>> Current DISK READ: 0.00 B/s | Current DISK WRITE: 3.43 M/s
>>> TID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND
>>> 2306 be/4 ceph 0.00 B/s 1533.22 K/s 0.00 % 99.99 %
>>> ceph-osd -f --cluster ceph --id 3 --setuser ceph --setgroup ceph
>>> [rocksdb:low1]
>>>
>>>
>>> In the ceph-osd log I see that rocksdb is compacting.
>>>
https://gist.github.com/qwasli/3bd0c7d535ee462feff8aaee618f3e08
>>>
>>> The pool and one OSD is nearfull. I'd planed to move some data away
>>> to another ceph pool. But now I'm not sure anymore if I should go
>>> with ceph.
>>> I'l move some data away anyway today to see if that helps, but
>>> before the upgrade there was the same amount of data an I haven't
>>> had a problem.
>>>
>>> Any hints to solve this are appreciated.
>>>
>>> Cheers
>>> Raffael
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users(a)ceph.io
>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io
>>>
>> _______________________________________________
>> ceph-users mailing list -- ceph-users(a)ceph.io
>> To unsubscribe send an email to ceph-users-leave(a)ceph.io
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io
> To unsubscribe send an email to ceph-users-leave(a)ceph.io