[ceph-users] Re: CephFS thrashing through the page cache

4 Apr 2023

Hi Ashu,

Yeah, please see 
https://patchwork.kernel.org/project/ceph-devel/list/?series=733010.

Sorry I forgot to reply it here.

- Xiubo

On 4/4/23 13:58, Ashu Pachauri wrote:
> Hi Xiubo,
>
> Did you get a chance to work on this? I am curious to test out the 
> improvements.
>
> Thanks and Regards,
> Ashu Pachauri
>
>
> On Fri, Mar 17, 2023 at 3:33 PM Frank Schilder &lt;frans(a)dtu.dk&gt; wrote:
>
>     Hi Ashu,
>
>     thanks for the clarification. That's not an option that is easy to
>     change. I hope that the modifications to the fs clients Xiubo has
>     in mind will improve that. Thanks for flagging this performance
>     issue. Would be great if this becomes part of a test suite.
>
>     Best regards,
>     =================
>     Frank Schilder
>     AIT Risø Campus
>     Bygning 109, rum S14
>
>     ________________________________________
>     From: Ashu Pachauri &lt;ashu210890(a)gmail.com&gt;
>     Sent: 17 March 2023 09:55:25
>     To: Xiubo Li
>     Cc: Frank Schilder; ceph-users(a)ceph.io
>     Subject: Re: [ceph-users] Re: CephFS thrashing through the page cache
>
>     Hi Xiubo,
>
>     As you have correctly pointed out, I was talking about the
>     stipe_unit setting in the file layout configuration. Here is the
>     documentation for that for anyone else's reference:
>     https://docs.ceph.com/en/quincy/cephfs/file-layouts/
>
>     As with any RAID0 setup, the stripe_unit is definitely workload
>     dependent. Our use case requires us to read somewhere from a few
>     kilobytes to a few hundred kilobytes at once. Having a 4MB default
>     stripe_unit definitely hurts quite a bit. We were able to achieve
>     almost 2x improvement in terms of average latency and overall
>     throughput (for useful data) by reducing the stripe_unit. The rule
>     of thumb is that you want to align the stripe_unit to your most
>     common IO size.
>
>     > BTW, have you tried to set 'rasize' option to a small size
>     instead of 0
>     > ? Won't this work ?
>
>     No this won't work. I have tried it already. Since rasize simply
>     impacts readahead, your minimum io size to the cephfs client will
>     still be at the maximum of (rasize, stripe_unit). rasize is a
>     useful configuration only if it is required to be larger than the
>     stripe_unit, otherwise it's not. Also, it's worth pointing out
>     that simply setting rasize is not sufficient; one needs to change
>     the corresponding configurations that control maximum/minimum
>     readahead for ceph clients.
>
>     Thanks and Regards,
>     Ashu Pachauri
>
>
>     On Fri, Mar 17, 2023 at 2:14 PM Xiubo Li
>     <xiubli@redhat.com<mailto:xiubli@redhat.com>> wrote:
>
>     On 15/03/2023 17:20, Frank Schilder wrote:
>     > Hi Ashu,
>     >
>     > are you talking about the kernel client? I can't find "stripe
>     size" anywhere in its mount-documentation. Could you possibly post
>     exactly what you did? Mount fstab line, config setting?
>
>     There is no mount option to do this in both userspace and kernel
>     clients. You need to change the file layout, which is (4MB
>     stripe_unit,
>     1 stripe_count and 4MB object_size) by default, instead.
>
>     Certainly with a smaller size of the stripe_unit will work. But IMO it
>     will depend and be careful, changing the layout may cause other
>     performance issues in some case, for example too small stripe_unit
>     size
>     may split the sync read into more osd requests to different OSDs.
>
>     I will generate one patch to make the kernel client wiser instead of
>     blindly setting it to stripe_unit always.
>
>     Thanks
>
>     - Xiubo
>
>
>     >
>     > Thanks!
>     > =================
>     > Frank Schilder
>     > AIT Risø Campus
>     > Bygning 109, rum S14
>     >
>     > ________________________________________
>     > From: Ashu Pachauri
>     <ashu210890@gmail.com<mailto:ashu210890@gmail.com>>
>     > Sent: 14 March 2023 19:23:42
>     > To: ceph-users@ceph.io<mailto:ceph-users@ceph.io>
>     > Subject: [ceph-users] Re: CephFS thrashing through the page cache
>     >
>     > Got the answer to my own question; posting here if someone else
>     > encounters the same problem. The issue is that the default
>     stripe size in a
>     > cephfs mount is 4 MB. If you are doing small reads (like 4k
>     reads in the
>     > test I posted) inside the file, you'll end up pulling at least
>     4MB to the
>     > client (and then discarding most of the pulled data) even if you set
>     > readahead to zero. So, the solution for us was to set a lower
>     stripe size,
>     > which aligns better with our workloads.
>     >
>     > Thanks and Regards,
>     > Ashu Pachauri
>     >
>     >
>     > On Fri, Mar 10, 2023 at 9:41 PM Ashu Pachauri
>     <ashu210890@gmail.com<mailto:ashu210890@gmail.com>> wrote:
>     >
>     >> Also, I am able to reproduce the network read amplification
>     when I try to
>     >> do very small reads from larger files. e.g.
>     >>
>     >> for i in $(seq 1 10000); do
>     >>    dd if=test_${i} of=/dev/null bs=5k count=10
>     >> done
>     >>
>     >>
>     >> This piece of code generates a network traffic of 3.3 GB while
>     it actually
>     >> reads approx 500 MB of data.
>     >>
>     >>
>     >> Thanks and Regards,
>     >> Ashu Pachauri
>     >>
>     >> On Fri, Mar 10, 2023 at 9:22 PM Ashu Pachauri
>     <ashu210890@gmail.com<mailto:ashu210890@gmail.com>>
>     >> wrote:
>     >>
>     >>> We have an internal use case where we back the storage of a
>     proprietary
>     >>> database by a shared file system. We noticed something very
>     odd when
>     >>> testing some workload with a local block device backed file
>     system vs
>     >>> cephfs. We noticed that the amount of network IO done by
>     cephfs is almost
>     >>> double compared to the IO done in case of a local file system
>     backed by an
>     >>> attached block device.
>     >>>
>     >>> We also noticed that CephFS thrashes through the page cache
>     very quickly
>     >>> compared to the amount of data being read and think that the
>     two issues
>     >>> might be related. So, I wrote a simple test.
>     >>>
>     >>> 1. I wrote 10k files 400KB each using dd (approx 4 GB data).
>     >>> 2. I dropped the page cache completely.
>     >>> 3. I then read these files serially, again using dd. The page
>     cache usage
>     >>> shot up to 39 GB for reading such a small amount of data.
>     >>>
>     >>> Following is the code used to repro this in bash:
>     >>>
>     >>> for i in $(seq 1 10000); do
>     >>>    dd if=/dev/zero of=test_${i} bs=4k count=100
>     >>> done
>     >>>
>     >>> sync; echo 1 > /proc/sys/vm/drop_caches
>     >>>
>     >>> for i in $(seq 1 10000); do
>     >>>    dd if=test_${i} of=/dev/null bs=4k count=100
>     >>> done
>     >>>
>     >>>
>     >>> The ceph version being used is:
>     >>> ceph version 15.2.13
>     (c44bc49e7a57a87d84dfff2a077a2058aa2172e2) octopus
>     >>> (stable)
>     >>>
>     >>> The ceph configs being overriden:
>     >>> WHO       MASK  LEVEL     OPTION                VALUE
>     >>>       RO
>     >>>    mon           advanced
>     auth_allow_insecure_global_id_reclaim  false
>     >>>
>     >>>    mgr           advanced  mgr/balancer/mode                 upmap
>     >>>
>     >>>    mgr           advanced mgr/dashboard/server_addr
>     >>>   127.0.0.1    *
>     >>>    mgr           advanced mgr/dashboard/server_port           
>       8443
>     >>>      *
>     >>>    mgr           advanced  mgr/dashboard/ssl                 false
>     >>>       *
>     >>>    mgr           advanced mgr/prometheus/server_addr         
>        0.0.0.0
>     >>>       *
>     >>>    mgr           advanced mgr/prometheus/server_port         
>        9283
>     >>>      *
>     >>>    osd           advanced bluestore_compression_algorithm     
>       lz4
>     >>>
>     >>>    osd           advanced bluestore_compression_mode
>     >>> aggressive
>     >>>    osd           advanced bluestore_throttle_bytes
>     >>> 536870912
>     >>>    osd           advanced  osd_max_backfills                 3
>     >>>
>     >>>    osd           advanced osd_op_num_threads_per_shard_ssd       8
>     >>>       *
>     >>>    osd           advanced  osd_scrub_auto_repair              
>       true
>     >>>
>     >>>    mds           advanced  client_oc                 false
>     >>>
>     >>>    mds           advanced client_readahead_max_bytes         
>        4096
>     >>>
>     >>>    mds           advanced client_readahead_max_periods           1
>     >>>
>     >>>    mds           advanced  client_readahead_min                  0
>     >>>
>     >>>    mds           basic     mds_cache_memory_limit
>     >>> 21474836480
>     >>>    client        advanced  client_oc                 false
>     >>>
>     >>>    client        advanced client_readahead_max_bytes         
>        4096
>     >>>
>     >>>    client        advanced client_readahead_max_periods           1
>     >>>
>     >>>    client        advanced  client_readahead_min                  0
>     >>>
>     >>>    client        advanced fuse_disable_pagecache             
>        false
>     >>>
>     >>>
>     >>> The cephfs mount options (note that readahead was disabled for
>     this test):
>     >>> /mnt/cephfs type ceph
>     >>> (rw,relatime,name=cephfs,secret=<hidden>,acl,rasize=0)
>     >>>
>     >>> Any help or pointers are appreciated; this is a major
>     performance issue
>     >>> for us.
>     >>>
>     >>>
>     >>> Thanks and Regards,
>     >>> Ashu Pachauri
>     >>>
>     > _______________________________________________
>     > ceph-users mailing list --
>     ceph-users@ceph.io<mailto:ceph-users@ceph.io>
>     > To unsubscribe send an email to
>     ceph-users-leave@ceph.io<mailto:ceph-users-leave@ceph.io>
>     > _______________________________________________
>     > ceph-users mailing list --
>     ceph-users@ceph.io<mailto:ceph-users@ceph.io>
>     > To unsubscribe send an email to
>     ceph-users-leave@ceph.io<mailto:ceph-users-leave@ceph.io>
>     >
>     --
>     Best Regards,
>
>     Xiubo Li (李秀波)
>
>     Email: xiubli@redhat.com/xiubli@ibm.com
>    
<http://xiubli@redhat.com/xiubli@ibm.com><http://xiubli@redhat.com/xiubli@ibm.com>
>     Slack: @Xiubo Li
>

2024

2023

2022

2021

2020

2019

[ceph-users] Re: CephFS thrashing through the page cache