Hi Ashu,
Yeah, please see
https://patchwork.kernel.org/project/ceph-devel/list/?series=733010.
Sorry I forgot to reply it here.
- Xiubo
On 4/4/23 13:58, Ashu Pachauri wrote:
> Hi Xiubo,
>
> Did you get a chance to work on this? I am curious to test out the
> improvements.
>
> Thanks and Regards,
> Ashu Pachauri
>
>
> On Fri, Mar 17, 2023 at 3:33 PM Frank Schilder <frans(a)dtu.dk> wrote:
>
> Hi Ashu,
>
> thanks for the clarification. That's not an option that is easy to
> change. I hope that the modifications to the fs clients Xiubo has
> in mind will improve that. Thanks for flagging this performance
> issue. Would be great if this becomes part of a test suite.
>
> Best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ________________________________________
> From: Ashu Pachauri <ashu210890(a)gmail.com>
> Sent: 17 March 2023 09:55:25
> To: Xiubo Li
> Cc: Frank Schilder; ceph-users(a)ceph.io
> Subject: Re: [ceph-users] Re: CephFS thrashing through the page cache
>
> Hi Xiubo,
>
> As you have correctly pointed out, I was talking about the
> stipe_unit setting in the file layout configuration. Here is the
> documentation for that for anyone else's reference:
>
https://docs.ceph.com/en/quincy/cephfs/file-layouts/
>
> As with any RAID0 setup, the stripe_unit is definitely workload
> dependent. Our use case requires us to read somewhere from a few
> kilobytes to a few hundred kilobytes at once. Having a 4MB default
> stripe_unit definitely hurts quite a bit. We were able to achieve
> almost 2x improvement in terms of average latency and overall
> throughput (for useful data) by reducing the stripe_unit. The rule
> of thumb is that you want to align the stripe_unit to your most
> common IO size.
>
> > BTW, have you tried to set 'rasize' option to a small size
> instead of 0
> > ? Won't this work ?
>
> No this won't work. I have tried it already. Since rasize simply
> impacts readahead, your minimum io size to the cephfs client will
> still be at the maximum of (rasize, stripe_unit). rasize is a
> useful configuration only if it is required to be larger than the
> stripe_unit, otherwise it's not. Also, it's worth pointing out
> that simply setting rasize is not sufficient; one needs to change
> the corresponding configurations that control maximum/minimum
> readahead for ceph clients.
>
> Thanks and Regards,
> Ashu Pachauri
>
>
> On Fri, Mar 17, 2023 at 2:14 PM Xiubo Li
> <xiubli@redhat.com<mailto:xiubli@redhat.com>> wrote:
>
> On 15/03/2023 17:20, Frank Schilder wrote:
> > Hi Ashu,
> >
> > are you talking about the kernel client? I can't find "stripe
> size" anywhere in its mount-documentation. Could you possibly post
> exactly what you did? Mount fstab line, config setting?
>
> There is no mount option to do this in both userspace and kernel
> clients. You need to change the file layout, which is (4MB
> stripe_unit,
> 1 stripe_count and 4MB object_size) by default, instead.
>
> Certainly with a smaller size of the stripe_unit will work. But IMO it
> will depend and be careful, changing the layout may cause other
> performance issues in some case, for example too small stripe_unit
> size
> may split the sync read into more osd requests to different OSDs.
>
> I will generate one patch to make the kernel client wiser instead of
> blindly setting it to stripe_unit always.
>
> Thanks
>
> - Xiubo
>
>
> >
> > Thanks!
> > =================
> > Frank Schilder
> > AIT Risø Campus
> > Bygning 109, rum S14
> >
> > ________________________________________
> > From: Ashu Pachauri
> <ashu210890@gmail.com<mailto:ashu210890@gmail.com>>
> > Sent: 14 March 2023 19:23:42
> > To: ceph-users@ceph.io<mailto:ceph-users@ceph.io>
> > Subject: [ceph-users] Re: CephFS thrashing through the page cache
> >
> > Got the answer to my own question; posting here if someone else
> > encounters the same problem. The issue is that the default
> stripe size in a
> > cephfs mount is 4 MB. If you are doing small reads (like 4k
> reads in the
> > test I posted) inside the file, you'll end up pulling at least
> 4MB to the
> > client (and then discarding most of the pulled data) even if you set
> > readahead to zero. So, the solution for us was to set a lower
> stripe size,
> > which aligns better with our workloads.
> >
> > Thanks and Regards,
> > Ashu Pachauri
> >
> >
> > On Fri, Mar 10, 2023 at 9:41 PM Ashu Pachauri
> <ashu210890@gmail.com<mailto:ashu210890@gmail.com>> wrote:
> >
> >> Also, I am able to reproduce the network read amplification
> when I try to
> >> do very small reads from larger files. e.g.
> >>
> >> for i in $(seq 1 10000); do
> >> dd if=test_${i} of=/dev/null bs=5k count=10
> >> done
> >>
> >>
> >> This piece of code generates a network traffic of 3.3 GB while
> it actually
> >> reads approx 500 MB of data.
> >>
> >>
> >> Thanks and Regards,
> >> Ashu Pachauri
> >>
> >> On Fri, Mar 10, 2023 at 9:22 PM Ashu Pachauri
> <ashu210890@gmail.com<mailto:ashu210890@gmail.com>>
> >> wrote:
> >>
> >>> We have an internal use case where we back the storage of a
> proprietary
> >>> database by a shared file system. We noticed something very
> odd when
> >>> testing some workload with a local block device backed file
> system vs
> >>> cephfs. We noticed that the amount of network IO done by
> cephfs is almost
> >>> double compared to the IO done in case of a local file system
> backed by an
> >>> attached block device.
> >>>
> >>> We also noticed that CephFS thrashes through the page cache
> very quickly
> >>> compared to the amount of data being read and think that the
> two issues
> >>> might be related. So, I wrote a simple test.
> >>>
> >>> 1. I wrote 10k files 400KB each using dd (approx 4 GB data).
> >>> 2. I dropped the page cache completely.
> >>> 3. I then read these files serially, again using dd. The page
> cache usage
> >>> shot up to 39 GB for reading such a small amount of data.
> >>>
> >>> Following is the code used to repro this in bash:
> >>>
> >>> for i in $(seq 1 10000); do
> >>> dd if=/dev/zero of=test_${i} bs=4k count=100
> >>> done
> >>>
> >>> sync; echo 1 > /proc/sys/vm/drop_caches
> >>>
> >>> for i in $(seq 1 10000); do
> >>> dd if=test_${i} of=/dev/null bs=4k count=100
> >>> done
> >>>
> >>>
> >>> The ceph version being used is:
> >>> ceph version 15.2.13
> (c44bc49e7a57a87d84dfff2a077a2058aa2172e2) octopus
> >>> (stable)
> >>>
> >>> The ceph configs being overriden:
> >>> WHO MASK LEVEL OPTION VALUE
> >>> RO
> >>> mon advanced
> auth_allow_insecure_global_id_reclaim false
> >>>
> >>> mgr advanced mgr/balancer/mode upmap
> >>>
> >>> mgr advanced mgr/dashboard/server_addr
> >>> 127.0.0.1 *
> >>> mgr advanced mgr/dashboard/server_port
> 8443
> >>> *
> >>> mgr advanced mgr/dashboard/ssl false
> >>> *
> >>> mgr advanced mgr/prometheus/server_addr
> 0.0.0.0
> >>> *
> >>> mgr advanced mgr/prometheus/server_port
> 9283
> >>> *
> >>> osd advanced bluestore_compression_algorithm
> lz4
> >>>
> >>> osd advanced bluestore_compression_mode
> >>> aggressive
> >>> osd advanced bluestore_throttle_bytes
> >>> 536870912
> >>> osd advanced osd_max_backfills 3
> >>>
> >>> osd advanced osd_op_num_threads_per_shard_ssd 8
> >>> *
> >>> osd advanced osd_scrub_auto_repair
> true
> >>>
> >>> mds advanced client_oc false
> >>>
> >>> mds advanced client_readahead_max_bytes
> 4096
> >>>
> >>> mds advanced client_readahead_max_periods 1
> >>>
> >>> mds advanced client_readahead_min 0
> >>>
> >>> mds basic mds_cache_memory_limit
> >>> 21474836480
> >>> client advanced client_oc false
> >>>
> >>> client advanced client_readahead_max_bytes
> 4096
> >>>
> >>> client advanced client_readahead_max_periods 1
> >>>
> >>> client advanced client_readahead_min 0
> >>>
> >>> client advanced fuse_disable_pagecache
> false
> >>>
> >>>
> >>> The cephfs mount options (note that readahead was disabled for
> this test):
> >>> /mnt/cephfs type ceph
> >>> (rw,relatime,name=cephfs,secret=<hidden>,acl,rasize=0)
> >>>
> >>> Any help or pointers are appreciated; this is a major
> performance issue
> >>> for us.
> >>>
> >>>
> >>> Thanks and Regards,
> >>> Ashu Pachauri
> >>>
> > _______________________________________________
> > ceph-users mailing list --
> ceph-users@ceph.io<mailto:ceph-users@ceph.io>
> > To unsubscribe send an email to
> ceph-users-leave@ceph.io<mailto:ceph-users-leave@ceph.io>
> > _______________________________________________
> > ceph-users mailing list --
> ceph-users@ceph.io<mailto:ceph-users@ceph.io>
> > To unsubscribe send an email to
> ceph-users-leave@ceph.io<mailto:ceph-users-leave@ceph.io>
> >
> --
> Best Regards,
>
> Xiubo Li (李秀波)
>
> Email: xiubli@redhat.com/xiubli@ibm.com
>
<http://xiubli@redhat.com/xiubli@ibm.com><http://xiubli@redhat.com/xiubli@ibm.com>
> Slack: @Xiubo Li
>