Hi all,
Recently I encountered a situation requires reliable file storage with
cephfs, and the point is those data is not allowed to get modified or
deleted.
After some learning I found that the WORM(write once read many) feature is
exactly what I need.Unfortunately, as far as I know, there is no worm
feature in cephfs.
So I was wondering is there any plan or design about this feature?
Thanks.
Hi everyone.
The next DocuBetter meeting is scheduled for tomorrow. This is at
the following time:
1800 PST 27 Nov 2019
0100 UTC 27 Nov 2019
1200 AEST 28 Nov 2019
Etherpad: https://pad.ceph.com/p/Ceph_Documentation
Meeting: https://bluejeans.com/908675367
Agenda: This week we will be discussing the new Getting Started
Guide, RADOS documentation, the possibility of a CephFS guide,
the Sphinx theming on the website, and the possibility of
improving docs bug reporting from the Ceph community of users.
Hi all,
We conduct yearly user surveys to better under how our users utilize Ceph.
The Ceph Foundation collects the data under the Community Data License
agreement [0]; which helps the community make more of an informed decision
of where our efforts in the development of future releases should go.
Back in August, I asked for the community to help draft the next survey
[1]. I'm happy to provide a draft of the user survey for 2019. I'm sending
this to the dev list in hopes of getting feedback before sending it to the
Ceph users list.
The first question I received was using something other than Survey monkey
due to it not being available in some regions. I have been using another
third-party service for our Ceph Days CFP forms, and luckily they offer a
survey service that isn't blocked.
A second question that came up was how to layout questions for multiple
cluster deployments. An idea I had was having our general Ceph user survey
[2] separate from the deployment questions [3]. The general questions only
need to be answered once, and the deployment survey can be answered
multiple times to capture the different configurations. I'm looking into a
way to link the answers of both surveys together.
Any feedback, corrections or ideas?
[0] - https://cdla.io/sharing-1-0/
[1] -
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/Q3NCHOJN45D…
[2] -
https://ceph.io/wp-content/uploads/2019/10/Ceph-User-Survey-general.pdf
[3] -
https://ceph.io/wp-content/uploads/2019/10/Ceph-User-Survey-Clusters.pdf
--
Mike Perez
he/him
Ceph Community Manager
M: +1-951-572-2633
494C 5D25 2968 D361 65FB 3829 94BC D781 ADA8 8AEA
@Thingee <https://twitter.com/thingee> Thingee
<https://www.linkedin.com/thingee> <https://www.facebook.com/RedHatInc>
<https://www.redhat.com>
Hi everyone.
The next DocuBetter meeting is scheduled for tomorrow. This is at
the following time:
1800 PST 27 Nov 2019
0100 UTC 27 Nov 2019
1200 AEST 28 Nov 2019
Etherpad: https://pad.ceph.com/p/Ceph_Documentation
Meeting: https://bluejeans.com/908675367
Agenda: This week we will be discussing the new Getting Started
Guide, RADOS documentation, the possibility of a CephFS guide,
the Sphinx theming on the website, and the possibility of
improving docs bug reporting from the Ceph community of users.
Zac
This is the seventh bugfix release of the Mimic v13.2.x long term stable
release series. We recommend all Mimic users upgrade.
For the full release notes, see
https://ceph.io/releases/v13-2-7-mimic-released/
Notable Changes
MDS:
- Cache trimming is now throttled. Dropping the MDS cache via the “ceph
tell mds.<foo> cache drop” command or large reductions in the cache size
will no longer cause service unavailability.
- Behavior with recalling caps has been significantly improved to not
attempt recalling too many caps at once, leading to instability. MDS with
a large cache (64GB+) should be more stable.
- MDS now provides a config option “mds_max_caps_per_client” (default:
1M) to limit the number of caps a client session may hold. Long running
client sessions with a large number of caps have been a source of
instability in the MDS when all of these caps need to be processed during
certain session events. It is recommended to not unnecessarily increase
this value.
- The “mds_recall_state_timeout” config parameter has been removed. Late
client recall warnings are now generated based on the number of caps the
MDS has recalled which have not been released. The new config parameters
“mds_recall_warning_threshold” (default: 32K) and
“mds_recall_warning_decay_rate” (default: 60s) set the threshold for this
warning.
- The “cache drop” admin socket command has been removed. The “ceph tell
mds.X cache drop” remains.
OSD:
- A health warning is now generated if the average osd heartbeat ping
time exceeds a configurable threshold for any of the intervals computed.
The OSD computes 1 minute, 5 minute and 15 minute intervals with average,
minimum and maximum values. New configuration option
“mon_warn_on_slow_ping_ratio” specifies a percentage of
“osd_heartbeat_grace” to determine the threshold. A value of zero disables
the warning. A new configuration option “mon_warn_on_slow_ping_time”,
specified in milliseconds, overrides the computed value, causing a warning
when OSD heartbeat pings take longer than the specified amount. A new
admin command “ceph daemon mgr.# dump_osd_network [threshold]” lists all
connections with a ping time longer than the specified threshold or value
determined by the config options, for the average for any of the 3
intervals. A new admin command ceph daemon osd.# dump_osd_network
[threshold]” does the same but only including heartbeats initiated by the
specified OSD.
- The default value of the
“osd_deep_scrub_large_omap_object_key_threshold” parameter has been
lowered to detect an object with large number of omap keys more easily.
RGW:
- radosgw-admin introduces two subcommands that allow the managing of
expire-stale objects that might be left behind after a bucket reshard in
earlier versions of RGW. One subcommand lists such objects and the other
deletes them. Read the troubleshooting section of the dynamic resharding
docs for details.
@Haomai: Please check my reply.
On 01:52 Wed 27 Nov, <haomai(a)xsky.com> wrote:
> Liu, Changcheng <changcheng.liu(a)intel.com>
> >
> > On 02:53 Tue 19 Nov, haomai(a)xsky.com wrote:
> > > Liu, Changcheng <changcheng.liu(a)intel.com>
> > [Changcheng]:
> > 1. Do we have plan to use RDMA-CM connection management by default for RDMA in Ceph?
> > Currently, RDMA-CM connection has been integrated into Ceph code.
> > However, it will only work when setting 'ms_async_rdma_cm=true' while the default value of ms_async_rdma_cm is false.
> > It's really not good that we maintine to connection management method for RDMA in Ceph.
> >
> > What's about changing the default connection management to RDMA-CM?
> If we have good test over rdma-cm, it should be ok.
[Changcheng]:
Once using rdma-cm for connection management, it could both support
RoCEv1/RoCEv2/iWARP, this could unify the Ceph RDMA configuration.
> > >
> > > > 1) Support multiple devices
> > > > [Changcheng]:
> > > > Do you mean seperate public & cluster network and use both RDMA on public & cluster network?
> > > > Currently, Ceph could work under RDMA with below solution:
> > > > a. Make no difference between public & cluster network, both use the same RDMA device port for RDMA messenger.
> > > > OR
> > > > b. Public network is based on TCP posix and cluster network is running on RDMA.
> > > > 2) Enable unified ceph.conf for all ceph nodes
> > > > [Changcheng]:
> > > > Do you mean that in some node, ceph need set different RDMA device port to be used?
> > >
> > > hmm, yes
[Changcheng]:
To avoid "set different RDMA device port to be used", it's better to
look for the RDMA device according to the RNIC IP address.
What do you think of it?
> > [Changcheng]:
> > 2. If there's plan to let both public & cluster network run RDMA on seperate network, we must use RDMA-CM for connection management, right?
> not exactly, but if rdma-cm it will be easier to let code support
[Changcheng]:
Yes, rdma-cm makes it easier to let code support it.
> > > It's a long story.....
> > [Changcheng]:
> > 3. Is this related with RDMA? Has it been implemented in Ceph?
> I think we should refer to crimson-ceph to support this
[Changcheng]:
Thanks for your info.
>
> > > it's mean register data buffer read from storage device
> > [Changcheng]:
> > 4. Do you mean that 1) create the RDMA Memory Region(MR) first 2) use the MR in bufferlist 3) post the bufferlist as work request in RDMA send queue to be sent directly without using tx_copy_chunk?
> yeap
[Changcheng]:
This seems impossible. I don't know whether the bufferlist is only for
message transaction. If we work in this direction, there could be lots
of changes.
> > > > II. ToDo:
> > > > 1. Use RDMA Read/Write for better memory utilization
> > > > [Changcheng]:
> > > > Any plan to implement RDMA Read/Write? How to solve the compatiblity problem since the previous implementation is based on RC-Send/RC-Recv?
> > >
> > > Maybe it's not a good idea now
> > [Changcheng]:
> > 5. Is there any background that we don't use Read/Write semantics in Ceph RDMA implementation?
> from vendor's info, Read/Write is not welcomed.
[Changcheng]:
OK. I don't have performance data about the difference
between Read/Write & Send/Recv. Let's talk this later.
On 02:53 Tue 19 Nov, haomai(a)xsky.com wrote:
> Liu, Changcheng <changcheng.liu(a)intel.com>
> >
> > Hi Haomai,
> > I read your below presentation:
> > Topic: CEPH RDMA UPDATE
> > Link: https://www.openfabrics.org/images/eventpresos/2017presentations/103_Ceph_H…
> >
> > I want to talk about the items on page 17:
> > I. Work in Progress:
> > 1. RDMA-CM for control path
> > [Changcheng]:
> > Do you also prefer that we need use RDMA-CM for connection management?
>
> RDMA-CM has good wrapper
[Changcheng]:
1. Do we have plan to use RDMA-CM connection management by default for RDMA in Ceph?
Currently, RDMA-CM connection has been integrated into Ceph code.
However, it will only work when setting 'ms_async_rdma_cm=true' while the default value of ms_async_rdma_cm is false.
It's really not good that we maintine to connection management method for RDMA in Ceph.
What's about changing the default connection management to RDMA-CM?
>
> > 1) Support multiple devices
> > [Changcheng]:
> > Do you mean seperate public & cluster network and use both RDMA on public & cluster network?
> > Currently, Ceph could work under RDMA with below solution:
> > a. Make no difference between public & cluster network, both use the same RDMA device port for RDMA messenger.
> > OR
> > b. Public network is based on TCP posix and cluster network is running on RDMA.
> > 2) Enable unified ceph.conf for all ceph nodes
> > [Changcheng]:
> > Do you mean that in some node, ceph need set different RDMA device port to be used?
>
> hmm, yes
[Changcheng]:
2. If there's plan to let both public & cluster network run RDMA on seperate network, we must use RDMA-CM for connection management, right?
>
> > 2. Ceph replication Zero-copy
> > 1) Reduce number of memcpy by half by re-using data buffers on primary OSD
> > [Changcheng]:
> > What does it mean? Any technical sharing about this iteam?
>
> It's a long story.....
[Changcheng]:
3. Is this related with RDMA? Has it been implemented in Ceph?
>
> > 3. Tx zero-copy
> > Avoid copy out by using reged memory
> > [Changcheng]:
> > I've read the code, the function:tx_copy_chunk will copy data to segmented chunk to be sent. How do you solve the zero-copy problem?
>
> it's mean register data buffer read from storage device
[Changcheng]:
4. Do you mean that 1) create the RDMA Memory Region(MR) first 2) use the MR in bufferlist 3) post the bufferlist as work request in RDMA send queue to be sent directly without using tx_copy_chunk?
>
> >
> > II. ToDo:
> > 1. Use RDMA Read/Write for better memory utilization
> > [Changcheng]:
> > Any plan to implement RDMA Read/Write? How to solve the compatiblity problem since the previous implementation is based on RC-Send/RC-Recv?
>
> Maybe it's not a good idea now
[Changcheng]:
5. Is there any background that we don't use Read/Write semantics in Ceph RDMA implementation?
>
> > 2. ODP - On demand paging
> > [Changcheng]:
> > Do you mean that "the registered Memory Region is pinned to physical page and can't be swapped out" problem?
>
> No, it's a transparent register tech, currently it's not available
[Changcheng]:
Thanks for your info.
>
> > 3. Erasure-coding using HW offload.
> > [Changcheng]:
> > Is this related with RDMA NIC?
>
> SMART-NIC
[Changcheng]:
Thanks for your info.
>
> >
> > B.R.
> > Changcheng
> > _______________________________________________
> > Dev mailing list -- dev(a)ceph.io
> > To unsubscribe send an email to dev-leave(a)ceph.io
Hi,
Recently, I'm trying to mount cephfs using non-privileged users via
ceph-fuse,but it‘s always fail. I looked at the code and found that there
will be a remount operation when using ceph-fuse mount. Remount will
execute the 'mount -i -o remount {mountpoint}' command and that causes the
mount to fail.
I want to ask How to use non-privileged users to mount cephfs via ceph-fuse?
Thanks.
Hi Folks,
We're discussing changing the minimum allocation size in bluestore to
4k. For flash devices this appears to be a no-brainer. We've made the
write path fast enough in bluestore that we're typically seeing either
the same or faster performance with a 4K min_alloc size and the space
savings for small objects easily outweigh the increase in metadata for
large fragmented objects.
For HDDs there are tradeoffs. A smaller allocation size means more
fragmentation when there are small overwrites (like in RBD) which can
mean a lot more seeks. Igor was showing some fairly steep RBD
performance drops for medium-large reads/writes once the OSDs started to
become fragmented. For RGW this isn't nearly as big of a deal though
since typically the objects shouldn't become fragmented. A small (4K)
allocation size does mean however that we can write out 4K random writes
sequentially and gain a big IOPS win which theoretically should benefit
both RBD and RGW.
Regarding space-amplification, Josh pointed out that our current 64K
allocation size has huge ramifications for overall space-amp when
writing out medium sized objects to EC pools. In an attempt to actually
quantify this, I made a spreadsheet with some graphs showing a couple of
examples of how the min_alloc size and replication/EC interact with each
other at different object sizes. The gist of it is that with our
current default HDD min_alloc size (64K), erasure coding can actually
have worse space amplification than 3X replication, even with moderately
large (128K) object sizes. How much this factors into the decision vs
fragmentation is a tough call, but I wanted to at least showcase the
behavior as we work through deciding what our default HDD behavior
should be.
https://docs.google.com/spreadsheets/d/1rpGfScgG-GLoIGMJWDixEkqs-On9w8nAUTo…
Thanks,
Mark
Hi Folks,
Perf meeting is on in ~60 minutes! Discussion topics for today include
bluestore 4K min_alloc size on HDDs and an update on testing the new
performance cluster that Intel donated for the community. Please feel
free to add your own topic!
Etherpad:
https://pad.ceph.com/p/performance_weekly
Bluejeans:
https://bluejeans.com/908675367
Thanks,
Mark