Hi,
I am using the Ceph development cluster through vstart.sh script. I would
like to measure/benchmark read and write performance (benchmark ceph at a
low level). For that I want to use the fio tool.
Can we use fio on the development cluster? AFAIK, we can..... I have seen
the fio option in the CMakeLists.txt of the Ceph source code.
Thanks in advance.
BR
Hi Ceph Dev Team,
i am new to ceph and I am wondering if it would be possible to implement a command that lists all objects in a pool/namespace/rbd image that have been modified after a certain period in time?
The background of this question is that I would like to implement incremental backup and restore of rbd images for a long period (e.g. 90 days daily) without keeping snapshots for each of the backups.
I would instead like to store some extra info alongside with the backups that gives me the possible to later issue a call like: ceph tell me with objects of the rbd image have been changed since
I made this backup and then I would like to have the opportunity to only restore those segments of the rbd that have been changed since then.
I have learnt that each object has an mtime, but I have learnt that mtime is not a good choice and it would be better to have something that is strictly monotonically increasing.
If there is nothing like epoch or version that can be used, would you consider mtime stable enough for this purpose if some extra time is added (e.g. does the rados object mtime
have an update_interval like the rbd image mtime)?
Thanks for your advice/hints,
Peter
Hello everyone,
When I rebased my branch with master, and tried to build it I m
getting this error
/home/abhinav/GSOC/PR/ceph/src/mds/Mantle.cc:79:5: error: ‘lua_seti’ was
not declared in this scope; did you mean ‘luaL_setn’?
79 | lua_seti(L, -2, i);
| ^~~~~~~~
| luaL_setn
/home/abhinav/GSOC/PR/ceph/src/mds/Mantle.cc:86:32: error: ‘LUA_OK’ was not
declared in this scope; did you mean ‘LUA_QS’?
86 | if (lua_pcall(L, 0, 1, 0) != LUA_OK) {
| ^~~~~~
| LUA_QS
/home/abhinav/GSOC/PR/ceph/src/mds/Mantle.cc:100:10: error: ‘lua_isinteger’
was not declared in this scope; did you mean ‘lua_tointeger’?
100 | if (!lua_isinteger(L, -2) || !lua_isnumber(L, -1)) {
| ^~~~~~~~~~~~~
| lua_tointeger
/home/abhinav/GSOC/PR/ceph/src/mds/Mantle.cc: In constructor
‘Mantle::Mantle()’:
/home/abhinav/GSOC/PR/ceph/src/mds/Mantle.cc:123:21: error:
‘luaopen_coroutine’ was not declared in this scope; did you mean
‘luaopen_string’?
123 | {LUA_COLIBNAME, luaopen_coroutine},
| ^~~~~~~~~~~~~~~~~
| luaopen_string
/home/abhinav/GSOC/PR/ceph/src/mds/Mantle.cc:127:6: error:
‘LUA_UTF8LIBNAME’ was not declared in this scope; did you mean
‘LUA_STRLIBNAME’?
127 | {LUA_UTF8LIBNAME, luaopen_utf8},
| ^~~~~~~~~~~~~~~
| LUA_STRLIBNAME
/home/abhinav/GSOC/PR/ceph/src/mds/Mantle.cc:127:23: error: ‘luaopen_utf8’
was not declared in this scope; did you mean ‘luaopen_math’?
127 | {LUA_UTF8LIBNAME, luaopen_utf8},
| ^~~~~~~~~~~~
| luaopen_math
/home/abhinav/GSOC/PR/ceph/src/mds/Mantle.cc:133:7: error: ‘luaL_requiref’
was not declared in this scope; did you mean ‘luaL_unref’?
133 | luaL_requiref(L, lib->name, lib->func, 1);
OS - ubuntu 18.04
Lua &Lua dev - 5.1
ceph 16.0.0-6381-g4304ebeca8 (4304ebeca8a7c55b7c583eaf35a0aede807692be)
pacific (dev)
Thank You,
Abhinav Singh
Hi all,
Thanks to Jason, Josh and others, we discussed replicated persistent
write-back cache during last CDM. This email is to continue discussing
detailed info about errror handling.
The following describes background knowledge and the error case
handling, welcome for any comments.
Current implementation:
======================
A persistent write-back cache [1] is implemented in librbd, which
provides an LBA-based, ordered write-back cache using NVDIMM as cache
medium.
The data layout on the cache device is split into three parts: header,
a vector of log entries, and customer data.
The customer data part store all the customer data.
Every update request like write/discard etc is mapped to a log, and
these logs are stored sequentially into a vector of log entries. The
vector acts like a ring buffer and it is used repeatedly.
The header part records the overall information about the cache pool,
especially header and tail. The header indicates the first valid entry
in log entries, and the tail indicates the next free entry in log
entries.
Replicated write-back cache
=======================
The above is the overall implementation of persistent write-back cache
in librbd, and currently the data is stored in local computer server
with a single copy. To improve the redundancy, we are planning to add
more copies across different servers. That is replicated write-back
cache in client side through NVDIMM + RDMA.
Except librbd, some replica daemon services will be started in other
servers which provide management of NVDIMM devices in these servers.
When a librbd starts and persistent write-back cache is required, it
allocates a cache pool in local NVDIMM device. Meanwhile, it talks
with replica daemons to allocate remote replica copies. After
initialization, replica daemons register the replica pools and expose
them through RDMA connections. All the cache metadata information is
stored as part of the rbd image’s metadata. The librbd sets up RDMA
connection with the corresponding replica daemons and access the data.
With NVDIMM + RDMA, all the copies will have exactly the same data
layout and data. The simple idea is to register NVDIMM through RDMA,
and then using RDMA read/write to access the data which doesn’t need
the involvement of CPUs in remote servers. These parts will use the
RPMA library [2].
When an update request comes, it cached the request in local NVDIMM,
and meanwhile persist in the same position of remote replica copies.
This email is to focus on discussing how to handle kinds of failed scenarios.
1. Librbd crashes or local NVDIMM error
As local cache pool is mmap to librbd application, the librbd process
crashes when error happens in NVDIMM. So this NVDIMM error is the same
as librbd crashes.
Once the librbd process crashes, the RDMA connection to replicas will
lose. The replica daemons are monitoring the connection’s status. Once
they find the disconnection and wait some timeout time, they start to
get the exclusive lock of the rbd image. There is only one replica
daemon that can get the exclusive lock and it starts to flush the
cache data to OSDs. Once flush is completed, it needs to do the
following work:
a. The cache metadata of the volume is updated as none and the
exclusive lock is released.
b. Notify other replica daemons to release the cache pools.
2. Librbd restarts.
When the librbd process restarts, its corresponding replica daemons
check that the RDMA connection is lost, wait some time, and try to
flush.
To prevent unnecessary flush by replica daemons, a timeout time can be
configured by users. Only after waiting for the timeout time, replica
daemons start to flush cached data.
3. Replica daemon crashes
Same as above, the timeout time is needed to define. When librbd finds
out the disconnection, it tries to recreate the connection. Once it
fails after the timeout, it starts to find a new replica and sync
data.
Based on our tests, it takes about 1s to sync 1G data through two
ports of 100Gb/s connection.
The failover time includes 1) the time to find the error, 2) timeout
time, 3) time to allocate a new replica copy, and 4) time to sync
data. The overall time won’t exceed 300s (common IO timeout time).
Once the failed replica daemon recovers in time, the librbd checks the
data integrity by comparing the pool header. If data is sync, recover
IO handling.
4. RDMA connections between librbd and replica daemons lost
If only connections are lost, replica daemons try to get the exclusive
lock. As the exclusive lock is held by librbd, it fails for replica
daemon. As a result, no flush happens.
In the librbd, it also finds out the disconnection. Its behaviors are
the same as the fail case ‘replica daemon crashes’.
[1] https://github.com/ceph/ceph/pull/35060
[2] https://github.com/pmem/rpma
--
Best wishes
Lisa
Hi Folks,
The weekly performance meeting will start in approx 15 minutes! Ben
England will be presenting his work on Container networking performance
today, and Gabi will also be presenting some of his work looking at
rocksdb performance with different column family sharding and compaction
options.
Hope to see you there!
Etherpad:
https://pad.ceph.com/p/performance_weekly
Bluejeans:
https://bluejeans.com/908675367
Thanks,
Mark
Details of this release summarized here:
https://tracker.ceph.com/issues/48200#note-1
Asking dev leads for early approval as we target the release date -
early next week.
Some suites are still in progress (will try to finish over the weekend)
rados - approved Neha?
rgw - approved Casey?
rbd - approved Jason?
krbd - approved Jason, Ilya?
fs - approved Patrick?
multimds - approved Patrick?
ceph-deploy - in progress
upgrade/client-upgrade-jewel-nautilus (nautilus) - in progress
upgrade/client-upgrade-mimic (nautilus) - in progress
upgrade/client-upgrade-luminous-nautilus (nautilus) - in progress
upgrade/client-upgrade-nautilus-octopus-octopus (octopus) - in progress
upgrade/nautilus-p2 - in progress
upgrade/luminous-x (nautilus) - in progress
upgrade/mimic-x (nautilus) - in progress
upgrade/nautilus-x (octopus) - in progress
ceph-volume - in progress (Jan pls see)
Thx
YuriW
This is the 6th backport release in the Octopus series. This releases
fixes a security flaw affecting Messenger V2 for Octopus & Nautilus. We
recommend users to update to this release.
Notable Changes
---------------
* CVE 2020-25660: Fix a regression in Messenger V2 replay attacks
Getting Ceph
------------
* Git at git://github.com/ceph/ceph.git
* Tarball at http://download.ceph.com/tarballs/ceph-15.2.6.tar.gz
* For packages, see http://docs.ceph.com/docs/master/install/get-packages/
* Release git sha1: cb8c61a60551b72614257d632a574d420064c17a