Hi all,
Couple of weeks ago I started an experimental Pech OSD project [1]
for several reasons: I need easy hackable OSD in C with IO path only,
without fail-over, log-based replication, PG layer and all other
things. I want to test performance on different replication strategies
(client-based, primary-copy, chain) having simplest and fastest file
storage (yes, step back to filestore) which reads and writes directly
to files without any journals involved.
Eventually this Pech OSD can be a starting point to something
different, something which is not RADOS, which is fast, with minimum
IO ordering requirements and acts as a RAID 1 cluster, e.g. something
which is described here [2].
Q: What is this name, Pech?
A: Just an anagram created from Ceph. Also this is a German word which
perfectly describes this work. Pronounced exactly the same [peh].
Q: Why C, why Linux kernel sources?
A: I found more comfortable to hack Ceph, analyzing protocol
implementation, monitor and OSD client code reading Linux kernel
C code, instead of legacy OSD C++ code or Crimson project.
Linux kernel path net/ceph has everything I need: monitor client,
v1 messenger, osdmap, monmap, all headers and defines. Since by
default kernel sources is cleansed of external library dependencies
this is just a homework exercise to provide a layer of kernel API
in order to build all sources from net/ceph path as a userspace
application with no modifications made.
I also really like the idea of code unification: same sources
can be compiled and used on both sides.
Continuing this hackary madness IMO it is possible to compile
drivers/block/rbd.c in userspace and use it as a separate very
light rbd client. Why? Same 1 thread architecture (see next
question for details) can be a win in terms of performance for
a client, at the same time it may be interesting for debug
purposes or fast prototypes.
Q: What is the architecture?
A: I do not use threads, I use cooperative scheduling and jump from
task contexts using setjmp()/longjmp() calls. This model perfectly
fits UP kernel with disabled preemption, thus reworked scheduling
(sched.c), workqueue.c and timer.c code runs the event loop.
So again, no atomic operations, no locks, everything is one thread.
In future number of event loops can be equal to a number of
physical CPUs, where each event loop is executed from a dedicated
pthread context and pinned to a particular CPU.
Does that sound similar to Crimson and can be described in all the
same buzzy words from advertising brochures? Absolutely.
Q: What this noop OSD can do now?
A: Now it can only:
o Connect to monitors and "boot" OSD, i.e. mark it as UP.
o On Ctrl+C mark OSD on monitors as DOWN and gracefully exit.
Q: What is not yet ported from kernel sources?
A: Crypto part is noop now, thus monitors should be run with
auth=none. To make cephx work direct copy-paste of kernel crypto
sources has to be done, or a wrapper over openssl lib should be
written, see src/ceph/crypto.c interface empty stubs for details.
Q: What are the instructions to build and start Pech OSD?
A: Make:
$ make -j8
Start new Ceph cluster with 1 OSD and then stop everything. We
start monitors on specified port and witg -X option, i.e. auth=none.
$ CEPH_PORT=50000 MON=1 MDS=0 OSD=1 MGR=0 ../src/vstart.sh
--memstore -n -X
$ ../src/stop.sh
Restart only Ceph monitor(s):
$ MON=1 MDS=0 OSD=0 MGR=0 ../src/vstart.sh
Start pech-osd accessing monitor over v1 protocol:
$ ./pech-osd mon_addrs=ip.ip.ip.ip:50001 name=0 fsid=`cat
./osd0/fsid` log_level=5
For DEBUG purposes maximum output log level can be specified:
log_level=7
In order not to confuse valgrind with stack allocations/deallocations
USE_VALGRIND=1 option can be passed to make:
$ make USE_VALGRIND=1
Have fun!
[1] https://github.com/rouming/pech
[2]
https://lists.ceph.io/hyperkitty/list/dev@ceph.io/thread/N46NR7NBHWBQL4B2AS…
--
Roman
This is the eighth update to the Ceph Nautilus release series. This release
fixes issues across a range of subsystems. We recommend that all users upgrade
to this release. Please note the following important changes in this
release; as always the full changelog is posted at:
https://ceph.io/releases/v14-2-8-nautilus-released
Notable Changes
---------------
* The default value of `bluestore_min_alloc_size_ssd` has been changed
to 4K to improve performance across all workloads.
* The following OSD memory config options related to bluestore cache autotuning can now
be configured during runtime:
- osd_memory_base (default: 768 MB)
- osd_memory_cache_min (default: 128 MB)
- osd_memory_expected_fragmentation (default: 0.15)
- osd_memory_target (default: 4 GB)
The above options can be set with::
ceph config set osd <option> <value>
* The MGR now accepts `profile rbd` and `profile rbd-read-only` user caps.
These caps can be used to provide users access to MGR-based RBD functionality
such as `rbd perf image iostat` an `rbd perf image iotop`.
* The configuration value `osd_calc_pg_upmaps_max_stddev` used for upmap
balancing has been removed. Instead use the mgr balancer config
`upmap_max_deviation` which now is an integer number of PGs of deviation
from the target PGs per OSD. This can be set with a command like
`ceph config set mgr mgr/balancer/upmap_max_deviation 2`. The default
`upmap_max_deviation` is 1. There are situations where crush rules
would not allow a pool to ever have completely balanced PGs. For example, if
crush requires 1 replica on each of 3 racks, but there are fewer OSDs in 1 of
the racks. In those cases, the configuration value can be increased.
* RGW: a mismatch between the bucket notification documentation and the actual
message format was fixed. This means that any endpoints receiving bucket
notification, will now receive the same notifications inside a JSON array
named 'Records'. Note that this does not affect pulling bucket notification
from a subscription in a 'pubsub' zone, as these are already wrapped inside
that array.
* CephFS: multiple active MDS forward scrub is now rejected. Scrub currently
only is permitted on a file system with a single rank. Reduce the ranks to one
via `ceph fs set <fs_name> max_mds 1`.
* Ceph now refuses to create a file system with a default EC data pool. For
further explanation, see:
https://docs.ceph.com/docs/nautilus/cephfs/createfs/#creating-pools
* Ceph will now issue a health warning if a RADOS pool has a `pg_num`
value that is not a power of two. This can be fixed by adjusting
the pool to a nearby power of two::
ceph osd pool set <pool-name> pg_num <new-pg-num>
Alternatively, the warning can be silenced with::
ceph config set global mon_warn_on_pool_pg_num_not_power_of_two false
Getting Ceph
------------
* Git at git://github.com/ceph/ceph.git
* Tarball at http://download.ceph.com/tarballs/ceph-14.2.8.tar.gz
* For packages, see http://docs.ceph.com/docs/master/install/get-packages/
* Release git sha1: 2d095e947a02261ce61424021bb43bd3022d35cb
--
Abhishek Lekshmanan
SUSE Software Solutions Germany GmbH
GF: Felix Imendörffer HRB 21284 (AG Nürnberg)
Docubetter Meeting -- 2020 Mar 11
There is a general documentation meeting called the "DocuBetter Meeting",
and it is held every two weeks. The next DocuBetter Meeting will be on
March 11, 2020 at 0830 PST, and will run for thirty minutes. Everyone with
a documentation-related request or complaint is invited. The meeting will
be held here: https://bluejeans.com/908675367
Send documentation-related requests and complaints to me by replying to
this email and CCing me at zac.dover(a)gmail.com.
This message will be sent to dev(a)ceph.io every Monday morning, North
American time.
Hi everyone.
The next DocuBetter meeting is scheduled for:
11 Mar 2020 0830 PST
11 Mar 2020 1630 UTC
12 Mar 2020 0230 AEST
Etherpad: https://pad.ceph.com/p/Ceph_Documentation
Meeting: https://bluejeans.com/908675367
Thanks, everyone.
Zac Dover