Dev March 2020

dev@ceph.io

70 participants
66 discussions

Pech OSD as a userspace experiment based on Ceph sources from Linux kernel

by Roman Penyaev

Hi all, Couple of weeks ago I started an experimental Pech OSD project [1] for several reasons: I need easy hackable OSD in C with IO path only, without fail-over, log-based replication, PG layer and all other things. I want to test performance on different replication strategies (client-based, primary-copy, chain) having simplest and fastest file storage (yes, step back to filestore) which reads and writes directly to files without any journals involved. Eventually this Pech OSD can be a starting point to something different, something which is not RADOS, which is fast, with minimum IO ordering requirements and acts as a RAID 1 cluster, e.g. something which is described here [2]. Q: What is this name, Pech? A: Just an anagram created from Ceph. Also this is a German word which perfectly describes this work. Pronounced exactly the same [peh]. Q: Why C, why Linux kernel sources? A: I found more comfortable to hack Ceph, analyzing protocol implementation, monitor and OSD client code reading Linux kernel C code, instead of legacy OSD C++ code or Crimson project. Linux kernel path net/ceph has everything I need: monitor client, v1 messenger, osdmap, monmap, all headers and defines. Since by default kernel sources is cleansed of external library dependencies this is just a homework exercise to provide a layer of kernel API in order to build all sources from net/ceph path as a userspace application with no modifications made. I also really like the idea of code unification: same sources can be compiled and used on both sides. Continuing this hackary madness IMO it is possible to compile drivers/block/rbd.c in userspace and use it as a separate very light rbd client. Why? Same 1 thread architecture (see next question for details) can be a win in terms of performance for a client, at the same time it may be interesting for debug purposes or fast prototypes. Q: What is the architecture? A: I do not use threads, I use cooperative scheduling and jump from task contexts using setjmp()/longjmp() calls. This model perfectly fits UP kernel with disabled preemption, thus reworked scheduling (sched.c), workqueue.c and timer.c code runs the event loop. So again, no atomic operations, no locks, everything is one thread. In future number of event loops can be equal to a number of physical CPUs, where each event loop is executed from a dedicated pthread context and pinned to a particular CPU. Does that sound similar to Crimson and can be described in all the same buzzy words from advertising brochures? Absolutely. Q: What this noop OSD can do now? A: Now it can only: o Connect to monitors and "boot" OSD, i.e. mark it as UP. o On Ctrl+C mark OSD on monitors as DOWN and gracefully exit. Q: What is not yet ported from kernel sources? A: Crypto part is noop now, thus monitors should be run with auth=none. To make cephx work direct copy-paste of kernel crypto sources has to be done, or a wrapper over openssl lib should be written, see src/ceph/crypto.c interface empty stubs for details. Q: What are the instructions to build and start Pech OSD? A: Make: $ make -j8 Start new Ceph cluster with 1 OSD and then stop everything. We start monitors on specified port and witg -X option, i.e. auth=none. $ CEPH_PORT=50000 MON=1 MDS=0 OSD=1 MGR=0 ../src/vstart.sh --memstore -n -X $ ../src/stop.sh Restart only Ceph monitor(s): $ MON=1 MDS=0 OSD=0 MGR=0 ../src/vstart.sh Start pech-osd accessing monitor over v1 protocol: $ ./pech-osd mon_addrs=ip.ip.ip.ip:50001 name=0 fsid=`cat ./osd0/fsid` log_level=5 For DEBUG purposes maximum output log level can be specified: log_level=7 In order not to confuse valgrind with stack allocations/deallocations USE_VALGRIND=1 option can be passed to make: $ make USE_VALGRIND=1 Have fun! [1] https://github.com/rouming/pech [2] https://lists.ceph.io/hyperkitty/list/dev@ceph.io/thread/N46NR7NBHWBQL4B2AS… -- Roman

4 years, 1 month

v14.2.8 Nautilus released

by Abhishek Lekshmanan

This is the eighth update to the Ceph Nautilus release series. This release fixes issues across a range of subsystems. We recommend that all users upgrade to this release. Please note the following important changes in this release; as always the full changelog is posted at: https://ceph.io/releases/v14-2-8-nautilus-released Notable Changes --------------- * The default value of `bluestore_min_alloc_size_ssd` has been changed to 4K to improve performance across all workloads. * The following OSD memory config options related to bluestore cache autotuning can now be configured during runtime: - osd_memory_base (default: 768 MB) - osd_memory_cache_min (default: 128 MB) - osd_memory_expected_fragmentation (default: 0.15) - osd_memory_target (default: 4 GB) The above options can be set with:: ceph config set osd <option> <value> * The MGR now accepts `profile rbd` and `profile rbd-read-only` user caps. These caps can be used to provide users access to MGR-based RBD functionality such as `rbd perf image iostat` an `rbd perf image iotop`. * The configuration value `osd_calc_pg_upmaps_max_stddev` used for upmap balancing has been removed. Instead use the mgr balancer config `upmap_max_deviation` which now is an integer number of PGs of deviation from the target PGs per OSD. This can be set with a command like `ceph config set mgr mgr/balancer/upmap_max_deviation 2`. The default `upmap_max_deviation` is 1. There are situations where crush rules would not allow a pool to ever have completely balanced PGs. For example, if crush requires 1 replica on each of 3 racks, but there are fewer OSDs in 1 of the racks. In those cases, the configuration value can be increased. * RGW: a mismatch between the bucket notification documentation and the actual message format was fixed. This means that any endpoints receiving bucket notification, will now receive the same notifications inside a JSON array named 'Records'. Note that this does not affect pulling bucket notification from a subscription in a 'pubsub' zone, as these are already wrapped inside that array. * CephFS: multiple active MDS forward scrub is now rejected. Scrub currently only is permitted on a file system with a single rank. Reduce the ranks to one via `ceph fs set <fs_name> max_mds 1`. * Ceph now refuses to create a file system with a default EC data pool. For further explanation, see: https://docs.ceph.com/docs/nautilus/cephfs/createfs/#creating-pools * Ceph will now issue a health warning if a RADOS pool has a `pg_num` value that is not a power of two. This can be fixed by adjusting the pool to a nearby power of two:: ceph osd pool set <pool-name> pg_num <new-pg-num> Alternatively, the warning can be silenced with:: ceph config set global mon_warn_on_pool_pg_num_not_power_of_two false Getting Ceph ------------ * Git at git://github.com/ceph/ceph.git * Tarball at http://download.ceph.com/tarballs/ceph-14.2.8.tar.gz * For packages, see http://docs.ceph.com/docs/master/install/get-packages/ * Release git sha1: 2d095e947a02261ce61424021bb43bd3022d35cb -- Abhishek Lekshmanan SUSE Software Solutions Germany GmbH GF: Felix Imendörffer HRB 21284 (AG Nürnberg)

4 years, 1 month

question for ceph study

by Chen, Haochuan Z

Hi I am a rookie, one question. what's magic in monitor data? What's the usage of magic? Thanks Martin, Chen IOTG, Software Engineer 021-61164330

4 years, 1 month

14.2.8 QE Nautilus validation status

by Yuri Weinstein

Details of this release summarized here: https://tracker.ceph.com/issues/44155#note-4 rados - approved Neha? rgw - approved Casey? rbd - approved Jason? krbd - approved Jason, Ilya? fs - approved Patrick, Ramana? kcephfs - approved Patrick, Ramana? multimds - approved Patrick, Ramana? ceph-deploy - FAILED Sage, Brad approved ? upgrade/client-upgrade-jewel (nautilus) - PASSED upgrade/client-upgrade-mimic (nautilus) - FAILED need review with Josh upgrade/client-upgrade-luminous-nautilus (nautilus) - FAILED need review with Josh upgrade/luminous-p2p - PASSED powercycle - PASSED ceph-ansible - approved Brad? upgrade/luminous-x (nautilus) - PASSED upgrade/mimic-x (nautilus) - in progress ceph-volume - approved Jan? missing backports? (please speak up if something is missing) Thx YuriW

4 years, 1 month

DocuBetter Meeting -- Upcoming next Wednesday 11 Mar 2020

by John Zachary Dover

Docubetter Meeting -- 2020 Mar 11 There is a general documentation meeting called the "DocuBetter Meeting", and it is held every two weeks. The next DocuBetter Meeting will be on March 11, 2020 at 0830 PST, and will run for thirty minutes. Everyone with a documentation-related request or complaint is invited. The meeting will be held here: https://bluejeans.com/908675367 Send documentation-related requests and complaints to me by replying to this email and CCing me at zac.dover(a)gmail.com. This message will be sent to dev(a)ceph.io every Monday morning, North American time. Hi everyone. The next DocuBetter meeting is scheduled for: 11 Mar 2020 0830 PST 11 Mar 2020 1630 UTC 12 Mar 2020 0230 AEST Etherpad: https://pad.ceph.com/p/Ceph_Documentation Meeting: https://bluejeans.com/908675367 Thanks, everyone. Zac Dover

4 years, 1 month

Readiness for v13.2.9?

by Yuri Weinstein

Below is the current queue of PRs: https://github.com/ceph/ceph/pulls?q=is%3Aopen+label%3Amimic-batch-1+label%… Dev leads - pls add and tag all RRs that must be included. Thx YuriW

4 years, 1 month

2024

2023

2022

2021

2020

2019

Dev March 2020