The document formerly known as "HACKING.rst" is now live on the docs
website at the following location:
https://docs.ceph.com/docs/master/dev/developer_guide/dash-devel/
This represents a significant addition to the Developer Guide. If you
haven't seen it before, take a look. It's big.
Please direct pull requests against this new file and not against
HACKING.rst.
Zac Dover
Upstream Docs
Ceph
sorry for cross-posting. i sent this mail to ceph-maintainers two
months ago, but got no responses so far. but after reading the
comments in https://github.com/ceph/ceph-deploy/pull/496, i think i
should check with ceph-devel as well. so i am forwarding this mail to
ceph-devel for more inputs.
---------- Forwarded message ---------
From: kefu chai <tchaikov(a)gmail.com>
Date: Thu, Jun 4, 2020 at 6:39 PM
Subject: is ceph-deploy still used?
To: <ceph-maintainers(a)ceph.io>
Cc: Neha Ojha <nojha(a)redhat.com>, Josh Durgin <jdurgin(a)redhat.com>,
Brad Hubbard <bhubbard(a)redhat.com>, James Page <james.page(a)ubuntu.com>
hi ceph maintainers,
when reviewing ceph-deploy PRs, i am wondering why are we still
maintaining this tool. as IIUC, we are supposed to deploy ceph using
the Ansible playbooks offered by ceph-ansble[0]. and in future, we are
more likely to deploy a ceph cluster using cephadm[1].
so the question is, are you still packaging / using ceph-deploy?
cheers,
--
[0] https://github.com/ceph/ceph-ansible
[1] https://ceph.io/ceph-management/introducing-cephadm/
--
Regards
Kefu Chai
--
Regards
Kefu Chai
Hi folks,
here are the links for slides/sheets I presented at yesterday's perf call.
Slides:
https://docs.google.com/presentation/d/1Qid__UuHmE5PhVmFT8aviZADuiLp32zzbhq…
Sheets:
https://docs.google.com/spreadsheets/d/1ngQA-x7Qpk0HARlkfZIOVFW8TGuAYhmhoW4…
@Josh - some feedback to one of your comments at the call:
Today I made another experiment and run original(!) delete with sleep=1s
(dropped from default 2s just to complete faster).
The second stage's parallel writes were executed for 2500s (initial
ones had been run for 1000s as before) . Pool removal completed in 2127
seconds and one can also observe writing performance drop for some
seconds before the completion in this scenario.
See "long original deleting" sheet under the second link above. Hence it
looks like bulk removals aren't worse than original stuff in this aspect...
Thanks,
Igor
Hello everyone,
When I m starting this my rgw server I m getting this error "couldn't init
storage provider"
Command I used : "RGW=1 ../src/vstart.sh -d -n -x"
OS:ubuntu : 18.04
Ceph version : ceph version 15.1.0-1866-g053fd8f816
(053fd8f816ec0583fddfc63918dda521a3cf821e) octopus (rc)
I see that error is happening inside "rgw_sal_rados.cc" in function
`rgw::sal::RadosStore::init_storage_provider(...)` but I dont understand
why it is happening
I tried to restart my system and do it is still not working,
I dont understand which file should I share with you all to find my issue
so you can ask me for specific file :)
Thank You,
Abhinav Singh
Hi,
I have changed most of pools from 3-replica to ec 4+2 in my cluster, when I use
ceph df command to show the used capactiy of the cluster:
RAW STORAGE:
CLASS SIZE AVAIL USED RAW USED %RAW USED
hdd 1.8 PiB 788 TiB 1.0 PiB 1.0 PiB 57.22
ssd 7.9 TiB 4.6 TiB 181 GiB 3.2 TiB 41.15
ssd-cache 5.2 TiB 5.2 TiB 67 GiB 73 GiB 1.36
TOTAL 1.8 PiB 798 TiB 1.0 PiB 1.0 PiB 56.99
POOLS:
POOL ID STORED OBJECTS
USED %USED MAX AVAIL
default-oss.rgw.control 1 0 B 8 0
B 0 1.3 TiB
default-oss.rgw.meta 2 22 KiB 97 3.9
MiB 0 1.3 TiB
default-oss.rgw.log 3 525 KiB 223 621
KiB 0 1.3 TiB
default-oss.rgw.buckets.index 4 33 MiB 34 33
MiB 0 1.3 TiB
default-oss.rgw.buckets.non-ec 5 1.6 MiB 48 3.8
MiB 0 1.3 TiB
.rgw.root 6 3.8 KiB 16 720
KiB 0 1.3 TiB
default-oss.rgw.buckets.data 7 274 GiB 185.39k 450
GiB 0.14 212 TiB
default-fs-metadata 8 488 GiB 153.10M 490
GiB 10.65 1.3 TiB
default-fs-data0 9 374 TiB 1.48G 939
TiB 74.71 212 TiB
...
The USED = 3 * STORED in 3-replica mode is completely right, but for EC 4+2 pool
(for default-fs-data0 )
P.S. I have another cluster with the same config, its ceph df output is right.
The diff between them is that the cluster has different HDD OSD(size 8T and 12T).
I'm not sure it's a bug or something, but it's not reasonable for the spaces used.
Hi Folks,
The weekly performance meeting will be starting in 5 minutes! Today, we
are going to continue discussing refactoring onodes in bluestore to
improve memory usage and CPU overhead. See you there!
Etherpad:
https://pad.ceph.com/p/performance_weekly
Bluejeans:
https://bluejeans.com/908675367
Thanks,
Mark
Hi cephers,
At the CLT meeting today there's been agreement to *make Ceph API tests
"required" *again for Pull Request to be merged:
- The current approach (*"honoring the agreement not to merge failing
PRs"*) is simply not working: PRs have been merged with API tests in
red. While most of these are harmless due to random failures (*we are
working to improve this*), other times API tests warned about real
issues... which eventually slipped into the code. [1]
<https://tracker.ceph.com/issues/47306> [2]
<https://tracker.ceph.com/issues/45717> [3]
<https://github.com/ceph/ceph/pull/36091>
- The cost & risk of debugging issues a posteriori is usually higher
than the pain of retriggering the API tests (*we are working to improve
this*).
- Ceph API tests, even with their downsides, are providing true
integration testing at CI time: this doesn't simply mean complex unit tests
or component testing, it means running a vstart Ceph cluster and actually
testing RADOS, RBD, RGW, CephFS...
*What does this mean?*
If Ceph API tests are in green, great! It's not that hard to achieve: ~*75%
PRs pass the Ceph API tests from the beginning.*
[image: image.png]
What if they *are NOT* passing?
[image: image.png]
From Github you may access Ceph API tests results in Jenkins by clicking in
*"Details"* and you'll see a report:
1. The test may fail due to multiple causes: issues in a Jenkins node,
Github repo fetching, "make" stage, ... (if this is the case you may easily
retrigger the Ceph API test by adding a comment to the PR with the
text "jenkins
test api").
2. If the failure actually happens as a result of the Ceph API tests
themselves, the report will look like this
<https://jenkins.ceph.com/job/ceph-api/2726/>:
[image: image.png]
From there:
- You can quickly check whether this has already been reported
<https://tracker.ceph.com/search?q=FAIL:%20test_all%20%28tasks.mgr.dashboard…>
(a known issue or a flapping test) or otherwise raise a new issue report
<https://tracker.ceph.com/projects/mgr/issues/new?issue[subject]=FAIL:%20tes…>
.
- If the failure looks like a flapping one, you may retrigger the tests.
- If, however, the failure is caused by an intentional change in
behaviour, please reach out to Dashboard team for help.
*What may you expect from the Dashboard team?*
- We are working to harden Ceph API tests, increase their coverage and
make them more stable. You may check our backlog
<https://pad.ceph.com/p/dashboard-api-test-improvements> of
improvements. You are welcome to contribute with ideas or, even better,
working code ;-)
- We are monitoring every day how Ceph API tests are doing: failure
rate, runtime, ...
- You can find us in #IRC (#ceph-dashboard), Github (@ceph/dashboard),
in this very mail-list or pinging us directly: Lenz (in CC) is the
component lead, Laura (in CC too) is taking care of Dashboard QA, or myself.
Kind regards,
Ernesto
@Haomai,
Does HAVE_IBV_EXP still work with any RNIC in current Ceph repository?
@Nasution:
I have never used below options yet
ms_async_rdma_roce_ver = 0 #RoCEv1, all nodes with same networks. Should I use RoCEv2?
ms_async_rdma_local_gid = fe80:0000:0000:0000:****:****:****:**** #should I use 0000:0000:0000:0000:0000 :****:****:**** one?
To use RDMA, you may need:
1) configure “ulimit -l” to be unlimited
2) For RNIC with SRQ function:
a. below configuration should be OK
ms_async_rdma_device_name = mlx5_bond_0
ms_cluster_type = async+rdma
ms_public_type = async+posix
b. If you need to different RoCEv1 or RoCEv2, you need to configure “ms_async_rdma_gid_idx”
Reference: https://github.com/ceph/ceph/pull/31517/commits/b971cff51a9179c02f85a27cc19…
From: Lazuardi Nasution <mrxlazuardin(a)gmail.com>
Sent: Thursday, September 10, 2020 12:23 AM
To: Liu, Changcheng <changcheng.liu(a)intel.com>
Subject: Ceph with RDMA
Hi,
I'm reading your post regarding Ceph with RDMA. Have you solved your problem? I'm trying the same way, but currently I'm facing a problem that some OSDs are automatically down not so long after it up due to no heartbeat reply, even for the newly installed cluster. I'm using the following RDMA related configuration.
[global]
.......
ms_async_rdma_device_name = mlx5_bond_0
ms_cluster_type = async+rdma
ms_public_type = async+posix
#/rbd does not support rdma
ms_async_rdma_polling_us = 0
ms_async_rdma_roce_ver = 0 #RoCEv1, all nodes with same networks. Should I use RoCEv2?
ms_async_rdma_local_gid = fe80:0000:0000:0000:****:****:****:**** #should I use 0000:0000:0000:0000:0000 :****:****:**** one?
[mgr]
ms_type = async+posix
I have put "LimitMEMLOCK on OSD (because it is the only one that failed to start without it) systemd unit file. "Would you mind sharing your configuration of working Ceph with RDMA? Do I miss something?
Best regards,
Ceph Developers,
There is a proper format for Merge Commits, which has been documented here:
https://docs.ceph.com/docs/master/dev/developer_guide/basic-workflow/#prope…
Kefu is quite keen for us to adhere to this format.
If this needs to be beefed up or slimmed down, let me know.
Zac Dover
Upstream Docs
Ceph