Hey folks, now that Pacific is out I wanted to bring up docs backports.
Today, docs.ceph.com shows master by default, with an appropriate
warning at the top that it represents a development version.
Since the primary audience of the docs is users, not developers, I
suggest that we switch the default branch to the latest stable, i.e.
pacific, and apply the normal backport process to docs that are
relevant to the latest stable release as well.
To kickstart things, I'll prepare a backport of the existing
doc changes since the pacific release.
What do folks think?
On Wed, May 19, 2021 at 11:32:04AM +0800, Zhi Zhang wrote:
> On Wed, May 19, 2021 at 11:19 AM Zhi Zhang <zhang.david2011(a)gmail.com>
> > On Tue, May 18, 2021 at 10:58 PM Mykola Golub <to.my.trociny(a)gmail.com>
> > wrote:
> > >
> > > Could you please provide the full rbd-nbd log? If it is too large for
> > > the attachment then may be via some public url?
> > ceph.rbd-client.log.bz2
> > <https://drive.google.com/file/d/1TuiGOrVAgKIJ3BUmiokG0cU12fnlQ3GR/view?usp=…>
> > I uploaded it to google driver. Pls check it out.
> We found the reader_entry thread got zero byte when trying to read the nbd
> request header, then rbd-nbd exited and closed the socket. But we haven't
> figured out why read zero byte?
Ok. I was hoping to find some hint in the log, why the read from the
kernel could return without data, but I don't see it.
From experience it could happen when the rbd-nbd got stack or was too
slow so the kernel failed after timeout, but it looked different in
the logs AFAIR. Anyway you can try increasing the timeout using
rbd-nbd --timeout (--io-timeout in newer versions) option. The default
is 30 sec.
If it does not help, probably you will find a clue increasing the
kernel debug level for nbd (it seems it is possible to do).
Recently (or not so recently, it's been almost 2 years), the nfs-ganesha
project implemented capability to utilize asynchronous non-blocking I/O
to storage backends to prevent thread starvation. The assumption is that
the backend provides non-blocking I/O with a callback mechanism to
notify nfs-ganesha when the I/O is complete so that nfs-ganesha can
subsequently asynchronously respond to the client indicating I/O completion.
Ceph looks like it is structured to allow for such with Context objects
having finish and complete methods that allow the I/O path to notify
completion. In general libcephfs seems to use some form of condition
variable Context to block and wait for this notification. This would be
relatively easy to replace with a call back Context.
However, libcephfs does use ObjectCacher and sets the
block_writes_upfront flag which seems to make any writes that go through
ObjectCacher to block using an internal condition variable and not
utilize the onfreespace Context object (which maybe should have been
I'm wondering what the implication of setting block_writes_upfront to
false would be for libcephs beyond needing to assure an onfreespace
Context object is passed.
Below are discussion points regarding the addition of jaeger tracing to the
Any feedback is welcome!
* multipart upload usecase:
- correlate the flow of multipart upload operations that may be spread
across multiple RGWs (e.g. "put" of different parts on different RGWs)
- on each RGW, we would like to be able to follow the tracepoints of the
operation from the frontend, via librados, down to the OSD
- we would like to be able to correlate the syncing of the object from the
RGWs where the upload is done to RGWs on other zones
- this is probably the usecase that would require most tracing features and
would have the most value
* deployment in case of multisite:
- agents should run per host, co-located with the RGWs
- collectors can be per cluster
- if we have multiple clusters, we should probably follow the "Kafka as
intermediate buffer" architecture from here . Having multiple
collectors send the spans/traces to a centralized location
- 1st deployment option would be manual, for which we would provide only
- 2nd deployment option would be in the case of k8s  and OpenShift .
more investigation is needed to figure out how to support centralized DB
location if different ceph clusters are in different k8s clusters
- 3rd option would be using cephadm. some work started by the OSD team .
but this probably won't cover the multisite case
* logs in traces:
- we should probably avoid unstructured string based logs and should mainly
use the trace/span names, together with tags to convey the information in
- e.g. use error codes as tags, instead of error messages in logs
- in the future, we may add structured or dictionary based logs. note that
string copy would still be needed for the traces unless we modify the
underlying jaeger code
- for multisite sync tracing between RGWs the trace should be added to the
bucket-index log. injected when the object is created and extracted by the
RGW that sync itself
- adding that to the sync REST (HTTP) API is probably less useful
- would need to piggyback the trace to the RADOS protocol, inject the span
inside librados and extract it in the OSD for the rest of the tracing. this
could be done similarly to the work done for blkin tracing . need to
check if we can use the same API, or need to add a new one
* ops context:
- we should probably add host_id from req_state as a tag to identify the
RGW that emitted the trace
- for multipart upload, we should add the "upload_id" as a tag so that
traces that start on different RGWs could be correlated
- we should use return codes as tags, to indicate the success/failure
reason of the operation. this could be done at the base level
* locks and thread_local tracers:
- the goal here is to avoid contention on the locks used when a "Finish()"
is called on a span - which sends the data to the agent
- AFAIK, in our threads/coroutine model, a different thread may resume a
coroutine that started on another thread. this means that the span would
use a different tracer to do the sending than the one that was used to
create it. need to make sure that it works.
* conditional tracing:
- this was brought up as an important usability issue for tracing. but not
discussed further. we should set up a different discussion for that topic
- current code would allow dynamic enabling/disabling of tracing
Today's meeting minutes:
- 16.2.5 shipped on time, next up is 15.2.13 (no particular urgency
but should start rounding up PRs)
- move release containers to quay (Dimitri)
- we are currently split between dockerhub and quay which causes
- dockerhub is full of legacy cruft, e.g. daemon-base images which
are used only by ceph-ansible and ceph-nano
- only new-style images/tags would be pushed to quay
- update release email/blog template to include links to the container
registry web UI (David)
- improve release notes script (Josh)
- ideally get to the point where the output doesn't require any
- Guillaume is taking over ceph-volume maintainership from Jan
- ceph.io team page is woefully out of date
- high-level development priorities doc (why as opposed to what)
- need to flesh out a strategy for local storage and replica-1 use cases
- want raw devices for OSDs but also the ability to carve out chunks
for db and wal
- existing solutions seem incomplete
- introduce rook-local (rook on bare metal) operator?
- replica-1 use cases
- scratch storage for playing around, prototyping, etc
- storage for workloads such as mongodb that do their own
- offload these to the operator or bite the bullet and make replica-1
corner case work well?
- a bit too complicated of a stack for something that
could be just a local partition but things like mirroring would
- must avoid spreading replica-1 PVs across OSDs
- could be pgp_num = 1 or a custom CRUSH rule
- rados suite environmental issues are being worked out (centos.stream
- fs:workload suite migrated to cephadm seems to be exposing a race in
podman/runc related to starting containers
- component suites migrated to cephadm can pick a single distro
- we have enough distro coverage in cephadm suite
I think we probably need to redo this bit of documentation:
I would just spin up a patch, but I think we might also just want to
reconsider recommending an ingress controller at all.
Some people seem to be taking this to mean that they can shoot down one
of the nodes in the NFS server cluster, and the rest will just pick up
the load. That's not at all how this works.
If a NFS cluster node goes down, then it _must_ be resurrected in some
fashion, period. Otherwise, the MDS will eventually (in 5 mins) time out
the state it held and the NFS clients will not be able to reclaim their
Given that, the bulleted list at the start of the doc above is wrong. We
cannot do any sort of failover if there is a host failure. My assumption
was that the orchestrator took care of starting up an NFS server
elsewhere if the host it was running on went down. Is that not the case?
In any case, think we should reconsider recommending an ingress
controller at all. It's really just another point of failure, and a lot
of people seem to be misconstruing what guarantees that offers.
Round-robin DNS would be a better option in this situation, and it
wouldn't be as problematic if we want to support things like live
shrinking the cluster in the future.
Jeff Layton <jlayton(a)redhat.com>
Notes from today's call:
* lists.ceph.io changes
* ceph-users - When nonmember posts, reject with notification
instead of Hold for moderation
* Disabled bounce processing on dev, ceph-users, sepia
* all lists? Done.
* Posting via web is disabled (not just a header rule now)
* upgrade tests, rgw failures
* consistently fails, will be investigated
* telemetry crashes integration follow up
* crash triage / crash queue custom queries now in redmine
* will open manageable amount of trackers and see how triage goes,
listen to feedback before next batch
* teuthology dev reinvigorated
* semi-weekly calls on community calendar, link and notes here:
* focus on ease of use
* new version of pulpito, goal to be the main interface to
* simple dev setup (docker-compose) for all services