Note: replying as a ceph cluster admin. Hope that is OK.
Hi Prashant,
that sounds like a very interesting idea. I have a few questions/concerns/suggestions from
the point of view of a cluster admin.
Short version:
- please (!!) keep these logs on the dedicated MON storage below /var/lib/ceph
- however: take the logs out of the MON DB and write them to their own DB/file
- make the last-log size a configuration parameter (the log file becomes a ring buffer)
the config could be elastic and a combination of max_size and max_age
- optional: make filtering rules a config option (filter by type/debug level)
Long version:
1) What is the actual problem.
If I recall the cases about "MON store growing rapidly" correctly, I believe the
problem was not that the logs go to the MONs, the problem was that the logs don't get
trimmed unless health is health_ok. The MONs apparently had no (performance) problem
receiving the logs, but a capacity problem storing them in case of health failures. If the
logs are really just used for having the last entries available, why not look at the
trimming first? Also, there is nothing in the logs stored on the MONs that isn't in
the syslog, so loosing something here seems not really a problem to begin with.
2) .mgr pool
2.1) I have become really tired of these administrative pools that are created on the fly
without any regards to device classes, available capacity, PG allocation and the like. The
first one that showed up without warning was device_health_metrics, which turned the
cluster health_err right away because the on-the-fly pool creation is, well, not exactly
smart.
We don't even have drives below the default root. We have a lot of different pools on
different (custom!) device classes with different replication schemes to accommodate a
large variety of use cases. Administrative pools showing up randomly somewhere in the tree
are a real pain. There are ceph-user cases where people deleted and recreated it only to
make the device health module useless, because it seems to store the pool ID and there is
no way to tell it to use the new pool.
If you really think about adding a pool for that, please please make the pool creation
part of the upgrade instructions with some hints on sizing, PGs and realistic (!!!) IOP/s
requirements. I personally use the host-syslog and have drives with reasonable performance
and capacity in the hosts to be able to pull debug logs with high logging values. All host
logs are also aggregated to an rsyslogd instance. I don't see *any* need to aggregate
these logs to a ceph pool.
2.2) Using a ceph pool for logging is not reliable during critical situations. The whole
point of the logging is to provide information in case of disaster. In case of disaster,
we can safely assume that an .mgr pool will not be available. The logging has to be on an
alternative infrastructure that is not affected by ceph storage outages/health problems.
Having it in the MON stores on local storage is such an alternative infrastructure. Why
not just separate the logging storage from the actual MON DB store and make it max_size
configurable?
I would propose to keep it on the local dedicated MON storage (however outside of the MON
DB) also to keep setting up a ceph cluster simple. If we needed now an additional MGR
store, things would be more complicated. Just tell people that 60G is not enough for a MON
store and at the same time make the last-log size a config option (it should really be a
ring-buffer with a configurable fixed max-number of entries).
3) MGR performance
While it would possibly make sense to let the MGRs do more work, there is the problem of
this work not being distributed (only 1 MGR does something) and that MGR modules seem not
really performance optimized (too much python). If one wanted to outsource additional
functionality to the MGRs, a good start would be to make all MGRs active and distribute
the work (like a small distributed-memory compute cluster). A bit more module-crash
resilience and performance improvements are also welcome.
Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
________________________________________
From: Prashant Dhange <pdhange(a)redhat.com>
Sent: 22 March 2023 06:35:36
To: dev(a)ceph.io
Subject: Moving cluster log storage from monstore db
Hi All,
We are looking for inputs on a new feature to be implemented to move clog messages storage
from monstore db, refer trello card [1] for more details around this topic.
Currently, every clog message goes to monstore db as well as debug/warning messages
generates clog messages 1000s of times per seconds which leads to monstore db growing at
an exponential rate in a catastrophic failure situation.
The primary use cases for the logm entries in monstore db are :
* For "ceph log last" commands to get historical clog entries
* Ceph dashboard (mgr is subscriber of log-info which propagate clog to dashboard
module)
@Patrick Donnelly<mailto:pdonnell@redhat.com> suggested a viable solution to move
the cluster log storage to a new mgr module which handles the "ceph log last"
command. The clog data can be stored in the .mgr pool via libcephsqlite.
Alternatively, if we donot want to get rid of logm storage from monstore db then the other
solutions would be :
* Stop writing logm entries to mon db if there are excessive entries getting
generated
* Filter out clog DBG entries and only log WRN/INF/ERR entries.
Looking forward to additional perspectives arounds this topic. Feel free to add your
inputs to trello card [1] or reply to this email-thread.
[1]
https://trello.com/c/oCGGFfTs/822-better-handling-of-cluster-log-messages-f…
Regards,
Prashant