Hi Frank,
Thanks for the inputs. Kindly find my answers inline below...
On Wed, Mar 22, 2023 at 2:05 AM Frank Schilder <frans(a)dtu.dk> wrote:
Note: replying as a ceph cluster admin. Hope that is
OK.
Hi Prashant,
that sounds like a very interesting idea. I have a few
questions/concerns/suggestions from the point of view of a cluster admin.
Short version:
- please (!!) keep these logs on the dedicated MON storage below
/var/lib/ceph
- however: take the logs out of the MON DB and write them to their own
DB/file
- make the last-log size a configuration parameter (the log file becomes a
ring buffer)
the config could be elastic and a combination of max_size and max_age
- optional: make filtering rules a config option (filter by type/debug
level)
Long version:
1) What is the actual problem.
If I recall the cases about "MON store growing rapidly" correctly, I
believe the problem was not that the logs go to the MONs, the problem was
that the logs don't get trimmed unless health is health_ok. The MONs
apparently had no (performance) problem receiving the logs, but a capacity
problem storing them in case of health failures. If the logs are really
just used for having the last entries available, why not look at the
trimming first? Also, there is nothing in the logs stored on the MONs that
isn't in the syslog, so loosing something here seems not really a problem
to begin with.
Yes, you are right. Even in the case of HEALTH_OK, the logm trimming
encountered one corner case because of potential corruption of the
committed versions (
https://tracker.ceph.com/issues/53485). If we trim logm
(cluster log) entries aggressively in the event of excessive logm getting
stored then there is no use of storing them at all as they will be trimmed
sooner than logm entries getting fetched using log last or mgr dashboard.
2) .mgr pool
2.1) I have become really tired of these administrative pools that are
created on the fly without any regards to device classes, available
capacity, PG allocation and the like. The first one that showed up without
warning was device_health_metrics, which turned the cluster health_err
right away because the on-the-fly pool creation is, well, not exactly smart.
We don't even have drives below the default root. We have a lot of
different pools on different (custom!) device classes with different
replication schemes to accommodate a large variety of use cases.
Administrative pools showing up randomly somewhere in the tree are a real
pain. There are ceph-user cases where people deleted and recreated it only
to make the device health module useless, because it seems to store the
pool ID and there is no way to tell it to use the new pool.
I am not sure but isn't changing pool_name to new pool using "ceph config
set mgr mgr/devicehealth/pool_name <new-pool>" for device health metrics
work? Maybe we can address this issue related to device health module over
a tracker ?
If you really think about adding a pool for that,
please please make the
pool creation part of the upgrade instructions with some hints on sizing,
PGs and realistic (!!!) IOP/s requirements. I personally use the
host-syslog and have drives with reasonable performance and capacity in the
hosts to be able to pull debug logs with high logging values. All host logs
are also aggregated to an rsyslogd instance. I don't see *any* need to
aggregate these logs to a ceph pool.
2.2) Using a ceph pool for logging is not reliable during critical
situations. The whole point of the logging is to provide information in
case of disaster. In case of disaster, we can safely assume that an .mgr
pool will not be available. The logging has to be on an alternative
infrastructure that is not affected by ceph storage outages/health
problems. Having it in the MON stores on local storage is such an
alternative infrastructure. Why not just separate the logging storage from
the actual MON DB store and make it max_size configurable?
Agree on 2.1 and 2.2. Really appreciate your efforts to document these
concerns in detail. The other caveat with this solution is if mgr pool
storing ceph cluster logs is not writable because of OSD full, network
issue etc then we need to find an alternative way to get hold of cluster
logs for troubleshooting purposes.
I would propose to keep it on the local dedicated MON storage (however
outside of the MON DB) also to keep setting up a ceph cluster simple. If we
needed now an additional MGR store, things would be more complicated. Just
tell people that 60G is not enough for a MON store and at the same time
make the last-log size a config option (it should really be a ring-buffer
with a configurable fixed max-number of entries).
3) MGR performance
While it would possibly make sense to let the MGRs do more work, there is
the problem of this work not being distributed (only 1 MGR does something)
and that MGR modules seem not really performance optimized (too much
python). If one wanted to outsource additional functionality to the MGRs, a
good start would be to make all MGRs active and distribute the work (like a
small distributed-memory compute cluster). A bit more module-crash
resilience and performance improvements are also welcome.
Yes, mgr is not distributed and single mgr is responsible for all mgr
workload. The major job of mgr is to lightweight MONs as much as possible.
Another concern here is if the active mgr will be handling the cluster
logging through the new pool then we will miss out cluster logs during the
timeframe when all mgrs are down.
Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
Regards,
Prashant
________________________________________
From: Prashant Dhange <pdhange(a)redhat.com>
Sent: 22 March 2023 06:35:36
To: dev(a)ceph.io
Subject: Moving cluster log storage from monstore db
Hi All,
We are looking for inputs on a new feature to be implemented to move clog
messages storage from monstore db, refer trello card [1] for more details
around this topic.
Currently, every clog message goes to monstore db as well as debug/warning
messages generates clog messages 1000s of times per seconds which leads to
monstore db growing at an exponential rate in a catastrophic failure
situation.
The primary use cases for the logm entries in monstore db are :
* For "ceph log last" commands to get historical clog entries
* Ceph dashboard (mgr is subscriber of log-info which propagate clog
to dashboard module)
@Patrick Donnelly<mailto:pdonnell@redhat.com> suggested a viable solution
to move the cluster log storage to a new mgr module which handles the "ceph
log last" command. The clog data can be stored in the .mgr pool via
libcephsqlite.
Alternatively, if we donot want to get rid of logm storage from monstore
db then the other solutions would be :
* Stop writing logm entries to mon db if there are excessive entries
getting generated
* Filter out clog DBG entries and only log WRN/INF/ERR entries.
Looking forward to additional perspectives arounds this topic. Feel free
to add your inputs to trello card [1] or reply to this email-thread.
[1]
https://trello.com/c/oCGGFfTs/822-better-handling-of-cluster-log-messages-f…
Regards,
Prashant