Re: Moving cluster log storage from monstore db

28 Mar 2023

Hi Frank,

Thanks for the inputs. Kindly find my answers inline below...

On Wed, Mar 22, 2023 at 2:05 AM Frank Schilder &lt;frans(a)dtu.dk&gt; wrote:

...
  Note: replying as a ceph cluster admin. Hope that is
OK.

 Hi Prashant,

 that sounds like a very interesting idea. I have a few
 questions/concerns/suggestions from the point of view of a cluster admin.

 Short version:

 - please (!!) keep these logs on the dedicated MON storage below
 /var/lib/ceph
 - however: take the logs out of the MON DB and write them to their own
 DB/file
 - make the last-log size a configuration parameter (the log file becomes a
 ring buffer)
   the config could be elastic and a combination of max_size and max_age
 - optional: make filtering rules a config option (filter by type/debug
 level)

 Long version:

 1) What is the actual problem.

 If I recall the cases about "MON store growing rapidly" correctly, I
 believe the problem was not that the logs go to the MONs, the problem was
 that the logs don't get trimmed unless health is health_ok. The MONs
 apparently had no (performance) problem receiving the logs, but a capacity
 problem storing them in case of health failures. If the logs are really
 just used for having the last entries available, why not look at the
 trimming first? Also, there is nothing in the logs stored on the MONs that
 isn't in the syslog, so loosing something here seems not really a problem
 to begin with.

Yes, you are right. Even in the case of HEALTH_OK, the logm trimming
encountered one corner case because of potential corruption of the
committed versions (https://tracker.ceph.com/issues/53485). If we trim logm
(cluster log) entries aggressively in the event of excessive logm getting
stored then there is no use of storing them at all as they will be trimmed
sooner than logm entries getting fetched using log last or mgr dashboard.

...
  2) .mgr pool

 2.1) I have become really tired of these administrative pools that are
 created on the fly without any regards to device classes, available
 capacity, PG allocation and the like. The first one that showed up without
 warning was device_health_metrics, which turned the cluster health_err
 right away because the on-the-fly pool creation is, well, not exactly smart.

 We don't even have drives below the default root. We have a lot of
 different pools on different (custom!) device classes with different
 replication schemes to accommodate a large variety of use cases.
 Administrative pools showing up randomly somewhere in the tree are a real
 pain. There are ceph-user cases where people deleted and recreated it only
 to make the device health module useless, because it seems to store the
 pool ID and there is no way to tell it to use the new pool.

 I am not sure but isn't changing pool_name to new pool using "ceph config
set mgr mgr/devicehealth/pool_name <new-pool>" for device health metrics
work? Maybe we can address this issue related to device health module over
a tracker ?

...
  If you really think about adding a pool for that,
please please make the
 pool creation part of the upgrade instructions with some hints on sizing,
 PGs and realistic (!!!) IOP/s requirements. I personally use the
 host-syslog and have drives with reasonable performance and capacity in the
 hosts to be able to pull debug logs with high logging values. All host logs
 are also aggregated to an rsyslogd instance. I don't see *any* need to
 aggregate these logs to a ceph pool.

 2.2) Using a ceph pool for logging is not reliable during critical
 situations. The whole point of the logging is to provide information in
 case of disaster. In case of disaster, we can safely assume that an .mgr
 pool will not be available. The logging has to be on an alternative
 infrastructure that is not affected by ceph storage outages/health
 problems. Having it in the MON stores on local storage is such an
 alternative infrastructure. Why not just separate the logging storage from
 the actual MON DB store and make it max_size configurable?

Agree on 2.1 and 2.2. Really appreciate your efforts to document these
concerns in detail. The other caveat with this solution is if mgr pool
storing ceph cluster logs is not writable because of OSD full, network
issue etc then we need to find an alternative way to get hold of cluster
logs for troubleshooting purposes.

...

 I would propose to keep it on the local dedicated MON storage (however
 outside of the MON DB) also to keep setting up a ceph cluster simple. If we
 needed now an additional MGR store, things would be more complicated. Just
 tell people that 60G is not enough for a MON store and at the same time
 make the last-log size a config option (it should really be a ring-buffer
 with a configurable fixed max-number of entries).

 3) MGR performance

 While it would possibly make sense to let the MGRs do more work, there is
 the problem of this work not being distributed (only 1 MGR does something)
 and that MGR modules seem not really performance optimized (too much
 python). If one wanted to outsource additional functionality to the MGRs, a
 good start would be to make all MGRs active and distribute the work (like a
 small distributed-memory compute cluster). A bit more module-crash
 resilience and performance improvements are also welcome.

Yes, mgr is not distributed and single mgr is responsible for all mgr
workload. The major job of mgr is to lightweight MONs as much as possible.
Another concern here is if the active mgr will be handling the cluster
logging through the new pool then we will miss out cluster logs during the
timeframe when all mgrs are down.

...

 Best regards,
 =================
 Frank Schilder
 AIT Risø Campus
 Bygning 109, rum S14

Regards,
Prashant

...

 ________________________________________
 From: Prashant Dhange &lt;pdhange(a)redhat.com&gt;
 Sent: 22 March 2023 06:35:36
 To: dev(a)ceph.io
 Subject: Moving cluster log storage from monstore db

 Hi All,

 We are looking for inputs on a new feature to be implemented to move clog
 messages storage from monstore db, refer trello card [1] for more details
 around this topic.

 Currently, every clog message goes to monstore db as well as debug/warning
 messages generates clog messages 1000s of times per seconds which leads to
 monstore db growing at an exponential rate in a catastrophic failure
 situation.

 The primary use cases for the logm entries in monstore db are :

   *   For "ceph log last" commands to get historical clog entries
   *   Ceph dashboard (mgr is subscriber of log-info which propagate clog
 to dashboard module)

 @Patrick Donnelly<mailto:pdonnell@redhat.com> suggested a viable solution
 to move the cluster log storage to a new mgr module which handles the "ceph
 log last" command. The clog data can be stored in the .mgr pool via
 libcephsqlite.

 Alternatively, if we donot want to get rid of logm storage from monstore
 db then the other solutions would be :

   *   Stop writing logm entries to mon db if there are excessive entries
 getting generated
   *   Filter out clog DBG entries and only log WRN/INF/ERR entries.

 Looking forward to additional perspectives arounds this topic. Feel free
 to add your inputs to trello card [1] or reply to this email-thread.

 [1]

https://trello.com/c/oCGGFfTs/822-better-handling-of-cluster-log-messages-f…

 Regards,
 Prashant

2024

2023

2022

2021

2020

2019

Re: Moving cluster log storage from monstore db