Moving cluster log storage from monstore db

List overview All Threads
Download

newer

older

03/30/2023 and 04/06/2023 perf...

Ceph Leadership Team meeting notes...

Prashant Dhange

21 Mar 2023 21 Mar '23

10:35 p.m.

Hi All, We are looking for inputs on a new feature to be implemented to move clog messages storage from monstore db, refer trello card [1] for more details around this topic. Currently, every clog message goes to monstore db as well as debug/warning messages generates clog messages 1000s of times per seconds which leads to monstore db growing at an exponential rate in a catastrophic failure situation. The primary use cases for the logm entries in monstore db are : - For "ceph log last" commands to get historical clog entries - Ceph dashboard (mgr is subscriber of log-info which propagate clog to dashboard module) @Patrick Donnelly <pdonnell(a)redhat.com> suggested a viable solution to move the cluster log storage to a new mgr module which handles the "ceph log last" command. The clog data can be stored in the .mgr pool via libcephsqlite. Alternatively, if we donot want to get rid of logm storage from monstore db then the other solutions would be : - Stop writing logm entries to mon db if there are excessive entries getting generated - Filter out clog DBG entries and only log WRN/INF/ERR entries. Looking forward to additional perspectives arounds this topic. Feel free to add your inputs to trello card [1] or reply to this email-thread. [1] https://trello.com/c/oCGGFfTs/822-better-handling-of-cluster-log-messages-f… Regards, Prashant

Attachments:

attachment.html (text/html — 1.9 KB)

Show replies by date

Frank Schilder

22 Mar 22 Mar

2:05 a.m.

Note: replying as a ceph cluster admin. Hope that is OK. Hi Prashant, that sounds like a very interesting idea. I have a few questions/concerns/suggestions from the point of view of a cluster admin. Short version: - please (!!) keep these logs on the dedicated MON storage below /var/lib/ceph - however: take the logs out of the MON DB and write them to their own DB/file - make the last-log size a configuration parameter (the log file becomes a ring buffer) the config could be elastic and a combination of max_size and max_age - optional: make filtering rules a config option (filter by type/debug level) Long version: 1) What is the actual problem. If I recall the cases about "MON store growing rapidly" correctly, I believe the problem was not that the logs go to the MONs, the problem was that the logs don't get trimmed unless health is health_ok. The MONs apparently had no (performance) problem receiving the logs, but a capacity problem storing them in case of health failures. If the logs are really just used for having the last entries available, why not look at the trimming first? Also, there is nothing in the logs stored on the MONs that isn't in the syslog, so loosing something here seems not really a problem to begin with. 2) .mgr pool 2.1) I have become really tired of these administrative pools that are created on the fly without any regards to device classes, available capacity, PG allocation and the like. The first one that showed up without warning was device_health_metrics, which turned the cluster health_err right away because the on-the-fly pool creation is, well, not exactly smart. We don't even have drives below the default root. We have a lot of different pools on different (custom!) device classes with different replication schemes to accommodate a large variety of use cases. Administrative pools showing up randomly somewhere in the tree are a real pain. There are ceph-user cases where people deleted and recreated it only to make the device health module useless, because it seems to store the pool ID and there is no way to tell it to use the new pool. If you really think about adding a pool for that, please please make the pool creation part of the upgrade instructions with some hints on sizing, PGs and realistic (!!!) IOP/s requirements. I personally use the host-syslog and have drives with reasonable performance and capacity in the hosts to be able to pull debug logs with high logging values. All host logs are also aggregated to an rsyslogd instance. I don't see *any* need to aggregate these logs to a ceph pool. 2.2) Using a ceph pool for logging is not reliable during critical situations. The whole point of the logging is to provide information in case of disaster. In case of disaster, we can safely assume that an .mgr pool will not be available. The logging has to be on an alternative infrastructure that is not affected by ceph storage outages/health problems. Having it in the MON stores on local storage is such an alternative infrastructure. Why not just separate the logging storage from the actual MON DB store and make it max_size configurable? I would propose to keep it on the local dedicated MON storage (however outside of the MON DB) also to keep setting up a ceph cluster simple. If we needed now an additional MGR store, things would be more complicated. Just tell people that 60G is not enough for a MON store and at the same time make the last-log size a config option (it should really be a ring-buffer with a configurable fixed max-number of entries). 3) MGR performance While it would possibly make sense to let the MGRs do more work, there is the problem of this work not being distributed (only 1 MGR does something) and that MGR modules seem not really performance optimized (too much python). If one wanted to outsource additional functionality to the MGRs, a good start would be to make all MGRs active and distribute the work (like a small distributed-memory compute cluster). A bit more module-crash resilience and performance improvements are also welcome. Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Prashant Dhange <pdhange(a)redhat.com> Sent: 22 March 2023 06:35:36 To: dev(a)ceph.io Subject: Moving cluster log storage from monstore db Hi All, We are looking for inputs on a new feature to be implemented to move clog messages storage from monstore db, refer trello card [1] for more details around this topic. Currently, every clog message goes to monstore db as well as debug/warning messages generates clog messages 1000s of times per seconds which leads to monstore db growing at an exponential rate in a catastrophic failure situation. The primary use cases for the logm entries in monstore db are : * For "ceph log last" commands to get historical clog entries * Ceph dashboard (mgr is subscriber of log-info which propagate clog to dashboard module) @Patrick Donnelly<mailto:pdonnell@redhat.com> suggested a viable solution to move the cluster log storage to a new mgr module which handles the "ceph log last" command. The clog data can be stored in the .mgr pool via libcephsqlite. Alternatively, if we donot want to get rid of logm storage from monstore db then the other solutions would be : * Stop writing logm entries to mon db if there are excessive entries getting generated * Filter out clog DBG entries and only log WRN/INF/ERR entries. Looking forward to additional perspectives arounds this topic. Feel free to add your inputs to trello card [1] or reply to this email-thread. [1] https://trello.com/c/oCGGFfTs/822-better-handling-of-cluster-log-messages-f… Regards, Prashant

Janne Johansson

3:10 a.m.

...

2) .mgr pool 2.1) I have become really tired of these administrative pools that are created on the fly without any regards to device classes, available capacity, PG allocation and the like. The first one that showed up without warning was device_health_metrics, which turned the cluster health_err right away because the on-the-fly pool creation is, well, not exactly smart. We don't even have drives below the default root. We have a lot of different pools on different (custom!) device classes with different replication schemes to accommodate a large variety of use cases. Administrative pools showing up randomly somewhere in the tree are a real pain. There are ceph-user cases where people deleted and recreated it only to make the device health module useless, because it seems to store the pool ID and there is no way to tell it to use the new pool.

Ernesto Puerta

6:10 a.m.

Hi Prashant, Is this move just limited to the impact of the cluster log in the mon store db or is it part of a larger mon db clean-up effort? I'm asking this because, besides de cluster log, the mon store db is currently used (and perhaps abused) also by some mgr modules via: - set_module_option() <https://docs.ceph.com/en/quincy/mgr/modules/#mgr_module.MgrModule.set_module_option> to set MODULE_OPTIONS values via CLI commands. - set_store() <https://docs.ceph.com/en/quincy/mgr/modules/#mgr_module.MgrModule.set_store>: there are 2 main storage use cases here: - *Immutable/sensitive data*: instead of exposing those as MODULE_OPTIONS (password hashes, private certificates, API keys, etc.), - *Changing data*: mgr-module internal state. While this shouldn't cause the db to grow in the long term, it might cause short-term/compaction issues (I'm not familiar with rocksdb internals, just extrapolating from experience with sstable/leveldb) For the latter case there, Dashboard developers have been looking for an efficient alternative to persistently store rapidly-changing data. We discarded the idea of using a pool since the Dashboard should be able to operate prior to any OSD provisioning and in case of storage downtimes Coming back to your original questions, I understand that there are two different issues at stake: - *Cluster log processing*: currently mon via Paxos (Do we really need Paxos ack for logs? Can we live with some type of eventually-consistent/best-effort storage here?) - *Cluster log storage*: currently mon store db. AFAIK this is the main issue, right? From there, I see 2 possible paths: - *Keep cluster-wide logs as a Ceph concern:* - IMHO putting some throttling in place should be a must, since client-triggered cluster logs could easily become a DoS vector. - I wouldn't put them into a rados pool, not so much for the data availability in case of OSD service downtime (logs will still be recoverable from logfiles), but as for the potential interference with user workloads/deployment patterns (as Frank mentioned before). - Could we run the ".mgr" pool on a new type of "internal/service-only" colocated OSDs (memstore)? - Save logs to a fixed-size/TTL-bound priority or multi-level queue structure? - Add some (eventually-consistent) store db to the ceph-mgr? - To solve ceph-mgr scalability issues, we recently added a new kind of Ceph utility daemon (ceph-exporter) whose sole purpose is to fetch metrics from co-located Ceph daemon's perf-counters and make those available for Prometheus scraping. We could think about a similar thing but for logs... (although it'd be very similar to the Loki approach below). - *Move them outside Ceph:* - Cephadm + Dashboard now support Centralized Logging via Loki + Promtail <https://ceph.io/en/news/blog/2022/centralized_logging/>, which basically polls all daemon logfiles and sends new log traces to a central service (Loki) where they can be monitored/filtered in real-time. - If we find the previous solution too bulky for regular cluster monitoring, we could explore systemd-journal-remote <https://www.freedesktop.org/software/systemd/man/systemd-journal-remote.service.html> /rsyslog/... - The main downside of this approach is that it might break the "ceph log" command (rados_monitor_log and log events could still be watched I guess). Kind Regards, Ernesto On Wed, Mar 22, 2023 at 11:12 AM Janne Johansson <icepic.dz(a)gmail.com> wrote:

...

2) .mgr pool 2.1) I have become really tired of these administrative pools that are

created on the fly without any regards to device classes, available capacity, PG allocation and the like. The first one that showed up without warning was device_health_metrics, which turned the cluster health_err right away because the on-the-fly pool creation is, well, not exactly smart.

We don't even have drives below the default root. We have a lot of

different pools on different (custom!) device classes with different replication schemes to accommodate a large variety of use cases. Administrative pools showing up randomly somewhere in the tree are a real pain. There are ceph-user cases where people deleted and recreated it only to make the device health module useless, because it seems to store the pool ID and there is no way to tell it to use the new pool.

Ah, that's why it looked unused after I also had to remake it. Since it gets created when you don't have the OSDs yet, the possibilities for it ending up wrong seem very large. -- May the most significant bit of your life be positive. _______________________________________________ Dev mailing list -- dev(a)ceph.io To unsubscribe send an email to dev-leave(a)ceph.io

Matthias Muench

1:51 p.m.

Hi Prashant, et. al., separating the logs from the DB might be a good thing. I would second what Frank suggested: local storage. Local to the mon instances hosts, perhaps just saying that flash is required which shouldn't be an issue nowadays. This would also give the best latency to avoid starvation on IOPS in case of the disaster. With redundancy in the instances, data is available, at least from one of the mon instance hosts. Relying on pools would assume that communication is intact even between the actors of the pool. An exclusive pool for just this only would still depend on the network connection and introducing additional latency, too. The other alternatives sound promising as well, however, I would like to raise some concerns. Pushing the logs only to a central location would impose a dependency on this location in case of a disaster. A disaster could be also in conjunction with a network issue affecting the connection to outside world. So, might be an add-on but for troubleshooting rather some kind of additional challenge. Eventually consistent distribution of data might be hard for troubleshooting. The basic assumption would be that the logs aren't that important to be available in full in some of the places, as in the different mon instance hosts. Eventual consistency also would add another level of trouble to troubleshoot in conjunction with a disaster. Those interconnection requirements may be void or at least the service may be at limited availability that might not help to get the data into the place just in need. Kind regards, -matt On 22.03.23 14:10, Ernesto Puerta wrote:

...

Hi Prashant, Is this move just limited to the impact of the cluster log in the mon store db or is it part of a larger mon db clean-up effort? I'm asking this because, besides de cluster log, the mon store db is currently used (and perhaps abused) also by some mgr modules via: * set_module_option() <https://docs.ceph.com/en/quincy/mgr/modules/#mgr_module.MgrModule.set_module_option> to set MODULE_OPTIONS values via CLI commands. * set_store() <https://docs.ceph.com/en/quincy/mgr/modules/#mgr_module.MgrModule.set_store>: there are 2 main storage use cases here: o *Immutable/sensitive data*: instead of exposing those as MODULE_OPTIONS (password hashes, private certificates, API keys, etc.), o *Changing data*: mgr-module internal state. While this shouldn't cause the db to grow in the long term, it might cause short-term/compaction issues (I'm not familiar with rocksdb internals, just extrapolating from experience with sstable/leveldb) For the latter case there, Dashboard developers have been looking for an efficient alternative to persistently store rapidly-changing data. We discarded the idea of using a pool since the Dashboard should be able to operate prior to any OSD provisioning and in case of storage downtimes Coming back to your original questions, I understand that there are two different issues at stake: * *Cluster log processing*: currently mon via Paxos (Do we really need Paxos ack for logs? Can we live with some type of eventually-consistent/best-effort storage here?) * *Cluster log storage*: currently mon store db. AFAIK this is the main issue, right? From there, I see 2 possible paths: * *Keep cluster-wide logs as a Ceph concern:* o IMHO putting some throttling in place should be a must, since client-triggered cluster logs could easily become a DoS vector. o I wouldn't put them into a rados pool, not so much for the data availability in case of OSD service downtime (logs will still be recoverable from logfiles), but as for the potential interference with user workloads/deployment patterns (as Frank mentioned before). + Could we run the ".mgr" pool on a new type of "internal/service-only" colocated OSDs (memstore)? o Save logs to a fixed-size/TTL-bound priority or multi-level queue structure? o Add some (eventually-consistent) store db to the ceph-mgr? o To solve ceph-mgr scalability issues, we recently added a new kind of Ceph utility daemon (ceph-exporter) whose sole purpose is to fetch metrics from co-located Ceph daemon's perf-counters and make those available for Prometheus scraping. We could think about a similar thing but for logs... (although it'd be very similar to the Loki approach below). * *Move them outside Ceph:* o Cephadm + Dashboard now support Centralized Logging via Loki + Promtail <https://ceph.io/en/news/blog/2022/centralized_logging/>, which basically polls all daemon logfiles and sends new log traces to a central service (Loki) where they can be monitored/filtered in real-time. + If we find the previous solution too bulky for regular cluster monitoring, we could explore systemd-journal-remote <https://www.freedesktop.org/software/systemd/man/systemd-journal-remote.service.html>/rsyslog/... o The main downside of this approach is that it might break the "ceph log" command (rados_monitor_log and log events could still be watched I guess). Kind Regards, Ernesto On Wed, Mar 22, 2023 at 11:12 AM Janne Johansson <icepic.dz(a)gmail.com> wrote:

2) .mgr pool 2.1) I have become really tired of these administrative pools

that are created on the fly without any regards to device classes, available capacity, PG allocation and the like. The first one that showed up without warning was device_health_metrics, which turned the cluster health_err right away because the on-the-fly pool creation is, well, not exactly smart.

We don't even have drives below the default root. We have a lot

of different pools on different (custom!) device classes with different replication schemes to accommodate a large variety of use cases. Administrative pools showing up randomly somewhere in the tree are a real pain. There are ceph-user cases where people deleted and recreated it only to make the device health module useless, because it seems to store the pool ID and there is no way to tell it to use the new pool.

-- —————————————————— Matthias Muench Principal Specialist Solution Architect EMEA Storage Specialist matthias.muench(a)redhat.com Phone: +49-160-92654111 Red Hat GmbH Technopark II Werner-von-Siemens-Ring 12 85630 Grasbrunn Germany _______________________________________________________________________ Red Hat GmbH, Registered seat: Werner von Siemens Ring 12, D-85630 Grasbrunn, Germany Commercial register: Amtsgericht Muenchen/Munich, HRB 153243, Managing Directors: Ryan Barnhart, Charles Cachera, Michael O'Neill, Amy Ross

Prashant Dhange

27 Mar 27 Mar

11:05 p.m.

Thanks Matthias. On Wed, Mar 22, 2023 at 1:51 PM Matthias Muench <mmuench(a)redhat.com> wrote:

...

Yes, we can achieve this but maybe instead of mon handling these logs we can delegate this task to mgr daemon.

...

With redundancy in the instances, data is available, at least from one of the mon instance hosts. Relying on pools would assume that communication is intact even between the actors of the pool. An exclusive pool for just this only would still depend on the network connection and introducing additional latency, too.

Rightly said.

...

The other alternatives sound promising as well, however, I would like to raise some concerns. Pushing the logs only to a central location would impose a dependency on this location in case of a disaster. A disaster could be also in conjunction with a network issue affecting the connection to outside world. So, might be an add-on but for troubleshooting rather some kind of additional challenge.

...

Eventually consistent distribution of data might be hard for troubleshooting. The basic assumption would be that the logs aren't that important to be available in full in some of the places, as in the different mon instance hosts. Eventual consistency also would add another level of trouble to troubleshoot in conjunction with a disaster. Those interconnection requirements may be void or at least the service may be at limited availability that might not help to get the data into the place just in need.

Yes, it will be *SPOF* for log availability if we log to a central location. We will consider these inputs. Thanks for your inputs.

...

Kind regards, -matt On 22.03.23 14:10, Ernesto Puerta wrote: Hi Prashant, Is this move just limited to the impact of the cluster log in the mon store db or is it part of a larger mon db clean-up effort? I'm asking this because, besides de cluster log, the mon store db is currently used (and perhaps abused) also by some mgr modules via: - set_module_option() <https://docs.ceph.com/en/quincy/mgr/modules/#mgr_module.MgrModule.set_module_option> to set MODULE_OPTIONS values via CLI commands. - set_store() <https://docs.ceph.com/en/quincy/mgr/modules/#mgr_module.MgrModule.set_store>: there are 2 main storage use cases here: - *Immutable/sensitive data*: instead of exposing those as MODULE_OPTIONS (password hashes, private certificates, API keys, etc.), - *Changing data*: mgr-module internal state. While this shouldn't cause the db to grow in the long term, it might cause short-term/compaction issues (I'm not familiar with rocksdb internals, just extrapolating from experience with sstable/leveldb) For the latter case there, Dashboard developers have been looking for an efficient alternative to persistently store rapidly-changing data. We discarded the idea of using a pool since the Dashboard should be able to operate prior to any OSD provisioning and in case of storage downtimes Coming back to your original questions, I understand that there are two different issues at stake: - *Cluster log processing*: currently mon via Paxos (Do we really need Paxos ack for logs? Can we live with some type of eventually-consistent/best-effort storage here?) - *Cluster log storage*: currently mon store db. AFAIK this is the main issue, right? From there, I see 2 possible paths: - *Keep cluster-wide logs as a Ceph concern:* - IMHO putting some throttling in place should be a must, since client-triggered cluster logs could easily become a DoS vector. - I wouldn't put them into a rados pool, not so much for the data availability in case of OSD service downtime (logs will still be recoverable from logfiles), but as for the potential interference with user workloads/deployment patterns (as Frank mentioned before). - Could we run the ".mgr" pool on a new type of "internal/service-only" colocated OSDs (memstore)? - Save logs to a fixed-size/TTL-bound priority or multi-level queue structure? - Add some (eventually-consistent) store db to the ceph-mgr? - To solve ceph-mgr scalability issues, we recently added a new kind of Ceph utility daemon (ceph-exporter) whose sole purpose is to fetch metrics from co-located Ceph daemon's perf-counters and make those available for Prometheus scraping. We could think about a similar thing but for logs... (although it'd be very similar to the Loki approach below). - *Move them outside Ceph:* - Cephadm + Dashboard now support Centralized Logging via Loki + Promtail <https://ceph.io/en/news/blog/2022/centralized_logging/>, which basically polls all daemon logfiles and sends new log traces to a central service (Loki) where they can be monitored/filtered in real-time. - If we find the previous solution too bulky for regular cluster monitoring, we could explore systemd-journal-remote <https://www.freedesktop.org/software/systemd/man/systemd-journal-remote.service.html> /rsyslog/... - The main downside of this approach is that it might break the "ceph log" command (rados_monitor_log and log events could still be watched I guess). Kind Regards, Ernesto On Wed, Mar 22, 2023 at 11:12 AM Janne Johansson <icepic.dz(a)gmail.com> wrote:

2) .mgr pool 2.1) I have become really tired of these administrative pools that are

We don't even have drives below the default root. We have a lot of

_______________________________________________ Dev mailing list -- dev(a)ceph.io To unsubscribe send an email to dev-leave(a)ceph.io -- —————————————————— Matthias Muench Principal Specialist Solution Architect EMEA Storage Specialistmatthias.muench(a)redhat.com Phone: +49-160-92654111 Red Hat GmbH Technopark II Werner-von-Siemens-Ring 12 85630 Grasbrunn Germany _______________________________________________________________________ Red Hat GmbH, Registered seat: Werner von Siemens Ring 12, D-85630 Grasbrunn, Germany Commercial register: Amtsgericht Muenchen/Munich, HRB 153243, Managing Directors: Ryan Barnhart, Charles Cachera, Michael O'Neill, Amy Ross

Ernesto Puerta

28 Mar 28 Mar

3:25 a.m.

...

Thanks Matthias. On Wed, Mar 22, 2023 at 1:51 PM Matthias Muench <mmuench(a)redhat.com> wrote:

Yes, we can achieve this but maybe instead of mon handling these logs we can delegate this task to mgr daemon.

Rightly said.

Yes, it will be *SPOF* for log availability if we log to a central location. We will consider these inputs. Thanks for your inputs. > Kind regards, > -matt > > On 22.03.23 14:10, Ernesto Puerta wrote: > > Hi Prashant, > > Is this move just limited to the impact of the cluster log in the mon > store db or is it part of a larger mon db clean-up effort? > > I'm asking this because, besides de cluster log, the mon store db is > currently used (and perhaps abused) also by some mgr modules via: > > - set_module_option() > <https://docs.ceph.com/en/quincy/mgr/modules/#mgr_module.MgrModule.set_module_option> to > set MODULE_OPTIONS values via CLI commands. > - set_store() > <https://docs.ceph.com/en/quincy/mgr/modules/#mgr_module.MgrModule.set_store>: > there are 2 main storage use cases here: > - *Immutable/sensitive data*: instead of exposing those as > MODULE_OPTIONS (password hashes, private certificates, API keys, etc.), > - *Changing data*: mgr-module internal state. While this shouldn't > cause the db to grow in the long term, it might cause short-term/compaction > issues (I'm not familiar with rocksdb internals, just extrapolating from > experience with sstable/leveldb) > > For the latter case there, Dashboard developers have been looking for an > efficient alternative to persistently store rapidly-changing data. We > discarded the idea of using a pool since the Dashboard should be able to > operate prior to any OSD provisioning and in case of storage downtimes > > Coming back to your original questions, I understand that there are two > different issues at stake: > > - *Cluster log processing*: currently mon via Paxos (Do we really > need Paxos ack for logs? Can we live with some type of > eventually-consistent/best-effort storage here?) > - *Cluster log storage*: currently mon store db. AFAIK this is the > main issue, right? > > From there, I see 2 possible paths: > > - *Keep cluster-wide logs as a Ceph concern:* > - IMHO putting some throttling in place should be a must, since > client-triggered cluster logs could easily become a DoS vector. > - I wouldn't put them into a rados pool, not so much for the data > availability in case of OSD service downtime (logs will still > be recoverable from logfiles), but as for the potential interference with > user workloads/deployment patterns (as Frank mentioned before). > - Could we run the ".mgr" pool on a new type of > "internal/service-only" colocated OSDs (memstore)? > - Save logs to a fixed-size/TTL-bound priority or multi-level > queue structure? > - Add some (eventually-consistent) store db to the ceph-mgr? > - To solve ceph-mgr scalability issues, we recently added a new > kind of Ceph utility daemon (ceph-exporter) whose sole purpose is to fetch > metrics from co-located Ceph daemon's perf-counters and make those > available for Prometheus scraping. We could think about a similar thing but > for logs... (although it'd be very similar to the Loki approach below). > - *Move them outside Ceph:* > - Cephadm + Dashboard now support Centralized Logging via Loki + > Promtail <https://ceph.io/en/news/blog/2022/centralized_logging/>, > which basically polls all daemon logfiles and sends new log traces to a > central service (Loki) where they can be monitored/filtered in real-time. > - If we find the previous solution too bulky for regular > cluster monitoring, we could explore systemd-journal-remote > <https://www.freedesktop.org/software/systemd/man/systemd-journal-remote.service.html> > /rsyslog/... > - The main downside of this approach is that it might break the > "ceph log" command (rados_monitor_log and log events could still be watched > I guess). > > Kind Regards, > Ernesto > > > On Wed, Mar 22, 2023 at 11:12 AM Janne Johansson <icepic.dz(a)gmail.com> > wrote: > >> > 2) .mgr pool >> > >> > 2.1) I have become really tired of these administrative pools that are >> created on the fly without any regards to device classes, available >> capacity, PG allocation and the like. The first one that showed up without >> warning was device_health_metrics, which turned the cluster health_err >> right away because the on-the-fly pool creation is, well, not exactly smart. >> > >> > We don't even have drives below the default root. We have a lot of >> different pools on different (custom!) device classes with different >> replication schemes to accommodate a large variety of use cases. >> Administrative pools showing up randomly somewhere in the tree are a real >> pain. There are ceph-user cases where people deleted and recreated it only >> to make the device health module useless, because it seems to store the >> pool ID and there is no way to tell it to use the new pool. >> > >> >> Ah, that's why it looked unused after I also had to remake it. Since >> it gets created when you don't have the OSDs yet, the possibilities >> for it ending up wrong seem very large. >> >> -- >> May the most significant bit of your life be positive. >> _______________________________________________ >> Dev mailing list -- dev(a)ceph.io >> To unsubscribe send an email to dev-leave(a)ceph.io >> >> > _______________________________________________ > Dev mailing list -- dev(a)ceph.io > To unsubscribe send an email to dev-leave(a)ceph.io > > > -- > —————————————————— > Matthias Muench > Principal Specialist Solution Architect > EMEA Storage Specialistmatthias.muench(a)redhat.com > Phone: +49-160-92654111 > > Red Hat GmbH > Technopark II > Werner-von-Siemens-Ring 12 > 85630 Grasbrunn > Germany > _______________________________________________________________________ > Red Hat GmbH, Registered seat: Werner von Siemens Ring 12, D-85630 Grasbrunn, Germany > Commercial register: Amtsgericht Muenchen/Munich, HRB 153243, > Managing Directors: Ryan Barnhart, Charles Cachera, Michael O'Neill, Amy Ross > >

Prashant Dhange

29 Mar 29 Mar

1:36 p.m.

Thank you Ernesto for the pointers. [1] and [2] explains a loki integration to great extent and is easily understandable. Let me spend some more time on it and see how we can benefit with loki+promtail compared to other ideas. On Tue, Mar 28, 2023 at 3:26 AM Ernesto Puerta <epuertat(a)redhat.com> wrote:

...

Hi Prashant, Thank you for the feedback! Just a remark on that last point (which I missed from the original email from Matthias): most centralized logging solutions (including Loki or Elasticsearch) already cope with SPOF scenarios, either by connection pooling & retrying from client side [1] to sharding/replication in server side, [2] [3] and scheduled snapshots/back-ups. From experience (and the Ceph cluster log is the perfect example), centralized logging allows for better log lifecycle management, log-based monitoring/alerting, and dramatically improved troubleshooting. The main downside would be that log streaming generates some network traffic that might interfere with the storage workload, but that can be always solved by routing through a separate network/low-prio vlan. [1] https://grafana.com/docs/loki/latest/clients/promtail/troubleshooting/#loki… [2] https://grafana.com/docs/loki/latest/configuration/#memberlist_config [3] https://www.elastic.co/guide/en/elasticsearch/reference/current/high-availa… Kind Regards, Ernesto On Tue, Mar 28, 2023 at 8:05 AM Prashant Dhange <pdhange(a)redhat.com> wrote: > Thanks Matthias. > > On Wed, Mar 22, 2023 at 1:51 PM Matthias Muench <mmuench(a)redhat.com> > wrote: > >> Hi Prashant, et. al., >> >> separating the logs from the DB might be a good thing. >> >> I would second what Frank suggested: local storage. Local to the mon >> instances hosts, perhaps just saying that flash is required which shouldn't >> be an issue nowadays. This would also give the best latency to avoid >> starvation on IOPS in case of the disaster. >> > > Yes, we can achieve this but maybe instead of mon handling these logs we > can delegate this task to mgr daemon. > > > >> With redundancy in the instances, data is available, at least from one >> of the mon instance hosts. Relying on pools would assume that communication >> is intact even between the actors of the pool. An exclusive pool for just >> this only would still depend on the network connection and introducing >> additional latency, too. >> > > Rightly said. > > >> >> The other alternatives sound promising as well, however, I would like to >> raise some concerns. >> >> Pushing the logs only to a central location would impose a dependency on >> this location in case of a disaster. A disaster could be also in >> conjunction with a network issue affecting the connection to outside world. >> So, might be an add-on but for troubleshooting rather some kind of >> additional challenge. >> > >> Eventually consistent distribution of data might be hard for >> troubleshooting. The basic assumption would be that the logs aren't that >> important to be available in full in some of the places, as in the >> different mon instance hosts. Eventual consistency also would add another >> level of trouble to troubleshoot in conjunction with a disaster. Those >> interconnection requirements may be void or at least the service may be at >> limited availability that might not help to get the data into the place >> just in need. >> > > Yes, it will be *SPOF* for log availability if we log to a central > location. We will consider these inputs. Thanks for your inputs. > > >> Kind regards, >> -matt >> >> On 22.03.23 14:10, Ernesto Puerta wrote: >> >> Hi Prashant, >> >> Is this move just limited to the impact of the cluster log in the mon >> store db or is it part of a larger mon db clean-up effort? >> >> I'm asking this because, besides de cluster log, the mon store db is >> currently used (and perhaps abused) also by some mgr modules via: >> >> - set_module_option() >> <https://docs.ceph.com/en/quincy/mgr/modules/#mgr_module.MgrModule.set_module_option> to >> set MODULE_OPTIONS values via CLI commands. >> - set_store() >> <https://docs.ceph.com/en/quincy/mgr/modules/#mgr_module.MgrModule.set_store>: >> there are 2 main storage use cases here: >> - *Immutable/sensitive data*: instead of exposing those as >> MODULE_OPTIONS (password hashes, private certificates, API keys, etc.), >> - *Changing data*: mgr-module internal state. While this >> shouldn't cause the db to grow in the long term, it might cause >> short-term/compaction issues (I'm not familiar with rocksdb internals, just >> extrapolating from experience with sstable/leveldb) >> >> For the latter case there, Dashboard developers have been looking for an >> efficient alternative to persistently store rapidly-changing data. We >> discarded the idea of using a pool since the Dashboard should be able to >> operate prior to any OSD provisioning and in case of storage downtimes >> >> Coming back to your original questions, I understand that there are two >> different issues at stake: >> >> - *Cluster log processing*: currently mon via Paxos (Do we really >> need Paxos ack for logs? Can we live with some type of >> eventually-consistent/best-effort storage here?) >> - *Cluster log storage*: currently mon store db. AFAIK this is the >> main issue, right? >> >> From there, I see 2 possible paths: >> >> - *Keep cluster-wide logs as a Ceph concern:* >> - IMHO putting some throttling in place should be a must, since >> client-triggered cluster logs could easily become a DoS vector. >> - I wouldn't put them into a rados pool, not so much for the data >> availability in case of OSD service downtime (logs will still >> be recoverable from logfiles), but as for the potential interference with >> user workloads/deployment patterns (as Frank mentioned before). >> - Could we run the ".mgr" pool on a new type of >> "internal/service-only" colocated OSDs (memstore)? >> - Save logs to a fixed-size/TTL-bound priority or multi-level >> queue structure? >> - Add some (eventually-consistent) store db to the ceph-mgr? >> - To solve ceph-mgr scalability issues, we recently added a new >> kind of Ceph utility daemon (ceph-exporter) whose sole purpose is to fetch >> metrics from co-located Ceph daemon's perf-counters and make those >> available for Prometheus scraping. We could think about a similar thing but >> for logs... (although it'd be very similar to the Loki approach below). >> - *Move them outside Ceph:* >> - Cephadm + Dashboard now support Centralized Logging via Loki + >> Promtail <https://ceph.io/en/news/blog/2022/centralized_logging/>, >> which basically polls all daemon logfiles and sends new log traces to a >> central service (Loki) where they can be monitored/filtered in real-time. >> - If we find the previous solution too bulky for regular >> cluster monitoring, we could explore systemd-journal-remote >> <https://www.freedesktop.org/software/systemd/man/systemd-journal-remote.service.html> >> /rsyslog/... >> - The main downside of this approach is that it might break the >> "ceph log" command (rados_monitor_log and log events could still be watched >> I guess). >> >> Kind Regards, >> Ernesto >> >> >> On Wed, Mar 22, 2023 at 11:12 AM Janne Johansson <icepic.dz(a)gmail.com> >> wrote: >> >>> > 2) .mgr pool >>> > >>> > 2.1) I have become really tired of these administrative pools that >>> are created on the fly without any regards to device classes, available >>> capacity, PG allocation and the like. The first one that showed up without >>> warning was device_health_metrics, which turned the cluster health_err >>> right away because the on-the-fly pool creation is, well, not exactly smart. >>> > >>> > We don't even have drives below the default root. We have a lot of >>> different pools on different (custom!) device classes with different >>> replication schemes to accommodate a large variety of use cases. >>> Administrative pools showing up randomly somewhere in the tree are a real >>> pain. There are ceph-user cases where people deleted and recreated it only >>> to make the device health module useless, because it seems to store the >>> pool ID and there is no way to tell it to use the new pool. >>> > >>> >>> Ah, that's why it looked unused after I also had to remake it. Since >>> it gets created when you don't have the OSDs yet, the possibilities >>> for it ending up wrong seem very large. >>> >>> -- >>> May the most significant bit of your life be positive. >>> _______________________________________________ >>> Dev mailing list -- dev(a)ceph.io >>> To unsubscribe send an email to dev-leave(a)ceph.io >>> >>> >> _______________________________________________ >> Dev mailing list -- dev(a)ceph.io >> To unsubscribe send an email to dev-leave(a)ceph.io >> >> >> -- >> —————————————————— >> Matthias Muench >> Principal Specialist Solution Architect >> EMEA Storage Specialistmatthias.muench(a)redhat.com >> Phone: +49-160-92654111 >> >> Red Hat GmbH >> Technopark II >> Werner-von-Siemens-Ring 12 >> 85630 Grasbrunn >> Germany >> _______________________________________________________________________ >> Red Hat GmbH, Registered seat: Werner von Siemens Ring 12, D-85630 Grasbrunn, Germany >> Commercial register: Amtsgericht Muenchen/Munich, HRB 153243, >> Managing Directors: Ryan Barnhart, Charles Cachera, Michael O'Neill, Amy Ross >> >>

Prashant Dhange

27 Mar 27 Mar

3:54 p.m.

Hi Ernesto, Thanks for your valuable inputs. Kindly find my answers inline below. On Wed, Mar 22, 2023 at 6:11 AM Ernesto Puerta <epuertat(a)redhat.com> wrote:

...

Hi Prashant, Is this move just limited to the impact of the cluster log in the mon store db or is it part of a larger mon db clean-up effort?

Yes, it's limited to moving cluster logs from the monstore db.

...

I'm asking this because, besides de cluster log, the mon store db is currently used (and perhaps abused) also by some mgr modules via: - set_module_option() <https://docs.ceph.com/en/quincy/mgr/modules/#mgr_module.MgrModule.set_module_option> to set MODULE_OPTIONS values via CLI commands. - set_store() <https://docs.ceph.com/en/quincy/mgr/modules/#mgr_module.MgrModule.set_store>: there are 2 main storage use cases here: - *Immutable/sensitive data*: instead of exposing those as MODULE_OPTIONS (password hashes, private certificates, API keys, etc.), - *Changing data*: mgr-module internal state. While this shouldn't cause the db to grow in the long term, it might cause short-term/compaction issues (I'm not familiar with rocksdb internals, just extrapolating from experience with sstable/leveldb) The config related information stored in db should not be a problem here.

We are only concerned about logm entries in the event of health error that too when logm entries are not getting trimmed.

...

For the latter case there, Dashboard developers have been looking for an efficient alternative to persistently store rapidly-changing data. We discarded the idea of using a pool since the Dashboard should be able to operate prior to any OSD provisioning and in case of storage downtimes Coming back to your original questions, I understand that there are two different issues at stake: - *Cluster log processing*: currently mon via Paxos (Do we really need Paxos ack for logs? Can we live with some type of eventually-consistent/best-effort storage here?) Yes, we need paxos ack for logm. The logm entries gets written to monstore

on paxos proposal and gets written to cluster log on update from paxos. Yes, we are working on different approaches and one of them is to write to the dedicated pool.

...

- *Cluster log storage*: currently mon store db. AFAIK this is the main issue, right? Yes, that's right.

...

From there, I see 2 possible paths: - *Keep cluster-wide logs as a Ceph concern:* - IMHO putting some throttling in place should be a must, since client-triggered cluster logs could easily become a DoS vector. - I wouldn't put them into a rados pool, not so much for the data availability in case of OSD service downtime (logs will still be recoverable from logfiles), but as for the potential interference with user workloads/deployment patterns (as Frank mentioned before). - Could we run the ".mgr" pool on a new type of "internal/service-only" colocated OSDs (memstore)? - Save logs to a fixed-size/TTL-bound priority or multi-level queue structure? - Add some (eventually-consistent) store db to the ceph-mgr? - To solve ceph-mgr scalability issues, we recently added a new kind of Ceph utility daemon (ceph-exporter) whose sole purpose is to fetch metrics from co-located Ceph daemon's perf-counters and make those available for Prometheus scraping. We could think about a similar thing but for logs... (although it'd be very similar to the Loki approach below). - *Move them outside Ceph:* - Cephadm + Dashboard now support Centralized Logging via Loki + Promtail <https://ceph.io/en/news/blog/2022/centralized_logging/>, which basically polls all daemon logfiles and sends new log traces to a central service (Loki) where they can be monitored/filtered in real-time. - If we find the previous solution too bulky for regular cluster monitoring, we could explore systemd-journal-remote <https://www.freedesktop.org/software/systemd/man/systemd-journal-remote.service.html> /rsyslog/... - The main downside of this approach is that it might break the "ceph log" command (rados_monitor_log and log events could still be watched I guess).

This is really helpful. Let me explore these paths. If required, we will propose a meeting with a wider audience to discuss this further.

...

Kind Regards, Ernesto

Regards, Prashant

...

On Wed, Mar 22, 2023 at 11:12 AM Janne Johansson <icepic.dz(a)gmail.com> wrote: > > 2) .mgr pool > > > > 2.1) I have become really tired of these administrative pools that are > created on the fly without any regards to device classes, available > capacity, PG allocation and the like. The first one that showed up without > warning was device_health_metrics, which turned the cluster health_err > right away because the on-the-fly pool creation is, well, not exactly smart. > > > > We don't even have drives below the default root. We have a lot of > different pools on different (custom!) device classes with different > replication schemes to accommodate a large variety of use cases. > Administrative pools showing up randomly somewhere in the tree are a real > pain. There are ceph-user cases where people deleted and recreated it only > to make the device health module useless, because it seems to store the > pool ID and there is no way to tell it to use the new pool. > > > > Ah, that's why it looked unused after I also had to remake it. Since > it gets created when you don't have the OSDs yet, the possibilities > for it ending up wrong seem very large. > > -- > May the most significant bit of your life be positive. > _______________________________________________ > Dev mailing list -- dev(a)ceph.io > To unsubscribe send an email to dev-leave(a)ceph.io > >

Prashant Dhange

3:34 p.m.

On Wed, Mar 22, 2023 at 3:10 AM Janne Johansson <icepic.dz(a)gmail.com> wrote:

...

2) .mgr pool 2.1) I have become really tired of these administrative pools that are

We don't even have drives below the default root. We have a lot of

Ah, that's why it looked unused after I also had to remake it. Since it gets created when you don't have the OSDs yet, the possibilities for it ending up wrong seem very large.

Ok. I changed the mgr/devicehealth/pool_name to a new pool and was able to query the OSD daemons for health metrics but was not able to get the device health metrics. IMHO we should track this inconsistency through a tracker if it's not already filed.

...

-- May the most significant bit of your life be positive.

Prashant Dhange

3:20 p.m.

Hi Frank, Thanks for the inputs. Kindly find my answers inline below... On Wed, Mar 22, 2023 at 2:05 AM Frank Schilder <frans(a)dtu.dk> wrote:

...

Yes, you are right. Even in the case of HEALTH_OK, the logm trimming encountered one corner case because of potential corruption of the committed versions (https://tracker.ceph.com/issues/53485). If we trim logm (cluster log) entries aggressively in the event of excessive logm getting stored then there is no use of storing them at all as they will be trimmed sooner than logm entries getting fetched using log last or mgr dashboard.

...

I am not sure but isn't changing pool_name to new pool using "ceph config set mgr mgr/devicehealth/pool_name <new-pool>" for device health metrics work? Maybe we can address this issue related to device health module over a tracker ?

...

If you really think about adding a pool for that, please please make the pool creation part of the upgrade instructions with some hints on sizing, PGs and realistic (!!!) IOP/s requirements. I personally use the host-syslog and have drives with reasonable performance and capacity in the hosts to be able to pull debug logs with high logging values. All host logs are also aggregated to an rsyslogd instance. I don't see *any* need to aggregate these logs to a ceph pool. 2.2) Using a ceph pool for logging is not reliable during critical situations. The whole point of the logging is to provide information in case of disaster. In case of disaster, we can safely assume that an .mgr pool will not be available. The logging has to be on an alternative infrastructure that is not affected by ceph storage outages/health problems. Having it in the MON stores on local storage is such an alternative infrastructure. Why not just separate the logging storage from the actual MON DB store and make it max_size configurable?

Agree on 2.1 and 2.2. Really appreciate your efforts to document these concerns in detail. The other caveat with this solution is if mgr pool storing ceph cluster logs is not writable because of OSD full, network issue etc then we need to find an alternative way to get hold of cluster logs for troubleshooting purposes.

...

I would propose to keep it on the local dedicated MON storage (however outside of the MON DB) also to keep setting up a ceph cluster simple. If we needed now an additional MGR store, things would be more complicated. Just tell people that 60G is not enough for a MON store and at the same time make the last-log size a config option (it should really be a ring-buffer with a configurable fixed max-number of entries). 3) MGR performance While it would possibly make sense to let the MGRs do more work, there is the problem of this work not being distributed (only 1 MGR does something) and that MGR modules seem not really performance optimized (too much python). If one wanted to outsource additional functionality to the MGRs, a good start would be to make all MGRs active and distribute the work (like a small distributed-memory compute cluster). A bit more module-crash resilience and performance improvements are also welcome.

Yes, mgr is not distributed and single mgr is responsible for all mgr workload. The major job of mgr is to lightweight MONs as much as possible. Another concern here is if the active mgr will be handling the cluster logging through the new pool then we will miss out cluster logs during the timeframe when all mgrs are down.

...

Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14

Regards, Prashant

...

________________________________________ From: Prashant Dhange <pdhange(a)redhat.com> Sent: 22 March 2023 06:35:36 To: dev(a)ceph.io Subject: Moving cluster log storage from monstore db Hi All, We are looking for inputs on a new feature to be implemented to move clog messages storage from monstore db, refer trello card [1] for more details around this topic. Currently, every clog message goes to monstore db as well as debug/warning messages generates clog messages 1000s of times per seconds which leads to monstore db growing at an exponential rate in a catastrophic failure situation. The primary use cases for the logm entries in monstore db are : * For "ceph log last" commands to get historical clog entries * Ceph dashboard (mgr is subscriber of log-info which propagate clog to dashboard module) @Patrick Donnelly<mailto:pdonnell@redhat.com> suggested a viable solution to move the cluster log storage to a new mgr module which handles the "ceph log last" command. The clog data can be stored in the .mgr pool via libcephsqlite. Alternatively, if we donot want to get rid of logm storage from monstore db then the other solutions would be : * Stop writing logm entries to mon db if there are excessive entries getting generated * Filter out clog DBG entries and only log WRN/INF/ERR entries. Looking forward to additional perspectives arounds this topic. Feel free to add your inputs to trello card [1] or reply to this email-thread. [1] https://trello.com/c/oCGGFfTs/822-better-handling-of-cluster-log-messages-f… Regards, Prashant

391

days inactive

398

days old

dev@ceph.io

Manage subscription

10 comments

5 participants

tags (0)

participants (5)

Ernesto Puerta
Frank Schilder
Janne Johansson
Matthias Muench
Prashant Dhange