Thank you for the feedback!
Just a remark on that last point (which I missed from the original email
from Matthias): most centralized logging solutions (including Loki or
Elasticsearch) already cope with SPOF scenarios, either by connection
pooling & retrying from client side  to sharding/replication in server
side,   and scheduled snapshots/back-ups. From experience (and the
Ceph cluster log is the perfect example), centralized logging allows for
better log lifecycle management, log-based monitoring/alerting, and
dramatically improved troubleshooting. The main downside would be that log
streaming generates some network traffic that might interfere with the
storage workload, but that can be always solved by routing through a
separate network/low-prio vlan.
On Tue, Mar 28, 2023 at 8:05 AM Prashant Dhange <pdhange(a)redhat.com> wrote:
On Wed, Mar 22, 2023 at 1:51 PM Matthias Muench <mmuench(a)redhat.com>
Hi Prashant, et. al.,
separating the logs from the DB might be a good thing.
I would second what Frank suggested: local storage. Local to the mon
instances hosts, perhaps just saying that flash is required which shouldn't
be an issue nowadays. This would also give the best latency to avoid
starvation on IOPS in case of the disaster.
Yes, we can achieve this but maybe instead of mon handling these logs we
can delegate this task to mgr daemon.
With redundancy in the instances, data is
available, at least from one of
the mon instance hosts. Relying on pools would assume that communication is
intact even between the actors of the pool. An exclusive pool for just this
only would still depend on the network connection and introducing
additional latency, too.
The other alternatives sound promising as well, however, I would like to
raise some concerns.
Pushing the logs only to a central location would impose a dependency on
this location in case of a disaster. A disaster could be also in
conjunction with a network issue affecting the connection to outside world.
So, might be an add-on but for troubleshooting rather some kind of
Eventually consistent distribution of data might
be hard for
troubleshooting. The basic assumption would be that the logs aren't that
important to be available in full in some of the places, as in the
different mon instance hosts. Eventual consistency also would add another
level of trouble to troubleshoot in conjunction with a disaster. Those
interconnection requirements may be void or at least the service may be at
limited availability that might not help to get the data into the place
just in need.
Yes, it will be *SPOF* for log availability if we log to a central
location. We will consider these inputs. Thanks for your inputs.
> Kind regards,
> On 22.03.23 14:10, Ernesto Puerta wrote:
> Hi Prashant,
> Is this move just limited to the impact of the cluster log in the mon
> store db or is it part of a larger mon db clean-up effort?
> I'm asking this because, besides de cluster log, the mon store db is
> currently used (and perhaps abused) also by some mgr modules via:
> - set_module_option()
> set MODULE_OPTIONS values via CLI commands.
> - set_store()
> there are 2 main storage use cases here:
> - *Immutable/sensitive data*: instead of exposing those as
> MODULE_OPTIONS (password hashes, private certificates, API keys, etc.),
> - *Changing data*: mgr-module internal state. While this shouldn't
> cause the db to grow in the long term, it might cause short-term/compaction
> issues (I'm not familiar with rocksdb internals, just extrapolating from
> experience with sstable/leveldb)
> For the latter case there, Dashboard developers have been looking for an
> efficient alternative to persistently store rapidly-changing data. We
> discarded the idea of using a pool since the Dashboard should be able to
> operate prior to any OSD provisioning and in case of storage downtimes
> Coming back to your original questions, I understand that there are two
> different issues at stake:
> - *Cluster log processing*: currently mon via Paxos (Do we really
> need Paxos ack for logs? Can we live with some type of
> eventually-consistent/best-effort storage here?)
> - *Cluster log storage*: currently mon store db. AFAIK this is the
> main issue, right?
> From there, I see 2 possible paths:
> - *Keep cluster-wide logs as a Ceph concern:*
> - IMHO putting some throttling in place should be a must, since
> client-triggered cluster logs could easily become a DoS vector.
> - I wouldn't put them into a rados pool, not so much for the data
> availability in case of OSD service downtime (logs will still
> be recoverable from logfiles), but as for the potential interference with
> user workloads/deployment patterns (as Frank mentioned before).
> - Could we run the ".mgr" pool on a new type of
> "internal/service-only" colocated OSDs (memstore)?
> - Save logs to a fixed-size/TTL-bound priority or multi-level
> queue structure?
> - Add some (eventually-consistent) store db to the ceph-mgr?
> - To solve ceph-mgr scalability issues, we recently added a new
> kind of Ceph utility daemon (ceph-exporter) whose sole purpose is to fetch
> metrics from co-located Ceph daemon's perf-counters and make those
> available for Prometheus scraping. We could think about a similar thing but
> for logs... (although it'd be very similar to the Loki approach below).
> - *Move them outside Ceph:*
> - Cephadm + Dashboard now support Centralized Logging via Loki +
> Promtail <https://ceph.io/en/news/blog/2022/centralized_logging/>,
> which basically polls all daemon logfiles and sends new log traces to a
> central service (Loki) where they can be monitored/filtered in real-time.
> - If we find the previous solution too bulky for regular
> cluster monitoring, we could explore systemd-journal-remote
> - The main downside of this approach is that it might break the
> "ceph log" command (rados_monitor_log and log events could still be
> I guess).
> Kind Regards,
> On Wed, Mar 22, 2023 at 11:12 AM Janne Johansson <icepic.dz(a)gmail.com>
>> > 2) .mgr pool
>> > 2.1) I have become really tired of these administrative pools that are
>> created on the fly without any regards to device classes, available
>> capacity, PG allocation and the like. The first one that showed up without
>> warning was device_health_metrics, which turned the cluster health_err
>> right away because the on-the-fly pool creation is, well, not exactly smart.
>> > We don't even have drives below the default root. We have a lot of
>> different pools on different (custom!) device classes with different
>> replication schemes to accommodate a large variety of use cases.
>> Administrative pools showing up randomly somewhere in the tree are a real
>> pain. There are ceph-user cases where people deleted and recreated it only
>> to make the device health module useless, because it seems to store the
>> pool ID and there is no way to tell it to use the new pool.
>> Ah, that's why it looked unused after I also had to remake it. Since
>> it gets created when you don't have the OSDs yet, the possibilities
>> for it ending up wrong seem very large.
>> May the most significant bit of your life be positive.
>> Dev mailing list -- dev(a)ceph.io
>> To unsubscribe send an email to dev-leave(a)ceph.io
> Dev mailing list -- dev(a)ceph.io
> To unsubscribe send an email to dev-leave(a)ceph.io
> Matthias Muench
> Principal Specialist Solution Architect
> EMEA Storage Specialistmatthias.muench(a)redhat.com
> Phone: +49-160-92654111
> Red Hat GmbH
> Technopark II
> Werner-von-Siemens-Ring 12
> 85630 Grasbrunn
> Red Hat GmbH, Registered seat: Werner von Siemens Ring 12, D-85630 Grasbrunn,
> Commercial register: Amtsgericht Muenchen/Munich, HRB 153243,
> Managing Directors: Ryan Barnhart, Charles Cachera, Michael O'Neill, Amy Ross