Hi,
We have a 3-site Ceph cluster and would like to create a 4+2 EC pool
with 2 chunks per datacenter, to maximise the resilience in case of 1
datacenter being down. I have not found a way to create an EC profile
with this 2-level allocation strategy. I created an EC profile with a
failure domain = datacenter but it doesn't work as, I guess, it would
like to ensure it has always 5 OSDs up (to ensure that the pools remains
R/W) where with a failure domain = datacenter, the guarantee is only 4.
My idea was to create a 2-step allocation and a failure domain=host to
achieve our desired configuration, with something like the following in
the crushmap rule:
step choose indep 3 datacenter
step chooseleaf indep x host
step emit
Is it the right approach? If yes, what should be 'x'? Would 0 work?
From what I have seen, there is no way to create such a rule with the
'ceph osd crush' commands: I have to download the current CRUSHMAP, edit
it and upload the modified version. Am I right?
Thanks in advance for your help or suggestions. Best regards,
Michel
Hi,
We have a 3-site Ceph cluster and would like to create a 4+2 EC pool
with 2 chunks per datacenter, to maximise the resilience in case of 1
datacenter being down. I have not found a way to create an EC profile
with this 2-level allocation strategy. I created an EC profile with a
failure domain = datacenter but it doesn't work as, I guess, it would
like to ensure it has always 5 OSDs up (to ensure that the pools remains
R/W) where with a failure domain = datacenter, the guarantee is only 4.
My idea was to create a 2-step allocation and a failure domain=host to
achieve our desired configuration, with something like the following in
the crushmap rule:
step choose indep 3 datacenter
step chooseleaf indep x host
step emit
Is it the right approach? If yes, what should be 'x'? Would 0 work?
From what I have seen, there is no way to create such a rule with the
'ceph osd crush' commands: I have to download the current CRUSHMAP, edit
it and upload the modified version. Am I right?
Thanks in advance for your help or suggestions. Best regards,
Michel
Hello guys!
We noticed an unexpected situation. In a recently deployed Ceph cluster we
are seeing a raw usage, that is a bit odd. We have the following setup:
We have a new cluster with 5 nodes with the following setup:
- 128 GB of RAM
- 2 cpus Intel(R) Intel Xeon Silver 4210R
- 1 NVME of 2 TB for the rocks DB caching
- 5 HDDs of 14TB
- 1 NIC dual port of 25GiB in BOND mode.
Right after deploying the Ceph cluster, we see a raw usage of about 9TiB.
However, no load has been applied onto the cluster. Have you guys seen such
a situation? Or, can you guys help understand it?
We are using Ceph Octopus, and we have set the following configurations:
```
ceph_conf_overrides:
global:
osd pool default size: 3
osd pool default min size: 1
osd pool default pg autoscale mode: "warn"
perf: true
rocksdb perf: true
mon:
mon osd down out interval: 120
osd:
bluestore min alloc size hdd: 65536
```
Any tip or help on how to explain this situation is welcome!
Hello,
We are considering CephFS as an alternative to GlusterFS, and have some
questions about performance. Is anyone able to advise us please?
This would be for file systems between 100GB and 2TB in size, average file
size around 5MB, and a mixture of reads and writes. I may not be using the
correct terminology in the Ceph world, but in my parlance a node is a Linux
server running the Ceph storage software. Multiple nodes make up the whole
Ceph storage solution. Someone correct me if I should be using different
terms!
In our normal scenario the nodes in the replicated filesystem would be
around 0.3ms apart, but we're also interested in geographically remote
nodes which would be say 20ms away. We are using third party software which
relies on a traditional Linux filesystem, so we can't use an object storage
solution directly.
So my specific questions are:
1. When reading a file from CephFS, does it read from just one node, or
from all nodes?
2. If reads are from one node then does it choose the node with the fastest
response to optimise performance, or if from all nodes then will reads be
no faster than latency to the furthest node?
3. When writing to CephFS, are all nodes written to synchronously, or are
writes to one node which then replicates that to other nodes asynchronously?
4. Can anyone give a recommendation on maximum latency between nodes to
have decent performance?
5. How does CephFS handle a node which suddenly becomes unavailable on the
network? Is the block time configurable, and how good is the healing
process after the lost node rejoins the network?
6. I have read that CephFS is more complicated to administer than
GlusterFS. What does everyone think? Are things like healing after a net
split difficult for administrators new to Ceph to handle?
Thanks very much in advance.
--
David Cunningham, Voisonics Limited
http://voisonics.com/
USA: +1 213 221 1092
New Zealand: +64 (0)28 2558 3782
Hi,
I am reading reading some documentation about mClock and have two questions.
First, about the IOPS. Are those IOPS disk IOPS or other kind of IOPS? And what the assumption of those? (Like block size, sequential or random reads/writes)?
And the second question,
How mClock calculates its profiles? I have my lab cluster running Quincy, and I have this parameters for mClock:
"osd_mclock_max_capacity_iops_hdd": "450.000000",
"osd_mclock_profile": "balanced",
According to the documentation: https://docs.ceph.com/en/quincy/rados/configuration/mclock-config-ref/#bala… I am expecting to have:
"osd_mclock_scheduler_background_best_effort_lim": "999999",
"osd_mclock_scheduler_background_best_effort_res": "90",
"osd_mclock_scheduler_background_best_effort_wgt": "2",
"osd_mclock_scheduler_background_recovery_lim": "675",
"osd_mclock_scheduler_background_recovery_res": "180",
"osd_mclock_scheduler_background_recovery_wgt": "1",
"osd_mclock_scheduler_client_lim": "450",
"osd_mclock_scheduler_client_res": "180", "osd_mclock_scheduler_client_wgt": "1",
But what I get is:
"osd_mclock_scheduler_background_best_effort_lim": "999999",
"osd_mclock_scheduler_background_best_effort_res": "18",
"osd_mclock_scheduler_background_best_effort_wgt": "2",
"osd_mclock_scheduler_background_recovery_lim": "135",
"osd_mclock_scheduler_background_recovery_res": "36",
"osd_mclock_scheduler_background_recovery_wgt": "1",
"osd_mclock_scheduler_client_lim": "90",
"osd_mclock_scheduler_client_res": "36",
"osd_mclock_scheduler_client_wgt": "1",
Which seems very low according to what my disk seems to be able to handle.
Is this calculation the expected one? Or did I miss something on how those profiles are populated?
Luis Domingues
Proton AG
Dear Ceph users,
my cluster is made up of 10 old machines, with uneven number of disks and disk size. Essentially I have just one big data pool (6+2 erasure code, with host failure domain) for which I am currently experiencing a very poor available space (88 TB of which 40 TB occupied, as reported by df -h on hosts mounting the cephfs) compared to the raw one (196.5 TB). I have a total of 104 OSDs and 512 PGs for the pool; I cannot increment the PG number since the machines are old and with very low amount of RAM, and some of them are already overloaded.
In this situation I'm seeing a high occupation of small OSDs (500 MB) with respect to bigger ones (2 and 4 TB) even if the weight is set equal to disk capacity (see below for ceph osd tree). For example OSD 9 is at 62% occupancy even with weight 0.5 and reweight 0.75, while the highest occupancy for 2 TB OSDs is 41% (OSD 18) and 4 TB OSDs is 23% (OSD 79). I guess this high occupancy for 500 MB OSDs combined with erasure code size and host failure domain might be the cause of the poor available space, could this be true? The upmap balancer is currently running but I don't know if and how much it could improve the situation.
Any hint is greatly appreciated, thanks.
Nicola
# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 196.47754 root default
-7 14.55518 host aka
4 hdd 1.81940 osd.4 up 1.00000 1.00000
11 hdd 1.81940 osd.11 up 1.00000 1.00000
18 hdd 1.81940 osd.18 up 1.00000 1.00000
26 hdd 1.81940 osd.26 up 1.00000 1.00000
32 hdd 1.81940 osd.32 up 1.00000 1.00000
41 hdd 1.81940 osd.41 up 1.00000 1.00000
48 hdd 1.81940 osd.48 up 1.00000 1.00000
55 hdd 1.81940 osd.55 up 1.00000 1.00000
-3 14.55518 host balin
0 hdd 1.81940 osd.0 up 1.00000 1.00000
8 hdd 1.81940 osd.8 up 1.00000 1.00000
15 hdd 1.81940 osd.15 up 1.00000 1.00000
22 hdd 1.81940 osd.22 up 1.00000 1.00000
29 hdd 1.81940 osd.29 up 1.00000 1.00000
34 hdd 1.81940 osd.34 up 1.00000 1.00000
43 hdd 1.81940 osd.43 up 1.00000 1.00000
49 hdd 1.81940 osd.49 up 1.00000 1.00000
-13 29.10950 host bifur
3 hdd 3.63869 osd.3 up 1.00000 1.00000
14 hdd 3.63869 osd.14 up 1.00000 1.00000
27 hdd 3.63869 osd.27 up 1.00000 1.00000
37 hdd 3.63869 osd.37 up 1.00000 1.00000
50 hdd 3.63869 osd.50 up 1.00000 1.00000
59 hdd 3.63869 osd.59 up 1.00000 1.00000
64 hdd 3.63869 osd.64 up 1.00000 1.00000
69 hdd 3.63869 osd.69 up 1.00000 1.00000
-17 29.10950 host bofur
2 hdd 3.63869 osd.2 up 1.00000 1.00000
21 hdd 3.63869 osd.21 up 1.00000 1.00000
39 hdd 3.63869 osd.39 up 1.00000 1.00000
57 hdd 3.63869 osd.57 up 1.00000 1.00000
66 hdd 3.63869 osd.66 up 1.00000 1.00000
72 hdd 3.63869 osd.72 up 1.00000 1.00000
76 hdd 3.63869 osd.76 up 1.00000 1.00000
79 hdd 3.63869 osd.79 up 1.00000 1.00000
-21 29.10376 host dwalin
88 hdd 1.81898 osd.88 up 1.00000 1.00000
89 hdd 1.81898 osd.89 up 1.00000 1.00000
90 hdd 1.81898 osd.90 up 1.00000 1.00000
91 hdd 1.81898 osd.91 up 1.00000 1.00000
92 hdd 1.81898 osd.92 up 1.00000 1.00000
93 hdd 1.81898 osd.93 up 1.00000 1.00000
94 hdd 1.81898 osd.94 up 1.00000 1.00000
95 hdd 1.81898 osd.95 up 1.00000 1.00000
96 hdd 1.81898 osd.96 up 1.00000 1.00000
97 hdd 1.81898 osd.97 up 1.00000 1.00000
98 hdd 1.81898 osd.98 up 1.00000 1.00000
99 hdd 1.81898 osd.99 up 1.00000 1.00000
100 hdd 1.81898 osd.100 up 1.00000 1.00000
101 hdd 1.81898 osd.101 up 1.00000 1.00000
102 hdd 1.81898 osd.102 up 1.00000 1.00000
103 hdd 1.81898 osd.103 up 1.00000 1.00000
-9 14.55518 host ogion
7 hdd 1.81940 osd.7 up 1.00000 1.00000
16 hdd 1.81940 osd.16 up 1.00000 1.00000
23 hdd 1.81940 osd.23 up 1.00000 1.00000
33 hdd 1.81940 osd.33 up 1.00000 1.00000
40 hdd 1.81940 osd.40 up 1.00000 1.00000
47 hdd 1.81940 osd.47 up 1.00000 1.00000
54 hdd 1.81940 osd.54 up 1.00000 1.00000
61 hdd 1.81940 osd.61 up 1.00000 1.00000
-19 14.55518 host prestno
81 hdd 1.81940 osd.81 up 1.00000 1.00000
82 hdd 1.81940 osd.82 up 1.00000 1.00000
83 hdd 1.81940 osd.83 up 1.00000 1.00000
84 hdd 1.81940 osd.84 up 1.00000 1.00000
85 hdd 1.81940 osd.85 up 1.00000 1.00000
86 hdd 1.81940 osd.86 up 1.00000 1.00000
87 hdd 1.81940 osd.87 up 1.00000 1.00000
104 hdd 1.81940 osd.104 up 1.00000 1.00000
-15 29.10376 host remolo
6 hdd 1.81897 osd.6 up 1.00000 1.00000
12 hdd 1.81897 osd.12 up 1.00000 1.00000
19 hdd 1.81897 osd.19 up 1.00000 1.00000
28 hdd 1.81897 osd.28 up 1.00000 1.00000
35 hdd 1.81897 osd.35 up 1.00000 1.00000
44 hdd 1.81897 osd.44 up 1.00000 1.00000
52 hdd 1.81897 osd.52 up 1.00000 1.00000
58 hdd 1.81897 osd.58 up 1.00000 1.00000
63 hdd 1.81897 osd.63 up 1.00000 1.00000
67 hdd 1.81897 osd.67 up 1.00000 1.00000
71 hdd 1.81897 osd.71 up 1.00000 1.00000
73 hdd 1.81897 osd.73 up 1.00000 1.00000
74 hdd 1.81897 osd.74 up 1.00000 1.00000
75 hdd 1.81897 osd.75 up 1.00000 1.00000
77 hdd 1.81897 osd.77 up 1.00000 1.00000
78 hdd 1.81897 osd.78 up 1.00000 1.00000
-5 14.55518 host rokanan
1 hdd 1.81940 osd.1 up 1.00000 1.00000
10 hdd 1.81940 osd.10 up 1.00000 1.00000
17 hdd 1.81940 osd.17 up 1.00000 1.00000
24 hdd 1.81940 osd.24 up 1.00000 1.00000
31 hdd 1.81940 osd.31 up 1.00000 1.00000
38 hdd 1.81940 osd.38 up 1.00000 1.00000
46 hdd 1.81940 osd.46 up 1.00000 1.00000
53 hdd 1.81940 osd.53 up 1.00000 1.00000
-11 7.27515 host romolo
5 hdd 0.45470 osd.5 up 1.00000 1.00000
9 hdd 0.45470 osd.9 up 0.75000 1.00000
13 hdd 0.45470 osd.13 up 1.00000 1.00000
20 hdd 0.45470 osd.20 up 0.95000 1.00000
25 hdd 0.45470 osd.25 up 0.75000 1.00000
30 hdd 0.45470 osd.30 up 1.00000 1.00000
36 hdd 0.45470 osd.36 up 1.00000 1.00000
42 hdd 0.45470 osd.42 up 1.00000 1.00000
45 hdd 0.45470 osd.45 up 0.85004 1.00000
51 hdd 0.45470 osd.51 up 0.89999 1.00000
56 hdd 0.45470 osd.56 up 1.00000 1.00000
60 hdd 0.45470 osd.60 up 1.00000 1.00000
62 hdd 0.45470 osd.62 up 1.00000 1.00000
65 hdd 0.45470 osd.65 up 0.85004 1.00000
68 hdd 0.45470 osd.68 up 1.00000 1.00000
70 hdd 0.45470 osd.70 up 1.00000 1.00000
Hi,
> On 2 Apr 2023, at 23:14, Matthias Ferdinand <mf+ml.ceph(a)mfedv.net> wrote:
>
> I understand that grafana graphs are generated from prometheus metrics.
> I just wanted to know which OSD daemon-perf values feed these prometheus
> metrics (or if they are generated in some other way).
Yep, this perf metrics is generated in some way 🙂
You can consult with ceph-mgr prometheus module source code [1]
[1] https://github.com/ceph/ceph/blob/main/src/pybind/mgr/prometheus/module.py#…
k
Hi,
I would like to understand how the per-OSD data from "ceph osd perf"
(i.e. apply_latency, commit_latency) is generated. So far I couldn't
find documentation on this. "ceph osd perf" output is nice for a quick
glimpse, but is not very well suited for graphing. Output values are
from the most recent 5s-averages apparently.
With "ceph daemon osd.X perf dump" OTOH, you get quite a lot of latency
metrics, while it is just not obvious to me how they aggregate into
apply_latency and commit_latency. Or some comparably easy read latency
metric (something that is missing completely in "ceph osd perf").
Can somebody shed some light on this?
Regards
Matthias