Hi Mailing-Lister's,
I am reaching out for assistance regarding a deployment issue I am facing
with Ceph on a 4 node RKE2 cluster. We are attempting to deploy Ceph via
the rook helm chart, but we are encountering an issue that apparently seems
related to a known bug (https://tracker.ceph.com/issues/61597).
During the OSD preparation phase, the deployment consistently fails with an
IndexError: list index out of range. The logs indicate a problem occurs
when configuring new Disks, specifically using /dev/dm-3 as a metadata
device. It's important to note that /dev/dm-3 is an LVM on top of an mdadm
RAID, which might or might not be contributing to this issue. (I swear,
this setup worked already)
Here is a snippet of the error from the deployment logs:
> 2023-11-23 23:11:30.196913 D | exec: IndexError: list index out of range
> 2023-11-23 23:11:30.236962 C | rookcmd: failed to configure devices:
failed to initialize osd: failed ceph-volume report: exit status 1
https://paste.openstack.org/show/bileqRFKbolrBlTqszmC/
We have attempted different configurations, including specifying devices
explicitly and using the useAllDevices: true option with a specified
metadata device (/dev/dm-3 or the /dev/pv_md0/lv_md0 path). However, the
issue persists across multiple configurations.
tested configurations are as follows:
Explicit device specification:
```yaml
nodes:
- name: "ceph01.maas"
devices:
- name: /dev/dm-1
- name: /dev/dm-2
- name: "sdb"
config:
metadataDevice: "/dev/dm-3"
- name: "sdc"
config:
metadataDevice: "/dev/dm-3"
```
General device specification with metadata device:
```yaml
storage:
useAllNodes: true
useAllDevices: true
config:
metadataDevice: /dev/dm-3
```
I would greatly appreciate any insights or recommendations on how to
proceed or work around this issue.
Is there a halfway decent way to apply the fix or maybe a workaround that
we can apply to successfully deploy Ceph in our environment?
Kind regards,
Hi community,
My ceph cluster is using s3 with three pools and obj/s approximately 4.5k
obj/s and the rgw lifecycle delete per pool is only 60-70 objects/s
How can I speed up the lc rgw process? 60 70 objects/s is too slow
Thanks a lot
Question about the osdmaptool deviation calculations;
For instance,
-----
osdmaptool omap --upmap output.txt --upmap-pool cephfs_data-rep3 --upmap-max 1000 --upmap-deviation 5
osdmaptool: osdmap file 'omap'
writing upmap command output to: output.txt
checking for upmap cleanups
upmap, max-count 1000, max deviation 5
limiting to pools cephfs_data-rep3 ([30])
pools cephfs_data-rep3
prepared 0/1000 changes
Unable to find further optimization, or distribution is already perfect
-----
The evaluated pool is all-on-hdd, and the pool was created with PGS > number of hdd OSDs in the cluster. So each hdd OSD is being used at least once by this pool.
Is it correct to assume that the osdmaptool is relying on the equations set at
ceph-17.2.5/src/osd/OSDMap.cc:5143
5143 // This function calculates the 2 maps osd_deviation and deviation_osd which
5144 // hold the deviation between the current number of PGs which map to an OSD
5145 // and the optimal number. ...
# pgs_per_weight
# ceph-17.2.5/src/osd/OSDMap.cc:4806
4806 float pgs_per_weight = total_pgs / osd_weight_total;
# target
# ceph-17.2.5/src/osd/OSDMap.cc:5156
5156 float target = osd_weight.at(oid) * pgs_per_weight;
# deviation
# ceph-17.2.5/src/osd/OSDMap.cc:5157
5157 float deviation = (float)opgs.size() - target;
And so for pgs_per_weight I calculate
ceph -f json osd df | jq '[ .nodes[] | select (.device_class == "hdd") .pgs ] | add'
divided by
ceph -f json osd df | jq '[ .nodes[] | select (.device_class == "hdd") .crush_weight ] | add'
(each hdd OSD in this cluster has identical weight)
target = osd_weight.at(oid) * pgs_per_weight
I calculate deviation for each osd
deviation = opgs.size - target
where, opgs.size = the number of PGs at an OSD. i.e. The value of $19 for each $1, in `ceph osd df hdd | awk '{ print $1 " " $19 }'`
The result is many many OSDs with a deviation well above the upmap_max_deviation which is at default: 5
So I am wondering if I am miscalculating something, or if I'm not aware of further things the osdmaptool is considering when formulating upmap suggestions?
-Robert
Hello,
We are running a pacific 16.2.10 cluster and enabled the balancer module, here is the configuration:
[root@ceph-1 ~]# ceph balancer status
{
"active": true,
"last_optimize_duration": "0:00:00.052548",
"last_optimize_started": "Fri Nov 17 17:09:57 2023",
"mode": "upmap",
"optimize_result": "Unable to find further optimization, or pool(s) pg_num is decreasing, or distribution is already perfect",
"plans": []
}
[root@ceph-1 ~]# ceph balancer eval
current cluster score 0.017742 (lower is better)
Here is the balancer configuration of upmap_max_deviation:
# ceph config get mgr mgr/balancer/upmap_max_deviation
5
We have two different types of OSDS, one is 7681G and another is 3840G. When I checked our PG distribution on each type of OSD, I found the PG distribution is not evenly, for the 7681G OSDs, the OSD distribution varies from 136 to 158; while for the 3840G OSDs, it varies from 60 to 83, seems the upmap_max_deviation is almost +/- 10. So I just wondering if this is expected or do I need to change the upmap_max_deviation to a smaller value.
Thanks for answering my question.
Hi,
The context is RBD on bluestore. I did check extent on Wiki.
I see "extent" when talking about snapshot and export/import.
For example, when create a snapshot, we mark extents. When
there is write to marked extents, we will make a copy.
I also know that user data on block device maps to objects.
How "extent" and "object" are related?
Can I say extent is a set of continuous objects (with default tripe settings)?
Thanks!
Tony
Hi Groups,
Recently I was setting up a ceph cluster with 10 nodes 144 osd, and I use S3 for it with pool erasure code EC3+2 on it.
I have a question, how many osd nodes can fail with erasure code 3+2 with cluster working normal (read, write)? and can i choose better erasure code ec7+3, 8+2 etc..?
With the erasure code algorithm, it only ensures no data loss, but does not guarantee that the cluster operates normally and does not block IO when osd nodes down. Is that right?
Thanks to the community.
Hi,
We have upgraded one ceph cluster from 17.2.7 to 18.2.0. Since then we are having CephFS issues.
For example this morning:
“””
[root@naret-monitor01 ~]# ceph -s
cluster:
id: 63334166-d991-11eb-99de-40a6b72108d0
health: HEALTH_WARN
1 filesystem is degraded
3 clients failing to advance oldest client/flush tid
3 MDSs report slow requests
6 pgs not scrubbed in time
29 daemons have recently crashed
…
“””
The ceph orch, ceph crash and ceph fs status commands were hanging.
After a “ceph mgr fail” those commands started to respond.
Then I have noticed that there was one mds with most of the slow operations,
“””
[WRN] MDS_SLOW_REQUEST: 3 MDSs report slow requests
mds.cephfs.naret-monitor01.nuakzo(mds.0): 18 slow requests are blocked > 30 secs
mds.cephfs.naret-monitor01.uvevbf(mds.1): 1683 slow requests are blocked > 30 secs
mds.cephfs.naret-monitor02.exceuo(mds.2): 1 slow requests are blocked > 30 secs
“””
Then I tried to restart it with
“””
[root@naret-monitor01 ~]# ceph orch daemon restart mds.cephfs.naret-monitor01.uvevbf
Scheduled to restart mds.cephfs.naret-monitor01.uvevbf on host 'naret-monitor01'
“””
After the cephfs entered into this situation:
“””
[root@naret-monitor01 ~]# ceph fs status
cephfs - 198 clients
======
RANK STATE MDS ACTIVITY DNS INOS DIRS CAPS
0 active cephfs.naret-monitor01.nuakzo Reqs: 0 /s 17.2k 16.2k 1892 14.3k
1 active cephfs.naret-monitor02.ztdghf Reqs: 0 /s 28.1k 10.3k 752 6881
2 clientreplay cephfs.naret-monitor02.exceuo 63.0k 6491 541 66
3 active cephfs.naret-monitor03.lqppte Reqs: 0 /s 16.7k 13.4k 8233 990
POOL TYPE USED AVAIL
cephfs.cephfs.meta metadata 5888M 18.5T
cephfs.cephfs.data data 119G 215T
cephfs.cephfs.data.e_4_2 data 2289G 3241T
cephfs.cephfs.data.e_8_3 data 9997G 470T
STANDBY MDS
cephfs.naret-monitor03.eflouf
cephfs.naret-monitor01.uvevbf
MDS version: ceph version 18.2.0 (5dd24139a1eada541a3bc16b6941c5dde975e26d) reef (stable)
“””
The file system is totally unresponsive (we can mount it on client nodes but any operations like a simple ls hangs).
During the night we had a lot of mds crashes, I can share the content.
Does anybody have an idea on how to tackle this problem?
Best,
Giuseppe
Hi,
src-image is 1GB (provisioned size). I did the following 3 tests.
1. rbd export src-image - | rbd import - dst-image
2. rbd export --export-format 2 src-image - | rbd import --export-format 2 - dst-image
3. rbd export --export-format 2 src-image - | rbd import - dst-image
With #1 and #2, dst-image size (rbd info) is the same as src-image, which is expected.
With #3, dst-image size (rbd info) is close to used size (rbd du), not the provisioned
size of src-image. I'm not sure if this image is actually useable when write into it.
The questions is that, is #3 not supposed to be used at all?
I checked doc, didn't see something like "--export-format 2 has to be used for
importing the image which is exported with --export-format 2 option".
Any comments?
Thanks!
Tony