Hello.
I have 5 node ceph cluster and I'm constantly having "clients failing to
respond to cache pressure" warning.
I have 84 cephfs kernel clients (servers) and my users are accessing their
personal subvolumes located on one pool.
My users are software developers and the data is home and user data. (Git,
python projects, sample data and generated new data)
---------------------------------------------------------------------------------
--- RAW STORAGE ---
CLASS SIZE AVAIL USED RAW USED %RAW USED
ssd 146 TiB 101 TiB 45 TiB 45 TiB 30.71
TOTAL 146 TiB 101 TiB 45 TiB 45 TiB 30.71
--- POOLS ---
POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL
.mgr 1 1 356 MiB 90 1.0 GiB 0 30 TiB
cephfs.ud-data.meta 9 256 69 GiB 3.09M 137 GiB 0.15 45 TiB
cephfs.ud-data.data 10 2048 26 TiB 100.83M 44 TiB 32.97 45 TiB
---------------------------------------------------------------------------------
root@ud-01:~# ceph fs status
ud-data - 84 clients
=======
RANK STATE MDS ACTIVITY DNS INOS DIRS
CAPS
0 active ud-data.ud-04.seggyv Reqs: 142 /s 2844k 2798k 303k
720k
POOL TYPE USED AVAIL
cephfs.ud-data.meta metadata 137G 44.9T
cephfs.ud-data.data data 44.2T 44.9T
STANDBY MDS
ud-data.ud-02.xcoojt
ud-data.ud-05.rnhcfe
ud-data.ud-03.lhwkml
ud-data.ud-01.uatjle
MDS version: ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5)
quincy (stable)
-----------------------------------------------------------------------------------
My MDS settings are below:
mds_cache_memory_limit | 8589934592
mds_cache_trim_threshold | 524288
mds_recall_global_max_decay_threshold | 131072
mds_recall_max_caps | 30000
mds_recall_max_decay_rate | 1.500000
mds_recall_max_decay_threshold | 131072
mds_recall_warning_threshold | 262144
I have 2 questions:
1- What should I do to prevent cache pressue warning ?
2- What can I do to increase speed ?
- Thanks
Hi! I'm new with ceph and i struggle to make a mapping between
my current storage knowledge and ceph...
So, i will state my understanding of the context and the question
so please correct me with anything that i got wrong :)
So, files (or pieces of files) are put in PGs that are given sections
of OSDs. The crushmap gives a physical OSDs map to be chosen for placement
or access
Pools are a logical name for a storage space but how can i specify
what osds or host are part of a pool?
For replication, how can i specify: if a replica is missing (for a given time)
start rebuilding on some available OSD?
Is there a notion of "spare" so if an osd is missing on action, the rebuild to
start on another host and when the old OSD is back (the hdd is replaced, or the
machine was repaired) to be automatically cleaned up and used?
I'm thinking about a 3 node cluster with the replica=2
failure domain = host, in such a way if one node is down, the data
from there to be replicated on the remaining nodes (with some drives kept
as spares..)
I am almost certain that from the point of view of ceph, what i'm thinking is wrong
so i would love to receive some advice :)
Thanks a lot!
Adrian
Hi Eugen
Please find the details below
root@meghdootctr1:/var/log/ceph# ceph -s
cluster:
id: c59da971-57d1-43bd-b2b7-865d392412a5
health: HEALTH_WARN
nodeep-scrub flag(s) set
544 pgs not deep-scrubbed in time
services:
mon: 3 daemons, quorum meghdootctr1,meghdootctr2,meghdootctr3 (age 5d)
mgr: meghdootctr1(active, since 5d), standbys: meghdootctr2, meghdootctr3
mds: 3 up:standby
osd: 36 osds: 36 up (since 34h), 36 in (since 34h)
flags nodeep-scrub
data:
pools: 2 pools, 544 pgs
objects: 10.14M objects, 39 TiB
usage: 116 TiB used, 63 TiB / 179 TiB avail
pgs: 544 active+clean
io:
client: 24 MiB/s rd, 16 MiB/s wr, 2.02k op/s rd, 907 op/s wr
Ceph Versions:
root@meghdootctr1:/var/log/ceph# ceph --version
ceph version 14.2.16 (762032d6f509d5e7ee7dc008d80fe9c87086603c) nautilus
(stable)
Ceph df -h
https://pastebin.com/1ffucyJg
Ceph OSD performance dump
https://pastebin.com/1R6YQksE
Ceph tell osd.XX bench (Out of 36 osds only 8 OSDs give High IOPS value of 250
+. Out of that 4 OSDs are from HP 3PAR and 4 OSDS from DELL EMC. We are using
only 4 OSDs from HP3 par and it is working fine without any latency and iops
issues from the beginning but the remaining 32 OSDs are from DELL EMC in which 4
OSDs are much better than the remaining 28 OSDs)
https://pastebin.com/CixaQmBi
Please help me to identify if the issue is with the DELL EMC Storage, Ceph
configuration parameter tuning or the Overload in the cloud setup
On November 1, 2023 at 9:48 PM Eugen Block <eblock(a)nde.ag> wrote:
> Hi,
>
> for starters please add more cluster details like 'ceph status', 'ceph
> versions', 'ceph osd df tree'. Increasing the to 10G was the right
> thing to do, you don't get far with 1G with real cluster load. How are
> the OSDs configured (HDD only, SSD only or HDD with rocksdb on SSD)?
> How is the disk utilization?
>
> Regards,
> Eugen
>
> Zitat von prabhav(a)cdac.in:
>
> > In a production setup of 36 OSDs( SAS disks) totalling 180 TB
> > allocated to a single Ceph Cluster with 3 monitors and 3 managers.
> > There were 830 volumes and VMs created in Openstack with Ceph as a
> > backend. On Sep 21, users reported slowness in accessing the VMs.
> > Analysing the logs lead us to problem with SAS , Network congestion
> > and Ceph configuration( as all default values were used). We updated
> > the Network from 1Gbps to 10Gbps for public and cluster networking.
> > There was no change.
> > The ceph benchmark performance showed that 28 OSDs out of 36 OSDs
> > reported very low IOPS of 30 to 50 while the remaining showed 300+
> > IOPS.
> > We gradually started reducing the load on the ceph cluster and now
> > the volumes count is 650. Now the slow operations has gradually
> > reduced but I am aware that this is not the solution.
> > Ceph configuration is updated with increasing the
> > osd_journal_size to 10 GB,
> > osd_max_backfills = 1
> > osd_recovery_max_active = 1
> > osd_recovery_op_priority = 1
> > bluestore_cache_trim_max_skip_pinned=10000
> >
> > After one month, now we faced another issue with Mgr daemon stopped
> > in all 3 quorums and 16 OSDs went down. From the
> > ceph-mon,ceph-mgr.log could not get the reason. Please guide me as
> > its a production setup
> > _______________________________________________
> > ceph-users mailing list -- ceph-users(a)ceph.io
> > To unsubscribe send an email to ceph-users-leave(a)ceph.io
>
>
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io
> To unsubscribe send an email to ceph-users-leave(a)ceph.io
Thanks & Regards,
Ms V A Prabha / श्रीमती प्रभा वी ए
Joint Director / संयुक्त निदेशक
Centre for Development of Advanced Computing(C-DAC) / प्रगत संगणन विकास
केन्द्र(सी-डैक)
Tidel Park”, 8th Floor, “D” Block, (North &South) / “टाइडल पार्क”,8वीं मंजिल,
“डी” ब्लॉक, (उत्तर और दक्षिण)
No.4, Rajiv Gandhi Salai / नं.4, राजीव गांधी सलाई
Taramani / तारामणि
Chennai / चेन्नई – 600113
Ph.No.:044-22542226/27
Fax No.: 044-22542294
------------------------------------------------------------------------------------------------------------
[ C-DAC is on Social-Media too. Kindly follow us at:
Facebook: https://www.facebook.com/CDACINDIA & Twitter: @cdacindia ]
This e-mail is for the sole use of the intended recipient(s) and may
contain confidential and privileged information. If you are not the
intended recipient, please contact the sender by reply e-mail and destroy
all copies and the original message. Any unauthorized review, use,
disclosure, dissemination, forwarding, printing or copying of this email
is strictly prohibited and appropriate legal action will be taken.
------------------------------------------------------------------------------------------------------------
Hi,
Due to idiotic behaviour on my part I made a mistake while replacing some
disks in our data centre and our cluster ended up all powered off!
I have been using ceph for many years (since firefly) but only recently
upgraded to reef and moved to the cephadm / podman setup. I am trying to
figure out how to get it all started up again. I am not very familiar with
docker at all. I can see the bootstrap option but no "recover" option.
It is a small cluster with 3 nodes and two 3TB/4TB disks in each node.
I have had a look at
https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-mon/#…
but wonder if cephadm does this itself automagically?
Help please, I don't want to lose my data!
Many thanks
Carl.
Hi,
The following article:
https://ceph.io/en/news/blog/2024/ceph-a-journey-to-1tibps/
suggests that dsabling C-states on your CPUs (on the OSD nodes) as one method to improve performance. The article seems to indicate that the scenariobeing addressed in the article was with NVMEs as OSDs.
Questions:
Will disabling C-states and keeping the processors at max power state help performance for the following:
1. NVME OSDs (yes)2. SSD OSDs3. Spinning disk OSDs
-Chris
Hi
A few years ago we were really strapped for space so we tweaked pg_num
for some pools to ensure all pgs were as to close to the same size as
possible while stile observing the power of 2 rule, in order to get the
most mileage space wise. We set the auto-scaler to off for the tweaked
pools to get rid of the warnings.
We now have a lot more free space so I flipped the auto-scaler to warn
for all pools and set the bulk flag for the pools expected to be data
pools, leading to this:
"
[WRN] POOL_TOO_FEW_PGS: 4 pools have too few placement groups
Pool rbd has 512 placement groups, should have 2048
Pool rbd_internal has 1024 placement groups, should have 2048
Pool cephfs.nvme.data has 32 placement groups, should have 4096
Pool cephfs.ssd.data has 32 placement groups, should have 1024
[WRN] POOL_TOO_MANY_PGS: 4 pools have too many placement groups
Pool libvirt has 256 placement groups, should have 32
Pool cephfs.cephfs.data has 512 placement groups, should have 32
Pool rbd_ec_data has 4096 placement groups, should have 1024
Pool cephfs.hdd.data has 2048 placement groups, should have 1024
"
That's a lot of warnings *ponder*
"
# ceph osd pool autoscale-status
POOL SIZE TARGET SIZE RATE RAW CAPACITY RATIO
TARGET RATIO EFFECTIVE RATIO BIAS PG_NUM NEW PG_NUM AUTOSCALE BULK
libvirt 2567G 3.0 3031T 0.0025
1.0 256 warn False
.mgr 807.5M 2.0 6520G 0.0002
1.0 1 warn False
rbd_ec 9168k 3.0 6520G 0.0000
1.0 32 warn False
nvme 31708G 2.0 209.5T 0.2955
1.0 2048 warn False
.nfs 36864 3.0 6520G 0.0000
1.0 32 warn False
cephfs.cephfs.meta 24914M 3.0 6520G 0.0112
4.0 32 warn False
cephfs.cephfs.data 16384 3.0 6520G 0.0000
1.0 512 warn False
rbd.ssd.data 798.1G 2.25 6520G 0.2754
1.0 64 warn False
rbd_ec_data 609.2T 1.5 3031T 0.3014
1.0 4096 warn True
rbd 68170G 3.0 3031T 0.0659
1.0 512 warn True
rbd_internal 69553G 3.0 3031T 0.0672
1.0 1024 warn True
cephfs.nvme.data 0 2.0 209.5T 0.0000
1.0 32 warn True
cephfs.ssd.data 68609M 2.0 6520G 0.0206
1.0 32 warn True
cephfs.hdd.data 111.0T 2.25 3031T 0.0824
1.0 2048 warn True
"
"
# ceph df
--- RAW STORAGE ---
CLASS SIZE AVAIL USED RAW USED %RAW USED
hdd 3.0 PiB 1.3 PiB 1.6 PiB 1.6 PiB 54.69
nvme 210 TiB 146 TiB 63 TiB 63 TiB 30.21
ssd 6.4 TiB 4.0 TiB 2.4 TiB 2.4 TiB 37.69
TOTAL 3.2 PiB 1.5 PiB 1.7 PiB 1.7 PiB 53.07
--- POOLS ---
POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL
rbd 4 512 80 TiB 21.35M 200 TiB 19.31 278 TiB
libvirt 5 256 3.0 TiB 810.89k 7.5 TiB 0.89 278 TiB
rbd_internal 6 1024 86 TiB 28.22M 204 TiB 19.62 278 TiB
.mgr 8 1 4.3 GiB 1.06k 1.6 GiB 0.07 1.0 TiB
rbd_ec 10 32 55 MiB 25 27 MiB 0 708 GiB
rbd_ec_data 11 4096 683 TiB 180.52M 914 TiB 52.26 556 TiB
nvme 23 2048 46 TiB 25.18M 62 TiB 31.62 67 TiB
.nfs 25 32 4.6 KiB 10 108 KiB 0 708 GiB
cephfs.cephfs.meta 31 32 25 GiB 1.66M 73 GiB 3.32 708 GiB
cephfs.cephfs.data 32 679 489 B 40.41M 48 KiB 0 708 GiB
cephfs.nvme.data 34 32 0 B 0 0 B 0 67 TiB
cephfs.ssd.data 35 32 77 GiB 425.03k 134 GiB 5.94 1.0 TiB
cephfs.hdd.data 37 2048 121 TiB 68.42M 250 TiB 23.03 371 TiB
rbd.ssd.data 38 64 934 GiB 239.94k 1.8 TiB 45.82 944 GiB
"
The most weird one:
Pool rbd_ec_data stores 683TB in 4096 pgs -> warn should be 1024
Pool rbd_internal stores 86TB in 1024 pgs-> warn should be 2048
That makes no sense to me based on the amount of data stored. Is this a
bug or what am I missing? Ceph version is 17.2.7.
Mvh.
Torkil
--
Torkil Svensgaard
Systems Administrator
Danish Research Centre for Magnetic Resonance DRCMR, Section 714
Copenhagen University Hospital Amager and Hvidovre
Kettegaard Allé 30, 2650 Hvidovre, Denmark
Hello, I'm new to ceph and sorry in advance for the naive questions.
1.
As far as I know, CRUSH utilizes the cluster map consisting of the PG
map and others.
I don't understand why CRUSH computation is required on client-side,
even though PG-to-OSDs mapping can be acquired from the PG map.
2.
how does the client get a valid(old) OSD set when the PG is being
remapped to a new ODS set which CRUSH returns?
thanks.
More and more I am annoyed with the 'dumb' design decisions of redhat. Just now I have an issue on an 'air gapped' vm that I am unable to start a docker/podman container because it tries to contact the repository to update the image and instead of using the on disk image it just fails. (Not to mention the %$#$%#$ that design containers to download stuff from the internet on startup)
I was wondering if this is also an issue with ceph-admin. Is there an issue with starting containers when container image repositories are not available or when there is no internet connection.