Dear all
We have an HDD ceph cluster that could do with some more IOPS. One
solution we are considering is installing NVMe SSDs into the storage
nodes and using them as WAL- and/or DB devices for the Bluestore OSDs.
However, we have some questions about this and are looking for some
guidance and advice.
The first one is about the expected benefits. Before we undergo the
efforts involved in the transition, we are wondering if it is even worth
it. How much of a performance boost one can expect when adding NVMe SSDs
for WAL-devices to an HDD cluster? Plus, how much faster than that does
it get with the DB also being on SSD. Are there rule-of-thumb number of
that? Or maybe someone has done benchmarks in the past?
The second question is of more practical nature. Are there any
best-practices on how to implement this? I was thinking we won't do one
SSD per HDD - surely an NVMe SSD is plenty fast to handle the traffic
from multiple OSDs. But what is a good ratio? Do I have one NVMe SSD per
4 HDDs? Per 6 or even 8? Also, how should I chop-up the SSD, using
partitions or using LVM? Last but not least, if I have one SSD handle
WAL and DB for multiple OSDs, losing that SSD means losing multiple
OSDs. How do people deal with this risk? Is it generally deemed
acceptable or is this something people tend to mitigate and if so how?
Do I run multiple SSDs in RAID?
I do realize that for some of these, there might not be the one perfect
answer that fits all use cases. I am looking for best practices and in
general just trying to avoid any obvious mistakes.
Any advice is much appreciated.
Sincerely
Niklaus Hofer
--
stepping stone AG
Wasserwerkgasse 7
CH-3011 Bern
Telefon: +41 31 332 53 63
www.stepping-stone.ch
niklaus.hofer(a)stepping-stone.ch
Hi,
as the documentation sends mixed signals in
https://docs.ceph.com/en/latest/rados/configuration/network-config-ref/#ipv…
"Note
Binding to IPv4 is enabled by default, so if you just add the option to
bind to IPv6 you’ll actually put yourself into dual stack mode."
and
https://docs.ceph.com/en/latest/rados/configuration/msgr2/#address-formats
"Note
The ability to bind to multiple ports has paved the way for dual-stack
IPv4 and IPv6 support. That said, dual-stack operation is not yet
supported as of Quincy v17.2.0."
just the quick questions:
Is a dual stacked networking with IPv4 and IPv6 now supported or not?
From which version on is it considered stable?
Are OSDs now able to register themselves with two IP addresses in the
cluster map? MONs too?
Regards
--
Robert Sander
Heinlein Consulting GmbH
Schwedter Str. 8/9b, 10119 Berlin
https://www.heinlein-support.de
Tel: 030 / 405051-43
Fax: 030 / 405051-19
Amtsgericht Berlin-Charlottenburg - HRB 220009 B
Geschäftsführer: Peer Heinlein - Sitz: Berlin
Hi,
I have problem to answer to this question:
Why CEPH is better than other storage solutions?
I know this high level texts about
- scalability,
- flexibility,
- distributed,
- cost-Effectiveness
What convince me, but could be received also against, is ceph as a product has everything what I need it mean:
block storage (RBD),
file storage (CephFS),
object storage (S3, Swift)
and "plugins" to run NFS, NVMe over Fabric, NFS on object storage.
Also many other features which are usually sold as a option (mirroring, geo replication, etc) in paid solutions.
I have problem to write it done piece by piece.
I want convince my managers we are going in good direction.
Why not something from robin.io or purestorage, netapp, dell/EMC. From opensource longhorn or openEBS.
If you have ideas please write it.
Thanks,
S.
Hi All,
We have a somewhat serious situation where we have a cephfs filesystem
(18.2.1), and 2 active MDSs (one standby). ThI tried to restart one of
the active daemons to unstick a bunch of blocked requests, and the
standby went into 'replay' for a very long time, then RAM on that MDS
server filled up, and it just stayed there for a while then eventually
appeared to give up and switched to the standby, but the cycle started
again. So I restarted that MDS, and now I'm in a situation where I see
this:
# ceph fs status
slugfs - 29 clients
======
RANK STATE MDS ACTIVITY DNS INOS DIRS CAPS
0 replay slugfs.pr-md-01.xdtppo 3958k 57.1k 12.2k 0
1 resolve slugfs.pr-md-02.sbblqq 0 3 1 0
POOL TYPE USED AVAIL
cephfs_metadata metadata 997G 2948G
cephfs_md_and_data data 0 87.6T
cephfs_data data 773T 175T
STANDBY MDS
slugfs.pr-md-03.mclckv
MDS version: ceph version 18.2.1
(7fe91d5d5842e04be3b4f514d6dd990c54b29c76) reef (stable)
It just stays there indefinitely. All my clients are hung. I tried
restarting all MDS daemons and they just went back to this state after
coming back up.
Is there any way I can somehow escape this state of indefinite
replay/resolve?
Thanks so much! I'm kinda nervous since none of my clients have
filesystem access at the moment...
cheers,
erich
Hello,
We are tracking PR #56805:
https://github.com/ceph/ceph/pull/56805
And the resolution of this item would potentially fix a pervasive and
ongoing issue that needs daily attention in our cephfs cluster. I was
wondering if it would be included in 18.2.3 which I *think* should be
released soon? Is there any way of knowing if that is true?
Thanks again,
erich
Hi,
We have recently upgraded one of our clusters from Quincy 17.2.6 to Reef 18.2.1, since then we have had 3 instances of our RGWs stop processing requests. We have 3 hosts that run a single instance of RGW on each, and all 3 just seem to stop processing requests at the same time causing our storage to become unavailable. A restart or redeploy of the RGW service brings them back ok. The cluster was deployed using ceph ansible, but since we have adopted it to cephadm which is how the upgrade was performed.
We have enabled debug logging as there was nothing out of the ordinary in normal logs and are currently sifting through them from the last crash.
We are just wondering if it possible to run Quincy RGWs instead of Reef as we didn't have this issue prior to the upgrade?
We have 3 clusters in a multisite setup, we are holding off on upgrading the other 2 clusters due to this issue.
Thanks
Iain
Iain Stott
OpenStack Engineer
Iain.Stott(a)thg.com
[THG Ingenuity Logo]<https://www.thg.com>
www.thg.com<https://www.thg.com/>
[LinkedIn]<https://www.linkedin.com/company/thgplc/?originalSubdomain=uk> [Instagram] <https://www.instagram.com/thg> [X] <https://twitter.com/thgplc?lang=en>
Hello! I've installed my 5-node CEPH cluster next install NFS server by command:
ceph nfs cluster create nfshacluster 5 --ingress --virtual_ip 192.168.171.48/26 --ingress-mode haproxy-protocol.
I don't understand fully how this must be works but when i stop NFS daemon even on one of this nodes I've see that writing on NFS shares is disappear (testing via vdbench).
As i understand it is wrong and IO from stopped daemon must switching to another NFS daemon without any impact on IO.
Can someone help me with troubleshoot this issue? Or explain how done full-fledged Active-Active HA NFS Cluster for production use.
Thanks!
Руслан Нурабаев
Старший инженер
Сектор ИТ платформы
Отдел развития опорной сети
Департамент развития сети
+77012119272
Ruslan.Nurabayev(a)kcell.kz
-----Original Message-----
From: ceph-users-request(a)ceph.io <ceph-users-request(a)ceph.io>
Sent: Thursday, April 11, 2024 15:07
To: Ruslan Nurabayev <Ruslan.Nurabayev(a)kcell.kz>
Subject: Welcome to the "ceph-users" mailing list
[You don't often get email from ceph-users-request(a)ceph.io. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
Welcome to the "ceph-users" mailing list!
To post to this list, send your email to:
ceph-users(a)ceph.io
You can unsubscribe or make adjustments to your options via email by sending a message to:
ceph-users-request(a)ceph.io
with the word 'help' in the subject or body (don't include the quotes), and you will get back a message with instructions. You will need your password to change your options, but for security purposes, this password is not included here. If you have forgotten your password you will need to reset it via the web UI.
________________________________
****************************************************************************************
Осы хабарлама және онымен берілетін кез келген файлдар құпия болып
табылады және олар мекенжайда көрсетілген жеке немесе заңды тұлғалардың
пайдалануына ғана арналған. Егер сіз болжамды алушы болып табылмайтын
болсаңыз, осы арқылы осындай ақпаратты кез келген таратуға, жіберуге,
көшіруге немесе пайдалануға қатаң тыйым салынатыны және осы электрондық
хабарлама дереу жойылуға тиіс екендігін хабарлаймыз.
KCELL осы хабарламадағы кез келген ақпараттың дәлдігіне немесе
толықтығына қатысты ешқандай кепілдік бермейді және сол арқылы онда
қамтылған ақпарат үшін немесе оны беру, қабылдау, сақтау немесе қандай да
бір түрде пайдалану үшін кез келген жауапкершілікті болдырмайды. Осы
хабарламада айтылған пікірлер тек жіберушіге ғана тиесілі және KCELL
пікірін де білдіруі міндетті емес. Бұл электрондық хабарлама барлық
танымал компьютерлік вирустарға тексерілді.
****************************************************************************************
Данное сообщение и любые передаваемые с ним файлы являются
конфиденциальными и предназначены исключительно для использования
физическими или юридическими лицами, которым они адресованы. Если вы не
являетесь предполагаемым получателем, настоящим уведомляем о том,
что любое распространение, пересылка, копирование или использование такой
информации строго запрещено, и данное электронное сообщение должно
быть немедленно удалено.
KCELL не дает никаких гарантий относительно точности или полноты любой
информации, содержащейся в данном сообщении, и тем самым исключает
любую ответственность за информацию, содержащуюся в нем, или за ее
передачу, прием, хранение или использование каким-либо образом. Мнения,
выраженные в данном сообщении, принадлежат только отправителю и
не обязательно отражают мнение KCELL.
Данное электронное сообщение было проверено на наличие всех известных
компьютерных вирусов.
****************************************************************************************
This e-mail and any files transmitted with it are confidential and
intended solely for the use of the individual or entity to whom they are
addressed. If you are not the intended recipient you are hereby notified
that any dissemination, forwarding, copying or use of any of the
information is strictly prohibited, and the e-mail should immediately be
deleted.
KCELL makes no warranty as to the accuracy or completeness of any
information contained in this message and hereby excludes any liability
of any kind for the information contained therein or for the information
transmission, reception, storage or use of such in any way
whatsoever. The opinions expressed in this message belong to sender alone
and may not necessarily reflect the opinions of KCELL.
This e-mail has been scanned for all known computer viruses.
****************************************************************************************
Hi All,
We have a slurm cluster with 25 clients, each with 256 cores, each
mounting a cephfs filesystem as their main storage target. The workload
can be heavy at times.
We have two active MDS daemons and one standby. A lot of the time
everything is healthy but we sometimes get warnings about MDS daemons
being slow on requests, behind on trimming, etc. I realize their may be
a bug in play, but also, I was wondering if we simply didn't have enough
MDS daemons to handle the load. Is there a way to know if adding a MDS
daemon would help? We could add a third active MDS if needed. But I
don't want to start adding a bunch of MDS's if that won't help.
The OSD servers seem fine. It's mainly the MDS instances that are
complaining.
We are running reef 18.2.1.
For reference, when things look healthy:
# ceph fs status slugfs
slugfs - 34 clients
======
RANK STATE MDS ACTIVITY DNS INOS DIRS
CAPS
0 active slugfs.pr-md-03.mclckv Reqs: 273 /s 2759k 2636k
362k 1079k
1 active slugfs.pr-md-01.xdtppo Reqs: 194 /s 868k 674k
67.3k 351k
POOL TYPE USED AVAIL
cephfs_metadata metadata 127G 3281G
cephfs_md_and_data data 0 98.3T
cephfs_data data 740T 196T
STANDBY MDS
slugfs.pr-md-02.sbblqq
MDS version: ceph version 18.2.1
(7fe91d5d5842e04be3b4f514d6dd990c54b29c76) reef (stable)
# ceph -s
cluster:
id: 58bde08a-d7ed-11ee-9098-506b4b4da440
health: HEALTH_OK
services:
mon: 5 daemons, quorum
pr-md-01,pr-md-02,pr-store-01,pr-store-02,pr-md-03 (age 5d)
mgr: pr-md-01.jemmdf(active, since 5w), standbys: pr-md-02.emffhz
mds: 2/2 daemons up, 1 standby
osd: 46 osds: 46 up (since 8d), 46 in (since 4w)
data:
volumes: 1/1 healthy
pools: 4 pools, 1313 pgs
objects: 271.17M objects, 493 TiB
usage: 744 TiB used, 384 TiB / 1.1 PiB avail
pgs: 1307 active+clean
4 active+clean+scrubbing
2 active+clean+scrubbing+deep
io:
client: 39 MiB/s rd, 108 MiB/s wr, 1.96k op/s rd, 54 op/s wr
But when things are in "warning" mode, it looks like this:
# ceph -s
cluster:
id: 58bde08a-d7ed-11ee-9098-506b4b4da440
health: HEALTH_WARN
1 filesystem is degraded
1 clients failing to advance oldest client/flush tid
1 MDSs report slow requests
1 MDSs behind on trimming
services:
mon: 5 daemons, quorum
pr-md-01,pr-md-02,pr-store-01,pr-store-02,pr-md-03 (age 5d)
mgr: pr-md-01.jemmdf(active, since 5w), standbys: pr-md-02.emffhz
mds: 2/2 daemons up, 1 standby
osd: 46 osds: 46 up (since 8d), 46 in (since 4w)
data:
volumes: 1/1 healthy
pools: 4 pools, 1313 pgs
objects: 271.28M objects, 494 TiB
usage: 746 TiB used, 382 TiB / 1.1 PiB avail
pgs: 1307 active+clean
5 active+clean+scrubbing
1 active+clean+scrubbing+deep
io:
client: 55 MiB/s rd, 2.6 MiB/s wr, 15 op/s rd, 46 op/s wr
And this:
# ceph health detail
HEALTH_WARN 2 clients failing to advance oldest client/flush tid; 2 MDSs
report slow requests; 1 MDSs behind on trimming
[WRN] MDS_CLIENT_OLDEST_TID: 2 clients failing to advance oldest
client/flush tid
mds.slugfs.pr-md-01.xdtppo(mds.0): Client phoenix-06.prism failing
to advance its oldest client/flush tid. client_id: 125780
mds.slugfs.pr-md-02.sbblqq(mds.1): Client phoenix-00.prism failing
to advance its oldest client/flush tid. client_id: 99385
[WRN] MDS_SLOW_REQUEST: 2 MDSs report slow requests
mds.slugfs.pr-md-01.xdtppo(mds.0): 4 slow requests are blocked > 30
secs
mds.slugfs.pr-md-02.sbblqq(mds.1): 67 slow requests are blocked >
30 secs
[WRN] MDS_TRIM: 1 MDSs behind on trimming
mds.slugfs.pr-md-02.sbblqq(mds.1): Behind on trimming (109410/250)
max_segments: 250, num_segments: 109410
The "cure" is the restart the active MDS daemons, one at a time. Then
everything becomes healthy again, for a time.
We also have the following MDS config items in play:
mds_cache_memory_limit = 8589934592
mds_cache_trim_decay_rate = .6
mds_log_max_segments = 250
Thanks for any pointers!
cheers,
erich