Hi guys,
I need setup Ceph over RDMA, but I faced many issues!
The info regarding my cluster:
Ceph version is Reef
Network cards are Broadcom RDMA.
RDMA connection between OSD nodes are OK.
I just found ms_type = async+rdma config in document and apply it using
ceph config set global ms_type async+rdma
After this action the cluster crashes. I tried to cluster back, and I did:
Put ms_type async+posix in ceph.conf
Restart all MON services
The cluster is back, but I don't have any active mgr. All OSDs are down too.
Is there any order to do for setting up Ceph over RDMA?
Thanks
Dear colleagues, hope that anybody can help us.
The initial point: Ceph cluster v15.2 (installed and controlled by the Proxmox) with 3 nodes based on physical servers rented from a cloud provider. CephFS is installed also.
Yesterday we discovered that some of the applications stopped working. During the investigation we recognized that we have the problem with Ceph, more precisely with СephFS - MDS daemons suddenly crashed. We tried to restart them and found that they crashed again immediately after the start. The crash information:
2024-04-17T17:47:42.841+0000 7f959ced9700 1 mds.0.29134 recovery_done -- successful recovery!
2024-04-17T17:47:42.853+0000 7f959ced9700 1 mds.0.29134 active_start
2024-04-17T17:47:42.881+0000 7f959ced9700 1 mds.0.29134 cluster recovered.
2024-04-17T17:47:43.825+0000 7f959aed5700 -1 ./src/mds/OpenFileTable.cc: In function 'void OpenFileTable::commit(MDSContext*, uint64_t, int)' thread 7f959aed5700 time 2024-04-17T17:47:43.831243+0000
./src/mds/OpenFileTable.cc: 549: FAILED ceph_assert(count > 0)
Next hours we read the tons of articles, studied the documentation, and checked the common state of Ceph cluster by the various diagnostic commands – but didn’t find anything wrong. At evening we decided to upgrade it up to v16, and finally to v17.2.7. Unfortunately, it didn’t solve the problem, MDS continue to crash with the same error. The only difference that we found is “1 MDSs report damaged metadata” in the output of ceph -s – see it below.
I supposed that it may be the well-known bug, but couldn’t find the same one on https://tracker.ceph.com - there are several bugs associated with file OpenFileTable.cc but not related to ceph_assert(count > 0)
We tried to check the source code of OpenFileTable.cc also, here is a fragment of it, in function OpenFileTable::_journal_finish
int omap_idx = anchor.omap_idx;
unsigned& count = omap_num_items.at(omap_idx);
ceph_assert(count > 0);
So, we guess that the object map is empty for some object in Ceph, and it is unexpected behavior. But again, we found nothing wrong in our cluster…
Next, we started with https://docs.ceph.com/en/latest/cephfs/disaster-recovery-experts/ article – tried to reset the journal (despite that it was Ok all the time) and wipe the sessions using cephfs-table-tool all reset session command. No result…
Now I decided to continue following this article and run cephfs-data-scan scan_extents command, it is working just now. But I have a doubt that it will solve the issue because of no problem with our objects in Ceph.
Is it the new bug? or something else? Any idea is welcome!
The important outputs:
----- ceph -s
cluster:
id: 4cd1c477-c8d0-4855-a1f1-cb71d89427ed
health: HEALTH_ERR
1 MDSs report damaged metadata
insufficient standby MDS daemons available
83 daemons have recently crashed
3 mgr modules have recently crashed
services:
mon: 3 daemons, quorum asrv-dev-stor-2,asrv-dev-stor-3,asrv-dev-stor-1 (age 22h)
mgr: asrv-dev-stor-2(active, since 22h), standbys: asrv-dev-stor-1
mds: 1/1 daemons up
osd: 18 osds: 18 up (since 22h), 18 in (since 29h)
data:
volumes: 1/1 healthy
pools: 5 pools, 289 pgs
objects: 29.72M objects, 5.6 TiB
usage: 21 TiB used, 47 TiB / 68 TiB avail
pgs: 287 active+clean
2 active+clean+scrubbing+deep
io:
client: 2.5 KiB/s rd, 172 KiB/s wr, 261 op/s rd, 195 op/s wr
-----ceph fs dump
e29480
enable_multiple, ever_enabled_multiple: 0,1
default compat: compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline data,8=no anchor table,9=file layout v2,10=snaprealm v2}
legacy client fscid: 1
Filesystem 'cephfs' (1)
fs_name cephfs
epoch 29480
flags 12 joinable allow_snaps allow_multimds_snaps
created 2022-11-25T15:56:08.507407+0000
modified 2024-04-18T16:52:29.970504+0000
tableserver 0
root 0
session_timeout 60
session_autoclose 300
max_file_size 1099511627776
required_client_features {}
last_failure 0
last_failure_osd_epoch 14728
compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline data,8=no anchor table,9=file layout v2,10=snaprealm v2}
max_mds 1
in 0
up {0=156636152}
failed
damaged
stopped
data_pools [5]
metadata_pool 6
inline_data disabled
balancer
standby_count_wanted 1
[mds.asrv-dev-stor-1{0:156636152} state up:active seq 6 laggy since 2024-04-18T16:52:29.970479+0000 addr [v2:172.22.2.91:6800/2487054023,v1:172.22.2.91:6801/2487054023] compat {c=[1],r=[1],i=[7ff]}]
-----cephfs-journal-tool --rank=cephfs:0 journal inspect
Overall journal integrity: OK
-----ceph pg dump summary
version 41137
stamp 2024-04-18T21:17:59.133536+0000
last_osdmap_epoch 0
last_pg_scan 0
PG_STAT OBJECTS MISSING_ON_PRIMARY DEGRADED MISPLACED UNFOUND BYTES OMAP_BYTES* OMAP_KEYS* LOG DISK_LOG
sum 29717605 0 0 0 0 6112544251872 13374192956 28493480 1806575 1806575
OSD_STAT USED AVAIL USED_RAW TOTAL
sum 21 TiB 47 TiB 21 TiB 68 TiB
-----ceph pg dump pools
POOLID OBJECTS MISSING_ON_PRIMARY DEGRADED MISPLACED UNFOUND BYTES OMAP_BYTES* OMAP_KEYS* LOG DISK_LOG
8 31771 0 0 0 0 131337887503 2482 140 401246 401246
7 839707 0 0 0 0 3519034650971 736 61 399328 399328
6 1319576 0 0 0 0 421044421 13374189738 28493279 206749 206749
5 27526539 0 0 0 0 2461702171417 0 0 792165 792165
2 12 0 0 0 0 48497560 0 0 6991 6991
Hi.
We're currently getting these errors - and I seem to be missing a clear overview over the cause and how to debug.
3/26/24 9:38:09 PM[ERR]executing _write_files((['dkcphhpcadmin01', 'dkcphhpcmgt028', 'dkcphhpcmgt029', 'dkcphhpcmgt031', 'dkcphhpcosd033', 'dkcphhpcosd034', 'dkcphhpcosd035', 'dkcphhpcosd036', 'dkcphhpcosd037', 'dkcphhpcosd038', 'dkcphhpcosd039', 'dkcphhpcosd040', 'dkcphhpcosd041', 'dkcphhpcosd042', 'dkcphhpcosd043', 'dkcphhpcosd044'],)) failed. Traceback (most recent call last): File "/usr/share/ceph/mgr/cephadm/ssh.py", line 240, in _write_remote_file await asyncssh.scp(f.name, (conn, tmp_path)) File "/lib/python3.6/site-packages/asyncssh/scp.py", line 922, in scp await source.run(srcpath) File "/lib/python3.6/site-packages/asyncssh/scp.py", line 458, in run self.handle_error(exc) File "/lib/python3.6/site-packages/asyncssh/scp.py", line 307, in handle_error raise exc from None File "/lib/python3.6/site-packages/asyncssh/scp.py", line 456, in run await self._send_files(path, b'') File "/lib/python3.6/site-packages/asyncssh/scp.py", line 438, in _send_files self.handle_error(exc) File "/lib/python3.6/site-packages/asyncssh/scp.py", line 307, in handle_error raise exc from None File "/lib/python3.6/site-packages/asyncssh/scp.py", line 434, in _send_files await self._send_file(srcpath, dstpath, attrs) File "/lib/python3.6/site-packages/asyncssh/scp.py", line 365, in _send_file await self._make_cd_request(b'C', attrs, size, srcpath) File "/lib/python3.6/site-packages/asyncssh/scp.py", line 343, in _make_cd_request self._fs.basename(path)) File "/lib/python3.6/site-packages/asyncssh/scp.py", line 224, in make_request raise exc asyncssh.sftp.SFTPFailure: scp: /tmp/var/lib/ceph/5c384430-da91-11ed-af9c-c780a5227aff/config/ceph.conf.new: Permission denied During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/share/ceph/mgr/cephadm/utils.py", line 79, in do_work return f(*arg) File "/usr/share/ceph/mgr/cephadm/serve.py", line 1088, in _write_files self._write_client_files(client_files, host) File "/usr/share/ceph/mgr/cephadm/serve.py", line 1107, in _write_client_files self.mgr.ssh.write_remote_file(host, path, content, mode, uid, gid) File "/usr/share/ceph/mgr/cephadm/ssh.py", line 261, in write_remote_file host, path, content, mode, uid, gid, addr)) File "/usr/share/ceph/mgr/cephadm/module.py", line 615, in wait_async return self.event_loop.get_result(coro) File "/usr/share/ceph/mgr/cephadm/ssh.py", line 56, in get_result return asyncio.run_coroutine_threadsafe(coro, self._loop).result() File "/lib64/python3.6/concurrent/futures/_base.py", line 432, in result return self.__get_result() File "/lib64/python3.6/concurrent/futures/_base.py", line 384, in __get_result raise self._exception File "/usr/share/ceph/mgr/cephadm/ssh.py", line 249, in _write_remote_file raise OrchestratorError(msg) orchestrator._interface.OrchestratorError: Unable to write dkcphhpcmgt028:/var/lib/ceph/5c384430-da91-11ed-af9c-c780a5227aff/config/ceph.conf: scp: /tmp/var/lib/ceph/5c384430-da91-11ed-af9c-c780a5227aff/config/ceph.conf.new: Permission denied
3/26/24 9:38:09 PM[ERR]Unable to write dkcphhpcmgt028:/var/lib/ceph/5c384430-da91-11ed-af9c-c780a5227aff/config/ceph.conf: scp: /tmp/var/lib/ceph/5c384430-da91-11ed-af9c-c780a5227aff/config/ceph.conf.new: Permission denied Traceback (most recent call last): File "/usr/share/ceph/mgr/cephadm/ssh.py", line 240, in _write_remote_file await asyncssh.scp(f.name, (conn, tmp_path)) File "/lib/python3.6/site-packages/asyncssh/scp.py", line 922, in scp await source.run(srcpath) File "/lib/python3.6/site-packages/asyncssh/scp.py", line 458, in run self.handle_error(exc) File "/lib/python3.6/site-packages/asyncssh/scp.py", line 307, in handle_error raise exc from None File "/lib/python3.6/site-packages/asyncssh/scp.py", line 456, in run await self._send_files(path, b'') File "/lib/python3.6/site-packages/asyncssh/scp.py", line 438, in _send_files self.handle_error(exc) File "/lib/python3.6/site-packages/asyncssh/scp.py", line 307, in handle_error raise exc from None File "/lib/python3.6/site-packages/asyncssh/scp.py", line 434, in _send_files await self._send_file(srcpath, dstpath, attrs) File "/lib/python3.6/site-packages/asyncssh/scp.py", line 365, in _send_file await self._make_cd_request(b'C', attrs, size, srcpath) File "/lib/python3.6/site-packages/asyncssh/scp.py", line 343, in _make_cd_request self._fs.basename(path)) File "/lib/python3.6/site-packages/asyncssh/scp.py", line 224, in make_request raise exc asyncssh.sftp.SFTPFailure: scp: /tmp/var/lib/ceph/5c384430-da91-11ed-af9c-c780a5227aff/config/ceph.conf.new: Permission denied
3/26/24 9:38:09 PM[INF]Updating dkcphhpcmgt028:/var/lib/ceph/5c384430-da91-11ed-af9c-c780a5227aff/config/ceph.conf
It seem to be related to the permissions that the manager writes the files with and the process copying them around.
$ sudo ceph -v
[sudo] password for adminjskr:
ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable)
Best regards,
Jesper Agerbo Krogh
Director Digitalization
Digitalization
Topsoe A/S
Haldor Topsøes Allé 1
2800 Kgs. Lyngby
Denmark
Phone (direct): 27773240
   
Read more attopsoe.com
Topsoe A/S and/or its affiliates. This e-mail message (including attachments, if any) is confidential and may be privileged. It is intended only for the addressee.
Any unauthorised distribution or disclosure is prohibited. Disclosure to anyone other than the intended recipient does not constitute waiver of privilege.
If you have received this email in error, please notify the sender by email and delete it and any attachments from your computer system and records.
On 29/05/2023 20.55, Anthony D'Atri wrote:
> Check the uptime for the OSDs in question
I restarted all my OSDs within the past 10 days or so. Maybe OSD
restarts are somehow breaking these stats?
>
>> On May 29, 2023, at 6:44 AM, Hector Martin <marcan(a)marcan.st> wrote:
>>
>> Hi,
>>
>> I'm watching a cluster finish a bunch of backfilling, and I noticed that
>> quite often PGs end up with zero misplaced objects, even though they are
>> still backfilling.
>>
>> Right now the cluster is down to 6 backfilling PGs:
>>
>> data:
>> volumes: 1/1 healthy
>> pools: 6 pools, 268 pgs
>> objects: 18.79M objects, 29 TiB
>> usage: 49 TiB used, 25 TiB / 75 TiB avail
>> pgs: 262 active+clean
>> 6 active+remapped+backfilling
>>
>> But there are no misplaced objects, and the misplaced column in `ceph pg
>> dump` is zero for all PGs.
>>
>> If I do a `ceph pg dump_json`, I can see `num_objects_recovered`
>> increasing for these PGs... but the misplaced count is still 0.
>>
>> Is there something else that would cause recoveries/backfills other than
>> misplaced objects? Or perhaps there is a bug somewhere causing the
>> misplaced object count to be misreported as 0 sometimes?
>>
>> # ceph -v
>> ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy
>> (stable)
>>
>> - Hector
>> _______________________________________________
>> ceph-users mailing list -- ceph-users(a)ceph.io
>> To unsubscribe send an email to ceph-users-leave(a)ceph.io
>
>
- Hector
Hello all,
I am trying to troubleshoot a ceph cluster version 18.2.2 having users
reporting slow and blocked reads and writes.
When running "ceph status" I am seeing many warnings about its health state:
cluster:
id: cc881230-e0dd-11ee-aa9e-37c4e4e5e14b
health: HEALTH_WARN
6 clients failing to respond to capability release
2 clients failing to advance oldest client/flush tid
1 MDSs report slow requests
1 MDSs behind on trimming
Too many repaired reads on 11 OSDs
Degraded data redundancy: 2 pgs degraded
105 pgs not deep-scrubbed in time
109 pgs not scrubbed in time
1 mgr modules have recently crashed
12 slow ops, oldest one blocked for 97678 sec, daemons
[osd.11,osd.12,osd.15,osd.16,osd.19,osd.20,osd.28,osd.3,osd.32,osd.34]...
have slow ops.
services:
mon: 3 daemons, quorum file03-xx,file04-xx,file05-xx (age 17h)
mgr: file03-xx.xxxxxx(active, since 2w), standbys: file04-xx.xxxxxx
mds: 1/1 daemons up, 1 standby
osd: 44 osds: 44 up (since 17h), 44 in (since 39h); 492 remapped pgs
data:
volumes: 1/1 healthy
pools: 3 pools, 2065 pgs
objects: 66.44M objects, 140 TiB
usage: 281 TiB used, 304 TiB / 586 TiB avail
pgs: 16511162/134215883 objects misplaced (12.302%)
1508 active+clean
487 active+remapped+backfill_wait
53 active+clean+scrubbing+deep
8 active+clean+scrubbing
5 active+remapped+backfilling
2 active+recovering+degraded+repair
2 active+recovering+repair
io:
recovery: 47 MiB/s, 37 objects/s
When checking the output of `ceph -w` I am flooded with crc error messages
like the examples below:
2024-04-24T19:15:40.430334+0000 osd.32 [ERR] 3.566 full-object read crc
0xa5da7fa != expected 0xffffffff on 3:66a8d8f5:::10001c72400.00000007:head
2024-04-24T19:15:40.430507+0000 osd.39 [ERR] 3.270 full-object read crc
0xa1bc3a1e != expected 0xffffffff on 3:0e44aa2f:::1000265a625.00000003:head
2024-04-24T19:15:40.494249+0000 osd.28 [ERR] 3.469 full-object read crc
0x8e757c06 != expected 0xffffffff on 3:962852f4:::10001c72c2d.00000001:head
2024-04-24T19:15:40.529771+0000 osd.32 [ERR] 3.566 full-object read crc
0xa5da7fa != expected 0xffffffff on 3:66a8d8f5:::10001c72400.00000007:head
2024-04-24T19:15:40.582128+0000 osd.19 [ERR] 3.4b full-object read crc
0x9222aec != expected 0xffffffff on 3:d20bddca:::100026fdb01.00000006:head
2024-04-24T19:15:40.583350+0000 osd.19 [ERR] 3.4b full-object read crc
0x9222aec != expected 0xffffffff on 3:d20bddca:::100026fdb01.00000006:head
2024-04-24T19:15:40.662945+0000 osd.28 [ERR] 3.469 full-object read crc
0x8e757c06 != expected 0xffffffff on 3:962852f4:::10001c72c2d.00000001:head
2024-04-24T19:15:40.698197+0000 osd.19 [ERR] 3.4b full-object read crc
0x9222aec != expected 0xffffffff on 3:d20bddca:::100026fdb01.00000006:head
2024-04-24T19:15:40.699389+0000 osd.19 [ERR] 3.4b full-object read crc
0x9222aec != expected 0xffffffff on 3:d20bddca:::100026fdb01.00000006:head
2024-04-24T19:15:40.769191+0000 osd.28 [ERR] 3.469 full-object read crc
0x8e757c06 != expected 0xffffffff on 3:962852f4:::10001c72c2d.00000001:head
2024-04-24T19:15:40.834344+0000 osd.19 [ERR] 3.4b full-object read crc
0x9222aec != expected 0xffffffff on 3:d20bddca:::100026fdb01.00000006:head
2024-04-24T19:15:40.835513+0000 osd.19 [ERR] 3.4b full-object read crc
0x9222aec != expected 0xffffffff on 3:d20bddca:::100026fdb01.00000006:head
I suspect this is the main issue affecting the cluster health state and
performance so I am trying to address this first.
The "expected 0xffffffff" crc seems like a bug to me and I found an open
ticket (https://tracker.ceph.com/issues/53240) with similar error messages
but I am not sure this is related to my case.
Could someone point me to the steps to solve these errors?
Cheers,
--
Fabio
Hi,
This problem started with trying to add a new storage server into a
quincy v17.2.6 ceph cluster. Whatever I did, I could not add the drives
on the new host as OSDs: via dashboard, via cephadm shell, by setting
osd unmanaged to false.
But what I started realizing is that orchestrator will also no longer
automatically manage services. I.e. if a service is set to manage by
labels, removing and adding labels to different hosts for that service
has no affect. Same if I set a service to be manage via hostnames. Same
if I try to drain a host (the services/podman containers just keep
running). Although, I am able to add/rm services via 'cephadm shell ceph
orch daemon add/rm'. But Ceph will not manage automatically using
labels/hostnames.
This apparently includes OSD daemons. I can not create and on the new
host either automatically or manually, but I'm hoping the services/OSD
issues are related and not two issues.
I haven't been able to find any obvious errors in /var/log/ceph,
/var/log/syslog, logs <container>, etc. I have been able to get 'slow
ops' errors on monitors by trying to add OSDs manually (and having to
restart the monitor). I've also gotten cephadm shell to hang. And had to
restart managers. I'm not an expert and it could be something obvious,
but I haven't been able to figure out a solution. If anyone has any
suggestions, I would greatly appreciate them.
Thanks,
Mike
--
Michael Baer
ceph(a)mikesoffice.com
Hi Eugen,
Thank you for a viable solution to our underlying issue - I'll attempt
to implement it shortly. :-)
However, with all the respect in world, I believe you are incorrect when
you say the doco is correct (but I will be more than happy to be proven
wrong). :-)
The relevant text (extracted from the document page (the last couple of
paragraphs)) says:
~~~
If a client already has a capability for file-system name |a| and path
|dir1|, running |fs authorize| again for FS name |a| but path |dir2|,
instead of modifying the capabilities client already holds, a new cap
for |dir2| will be granted:
cephfsauthorizeaclient.x/dir1rw
cephauthgetclient.x
[client.x]
key = AQC1tyVknMt+JxAAp0pVnbZGbSr/nJrmkMNKqA==
caps mds = "allow rw fsname=a path=/dir1"
caps mon = "allow r fsname=a"
caps osd = "allow rw tag cephfs data=a"
cephfsauthorizeaclient.x/dir2rw
updated caps for client.x
cephauthgetclient.x
[client.x]
key = AQC1tyVknMt+JxAAp0pVnbZGbSr/nJrmkMNKqA==
caps mds = "allow rw fsname=a path=dir1, allow rw fsname=a path=dir2"
caps mon = "allow r fsname=a"
caps osd = "allow rw tag cephfs data=a"
~~~
The above *seems* to me to say (as per the 2nd `cephauthgetclient.x`
example) that a 2nd directory (dir2) *will* be added to the `client.x`
authorisation.
HOWEVER, this does not work in practice - hence my original query.
This is what we originally attempted to do (word for word, only
substituting our CechFS name for "a") and we got the error in the
original post.
So if the doco says that something can be done *and* gives a working
example, but an end-user (admin) cannot achieve the same results but
gets an error instead when following the exact same commands, then
either the doco is incorrect *or* there is something else wrong.
BUT your statement ("running 'ceph fs authorize' will overwrite the
existing caps, it will not add more caps to the client") is in direct
contradiction to the documentation ("If a client already has a
capability for file-system name |a| and path |dir1|, running |fs
authorize| again for FS name |a| but path |dir2|, instead of modifying
the capabilities client already holds, a new cap for |dir2| will be
granted").
So there's some sort of "disconnect" there. :-)
Cheers
On 24/04/2024 17:33, ceph-users-request(a)ceph.io wrote:
> Send ceph-users mailing list submissions to
> ceph-users(a)ceph.io
>
> To subscribe or unsubscribe via email, send a message with subject or
> body 'help' to
> ceph-users-request(a)ceph.io
>
> You can reach the person managing the list at
> ceph-users-owner(a)ceph.io
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of ceph-users digest..."
>
> Today's Topics:
>
> 1. Re: Latest Doco Out Of Date? (Eugen Block)
> 2. Re: stretched cluster new pool and second pool with nvme
> (Eugen Block)
> 3. Re: Latest Doco Out Of Date? (Frank Schilder)
>
> _______________________________________________
> ceph-users mailing list --ceph-users(a)ceph.io
> To unsubscribe send an email toceph-users-leave(a)ceph.io
> %(web_page_url)slistinfo%(cgiext)s/%(_internal_name)s
Hi All,
In reference to this page from the Ceph documentation:
https://docs.ceph.com/en/latest/cephfs/client-auth/, down the bottom of
that page it says that you can run the following commands:
~~~
ceph fs authorize a client.x /dir1 rw
ceph fs authorize a client.x /dir2 rw
~~~
This will allow `client.x` to access both `dir1` and `dir2`.
So, having a use case where we need to do this, we are, HOWEVER, getting
the following error on running the 2nd command on a Reef 18.2.2 cluster:
`Error EINVAL: client.x already has fs capabilities that differ from
those supplied. To generate a new auth key for client.x, first remove
client.x from configuration files, execute 'ceph auth rm client.x', then
execute this command again.`
Something we're doing wrong, or is the doco "out of date" (mind you,
that's from the "latest" version of the doco, and the "reef" version),
or is something else going on?
Thanks in advance for the help
Cheers
Dulux-Oz
Dear all
We have an HDD ceph cluster that could do with some more IOPS. One
solution we are considering is installing NVMe SSDs into the storage
nodes and using them as WAL- and/or DB devices for the Bluestore OSDs.
However, we have some questions about this and are looking for some
guidance and advice.
The first one is about the expected benefits. Before we undergo the
efforts involved in the transition, we are wondering if it is even worth
it. How much of a performance boost one can expect when adding NVMe SSDs
for WAL-devices to an HDD cluster? Plus, how much faster than that does
it get with the DB also being on SSD. Are there rule-of-thumb number of
that? Or maybe someone has done benchmarks in the past?
The second question is of more practical nature. Are there any
best-practices on how to implement this? I was thinking we won't do one
SSD per HDD - surely an NVMe SSD is plenty fast to handle the traffic
from multiple OSDs. But what is a good ratio? Do I have one NVMe SSD per
4 HDDs? Per 6 or even 8? Also, how should I chop-up the SSD, using
partitions or using LVM? Last but not least, if I have one SSD handle
WAL and DB for multiple OSDs, losing that SSD means losing multiple
OSDs. How do people deal with this risk? Is it generally deemed
acceptable or is this something people tend to mitigate and if so how?
Do I run multiple SSDs in RAID?
I do realize that for some of these, there might not be the one perfect
answer that fits all use cases. I am looking for best practices and in
general just trying to avoid any obvious mistakes.
Any advice is much appreciated.
Sincerely
Niklaus Hofer
--
stepping stone AG
Wasserwerkgasse 7
CH-3011 Bern
Telefon: +41 31 332 53 63
www.stepping-stone.ch
niklaus.hofer(a)stepping-stone.ch