On 07:31 Mon 28 Oct, Mason-Williams, Gabryel (DLSLtd,RAL,LSCI) wrote:
I am using ceph version 12.2.8
(ae699615bac534ea496ee965ac6192cb7e0e07c0) luminous (stable).
I have not checked the master branch do you think this is an issue in
luminous that has been removed in later versions?
I haven't hit problem
on master branch. Ceph/RDMA changed a lot
from luminous to master branch.
Is below configuration really needed in luminous/ceph.conf?
ms_async_rdma_local_gid = xxxx
On master
branch, this parameter is not needed at all.
B.R.
Changcheng
> __________________________________________________________________
>
> From: Liu, Changcheng <changcheng.liu(a)intel.com>
> Sent: 25 October 2019 18:04
> To: Mason-Williams, Gabryel (DLSLtd,RAL,LSCI)
> <gabryel.mason-williams(a)diamond.ac.uk>
> Cc: ceph-users(a)ceph.com <ceph-users(a)ceph.com>om>; dev(a)ceph.io
> <dev(a)ceph.io>
> Subject: Re: RMDA Bug?
>
> What's your ceph version? Have you verified whether the problem could
> be
> reproduced on master branch?
> On 08:33 Fri 25 Oct, Mason-Williams, Gabryel (DLSLtd,RAL,LSCI) wrote:
> > I am currently trying to run Ceph on RDMA, either RoCE 1 or 2.
> However,
> > I am experiencing issues with this.
> >
> > When using Ceph on RDMA I experience issues where OSD’s will
> randomly
> > become unreachable even if the cluster is left alone alone, it
> also is
> > not properly talking over RDMA and using Ethernet when the config
> > states it should as shown by the same results in the bench marking
> of
> > the two setups.
> >
> > After reloading the cluster
> > [cid:36020940-0085-40fc-bb5b-d91de6ace453]
> >
> > After 5m 9s the cluster went from being healthy to down.
> >
> > [cid:ed084bcc-0b97-44bd-9648-ce2e06859cd5]
> >
> > This problem even happens when running a bench mark test on the
> > cluster, OSD’s will just fall over. Another curious issue is that
> it is
> > not properly talking over RDMA as well and instead using the
> Ethernet.
> >
> > [cid:05e9dc68-075e-425d-b76b-ce7fa1d2f7a8]
> >
> > Next test:
> >
> > [cid:4183557e-b1da-41f3-afc3-f081b9fb4034]
> >
> > The config used for the RDMA is a so:
> >
> > [global]
> >
> > fsid = aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa
> >
> > mon_initial_members = node1, node2, node3
> >
> > mon_host =xxx.xxx.xxx.xxx,xxx.xxx.xxx.xxx, xxx.xxx.xxx.xxx
> >
> > auth_cluster_required = cephx
> >
> > auth_service_required =cephx
> >
> > auth_client_required = cephx
> >
> > public_network = xxx.xxx.xxx.xxx/24
> >
> > cluster_network = yyy.yyy.yyy.yyy/16
> >
> > ms_cluster_type =async+rdma
> >
> > ms_public_type = async+posix
> >
> > ms_async_rdma_device_name = mlx4_0
> >
> > [osd.0]
> >
> > ms_async_rdma_local_gid = xxxx
> >
> > [osd.1]
> >
> > ms_async_rdma_local_gid = xxxx
> >
> > [osd.2]
> >
> > ms_async_rdma_local_gid =xxxx
> >
> > Tests to check the system is using RDMA
> >
> > sudo ceph --admin-daemon /var/run/ceph/ceph-osd.0.asok config show
> |
> > grep ms_cluster
> >
> > OUTPUT
> >
> > "ms_cluster_type": "async+rdma",
> >
> > sudo ceph daemon osd.0 perf dump AsyncMessenger::RDMAWorker-1
> >
> > OUTPUT
> >
> > {
> >
> > "AsyncMessenger::RDMAWorker-1": {
> >
> > "tx_no_mem": 0,
> >
> > "tx_parital_mem": 0,
> >
> > "tx_failed_post": 0,
> >
> > "rx_no_registered_mem": 0,
> >
> > "tx_chunks": 9,
> >
> > "tx_bytes": 2529,
> >
> > "rx_chunks": 0,
> >
> > "rx_bytes": 0,
> >
> > "pending_sent_conns": 0
> >
> > }
> >
> > }
> >
> > When running over Ethernet I have a completely stable system with
> the
> > current benchmarks as so
> >
> > [cid:544ecbbc-10d9-43e6-ab2f-aa7c2bcd88c0]
> >
> > Config setup when using Ethernet is
> >
> > The Config setup when using Ethernet is
> >
> > [global]
> >
> > fsid = aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa
> >
> > mon_initial_members = node1, node2, node3
> >
> > mon_host =xxx.xxx.xxx.xxx,xxx.xxx.xxx.xxx, xxx.xxx.xxx.xxx
> >
> > auth_cluster_required = cephx
> >
> > auth_service_required =cephx
> >
> > auth_client_required = cephx
> >
> > public_network = xxx.xxx.xxx.xxx/24
> >
> > cluster_network = yyy.yyy.yyy.yyy/16
> >
> > ms_cluster_type =async+posix
> >
> > ms_public_type = async+posix
> >
> > ms_async_rdma_device_name = mlx4_0
> >
> > [osd.0]
> >
> > ms_async_rdma_local_gid = xxxx
> >
> > [osd.1]
> >
> > ms_async_rdma_local_gid = xxxx
> >
> > [osd.2]
> >
> > ms_async_rdma_local_gid =xxxx
> > Tests to check the system is using async+posix
> >
> > sudo ceph --admin-daemon /var/run/ceph/ceph-osd.0.asok config show
> |
> > grep ms_cluster
> >
> > OUTPUT
> >
> > "ms_cluster_type": "async+posix"
> >
> > sudo ceph daemon osd.0 perf dump AsyncMessenger::RDMAWorker-1
> >
> > OUTPUT
> >
> > {}
> >
> > This clearly a issue with RDMA and not with the OSD's shown by the
> fact
> > the system is completely fine over Ethernet and not with RDMA.
> >
> > Any guidance or ideas on how to approach this problem to make Ceph
> work
> > with RDMA would be greatly appreciated.
> >
> > Regards
> >
> > Gabryel Mason-Williams, Placement Student
> >
> > Address: Diamond Light Source Ltd., Diamond House, Harwell Science
> &
> > Innovation Campus, Didcot, Oxfordshire OX11 0DE
> >
> > Email: gabryel.mason-williams(a)diamond.ac.uk
> >
> >
> > --
> >
> > This e-mail and any attachments may contain confidential,
> copyright and
> > or privileged material, and are for the use of the intended
> addressee
> > only. If you are not the intended addressee or an authorised
> recipient
> > of the addressee please notify us of receipt by returning the
> e-mail
> > and do not use, copy, retain, distribute or disclose the
> information in
> > or attached to the e-mail.
> > Any opinions expressed within this e-mail are those of the
> individual
> > and not necessarily of Diamond Light Source Ltd.
> > Diamond Light Source Ltd. cannot guarantee that this e-mail or any
> > attachments are free from viruses and we cannot accept liability
> for
> > any damage which you may sustain as a result of software viruses
> which
> > may be transmitted in or with the message.
> > Diamond Light Source Limited (company no. 4375679). Registered in
> > England and Wales with its registered office at Diamond House,
> Harwell
> > Science and Innovation Campus, Didcot, Oxfordshire, OX11 0DE,
> United
> > Kingdom
> > _______________________________________________
> > Dev mailing list -- dev(a)ceph.io
> > To unsubscribe send an email to dev-leave(a)ceph.io
>
>
> --
>
> This e-mail and any attachments may contain confidential, copyright and
> or privileged material, and are for the use of the intended addressee
> only. If you are not the intended addressee or an authorised recipient
> of the addressee please notify us of receipt by returning the e-mail
> and do not use, copy, retain, distribute or disclose the information in
> or attached to the e-mail.
> Any opinions expressed within this e-mail are those of the individual
> and not necessarily of Diamond Light Source Ltd.
> Diamond Light Source Ltd. cannot guarantee that this e-mail or any
> attachments are free from viruses and we cannot accept liability for
> any damage which you may sustain as a result of software viruses which
> may be transmitted in or with the message.
> Diamond Light Source Limited (company no. 4375679). Registered in
> England and Wales with its registered office at Diamond House, Harwell
> Science and Innovation Campus, Didcot, Oxfordshire, OX11 0DE, United
> Kingdom