Re: RMDA Bug? - Dev

11 Nov 2019

Hi Williams,
   Besides usign same port(both publid and cluster network use RDMA) for
   RDMA messenger, I also tried to use public-network-TCP-messenger and
   cluster-network-RDMA-messenger. There's no serious problem happen.
   The ceph is built by self based on master commit 8cb1f6bd(Wed Nov 6
   18:43:41 2019 -0500).

   I don't have too many nodes to check your problem.

   BTW, on "ceph-users Digest, Vol 82, Issue 27", there's below item:
      2. Re: mgr daemons becoming unresponsive (Gregory Farnum)
   However, I haven't hit mgr problem in my side.

B.R.
Changcheng 

On 08:53 Mon 11 Nov, Mason-Williams, Gabryel (DLSLtd,RAL,LSCI) wrote:
>    @Changcheng
> 
>    Sorry for the late reply as well.
> 
>    I followed your setup and I have an issue where the MGR cannot connect
>    to the cluster and RDMA does not work, I believe the MGR is not
>    supported on RDMA.
> 
>    Thank you for your time but I believe we may be hitting a dead end with
>    this approach as we seem to get different results.
> 
>    Kind regards
> 
>    Gabryel Mason-Williams
>      __________________________________________________________________
> 
>    From: Liu, Changcheng &lt;changcheng.liu(a)intel.com&gt;
>    Sent: 01 November 2019 06:24
>    To: Mason-Williams, Gabryel (DLSLtd,RAL,LSCI)
>    &lt;gabryel.mason-williams(a)diamond.ac.uk&gt;
>    Cc: dev(a)ceph.io &lt;dev(a)ceph.io&gt;
>    Subject: Re: RMDA Bug?
> 
>    @Williams,
>       Sorry for late reply. I'm busy on working getting Ceph/RDMA
>       performance data these days.
>       I'm using Intel RDMA NIC with small cluster, there's no serious
>    issue
>       happened.
>       For Mellanox NIC, there's no problem with your ceph.conf from my
>    perspective.
>       Below is the steps that I used to deploy cluster
>         1. server0: 172.16.1.4, /dev/nvme0n1, /dev/nvme1n1
>         2. server1: 172.16.1.2, /dev/nvme0n1, /dev/nvme1n1
> 
>       Below is my deploy steps:
>       [admin@server0 deploy]$ ceph-deploy new server0 --fsid
>    24280750-d4f7-4d4f-89e4-f95b8fab87ff
>       [admin@server0 deploy]$ #change ceph.conf as below:
>       [admin@server0 deploy]$ cat ceph.conf
>       [global]
>           cluster = ceph
>           fsid = 24280750-d4f7-4d4f-89e4-f95b8fab87ff
>           auth_cluster_required = cephx
>           auth_service_required = cephx
>           auth_client_required = cephx
> 
>           osd pool default size = 2
>           osd pool default min size = 2
>           osd pool default pg num = 64
>           osd pool default pgp num = 128
> 
>           osd pool default crush rule = 0
>           osd crush chooseleaf type = 1
> 
>           mon_allow_pool_delete=true
>           osd_pool_default_pg_autoscale_mode=on
> 
>           ms_type = async+rdma
>           ;----changcheng: change device to your dev name----------
>           ms_async_rdma_device_name = irdma1
>           ;----changcheng: ignore below parameters with Mellanox
>    NIC--------
>           ;ms_async_rdma_support_srq = false
> 
>           mon_initial_members = server0
>           mon_host = 172.16.1.4
> 
>       [mon.rdmarhel0]
>           host = server0
>           mon addr = 172.16.1.4
>       [admin@server0 deploy]$ ceph-deploy mon create-initial
>       [admin@server0 deploy]$ ceph-deploy admin server0 server1
>       [admin@server0 deploy]$ ceph-deploy mgr create server0
>       [admin@server0 deploy]$ ceph-deploy osd create --data /dev/nvme0n1
>    server0
>       [admin@server0 deploy]$ ceph-deploy osd create --data /dev/nvme1n1
>    server0
>       [admin@server0 deploy]$ ceph-deploy osd create --data /dev/nvme0n1
>    server1
>       [admin@server0 deploy]$ ceph-deploy osd create --data /dev/nvme1n1
>    server1
>    B.R.
>    Changcheng
>    On 08:27 Thu 31 Oct, Mason-Williams, Gabryel (DLSLtd,RAL,LSCI) wrote:
>    >     1. When not defining a public and cluster network the OSD and MGR
>    >        nodes do not get recognised
>    >
>    >      sudo ceph -s
>    >
>    >      cluster:
>    >
>    >        id:     820f1573-bc4a-4ee0-b702-80ba5ac13c25
>    >
>    >        health: HEALTH_WARN
>    >
>    >                3 osds down
>    >
>    >                3 hosts (3 osds) down
>    >
>    >                1 root (3 osds) down
>    >
>    >                no active mgr
>    >
>    >                too few PGs per OSD (21 < min 30)
>    >
>    >
>    >      services:
>    >
>    >        mon: 3 daemons, quorum
>    >    cs04r-sc-com99-05,cs04r-sc-com99-07,cs04r-sc-com99-08 (age 5m)
>    >
>    >        mgr: no daemons active (since 4m)
>    >
>    >        osd: 3 osds: 0 up (since 9m), 3 in (since 9m)
>    >
>    >
>    >      data:
>    >
>    >        pools:    1 pools, 64 pgs
>    >
>    >        objects: 0 objects, 0 B
>    >
>    >        usage:   3.0 GiB used, 114 GiB / 117 GiB avail
>    >
>    >        pgs:       44 stale+active+clean
>    >
>    >                     20 active+clean
>    >
>    >    This is an issue within the ms_type being async+rdma as the
>    daemons are
>    >    running:
>    >
>    >      sudo systemctl status ceph-osd.target
>    >
>    >    $B!|(B ceph-osd.target - ceph target allowing to start/stop all
>    >    ceph-osd@.service instances at once
>    >
>    >       Loaded: loaded (/usr/lib/systemd/system/ceph-osd.target;
>    enabled;
>    >    vendor preset: enabled)
>    >
>    >       Active: active since Thu 2019-10-31 08:13:42 GMT; 8min ago
>    >
>    >    sudo systemctl status ceph-mgr.target
>    >
>    >    $B!|(B ceph-mgr.target - ceph target allowing to start/stop all
>    >    ceph-mgr@.service instances at once
>    >       Loaded: loaded (/usr/lib/systemd/system/ceph-mgr.target;
>    enabled;
>    >    vendor preset: enabled)
>    >       Active: active since Thu 2019-10-31 08:13:33 GMT; 11min ago
>    >
>    >    With the config being
>    >
>    >      [global]
>    >
>    >    fsid = 820f1573-bc4a-4ee0-b702-80ba5ac13c25
>    >
>    >    mon_initial_members = node1, node2, node3
>    >
>    >    mon_host = xxx.xx.xxx.aa,xxx.xx.xxx.ac, xxx.xx.xxx.ad
>    >
>    >    auth_cluster_required = cephx
>    >
>    >    auth_service_required = cephx
>    >
>    >    auth_client_required = cephx
>    >
>    >    ms_type = async+rdma
>    >
>    >    ms_async_rdma_device_name = mlx4_0
>    >
>    __________________________________________________________________
>    >
>    >    From: Liu, Changcheng &lt;changcheng.liu(a)intel.com&gt;
>    >    Sent: 31 October 2019 01:09
>    >    To: Mason-Williams, Gabryel (DLSLtd,RAL,LSCI)
>    >    &lt;gabryel.mason-williams(a)diamond.ac.uk&gt;
>    >    Cc: dev(a)ceph.io &lt;dev(a)ceph.io&gt;
>    >    Subject: Re: RMDA Bug?
>    >
>    >    >   2) I'll confirm with my colleague that whether cluster network
>    is
>    >    really used in 14.2.4. We also hit similar problem these days even
>    >    using TCP async messenger.
>    >    [Changcheng]:
>    >    1) The problem should be already sovled in 14.2.4. We hit the
>    problem
>    >    in 14.2.1
>    >    2) I'll try to verify your problem when I have time(I'm working
on
>    >    other
>    >    affairs). There should be no problem when unifying both
>    public/cluster
>    >    network with RDMA device.
>    >    On 23:22 Wed 30 Oct, Liu, Changcheng wrote:
>    >    > I'm working on master branch and deploy two nodes cluster. Data
>    is
>    >    transferring over RDMA.
>    >    >       [admin@server0 ~]$ sudo ceph daemon osd.0 perf dump
>    >    AsyncMessenger::RDMAWorker-1
>    >    >       {
>    >    >           "AsyncMessenger::RDMAWorker-1": {
>    >    >               "tx_no_mem": 0,
>    >    >               "tx_parital_mem": 0,
>    >    >               "tx_failed_post": 0,
>    >    >               "tx_chunks": 26966,
>    >    >               "tx_bytes": 52789637,
>    >    >               "rx_chunks": 26916,
>    >    >               "rx_bytes": 52812278,
>    >    >               "pending_sent_conns": 0
>    >    >           }
>    >    >       }
>    >    >
>    >    > The only difference is that I don$B!G(Bt differentiate
>    public/cluster
>    >    network in my cluster.
>    >    > You can try to make all public/cluster network use RDMA.
>    >    > Note:
>    >    >   1) If both public/cluster use RDMA, we can$B!G(Bt
>    differentiate them in
>    >    different subnetwork. This is feature limited. I'm planning to
>    solve it
>    >    in future)
>    >    >   2) I'll confirm with my colleague that whether cluster network
>    is
>    >    really used in 14.2.4. We also hit similar problem these days even
>    >    using TCP async messenger.
>    >    >
>    >    > Below is my cluster's ceph configuration.
>    >    > I also attach the systemd patch used in my side.
>    >    >       [admin@server0 ~]$ cat /etc/ceph/ceph.conf
>    >    >       [global]
>    >    >           cluster = ceph
>    >    >           fsid = 24280750-d4f7-4d4f-89e4-f95b8fab87ff
>    >    >           auth_cluster_required = cephx
>    >    >           auth_service_required = cephx
>    >    >           auth_client_required = cephx
>    >    >
>    >    >           osd pool default size = 2
>    >    >           osd pool default min size = 2
>    >    >           osd pool default pg num = 64
>    >    >           osd pool default pgp num = 128
>    >    >
>    >    >           osd pool default crush rule = 0
>    >    >           osd crush chooseleaf type = 1
>    >    >
>    >    >           mon_allow_pool_delete=true
>    >    >           osd_pool_default_pg_autoscale_mode=off
>    >    >
>    >    >           ms_type = async+rdma
>    >    >           ms_async_rdma_device_name = mlx5_0
>    >    >
>    >    >           mon_initial_members = server0
>    >    >           mon_host = 172.16.1.4
>    >    >
>    >    >       [mon.rdmarhel0]
>    >    >           host = server0
>    >    >           mon addr = 172.16.1.4
>    >    >       [admin@server0 ~]$
>    >    >
>    >    > B.R.
>    >    > Changcheng
>    >    >
>    >    > On 13:07 Wed 30 Oct, Mason-Williams, Gabryel (DLSLtd,RAL,LSCI)
>    wrote:
>    >    > >     1. The current problem is that it still sending data over
>    the
>    >    ethernet
>    >    > >        instead of ib.
>    >    > >     2. [global]
>    >    > >        fsid=xxxx
>    >    > >        mon_initial_members = node1, node2, node3
>    >    > >        mon_host = xxx.xx.xxx.ab,xxx.xx.xxx.ac, xxx.xx.xxx.ad
>    >    > >        auth_cluster_required = cephx
>    >    > >        auth_service_required = cephx
>    >    > >        auth_client_required = cephx
>    >    > >        public_network = xxx.xx.xxx.0/24
>    >    > >        cluster_network = xx.xxx.0.0/16
>    >    > >        ms_cluster_type = async+rdma
>    >    > >        ms_type = async+rdma
>    >    > >        ms_public_type = async+posix
>    >    > >        [mgr]
>    >    > >        ms_type = async+posix
>    >    > >     3. The ceph cluster is deployed using ceph-deploy then
>    once up
>    >    all of
>    >    > >        the daemons are turned off the rdma cluster config is
>    then
>    >    sent
>    >    > >        around then once that is complete the daemons are
>    turned
>    >    back on.
>    >    > >        The ulimit is set to unlimited, LimitMEMLOCK=infinity
>    is set
>    >    on the
>    >    > >        ceph-disk@.service, ceph-mds@.service,
>    ceph-mon@.service,
>    >    > >        ceph-osd@.service, ceph-radosgw@.service, aswell as
>    >    > >        PrivateDevices=no on ceph-mds@.service,
>    ceph-mon@.service
>    >    and
>    >    > >        ceph-radosgw@.service. The ethernet mtu is set to 1000
>    >    > >
>    >    __________________________________________________________________
>    >    > >
>    >    > >    From: Liu, Changcheng &lt;changcheng.liu(a)intel.com&gt;
>    >    > >    Sent: 30 October 2019 12:24
>    >    > >    To: Mason-Williams, Gabryel (DLSLtd,RAL,LSCI)
>    >    > >    &lt;gabryel.mason-williams(a)diamond.ac.uk&gt;
>    >    > >    Cc: dev(a)ceph.io &lt;dev(a)ceph.io&gt;
>    >    > >    Subject: Re: RMDA Bug?
>    >    > >
>    >    > >    1. What's the problem do you hit when using RDMA in
14.2.4?
>    Any
>    >    log
>    >    > >    shows the error?
>    >    > >    2. What's your ceph.conf?
>    >    > >    3. How do you deploy the ceph cluster? RDMA need lock some
>    >    memory. So,
>    >    > >    it needs change some system configuration to meet with this
>    >    > >    requirement?
>    >    > >    On 11:21 Wed 30 Oct, Gabryel Mason-Williams wrote:
>    >    > >    > Liu, Changcheng wrote:
>    >    > >    > > On 07:31 Mon 28 Oct, Mason-Williams, Gabryel
>    >    (DLSLtd,RAL,LSCI)
>    >    > >    wrote:
>    >    > >    > > >     I am using ceph version 12.2.8
>    >    > >    > > >    
(ae699615bac534ea496ee965ac6192cb7e0e07c0)
>    luminous
>    >    (stable).
>    >    > >    > > >
>    >    > >    > > >     I have not checked the master branch do
you think
>    this
>    >    is an
>    >    > >    issue in
>    >    > >    > > >     luminous that has been removed in later
versions?
>    >    I
>    >    > >    haven't hit problem
>    >    > >    > > on master branch. Ceph/RDMA changed a lot
>    >    > >    > >       from luminous to master branch.
>    >    > >    > >
>    >    > >    > >       Is below configuration really needed in
>    >    luminous/ceph.conf?
>    >    > >    > > >     ms_async_rdma_local_gid = xxxx          On
master
>    >    branch,
>    >    > >    this
>    >    > >    > > parameter is not needed at all.
>    >    > >    > > B.R.
>    >    > >    > > Changcheng
>    >    > >    > > >
>    >    > >
>    >    __________________________________________________________________
>    >    > >    >
>    >    > >    > Thanks, the issue of the OSD's falling over seems to
have
>    gone
>    >    away
>    >    > >    updating to Nautilus 14.2.4. However, I am still unable to
>    get
>    >    it to
>    >    > >    properly communicate over RDMA even with removing
>    >    > >    ms_async_rdma_local_gid.
>    >    > >    > _______________________________________________
>    >    > >    > Dev mailing list -- dev(a)ceph.io
>    >    > >    > To unsubscribe send an email to dev-leave(a)ceph.io
>    >    > >
>    >    > >
>    >    > >    --
>    >    > >
>    >    > >    This e-mail and any attachments may contain confidential,
>    >    copyright and
>    >    > >    or privileged material, and are for the use of the intended
>    >    addressee
>    >    > >    only. If you are not the intended addressee or an
>    authorised
>    >    recipient
>    >    > >    of the addressee please notify us of receipt by returning
>    the
>    >    e-mail
>    >    > >    and do not use, copy, retain, distribute or disclose the
>    >    information in
>    >    > >    or attached to the e-mail.
>    >    > >    Any opinions expressed within this e-mail are those of the
>    >    individual
>    >    > >    and not necessarily of Diamond Light Source Ltd.
>    >    > >    Diamond Light Source Ltd. cannot guarantee that this e-mail
>    or
>    >    any
>    >    > >    attachments are free from viruses and we cannot accept
>    liability
>    >    for
>    >    > >    any damage which you may sustain as a result of software
>    viruses
>    >    which
>    >    > >    may be transmitted in or with the message.
>    >    > >    Diamond Light Source Limited (company no. 4375679).
>    Registered
>    >    in
>    >    > >    England and Wales with its registered office at Diamond
>    House,
>    >    Harwell
>    >    > >    Science and Innovation Campus, Didcot, Oxfordshire, OX11
>    0DE,
>    >    United
>    >    > >    Kingdom
>    >    >
>    >    > > _______________________________________________
>    >    > > Dev mailing list -- dev(a)ceph.io
>    >    > > To unsubscribe send an email to dev-leave(a)ceph.io
>    >    >
>    >    > From 40fa0d7096364b410e8242c46967029fb949876a Mon Sep 17
>    00:00:00
>    >    2001
>    >    > From: Changcheng Liu &lt;changcheng.liu(a)aliyun.com&gt;
>    >    > Date: Tue, 23 Jul 2019 18:50:57 +0800
>    >    > Subject: [PATCH] rdma systemd: grant access to /dev and unlimit
>    mem
>    >    >
>    >    > Signed-off-by: Changcheng Liu &lt;changcheng.liu(a)aliyun.com&gt;
>    >    >
>    >    > diff --git a/systemd/ceph-fuse@.service.in
>    >    b/systemd/ceph-fuse@.service.in
>    >    > index d603042b12..ff2e9072f6 100644
>    >    > --- a/systemd/ceph-fuse@.service.in
>    >    > +++ b/systemd/ceph-fuse@.service.in
>    >    > @@ -12,6 +12,7 @@ ExecStart=/usr/bin/ceph-fuse -f --cluster
>    >    ${CLUSTER} %I
>    >    >  LockPersonality=true
>    >    >  MemoryDenyWriteExecute=true
>    >    >  NoNewPrivileges=true
>    >    > +LimitMEMLOCK=infinity
>    >    >  # ceph-fuse requires access to /dev fuse device
>    >    >  PrivateDevices=no
>    >    >  ProtectControlGroups=true
>    >    > diff --git a/systemd/ceph-mds@.service.in
>    >    b/systemd/ceph-mds@.service.in
>    >    > index 39a2e63105..0e58dfeeea 100644
>    >    > --- a/systemd/ceph-mds@.service.in
>    >    > +++ b/systemd/ceph-mds@.service.in
>    >    > @@ -14,7 +14,8 @@ ExecReload=/bin/kill -HUP $MAINPID
>    >    >  LockPersonality=true
>    >    >  MemoryDenyWriteExecute=true
>    >    >  NoNewPrivileges=true
>    >    > -PrivateDevices=yes
>    >    > +LimitMEMLOCK=infinity
>    >    > +PrivateDevices=no
>    >    >  ProtectControlGroups=true
>    >    >  ProtectHome=true
>    >    >  ProtectKernelModules=true
>    >    > diff --git a/systemd/ceph-mgr@.service.in
>    >    b/systemd/ceph-mgr@.service.in
>    >    > index c98f6378b9..682c7ecef3 100644
>    >    > --- a/systemd/ceph-mgr@.service.in
>    >    > +++ b/systemd/ceph-mgr@.service.in
>    >    > @@ -18,7 +18,8 @@ LockPersonality=true
>    >    >  MemoryDenyWriteExecute=false
>    >    >
>    >    >  NoNewPrivileges=true
>    >    > -PrivateDevices=yes
>    >    > +LimitMEMLOCK=infinity
>    >    > +PrivateDevices=no
>    >    >  ProtectControlGroups=true
>    >    >  ProtectHome=true
>    >    >  ProtectKernelModules=true
>    >    > diff --git a/systemd/ceph-mon@.service.in
>    >    b/systemd/ceph-mon@.service.in
>    >    > index c95fcabb26..51854fad96 100644
>    >    > --- a/systemd/ceph-mon@.service.in
>    >    > +++ b/systemd/ceph-mon@.service.in
>    >    > @@ -21,7 +21,8 @@ LockPersonality=true
>    >    >  MemoryDenyWriteExecute=true
>    >    >  # Need NewPrivileges via `sudo smartctl`
>    >    >  NoNewPrivileges=false
>    >    > -PrivateDevices=yes
>    >    > +LimitMEMLOCK=infinity
>    >    > +PrivateDevices=no
>    >    >  ProtectControlGroups=true
>    >    >  ProtectHome=true
>    >    >  ProtectKernelModules=true
>    >    > diff --git a/systemd/ceph-osd@.service.in
>    >    b/systemd/ceph-osd@.service.in
>    >    > index 1b5c9c82b8..06c20d7c83 100644
>    >    > --- a/systemd/ceph-osd@.service.in
>    >    > +++ b/systemd/ceph-osd@.service.in
>    >    > @@ -16,6 +16,8 @@ LockPersonality=true
>    >    >  MemoryDenyWriteExecute=true
>    >    >  # Need NewPrivileges via `sudo smartctl`
>    >    >  NoNewPrivileges=false
>    >    > +LimitMEMLOCK=infinity
>    >    > +PrivateDevices=no
>    >    >  ProtectControlGroups=true
>    >    >  ProtectHome=true
>    >    >  ProtectKernelModules=true
>    >    > diff --git a/systemd/ceph-radosgw@.service.in
>    >    b/systemd/ceph-radosgw@.service.in
>    >    > index 7e3ddf6c04..fe1a6b9159 100644
>    >    > --- a/systemd/ceph-radosgw@.service.in
>    >    > +++ b/systemd/ceph-radosgw@.service.in
>    >    > @@ -13,7 +13,8 @@ ExecStart=/usr/bin/radosgw -f --cluster
>    ${CLUSTER}
>    >    --name client.%i --setuser ce
>    >    >  LockPersonality=true
>    >    >  MemoryDenyWriteExecute=true
>    >    >  NoNewPrivileges=true
>    >    > -PrivateDevices=yes
>    >    > +LimitMEMLOCK=infinity
>    >    > +PrivateDevices=no
>    >    >  ProtectControlGroups=true
>    >    >  ProtectHome=true
>    >    >  ProtectKernelModules=true
>    >    > diff --git a/systemd/ceph-volume@.service
>    >    b/systemd/ceph-volume@.service
>    >    > index c21002cecb..e2d1f67b85 100644
>    >    > --- a/systemd/ceph-volume@.service
>    >    > +++ b/systemd/ceph-volume@.service
>    >    > @@ -9,6 +9,7 @@ KillMode=none
>    >    >  Environment=CEPH_VOLUME_TIMEOUT=10000
>    >    >  ExecStart=/bin/sh -c 'timeout $CEPH_VOLUME_TIMEOUT
>    >    /usr/sbin/ceph-volume-systemd %i'
>    >    >  TimeoutSec=0
>    >    > +LimitMEMLOCK=infinity
>    >    >
>    >    >  [Install]
>    >    >  WantedBy=multi-user.target
>    >    > --
>    >    > 2.17.1
>    >    >
>    >    > _______________________________________________
>    >    > Dev mailing list -- dev(a)ceph.io
>    >    > To unsubscribe send an email to dev-leave(a)ceph.io
>    >
>    >
>    >    --
>    >
>    >    This e-mail and any attachments may contain confidential,
>    copyright and
>    >    or privileged material, and are for the use of the intended
>    addressee
>    >    only. If you are not the intended addressee or an authorised
>    recipient
>    >    of the addressee please notify us of receipt by returning the
>    e-mail
>    >    and do not use, copy, retain, distribute or disclose the
>    information in
>    >    or attached to the e-mail.
>    >    Any opinions expressed within this e-mail are those of the
>    individual
>    >    and not necessarily of Diamond Light Source Ltd.
>    >    Diamond Light Source Ltd. cannot guarantee that this e-mail or any
>    >    attachments are free from viruses and we cannot accept liability
>    for
>    >    any damage which you may sustain as a result of software viruses
>    which
>    >    may be transmitted in or with the message.
>    >    Diamond Light Source Limited (company no. 4375679). Registered in
>    >    England and Wales with its registered office at Diamond House,
>    Harwell
>    >    Science and Innovation Campus, Didcot, Oxfordshire, OX11 0DE,
>    United
>    >    Kingdom