New subject: cephfs snapshot mirror peer_bootstrap import hung

7 Aug 2023

Hi Venky,

Could this be the reason that the peer-bootstrap import is hanging?  how do I upgrade
cephfs-mirror to Quincy?
root@fl31ca104ja0201:/# cephfs-mirror --version
ceph version 16.2.13 (5378749ba6be3a0868b51803968ee9cde4833a3e) pacific (stable)
root@fl31ca104ja0201:/# ceph version
ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable)
root@fl31ca104ja0201:/#

Thank you,
Anantha
From: Adiga, Anantha
Sent: Monday, August 7, 2023 11:21 AM
To: 'Venky Shankar' &lt;vshankar(a)redhat.com&gt;om>; &#39;ceph-users(a)ceph.io&#39;
&lt;ceph-users(a)ceph.io...

 Subject: RE: [ceph-users] Re: cephfs snapshot mirror peer_bootstrap import hung

Hi Venky,

I tried on another secondary Quincy cluster and it is the same problem. The peer_bootstrap
mport  command hangs.

root@fl31ca104ja0201:/# ceph fs  snapshot mirror peer_bootstrap import cephfs
eyJmc2lkIjogIjJlYWMwZWEwLTYwNDgtNDQ0Zi04NGIyLThjZWVmZWQyN2E1YiIsICJmaWxlc3lzdGVtIjogImNlcGhmcyIsICJ1c2VyIjogImNsaWVudC5taXJyb3JfcmVtb3RlIiwgInNpdGVfbmFtZSI6ICJzaGdSLXNpdGUiLCAia2V5IjogIkFRQ0lGdEZrSStTTE5oQUFXbWV6MkRKcEg5ZUdyYnhBOWVmZG9BPT0iLCAibW9uX2hvc3QiOiAiW3YyOjEwLjIzOS4xNTUuMTg6MzMwMC8wLHYxOjEwLjIzOS4xNTUuMTg6Njc4OS8wXSBbdjI6MTAuMjM5LjE1NS4xOTozMzAwLzAsdjE6MTAuMjM5LjE1NS4xOTo2Nzg5LzBdIFt2MjoxMC4yMzkuMTU1LjIwOjMzMDAvMCx2MToxMC4yMzkuMTU1LjIwOjY3ODkvMF0ifQ==

……

…….

..command does not complete..waits here
^C  to exit.
Thereafter some commands do not complete…
root@fl31ca104ja0201:/# ceph -s
  cluster:
    id:     d0a3b6e0-d2c3-11ed-be05-a7a3a1d7a87e
    health: HEALTH_OK

  services:
    mon:           3 daemons, quorum fl31ca104ja0202,fl31ca104ja0203,fl31ca104ja0201 (age
2d)
    mgr:           fl31ca104ja0201.kkoono(active, since 3d), standbys: fl31ca104ja0202,
fl31ca104ja0203
    mds:           1/1 daemons up, 2 standby
    osd:           44 osds: 44 up (since 2d), 44 in (since 5w)
    cephfs-mirror: 1 daemon active (1 hosts)
    rgw:           3 daemons active (3 hosts, 1 zones)

  data:
    volumes: 1/1 healthy
    pools:   25 pools, 769 pgs
    objects: 614.40k objects, 1.9 TiB
    usage:   2.9 TiB used, 292 TiB / 295 TiB avail
    pgs:     769 active+clean

  io:
    client:   32 KiB/s rd, 0 B/s wr, 33 op/s rd, 1 op/s wr

root@fl31ca104ja0201:/#
root@fl31ca104ja0201:/# ceph fs status cephfs
This command also waits. ……

I have attached the mgr log
root@fl31ca104ja0201:/# ceph service status
{
    "cephfs-mirror": {
        "5306346": {
            "status_stamp": "2023-08-07T17:35:56.884907+0000",
            "last_beacon": "2023-08-07T17:45:01.903540+0000",
            "status": {
                "status_json":
"{\"1\":{\"name\":\"cephfs\",\"directory_count\":0,\"peers\":{}}}"
            }
        }

Quincy secondary cluster

root@a001s008-zz14l47008:/# ceph mgr module enable mirroring

root@a001s008-zz14l47008:/# ceph fs authorize cephfs client.mirror_remote / rwps

[client.mirror_remote]

        key = AQCIFtFkI+SLNhAAWmez2DJpH9eGrbxA9efdoA==

root@a001s008-zz14l47008:/# ceph auth get client.mirror_remote

[client.mirror_remote]

        key = AQCIFtFkI+SLNhAAWmez2DJpH9eGrbxA9efdoA==

        caps mds = "allow rwps fsname=cephfs"

        caps mon = "allow r fsname=cephfs"

        caps osd = "allow rw tag cephfs data=cephfs"

root@a001s008-zz14l47008:/#

root@a001s008-zz14l47008:/# ceph fs snapshot mirror peer_bootstrap create cephfs
client.mirror_remote shgR-site

{"token":
"eyJmc2lkIjogIjJlYWMwZWEwLTYwNDgtNDQ0Zi04NGIyLThjZWVmZWQyN2E1YiIsICJmaWxlc3lzdGVtIjogImNlcGhmcyIsICJ1c2VyIjogImNsaWVudC5taXJyb3JfcmVtb3RlIiwgInNpdGVfbmFtZSI6ICJzaGdSLXNpdGUiLCAia2V5IjogIkFRQ0lGdEZrSStTTE5oQUFXbWV6MkRKcEg5ZUdyYnhBOWVmZG9BPT0iLCAibW9uX2hvc3QiOiAiW3YyOjEwLjIzOS4xNTUuMTg6MzMwMC8wLHYxOjEwLjIzOS4xNTUuMTg6Njc4OS8wXSBbdjI6MTAuMjM5LjE1NS4xOTozMzAwLzAsdjE6MTAuMjM5LjE1NS4xOTo2Nzg5LzBdIFt2MjoxMC4yMzkuMTU1LjIwOjMzMDAvMCx2MToxMC4yMzkuMTU1LjIwOjY3ODkvMF0ifQ=="}

root@a001s008-zz14l47008:/#

Thank you,
Anantha

From: Adiga, Anantha
Sent: Friday, August 4, 2023 11:55 AM
To: Venky Shankar <vshankar@redhat.com<mailto:vshankar@redhat.com>>;
ceph-users@ceph.io<mailto:ceph-users@ceph.io...

 Subject: RE: [ceph-users] Re: cephfs snapshot mirror peer_bootstrap import hung

Hi Venky,

Thank you so much for the guidance. Attached is the mgr log.

Note: the 4th node in the primary cluster has smaller capacity  drives, the other 3 nodes
have the larger capacity drives.

32    ssd    6.98630   1.00000  7.0 TiB   44 GiB   44 GiB   183 KiB  148 MiB  6.9 TiB 
0.62  0.64   40      up          osd.32

-7          76.84927         -   77 TiB  652 GiB  648 GiB    20 MiB  3.0 GiB   76 TiB 
0.83  0.86    -              host fl31ca104ja0203

  1    ssd    6.98630   1.00000  7.0 TiB   73 GiB   73 GiB   8.0 MiB  333 MiB  6.9 TiB 
1.02  1.06   54      up          osd.1

  4    ssd    6.98630   1.00000  7.0 TiB   77 GiB   77 GiB   1.1 MiB  174 MiB  6.9 TiB 
1.07  1.11   55      up          osd.4

  7    ssd    6.98630   1.00000  7.0 TiB   47 GiB   47 GiB   140 KiB  288 MiB  6.9 TiB 
0.66  0.68   51      up          osd.7

10    ssd    6.98630   1.00000  7.0 TiB   75 GiB   75 GiB   299 KiB  278 MiB  6.9 TiB 
1.05  1.09   44      up          osd.10

13    ssd    6.98630   1.00000  7.0 TiB   94 GiB   94 GiB  1018 KiB  291 MiB  6.9 TiB 
1.31  1.36   72      up          osd.13

16    ssd    6.98630   1.00000  7.0 TiB   31 GiB   31 GiB   163 KiB  267 MiB  7.0 TiB 
0.43  0.45   49      up          osd.16

19    ssd    6.98630   1.00000  7.0 TiB   14 GiB   14 GiB   756 KiB  333 MiB  7.0 TiB 
0.20  0.21   50      up          osd.19

22    ssd    6.98630   1.00000  7.0 TiB  105 GiB  104 GiB   1.3 MiB  313 MiB  6.9 TiB 
1.46  1.51   48      up          osd.22

25    ssd    6.98630   1.00000  7.0 TiB   17 GiB   16 GiB   257 KiB  272 MiB  7.0 TiB 
0.23  0.24   45      up          osd.25

28    ssd    6.98630   1.00000  7.0 TiB   72 GiB   72 GiB   6.1 MiB  180 MiB  6.9 TiB 
1.01  1.05   43      up          osd.28

31    ssd    6.98630   1.00000  7.0 TiB   47 GiB   46 GiB   592 KiB  358 MiB  6.9 TiB 
0.65  0.68   56      up          osd.31

-9          64.04089         -   64 TiB  728 GiB  726 GiB    17 MiB  1.8 GiB   63 TiB 
1.11  1.15    -              host fl31ca104ja0302

33    ssd    5.82190   1.00000  5.8 TiB   65 GiB   65 GiB   245 KiB  144 MiB  5.8 TiB 
1.09  1.13   47      up          osd.33

34    ssd    5.82190   1.00000  5.8 TiB   14 GiB   14 GiB   815 KiB   83 MiB  5.8 TiB 
0.24  0.25   55      up          osd.34

35    ssd    5.82190   1.00000  5.8 TiB   77 GiB   77 GiB   224 KiB  213 MiB  5.7 TiB 
1.30  1.34   44      up          osd.35

36    ssd    5.82190   1.00000  5.8 TiB  117 GiB  117 GiB   8.5 MiB  284 MiB  5.7 TiB 
1.96  2.03   52      up          osd.36

37    ssd    5.82190   1.00000  5.8 TiB   58 GiB   58 GiB   501 KiB  132 MiB  5.8 TiB 
0.98  1.01   40      up          osd.37

38    ssd    5.82190   1.00000  5.8 TiB  123 GiB  123 GiB   691 KiB  266 MiB  5.7 TiB 
2.07  2.14   73      up          osd.38

39    ssd    5.82190   1.00000  5.8 TiB   77 GiB   77 GiB   609 KiB  193 MiB  5.7 TiB 
1.30  1.34   62      up          osd.39

40    ssd    5.82190   1.00000  5.8 TiB   77 GiB   77 GiB   262 KiB  148 MiB  5.7 TiB 
1.29  1.34   55      up          osd.40

41    ssd    5.82190   1.00000  5.8 TiB   44 GiB   44 GiB   4.4 MiB  140 MiB  5.8 TiB 
0.75  0.77   44      up          osd.41

42    ssd    5.82190   1.00000  5.8 TiB   45 GiB   45 GiB   886 KiB  135 MiB  5.8 TiB 
0.75  0.78   47      up          osd.42

43    ssd    5.82190   1.00000  5.8 TiB   28 GiB   28 GiB   187 KiB  104 MiB  5.8 TiB 
0.48  0.49   58      up          osd.43

[Also: Yesterday I had two cfs-mirror running one on fl31ca104ja0201 and fl31ca104ja0302.
The cfs-mirror on fl31ca104ja0201 was stopped. When the  import token was run on
fl31ca104ja0302, the cfs-mirror log was active. Just in case it is useful I have attached
that log (cfsmirror-container.log) as well. ]

How can I list the token on the target cluster after running the create peer_bootstrap
command?

Here is today’s status with your suggestion:

There is only one cfs-mirror daemon running now. It is on fl31ca104ja0201 node.

root@fl31ca104ja0201:/# ceph -s

  cluster:

    id:     d0a3b6e0-d2c3-11ed-be05-a7a3a1d7a87e

    health: HEALTH_OK

  services:

    mon:           3 daemons, quorum fl31ca104ja0202,fl31ca104ja0203,fl31ca104ja0201 (age
7m)

    mgr:           fl31ca104ja0201.kkoono(active, since 13m), standbys: fl31ca104ja0202,
fl31ca104ja0203

    mds:           1/1 daemons up, 2 standby

    osd:           44 osds: 44 up (since 7m), 44 in (since 4w)

    cephfs-mirror: 1 daemon active (1 hosts)

    rgw:           3 daemons active (3 hosts, 1 zones)

  data:

    volumes: 1/1 healthy

    pools:   25 pools, 769 pgs

    objects: 614.40k objects, 1.9 TiB

    usage:   2.8 TiB used, 292 TiB / 295 TiB avail

    pgs:     769 active+clean

  io:

    client:   32 MiB/s rd, 0 B/s wr, 57 op/s rd, 1 op/s wr

root@fl31ca104ja0201:/#

root@fl31ca104ja0201:/#

root@fl31ca104ja0201:/# ceph tell mgr.fl31ca104ja0201.kkoono config set debug_mgr 20

{

    "success": ""

}

root@fl31ca104ja0201:/# ceph fs snapshot mirror peer_bootstrap import cephfs
eyJmc2lkIjogImE2ZjUyNTk4LWU1Y2QtNGEwOC04NDIyLTdiNmZkYjFkNWRiZSIsICJmaWxlc3lzdGVtIjogImNlcGhmcyIsICJ1c2VyIjogImNsaWVudC5taXJyb3JfcmVtb3RlIiwgInNpdGVfbmFtZSI6ICJmbGV4Mi1zaXRlIiwgImtleSI6ICJBUUNmd01sa005MHBMQkFBd1h0dnBwOGowNEl2Qzh0cXBBRzliQT09IiwgIm1vbl9ob3N0IjogIlt2MjoxNzIuMTguNTUuNzE6MzMwMC8wLHYxOjE3Mi4xOC41NS43MTo2Nzg5LzBdIFt2MjoxNzIuMTguNTUuNzM6MzMwMC8wLHYxOjE3Mi4xOC41NS43Mzo2Nzg5LzBdIn0=

^CInterrupted

Ctrl-C after 15  min. Once the command is run, the health status goes to WARN .

root@fl31ca104ja0201:/# ceph -s

  cluster:

    id:     d0a3b6e0-d2c3-11ed-be05-a7a3a1d7a87e

    health: HEALTH_WARN

            6 slow ops, oldest one blocked for 1095 sec, mon.fl31ca104ja0203 has slow
ops

  services:

    mon:           3 daemons, quorum fl31ca104ja0202,fl31ca104ja0203,fl31ca104ja0201 (age
30m)

    mgr:           fl31ca104ja0201.kkoono(active, since 35m), standbys: fl31ca104ja0202,
fl31ca104ja0203

    mds:           1/1 daemons up, 2 standby

    osd:           44 osds: 44 up (since 29m), 44 in (since 4w)

    cephfs-mirror: 1 daemon active (1 hosts)

    rgw:           3 daemons active (3 hosts, 1 zones)

  data:

    volumes: 1/1 healthy

    pools:   25 pools, 769 pgs

    objects: 614.40k objects, 1.9 TiB

    usage:   2.8 TiB used, 292 TiB / 295 TiB avail

    pgs:     769 active+clean

  io:

    client:   67 KiB/s rd, 0 B/s wr, 68 op/s rd, 21 op/s wr

-----Original Message-----
From: Venky Shankar <vshankar@redhat.com<mailto:vshankar@redhat.com>...

 Sent: Thursday, August 3, 2023 11:03 PM
To: Adiga, Anantha <anantha.adiga@intel.com<mailto:anantha.adiga@intel.com>...

Cc: ceph-users@ceph.io<mailto:ceph-users@ceph.io...

 Subject: [ceph-users] Re: cephfs snapshot mirror peer_bootstrap import hung

Hi Anantha,

On Fri, Aug 4, 2023 at 2:27 AM Adiga, Anantha
<anantha.adiga@intel.com<mailto:anantha.adiga@intel.com>> wrote:

...

...
  Hi

...

...
  Could you please  provide guidance on how to diagnose
this issue:

...

...
  In this case, there are two  Ceph clusters: cluster A,
4 nodes and cluster B, 3 node, in different locations.  Both are already running RGW
multi-site,  A is master.

...

...
  Cephfs snapshot mirroring is being configured on the
clusters.  Cluster A  is the primary, cluster B is the peer. Cephfs snapshot mirroring is
being configured. The bootstrap import  step on the primary node hangs.

...

...
  On the target cluster :

...
  ---------------------------

...
  "version": "16.2.5",

...
      "release": "pacific",

...
      "release_type": "stable"

...

...
  root@cr21meg16ba0101:/# ceph fs snapshot mirror
peer_bootstrap create

...
  cephfs client.mirror_remote flex2-site

...
  {"token":

...

"eyJmc2lkIjogImE2ZjUyNTk4LWU1Y2QtNGEwOC04NDIyLTdiNmZkYjFkNWRiZSIsICJma

...

Wxlc3lzdGVtIjogImNlcGhmcyIsICJ1c2VyIjogImNsaWVudC5taXJyb3JfcmVtb3RlIiw

...

gInNpdGVfbmFtZSI6ICJmbGV4Mi1zaXRlIiwgImtleSI6ICJBUUNmd01sa005MHBMQkFBd

...

1h0dnBwOGowNEl2Qzh0cXBBRzliQT09IiwgIm1vbl9ob3N0IjogIlt2MjoxNzIuMTguNTU

...

uNzE6MzMwMC8wLHYxOjE3Mi4xOC41NS43MTo2Nzg5LzBdIFt2MjoxNzIuMTguNTUuNzM6M

...
  zMwMC8wLHYxOjE3Mi4xOC41NS43Mzo2Nzg5LzBdIn0="}

Seems fine uptil here.

...
  root@cr21meg16ba0101:/var/run/ceph#

...

...
  On the source cluster:

...
  ----------------------------

...
  "version": "17.2.6",

...
      "release": "quincy",

...
      "release_type": "stable"

...

...
  root@fl31ca104ja0201:/# ceph -s

...
    cluster:

...
      id:     d0a3b6e0-d2c3-11ed-be05-a7a3a1d7a87e

...
      health: HEALTH_OK

...

...
    services:

...
      mon:           3 daemons, quorum
fl31ca104ja0202,fl31ca104ja0203,fl31ca104ja0201 (age 111m)

...
      mgr:           fl31ca104ja0201.nwpqlh(active,
since 11h), standbys: fl31ca104ja0203, fl31ca104ja0202

...
      mds:           1/1 daemons up, 2 standby

...
      osd:           44 osds: 44 up (since 111m), 44 in
(since 4w)

...
      cephfs-mirror: 1 daemon active (1 hosts)

...
      rgw:           3 daemons active (3 hosts, 1
zones)

...

...
    data:

...
      volumes: 1/1 healthy

...
      pools:   25 pools, 769 pgs

...
      objects: 614.40k objects, 1.9 TiB

...
      usage:   2.8 TiB used, 292 TiB / 295 TiB avail

...
      pgs:     769 active+clean

...

...
  root@fl31ca104ja0302:/# ceph mgr module enable
mirroring module

...
  'mirroring' is already enabled
root@fl31ca104ja0302:/# ceph fs

...
  snapshot mirror peer_bootstrap import cephfs

...

eyJmc2lkIjogImE2ZjUyNTk4LWU1Y2QtNGEwOC04NDIyLTdiNmZkYjFkNWRiZSIsICJmaW

...

xlc3lzdGVtIjogImNlcGhmcyIsICJ1c2VyIjogImNsaWVudC5taXJyb3JfcmVtb3RlIiwg

...

InNpdGVfbmFtZSI6ICJmbGV4Mi1zaXRlIiwgImtleSI6ICJBUUNmd01sa005MHBMQkFBd1

...

h0dnBwOGowNEl2Qzh0cXBBRzliQT09IiwgIm1vbl9ob3N0IjogIlt2MjoxNzIuMTguNTUu

...

NzE6MzMwMC8wLHYxOjE3Mi4xOC41NS43MTo2Nzg5LzBdIFt2MjoxNzIuMTguNTUuNzM6Mz

...
  MwMC8wLHYxOjE3Mi4xOC41NS43Mzo2Nzg5LzBdIn0=

Going by your description, I'm guessing this is the command that hangs? If that's
the case, set `debug_mgr=20`, repeat the token import step and share the ceph-mgr log.
Also note that you can check the mirror daemon status as detailed in

        https://docs.ceph.com/en/latest/dev/cephfs-mirroring/#mirror-daemon-status

...

...

...
  root@fl31ca104ja0302:/var/run/ceph# ceph
--admin-daemon

...

/var/run/ceph/ceph-client.cephfs-mirror.fl31ca104ja0302.sypagt.7.94083135960976.asok
status {

...
      "metadata": {

...
          "ceph_sha1":
"d7ff0d10654d2280e08f1ab989c7cdf3064446a5",

...
          "ceph_version": "ceph version
17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable)",

...
          "entity_id":
"cephfs-mirror.fl31ca104ja0302.sypagt",

...
          "hostname":
"fl31ca104ja0302",

...
          "pid": "7",

...
          "root": "/"

...
      },

...
      "dentry_count": 0,

...
      "dentry_pinned_count": 0,

...
      "id": 5194553,

...
      "inst": {

...
          "name": {

...
              "type": "client",

...
              "num": 5194553

...
          },

...
          "addr": {

...
              "type": "v1",

...
              "addr":
"10.45.129.5:0",

...
              "nonce": 2497002034

...
          }

...
      },

...
      "addr": {

...
          "type": "v1",

...
          "addr": "10.45.129.5:0",

...
          "nonce": 2497002034

...
      },

...
      "inst_str": "client.5194553
10.45.129.5:0/2497002034",

...
      "addr_str":
"10.45.129.5:0/2497002034",

...
      "inode_count": 1,

...
      "mds_epoch": 118,

...
      "osd_epoch": 6266,

...
      "osd_epoch_barrier": 0,

...
      "blocklisted": false,

...
      "fs_name": "cephfs"

...
  }

...

...
  root@fl31ca104ja0302:/home/general# docker logs

...

ceph-d0a3b6e0-d2c3-11ed-be05-a7a3a1d7a87e-cephfs-mirror-fl31ca104ja030

...
  2-sypagt --tail  10 debug 2023-08-03T05:24:27.413+0000
7f8eb6fc0280  0

...
  ceph version 17.2.6
(d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy

...
  (stable), process cephfs-mirror, pid 7 debug

...
  2023-08-03T05:24:27.413+0000 7f8eb6fc0280  0
pidfile_write: ignore

...
  empty --pid-file debug 2023-08-03T05:24:27.445+0000
7f8eb6fc0280  1

...
  mgrc service_daemon_register cephfs-mirror.5184622
metadata

...
  {arch=x86_64,ceph_release=quincy,ceph_version=ceph
version 17.2.6

...
  (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy

...

(stable),ceph_version_short=17.2.6,container_hostname=fl31ca104ja0302,

>
container_image=quay.io/ceph/ceph@sha256:af79fedafc42237b7612fe2d18a9c<mailto:container_image=quay.io/ceph/ceph@sha256:af79fedafc42237b7612fe2d18a9c...

...
  64ca62a0b38ab362e614ad671efa4a0547e,cpu=Intel(R)
Xeon(R) Gold 6252 CPU

...
  @ 2.10GHz,distro=centos,distro_description=CentOS
Stream

...

8,distro_version=8,hostname=fl31ca104ja0302,id=fl31ca104ja0302.sypagt,

...
  instance_id=5184622,kernel_description=#82-Ubuntu SMP
Tue Jun 6

...
  23:10:23 UTC

...

2023,kernel_version=5.15.0-75-generic,mem_swap_kb=8388604,mem_total_kb

...
  =527946928,os=Linux} debug
2023-08-03T05:27:10.419+0000 7f8ea1b2c700

...
  0 client.5194553 ms_handle_reset on
v2:10.45.128.141:3300/0 debug

...
  2023-08-03T05:50:10.917+0000 7f8ea1b2c700  0
client.5194553

...
  ms_handle_reset on v2:10.45.128.139:3300/0

...

...
  Thank you,

...
  Anantha

...
  _______________________________________________

...
  ceph-users mailing list --
ceph-users@ceph.io<mailto:ceph-users@ceph.io> To unsubscribe send an

> email to ceph-users-leave@ceph.io<mailto:ceph-users-leave@ceph.io...

...

--

Cheers,

Venky

_______________________________________________

ceph-users mailing list -- ceph-users@ceph.io<mailto:ceph-users@ceph.io> To
unsubscribe send an email to
ceph-users-leave@ceph.io<mailto:ceph-users-leave@ceph.io...