August 2023 - ceph-users

by Reed Dier

8 months, 1 week

6
10
0 0

Re: cephfs snapshot mirror peer_bootstrap import hung

by Adiga, Anantha

Hi Venky, Could this be the reason that the peer-bootstrap import is hanging? how do I upgrade cephfs-mirror to Quincy? root@fl31ca104ja0201:/# cephfs-mirror --version ceph version 16.2.13 (5378749ba6be3a0868b51803968ee9cde4833a3e) pacific (stable) root@fl31ca104ja0201:/# ceph version ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable) root@fl31ca104ja0201:/# Thank you, Anantha From: Adiga, Anantha Sent: Monday, August 7, 2023 11:21 AM To: 'Venky Shankar' <vshankar(a)redhat.com>; 'ceph-users(a)ceph.io' <ceph-users(a)ceph.io> Subject: RE: [ceph-users] Re: cephfs snapshot mirror peer_bootstrap import hung Hi Venky, I tried on another secondary Quincy cluster and it is the same problem. The peer_bootstrap mport command hangs. root@fl31ca104ja0201:/# ceph fs snapshot mirror peer_bootstrap import cephfs eyJmc2lkIjogIjJlYWMwZWEwLTYwNDgtNDQ0Zi04NGIyLThjZWVmZWQyN2E1YiIsICJmaWxlc3lzdGVtIjogImNlcGhmcyIsICJ1c2VyIjogImNsaWVudC5taXJyb3JfcmVtb3RlIiwgInNpdGVfbmFtZSI6ICJzaGdSLXNpdGUiLCAia2V5IjogIkFRQ0lGdEZrSStTTE5oQUFXbWV6MkRKcEg5ZUdyYnhBOWVmZG9BPT0iLCAibW9uX2hvc3QiOiAiW3YyOjEwLjIzOS4xNTUuMTg6MzMwMC8wLHYxOjEwLjIzOS4xNTUuMTg6Njc4OS8wXSBbdjI6MTAuMjM5LjE1NS4xOTozMzAwLzAsdjE6MTAuMjM5LjE1NS4xOTo2Nzg5LzBdIFt2MjoxMC4yMzkuMTU1LjIwOjMzMDAvMCx2MToxMC4yMzkuMTU1LjIwOjY3ODkvMF0ifQ== …… ……. ..command does not complete..waits here ^C to exit. Thereafter some commands do not complete… root@fl31ca104ja0201:/# ceph -s cluster: id: d0a3b6e0-d2c3-11ed-be05-a7a3a1d7a87e health: HEALTH_OK services: mon: 3 daemons, quorum fl31ca104ja0202,fl31ca104ja0203,fl31ca104ja0201 (age 2d) mgr: fl31ca104ja0201.kkoono(active, since 3d), standbys: fl31ca104ja0202, fl31ca104ja0203 mds: 1/1 daemons up, 2 standby osd: 44 osds: 44 up (since 2d), 44 in (since 5w) cephfs-mirror: 1 daemon active (1 hosts) rgw: 3 daemons active (3 hosts, 1 zones) data: volumes: 1/1 healthy pools: 25 pools, 769 pgs objects: 614.40k objects, 1.9 TiB usage: 2.9 TiB used, 292 TiB / 295 TiB avail pgs: 769 active+clean io: client: 32 KiB/s rd, 0 B/s wr, 33 op/s rd, 1 op/s wr root@fl31ca104ja0201:/# root@fl31ca104ja0201:/# ceph fs status cephfs This command also waits. …… I have attached the mgr log root@fl31ca104ja0201:/# ceph service status { "cephfs-mirror": { "5306346": { "status_stamp": "2023-08-07T17:35:56.884907+0000", "last_beacon": "2023-08-07T17:45:01.903540+0000", "status": { "status_json": "{\"1\":{\"name\":\"cephfs\",\"directory_count\":0,\"peers\":{}}}" } } Quincy secondary cluster root@a001s008-zz14l47008:/# ceph mgr module enable mirroring root@a001s008-zz14l47008:/# ceph fs authorize cephfs client.mirror_remote / rwps [client.mirror_remote] key = AQCIFtFkI+SLNhAAWmez2DJpH9eGrbxA9efdoA== root@a001s008-zz14l47008:/# ceph auth get client.mirror_remote [client.mirror_remote] key = AQCIFtFkI+SLNhAAWmez2DJpH9eGrbxA9efdoA== caps mds = "allow rwps fsname=cephfs" caps mon = "allow r fsname=cephfs" caps osd = "allow rw tag cephfs data=cephfs" root@a001s008-zz14l47008:/# root@a001s008-zz14l47008:/# ceph fs snapshot mirror peer_bootstrap create cephfs client.mirror_remote shgR-site {"token": "eyJmc2lkIjogIjJlYWMwZWEwLTYwNDgtNDQ0Zi04NGIyLThjZWVmZWQyN2E1YiIsICJmaWxlc3lzdGVtIjogImNlcGhmcyIsICJ1c2VyIjogImNsaWVudC5taXJyb3JfcmVtb3RlIiwgInNpdGVfbmFtZSI6ICJzaGdSLXNpdGUiLCAia2V5IjogIkFRQ0lGdEZrSStTTE5oQUFXbWV6MkRKcEg5ZUdyYnhBOWVmZG9BPT0iLCAibW9uX2hvc3QiOiAiW3YyOjEwLjIzOS4xNTUuMTg6MzMwMC8wLHYxOjEwLjIzOS4xNTUuMTg6Njc4OS8wXSBbdjI6MTAuMjM5LjE1NS4xOTozMzAwLzAsdjE6MTAuMjM5LjE1NS4xOTo2Nzg5LzBdIFt2MjoxMC4yMzkuMTU1LjIwOjMzMDAvMCx2MToxMC4yMzkuMTU1LjIwOjY3ODkvMF0ifQ=="} root@a001s008-zz14l47008:/# Thank you, Anantha From: Adiga, Anantha Sent: Friday, August 4, 2023 11:55 AM To: Venky Shankar <vshankar(a)redhat.com<mailto:vshankar@redhat.com>>; ceph-users(a)ceph.io<mailto:ceph-users@ceph.io> Subject: RE: [ceph-users] Re: cephfs snapshot mirror peer_bootstrap import hung Hi Venky, Thank you so much for the guidance. Attached is the mgr log. Note: the 4th node in the primary cluster has smaller capacity drives, the other 3 nodes have the larger capacity drives. 32 ssd 6.98630 1.00000 7.0 TiB 44 GiB 44 GiB 183 KiB 148 MiB 6.9 TiB 0.62 0.64 40 up osd.32 -7 76.84927 - 77 TiB 652 GiB 648 GiB 20 MiB 3.0 GiB 76 TiB 0.83 0.86 - host fl31ca104ja0203 1 ssd 6.98630 1.00000 7.0 TiB 73 GiB 73 GiB 8.0 MiB 333 MiB 6.9 TiB 1.02 1.06 54 up osd.1 4 ssd 6.98630 1.00000 7.0 TiB 77 GiB 77 GiB 1.1 MiB 174 MiB 6.9 TiB 1.07 1.11 55 up osd.4 7 ssd 6.98630 1.00000 7.0 TiB 47 GiB 47 GiB 140 KiB 288 MiB 6.9 TiB 0.66 0.68 51 up osd.7 10 ssd 6.98630 1.00000 7.0 TiB 75 GiB 75 GiB 299 KiB 278 MiB 6.9 TiB 1.05 1.09 44 up osd.10 13 ssd 6.98630 1.00000 7.0 TiB 94 GiB 94 GiB 1018 KiB 291 MiB 6.9 TiB 1.31 1.36 72 up osd.13 16 ssd 6.98630 1.00000 7.0 TiB 31 GiB 31 GiB 163 KiB 267 MiB 7.0 TiB 0.43 0.45 49 up osd.16 19 ssd 6.98630 1.00000 7.0 TiB 14 GiB 14 GiB 756 KiB 333 MiB 7.0 TiB 0.20 0.21 50 up osd.19 22 ssd 6.98630 1.00000 7.0 TiB 105 GiB 104 GiB 1.3 MiB 313 MiB 6.9 TiB 1.46 1.51 48 up osd.22 25 ssd 6.98630 1.00000 7.0 TiB 17 GiB 16 GiB 257 KiB 272 MiB 7.0 TiB 0.23 0.24 45 up osd.25 28 ssd 6.98630 1.00000 7.0 TiB 72 GiB 72 GiB 6.1 MiB 180 MiB 6.9 TiB 1.01 1.05 43 up osd.28 31 ssd 6.98630 1.00000 7.0 TiB 47 GiB 46 GiB 592 KiB 358 MiB 6.9 TiB 0.65 0.68 56 up osd.31 -9 64.04089 - 64 TiB 728 GiB 726 GiB 17 MiB 1.8 GiB 63 TiB 1.11 1.15 - host fl31ca104ja0302 33 ssd 5.82190 1.00000 5.8 TiB 65 GiB 65 GiB 245 KiB 144 MiB 5.8 TiB 1.09 1.13 47 up osd.33 34 ssd 5.82190 1.00000 5.8 TiB 14 GiB 14 GiB 815 KiB 83 MiB 5.8 TiB 0.24 0.25 55 up osd.34 35 ssd 5.82190 1.00000 5.8 TiB 77 GiB 77 GiB 224 KiB 213 MiB 5.7 TiB 1.30 1.34 44 up osd.35 36 ssd 5.82190 1.00000 5.8 TiB 117 GiB 117 GiB 8.5 MiB 284 MiB 5.7 TiB 1.96 2.03 52 up osd.36 37 ssd 5.82190 1.00000 5.8 TiB 58 GiB 58 GiB 501 KiB 132 MiB 5.8 TiB 0.98 1.01 40 up osd.37 38 ssd 5.82190 1.00000 5.8 TiB 123 GiB 123 GiB 691 KiB 266 MiB 5.7 TiB 2.07 2.14 73 up osd.38 39 ssd 5.82190 1.00000 5.8 TiB 77 GiB 77 GiB 609 KiB 193 MiB 5.7 TiB 1.30 1.34 62 up osd.39 40 ssd 5.82190 1.00000 5.8 TiB 77 GiB 77 GiB 262 KiB 148 MiB 5.7 TiB 1.29 1.34 55 up osd.40 41 ssd 5.82190 1.00000 5.8 TiB 44 GiB 44 GiB 4.4 MiB 140 MiB 5.8 TiB 0.75 0.77 44 up osd.41 42 ssd 5.82190 1.00000 5.8 TiB 45 GiB 45 GiB 886 KiB 135 MiB 5.8 TiB 0.75 0.78 47 up osd.42 43 ssd 5.82190 1.00000 5.8 TiB 28 GiB 28 GiB 187 KiB 104 MiB 5.8 TiB 0.48 0.49 58 up osd.43 [Also: Yesterday I had two cfs-mirror running one on fl31ca104ja0201 and fl31ca104ja0302. The cfs-mirror on fl31ca104ja0201 was stopped. When the import token was run on fl31ca104ja0302, the cfs-mirror log was active. Just in case it is useful I have attached that log (cfsmirror-container.log) as well. ] How can I list the token on the target cluster after running the create peer_bootstrap command? Here is today’s status with your suggestion: There is only one cfs-mirror daemon running now. It is on fl31ca104ja0201 node. root@fl31ca104ja0201:/# ceph -s cluster: id: d0a3b6e0-d2c3-11ed-be05-a7a3a1d7a87e health: HEALTH_OK services: mon: 3 daemons, quorum fl31ca104ja0202,fl31ca104ja0203,fl31ca104ja0201 (age 7m) mgr: fl31ca104ja0201.kkoono(active, since 13m), standbys: fl31ca104ja0202, fl31ca104ja0203 mds: 1/1 daemons up, 2 standby osd: 44 osds: 44 up (since 7m), 44 in (since 4w) cephfs-mirror: 1 daemon active (1 hosts) rgw: 3 daemons active (3 hosts, 1 zones) data: volumes: 1/1 healthy pools: 25 pools, 769 pgs objects: 614.40k objects, 1.9 TiB usage: 2.8 TiB used, 292 TiB / 295 TiB avail pgs: 769 active+clean io: client: 32 MiB/s rd, 0 B/s wr, 57 op/s rd, 1 op/s wr root@fl31ca104ja0201:/# root@fl31ca104ja0201:/# root@fl31ca104ja0201:/# ceph tell mgr.fl31ca104ja0201.kkoono config set debug_mgr 20 { "success": "" } root@fl31ca104ja0201:/# ceph fs snapshot mirror peer_bootstrap import cephfs eyJmc2lkIjogImE2ZjUyNTk4LWU1Y2QtNGEwOC04NDIyLTdiNmZkYjFkNWRiZSIsICJmaWxlc3lzdGVtIjogImNlcGhmcyIsICJ1c2VyIjogImNsaWVudC5taXJyb3JfcmVtb3RlIiwgInNpdGVfbmFtZSI6ICJmbGV4Mi1zaXRlIiwgImtleSI6ICJBUUNmd01sa005MHBMQkFBd1h0dnBwOGowNEl2Qzh0cXBBRzliQT09IiwgIm1vbl9ob3N0IjogIlt2MjoxNzIuMTguNTUuNzE6MzMwMC8wLHYxOjE3Mi4xOC41NS43MTo2Nzg5LzBdIFt2MjoxNzIuMTguNTUuNzM6MzMwMC8wLHYxOjE3Mi4xOC41NS43Mzo2Nzg5LzBdIn0= ^CInterrupted Ctrl-C after 15 min. Once the command is run, the health status goes to WARN . root@fl31ca104ja0201:/# ceph -s cluster: id: d0a3b6e0-d2c3-11ed-be05-a7a3a1d7a87e health: HEALTH_WARN 6 slow ops, oldest one blocked for 1095 sec, mon.fl31ca104ja0203 has slow ops services: mon: 3 daemons, quorum fl31ca104ja0202,fl31ca104ja0203,fl31ca104ja0201 (age 30m) mgr: fl31ca104ja0201.kkoono(active, since 35m), standbys: fl31ca104ja0202, fl31ca104ja0203 mds: 1/1 daemons up, 2 standby osd: 44 osds: 44 up (since 29m), 44 in (since 4w) cephfs-mirror: 1 daemon active (1 hosts) rgw: 3 daemons active (3 hosts, 1 zones) data: volumes: 1/1 healthy pools: 25 pools, 769 pgs objects: 614.40k objects, 1.9 TiB usage: 2.8 TiB used, 292 TiB / 295 TiB avail pgs: 769 active+clean io: client: 67 KiB/s rd, 0 B/s wr, 68 op/s rd, 21 op/s wr -----Original Message----- From: Venky Shankar <vshankar(a)redhat.com<mailto:vshankar@redhat.com>> Sent: Thursday, August 3, 2023 11:03 PM To: Adiga, Anantha <anantha.adiga(a)intel.com<mailto:anantha.adiga@intel.com>> Cc: ceph-users(a)ceph.io<mailto:ceph-users@ceph.io> Subject: [ceph-users] Re: cephfs snapshot mirror peer_bootstrap import hung Hi Anantha, On Fri, Aug 4, 2023 at 2:27 AM Adiga, Anantha <anantha.adiga(a)intel.com<mailto:anantha.adiga@intel.com>> wrote: > > Hi > > Could you please provide guidance on how to diagnose this issue: > > In this case, there are two Ceph clusters: cluster A, 4 nodes and cluster B, 3 node, in different locations. Both are already running RGW multi-site, A is master. > > Cephfs snapshot mirroring is being configured on the clusters. Cluster A is the primary, cluster B is the peer. Cephfs snapshot mirroring is being configured. The bootstrap import step on the primary node hangs. > > On the target cluster : > --------------------------- > "version": "16.2.5", > "release": "pacific", > "release_type": "stable" > > root@cr21meg16ba0101:/# ceph fs snapshot mirror peer_bootstrap create > cephfs client.mirror_remote flex2-site > {"token": > "eyJmc2lkIjogImE2ZjUyNTk4LWU1Y2QtNGEwOC04NDIyLTdiNmZkYjFkNWRiZSIsICJma > Wxlc3lzdGVtIjogImNlcGhmcyIsICJ1c2VyIjogImNsaWVudC5taXJyb3JfcmVtb3RlIiw > gInNpdGVfbmFtZSI6ICJmbGV4Mi1zaXRlIiwgImtleSI6ICJBUUNmd01sa005MHBMQkFBd > 1h0dnBwOGowNEl2Qzh0cXBBRzliQT09IiwgIm1vbl9ob3N0IjogIlt2MjoxNzIuMTguNTU > uNzE6MzMwMC8wLHYxOjE3Mi4xOC41NS43MTo2Nzg5LzBdIFt2MjoxNzIuMTguNTUuNzM6M > zMwMC8wLHYxOjE3Mi4xOC41NS43Mzo2Nzg5LzBdIn0="} Seems fine uptil here. > root@cr21meg16ba0101:/var/run/ceph# > > On the source cluster: > ---------------------------- > "version": "17.2.6", > "release": "quincy", > "release_type": "stable" > > root@fl31ca104ja0201:/# ceph -s > cluster: > id: d0a3b6e0-d2c3-11ed-be05-a7a3a1d7a87e > health: HEALTH_OK > > services: > mon: 3 daemons, quorum fl31ca104ja0202,fl31ca104ja0203,fl31ca104ja0201 (age 111m) > mgr: fl31ca104ja0201.nwpqlh(active, since 11h), standbys: fl31ca104ja0203, fl31ca104ja0202 > mds: 1/1 daemons up, 2 standby > osd: 44 osds: 44 up (since 111m), 44 in (since 4w) > cephfs-mirror: 1 daemon active (1 hosts) > rgw: 3 daemons active (3 hosts, 1 zones) > > data: > volumes: 1/1 healthy > pools: 25 pools, 769 pgs > objects: 614.40k objects, 1.9 TiB > usage: 2.8 TiB used, 292 TiB / 295 TiB avail > pgs: 769 active+clean > > root@fl31ca104ja0302:/# ceph mgr module enable mirroring module > 'mirroring' is already enabled root@fl31ca104ja0302:/# ceph fs > snapshot mirror peer_bootstrap import cephfs > eyJmc2lkIjogImE2ZjUyNTk4LWU1Y2QtNGEwOC04NDIyLTdiNmZkYjFkNWRiZSIsICJmaW > xlc3lzdGVtIjogImNlcGhmcyIsICJ1c2VyIjogImNsaWVudC5taXJyb3JfcmVtb3RlIiwg > InNpdGVfbmFtZSI6ICJmbGV4Mi1zaXRlIiwgImtleSI6ICJBUUNmd01sa005MHBMQkFBd1 > h0dnBwOGowNEl2Qzh0cXBBRzliQT09IiwgIm1vbl9ob3N0IjogIlt2MjoxNzIuMTguNTUu > NzE6MzMwMC8wLHYxOjE3Mi4xOC41NS43MTo2Nzg5LzBdIFt2MjoxNzIuMTguNTUuNzM6Mz > MwMC8wLHYxOjE3Mi4xOC41NS43Mzo2Nzg5LzBdIn0= Going by your description, I'm guessing this is the command that hangs? If that's the case, set `debug_mgr=20`, repeat the token import step and share the ceph-mgr log. Also note that you can check the mirror daemon status as detailed in https://docs.ceph.com/en/latest/dev/cephfs-mirroring/#mirror-daemon-status > > > root@fl31ca104ja0302:/var/run/ceph# ceph --admin-daemon > /var/run/ceph/ceph-client.cephfs-mirror.fl31ca104ja0302.sypagt.7.94083135960976.asok status { > "metadata": { > "ceph_sha1": "d7ff0d10654d2280e08f1ab989c7cdf3064446a5", > "ceph_version": "ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable)", > "entity_id": "cephfs-mirror.fl31ca104ja0302.sypagt", > "hostname": "fl31ca104ja0302", > "pid": "7", > "root": "/" > }, > "dentry_count": 0, > "dentry_pinned_count": 0, > "id": 5194553, > "inst": { > "name": { > "type": "client", > "num": 5194553 > }, > "addr": { > "type": "v1", > "addr": "10.45.129.5:0", > "nonce": 2497002034 > } > }, > "addr": { > "type": "v1", > "addr": "10.45.129.5:0", > "nonce": 2497002034 > }, > "inst_str": "client.5194553 10.45.129.5:0/2497002034", > "addr_str": "10.45.129.5:0/2497002034", > "inode_count": 1, > "mds_epoch": 118, > "osd_epoch": 6266, > "osd_epoch_barrier": 0, > "blocklisted": false, > "fs_name": "cephfs" > } > > root@fl31ca104ja0302:/home/general# docker logs > ceph-d0a3b6e0-d2c3-11ed-be05-a7a3a1d7a87e-cephfs-mirror-fl31ca104ja030 > 2-sypagt --tail 10 debug 2023-08-03T05:24:27.413+0000 7f8eb6fc0280 0 > ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy > (stable), process cephfs-mirror, pid 7 debug > 2023-08-03T05:24:27.413+0000 7f8eb6fc0280 0 pidfile_write: ignore > empty --pid-file debug 2023-08-03T05:24:27.445+0000 7f8eb6fc0280 1 > mgrc service_daemon_register cephfs-mirror.5184622 metadata > {arch=x86_64,ceph_release=quincy,ceph_version=ceph version 17.2.6 > (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy > (stable),ceph_version_short=17.2.6,container_hostname=fl31ca104ja0302, > container_image=quay.io/ceph/ceph@sha256:af79fedafc42237b7612fe2d18a9c<mailto:container_image=quay.io/ceph/ceph@sha256:af79fedafc42237b7612fe2d18a9c> > 64ca62a0b38ab362e614ad671efa4a0547e,cpu=Intel(R) Xeon(R) Gold 6252 CPU > @ 2.10GHz,distro=centos,distro_description=CentOS Stream > 8,distro_version=8,hostname=fl31ca104ja0302,id=fl31ca104ja0302.sypagt, > instance_id=5184622,kernel_description=#82-Ubuntu SMP Tue Jun 6 > 23:10:23 UTC > 2023,kernel_version=5.15.0-75-generic,mem_swap_kb=8388604,mem_total_kb > =527946928,os=Linux} debug 2023-08-03T05:27:10.419+0000 7f8ea1b2c700 > 0 client.5194553 ms_handle_reset on v2:10.45.128.141:3300/0 debug > 2023-08-03T05:50:10.917+0000 7f8ea1b2c700 0 client.5194553 > ms_handle_reset on v2:10.45.128.139:3300/0 > > Thank you, > Anantha > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io<mailto:ceph-users@ceph.io> To unsubscribe send an > email to ceph-users-leave(a)ceph.io<mailto:ceph-users-leave@ceph.io> > -- Cheers, Venky _______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io<mailto:ceph-users@ceph.io> To unsubscribe send an email to ceph-users-leave(a)ceph.io<mailto:ceph-users-leave@ceph.io>

8 months, 1 week

2
4
0 0

v16.2.14 Pacific released

by Yuri Weinstein

We're happy to announce the 14th backport release in the Pacific series. https://ceph.io/en/news/blog/2023/v16-2-14-pacific-released/ Notable Changes --------------- * CEPHFS: After recovering a Ceph File System post following the disaster recovery procedure, the recovered files under lost+found directory can now be deleted. * ceph mgr dump command now displays the name of the mgr module that registered a RADOS client in the name field added to elements of the active_clients array. Previously, only the address of a module's RADOS client was shown in the active_clients array. Getting Ceph ------------ * Git at git://github.com/ceph/ceph.git * Tarball at https://download.ceph.com/tarballs/ceph-16.2.14.tar.gz * Containers at https://quay.io/repository/ceph/ceph * For packages, see https://docs.ceph.com/en/latest/install/get-packages/ * Release git sha1: 238ba602515df21ea7ffc75c88db29f9e5ef12c9

8 months, 1 week

1
0
0 0

Multisite RGW setup not working when following the docs step by step

by Petr Bena

Hello, My goal is to setup multisite RGW with 2 separate CEPH clusters in separate datacenters, where RGW data are being replicated. I created a lab for this purpose in both locations (with latest reef ceph installed using cephadm) and tried to follow this guide: https://docs.ceph.com/en/reef/radosgw/multisite/ Unfortunatelly, even after multiple attempts it always failed when creating a secondary zone. I could succesfully pull the realm from master, but that was pretty much the last trully succesful step. I can notice that immediately after pulling the realm to secondary, radosgw-admin user list returns an empty list (which IMHO should contain replicated user list from master). Continuing by setting default real and zonegroup and creating the secondary zone in secondary cluster I end up having 2 zones in each cluster, both seemingly in same zonegroup, but with replication failing - this is what I see in sync status: (master) [ceph: root@ceph-lab-brn-01 /]# radosgw-admin sync status realm d2c4ebf9-e156-4c4e-9d56-3fff6a652e75 (ceph) zonegroup abc3c0ae-a84d-48d4-8e78-da251eb78781 (cz) zone 97fb5842-713a-4995-8966-5afe1384f17f (cz-brn) current time 2023-08-30T12:58:12Z zonegroup features enabled: resharding disabled: compress-encrypted metadata sync no sync (zone is master) 2023-08-30T12:58:13.991+0000 7f583a52c780 0 ERROR: failed to fetch datalog info data sync source: 13a8c663-b241-4d8a-a424-8785fc539ec5 (cz-hol) failed to retrieve sync info: (13) Permission denied (secondary) [ceph: root@ceph-lab-hol-01 /]# radosgw-admin sync status realm d2c4ebf9-e156-4c4e-9d56-3fff6a652e75 (ceph) zonegroup abc3c0ae-a84d-48d4-8e78-da251eb78781 (cz) zone 13a8c663-b241-4d8a-a424-8785fc539ec5 (cz-hol) current time 2023-08-30T12:58:54Z zonegroup features enabled: resharding disabled: compress-encrypted metadata sync failed to read sync status: (2) No such file or directory 2023-08-30T12:58:55.617+0000 7ff37c9db780 0 ERROR: failed to fetch datalog info data sync source: 97fb5842-713a-4995-8966-5afe1384f17f (cz-brn) failed to retrieve sync info: (13) Permission denied In master there is one user created during the process (synchronization-user), on slave there are no users and when I try to re-create this synchronization user it complains I shouldn't even try and instead execute the command on master. I can see same realm and zonegroup IDs on both sides, zone list is different though: (master) [ceph: root@ceph-lab-brn-01 /]# radosgw-admin zone list { "default_info": "97fb5842-713a-4995-8966-5afe1384f17f", "zones": [ "cz-brn", "default" ] } (secondary) [ceph: root@ceph-lab-hol-01 /]# radosgw-admin zone list { "default_info": "13a8c663-b241-4d8a-a424-8785fc539ec5", "zones": [ "cz-hol", "default" ] } The permission denied error is puzzling me - could it be because real pull didn't sync the users? I tried this multiple times with clean ceph install on both sides - and always ended up the same. I even tried force creating the same user with same secrets on the other side, but it didn't help. How can I debug what kind of secret is secondary trying to use when communicating with master? Could it be that this multisite RGW setup is not yet truly supported in reef? I noticed that the documentation itself seems written for older ceph versions, as there are no mentions about orchestrator (for example in steps where configuration files of RGW need to be edited, which is done differently when using cephadm). I think that documentation is simply wrong at this time. Either it's missing some crucial steps, or it's outdated or otherwise unclear - simply by following all the steps as outlined there, you are likely to end up the same. Thanks for help!

8 months, 1 week

2
1
0 0

radosgw mulsite multi zone configuration: current period realm name not same as in zonegroup

by Adiga, Anantha

Hi, I have a multi zone configuration with 4 zones. While adding a secondary zone, getting this error: root@cs17ca101ja0702:/# radosgw-admin realm pull --rgw-realm=global --url=http://10.45.128.139:8080 --default --access-key=sync_user --secret=sync_secret request failed: (13) Permission denied If the realm has been changed on the master zone, the master zone's gateway may need to be restarted to recognize this user. root@cs17ca101ja0702:/# The realm name is "global". Is the cause of the error due to the primary cluster having a current period listing the realm name as "default" instead of "global" ? However, the realm id is of realm "global" AND the zonegroup does not list realm name but has the correct realm id. See below. How to fix this issue. root@fl31ca104ja0201:/# radosgw-admin realm get { "id": "3da7b5ea-c44b-4d44-aced-fae2aabce97b", "name": "global", "current_period": "b8bc1187-2a2d-4d9e-b7be-c4f4667e3fa6", "epoch": 2 } root@fl31ca104ja0201:/# radosgw-admin realm get --rgw-realm=global { "id": "3da7b5ea-c44b-4d44-aced-fae2aabce97b", "name": "global", "current_period": "b8bc1187-2a2d-4d9e-b7be-c4f4667e3fa6", "epoch": 2 } root@fl31ca104ja0201:/# radosgw-admin zonegroup list { "default_info": "ec8b68db-1900-464f-a21a-2f6e8c107e94", "zonegroups": [ "alldczg" ] } root@fl31ca104ja0201:/# radosgw-admin zonegroup get --rgw-zonegroup=alldczg { "id": "ec8b68db-1900-464f-a21a-2f6e8c107e94", "name": "alldczg", "api_name": "alldczg", "is_master": "true", "endpoints": [ http://10.45.128.139:8080, http://172.18.55.71:8080, http://10.239.155.23:8080 ], "hostnames": [], "hostnames_s3website": [], "master_zone": "ae267592-7cd8-4d67-8792-adc57d104cd6", "zones": [ { "id": "0962f0b4-beb6-4d07-a64d-07046b81529e", "name": "CRsite", "endpoints": [ http://172.18.55.71:8080 ], "log_meta": "false", "log_data": "true", "bucket_index_max_shards": 11, "read_only": "false", "tier_type": "", "sync_from_all": "true", "sync_from": [], "redirect_zone": "" }, { "id": "9129d118-55ac-4859-b339-b8afe0793a80", "name": "BArga", "endpoints": [ http://10.208.11.26:8080 ], "log_meta": "false", "log_data": "true", "bucket_index_max_shards": 11, "read_only": "false", "tier_type": "", "sync_from_all": "true", "sync_from": [], "redirect_zone": "" }, { "id": "ae267592-7cd8-4d67-8792-adc57d104cd6", "name": "ORflex2", "endpoints": [ http://10.45.128.139:8080 ], "log_meta": "false", "log_data": "true", "bucket_index_max_shards": 11, "read_only": "false", "tier_type": "", "sync_from_all": "true", "sync_from": [], "redirect_zone": "" }, { "id": "f5edeb4b-2a37-413b-8587-0ff40d7647ea", "name": "SHGrasp", "endpoints": [ http://10.239.155.23:8080 ], "log_meta": "false", "log_data": "true", "bucket_index_max_shards": 11, "read_only": "false", "tier_type": "", "sync_from_all": "true", "sync_from": [], "redirect_zone": "" } ], "placement_targets": [ { "name": "default-placement", "tags": [], "storage_classes": [ "STANDARD" ] } ], "default_placement": "default-placement", "realm_id": "3da7b5ea-c44b-4d44-aced-fae2aabce97b", "sync_policy": { "groups": [] } } root@fl31ca104ja0201:/# radosgw-admin period get-current { "current_period": "b8bc1187-2a2d-4d9e-b7be-c4f4667e3fa6" } root@fl31ca104ja0201:/# radosgw-admin period get { "id": "b8bc1187-2a2d-4d9e-b7be-c4f4667e3fa6", "epoch": 42, "predecessor_uuid": "2df86f9a-d267-4b52-a13b-def8e5e612a2", "sync_status": [], "period_map": { "id": "b8bc1187-2a2d-4d9e-b7be-c4f4667e3fa6", "zonegroups": [ { "id": "ec8b68db-1900-464f-a21a-2f6e8c107e94", "name": "alldczg", "api_name": "alldczg", "is_master": "true", "endpoints": [ http://10.45.128.139:8080, http://172.18.55.71:8080, http://10.239.155.23:8080 ], "hostnames": [], "hostnames_s3website": [], "master_zone": "ae267592-7cd8-4d67-8792-adc57d104cd6", "zones": [ { "id": "0962f0b4-beb6-4d07-a64d-07046b81529e", "name": "CRsite", "endpoints": [ http://172.18.55.71:8080 ], "log_meta": "false", "log_data": "true", "bucket_index_max_shards": 11, "read_only": "false", "tier_type": "", "sync_from_all": "true", "sync_from": [], "redirect_zone": "" }, { "id": "9129d118-55ac-4859-b339-b8afe0793a80", "name": "BArga", "endpoints": [ http://10.208.11.26:8080 ], "log_meta": "false", "log_data": "true", "bucket_index_max_shards": 11, "read_only": "false", "tier_type": "", "sync_from_all": "true", "sync_from": [], "redirect_zone": "" }, { "id": "ae267592-7cd8-4d67-8792-adc57d104cd6", "name": "ORflex2", "endpoints": [ http://10.45.128.139:8080 ], "log_meta": "false", "log_data": "true", "bucket_index_max_shards": 11, "read_only": "false", "tier_type": "", "sync_from_all": "true", "sync_from": [], "redirect_zone": "" }, { "id": "f5edeb4b-2a37-413b-8587-0ff40d7647ea", "name": "SHGrasp", "endpoints": [ http://10.239.155.23:8080 ], "log_meta": "false", "log_data": "true", "bucket_index_max_shards": 11, "read_only": "false", "tier_type": "", "sync_from_all": "true", "sync_from": [], "redirect_zone": "" } ], "placement_targets": [ { "name": "default-placement", "tags": [], "storage_classes": [ "STANDARD" ] } ], "default_placement": "default-placement", "realm_id": "3da7b5ea-c44b-4d44-aced-fae2aabce97b", "sync_policy": { "groups": [] } } ], "short_zone_ids": [ { "key": "0962f0b4-beb6-4d07-a64d-07046b81529e", "val": 890360385 }, { "key": "9129d118-55ac-4859-b339-b8afe0793a80", "val": 3855275770 }, { "key": "ae267592-7cd8-4d67-8792-adc57d104cd6", "val": 1499747631 }, { "key": "f5edeb4b-2a37-413b-8587-0ff40d7647ea", "val": 815709635 } ] }, "master_zonegroup": "ec8b68db-1900-464f-a21a-2f6e8c107e94", "master_zone": "ae267592-7cd8-4d67-8792-adc57d104cd6", "period_config": { "bucket_quota": { "enabled": false, "check_on_raw": false, "max_size": -1, "max_size_kb": 0, "max_objects": -1 }, "user_quota": { "enabled": false, "check_on_raw": false, "max_size": -1, "max_size_kb": 0, "max_objects": -1 } }, "realm_id": "3da7b5ea-c44b-4d44-aced-fae2aabce97b", "realm_name": "default", "realm_epoch": 2 } root@fl31ca104ja0201:/#

8 months, 1 week

1
1
0 0

CLT Meeting minutes 2023-08-30

by Nizamudeen A

Hello, Finish v18.2.0 upgrade on LRC? It seems to be running v18.1.3 not much of a difference in code commits news on teuthology jobs hanging? cephfs issues because of network troubles Its resolved by Patrick User council discussion follow-up Detailed info on this pad: https://pad.ceph.com/p/user_dev_relaunch First topic will come from David's team 16.2.14 release Pushing to release by this week. Regards, Nizam -- Nizamudeen A Software Engineer Red Hat <https://www.redhat.com/> <https://www.redhat.com/>

8 months, 1 week

1
0
0 0

Is there any way to fine tune peering/pg relocation/rebalance?

by Szabo, Istvan (Agoda)

Hello, Is there a way to somehow fine tune the rebalance even further than basic tuning steps when adding new osds? Today I've added some osd to the index pool and it generated many slow ops due to OSD op latency increase + read operation latency increase = high put get latency. https://ibb.co/album/9mN6GQ osd max backfill, max recovery, recovery ops priority are 1. 1 nvme drive has 4 osd, each osd has around 80pg. The steps how I add the osds: 1. Set norebalance 2. add the osds 3. wait for peering 4. unset rebalance It takes like 15-20 mins to became normal without interrupting the rebalance the user traffic. Thank you, Istvan ________________________________ This message is confidential and is for the sole use of the intended recipient(s). It may also be privileged or otherwise protected by copyright or other legal rules. If you have received it by mistake please let us know by reply email and delete it from your system. It is prohibited to copy this message or disclose its content to anyone. Any confidentiality or privilege is not waived or lost by any mistaken delivery or unauthorized disclosure of the message. All messages sent to and from Agoda may be monitored to ensure compliance with company policies, to protect the company's interests and to remove potential malware. Electronic messages may be intercepted, amended, lost or deleted, or contain viruses.

8 months, 1 week

3
5
0 0

hardware setup recommendations wanted

by Kai Zimmer

Dear listers, my employer already has a production Ceph cluster running but we need a second one. I just wanted to ask your opininion on the following setup. It is planned for 500 TB net capacity, expandable to 2 PB. I expect the number of OSD servers to double in the next 4 years. Erasure Code 3:2 will be used for OSDs. Usage will be file storage, Rados block devices and S3: 5x OSD servers (12x18 TB Toshiba MG09SCA18TE SAS spinning disks for data, 2x512 GB Samsung PM9A1 M.2 NVME SSD 0,55 DWPD for system, 1xAMD 7313P CPU with 16 cores @3GHz, 256 GB RAM, LSI SAS 9500 HBA, Broadcom P425G network adapter 4x25 Gbit/s) 3x MON servers (1x2 TB Samsung PM9A1 M.2 NVME SSD 0,55 DWPD for system, 2x1.6TB Kioxia CD6-V SSD 3.0 DWPD for data, 2x Broadcom P210/N210 network 4x10 GBit/s, 1xAMD 7232P CPU with 8 cores @3.1 GHz, 64 GB RAM) 3x MDS servers (1x2 TB Samsung PM9A1 M.2 NVME SSD 0,55 DWPD for system, 2x1.6 TB Kioxia CD6-V SSD 3.0 DWPD for data, 2x Broadcom P210/N210 network 4x10 GBit/s, 1xAMD 7313P CPU with 16 cores @3 GHz, 128 GB RAM) OSD servers will be connected via 2x25 GBit fibre interfaces "backend" to 2x Mikrotik CRS518-16XS-2XQ (which are connected for high-availability via 100 GBit) For the "frontend" connection to servers/clients via 2x10 GBit we're looking into 3x Mikrotik CRS326-24S+2Q+RM (which are connected for high-availability via 40 GBit) Especially for the "frontend" switches i'm looking for alternatives. Currently we use Huawei C6810-32T16A4Q-LI models with 2x33 LACP connections connected via 10 GBit/s RJ45. But those had ports blocking after a number of errors which resulted in some trouble. We'd like to avoid IOS and clones in general and would prefer a decent web interface. Any comments/recommendations? Best regards, Kai

8 months, 1 week

2
1
0 0

two ways of adding OSDs? LVN vs ceph orch daemon add

by Giuliano Maggi

Hi, I am learning about Ceph, and I found this two ways of adding OSDs: https://docs.ceph.com/en/quincy/install/manual-deployment/#short-form <https://docs.ceph.com/en/quincy/install/manual-deployment/#short-form> (via LVM) AND https://docs.ceph.com/en/quincy/cephadm/services/osd/#creating-new-osds <https://docs.ceph.com/en/quincy/cephadm/services/osd/#creating-new-osds> (ceph orch daemon add osd *<host>*:*<device-path>*) Are these two ways equivalents? Thanks, Giuliano,

8 months, 1 week

2
1
0 0

Reef - what happened to OSD spec?

by Nigel Williams

We upgraded to Reef from Quincy, all went smoothly (thanks Ceph developers!) When adding OSDs, the process seems to have changed, the docs no longer mention OSD spec, and giving it a try it fails when it bumps into the root drive (which has an active LVM). I expect I can add a filter to avoid it. But is using the OSD spec ( https://docs.ceph.com/en/octopus/cephadm/drivegroups/) approach now deprecated? Is the web-interface now the preferred way? thanks.

8 months, 1 week

2
2
0 0

2024

2023

2022

2021

2020

2019

ceph-users August 2023