November 2020 - ceph-users

by Frank Ritchie

Hi All, Are there any methods for relocating an entire RGW bucket to a different storage pool other than copying the contents? Thanks, Frank

3 years, 5 months

1
0
0 0

Mon went down and won't come back

by Paul Mezzanini

Hi everyone, I figure it's time to pull in more brain power on this one. We had an NVMe mostly die in one of our monitors and it caused the write latency for the machine to spike. Ceph did the RightThing(tm) and when it lost quorum on that machine it was ignored. I pulled the bad drive out of the array and tried to bring the mon and mgr back in (our monitors double-duty as managers). The manager came up 0 problems but the monitor got stuck probing. I removed the bad host from the monmap and stood up a new one on an OSD node to get back to 3 active. That new node added perfectly using the same methods I've tried on the old one. Network appears to be clean between all hosts. Packet captures show them chatting just fine. Since we are getting ready to upgrade from RHEL7 to RHEL8 I took this as an opportunity to reinstall the monitor as an 8 box to get that process rolling. Box is now on RHEL8 with no changes to how ceph-mon is acting. I install machines with a kickstart and use our own ansible roles to get it 95% into service. I then follow the manual install instructions (https://docs.ceph.com/en/latest/rados/operations/add-or-rm-mons/#adding-mon…). Time is in sync, /var/lib/ceph/mon/* is owned by the right UID, keys are in sync, configs are in sync. I pulled the old mon out of "mon initial members" and "mon host". `nc` can talk to all the ports in question and we've tried it with firewalld off as well (ditto with selinux). Cleaned up some stale DNS and even tried a different IP (same DNS name). I started all of this with 14.2.12 but .13 was released while debugging so I've got that on the broken monitor at the moment. I manually start the daemon in debug mode (/usr/bin/ceph-mon -d --cluster ceph --id ceph-mon-02 --setuser ceph --setgroup ceph) until it's joined in then use the systemd scripts to start it once it's clean. The current state is: (Lightly sanitized output) :snip: 2020-11-04 11:38:57.049 7f4232fb3540 0 mon.ceph-mon-02 does not exist in monmap, will attempt to join an existing cluster 2020-11-04 11:38:57.049 7f4232fb3540 0 using public_addr v2:Num.64:0/0 -> [v2:Num.64:3300/0,v1:Num.64:6789/0] 2020-11-04 11:38:57.050 7f4232fb3540 0 starting mon.ceph-mon-02 rank -1 at public addrs [v2:Num.64:3300/0,v1:Num.64:6789/0] at bind addrs [v2:Num.64:3300/0,v1:Num.64:6789/0] mon_data /var/lib/ceph/mon/ceph-ceph-mon-02 fsid 8514c8d5-4cd3-4dee-b460-27633e3adb1a 2020-11-04 11:38:57.051 7f4232fb3540 1 mon.ceph-mon-02@-1(???) e25 preinit fsid 8514c8d5-4cd3-4dee-b460-27633e3adb1a 2020-11-04 11:38:57.051 7f4232fb3540 1 mon.ceph-mon-02@-1(???) e25 initial_members ceph-mon-01,ceph-mon-03, filtering seed monmap 2020-11-04 11:38:57.051 7f4232fb3540 0 mon.ceph-mon-02(a)-1(???).mds e430081 new map 2020-11-04 11:38:57.051 7f4232fb3540 0 mon.ceph-mon-02(a)-1(???).mds e430081 print_map :snip: 2020-11-04 11:38:57.053 7f4232fb3540 0 mon.ceph-mon-02(a)-1(???).osd e1198618 crush map has features 288514119978713088, adjusting msgr requires 2020-11-04 11:38:57.053 7f4232fb3540 0 mon.ceph-mon-02(a)-1(???).osd e1198618 crush map has features 288514119978713088, adjusting msgr requires 2020-11-04 11:38:57.053 7f4232fb3540 0 mon.ceph-mon-02(a)-1(???).osd e1198618 crush map has features 3314933069571702784, adjusting msgr requires 2020-11-04 11:38:57.053 7f4232fb3540 0 mon.ceph-mon-02(a)-1(???).osd e1198618 crush map has features 288514119978713088, adjusting msgr requires 2020-11-04 11:38:57.054 7f4232fb3540 1 mon.ceph-mon-02(a)-1(???).paxosservice(auth 54141..54219) refresh upgraded, format 0 -> 3 2020-11-04 11:38:57.069 7f421d891700 1 mon.ceph-mon-02@-1(probing) e25 handle_auth_request failed to assign global_id ^^^ last line repeated every few seconds until process killed I've exhausted everything I can think of so I've just been doing the scientific shotgun (one slug at a time) approach to see what changes. Does anyone else have any ideas? -- Paul Mezzanini Sr Systems Administrator / Engineer, Research Computing Information & Technology Services Finance & Administration Rochester Institute of Technology o:(585) 475-3245 | pfmeec(a)rit.edu CONFIDENTIALITY NOTE: The information transmitted, including attachments, is intended only for the person(s) or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon this information by persons or entities other than the intended recipient is prohibited. If you received this in error, please contact the sender and destroy any copies of this information. ------------------------

3 years, 5 months

2
6
0 0

Re: high latency after maintenance

by Marcel Kuiper

Yes the OSDs are all bluestore. So does this mean that we can assign most of the memory to the osd processes by setting the osd_memory_target? > > If your OSDs are all BlueStore, page cache isnât nearly as important as > with Filestore. > > >>

3 years, 5 months

1
0
0 0

Multisite sync not working - permission denied

by Michael Breen

Hi, radosgw-admin -v ceph version 15.2.4 (7447c15c6ff58d7fce91843b705a268a1917325c) octopus (stable) Multisite sync was something I had working with a previous cluster and an earlier Ceph version, but it doesn't now, and I can't understand why. If anyone with an idea of a possible cause could give me a clue I would be grateful. I have clusters set up using Rook, but as far as I can tell, that's not a factor. On the primary cluster, I have this: radosgw-admin zonegroup get --rgw-zonegroup zonegroup-a { "id": "b115d74a-2d5f-4127-b621-0223f1e96c71", "name": "zonegroup-a", "api_name": "zonegroup-a", "is_master": "true", "endpoints": [ "http://192.168.30.8:80" ], "hostnames": [], "hostnames_s3website": [], "master_zone": "024687e0-1461-4f45-9149-9e571791c2b3", "zones": [ { "id": "024687e0-1461-4f45-9149-9e571791c2b3", "name": "zone-a", "endpoints": [ "http://192.168.30.8:80" ], "log_meta": "false", "log_data": "true", "bucket_index_max_shards": 11, "read_only": "false", "tier_type": "", "sync_from_all": "true", "sync_from": [], "redirect_zone": "" }, { "id": "6ba0ee26-0155-48f9-b057-2803336f0d66", "name": "zone-b", "endpoints": [ "http://192.168.30.108:80" ], "log_meta": "false", "log_data": "true", "bucket_index_max_shards": 11, "read_only": "false", "tier_type": "", "sync_from_all": "true", "sync_from": [], "redirect_zone": "" } ], "placement_targets": [ { "name": "default-placement", "tags": [], "storage_classes": [ "STANDARD" ] } ], "default_placement": "default-placement", "realm_id": "8c38fa05-c19d-4e30-bc98-e2bc84eccb68", "sync_policy": { "groups": [] } } It's identical on the secondary (that's after a realm pull, an update of the zone-b endpoints, and a period commit), which I double-checked by piping the output to md5sum on both sides. The system user created on the primary is radosgw-admin user info --uid realm-a-system-user { ... "keys": [ { "user": "realm-a-system-user", "access_key": "IUs+USI5IjA8WkZPRjU=", "secret_key": "PGRDSzRERD4lbF9AYThuLzkvW1QvL148Q147PA==" } ... } The zones on both sides have these keys radosgw-admin zone get --rgw-zone zone-a { ... "system_key": { "access_key": "IUs+USI5IjA8WkZPRjU=", "secret_key": "PGRDSzRERD4lbF9AYThuLzkvW1QvL148Q147PA==" }, ... } radosgw-admin zone get --rgw-zonegroup zonegroup-a --rgw-zone zone-b { ... "system_key": { "access_key": "IUs+USI5IjA8WkZPRjU=", "secret_key": "PGRDSzRERD4lbF9AYThuLzkvW1QvL148Q147PA==" }, ... } Yet, on the secondary radosgw-admin sync status realm 8c38fa05-c19d-4e30-bc98-e2bc84eccb68 (realm-a) zonegroup b115d74a-2d5f-4127-b621-0223f1e96c71 (zonegroup-a) zone 6ba0ee26-0155-48f9-b057-2803336f0d66 (zone-b) metadata sync preparing for full sync full sync: 64/64 shards full sync: 0 entries to sync incremental sync: 0/64 shards metadata is behind on 64 shards behind shards: [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63] data sync source: 024687e0-1461-4f45-9149-9e571791c2b3 (zone-a) syncing full sync: 0/128 shards incremental sync: 128/128 shards data is caught up with source and on the primary radosgw-admin sync status realm 8c38fa05-c19d-4e30-bc98-e2bc84eccb68 (realm-a) zonegroup b115d74a-2d5f-4127-b621-0223f1e96c71 (zonegroup-a) zone 024687e0-1461-4f45-9149-9e571791c2b3 (zone-a) metadata sync no sync (zone is master) 2020-11-06T10:58:46.345+0000 7fa805c201c0 0 data sync zone:6ba0ee26 ERROR: failed to fetch datalog info data sync source: 6ba0ee26-0155-48f9-b057-2803336f0d66 (zone-b) failed to retrieve sync info: (13) Permission denied Given that all the keys above match, that "permission denied" is a mystery to me, but it does accord with: export AWS_ACCESS_KEY_ID="IUs+USI5IjA8WkZPRjU=" export AWS_SECRET_ACCESS_KEY="PGRDSzRERD4lbF9AYThuLzkvW1QvL148Q147PA==" s3cmd ls --no-ssl --host-bucket= --host=192.168.30.8 # OK, but: s3cmd ls --no-ssl --host-bucket= --host=192.168.30.108 # ERROR: S3 error: 403 (InvalidAccessKeyId) # Although curl -L http://192.168.30.108 # works: <?xml version="1.0" encoding="UTF-8 ... 192.168.30.108 is the external IP, but just to be certain I was hitting zone-b, I tried this also within the cluster using its internal IP s3cmd ls --no-ssl --host-bucket= --host=10.41.157.115 # ERROR: S3 error: 403 (InvalidAccessKeyId) This seems to be the reason it's not syncing, but why? The user with those keys existed on the primary before the realm pull, in agreement with every procedure I have seen for setting up multisite. Any suggestions? Regards, Michael -- CONFIDENTIALITY This e-mail message and any attachments thereto, is intended only for use by the addressee(s) named herein and may contain legally privileged and/or confidential information. If you are not the intended recipient of this e-mail message, you are hereby notified that any dissemination, distribution or copying of this e-mail message, and any attachments thereto, is strictly prohibited. If you have received this e-mail message in error, please immediately notify the sender and permanently delete the original and any copies of this email and any prints thereof. ABSENT AN EXPRESS STATEMENT TO THE CONTRARY HEREINABOVE, THIS E-MAIL IS NOT INTENDED AS A SUBSTITUTE FOR A WRITING. Notwithstanding the Uniform Electronic Transactions Act or the applicability of any other law of similar substance and effect, absent an express statement to the contrary hereinabove, this e-mail message its contents, and any attachments hereto are not intended to represent an offer or acceptance to enter into a contract and are not otherwise intended to bind the sender, Sanmina Corporation (or any of its subsidiaries), or any other person or entity.

3 years, 5 months

2
4
0 0

pg xyz is stuck undersized for long time

by Frank Schilder

Hi all, I moved the crush location of 8 OSDs and rebalancing went on happily (misplaced objects only). Today, osd.1 crashed, restarted and rejoined the cluster. However, it seems not to re-join some PGs it was a member of. I have now undersized PGs for no real reason I would believe: PG_DEGRADED Degraded data redundancy: 52173/2268789087 objects degraded (0.002%), 2 pgs degraded, 7 pgs undersized pg 11.52 is stuck undersized for 663.929664, current state active+undersized+remapped+backfilling, last acting [237,60,2147483647,74,233,232,292,86] The up and acting sets are: "up": [ 237, 2, 74, 289, 233, 232, 292, 86 ], "acting": [ 237, 60, 2147483647, 74, 233, 232, 292, 86 ], How can I get the PG to complete peering and osd.1 to join? I have an unreasonable number of degraded objects where the missing part is on this OSD. For completeness, here the cluster status: # ceph status cluster: id: ... health: HEALTH_ERR noout,norebalance flag(s) set 1 large omap objects 35815902/2268938858 objects misplaced (1.579%) Degraded data redundancy: 46122/2268938858 objects degraded (0.002%), 2 pgs degraded, 7 pgs undersized Degraded data redundancy (low space): 28 pgs backfill_toofull services: mon: 3 daemons, quorum ceph-01,ceph-02,ceph-03 mgr: ceph-01(active), standbys: ceph-03, ceph-02 mds: con-fs2-1/1/1 up {0=ceph-08=up:active}, 1 up:standby-replay osd: 299 osds: 275 up, 275 in; 301 remapped pgs flags noout,norebalance data: pools: 11 pools, 3215 pgs objects: 268.8 M objects, 675 TiB usage: 854 TiB used, 1.1 PiB / 1.9 PiB avail pgs: 46122/2268938858 objects degraded (0.002%) 35815902/2268938858 objects misplaced (1.579%) 2907 active+clean 219 active+remapped+backfill_wait 47 active+remapped+backfilling 28 active+remapped+backfill_wait+backfill_toofull 6 active+clean+scrubbing+deep 5 active+undersized+remapped+backfilling 2 active+undersized+degraded+remapped+backfilling 1 active+clean+scrubbing io: client: 13 MiB/s rd, 196 MiB/s wr, 2.82 kop/s rd, 1.81 kop/s wr recovery: 57 MiB/s, 14 objects/s Thanks and best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14

3 years, 5 months

2
2
0 0

Dovecot and fnctl locks

by Dan van der Ster

Hi all, MDS version v14.2.11 Client kernel 3.10.0-1127.19.1.el7.x86_64 We are seeing a strange issue with a dovecot use-case on cephfs. Occasionally we have dovecot reporting a file locked, such as: Nov 09 13:55:00 dovecot-backend-00.cern.ch dovecot[27710]: imap(reguero)<23945><fRA6B6yznq68uE28>: Error: Mailbox Deleted Items: Timeout (180s) while waiting for lock for transaction log file /mail/users/r/reguero//mdbox/mailboxes/Deleted Items/dbox-Mails/dovecot.index.log (WRITE lock held by pid -9605) We checked all hosts that have mounted the cephfs -- there is no pid 9605. Is there any way to see who exactly created the lock? ceph_filelock has a client id, but I didn't find a way to inspect the cephfs_metadata to see the ceph_filelock directly. Otherwise, are other Dovecot/CephFS users seeing this? Did you switch to flock or lockfile instead of fnctlk locks? Thanks! Dan P.S. here is the output from print locks tool from the kernel client: Read lock: Type: 1 (0: Read, 1: Write, 2: Unlocked) Whence: 0 (0: start, 1: current, 2: end) Offset: 0 Len: 1 Pid: -9605 Write lock: Type: 1 (0: Read, 1: Write, 2: Unlocked) Whence: 0 (0: start, 1: current, 2: end) Offset: 0 Len: 1 Pid: -9605 and same file from a 15.2.5 fuse client : Read lock: Type: 1 (0: Read, 1: Write, 2: Unlocked) Whence: 0 (0: start, 1: current, 2: end) Offset: 0 Len: 0 Pid: 0 Write lock: Type: 1 (0: Read, 1: Write, 2: Unlocked) Whence: 0 (0: start, 1: current, 2: end) Offset: 0 Len: 0 Pid: 0

3 years, 5 months

1
0
0 0

Re: high latency after maintenance

by Marcel Kuiper

Hi Anthony > > Did you add a bunch of data since then, or change the Ceph release? Do > you have bluefs_buffered_io set to false? > > We did not change ceph release in the meantime. It is very welll possible that the delays were just not noticed during out previous maintenances. bluefs_buffered_io is set to false (default setting in 14.2.11). I posted a question about this setting some time ago without any respons. Perhaps you are able to answer this: - If bluefs_buffered_io is set to false does that mean that all ceph buffering is done in the osd processes? Or is the linux buffer still used somewhere?? If the linux buffer is still used then what would be your advices in setting the osd_memory_target vs leaving space for linux buffers > > PGs block while peering, so it pays to spread out the peering load. > > Scrubs vary as a function of a number of things. Remember that shallow > scrubs are cheap and frequent, so if you have downtime theyâll need to > catch up when they come back. Especially if you also limit the times of > day when scrubs can run (which is usually a bad idea). Scrubs are not > themselves part of peering. Thank you we will keep that in mind for the next maintenance > > Are you using EC? What networking technology? > > We are not using EC. The network for OSD's consist of 2 x 10Gbps link in a bond. There is no separate cluster network. So far it look slike the network is nowhere used near its maximum Kind Regards Marcel

3 years, 5 months

1
0
0 0

ceph command on cephadm install stuck

by Oliver Weinmann

Hi, on my fresh deployed cephadm bootstrap node I can no longer run the ceph command. It is just stuck: [root@gedasvl02 ~]# ceph orch dev ls INFO:cephadm:Inferring fsid c7879f24-1f90-11eb-8ba2-005056b703af INFO:cephadm:Using recent ceph image docker.io/ceph/ceph:v15 [root@gedasvl02 ~]# ceph -s INFO:cephadm:Inferring fsid c7879f24-1f90-11eb-8ba2-005056b703af INFO:cephadm:Using recent ceph image docker.io/ceph/ceph:v15 I rebooted the node, but this didn't solve the issue. :( systemctl looks fine no stopped service. Any ideas? It's just a test cluster I can rebuild at any time, but I'm curious what has happened. Best Regards, Oliver

3 years, 5 months

1
0
0 0

Multisite mechanism deeper understanding

by Szabo, Istvan (Agoda)

Hi, Couple of questions came up which is not really documented anywhere, hopefully someone knows the answers: 1. Is there a way to see the replication queue? I want to create metrics like is there any delay in the replication etc ... 2. Is the replication FIFO? 3. Actually how a replication works on a lower level? Thank you ________________________________ This message is confidential and is for the sole use of the intended recipient(s). It may also be privileged or otherwise protected by copyright or other legal rules. If you have received it by mistake please let us know by reply email and delete it from your system. It is prohibited to copy this message or disclose its content to anyone. Any confidentiality or privilege is not waived or lost by any mistaken delivery or unauthorized disclosure of the message. All messages sent to and from Agoda may be monitored to ensure compliance with company policies, to protect the company's interests and to remove potential malware. Electronic messages may be intercepted, amended, lost or deleted, or contain viruses.

3 years, 5 months

1
0
0 0

150mb per sec on NVMe pool

by Alex L

Hi, I have invested in SAMSUNG PM983 (MZ1LB960HAJQ-00007) x3 to run a fast pool on. However I am only getting 150mb/sec from these. vfio results directly on the NVMe's: https://docs.google.com/spreadsheets/d/1LXupjEUnNdf011QNr24pkAiDBphzpz5_MwM… Config and Results of ceph bench: https://pastebin.com/cScBv7Fv Appreciate any help you can give me. A

3 years, 5 months

1
0
0 0

2024

2023

2022

2021

2020

2019

ceph-users November 2020