Hi everyone,
I figure it's time to pull in more brain power on this one. We had an NVMe mostly die in one of our monitors and it caused the write latency for the machine to spike. Ceph did the RightThing(tm) and when it lost quorum on that machine it was ignored. I pulled the bad drive out of the array and tried to bring the mon and mgr back in (our monitors double-duty as managers).
The manager came up 0 problems but the monitor got stuck probing.
I removed the bad host from the monmap and stood up a new one on an OSD node to get back to 3 active. That new node added perfectly using the same methods I've tried on the old one.
Network appears to be clean between all hosts. Packet captures show them chatting just fine. Since we are getting ready to upgrade from RHEL7 to RHEL8 I took this as an opportunity to reinstall the monitor as an 8 box to get that process rolling. Box is now on RHEL8 with no changes to how ceph-mon is acting.
I install machines with a kickstart and use our own ansible roles to get it 95% into service. I then follow the manual install instructions (https://docs.ceph.com/en/latest/rados/operations/add-or-rm-mons/#adding-mon…).
Time is in sync, /var/lib/ceph/mon/* is owned by the right UID, keys are in sync, configs are in sync. I pulled the old mon out of "mon initial members" and "mon host". `nc` can talk to all the ports in question and we've tried it with firewalld off as well (ditto with selinux). Cleaned up some stale DNS and even tried a different IP (same DNS name). I started all of this with 14.2.12 but .13 was released while debugging so I've got that on the broken monitor at the moment.
I manually start the daemon in debug mode (/usr/bin/ceph-mon -d --cluster ceph --id ceph-mon-02 --setuser ceph --setgroup ceph) until it's joined in then use the systemd scripts to start it once it's clean. The current state is:
(Lightly sanitized output)
:snip:
2020-11-04 11:38:57.049 7f4232fb3540 0 mon.ceph-mon-02 does not exist in monmap, will attempt to join an existing cluster
2020-11-04 11:38:57.049 7f4232fb3540 0 using public_addr v2:Num.64:0/0 -> [v2:Num.64:3300/0,v1:Num.64:6789/0]
2020-11-04 11:38:57.050 7f4232fb3540 0 starting mon.ceph-mon-02 rank -1 at public addrs [v2:Num.64:3300/0,v1:Num.64:6789/0] at bind addrs [v2:Num.64:3300/0,v1:Num.64:6789/0] mon_data /var/lib/ceph/mon/ceph-ceph-mon-02 fsid 8514c8d5-4cd3-4dee-b460-27633e3adb1a
2020-11-04 11:38:57.051 7f4232fb3540 1 mon.ceph-mon-02@-1(???) e25 preinit fsid 8514c8d5-4cd3-4dee-b460-27633e3adb1a
2020-11-04 11:38:57.051 7f4232fb3540 1 mon.ceph-mon-02@-1(???) e25 initial_members ceph-mon-01,ceph-mon-03, filtering seed monmap
2020-11-04 11:38:57.051 7f4232fb3540 0 mon.ceph-mon-02(a)-1(???).mds e430081 new map
2020-11-04 11:38:57.051 7f4232fb3540 0 mon.ceph-mon-02(a)-1(???).mds e430081 print_map
:snip:
2020-11-04 11:38:57.053 7f4232fb3540 0 mon.ceph-mon-02(a)-1(???).osd e1198618 crush map has features 288514119978713088, adjusting msgr requires
2020-11-04 11:38:57.053 7f4232fb3540 0 mon.ceph-mon-02(a)-1(???).osd e1198618 crush map has features 288514119978713088, adjusting msgr requires
2020-11-04 11:38:57.053 7f4232fb3540 0 mon.ceph-mon-02(a)-1(???).osd e1198618 crush map has features 3314933069571702784, adjusting msgr requires
2020-11-04 11:38:57.053 7f4232fb3540 0 mon.ceph-mon-02(a)-1(???).osd e1198618 crush map has features 288514119978713088, adjusting msgr requires
2020-11-04 11:38:57.054 7f4232fb3540 1 mon.ceph-mon-02(a)-1(???).paxosservice(auth 54141..54219) refresh upgraded, format 0 -> 3
2020-11-04 11:38:57.069 7f421d891700 1 mon.ceph-mon-02@-1(probing) e25 handle_auth_request failed to assign global_id
^^^ last line repeated every few seconds until process killed
I've exhausted everything I can think of so I've just been doing the scientific shotgun (one slug at a time) approach to see what changes. Does anyone else have any ideas?
--
Paul Mezzanini
Sr Systems Administrator / Engineer, Research Computing
Information & Technology Services
Finance & Administration
Rochester Institute of Technology
o:(585) 475-3245 | pfmeec(a)rit.edu
CONFIDENTIALITY NOTE: The information transmitted, including attachments, is
intended only for the person(s) or entity to which it is addressed and may
contain confidential and/or privileged material. Any review, retransmission,
dissemination or other use of, or taking of any action in reliance upon this
information by persons or entities other than the intended recipient is
prohibited. If you received this in error, please contact the sender and
destroy any copies of this information.
------------------------
Yes the OSDs are all bluestore. So does this mean that we can assign most
of the memory to the osd processes by setting the osd_memory_target?
>
> If your OSDs are all BlueStore, page cache isnât nearly as important as
> with Filestore.
>
>
>>
Hi,
radosgw-admin -v
ceph version 15.2.4 (7447c15c6ff58d7fce91843b705a268a1917325c) octopus
(stable)
Multisite sync was something I had working with a previous cluster and an
earlier Ceph version, but it doesn't now, and I can't understand why.
If anyone with an idea of a possible cause could give me a clue I would be
grateful.
I have clusters set up using Rook, but as far as I can tell, that's not a
factor.
On the primary cluster, I have this:
radosgw-admin zonegroup get --rgw-zonegroup zonegroup-a
{
"id": "b115d74a-2d5f-4127-b621-0223f1e96c71",
"name": "zonegroup-a",
"api_name": "zonegroup-a",
"is_master": "true",
"endpoints": [
"http://192.168.30.8:80"
],
"hostnames": [],
"hostnames_s3website": [],
"master_zone": "024687e0-1461-4f45-9149-9e571791c2b3",
"zones": [
{
"id": "024687e0-1461-4f45-9149-9e571791c2b3",
"name": "zone-a",
"endpoints": [
"http://192.168.30.8:80"
],
"log_meta": "false",
"log_data": "true",
"bucket_index_max_shards": 11,
"read_only": "false",
"tier_type": "",
"sync_from_all": "true",
"sync_from": [],
"redirect_zone": ""
},
{
"id": "6ba0ee26-0155-48f9-b057-2803336f0d66",
"name": "zone-b",
"endpoints": [
"http://192.168.30.108:80"
],
"log_meta": "false",
"log_data": "true",
"bucket_index_max_shards": 11,
"read_only": "false",
"tier_type": "",
"sync_from_all": "true",
"sync_from": [],
"redirect_zone": ""
}
],
"placement_targets": [
{
"name": "default-placement",
"tags": [],
"storage_classes": [
"STANDARD"
]
}
],
"default_placement": "default-placement",
"realm_id": "8c38fa05-c19d-4e30-bc98-e2bc84eccb68",
"sync_policy": {
"groups": []
}
}
It's identical on the secondary (that's after a realm pull, an update of
the zone-b endpoints, and a period commit), which I double-checked by
piping the output to md5sum on both sides.
The system user created on the primary is
radosgw-admin user info --uid realm-a-system-user
{
...
"keys": [
{
"user": "realm-a-system-user",
"access_key": "IUs+USI5IjA8WkZPRjU=",
"secret_key": "PGRDSzRERD4lbF9AYThuLzkvW1QvL148Q147PA=="
}
...
}
The zones on both sides have these keys
radosgw-admin zone get --rgw-zone zone-a
{
...
"system_key": {
"access_key": "IUs+USI5IjA8WkZPRjU=",
"secret_key": "PGRDSzRERD4lbF9AYThuLzkvW1QvL148Q147PA=="
},
...
}
radosgw-admin zone get --rgw-zonegroup zonegroup-a --rgw-zone zone-b
{
...
"system_key": {
"access_key": "IUs+USI5IjA8WkZPRjU=",
"secret_key": "PGRDSzRERD4lbF9AYThuLzkvW1QvL148Q147PA=="
},
...
}
Yet, on the secondary
radosgw-admin sync status
realm 8c38fa05-c19d-4e30-bc98-e2bc84eccb68 (realm-a)
zonegroup b115d74a-2d5f-4127-b621-0223f1e96c71 (zonegroup-a)
zone 6ba0ee26-0155-48f9-b057-2803336f0d66 (zone-b)
metadata sync preparing for full sync
full sync: 64/64 shards
full sync: 0 entries to sync
incremental sync: 0/64 shards
metadata is behind on 64 shards
behind shards:
[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63]
data sync source: 024687e0-1461-4f45-9149-9e571791c2b3 (zone-a)
syncing
full sync: 0/128 shards
incremental sync: 128/128 shards
data is caught up with source
and on the primary
radosgw-admin sync status
realm 8c38fa05-c19d-4e30-bc98-e2bc84eccb68 (realm-a)
zonegroup b115d74a-2d5f-4127-b621-0223f1e96c71 (zonegroup-a)
zone 024687e0-1461-4f45-9149-9e571791c2b3 (zone-a)
metadata sync no sync (zone is master)
2020-11-06T10:58:46.345+0000 7fa805c201c0 0 data sync zone:6ba0ee26 ERROR:
failed to fetch datalog info
data sync source: 6ba0ee26-0155-48f9-b057-2803336f0d66 (zone-b)
failed to retrieve sync info: (13) Permission denied
Given that all the keys above match, that "permission denied" is a mystery
to me, but it does accord with:
export AWS_ACCESS_KEY_ID="IUs+USI5IjA8WkZPRjU="
export AWS_SECRET_ACCESS_KEY="PGRDSzRERD4lbF9AYThuLzkvW1QvL148Q147PA=="
s3cmd ls --no-ssl --host-bucket= --host=192.168.30.8 # OK, but:
s3cmd ls --no-ssl --host-bucket= --host=192.168.30.108
# ERROR: S3 error: 403 (InvalidAccessKeyId)
# Although
curl -L http://192.168.30.108 # works: <?xml version="1.0" encoding="UTF-8
...
192.168.30.108 is the external IP, but just to be certain I was hitting
zone-b, I tried this also within the cluster using its internal IP
s3cmd ls --no-ssl --host-bucket= --host=10.41.157.115
# ERROR: S3 error: 403 (InvalidAccessKeyId)
This seems to be the reason it's not syncing, but why?
The user with those keys existed on the primary before the realm pull, in
agreement with every procedure I have seen for setting up multisite.
Any suggestions?
Regards,
Michael
--
CONFIDENTIALITY
This e-mail message and any attachments thereto, is
intended only for use by the addressee(s) named herein and may contain
legally privileged and/or confidential information. If you are not the
intended recipient of this e-mail message, you are hereby notified that any
dissemination, distribution or copying of this e-mail message, and any
attachments thereto, is strictly prohibited. If you have received this
e-mail message in error, please immediately notify the sender and
permanently delete the original and any copies of this email and any prints
thereof.
ABSENT AN EXPRESS STATEMENT TO THE CONTRARY HEREINABOVE, THIS
E-MAIL IS NOT INTENDED AS A SUBSTITUTE FOR A WRITING. Notwithstanding the
Uniform Electronic Transactions Act or the applicability of any other law
of similar substance and effect, absent an express statement to the
contrary hereinabove, this e-mail message its contents, and any attachments
hereto are not intended to represent an offer or acceptance to enter into a
contract and are not otherwise intended to bind the sender, Sanmina
Corporation (or any of its subsidiaries), or any other person or entity.
Hi all,
I moved the crush location of 8 OSDs and rebalancing went on happily (misplaced objects only). Today, osd.1 crashed, restarted and rejoined the cluster. However, it seems not to re-join some PGs it was a member of. I have now undersized PGs for no real reason I would believe:
PG_DEGRADED Degraded data redundancy: 52173/2268789087 objects degraded (0.002%), 2 pgs degraded, 7 pgs undersized
pg 11.52 is stuck undersized for 663.929664, current state active+undersized+remapped+backfilling, last acting [237,60,2147483647,74,233,232,292,86]
The up and acting sets are:
"up": [
237,
2,
74,
289,
233,
232,
292,
86
],
"acting": [
237,
60,
2147483647,
74,
233,
232,
292,
86
],
How can I get the PG to complete peering and osd.1 to join? I have an unreasonable number of degraded objects where the missing part is on this OSD.
For completeness, here the cluster status:
# ceph status
cluster:
id: ...
health: HEALTH_ERR
noout,norebalance flag(s) set
1 large omap objects
35815902/2268938858 objects misplaced (1.579%)
Degraded data redundancy: 46122/2268938858 objects degraded (0.002%), 2 pgs degraded, 7 pgs undersized
Degraded data redundancy (low space): 28 pgs backfill_toofull
services:
mon: 3 daemons, quorum ceph-01,ceph-02,ceph-03
mgr: ceph-01(active), standbys: ceph-03, ceph-02
mds: con-fs2-1/1/1 up {0=ceph-08=up:active}, 1 up:standby-replay
osd: 299 osds: 275 up, 275 in; 301 remapped pgs
flags noout,norebalance
data:
pools: 11 pools, 3215 pgs
objects: 268.8 M objects, 675 TiB
usage: 854 TiB used, 1.1 PiB / 1.9 PiB avail
pgs: 46122/2268938858 objects degraded (0.002%)
35815902/2268938858 objects misplaced (1.579%)
2907 active+clean
219 active+remapped+backfill_wait
47 active+remapped+backfilling
28 active+remapped+backfill_wait+backfill_toofull
6 active+clean+scrubbing+deep
5 active+undersized+remapped+backfilling
2 active+undersized+degraded+remapped+backfilling
1 active+clean+scrubbing
io:
client: 13 MiB/s rd, 196 MiB/s wr, 2.82 kop/s rd, 1.81 kop/s wr
recovery: 57 MiB/s, 14 objects/s
Thanks and best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
Hi all,
MDS version v14.2.11
Client kernel 3.10.0-1127.19.1.el7.x86_64
We are seeing a strange issue with a dovecot use-case on cephfs.
Occasionally we have dovecot reporting a file locked, such as:
Nov 09 13:55:00 dovecot-backend-00.cern.ch dovecot[27710]:
imap(reguero)<23945><fRA6B6yznq68uE28>: Error: Mailbox Deleted Items:
Timeout (180s) while waiting for lock for transaction log file
/mail/users/r/reguero//mdbox/mailboxes/Deleted
Items/dbox-Mails/dovecot.index.log (WRITE lock held by pid -9605)
We checked all hosts that have mounted the cephfs -- there is no pid 9605.
Is there any way to see who exactly created the lock? ceph_filelock
has a client id, but I didn't find a way to inspect the
cephfs_metadata to see the ceph_filelock directly.
Otherwise, are other Dovecot/CephFS users seeing this? Did you switch
to flock or lockfile instead of fnctlk locks?
Thanks!
Dan
P.S. here is the output from print locks tool from the kernel client:
Read lock:
Type: 1 (0: Read, 1: Write, 2: Unlocked)
Whence: 0 (0: start, 1: current, 2: end)
Offset: 0
Len: 1
Pid: -9605
Write lock:
Type: 1 (0: Read, 1: Write, 2: Unlocked)
Whence: 0 (0: start, 1: current, 2: end)
Offset: 0
Len: 1
Pid: -9605
and same file from a 15.2.5 fuse client :
Read lock:
Type: 1 (0: Read, 1: Write, 2: Unlocked)
Whence: 0 (0: start, 1: current, 2: end)
Offset: 0
Len: 0
Pid: 0
Write lock:
Type: 1 (0: Read, 1: Write, 2: Unlocked)
Whence: 0 (0: start, 1: current, 2: end)
Offset: 0
Len: 0
Pid: 0
Hi Anthony
>
> Did you add a bunch of data since then, or change the Ceph release? Do
> you have bluefs_buffered_io set to false?
>
>
We did not change ceph release in the meantime. It is very welll possible
that the delays were just not noticed during out previous maintenances.
bluefs_buffered_io is set to false (default setting in 14.2.11). I posted
a question about this setting some time ago without any respons. Perhaps
you are able to answer this:
- If bluefs_buffered_io is set to false does that mean that all ceph
buffering is done in the osd processes? Or is the linux buffer still used
somewhere??
If the linux buffer is still used then what would be your advices in
setting the osd_memory_target vs leaving space for linux buffers
>
> PGs block while peering, so it pays to spread out the peering load.
>
> Scrubs vary as a function of a number of things. Remember that shallow
> scrubs are cheap and frequent, so if you have downtime theyâll need to
> catch up when they come back. Especially if you also limit the times of
> day when scrubs can run (which is usually a bad idea). Scrubs are not
> themselves part of peering.
Thank you we will keep that in mind for the next maintenance
>
> Are you using EC? What networking technology?
>
>
We are not using EC.
The network for OSD's consist of 2 x 10Gbps link in a bond. There is no
separate cluster network. So far it look slike the network is nowhere used
near its maximum
Kind Regards
Marcel
Hi,
on my fresh deployed cephadm bootstrap node I can no longer run the ceph command. It is just stuck:
[root@gedasvl02 ~]# ceph orch dev ls
INFO:cephadm:Inferring fsid c7879f24-1f90-11eb-8ba2-005056b703af
INFO:cephadm:Using recent ceph image docker.io/ceph/ceph:v15
[root@gedasvl02 ~]# ceph -s
INFO:cephadm:Inferring fsid c7879f24-1f90-11eb-8ba2-005056b703af
INFO:cephadm:Using recent ceph image docker.io/ceph/ceph:v15
I rebooted the node, but this didn't solve the issue. :(
systemctl looks fine no stopped service.
Any ideas? It's just a test cluster I can rebuild at any time, but I'm curious what has happened.
Best Regards,
Oliver
Hi,
Couple of questions came up which is not really documented anywhere, hopefully someone knows the answers:
1.
Is there a way to see the replication queue? I want to create metrics like is there any delay in the replication etc ...
2.
Is the replication FIFO?
3.
Actually how a replication works on a lower level?
Thank you
________________________________
This message is confidential and is for the sole use of the intended recipient(s). It may also be privileged or otherwise protected by copyright or other legal rules. If you have received it by mistake please let us know by reply email and delete it from your system. It is prohibited to copy this message or disclose its content to anyone. Any confidentiality or privilege is not waived or lost by any mistaken delivery or unauthorized disclosure of the message. All messages sent to and from Agoda may be monitored to ensure compliance with company policies, to protect the company's interests and to remove potential malware. Electronic messages may be intercepted, amended, lost or deleted, or contain viruses.