Hey all!
I'm a first time ceph user trying to learn how to set up a cluster. I've
gotten a basic cluster created using the following:
```
cehphadm bootstrap --mon-ip <SERVER_1_IP>
ceph orch host add server-2 <SERVER_2_IP> _admin
```
I've created and mounted an fs on a host, everything is going well, but I
have noticed that I have an alert
triggered: CephMgrPrometheusModuleInactive.
It seems this alert is trying to `curl server-2:9283`. To debug if this was
a network issue I did `ceph mgr fail` to move the mgr to server-2. After
some time I get the same alert with the instance being server-1:9283.
Running `ss -l -n -p | grep 9283` shows the port is bound on server-2 and
not server-1. If I run `ceph mgr fail` again the port becomes bound on
server-1 and not server-2.
Is this alert important? Is there a way to remediate this issue? Let me
know if I am missing something here.
Thanks,
- Josh
Hi Anthony, thanks for reaching out.
Erasure data pool (K=4, M=2) but I had more than two disk failures around
the same time and the data had not fully replicated / restored elsewhere in
the cluster.
They are big 12TB Exos so it usually takes a few weeks to backfill /
recover plus I had snaptrimming on the go.
FYI - The journal's co-located on drive.
Kind regards
Geoff
On Fri, 24 Feb 2023 at 18:30, Anthony D'Atri <aad(a)dreamsnake.net> wrote:
> Are you only doing 2 replicas?
>
>
>
>
>
> On Feb 24, 2023, at 08:20, Geoffrey Rhodes <geoffrey(a)rhodes.org.za> wrote:
>
> This has caused a PG to go inactive
>
>
>
Hello all, I'd really appreciate some input from the more knowledgeable
here.
Is there a way I can access OSD objects if I have a BlueFS replay error?
This error prevents me starting the OSD and also throws an error if I try
using the bluestore or objectstore tools. - I can however run a
ceph-bluestore-tool show-label without issue.
I'm hoping there is another way or possibly a way to purge this log that I
can still access the objects on this OSD.
If deleting this reply log will help (even with some data loss) I'm happy
to try it.
This has caused a PG to go inactive and I'm considering deleting the PG and
force re-creating it. - Saw this mentioned as a last resort option.
Below is a snip of where things go wrong. - I don't know if there is even a
chance or is this an unrecoverable state?
2023-01-25T10:05:26.543+0000 7fa773a14240 20 bluefs _replay 0x0:
op_dir_link db/031664.sst to 29549
2023-01-25T10:05:26.543+0000 7fa773a14240 20 bluefs _replay 0x0:
op_dir_link db/031665.sst to 29550
2023-01-25T10:05:26.543+0000 7fa773a14240 20 bluefs _replay 0x0:
op_dir_link db/031666.sst to 29551
2023-01-25T10:05:26.543+0000 7fa773a14240 20 bluefs _replay 0x0:
op_dir_link db/CURRENT to 29543
2023-01-25T10:05:26.543+0000 7fa773a14240 20 bluefs _replay 0x0:
op_dir_link db/IDENTITY to 5
2023-01-25T10:05:26.543+0000 7fa773a14240 20 bluefs _replay 0x0:
op_dir_link db/LOCK to 2
2023-01-25T10:05:26.543+0000 7fa773a14240 20 bluefs _replay 0x0:
op_dir_link db/MANIFEST-031657 to 29542
2023-01-25T10:05:26.543+0000 7fa773a14240 20 bluefs _replay 0x0:
op_dir_link db/OPTIONS-031645 to 29529
2023-01-25T10:05:26.543+0000 7fa773a14240 20 bluefs _replay 0x0:
op_dir_link db/OPTIONS-031660 to 29545
2023-01-25T10:05:26.543+0000 7fa773a14240 20 bluefs _replay 0x0:
op_dir_create db.slow
2023-01-25T10:05:26.543+0000 7fa773a14240 20 bluefs _replay 0x0: op_jump
seq 5204712 offset 0x20000
2023-01-25T10:05:26.543+0000 7fa773a14240 10 bluefs _read h 0x55d2f1cfdb80
0x10000~10000 from file(ino 1 size 0x0 mtime
2022-10-07T17:55:34.189440+0000 allocated 420000 alloc_commit 420000
extents [1:0x1770170000~20000,1:0x53d1e900000~400000])
2023-01-25T10:05:26.543+0000 7fa773a14240 20 bluefs _read left 0x10000 len
0x10000
2023-01-25T10:05:26.543+0000 7fa773a14240 20 bluefs _read got 65536
2023-01-25T10:05:26.543+0000 7fa773a14240 10 bluefs _read h 0x55d2f1cfdb80
0x20000~1000 from file(ino 1 size 0x20000 mtime
2022-10-07T17:55:34.189440+0000 allocated 420000 alloc_commit 420000
extents [1:0x1770170000~20000,1:0x53d1e900000~400000])
2023-01-25T10:05:26.543+0000 7fa773a14240 20 bluefs _read fetching
0x0~100000 of 1:0x53d1e900000~400000
2023-01-25T10:05:26.547+0000 7fa773a14240 20 bluefs _read left 0x100000 len
0x1000
2023-01-25T10:05:26.547+0000 7fa773a14240 20 bluefs _read got 4096
2023-01-25T10:05:26.547+0000 7fa773a14240 10 bluefs _replay 0x20000:
txn(seq 5204713 len 0x55 crc 0x81f48b1c)
2023-01-25T10:05:26.547+0000 7fa773a14240 20 bluefs _replay 0x20000:
op_file_update file(ino 29551 size 0x0 mtime
2022-10-07T17:55:34.151007+0000 allocated 0 alloc_commit 0 extents [])
2023-01-25T10:05:26.547+0000 7fa773a14240 20 bluefs _replay 0x20000:
op_dir_link db/031666.sst to 29551
2023-01-25T10:05:26.555+0000 7fa773a14240 -1
/build/ceph-17.2.5/src/os/bluestore/BlueFS.cc: In function 'int
BlueFS::_replay(bool, bool)' thread 7fa773a14240 time
2023-01-25T10:05:26.551808+0000
/build/ceph-17.2.5/src/os/bluestore/BlueFS.cc: 1419: FAILED ceph_assert(r
== q->second->file_map.end())
Kind regards
Geoff
Dear Ceph community,
since about two or three weeks, we have CephFS clients regularly failing
to respond to capability releases accompanied OSD slow ops. By now, this
happens daily every time clients get more active (e.g. during nightly
backups).
We mostly observe it with a handful of highly active clients, so
correlating with IO volume. But we have over 250 clients which mount the
CephFS and plan to get them all more active soon. What's worrying me
further, it doesn't seem to effect only the clients which fail to
respond to the capability release. But also other clients get just stuck
accessing data on the CephFS.
So far I've been tracking down the corresponding OSDs via the client
(`cat /sys/kernel/debug/ceph/*/osdc`) and restarted them one by one. But
since this is now a regular/systemic issue, this is obviously no
sustainable solution. This would be usually a handful of OSDs per client
and I couldn't observe any particular pattern of involved OSDs, yet.
Our cluster still runs on CentOS 7 with kernel
3.10.0-1160.42.2.el7.x86_64 using cephadm with ceph version 17.2.1
(ec95624474b1871a821a912b8c3af68f8f8e7aa1) quincy (stable).
Most active clients are currently on kernel versions such as:
4.18.0-348.el8.0.2.x86_64, 4.18.0-348.2.1.el8_5.x86_64,
4.18.0-348.7.1.el8_5.x86_64
I picked up some logging ideas from an older issue with similar symptoms:
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/CKTIM6LF274…
But this has been already fixed in the kernel client and I don't have
similar things in the logs.
But also I'm not sure if the things I digging up in the logs are
actually useful. And if I'm actually looking in the right places.
So, I enabled "debug_ms 1" for the OSDs as suggested in the other thread.
But this filled up our host disks pretty fast, leading to e.g. monitors
crashing.
I disabled the debug messages again and trimmed logs to free up space.
But I made copies of two OSD logs files which were involved in a
capability release / slow requests issue.
They are quite big now (~3GB) and even if I remove things like ping stuff,
I have more than 1 million lines just for the morning until the disk
space was full (around 7 hours).
So now I'm wondering how to filter/look for the right things here.
When I grep for "error", I get a few of these messages:
{"log":"debug 2023-02-22T06:18:08.113+0000 7f15c5fff700 1 --
[v2:192.168.1.13:6881/4149819408,v1:192.168.1.13:6884/4149819408]
\u003c== osd.161 v2:192.168.1.31:6835/1012436344 182573 ====
pg_update_log_missing(3.1a6s2 epoch 646235/644895 rep_tid 1014320
entries 646235'7672108 (0'0) error
3:65836dde:::10016e9b7c8.00000000:head by mds.0.1221974:8515830 0.000000
-2 ObjectCleanRegions clean_offsets: [0~18446744073709551615],
clean_omap: 1, new_object: 0 trim_to 646178'7662340 roll_forward_to
646192'7672106) v3 ==== 261+0+0 (crc 0 0 0) 0x562d55e52380 con
0x562d8a2de400\n","stream":"stderr","time":"2023-02-22T06:18:08.115002765Z"}
And if I grep for "failed", I get a couple of those:
{"log":"debug 2023-02-22T06:15:25.242+0000 7f58bbf7c700 1 --
[v2:172.16.62.11:6829/3509070161,v1:172.16.62.11:6832/3509070161]
\u003e\u003e 172.16.62.10:0/3127362489 conn(0x55ba06bf3c00
msgr2=0x55b9ce07e580 crc :-1 s=STATE_CONNECTION_ESTABLISHED
l=1).read_until read
failed\n","stream":"stderr","time":"2023-02-22T06:15:25.243808392Z"}
{"log":"debug 2023-02-22T06:15:25.242+0000 7f58bbf7c700 1 --2-
[v2:172.16.62.11:6829/3509070161,v1:172.16.62.11:6832/3509070161]
\u003e\u003e 172.16.62.10:0/3127362489 conn(0x55ba06bf3c00
0x55b9ce07e580 crc :-1 s=READY pgs=2096664 cs=0 l=1 rev1=1 crypto rx=0
tx=0 comp rx=0 tx=0).handle_read_frame_preamble_main read frame preamble
failed r=-1 ((1) Operation not
permitted)\n","stream":"stderr","time":"2023-02-22T06:15:25.243813528Z"}
Not sure, if they are related to the issue.
In the kernel logs of the client (dmesg, journalctl or /var/log/messages),
there seem to be no errors or any stack traces in the relevant time periods.
The only thing I can see is me restarting the relevant OSDs:
[Mi Feb 22 07:29:59 2023] libceph: osd90 down
[Mi Feb 22 07:30:34 2023] libceph: osd90 up
[Mi Feb 22 07:31:55 2023] libceph: osd93 down
[Mi Feb 22 08:37:50 2023] libceph: osd93 up
I noticed a socket closed for another client, but I assume that's more
related to monitors failing due to full disks:
[Mi Feb 22 05:59:52 2023] libceph: mon2 (1)172.16.62.12:6789 socket
closed (con state OPEN)
[Mi Feb 22 05:59:52 2023] libceph: mon2 (1)172.16.62.12:6789 session
lost, hunting for new mon
[Mi Feb 22 05:59:52 2023] libceph: mon3 (1)172.16.62.13:6789 session
established
I would appreciate if anybody has a suggestion where I should look next.
Thank you for your help
Best Wishes,
Mathias
Hi,
I'm really lost with my Ceph system. I built a small cluster for home
usage which has two uses for me: I want to replace an old NAS and I want
to learn about Ceph so that I have hands-on experience. We're using it
in our company but I need some real-life experience without risking any
company or customers data. That's my preferred way of learning.
The cluster consists of 3 Raspberry Pis plus a few VMs running on
Proxmox. I'm not using Proxmox' built in Ceph because I want to focus on
Ceph and not just use it as a preconfigured tool.
All hosts are running Fedora (x86_64 and arm64) and during an Upgrade
from F36 to F37 my cluster suddenly showed all PGs as unavailable. I
worked nearly a week to get it back online and I learned a lot about
Ceph management and recovery. The cluster is back but I still can't
access my data. Maybe you can help me?
Here are my versions:
[ceph: root@ceph04 /]# ceph versions
{
"mon": {
"ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757)
quincy (stable)": 3
},
"mgr": {
"ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757)
quincy (stable)": 3
},
"osd": {
"ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757)
quincy (stable)": 5
},
"mds": {
"ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757)
quincy (stable)": 4
},
"overall": {
"ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757)
quincy (stable)": 15
}
}
Here's MDS status output of one MDS:
[ceph: root@ceph04 /]# ceph tell mds.mds01.ceph05.pqxmvt status
2023-01-14T15:30:28.607+0000 7fb9e17fa700 0 client.60986454
ms_handle_reset on v2:192.168.23.65:6800/2680651694
2023-01-14T15:30:28.640+0000 7fb9e17fa700 0 client.60986460
ms_handle_reset on v2:192.168.23.65:6800/2680651694
{
"cluster_fsid": "ff6e50de-ed72-11ec-881c-dca6325c2cc4",
"whoami": 0,
"id": 60984167,
"want_state": "up:replay",
"state": "up:replay",
"fs_name": "cephfs",
"replay_status": {
"journal_read_pos": 0,
"journal_write_pos": 0,
"journal_expire_pos": 0,
"num_events": 0,
"num_segments": 0
},
"rank_uptime": 1127.54018615,
"mdsmap_epoch": 98056,
"osdmap_epoch": 12362,
"osdmap_epoch_barrier": 0,
"uptime": 1127.957307273
}
It's staying like that for days now. If there was a counter moving, I
just would wait but it doesn't change anything and alle stats says, the
MDS aren't working at all.
The symptom I have is that Dashboard and all other tools I use say, it's
more or less ok. (Some old messages about failed daemons and scrubbing
aside). But I can't mount anything. When I try to start a VM that's on
RDS I just get a timeout. And when I try to mount a CephFS, mount just
hangs forever.
Whatever command I give MDS or journal, it just hangs. The only thing I
could do, was take all CephFS offline, kill the MDS's and do a "ceph fs
reset <fs name> --yes-i-really-mean-it". After that I rebooted all
nodes, just to be sure but I still have no access to data.
Could you please help me? I'm kinda desperate. If you need any more
information, just let me know.
Cheers,
Thomas
--
Thomas Widhalm
Lead Systems Engineer
NETWAYS Professional Services GmbH | Deutschherrnstr. 15-19 | D-90429 Nuernberg
Tel: +49 911 92885-0 | Fax: +49 911 92885-77
CEO: Julian Hein, Bernd Erk | AG Nuernberg HRB34510
https://www.netways.de | thomas.widhalm(a)netways.de
** stackconf 2023 - September - https://stackconf.eu **
** OSMC 2023 - November - https://osmc.de **
** New at NWS: Managed Database - https://nws.netways.de/managed-database **
** NETWAYS Web Services - https://nws.netways.de **
Hi Cephers, We have two octopus 15.2.17 clusters in a multisite configuration. Every once in a while we have to perform a bucket reshard (most recently on 613 shards) and this practically kills our replication for a few days. Does anyone know of any priority mechanics within sync to give priority to other buckets and/or lower them? Are there any improvements to this in higher versions of ceph that we could take advantage of if we upgrade the cluster (I haven't found any)? How to safely perform the increase of rgw_data_log_num_shards, because the documentation only says: "The values of rgw_data_log_num_shards and rgw_md_log_max_shards should not be changed after sync has started." Does this mean that I should block access to the cluster, wait until sync is caught up with source/master, change this value, restart rgw and unblock access? Kind Regards, Tom
Hi,
today I wanted to increase the PGs from 2k -> 4k and random OSDs went
offline in the cluster.
After some investigation we saw, that the OSDs got OOM killed (I've seen a
host that went from 90GB used memory to 190GB before OOM kills happen).
We have around 24 SSD OSDs per host and 128GB/190GB/265GB memory in these
hosts. All of them experienced OOM kills.
All hosts are octopus / ubuntu 20.04.
And on every step new OSDs crashed with OOM. (We now set the pg_num/pgp_num
to 2516 to stop the process).
The OSD logs do not show anything why this might happen.
Some OSDs also segfault.
I now started to stop all OSDs on a host, and do a "ceph-bluestore-tool
repair" and a "ceph-kvstore-tool bluestore-kv compact" on all OSDs. This
takes for the 8GB OSDs around 30 minutes. When I start the OSDs I instantly
get a lot of slow OPS from all the other OSDs when the OSD come up (the 8TB
OSDs take around 10 minutes with "load_pgs".
I am unsure what I can do to restore normal cluster performance. Any ideas
or suggestions or maybe even known bugs?
Maybe a line for what I can search in the logs.
Cheers
Boris
Hi,
Our cluster runs Pacific on Rocky8. We have 3 rgw running on port 7480.
I tried to setup an ingress service with a yaml definition of service:
no luck
service_type: ingress
service_id: rgw.myceph.be
placement:
hosts:
- ceph001
- ceph002
- ceph003
spec:
backend_service: rgw.myceph.be
virtual_ip: 192.168.0.10
frontend_port: 443
monitor_port: 9000
ssl_cert: |
-----BEGIN PRIVATE KEY-----
...
-----END PRIVATE KEY-----
-----BEGIN CERTIFICATE-----
...
-----END CERTIFICATE-----
I tried to setup the ingress service with the dashboard... still no
luck. I started debugging the problem.
1. Even if I entered the certificate and the private key in the form,
CEPH complained about no haproxy.pem.key file.
I added manually the file in the container folder definition. Haproxy
containers started !
2. Looking at the monitoring page of HAProxy, I realized that there was
no backend server defined. In the form, I selected manually the servers
running the rgw.
In the container definition folder, the backend definition of
haproxy.cfg looks like:
...
backend backend
option forwardfor
balance static-rr
option httpchk HEAD / HTTP/1.0
No mention of servers or port 7480
Once again, I added the definition manually :
server ceph001 192.168.0.1:7480 check
server ceph004 192.168.0.2:7480 check
server ceph008 192.168.0.2:7480 check
and redeployed the containers. It's working.
Any idea ?
Patrick