Hi all,
I need some help troubleshooting a strange issue where my relatively newly setup 2 ceph clusters (17.2.6) are setup for replication and I can get files with just regular names (example.txt for example) to sync but anything with a slash or a folder type (folder1/folder2/example.txt for example) won't sync over. Not sure exactly why this would be the case as I'm pretty sure slashes are allowed in object names, https://docs.ceph.com/en/latest/radosgw/layout/. Any ideas or something obvious I'm missing? Sync status looks normal and I have tested this with a variety of new and old buckets and the behavior always stays the same, nothing with a slash syncs but everything without does.
Thanks in advance,
-Matt Dunavant
Hi all,
I hope this message finds you well. We recently encountered an issue on one
of our OSD servers, leading to network flapping and subsequently causing
significant performance degradation across our entire cluster. Although the
OSDs were correctly marked as down in the monitor, slow ops persisted until
we resolved the network issue. This incident resulted in a major
disruption, especially affecting VMs with mapped RBD images, leading to
their freezing.
In light of this, I have two key questions for the community:
1. Why did slow ops persist even after marking the affected server as down
in the monitor?
2.Are there any recommended configurations for OSD suicide or OSD down
reports that could help us better handle similar network-related issues in
the future?
Best Regards,
Mahnoosh
Hi,
I am bootstrapping a ceph cluster using cephadm, and our cluster uses 3 networks.
We have
- 1 network as public network (10.X.X.0/24) (pub)
- 1 network as cluster network (10.X.Y.0/24) (cluster)
- 1 network for management (172.Z.Z.0/24) (mgmt)
The nodes are reachable using SSH only on mgmt network. However, they are reachable for our services using pub network. I want my MONs to be bind to this pub network.
But when I bootstrap my cluster, I set my MON IP and CLUSTER NETWORK, and then the bootstrap process tries to add my bootstrap node using the MON IP. And then fails because it cannot reach the node. If I apply proper spec after it works fine, but the bootstrap process did not finish properly.
Is there an option to tell cephadm to not use MON IP but another one for accessing the node during the bootstrap? If I tell it --skip-prepare-host, it tries to connect to it anyway, and then fails.
Thanks,
Luis Domingues
Proton AG
Hi all,
We are testing migrations from a cluster running Pacific to Reef. In pacific we needed to tweak osd_mclock_max_capacity_iops_hdd to have decent performances of ou cluster.
But in reef it looks like changing the value of osd_mclock_max_capacity_iops_hdd does not impact cluster performances. Did osd_mclock_max_capacity_iops_hdd became useless?
I did not found anything regarding it on the changelogs, but I could have miss something.
Luis Domingues
Proton AG
Hello,
I've problem: my ceph cluster (3x mon nodes, 6x osd nodes, every
osd node have 12 rotational disk and one NVMe device for
bluestore DB). CEPH is installed by ceph orchestrator and have
bluefs storage on osd.
I've started process upgrade from version 17.2.6 to 18.2.1 by
invocating:
ceph orch upgrade start --ceph-version 18.2.1
After upgrade of mon and mgr processes orchestrator tried to
upgrade the first OSD node, but they are falling down.
I've stop the process of upgrade, but I have 1 osd node
completely down.
After upgrade I've got some error messages and I've found
/var/lib/ceph/crashxxxx directories, I attach to this message
files, which I've found here.
Please, can you advice, what now I can do? It seems, that rocksdb
is even non-compatible or corrupted :-(
Thanks in advance.
Sincerely
Jan Marek
--
Ing. Jan Marek
University of South Bohemia
Academic Computer Centre
Phone: +420389032080
http://www.gnu.org/philosophy/no-word-attachments.cs.html
Hello, Ceph users!
I have recently noticed that when I reboot a single ceph node,
ceph -s reports "5 hosts down" instead of one. The following
is captured during reboot of a node with two OSDs:
health: HEALTH_WARN
noout flag(s) set
2 osds down
5 hosts (2 osds) down
[...]
mon: 3 daemons, quorum mon1,mon3,mon2 (age 8h)
mgr: mon2(active, since 2d), standbys: mon3, mon1
osd: 34 osds: 32 up (since 2m), 34 in (since 4M)
flags noout
rgw: 1 daemon active (1 hosts, 1 zones)
After the node successfully reboots, ceph -s reports "HEALTH OK"
and of course no OSDs and no hosts are reported as being down.
Does anybody else see this as well? This is Ceph 18.2.1, but I think
I have seen this on Ceph 17 as well.
Thanks,
-Yenya
--
| Jan "Yenya" Kasprzak <kas at {fi.muni.cz - work | yenya.net - private}> |
| https://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5 |
We all agree on the necessity of compromise. We just can't agree on
when it's necessary to compromise. --Larry Wall
Hello,
I have an issue with a Ceph 17.2.6 cluster. The dashboard says "The Object
Gateway Service is not configured" when trying to access the Object Gateway
section. It used to work before.
One interesting symptom: the "admin" bucket exists in the output of
"radosgw-admin bucket list" but it does not exist in "radosgw-admin bucket
stats". Rather, I get a number of "ERROR: could not decode buffer info,
caught buffer::error" messages from the "radosgw-admin bucket stats"
command. Also, I cannot remove the "admin" bucket because I also get the
same error (I thought about starting fresh with the admin bucket).
Could someone help me debug this further and eventually resolve the issue?
There is no critical data in the radosgw buckets (the cluster is primarily
accessed via an CephFS cluster), so clearing all radosgw buckets is an
option. Ideally, I could repair this, however.
Kind regards,
Manuel
Hi,
I just freshly deployed a new cluster (v18.2.1) using cephadm. Now
before creating pools, cephfs and so on I wanted to check if the
dashboard is working and if I get some metrics.
If I navigate to Cluster >> Hosts and open one of the OSD hosts the
"Performance Details" tab is shown but all graphs display "no data".
"OSDs" and "Raw Capacity" in that tab display "N/A".
Prometheus is running:
[root@cephmon-01 ~]# ceph orch ps --service_name prometheus
NAME HOST PORTS STATUS REFRESHED AGE
MEM USE MEM LIM VERSION IMAGE ID CONTAINER ID
prometheus.cephmon-01 cephmon-01 *:9095 running (54m) 9m ago 3d
51.4M - 2.43.0 a07b618ecd1d 11b4b19df0d6
However it has no data collected:
[root@cephmon-01 ~]# curl -s -XGET
http://127.0.0.1:9095/api/v1/targets/metadata
{"status":"success","data":[]}
ceph-exporter services also seem to be running:
[root@cephmon-01 ~]# ceph orch ps --service_name ceph-exporter
NAME HOST PORTS STATUS REFRESHED
AGE MEM USE MEM LIM VERSION IMAGE ID CONTAINER ID
ceph-exporter.cephmon-01 cephmon-01 running (46h) 10m ago
3d 16.7M - 18.2.1 d2cdd87030d1 191eedccbcd8
ceph-exporter.cephmon-02 cephmon-02 running (3d) 2m ago
3d 16.5M - 18.2.1 d2cdd87030d1 7a08e9f1401c
ceph-exporter.cephmon-03 cephmon-03 running (3d) 6m ago
3d 16.5M - 18.2.1 d2cdd87030d1 1eb4856a60d4
ceph-exporter.cephosd-01 cephosd-01 running (2d) 10m ago
2d 19.9M - 18.2.1 d2cdd87030d1 05642098f7de
ceph-exporter.cephosd-02 cephosd-02 running (2d) 10m ago
2d 19.9M - 18.2.1 d2cdd87030d1 648715ecaa9d
ceph-exporter.cephosd-03 cephosd-03 running (2d) 10m ago
2d 19.4M - 18.2.1 d2cdd87030d1 b8bb6dcb5386
ceph-exporter.cephosd-04 cephosd-04 running (2d) 10m ago
2d 19.5M - 18.2.1 d2cdd87030d1 4f1964f79ffe
ceph-exporter.cephosd-05 cephosd-05 running (2d) 10m ago
2d 19.8M - 18.2.1 d2cdd87030d1 8ca8cbbf3984
ceph-exporter.cephosd-06 cephosd-06 running (2d) 10m ago
2d 19.4M - 18.2.1 d2cdd87030d1 a5e2860cc98e
ceph-exporter.cephosd-07 cephosd-07 running (2d) 3m ago
2d 19.8M - 18.2.1 d2cdd87030d1 4eb01b8ebd33
ceph-exporter.cephosd-08 cephosd-08 running (2d) 3m ago
2d 19.9M - 18.2.1 d2cdd87030d1 b934866d2a1d
ceph-exporter.cephosd-10 cephosd-10 running (2d) 3m ago
2d 19.4M - 18.2.1 d2cdd87030d1 457368d07579
ceph-exporter.cephosd-11 cephosd-11 running (2d) 3m ago
2d 19.5M - 18.2.1 d2cdd87030d1 e561cfac4209
ceph-exporter.cephosd-12 cephosd-12 running (2d) 9m ago
2d 19.9M - 18.2.1 d2cdd87030d1 0e5773c8e038
as well as node-exporter services:
[root@cephmon-01 ~]# ceph orch ps --service_name node-exporter
NAME HOST PORTS STATUS REFRESHED
AGE MEM USE MEM LIM VERSION IMAGE ID CONTAINER ID
node-exporter.cephmon-01 cephmon-01 *:9100 running (46h) 19s ago
3d 11.3M - 1.5.0 0da6a335fe13 72fefb7966ff
node-exporter.cephmon-02 cephmon-02 *:9100 running (3d) 3m ago
3d 13.5M - 1.5.0 0da6a335fe13 2041d3d385b0
node-exporter.cephmon-03 cephmon-03 *:9100 running (3d) 7m ago
3d 12.9M - 1.5.0 0da6a335fe13 6ef204d12a7d
node-exporter.cephosd-01 cephosd-01 *:9100 running (2d) 18s ago
2d 13.0M - 1.5.0 0da6a335fe13 6b05483b05c6
node-exporter.cephosd-02 cephosd-02 *:9100 running (2d) 18s ago
2d 11.4M - 1.5.0 0da6a335fe13 ede4995ffb1d
node-exporter.cephosd-03 cephosd-03 *:9100 running (2d) 18s ago
2d 11.5M - 1.5.0 0da6a335fe13 cfbf15168667
node-exporter.cephosd-04 cephosd-04 *:9100 running (2d) 18s ago
2d 13.2M - 1.5.0 0da6a335fe13 5dc4794a7f6e
node-exporter.cephosd-05 cephosd-05 *:9100 running (2d) 18s ago
2d 13.4M - 1.5.0 0da6a335fe13 8dfa1e252f82
node-exporter.cephosd-06 cephosd-06 *:9100 running (2d) 18s ago
2d 13.5M - 1.5.0 0da6a335fe13 93467e37df08
node-exporter.cephosd-07 cephosd-07 *:9100 running (2d) 4m ago
2d 13.2M - 1.5.0 0da6a335fe13 11795b83732d
node-exporter.cephosd-08 cephosd-08 *:9100 running (2d) 4m ago
2d 13.5M - 1.5.0 0da6a335fe13 04197f3a6eb1
node-exporter.cephosd-10 cephosd-10 *:9100 running (2d) 4m ago
2d 13.2M - 1.5.0 0da6a335fe13 9e904581442c
node-exporter.cephosd-11 cephosd-11 *:9100 running (2d) 4m ago
2d 13.0M - 1.5.0 0da6a335fe13 5164113044ed
node-exporter.cephosd-12 cephosd-12 *:9100 running (2d) 16s ago
2d 13.3M - 1.5.0 0da6a335fe13 5c8af368eed4
I'm a bit lost, how can I get this running?
Thanks for any help
Dietmar
Thanks for ressponse! Yes, it is in use
"watcher=10.1.254.51:0/1544956346 client.39553300 cookie=140244238214096" this is indicating the client is connect the image.
I am using fio perform write task on it.
I guess it is the feature not enable correctly or setting somewhere incorrect. Should I restart any process after modifying Ceph config?
Any thought?
I follow below document to setup image level rbd persistent cache,
however I get error output while i using the command provide by the document.
I have put my commands and descriptions below.
Can anyone give some instructions? thanks in advance.
https://docs.ceph.com/en/pacific/rbd/rbd-persistent-write-back-cache/https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/5/html/b…
[https://access.redhat.com/webassets/avalon/g/shadowman-200.png]<https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/5/html/b…>
Chapter 2. Ceph block devices Red Hat Ceph Storage 5 | Red Hat Customer Portal<https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/5/html/b…>
Access Red Hat’s knowledge, guidance, and support through your subscription.
access.redhat.com
I tried use host level client command, i got no error, however I won't be able to get cache usage output.
"ceph config set client rbd_persistent_cache_mode ssd
ceph config set client rbd_plugins pwl_cache"
[root@master-node1 ceph]# rbd info sas-pool/testdrive
rbd image 'testdrive':
size 40 GiB in 10240 objects
order 22 (4 MiB objects)
snapshot_count: 0
id: 3de76a7e7c519
block_name_prefix: rbd_data.3de76a7e7c519
format: 2
features: layering, exclusive-lock, object-map, fast-diff, deep-flatten
op_features:
flags:
create_timestamp: Thu Jun 29 02:03:41 2023
access_timestamp: Thu Jun 29 07:19:40 2023
modify_timestamp: Thu Jun 29 07:18:00 2023
I check feature exclusive-lock has been already enabled
and I run following command get fault output.
[root@master-node1 ceph]# rbd config image set sas-pool/testdrive rbd_persistent_cache_mode ssd
rbd: invalid config key: rbd_persistent_cache_mode
[root@master-node1 ceph]# rbd config image set sas-pool/testdrive rbd_plugins pwl_cache
rbd: invalid config key: rbd_plugins
root@node1:~# rbd status sas-pool/testdrive
Watchers:
watcher=10.1.254.51:0/1544956346 client.39553300 cookie=140244238214096
I hope to see the output include the persistent cache state like below:
$ rbd status rbd/foo
Watchers:
watcher=10.10.0.102:0/1061883624 client.25496 cookie=140338056493088
Persistent cache state:
host: sceph9
path: /mnt/nvme0/rbd-pwl.rbd.101e5824ad9a.pool
size: 1 GiB
mode: ssd
stats_timestamp: Sun Apr 10 13:26:32 2022
present: true empty: false clean: false
allocated: 509 MiB
cached: 501 MiB
dirty: 338 MiB
free: 515 MiB
hits_full: 1450 / 61%
hits_partial: 0 / 0%
misses: 924
hit_bytes: 192 MiB / 66%
miss_bytes: 97 MiB