So I'm trying to figure out ways to reduce the number of warnings I'm
getting and I'm thinking about the one "client failing to respond to
cache pressure".
Is there maybe a way to tell a client (or all clients) to reduce the
amount of cache it uses or to release caches quickly? Like, all the time?
I know the linux kernel (and maybe ceph) likes to cache everything for a
while, and rightfully so, but I suspect in my use case it may be more
efficient to more quickly purge the cache or to in general just cache
way less overall...?
We have many thousands of threads all doing different things that are
hitting our filesystem, so I suspect the caching isn't really doing me
much good anyway due to the churn, and probably is causing more problems
than it helping...
-erich
Hi,
We recently removed an osd from our Cepth cluster. Its underlying disk has
a hardware issue.
We use command: ceph orch osd rm osd_id --zap
During the process, sometimes ceph cluster enters warning state with slow
ops on this osd. Our rgw also failed to respond to requests and returned
503.
We restarted rgw daemon to make it work again. But the same failure occured
from time to time. Eventually we noticed that rgw 503 error is a result of
osd slow ops.
Our cluster has 18 hosts and 210 OSDs. We expect remove an osd with
hardware issue won't impact cluster performance & rgw availbility. Is our
expectation reasonable? What's the best way to handle osd with hardware
failures?
Thank you in advance for any comments or suggestions.
Best Regards,
Mary Zhang
On 4/26/24 15:47, Vahideh Alinouri wrote:
> The result of this command shows one of the servers in the cluster,
> but I have node-exporter daemons on all servers.
The default service specification looks like this:
service_type: node-exporter
service_name: node-exporter
placement:
host_pattern: '*'
If you apply this YAML code the orchestrator should deploy one
node-exporter daemon to each host of the cluster.
Regards
--
Robert Sander
Heinlein Consulting GmbH
Schwedter Str. 8/9b, 10119 Berlin
https://www.heinlein-support.de
Tel: 030 / 405051-43
Fax: 030 / 405051-19
Amtsgericht Berlin-Charlottenburg - HRB 220009 B
Geschäftsführer: Peer Heinlein - Sitz: Berlin
Hi,
Similar case as with previously fixed https://tracker.ceph.com/issues/48382 - https://github.com/ceph/ceph/pull/47308.
Confirmed on Cephadm deployed Ceph 18.2.2/17.2.7 with Openstack Antelope/Yoga.
I’m getting "404 NoSuchBucket" error with public buckets. Enabled with Swift/Keystone integration - everything else works fine.
With rgw_swift_account_in_url = true and proper endpoints: "https://rgw.test/swift/v1/AUTH_%(project_id)s"
ticking public access in horizon properly sets ACL on the bucket according to swift client:
swift -v stat test-bucket
URL: https://rgw.test/swift/v1/AUTH_daksjhdkajdshda/testbucket
Auth Token:
Account: AUTH_daksjhdkajdshda
Container: testbucket
Objects: 1
Bytes: 1021036
Read ACL: .r:*,.rlistings
Write ACL:
Sync To:
Sync Key:
X-Timestamp: 1710947159.41219
X-Container-Bytes-Used-Actual: 1024000
X-Storage-Policy: default-placement
X-Storage-Class: STANDARD
Last-Modified: Thu, 21 Mar 2024 10:30:05 GMT
X-Trans-Id: tx00000092ac12312312312-1231231231-1701e5-default
X-Openstack-Request-Id: tx00000092ac12312312312-1231231231-1701e5-default
Accept-Ranges: bytes
Content-Type: text/plain; charset=utf-8
however still getting 404 NoSuchBucket error
Could someone using the latest version of Ceph with Swift/Keystone integration please test public buckets? Thank you.
Best regards,
Bartosz Bezak
Hi guys,
I have tried to add node-exporter to the new host in ceph cluster by
the command mentioned in the document.
ceph orch apply node-exporter hostname
I think there is a functionality issue because cephadm log print
node-exporter was applied successfully, but it didn't work!
I tried the below command and it worked!
ceph orch daemon add node-exporter hostname
Which way is the correct way?
Hi guys,
I need setup Ceph over RDMA, but I faced many issues!
The info regarding my cluster:
Ceph version is Reef
Network cards are Broadcom RDMA.
RDMA connection between OSD nodes are OK.
I just found ms_type = async+rdma config in document and apply it using
ceph config set global ms_type async+rdma
After this action the cluster crashes. I tried to cluster back, and I did:
Put ms_type async+posix in ceph.conf
Restart all MON services
The cluster is back, but I don't have any active mgr. All OSDs are down too.
Is there any order to do for setting up Ceph over RDMA?
Thanks
Dear colleagues, hope that anybody can help us.
The initial point: Ceph cluster v15.2 (installed and controlled by the Proxmox) with 3 nodes based on physical servers rented from a cloud provider. CephFS is installed also.
Yesterday we discovered that some of the applications stopped working. During the investigation we recognized that we have the problem with Ceph, more precisely with СephFS - MDS daemons suddenly crashed. We tried to restart them and found that they crashed again immediately after the start. The crash information:
2024-04-17T17:47:42.841+0000 7f959ced9700 1 mds.0.29134 recovery_done -- successful recovery!
2024-04-17T17:47:42.853+0000 7f959ced9700 1 mds.0.29134 active_start
2024-04-17T17:47:42.881+0000 7f959ced9700 1 mds.0.29134 cluster recovered.
2024-04-17T17:47:43.825+0000 7f959aed5700 -1 ./src/mds/OpenFileTable.cc: In function 'void OpenFileTable::commit(MDSContext*, uint64_t, int)' thread 7f959aed5700 time 2024-04-17T17:47:43.831243+0000
./src/mds/OpenFileTable.cc: 549: FAILED ceph_assert(count > 0)
Next hours we read the tons of articles, studied the documentation, and checked the common state of Ceph cluster by the various diagnostic commands – but didn’t find anything wrong. At evening we decided to upgrade it up to v16, and finally to v17.2.7. Unfortunately, it didn’t solve the problem, MDS continue to crash with the same error. The only difference that we found is “1 MDSs report damaged metadata” in the output of ceph -s – see it below.
I supposed that it may be the well-known bug, but couldn’t find the same one on https://tracker.ceph.com - there are several bugs associated with file OpenFileTable.cc but not related to ceph_assert(count > 0)
We tried to check the source code of OpenFileTable.cc also, here is a fragment of it, in function OpenFileTable::_journal_finish
int omap_idx = anchor.omap_idx;
unsigned& count = omap_num_items.at(omap_idx);
ceph_assert(count > 0);
So, we guess that the object map is empty for some object in Ceph, and it is unexpected behavior. But again, we found nothing wrong in our cluster…
Next, we started with https://docs.ceph.com/en/latest/cephfs/disaster-recovery-experts/ article – tried to reset the journal (despite that it was Ok all the time) and wipe the sessions using cephfs-table-tool all reset session command. No result…
Now I decided to continue following this article and run cephfs-data-scan scan_extents command, it is working just now. But I have a doubt that it will solve the issue because of no problem with our objects in Ceph.
Is it the new bug? or something else? Any idea is welcome!
The important outputs:
----- ceph -s
cluster:
id: 4cd1c477-c8d0-4855-a1f1-cb71d89427ed
health: HEALTH_ERR
1 MDSs report damaged metadata
insufficient standby MDS daemons available
83 daemons have recently crashed
3 mgr modules have recently crashed
services:
mon: 3 daemons, quorum asrv-dev-stor-2,asrv-dev-stor-3,asrv-dev-stor-1 (age 22h)
mgr: asrv-dev-stor-2(active, since 22h), standbys: asrv-dev-stor-1
mds: 1/1 daemons up
osd: 18 osds: 18 up (since 22h), 18 in (since 29h)
data:
volumes: 1/1 healthy
pools: 5 pools, 289 pgs
objects: 29.72M objects, 5.6 TiB
usage: 21 TiB used, 47 TiB / 68 TiB avail
pgs: 287 active+clean
2 active+clean+scrubbing+deep
io:
client: 2.5 KiB/s rd, 172 KiB/s wr, 261 op/s rd, 195 op/s wr
-----ceph fs dump
e29480
enable_multiple, ever_enabled_multiple: 0,1
default compat: compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline data,8=no anchor table,9=file layout v2,10=snaprealm v2}
legacy client fscid: 1
Filesystem 'cephfs' (1)
fs_name cephfs
epoch 29480
flags 12 joinable allow_snaps allow_multimds_snaps
created 2022-11-25T15:56:08.507407+0000
modified 2024-04-18T16:52:29.970504+0000
tableserver 0
root 0
session_timeout 60
session_autoclose 300
max_file_size 1099511627776
required_client_features {}
last_failure 0
last_failure_osd_epoch 14728
compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline data,8=no anchor table,9=file layout v2,10=snaprealm v2}
max_mds 1
in 0
up {0=156636152}
failed
damaged
stopped
data_pools [5]
metadata_pool 6
inline_data disabled
balancer
standby_count_wanted 1
[mds.asrv-dev-stor-1{0:156636152} state up:active seq 6 laggy since 2024-04-18T16:52:29.970479+0000 addr [v2:172.22.2.91:6800/2487054023,v1:172.22.2.91:6801/2487054023] compat {c=[1],r=[1],i=[7ff]}]
-----cephfs-journal-tool --rank=cephfs:0 journal inspect
Overall journal integrity: OK
-----ceph pg dump summary
version 41137
stamp 2024-04-18T21:17:59.133536+0000
last_osdmap_epoch 0
last_pg_scan 0
PG_STAT OBJECTS MISSING_ON_PRIMARY DEGRADED MISPLACED UNFOUND BYTES OMAP_BYTES* OMAP_KEYS* LOG DISK_LOG
sum 29717605 0 0 0 0 6112544251872 13374192956 28493480 1806575 1806575
OSD_STAT USED AVAIL USED_RAW TOTAL
sum 21 TiB 47 TiB 21 TiB 68 TiB
-----ceph pg dump pools
POOLID OBJECTS MISSING_ON_PRIMARY DEGRADED MISPLACED UNFOUND BYTES OMAP_BYTES* OMAP_KEYS* LOG DISK_LOG
8 31771 0 0 0 0 131337887503 2482 140 401246 401246
7 839707 0 0 0 0 3519034650971 736 61 399328 399328
6 1319576 0 0 0 0 421044421 13374189738 28493279 206749 206749
5 27526539 0 0 0 0 2461702171417 0 0 792165 792165
2 12 0 0 0 0 48497560 0 0 6991 6991
Hi you all,
I'm almost new to ceph and I'm understanding, day by day, why the
official support is so expansive :)
I setting up a ceph nfs network cluster whose recipe can be found here
below.
#######################
--> cluster creation cephadm bootstrap --mon-ip 10.20.20.81
--cluster-network 10.20.20.0/24 --fsid $FSID --initial-dashboard-user adm \
--initial-dashboard-password 'Hi_guys' --dashboard-password-noupdate
--allow-fqdn-hostname --ssl-dashboard-port 443 \
--dashboard-crt /etc/ssl/wildcard.it/wildcard.it.crt --dashboard-key
/etc/ssl/wildcard.it/wildcard.it.key \
--allow-overwrite --cleanup-on-failure
cephadm shell --fsid $FSID -c /etc/ceph/ceph.conf -k
/etc/ceph/ceph.client.admin.keyring
cephadm add-repo --release reef && cephadm install ceph-common
--> adding hosts and set labels
for IP in $(grep ceph /etc/hosts | awk '{print $1}') ; do ssh-copy-id -f
-i /etc/ceph/ceph.pub root@$IP ; done
ceph orch host add cephstage01 10.20.20.81 --labels
_admin,mon,mgr,prometheus,grafana
ceph orch host add cephstage02 10.20.20.82 --labels
_admin,mon,mgr,prometheus,grafana
ceph orch host add cephstage03 10.20.20.83 --labels
_admin,mon,mgr,prometheus,grafana
ceph orch host add cephstagedatanode01 10.20.20.84 --labels
osd,nfs,prometheus
ceph orch host add cephstagedatanode02 10.20.20.85 --labels
osd,nfs,prometheus
ceph orch host add cephstagedatanode03 10.20.20.86 --labels
osd,nfs,prometheus
--> network setup and daemons deploy
ceph config set mon public_network 10.20.20.0/24,192.168.7.0/24
ceph orch apply mon
--placement="cephstage01:10.20.20.81,cephstage02:10.20.20.82,cephstage03:10.20.20.83"
ceph orch apply mgr
--placement="cephstage01:10.20.20.81,cephstage02:10.20.20.82,cephstage03:10.20.20.83"
ceph orch apply prometheus
--placement="cephstage01:10.20.20.81,cephstage02:10.20.20.82,cephstage03:10.20.20.83,cephstagedatanode01:10.20.20.84,cephstagedatanode02:10.20.20.85,cephstagedatanode03:10.20.20.86"
ceph orch apply grafana
--placement="cephstage01:10.20.20.81,cephstage02:10.20.20.82,cephstage03:10.20.20.83,cephstagedatanode01:10.20.20.84,cephstagedatanode02:10.20.20.85,cephstagedatanode03:10.20.20.86"
ceph orch apply node-exporter
ceph orch apply alertmanager
ceph config set mgr mgr/cephadm/secure_monitoring_stack true
--> disks and osd setup
for IP in $(grep cephstagedatanode/etc/hosts | awk '{print $1}') ; do
ssh root@$IP "hostname && wipefs -a -f /dev/sdb&& wipefs -a -f
/dev/sdc"; done
ceph config set mgr mgr/cephadm/device_enhanced_scan true
for IP in $(grep cephstagedatanode/etc/hosts | awk '{print $1}') ;
doceph orch device ls --hostname=$IP --wide --refresh ; done
for IP in $(grep cephstagedatanode/etc/hosts | awk '{print $1}') ;
doceph orch device zap $IP /dev/sdb; done
for IP in $(grep cephstagedatanode/etc/hosts | awk '{print $1}') ;
doceph orch device zap $IP /dev/sdc ; done
for IP in $(grep cephstagedatanode/etc/hosts | awk '{print $1}') ;
doceph orch daemon add osd $IP:/dev/sdb ; done
for IP in $(grep cephstagedatanode/etc/hosts | awk '{print $1}') ;
doceph orch daemon add osd $IP:/dev/sdc ; done
--> ganesha nfs cluster
ceph mgr module enable nfs
ceph fs volume create vol1
ceph nfs cluster create nfs-cephfs
"cephstagedatanode01,cephstagedatanode02,cephstagedatanode03" --ingress
--virtual-ip 192.168.7.80 --ingress-mode default
ceph nfs export create cephfs --cluster-id nfs-cephfs --pseudo-path /mnt
--fsname vol1
--> nfs mount
mount -t nfs -o nfsvers=4.1,proto=tcp 192.168.7.80:/mnt /mnt/ceph
is my recipe correct?
the cluster is set up by 3 mon/mgr nodes and 3 osd/nfs nodes, on the
latters I installed one 3tb ssd, for the data, and one 300gb ssd for the
journaling but
my problems are :
- Although I can mount the export I can't write on it
- I can't understand how to use the sdc disks for journaling
- I can't understand the concept of "pseudo path"
here below you can find the json output of the exports
--> check
ceph nfs export ls nfs-cephfs
ceph nfs export info nfs-cephfs /mnt
------------------------------------
json file
---------
{
"export_id": 1,
"path": "/",
"cluster_id": "nfs-cephfs",
"pseudo": "/mnt",
"access_type": "RW",
"squash": "none",
"security_label": true,
"protocols": [
4
],
"transports": [
"TCP"
],
"fsal": {
"name": "CEPH",
"user_id": "nfs.nfs-cephfs.1",
"fs_name": "vol1"
},
"clients": []
}
------------------------------------
Thanks in advance
Rob
Hi,
I'm trying to estimate the possible impact when large PGs are
splitted. Here's one example of such a PG:
PG_STAT OBJECTS BYTES OMAP_BYTES* OMAP_KEYS* LOG DISK_LOG UP
86.3ff 277708 414403098409 0 0 3092
3092
[187,166,122,226,171,234,177,163,155,34,81,239,101,13,117,8,57,111]
Their main application is RGW on EC (currently 1024 PGs on 240 OSDs),
8TB HDDs backed by SSDs. There are 6 RGWs running behind HAProxies. It
took me a while to convince them to do a PG split and now they're
trying to assess how big the impact could be. The fullest OSD is
already at 85% usage, the least filled one at 59%, so there is
definitely room for a better balancing which, will be necessary until
the new hardware arrives. The current distribution is around 100 PGs
per OSD which usually would be fine, but since the PGs are that large
only a few PGs difference have a huge impact on the OSD utilization.
I'm targeting 2048 PGs for that pool for now, probably do another
split when the new hardware has been integrated.
Any comments are appreciated!
Eugen
Hi,
We're testing with rbd-mirror (mode snapshot) and try to get status
updates about snapshots as fast a possible. We want to use rbd-mirror as
a migration tool between two clusters and keep downtime during migration
as short as possible. Therefore we have tuned the following parameters
and set them to 1 second (default 30 seconds):
rbd_mirror_pool_replayers_refresh_interval
rbd_mirror_image_state_check_interval
rbd_mirror_sync_point_update_age
However, on the destination cluster, the "last_update:" field is only
updated every 30 seconds. Is this tunable?
Goal is to determine when the last snapshot that is made on the source
has made it to the target and a demote (source) and promote (target) can
be initiated.
Gr. Stefan