September 2023 - ceph-users

by Ben

Hi, It is running 17.2.5. there are slow requests warnings in cluster log. ceph tell mds.5 dump_ops_in_flight, get the following. These look like outdated and clients were k8s pods. There are warning of the kind in other mds as well. How could they be cleaned from warnings safely? Many thanks. { "ops": [ { "description": "peer_request(mds.3:5311742.0 authpin)", "initiated_at": "2023-09-14T12:25:43.092201+0000", "age": 926013.05098558997, "duration": 926013.051015759, "type_data": { "flag_point": "dispatched", "reqid": "mds.3:5311742", "op_type": "peer_request", "leader_info": { "leader": "3" }, "request_info": { "attempt": 0, "op_type": "authpin", "lock_type": 0, "object_info": "0x60001205d6d.head", "srcdnpath": "", "destdnpath": "", "witnesses": "", "has_inode_export": false, "inode_export_v": 0, "op_stamp": "0.000000" }, "events": [ { "time": "2023-09-14T12:25:43.092201+0000", "event": "initiated" }, { "time": "2023-09-14T12:25:43.092202+0000", "event": "throttled" }, { "time": "2023-09-14T12:25:43.092201+0000", "event": "header_read" }, { "time": "2023-09-14T12:25:43.092207+0000", "event": "all_read" }, { "time": "2023-09-14T12:25:43.092218+0000", "event": "dispatched" } ] } }, { "description": "peer_request(mds.3:5311743.0 authpin)", "initiated_at": "2023-09-14T12:25:43.092371+0000", "age": 926013.05081614305, "duration": 926013.05089185503, "type_data": { "flag_point": "dispatched", "reqid": "mds.3:5311743", "op_type": "peer_request", "leader_info": { "leader": "3" }, "request_info": { "attempt": 0, "op_type": "authpin", "lock_type": 0, "object_info": "0x60001205d6d.head", "srcdnpath": "", "destdnpath": "", "witnesses": "", "has_inode_export": false, "inode_export_v": 0, "op_stamp": "0.000000" }, "events": [ { "time": "2023-09-14T12:25:43.092371+0000", "event": "initiated" }, { "time": "2023-09-14T12:25:43.092371+0000", "event": "throttled" }, { "time": "2023-09-14T12:25:43.092371+0000", "event": "header_read" }, { "time": "2023-09-14T12:25:43.092374+0000", "event": "all_read" }, { "time": "2023-09-14T12:25:43.092381+0000", "event": "dispatched" } ] } }, { "description": "peer_request(mds.4:4503615.0 authpin)", "initiated_at": "2023-09-14T13:40:25.150040+0000", "age": 921530.99314722, "duration": 921530.99326053297, "type_data": { "flag_point": "dispatched", "reqid": "mds.4:4503615", "op_type": "peer_request", "leader_info": { "leader": "4" }, "request_info": { "attempt": 0, "op_type": "authpin", "lock_type": 0, "object_info": "0x60001205c4f.head", "srcdnpath": "", "destdnpath": "", "witnesses": "", "has_inode_export": false, "inode_export_v": 0, "op_stamp": "0.000000" }, "events": [ { "time": "2023-09-14T13:40:25.150040+0000", "event": "initiated" }, { "time": "2023-09-14T13:40:25.150040+0000", "event": "throttled" }, { "time": "2023-09-14T13:40:25.150040+0000", "event": "header_read" }, { "time": "2023-09-14T13:40:25.150045+0000", "event": "all_read" }, { "time": "2023-09-14T13:40:25.150053+0000", "event": "dispatched" } ] } }, { "description": "client_request(client.460983:5731820 getattr pAsLsXsFs #0x60001205c4f 2023-09-14T13:40:25.144336+0000 caller_uid=0, caller_gid=0{})", "initiated_at": "2023-09-14T13:40:25.150176+0000", "age": 921530.99301089498, "duration": 921530.99316312897, "type_data": { "flag_point": "failed to authpin, inode is being exported", "reqid": "client.460983:5731820", "op_type": "client_request", "client_info": { "client": "client.460983", "tid": 5731820 }, "events": [ { "time": "2023-09-14T13:40:25.150176+0000", "event": "initiated" }, { "time": "2023-09-14T13:40:25.150177+0000", "event": "throttled" }, { "time": "2023-09-14T13:40:25.150176+0000", "event": "header_read" }, { "time": "2023-09-14T13:40:25.150180+0000", "event": "all_read" }, { "time": "2023-09-14T13:40:25.150186+0000", "event": "dispatched" }, { "time": "2023-09-14T13:40:25.150195+0000", "event": "failed to authpin, inode is being exported" } ] } } ], "num_ops": 4 }

7 months

2
4
0 0

Ceph 16.2.x excessive logging, how to reduce?

by Zakhar Kirpichenko

Hi, Our Ceph 16.2.x cluster managed by cephadm is logging a lot of very detailed messages, Ceph logs alone on hosts with monitors and several OSDs has already eaten through 50% of the endurance of the flash system drives over a couple of years. Cluster logging settings are default, and it seems that all daemons are writing lots and lots of debug information to the logs, such as for example: https://pastebin.com/ebZq8KZk (it's just a snippet, but there's lots and lots of various information). Is there a way to reduce the amount of logging and, for example, limit the logging to warnings or important messages so that it doesn't include every successful authentication attempt, compaction etc, etc, when the cluster is healthy and operating normally? I would very much appreciate your advice on this. Best regards, Zakhar

7 months

4
12
0 0

CEPH complete cluster failure: unknown PGS

by v1tnam＠gmail.com

I have an 8-node cluster with old hardware. a week ago 4 nodes went down and the CEPH cluster went nuts. All pgs became unknown and montors took too long to be in sync. So i reduced the number of mons to one and mgrs to one as well Now the recovery starts with 100% unknown pgs and then pgs start to move ot inactive . It generally fails to recover in the middle and starts from scratch. It's hold hardware and OSDs have lots of slow ops and probably number of bad sectors as well Any suggestions on how to tackle this. It's a nautilus cluster and pretty old (8-year old hardware) Thanks

7 months

3
2
0 0

RGW multisite - requesting help for fixing error_code: 125

by Jayanth Reddy

Hello Users, We're running 2 Ceph clusters with v17.2.6 and noticing the error message in # radosgw-admin sync error list *"message": "failed to sync bucket instance: (125) Operation canceled"* We've the output as below, [ { "shard_id": 0, "entries": [ { "id": "1_1690711173.869335_133603.1", "section": "data", "name": "b1:d09d3d16-8601-448b-bf3d-609b8a29647d.38987.1:2897", "timestamp": "2023-07-30T09:59:33.869335Z", "info": { "source_zone": "d09d3d16-8601-448b-bf3d-609b8a29647d", "error_code": 125, "message": "failed to sync bucket instance: (125) Operation canceled" } }, { "id": "1_1690711175.505687_133683.1", "section": "data", "name": "b1:d09d3d16-8601-448b-bf3d-609b8a29647d.38987.1:1719", "timestamp": "2023-07-30T09:59:35.505687Z", "info": { "source_zone": "d09d3d16-8601-448b-bf3d-609b8a29647d", "error_code": 125, "message": "failed to sync bucket instance: (125) Operation canceled" } }, and with around 26236 errors # radosgw-admin sync error list | grep -i "(125) Operation canceled" | wc -l 26236 I'm trying to fix these by rewriting objects but I'm having trouble finding the exact object names and the procedure. Any help is really appreciated! Thanks, Jayanth

7 months, 1 week

2
1
0 0

Balancer blocked as autoscaler not acting on scaling change

by bc10＠sanger.ac.uk

Hi Folks, We are currently running with one nearfull OSD and 15 nearfull pools. The most full OSD is about 86% full but the average is 58% full. However, the balancer is skipping a pool on which the autoscaler is trying to complete a pg_num reduction from 131,072 to 32,768 (default.rgw.buckets.data pool). However, the autoscaler has been working on this for the last 20 days, it works through a list of objects that are misplaced but when it gets close to the end, more objects get added to the list. This morning I observed the list get down to c. 7,000 objects misplaced with 2 PGs active+remapped+backfilling, one PG completed the backfilling then the list shot up to c. 70,000 objects misplaced with 3 PGs active+remapped+backfilling. Has anyone come across this behaviour before? If so, what was your remediation? Thanks in advance for sharing. Bruno Cluster details: 3,068 OSDs when all running, c. 60 per storage node OS: Ubuntu 20.04 Ceph: Pacific 16.2.13 from Ubuntu Cloud Archive Use case: S3 storage and OpenStack backend, all pools three-way replicated

7 months, 1 week

5
4
0 0

Re: VM hangs when overwriting a file on erasure coded RBD

by peter.linder＠fiberdirekt.se

Yes, this is all set up. It was working fine until after the problem with the osd host that lost the cluster/sync network occured. There are a few other VMs that keep running along fine without this issue. I've restarted the problematic VM without success (that is, creating a file works, but overwriting it still hangs right away). fsck runs fine so reading the whole image works. I'm kind of stumped as to what can cause this. Because of the lengthy recovery, and then pg autoscaler currenty doing things there are currently lots of PGs that haven't been scrubbed, but I doubt that is an issue here. Den 2023-09-29 kl. 18:52, skrev Anthony D'Atri: > EC for RBD wasn't possible until Luminous IIRC, so I had to ask. You have a replicated metadata pool defined? Does proxmox know that this is an EC pool? When connecting it needs to know both the metata and data pools. > >> On Sep 29, 2023, at 12:49, peter.linder(a)fiberdirekt.se wrote: >> >> (sorry for duplicate emails) >> >> This turns out to be a good question actually. >> >> The cluster is running Quincy, 17.2.6. >> >> The compute node that is running the VM is proxmox, version 7.4-3. Supposedly this is fairly new, but the version of librbd1 claims to be 14.2.21 when I check with "apt list". We are not using proxmox's own ceph cluster release. However we haven't had any issues with this setup before, but we haven't been using neither erasure coded pools nor had the node-half-dead problem for such a long time. >> >> The VM is configured using proxmox which is not libvirt but similar, and krbd is not enabled. I don't know for sure if proxmox has its own librbd linked in qemu/kvm. >> >> "ceph features" looks like this: >> >> { >> "mon": [ >> { >> "features": "0x3f01cfbf7ffdffff", >> "release": "luminous", >> "num": 5 >> } >> ], >> "osd": [ >> { >> "features": "0x3f01cfbf7ffdffff", >> "release": "luminous", >> "num": 24 >> } >> ], >> "client": [ >> { >> "features": "0x3f01cfb87fecffff", >> "release": "luminous", >> "num": 4 >> }, >> { >> "features": "0x3f01cfbf7ffdffff", >> "release": "luminous", >> "num": 12 >> } >> ], >> "mgr": [ >> { >> "features": "0x3f01cfbf7ffdffff", >> "release": "luminous", >> "num": 2 >> } >> ] >> } >> >> Regards, >> >> Peter >> >> >> Den 2023-09-29 kl. 17:55, skrev Anthony D'Atri: >>> Which Ceph releases are installed on the VM and the back end? Is the VM using librbd through libvirt, or krbd? >>> >>>> On Sep 29, 2023, at 09:09, Peter Linder <peter.linder(a)fiberdirekt.se> wrote: >>>> >>>> Dear all, >>>> >>>> I have a problem that after an OSD host lost connection to the sync/cluster rear network for many hours (the public network was online), a test VM using RBD cant overwrite its files. I can create a new file inside it just fine, but not overwrite it, the process just hangs. >>>> >>>> The VM's disk is on an erasure coded data pool and a replicated pool in front of it. EC overwrites is on for the pool. >>>> >>>> The custer consists of 5 hosts and 4 OSDs on each, and separate hosts for compute. There is a public and separate cluster network, separated. In this case, the AOC cable to the cluster network went link down on a host and it had to be replaced and the host was rebooted. Recovery took about a week to complete. The host was half-down for about 12 hours like this. >>>> >>>> I have some other VMs as well with images in the same pool (totally 4), and they seem to work fine, it is just this one that cant overwrite. >>>> >>>> I'm thinking there is somehow something wrong with just this image? >>>> >>>> Regards, >>>> >>>> Peter >>>> _______________________________________________ >>>> ceph-users mailing list -- ceph-users(a)ceph.io >>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io >> _______________________________________________ >> ceph-users mailing list -- ceph-users(a)ceph.io >> To unsubscribe send an email to ceph-users-leave(a)ceph.io

7 months, 1 week

2
1
0 0

cephfs health warn

by Ben

Hi, see below for details of warnings. the cluster is running 17.2.5. the warnings have been around for a while. one concern of mine is num_segments growing over time. clients with warn of MDS_CLIENT_OLDEST_TID increase from 18 to 25 as well. The nodes are with kernel 4.19.0-91.82.42.uelc20.x86_64. It looks like bugs with client library. And rebooting nodes with problem will fix it for short period of time? Any suggestions from community for fixing? Thanks, Ben [root@8cd2c0657c77 /]# ceph health detail HEALTH_WARN 6 hosts fail cephadm check; 2 clients failing to respond to capability release; 25 clients failing to advance oldest client/flush tid; 3 MDSs report slow requests; 3 MDSs behind on trimming [WRN] CEPHADM_HOST_CHECK_FAILED: 6 hosts fail cephadm check host host15w (192.168.31.33) failed check: Unable to reach remote host host15w. Process exited with non-zero exit status 1 host host20w (192.168.31.38) failed check: Unable to reach remote host host20w. Process exited with non-zero exit status 1 host host19w (192.168.31.37) failed check: Unable to reach remote host host19w. Process exited with non-zero exit status 1 host host17w (192.168.31.35) failed check: Unable to reach remote host host17w. Process exited with non-zero exit status 1 host host18w (192.168.31.36) failed check: Unable to reach remote host host18w. Process exited with non-zero exit status 1 host host16w (192.168.31.34) failed check: Unable to reach remote host host16w. Process exited with non-zero exit status 1 [WRN] MDS_CLIENT_LATE_RELEASE: 2 clients failing to respond to capability release mds.code-store.host18w.fdsqff(mds.1): Client k8s-node36 failing to respond to capability release client_id: 460983 mds.code-store.host16w.vucirx(mds.3): Client failing to respond to capability release client_id: 460983 [WRN] MDS_CLIENT_OLDEST_TID: 25 clients failing to advance oldest client/flush tid mds.code-store.host18w.fdsqff(mds.1): Client k8s-node36 failing to advance its oldest client/flush tid. client_id: 460983 mds.code-store.host18w.fdsqff(mds.1): Client failing to advance its oldest client/flush tid. client_id: 460226 mds.code-store.host18w.fdsqff(mds.1): Client k8s-node32 failing to advance its oldest client/flush tid. client_id: 239797 mds.code-store.host15w.reolpx(mds.5): Client k8s-node34 failing to advance its oldest client/flush tid. client_id: 460226 mds.code-store.host15w.reolpx(mds.5): Client k8s-node32 failing to advance its oldest client/flush tid. client_id: 239797 mds.code-store.host15w.reolpx(mds.5): Client failing to advance its oldest client/flush tid. client_id: 460983 mds.code-store.host18w.rtyvdy(mds.7): Client k8s-node34 failing to advance its oldest client/flush tid. client_id: 460226 mds.code-store.host18w.rtyvdy(mds.7): Client failing to advance its oldest client/flush tid. client_id: 239797 mds.code-store.host18w.rtyvdy(mds.7): Client k8s-node36 failing to advance its oldest client/flush tid. client_id: 460983 mds.code-store.host17w.kcdopb(mds.2): Client failing to advance its oldest client/flush tid. client_id: 239797 mds.code-store.host17w.kcdopb(mds.2): Client failing to advance its oldest client/flush tid. client_id: 460983 mds.code-store.host17w.kcdopb(mds.2): Client k8s-node34 failing to advance its oldest client/flush tid. client_id: 460226 mds.code-store.host17w.kcdopb(mds.2): Client k8s-node24 failing to advance its oldest client/flush tid. client_id: 12072730 mds.code-store.host20w.bfoftp(mds.4): Client k8s-node32 failing to advance its oldest client/flush tid. client_id: 239797 mds.code-store.host20w.bfoftp(mds.4): Client k8s-node36 failing to advance its oldest client/flush tid. client_id: 460983 mds.code-store.host19w.ywrmiz(mds.6): Client k8s-node24 failing to advance its oldest client/flush tid. client_id: 12072730 mds.code-store.host19w.ywrmiz(mds.6): Client k8s-node34 failing to advance its oldest client/flush tid. client_id: 460226 mds.code-store.host19w.ywrmiz(mds.6): Client failing to advance its oldest client/flush tid. client_id: 239797 mds.code-store.host19w.ywrmiz(mds.6): Client failing to advance its oldest client/flush tid. client_id: 460983 mds.code-store.host16w.vucirx(mds.3): Client failing to advance its oldest client/flush tid. client_id: 460983 mds.code-store.host16w.vucirx(mds.3): Client failing to advance its oldest client/flush tid. client_id: 460226 mds.code-store.host16w.vucirx(mds.3): Client failing to advance its oldest client/flush tid. client_id: 239797 mds.code-store.host17w.pdziet(mds.0): Client k8s-node32 failing to advance its oldest client/flush tid. client_id: 239797 mds.code-store.host17w.pdziet(mds.0): Client k8s-node34 failing to advance its oldest client/flush tid. client_id: 460226 mds.code-store.host17w.pdziet(mds.0): Client k8s-node36 failing to advance its oldest client/flush tid. client_id: 460983 [WRN] MDS_SLOW_REQUEST: 3 MDSs report slow requests mds.code-store.host15w.reolpx(mds.5): 4 slow requests are blocked > 5 secs mds.code-store.host20w.bfoftp(mds.4): 6 slow requests are blocked > 5 secs mds.code-store.host16w.vucirx(mds.3): 97 slow requests are blocked > 5 secs [WRN] MDS_TRIM: 3 MDSs behind on trimming mds.code-store.host15w.reolpx(mds.5): Behind on trimming (25831/128) max_segments: 128, num_segments: 25831 mds.code-store.host20w.bfoftp(mds.4): Behind on trimming (27605/128) max_segments: 128, num_segments: 27605 mds.code-store.host16w.vucirx(mds.3): Behind on trimming (28676/128) max_segments: 128, num_segments: 28676

7 months, 1 week

2
8
0 0

radosgw-admin sync error trim seems to do nothing

by Richard Bade

Hi Matthew, At least for nautilus (14.2.22) i have discovered through trial and error that you need to specify a beginning or end date. Something like this: radosgw-admin sync error trim --end-date="2023-08-20 23:00:00" --rgw-zone={your_zone_name} I specify the zone as there's a error list for each zone. Hopefully that helps. Rich ------------------------------ Date: Sat, 19 Aug 2023 12:48:55 -0400 From: Matthew Darwin <bugs(a)mdarwin.ca> Subject: [ceph-users] radosgw-admin sync error trim seems to do nothing To: Ceph Users <ceph-users(a)ceph.io> Message-ID: <95e7edfd-ca29-fc0e-a30a-987f1c43e2d4(a)mdarwin.ca> Content-Type: text/plain; charset=UTF-8; format=flowed Hello all, "radosgw-admin sync error list" returns errors from 2022. I want to clear those out. I tried "radosgw-admin sync error trim" but it seems to do nothing. The man page seems to offer no suggestions https://protect-au.mimecast.com/s/26o0CzvkGRhLoOXfXjZR3?domain=docs.ceph.com Any ideas what I need to do to remove old errors? (or at least I want to see more recent errors) ceph version 17.2.6 (quincy) Thanks.

7 months, 1 week

2
2
0 0

Impacts on doubling the size of pgs in a rbd pool?

by Hervé Ballans

Hi all, I have a Ceph cluster on Quincy (17.2.6), with 3 pools (1 rbd and 1 CephFS volume), each configured with 3 replicas. $ sudo ceph osd pool ls detail pool 7 'cephfs_data_home' replicated size 3 min_size 2 crush_rule 1 object_hash rjenkins pg_num 512 pgp_num 512 autoscale_mode on last_change 6287147 lfor 0/5364613/5364611 flags hashpspool stripe_width 0 application cephfs pool 8 'cephfs_metadata_home' replicated size 3 min_size 2 crush_rule 3 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 6333341 lfor 0/6333341/6333339 flags hashpspool stripe_width 0 application cephfs pool 9 'rbd_backup_vms' replicated size 3 min_size 2 crush_rule 2 object_hash rjenkins pg_num 1024 pgp_num 1024 autoscale_mode on last_change 6365131 lfor 0/211948/249421 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd pool 10 '.mgr' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode warn last_change 6365131 flags hashpspool stripe_width 0 pg_num_min 1 application mgr,mgr_devicehealth $ sudo ceph df --- RAW STORAGE --- CLASS SIZE AVAIL USED RAW USED %RAW USED hdd 306 TiB 186 TiB 119 TiB 119 TiB 39.00 nvme 4.4 TiB 4.3 TiB 118 GiB 118 GiB 2.63 TOTAL 310 TiB 191 TiB 119 TiB 119 TiB 38.49 --- POOLS --- POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL cephfs_data_home 7 512 12 TiB 28.86M 12 TiB 12.85 27 TiB cephfs_metadata_home 8 32 33 GiB 3.63M 33 GiB 0.79 1.3 TiB rbd_backup_vms 9 1024 24 TiB 6.42M 24 TiB 58.65 5.6 TiB .mgr 10 1 35 MiB 9 35 MiB 0 12 TiB I am going to extend the rbd pool (rbd_backup_vms), currently used at 60%. This pool contains 60 disks, i.e. 20 disks by rack in the crushmap. This pool is used for storing VM disk images (available to a separate ProxmoxVE cluster) For this purpose, I am going to add 42 disks of the same size as those currently in the pool, i.e. 14 additional disks on each rack. Currently, this pool is configured with 1024 pgs. Before this operation, I would like to extend the number of pgs, let's say 2048 (i.e. double). I wonder about the overall impact of this change on the cluster. I guess that the heavy moves in the pgs will have a strong impact regarding the iops? I have two questions: 1) Is it useful to make this modification before adding the new OSDs? (I'm afraid of warnings about full or nearfull pgs if not) 2) are there any configuration recommendations in order to minimize these anticipated impacts? Thank you! Cheers, Hervé

7 months, 1 week

3
3
0 0

SSD SATA performance

by Murilo Morais

Good morning everybody! Guys, I have 9x Kingston DC600M/1920 SSDs (SATA) in 3x DL380e, using the P420 (I still don't have an HBA to perform the exchange) in RAID 0. The device's specifications indicate that it achieves 94k/78k RAND-RW IOPS at 4K. I'm using it exclusively for VMs with RBD (I'm using OpenStack). (pool size 3) Performing sequential testing directly on the device I can easily beat these rates, but with random tests I reach fixed rates of 15k, and I think the device just doesn't deliver that. On VMs I get rates of 20k randwrite. What I've already tried: Disable Controller Cache Enable HP Smart Path (it ended up worsening SSD performance) Change the scheduler to "none" I haven't tried putting the controller in HBA mode yet, as the boot disks are on the P420, I still have to align this issue to perform the test. I would like to know if I can improve these rates or if it is just that. Thanks in advance!

7 months, 1 week

1
0
0 0

2024

2023

2022

2021

2020

2019

ceph-users September 2023