Hi,
Our Ceph 16.2.x cluster managed by cephadm is logging a lot of very
detailed messages, Ceph logs alone on hosts with monitors and several OSDs
has already eaten through 50% of the endurance of the flash system drives
over a couple of years.
Cluster logging settings are default, and it seems that all daemons are
writing lots and lots of debug information to the logs, such as for
example: https://pastebin.com/ebZq8KZk (it's just a snippet, but there's
lots and lots of various information).
Is there a way to reduce the amount of logging and, for example, limit the
logging to warnings or important messages so that it doesn't include every
successful authentication attempt, compaction etc, etc, when the cluster is
healthy and operating normally?
I would very much appreciate your advice on this.
Best regards,
Zakhar
I have an 8-node cluster with old hardware. a week ago 4 nodes went down and the CEPH cluster went nuts.
All pgs became unknown and montors took too long to be in sync.
So i reduced the number of mons to one and mgrs to one as well
Now the recovery starts with 100% unknown pgs and then pgs start to move ot inactive . It generally fails to recover in the middle and starts from scratch.
It's hold hardware and OSDs have lots of slow ops and probably number of bad sectors as well
Any suggestions on how to tackle this. It's a nautilus cluster and pretty old (8-year old hardware)
Thanks
Hi Folks,
We are currently running with one nearfull OSD and 15 nearfull pools. The most full OSD is about 86% full but the average is 58% full. However, the balancer is skipping a pool on which the autoscaler is trying to complete a pg_num reduction from 131,072 to 32,768 (default.rgw.buckets.data pool). However, the autoscaler has been working on this for the last 20 days, it works through a list of objects that are misplaced but when it gets close to the end, more objects get added to the list.
This morning I observed the list get down to c. 7,000 objects misplaced with 2 PGs active+remapped+backfilling, one PG completed the backfilling then the list shot up to c. 70,000 objects misplaced with 3 PGs active+remapped+backfilling.
Has anyone come across this behaviour before? If so, what was your remediation?
Thanks in advance for sharing.
Bruno
Cluster details:
3,068 OSDs when all running, c. 60 per storage node
OS: Ubuntu 20.04
Ceph: Pacific 16.2.13 from Ubuntu Cloud Archive
Use case:
S3 storage and OpenStack backend, all pools three-way replicated
Yes, this is all set up. It was working fine until after the problem
with the osd host that lost the cluster/sync network occured.
There are a few other VMs that keep running along fine without this
issue. I've restarted the problematic VM without success (that is,
creating a file works, but overwriting it still hangs right away). fsck
runs fine so reading the whole image works.
I'm kind of stumped as to what can cause this.
Because of the lengthy recovery, and then pg autoscaler currenty doing
things there are currently lots of PGs that haven't been scrubbed, but I
doubt that is an issue here.
Den 2023-09-29 kl. 18:52, skrev Anthony D'Atri:
> EC for RBD wasn't possible until Luminous IIRC, so I had to ask. You have a replicated metadata pool defined? Does proxmox know that this is an EC pool? When connecting it needs to know both the metata and data pools.
>
>> On Sep 29, 2023, at 12:49, peter.linder(a)fiberdirekt.se wrote:
>>
>> (sorry for duplicate emails)
>>
>> This turns out to be a good question actually.
>>
>> The cluster is running Quincy, 17.2.6.
>>
>> The compute node that is running the VM is proxmox, version 7.4-3. Supposedly this is fairly new, but the version of librbd1 claims to be 14.2.21 when I check with "apt list". We are not using proxmox's own ceph cluster release. However we haven't had any issues with this setup before, but we haven't been using neither erasure coded pools nor had the node-half-dead problem for such a long time.
>>
>> The VM is configured using proxmox which is not libvirt but similar, and krbd is not enabled. I don't know for sure if proxmox has its own librbd linked in qemu/kvm.
>>
>> "ceph features" looks like this:
>>
>> {
>> "mon": [
>> {
>> "features": "0x3f01cfbf7ffdffff",
>> "release": "luminous",
>> "num": 5
>> }
>> ],
>> "osd": [
>> {
>> "features": "0x3f01cfbf7ffdffff",
>> "release": "luminous",
>> "num": 24
>> }
>> ],
>> "client": [
>> {
>> "features": "0x3f01cfb87fecffff",
>> "release": "luminous",
>> "num": 4
>> },
>> {
>> "features": "0x3f01cfbf7ffdffff",
>> "release": "luminous",
>> "num": 12
>> }
>> ],
>> "mgr": [
>> {
>> "features": "0x3f01cfbf7ffdffff",
>> "release": "luminous",
>> "num": 2
>> }
>> ]
>> }
>>
>> Regards,
>>
>> Peter
>>
>>
>> Den 2023-09-29 kl. 17:55, skrev Anthony D'Atri:
>>> Which Ceph releases are installed on the VM and the back end? Is the VM using librbd through libvirt, or krbd?
>>>
>>>> On Sep 29, 2023, at 09:09, Peter Linder <peter.linder(a)fiberdirekt.se> wrote:
>>>>
>>>> Dear all,
>>>>
>>>> I have a problem that after an OSD host lost connection to the sync/cluster rear network for many hours (the public network was online), a test VM using RBD cant overwrite its files. I can create a new file inside it just fine, but not overwrite it, the process just hangs.
>>>>
>>>> The VM's disk is on an erasure coded data pool and a replicated pool in front of it. EC overwrites is on for the pool.
>>>>
>>>> The custer consists of 5 hosts and 4 OSDs on each, and separate hosts for compute. There is a public and separate cluster network, separated. In this case, the AOC cable to the cluster network went link down on a host and it had to be replaced and the host was rebooted. Recovery took about a week to complete. The host was half-down for about 12 hours like this.
>>>>
>>>> I have some other VMs as well with images in the same pool (totally 4), and they seem to work fine, it is just this one that cant overwrite.
>>>>
>>>> I'm thinking there is somehow something wrong with just this image?
>>>>
>>>> Regards,
>>>>
>>>> Peter
>>>> _______________________________________________
>>>> ceph-users mailing list -- ceph-users(a)ceph.io
>>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io
>> _______________________________________________
>> ceph-users mailing list -- ceph-users(a)ceph.io
>> To unsubscribe send an email to ceph-users-leave(a)ceph.io
Hi,
see below for details of warnings.
the cluster is running 17.2.5. the warnings have been around for a while.
one concern of mine is num_segments growing over time. clients with
warn of MDS_CLIENT_OLDEST_TID
increase from 18 to 25 as well. The nodes are with kernel
4.19.0-91.82.42.uelc20.x86_64.
It looks like bugs with client library. And rebooting nodes with problem
will fix it for short period of time? Any suggestions from community for
fixing?
Thanks,
Ben
[root@8cd2c0657c77 /]# ceph health detail
HEALTH_WARN 6 hosts fail cephadm check; 2 clients failing to respond to
capability release; 25 clients failing to advance oldest client/flush tid;
3 MDSs report slow requests; 3 MDSs behind on trimming
[WRN] CEPHADM_HOST_CHECK_FAILED: 6 hosts fail cephadm check
host host15w (192.168.31.33) failed check: Unable to reach remote host
host15w. Process exited with non-zero exit status 1
host host20w (192.168.31.38) failed check: Unable to reach remote host
host20w. Process exited with non-zero exit status 1
host host19w (192.168.31.37) failed check: Unable to reach remote host
host19w. Process exited with non-zero exit status 1
host host17w (192.168.31.35) failed check: Unable to reach remote host
host17w. Process exited with non-zero exit status 1
host host18w (192.168.31.36) failed check: Unable to reach remote host
host18w. Process exited with non-zero exit status 1
host host16w (192.168.31.34) failed check: Unable to reach remote host
host16w. Process exited with non-zero exit status 1
[WRN] MDS_CLIENT_LATE_RELEASE: 2 clients failing to respond to capability
release
mds.code-store.host18w.fdsqff(mds.1): Client k8s-node36 failing to
respond to capability release client_id: 460983
mds.code-store.host16w.vucirx(mds.3): Client failing to respond to
capability release client_id: 460983
[WRN] MDS_CLIENT_OLDEST_TID: 25 clients failing to advance oldest
client/flush tid
mds.code-store.host18w.fdsqff(mds.1): Client k8s-node36 failing to
advance its oldest client/flush tid. client_id: 460983
mds.code-store.host18w.fdsqff(mds.1): Client failing to advance its
oldest client/flush tid. client_id: 460226
mds.code-store.host18w.fdsqff(mds.1): Client k8s-node32 failing to
advance its oldest client/flush tid. client_id: 239797
mds.code-store.host15w.reolpx(mds.5): Client k8s-node34 failing to
advance its oldest client/flush tid. client_id: 460226
mds.code-store.host15w.reolpx(mds.5): Client k8s-node32 failing to
advance its oldest client/flush tid. client_id: 239797
mds.code-store.host15w.reolpx(mds.5): Client failing to advance its
oldest client/flush tid. client_id: 460983
mds.code-store.host18w.rtyvdy(mds.7): Client k8s-node34 failing to
advance its oldest client/flush tid. client_id: 460226
mds.code-store.host18w.rtyvdy(mds.7): Client failing to advance its
oldest client/flush tid. client_id: 239797
mds.code-store.host18w.rtyvdy(mds.7): Client k8s-node36 failing to
advance its oldest client/flush tid. client_id: 460983
mds.code-store.host17w.kcdopb(mds.2): Client failing to advance its
oldest client/flush tid. client_id: 239797
mds.code-store.host17w.kcdopb(mds.2): Client failing to advance its
oldest client/flush tid. client_id: 460983
mds.code-store.host17w.kcdopb(mds.2): Client k8s-node34 failing to
advance its oldest client/flush tid. client_id: 460226
mds.code-store.host17w.kcdopb(mds.2): Client k8s-node24 failing to
advance its oldest client/flush tid. client_id: 12072730
mds.code-store.host20w.bfoftp(mds.4): Client k8s-node32 failing to
advance its oldest client/flush tid. client_id: 239797
mds.code-store.host20w.bfoftp(mds.4): Client k8s-node36 failing to
advance its oldest client/flush tid. client_id: 460983
mds.code-store.host19w.ywrmiz(mds.6): Client k8s-node24 failing to
advance its oldest client/flush tid. client_id: 12072730
mds.code-store.host19w.ywrmiz(mds.6): Client k8s-node34 failing to
advance its oldest client/flush tid. client_id: 460226
mds.code-store.host19w.ywrmiz(mds.6): Client failing to advance its
oldest client/flush tid. client_id: 239797
mds.code-store.host19w.ywrmiz(mds.6): Client failing to advance its
oldest client/flush tid. client_id: 460983
mds.code-store.host16w.vucirx(mds.3): Client failing to advance its
oldest client/flush tid. client_id: 460983
mds.code-store.host16w.vucirx(mds.3): Client failing to advance its
oldest client/flush tid. client_id: 460226
mds.code-store.host16w.vucirx(mds.3): Client failing to advance its
oldest client/flush tid. client_id: 239797
mds.code-store.host17w.pdziet(mds.0): Client k8s-node32 failing to
advance its oldest client/flush tid. client_id: 239797
mds.code-store.host17w.pdziet(mds.0): Client k8s-node34 failing to
advance its oldest client/flush tid. client_id: 460226
mds.code-store.host17w.pdziet(mds.0): Client k8s-node36 failing to
advance its oldest client/flush tid. client_id: 460983
[WRN] MDS_SLOW_REQUEST: 3 MDSs report slow requests
mds.code-store.host15w.reolpx(mds.5): 4 slow requests are blocked > 5
secs
mds.code-store.host20w.bfoftp(mds.4): 6 slow requests are blocked > 5
secs
mds.code-store.host16w.vucirx(mds.3): 97 slow requests are blocked > 5
secs
[WRN] MDS_TRIM: 3 MDSs behind on trimming
mds.code-store.host15w.reolpx(mds.5): Behind on trimming (25831/128)
max_segments: 128, num_segments: 25831
mds.code-store.host20w.bfoftp(mds.4): Behind on trimming (27605/128)
max_segments: 128, num_segments: 27605
mds.code-store.host16w.vucirx(mds.3): Behind on trimming (28676/128)
max_segments: 128, num_segments: 28676
Hi Matthew,
At least for nautilus (14.2.22) i have discovered through trial and
error that you need to specify a beginning or end date. Something like
this:
radosgw-admin sync error trim --end-date="2023-08-20 23:00:00"
--rgw-zone={your_zone_name}
I specify the zone as there's a error list for each zone.
Hopefully that helps.
Rich
------------------------------
Date: Sat, 19 Aug 2023 12:48:55 -0400
From: Matthew Darwin <bugs(a)mdarwin.ca>
Subject: [ceph-users] radosgw-admin sync error trim seems to do
nothing
To: Ceph Users <ceph-users(a)ceph.io>
Message-ID: <95e7edfd-ca29-fc0e-a30a-987f1c43e2d4(a)mdarwin.ca>
Content-Type: text/plain; charset=UTF-8; format=flowed
Hello all,
"radosgw-admin sync error list" returns errors from 2022. I want to
clear those out.
I tried "radosgw-admin sync error trim" but it seems to do nothing.
The man page seems to offer no suggestions
https://protect-au.mimecast.com/s/26o0CzvkGRhLoOXfXjZR3?domain=docs.ceph.com
Any ideas what I need to do to remove old errors? (or at least I want
to see more recent errors)
ceph version 17.2.6 (quincy)
Thanks.
Hi all,
I have a Ceph cluster on Quincy (17.2.6), with 3 pools (1 rbd and 1
CephFS volume), each configured with 3 replicas.
$ sudo ceph osd pool ls detail
pool 7 'cephfs_data_home' replicated size 3 min_size 2 crush_rule 1
object_hash rjenkins pg_num 512 pgp_num 512 autoscale_mode on
last_change 6287147 lfor 0/5364613/5364611 flags hashpspool stripe_width
0 application cephfs
pool 8 'cephfs_metadata_home' replicated size 3 min_size 2 crush_rule 3
object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change
6333341 lfor 0/6333341/6333339 flags hashpspool stripe_width 0
application cephfs
pool 9 'rbd_backup_vms' replicated size 3 min_size 2 crush_rule 2
object_hash rjenkins pg_num 1024 pgp_num 1024 autoscale_mode on
last_change 6365131 lfor 0/211948/249421 flags
hashpspool,selfmanaged_snaps stripe_width 0 application rbd
pool 10 '.mgr' replicated size 3 min_size 2 crush_rule 0 object_hash
rjenkins pg_num 1 pgp_num 1 autoscale_mode warn last_change 6365131
flags hashpspool stripe_width 0 pg_num_min 1 application
mgr,mgr_devicehealth
$ sudo ceph df
--- RAW STORAGE ---
CLASS SIZE AVAIL USED RAW USED %RAW USED
hdd 306 TiB 186 TiB 119 TiB 119 TiB 39.00
nvme 4.4 TiB 4.3 TiB 118 GiB 118 GiB 2.63
TOTAL 310 TiB 191 TiB 119 TiB 119 TiB 38.49
--- POOLS ---
POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL
cephfs_data_home 7 512 12 TiB 28.86M 12 TiB 12.85 27 TiB
cephfs_metadata_home 8 32 33 GiB 3.63M 33 GiB 0.79 1.3 TiB
rbd_backup_vms 9 1024 24 TiB 6.42M 24 TiB 58.65 5.6 TiB
.mgr 10 1 35 MiB 9 35 MiB 0 12 TiB
I am going to extend the rbd pool (rbd_backup_vms), currently used at 60%.
This pool contains 60 disks, i.e. 20 disks by rack in the crushmap. This
pool is used for storing VM disk images (available to a separate
ProxmoxVE cluster)
For this purpose, I am going to add 42 disks of the same size as those
currently in the pool, i.e. 14 additional disks on each rack.
Currently, this pool is configured with 1024 pgs.
Before this operation, I would like to extend the number of pgs, let's
say 2048 (i.e. double).
I wonder about the overall impact of this change on the cluster. I guess
that the heavy moves in the pgs will have a strong impact regarding the
iops?
I have two questions:
1) Is it useful to make this modification before adding the new OSDs?
(I'm afraid of warnings about full or nearfull pgs if not)
2) are there any configuration recommendations in order to minimize
these anticipated impacts?
Thank you!
Cheers,
Hervé
Good morning everybody!
Guys, I have 9x Kingston DC600M/1920 SSDs (SATA) in 3x DL380e, using the
P420 (I still don't have an HBA to perform the exchange) in RAID 0.
The device's specifications indicate that it achieves 94k/78k RAND-RW IOPS
at 4K.
I'm using it exclusively for VMs with RBD (I'm using OpenStack). (pool size
3)
Performing sequential testing directly on the device I can easily beat
these rates, but with random tests I reach fixed rates of 15k, and I think
the device just doesn't deliver that.
On VMs I get rates of 20k randwrite.
What I've already tried:
Disable Controller Cache
Enable HP Smart Path (it ended up worsening SSD performance)
Change the scheduler to "none"
I haven't tried putting the controller in HBA mode yet, as the boot disks
are on the P420, I still have to align this issue to perform the test.
I would like to know if I can improve these rates or if it is just that.
Thanks in advance!