Hi,
on debian12, ceph-dashboard is throwing a warning
"Module 'dashboard' has failed dependency: PyO3 modules may only be
initialized once per interpreter process"
Seem to be related to pyo3 0.17 change
https://github.com/PyO3/pyo3/blob/7bdc504252a2f972ba3490c44249b202a4ce6180/…
"
Each #[pymodule] can now only be initialized once per process
To make PyO3 modules sound in the presence of Python sub-interpreters,
for now it has been necessary to explicitly disable the ability to
initialize a #[pymodule] more than once in the same process. Attempting
to do this will now raise an ImportError.
"
Hi all,
we seem to have hit a bug in the ceph fs kernel client and I just want to confirm what action to take. We get the error "wrong peer at address" in dmesg and some jobs on that server seem to get stuck in fs access; log extract below. I found these 2 tracker items https://tracker.ceph.com/issues/23883 and https://tracker.ceph.com/issues/41519, which don't seem to have fixes.
My questions:
- Is this harmless or does it indicate invalid/corrupted client cache entries?
- How to resolve, ignore, umount+mount or reboot?
Here an extract from the dmesg log, the error has survived a couple of MDS restarts already:
[Mon Mar 6 12:56:46 2023] libceph: mds1 192.168.32.87:6801 wrong peer at address
[Mon Mar 6 13:05:18 2023] libceph: wrong peer, want 192.168.32.87:6801/-223958753, got 192.168.32.87:6801/-1572619386
[Mon Mar 6 13:05:18 2023] libceph: mds1 192.168.32.87:6801 wrong peer at address
[Mon Mar 6 13:13:50 2023] libceph: wrong peer, want 192.168.32.87:6801/-223958753, got 192.168.32.87:6801/-1572619386
[Mon Mar 6 13:13:50 2023] libceph: mds1 192.168.32.87:6801 wrong peer at address
[Mon Mar 6 13:16:41 2023] libceph: mds1 192.168.32.87:6801 socket closed (con state OPEN)
[Mon Mar 6 13:16:41 2023] libceph: mds1 192.168.32.87:6801 socket closed (con state OPEN)
[Mon Mar 6 13:16:45 2023] ceph: mds1 reconnect start
[Mon Mar 6 13:16:45 2023] ceph: mds1 reconnect start
[Mon Mar 6 13:16:48 2023] ceph: mds1 reconnect success
[Mon Mar 6 13:16:48 2023] ceph: mds1 reconnect success
[Mon Mar 6 13:18:13 2023] ceph: update_snap_trace error -22
[Mon Mar 6 13:18:17 2023] libceph: mds7 192.168.32.88:6801 socket closed (con state OPEN)
[Mon Mar 6 13:18:17 2023] libceph: mds7 192.168.32.88:6801 socket closed (con state OPEN)
[Mon Mar 6 13:18:23 2023] ceph: mds1 recovery completed
[Mon Mar 6 13:18:23 2023] ceph: mds1 recovery completed
[Mon Mar 6 13:18:28 2023] ceph: mds7 reconnect start
[Mon Mar 6 13:18:28 2023] ceph: mds7 reconnect start
[Mon Mar 6 13:18:28 2023] ceph: mds7 reconnect success
[Mon Mar 6 13:18:29 2023] ceph: mds7 reconnect success
[Mon Mar 6 13:18:35 2023] ceph: update_snap_trace error -22
[Mon Mar 6 13:18:35 2023] ceph: mds7 recovery completed
[Mon Mar 6 13:18:35 2023] ceph: mds7 recovery completed
[Mon Mar 6 13:22:22 2023] libceph: wrong peer, want 192.168.32.87:6801/-223958753, got 192.168.32.87:6801/-453143347
[Mon Mar 6 13:22:22 2023] libceph: mds1 192.168.32.87:6801 wrong peer at address
[Mon Mar 6 13:30:54 2023] libceph: wrong peer, want 192.168.32.87:6801/-223958753, got 192.168.32.87:6801/-453143347
[...]
[Thu Mar 9 09:37:24 2023] slurm.epilog.cl (31457): drop_caches: 3
[Thu Mar 9 09:38:26 2023] libceph: wrong peer, want 192.168.32.87:6801/-223958753, got 192.168.32.87:6801/-453143347
[Thu Mar 9 09:38:26 2023] libceph: mds1 192.168.32.87:6801 wrong peer at address
[Thu Mar 9 09:46:58 2023] libceph: wrong peer, want 192.168.32.87:6801/-223958753, got 192.168.32.87:6801/-453143347
[Thu Mar 9 09:46:58 2023] libceph: mds1 192.168.32.87:6801 wrong peer at address
[Thu Mar 9 09:55:30 2023] libceph: wrong peer, want 192.168.32.87:6801/-223958753, got 192.168.32.87:6801/-453143347
[Thu Mar 9 09:55:30 2023] libceph: mds1 192.168.32.87:6801 wrong peer at address
[Thu Mar 9 10:04:02 2023] libceph: wrong peer, want 192.168.32.87:6801/-223958753, got 192.168.32.87:6801/-453143347
[Thu Mar 9 10:04:02 2023] libceph: mds1 192.168.32.87:6801 wrong peer at address
Thanks and best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
Hey ceph-users,
I am running two (now) Quincy clusters doing RGW multi-site replication
with only one actually being written to by clients.
The other site is intended simply as a remote copy.
On the primary cluster I am observing an ever growing (objects and
bytes) "sitea.rgw.log" pool, not so on the remote "siteb.rgw.log" which
is only 300MiB and around 15k objects with no growth.
Metrics show that the growth of pool on primary is linear for at least 6
months, so not sudden spikes or anything. Also sync status appears to be
totally happy.
There are also no warnings in regards to large OMAPs or anything similar.
I was under the impression that RGW will trim its three logs (md, bi,
data) automatically and only keep data that has not yet been replicated
by the other zonegroup members?
The config option "ceph config get mgr rgw_sync_log_trim_interval" is
set to 1200, so 20 Minutes.
So I am wondering if there might be some inconsistency or how I can best
analyze what the cause for the accumulation of log data is?
There are older questions on the ML, such as [1], but there was not
really a solution or root cause identified.
I know there is manual trimming, but I'd rather want to analyze the
current situation and figure out what the cause for the lack of
auto-trimming is.
* Do I need to go through all buckets and count logs and look at
their timestamps? Which queries do make sense here?
* Is there usually any logging of the log trimming activity that I
should expect? Or that might indicate why trimming does not happen?
Regards
Christian
[1]
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/message/WZCFOAMLWV…
Hello,
This message does not concern Ceph itself but a hardware vulnerability which can lead to permanent loss of data on a Ceph cluster equipped with the same hardware in separate fault domains.
The DELL / Toshiba PX02SMF020, PX02SMF040, PX02SMF080 and PX02SMB160 SSD drives of the 13G generation of DELL servers are subject to a vulnerability which renders them unusable after 70,000 hours of operation, i.e. approximately 7 years and 11 months of activity.
This topic has been discussed here: https://www.dell.com/community/PowerVault/TOSHIBA-PX02SMF080-has-lost-commu…
The risk is all the greater since these disks may die at the same time in the same server leading to the loss of all data in the server.
To date, DELL has not provided any firmware fixing this vulnerability, the latest firmware version being "A3B3" released on Sept. 12, 2016: https://www.dell.com/support/home/en-us/ drivers/driversdetails?driverid=hhd9k
If your have servers running these drives, check their uptime. If they are close to the 70,000 hour limit, replace them immediately.
The smartctl tool does not report the uptime for these SSDs, but if you have HDDs in the server, you can query their SMART status and get their uptime, which should be about the same as the SSDs.
The smartctl command is: smartctl -a -d megaraid,XX /dev/sdc (where XX is the iSCSI bus number).
We have informed DELL about this but have no information yet on the arrival of a fix.
We have lost 6 disks, in 3 different servers, in the last few weeks. Our observation shows that the drives don't survive full shutdown and restart of the machine (power off then power on in iDrac), but they may also die during a single reboot (init 6) or even while the machine is running.
Fujitsu released a corrective firmware in June 2021 but this firmware is most certainly not applicable to DELL drives: https://www.fujitsu.com/us/imagesgig5/PY-CIB070-00.pdf
Regards,
Frederic
Sous-direction Infrastructure and Services
Direction du Numérique
Université de Lorraine
Hello everyone!
Recently we had a very nasty incident with one of our CEPH storages.
During basic backfill recovery operation due to faulty disk CephFS metadata started growing exponentially until they used all available space and whole cluster DIED. Usage graph screenshot in attachment.
Everything was very fast and even when the OSDs were marked full they tripped failsafe and ate all the free blocks, still trying to allocate space and completely died without possibility to even start them again.
Only solution was to copy whole bluestore to bigger SSD and resize underlying BS device. Just about 1/3 was able to start after moving but it was enough since we have very redundant settings for cephfs metadata. Basically metadata were moved from 12x 240g SSD to 12x 500GB SSD to have enough space to start again.
Brief info about the cluster:
- CephFS data are stored on ~500x 8TB SAS HDD using 10+2 ECC in 18 hosts.
- CephFS metadata are stored on ~12x 500GB SAS/SATA SSD using 5x replication on 6 hosts.
- Version was one of the latest 16.x.x Pacific at the time of the incident.
- 3x Mon+mgr and 2 active and 2 hot standby MDS are on separate virtual servers.
- typical file size to be stored is from hundreds of MBs to tens of GBs.
- this cluster is not the biggest, not having the most HDDs, no special config, I simply see nothing special about this cluster.
During investigation I found out the following:
- Metadata are outgrowing any time recovery is running on any of maintained clusters (~15 clusters of different usages and sizes) but not this much, this was an extreme situation.
- after recovery finish size went fine again.
- i think there is slight correlation with recovery width (objects to be touched by recovery in order to recovery everything) and recovery (time) length. But i have no proof.
- nothing much else
I would like to find out why this happened because i think this can happen again sometime and someone might lose some data if they have less luck.
Any ideas are appreciated, or even info if anyone have seen any similar behavior or if i am the only one struggling with issue like this :)
Kind regards,
Jakub Petrzilka
Hi,
We're having sporadic problems with a CephFS filesystem where MDSs end up
on the OSD blocklist. We're still digging around looking for a cause
(Ceph related or other infrastructure cause).
The cluster isn't massive (68 OSDs spread over 34 hosts), each host is a
VM, with MGR/MON/MDS on non-OSD hosts.
Running Ceph 16.2.10
Any suggestions for debugging this further?
Hello,
I have two main questions here.
1. What can I do when `ceph-bluestore-tool` outputs a stack trace for
`fsck`?
2. How does one recover from lost PGs / data corruption in an RGW
Multi-site setup?
---
I have a Luminous 12.2.12 cluster built on
ceph/daemon:v3.2.10-stable-3.2-luminous-centos-7-x86_64 for all daemons, no
ceph packages are installed on the systems. The OSD nodes have 128GB RAM, 6
SATA SSDs (Micron 5200, 2TB) and 1 NVMe SSD split into 4 OSDs.
osd_memory_target is set to 10GB and the OSD nodes have 128GB of RAM. That
should put me at 100/128GB used.
There are 3 PGs down, 3 of the OSDs that had those PGs won't stay online,
and they crash fairly quickly after starting. These are running on SATA
SSDs which are being replaced with NVMe SSDs. Crush re-weighting the SATA
drives down causes some SATA OSDs to crash and some NVMe drives have slow
or blocked ops (related to the down PGs).
I installed the ceph-osd package on one OSD host. When I ran
`ceph-bluestore-tool`, I got a bunch of tcmalloc and unexpected aio errors.
Exact output below. I also tried `ceph-objectstore-tool` but received
similar results. I cloned the other OSD that has the affected PGs to have a
copy I can work on, but I got the exact same results as before.
---
From what I can see, this is likely due to bad drives and automation trying
to restart down OSDs several times. With 3 down PGs, I am assuming my next
step would be to mark those PGs lost. From there, I am unsure what the
recovery procedure is to sync "clean" data from other zones into the
cluster that was impacted. Is RGW able to handle this? Do I need to use
`rclone`?
---
$ ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-11 fsck
tcmalloc: large alloc 1283989504 bytes == 0x557fdbe46000 @ 0x7fc87e4126d0
0x7fc873354ae9 0x7fc873356073 0x557f89d3d680 0x557f89d2ebcd 0x557f89d30524
0x557f89d318ef 0x557f89d33147 0x557f89bb0d6f 0x557f89b3c91b 0x557f89b6df8a
0x557f89a2c5e1 0x7fc87299d2e1 0x557f89ab03fa (nil)
tcmalloc: large alloc 2567970816 bytes == 0x5580286c8000 @ 0x7fc87e4126d0
0x7fc873354ae9 0x7fc873356073 0x557f89d3d680 0x557f89d2ebcd 0x557f89d30524
0x557f89d318ef 0x557f89d33147 0x557f89bb0d6f 0x557f89b3c91b 0x557f89b6df8a
0x557f89a2c5e1 0x7fc87299d2e1 0x557f89ab03fa (nil)
tcmalloc: large alloc 5135933440 bytes == 0x5580c17ca000 @ 0x7fc87e4126d0
0x7fc873354ae9 0x7fc873356073 0x557f89d3d680 0x557f89d2ebcd 0x557f89d30524
0x557f89d318ef 0x557f89d33147 0x557f89bb0d6f 0x557f89b3c91b 0x557f89b6df8a
0x557f89a2c5e1 0x7fc87299d2e1 0x557f89ab03fa (nil)
tcmalloc: large alloc 3025510400 bytes == 0x557f8f6e6000 @ 0x7fc87e4126d0
0x7fc873354ae9 0x7fc87335582b 0x557f89d75d19 0x557f89d2edda 0x557f89d30524
0x557f89d318ef 0x557f89d33147 0x557f89bb0d6f 0x557f89b3c91b 0x557f89b6df8a
0x557f89a2c5e1 0x7fc87299d2e1 0x557f89ab03fa (nil)
tcmalloc: large alloc 2269913088 bytes == 0x55832469e000 @ 0x7fc87e3f2e50
0x7fc87e4121b9 0x7fc8756ca4f7 0x7fc8756cd304 0x557f89cc4661 0x557f89ad0858
0x557f89ad2224 0x557f89cb7b1d 0x557f89de584c 0x557f89de6a7e 0x557f89e05e7b
0x557f89d2cf48 0x557f89d2efd2 0x557f89d30524 0x557f89d318ef 0x557f89d33147
0x557f89bb0d6f 0x557f89b3c91b 0x557f89b6df8a 0x557f89a2c5e1 0x7fc87299d2e1
0x557f89ab03fa (nil)
2023-07-30 08:27:27.531919 7fc86f689700 -1 bdev(0x557f8add4240
/var/lib/ceph/osd/ceph-11/block) aio to 929504952320~2269908992 but
returned: 2147479552/build/ceph-12.2.12/src/os/bluestore/KernelDevice.cc:
In function 'void KernelDevice::_aio_thread()' thread 7fc86f689700 time
2023-07-30 08:27:27.532004
/build/ceph-12.2.12/src/os/bluestore/KernelDevice.cc: 397: FAILED assert(0
== "unexpected aio error")
ceph version 12.2.12 (1436006594665279fe734b4c15d7e08c13ebd777) luminous
(stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x102) [0x7fc8757242c2]
2: (KernelDevice::_aio_thread()+0x1377) [0x557f89cc14c7]
3: (KernelDevice::AioCompletionThread::entry()+0xd) [0x557f89cc725d]
4: (()+0x74a4) [0x7fc8740104a4]
5: (clone()+0x3f) [0x7fc872a65d0f]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to
interpret this.
2023-07-30 08:27:27.544215 7fc86f689700 -1
/build/ceph-12.2.12/src/os/bluestore/KernelDevice.cc: In function 'void
KernelDevice::_aio_thread()' thread 7fc86f689700 time 2023-07-30
08:27:27.532004
/build/ceph-12.2.12/src/os/bluestore/KernelDevice.cc: 397: FAILED assert(0
== "unexpected aio error")
ceph version 12.2.12 (1436006594665279fe734b4c15d7e08c13ebd777) luminous
(stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x102) [0x7fc8757242c2]
2: (KernelDevice::_aio_thread()+0x1377) [0x557f89cc14c7]
3: (KernelDevice::AioCompletionThread::entry()+0xd) [0x557f89cc725d]
4: (()+0x74a4) [0x7fc8740104a4]
5: (clone()+0x3f) [0x7fc872a65d0f]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to
interpret this.
-1> 2023-07-30 08:27:27.531919 7fc86f689700 -1 bdev(0x557f8add4240
/var/lib/ceph/osd/ceph-11/block) aio to 929504952320~2269908992 but
returned: 2147479552
0> 2023-07-30 08:27:27.544215 7fc86f689700 -1
/build/ceph-12.2.12/src/os/bluestore/KernelDevice.cc: In function 'void
KernelDevice::_aio_thread()' thread 7fc86f689700 time 2023-07-30
08:27:27.532004
/build/ceph-12.2.12/src/os/bluestore/KernelDevice.cc: 397: FAILED assert(0
== "unexpected aio error")
ceph version 12.2.12 (1436006594665279fe734b4c15d7e08c13ebd777) luminous
(stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x102) [0x7fc8757242c2]
2: (KernelDevice::_aio_thread()+0x1377) [0x557f89cc14c7]
3: (KernelDevice::AioCompletionThread::entry()+0xd) [0x557f89cc725d]
4: (()+0x74a4) [0x7fc8740104a4]
5: (clone()+0x3f) [0x7fc872a65d0f]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to
interpret this.
*** Caught signal (Aborted) **
in thread 7fc86f689700 thread_name:bstore_aio
ceph version 12.2.12 (1436006594665279fe734b4c15d7e08c13ebd777) luminous
(stable)
1: (()+0x424fc4) [0x557f89d25fc4]
2: (()+0x110e0) [0x7fc87401a0e0]
3: (gsignal()+0xcf) [0x7fc8729affff]
4: (abort()+0x16a) [0x7fc8729b142a]
5: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x28e) [0x7fc87572444e]
6: (KernelDevice::_aio_thread()+0x1377) [0x557f89cc14c7]
7: (KernelDevice::AioCompletionThread::entry()+0xd) [0x557f89cc725d]
8: (()+0x74a4) [0x7fc8740104a4]
9: (clone()+0x3f) [0x7fc872a65d0f]
2023-07-30 08:27:27.549175 7fc86f689700 -1 *** Caught signal (Aborted) **
in thread 7fc86f689700 thread_name:bstore_aio
ceph version 12.2.12 (1436006594665279fe734b4c15d7e08c13ebd777) luminous
(stable)
1: (()+0x424fc4) [0x557f89d25fc4]
2: (()+0x110e0) [0x7fc87401a0e0]
3: (gsignal()+0xcf) [0x7fc8729affff]
4: (abort()+0x16a) [0x7fc8729b142a]
5: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x28e) [0x7fc87572444e]
6: (KernelDevice::_aio_thread()+0x1377) [0x557f89cc14c7]
7: (KernelDevice::AioCompletionThread::entry()+0xd) [0x557f89cc725d]
8: (()+0x74a4) [0x7fc8740104a4]
9: (clone()+0x3f) [0x7fc872a65d0f]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to
interpret this.
0> 2023-07-30 08:27:27.549175 7fc86f689700 -1 *** Caught signal
(Aborted) **
in thread 7fc86f689700 thread_name:bstore_aio
ceph version 12.2.12 (1436006594665279fe734b4c15d7e08c13ebd777) luminous
(stable)
1: (()+0x424fc4) [0x557f89d25fc4]
2: (()+0x110e0) [0x7fc87401a0e0]
3: (gsignal()+0xcf) [0x7fc8729affff]
4: (abort()+0x16a) [0x7fc8729b142a]
5: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x28e) [0x7fc87572444e]
6: (KernelDevice::_aio_thread()+0x1377) [0x557f89cc14c7]
7: (KernelDevice::AioCompletionThread::entry()+0xd) [0x557f89cc725d]
8: (()+0x74a4) [0x7fc8740104a4]
9: (clone()+0x3f) [0x7fc872a65d0f]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to
interpret this.
Aborted
$ ceph-objectstore-tool --data-path=/var/lib/ceph/osd/ceph-11 --op list-pgs
tcmalloc: large alloc 1283989504 bytes == 0x5649b1bdc000 @ 0x7f3af5e756d0
0x7f3aeafbbae9 0x7f3aeafbd073 0x56495defb9e0 0x56495deed01d 0x56495deee974
0x56495deefd3f 0x56495def1597 0x56495de0e47f 0x56495dd95dab 0x56495ddcf9e4
0x56495d7de4db 0x7f3aea6042e1 0x56495d86853a (nil)
tcmalloc: large alloc 2567970816 bytes == 0x5649fe45e000 @ 0x7f3af5e756d0
0x7f3aeafbbae9 0x7f3aeafbd073 0x56495defb9e0 0x56495deed01d 0x56495deee974
0x56495deefd3f 0x56495def1597 0x56495de0e47f 0x56495dd95dab 0x56495ddcf9e4
0x56495d7de4db 0x7f3aea6042e1 0x56495d86853a (nil)
tcmalloc: large alloc 5135933440 bytes == 0x564a97560000 @ 0x7f3af5e756d0
0x7f3aeafbbae9 0x7f3aeafbd073 0x56495defb9e0 0x56495deed01d 0x56495deee974
0x56495deefd3f 0x56495def1597 0x56495de0e47f 0x56495dd95dab 0x56495ddcf9e4
0x56495d7de4db 0x7f3aea6042e1 0x56495d86853a (nil)
tcmalloc: large alloc 3025510400 bytes == 0x56496547c000 @ 0x7f3af5e756d0
0x7f3aeafbbae9 0x7f3aeafbc82b 0x56495df34079 0x56495deed22a 0x56495deee974
0x56495deefd3f 0x56495def1597 0x56495de0e47f 0x56495dd95dab 0x56495ddcf9e4
0x56495d7de4db 0x7f3aea6042e1 0x56495d86853a (nil)
tcmalloc: large alloc 2269913088 bytes == 0x564cfa402000 @ 0x7f3af5e55e50
0x7f3af5e751b9 0x7f3aed12d4f7 0x7f3aed130304 0x56495de9fbc1 0x56495de7a5f8
0x56495de7bfc4 0x56495de9307d 0x56495dfa32dc 0x56495dfa450e 0x56495dfc34db
0x56495deeb398 0x56495deed422 0x56495deee974 0x56495deefd3f 0x56495def1597
0x56495de0e47f 0x56495dd95dab 0x56495ddcf9e4 0x56495d7de4db 0x7f3aea6042e1
0x56495d86853a (nil)
/build/ceph-12.2.12/src/os/bluestore/KernelDevice.cc: In function 'void
KernelDevice::_aio_thread()' thread 7f3ae72f0700 time 2023-07-30
08:37:16.531432
/build/ceph-12.2.12/src/os/bluestore/KernelDevice.cc: 397: FAILED assert(0
== "unexpected aio error")
ceph version 12.2.12 (1436006594665279fe734b4c15d7e08c13ebd777) luminous
(stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x102) [0x7f3aed1872c2]
2: (KernelDevice::_aio_thread()+0x1377) [0x56495de9ca27]
3: (KernelDevice::AioCompletionThread::entry()+0xd) [0x56495dea27bd]
4: (()+0x74a4) [0x7f3aeba734a4]
5: (clone()+0x3f) [0x7f3aea6ccd0f]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to
interpret this.
*** Caught signal (Aborted) **
in thread 7f3ae72f0700 thread_name:bstore_aio
ceph version 12.2.12 (1436006594665279fe734b4c15d7e08c13ebd777) luminous
(stable)
1: (()+0x94a0f4) [0x56495debe0f4]
2: (()+0x110e0) [0x7f3aeba7d0e0]
3: (gsignal()+0xcf) [0x7f3aea616fff]
4: (abort()+0x16a) [0x7f3aea61842a]
5: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x28e) [0x7f3aed18744e]
6: (KernelDevice::_aio_thread()+0x1377) [0x56495de9ca27]
7: (KernelDevice::AioCompletionThread::entry()+0xd) [0x56495dea27bd]
8: (()+0x74a4) [0x7f3aeba734a4]
9: (clone()+0x3f) [0x7f3aea6ccd0f]
Aborted
--
Gregory O’Neill
Details of this release are summarized here:
https://tracker.ceph.com/issues/62231#note-1
Seeking approvals/reviews for:
smoke - Laura, Radek
rados - Neha, Radek, Travis, Ernesto, Adam King
rgw - Casey
fs - Venky
orch - Adam King
rbd - Ilya
krbd - Ilya
upgrade-clients:client-upgrade* - in progress
powercycle - Brad
Please reply to this email with approval and/or trackers of known
issues/PRs to address them.
bookworm distro support is an outstanding issue.
TIA
YuriW
Hi,
I have trouble with large OMAP files in a cluster in the RGW index pool. Some
background information about the cluster: There is CephFS and RBD usage on the
main cluster but for this issue I think only S3 is interesting.
There is one realm, one zonegroup with two zones which have a bidirectional sync
set up. Since this does not allow for autoresharding we have to do it by hand in
this cluster – looking forward to Reef!
From the logs:
cluster 2023-07-17T22:59:03.018722+0000 osd.75 (osd.75) 623978 :
cluster [WRN] Large omap object found. Object:
34:bcec3016:::.dir.3caabb9a-4e3b-4b8a-8222-34c33dd63210.10610190.9.5:head
PG: 34.680c373d (34.5) Key count: 962091 Size (bytes): 277963182
The offending bucket looks like this:
# radosgw-admin bucket stats \
| jq '.[] | select(.marker
=="3caabb9a-4e3b-4b8a-8222-34c33dd63210.10610190.9")
|"\(.num_shards) \(.usage["rgw.main"].num_objects)"' -r
131 9463833
Last week the number of objects was about 12 million. Which is why I reshareded
the offending bucket twice, I think. Once to 129 and the second time to 131
because I wanted some leeway (or lieway? scnr, Sage).
Unfortunately, even after a week the objects were still to big (the log line
above is quite recent), so I looked into it again.
# rados -p raum.rgw.buckets.index ls \
|grep .dir.3caabb9a-4e3b-4b8a-8222-34c33dd63210.10610190.9 \
|sort -V
.dir.3caabb9a-4e3b-4b8a-8222-34c33dd63210.10610190.9.0
.dir.3caabb9a-4e3b-4b8a-8222-34c33dd63210.10610190.9.1
.dir.3caabb9a-4e3b-4b8a-8222-34c33dd63210.10610190.9.2
.dir.3caabb9a-4e3b-4b8a-8222-34c33dd63210.10610190.9.3
.dir.3caabb9a-4e3b-4b8a-8222-34c33dd63210.10610190.9.4
.dir.3caabb9a-4e3b-4b8a-8222-34c33dd63210.10610190.9.5
.dir.3caabb9a-4e3b-4b8a-8222-34c33dd63210.10610190.9.6
.dir.3caabb9a-4e3b-4b8a-8222-34c33dd63210.10610190.9.7
.dir.3caabb9a-4e3b-4b8a-8222-34c33dd63210.10610190.9.8
.dir.3caabb9a-4e3b-4b8a-8222-34c33dd63210.10610190.9.9
.dir.3caabb9a-4e3b-4b8a-8222-34c33dd63210.10610190.9.10
# rados -p raum.rgw.buckets.index ls \
|grep .dir.3caabb9a-4e3b-4b8a-8222-34c33dd63210.10610190.9 \
|sort -V \
|xargs -IOMAP sh -c \
'rados -p raum.rgw.buckets.index listomapkeys OMAP | wc -l'
1013854
1011007
1012287
1011232
1013565
998262
1012777
1012713
1012230
1010690
997111
Apparently, only 11 shards are in use. This would explain why the "Key usage"
(from the log line) is about ten times higher than I would expect.
How can I deal with this issue?
One thing I could try to fix this would be to reshard to a lower number, but I
am not sure if there are any risks associated with "downsharding". After that I
could reshard to something like 97. Or I could directly "downshard" to 97.
Also, the second zone has a similar problem, but as the error messsage lets me
know, this would be a bad idea. Will it just take more time until the sharding
is transferred to the seconds zone?
Best,
Christian Kugler