November 2023 - ceph-users

resharding RocksDB after upgrade to Pacific breaks OSDs

by Denis Polom

Hi we upgraded our Ceph cluster from latest Octopus to Pacific 16.2.14 and then we followed the docs (https://docs.ceph.com/en/latest/rados/configuration/bluestore-config-ref/#r… <https://docs.ceph.com/en/latest/rados/configuration/bluestore-config-ref/#r…>) to reshard RocksDB on our OSDs. Despite resharding reports operation as successful, OSD fails to start. # ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-5/ --sharding="m(3) p(3,0-12) o(3,0-13)=block_cache={type=binned_lru} l p" reshard reshard success Oct 30 12:44:17 octopus2 ceph-osd[4521]: /build/ceph-16.2.14/src/kv/RocksDBStore.cc: 1223: FAILED ceph_assert(recreate_mode) Oct 30 12:44:17 octopus2 ceph-osd[4521]: ceph version 16.2.14 (238ba602515df21ea7ffc75c88db29f9e5ef12c9) pacific (stable) Oct 30 12:44:17 octopus2 ceph-osd[4521]: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14b) [0x564047cb92b2] Oct 30 12:44:17 octopus2 ceph-osd[4521]: 2: /usr/bin/ceph-osd(+0xaa948a) [0x564047cb948a] Oct 30 12:44:17 octopus2 ceph-osd[4521]: 3: (RocksDBStore::do_open(std::ostream&, bool, bool, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x1609) [0x564048794829] Oct 30 12:44:17 octopus2 ceph-osd[4521]: 4: (BlueStore::_open_db(bool, bool, bool)+0x601) [0x564048240421] Oct 30 12:44:17 octopus2 ceph-osd[4521]: 5: (BlueStore::_open_db_and_around(bool, bool)+0x26b) [0x5640482a5f8b] Oct 30 12:44:17 octopus2 ceph-osd[4521]: 6: (BlueStore::_mount()+0x9c) [0x5640482a896c] Oct 30 12:44:17 octopus2 ceph-osd[4521]: 7: (OSD::init()+0x38a) [0x564047daacea] Oct 30 12:44:17 octopus2 ceph-osd[4521]: 8: main() Oct 30 12:44:17 octopus2 ceph-osd[4521]: 9: __libc_start_main() Oct 30 12:44:17 octopus2 ceph-osd[4521]: 10: _start() Oct 30 12:44:17 octopus2 ceph-osd[4521]: 0> 2023-10-30T12:44:17.088+0000 7f4971ed2100 -1 *** Caught signal (Aborted) ** Oct 30 12:44:17 octopus2 ceph-osd[4521]: in thread 7f4971ed2100 thread_name:ceph-osd Oct 30 12:44:17 octopus2 ceph-osd[4521]: ceph version 16.2.14 (238ba602515df21ea7ffc75c88db29f9e5ef12c9) pacific (stable) Oct 30 12:44:17 octopus2 ceph-osd[4521]: 1: /lib/x86_64-linux-gnu/libpthread.so.0(+0x12730) [0x7f4972921730] Oct 30 12:44:17 octopus2 ceph-osd[4521]: 2: gsignal() Oct 30 12:44:17 octopus2 ceph-osd[4521]: 3: abort() Oct 30 12:44:17 octopus2 ceph-osd[4521]: 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x19c) [0x564047cb9303] Oct 30 12:44:17 octopus2 ceph-osd[4521]: 5: /usr/bin/ceph-osd(+0xaa948a) [0x564047cb948a] Oct 30 12:44:17 octopus2 ceph-osd[4521]: 6: (RocksDBStore::do_open(std::ostream&, bool, bool, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x1609) [0x564048794829] Oct 30 12:44:17 octopus2 ceph-osd[4521]: 7: (BlueStore::_open_db(bool, bool, bool)+0x601) [0x564048240421] Oct 30 12:44:17 octopus2 ceph-osd[4521]: 8: (BlueStore::_open_db_and_around(bool, bool)+0x26b) [0x5640482a5f8b] Oct 30 12:44:17 octopus2 ceph-osd[4521]: 9: (BlueStore::_mount()+0x9c) [0x5640482a896c] Oct 30 12:44:17 octopus2 ceph-osd[4521]: 10: (OSD::init()+0x38a) [0x564047daacea] Oct 30 12:44:17 octopus2 ceph-osd[4521]: 11: main() Oct 30 12:44:17 octopus2 ceph-osd[4521]: 12: __libc_start_main() Oct 30 12:44:17 octopus2 ceph-osd[4521]: 13: _start() Oct 30 12:44:17 octopus2 ceph-osd[4521]: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. Oct 30 12:44:17 octopus2 ceph-osd[4521]: -1> 2023-10-30T12:44:17.084+0000 7f4971ed2100 -1 /build/ceph-16.2.14/src/kv/RocksDBStore.cc: In function 'int RocksDBStore::do_open(std::ostream&, bool, bool, const string&)' thread 7f4971ed2100 time 2023-10-30T12:44:17.087172+0000 I've submitted bug report here https://tracker.ceph.com/issues/63353 but may be community here have some ideas how to fix it unless it's really a bug. Thanks

6 months, 2 weeks

7
8
0 0

ceph orch problem

by Dario Graña

Hi everyone! I've a ceph cluster running AlmaLinux9, podman and Ceph Quincy (17.6.0). Since yesterday I was having some problems but the last one is that the ceph orch command is hanging up. I have seen the logs but I didn't find anything relevant that could help to fix the problem. Podman shows the daemons running and if I stop one daemon it appears again after a few seconds. I also tried *ceph mgr fail* command, daemon order changes, but ceph orch still not working. ceph orch pause/resume also are not working. Disabling and enabling *cephadm* module didn't help. Any help to understand what's going on would be welcome. Thanks in advance. -- Dario Graña PIC (Port d'Informació Científica) Campus UAB, Edificio D E-08193 Bellaterra, Barcelona http://www.pic.es Avis - Aviso - Legal Notice: http://legal.ifae.es

6 months, 2 weeks

2
3
0 0

data corruption after rbd migration

by Nikola Ciprich

Dear ceph users and developers, we're struggling with strange issue which I think might be a bug causing snapshot data corruption while migrating RBD image we've tracked it to minimal set of steps to reproduce using VM with one 32G drive: rbd create --size 32768 sata/D2 virsh create xml_orig.xml rbd snap create ssd/D1@snap1 rbd export-diff ssd/D1@snap1 - | rbd import-diff - sata/D2 rbd export --export-format 1 --no-progress ssd/D1@snap1 - | xxh64sum 505dde3c49775773 rbd export --export-format 1 --no-progress sata/D2@snap1 - | xxh64sum 505dde3c49775773 # <- checksums match - OK virsh shutdown VM rbd migration prepare ssd/D1 sata/D1Z virsh create xml_new.xml rbd snap create sata/D1Z@snap2 rbd export-diff --from-snap snap1 sata/D1Z@snap2 - | rbd import-diff - sata/D2 rbd migration execute sata/D1Z rbd migration commit sata/D1Z rbd export --export-format 1 --no-progress sata/D1Z@snap2 - | xxh64sum 19892545c742c1de rbd export --export-format 1 --no-progress sata/D2@snap2 - | xxh64sum cc045975baf74ba8 # <- snapshosts differ OS is alma 9 based, kernel 5.15.105, CEPH 17.2.6, qemu-8.0.3 we tried disabling VM disk caches as well as discard, to no avail. my first question is, is it correct to assume creating snapshots while live migrating data is safe? if so, any ideas on where the problem could be? If I could provide more info, please let me know with regards nikola ciprich -- ------------------------------------- Ing. Nikola CIPRICH LinuxBox.cz, s.r.o. 28.rijna 168, 709 00 Ostrava tel.: +420 591 166 214 fax: +420 596 621 273 mobil: +420 777 093 799 www.linuxbox.cz mobil servis: +420 737 238 656 email servis: servis(a)linuxbox.cz -------------------------------------

6 months, 2 weeks

2
2
0 0

17.2.7 quincy

by Matthew Darwin

Hi all, I see17.2.7 quincy is published as debian-bullseye packages. So I tried it on a test cluster. I must say I was not expecting the big dashboard change in a patch release. Also all the "cluster utilization" numbers are all blank now (any way to fix it?), so the dashboard is much less usable now. Thoughts?

6 months, 2 weeks

2
7
0 0

upgrade 17.2.6 to 17.2.7 , any issues?

by Dmitry Melekhov

Hello! I want to do upgrade from 17.2.6 to 17.2.7 , all 3 servers run alma 9. Could you share your experience? Any issues? Thank you!

6 months, 2 weeks

2
2
0 0

Re: RGW access logs with bucket name

by Boris Behrens

Bringing up that topic again: is it possible to log the bucket name in the rgw client logs? currently I am only to know the bucket name when someone access the bucket via https://TLD/bucket/object instead of https://bucket.TLD/object. Am Di., 3. Jan. 2023 um 10:25 Uhr schrieb Boris Behrens <bb(a)kervyn.de>: > Hi, > I am looking forward to move our logs from > /var/log/ceph/ceph-client...log to our logaggregator. > > Is there a way to have the bucket name in the log file? > > Or can I write the rgw_enable_ops_log into a file? Maybe I could work with > this. > > Cheers and happy new year > Boris > -- Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im groÃƒ¼en Saal.

6 months, 2 weeks

4
6
0 0

Emergency, I lost 4 monitors but all osd disk are safe

by Mohamed LAMDAOUAR

Hello, I have 7 machines on CEPH cluster, the service ceph runs on a docker container. Each machine has 4 hdd of data (available) and 2 nvme sssd (bricked) During a reboot, the ssd bricked on 4 machines, the data are available on the HDD disk but the nvme is bricked and the system is not available. is it possible to recover the data of the cluster (the data disk are all available)

6 months, 2 weeks

7
14
0 0

"cephadm version" in reef returns "AttributeError: 'CephadmContext' object has no attribute 'fsid'"

by Martin Conway

Hi I wrote before about issues I was having with cephadm in 18.2.0 Sorry, I didn't see the helpful replies because my mail service binned the responses. I still can't get the reef version of cephadm to work properly. I had updated the system rpm to reef (ceph repo) and also upgraded the containerised ceph daemons to reef before my first email. Both the system package cephadm and the one found at /var/lib/ceph/${fsid}/cephadm.* return the same error when running "cephadm version" Traceback (most recent call last): File "./cephadm.059bfc99f5cf36ed881f2494b104711faf4cbf5fc86a9594423cc105cafd9b4e", line 9468, in <module> main() File "./cephadm.059bfc99f5cf36ed881f2494b104711faf4cbf5fc86a9594423cc105cafd9b4e", line 9456, in main r = ctx.func(ctx) File "./cephadm.059bfc99f5cf36ed881f2494b104711faf4cbf5fc86a9594423cc105cafd9b4e", line 2108, in _infer_image ctx.image = infer_local_ceph_image(ctx, ctx.container_engine.path) File "./cephadm.059bfc99f5cf36ed881f2494b104711faf4cbf5fc86a9594423cc105cafd9b4e", line 2191, in infer_local_ceph_image container_info = get_container_info(ctx, daemon, daemon_name is not None) File "./cephadm.059bfc99f5cf36ed881f2494b104711faf4cbf5fc86a9594423cc105cafd9b4e", line 2154, in get_container_info matching_daemons = [d for d in daemons if daemon_name_or_type(d) == daemon_filter and d['fsid'] == ctx.fsid] File "./cephadm.059bfc99f5cf36ed881f2494b104711faf4cbf5fc86a9594423cc105cafd9b4e", line 2154, in <listcomp> matching_daemons = [d for d in daemons if daemon_name_or_type(d) == daemon_filter and d['fsid'] == ctx.fsid] File "./cephadm.059bfc99f5cf36ed881f2494b104711faf4cbf5fc86a9594423cc105cafd9b4e", line 217, in __getattr__ return super().__getattribute__(name) AttributeError: 'CephadmContext' object has no attribute 'fsid' I am running into other issues as well, but I think they may point back to this issue of "'CephadmContext' object has no attribute 'fsid'" Any help would be appreciated. Regards, Martin Conway IT and Digital Media Manager Research School of Physics Australian National University Canberra ACT 2601 +61 2 6125 1599 https://physics.anu.edu.au<https://physics.anu.edu.au/>

6 months, 2 weeks

4
8
0 0

diskprediction_local module and trained models

by Can Özyurt

Hi everyone, We have recently noticed diskprediction_local module only works for a set of manufacturers. Hence we have the following questions: Are there any plans to support more manufacturers in the near future? Can we contribute to the process of training new models and how? Can the existing models be used with disks of other vendors (with hackish methods)? Is it that It just doesn't work or it would not be reliable? Basically how does the trained model and the vendor relation work? Say we have a disk for an existing model, what is the best way to test the trained model and the module itself? Thanks in advance

6 months, 2 weeks

1
0
0 0

CephFS scrub causing MDS OOM-kill

by Denis Polom

Hi, I did setup CephFS forward scrub by executing cmd # ceph tell mds.cephfs:0 scrub start / recursive { "return_code": 0, "scrub_tag": "37a67f72-89a3-474e-8f8b-1e55cb979e14", "mode": "asynchronous" } But immediately after it started, memory usage on MDS that keeps rank 0 increased from 2GB to 32GB until MDS was killed by OOM killer. Why does it consume all the memory? Can it be adjusted somehow? Thank you

6 months, 2 weeks

1
0
0 0

2024

2023

2022

2021

2020

2019

ceph-users November 2023