December 2023 - ceph-users

by Lokendra Rathour

Hi Team, please help in the reference of the issue raised. Best Regards, Lokendra On Wed, Dec 13, 2023 at 2:33 PM Kushagr Gupta <kushagrguptasps.mun(a)gmail.com> wrote: > Hi Team, > > *Environment:* > We have deployed a ceph setup using ceph-ansible. > Ceph-version: 18.2.0 > OS: Almalinux 8.8 > We have a 3 node-setup. > > *Queries:* > > 1. Is SNMP supported for ceph-ansible?Is there some other way to setup > SNMP gateway for the ceph cluster? > 2. Do we have a procedure to set backend for ceph-orchestrator via > ceph-ansible? Which backend to use? > 3. Are there any CEPH MIB files which work independent of prometheus. > > > *Description:* > We are trying to perform SNMP monitoring for the ceph cluster using the > following link: > > 1. > https://docs.ceph.com/en/quincy/cephadm/services/snmp-gateway/#:~:text=Ceph's%20SNMP%20integration%20focuses%20on,a%20designated%20SNMP%20management%20platform > . > 2. > https://www.ibm.com/docs/en/storage-ceph/7?topic=traps-deploying-snmp-gatew… > > But when we try to follow the steps mentioned in the above link, we get > the following error when we try to run any "ceph orch" we get the following > error: > "Error ENOENT: No orchestrator configured (try `ceph orch set backend`)" > > After going through following links: > 1. > https://www.ibm.com/docs/en/storage-ceph/5?topic=operations-use-ceph-orches… > 2. > https://forum.proxmox.com/threads/ceph-mgr-orchestrator-enabled-but-showing… > 3. https://docs.ceph.com/en/latest/mgr/orchestrator_modules/ > I think since we have deployed the cluster using ceph-ansible, we can't > use the ceph-orch commands. > When we checked in the cluster, the following are the enabled modules: > " > [root@storagenode1 ~]# ceph mgr module ls > MODULE > balancer on (always on) > crash on (always on) > devicehealth on (always on) > orchestrator on (always on) > pg_autoscaler on (always on) > progress on (always on) > rbd_support on (always on) > status on (always on) > telemetry on (always on) > volumes on (always on) > alerts on > iostat on > nfs on > prometheus on > restful on > dashboard - > influx - > insights - > localpool - > mds_autoscaler - > mirroring - > osd_perf_query - > osd_support - > rgw - > selftest - > snap_schedule - > stats - > telegraf - > test_orchestrator - > zabbix - > [root@storagenode1 ~]# > " > As can be seen above, orchestrator is on. > > Also, We were exploring more about snmp and as per the file: > "/etc/prometheus/ceph/ceph_default_alerts.yml" on the ceph storage, the > OIDs in the file represents the OID for ceph components via prometheus. > For example: > for the following OID: 1.3.6.1.4.1.50495.1.2.1.2.1 > [root@storagenode3 ~]# snmpwalk -v 2c -c 209ijvfwer0df92jd -O e 10.0.1.36 > 1.3.6.1.4.1.50495.1.2.1.2.1 > CEPH-MIB::promHealthStatusError = No Such Object available on this agent > at this OID > [root@storagenode3 ~]# > > Kindly help us for the same. > > Thanks and regards, > Kushagra Gupta > -- ~ Lokendra skype: lokendrarathour

5 months

3
2
0 0

Re: Can not activate some OSDs after upgrade (bad crc on label)

by Huseyin Cotuk

Hello again, We understood that the issue arises from a hardware crash with the help of Dan van der Ster and Mykola from Clyso. After upgrading Ceph, we encountered an unexpected crash resulted with a reboot. After comparing the first blocks of running and failed OSDs, we found that HW crash caused a corruption on the first 23 bytes of block devices. First few bytes of the block device of a failed OSD contains: 00000000: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 00000010: 0000 0000 0000 0a63 3965 6533 6566 362d .......c9ee3ef6- 00000020: 3733 6437 2d34 3032 392d 3963 6436 2d30 73d7-4029-9cd6-0 00000030: 3836 6363 3935 6432 6632 370a 0201 a901 86cc95d2f27..... 00000040: 0000 c9ee 3ef6 73d7 4029 9cd6 086c c95d ....>.s.@)...l.] 00000050: 2f27 0000 4002 4707 0000 8ceb 6665 c827 /'..@.G.....fe.' 00000060: 0409 0400 0000 6d61 696e 0d00 0000 0a00 ......main…… and a running OSD contains: 00000000: 626c 7565 7374 6f72 6520 626c 6f63 6b20 bluestore block 00000010: 6465 7669 6365 0a38 6637 3732 3532 312d device.8f772521- 00000020: 6535 3663 2d34 6135 622d 6239 3763 2d31 e56c-4a5b-b97c-1 00000030: 6233 3630 6439 6266 6135 340a 0201 a901 b360d9bfa54..... 00000040: 0000 8f77 2521 e56c 4a5b b97c 1b36 0d9b ...w%!.lJ[.|.6.. 00000050: fa54 0000 4002 4707 0000 c8eb 6665 cd4c .T..@.G.....fe.L 00000060: 6233 0400 0000 6d61 696e 0d00 0000 0a00 b3....main…… It turned out that the first 23 bytes of data is corrupted during HW crash. So we copied the first 23 bytes of this data from a running OSD with the following command: dd if=/dev/ceph-block-21/block-21 of=/root/header.21.dat bs=23 count=1 Then we copied the exact 23 bytes to the every failed OSD block device after backup and the problem is resolved. for i in {12..20} ; do dd if=/dev/ceph-block-$i/block-$i of=/root/backup.$i.1M bs=1M count=1 ; dd if=/root/header.21.dat of=/dev/ceph-block-$i/block-$i bs=23 count=1 ; done At the end of the day, it turned out that the lsiutil tool is not compatible with our kernel and caused the crash. The following link contains the detailed information. https://support.huawei.com/enterprise/en/knowledge/KB1000001578https://supp… I want to thank to Dan and Mykola from Clyso and appreciate their help. BR, Huseyin Cotuk hcotuk(a)gmail.com

5 months

2
1
0 0

No User + Dev Monthly Meetup this week - Happy Holidays!

by Laura Flores

Hi all, A quick reminder that the User + Dev Monthly Meetup that was scheduled for this week December 21 is cancelled due to the holidays. The User + Dev Monthly Meetup will resume in the new year on January 18. If you have a topic you'd like to present at an upcoming meetup, you're welcome to submit it here: https://docs.google.com/forms/d/e/1FAIpQLSdboBhxVoBZoaHm8xSmeBoemuXoV_rmh4v… Wishing everyone a happy holiday season! Laura Flores -- Laura Flores She/Her/Hers Software Engineer, Ceph Storage <https://ceph.io> Chicago, IL lflores(a)ibm.com | lflores(a)redhat.com <lflores(a)redhat.com> M: +17087388804

5 months

1
0
0 0

Can not activate some OSDs after upgrade (bad crc on label)

by Huseyin Cotuk

Hello Cephers, I have two identical Ceph clusters with 32 OSDs each, running radosgw with EC. They were running Octopus on Ubuntu 20.04. On one of these clusters, I have upgraded OS to Ubuntu 22.04 and Ceph version is upgraded to Quincy 17.2.6. This cluster completed the process without any issue and it works as expected. On the second cluster, I followed the same procedure and upgraded the cluster. After upgrade 9 of 32 OSDs can not be activated. AFAIU, the label of these OSDs can not be read. ceph-volume lvm activate {osd.id} {osd_fsid} command fails as below: stderr: failed to read label for /dev/ceph-block-13/block-13: (5) Input/output error stderr: 2023-12-19T11:46:25.310+0300 7f088cd7ea80 -1 bluestore(/dev/ceph-block-13/block-13) _read_bdev_label bad crc on label, expected 2340927273 != actual 2067505886 All ceph-bluestore-tool and ceph-object-storetool commands fail with the same message, so I can not try repair, fsck or migrate. # ceph-bluestore-tool repair --deep yes --path /var/lib/ceph/osd/ceph-13/ failed to load os-type: (2) No such file or directory 2023-12-19T13:57:06.551+0300 7f39b1635a80 -1 bluestore(/var/lib/ceph/osd/ceph-13/block) _read_bdev_label bad crc on label, expected 2340927273 != actual 2067505886 I also tried show label with bluestore-tool without success. # ceph-bluestore-tool show-label --dev /dev/ceph-block-13/block-13 unable to read label for /dev/ceph-block-13/block-13: (5) Input/output error 2023-12-19T14:01:19.668+0300 7fdcdd111a80 -1 bluestore(/dev/ceph-block-13/block-13) _read_bdev_label bad crc on label, expected 2340927273 != actual 2067505886 I can get the information including osd_fsif, block_uuid of all failed OSDs via ceph-volume lvm list like below. ====== osd.13 ====== [block] /dev/ceph-block-13/block-13 block device /dev/ceph-block-13/block-13 block uuid jFaTba-ln5r-muQd-7Ef9-3tWe-JwvO-qW9nqi cephx lockbox secret cluster fsid 4e7e7d1c-22db-49c7-9f24-5a75cd3a3b9f cluster name ceph crush device class None encrypted 0 osd fsid c9ee3ef6-73d7-4029-9cd6-086cc95d2f27 osd id 13 osdspec affinity type block vdo 0 devices /dev/mapper/mpathb All vgs and lvs look healthy. # lvdisplay ceph-block-13/block-13 --- Logical volume --- LV Path /dev/ceph-block-13/block-13 LV Name block-13 VG Name ceph-block-13 LV UUID jFaTba-ln5r-muQd-7Ef9-3tWe-JwvO-qW9nqi LV Write Access read/write LV Creation host, time ank-backup01, 2023-11-29 10:41:53 +0300 LV Status available # open 0 LV Size <7.28 TiB Current LE 1907721 Segments 1 Allocation inherit Read ahead sectors auto - currently set to 256 Block device 253:33 This is a single node cluster running only radosgw. The environment is as follows: # ceph -v ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable) # lsb_release -a No LSB modules are available. Distributor ID: Ubuntu Description: Ubuntu 22.04.3 LTS Release: 22.04 Codename: jammy # ceph osd crush rule dump [ { "rule_id": 0, "rule_name": "osd_replicated_rule", "type": 1, "steps": [ { "op": "take", "item": -2, "item_name": "default~hdd" }, { "op": "choose_firstn", "num": 0, "type": "osd" }, { "op": "emit" } ] }, { "rule_id": 2, "rule_name": "default.rgw.buckets.data", "type": 3, "steps": [ { "op": "set_chooseleaf_tries", "num": 5 }, { "op": "set_choose_tries", "num": 100 }, { "op": "take", "item": -1, "item_name": "default" }, { "op": "choose_indep", "num": 0, "type": "osd" }, { "op": "emit" } ] } ] ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 226.29962 root default -3 226.29962 host ank-backup01 0 hdd 7.29999 osd.0 up 1.00000 1.00000 1 hdd 7.29999 osd.1 up 1.00000 1.00000 2 hdd 7.29999 osd.2 up 1.00000 1.00000 3 hdd 7.29999 osd.3 up 1.00000 1.00000 4 hdd 7.29999 osd.4 up 1.00000 1.00000 5 hdd 7.29999 osd.5 up 1.00000 1.00000 6 hdd 7.29999 osd.6 up 1.00000 1.00000 7 hdd 7.29999 osd.7 up 1.00000 1.00000 8 hdd 7.29999 osd.8 up 1.00000 1.00000 9 hdd 7.29999 osd.9 up 1.00000 1.00000 10 hdd 7.29999 osd.10 up 1.00000 1.00000 11 hdd 7.29999 osd.11 up 1.00000 1.00000 12 hdd 7.29999 osd.12 down 0 1.00000 13 hdd 7.29999 osd.13 down 0 1.00000 14 hdd 7.29999 osd.14 down 0 1.00000 15 hdd 7.29999 osd.15 down 0 1.00000 16 hdd 7.29999 osd.16 down 0 1.00000 17 hdd 7.29999 osd.17 down 0 1.00000 18 hdd 7.29999 osd.18 down 0 1.00000 19 hdd 7.29999 osd.19 down 0 1.00000 20 hdd 7.29999 osd.20 down 0 1.00000 21 hdd 7.29999 osd.21 up 1.00000 1.00000 22 hdd 7.29999 osd.22 up 1.00000 1.00000 23 hdd 7.29999 osd.23 up 1.00000 1.00000 24 hdd 7.29999 osd.24 up 1.00000 1.00000 25 hdd 7.29999 osd.25 up 1.00000 1.00000 26 hdd 7.29999 osd.26 up 1.00000 1.00000 27 hdd 7.29999 osd.27 up 1.00000 1.00000 28 hdd 7.29999 osd.28 up 1.00000 1.00000 29 hdd 7.29999 osd.29 up 1.00000 1.00000 30 hdd 7.29999 osd.30 up 1.00000 1.00000 31 hdd 7.29999 osd.31 up 1.00000 1.00000 Does anybody have any idea why the labels of these OSDs can not be read? Any help would be appreciated. Best Regards, Huseyin Cotuk hcotuk(a)gmail.com

5 months

1
0
0 0

cephadm Adding OSD wal device on a new

by Adiga, Anantha

Hi, After adding a node to the cluster (3 nodes) with cephadm, how do I add OSDs with the same configuration on the other nodes ? The other nodes have 12 drives for data osd-block AND 2 drives for wal osd-wal. There are 6 LVs in each wal disk for the 12 data drives. I have added the ODS with ceph orch daemon add osd hostname:/dev/nvme0n1 How do I attach the wal devices to the OSDs? I have the WAL volumes created nvme3n1 259:5 0 349.3G 0 disk |-ceph--75d65cd1--91e4--4a8f--869b--e2a550f83104-osd--wal--dad0df4e--149a--4e80--b451--79f9b81838b8 253:12 0 58.2G 0 lvm |-ceph--75d65cd1--91e4--4a8f--869b--e2a550f83104-osd--wal--a5f2e93a--7bf0--4904--a233--3946b855c764 253:13 0 58.2G 0 lvm |-ceph--75d65cd1--91e4--4a8f--869b--e2a550f83104-osd--wal--cc949e1b--2560--4d38--bc27--558550881726 253:14 0 58.2G 0 lvm |-ceph--75d65cd1--91e4--4a8f--869b--e2a550f83104-osd--wal--8846f50e--7e92--4f66--a738--ce3a89650019 253:15 0 58.2G 0 lvm |-ceph--75d65cd1--91e4--4a8f--869b--e2a550f83104-osd--wal--6d646762--483a--40ca--8c51--ea54e0684a94 253:16 0 58.2G 0 lvm `-ceph--75d65cd1--91e4--4a8f--869b--e2a550f83104-osd--wal--74e58163--de1d--4062--a658--5b0356d43a87 253:17 0 58.2G 0 lvm How do I attach the wal volumes to the OSDs osd.36 nvme0n1 259:1 0 5.8T 0 disk `-ceph--3df7c5c3--c2c0--4498--9e17--2af79e448abc-osd--block--804b50ea--d44c--4cad--9177--8d722f737df9 253:0 0 5.8T 0 lvm Osd.37 nvme1n1 259:3 0 5.8T 0 disk `-ceph--30858acc--c48b--4a08--bb98--4c9b59112c59-osd--block--0a3a198b--66ec--4ed9--94da--fb171e190e38 253:1 0 5.8T 0 lvm nvme3n1 Thank you, Anantha

5 months

2
1
0 0

mgr finish mon failed to return metadata for mds

by Manolis Daramas

The current ceph version that we use is 17.2.7. We see in the Manager logs the below errors: 2 mgr.server handle_open ignoring open from mds.storage.node01.zjltbu v2:10.40.99.11:6800/1327026642; not ready for session (expect reconnect) 0 7faf43715700 1 mgr finish mon failed to return metadata for mds.storage.node01.zjltbu: (2) No such file or directory and in # ceph fs status Error EINVAL: Traceback (most recent call last): File "/usr/share/ceph/mgr/mgr_module.py", line 1759, in _handle_command return CLICommand.COMMANDS[cmd['prefix']].call(self, cmd, inbuf) File "/usr/share/ceph/mgr/mgr_module.py", line 462, in call return self.func(mgr, **kwargs) File "/usr/share/ceph/mgr/status/module.py", line 109, in handle_fs_status assert metadata AssertionError Does anyone know what they are and what we can do to fix them ? Thanks, Manolis Under the General Data Protection Regulation (GDPR) (EU) 2016/679, Motivian as Data Controller has a legal duty to protect any information collected from you via email. Information contained in this email and any attachments may be privileged or confidential and intended for the exclusive use of the original recipient. If you have received this email by mistake, please advise the sender immediately and delete the email, including emptying your deleted email box. Information included in this email is reserved to named addressee's eyes only. You may not share this message or any of its attachments to anyone. Please note that as the recipient, it is your responsibility to check the email for malicious software. Motivian puts the security of the client at a high priority. Therefore, we have put efforts into ensuring that the message is error and virus-free. Unfortunately, full security of the email cannot be ensured as, despite our efforts, the data included in emails could be infected, intercepted, or corrupted. Therefore, the recipient should check the email for threats with proper software, as the sender does not accept liability for any damage inflicted by viewing the content of this email.

5 months

3
2
0 0

Terrible cephfs rmdir performance

by Paul Mezzanini

Long story short, we've got a lot of empty directories that I'm working on removing. While removing directories, using "perf top -g" we can watch the MDS daemon go to 100% cpu usage with "SnapRealm:: split_at" and "CInode::is_ancestor_of". It's this 2 year old bug that still is around. https://tracker.ceph.com/issues/53192 To help combat this, we've moved our snapshot schedule down the tree one level so the snaprealm is significantly smaller. Our luck with multiple active MDSs hasn't been great so we are still on a single MDS. To help split the load, I'm working on moving different workloads to different filesytems within ceph. A user can still fairly easily overwhelm the MDS's finisher thread and basically stop all cephfs io through that MDS. I'm hoping we can get some other people chiming in with "Me Too!" so there can be some traction behind fixing this. It's a longstanding bug so the version is less important, but we are on 17.2.7. Thoughts? -paul -- Paul Mezzanini Platform Engineer III Research Computing Rochester Institute of Technology “End users is a description, not a goal.”

5 months

2
1
0 0

MDS crashing repeatedly

by Thomas Widhalm

Hi, I have a 18.2.0 Ceph cluster and my MDS are now crashing repeatedly. After a few automatic restart, every MDS is removed and only one stays active. But it's flagged "laggy" and I can't even start a scrub on it. In the log I have this during crashes: Dec 13 15:54:02 ceph04 ceph-ff6e50de-ed72-11ec-881c-dca6325c2cc4-mds-mds01-ceph04-krxszj[33486]: 2023-12-13T14:54:02.721+0000 7f15ea108700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/18.2.0/rpm/el8/BUILD/ceph-18.2.0/src/mds/MDCache.cc: In function 'void MDCache::journal_cow_dentry(MutationImpl*, EMetaBlob*, CDentry*, snapid_t, CInode**, CDentry::linkage_t*)' thread 7f15ea108700 time 2023-12-13T14:54:02.720383+0000 Dec 13 15:54:02 ceph04 ceph-ff6e50de-ed72-11ec-881c-dca6325c2cc4-mds-mds01-ceph04-krxszj[33486]: /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/18.2.0/rpm/el8/BUILD/ceph-18.2.0/src/mds/MDCache.cc: 1638: FAILED ceph_assert(follows >= realm->get_newest_seq()) Dec 13 15:54:02 ceph04 ceph-ff6e50de-ed72-11ec-881c-dca6325c2cc4-mds-mds01-ceph04-krxszj[33486]: Dec 13 15:54:02 ceph04 ceph-ff6e50de-ed72-11ec-881c-dca6325c2cc4-mds-mds01-ceph04-krxszj[33486]: ceph version 18.2.0 (5dd24139a1eada541a3bc16b6941c5dde975e26d) reef (stable) Dec 13 15:54:02 ceph04 ceph-ff6e50de-ed72-11ec-881c-dca6325c2cc4-mds-mds01-ceph04-krxszj[33486]: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x135) [0x7f15f5ef9dbb] Dec 13 15:54:02 ceph04 ceph-ff6e50de-ed72-11ec-881c-dca6325c2cc4-mds-mds01-ceph04-krxszj[33486]: 2: /usr/lib64/ceph/libceph-common.so.2(+0x2a8f81) [0x7f15f5ef9f81] Dec 13 15:54:02 ceph04 ceph-ff6e50de-ed72-11ec-881c-dca6325c2cc4-mds-mds01-ceph04-krxszj[33486]: 3: (MDCache::journal_cow_dentry(MutationImpl*, EMetaBlob*, CDentry*, snapid_t, CInode**, CDentry::linkage_t*)+0xae2) [0x55727bc0c672] Dec 13 15:54:02 ceph04 ceph-ff6e50de-ed72-11ec-881c-dca6325c2cc4-mds-mds01-ceph04-krxszj[33486]: 4: (MDCache::journal_dirty_inode(MutationImpl*, EMetaBlob*, CInode*, snapid_t)+0xc5) [0x55727bc0d0d5] Dec 13 15:54:02 ceph04 ceph-ff6e50de-ed72-11ec-881c-dca6325c2cc4-mds-mds01-ceph04-krxszj[33486]: 5: (Locker::scatter_writebehind(ScatterLock*)+0x5f6) [0x55727bce40f6] Dec 13 15:54:02 ceph04 ceph-ff6e50de-ed72-11ec-881c-dca6325c2cc4-mds-mds01-ceph04-krxszj[33486]: 6: (Locker::simple_sync(SimpleLock*, bool*)+0x388) [0x55727bceb908] Dec 13 15:54:02 ceph04 ceph-ff6e50de-ed72-11ec-881c-dca6325c2cc4-mds-mds01-ceph04-krxszj[33486]: 7: (Locker::scatter_nudge(ScatterLock*, MDSContext*, bool)+0x30d) [0x55727bcef25d] Dec 13 15:54:02 ceph04 ceph-ff6e50de-ed72-11ec-881c-dca6325c2cc4-mds-mds01-ceph04-krxszj[33486]: 8: (Locker::scatter_tick()+0x1e7) [0x55727bd0bc37] Dec 13 15:54:02 ceph04 ceph-ff6e50de-ed72-11ec-881c-dca6325c2cc4-mds-mds01-ceph04-krxszj[33486]: 9: (Locker::tick()+0xd) [0x55727bd0c0ed] Dec 13 15:54:02 ceph04 ceph-ff6e50de-ed72-11ec-881c-dca6325c2cc4-mds-mds01-ceph04-krxszj[33486]: 10: (MDSRankDispatcher::tick()+0x1ef) [0x55727bb08e9f] Dec 13 15:54:02 ceph04 ceph-ff6e50de-ed72-11ec-881c-dca6325c2cc4-mds-mds01-ceph04-krxszj[33486]: 11: (Context::complete(int)+0xd) [0x55727bade2cd] Dec 13 15:54:02 ceph04 ceph-ff6e50de-ed72-11ec-881c-dca6325c2cc4-mds-mds01-ceph04-krxszj[33486]: 12: (CommonSafeTimer<ceph::fair_mutex>::timer_thread()+0x16d) [0x7f15f5fea1cd] Dec 13 15:54:02 ceph04 ceph-ff6e50de-ed72-11ec-881c-dca6325c2cc4-mds-mds01-ceph04-krxszj[33486]: 13: (CommonSafeTimerThread<ceph::fair_mutex>::entry()+0x11) [0x7f15f5feb2a1] Dec 13 15:54:02 ceph04 ceph-ff6e50de-ed72-11ec-881c-dca6325c2cc4-mds-mds01-ceph04-krxszj[33486]: 14: /lib64/libpthread.so.0(+0x81ca) [0x7f15f4ca11ca] Dec 13 15:54:02 ceph04 ceph-ff6e50de-ed72-11ec-881c-dca6325c2cc4-mds-mds01-ceph04-krxszj[33486]: 15: clone() Dec 13 15:54:02 ceph04 ceph-ff6e50de-ed72-11ec-881c-dca6325c2cc4-mds-mds01-ceph04-krxszj[33486]: Dec 13 15:54:02 ceph04 ceph-ff6e50de-ed72-11ec-881c-dca6325c2cc4-mds-mds01-ceph04-krxszj[33486]: *** Caught signal (Aborted) ** Dec 13 15:54:02 ceph04 ceph-ff6e50de-ed72-11ec-881c-dca6325c2cc4-mds-mds01-ceph04-krxszj[33486]: in thread 7f15ea108700 thread_name:safe_timer Dec 13 15:54:02 ceph04 ceph-ff6e50de-ed72-11ec-881c-dca6325c2cc4-mds-mds01-ceph04-krxszj[33486]: ceph version 18.2.0 (5dd24139a1eada541a3bc16b6941c5dde975e26d) reef (stable) Dec 13 15:54:02 ceph04 ceph-ff6e50de-ed72-11ec-881c-dca6325c2cc4-mds-mds01-ceph04-krxszj[33486]: 1: /lib64/libpthread.so.0(+0x12cf0) [0x7f15f4cabcf0] Dec 13 15:54:02 ceph04 ceph-ff6e50de-ed72-11ec-881c-dca6325c2cc4-mds-mds01-ceph04-krxszj[33486]: 2: gsignal() Dec 13 15:54:02 ceph04 ceph-ff6e50de-ed72-11ec-881c-dca6325c2cc4-mds-mds01-ceph04-krxszj[33486]: 3: abort() Dec 13 15:54:02 ceph04 ceph-ff6e50de-ed72-11ec-881c-dca6325c2cc4-mds-mds01-ceph04-krxszj[33486]: 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x18f) [0x7f15f5ef9e15] Dec 13 15:54:02 ceph04 ceph-ff6e50de-ed72-11ec-881c-dca6325c2cc4-mds-mds01-ceph04-krxszj[33486]: 5: /usr/lib64/ceph/libceph-common.so.2(+0x2a8f81) [0x7f15f5ef9f81] Dec 13 15:54:02 ceph04 ceph-ff6e50de-ed72-11ec-881c-dca6325c2cc4-mds-mds01-ceph04-krxszj[33486]: 6: (MDCache::journal_cow_dentry(MutationImpl*, EMetaBlob*, CDentry*, snapid_t, CInode**, CDentry::linkage_t*)+0xae2) [0x55727bc0c672] Dec 13 15:54:02 ceph04 ceph-ff6e50de-ed72-11ec-881c-dca6325c2cc4-mds-mds01-ceph04-krxszj[33486]: 7: (MDCache::journal_dirty_inode(MutationImpl*, EMetaBlob*, CInode*, snapid_t)+0xc5) [0x55727bc0d0d5] Dec 13 15:54:02 ceph04 ceph-ff6e50de-ed72-11ec-881c-dca6325c2cc4-mds-mds01-ceph04-krxszj[33486]: 8: (Locker::scatter_writebehind(ScatterLock*)+0x5f6) [0x55727bce40f6] Dec 13 15:54:02 ceph04 ceph-ff6e50de-ed72-11ec-881c-dca6325c2cc4-mds-mds01-ceph04-krxszj[33486]: 9: (Locker::simple_sync(SimpleLock*, bool*)+0x388) [0x55727bceb908] Dec 13 15:54:02 ceph04 ceph-ff6e50de-ed72-11ec-881c-dca6325c2cc4-mds-mds01-ceph04-krxszj[33486]: 10: (Locker::scatter_nudge(ScatterLock*, MDSContext*, bool)+0x30d) [0x55727bcef25d] Dec 13 15:54:02 ceph04 ceph-ff6e50de-ed72-11ec-881c-dca6325c2cc4-mds-mds01-ceph04-krxszj[33486]: 11: (Locker::scatter_tick()+0x1e7) [0x55727bd0bc37] Dec 13 15:54:02 ceph04 ceph-ff6e50de-ed72-11ec-881c-dca6325c2cc4-mds-mds01-ceph04-krxszj[33486]: 12: (Locker::tick()+0xd) [0x55727bd0c0ed] I tried the following to get it back: ceph fs fail cephfs cephfs-data-scan cleanup --filesystem cephfs cephfs_data cephfs-journal-tool --rank cephfs:0 event recover_dentries list cephfs-table-tool cephfs:all reset session cephfs-journal-tool --rank cephfs:0 journal reset cephfs-data-scan scan_extents --worker_n 0 --worker_m 4 --filesystem cephfs cephfs_data cephfs-data-scan scan_inodes --worker_n 0 --worker_m 4 --filesystem cephfs cephfs_data cephfs-data-scan scan_links --filesystem cephfs ceph mds repaired 0 ceph fs set cephfs joinable true (with the data scan commands of course running on several systems simultanously) Unfortunately it didn't help at all. I'm quite sure that my undersized hardware is the root cause because problems with metadata already occured in the past (with 17.x) and it was always during times of higher load (e.g. taking Proxmox backups while deleting CephFS snapshots). I now have a strategy to lower the load and update the hardware. But still - I need my data back. Any ideas? Cheers, Thomas -- http://www.widhalm.or.at GnuPG : 6265BAE6 , A84CB603 Threema: H7AV7D33 Telegram, Signal: widhalmt(a)widhalm.or.at

5 months

2
1
0 0

cephadm file "/sbin/cephadm", line 10098 PK ^

by farhad kh

Hello, I downloaded cephadm from the link below. https://download.ceph.com/rpm-18.2.0/el8/noarch/ I change the address of the images to the address of my private registry, ``` DEFAULT_IMAGE = 'opkbhfpspsp0101.fns/ceph/ceph:v18' DEFAULT_IMAGE_IS_MAIN = False DEFAULT_IMAGE_RELEASE = 'reef' DEFAULT_PROMETHEUS_IMAGE = 'opkbhfpspsp0101.fns/ceph/prometheus:v2.43.0' DEFAULT_LOKI_IMAGE = 'opkbhfpspsp0101.fns/ceph/loki:2.4.0' DEFAULT_PROMTAIL_IMAGE = 'opkbhfpspsp0101.fns/ceph/promtail:2.4.0' DEFAULT_NODE_EXPORTER_IMAGE = 'opkbhfpspsp0101.fns/ceph/node-exporter:v1.5.0' DEFAULT_ALERT_MANAGER_IMAGE = 'opkbhfpspsp0101.fns/ceph/alertmanager:v0.25.0' DEFAULT_GRAFANA_IMAGE = 'opkbhfpspsp0101.fns/ceph/ceph-grafana:9.4.7' DEFAULT_HAPROXY_IMAGE = 'opkbhfpspsp0101.fns/ceph/haproxy:2.3' DEFAULT_KEEPALIVED_IMAGE = 'opkbhfpspsp0101.fns/ceph/keepalived:2.2.4' DEFAULT_SNMP_GATEWAY_IMAGE = 'opkbhfpspsp0101.fns/ceph/snmp-notifier:v1.2.1' DEFAULT_ELASTICSEARCH_IMAGE = 'opkbhfpspsp0101.fns/ceph/elasticsearch:6.8.23' DEFAULT_JAEGER_COLLECTOR_IMAGE = 'opkbhfpspsp0101.fns/ceph/jaeger-collector:1.29' DEFAULT_JAEGER_AGENT_IMAGE = 'opkbhfpspsp0101.fns/ceph/jaeger-agent:1.29' DEFAULT_JAEGER_QUERY_IMAGE = 'opkbhfpspsp0101.fns/ceph/jaeger-query:1.29' DEFAULT_REGISTRY = 'opkbhfpspsp0101.fns' # normalize unqualified digests to this ``` but I encounter this error ` File "/sbin/cephadm", line 10098 PK ^ `. Also, there is a wrong line at the beginning of cephadm file PK^C^D^T^@^@^@^@^@¥<9a>^CW<8e>^[º^×Ü^E^@×Ü^E^@^K^@^@^@__main__.py#!/usr/bin/python3

5 months

2
1
0 0

Ceph Cluster Deployment - Recommendation

by Amardeep Singh

Hi Everyone, We are in the process of planning a ceph cluster deployment for our data infrastructure. To provide you with a bit of context, we have deployed hardware across two data halls in our data center, and they are connected via a 10Gb interconnect. The hardware configuration for 4 x Ceph Cluster (2 x servers in both data halls) * 2 x AMD EPYC 7513 - 32 Cores / 64 Threads * 512GB RAM * 2 x 960 GB (OS DISKS) * 8x Micron 7450 PRO 7680GB NVMe - PCIe Gen4 * Intel X550-T2 - 10GbE Dual-Port RJ45 Server Adaptor Our primary usage is Object Gateway and will be running 4 x RGW Service. We are aiming to deploy using cephadm and utilize all nodes for MON/MGR/RGW and OSD's Given our limited experience with Ceph, we are reaching out to the knowledgeable members of this community for recommendations and best practices. We would greatly appreciate any insights or advice you can share regarding the following aspects: Cluster Topology: Considering our hardware setup with two data halls connected via a 10Gb interconnect, what would be the recommended cluster topology for optimal performance and fault tolerance? Best Practices for Deployment: Are there any recommended best practices for deploying Ceph in a similar environment? Any challenges we should be aware of? Thank you in advance for your time and assistance. Regards, Amar DISCLAIMER: This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you have received this email in error, please notify the sender. This message contains confidential information and is intended only for the individual named. If you are not the named addressee, you should not disseminate, distribute or copy this email. Please notify the sender immediately by email if you have received this email by mistake and delete this email from your system. If you are not the intended recipient, you are notified that disclosing, copying, distributing or taking any action in reliance on the contents of this information is strictly prohibited. Thanks for your cooperation.

5 months

3
3
0 0

2024

2023

2022

2021

2020

2019

ceph-users December 2023