December 2023 - ceph-users

OSD has Rocksdb corruption that crashes ceph-bluestore-tool repair

by Malcolm Haak

Hello all, I had an OSD go offline due to UWE. When restarting the OSD service, to try and at least get it to drain cleanly of that data that wasn't damaged, the ceph-osd process would crash. I then attempted to repair it using ceph-bluestore-tool. I can run fsck and it will complete without issue, however when attempting to run repair it crashes in the exact same way that ceph-osd crashes. I'll attach the tail end of the output here: 2023-12-17T20:24:53.320+1000 7fdb7bf17740 -1 rocksdb: submit_common error: Corruption: block checksum mismatch: stored = 1106056583, computed = 657190205, type = 1 in db/020524.sst offset 21626321 size 4014 code = Rocksdb transaction: PutCF( prefix = S key = 'per_pool_omap' value size = 1) -442> 2023-12-17T20:24:53.386+1000 7fdb7bf17740 -1 /usr/src/debug/ceph/ceph-18.2.0/src/os/bluestore/BlueStore.cc: In function 'unsigned int BlueStoreRepairer::apply(KeyValueDB*)' thread 7fdb7bf17740 time 2023-12-17T20:24:53.341999+1000 /usr/src/debug/ceph/ceph-18.2.0/src/os/bluestore/BlueStore.cc: 17982: FAILED ceph_assert(ok) ceph version 18.2.0 (5dd24139a1eada541a3bc16b6941c5dde975e26d) reef (stable) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x136) [0x7fdb7b6502c9] 2: /usr/lib/ceph/libceph-common.so.2(+0x2504a4) [0x7fdb7b6504a4] 3: (BlueStoreRepairer::apply(KeyValueDB*)+0x5af) [0x559afb98cc7f] 4: (BlueStore::_fsck_on_open(BlueStore::FSCKDepth, bool)+0x45fc) [0x559afba2436c] 5: (BlueStore::_fsck(BlueStore::FSCKDepth, bool)+0x204) [0x559afba31014] 6: main() 7: /usr/lib/libc.so.6(+0x27cd0) [0x7fdb7ae45cd0] 8: __libc_start_main() 9: _start() -441> 2023-12-17T20:24:53.390+1000 7fdb7bf17740 -1 *** Caught signal (Aborted) ** in thread 7fdb7bf17740 thread_name:ceph-bluestore- ceph version 18.2.0 (5dd24139a1eada541a3bc16b6941c5dde975e26d) reef (stable) 1: /usr/lib/libc.so.6(+0x3e710) [0x7fdb7ae5c710] 2: /usr/lib/libc.so.6(+0x8e83c) [0x7fdb7aeac83c] 3: raise() 4: abort() 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x191) [0x7fdb7b650324] 6: /usr/lib/ceph/libceph-common.so.2(+0x2504a4) [0x7fdb7b6504a4] 7: (BlueStoreRepairer::apply(KeyValueDB*)+0x5af) [0x559afb98cc7f] 8: (BlueStore::_fsck_on_open(BlueStore::FSCKDepth, bool)+0x45fc) [0x559afba2436c] 9: (BlueStore::_fsck(BlueStore::FSCKDepth, bool)+0x204) [0x559afba31014] 10: main() 11: /usr/lib/libc.so.6(+0x27cd0) [0x7fdb7ae45cd0] 12: __libc_start_main() 13: _start() NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. The reason I need to get this OSD functioning is I had two other OSD's fail causing a single PG to be in down state. The weird thing is, I got one of those back up without issue (ceph-osd crashed due to root filling and alert not sending) but the PG is still down. So I need to get this other one back up (or the data extracted) to get that PG back from down. Thanks in advance

5 months

2
1
0 0

Cephfs MDS tunning for deep-learning data-flow

by mhnx

Hello everyone! How are you doing? I wasn't around for two years but I'm back and working on a new development. I deployed 2x ceph cluster: 1- user_data:5x node [8x4TB Sata SSD, 2x 25Gbit network], 2- data-gen: 3x node [8x4TB Sata SSD, 2x 25Gbit network], note: hardware is not my choice and I know I have TRIM issue and also I couldn't use any PCI-E nvme for wal+db because 1u servers and no empty slots --------------------- At test phase everything was good, I reached 1GB/s for 18 clients at the same time. But when I migrate to production (60 GPU server client + 40 CPU server client) the speed issue begin because of the default parameters as usual and now I'm working on adaptation by debugging current data work flow I have and I'm researching how can I improve my environment. So far, I couldn't find useful guide or informations in one place and I just wanted to share my findings, benchmarks and ideas with the community and if I'm lucky enough, maybe I will get awesome recommendations from some old friends and enjoy get in touch after a while. :) Starting from here, I will only share technical information about my environment: 1- Cluster user_data: 5x node [8x4TB Sata SSD, 2x 25Gbit network] = Replication 2 - A: I only have 1 pool in this cluster and information is below: - ceph df --- RAW STORAGE --- CLASS SIZE AVAIL USED RAW USED %RAW USED ssd 146 TiB 106 TiB 40 TiB 40 TiB 27.50 TOTAL 146 TiB 106 TiB 40 TiB 40 TiB 27.50 --- POOLS --- POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL .mgr 1 1 286 MiB 73 859 MiB 0 32 TiB cephfs.ud-data.meta 9 512 65 GiB 2.87M 131 GiB 0.13 48 TiB cephfs.ud-data.data 10 2048 23 TiB 95.34M 40 TiB 29.39 48 TiB - B: In this cluster, every user(50) has a subvolume and the quota is 1TB/for each users - C: In each subvolume, users has "home and data" directory. - D: home directory size 5-10GB and client uses it for docker home directory at each login - E: I'm also storing users personal or development data around 2TB/each user - F: I only have 1x active MDS server and 4x standby as below. - ceph fs status > ud-data - 84 clients > ======= > RANK STATE MDS ACTIVITY DNS INOS DIRS > CAPS > 0 active ud-data.ud-04.seggyv Reqs: 372 /s 4343k 4326k 69.7k > 2055k > POOL TYPE USED AVAIL > cephfs.ud-data.meta metadata 130G 47.5T > cephfs.ud-data.data data 39.5T 47.5T > STANDBY MDS > ud-data.ud-01.uatjle > ud-data.ud-02.xcoojt > ud-data.ud-05.rnhcfe > ud-data.ud-03.lhwkml > MDS version: ceph version 17.2.6 > (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable) - What is my issue? 2023-12-15T21:07:47.175542+0000 mon.ud-01 [WRN] Health check failed: 1 clients failing to respond to cache pressure (MDS_CLIENT_RECALL) 2023-12-15T21:09:35.002112+0000 mon.ud-01 [INF] MDS health message cleared (mds.?): Client gpu-server-11 failing to respond to cache pressure 2023-12-15T21:09:35.391235+0000 mon.ud-01 [INF] Health check cleared: MDS_CLIENT_RECALL (was: 1 clients failing to respond to cache pressure) 2023-12-15T21:09:35.391304+0000 mon.ud-01 [INF] Cluster is now healthy 2023-12-15T21:10:00.000169+0000 mon.ud-01 [INF] overall HEALTH_OK For every read and write in client's trying to reach ceph MDS server and requests some data: issue 1: home data is around 5-10GB and users need all the time. I need to store it one time and prevent new requests. issue 2: users process generates new data by only reading some data one time and they write generated data one time. No need to cache this data at all. What I want to do ??? 1- I want to deploy 2x active MDS server for only "home" directory in each subvolume: - These 2x home MDS servers must send the data to client and cache in the client to reduce new requests even for simple "ls" command 2- I want to deploy 2x active MDS server for only "data" directory in each subvolume: - These 2x MDS servers must be configured to not hold any CACHE if it is not required constantly. The cache life time must be short and must be independent. - Constantly requested data from one client must be cached locally in that client to reduce requests and load on the MDS server. ------------------------------------------------------------ I believe you understand my data-flow and my needs. Let's talk what we can do about it. Note: I'm still researching and these are my finding and my plan so far. it is not completed, and this is the main reason why I'm writing this mail. ceph fs set $MYFS max_mds 4 mds_cache_memory_limit | default 4GiB --> 16GiB mds_cache_reservation | default 0.05 --> ?? mds_health_cache_threshold | default 1.5 --> ?? mds_cache_trim_threshold | default 256KiB --> ?? mds_cache_trim_decay_rate | default 1.0 --> ?? mds_cache_mid mds_decay_halflife mds_client_prealloc_inos mds_dirstat_min_interval mds_session_cache_liveness_magnitude mds_session_cache_liveness_decay_rate mds_max_caps_per_client mds_recall_max_caps mds_recall_max_decay_threshold mds_recall_max_decay_rate mds_recall_global_max_decay_threshold mds_session_cap_acquisition_throttle mds_session_cap_acquisition_decay_rate mds_session_max_caps_throttle_ratio mds_cap_acquisition_throttle_retry_request_timeout -Manually pinning directory trees to a particular rank As you can see, I'm at the beginning of this journey and I will be grateful if you can help me, share your knowledge, even I'm ready to help developers to use my system as a test bench to improve ceph as always! Best regards folks! - Özkan

5 months

1
1
0 0

rbd trash: snapshot id is protected from removal

by Eugen Block

Hi, I've been searching and trying things but to no avail yet. This is uncritical because it's a test cluster only, but I'd still like to have a solution in case this somehow will make it into our production clusters. It's an Openstack Victoria Cloud with Ceph backend. If one tries to remove a glance image (openstack image delete {UUID}' which usually has a protected snapshot it will fail to do so, but apparently the snapshot is actually moved to the trash namespace. And since it is protected, I can't remove it: storage01:~ # rbd -p images snap ls 278ffe2b-67a7-40d0-87b7-903f2fc9c3b4 --all SNAPID NAME SIZE PROTECTED TIMESTAMP NAMESPACE 159 1a97db13-307e-4820-8dc2-8549e9ba1ad7 39 MiB Thu Dec 14 08:29:56 2023 trash (snap) storage01:~ # rbd snap rm --snap-id 159 images/278ffe2b-67a7-40d0-87b7-903f2fc9c3b4 rbd: snapshot id 159 is protected from removal. storage01:~ # rbd snap ls images/278ffe2b-67a7-40d0-87b7-903f2fc9c3b4 storage01:~ # This is a small image and only a test environment, but these orphans could potentially fill up lots of space. In a newer openstack version (I tried with Antelope) this doesn't seem to work like that anymore, so that's good. But how would I get rid of that trash snapshot in this cluster? Thanks! Eugen

5 months

2
2
0 0

Corrupted and inconsistent reads from CephFS on EC pool

by aschmitz

Hi everyone, I'm seeing different results from reading files, depending on which OSDs are running, including some incorrect reads with all OSDs running, in CephFS from a pool with erasure coding. I'm running Ceph 17.2.6. # More detail In particular, I have a relatively large backup of some files, combined with SHA-256 hashes of the files (which were verified when the backup was created, approximately 7 months ago). Verifying these hashes currently gives several errors, both in large and small files, but somewhat tilted towards larger files. Investigating which PGs stored the relevant files (using cephfs-data-scan pg_files) didn't show the problem to be isolated to one PG, but did show several PGs that contained OSD 15 as an active member. Taking OSD 15 offline leads to *better* reads (more files with correct SHA-256 hashes), but not completely correct reads. Further investigation implicated OSD 34 as another potential issue, but taking it offline also results in more correct files but not completely. Bringing the stopped OSDs (15 and 34) back online results in the earlier (incorrect) hashes when reading files, as might be expected, but this seems to demonstrate that the correct information (or at least more correct information) is still on the drives. The hashes I receive for a given corrupted file are consistent from read to read (including on different hosts, to avoid caching as an issue), but obviously sometimes change if I take an affected OSD offline. # Recent history I have Ceph configured with a deep scrub interval of approximately 30 days, and they have completed regularly with no issues identified. However, within the past two weeks I added two additional drives to the cluster, and rebalancing took about two weeks to complete: the placement groups I took notice of having issues were not deep scrubbed since the replacement completed, so it is possible something got corrupted during the rebalance. Neither OSD 15 nor 34 is a new drive, and as far as I have experienced (and Ceph's health indications have shown), all of the existing OSDs have behaved correctly up to this point. # Configuration I created an erasure coding profile for the pool in question using the following command: ceph osd erasure-code-profile set erasure_k4_m2 \ plugin=jerasure \ k=4 m=2 \ technique=blaum_roth \ crush-device-class=hdd And the following CRUSH rule is used for the pool: rule erasure_k4_m2_hdd_rule { id 3 type erasure min_size 4 max_size 6 step take default class hdd step choose indep 3 type host step chooseleaf indep 2 type osd step emit } # Questions 1. Does this behavior ring a bell to anyone? Is there something obvious I'm missing or should do? 2. Is deep scrubbing likely to help the situation? Hurt it? (Hopefully not hurt: I've prioritized deep scrubbing of the PGs on OSD 15 and 34, and will likely follow up with the rest of the pool.) 3. Is there a way to force "full reads" or otherwise to use all of the EC chunks (potentially in tandem with on-disk checksums) to identify the correct data, rather than a combination of the data from the primary OSDs? Thanks for any insights you might have, aschmitz

5 months

1
0
0 0

Etag change of a parent object

by Rok Jaklič

Hi, shouldn't etag of a "parent" object change when "child" objects are added on s3? Example: 1. I add an object to test bucket: "example/" - size 0 "example/" has an etag XYZ1 2. I add an object to test bucket: "example/test1.txt" - size 12 "example/test1.txt" has an etag XYZ2 "example/" has an etag XYZ1 ... should this change? I understand that object storage is not hierarchical by design and objects are "not connected" by some other means than the bucket name. Kind regards, Rok

5 months

4
3
0 0

Howto: 'one line patch' in deployed cluster?

by Harry G Coin

Is there a 'Howto' or 'workflow' to implement a one-line patch in a running cluster? With full understanding it will be gone on the next upgrade? Hopefully without having to set up an entire packaging/development environment? Thanks! To implement: * /Subject/: Re: Permanent KeyError: 'TYPE' ->17.2.7: return self.blkid_api['TYPE'] == 'part' * /From/: Sascha Lucas <ceph-users@xxxxxxxxx> Problem found: in my case this is caused by DRBD secondary block devices, which can not be read until promoted to primary. ceph_volume/util/disk.py runs in blkid(): $ blkid -c /dev/null -p /dev/drbd4 blkid: error: /dev/drbd4: Wrong medium type but does not care about its return code. A quick fix is to use the get() method to automatically fall back to None for non existing keys: --- a/ceph_volume/util/device.py 2023-11-10 07:00:01.552497107 +0000 +++ b/ceph_volume/util/device.py 2023-11-10 08:54:40.320718690 +0000 @@ -476,13 +476,13 @@ @property def is_partition(self): self.load_blkid_api() if self.disk_api: return self.disk_api['TYPE'] == 'part' elif self.blkid_api: - return self.blkid_api['TYPE'] == 'part' + return self.blkid_api.get('TYPE') == 'part' return False Don't know why this is triggered in 17.2.7.

5 months, 1 week

1
0
0 0

Ceph orch made block_db too small, not accounting for multiple nvmes, how-to fix it?

by Mikael Öhman

Hi Our hosts have 3 NVMEs and 48 spinning drives each. We found that ceph orch made the default lvm size for the block_db 1/3 the total size of the NVMEs. I suspect that ceph only considered one of the NVMEs when determining the size, based the closely related issue; https://tracker.ceph.com/issues/54541 We started having some bluefs spillover events now, so I'm looking for a way to fix this. Best idea I have so far is to manually specify the "block_db_size" in the osd_spec, then just recreating the entire block_db. Though I'm not sure if that means we'll hit the same issue https://tracker.ceph.com/issues/54541 instead. There would also be a lot of data to move in order to do this to a total of 588 OSD's. Maybe there is a way to just maybe remove and re-add (bigger) block_db? I would appreciate any suggestions or tips. Best regards, Mikael

5 months, 1 week

1
1
0 0

Pool Migration / Import/Export

by duluxoz

Hi All, I find myself in the position of having to change the k/m values on an ec-pool. I've discovered that I simply can't change the ec-profile, but have to create a "new ec-profile" and a "new ec-pool" using the new values, then migrate the "old ec-pool" to the new (see: https://ceph.io/en/news/blog/2015/ceph-pool-migration/ => Using Rados Export/Import). (Yes, a PITA, but it has to be done, and better doing it now when the data-size isn't that big - yet!) My only concern is that the "old ec-pool" is a `--data-pool` and part of an rbd image (ie the image was created with `rbd create --size 2T ec_rbd_pool/disk01 --data-pool ec_pool --image-feature journaling`). So my Q is: What are the "gotchas" (if any) with this, or is it simpler to backup the data (already done), destroy and recreate the rbd image from scratch, and restore the data to the re-created pool(s)/image? FTR: space isn't an issue (I've got plenty to play with) but I'm looking for the quickest way to do this as its a live (but little used) system, so I can take it down for a few hours (or more), but don't particular *want* to have it off-line for longer than absolutely necessary. Thanks in advance for any advice/help/warnings/etc :-) Cheers Dulux-Oz

5 months, 1 week

2
1
0 0

Re: increasing number of (deep) scrubs

by Frank Schilder

Hi Dan, thanks for your answer. I don't have a problem with increasing osd_max_scrubs (=1 at the moment) as such. I would simply prefer a somewhat finer grained way of controlling scrubbing than just doubling or tripling it right away. Some more info. These 2 pools are data pools for a large FS. Unfortunately, we have a large percentage of small files, which is a pain for recovery and seemingly also for deep scrubbing. Our OSDs are about 25% used and I had to increase the warning interval already to 2 weeks. With all the warning grace parameters this means that we manage to deep scrub everything about every month. I need to plan for 75% utilisation and a 3 months period is a bit far on the risky side. Our data is to a large percentage cold data. Client reads will not do the check for us, we need to combat bit-rot pro-actively. The reasons I'm interested in parameters initiating more scrubs while also converting more scrubs into deep scrubs are, that 1) scrubs seem to complete very fast. I almost never catch a PG in state "scrubbing", I usually only see "deep scrubbing". 2) I suspect the low deep-scrub count is due to a low number of deep-scrubs scheduled and not due to conflicting per-OSD deep scrub reservations. With the OSD count we have and the distribution over 12 servers I would expect at least a peak of 50% OSDs being active in scrubbing instead of the 25% peak I'm seeing now. It ought to be possible to schedule more PGs for deep scrub than actually are. 3) Every OSD having only 1 deep scrub active seems to have no measurable impact on user IO. If I could just get more PGs scheduled with 1 deep scrub per OSD it would already help a lot. Once this is working, I can eventually increase osd_max_scrubs when the OSDs fill up. For now I would just like that (deep) scrub scheduling looks a bit harder and schedules more eligible PGs per time unit. If we can get deep scrubbing up to an average of 42PGs completing per hour with keeping osd_max_scrubs=1 to maintain current IO impact, we should be able to complete a deep scrub with 75% full OSDs in about 30 days. This is the current tail-time with 25% utilisation. I believe currently a deep scrub of a PG in these pools takes 2-3 hours. Its just a gut feeling from some repair and deep-scrub commands, I would need to check logs for more precise info. Increasing osd_max_scrubs would then be a further and not the only option to push for more deep scrubbing. My expectation would be that values of 2-3 are fine due to the increasingly higher percentage of cold data for which no interference with client IO will happen. Hope that makes sense and there is a way beyond bumping osd_max_scrubs to increase the number of scheduled and executed deep scrubs. Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Dan van der Ster <dvanders(a)gmail.com> Sent: 05 January 2023 15:36 To: Frank Schilder Cc: ceph-users(a)ceph.io Subject: Re: [ceph-users] increasing number of (deep) scrubs Hi Frank, What is your current osd_max_scrubs, and why don't you want to increase it? With 8+2, 8+3 pools each scrub is occupying the scrub slot on 10 or 11 OSDs, so at a minimum it could take 3-4x the amount of time to scrub the data than if those were replicated pools. If you want the scrub to complete in time, you need to increase the amount of scrub slots accordingly. On the other hand, IMHO the 1-week deadline for deep scrubs is often much too ambitious for large clusters -- increasing the scrub intervals is one solution, or I find it simpler to increase mon_warn_pg_not_scrubbed_ratio and mon_warn_pg_not_deep_scrubbed_ratio until you find a ratio that works for your cluster. Of course, all of this can impact detection of bit-rot, which anyway can be covered by client reads if most data is accessed periodically. But if the cluster is mostly idle or objects are generally not read, then it would be preferable to increase slots osd_max_scrubs. Cheers, Dan On Tue, Jan 3, 2023 at 2:30 AM Frank Schilder <frans(a)dtu.dk> wrote: > > Hi all, > > we are using 16T and 18T spinning drives as OSDs and I'm observing that they are not scrubbed as often as I would like. It looks like too few scrubs are scheduled for these large OSDs. My estimate is as follows: we have 852 spinning OSDs backing a 8+2 pool with 2024 and an 8+3 pool with 8192 PGs. On average I see something like 10PGs of pool 1 and 12 PGs of pool 2 (deep) scrubbing. This amounts to only 232 out of 852 OSDs scrubbing and seems to be due to a conservative rate of (deep) scrubs being scheduled. The PGs (dep) scrub fairly quickly. > > I would like to increase gently the number of scrubs scheduled for these drives and *not* the number of scrubs per OSD. I'm looking at parameters like: > > osd_scrub_backoff_ratio > osd_deep_scrub_randomize_ratio > > I'm wondering if lowering osd_scrub_backoff_ratio to 0.5 and, maybe, increasing osd_deep_scrub_randomize_ratio to 0.2 would have the desired effect? Are there other parameters to look at that allow gradual changes in the number of scrubs going on? > > Thanks a lot for your help! > ================= > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io > To unsubscribe send an email to ceph-users-leave(a)ceph.io

5 months, 1 week

2
3
0 0

cephfs read hang after cluster stuck, but need attach the process to continue

by zxcs

Hi, experts, we are using cephfs with 16.2.* with multi active mds, and recently we see a strange thing, we have some c++ code about read file from cephfs. the client code just call very base read(), and when the cluster hit mds has slow request, and later the cluster back to normal. the read hang. then we need to gdb attach -p `pid of c++ process`, just attach, do nothing, the code then continue running. Our question is why this happens? and config could tune or we need change out c++ read code? Thanks xz

5 months, 1 week

1
0
0 0

2024

2023

2022

2021

2020

2019

ceph-users December 2023