January 2023 - ceph-users

rbd-mirror | ceph quincy Not able to find rbd_mirror_journal_max_fetch_bytes config in rbd mirror

by ankit raikwar

Hello All, In the ceph quincy Not able to find rbd_mirror_journal_max_fetch_bytes config in rbd mirror i configured the ceph cluster almost 400 tb and enable the rbd-mirror in the starting stage i'm able to achive the almost 9 GB speed , but after the rebalane completed of the all the images . rbd-mirror speed got automaticily reduce to between 4 to 5 mbps. in my primary cluster we are continuelsy writing the 50 to 400 mbps data but replication speed only we get the 4 to 5 mbps. also we have the 10 Gbps replication network bandwidth. Note::- I also try to find the option rbd_mirror_journal_max_fetch_bytes but i'm not able to find the this option in the configuration. also when i try to set from the command line it's showing error like command: ceph config set client.rbd rbd_mirror_journal_max_fetch_bytes 33554432 error: Error EINVAL: unrecognized config option 'rbd_mirror_journal_max_fetch_bytes' Cluster version: ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy (stable) Please suggest any alternative way to configurre this option or how i can improve the replication n/w speed. 0 0 Reply

1 year, 3 months

1
0
0 0

nfs RGW export makes nfs-gnaesha server in crash loop

by Ben Gao

Hi, This is running Quincy 17.2.5 deployed by rook on k8s. RGW nfs export will crash Ganesha server pod. CephFS export works just fine. Here are steps of it: 1, create export: bash-4.4$ ceph nfs export create rgw --cluster-id nfs4rgw --pseudo-path /bucketexport --bucket testbk { "bind": "/bucketexport", "path": "testbk", "cluster": "nfs4rgw", "mode": "RW", "squash": "none" } 2, check pods status afterwards: rook-ceph-nfs-nfs1-a-679fdb795-82tcx 2/2 Running 0 4h3m rook-ceph-nfs-nfs4rgw-a-5c594d67dc-nlr42 1/2 Error 2 4h6m 3, check failing pod’s logs: 11/01/2023 08:11:53 : epoch 63be6f49 : rook-ceph-nfs-nfs4rgw-a-5c594d67dc-nlr42 : nfs-ganesha-1[main] nfs_start_grace :STATE :EVENT :NFS Server Now IN GRACE, duration 90 11/01/2023 08:11:54 : epoch 63be6f49 : rook-ceph-nfs-nfs4rgw-a-5c594d67dc-nlr42 : nfs-ganesha-1[main] nfs_start_grace :STATE :EVENT :grace reload client info completed from backend 11/01/2023 08:11:54 : epoch 63be6f49 : rook-ceph-nfs-nfs4rgw-a-5c594d67dc-nlr42 : nfs-ganesha-1[main] nfs_try_lift_grace :STATE :EVENT :check grace:reclaim complete(0) clid count(0) 11/01/2023 08:11:57 : epoch 63be6f49 : rook-ceph-nfs-nfs4rgw-a-5c594d67dc-nlr42 : nfs-ganesha-1[main] nfs_lift_grace_locked :STATE :EVENT :NFS Server Now NOT IN GRACE 11/01/2023 08:11:57 : epoch 63be6f49 : rook-ceph-nfs-nfs4rgw-a-5c594d67dc-nlr42 : nfs-ganesha-1[main] export_defaults_commit :CONFIG :INFO :Export Defaults now (options=03303002/00080000 , , , , , , , , expire= 0) 2023-01-11T08:11:57.853+0000 7f59dac7c200 -1 auth: unable to find a keyring on /var/lib/ceph/radosgw/ceph-admin/keyring: (2) No such file or directory 2023-01-11T08:11:57.853+0000 7f59dac7c200 -1 AuthRegistry(0x56476817a480) no keyring found at /var/lib/ceph/radosgw/ceph-admin/keyring, disabling cephx 2023-01-11T08:11:57.855+0000 7f59dac7c200 -1 auth: unable to find a keyring on /var/lib/ceph/radosgw/ceph-admin/keyring: (2) No such file or directory 2023-01-11T08:11:57.855+0000 7f59dac7c200 -1 AuthRegistry(0x7ffe4d092c90) no keyring found at /var/lib/ceph/radosgw/ceph-admin/keyring, disabling cephx 2023-01-11T08:11:57.856+0000 7f5987537700 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [1] 2023-01-11T08:11:57.856+0000 7f5986535700 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [1] 2023-01-11T08:12:00.861+0000 7f5986d36700 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [1] 2023-01-11T08:12:00.861+0000 7f59dac7c200 -1 monclient: authenticate NOTE: no keyring found; disabled cephx authentication failed to fetch mon config (--no-mon-config to skip) 4, delete the export: ceph nfs export delete nfs4rgw /bucketexport Ganesha servers go back normal: rook-ceph-nfs-nfs1-a-679fdb795-82tcx 2/2 Running 0 4h30m rook-ceph-nfs-nfs4rgw-a-5c594d67dc-nlr42 2/2 Running 10 4h33m Any ideas to make it work? Thanks Ben

1 year, 3 months

1
0
0 0

Removing OSDs - draining but never completes.

by Wyll Ingersoll

Running ceph-pacific 16.2.9 using ceph orchestrator. We made a mistake adding a disk to the cluster and immediately issued a command to remove it using "ceph orch osd rm ### --replace --force". This OSD had no data on it at the time and was removed after just a few minutes. "ceph orch osd rm status" shows that it is still "draining". ceph osd df shows that the osd being removed has -1 PGs. So - why is the simple act of removal taking so long and can we abort it and manually remove that osd somehow? Note: the cluster is also doing a rebalance while this is going on, but the osd being removed never had any data and should not be affected by the rebalance. thanks!

1 year, 3 months

2
1
0 0

[rgw] Upload object with bad performance after the cluster running few months

by can zhu

ceph version is: 16.2.10 Use “rclone” tools to upload the big object: rclone copy ./zh-cn_windows_10_business_editions_version_21h1_updated_jul_2021_x64_dvd_f49026f5.iso smd:test --progres Transferred: 670 MiB / 5.293 GiB, 12%, 15.195 MiB/s, ETA 5m12s Transferred: 0 / 1, 0% Elapsed time: 46.0s Transferring: * zh-cn_windows_10_busin…1_x64_dvd_f49026f5.iso: 12% /5.293Gi, 15.195Mi/s, 5m12s The speed is: 15MB/s But at the beginning the speed can reach 100MB/s

1 year, 3 months

1
0
0 0

Move bucket between realms

by mahnoosh shahidi

Hi all, Is there any way in rgw to move a bucket from one realm to another one in the same cluster? Best regards, Mahnoosh

1 year, 3 months

1
0
0 0

adding OSD to orchestrated system, ignoring osd service spec.

by Wyll Ingersoll

When adding a new OSD to a ceph orchestrated system (16.2.9) on a storage node that has a specification profile that dictates which devices to use as the db_devices (SSDs), the newly added OSDs seem to be ignoring the db_devices (there are several available) and putting the data and db/wal on the same device. We installed the new disk (HDD) and then ran "ceph orch device zap /dev/xyz --force" to initialize the addition process. The OSDs that were added originally on that node were layed out correctly, but the new ones seem to be ignoring the OSD service spec. How can we make sure the new devices added are layed out correctly? thanks, Wyllys

1 year, 3 months

2
3
0 0

Re: Serious cluster issue - Incomplete PGs

by Deep Dish

Eugen, I never insinuated my circumstance is resultant from buggy software, and acknowledged operational missteps. Let's please leave that there. Ceph remains a technology I like and will continue to use. Our operational understanding has evolved greatly as a result of current circumstances. Removed OSDs are gone and not recoverable. (ie. lockbox keys gone, VG groups removed).. My objective of this post is to validate understanding of an alternate recovery (of available, not complete) data scenario: 1. Cluster has blocked IO due to Incomplete pages. Therefore any online operations on affected pools / images / filesystems are blocked. # ceph -s cluster: id: health: HEALTH_WARN 1 hosts fail cephadm check cephadm background work is paused Reduced data availability: 28 pgs inactive, 28 pgs incomplete 5 pgs not deep-scrubbed in time 3 slow ops, oldest one blocked for 347227 sec, daemons [osd.25,osd.50,osd.51] have slow ops. services: mon: 5 daemons, quorum (age 8h) mgr: (active, since 27m) mds: 2/2 daemons up, 3 standby osd: 70 osds: 70 up (since 3d), 45 in (since 3d); 24 remapped pgs data: volumes: 2/2 healthy pools: 9 pools, 1056 pgs objects: 10.64M objects, 40 TiB usage: 61 TiB used, 266 TiB / 327 TiB avail pgs: 2.652% pgs not active 1027 active+clean 24 remapped+incomplete 4 incomplete 1 active+clean+scrubbing+deep 2. Since pages are incomplete and supporting data lost, I found a documented process that will mark pages are complete and unblock IO for the cluster. I fully understand that pgs that have 0 objects will have no impact on data integrity, however those pgs containing objects will result in complete data loss for only those affected pgs. Link: https://medium.com/opsops/recovering-ceph-from-reduced-data-availability-3-… Based on above referenced link, commands to this effect would mark incomplete PGs as complete (examples): ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-2 --op mark-complete --pgid 2.50 ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-2 --op mark-complete --pgid 2.57 3. My cluster, at present, has a total 28 incomplete pgs. Of these, 7 reference approximately 644 GB of now lost / irrecoverable data, the rest reference 0 objects and 0 bytes (empty). The cluster holds a total of 61.3T of data, leaving ~60.8T available for recovery. 4. If I were to mark ALL incomplete pgs as complete, the cluster would be operable - meaning I can interact with pool images and surviving files on cephfs pools. 5. Although data loss may affect the contents RBD images, these images would be able to be mapped ( rbd map ) and available for alternate recovery methods (e.g. DD contents to a seperate volume for use at a recovery facility, or attempt to read via available recovery tools that interact with the filesystem on those block devices (XFS in this case). Lost data would be equivalent of blocks of 0's in the overall image data stream where data was lost. 6. The above could be successful in extracting available / recoverable. 7. Upon marking the 2 incomplete pages affecting CephFS volume as complete, CephFS would be accessible minus affected files. How would these files be represented? (Corrupted or simply 0 bytes)? Thank you. Date: Tue, 10 Jan 2023 08:15:31 +0000 From: Eugen Block <eblock(a)nde.ag> Subject: [ceph-users] Re: Serious cluster issue - Incomplete PGs To: ceph-users(a)ceph.io Message-ID: <20230110081531.Horde.NfeIXEvXkBYy6JFyMgYbpX2(a)webmail.nde.ag> Content-Type: text/plain; charset=utf-8; format=flowed; DelSp=Yes Hi, > Backups will be challenging. I honestly didn't anticipate this kind of > failure with ceph to be possible, we've been using it for several years now > and were encouraged by orchestrator and performance improvements in the 17 > code branch. that's exactly what a backup is for, to be prepared for the unexpected. Besides the fact that ceph didn't actually fail (you removed too many/too early OSDs) you can't expect a bug free software, no matter how long it has been running successfully. > - Identifying the pools / images / files that are affected by incomplete > pages; The PGs start with a number which reflects the pools in your cluster, check the output of 'ceph osd pool ls detail'. There's no easy way to tell which images or files are affected, you can query each OSD and list the PG's objects, but that doesn't work for missing OSDs/PGs, of course. I'm not sure how promising it is, but maybe try a for loop over all rbd images and just execute a 'rbd info <pool>/<image>' for each image, maybe it will tell you which image is incomplete. > - Extracting and reconstructing data for RBD images (these images are XFS > formatted filesystems); > - Extracting and reconstructing data for CephFS Files not affected by > incomplete PGs. If you kept the disks you removed too early (and didn't wipe them) there may be a chance to export the PG chunks with ceph-objectstore-tool [2]. I haven't used that myself in a production cluster so be careful and get familiar with the commands in a test environment first. If you already wiped the temporary OSDs I don't see a chance to recover from this. Regards, Eugen [2] https://docs.ceph.com/en/pacific/man/8/ceph-objectstore-tool/ Zitat von Deep Dish <deeepdish(a)gmail.com>: > Thanks for the insight Eugen. > > Here's what basically happened: > > - Upgrade from Nautilus to Quincy via migration to new cluster on temp > hardware; > - Data from Nautilus migrated successfully to older / lab-type equipment > running Quincy; > - Nautilus Hardware rebuilt for Quincy, data migrated back; > - As data was migrating we set the older notes to maintenance mode and > started to drain them; > - After several days many OSDs were showing as spinning in "deleting" > status on portal and we were marked OUT; > - This point we made the incorrect assumption those OSDs were no longer > required and proceeded to remove those nodes / OSDs. > > I understand Incomplete pages are basically lost. And it's likely a > lengthy task to attempt to salvage data. > > Backups will be challenging. I honestly didn't anticipate this kind of > failure with ceph to be possible, we've been using it for several years now > and were encouraged by orchestrator and performance improvements in the 17 > code branch. > > The fact is of the Incomplete pages that have object counts > 0, there's > about 644 GB of data that's tied up in this mess. There are other > incomplete PGs with object = 0 which I understand can be manually marked as > complete. The cluster has a data usage of 61 TiB. Of this I can > categorize about 14TB of critical data, 40 TB of data that is of medium / > high importance. > > There's 14TB in RBD images that would be critical on an EC pool there are > other images, however of lower importance at this point; > > There's also about a 20TB CephFS file system of lower data importance as > well. > > Question - Can you kindly point me to procedures for: > > - Identifying the pools / images / files that are affected by incomplete > pages; > - Extracting and reconstructing data for RBD images (these images are XFS > formatted filesystems); > - Extracting and reconstructing data for CephFS Files not affected by > incomplete PGs. > > Much appreciated. > > > ------------------------------ > > Date: Mon, 09 Jan 2023 10:12:49 +0000 > From: Eugen Block <eblock(a)nde.ag> > Subject: [ceph-users] Re: Serious cluster issue - Incomplete PGs > To: ceph-users(a)ceph.io > Message-ID: > <20230109101249.Horde.hAHCWQijFMYLNdX8a2YQDVV(a)webmail.nde.ag> > Content-Type: text/plain; charset=utf-8; format=flowed; DelSp=Yes > > Hi, > > can you clarify what exactly you did to get into this situation? What > about the undersized PGs, any chance to bring those OSDs back online? > Regarding the incomplete PGs I'm not sure there's much you can do if > the OSDs are lost. To me it reads like you may have > destroyed/recreated more OSDs than you should have, just recreating > OSDs with the same IDs is not sufficient if you destroyed too many > chunks. Each OSD only contains a chunk of the PG due to the erasure > coding. I'm afraid those objects are lost and you would have to > restore from backup. To get the cluster into a healthy state again > there a couple of threads, e. g. [1], but recovering the lost chunks > from ceph will probably not work. > > Regards, > Eugen > > [1] https://www.mail-archive.com/ceph-users@ceph.io/msg14757.html > > Zitat von Deep Dish <deeepdish(a)gmail.com>: > >> Hello. I really screwed up my ceph cluster. Hoping to get data off it >> so I can rebuild it. >> >> In summary, too many changes too quickly caused the cluster to develop >> incomplete pgs. Some PGS were reporting that OSDs were to be probes. >> I've created those OSD IDs (empty), however this wouldn't clear >> incompletes. Incompletes are part of EC pools. Running 17.2.5. >> >> This is the overall state: >> >> cluster: >> >> id: 49057622-69fc-11ed-b46e-d5acdedaae33 >> >> health: HEALTH_WARN >> >> Failed to apply 1 service(s): > osd.dashboard-admin-1669078094056 >> >> 1 hosts fail cephadm check >> >> cephadm background work is paused >> >> Reduced data availability: 28 pgs inactive, 28 pgs incomplete >> >> Degraded data redundancy: 55 pgs undersized >> >> 2 slow ops, oldest one blocked for 4449 sec, daemons >> [osd.25,osd.50,osd.51] have slow ops. >> >> >> >> These are PGs that are incomplete that HAVE DATA (Objects > 0) [ via ceph >> pg ls incomplete ]: >> >> 2.35 23199 0 0 0 95980273664 0 >> 0 2477 incomplete 10s 2104'46277 28260:686871 >> [44,4,37,3,40,32]p44 [44,4,37,3,40,32]p44 >> 2023-01-03T03:54:47.821280+0000 2022-12-29T18:53:09.287203+0000 >> 14 queued for deep scrub >> 2.53 22821 0 0 0 94401175552 0 >> 0 2745 remapped+incomplete 10s 2104'45845 28260:565267 >> [60,48,52,65,67,7]p60 [60]p60 >> 2023-01-03T10:18:13.388383+0000 2023-01-03T10:18:13.388383+0000 >> 408 queued for scrub >> 2.9f 22858 0 0 0 94555983872 0 >> 0 2736 remapped+incomplete 10s 2104'45636 28260:759872 >> [56,59,3,57,5,32]p56 [56]p56 >> 2023-01-03T10:55:49.848693+0000 2023-01-03T10:55:49.848693+0000 >> 376 queued for scrub >> 2.be 22870 0 0 0 94429110272 0 >> 0 2661 remapped+incomplete 10s 2104'45561 28260:813759 >> [41,31,37,9,7,69]p41 [41]p41 >> 2023-01-03T14:02:15.790077+0000 2023-01-03T14:02:15.790077+0000 >> 360 queued for scrub >> 2.e4 22953 0 0 0 94912278528 0 >> 0 2648 remapped+incomplete 20m 2104'46048 28259:732896 >> [37,46,33,4,48,49]p37 [37]p37 >> 2023-01-02T18:38:46.268723+0000 2022-12-29T18:05:47.431468+0000 >> 18 queued for deep scrub >> 17.78 20169 0 0 0 84517834400 0 >> 0 2198 remapped+incomplete 10s 3735'53405 28260:1243673 >> [4,37,2,36,66,0]p4 [41]p41 >> 2023-01-03T14:21:41.563424+0000 2023-01-03T14:21:41.563424+0000 >> 348 queued for scrub >> 17.d8 20328 0 0 0 85196053130 0 >> 0 1852 remapped+incomplete 10s 3735'54458 28260:1309564 >> [38,65,61,37,58,39]p38 [53]p53 >> 2023-01-02T18:32:35.371071+0000 2022-12-28T19:08:29.492244+0000 >> 21 queued for deep scrub >> >> At present I'm unable to reliably access my data due to incomplete pages >> above. I'll post whatever outputs requested (won't post now as it can be >> rather verbose). Is there hope?

1 year, 3 months

2
1
0 0

Move bucket between realms

by mahnoosh shahidi

Hi all, Is there any way in rgw to move a bucket from one realm to another one in the same cluster? Best regards, Mahnoosh

1 year, 3 months

1
0
0 0

Move bucket between realms

by mahnoosh shahidi

Hi all, Is there any way in rgw to move a bucket from one realm to another one in the same cluster? Best regards, Mahnoosh

1 year, 3 months

1
0
0 0

OSD crash with "FAILED ceph_assert(v.length() == p->shard_info->bytes)"

by Yu Changyuan

One of OSD(other OSDs are fine) was crashed, and try "ceph-bluestore-tool fsck" also crashed with same error. Besides destroy this OSD and re-create, are there any other steps I can do to restore the OSD? Below is part of message: /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.10/rpm/el8/BUILD/ceph-16.2.10/src/os/bluestore/BlueStore.cc: 3228: FAILED ceph_assert(v.length() == p->shard_info->bytes) ceph version 16.2.10 (45fa1a083152e41a408d15505f594ec5f1b4fe17) pacific (stable) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x158) [0x562c19f7f73c] 2: /usr/bin/ceph-osd(+0x57f956) [0x562c19f7f956] 3: (BlueStore::ExtentMap::fault_range(KeyValueDB*, unsigned int, unsigned int)+0x7cf) [0x562c1a56d1ef] 4: (BlueStore::_do_write(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>&, boost::intrusive_ptr<BlueStore::Onode>, unsigned long, unsigned long, ceph::buffer::v15_2_0::list&, unsigned int)+0x5dd) [0x562c1a5c5ebd] 5: (BlueStore::_write(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>&, boost::intrusive_ptr<BlueStore::Onode>&, unsigned long, unsigned long, ceph::buffer::v15_2_0::list&, unsigned int)+0xd1) [0x562c1a5c70e1] 6: (BlueStore::_txc_add_transaction(BlueStore::TransContext*, ceph::os::Transaction*)+0x2077) [0x562c1a5cb237] 7: (BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, std::vector<ceph::os::Transaction, std::allocator<ceph::os::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x316) [0x562c1a5e66d6] 8: (non-virtual thunk to PrimaryLogPG::queue_transactions(std::vector<ceph::os::Transaction, std::allocator<ceph::os::Transaction> >&, boost::intrusive_ptr<OpRequest>)+0x58) [0x562c1a22a878] 9: (ReplicatedBackend::do_repop(boost::intrusive_ptr<OpRequest>)+0xeb0) [0x562c1a41cff0] 10: (ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x267) [0x562c1a42d357] 11: (PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x52) [0x562c1a25dd52] 12: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x5de) [0x562c1a20168e] 13: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x309) [0x562c1a088fc9] 14: (ceph::osd::scheduler::PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x68) [0x562c1a2e7e78] 15: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xc28) [0x562c1a0a64c8] 16: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4) [0x562c1a7232a4] 17: (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x562c1a726184] 18: /lib64/libpthread.so.0(+0x81ca) [0x7f2d2a4081ca] 19: clone() -10> 2023-01-10T09:28:02.143+0000 7f2cff3de700 -1 *** Caught signal (Aborted) ** And this is "meta" file of crash log: { "crash_id": "2023-01-10T09:28:02.137396Z_a504670d-32c3-46ee-8398-84389c9c2d95", "timestamp": "2023-01-10T09:28:02.137396Z", "process_name": "ceph-osd", "entity_name": "osd.3", "ceph_version": "16.2.10", "utsname_hostname": "dskm1-r0", "utsname_sysname": "Linux", "utsname_release": "5.18.19", "utsname_version": "#1-NixOS SMP PREEMPT_DYNAMIC Sun Aug 21 13:18:56 UTC 2022", "utsname_machine": "x86_64", "os_name": "CentOS Stream", "os_id": "centos", "os_version_id": "8", "os_version": "8", "assert_condition": "v.length() == p->shard_info->bytes", "assert_func": "void BlueStore::ExtentMap::fault_range(KeyValueDB*, uint32_t, uint32_t)", "assert_file": "/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.10/rpm/el8/BUILD/ceph-16.2.10/src/os/bluestore/BlueStore.cc", "assert_line": 3228, "assert_thread_name": "tp_osd_tp", "assert_msg": "/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.10/rpm/el8/BUILD/ceph-16.2.10/src/os/bluestore/BlueStore.cc: In function 'void BlueStore::ExtentMap::fault_range(KeyValueDB*, uint32_t, uint32_t)' thread 7f2cff3de700 time 2023-01-10T09:28:02.016735+0000\n/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.10/rpm/el8/BUILD/ceph-16.2.10/src/os/bluestore/BlueStore.cc: 3228: FAILED ceph_assert(v.length() == p->shard_info->bytes)\n", "backtrace": [ "/lib64/libpthread.so.0(+0x12cf0) [0x7f2d2a412cf0]", "gsignal()", "abort()", "(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a9) [0x562c19f7f78d]", "/usr/bin/ceph-osd(+0x57f956) [0x562c19f7f956]", "(BlueStore::ExtentMap::fault_range(KeyValueDB*, unsigned int, unsigned int)+0x7cf) [0x562c1a56d1ef]", "(BlueStore::_do_write(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>&, boost::intrusive_ptr<BlueStore::Onode>, unsigned long, unsigned long, ceph::buffer::v15_2_0::list&, unsigned int)+0x5dd) [0x562c1a5c5ebd]", "(BlueStore::_write(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>&, boost::intrusive_ptr<BlueStore::Onode>&, unsigned long, unsigned long, ceph::buffer::v15_2_0::list&, unsigned int)+0xd1) [0x562c1a5c70e1]", "(BlueStore::_txc_add_transaction(BlueStore::TransContext*, ceph::os::Transaction*)+0x2077) [0x562c1a5cb237]", "(BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, std::vector<ceph::os::Transaction, std::allocator<ceph::os::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x316) [0x562c1a5e66d6]", "(non-virtual thunk to PrimaryLogPG::queue_transactions(std::vector<ceph::os::Transaction, std::allocator<ceph::os::Transaction> >&, boost::intrusive_ptr<OpRequest>)+0x58) [0x562c1a22a878]", "(ReplicatedBackend::do_repop(boost::intrusive_ptr<OpRequest>)+0xeb0) [0x562c1a41cff0]", "(ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x267) [0x562c1a42d357]", "(PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x52) [0x562c1a25dd52]", "(PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x5de) [0x562c1a20168e]", "(OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x309) [0x562c1a088fc9]", "(ceph::osd::scheduler::PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x68) [0x562c1a2e7e78]", "(OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xc28) [0x562c1a0a64c8]", "(ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4) [0x562c1a7232a4]", "(ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x562c1a726184]", "/lib64/libpthread.so.0(+0x81ca) [0x7f2d2a4081ca]", "clone()" ] } I also update full crash log to github gist: https://gist.github.com/yuchangyuan/2016f259175940f64e2eed528d633794 -- Best wishes ~

1 year, 3 months

1
0
0 0

2024

2023

2022

2021

2020

2019

ceph-users January 2023