Hi Sebastian,
You can find some more discussion and fixes for this type of fs
corruption here:
https://www.spinics.net/lists/ceph-users/msg76952.html
--
Dan van der Ster
CTO
Clyso GmbH
p: +49 89 215252722 | a: Vancouver, Canada
w:
https://clyso.com | e: dan.vanderster(a)clyso.com
We are hiring:
https://www.clyso.com/jobs/
On Fri, Nov 24, 2023 at 5:48 AM Sebastian Knust
<sknust(a)physik.uni-bielefeld.de> wrote:
>
> Hi,
>
> After updating from 17.2.6 to 17.2.7 with cephadm, our cluster went into
> MDS_DAMAGE state. We had some prior issues with faulty kernel clients
> not releasing capabilities, therefore the update might just be a
> coincidence.
>
> `ceph tell mds.cephfs:0 damage ls` lists 56 affected files all with
> these general details:
>
> {
> "damage_type": "dentry",
> "id": 123456,
> "ino": 1234567890,
> "frag": "*",
> "dname": "some-filename.ext",
> "snap_id": "head",
> "path": "/full/path/to/file"
> }
>
> The behaviour upon trying to access file information in the (Kernel
> mounted) filesystem is a bit inconsistent. Generally, the first `stat`
> call seems to result in "Input/output error", the next call provides all
> `stat` data as expected from an undamaged file. The file can be read
> with `cat` with full and correct content (verified with backup) once the
> stat call succeeds.
>
> Scrubbing the affected subdirectories with `ceph tell mds.cephfs:0 scrub
> start /path/to/dir/ recursive,repair,force` does not fix the issue.
>
> Trying to delete the file results in an "Input/output error". If the
> stat calls beforehand succeeded, this also crashes the active MDS with
> these messages in the system journal:
> > Nov 24 14:21:15 iceph-18.servernet ceph-mds[1946861]:
mds.0.cache.den(0x10012271195 DisplaySettings.json) newly corrupt dentry to be committed:
[dentry
#0x1/homes/huser/d3data/transfer/hortkrass/FLIMSIM/2023-04-12-irf-characterization/2-qwp-no-extra-filter-pc-off-tirf-94-tirf-cursor/DisplaySettings.json
[1000275c4a0,head] auth (dversion lock) pv=0 v=225 ino=0x10012271197 state=1073741824 |
inodepin=1 0x56413e1e2780]
> > Nov 24 14:21:15 iceph-18.servernet ceph-mds[1946861]: log_channel(cluster) log
[ERR] : MDS abort because newly corrupt dentry to be committed: [dentry
#0x1/homes/huser/d3data/transfer/hortkrass/FLIMSIM/2023-04-12-irf-characterization/2-qwp-no-extra-filter-pc-off-tirf-94-tirf-cursor/DisplaySettings.json
[1000275c4a0,head] auth (dversion lock) pv=0 v=225 ino=0x10012271197 state=1073741824 |
inodepin=1 0x56413e1e2780]
> > Nov 24 14:21:15 iceph-18.servernet
ceph-eafd0514-3644-11eb-bc6a-3cecef2330fa-mds-cephfs-iceph-18-ujfqnd[1946838]:
2023-11-24T13:21:15.654+0000 7f3fdcde0700 -1 mds.0.cache.den(0x10012271195
DisplaySettings.json) newly corrupt dentry to be committed: [dentry
#0x1/homes/huser/d3data/transfer/hortkrass/FLIMSIM/2023-04-12-irf-characterization/2-qwp-no-extra-filter-pc-off-tirf-94-tirf-cursor/DisplaySettings.json
[1000275c4a0,head] auth (dversion lock) pv=0 v=225 ino=0x1001>
> > Nov 24 14:21:15 iceph-18.servernet
ceph-eafd0514-3644-11eb-bc6a-3cecef2330fa-mds-cephfs-iceph-18-ujfqnd[1946838]:
2023-11-24T13:21:15.654+0000 7f3fdcde0700 -1 log_channel(cluster) log [ERR] : MDS abort
because newly corrupt dentry to be committed: [dentry
#0x1/homes/huser/d3data/transfer/hortkrass/FLIMSIM/2023-04-12-irf-characterization/2-qwp-no-extra-filter-pc-off-tirf-94-tirf-cursor/DisplaySettings.json
[1000275c4a0,head] auth (dversion lock) pv=0 v=225 ino=0x10012>
> > Nov 24 14:21:15 iceph-18.servernet
ceph-eafd0514-3644-11eb-bc6a-3cecef2330fa-mds-cephfs-iceph-18-ujfqnd[1946838]:
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.7/rpm/el8/BUILD/ceph-17.2.7/src/mds/MDSRank.cc:
In function 'void MDSRank::abort(std::string_view)' thread 7f3fdcde0700 time
2023-11-24T13:21:15.655088+0000
> > Nov 24 14:21:15 iceph-18.servernet ceph-mds[1946861]:
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.7/rpm/el8/BUILD/ceph-17.2.7/src/mds/MDSRank.cc:
In function 'void MDSRank::abort(std::string_view)' thread 7f3fdcde0700 time
2023-11-24T13:21:15.655088+0000
> >
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.7/rpm/el8/BUILD/ceph-17.2.7/src/mds/MDSRank.cc:
937: ceph_abort_msg("abort() called")
> >
> > ceph version 17.2.7
(b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy (stable)
> > 1:
(ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char,
std::char_traits<char>, std::allocator<char> > const&)+0xd7)
[0x7f3fe5a1cb03]
> > 2:
(MDSRank::abort(std::basic_string_view<char, std::char_traits<char> >)+0x7d)
[0x5640f2e6fa2d]
> > 3:
(CDentry::check_corruption(bool)+0x740) [0x5640f30e4820]
> > 4:
(EMetaBlob::add_primary_dentry(EMetaBlob::dirlump&, CDentry*, CInode*, unsigned
char)+0x47) [0x5640f2f41877]
> > 5:
(EOpen::add_clean_inode(CInode*)+0x121) [0x5640f2f49fc1]
> > 6:
(Locker::adjust_cap_wanted(Capability*, int, int)+0x426) [0x5640f305e036]
> > 7:
(Locker::process_request_cap_release(boost::intrusive_ptr<MDRequestImpl>&,
client_t, ceph_mds_request_release const&, std::basic_string_view<char,
std::char_traits<char> >)+0x599) [0x5640f307f7e9]
> > 8:
(Server::handle_client_request(boost::intrusive_ptr<MClientRequest const>
const&)+0xc06) [0x5640f2f2a7c6]
> > 9:
(Server::dispatch(boost::intrusive_ptr<Message const> const&)+0x13c)
[0x5640f2f2ef6c]
> > 10:
(MDSRank::_dispatch(boost::intrusive_ptr<Message const> const&, bool)+0x5db)
[0x5640f2e7727b]
> > 11:
(MDSRankDispatcher::ms_dispatch(boost::intrusive_ptr<Message const>
const&)+0x5c) [0x5640f2e778bc]
> > 12:
(MDSDaemon::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x1bf)
[0x5640f2e60c2f]
> > 13:
(Messenger::ms_deliver_dispatch(boost::intrusive_ptr<Message> const&)+0x478)
[0x7f3fe5c97ed8]
> > 14:
(DispatchQueue::entry()+0x50f) [0x7f3fe5c9531f]
> > 15:
(DispatchQueue::DispatchThread::entry()+0x11) [0x7f3fe5d5f381]
> > 16:
/lib64/libpthread.so.0(+0x81ca) [0x7f3fe4a0b1ca]
> > 17: clone()
>
> Deleting the file with cephfs-shell also does give Input/output error (5).
>
> Does anyone have an idea on how to proceed here? I am perfectly fine
> with loosing the affected files, they can all be easily restored from
> backup.
>
> Cheers
> Sebastian
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io
> To unsubscribe send an email to ceph-users-leave(a)ceph.io