Cephfs error state with one bad file - ceph-users

2 Jan 2024

Hi again, hopefully for the last time with problems. 

We had a MDS crash earlier with the MDS staying in failed state and used a command to
reset the filesystem (this was wrong, I know now, thanks Patrick Donnelly for pointing
this out). I did a full scrub on the filesystem and two files were damaged. One of those
got repaired, but the following file keeps giving errors and can't be removed.
What can I do now? Below some information.

# ceph tell mds.atlassian-prod:0 damage ls
[
    {
        "damage_type": "backtrace",
        "id": 2244444901,
        "ino": 1099534008829,
        "path":
"/app1/shared/data/repositories/11271/objects/41/8f82507a0737c611720ed224bcc8b7a24fda01"
    }
]

Trying to repair the error (online research shows this should work for a backtrace damage
type)
----------
# ceph tell mds.atlassian-prod:0 scrub start /app1/shared/data/repositories/11271
recursive,repair,force
{
    "return_code": 0,
    "scrub_tag": "d10ead42-5280-4224-971e-4f3022e79278",
    "mode": "asynchronous"
}

Cluster logs after this
----------
1/2/24 9:37:05 AM
[INF]
scrub summary: idle

1/2/24 9:37:02 AM
[INF]
scrub summary: idle+waiting paths [/app1/shared/data/repositories/11271]

1/2/24 9:37:01 AM
[INF]
scrub summary: active paths [/app1/shared/data/repositories/11271]

1/2/24 9:37:01 AM
[INF]
scrub summary: idle+waiting paths [/app1/shared/data/repositories/11271]

1/2/24 9:37:01 AM
[INF]
scrub queued for path: /app1/shared/data/repositories/11271

But the error doesn't disappear and still can't remove the file.

On the client trying to remove the file (we got a backup)
----------
$ rm -f
/mnt/shared_disk-app1/shared/data/repositories/11271/objects/41/8f82507a0737c611720ed224bcc8b7a24fda01
rm: cannot remove
'/mnt/shared_disk-app1/shared/data/repositories/11271/objects/41/8f82507a0737c611720ed224bcc8b7a24fda01':
Input/output error

Best regards,
Sake