Hi Patrick,
thanks for the reply
On Fri, 2020-09-04 at 10:25 -0700, Patrick Donnelly wrote:
We then
started using the cephfs (we keep VM images on the cephfs).
The
MDS were showing an error. I restarted the MDS but they didn't come
back.We then followed the instructions here:
https://docs.ceph.com/docs/nautilus/cephfs/disaster-recovery-experts/#disas…
up to truncating the journal. The MDS started again. However, as
soon
as we started writing the cephfs the MDS crashed. A scrub of the
cephfs
revealed backtrace damage.
I'm confused why you started the disaster recovery procedure when the
procedure you follow should result in no damage to the PGs (and
subsequently CephFS). It'd be helpful to know what this original
error
was.
so, when we re-enabled the cephfs I was monitoring the cluster using
ceph -w and I noticed lots of errors going past, something like
2020-09-03 09:30:24.711 7fd1d2932700 -1 log_channel(cluster) log [ERR]
: replayed ESubtreeMap at 8537805160800 subtree root 0x1 not in cache
2020-09-03 09:30:24.712 7fd1d2932700 0 mds.0.journal journal subtrees:
{0x1=[],0x100=[]}
2020-09-03 09:30:24.712 7fd1d2932700 0 mds.0.journal journal
ambig_subtrees:
2020-09-03 09:30:24.712 7fd1d2932700 -1 log_channel(cluster) log [ERR]
: replayed ESubtreeMap at 8537805208638 subtree root 0x1 not in cache
2020-09-03 09:30:24.712 7fd1d2932700 0 mds.0.journal journal subtrees:
{0x1=[],0x100=[]}
2020-09-03 09:30:24.712 7fd1d2932700 0 mds.0.journal journal
ambig_subtrees:
2020-09-03 09:30:24.714 7fd1d2932700 0 mds.0.journal EMetaBlob.replay
missing dir ino 0x1000003857d
2020-09-03 09:30:24.714 7fd1d2932700 -1 log_channel(cluster) log [ERR]
: failure replaying journal (EMetaBlob)
2020-09-03 09:30:24.714 7fd1d2932700 1 mds.store07 respawn!
I, perhaps foolishly, restarted mds daemons. Eventually the last one
didn't come back and the cephfs was in error.
I am not quite sure what we tried at this stage. I think we started the
cephfs scrub which found some backtrace errors. However, again perhaps
foolishly, we started using cephfs during the scrub process and MDS
crashed when the clients started writing to the cephfs. At this stage
should we have waited for the scrub to complete before allowing the
clients to write to the filesystem?
At that stage we started the recovery procedure.
Backtrace damage is usually resolved with a scrub.
this is not clear from the documentation.
We have now followed the remaining steps of the
disaster recovery
procedure and are waiting for the cephfs-data-scan scan_extents to
complete.
It would be really helpful if you could give an indication of how
long
this process will take (we have ~40TB in our cephfs) and how many
workers to use.
I don't have any recent data on how long it could take but you might
try using at least 8 workers.
We are using 4 workers and the first stage hasn't completed yet. Is it
safe to interrupt and restart the procedure with more workers? Can the
workers be run on different machines?
The other missing bit of documentation is the
cephfs scrubbing. Is
that
something we should run routinely?
CephFS scrubbing is usually done when something goes wrong or backing
metadata needs updated for some reason as part of an upgrade (e.g.
Mimic and snapshot formats). It's not considered necessary to do it
on
a routine basis. RADOS PG scrubbing is sufficient for ensuring that
the backing data is routinely checked for correctness/redundancy.
ok, that's very helpful information. Does the cephfs need to be in a
particular state for the scrub to be run?
Perhaps us restarting the cephfs uncovered an earlier error:
2020-08-31 12:54:45.976 7f10fe790700 0 mds.2.journal EMetaBlob.replay
missing dir ino 0x10002024c23
2020-08-31 12:54:45.979 7f10fe790700 -1 log_channel(cluster) log [ERR]
: failure replaying journal (EMetaBlob)
2020-08-31 12:54:45.979 7f10fe790700 1 mds.store06 respawn!
which we hadn't appreciated. would a scrub have resolved that?
Thanks a lot for your replies.
Regards
magnus
The University of Edinburgh is a charitable body, registered in Scotland, with
registration number SC005336.