[ceph-users] Re: Urgent help needed please - MDS offline

22 Oct 2020

On Thu, 22 Oct 2020, 19:03 David C, &lt;dcsysengineer(a)gmail.com&gt; wrote:

...
  Thanks, guys

 I can't add more RAM right now or have access to a server that does,
 I'd fear it wouldn't be enough anyway. I'll give the swap idea a go
 and try and track down the thread you mentioned, Frank.

 'cephfs-journal-tool journal inspect' tells me the journal is fine. I
 was able to back it up cleanly, however the apparent size of the file
 reported by du is 53TB, does that sound right to you? The actual size
 is 3.7GB.

IIRC it's a sparse file. So yes that sounds normal.

...
  'cephfs-journal-tool event get list' starts
listing events but
 eventually gets killed as expected.

Does it go oom too?

.. dan

...
  'cephfs-journal-tool event get summary'
 Events by type:
   OPEN: 314260
   SUBTREEMAP: 1134
   UPDATE: 547973
 Errors: 0

 Those numbers seem really high to me - for reference this is an approx
 128TB (usable space) cluster, 5050000 objects in metadata pool.

 On Thu, Oct 22, 2020 at 5:23 PM Frank Schilder &lt;frans(a)dtu.dk&gt; wrote:

 If you can't add RAM, you could try provisioning SWAP on a reasonably  fast
drive. There is a thread from this year where someone had a similar
 problem, the MDS running out of memory during replay. He could quickly add
 sufficient swap and the MDS managed to come up. Took a long time though,
 but might be faster than getting more RAM and will not loose data.

 Your clients will not be able to do much, if anything during recovery  though.

 Best regards,
 =================
 Frank Schilder
 AIT Risø Campus
 Bygning 109, rum S14

 ________________________________________
 From: Dan van der Ster &lt;dan(a)vanderster.com&gt;
 Sent: 22 October 2020 18:11:57
 To: David C
 Cc: ceph-devel; ceph-users
 Subject: [ceph-users] Re: Urgent help needed please - MDS offline

 I assume you aren't able to quickly double the RAM on this MDS ? or
 failover to a new MDS with more ram?

 Failing that, you shouldn't reset the journal without recovering
 dentries, otherwise the cephfs_data objects won't be consistent with
 the metadata.
 The full procedure to be used is here:

https://docs.ceph.com/en/latest/cephfs/disaster-recovery-experts/#disaster-…

      backup the journal, recover dentires, then reset the journal.
 (the steps after might not be needed)

 That said -- maybe there is a more elegant procedure than using
 cephfs-journal-tool.  A cephfs dev might have better advice.

 -- dan

 On Thu, Oct 22, 2020 at 6:03 PM David C &lt;dcsysengineer(a)gmail.com&gt; wrote:
 >
 > I'm pretty sure it's replaying the same ops every time, the last
 > "EMetaBlob.replay updated dir" before it dies is always referring to
 > the same directory. Although interestingly that particular dir shows
 > up in the log thousands of times - the dir appears to be where a
 > desktop app is doing some analytics collecting - I don't know if
 > that's likely to be a red herring or the reason why the journal
 > appears to be so long. It's a dir I'd be quite happy to lose changes
 > to or remove from the file system altogether.
 >
 > I'm loath to update during an outage although I have seen people
 > update the MDS code independently to get out of a scrape - I suspect
 > you wouldn't recommend that.
 >
 > I feel like this leaves me with having to manipulate the journal in
 > some way, is there a nuclear option where I can choose to disregard
 > the uncommitted events? I assume that would be a journal reset with
 > the cephfs-journal-tool but I'm unclear on the impact of that, I'd
 > expect to lose any metadata changes that were made since my cluster
 > filled up but are there further implications? I also wonder what's the
 > riskier option, resetting the journal or attempting an update.
 >
 > I'm very grateful for your help so far
 >
 > Below is more of the debug 10 log with ops relating to the
 > aforementioned dir (name changed but inode is accurate):
 >
 > 2020-10-22 16:44:00.488850 7f424659e700 10 mds.0.journal
 > EMetaBlob.replay updated dir [dir 0x10009e1ec8d /path/to/desktop/app/
 > [2,head] auth v=911968 cv=0/0 state=1610612736 f(v0 m2020-10-14
 > 16:32:42.596652 1=0+1) n(v6164 rc2020-10-22 08:46:44.932805 b133337592
 > 89216=89215+1)/n(v6164 rc2020-10-22 08:46:43.950805 b133337592
 > 89214=89213+1) hs=1+0,ss=0+0 dirty=1 | child=1 dirty=1 0x5654f8288300]
 > 2020-10-22 16:44:00.488864 7f424659e700 10 mds.0.journal
 > EMetaBlob.replay for [2,head] had [dentry
 > #0x1/path/to/desktop/app/Upload [2,head] auth (dversion lock) v=911967
 > inode=0x5654f8288a00 state=1610612736 | inodepin=1 dirty=1
 > 0x5654f82794a0]
 > 2020-10-22 16:44:00.488873 7f424659e700 10 mds.0.journal
 > EMetaBlob.replay for [2,head] had [inode 0x10009e1ec8e [...2,head]
 > /path/to/desktop/app/Upload/ auth v911967 f(v0 m2020-10-22
 > 08:46:44.932805 89215=89215+0) n(v2 rc2020-10-22 08:46:44.932805
 > b133337592 89216=89215+1) (iversion lock) | dirfrag=1 dirty=1
 > 0x5654f8288a00]
 > 2020-10-22 16:44:00.488884 7f424659e700 10 mds.0.journal
 > EMetaBlob.replay dir 0x10009e1ec8e
 > 2020-10-22 16:44:00.488885 7f424659e700 10 mds.0.journal
 > EMetaBlob.replay updated dir [dir 0x10009e1ec8e
 > /path/to/desktop/app/Upload/ [2,head] auth v=904150 cv=0/0
 > state=1073741824 f(v0 m2020-10-22 08:46:44.932805 89215=89215+0) n(v2
 > rc2020-10-22 08:46:44.932805 b133337592 89215=89215+0)
 > hs=42926+1178,ss=0+0 dirty=2375 | child=1 0x5654f8289100]
 > 2020-10-22 16:44:00.488898 7f424659e700 10 mds.0.journal
 > EMetaBlob.replay added (full) [dentry
 >  #0x1/path/to/desktop/app/Upload/{dc97bb9c-4600-48bb-b232-23f9e45caa6e}.tmp
  > [2,head] auth NULL (dversion lock) v=904149
inode=0
 > state=1610612800|bottomlru | dirty=1 0x56586df52f00]
 > 2020-10-22 16:44:00.488911 7f424659e700 10 mds.0.journal
 > EMetaBlob.replay added [inode 0x1000e4c0ff4 [2,head]
 > /path/to/desktop/app/Upload/{dc97bb9c-4600-48bb-b232-23f9e45caa6e}.tmp
 > auth v904149 s=0 n(v0 1=1+0) (iversion lock) 0x566ce168ce00]
 > 2020-10-22 16:44:00.488918 7f424659e700 10
 > mds.0.cache.ino(0x1000e4c0ff4) mark_dirty_parent
 > 2020-10-22 16:44:00.488920 7f424659e700 10 mds.0.journal
 > EMetaBlob.replay noting opened inode [inode 0x1000e4c0ff4 [2,head]
 > /path/to/desktop/app/Upload/{dc97bb9c-4600-48bb-b232-23f9e45caa6e}.tmp
 > auth v904149 dirtyparent s=0 n(v0 1=1+0) (iversion lock) |
 > dirtyparent=1 dirty=1 0x566ce168ce00]
 > 2020-10-22 16:44:00.488924 7f424659e700 10 mds.0.journal
 > EMetaBlob.replay inotable tablev 481253 <= table 481328
 > 2020-10-22 16:44:00.488926 7f424659e700 10 mds.0.journal
 > EMetaBlob.replay sessionmap v 240341131 <= table 240378576
 > 2020-10-22 16:44:00.488927 7f424659e700 10 mds.0.journal
 > EMetaBlob.replay request client.16250824:1416595263 trim_to 1416595263
 > 2020-10-22 16:44:00.491462 7f424659e700 10 mds.0.log _replay
 > 57437755528637~11764673 / 57441334490146 2020-10-22 09:08:56.198798:
 > EOpen [metablob 0x10009e1ec8e, 1881 dirs], 16748 open files
 > 2020-10-22 16:44:00.491471 7f424659e700 10 mds.0.journal EOpen.replay
 > 2020-10-22 16:44:00.491472 7f424659e700 10 mds.0.journal
 > EMetaBlob.replay 1881 dirlumps by unknown.0
 > 2020-10-22 16:44:00.491475 7f424659e700 10 mds.0.journal
 > EMetaBlob.replay dir 0x10009e1ec8e
 > 2020-10-22 16:44:00.491478 7f424659e700 10 mds.0.journal
 > EMetaBlob.replay updated dir [dir 0x10009e1ec8e
 > /path/to/desktop/app/Upload/ [2,head] auth v=904150 cv=0/0
 > state=1073741824 f(v0 m2020-10-22 08:46:44.932805 89215=89215+0) n(v2
 > rc2020-10-22 08:46:44.932805 b133337592 89215=89215+0)
 > hs=42927+1178,ss=0+0 dirty=2376 | child=1 0x5654f8289100]
 > 2020-10-22 16:44:03.783487 7f424ada7700  5
 > mds.beacon.hostnamecephssd01 Sending beacon up:replay seq 14
 > 2020-10-22 16:44:03.784082 7f424fd2c700  5
 > mds.beacon.hostnamecephssd01 received beacon reply up:replay seq 14
 > rtt 0.00100003
 > 2020-10-22 16:44:07.783586 7f424ada7700  5
 > mds.beacon.hostnamecephssd01 Sending beacon up:replay seq 15
 > 2020-10-22 16:44:07.784097 7f424fd2c700  5
 > mds.beacon.hostnamecephssd01 received beacon reply up:replay seq 15
 > rtt 0.00100003
 > 2020-10-22 16:44:11.783678 7f424ada7700  5
 > mds.beacon.hostnamecephssd01 Sending beacon up:replay seq 16
 > 2020-10-22 16:44:11.784223 7f424fd2c700  5
 > mds.beacon.hostnamecephssd01 received beacon reply up:replay seq 16
 > rtt 0.00100003
 > 2020-10-22 16:44:15.783788 7f424ada7700  1 heartbeat_map is_healthy
 > 'MDSRank' had timed out after 15
 > 2020-10-22 16:44:15.783814 7f424ada7700  0
 > mds.beacon.hostnamecephssd01 Skipping beacon heartbeat to monitors
 > (last acked 4.00013s ago); MDS internal heartbeat is not healthy!
 >
 > On Thu, Oct 22, 2020 at 3:30 PM Dan van der Ster &lt;dan(a)vanderster.com&gt; 
wrote:
  > >
 > > I wouldn't adjust it.
 > > Do you have the impression that the mds is replaying the exact same  ops
every
  > > time the mds is restarting? or is it
progressing and trimming the
 > > journal over time?
 > >
 > > The only other advice I have is that 12.2.10 is quite old, and might
 > > miss some important replay/mem fixes.
 > > I'm thinking of one particular memory bloat issue we suffered (it
 > > manifested on a multi-mds cluster, so I am not sure if it is the root
 > > cause here https://tracker.ceph.com/issues/45090 )
 > > I don't know enough about the changelog diffs to suggest upgrading
 > > right now in the middle of this outage.
 > >
 > >
 > > -- dan
 > >
 > > On Thu, Oct 22, 2020 at 4:14 PM David C &lt;dcsysengineer(a)gmail.com&gt; 
wrote:
  > > >
 > > > I've not touched the journal segments, current value of
 > > > mds_log_max_segments is 128. Would you recommend I increase (or
 > > > decrease) that value? And do you think I should change
 > > > mds_log_max_expiring to match that value?
 > > >
 > > > On Thu, Oct 22, 2020 at 3:06 PM Dan van der Ster < 
dan(a)vanderster.com&gt; wrote:
  > > > >
 > > > > You could decrease the mds_cache_memory_limit but I don't think
 this
  > > > > will help here during
replay.
 > > > >
 > > > > You can see a related tracker here: 
https://tracker.ceph.com/issues/47582
  > > > > This is possibly caused by
replaying a very large journal. Did  you
  > > > > increase the journal
segments?
 > > > >
 > > > > -- dan
 > > > >
 > > > >
 > > > >
 > > > >
 > > > >
 > > > >
 > > > >
 > > > > -- dan
 > > > >
 > > > > On Thu, Oct 22, 2020 at 3:35 PM David C
&lt;dcsysengineer(a)gmail.com&gt;  wrote:
  > > > > >
 > > > > > Dan, many thanks for the response.
 > > > > >
 > > > > > I was going down the route of looking at mds_beacon_grace but
 I now
  > > > > > realise when I start my
MDS, it's swallowing up memory rapidly  and
  > > > > > looks like the
oom-killer is eventually killing the mds. With  debug
  > > > > > upped to 10, I can see
it's doing EMetaBlob.replays on various  dirs in
  > > > > > the filesystem and I
can't see any obvious issues.
 > > > > >
 > > > > > This server has 128GB ram with 111GB free with the MDS stopped
 > > > > >
 > > > > > The mds_cache_memory_limit is currently set to 32GB
 > > > > >
 > > > > > Could this be a case of simply reducing the mds cache until I
 can get
  > > > > > this started again or is
there another setting I should be  looking at?
  > > > > > Is it safe to reduce the
cache memory limit at this point?
 > > > > >
 > > > > > The standby is currently down and has been deliberately down
 for a while now.
  > > > > >
 > > > > > Log excerpt from debug 10 just before MDS is killed 
(path/to/dir
  > > > > > refers to a real path in
my FS)
 > > > > >
 > > > > > 2020-10-22 13:29:49.527372 7fc72d39f700 10
 > > > > > mds.0.cache.ino(0x1000e4c0ff4) mark_dirty_parent
 > > > > > 2020-10-22 13:29:49.527374 7fc72d39f700 10 mds.0.journal
 > > > > > EMetaBlob.replay noting opened inode [inode 0x1000e4c0ff4 
[2,head]
  > > > > >
/path/to/dir/{dc97bb9c-4600-48bb-b232-23f9e45caa6e}.tmp auth  v904149
  > > > > > dirtyparent s
 > > > > > =0 n(v0 1=1+0) (iversion lock) | dirtyparent=1 dirty=1 
0x561c23d66e00]
  > > > > > 2020-10-22
13:29:49.527378 7fc72d39f700 10 mds.0.journal
 > > > > > EMetaBlob.replay inotable tablev 481253 <= table 481328
 > > > > > 2020-10-22 13:29:49.527380 7fc72d39f700 10 mds.0.journal
 > > > > > EMetaBlob.replay sessionmap v 240341131 <= table 240378576
 > > > > > 2020-10-22 13:29:49.527383 7fc72d39f700 10 mds.0.journal
 > > > > > EMetaBlob.replay request client.16250824:1416595263 trim_to
 1416595263
  > > > > > 2020-10-22
13:29:49.530097 7fc72d39f700 10 mds.0.log _replay
 > > > > > 57437755528637~11764673 / 57441334490146 2020-10-22 
09:08:56.198798:
  > > > > > EOpen [metab
 > > > > > lob 0x10009e1ec8e, 1881 dirs], 16748 open files
 > > > > > 2020-10-22 13:29:49.530106 7fc72d39f700 10 mds.0.journal 
EOpen.replay
  > > > > > 2020-10-22
13:29:49.530107 7fc72d39f700 10 mds.0.journal
 > > > > > EMetaBlob.replay 1881 dirlumps by unknown.0
 > > > > > 2020-10-22 13:29:49.530109 7fc72d39f700 10 mds.0.journal
 > > > > > EMetaBlob.replay dir 0x10009e1ec8e
 > > > > > 2020-10-22 13:29:49.530111 7fc72d39f700 10 mds.0.journal
 > > > > > EMetaBlob.replay updated dir [dir 0x10009e1ec8e /path/to/dir/
 [2,head]
  > > > > > auth v=904150 cv=0/0
state=1073741824 f(v0 m2020-10-22  08:46:44.932805
  > > > > > 89215=89215+0) n(v2
rc2020-10-22 08:46:44.932805 b133337592
 > > > > > 89215=89215+0) hs=42927+1178,ss=0+0 dirty=2376 | child=1
 > > > > > 0x56043c4bd100]
 > > > > > 2020-10-22 13:29:50.275864 7fc731ba8700  5
 > > > > > mds.beacon.hostnamecephssd01 Sending beacon up:replay seq 13
 > > > > > 2020-10-22 13:29:51.026368 7fc73732e700  5
 > > > > > mds.beacon.hostnamecephssd01 received beacon reply up:replay
 seq 13
  > > > > > rtt 0.750024
 > > > > > 2020-10-22 13:29:51.026377 7fc73732e700  0
 > > > > > mds.beacon.hostnamecephssd01  MDS is no longer laggy
 > > > > > 2020-10-22 13:29:54.275993 7fc731ba8700  5
 > > > > > mds.beacon.hostnamecephssd01 Sending beacon up:replay seq 14
 > > > > > 2020-10-22 13:29:54.277360 7fc73732e700  5
 > > > > > mds.beacon.hostnamecephssd01 received beacon reply up:replay
 seq 14
  > > > > > rtt 0.00100003
 > > > > > 2020-10-22 13:29:58.276117 7fc731ba8700  5
 > > > > > mds.beacon.hostnamecephssd01 Sending beacon up:replay seq 15
 > > > > > 2020-10-22 13:29:58.277322 7fc73732e700  5
 > > > > > mds.beacon.hostnamecephssd01 received beacon reply up:replay
 seq 15
  > > > > > rtt 0.00100003
 > > > > > 2020-10-22 13:30:02.276313 7fc731ba8700  5
 > > > > > mds.beacon.hostnamecephssd01 Sending beacon up:replay seq 16
 > > > > > 2020-10-22 13:30:02.477973 7fc73732e700  5
 > > > > > mds.beacon.hostnamecephssd01 received beacon reply up:replay
 seq 16
  > > > > > rtt 0.202007
 > > > > >
 > > > > > Thanks,
 > > > > > David
 > > > > >
 > > > > > On Thu, Oct 22, 2020 at 1:41 PM Dan van der Ster < 
dan(a)vanderster.com&gt; wrote:
  > > > > > >
 > > > > > > You can disable that beacon by increasing mds_beacon_grace
 to 300 or
  > > > > > > 600. This will stop
the mon from failing that mds over to a  standby.
  > > > > > > I don't know if
that is set on the mon or mgr, so I usually  set it on both.
  > > > > > > (You might as well
disable the standby too -- no sense in  something
  > > > > > > failing back and
forth between two mdses).
 > > > > > >
 > > > > > > Next -- looks like your mds is in active:replay. Is it
doing  anything?
  > > > > > > Is it using lots of
CPU/RAM? If you increase debug_mds do  you see some
  > > > > > > progress?
 > > > > > >
 > > > > > > -- dan
 > > > > > >
 > > > > > >
 > > > > > > On Thu, Oct 22, 2020 at 2:01 PM David C < 
dcsysengineer(a)gmail.com&gt; wrote:
  > > > > > > >
 > > > > > > > Hi All
 > > > > > > >
 > > > > > > > My main CephFS data pool on a Luminous 12.2.10 cluster
hit  capacity
  > > > > > > > overnight,
metadata is on a separate pool which didn't hit  capacity but the
  > > > > > > > filesystem
stopped working which I'd expect. I increased  the osd full-ratio
  > > > > > > > to give me
some breathing room to get some data deleted  once the filesystem
  > > > > > > > is back
online. When I attempt to restart the MDS service,  I see the usual
  > > > > > > > stuff I'd
expect in the log but then:
 > > > > > > >
 > > > > > > > heartbeat_map is_healthy 'MDSRank' had timed
out after 15
 > > > > > > >
 > > > > > > >
 > > > > > > > Followed by:
 > > > > > > >
 > > > > > > > mds.beacon.hostnamecephssd01 Skipping beacon heartbeat
to  monitors (last
  > > > > > > > > acked
4.00013s ago); MDS internal heartbeat is not  healthy!
  > > > > > > >
 > > > > > > >
 > > > > > > > Eventually I get:
 > > > > > > >
 > > > > > > > >
 > > > > > > > > mds.beacon.hostnamecephssd01 is_laggy 29.372 >
15 since  last acked beacon
  > > > > > > > >
mds.0.90884 skipping upkeep work because connection to  Monitors appears
  > > > > > > > > laggy
 > > > > > > > > mds.hostnamecephssd01 Updating MDS map to version
90885  from mon.0
  > > > > > > > >
mds.beacon.hostnamecephssd01  MDS is no longer laggy
 > > > > > > >
 > > > > > > >
 > > > > > > > The "MDS is no longer laggy" appears to be
where the  service fails
  > > > > > > >
 > > > > > > > Meanwhile a ceph -s is showing:
 > > > > > > >
 > > > > > > > >
 > > > > > > > > cluster:
 > > > > > > > >     id:     5c5998fd-dc9b-47ec-825e-beaba66aad11
 > > > > > > > >     health: HEALTH_ERR
 > > > > > > > >             1 filesystem is degraded
 > > > > > > > >             insufficient standby MDS daemons
available
 > > > > > > > >             67 backfillfull osd(s)
 > > > > > > > >             11 nearfull osd(s)
 > > > > > > > >             full ratio(s) out of order
 > > > > > > > >             2 pool(s) backfillfull
 > > > > > > > >             2 pool(s) nearfull
 > > > > > > > >             6 scrub errors
 > > > > > > > >             Possible data damage: 5 pgs
inconsistent
 > > > > > > > >   services:
 > > > > > > > >     mon: 3 daemons, quorum 
hostnameceph01,hostnameceph02,hostnameceph03
  > > > > > > > >     mgr:
hostnameceph03(active), standbys:  hostnameceph02, hostnameceph01
  > > > > > > > >     mds:
cephfs-1/1/1 up  {0=hostnamecephssd01=up:replay}
 > > > > > > > >     osd: 172 osds: 161 up, 161 in
 > > > > > > > >   data:
 > > > > > > > >     pools:   5 pools, 8384 pgs
 > > > > > > > >     objects: 76.25M objects, 124TiB
 > > > > > > > >     usage:   373TiB used, 125TiB / 498TiB avail
 > > > > > > > >     pgs:     8379 active+clean
 > > > > > > > >              5    active+clean+inconsistent
 > > > > > > > >   io:
 > > > > > > > >     client:   676KiB/s rd, 0op/s rd, 0op/s w
 > > > > > > >
 > > > > > > >
 > > > > > > > The 5 pgs inconsistent is not a new issue, that is
from  past scrubs, just
  > > > > > > > haven't
gotten around to manually clearing them although I  suppose they
  > > > > > > > could be
related to my issue
 > > > > > > >
 > > > > > > > The cluster has no clients connected
 > > > > > > >
 > > > > > > > I did notice in the ceph.log, some OSDs that are in
the  same host as the
  > > > > > > > MDS service
briefly went down when trying to restart the  MDS but examining
  > > > > > > > the logs of
those particular OSDs isn't showing any  glaring issues.
  > > > > > > >
 > > > > > > > Full MDS log at debug 5 (can go higher if needed):
 > > > > > > >
 > > > > > > > 2020-10-22 11:27:10.987652 7f6f696f5240  0 set uid:gid
to  167:167
  > > > > > > > (ceph:ceph)
 > > > > > > > 2020-10-22 11:27:10.987669 7f6f696f5240  0 ceph
version  12.2.10
  > > > > > > >
(177915764b752804194937482a39e95e0ca3de94) luminous  (stable), process
  > > > > > > > ceph-mds, pid
2022582
 > > > > > > > 2020-10-22 11:27:10.990567 7f6f696f5240  0
pidfile_write:  ignore empty
  > > > > > > > --pid-file
 > > > > > > > 2020-10-22 11:27:11.027981 7f6f62616700  1 
mds.hostnamecephssd01 Updating
  > > > > > > > MDS map to
version 90882 from mon.0
 > > > > > > > 2020-10-22 11:27:15.097957 7f6f62616700  1 
mds.hostnamecephssd01 Updating
  > > > > > > > MDS map to
version 90883 from mon.0
 > > > > > > > 2020-10-22 11:27:15.097989 7f6f62616700  1 
mds.hostnamecephssd01 Map has
  > > > > > > > assigned me to
become a standby
 > > > > > > > 2020-10-22 11:27:15.101071 7f6f62616700  1 
mds.hostnamecephssd01 Updating
  > > > > > > > MDS map to
version 90884 from mon.0
 > > > > > > > 2020-10-22 11:27:15.105310 7f6f62616700  1 mds.0.90884
 handle_mds_map i am
  > > > > > > > now
mds.0.90884
 > > > > > > > 2020-10-22 11:27:15.105316 7f6f62616700  1 mds.0.90884
 handle_mds_map state
  > > > > > > > change up:boot
--> up:replay
 > > > > > > > 2020-10-22 11:27:15.105325 7f6f62616700  1 mds.0.90884
 replay_start
  > > > > > > > 2020-10-22
11:27:15.105333 7f6f62616700  1 mds.0.90884  recovery set is
  > > > > > > > 2020-10-22
11:27:15.105344 7f6f62616700  1 mds.0.90884  waiting for osdmap
  > > > > > > > 73745 (which
blacklists prior instance)
 > > > > > > > 2020-10-22 11:27:15.149092 7f6f5be09700  0 mds.0.cache
 creating system
  > > > > > > > inode with
ino:0x100
 > > > > > > > 2020-10-22 11:27:15.149693 7f6f5be09700  0 mds.0.cache
 creating system
  > > > > > > > inode with
ino:0x1
 > > > > > > > 2020-10-22 11:27:41.021708 7f6f63618700  1
heartbeat_map  is_healthy
  > > > > > > >
'MDSRank' had timed out after 15
 > > > > > > > 2020-10-22 11:27:43.029290 7f6f5f610700  1
heartbeat_map  is_healthy
  > > > > > > >
'MDSRank' had timed out after 15
 > > > > > > > 2020-10-22 11:27:43.029297 7f6f5f610700  0 
mds.beacon.hostnamecephssd01
  > > > > > > > Skipping
beacon heartbeat to monitors (last acked 4.00013s  ago); MDS
  > > > > > > > internal
heartbeat is not healthy!
 > > > > > > > 2020-10-22 11:27:45.866711 7f6f5fe11700  1
heartbeat_map  reset_timeout
  > > > > > > >
'MDSRank' had timed out after 15
 > > > > > > > 2020-10-22 11:28:01.021965 7f6f63618700  1
heartbeat_map  is_healthy
  > > > > > > >
'MDSRank' had timed out after 15
 > > > > > > > 2020-10-22 11:28:03.029862 7f6f5f610700  1
heartbeat_map  is_healthy
  > > > > > > >
'MDSRank' had timed out after 15
 > > > > > > > 2020-10-22 11:28:03.029885 7f6f5f610700  0 
mds.beacon.hostnamecephssd01
  > > > > > > > Skipping
beacon heartbeat to monitors (last acked 4.00113s  ago); MDS
  > > > > > > > internal
heartbeat is not healthy!
 > > > > > > > 2020-10-22 11:28:06.022033 7f6f63618700  1
heartbeat_map  is_healthy
  > > > > > > >
'MDSRank' had timed out after 15
 > > > > > > > 2020-10-22 11:28:07.029955 7f6f5f610700  1
heartbeat_map  is_healthy
  > > > > > > >
'MDSRank' had timed out after 15
 > > > > > > > 2020-10-22 11:28:07.029961 7f6f5f610700  0 
mds.beacon.hostnamecephssd01
  > > > > > > > Skipping
beacon heartbeat to monitors (last acked 8.00126s  ago); MDS
  > > > > > > > internal
heartbeat is not healthy!
 > > > > > > > 2020-10-22 11:28:11.022099 7f6f63618700  1
heartbeat_map  is_healthy
  > > > > > > >
'MDSRank' had timed out after 15
 > > > > > > > 2020-10-22 11:28:11.030024 7f6f5f610700  1
heartbeat_map  is_healthy
  > > > > > > >
'MDSRank' had timed out after 15
 > > > > > > > 2020-10-22 11:28:11.030028 7f6f5f610700  0 
mds.beacon.hostnamecephssd01
  > > > > > > > Skipping
beacon heartbeat to monitors (last acked 12.0014s  ago); MDS
  > > > > > > > internal
heartbeat is not healthy!
 > > > > > > > 2020-10-22 11:28:15.030092 7f6f5f610700  1
heartbeat_map  is_healthy
  > > > > > > >
'MDSRank' had timed out after 15
 > > > > > > > 2020-10-22 11:28:15.030099 7f6f5f610700  0 
mds.beacon.hostnamecephssd01
  > > > > > > > Skipping
beacon heartbeat to monitors (last acked 16.0015s  ago); MDS
  > > > > > > > internal
heartbeat is not healthy!
 > > > > > > > 2020-10-22 11:28:16.022165 7f6f63618700  1
heartbeat_map  is_healthy
  > > > > > > >
'MDSRank' had timed out after 15
 > > > > > > > 2020-10-22 11:28:19.030163 7f6f5f610700  1
heartbeat_map  is_healthy
  > > > > > > >
'MDSRank' had timed out after 15
 > > > > > > > 2020-10-22 11:28:19.030169 7f6f5f610700  0 
mds.beacon.hostnamecephssd01
  > > > > > > > Skipping
beacon heartbeat to monitors (last acked 20.0016s  ago); MDS
  > > > > > > > internal
heartbeat is not healthy!
 > > > > > > > 2020-10-22 11:28:21.022231 7f6f63618700  1
heartbeat_map  is_healthy
  > > > > > > >
'MDSRank' had timed out after 15
 > > > > > > > 2020-10-22 11:28:23.030233 7f6f5f610700  1
heartbeat_map  is_healthy
  > > > > > > >
'MDSRank' had timed out after 15
 > > > > > > > 2020-10-22 11:28:23.030241 7f6f5f610700  0 
mds.beacon.hostnamecephssd01
  > > > > > > > Skipping
beacon heartbeat to monitors (last acked 24.0008s  ago); MDS
  > > > > > > > internal
heartbeat is not healthy!
 > > > > > > > 2020-10-22 11:28:26.022295 7f6f63618700  1
heartbeat_map  is_healthy
  > > > > > > >
'MDSRank' had timed out after 15
 > > > > > > > 2020-10-22 11:28:27.030305 7f6f5f610700  1
heartbeat_map  is_healthy
  > > > > > > >
'MDSRank' had timed out after 15
 > > > > > > > 2020-10-22 11:28:27.030311 7f6f5f610700  0 
mds.beacon.hostnamecephssd01
  > > > > > > > Skipping
beacon heartbeat to monitors (last acked 28.0009s  ago); MDS
  > > > > > > > internal
heartbeat is not healthy!
 > > > > > > > 2020-10-22 11:28:28.401161 7f6f5fe11700  1
heartbeat_map  reset_timeout
  > > > > > > >
'MDSRank' had timed out after 15
 > > > > > > > 2020-10-22 11:28:28.401168 7f6f5fe11700  1 
mds.beacon.hostnamecephssd01
  > > > > > > > is_laggy
29.372 > 15 since last acked beacon
 > > > > > > > 2020-10-22 11:28:28.401177 7f6f5fe11700  1 mds.0.90884
 skipping upkeep work
  > > > > > > > because
connection to Monitors appears laggy
 > > > > > > > 2020-10-22 11:28:28.401187 7f6f62616700  1 
mds.hostnamecephssd01 Updating
  > > > > > > > MDS map to
version 90885 from mon.0
 > > > > > > > 2020-10-22 11:28:31.659817 7f6f64595700  0 
mds.beacon.hostnamecephssd01
  > > > > > > >  MDS is no
longer laggy
 > > > > > > > 2020-10-22 11:36:15.880009 7f88ee4ac240  0 set uid:gid
to  167:167
  > > > > > > > (ceph:ceph)
 > > > > > > > 2020-10-22 11:36:15.880026 7f88ee4ac240  0 ceph
version  12.2.10
  > > > > > > >
(177915764b752804194937482a39e95e0ca3de94) luminous  (stable), process
  > > > > > > > ceph-mds, pid
2022663
 > > > > > > > 2020-10-22 11:36:15.883118 7f88ee4ac240  0
pidfile_write:  ignore empty
  > > > > > > > --pid-file
 > > > > > > > 2020-10-22 11:36:15.921200 7f88e73cd700  1 
mds.hostnamecephssd01 Updating
  > > > > > > > MDS map to
version 90887 from mon.2
 > > > > > > > 2020-10-22 11:36:20.270298 7f88e73cd700  1 
mds.hostnamecephssd01 Updating
  > > > > > > > MDS map to
version 90888 from mon.2
 > > > > > > > 2020-10-22 11:36:20.270329 7f88e73cd700  1 
mds.hostnamecephssd01 Map has
  > > > > > > > assigned me to
become a standby
 > > > > > > > 2020-10-22 11:36:20.272917 7f88e73cd700  1 
mds.hostnamecephssd01 Updating
  > > > > > > > MDS map to
version 90889 from mon.2
 > > > > > > > 2020-10-22 11:36:20.277063 7f88e73cd700  1 mds.0.90889
 handle_mds_map i am
  > > > > > > > now
mds.0.90889
 > > > > > > > 2020-10-22 11:36:20.277069 7f88e73cd700  1 mds.0.90889
 handle_mds_map state
  > > > > > > > change up:boot
--> up:replay
 > > > > > > > 2020-10-22 11:36:20.277079 7f88e73cd700  1 mds.0.90889
 replay_start
  > > > > > > > 2020-10-22
11:36:20.277086 7f88e73cd700  1 mds.0.90889  recovery set is
  > > > > > > > 2020-10-22
11:36:20.277096 7f88e73cd700  1 mds.0.90889  waiting for osdmap
  > > > > > > > 73746 (which
blacklists prior instance)
 > > > > > > > 2020-10-22 11:36:20.322318 7f88e0bc0700  0 mds.0.cache
 creating system
  > > > > > > > inode with
ino:0x100
 > > > > > > > 2020-10-22 11:36:20.322918 7f88e0bc0700  0 mds.0.cache
 creating system
  > > > > > > > inode with
ino:0x1
 > > > > > > > 2020-10-22 11:36:47.922531 7f88e43c7700  1
heartbeat_map  is_healthy
  > > > > > > >
'MDSRank' had timed out after 15
 > > > > > > > 2020-10-22 11:36:47.922549 7f88e43c7700  0 
mds.beacon.hostnamecephssd01
  > > > > > > > Skipping
beacon heartbeat to monitors (last acked 4.00013s  ago); MDS
  > > > > > > > internal
heartbeat is not healthy!
 > > > > > > > 2020-10-22 11:36:50.914516 7f88e83cf700  1
heartbeat_map  is_healthy
  > > > > > > >
'MDSRank' had timed out after 15
 > > > > > > > 2020-10-22 11:36:51.351457 7f88e4bc8700  1
heartbeat_map  reset_timeout
  > > > > > > >
'MDSRank' had timed out after 15
 > > > > > > > 2020-10-22 11:37:07.923089 7f88e43c7700  1
heartbeat_map  is_healthy
  > > > > > > >
'MDSRank' had timed out after 15
 > > > > > > > 2020-10-22 11:37:07.923126 7f88e43c7700  0 
mds.beacon.hostnamecephssd01
  > > > > > > > Skipping
beacon heartbeat to monitors (last acked 3.99913s  ago); MDS
  > > > > > > > internal
heartbeat is not healthy!
 > > > > > > > 2020-10-22 11:37:10.914767 7f88e83cf700  1
heartbeat_map  is_healthy
  > > > > > > >
'MDSRank' had timed out after 15
 > > > > > > > 2020-10-22 11:37:11.923216 7f88e43c7700  1
heartbeat_map  is_healthy
  > > > > > > >
'MDSRank' had timed out after 15
 > > > > > > > 2020-10-22 11:37:11.923223 7f88e43c7700  0 
mds.beacon.hostnamecephssd01
  > > > > > > > Skipping
beacon heartbeat to monitors (last acked 7.99926s  ago); MDS
  > > > > > > > internal
heartbeat is not healthy!
 > > > > > > > 2020-10-22 11:37:15.914831 7f88e83cf700  1
heartbeat_map  is_healthy
  > > > > > > >
'MDSRank' had timed out after 15
 > > > > > > > 2020-10-22 11:37:15.923286 7f88e43c7700  1
heartbeat_map  is_healthy
  > > > > > > >
'MDSRank' had timed out after 15
 > > > > > > > 2020-10-22 11:37:15.923294 7f88e43c7700  0 
mds.beacon.hostnamecephssd01
  > > > > > > > Skipping
beacon heartbeat to monitors (last acked 11.9994s  ago); MDS
  > > > > > > > internal
heartbeat is not healthy!
 > > > > > > > 2020-10-22 11:37:19.923359 7f88e43c7700  1
heartbeat_map  is_healthy
  > > > > > > >
'MDSRank' had timed out after 15
 > > > > > > > 2020-10-22 11:37:19.923366 7f88e43c7700  0 
mds.beacon.hostnamecephssd01
  > > > > > > > Skipping
beacon heartbeat to monitors (last acked 15.9995s  ago); MDS
  > > > > > > > internal
heartbeat is not healthy!
 > > > > > > > 2020-10-22 11:37:20.914917 7f88e83cf700  1
heartbeat_map  is_healthy
  > > > > > > >
'MDSRank' had timed out after 15
 > > > > > > > 2020-10-22 11:37:23.923430 7f88e43c7700  1
heartbeat_map  is_healthy
  > > > > > > >
'MDSRank' had timed out after 15
 > > > > > > > 2020-10-22 11:37:23.923437 7f88e43c7700  0 
mds.beacon.hostnamecephssd01
  > > > > > > > Skipping
beacon heartbeat to monitors (last acked 19.9996s  ago); MDS
  > > > > > > > internal
heartbeat is not healthy!
 > > > > > > > 2020-10-22 11:37:25.914981 7f88e83cf700  1
heartbeat_map  is_healthy
  > > > > > > >
'MDSRank' had timed out after 15
 > > > > > > > 2020-10-22 11:37:27.923501 7f88e43c7700  1
heartbeat_map  is_healthy
  > > > > > > >
'MDSRank' had timed out after 15
 > > > > > > > 2020-10-22 11:37:27.923508 7f88e43c7700  0 
mds.beacon.hostnamecephssd01
  > > > > > > > Skipping
beacon heartbeat to monitors (last acked 23.9998s  ago); MDS
  > > > > > > > internal
heartbeat is not healthy!
 > > > > > > > 2020-10-22 11:37:30.915046 7f88e83cf700  1
heartbeat_map  is_healthy
  > > > > > > >
'MDSRank' had timed out after 15
 > > > > > > > 2020-10-22 11:37:31.923572 7f88e43c7700  1
heartbeat_map  is_healthy
  > > > > > > >
'MDSRank' had timed out after 15
 > > > > > > > 2020-10-22 11:37:31.923579 7f88e43c7700  0 
mds.beacon.hostnamecephssd01
  > > > > > > > Skipping
beacon heartbeat to monitors (last acked 27.9999s  ago); MDS
  > > > > > > > internal
heartbeat is not healthy!
 > > > > > > > 2020-10-22 11:37:32.412628 7f88e4bc8700  1
heartbeat_map  reset_timeout
  > > > > > > >
'MDSRank' had timed out after 15
 > > > > > > > 2020-10-22 11:37:32.412635 7f88e4bc8700  1 
mds.beacon.hostnamecephssd01
  > > > > > > > is_laggy
28.4889 > 15 since last acked beacon
 > > > > > > > 2020-10-22 11:37:32.412643 7f88e4bc8700  1 mds.0.90889
 skipping upkeep work
  > > > > > > > because
connection to Monitors appears laggy
 > > > > > > > 2020-10-22 11:37:32.412657 7f88e73cd700  1 
mds.hostnamecephssd01 Updating
  > > > > > > > MDS map to
version 90890 from mon.2
 > > > > > > > 2020-10-22 11:37:35.978858 7f88e934c700  0 
mds.beacon.hostnamecephssd01
   > >
> > > >  MDS is no longer laggy
 > > > > > >
 > > > > > >
 > > > > > > Thanks in advance for any assistance you can provide!
 > > > > > > David
 > > > > > > _______________________________________________
 > > > > > > ceph-users mailing list -- ceph-users(a)ceph.io
 > > > > > > To unsubscribe send an email to ceph-users-leave(a)ceph.io
 _______________________________________________
 ceph-users mailing list -- ceph-users(a)ceph.io
 To unsubscribe send an email to ceph-users-leave(a)ceph.io 

2024

2023

2022

2021

2020

2019

[ceph-users] Re: Urgent help needed please - MDS offline