[ceph-users] Re: monitor sst files continue growing

29 Oct 2020

Hi,

I was so anxious a few hours ago cause the sst files were growing so fast
and I don't think
the space on mon servers could afford it.

Let me talk it from the beginning. I have a cluster with OSD deployed on
SATA(7200rpm).
10T each OSD and I used ec pool for more space.I added new OSDs into the
cluster last
week and it has recovered well so far. After that, while the cluster is
still recovering, I increased the pg_num.
Besides that, the clients still write data to the server all the time.

And the cluster became unhealthy last night. Some osds were down and one
mon was down.
Then I found the mon servers' root directories were lack of free space. The
sst files in /var/lib/ceph/mon/ceph-xxx/store.db/
were growing rapidly.

Frank Schilder &lt;frans(a)dtu.dk&gt; 于2020年10月29日周四 下午7:15写道：

...
  I think you really need to sit down and explain the
full story. Dropping
 one-liners with new information will not work via e-mail.

 I have never heard of the problem you are facing, so you did something
 that possibly no-one else has done before. Unless we know the full history
 from the last time the cluster was health_ok until now, it will almost
 certainly not be possible to figure out what is going on via e-mail.

 Usually, setting "norebalance" and "norecovery" should stop any
recovery
 IO and allow the PGs to peer. If they do not become active, something is
 wrong and the information we got so far does not give a clue what this
 could be.

 Please post the output of "ceph health detail", "ceph osd pool stats"
and
 "ceph osd pool ls detail" and a log of actions and results since last
 health_ok status here, maybe it gives a clue what is going on.

 Best regards,
 =================
 Frank Schilder
 AIT Risø Campus
 Bygning 109, rum S14

 ________________________________________
 From: Zhenshi Zhou &lt;deaderzzs(a)gmail.com&gt;
 Sent: 29 October 2020 09:44:14
 To: Frank Schilder
 Cc: ceph-users
 Subject: Re: [ceph-users] monitor sst files continue growing

 I reset the pg_num after adding osd, it made some pg inactive(in
 activating state)

 Frank Schilder <frans@dtu.dk<mailto:frans@dtu.dk>> 于2020年10月29日周四
 下午3:56写道：
 This does not explain incomplete and inactive PGs. Are you hitting
 https://tracker.ceph.com/issues/46847 (see also thread "Ceph does not
 recover from OSD restart"? In that case, temporarily stopping and
 restarting all new OSDs might help.

 Best regards,
 =================
 Frank Schilder
 AIT Risø Campus
 Bygning 109, rum S14

 ________________________________________
 From: Zhenshi Zhou <deaderzzs@gmail.com<mailto:deaderzzs@gmail.com>>
 Sent: 29 October 2020 08:30:25
 To: Frank Schilder
 Cc: ceph-users
 Subject: Re: [ceph-users] monitor sst files continue growing

 After add OSDs into the cluster, the recovery and backfill progress has
 not finished yet

 Zhenshi Zhou <deaderzzs@gmail.com<mailto:deaderzzs@gmail.com><mailto:
 deaderzzs@gmail.com<mailto:deaderzzs@gmail.com>>> 于2020年10月29日周四 下午3:29写道：
 MGR is stopped by me cause it took too much memories.
 For pg status, I added some OSDs in this cluster, and it

 Frank Schilder <frans@dtu.dk<mailto:frans@dtu.dk><mailto:frans@dtu.dk
 <mailto:frans@dtu.dk>>> 于2020年10月29日周四 下午3:27写道：
 Your problem is the overall cluster health. The MONs store cluster history
 information that will be trimmed once it reaches HEALTH_OK. Restarting the
 MONs only makes things worse right now. The health status is a mess, no
 MGR, a bunch of PGs inactive, etc. This is what you need to resolve. How
 did your cluster end up like this?

 It looks like all OSDs are up and in. You need to find out

 - why there are inactive PGs
 - why there are incomplete PGs

 This usually happens when OSDs go missing.

 Best regards,
 =================
 Frank Schilder
 AIT Risø Campus
 Bygning 109, rum S14

 ________________________________________
 From: Zhenshi Zhou <deaderzzs@gmail.com<mailto:deaderzzs@gmail.com

<mailto:deaderzzs@gmail.com<mailto:deaderzzs@gmail.com>>>  Sent: 29
October 2020 07:37:19
 To: ceph-users
 Subject: [ceph-users] monitor sst files continue growing

 Hi all,

 My cluster is in wrong state. SST files in /var/lib/ceph/mon/xxx/store.db
 continue growing. It claims mon are using a lot of disk space.

 I set "mon compact on start = true" and restart one of the monitors. But
 it started and campacting for a long time, seems it has no end.

 [image.png]

2024

2023

2022

2021

2020

2019

[ceph-users] Re: monitor sst files continue growing