Hi cephists,
We have a 10 node cluster running Nautilus 14.2.9
All objects are on EC pool. We have mgr balancer plugin in upmap mode
doing it's rebalancing:
health: HEALTH_OK
pgs:
1985 active+clean
190 active+remapped+backfilling
65 active+remapped+backfill_wait
io:
client: 0 B/s wr, 0 op/s rd, 0 op/s wr
recovery: 770 MiB/s, 463 objects/s
We have restarted osd.0 on one of our OSD nodes, and this was the
status immediately after:
```
health: HEALTH_WARN
1 osds down
Degraded data redundancy: 4531479/531067647 objects
degraded (0.853%), 109 pgs degraded
```
Then OSD became UP again:
```
health: HEALTH_WARN
Degraded data redundancy: 4963207/531067545 objects
degraded (0.935%), 120 pgs degraded
```
And after a minute or so has passed it settled on:
```
health: HEALTH_WARN
Degraded data redundancy: 295515/531067347 objects
degraded (0.056%), 10 pgs degraded, 10 pgs undersized
```
upmap balancer was running during osd.0 restart, the restart was
successfull, without any issues.
This left us wondering - how could a simple osd restart cause
degraded PGs? Could this be related to the upmap balancer running?
Thanks!
--
Vyteni
Hi list,
When reading the documentation for the new way of mirroring [1], some
questions arose, especially with the following sentence:
> Since this mode is not point-in-time consistent, the full snapshot
delta will need to be synced prior to use during a failover scenario.
1) I'm not sure I follow. Why are snapshots not point-in-time, whereas
using the journal-based mirroring is?
2) Also, I am not sure what the implications are for the second part of
the referenced sentence, what significance does the word "full" bear in
"full snapshot delta" ? Further, Isn't it obvious that in order to do a
failover, the other side should have the latest snapshot? - I am
probably missing what the documentation is trying to tell me.
All in all, I think some background concepts are missing in order to
read this chapter of the documentation. It might well be that this is
explained somewhere else in the docs, but maybe we should link to that.
If someone can explain the above 2 points, I'm willing to take a shot at
making a patch if I can make this clearer in the docs.
Thanks,
Hans
[1] https://docs.ceph.com/docs/octopus/rbd/rbd-mirroring/
Hi,
running 14.2.6, debian buster (backports).
Have set up a cephfs with 3 data pools and one metadata pool:
myfs_data, myfs_data_hdd, myfs_data_ssd, and myfs_metadata.
The data of all files are with the use of ceph.dir.layout.pool either
stored in the pools myfs_data_hdd or myfs_data_ssd. This has also been
checked by dumping the ceph.file.layout.pool attributes of all files.
The filesystem has 1617949 files and 36042 directories.
There are however approximately as many objects in the first pool created
for the cephfs, myfs_data, as there are files. They also becomes more or
fewer as files are created or deleted (so cannot be some leftover from
earlier exercises). Note how the USED size is reported as 0 bytes,
correctly reflecting that no file data is stored in them.
POOL_NAME USED OBJECTS CLONES COPIES MISSING_ON_PRIMARY UNFOUND DEGRADED RD_OPS RD WR_OPS WR USED COMPR UNDER COMPR
myfs_data 0 B 1618229 0 4854687 0 0 0 2263590 129 GiB 23312479 124 GiB 0 B 0 B
myfs_data_hdd 831 GiB 136309 0 408927 0 0 0 106046 200 GiB 269084 277 GiB 0 B 0 B
myfs_data_ssd 43 GiB 1552412 0 4657236 0 0 0 181468 2.3 GiB 4661935 12 GiB 0 B 0 B
myfs_metadata 1.2 GiB 36096 0 108288 0 0 0 4828623 82 GiB 1355102 143 GiB 0 B 0 B
Is this expected?
I was assuming that in this scenario, all objects, both their data and any
keys would be either in the metadata pool, or the two pools where the
objects are stored.
Is it some additional metadata keys that are stored in the first
created data pool for cephfs? This would not be so nice in case the osd
selection rules for it are using worse disks than the data itself...
Btw: is there any tool to see the amount of key value data size associated
with a pool? 'ceph osd df' gives omap and meta for osds, but not broken
down per pool.
Best regards,
Håkan
Hi Dylan,
It looks like you have 10GB of heap to be release -- try `ceph tell
mds.$(hostname) heap release` to free that up.
Otherwise, I've found it safe to incrementally inject decreased
mds_cache_memory_limit's on prod mds's running v12.2.12. I'd start by
decreasing the size just a few hundred MBs at a time while tailing the
mds log with `debug mds = 2` or `watch --color ceph fs status` to see
the cache sizes decrease and stabilize after each change.
(In my case I've decreased from ~16GB caches to ~4GB across 9 active
MDSs -- I moved at around 500MB per injection and there were no slow
req's or client issues.)
BTW, we also increase `mds cache trim threshold` to allow the MDS to
trim more caps per 5s tick -- if you find the lru is not trimming
quickly enough you could try 1.5 or 2x the default value.
If things get hairy you could increase the mds_beacon_grace (on mon
and or mds) to tolerate longer missed heartbeats rather than failing
the mds.
Cheers, Dan
On Thu, May 28, 2020 at 7:09 AM Dylan McCulloch <dmc(a)unimelb.edu.au> wrote:
>
> Hi all,
>
> The single active MDS on one of our Ceph clusters is close to running out of RAM.
>
> MDS total system RAM = 528GB
> MDS current free system RAM = 4GB
> mds_cache_memory_limit = 451GB
> current mds cache usage = 426GB
>
> Presumably we need to reduce our mds_cache_memory_limit and/or mds_max_caps_per_client, but would like some guidance on whether it’s possible to do that safely on a live production cluster when the MDS is already pretty close to running out of RAM.
>
> Cluster is Luminous - 12.2.12
> Running single active MDS with two standby.
> 890 clients
> Mix of kernel client (4.19.86) and ceph-fuse.
> Clients are 12.2.12 (398) and 12.2.13 (3)
>
> The kernel clients have stayed under “mds_max_caps_per_client”: “1048576". But the ceph-fuse clients appear to hold very large numbers according to the ceph-fuse asok.
> e.g.
> “num_caps”: 1007144398,
> “num_caps”: 1150184586,
> “num_caps”: 1502231153,
> “num_caps”: 1714655840,
> “num_caps”: 2022826512,
>
> Dropping caches on the clients appears to reduce their cap usage but does not free up RAM on the MDS.
> What is the safest method to free cache and reduce RAM usage on the MDS in this situation (without having to evict or remount clients)?
> I’m concerned that reducing mds_cache_memory_limit even in very small increments may trigger a large recall of caps and overwhelm the MDS.
> We also considered setting a reduced mds_cache_memory_limit on both the standby MDS. Would a subsequent failover to an MDS with a lower cache limit be safe?
> Some more details below and I’d be more than happy to provide additional logs.
>
> Thanks,
> Dylan
>
>
> # free -b
> total used free shared buff/cache available
> Mem: 540954992640 535268749312 4924698624 438284288 761544704 3893182464
> Swap: 0 0 0
>
> # ceph daemon mds.$(hostname -s) config get mds_cache_memory_limit
> {
> "mds_cache_memory_limit": "450971566080"
> }
>
> # ceph daemon mds.$(hostname -s) cache status
> {
> "pool": {
> "items": 10593257843,
> "bytes": 425176150288
> }
> }
>
> # ceph daemon mds.$(hostname -s) dump_mempools | grep -A2 "mds_co\|anon"
> "buffer_anon": {
> "items": 3935,
> "bytes": 4537932
> --
> "mds_co": {
> "items": 10595391186,
> "bytes": 425255456209
>
> # ceph daemon mds.$(hostname -s) perf dump | jq '.mds_mem.rss'
> 520100552
>
> # ceph tell mds.$(hostname) heap stats
> tcmalloc heap stats:------------------------------------------------
> MALLOC: 496040753720 (473061.3 MiB) Bytes in use by application
> MALLOC: + 11085479936 (10571.9 MiB) Bytes in page heap freelist
> MALLOC: + 22568895888 (21523.4 MiB) Bytes in central cache freelist
> MALLOC: + 31744 ( 0.0 MiB) Bytes in transfer cache freelist
> MALLOC: + 34186296 ( 32.6 MiB) Bytes in thread cache freelists
> MALLOC: + 2802057216 ( 2672.2 MiB) Bytes in malloc metadata
> MALLOC: ------------
> MALLOC: = 532531404800 (507861.5 MiB) Actual memory used (physical + swap)
> MALLOC: + 1315700736 ( 1254.8 MiB) Bytes released to OS (aka unmapped)
> MALLOC: ------------
> MALLOC: = 533847105536 (509116.3 MiB) Virtual address space used
> MALLOC:
> MALLOC: 44496459 Spans in use
> MALLOC: 22 Thread heaps in use
> MALLOC: 8192 Tcmalloc page size
> ------------------------------------------------
> Call ReleaseFreeMemory() to release freelist memory to the OS (via madvise()).
> Bytes released to the OS take up virtual address space but no physical memory.
>
>
> # ceph fs status
> hpc_projects - 890 clients
> ============
> +------+--------+----------------+---------------+-------+-------+
> | Rank | State | MDS | Activity | dns | inos |
> +------+--------+----------------+---------------+-------+-------+
> | 0 | active | mds1-ceph2-qh2 | Reqs: 304 /s | 167M | 167M |
> +------+--------+----------------+---------------+-------+-------+
> +--------------------+----------+-------+-------+
> | Pool | type | used | avail |
> +--------------------+----------+-------+-------+
> | hpcfs_metadata | metadata | 17.4G | 1893G |
> | hpcfs_data | data | 1014T | 379T |
> | test_nvmemeta | data | 0 | 1893G |
> | hpcfs_data_sandisk | data | 312T | 184T |
> +--------------------+----------+-------+-------+
>
> +----------------+
> | Standby MDS |
> +----------------+
> | mds3-ceph2-qh2 |
> | mds2-ceph2-qh2 |
> +----------------+
> MDS version: ceph version 12.2.12 (1436006594665279fe734b4c15d7e08c13ebd777) luminous (stable)
>
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io
> To unsubscribe send an email to ceph-users-leave(a)ceph.io
Hi all,
The single active MDS on one of our Ceph clusters is close to running out of RAM.
MDS total system RAM = 528GB
MDS current free system RAM = 4GB
mds_cache_memory_limit = 451GB
current mds cache usage = 426GB
Presumably we need to reduce our mds_cache_memory_limit and/or mds_max_caps_per_client, but would like some guidance on whether it’s possible to do that safely on a live production cluster when the MDS is already pretty close to running out of RAM.
Cluster is Luminous - 12.2.12
Running single active MDS with two standby.
890 clients
Mix of kernel client (4.19.86) and ceph-fuse.
Clients are 12.2.12 (398) and 12.2.13 (3)
The kernel clients have stayed under “mds_max_caps_per_client”: “1048576". But the ceph-fuse clients appear to hold very large numbers according to the ceph-fuse asok.
e.g.
“num_caps”: 1007144398,
“num_caps”: 1150184586,
“num_caps”: 1502231153,
“num_caps”: 1714655840,
“num_caps”: 2022826512,
Dropping caches on the clients appears to reduce their cap usage but does not free up RAM on the MDS.
What is the safest method to free cache and reduce RAM usage on the MDS in this situation (without having to evict or remount clients)?
I’m concerned that reducing mds_cache_memory_limit even in very small increments may trigger a large recall of caps and overwhelm the MDS.
We also considered setting a reduced mds_cache_memory_limit on both the standby MDS. Would a subsequent failover to an MDS with a lower cache limit be safe?
Some more details below and I’d be more than happy to provide additional logs.
Thanks,
Dylan
# free -b
total used free shared buff/cache available
Mem: 540954992640 535268749312 4924698624 438284288 761544704 3893182464
Swap: 0 0 0
# ceph daemon mds.$(hostname -s) config get mds_cache_memory_limit
{
"mds_cache_memory_limit": "450971566080"
}
# ceph daemon mds.$(hostname -s) cache status
{
"pool": {
"items": 10593257843,
"bytes": 425176150288
}
}
# ceph daemon mds.$(hostname -s) dump_mempools | grep -A2 "mds_co\|anon"
"buffer_anon": {
"items": 3935,
"bytes": 4537932
--
"mds_co": {
"items": 10595391186,
"bytes": 425255456209
# ceph daemon mds.$(hostname -s) perf dump | jq '.mds_mem.rss'
520100552
# ceph tell mds.$(hostname) heap stats
tcmalloc heap stats:------------------------------------------------
MALLOC: 496040753720 (473061.3 MiB) Bytes in use by application
MALLOC: + 11085479936 (10571.9 MiB) Bytes in page heap freelist
MALLOC: + 22568895888 (21523.4 MiB) Bytes in central cache freelist
MALLOC: + 31744 ( 0.0 MiB) Bytes in transfer cache freelist
MALLOC: + 34186296 ( 32.6 MiB) Bytes in thread cache freelists
MALLOC: + 2802057216 ( 2672.2 MiB) Bytes in malloc metadata
MALLOC: ------------
MALLOC: = 532531404800 (507861.5 MiB) Actual memory used (physical + swap)
MALLOC: + 1315700736 ( 1254.8 MiB) Bytes released to OS (aka unmapped)
MALLOC: ------------
MALLOC: = 533847105536 (509116.3 MiB) Virtual address space used
MALLOC:
MALLOC: 44496459 Spans in use
MALLOC: 22 Thread heaps in use
MALLOC: 8192 Tcmalloc page size
------------------------------------------------
Call ReleaseFreeMemory() to release freelist memory to the OS (via madvise()).
Bytes released to the OS take up virtual address space but no physical memory.
# ceph fs status
hpc_projects - 890 clients
============
+------+--------+----------------+---------------+-------+-------+
| Rank | State | MDS | Activity | dns | inos |
+------+--------+----------------+---------------+-------+-------+
| 0 | active | mds1-ceph2-qh2 | Reqs: 304 /s | 167M | 167M |
+------+--------+----------------+---------------+-------+-------+
+--------------------+----------+-------+-------+
| Pool | type | used | avail |
+--------------------+----------+-------+-------+
| hpcfs_metadata | metadata | 17.4G | 1893G |
| hpcfs_data | data | 1014T | 379T |
| test_nvmemeta | data | 0 | 1893G |
| hpcfs_data_sandisk | data | 312T | 184T |
+--------------------+----------+-------+-------+
+----------------+
| Standby MDS |
+----------------+
| mds3-ceph2-qh2 |
| mds2-ceph2-qh2 |
+----------------+
MDS version: ceph version 12.2.12 (1436006594665279fe734b4c15d7e08c13ebd777) luminous (stable)
Hello,
if I understand correctly:
if we upgrade from an running nautilus cluster to octopus we have a
downtime on an update of MDS.
Is this correct?
Mit freundlichen Grüßen / Kind regards
Andreas Schiefer
Leiter Systemadministration / Head of systemadministration
---
HOME OF LOYALTY
CRM- & Customer Loyalty Solution
by UW Service
Gesellschaft für Direktwerbung und Marketingberatung mbH
Alter Deutzer Postweg 221
51107 Koeln (Rath/Heumar)
Deutschland
Telefon : +49 221 98696 0
Telefax : +49 221 98696 5222
info(a)uw-service.de
www.hooloy.de
Amtsgericht Koeln HRB 24 768
UST-ID: DE 164 191 706
Geschäftsführer: Ralf Heim
---
FYI. Hope to see some awesome CephFS submissions for our virtual IO500 BoF!
Thanks,
John
---------- Forwarded message ---------
From: committee--- via IO-500 <io-500(a)vi4io.org>
Date: Fri, May 22, 2020 at 1:53 PM
Subject: [IO-500] IO500 ISC20 Call for Submission
To: <io-500(a)vi4io.org>
*Deadline*: 08 June 2020 AoE
The IO500 <http://io500.org/> is now accepting and encouraging submissions
for the upcoming 6th IO500 list. Once again, we are also accepting
submissions to the 10 Node Challenge to encourage the submission of small
scale results. The new ranked lists will be announced via live-stream at a
virtual session. We hope to see many new results.
The benchmark suite is designed to be easy to run and the community has
multiple active support channels to help with any questions. Please note
that submissions of all sizes are welcome; the site has customizable
sorting so it is possible to submit on a small system and still get a very
good per-client score for example. Additionally, the list is about much
more than just the raw rank; all submissions help the community by
collecting and publishing a wider corpus of data. More details below.
Following the success of the Top500 in collecting and analyzing historical
trends in supercomputer technology and evolution, the IO500
<http://io500.org/> was created in 2017, published its first list at SC17,
and has grown exponentially since then. The need for such an initiative has
long been known within High-Performance Computing; however, defining
appropriate benchmarks had long been challenging. Despite this challenge,
the community, after long and spirited discussion, finally reached
consensus on a suite of benchmarks and a metric for resolving the scores
into a single ranking.
The multi-fold goals of the benchmark suite are as follows:
1. Maximizing simplicity in running the benchmark suite
2. Encouraging optimization and documentation of tuning parameters for
performance
3. Allowing submitters to highlight their “hero run” performance numbers
4. Forcing submitters to simultaneously report performance for
challenging IO patterns.
Specifically, the benchmark suite includes a hero-run of both IOR and mdtest
configured however possible to maximize performance and establish an
upper-bound for performance. It also includes an IOR and mdtest run with
highly constrained parameters forcing a difficult usage pattern in an
attempt to determine a lower-bound. Finally, it includes a namespace search
as this has been determined to be a highly sought-after feature in HPC
storage systems that has historically not been well-measured. Submitters
are encouraged to share their tuning insights for publication.
The goals of the community are also multi-fold:
1. Gather historical data for the sake of analysis and to aid
predictions of storage futures
2. Collect tuning data to share valuable performance optimizations
across the community
3. Encourage vendors and designers to optimize for workloads beyond
“hero runs”
4. Establish bounded expectations for users, procurers, and
administrators
*10 Node I/O Challenge*
The 10 Node Challenge is conducted using the regular IO500 benchmark,
however, with the rule that exactly *10 client nodes* must be used to run
the benchmark. You may use any shared storage with, e.g., any number of
servers. When submitting for the IO500 list, you can opt-in for
“Participate in the 10 compute node challenge only”, then we will not
include the results into the ranked list. Other 10-node node submissions
will be included in the full list and in the ranked list. We will announce
the result in a separate derived list and in the full list but not on the
ranked IO500 list at https://io500.org/.
This information and rules for ISC20 submissions are available here:
https://www.vi4io.org/io500/rules/submission
Thanks,
The IO500 Committee
_______________________________________________
IO-500 mailing list
IO-500(a)vi4io.org
https://www.vi4io.org/mailman/listinfo/io-500
Hello,
we are currently experiencing problems with ceph pg repair not working
on Ceph Nautilus 14.2.8.
ceph health detail is showing us an inconsistent pg:
[aaaaax-yyyy ~]# ceph health detail
HEALTH_ERR 1 scrub errors; Possible data damage: 1 pg inconsistent
OSD_SCRUB_ERRORS 1 scrub errors
PG_DAMAGED Possible data damage: 1 pg inconsistent
pg 18.19a is active+clean+inconsistent+snaptrim_wait, acting
[21,15,39,18,0,9]
when we try to repair it, nothing happens.
[aaaaax-yyyy ~]# ceph pg repair 18.19a
instructing pg 18.19as0 on osd.21 to repair
There are no new entries in OSD 21's log file.
We have no trouble repairing pgs in our other clusters so I assume it
might have to be something related to this cluster using Erasure
Codings. But this is just a wild guess.
I found a similar problem in this mailing list -
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-April/026304.html
Unfortunately the solution of waiting more than a week until it fixes
itself isn't quite satisfying.
Is there anyone who has had similar issues and knows how to repair these
inconsistent pgs or what is causing the delay?
--
Mit freundlichen Grüßen
Daniel Aberger
Ihr Profihost Team
-------------------------------
Profihost AG
Expo Plaza 1
30539 Hannover
Deutschland
Tel.: +49 (511) 5151 8181 | Fax.: +49 (511) 5151 8282
URL: http://www.profihost.com | E-Mail: info(a)profihost.com
Sitz der Gesellschaft: Hannover, USt-IdNr. DE813460827
Registergericht: Amtsgericht Hannover, Register-Nr.: HRB 202350
Vorstand: Cristoph Bluhm, Sebastian Bluhm, Stefan Priebe
Aufsichtsrat: Prof. Dr. iur. Winfried Huck (Vorsitzender)
Hallo all,
hope you can help me with very strange problems which arose
suddenly today. Tried to search, also in this mailing list, but could
not find anything relevant.
At some point today, without any action from my side, I noticed some
OSDs in my production cluster would go down and never come up.
I am on Luminous 12.2.13, CentOS7, kernel 3.10: my setup is non-standard
as OSD disks are served off a SAN (which is for sure OK now, although I
cannot exclude some glitch).
Tried to reboot OSD servers a few times, ran "activate --all", added
bluestore_ignore_data_csum=true in the [osd] section in ceph.conf...
the number of "down" OSDs changed for a while but now seems rather stable.
There are actually two classes of problems (bit more details right below):
- ERROR: osd init failed: (5) Input/output error
- failed to load OSD map for epoch 141282, got 0 bytes
*First problem*
This affects 50 OSDs (all disks of this kind, on all but one server):
these OSDs are reserved for object storage but I am not yet using them
so I may in principle recreate them. But would be interested in
understanding what the problem is, and learn how to solve it for future
reference.
Here is what I see in logs:
.....
2020-05-21 21:17:48.661348 7fa2e9a95ec0 1 bluefs add_block_device bdev
1 path /var/lib/ceph/osd/cephpa1-72/block size 14.5TiB
2020-05-21 21:17:48.661428 7fa2e9a95ec0 1 bluefs mount
2020-05-21 21:17:48.662040 7fa2e9a95ec0 1 bluefs _init_alloc id 1
alloc_size 0x10000 size 0xe83a3400000
2020-05-21 21:52:43.858464 7fa2e9a95ec0 -1 bluefs mount failed to replay
log: (5) Input/output error
2020-05-21 21:52:43.858589 7fa2e9a95ec0 1 fbmap_alloc 0x55c6bba92e00
shutdown
2020-05-21 21:52:43.858728 7fa2e9a95ec0 -1
bluestore(/var/lib/ceph/osd/cephpa1-72) _open_db failed bluefs mount:
(5) Input/output error
2020-05-21 21:52:43.858790 7fa2e9a95ec0 1 bdev(0x55c6bbdb6600
/var/lib/ceph/osd/cephpa1-72/block) close
2020-05-21 21:52:44.103536 7fa2e9a95ec0 1 bdev(0x55c6bbdb8600
/var/lib/ceph/osd/cephpa1-72/block) close
2020-05-21 21:52:44.352899 7fa2e9a95ec0 -1 osd.72 0 OSD:init: unable to
mount object store
2020-05-21 21:52:44.352956 7fa2e9a95ec0 -1 ESC[0;31m ** ERROR: osd init
failed: (5) Input/output errorESC[0m
*Second problem*
This affects 11 OSDs, which I use *in production* for Cinder block
storage: looks like all PGs for this pool are currently OK.
Here is the excerpt from the logs.
.....
-5> 2020-05-21 20:52:06.756469 7fd2ccc19ec0 0 _get_class not
permitted to load kvs
-4> 2020-05-21 20:52:06.759686 7fd2ccc19ec0 1 <cls>
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.13/rpm/el7/BUILD/ceph-12.2.13/src/cls/rgw/cls_rgw.cc:3869:
Loaded rgw class!
-3> 2020-05-21 20:52:06.760021 7fd2ccc19ec0 1 <cls>
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.13/rpm/el7/BUILD/ceph-12.2.13/src/cls/log/cls_log.cc:299:
Loaded log class!
-2> 2020-05-21 20:52:06.760730 7fd2ccc19ec0 1 <cls>
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.13/rpm/el7/BUILD/ceph-12.2.13/src/cls/replica_log/cls_replica_log.cc:135:
Loaded replica log class!
-1> 2020-05-21 20:52:06.760873 7fd2ccc19ec0 -1 osd.63 0 failed to
load OSD map for epoch 141282, got 0 bytes
0> 2020-05-21 20:52:06.763277 7fd2ccc19ec0 -1
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.13/rpm/el7/BUILD/ceph-12.2.13/src/osd/OSD.h:
In function 'OSDMapRef OSDService::get_map(epoch_t)' thread 7fd2ccc19ec0
time 2020-05-21 20:52:06.760916
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.13/rpm/el7/BUILD/ceph-12.2.13/src/osd/OSD.h:
994: FAILED assert(ret)
Has anyone any idea how I could fix these problems, or what I could
do to try and shed some light? And also, what caused them, and whether
there is some magic configuration flag I could use to protect my cluster?
Thanks a lot for your help!
Fulvio