On 9/25/2020 6:07 PM, Saber(a)PlanetHoster.info wrote:
> Hi Igor,
>
> The only thing abnormal about this osdstore is that it was created by
> Mimic 13.2.8 and I can see that the OSDs size of this osdstore are not
> the same as the others in the cluster (while they should be exactly
> the same size).
>
> Can it be https://tracker.ceph.com/issues/39151 ?
hmm, may be... Did you change H/W at some point for this OSD's node as
it happened in the ticket?
And it's still unclear to me if the issue is reproducible for you.
Could you please also run fsck (at first) and then repair for this OSD
and collect log(s).
Thanks,
Igor
>
> Thanks!
> Saber
> CTO @PlanetHoster
>
>> On Sep 25, 2020, at 5:46 AM, Igor Fedotov <ifedotov(a)suse.de
>> <mailto:ifedotov@suse.de>> wrote:
>>
>> Hi Saber,
>>
>> I don't think this is related. New assertion happens along the write
>> path while the original one occurred on allocator shutdown.
>>
>>
>> Unfortunately there are not much information to troubleshoot this...
>> Are you able to reproduce the case?
>>
>>
>> Thanks,
>>
>> Igor
>>
>> On 9/25/2020 4:21 AM, Saber(a)PlanetHoster.info wrote:
>>> Hi Igor,
>>>
>>> We had an osd crash a week after running Nautilus. I have attached
>>> the logs, is it related to the same bug?
>>>
>>>
>>>
>>>
>>> Thanks,
>>> Saber
>>> CTO @PlanetHoster
>>>
>>>> On Sep 14, 2020, at 10:22 AM, Igor Fedotov <ifedotov(a)suse.de
>>>> <mailto:ifedotov@suse.de>> wrote:
>>>>
>>>> Thanks!
>>>>
>>>> Now got the root cause. The fix is on its way...
>>>>
>>>> Meanwhile you might want to try to workaround the issue via setting
>>>> "bluestore_hybrid_alloc_mem_cap" to 0 or using different allocator,
>>>> e.g. avl for bluestore_allocator (and optionally for
>>>> bluefs_allocator too).
>>>>
>>>>
>>>> Hope this helps,
>>>>
>>>> Igor.
>>>>
>>>>
>>>>
>>>> On 9/14/2020 5:02 PM, Jean-Philippe Méthot wrote:
>>>>> Alright, here’s the full log file.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Jean-Philippe Méthot
>>>>> Senior Openstack system administrator
>>>>> Administrateur système Openstack sénior
>>>>> PlanetHoster inc.
>>>>> 4414-4416 Louis B Mayer
>>>>> Laval, QC, H7P 0G1, Canada
>>>>> TEL : +1.514.802.1644 - Poste : 2644
>>>>> FAX : +1.514.612.0678
>>>>> CA/US : 1.855.774.4678
>>>>> FR : 01 76 60 41 43
>>>>> UK : 0808 189 0423
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> Le 14 sept. 2020 à 06:49, Igor Fedotov <ifedotov(a)suse.de
>>>>>> <mailto:ifedotov@suse.de>> a écrit :
>>>>>>
>>>>>> Well, I can see duplicate admin socket command
>>>>>> registration/de-registration (and the second de-registration
>>>>>> asserts) but don't understand how this could happen.
>>>>>>
>>>>>> Would you share the full log, please?
>>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Igor
>>>>>>
>>>>>> On 9/11/2020 7:26 PM, Jean-Philippe Méthot wrote:
>>>>>>> Here’s the out file, as requested.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Jean-Philippe Méthot
>>>>>>> Senior Openstack system administrator
>>>>>>> Administrateur système Openstack sénior
>>>>>>> PlanetHoster inc.
>>>>>>> 4414-4416 Louis B Mayer
>>>>>>> Laval, QC, H7P 0G1, Canada
>>>>>>> TEL : +1.514.802.1644 - Poste : 2644
>>>>>>> FAX : +1.514.612.0678
>>>>>>> CA/US : 1.855.774.4678
>>>>>>> FR : 01 76 60 41 43
>>>>>>> UK : 0808 189 0423
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> Le 11 sept. 2020 à 10:38, Igor Fedotov <ifedotov(a)suse.de
>>>>>>>> <mailto:ifedotov@suse.de>> a écrit :
>>>>>>>>
>>>>>>>> Could you please run:
>>>>>>>>
>>>>>>>> CEPH_ARGS="--log-file log --debug-asok 5" ceph-bluestore-tool
>>>>>>>> repair --path <...> ; cat log | grep asok > out
>>>>>>>>
>>>>>>>> and share 'out' file.
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> Igor
>>>>>>>>
>>>>>>>> On 9/11/2020 5:15 PM, Jean-Philippe Méthot wrote:
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> We’re upgrading our cluster OSD node per OSD node to Nautilus
>>>>>>>>> from Mimic. From some release notes, it was recommended to run
>>>>>>>>> the following command to fix stats after an upgrade :
>>>>>>>>>
>>>>>>>>> ceph-bluestore-tool repair --path /var/lib/ceph/osd/ceph-0
>>>>>>>>>
>>>>>>>>> However, running that command gives us the following error
>>>>>>>>> message:
>>>>>>>>>
>>>>>>>>>> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/14.2.11/rpm/el7/BUILD/ceph-14.2.11/src/os/bluestore/Allocator.cc
>>>>>>>>>> <http://allocator.cc/>: In
>>>>>>>>>> function 'virtual Allocator::SocketHook::~SocketHook()'
>>>>>>>>>> thread 7f1a6467eec0 time 2020-09-10 14:40:25.872353
>>>>>>>>>> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/14.2.11/rpm/el7/BUILD/ceph-14.2.11/src/os/bluestore/Allocator.cc
>>>>>>>>>> <http://allocator.cc/>: 53
>>>>>>>>>> : FAILED ceph_assert(r == 0)
>>>>>>>>>> ceph version 14.2.11
>>>>>>>>>> (f7fdb2f52131f54b891a2ec99d8205561242cdaf) nautilus (stable)
>>>>>>>>>> 1: (ceph::__ceph_assert_fail(char const*, char const*, int,
>>>>>>>>>> char const*)+0x14a) [0x7f1a5a823025]
>>>>>>>>>> 2: (()+0x25c1ed) [0x7f1a5a8231ed]
>>>>>>>>>> 3: (()+0x3c7a4f) [0x55b33537ca4f]
>>>>>>>>>> 4: (HybridAllocator::~HybridAllocator()+0x17) [0x55b3353ac517]
>>>>>>>>>> 5: (BlueStore::_close_alloc()+0x42) [0x55b3351f2082]
>>>>>>>>>> 6: (BlueStore::_close_db_and_around(bool)+0x2f8)
>>>>>>>>>> [0x55b335274528]
>>>>>>>>>> 7: (BlueStore::_fsck(BlueStore::FSCKDepth, bool)+0x2c1)
>>>>>>>>>> [0x55b3352749a1]
>>>>>>>>>> 8: (main()+0x10b3) [0x55b335187493]
>>>>>>>>>> 9: (__libc_start_main()+0xf5) [0x7f1a574aa555]
>>>>>>>>>> 10: (()+0x1f9b5f) [0x55b3351aeb5f]
>>>>>>>>>> 2020-09-10 14:40:25.873 7f1a6467eec0 -1
>>>>>>>>>> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/14.2.11/rpm/el7/BUILD/ceph-14.2.11/src/os/bluestore/Allocator.cc
>>>>>>>>>> <http://allocator.cc/>: In function 'virtual
>>>>>>>>>> Allocator::SocketHook::~SocketHook()' thread 7f1a6467eec0
>>>>>>>>>> time 2020-09-10 14:40:25.872353
>>>>>>>>>> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/14.2.11/rpm/el7/BUILD/ceph-14.2.11/src/os/bluestore/Allocator.cc
>>>>>>>>>> <http://allocator.cc/>: 53: FAILED ceph_assert(r == 0)
>>>>>>>>>>
>>>>>>>>>> ceph version 14.2.11
>>>>>>>>>> (f7fdb2f52131f54b891a2ec99d8205561242cdaf) nautilus (stable)
>>>>>>>>>> 1: (ceph::__ceph_assert_fail(char const*, char const*, int,
>>>>>>>>>> char const*)+0x14a) [0x7f1a5a823025]
>>>>>>>>>> 2: (()+0x25c1ed) [0x7f1a5a8231ed]
>>>>>>>>>> 3: (()+0x3c7a4f) [0x55b33537ca4f]
>>>>>>>>>> 4: (HybridAllocator::~HybridAllocator()+0x17) [0x55b3353ac517]
>>>>>>>>>> 5: (BlueStore::_close_alloc()+0x42) [0x55b3351f2082]
>>>>>>>>>> 6: (BlueStore::_close_db_and_around(bool)+0x2f8)
>>>>>>>>>> [0x55b335274528]
>>>>>>>>>> 7: (BlueStore::_fsck(BlueStore::FSCKDepth, bool)+0x2c1)
>>>>>>>>>> [0x55b3352749a1]
>>>>>>>>>> 8: (main()+0x10b3) [0x55b335187493]
>>>>>>>>>> 9: (__libc_start_main()+0xf5) [0x7f1a574aa555]
>>>>>>>>>> 10: (()+0x1f9b5f) [0x55b3351aeb5f]
>>>>>>>>>> *** Caught signal (Aborted) **
>>>>>>>>>> in thread 7f1a6467eec0 thread_name:ceph-bluestore-
>>>>>>>>>> ceph version 14.2.11
>>>>>>>>>> (f7fdb2f52131f54b891a2ec99d8205561242cdaf) nautilus (stable)
>>>>>>>>>> 1: (()+0xf630) [0x7f1a58cf0630]
>>>>>>>>>> 2: (gsignal()+0x37) [0x7f1a574be387]
>>>>>>>>>> 3: (abort()+0x148) [0x7f1a574bfa78]
>>>>>>>>>> 4: (ceph::__ceph_assert_fail(char const*, char const*, int,
>>>>>>>>>> char const*)+0x199) [0x7f1a5a823074]
>>>>>>>>>> 5: (()+0x25c1ed) [0x7f1a5a8231ed]
>>>>>>>>>> 6: (()+0x3c7a4f) [0x55b33537ca4f]
>>>>>>>>>> 7: (HybridAllocator::~HybridAllocator()+0x17) [0x55b3353ac517]
>>>>>>>>>> 8: (BlueStore::_close_alloc()+0x42) [0x55b3351f2082]
>>>>>>>>>> 9: (BlueStore::_close_db_and_around(bool)+0x2f8)
>>>>>>>>>> [0x55b335274528]
>>>>>>>>>> 10: (BlueStore::_fsck(BlueStore::FSCKDepth, bool)+0x2c1)
>>>>>>>>>> [0x55b3352749a1]
>>>>>>>>>> 11: (main()+0x10b3) [0x55b335187493]
>>>>>>>>>> 12: (__libc_start_main()+0xf5) [0x7f1a574aa555]
>>>>>>>>>> 13: (()+0x1f9b5f) [0x55b3351aeb5f]
>>>>>>>>>> 2020-09-10 14:40:25.874 7f1a6467eec0 -1 *** Caught signal
>>>>>>>>>> (Aborted) **
>>>>>>>>>> in thread 7f1a6467eec0 thread_name:ceph-bluestore-
>>>>>>>>>>
>>>>>>>>>> ceph version 14.2.11
>>>>>>>>>> (f7fdb2f52131f54b891a2ec99d8205561242cdaf) nautilus (stable)
>>>>>>>>>> 1: (()+0xf630) [0x7f1a58cf0630]
>>>>>>>>>> 2: (gsignal()+0x37) [0x7f1a574be387]
>>>>>>>>>> 3: (abort()+0x148) [0x7f1a574bfa78]
>>>>>>>>>> 4: (ceph::__ceph_assert_fail(char const*, char const*, int,
>>>>>>>>>> char const*)+0x199) [0x7f1a5a823074]
>>>>>>>>>> 5: (()+0x25c1ed) [0x7f1a5a8231ed]
>>>>>>>>>> 6: (()+0x3c7a4f) [0x55b33537ca4f]
>>>>>>>>>> 7: (HybridAllocator::~HybridAllocator()+0x17) [0x55b3353ac517]
>>>>>>>>>> 8: (BlueStore::_close_alloc()+0x42) [0x55b3351f2082]
>>>>>>>>>> 9: (BlueStore::_close_db_and_around(bool)+0x2f8)
>>>>>>>>>> [0x55b335274528]
>>>>>>>>>> 10: (BlueStore::_fsck(BlueStore::FSCKDepth, bool)+0x2c1)
>>>>>>>>>> [0x55b3352749a1]
>>>>>>>>>> 11: (main()+0x10b3) [0x55b335187493]
>>>>>>>>>> 12: (__libc_start_main()+0xf5) [0x7f1a574aa555]
>>>>>>>>>> 13: (()+0x1f9b5f) [0x55b3351aeb5f]
>>>>>>>>>> NOTE: a copy of the executable, or `objdump -rdS
>>>>>>>>>> <executable>` is needed to interpret this.
>>>>>>>>>
>>>>>>>>> What could be the source of this error? I haven’t found much
>>>>>>>>> of anything about it online.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Jean-Philippe Méthot
>>>>>>>>> Senior Openstack system administrator
>>>>>>>>> Administrateur système Openstack sénior
>>>>>>>>> PlanetHoster inc.
>>>>>>>>> 4414-4416 Louis B Mayer
>>>>>>>>> Laval, QC, H7P 0G1, Canada
>>>>>>>>> TEL : +1.514.802.1644 - Poste : 2644
>>>>>>>>> FAX : +1.514.612.0678
>>>>>>>>> CA/US : 1.855.774.4678
>>>>>>>>> FR : 01 76 60 41 43
>>>>>>>>> UK : 0808 189 0423
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> ceph-users mailing list -- ceph-users(a)ceph.io
>>>>>>>>> <mailto:ceph-users@ceph.io>
>>>>>>>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io
>>>>>>>>> <mailto:ceph-users-leave@ceph.io>
>>>>>>>
>>>>>
>>>
>
HI all,
Had an issue where the docker containers on all the ceph nodes just seem
to stop at some point, effectively shutting down the cluster. Restarting
cephs on all of the nodes restored the cluster to normal working order.
I would like to find out why this occurred, any ideas on where to look?
many thanks
Darrin
--
CONFIDENTIALITY NOTICE: This email is intended for the named recipients only. It may contain privileged, confidential or copyright information. If you are not the named recipients, any use, reliance upon, disclosure or copying of this email or any attachments is unauthorised. If you have received this email in error, please reply via email or telephone +61 2 8004 5928.
> 在 2020年10月26日,00:07,Anthony D'Atri <anthony.datri(a)gmail.com> 写道:
>
>> I'm not entirely sure if primary on SSD will actually make the read happen on SSD.
>
> My understanding is that by default reads always happen from the lead OSD in the acting set. Octopus seems to (finally) have an option to spread the reads around, which IIRC defaults to false.
I also remember that “by default reads always happen from the lead OSD in the acting set”. I dig through git blame and it seems ceph-fuse has a —localize-reads options since 10 years ago [1], through not documented anywhere. I don’t find such setting in kernel ceph module.
[1]: https://github.com/ceph/ceph/commit/7912f5c7034bd26d22615d1be1d398849e124749
> I’ve never seen anything that implies that lead OSDs within an acting set are a function of CRUSH rule ordering. I’m not asserting that they aren’t though, but I’m … skeptical.
That conclusion is from experiments. I create an empty pool with above mentioned CRUSH rule, and all 32 pgs have SSD as primary.
> Setting primary affinity would do the job, and you’d want to have cron continually update it across the cluster to react to topology changes. I was told of this strategy back in 2014, but haven’t personally seen it implemented.
I’m also considering this. But if I set the primary affinity of HDDs to 0, then what will happen if I create another all-HDD pool? Or I should just set primary affinity to a very small value, say 0.00001.
> That said, HDDs are more of a bottleneck for writes than reads and just might be fine for your application. Tiny reads are going to limit you to some degree regardless of drive type, and you do mention throughput, not IOPS.
>
> I must echo Frank’s notes about capacity too. Ceph can do a lot of things, but that doesn’t mean something exotic is necessarily the best choice. You’re concerned about 3R only yielding 1/3 of raw capacity if using an all-SSD cluster, but the architecture you propose limits you anyway because drive size. Consider also chassis, CPU, RAM, RU, switch port costs as well, and the cost of you fussing over an exotic solution instead of the hundreds of other things in your backlog.
>
> And your cluster as described is *tiny*. Honestly I’d suggest considering one of these alternatives:
>
> * Ditch the HDDs, use QLC flash. The emerging EDSFF drives are really promising for replacing HDDs for density in this kind of application. You might even consider ARM if IOPs aren’t a concern.
> * An NVMeoF solution
Thanks for the advices, we will discuss on these. But this deployment is on existing server hardwares, so we don’t have many choices. And our budget is very limited. We want to make best use of our existing SSDs. And we have plenty of cold data to fill our HDDs. We will not worry about the wasting of HDD capacity.
Sorry Anthony, I sent this mail twice. I forgot to CC this mail list at first.
> Cache tiers are “deprecated”, but then so are custom cluster names. Neither appears
>
>> For EC pools there is an option "fast_read" (https://docs.ceph.com/en/latest/rados/operations/pools/?highlight=fast_read…), which states that a read will return as soon as the first k shards have arrived. The default is to wait for all k+m shards (all replicas). This option is not available for replicated pools.
>>
>> Now, not sure if this option is not available for replicated pools because the read will always be served by the acting primary, or if it currently waits for all replicas. In the latter case, reads will wait for the slowest device.
>>
>> I'm not sure if I interpret this correctly. I think you should test the setup with HDD only and SSD+HDD to see if read speed improves. Note that write speed will always depend on the slowest device.
>>
>> Best regards,
>> =================
>> Frank Schilder
>> AIT Risø Campus
>> Bygning 109, rum S14
>>
>> ________________________________________
>> From: Frank Schilder <frans(a)dtu.dk>
>> Sent: 25 October 2020 15:03:16
>> To: 胡 玮文; Alexander E. Patrakov
>> Cc: ceph-users(a)ceph.io
>> Subject: [ceph-users] Re: The feasibility of mixed SSD and HDD replicated pool
>>
>> A cache pool might be an alternative, heavily depending on how much data is hot. However, then you will have much less SSD capacity available, because it also requires replication.
>>
>> Looking at the setup that you have only 10*1T =10T SSD, but 20*6T = 120T HDD you will probably run short of SSD capacity. Or, looking at it the other way around, with copies on 1 SSD+3HDD, you will only be able to use about 30T out of 120T HDD capacity.
>>
>> With this replication, the usable storage will be 10T and raw used will be 10T SSD and 30T HDD. If you can't do anything else on the HDD space, you will need more SSDs. If your servers have more free disk slots, you can add SSDs over time until you have at least 40T SSD capacity to balance SSD and HDD capacity.
>>
>> Personally, I think the 1SSD + 3HDD is a good option compared with a cache pool. You have the data security of 3-times replication and, if everything is up, need only 1 copy in the SSD cache, which means that you have 3 times the cache capacity.
>>
>> Best regards,
>> =================
>> Frank Schilder
>> AIT Risø Campus
>> Bygning 109, rum S14
>>
>> ________________________________________
>> From: 胡 玮文 <huww98(a)outlook.com>
>> Sent: 25 October 2020 13:40:55
>> To: Alexander E. Patrakov
>> Cc: ceph-users(a)ceph.io
>> Subject: [ceph-users] Re: The feasibility of mixed SSD and HDD replicated pool
>>
>> Yes. This is the limitation of CRUSH algorithm, in my mind. In order to guard against 2 host failures, I’m going to use 4 replications, 1 on SSD and 3 on HDD. This will work as intended, right? Because at least I can ensure 3 HDDs are from different hosts.
>>
>>>> 在 2020年10月25日,20:04,Alexander E. Patrakov <patrakov(a)gmail.com> 写道:
>>>
>>> On Sun, Oct 25, 2020 at 12:11 PM huww98(a)outlook.com <huww98(a)outlook.com> wrote:
>>>>
>>>> Hi all,
>>>>
>>>> We are planning for a new pool to store our dataset using CephFS. These data are almost read-only (but not guaranteed) and consist of a lot of small files. Each node in our cluster has 1 * 1T SSD and 2 * 6T HDD, and we will deploy about 10 such nodes. We aim at getting the highest read throughput.
>>>>
>>>> If we just use a replicated pool of size 3 on SSD, we should get the best performance, however, that only leave us 1/3 of usable SSD space. And EC pools are not friendly to such small object read workload, I think.
>>>>
>>>> Now I’m evaluating a mixed SSD and HDD replication strategy. Ideally, I want 3 data replications, each on a different host (fail domain). 1 of them on SSD, the other 2 on HDD. And normally every read request is directed to SSD. So, if every SSD OSD is up, I’d expect the same read throughout as the all SSD deployment.
>>>>
>>>> I’ve read the documents and did some tests. Here is the crush rule I’m testing with:
>>>>
>>>> rule mixed_replicated_rule {
>>>> id 3
>>>> type replicated
>>>> min_size 1
>>>> max_size 10
>>>> step take default class ssd
>>>> step chooseleaf firstn 1 type host
>>>> step emit
>>>> step take default class hdd
>>>> step chooseleaf firstn -1 type host
>>>> step emit
>>>> }
>>>>
>>>> Now I have the following conclusions, but I’m not very sure:
>>>> * The first OSD produced by crush will be the primary OSD (at least if I don’t change the “primary affinity”). So, the above rule is guaranteed to map SSD OSD as primary in pg. And every read request will read from SSD if it is up.
>>>> * It is currently not possible to enforce SSD and HDD OSD to be chosen from different hosts. So, if I want to ensure data availability even if 2 hosts fail, I need to choose 1 SSD and 3 HDD OSD. That means setting the replication size to 4, instead of the ideal value 3, on the pool using the above crush rule.
>>>>
>>>> Am I correct about the above statements? How would this work from your experience? Thanks.
>>>
>>> This works (i.e. guards against host failures) only if you have
>>> strictly separate sets of hosts that have SSDs and that have HDDs.
>>> I.e., there should be no host that has both, otherwise there is a
>>> chance that one hdd and one ssd from that host will be picked.
>>>
>>> --
>>> Alexander E. Patrakov
>>> CV: https://nam10.safelinks.protection.outlook.com/?url=http%3A%2F%2Fpc.cd%2FPL…
>> _______________________________________________
>> ceph-users mailing list -- ceph-users(a)ceph.io
>> To unsubscribe send an email to ceph-users-leave(a)ceph.io
>> _______________________________________________
>> ceph-users mailing list -- ceph-users(a)ceph.io
>> To unsubscribe send an email to ceph-users-leave(a)ceph.io
>> _______________________________________________
>> ceph-users mailing list -- ceph-users(a)ceph.io
>> To unsubscribe send an email to ceph-users-leave(a)ceph.io
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io
> To unsubscribe send an email to ceph-users-leave(a)ceph.io
Hi, my cluster was crashed by going down one of my DC and 'ceph -s'
status dont show me the current working status and nothing change in
large time, how can i see what is ceph doing really:
cluster:
health: HEALTH_ERR
mons fond-beagle,guided-tuna are using a lot of disk space
1/3 mons down, quorum fond-beagle,guided-tuna
18/404368 objects unfound (0.004%)
Reduced data availability: 235 pgs inactive, 72 pgs down, 9
pgs incomplete
Possible data damage: 3 pgs recovery_unfound
Degraded data redundancy: 306574/2607020 objects degraded
(11.760%), 10 pgs degraded, 10 pgs undersized
2 pgs not deep-scrubbed in time
32408 slow ops, oldest one blocked for 62348 sec, daemons
[osd.0,osd.10,osd.11,osd.13,osd.14,osd.15,osd.16,osd.17,osd.18,osd.19]...
have slow ops.
services:
mon: 3 daemons, quorum fond-beagle,guided-tuna (age 31m), out of
quorum: alive-lynx
mgr: fond-beagle(active, since 31m)
osd: 52 osds: 28 up (since 30m), 28 in (since 11h); 3 remapped pgs
data:
pools: 7 pools, 2305 pgs
objects: 404.37k objects, 1.7 TiB
usage: 2.7 TiB used, 22 TiB / 24 TiB avail
pgs: 6.681% pgs unknown
3.514% pgs not active
306574/2607020 objects degraded (11.760%)
18/404368 objects unfound (0.004%)
2060 active+clean
154 unknown
72 down
9 incomplete
7 active+undersized+degraded
3 active+recovery_unfound+undersized+degraded+remapped
Hi All
My main CephFS data pool on a Luminous 12.2.10 cluster hit capacity
overnight, metadata is on a separate pool which didn't hit capacity but the
filesystem stopped working which I'd expect. I increased the osd full-ratio
to give me some breathing room to get some data deleted once the filesystem
is back online. When I attempt to restart the MDS service, I see the usual
stuff I'd expect in the log but then:
heartbeat_map is_healthy 'MDSRank' had timed out after 15
Followed by:
mds.beacon.hostnamecephssd01 Skipping beacon heartbeat to monitors (last
> acked 4.00013s ago); MDS internal heartbeat is not healthy!
Eventually I get:
>
> mds.beacon.hostnamecephssd01 is_laggy 29.372 > 15 since last acked beacon
> mds.0.90884 skipping upkeep work because connection to Monitors appears
> laggy
> mds.hostnamecephssd01 Updating MDS map to version 90885 from mon.0
> mds.beacon.hostnamecephssd01 MDS is no longer laggy
The "MDS is no longer laggy" appears to be where the service fails
Meanwhile a ceph -s is showing:
>
> cluster:
> id: 5c5998fd-dc9b-47ec-825e-beaba66aad11
> health: HEALTH_ERR
> 1 filesystem is degraded
> insufficient standby MDS daemons available
> 67 backfillfull osd(s)
> 11 nearfull osd(s)
> full ratio(s) out of order
> 2 pool(s) backfillfull
> 2 pool(s) nearfull
> 6 scrub errors
> Possible data damage: 5 pgs inconsistent
> services:
> mon: 3 daemons, quorum hostnameceph01,hostnameceph02,hostnameceph03
> mgr: hostnameceph03(active), standbys: hostnameceph02, hostnameceph01
> mds: cephfs-1/1/1 up {0=hostnamecephssd01=up:replay}
> osd: 172 osds: 161 up, 161 in
> data:
> pools: 5 pools, 8384 pgs
> objects: 76.25M objects, 124TiB
> usage: 373TiB used, 125TiB / 498TiB avail
> pgs: 8379 active+clean
> 5 active+clean+inconsistent
> io:
> client: 676KiB/s rd, 0op/s rd, 0op/s w
The 5 pgs inconsistent is not a new issue, that is from past scrubs, just
haven't gotten around to manually clearing them although I suppose they
could be related to my issue
The cluster has no clients connected
I did notice in the ceph.log, some OSDs that are in the same host as the
MDS service briefly went down when trying to restart the MDS but examining
the logs of those particular OSDs isn't showing any glaring issues.
Full MDS log at debug 5 (can go higher if needed):
2020-10-22 11:27:10.987652 7f6f696f5240 0 set uid:gid to 167:167
(ceph:ceph)
2020-10-22 11:27:10.987669 7f6f696f5240 0 ceph version 12.2.10
(177915764b752804194937482a39e95e0ca3de94) luminous (stable), process
ceph-mds, pid 2022582
2020-10-22 11:27:10.990567 7f6f696f5240 0 pidfile_write: ignore empty
--pid-file
2020-10-22 11:27:11.027981 7f6f62616700 1 mds.hostnamecephssd01 Updating
MDS map to version 90882 from mon.0
2020-10-22 11:27:15.097957 7f6f62616700 1 mds.hostnamecephssd01 Updating
MDS map to version 90883 from mon.0
2020-10-22 11:27:15.097989 7f6f62616700 1 mds.hostnamecephssd01 Map has
assigned me to become a standby
2020-10-22 11:27:15.101071 7f6f62616700 1 mds.hostnamecephssd01 Updating
MDS map to version 90884 from mon.0
2020-10-22 11:27:15.105310 7f6f62616700 1 mds.0.90884 handle_mds_map i am
now mds.0.90884
2020-10-22 11:27:15.105316 7f6f62616700 1 mds.0.90884 handle_mds_map state
change up:boot --> up:replay
2020-10-22 11:27:15.105325 7f6f62616700 1 mds.0.90884 replay_start
2020-10-22 11:27:15.105333 7f6f62616700 1 mds.0.90884 recovery set is
2020-10-22 11:27:15.105344 7f6f62616700 1 mds.0.90884 waiting for osdmap
73745 (which blacklists prior instance)
2020-10-22 11:27:15.149092 7f6f5be09700 0 mds.0.cache creating system
inode with ino:0x100
2020-10-22 11:27:15.149693 7f6f5be09700 0 mds.0.cache creating system
inode with ino:0x1
2020-10-22 11:27:41.021708 7f6f63618700 1 heartbeat_map is_healthy
'MDSRank' had timed out after 15
2020-10-22 11:27:43.029290 7f6f5f610700 1 heartbeat_map is_healthy
'MDSRank' had timed out after 15
2020-10-22 11:27:43.029297 7f6f5f610700 0 mds.beacon.hostnamecephssd01
Skipping beacon heartbeat to monitors (last acked 4.00013s ago); MDS
internal heartbeat is not healthy!
2020-10-22 11:27:45.866711 7f6f5fe11700 1 heartbeat_map reset_timeout
'MDSRank' had timed out after 15
2020-10-22 11:28:01.021965 7f6f63618700 1 heartbeat_map is_healthy
'MDSRank' had timed out after 15
2020-10-22 11:28:03.029862 7f6f5f610700 1 heartbeat_map is_healthy
'MDSRank' had timed out after 15
2020-10-22 11:28:03.029885 7f6f5f610700 0 mds.beacon.hostnamecephssd01
Skipping beacon heartbeat to monitors (last acked 4.00113s ago); MDS
internal heartbeat is not healthy!
2020-10-22 11:28:06.022033 7f6f63618700 1 heartbeat_map is_healthy
'MDSRank' had timed out after 15
2020-10-22 11:28:07.029955 7f6f5f610700 1 heartbeat_map is_healthy
'MDSRank' had timed out after 15
2020-10-22 11:28:07.029961 7f6f5f610700 0 mds.beacon.hostnamecephssd01
Skipping beacon heartbeat to monitors (last acked 8.00126s ago); MDS
internal heartbeat is not healthy!
2020-10-22 11:28:11.022099 7f6f63618700 1 heartbeat_map is_healthy
'MDSRank' had timed out after 15
2020-10-22 11:28:11.030024 7f6f5f610700 1 heartbeat_map is_healthy
'MDSRank' had timed out after 15
2020-10-22 11:28:11.030028 7f6f5f610700 0 mds.beacon.hostnamecephssd01
Skipping beacon heartbeat to monitors (last acked 12.0014s ago); MDS
internal heartbeat is not healthy!
2020-10-22 11:28:15.030092 7f6f5f610700 1 heartbeat_map is_healthy
'MDSRank' had timed out after 15
2020-10-22 11:28:15.030099 7f6f5f610700 0 mds.beacon.hostnamecephssd01
Skipping beacon heartbeat to monitors (last acked 16.0015s ago); MDS
internal heartbeat is not healthy!
2020-10-22 11:28:16.022165 7f6f63618700 1 heartbeat_map is_healthy
'MDSRank' had timed out after 15
2020-10-22 11:28:19.030163 7f6f5f610700 1 heartbeat_map is_healthy
'MDSRank' had timed out after 15
2020-10-22 11:28:19.030169 7f6f5f610700 0 mds.beacon.hostnamecephssd01
Skipping beacon heartbeat to monitors (last acked 20.0016s ago); MDS
internal heartbeat is not healthy!
2020-10-22 11:28:21.022231 7f6f63618700 1 heartbeat_map is_healthy
'MDSRank' had timed out after 15
2020-10-22 11:28:23.030233 7f6f5f610700 1 heartbeat_map is_healthy
'MDSRank' had timed out after 15
2020-10-22 11:28:23.030241 7f6f5f610700 0 mds.beacon.hostnamecephssd01
Skipping beacon heartbeat to monitors (last acked 24.0008s ago); MDS
internal heartbeat is not healthy!
2020-10-22 11:28:26.022295 7f6f63618700 1 heartbeat_map is_healthy
'MDSRank' had timed out after 15
2020-10-22 11:28:27.030305 7f6f5f610700 1 heartbeat_map is_healthy
'MDSRank' had timed out after 15
2020-10-22 11:28:27.030311 7f6f5f610700 0 mds.beacon.hostnamecephssd01
Skipping beacon heartbeat to monitors (last acked 28.0009s ago); MDS
internal heartbeat is not healthy!
2020-10-22 11:28:28.401161 7f6f5fe11700 1 heartbeat_map reset_timeout
'MDSRank' had timed out after 15
2020-10-22 11:28:28.401168 7f6f5fe11700 1 mds.beacon.hostnamecephssd01
is_laggy 29.372 > 15 since last acked beacon
2020-10-22 11:28:28.401177 7f6f5fe11700 1 mds.0.90884 skipping upkeep work
because connection to Monitors appears laggy
2020-10-22 11:28:28.401187 7f6f62616700 1 mds.hostnamecephssd01 Updating
MDS map to version 90885 from mon.0
2020-10-22 11:28:31.659817 7f6f64595700 0 mds.beacon.hostnamecephssd01
MDS is no longer laggy
2020-10-22 11:36:15.880009 7f88ee4ac240 0 set uid:gid to 167:167
(ceph:ceph)
2020-10-22 11:36:15.880026 7f88ee4ac240 0 ceph version 12.2.10
(177915764b752804194937482a39e95e0ca3de94) luminous (stable), process
ceph-mds, pid 2022663
2020-10-22 11:36:15.883118 7f88ee4ac240 0 pidfile_write: ignore empty
--pid-file
2020-10-22 11:36:15.921200 7f88e73cd700 1 mds.hostnamecephssd01 Updating
MDS map to version 90887 from mon.2
2020-10-22 11:36:20.270298 7f88e73cd700 1 mds.hostnamecephssd01 Updating
MDS map to version 90888 from mon.2
2020-10-22 11:36:20.270329 7f88e73cd700 1 mds.hostnamecephssd01 Map has
assigned me to become a standby
2020-10-22 11:36:20.272917 7f88e73cd700 1 mds.hostnamecephssd01 Updating
MDS map to version 90889 from mon.2
2020-10-22 11:36:20.277063 7f88e73cd700 1 mds.0.90889 handle_mds_map i am
now mds.0.90889
2020-10-22 11:36:20.277069 7f88e73cd700 1 mds.0.90889 handle_mds_map state
change up:boot --> up:replay
2020-10-22 11:36:20.277079 7f88e73cd700 1 mds.0.90889 replay_start
2020-10-22 11:36:20.277086 7f88e73cd700 1 mds.0.90889 recovery set is
2020-10-22 11:36:20.277096 7f88e73cd700 1 mds.0.90889 waiting for osdmap
73746 (which blacklists prior instance)
2020-10-22 11:36:20.322318 7f88e0bc0700 0 mds.0.cache creating system
inode with ino:0x100
2020-10-22 11:36:20.322918 7f88e0bc0700 0 mds.0.cache creating system
inode with ino:0x1
2020-10-22 11:36:47.922531 7f88e43c7700 1 heartbeat_map is_healthy
'MDSRank' had timed out after 15
2020-10-22 11:36:47.922549 7f88e43c7700 0 mds.beacon.hostnamecephssd01
Skipping beacon heartbeat to monitors (last acked 4.00013s ago); MDS
internal heartbeat is not healthy!
2020-10-22 11:36:50.914516 7f88e83cf700 1 heartbeat_map is_healthy
'MDSRank' had timed out after 15
2020-10-22 11:36:51.351457 7f88e4bc8700 1 heartbeat_map reset_timeout
'MDSRank' had timed out after 15
2020-10-22 11:37:07.923089 7f88e43c7700 1 heartbeat_map is_healthy
'MDSRank' had timed out after 15
2020-10-22 11:37:07.923126 7f88e43c7700 0 mds.beacon.hostnamecephssd01
Skipping beacon heartbeat to monitors (last acked 3.99913s ago); MDS
internal heartbeat is not healthy!
2020-10-22 11:37:10.914767 7f88e83cf700 1 heartbeat_map is_healthy
'MDSRank' had timed out after 15
2020-10-22 11:37:11.923216 7f88e43c7700 1 heartbeat_map is_healthy
'MDSRank' had timed out after 15
2020-10-22 11:37:11.923223 7f88e43c7700 0 mds.beacon.hostnamecephssd01
Skipping beacon heartbeat to monitors (last acked 7.99926s ago); MDS
internal heartbeat is not healthy!
2020-10-22 11:37:15.914831 7f88e83cf700 1 heartbeat_map is_healthy
'MDSRank' had timed out after 15
2020-10-22 11:37:15.923286 7f88e43c7700 1 heartbeat_map is_healthy
'MDSRank' had timed out after 15
2020-10-22 11:37:15.923294 7f88e43c7700 0 mds.beacon.hostnamecephssd01
Skipping beacon heartbeat to monitors (last acked 11.9994s ago); MDS
internal heartbeat is not healthy!
2020-10-22 11:37:19.923359 7f88e43c7700 1 heartbeat_map is_healthy
'MDSRank' had timed out after 15
2020-10-22 11:37:19.923366 7f88e43c7700 0 mds.beacon.hostnamecephssd01
Skipping beacon heartbeat to monitors (last acked 15.9995s ago); MDS
internal heartbeat is not healthy!
2020-10-22 11:37:20.914917 7f88e83cf700 1 heartbeat_map is_healthy
'MDSRank' had timed out after 15
2020-10-22 11:37:23.923430 7f88e43c7700 1 heartbeat_map is_healthy
'MDSRank' had timed out after 15
2020-10-22 11:37:23.923437 7f88e43c7700 0 mds.beacon.hostnamecephssd01
Skipping beacon heartbeat to monitors (last acked 19.9996s ago); MDS
internal heartbeat is not healthy!
2020-10-22 11:37:25.914981 7f88e83cf700 1 heartbeat_map is_healthy
'MDSRank' had timed out after 15
2020-10-22 11:37:27.923501 7f88e43c7700 1 heartbeat_map is_healthy
'MDSRank' had timed out after 15
2020-10-22 11:37:27.923508 7f88e43c7700 0 mds.beacon.hostnamecephssd01
Skipping beacon heartbeat to monitors (last acked 23.9998s ago); MDS
internal heartbeat is not healthy!
2020-10-22 11:37:30.915046 7f88e83cf700 1 heartbeat_map is_healthy
'MDSRank' had timed out after 15
2020-10-22 11:37:31.923572 7f88e43c7700 1 heartbeat_map is_healthy
'MDSRank' had timed out after 15
2020-10-22 11:37:31.923579 7f88e43c7700 0 mds.beacon.hostnamecephssd01
Skipping beacon heartbeat to monitors (last acked 27.9999s ago); MDS
internal heartbeat is not healthy!
2020-10-22 11:37:32.412628 7f88e4bc8700 1 heartbeat_map reset_timeout
'MDSRank' had timed out after 15
2020-10-22 11:37:32.412635 7f88e4bc8700 1 mds.beacon.hostnamecephssd01
is_laggy 28.4889 > 15 since last acked beacon
2020-10-22 11:37:32.412643 7f88e4bc8700 1 mds.0.90889 skipping upkeep work
because connection to Monitors appears laggy
2020-10-22 11:37:32.412657 7f88e73cd700 1 mds.hostnamecephssd01 Updating
MDS map to version 90890 from mon.2
2020-10-22 11:37:35.978858 7f88e934c700 0 mds.beacon.hostnamecephssd01
MDS is no longer laggy
Thanks in advance for any assistance you can provide!
David
Since some days ago i am recoveryng my ceph cluster, all start with OSD
been killed by OOM, well i created a script to delete from the OSD the
PGs corrupted (i write corrupted because that pg is the cause of the
100% of RAM usage by OSD).
Great, almost done with all OSDs of my cluster, then the monitors now
are consuming all the servers RAM, and the Managers too, why??? why they
use 60GB of RAM, there are something to block that?? i tried configurind
all kind of RAM limit to the minimal.
Hi,
My rgw.buckets.index has the cluster in WARN. I'm either not understanding the real issue or I'm making it worse, or both.
OMAP_BYTES: 70461524
OMAP_KEYS: 250874
I thought I'd head this off by deleting rgw objects which would normally get deleted in the near future but this only seemed to make the values grow. Before I deleted lots of objects the values were:
OMAP_BYTES: 65450132
OMAP_KEYS: 209843
I read the default is 200k but I haven't read the proper way to manage this situation. What reading should I dive into? I could probably craft up a command to increase the value to clear the warning but I'm guessing this might not be great long-term.
Other errata which might matter:
Size: 3
Pool: nvme
CLASS SIZE AVAIL USED RAW USED %RAW USED
nvme 256 TiB 165 TiB 91 TiB 91 TiB 35.53
Errata: the complete statements:
PG OBJECTS DEGRADED MISPLACED UNFOUND BYTES OMAP_BYTES* OMAP_KEYS* LOG STATE SINCE VERSION REPORTED UP ACTING SCRUB_STAMP DEEP_SCRUB_STAMP
43.d 2 0 0 0 0 70461524 250874 3070 active+clean 36m 185904'456870 185904:1357091 [99,90,48]p99 [99,90,48]p99 2020-10-21 13:53:42.102363 2020-10-21 13:53:42.102363
Thanks!
peter
Peter Eisch
Senior Site Reliability Engineer
T1.612.445.5135
virginpulse.com
Australia | Bosnia and Herzegovina | Brazil | Canada | Singapore | Switzerland | United Kingdom | USA
Confidentiality Notice: The information contained in this e-mail, including any attachment(s), is intended solely for use by the designated recipient(s). Unauthorized use, dissemination, distribution, or reproduction of this message by anyone other than the intended recipient(s), or a person designated as responsible for delivering such messages to the intended recipient, is strictly prohibited and may be unlawful. This e-mail may contain proprietary, confidential or privileged information. Any views or opinions expressed are solely those of the author and do not necessarily represent those of Virgin Pulse, Inc. If you have received this message in error, or are not the named recipient(s), please immediately notify the sender and delete this e-mail message.
v2.66
Hello everyone,
I created a new ceph 14.2.7 Nautilus cluster recently. Cluster consists of
3 racks and 2 osd nodes on each rack, 12 new hdd in each node. HDD
model is TOSHIBA
MG07ACA14TE 14Tb. All data pools are ec pools.
Yesterday I decided to increase pg number on one of the pools with
command "ceph
osd pool set photo.buckets.data pg_num 512", after that many osds started
to crash with "out" and "down" status. I tried to increase recovery_sleep
to 1s but osds still crashes. Osds started working properly only when i set
"norecover" flag, but osd scrub errors appeared after that.
In logs from osd during crashes i found this:
---
Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]:
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHIN
E_SIZE/huge/release/14.2.7/rpm/el7/BUILD/ceph-14.2.7/src/osd/ECBackend.cc:
In function 'void ECBackend::continue_recovery_op(ECBackend::RecoveryOp&,
RecoveryMessages*)'
thread 7f8af535d700 time 2020-10-21 15:12:11.460092
Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]:
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHIN
E_SIZE/huge/release/14.2.7/rpm/el7/BUILD/ceph-14.2.7/src/osd/ECBackend.cc:
648: FAILED ceph_assert(pop.data.length() ==
sinfo.aligned_logical_offset_to_chunk_offset( aft
er_progress.data_recovered_to - op.recovery_progress.data_recovered_to))
Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: ceph version 14.2.7
(3d58626ebeec02d8385a4cefb92c6cbc3a45bfe8) nautilus (stable)
Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: 1:
(ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x14a) [0x55fc694d6c0f]
Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: 2: (()+0x4dddd7)
[0x55fc694d6dd7]
Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: 3:
(ECBackend::continue_recovery_op(ECBackend::RecoveryOp&,
RecoveryMessages*)+0x1740) [0x55fc698cafa0]
Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: 4:
(ECBackend::handle_recovery_read_complete(hobject_t const&,
boost::tuples::tuple<unsigned long, unsigned long, std::map<pg_shard_t,
ceph::buffer::v14_2_0::list, std::less<pg_shard_t>,
std::allocator<std::pair<pg_shard_t const, ceph::buffer::v14_2_0::list> >
>, boost::tuples::null_type, boost::tuples::null_type,
boost::tuples::null_type, boost::tuples::null_type,
boost::tuples::null_type, boost::tuples::null_type,
boost::tuples::null_type>&, boost::optional<std::map<std::string,
ceph::buffer::v14_2_0::list, std::less<std::string>,
std::allocator<std::pair<std::string const, ceph::buffer::v14_2_0::list> >
> >, RecoveryMessages*)+0x734) [0x55fc698cb804]
Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: 5:
(OnRecoveryReadComplete::finish(std::pair<RecoveryMessages*,
ECBackend::read_result_t&>&)+0x94) [0x55fc698ebbe4]
Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: 6:
(ECBackend::complete_read_op(ECBackend::ReadOp&, RecoveryMessages*)+0x8c)
[0x55fc698bfdcc]
Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: 7:
(ECBackend::handle_sub_read_reply(pg_shard_t, ECSubReadReply&,
RecoveryMessages*, ZTracer::Trace const&)+0x109c) [0x55fc698d6b8c]
Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: 8:
(ECBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x17f)
[0x55fc698d718f]
Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: 9:
(PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x4a)
[0x55fc697c18ea]
Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: 10:
(PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&,
ThreadPool::TPHandle&)+0x5b3) [0x55fc697676b3]
Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: 11:
(OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>,
ThreadPool::TPHandle&)+0x362) [0x55fc695b3d72]
Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: 12: (PGOpItem::run(OSD*,
OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x62)
[0x55fc698415c2]
Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: 13:
(OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x90f)
[0x55fc695cebbf]
Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: 14:
(ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5b6)
[0x55fc69b6f976]
Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: 15:
(ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55fc69b72490]
Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: 16: (()+0x7e65)
[0x7f8b1ddede65]
Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: 17: (clone()+0x6d)
[0x7f8b1ccb188d]
Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: *** Caught signal (Aborted) **
Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: in thread 7f8af535d700
thread_name:tp_osd_tp
---
Current ec profile and pool info bellow:
# ceph osd erasure-code-profile get EC42
crush-device-class=hdd
crush-failure-domain=host
crush-root=main
jerasure-per-chunk-alignment=false
k=4
m=2
plugin=jerasure
technique=reed_sol_van
w=8
pool 25 'photo.buckets.data' erasure size 6 min_size 4 crush_rule 6
object_hash rjenkins pg_num 512 pgp_num 280 pgp_num_target 512
autoscale_mode warn last_change 43418 lfor 0/0/42223 flags hashpspool
stripe_width 1048576 application rgw
Current ceph status:
ceph -s
cluster:
id: 9ec8d309-a620-4ad8-93fa-c2d111e5256e
health: HEALTH_ERR
norecover flag(s) set
1 pools have many more objects per pg than average
4542629 scrub errors
Possible data damage: 6 pgs inconsistent
Degraded data redundancy: 1207268/578535561 objects degraded
(0.209%), 51 pgs degraded, 35 pgs undersized
85 pgs not deep-scrubbed in time
services:
mon: 3 daemons, quorum ceph-osd-101,ceph-osd-201,ceph-osd-301 (age 2w)
mgr: ceph-osd-101(active, since 3w), standbys: ceph-osd-301,
ceph-osd-201
osd: 72 osds: 72 up (since 11h), 72 in (since 21h); 48 remapped pgs
flags norecover
rgw: 6 daemons active (ceph-osd-101.rgw0, ceph-osd-102.rgw0,
ceph-osd-201.rgw0, ceph-osd-202.rgw0, ceph-osd-301.rgw0, ceph-osd-302.rgw0)
data:
pools: 26 pools, 15680 pgs
objects: 96.46M objects, 124 TiB
usage: 303 TiB used, 613 TiB / 917 TiB avail
pgs: 1207268/578535561 objects degraded (0.209%)
14068769/578535561 objects misplaced (2.432%)
15290 active+clean
312 active+recovering
30 active+undersized+degraded+remapped+backfilling
21 active+recovering+degraded
13 active+remapped+backfilling
6 active+clean+inconsistent
5 active+recovering+undersized+remapped
3 active+clean+scrubbing+deep
So now my cluster is stuck and can't recover properly, can someone give
info about this problem? Is it a bug?