Btw: After the storm, I highly suggest you to consider to use Jumboframe.
It works like a charm.
[]'s
Arthur (aKa Guilherme Geronimo)
On 04/09/2019 15:50, Guilherme Geronimo wrote:
I see that you have many inactive PGs, probably because the 6 OSD
OUT+DOWN.
Problems with "flapping" OSD I use to solved:
* setting NOUP flag
* restarting the "fragile" OSDs
* check if everything is ok look ing their logs
* taking off the NOUP flag
Another solution is:
* Setting NOIN and NOUP flag
* Taking the fragile OSD out
* restarting the "fragile" OSDs
* check if everything is ok look ing their logs
* taking off the NOUP flag
* Take a coffee and wait till all data are drain
[]'s
Arthur (aKa Guilherme Geronimo)
On 04/09/2019 15:32, Bryan Stillwell wrote:
> We are not using jumbo frames anywhere on this cluster (all mtu
> 1500). The cluster was originally built in October of 2016 and has
> the following history:
>
> 2016-10-04: Created with Hammer (0.94.3)
> 2017-05-03: Upgraded to Hammer (0.94.10)
> 2017-10-09: Upgraded to Jewel (10.2.9)
> 2017-11-02: Upgraded to Jewel (10.2.10)
> 2018-04-30: Upgraded to Luminous (12.2.5)
> 2018-09-05: Upgraded to Luminous (12.2.8)
> 2019-04-05: Upgraded to Luminous (12.2.11)
> 2019-04-18: Upgraded to Luminous (12.2.12)
> 2019-07-26: Upgraded to Nautilus (14.2.2)
>
> It wasn't until after the Nautilus upgrade when this problem started
> showing up.
>
> Here's the output you requested:
>
> [root@a2mon002 ~]# ceph -s
> cluster:
> id: XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX
> health: HEALTH_ERR
> nodown,norebalance,noscrub,nodeep-scrub flag(s) set
> 1 nearfull osd(s)
> 19 pool(s) nearfull
> 1 scrub errors
> Reduced data availability: 6014 pgs inactive, 3 pgs down,
> 5958 pgs peering, 83 pgs stale
> Possible data damage: 1 pg inconsistent
> Degraded data redundancy: 1601/81648846 objects degraded
> (0.002%), 4 pgs degraded, 5 pgs undersized
> 1048 slow requests are blocked > 32 sec
>
> services:
> mon: 3 daemons, quorum a2mon002,a2mon003,a2mon004 (age 17m)
> mgr: a2mon004(active, since 53m), standbys: a2mon003, a2mon002
> mds: cephfs:2 {0=a2mon004=up:active(laggy or
> crashed),1=a2mon003=up:active(laggy or crashed)} 1 up:standby
> osd: 143 osds: 141 up, 137 in; 486 remapped pgs
> flags nodown,norebalance,noscrub,nodeep-scrub
>
> data:
> pools: 20 pools, 6288 pgs
> objects: 27.22M objects, 98 TiB
> usage: 308 TiB used, 114 TiB / 422 TiB avail
> pgs: 0.048% pgs unknown
> 95.611% pgs not active
> 1601/81648846 objects degraded (0.002%)
> 53012/81648846 objects misplaced (0.065%)
> 5379 peering
> 495 remapped+peering
> 269 active+clean
> 75 stale+peering
> 46 activating
> 7 stale+remapped+peering
> 3 unknown
> 3 active+undersized+degraded
> 3 down
> 2 activating+remapped
> 1 activating+undersized
> 1 active+clean+scrubbing
> 1 remapped+inconsistent+peering
> 1 activating+undersized+degraded
> 1 stale+activating
> 1 creating+peering
>
> [root@a2mon002 ~]# ceph versions
> {
> "mon": {
> "ceph version 14.2.2
> (4f8fa0a0024755aae7d95567c63f11d6862d55be) nautilus (stable)": 3
> },
> "mgr": {
> "ceph version 14.2.2
> (4f8fa0a0024755aae7d95567c63f11d6862d55be) nautilus (stable)": 3
> },
> "osd": {
> "ceph version 14.2.2
> (4f8fa0a0024755aae7d95567c63f11d6862d55be) nautilus (stable)": 141
> },
> "mds": {
> "ceph version 14.2.2
> (4f8fa0a0024755aae7d95567c63f11d6862d55be) nautilus (stable)": 1
> },
> "overall": {
> "ceph version 14.2.2
> (4f8fa0a0024755aae7d95567c63f11d6862d55be) nautilus (stable)": 148
> }
> }
>
>
> We had seen the slow peering shortly after the Nautilus upgrade, but
> it eventually recovered. We then started filling the cluster up to
> test another Nautilus bug (
https://tracker.ceph.com/issues/41255),
> but then a disk started to die (which caused the inconsistent PG).
> When we marked it out we ran into this peering problem again, but it
> seems much worse this time.
>
> Thanks,
> Bryan
>
>> On Sep 4, 2019, at 11:55 AM, Guilherme Geronimo
>> <guilherme.geronimo(a)gmail.com <mailto:guilherme.geronimo@gmail.com>>
>> wrote:
>>
>> Notice: This email is from an external sender.
>>
>>
>>
>> Hey Bryan,
>>
>> I suppose all nodes are using jumboframes (mtu 9000), right?
>> I would suggest to check OSD->MON communication.
>>
>> Can you send the output os these commands for us?
>> * ceph -s
>> * ceph versions
>>
>> []'s
>> Arthur (aKa Guilherme Geronimo)
>>
>> On 04/09/2019 14:18, Bryan Stillwell wrote:
>>> Our test cluster is seeing a problem where peering is going
>>> incredibly slow shortly after upgrading it to Nautilus (14.2.2)
>>> from Luminous (12.2.12).
>>>
>>> From what I can tell it seems to be caused by "wait for new map"
>>> taking a long time. When looking at dump_historic_slow_ops on
>>> pretty much any OSD I see stuff like this:
>>>
>>> # ceph daemon osd.112 dump_historic_slow_ops
>>> [...snip...]
>>> {
>>> "description": "osd_pg_create(e180614 287.4b:177739
>>> 287.75:177739 287.1c3:177739 287.1cf:177739 287.1e1:177739
>>> 287.2dd:177739 287.2fc:177739 287.342:177739 287.382:177739)",
>>> "initiated_at": "2019-09-03 15:12:41.366514",
>>> "age": 4800.8847047119998,
>>> "duration": 4780.0579745630002,
>>> "type_data": {
>>> "flag_point": "started",
>>> "events": [
>>> {
>>> "time": "2019-09-03
15:12:41.366514",
>>> "event": "initiated"
>>> },
>>> {
>>> "time": "2019-09-03
15:12:41.366514",
>>> "event": "header_read"
>>> },
>>> {
>>> "time": "2019-09-03
15:12:41.366501",
>>> "event": "throttled"
>>> },
>>> {
>>> "time": "2019-09-03
15:12:41.366547",
>>> "event": "all_read"
>>> },
>>> {
>>> "time": "2019-09-03
15:39:03.379456",
>>> "event": "dispatched"
>>> },
>>> {
>>> "time": "2019-09-03
15:39:03.379477",
>>> "event": "wait for new map"
>>> },
>>> {
>>> "time": "2019-09-03
15:39:03.522376",
>>> "event": "wait for new map"
>>> },
>>> {
>>> "time": "2019-09-03
15:53:55.912499",
>>> "event": "wait for new map"
>>> },
>>> {
>>> "time": "2019-09-03
15:59:37.909063",
>>> "event": "wait for new map"
>>> },
>>> {
>>> "time": "2019-09-03
16:00:43.356023",
>>> "event": "wait for new map"
>>> },
>>> {
>>> "time": "2019-09-03
16:20:50.575498",
>>> "event": "wait for new map"
>>> },
>>> {
>>> "time": "2019-09-03
16:31:48.689415",
>>> "event": "started"
>>> },
>>> {
>>> "time": "2019-09-03
16:32:21.424489",
>>> "event": "done"
>>> }
>>> ]
>>> }
>>>
>>> It always seems to be in osd_pg_create() with multiple "wait for
>>> new map" messages before it finally does something. What could be
>>> causing it so long to get the OSD map? The mons don't appear to be
>>> overloaded in any way.
>>>
>>> Thanks,
>>> Bryan
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users(a)ceph.io
>>> <mailto:ceph-users@ceph.io>
>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io
>>> <mailto:ceph-users-leave@ceph.io>
>> _______________________________________________
>> ceph-users mailing list -- ceph-users(a)ceph.io
>> <mailto:ceph-users@ceph.io>
>> To unsubscribe send an email to ceph-users-leave(a)ceph.io
>> <mailto:ceph-users-leave@ceph.io>
>