Hi,
I have installed all ceph packages from Sage's repo, means
ceph ceph-common ceph-mds ceph-mgr-dashboard ceph-mon ceph-osd
libcephfs2 librados2 libradosstriper1 librbd1 librgw2
python-ceph-argparse python-cephfs python-rados python-rbd python-rgw
after adding his repo and executing
apt upgrade
on all MGR nodes 3 hours ago.
I can confirm that since this uppgrade no errors occur; MGR is working
means bringing back ceph to healthy status.
Currently there are (only) +1000 slow requests blocked, but compared to
the previous days I would say:
don't worry, be happy.
Regards
Thomas
Am 07.11.2019 um 14:33 schrieb Sage Weil:
On Thu, 7 Nov 2019, Thomas Schneider wrote:
Hi,
I have installed package
ceph-mgr_14.2.4-1-gd592e56-1bionic_amd64.deb
manually:
root@ld5505:/home# dpkg --force-depends -i
ceph-mgr_14.2.4-1-gd592e56-1bionic_amd64.deb
(Reading database ... 107461 files and directories currently installed.)
Preparing to unpack ceph-mgr_14.2.4-1-gd592e56-1bionic_amd64.deb ...
Unpacking ceph-mgr (14.2.4-1-gd592e56-1bionic) over
(14.2.4-1-gd592e56-1bionic) ...
dpkg: ceph-mgr: dependency problems, but configuring anyway as you
requested:
ceph-mgr depends on ceph-base (= 14.2.4-1-gd592e56-1bionic); however:
Package ceph-base is not configured yet.
Setting up ceph-mgr (14.2.4-1-gd592e56-1bionic) ...
The I restarted ceph-mgr.
However, there's no effect, means the log entries are still the same.
The
ceph-mgr package is sufficient.
Note that the only change on top of 14.2.4 is that the mgr devicehealth
module will scrape OSDs only, not mons.
You can probably/hopefully induce the (previously) bad behavior by
triggering a scrape manually with 'ceph device scrape-health-metrics'?
sage
> Or should I install dependencies, namely
> ceph-base_14.2.4-1-gd592e56-1bionic_amd64.deb, too?
> Or any other packages?
>
> Installation from repo fails when using this repo-file:
> root@ld5506:~# more /etc/apt/sources.list.d/ceph-shaman.list
> deb
https://shaman.ceph.com/repos/ceph/wip-no-scrape-mons-nautilus/
> bionic main
>
> W: Failed to fetch
>
https://shaman.ceph.com/repos/ceph/wip-no-scrape-mons-nautilus/dists/bionic…
> 500 Internal Server Error [IP: 147.204.6.136 8080]
> W: Some index files failed to download. They have been ignored, or old
> ones used instead.
>
> Regards
> Thomas
>
> Am 07.11.2019 um 10:04 schrieb Oliver Freyermuth:
>> Dear Thomas,
>>
>> the most correct thing to do is probably to add the full repo
>> (the original link was still empty for me, but
>>
https://shaman.ceph.com/repos/ceph/wip-no-scrape-mons-nautilus/ seems
>> to work).
>> The commit itself suggests the ceph-mgr package should be sufficient.
>>
>> I'm still pondering though since our cluster is close to production
>> (and for now disk health monitoring is disabled) -
>> but updating the mgrs alone should also be fine with us. I hope to
>> have time for the experiment later today ;-).
>>
>> Cheers,
>> Oliver
>>
>> Am 07.11.19 um 08:57 schrieb Thomas Schneider:
>>> Hi,
>>>
>>> can you please advise which package(s) should be installed?
>>>
>>> Thanks
>>>
>>>
>>>
>>> Am 06.11.2019 um 22:28 schrieb Sage Weil:
>>>> My current working theory is that the mgr is getting hung up when it
>>>> tries
>>>> to scrape the device metrics from the mon. The 'tell' mechanism
>>>> used to
>>>> send mon-targetted commands is pretty kludgey/broken in nautilus and
>>>> earlier. It's been rewritten for octopus, but isn't worth
>>>> backporting--it
>>>> never really caused problems until the devicemanager started using it
>>>> heavily.
>>>>
>>>> In any case, this PR just disables scraping of mon devices for
>>>> nautilus:
>>>>
>>>>
https://github.com/ceph/ceph/pull/31446
>>>>
>>>> There is a build queued at
>>>>
>>>>
>>>>
https://shaman.ceph.com/repos/ceph/wip-no-scrape-mons-nautilus/d592e56e$
>>>>
>>>>
>>>> which should get packages in 1-2 hours.
>>>>
>>>> Perhaps you can install that package on the mgr host and try again to
>>>> reproduce it again?
>>>>
>>>> I noticed a few other oddities in the logs while looking through them,
>>>> like
>>>>
>>>>
https://tracker.ceph.com/issues/42666
>>>>
>>>> which will hopefully have a fix ready for 14.2.5. I'm not sure
>>>> about that
>>>> auth error message, though!
>>>>
>>>> sage
>>>>
>>>>
>>>> On Sat, 2 Nov 2019, Oliver Freyermuth wrote:
>>>>
>>>>> Dear Sage,
>>>>>
>>>>> good news - it happened again, with debug logs!
>>>>> There's nothing obvious to my eye, it's uploaded as:
>>>>> 0b2d0c09-46f3-4126-aa27-e2d2e8572741
>>>>> It seems the failure was roughly in parallel to me wanting to
>>>>> access the dashboard. It must have happened within the last ~5-10
>>>>> minutes of the log.
>>>>>
>>>>> I'll now go back to "stable operation", in case you
need anything
>>>>> else, just let me know.
>>>>>
>>>>> Cheers and all the best,
>>>>> Oliver
>>>>>
>>>>> Am 02.11.19 um 17:38 schrieb Oliver Freyermuth:
>>>>>> Dear Sage,
>>>>>>
>>>>>> at least for the simple case:
>>>>>> ceph device get-health-metrics osd.11
>>>>>> => mgr crashes (but in that case, it crashes fully, i.e. the
>>>>>> process is gone)
>>>>>> I have now uploaded a verbose log as:
>>>>>> ceph-post-file: e3bd60ad-cbce-4308-8b07-7ebe7998572e
>>>>>>
>>>>>> One potential cause of this (and maybe the other issues) might
be
>>>>>> because some of our OSDs are on non-JBOD controllers and hence
are
>>>>>> made by forming a Raid 0 per disk,
>>>>>> so a simple smartctl on the device will not work (but
>>>>>> -dmegaraid,<number> would be needed).
>>>>>>
>>>>>> Now I have both mgrs active again, debug logging on, device
health
>>>>>> metrics on again,
>>>>>> and am waiting for them to become silent again. Let's hope
the
>>>>>> issue reappears before the disks run full of logs ;-).
>>>>>>
>>>>>> Cheers,
>>>>>> Oliver
>>>>>>
>>>>>> Am 02.11.19 um 02:56 schrieb Sage Weil:
>>>>>>> On Sat, 2 Nov 2019, Oliver Freyermuth wrote:
>>>>>>>> Dear Cephers,
>>>>>>>>
>>>>>>>> interestingly, after:
>>>>>>>> ceph device monitoring off
>>>>>>>> the mgrs seem to be stable now - the active one still
went
>>>>>>>> silent a few minutes later,
>>>>>>>> but the standby took over and was stable, and restarting
the
>>>>>>>> broken one, it's now stable since an hour, too,
>>>>>>>> so probably, a restart of the mgr is needed after
disabling
>>>>>>>> device monitoring to get things stable again.
>>>>>>>>
>>>>>>>> So it seems to be caused by a problem with the device
health
>>>>>>>> metrics. In case this is a red herring and mgrs become
instable
>>>>>>>> again in the next days,
>>>>>>>> I'll let you know.
>>>>>>> If this seems to stabilize things, and you can tolerate
inducing the
>>>>>>> failure again, reproducing the problem with mgr logs cranked
up
>>>>>>> (debug_mgr
>>>>>>> = 20, debug_ms = 1) would probably give us a good idea of why
the
>>>>>>> mgr is
>>>>>>> hanging. Let us know!
>>>>>>>
>>>>>>> Thanks,
>>>>>>> sage
>>>>>>>
>>>>>>> >
>>>>>>>> Cheers,
>>>>>>>> Oliver
>>>>>>>>
>>>>>>>> Am 01.11.19 um 23:09 schrieb Oliver Freyermuth:
>>>>>>>>> Dear Cephers,
>>>>>>>>>
>>>>>>>>> this is a 14.2.4 cluster with device health metrics
enabled -
>>>>>>>>> since about a day, all mgr daemons go
"silent" on me after a
>>>>>>>>> few hours, i.e. "ceph -s" shows:
>>>>>>>>>
>>>>>>>>> cluster:
>>>>>>>>> id: 269cf2b2-7e7c-4ceb-bd1b-a33d915ceee9
>>>>>>>>> health: HEALTH_WARN
>>>>>>>>> no active mgr
>>>>>>>>> 1/3 mons down, quorum mon001,mon002
>>>>>>>>> services:
>>>>>>>>> mon: 3 daemons, quorum mon001,mon002 (age
57m), out
>>>>>>>>> of quorum: mon003
>>>>>>>>> mgr: no daemons active (since 56m)
>>>>>>>>> ...
>>>>>>>>> (the third mon has a planned outage and will come
back in a few
>>>>>>>>> days)
>>>>>>>>>
>>>>>>>>> Checking the logs of the mgr daemons, I find some
"reset"
>>>>>>>>> messages at the time when it goes "silent",
first for the first
>>>>>>>>> mgr:
>>>>>>>>>
>>>>>>>>> 2019-11-01 21:34:40.286 7f2df6a6b700 0
log_channel(cluster)
>>>>>>>>> log [DBG] : pgmap v1798: 1585 pgs: 1585 active+clean;
1.1 TiB
>>>>>>>>> data, 2.3 TiB used, 136 TiB / 138 TiB avail
>>>>>>>>> 2019-11-01 21:34:41.458 7f2e0d59b700 0 client.0
>>>>>>>>> ms_handle_reset on v2:10.160.16.1:6800/401248
>>>>>>>>> 2019-11-01 21:34:42.287 7f2df6a6b700 0
log_channel(cluster)
>>>>>>>>> log [DBG] : pgmap v1799: 1585 pgs: 1585 active+clean;
1.1 TiB
>>>>>>>>> data, 2.3 TiB used, 136 TiB / 138 TiB avail
>>>>>>>>>
>>>>>>>>> and a bit later, on the standby mgr:
>>>>>>>>>
>>>>>>>>> 2019-11-01 22:18:14.892 7f7bcc8ae700 0
log_channel(cluster)
>>>>>>>>> log [DBG] : pgmap v1798: 1585 pgs: 166
active+clean+snaptrim,
>>>>>>>>> 858 active+clean+snaptrim_wait, 561 active+clean; 1.1
TiB data,
>>>>>>>>> 2.3 TiB used, 136 TiB / 138 TiB avail
>>>>>>>>> 2019-11-01 22:18:16.022 7f7be9e72700 0 client.0
>>>>>>>>> ms_handle_reset on v2:10.160.16.2:6800/352196
>>>>>>>>> 2019-11-01 22:18:16.893 7f7bcc8ae700 0
log_channel(cluster)
>>>>>>>>> log [DBG] : pgmap v1799: 1585 pgs: 166
active+clean+snaptrim,
>>>>>>>>> 858 active+clean+snaptrim_wait, 561 active+clean; 1.1
TiB data,
>>>>>>>>> 2.3 TiB used, 136 TiB / 138 TiB avail
>>>>>>>>>
>>>>>>>>> Interestingly, the dashboard still works, but
presents outdated
>>>>>>>>> information, and for example zero I/O going on.
>>>>>>>>> I believe this started to happen mainly after the
third mon
>>>>>>>>> went into the known downtime, but I am not fully sure
if this
>>>>>>>>> was the trigger, since the cluster is still growing.
>>>>>>>>> It may also have been the addition of 24 more OSDs.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I also find other messages in the mgr logs which
seem
>>>>>>>>> problematic, but I am not sure they are related:
>>>>>>>>> ------------------------------
>>>>>>>>> 2019-11-01 21:17:09.849 7f2df4266700 0
mgr[devicehealth] Error
>>>>>>>>> reading OMAP: [errno 22] Failed to operate read op
for oid
>>>>>>>>> Traceback (most recent call last):
>>>>>>>>> File
"/usr/share/ceph/mgr/devicehealth/module.py", line 396,
>>>>>>>>> in put_device_metrics
>>>>>>>>> ioctx.operate_read_op(op, devid)
>>>>>>>>> File "rados.pyx", line 516, in
>>>>>>>>> rados.requires.wrapper.validate_func
>>>>>>>>>
(/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.4/rpm/el7/BUIL
>>>>>>>>>
D/ceph-14.2.4/build/src/pybind/rados/pyrex/rados.c:4721)
>>>>>>>>> File "rados.pyx", line 3474, in
rados.Ioctx.operate_read_op
>>>>>>>>>
(/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.4/rpm/el7/BUILD/ceph-14.2.4/build/src/pybind/rados/pyrex/rados.c:36554)
>>>>>>>>> InvalidArgumentError: [errno 22] Failed to operate
read op for oid
>>>>>>>>> ------------------------------
>>>>>>>>> or:
>>>>>>>>> ------------------------------
>>>>>>>>> 2019-11-01 21:33:53.977 7f7bd38bc700 0
mgr[devicehealth] Fail
>>>>>>>>> to parse JSON result from daemon osd.51 ()
>>>>>>>>> 2019-11-01 21:33:53.978 7f7bd38bc700 0
mgr[devicehealth] Fail
>>>>>>>>> to parse JSON result from daemon osd.52 ()
>>>>>>>>> 2019-11-01 21:33:53.979 7f7bd38bc700 0
mgr[devicehealth] Fail
>>>>>>>>> to parse JSON result from daemon osd.53 ()
>>>>>>>>> ------------------------------
>>>>>>>>>
>>>>>>>>> The reason why I am cautious about the health metrics
is that I
>>>>>>>>> observed a crash when trying to query them:
>>>>>>>>> ------------------------------
>>>>>>>>> 2019-11-01 20:21:23.661 7fa46314a700 0
log_channel(audit) log
>>>>>>>>> [DBG] : from='client.174136 -'
entity='client.admin'
>>>>>>>>> cmd=[{"prefix": "device
get-health-metrics", "devid": "osd.11",
>>>>>>>>> "target": ["mgr",
""]}]: dispatch
>>>>>>>>> 2019-11-01 20:21:23.661 7fa46394b700 0
mgr[devicehealth]
>>>>>>>>> handle_command
>>>>>>>>> 2019-11-01 20:21:23.663 7fa46394b700 -1 *** Caught
signal
>>>>>>>>> (Segmentation fault) **
>>>>>>>>> in thread 7fa46394b700 thread_name:mgr-fin
>>>>>>>>>
>>>>>>>>> ceph version 14.2.4
>>>>>>>>> (75f4de193b3ea58512f204623e6c5a16e6c1e1ba) nautilus
(stable)
>>>>>>>>> 1: (()+0xf5f0) [0x7fa488cee5f0]
>>>>>>>>> 2: (PyEval_EvalFrameEx()+0x1a9) [0x7fa48aeb50f9]
>>>>>>>>> 3: (PyEval_EvalFrameEx()+0x67bd) [0x7fa48aebb70d]
>>>>>>>>> 4: (PyEval_EvalFrameEx()+0x67bd) [0x7fa48aebb70d]
>>>>>>>>> 5: (PyEval_EvalFrameEx()+0x67bd) [0x7fa48aebb70d]
>>>>>>>>> 6: (PyEval_EvalCodeEx()+0x7ed) [0x7fa48aebe08d]
>>>>>>>>> 7: (()+0x709c8) [0x7fa48ae479c8]
>>>>>>>>> 8: (PyObject_Call()+0x43) [0x7fa48ae22ab3]
>>>>>>>>> 9: (()+0x5aaa5) [0x7fa48ae31aa5]
>>>>>>>>> 10: (PyObject_Call()+0x43) [0x7fa48ae22ab3]
>>>>>>>>> 11: (()+0x4bb95) [0x7fa48ae22b95]
>>>>>>>>> 12: (PyObject_CallMethod()+0xbb) [0x7fa48ae22ecb]
>>>>>>>>> 13:
(ActivePyModule::handle_command(std::map<std::string,
>>>>>>>>> boost::variant<std::string, bool, long, double,
>>>>>>>>> std::vector<std::string,
std::allocator<std::string> >,
>>>>>>>>> std::vector<long, std::allocator<long> >,
std::vector<double,
>>>>>>>>> std::allocator<double> > >,
std::less<void>,
>>>>>>>>> std::allocator<std::pair<std::string const,
>>>>>>>>> boost::variant<std::string, bool, long, double,
>>>>>>>>> std::vector<std::string,
std::allocator<std::string> >,
>>>>>>>>> std::vector<long, std::allocator<long> >,
std::vector<double,
>>>>>>>>> std::allocator<double> > > > > >
const&,
>>>>>>>>> ceph::buffer::v14_2_0::list const&,
>>>>>>>>> std::basic_stringstream<char,
std::char_traits<char>,
>>>>>>>>> std::allocator<char> >*,
std::basic_stringstream<char,
>>>>>>>>> std::char_traits<char>,
std::allocator<char> >*)+0x20e)
>>>>>>>>> [0x55c3c1fefc5e]
>>>>>>>>> 14: (()+0x16c23d) [0x55c3c204023d]
>>>>>>>>> 15: (FunctionContext::finish(int)+0x2c)
[0x55c3c2001eac]
>>>>>>>>> 16: (Context::complete(int)+0x9) [0x55c3c1ffe659]
>>>>>>>>> 17: (Finisher::finisher_thread_entry()+0x156)
[0x7fa48b439cc6]
>>>>>>>>> 18: (()+0x7e65) [0x7fa488ce6e65]
>>>>>>>>> 19: (clone()+0x6d) [0x7fa48799488d]
>>>>>>>>> NOTE: a copy of the executable, or `objdump -rdS
>>>>>>>>> <executable>` is needed to interpret this.
>>>>>>>>> ------------------------------
>>>>>>>>>
>>>>>>>>> I have issued:
>>>>>>>>> ceph device monitoring off
>>>>>>>>> for now and will keep waiting to see if mgrs go
silent again.
>>>>>>>>> If there are any better ideas or this issue is known,
let me know.
>>>>>>>>>
>>>>>>>>> Cheers,
>>>>>>>>> Oliver
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> ceph-users mailing list -- ceph-users(a)ceph.io
>>>>>>>>> To unsubscribe send an email to
ceph-users-leave(a)ceph.io
>>>>>>>>>
>>>>>> _______________________________________________
>>>>>> ceph-users mailing list -- ceph-users(a)ceph.io
>>>>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io
>>>>>>
>>>>>
>>>> _______________________________________________
>>>> ceph-users mailing list -- ceph-users(a)ceph.io
>>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io