November 2019 - ceph-users

by Karsten Nielsen

-----Original message----- From: Yan, Zheng <ukernel(a)gmail.com> Sent: Wed 06-11-2019 14:16 Subject: Re: [ceph-users] mds crash loop To: Karsten Nielsen <karsten(a)foo-bar.dk>; CC: ceph-users(a)ceph.io; > On Wed, Nov 6, 2019 at 4:42 PM Karsten Nielsen <karsten(a)foo-bar.dk> wrote: > > > > -----Original message----- > > From: Yan, Zheng <ukernel(a)gmail.com> > > Sent: Wed 06-11-2019 08:15 > > Subject: Re: [ceph-users] mds crash loop > > To: Karsten Nielsen <karsten(a)foo-bar.dk>; > > CC: ceph-users(a)ceph.io; > > > On Tue, Nov 5, 2019 at 5:29 PM Karsten Nielsen <karsten(a)foo-bar.dk> wrote: > > > > > > > > Hi, > > > > > > > > Last week I upgraded my ceph cluster from luminus to mimic 13.2.6 > > > > It was running fine for a while but yesterday my mds went into a crash > loop. > > > > > > > > I have 1 active and 1 standby mds for my cephfs both of which is running > the > > > same crash loop. > > > > I am running ceph based on https://hub.docker.com/r/ceph/daemon version > > > v3.2.7-stable-3.2-minic-centos-7-x86_64 with a etcd kv store. > > > > > > > > Log details are: https://paste.debian.net/1113943/ > > > > > > > > > > please try again with debug_mds=20. Thanks > > > > > > Yan, Zheng > > > > Yes I have set that and had to move to pastebin.com as debian apperently only > supports 150k > > > > > > https://pastebin.com/Gv7c5h54 > > > > Looks like on-disk root inode is corrupted. have you encountered any > unusually things during the upgrade? > > please run 'rados -p <cephfs metadata pool> stat 1.00000000.inode' , > check if the object is modified before or after the 'luminous -> > 13.2.6' upgrade. The fil was modified before the upgrade. > To fix the corrupted object. Run 'cephfs-data-scan init > --force-init'. Then restart mds. After mds become active, run 'ceph > daemon mds.x scrub_path / force repair' > > > > - Karsten > > > > > > > > > Thanks for any hints > > > > - Karsten > > > > _______________________________________________ > > > > ceph-users mailing list -- ceph-users(a)ceph.io > > > > To unsubscribe send an email to ceph-users-leave(a)ceph.io > > > > > > > >

4 years, 6 months

1
0
0 0

Slow write speed on 3-node cluster with 6* SATA Harddisks (~ 3.5 MB/s)

by Hermann Himmelbauer

Hi, I recently upgraded my 3-node cluster to proxmox 6 / debian-10 and recreated my ceph cluster with a new release (14.2.4 bluestore) - basically hoping to gain some I/O speed. The installation went flawlessly, reading is faster than before (~ 80 MB/s), however, the write speed is still really slow (~ 3,5 MB/s). I wonder if I can do anything to speed things up? My Hardware is as the following: 3 Nodes with Supermicro X8DTT-HIBQF Mainboard each, 2 OSD per node (2TB SATA harddisks, WDC WD2000F9YZ-0), interconnected via Infiniband 40 The network should be reasonably fast, I measure ~ 16 GBit/s with iperf, so this seems fine. I use ceph for RBD only, so my measurement is simply doing a very simple "dd" read and write test within a virtual machine (Debian 8) like the following: read: dd if=/dev/vdb | pv | dd of=/dev/null -> 80 MB/s write: dd if=/dev/zero | pv | dd of=/dev/vdb -> 3.5 MB/s When I do the same on the virtual machine on a disk that is on a NFS storage, I get something about 30 MB/s for reading and writing. If I disable the write cache on all OSD disks via "hdparm -W 0 /dev/sdX", I gain a little bit of performance, write speed is then 4.3 MB/s. Thanks to your help from the list I plan to install a second ceph cluster which is SSD based (Samsung PM1725b) which should be much faster, however, I still wonder if there is any way to speed up my harddisk based cluster? Thank you in advance for any help, Best Regards, Hermann -- hermann(a)qwer.tk PGP/GPG: 299893C7 (on keyservers) _______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

4 years, 6 months

7
9
0 0

Re: mgr daemons becoming unresponsive

by Thomas Schneider

Well, even after restarting the MGR service the relevant log is spoiled with this error messages: 2019-11-06 17:46:22.363 7f81ffdcc700 0 auth: could not find secret_id=3865 2019-11-06 17:46:22.363 7f81ffdcc700 0 cephx: verify_authorizer could not get service secret for service mgr secret_id=3865 As you can see the secret_id changes. However I have no idea what's the related service for this secret_id. And in my opinion these errors are preventing the MGR from doing it's job: bringing the cluster to Healthy state. Am 06.11.2019 um 17:41 schrieb Mac Wynkoop: > I actually just had some unresponsive mgr daemons. If It happens > again, I'll see if it's the same error if it happens again. Restarting > them fixed the issue. > Mac Wynkoop > > > > > On Wed, Nov 6, 2019 at 8:43 AM Thomas Schneider <74cmonty(a)gmail.com > <mailto:74cmonty@gmail.com>> wrote: > > Hi, > > does anybody get this error messages in MGR log? > 2019-11-06 15:41:44.765 7f10db740700 0 auth: could not find > secret_id=3863 > 2019-11-06 15:41:44.765 7f10db740700 0 cephx: verify_authorizer could > not get service secret for service mgr secret_id=3863 > > > THX > > Am 06.11.2019 um 10:43 schrieb Oliver Freyermuth: > > Hi together, > > > > interestingly, now that the third mon is missing for almost a week > > (those planned interventions always take longer than expected...), > > we get mgr failovers (but without crashes). > > > > In the mgr log, I find: > > > > 2019-11-06 07:50:05.409 7fce8a0dc700 0 client.0 ms_handle_reset on > > v2:10.160.16.1:6800/618072 <http://10.160.16.1:6800/618072> > > ... > > ... the normal churning ... > > ... > > 2019-11-06 07:52:44.113 7fce8a0dc700 -1 mgr handle_mgr_map I was > > active but no longer am > > 2019-11-06 07:52:44.113 7fce8a0dc700 1 mgr respawn e: > > '/usr/bin/ceph-mgr' > > > > In the mon log, I see: > > ... > > 2019-11-06 07:44:11.565 7f1f44453700 4 rocksdb: > > [db/db_impl_files.cc:356] [JOB 225] Try to delete WAL files size > > 10830909, prev total WAL file size 10839895, number of live WAL > files 2. > > > > 2019-11-06 07:44:11.565 7f1f3a43f700 4 rocksdb: > > [db/db_impl_compaction_flush.cc:1403] [default] Manual compaction > > starting > > 2019-11-06 07:44:11.565 7f1f44c54700 4 rocksdb: (Original Log Time > > 2019/11/06-07:44:11.565802) [db/db_impl_compaction_flush.cc:2374] > > [default] Manual compaction from level-0 to level-6 from 'mgrstat .. > > 'mgrstat; will stop at (end) > > ... > > 2019-11-06 07:50:36.734 7f1f3a43f700 4 rocksdb: > > [db/db_impl_compaction_flush.cc:1403] [default] Manual compaction > > starting > > 2019-11-06 07:52:27.046 7f1f4144d700 0 log_channel(cluster) log > [INF] > > : Manager daemon mon001 is unresponsive, replacing it with standby > > daemon mon002 > > ... > > > > There's a lot of compaction going on (probably due to the prolonged > > HEALTH_WARN state, so not really unexpected) > > so I wonder whether the actual cause for identifying the mgr as > > "unresponsive" is the heavy compaction on the mons. > > It will be interesting to see what happens when we finally have the > > third mon back and the cluster becomes healthy again... > > > > Did somebody see something similar after running for a week or more > > with Nautilus on old and slow hardware? > > > > Cheers, > > Oliver > > > > Am 02.11.19 um 18:20 schrieb Oliver Freyermuth: > >> Dear Sage, > >> > >> good news - it happened again, with debug logs! > >> There's nothing obvious to my eye, it's uploaded as: > >> 0b2d0c09-46f3-4126-aa27-e2d2e8572741 > >> It seems the failure was roughly in parallel to me wanting to > access > >> the dashboard. It must have happened within the last ~5-10 > minutes of > >> the log. > >> > >> I'll now go back to "stable operation", in case you need anything > >> else, just let me know. > >> > >> Cheers and all the best, > >> Oliver > >> > >> Am 02.11.19 um 17:38 schrieb Oliver Freyermuth: > >>> Dear Sage, > >>> > >>> at least for the simple case: > >>> ceph device get-health-metrics osd.11 > >>> => mgr crashes (but in that case, it crashes fully, i.e. the > process > >>> is gone) > >>> I have now uploaded a verbose log as: > >>> ceph-post-file: e3bd60ad-cbce-4308-8b07-7ebe7998572e > >>> > >>> One potential cause of this (and maybe the other issues) might be > >>> because some of our OSDs are on non-JBOD controllers and hence are > >>> made by forming a Raid 0 per disk, > >>> so a simple smartctl on the device will not work (but > >>> -dmegaraid,<number> would be needed). > >>> > >>> Now I have both mgrs active again, debug logging on, device health > >>> metrics on again, > >>> and am waiting for them to become silent again. Let's hope the > issue > >>> reappears before the disks run full of logs ;-). > >>> > >>> Cheers, > >>> Oliver > >>> > >>> Am 02.11.19 um 02:56 schrieb Sage Weil: > >>>> On Sat, 2 Nov 2019, Oliver Freyermuth wrote: > >>>>> Dear Cephers, > >>>>> > >>>>> interestingly, after: > >>>>> ceph device monitoring off > >>>>> the mgrs seem to be stable now - the active one still went > silent > >>>>> a few minutes later, > >>>>> but the standby took over and was stable, and restarting the > >>>>> broken one, it's now stable since an hour, too, > >>>>> so probably, a restart of the mgr is needed after disabling > device > >>>>> monitoring to get things stable again. > >>>>> > >>>>> So it seems to be caused by a problem with the device health > >>>>> metrics. In case this is a red herring and mgrs become instable > >>>>> again in the next days, > >>>>> I'll let you know. > >>>> > >>>> If this seems to stabilize things, and you can tolerate > inducing the > >>>> failure again, reproducing the problem with mgr logs cranked up > >>>> (debug_mgr > >>>> = 20, debug_ms = 1) would probably give us a good idea of why the > >>>> mgr is > >>>> hanging. Let us know! > >>>> > >>>> Thanks, > >>>> sage > >>>> > >>>> > > >>>>> Cheers, > >>>>> Oliver > >>>>> > >>>>> Am 01.11.19 um 23:09 schrieb Oliver Freyermuth: > >>>>>> Dear Cephers, > >>>>>> > >>>>>> this is a 14.2.4 cluster with device health metrics enabled - > >>>>>> since about a day, all mgr daemons go "silent" on me after > a few > >>>>>> hours, i.e. "ceph -s" shows: > >>>>>> > >>>>>> cluster: > >>>>>> id: 269cf2b2-7e7c-4ceb-bd1b-a33d915ceee9 > >>>>>> health: HEALTH_WARN > >>>>>> no active mgr > >>>>>> 1/3 mons down, quorum mon001,mon002 > >>>>>> services: > >>>>>> mon: 3 daemons, quorum mon001,mon002 (age 57m), out > >>>>>> of quorum: mon003 > >>>>>> mgr: no daemons active (since 56m) > >>>>>> ... > >>>>>> (the third mon has a planned outage and will come back in a few > >>>>>> days) > >>>>>> > >>>>>> Checking the logs of the mgr daemons, I find some "reset" > >>>>>> messages at the time when it goes "silent", first for the > first mgr: > >>>>>> > >>>>>> 2019-11-01 21:34:40.286 7f2df6a6b700 0 > log_channel(cluster) log > >>>>>> [DBG] : pgmap v1798: 1585 pgs: 1585 active+clean; 1.1 TiB data, > >>>>>> 2.3 TiB used, 136 TiB / 138 TiB avail > >>>>>> 2019-11-01 21:34:41.458 7f2e0d59b700 0 client.0 > ms_handle_reset > >>>>>> on v2:10.160.16.1:6800/401248 <http://10.160.16.1:6800/401248> > >>>>>> 2019-11-01 21:34:42.287 7f2df6a6b700 0 > log_channel(cluster) log > >>>>>> [DBG] : pgmap v1799: 1585 pgs: 1585 active+clean; 1.1 TiB data, > >>>>>> 2.3 TiB used, 136 TiB / 138 TiB avail > >>>>>> > >>>>>> and a bit later, on the standby mgr: > >>>>>> > >>>>>> 2019-11-01 22:18:14.892 7f7bcc8ae700 0 > log_channel(cluster) log > >>>>>> [DBG] : pgmap v1798: 1585 pgs: 166 active+clean+snaptrim, 858 > >>>>>> active+clean+snaptrim_wait, 561 active+clean; 1.1 TiB data, 2.3 > >>>>>> TiB used, 136 TiB / 138 TiB avail > >>>>>> 2019-11-01 22:18:16.022 7f7be9e72700 0 client.0 > ms_handle_reset > >>>>>> on v2:10.160.16.2:6800/352196 <http://10.160.16.2:6800/352196> > >>>>>> 2019-11-01 22:18:16.893 7f7bcc8ae700 0 > log_channel(cluster) log > >>>>>> [DBG] : pgmap v1799: 1585 pgs: 166 active+clean+snaptrim, 858 > >>>>>> active+clean+snaptrim_wait, 561 active+clean; 1.1 TiB data, 2.3 > >>>>>> TiB used, 136 TiB / 138 TiB avail > >>>>>> > >>>>>> Interestingly, the dashboard still works, but presents outdated > >>>>>> information, and for example zero I/O going on. > >>>>>> I believe this started to happen mainly after the third mon > went > >>>>>> into the known downtime, but I am not fully sure if this > was the > >>>>>> trigger, since the cluster is still growing. > >>>>>> It may also have been the addition of 24 more OSDs. > >>>>>> > >>>>>> > >>>>>> I also find other messages in the mgr logs which seem > >>>>>> problematic, but I am not sure they are related: > >>>>>> ------------------------------ > >>>>>> 2019-11-01 21:17:09.849 7f2df4266700 0 mgr[devicehealth] Error > >>>>>> reading OMAP: [errno 22] Failed to operate read op for oid > >>>>>> Traceback (most recent call last): > >>>>>> File "/usr/share/ceph/mgr/devicehealth/module.py", line 396, > >>>>>> in put_device_metrics > >>>>>> ioctx.operate_read_op(op, devid) > >>>>>> File "rados.pyx", line 516, in > >>>>>> rados.requires.wrapper.validate_func > >>>>>> > (/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.4/rpm/el7/BUIL > >>>>>> D/ceph-14.2.4/build/src/pybind/rados/pyrex/rados.c:4721) > >>>>>> File "rados.pyx", line 3474, in rados.Ioctx.operate_read_op > >>>>>> > (/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.4/rpm/el7/BUILD/ceph-14.2.4/build/src/pybind/rados/pyrex/rados.c:36554) > >>>>>> InvalidArgumentError: [errno 22] Failed to operate read op > for oid > >>>>>> ------------------------------ > >>>>>> or: > >>>>>> ------------------------------ > >>>>>> 2019-11-01 21:33:53.977 7f7bd38bc700 0 mgr[devicehealth] > Fail to > >>>>>> parse JSON result from daemon osd.51 () > >>>>>> 2019-11-01 21:33:53.978 7f7bd38bc700 0 mgr[devicehealth] > Fail to > >>>>>> parse JSON result from daemon osd.52 () > >>>>>> 2019-11-01 21:33:53.979 7f7bd38bc700 0 mgr[devicehealth] > Fail to > >>>>>> parse JSON result from daemon osd.53 () > >>>>>> ------------------------------ > >>>>>> > >>>>>> The reason why I am cautious about the health metrics is that I > >>>>>> observed a crash when trying to query them: > >>>>>> ------------------------------ > >>>>>> 2019-11-01 20:21:23.661 7fa46314a700 0 log_channel(audit) log > >>>>>> [DBG] : from='client.174136 -' entity='client.admin' > >>>>>> cmd=[{"prefix": "device get-health-metrics", "devid": "osd.11", > >>>>>> "target": ["mgr", ""]}]: dispatch > >>>>>> 2019-11-01 20:21:23.661 7fa46394b700 0 mgr[devicehealth] > >>>>>> handle_command > >>>>>> 2019-11-01 20:21:23.663 7fa46394b700 -1 *** Caught signal > >>>>>> (Segmentation fault) ** > >>>>>> in thread 7fa46394b700 thread_name:mgr-fin > >>>>>> > >>>>>> ceph version 14.2.4 > (75f4de193b3ea58512f204623e6c5a16e6c1e1ba) > >>>>>> nautilus (stable) > >>>>>> 1: (()+0xf5f0) [0x7fa488cee5f0] > >>>>>> 2: (PyEval_EvalFrameEx()+0x1a9) [0x7fa48aeb50f9] > >>>>>> 3: (PyEval_EvalFrameEx()+0x67bd) [0x7fa48aebb70d] > >>>>>> 4: (PyEval_EvalFrameEx()+0x67bd) [0x7fa48aebb70d] > >>>>>> 5: (PyEval_EvalFrameEx()+0x67bd) [0x7fa48aebb70d] > >>>>>> 6: (PyEval_EvalCodeEx()+0x7ed) [0x7fa48aebe08d] > >>>>>> 7: (()+0x709c8) [0x7fa48ae479c8] > >>>>>> 8: (PyObject_Call()+0x43) [0x7fa48ae22ab3] > >>>>>> 9: (()+0x5aaa5) [0x7fa48ae31aa5] > >>>>>> 10: (PyObject_Call()+0x43) [0x7fa48ae22ab3] > >>>>>> 11: (()+0x4bb95) [0x7fa48ae22b95] > >>>>>> 12: (PyObject_CallMethod()+0xbb) [0x7fa48ae22ecb] > >>>>>> 13: (ActivePyModule::handle_command(std::map<std::string, > >>>>>> boost::variant<std::string, bool, long, double, > >>>>>> std::vector<std::string, std::allocator<std::string> >, > >>>>>> std::vector<long, std::allocator<long> >, std::vector<double, > >>>>>> std::allocator<double> > >, std::less<void>, > >>>>>> std::allocator<std::pair<std::string const, > >>>>>> boost::variant<std::string, bool, long, double, > >>>>>> std::vector<std::string, std::allocator<std::string> >, > >>>>>> std::vector<long, std::allocator<long> >, std::vector<double, > >>>>>> std::allocator<double> > > > > > const&, > >>>>>> ceph::buffer::v14_2_0::list const&, > std::basic_stringstream<char, > >>>>>> std::char_traits<char>, std::allocator<char> >*, > >>>>>> std::basic_stringstream<char, std::char_traits<char>, > >>>>>> std::allocator<char> >*)+0x20e) [0x55c3c1fefc5e] > >>>>>> 14: (()+0x16c23d) [0x55c3c204023d] > >>>>>> 15: (FunctionContext::finish(int)+0x2c) [0x55c3c2001eac] > >>>>>> 16: (Context::complete(int)+0x9) [0x55c3c1ffe659] > >>>>>> 17: (Finisher::finisher_thread_entry()+0x156) > [0x7fa48b439cc6] > >>>>>> 18: (()+0x7e65) [0x7fa488ce6e65] > >>>>>> 19: (clone()+0x6d) [0x7fa48799488d] > >>>>>> NOTE: a copy of the executable, or `objdump -rdS > <executable>` > >>>>>> is needed to interpret this. > >>>>>> ------------------------------ > >>>>>> > >>>>>> I have issued: > >>>>>> ceph device monitoring off > >>>>>> for now and will keep waiting to see if mgrs go silent > again. If > >>>>>> there are any better ideas or this issue is known, let me know. > >>>>>> > >>>>>> Cheers, > >>>>>> Oliver > >>>>>> > >>>>>> > >>>>>> _______________________________________________ > >>>>>> ceph-users mailing list -- ceph-users(a)ceph.io > <mailto:ceph-users@ceph.io> > >>>>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io > <mailto:ceph-users-leave@ceph.io> > >>>>>> > >>>>> > >>> > >>> > >>> _______________________________________________ > >>> ceph-users mailing list -- ceph-users(a)ceph.io > <mailto:ceph-users@ceph.io> > >>> To unsubscribe send an email to ceph-users-leave(a)ceph.io > <mailto:ceph-users-leave@ceph.io> > >>> > >> > >> > >> > >> _______________________________________________ > >> ceph-users mailing list -- ceph-users(a)ceph.io > <mailto:ceph-users@ceph.io> > >> To unsubscribe send an email to ceph-users-leave(a)ceph.io > <mailto:ceph-users-leave@ceph.io> > >> > > > > > > _______________________________________________ > > ceph-users mailing list -- ceph-users(a)ceph.io > <mailto:ceph-users@ceph.io> > > To unsubscribe send an email to ceph-users-leave(a)ceph.io > <mailto:ceph-users-leave@ceph.io> > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io > <mailto:ceph-users@ceph.io> > To unsubscribe send an email to ceph-users-leave(a)ceph.io > <mailto:ceph-users-leave@ceph.io> >

4 years, 6 months

1
0
0 0

Re: mds crash loop

by Karsten Nielsen

-----Original message----- From: Yan, Zheng <ukernel(a)gmail.com> Sent: Wed 06-11-2019 08:15 Subject: Re: [ceph-users] mds crash loop To: Karsten Nielsen <karsten(a)foo-bar.dk>; CC: ceph-users(a)ceph.io; > On Tue, Nov 5, 2019 at 5:29 PM Karsten Nielsen <karsten(a)foo-bar.dk> wrote: > > > > Hi, > > > > Last week I upgraded my ceph cluster from luminus to mimic 13.2.6 > > It was running fine for a while but yesterday my mds went into a crash loop. > > > > I have 1 active and 1 standby mds for my cephfs both of which is running the > same crash loop. > > I am running ceph based on https://hub.docker.com/r/ceph/daemon version > v3.2.7-stable-3.2-minic-centos-7-x86_64 with a etcd kv store. > > > > Log details are: https://paste.debian.net/1113943/ > > > > please try again with debug_mds=20. Thanks > > Yan, Zheng Yes I have set that and had to move to pastebin.com as debian apperently only supports 150k https://pastebin.com/Gv7c5h54 - Karsten > > > Thanks for any hints > > - Karsten > > _______________________________________________ > > ceph-users mailing list -- ceph-users(a)ceph.io > > To unsubscribe send an email to ceph-users-leave(a)ceph.io > >

4 years, 6 months

2
1
0 0

mds crash loop

by Karsten Nielsen

Hi, Last week I upgraded my ceph cluster from luminus to mimic 13.2.6 It was running fine for a while but yesterday my mds went into a crash loop. I have 1 active and 1 standby mds for my cephfs both of which is running the same crash loop. I am running ceph based on https://hub.docker.com/r/ceph/daemon version v3.2.7-stable-3.2-minic-centos-7-x86_64 with a etcd kv store. Log details are: https://paste.debian.net/1113943/ Thanks for any hints - Karsten

4 years, 6 months

2
1
0 0

Balancer configuration fails with Error EINVAL: unrecognized config option 'mgr/balancer/max_misplaced'

by Thomas Schneider

Hi, I want to adjust balancer throttling and executed this command that returns an error: root@ld3955:~# ceph config set mgr mgr/balancer/max_misplaced .01 Error EINVAL: unrecognized config option 'mgr/balancer/max_misplaced' root@ld3955:~# ceph balancer status { "active": true, "plans": [], "mode": "upmap" } Ceph is healthy, though: root@ld3955:~# ceph -s cluster: id: 6b1b5117-6e08-4843-93d6-2da3cf8a6bae health: HEALTH_WARN mon ld5505 is low on available space services: mon: 3 daemons, quorum ld5505,ld5506,ld5507 (age 19h) mgr: ld5507(active, since 23h), standbys: ld5506, ld5505, ld5508 mds: cephfs:1 {0=ld4464=up:active} 1 up:standby osd: 442 osds: 442 up, 442 in data: pools: 6 pools, 8312 pgs objects: 62.07M objects, 237 TiB usage: 710 TiB used, 821 TiB / 1.5 PiB avail pgs: 8312 active+clean io: client: 1.3 KiB/s rd, 913 KiB/s wr, 1 op/s rd, 34 op/s wr Can you please advise howto adjust balancer throttling? THX

4 years, 6 months

2
1
0 0

stretch repository only has ceph-deploy

by Rodrigo Severo - Fábrica

Hi, I'm trying ceph for the first time. I'm trying to use the repository below: deb https://download.ceph.com/debian-nautilus/ stretch main But it seems that this repository only has the ceph-deploy package, not the rest of ceph. Why is that? How can I get all updated nautilus packages? Regards, Rodrigo Severo

4 years, 6 months

2
1
0 0

multiple pgs down with all disks online

by Kári Bertilsson

pgs: 14.377% pgs not active 3749681/537818808 objects misplaced (0.697%) 810 active+clean 156 down 124 active+remapped+backfilling 1 active+remapped+backfill_toofull 1 down+inconsistent when looking at the down pg's all disks are online 41.3db 53775 0 0 0 401643186092 0 0 3044 down 6m 161222'303144 162913:4630171 [32,96,128,115,86,129,113,124,57,109]p32 [32,96,128,115,86,129,113,124,57,109]p32 2019-11-03 Any way to see why the pg is down ?

4 years, 6 months

2
2
0 0

Re: mds crash loop

by Karsten Nielsen

from ceph -w [root@k8s-node-01 /]# ceph -w cluster: id: 571d4bfe-2c5d-45ca-8da1-91dcaf69942c health: HEALTH_WARN 1 filesystem is degraded services: mon: 3 daemons, quorum k8s-node-00,k8s-node-01,k8s-node-02 mgr: k8s-node-01(active) mds: cephfs-1/1/1 up recovery-fs-1/1/1 up {[cephfs:0]=k8s-node-00=up:active,[recovery-fs:0]=k8s-node-02=up:rejoin}, 1 up:standby osd: 6 osds: 6 up, 6 in rgw: 2 daemons active data: pools: 13 pools, 408 pgs objects: 745.2 k objects, 1.8 TiB usage: 3.6 TiB used, 8.7 TiB / 12 TiB avail pgs: 408 active+clean io: client: 3.3 KiB/s rd, 40 KiB/s wr, 1 op/s rd, 0 op/s wr 2019-11-05 09:43:20.168699 mon.k8s-node-00 [INF] daemon mds.k8s-node-02 restarted 2019-11-05 09:43:21.248009 mon.k8s-node-00 [ERR] Health check failed: 1 filesystem is offline (MDS_ALL_DOWN) 2019-11-05 09:43:21.348459 mon.k8s-node-00 [INF] Standby daemon mds.k8s-node-02 assigned to filesystem recovery-fs as rank 0 2019-11-05 09:43:21.349477 mon.k8s-node-00 [INF] Health check cleared: MDS_ALL_DOWN (was: 1 filesystem is offline) 2019-11-05 09:43:32.129516 mon.k8s-node-00 [INF] daemon mds.k8s-node-02 restarted 2019-11-05 09:43:32.171052 mon.k8s-node-00 [ERR] Health check failed: 1 filesystem is offline (MDS_ALL_DOWN) 2019-11-05 09:43:32.237675 mon.k8s-node-00 [INF] Standby daemon mds.k8s-node-02 assigned to filesystem recovery-fs as rank 0 2019-11-05 09:43:32.238331 mon.k8s-node-00 [INF] Health check cleared: MDS_ALL_DOWN (was: 1 filesystem is offline) 2019-11-05 09:43:46.825739 mon.k8s-node-00 [INF] daemon mds.k8s-node-02 restarted 2019-11-05 09:43:47.780044 mon.k8s-node-00 [ERR] Health check failed: 1 filesystem is offline (MDS_ALL_DOWN) 2019-11-05 09:43:47.821611 mon.k8s-node-00 [INF] Standby daemon mds.k8s-node-02 assigned to filesystem recovery-fs as rank 0 2019-11-05 09:43:47.821976 mon.k8s-node-00 [INF] Health check cleared: MDS_ALL_DOWN (was: 1 filesystem is offline) ... ... -----Original message----- From: Karsten Nielsen <karsten(a)foo-bar.dk> Sent: Tue 05-11-2019 10:29 Subject: [ceph-users] mds crash loop To: ceph-users(a)ceph.io; > Hi, > > Last week I upgraded my ceph cluster from luminus to mimic 13.2.6 > It was running fine for a while but yesterday my mds went into a crash loop. > > I have 1 active and 1 standby mds for my cephfs both of which is running the > same crash loop. > I am running ceph based on https://hub.docker.com/r/ceph/daemon version > v3.2.7-stable-3.2-minic-centos-7-x86_64 with a etcd kv store. > > Log details are: https://paste.debian.net/1113943/ > > Thanks for any hints > - Karsten > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io > To unsubscribe send an email to ceph-users-leave(a)ceph.io > >

4 years, 6 months

1
0
0 0

[ceph-user] Upload objects failed on FIPS enable ceph cluster

by Amit Ghadge

Hi All, I'm using ceph-14.2.4 and testing in FIPS enable cluster. Downloading objects are works but ceph raised segmentation exception while uploading. Please help me here. And please provide debugging stage, So I could take in development environment. Thanks, Amit G

4 years, 6 months

1
0
0 0

2024

2023

2022

2021

2020

2019

ceph-users November 2019