September 2020 - ceph-users

by Ramanathan S

Hi all, I just had created a ceph cluster to use cephfs. When i create the a ceph fs pool i get the filesystem below error. # ceph osd pool create cephfs_data 128 pool 'cephfs_data' created # ceph osd pool create cephfs_metadata 128 pool 'cephfs_metadata' created # ceph fs new cephfs cephfs_metadata cephfs_data new fs with metadata pool 6 and data pool 5 # ceph -s cluster: id: 1c27def45-f0f9-494d-sfke-eb4323432fd health: HEALTH_ERR 1 filesystem is offline 1 filesystem is online with fewer MDS than max_mds services: mon: 2 daemons, quorum ceph-mon01,ceph-mon02 mgr: ceph-adm01(active) mds: cephfs-0/0/1 up osd: 12 osds: 12 up, 12 in data: pools: 2 pools, 256 pgs objects: 0 objects, 0 B usage: 12 GiB used, 588 GiB / 600 GiB avail pgs: 256 active+clean but when i check the max_mds for the ceph fs it says 1 # ceph fs get cephfs | grep max_mds max_mds 1 Let anyone know what am i missing here? Any inputs is much appreciated. Regards, Ram Ceph-explorer..

3 weeks, 1 day

3
3
0 0

kernel client osdc ops stuck and mds slow reqs

by Dan van der Ster

Hi all, We are quite regularly (a couple times per week) seeing: HEALTH_WARN 1 clients failing to respond to capability release; 1 MDSs report slow requests MDS_CLIENT_LATE_RELEASE 1 clients failing to respond to capability release mdshpc-be143(mds.0): Client hpc-be028.cern.ch: failing to respond to capability release client_id: 52919162 MDS_SLOW_REQUEST 1 MDSs report slow requests mdshpc-be143(mds.0): 1 slow requests are blocked > 30 secs Which is being caused by osdc ops stuck in a kernel client, e.g.: 10:57:18 root hpc-be028 /root → cat /sys/kernel/debug/ceph/4da6fd06-b069-49af-901f-c9513baabdbd.client52919162/osdc REQUESTS 9 homeless 0 46559317 osd243 3.ee6ffcdb 3.cdb [243,501,92]/243 [243,501,92]/243 e678697 fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a01.00000057 0x400014 1 read 46559322 osd243 3.ee6ffcdb 3.cdb [243,501,92]/243 [243,501,92]/243 e678697 fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a01.00000057 0x400014 1 read 46559323 osd243 3.969cc573 3.573 [243,330,226]/243 [243,330,226]/243 e678697 fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a56.00000056 0x400014 1 read 46559341 osd243 3.969cc573 3.573 [243,330,226]/243 [243,330,226]/243 e678697 fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a56.00000056 0x400014 1 read 46559342 osd243 3.969cc573 3.573 [243,330,226]/243 [243,330,226]/243 e678697 fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a56.00000056 0x400014 1 read 46559345 osd243 3.969cc573 3.573 [243,330,226]/243 [243,330,226]/243 e678697 fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a56.00000056 0x400014 1 read 46559621 osd243 3.6313e8ef 3.8ef [243,330,521]/243 [243,330,521]/243 e678697 fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a45.0000007a 0x400014 1 read 46559629 osd243 3.b280c852 3.852 [243,113,539]/243 [243,113,539]/243 e678697 fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a3a.0000007f 0x400014 1 read 46559928 osd243 3.1ee7bab4 3.ab4 [243,332,94]/243 [243,332,94]/243 e678697 fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f099ff.0000073f 0x400024 1 write LINGER REQUESTS BACKOFFS We can unblock those requests by doing `ceph osd down osd.243` (or restarting osd.243). This is ceph v14.2.6 and the client kernel is el7 3.10.0-957.27.2.el7.x86_64. Are there a better way to debug this? Best Regards, Dan

1 year, 2 months

4
12
0 0

does ceph rgw has any option to limit bandwidth

by Zhenshi Zhou

Hi, Is there any option of rados gateway that limit bandwidth?

2 years, 10 months

5
6
0 0

Zabbix module Octopus 15.2.3

by Gert Wieberdink

Trying to configure Zabbix module in Octopus 15.2.3. CentOS 8.1 environment. Installed zabbix40-agent for CentOS 8.1 (from epel repository). This will also install zabbix_sender. After enabling the Zabbix module in Ceph, I configured my Zabbix host and Zabbix identifier. # ceph zabbix config-set zabbix_host <zabbix-fqdn> # ceph zabbix config-set zabbix_identifier <ident> # ceph zabbix config-show Error EINVAL: Traceback (most recent call last): File "/usr/share/ceph/mgr/mgr_module.py", line 1153, in _handle_command return self.handle_command(inbuf, cmd) File "/usr/share/ceph/mgr/zabbix/module.py", line 407, in handle_command return 0, json.dumps(self.config, index=4, sort_keys=True), '' File "/lib64/python3.6/json/__init__.py", line 238, in dumps **kw).encode(obj) TypeError: __init__() got an unexpected keyword argument 'index' # ceph -v ceph version 15.2.3 (d289bbdec69ed7c1f516e0a093594580a76b78d0) octopus (stable) # ceph health detail HEALTH_OK Anyone found a solution? rgds, -gw

2 years, 11 months

5
4
0 0

Remapped PGs

by David Orman

Hi, We see that we have 5 'remapped' PGs, but are unclear why/what to do about it. We shifted some target ratios for the autobalancer and it resulted in this state. When adjusting ratio, we noticed two OSDs go down, but we just restarted the container for those OSDs with podman, and they came back up. Here's status output: ################### root@ceph01:~# ceph status INFO:cephadm:Inferring fsid x INFO:cephadm:Inferring config x INFO:cephadm:Using recent ceph image docker.io/ceph/ceph:v15 cluster: id: 41bb9256-c3bf-11ea-85b9-9e07b0435492 health: HEALTH_OK services: mon: 5 daemons, quorum ceph01,ceph04,ceph02,ceph03,ceph05 (age 2w) mgr: ceph03.ytkuyr(active, since 2w), standbys: ceph01.aqkgbl, ceph02.gcglcg, ceph04.smbdew, ceph05.yropto osd: 168 osds: 168 up (since 2d), 168 in (since 2d); 5 remapped pgs data: pools: 3 pools, 1057 pgs objects: 18.00M objects, 69 TiB usage: 119 TiB used, 2.0 PiB / 2.1 PiB avail pgs: 1056 active+clean 1 active+clean+scrubbing+deep io: client: 859 KiB/s rd, 212 MiB/s wr, 644 op/s rd, 391 op/s wr root@ceph01:~# ################### When I look at ceph pg dump, I don't see any marked as remapped: ################### root@ceph01:~# ceph pg dump |grep remapped INFO:cephadm:Inferring fsid x INFO:cephadm:Inferring config x INFO:cephadm:Using recent ceph image docker.io/ceph/ceph:v15 dumped all root@ceph01:~# ################### Any idea what might be going on/how to recover? All OSDs are up. Health is 'OK'. This is Ceph 15.2.4 deployed using Cephadm in containers, on Podman 2.0.3.

3 years, 1 month

2
4
0 0

Re: mds lost very frequently

by Stefan Kooman

Hi, After setting: ceph config set mds mds_recall_max_caps 10000 (5000 before change) and ceph config set mds mds_recall_max_decay_rate 1.0 (2.5 before change) And the: ceph tell 'mds.*' injectargs '--mds_recall_max_caps 10000' ceph tell 'mds.*' injectargs '--mds_recall_max_decay_rate 1.0' our up:active MDS stopped responding and the standby-replay stepped in ... and hit an assert (same as in this thread): 2020-02-06 16:42:16.712 7ff76a528700 1 heartbeat_map reset_timeout 'MDSRank' had timed out after 15 2020-02-06 16:42:17.616 7ff76ff1b700 0 mds.beacon.mds2 MDS is no longer laggy 2020-02-06 16:42:20.348 7ff76d716700 -1 /build/ceph-13.2.8/src/mds/Locker.cc: In function 'void Locker::file_recover(ScatterLock*)' thread 7ff76d716700 time 2020-02-06 16:42:20.351124 /build/ceph-13.2.8/src/mds/Locker.cc: 5307: FAILED assert(lock->get_state() == LOCK_PRE_SCAN) ceph version 13.2.8 (5579a94fafbc1f9cc913a0f5d362953a5d9c3ae0) mimic (stable) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14e) [0x7ff7759939de] 2: (()+0x287b67) [0x7ff775993b67] 3: (()+0x28a9ea) [0x5585eb2b79ea] 4: (MDCache::start_files_to_recover()+0xbb) [0x5585eb1f897b] 5: (MDSRank::active_start()+0x135) [0x5585eb146be5] 6: (MDSRankDispatcher::handle_mds_map(MMDSMap*, MDSMap*)+0x4e5) [0x5585eb151ea5] 7: (MDSDaemon::handle_mds_map(MMDSMap*)+0xca8) [0x5585eb134608] 8: (MDSDaemon::handle_core_message(Message*)+0x6c) [0x5585eb138bbc] 9: (MDSDaemon::ms_dispatch(Message*)+0xbb) [0x5585eb13929b] 10: (DispatchQueue::entry()+0xb92) [0x7ff775a56e52] 11: (DispatchQueue::DispatchThread::entry()+0xd) [0x7ff775af3e2d] 12: (()+0x76db) [0x7ff7752846db] 13: (clone()+0x3f) [0x7ff77446a88f] 2020-02-06 16:42:20.348 7ff76d716700 -1 *** Caught signal (Aborted) ** in thread 7ff76d716700 thread_name:ms_dispatch ceph version 13.2.8 (5579a94fafbc1f9cc913a0f5d362953a5d9c3ae0) mimic (stable) 1: (()+0x12890) [0x7ff77528f890] 2: (gsignal()+0xc7) [0x7ff774387e97] 3: (abort()+0x141) [0x7ff774389801] 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x256) [0x7ff775993ae6] 5: (()+0x287b67) [0x7ff775993b67] 6: (()+0x28a9ea) [0x5585eb2b79ea] 7: (MDCache::start_files_to_recover()+0xbb) [0x5585eb1f897b] 8: (MDSRank::active_start()+0x135) [0x5585eb146be5] 9: (MDSRankDispatcher::handle_mds_map(MMDSMap*, MDSMap*)+0x4e5) [0x5585eb151ea5] 10: (MDSDaemon::handle_mds_map(MMDSMap*)+0xca8) [0x5585eb134608] 11: (MDSDaemon::handle_core_message(Message*)+0x6c) [0x5585eb138bbc] 12: (MDSDaemon::ms_dispatch(Message*)+0xbb) [0x5585eb13929b] 13: (DispatchQueue::entry()+0xb92) [0x7ff775a56e52] 14: (DispatchQueue::DispatchThread::entry()+0xd) [0x7ff775af3e2d] 15: (()+0x76db) [0x7ff7752846db] 16: (clone()+0x3f) [0x7ff77446a88f] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. Quoting Yan, Zheng (ukernel(a)gmail.com): > Please try below patch if you can compile ceph from source. If you > can't compile ceph or the issue still happens, please set debug_mds = > 10 for standby mds (change debug_mds to 0 after mds becomes active). > > Regards > Yan, Zheng > > diff --git a/src/mds/MDSRank.cc b/src/mds/MDSRank.cc > index 1e8b024b8a..d1150578f1 100644 > --- a/src/mds/MDSRank.cc > +++ b/src/mds/MDSRank.cc > @@ -1454,8 +1454,8 @@ void MDSRank::rejoin_done() > void MDSRank::clientreplay_start() > { > dout(1) << "clientreplay_start" << dendl; > - finish_contexts(g_ceph_context, waiting_for_replay); // kick waiters > mdcache->start_files_to_recover(); > + finish_contexts(g_ceph_context, waiting_for_replay); // kick waiters > queue_one_replay(); > } > > @@ -1487,8 +1487,8 @@ void MDSRank::active_start() > > mdcache->clean_open_file_lists(); > mdcache->export_remaining_imported_caps(); > - finish_contexts(g_ceph_context, waiting_for_replay); // kick waiters > mdcache->start_files_to_recover(); > + finish_contexts(g_ceph_context, waiting_for_replay); // kick waiters > > mdcache->reissue_all_caps(); > mdcache->activate_stray_manager(); AFAICT this patch has never been tested and never commited. Do you still think this might fix the issue? Any hints on how we might reproduce this issue: failing active mds and hitting this specific recovery scenario We will happily apply this patch and do testing to check if it really fixes the issue. Gr. Stefan P.s. For my understanding: the MDS should never stop responding by setting these parameters, right? -- | BIT BV https://www.bit.nl/ Kamer van Koophandel 09090351 | GPG: 0xD14839C6 +31 318 648 688 / info(a)bit.nl

3 years, 2 months

1
1
0 0

Default data pool in CEPH

by Gabriel Medve

Hi, I have a CEPH 15.2.4 running in a docker. How to configure for use a specific data pool? i try put the follow line in the ceph.conf but the changes not working. . [client.myclient] rbd default data pool = Mydatapool I need it to configure for erasure pool with cloudstack Can help me? , where is the ceph conf we i need configure? Thanks. -- Untitled Document

3 years, 2 months

4
6
0 0

Re: NFS Ganesha NFSv3

by Gabriel Medve

Hi Thanks for the reply. cephadm runs ceph containers automatically. How to set privileged mode in ceph container? -- > El 23/9/20 a las 13:24, Daniel Gryniewicz escribió: >> NFSv3 needs privileges to connect to the portmapper. Try running >> your docker container in privileged mode, and see if that helps. >> >> Daniel >> >> On 9/23/20 11:42 AM, Gabriel Medve wrote: >>> Hi, >>> >>> I have a CEPH 15.2.5 running in a docker , i configure nfs ganesha >>> with nfs version 3 but i can not mount it. >>> If configure ganesha with nfs version 4 i can mounted without >>> problems but i need the version 3 . >>> >>> The error is mount.nfs: Protocol not supported >>> >>> Can help me? >>> >>> Thanks. >>> >> _______________________________________________ >> ceph-users mailing list -- ceph-users(a)ceph.io >> To unsubscribe send an email to ceph-users-leave(a)ceph.io > -- > Untitled Document

3 years, 2 months

2
1
0 0

MDS rejects clients causing hanging mountpoint on linux kernel client

by Florian Pritz

Hi, We are running a ceph cluster on Ubuntu 18.04 machines with ceph 14.2.4. Our cephfs clients are using the kernel module and we have noticed that some of them are sometimes (at least once) hanging after an MDS restart. The only way to resolve this is to unmount and remount the mountpoint, or reboot the machine if unmounting is not possible. After some investigation, the problem seems to be that the MDS denies reconnect attempts from some clients during restart even though the reconnect interval is not yet reached. In particular, I see the following log entries. Note that there are supposedly 9 sessions. 9 clients reconnect (one client has two mountpoints) and then two more clients reconnect after the MDS already logged "reconnect_done". These two clients were hanging after the event. The kernel log of one of them is shown below too. Running `ceph tell mds.0 client ls` after the clients have been rebooted/remounted also shows 11 clients instead of 9. Do you have any ideas what is wrong here and how it could be fixed? I'm guessing that the issue is that the MDS apparently has an incorrect session count and stops the reconnect process to soon. Is this indeed a bug and if so, do you know what is broken? Regardless, I also think that the kernel should be able to deal with a denied reconnect and that it should try again later. Yet, even after 10 minutes, the kernel does not attempt to reconnect. Is this a known issue or maybe fixed in newer kernels? If not, is there a chance to get this fixed? Thanks, Florian MDS log: > 2019-09-26 16:08:27.479 7f9fdde99700 1 mds.0.server reconnect_clients -- 9 sessions > 2019-09-26 16:08:27.479 7f9fdde99700 0 log_channel(cluster) log [DBG] : reconnect by client.24197043 v1:10.1.4.203:0/990008521 after 0 > 2019-09-26 16:08:27.479 7f9fdde99700 0 log_channel(cluster) log [DBG] : reconnect by client.30487144 v1:10.1.4.146:0/483747473 after 0 > 2019-09-26 16:08:27.479 7f9fdde99700 0 log_channel(cluster) log [DBG] : reconnect by client.21019865 v1:10.1.7.22:0/3752632657 after 0 > 2019-09-26 16:08:27.479 7f9fdde99700 0 log_channel(cluster) log [DBG] : reconnect by client.21020717 v1:10.1.7.115:0/2841046616 after 0 > 2019-09-26 16:08:27.479 7f9fdde99700 0 log_channel(cluster) log [DBG] : reconnect by client.24171153 v1:10.1.7.243:0/1127767158 after 0 > 2019-09-26 16:08:27.479 7f9fdde99700 0 log_channel(cluster) log [DBG] : reconnect by client.23978093 v1:10.1.4.71:0/824226283 after 0 > 2019-09-26 16:08:27.479 7f9fdde99700 0 log_channel(cluster) log [DBG] : reconnect by client.24209569 v1:10.1.4.157:0/1271865906 after 0 > 2019-09-26 16:08:27.479 7f9fdde99700 0 log_channel(cluster) log [DBG] : reconnect by client.20190930 v1:10.1.4.240:0/3195698606 after 0 > 2019-09-26 16:08:27.479 7f9fdde99700 0 log_channel(cluster) log [DBG] : reconnect by client.20190912 v1:10.1.4.146:0/852604154 after 0 > 2019-09-26 16:08:27.479 7f9fdde99700 1 mds.0.59 reconnect_done > 2019-09-26 16:08:27.483 7f9fdde99700 1 mds.0.server no longer in reconnect state, ignoring reconnect, sending close > 2019-09-26 16:08:27.483 7f9fdde99700 0 log_channel(cluster) log [INF] : denied reconnect attempt (mds is up:reconnect) from client.24167394 v1:10.1.67.49:0/1483641729 after 0.00400002 (allowed interval 45) > 2019-09-26 16:08:27.483 7f9fe1087700 0 --1- [v2:10.1.4.203:6800/806949107,v1:10.1.4.203:6801/806949107] >> v1:10.1.67.49:0/1483641729 conn(0x55af50053f80 0x55af50140800 :6801 s=OPENED pgs=21 cs=1 l=0).fault server, going to standby > 2019-09-26 16:08:27.483 7f9fdde99700 1 mds.0.server no longer in reconnect state, ignoring reconnect, sending close > 2019-09-26 16:08:27.483 7f9fdde99700 0 log_channel(cluster) log [INF] : denied reconnect attempt (mds is up:reconnect) from client.30586072 v1:10.1.67.140:0/3664284158 after 0.00400002 (allowed interval 45) > 2019-09-26 16:08:27.483 7f9fe1888700 0 --1- [v2:10.1.4.203:6800/806949107,v1:10.1.4.203:6801/806949107] >> v1:10.1.67.140:0/3664284158 conn(0x55af50055600 0x55af50143000 :6801 s=OPENED pgs=8 cs=1 l=0).fault server, going to standby Hanging client (10.1.67.49) kernel log: > 2019-09-26T16:08:27.481676+02:00 hostnamefoo kernel: [708596.227148] ceph: mds0 reconnect start > 2019-09-26T16:08:27.488943+02:00 hostnamefoo kernel: [708596.233145] ceph: mds0 reconnect denied > 2019-09-26T16:16:17.541041+02:00 hostnamefoo kernel: [709066.287601] libceph: mds0 10.1.4.203:6801 socket closed (con state NEGOTIATING) > 2019-09-26T16:16:18.068934+02:00 hostnamefoo kernel: [709066.813064] ceph: mds0 rejected session > 2019-09-26T16:16:18.068955+02:00 hostnamefoo kernel: [709066.814843] ceph: get_quota_realm: ino (10000000008.fffffffffffffffe) null i_snap_realm

3 years, 2 months

3
6
0 0

diskprediction_local fails with python3-sklearn 0.22.2

by Eric Dold

Hello the mgr module diskprediction_local fails under ubuntu 20.04 focal with python3-sklearn version 0.22.2 Ceph version is 15.2.3 when the module is enabled i get the following error: File "/usr/share/ceph/mgr/diskprediction_local/module.py", line 112, in serve self.predict_all_devices() File "/usr/share/ceph/mgr/diskprediction_local/module.py", line 279, in predict_all_devices result = self._predict_life_expentancy(devInfo['devid']) File "/usr/share/ceph/mgr/diskprediction_local/module.py", line 222, in _predict_life_expentancy predicted_result = obj_predictor.predict(predict_datas) File "/usr/share/ceph/mgr/diskprediction_local/predictor.py", line 457, in predict pred = clf.predict(ordered_data) File "/usr/lib/python3/dist-packages/sklearn/svm/_base.py", line 585, in predict if self.break_ties and self.decision_function_shape == 'ovo': AttributeError: 'SVC' object has no attribute 'break_ties' Best Regards Eric

3 years, 4 months

2
1
0 0

2024

2023

2022

2021

2020

2019

ceph-users September 2020