June 2020 - ceph-users - lists.ceph.io

by Ramanathan S

Hi all, I just had created a ceph cluster to use cephfs. When i create the a ceph fs pool i get the filesystem below error. # ceph osd pool create cephfs_data 128 pool 'cephfs_data' created # ceph osd pool create cephfs_metadata 128 pool 'cephfs_metadata' created # ceph fs new cephfs cephfs_metadata cephfs_data new fs with metadata pool 6 and data pool 5 # ceph -s cluster: id: 1c27def45-f0f9-494d-sfke-eb4323432fd health: HEALTH_ERR 1 filesystem is offline 1 filesystem is online with fewer MDS than max_mds services: mon: 2 daemons, quorum ceph-mon01,ceph-mon02 mgr: ceph-adm01(active) mds: cephfs-0/0/1 up osd: 12 osds: 12 up, 12 in data: pools: 2 pools, 256 pgs objects: 0 objects, 0 B usage: 12 GiB used, 588 GiB / 600 GiB avail pgs: 256 active+clean but when i check the max_mds for the ceph fs it says 1 # ceph fs get cephfs | grep max_mds max_mds 1 Let anyone know what am i missing here? Any inputs is much appreciated. Regards, Ram Ceph-explorer..

3 weeks, 3 days

3
3
0 0

kernel client osdc ops stuck and mds slow reqs

by Dan van der Ster

Hi all, We are quite regularly (a couple times per week) seeing: HEALTH_WARN 1 clients failing to respond to capability release; 1 MDSs report slow requests MDS_CLIENT_LATE_RELEASE 1 clients failing to respond to capability release mdshpc-be143(mds.0): Client hpc-be028.cern.ch: failing to respond to capability release client_id: 52919162 MDS_SLOW_REQUEST 1 MDSs report slow requests mdshpc-be143(mds.0): 1 slow requests are blocked > 30 secs Which is being caused by osdc ops stuck in a kernel client, e.g.: 10:57:18 root hpc-be028 /root → cat /sys/kernel/debug/ceph/4da6fd06-b069-49af-901f-c9513baabdbd.client52919162/osdc REQUESTS 9 homeless 0 46559317 osd243 3.ee6ffcdb 3.cdb [243,501,92]/243 [243,501,92]/243 e678697 fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a01.00000057 0x400014 1 read 46559322 osd243 3.ee6ffcdb 3.cdb [243,501,92]/243 [243,501,92]/243 e678697 fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a01.00000057 0x400014 1 read 46559323 osd243 3.969cc573 3.573 [243,330,226]/243 [243,330,226]/243 e678697 fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a56.00000056 0x400014 1 read 46559341 osd243 3.969cc573 3.573 [243,330,226]/243 [243,330,226]/243 e678697 fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a56.00000056 0x400014 1 read 46559342 osd243 3.969cc573 3.573 [243,330,226]/243 [243,330,226]/243 e678697 fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a56.00000056 0x400014 1 read 46559345 osd243 3.969cc573 3.573 [243,330,226]/243 [243,330,226]/243 e678697 fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a56.00000056 0x400014 1 read 46559621 osd243 3.6313e8ef 3.8ef [243,330,521]/243 [243,330,521]/243 e678697 fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a45.0000007a 0x400014 1 read 46559629 osd243 3.b280c852 3.852 [243,113,539]/243 [243,113,539]/243 e678697 fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a3a.0000007f 0x400014 1 read 46559928 osd243 3.1ee7bab4 3.ab4 [243,332,94]/243 [243,332,94]/243 e678697 fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f099ff.0000073f 0x400024 1 write LINGER REQUESTS BACKOFFS We can unblock those requests by doing `ceph osd down osd.243` (or restarting osd.243). This is ceph v14.2.6 and the client kernel is el7 3.10.0-957.27.2.el7.x86_64. Are there a better way to debug this? Best Regards, Dan

1 year, 2 months

4
12
0 0

Zabbix module Octopus 15.2.3

by Gert Wieberdink

Trying to configure Zabbix module in Octopus 15.2.3. CentOS 8.1 environment. Installed zabbix40-agent for CentOS 8.1 (from epel repository). This will also install zabbix_sender. After enabling the Zabbix module in Ceph, I configured my Zabbix host and Zabbix identifier. # ceph zabbix config-set zabbix_host <zabbix-fqdn> # ceph zabbix config-set zabbix_identifier <ident> # ceph zabbix config-show Error EINVAL: Traceback (most recent call last): File "/usr/share/ceph/mgr/mgr_module.py", line 1153, in _handle_command return self.handle_command(inbuf, cmd) File "/usr/share/ceph/mgr/zabbix/module.py", line 407, in handle_command return 0, json.dumps(self.config, index=4, sort_keys=True), '' File "/lib64/python3.6/json/__init__.py", line 238, in dumps **kw).encode(obj) TypeError: __init__() got an unexpected keyword argument 'index' # ceph -v ceph version 15.2.3 (d289bbdec69ed7c1f516e0a093594580a76b78d0) octopus (stable) # ceph health detail HEALTH_OK Anyone found a solution? rgds, -gw

2 years, 11 months

5
4
0 0

Re: mds lost very frequently

by Stefan Kooman

Hi, After setting: ceph config set mds mds_recall_max_caps 10000 (5000 before change) and ceph config set mds mds_recall_max_decay_rate 1.0 (2.5 before change) And the: ceph tell 'mds.*' injectargs '--mds_recall_max_caps 10000' ceph tell 'mds.*' injectargs '--mds_recall_max_decay_rate 1.0' our up:active MDS stopped responding and the standby-replay stepped in ... and hit an assert (same as in this thread): 2020-02-06 16:42:16.712 7ff76a528700 1 heartbeat_map reset_timeout 'MDSRank' had timed out after 15 2020-02-06 16:42:17.616 7ff76ff1b700 0 mds.beacon.mds2 MDS is no longer laggy 2020-02-06 16:42:20.348 7ff76d716700 -1 /build/ceph-13.2.8/src/mds/Locker.cc: In function 'void Locker::file_recover(ScatterLock*)' thread 7ff76d716700 time 2020-02-06 16:42:20.351124 /build/ceph-13.2.8/src/mds/Locker.cc: 5307: FAILED assert(lock->get_state() == LOCK_PRE_SCAN) ceph version 13.2.8 (5579a94fafbc1f9cc913a0f5d362953a5d9c3ae0) mimic (stable) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14e) [0x7ff7759939de] 2: (()+0x287b67) [0x7ff775993b67] 3: (()+0x28a9ea) [0x5585eb2b79ea] 4: (MDCache::start_files_to_recover()+0xbb) [0x5585eb1f897b] 5: (MDSRank::active_start()+0x135) [0x5585eb146be5] 6: (MDSRankDispatcher::handle_mds_map(MMDSMap*, MDSMap*)+0x4e5) [0x5585eb151ea5] 7: (MDSDaemon::handle_mds_map(MMDSMap*)+0xca8) [0x5585eb134608] 8: (MDSDaemon::handle_core_message(Message*)+0x6c) [0x5585eb138bbc] 9: (MDSDaemon::ms_dispatch(Message*)+0xbb) [0x5585eb13929b] 10: (DispatchQueue::entry()+0xb92) [0x7ff775a56e52] 11: (DispatchQueue::DispatchThread::entry()+0xd) [0x7ff775af3e2d] 12: (()+0x76db) [0x7ff7752846db] 13: (clone()+0x3f) [0x7ff77446a88f] 2020-02-06 16:42:20.348 7ff76d716700 -1 *** Caught signal (Aborted) ** in thread 7ff76d716700 thread_name:ms_dispatch ceph version 13.2.8 (5579a94fafbc1f9cc913a0f5d362953a5d9c3ae0) mimic (stable) 1: (()+0x12890) [0x7ff77528f890] 2: (gsignal()+0xc7) [0x7ff774387e97] 3: (abort()+0x141) [0x7ff774389801] 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x256) [0x7ff775993ae6] 5: (()+0x287b67) [0x7ff775993b67] 6: (()+0x28a9ea) [0x5585eb2b79ea] 7: (MDCache::start_files_to_recover()+0xbb) [0x5585eb1f897b] 8: (MDSRank::active_start()+0x135) [0x5585eb146be5] 9: (MDSRankDispatcher::handle_mds_map(MMDSMap*, MDSMap*)+0x4e5) [0x5585eb151ea5] 10: (MDSDaemon::handle_mds_map(MMDSMap*)+0xca8) [0x5585eb134608] 11: (MDSDaemon::handle_core_message(Message*)+0x6c) [0x5585eb138bbc] 12: (MDSDaemon::ms_dispatch(Message*)+0xbb) [0x5585eb13929b] 13: (DispatchQueue::entry()+0xb92) [0x7ff775a56e52] 14: (DispatchQueue::DispatchThread::entry()+0xd) [0x7ff775af3e2d] 15: (()+0x76db) [0x7ff7752846db] 16: (clone()+0x3f) [0x7ff77446a88f] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. Quoting Yan, Zheng (ukernel(a)gmail.com): > Please try below patch if you can compile ceph from source. If you > can't compile ceph or the issue still happens, please set debug_mds = > 10 for standby mds (change debug_mds to 0 after mds becomes active). > > Regards > Yan, Zheng > > diff --git a/src/mds/MDSRank.cc b/src/mds/MDSRank.cc > index 1e8b024b8a..d1150578f1 100644 > --- a/src/mds/MDSRank.cc > +++ b/src/mds/MDSRank.cc > @@ -1454,8 +1454,8 @@ void MDSRank::rejoin_done() > void MDSRank::clientreplay_start() > { > dout(1) << "clientreplay_start" << dendl; > - finish_contexts(g_ceph_context, waiting_for_replay); // kick waiters > mdcache->start_files_to_recover(); > + finish_contexts(g_ceph_context, waiting_for_replay); // kick waiters > queue_one_replay(); > } > > @@ -1487,8 +1487,8 @@ void MDSRank::active_start() > > mdcache->clean_open_file_lists(); > mdcache->export_remaining_imported_caps(); > - finish_contexts(g_ceph_context, waiting_for_replay); // kick waiters > mdcache->start_files_to_recover(); > + finish_contexts(g_ceph_context, waiting_for_replay); // kick waiters > > mdcache->reissue_all_caps(); > mdcache->activate_stray_manager(); AFAICT this patch has never been tested and never commited. Do you still think this might fix the issue? Any hints on how we might reproduce this issue: failing active mds and hitting this specific recovery scenario We will happily apply this patch and do testing to check if it really fixes the issue. Gr. Stefan P.s. For my understanding: the MDS should never stop responding by setting these parameters, right? -- | BIT BV https://www.bit.nl/ Kamer van Koophandel 09090351 | GPG: 0xD14839C6 +31 318 648 688 / info(a)bit.nl

3 years, 2 months

1
1
0 0

MDS rejects clients causing hanging mountpoint on linux kernel client

by Florian Pritz

Hi, We are running a ceph cluster on Ubuntu 18.04 machines with ceph 14.2.4. Our cephfs clients are using the kernel module and we have noticed that some of them are sometimes (at least once) hanging after an MDS restart. The only way to resolve this is to unmount and remount the mountpoint, or reboot the machine if unmounting is not possible. After some investigation, the problem seems to be that the MDS denies reconnect attempts from some clients during restart even though the reconnect interval is not yet reached. In particular, I see the following log entries. Note that there are supposedly 9 sessions. 9 clients reconnect (one client has two mountpoints) and then two more clients reconnect after the MDS already logged "reconnect_done". These two clients were hanging after the event. The kernel log of one of them is shown below too. Running `ceph tell mds.0 client ls` after the clients have been rebooted/remounted also shows 11 clients instead of 9. Do you have any ideas what is wrong here and how it could be fixed? I'm guessing that the issue is that the MDS apparently has an incorrect session count and stops the reconnect process to soon. Is this indeed a bug and if so, do you know what is broken? Regardless, I also think that the kernel should be able to deal with a denied reconnect and that it should try again later. Yet, even after 10 minutes, the kernel does not attempt to reconnect. Is this a known issue or maybe fixed in newer kernels? If not, is there a chance to get this fixed? Thanks, Florian MDS log: > 2019-09-26 16:08:27.479 7f9fdde99700 1 mds.0.server reconnect_clients -- 9 sessions > 2019-09-26 16:08:27.479 7f9fdde99700 0 log_channel(cluster) log [DBG] : reconnect by client.24197043 v1:10.1.4.203:0/990008521 after 0 > 2019-09-26 16:08:27.479 7f9fdde99700 0 log_channel(cluster) log [DBG] : reconnect by client.30487144 v1:10.1.4.146:0/483747473 after 0 > 2019-09-26 16:08:27.479 7f9fdde99700 0 log_channel(cluster) log [DBG] : reconnect by client.21019865 v1:10.1.7.22:0/3752632657 after 0 > 2019-09-26 16:08:27.479 7f9fdde99700 0 log_channel(cluster) log [DBG] : reconnect by client.21020717 v1:10.1.7.115:0/2841046616 after 0 > 2019-09-26 16:08:27.479 7f9fdde99700 0 log_channel(cluster) log [DBG] : reconnect by client.24171153 v1:10.1.7.243:0/1127767158 after 0 > 2019-09-26 16:08:27.479 7f9fdde99700 0 log_channel(cluster) log [DBG] : reconnect by client.23978093 v1:10.1.4.71:0/824226283 after 0 > 2019-09-26 16:08:27.479 7f9fdde99700 0 log_channel(cluster) log [DBG] : reconnect by client.24209569 v1:10.1.4.157:0/1271865906 after 0 > 2019-09-26 16:08:27.479 7f9fdde99700 0 log_channel(cluster) log [DBG] : reconnect by client.20190930 v1:10.1.4.240:0/3195698606 after 0 > 2019-09-26 16:08:27.479 7f9fdde99700 0 log_channel(cluster) log [DBG] : reconnect by client.20190912 v1:10.1.4.146:0/852604154 after 0 > 2019-09-26 16:08:27.479 7f9fdde99700 1 mds.0.59 reconnect_done > 2019-09-26 16:08:27.483 7f9fdde99700 1 mds.0.server no longer in reconnect state, ignoring reconnect, sending close > 2019-09-26 16:08:27.483 7f9fdde99700 0 log_channel(cluster) log [INF] : denied reconnect attempt (mds is up:reconnect) from client.24167394 v1:10.1.67.49:0/1483641729 after 0.00400002 (allowed interval 45) > 2019-09-26 16:08:27.483 7f9fe1087700 0 --1- [v2:10.1.4.203:6800/806949107,v1:10.1.4.203:6801/806949107] >> v1:10.1.67.49:0/1483641729 conn(0x55af50053f80 0x55af50140800 :6801 s=OPENED pgs=21 cs=1 l=0).fault server, going to standby > 2019-09-26 16:08:27.483 7f9fdde99700 1 mds.0.server no longer in reconnect state, ignoring reconnect, sending close > 2019-09-26 16:08:27.483 7f9fdde99700 0 log_channel(cluster) log [INF] : denied reconnect attempt (mds is up:reconnect) from client.30586072 v1:10.1.67.140:0/3664284158 after 0.00400002 (allowed interval 45) > 2019-09-26 16:08:27.483 7f9fe1888700 0 --1- [v2:10.1.4.203:6800/806949107,v1:10.1.4.203:6801/806949107] >> v1:10.1.67.140:0/3664284158 conn(0x55af50055600 0x55af50143000 :6801 s=OPENED pgs=8 cs=1 l=0).fault server, going to standby Hanging client (10.1.67.49) kernel log: > 2019-09-26T16:08:27.481676+02:00 hostnamefoo kernel: [708596.227148] ceph: mds0 reconnect start > 2019-09-26T16:08:27.488943+02:00 hostnamefoo kernel: [708596.233145] ceph: mds0 reconnect denied > 2019-09-26T16:16:17.541041+02:00 hostnamefoo kernel: [709066.287601] libceph: mds0 10.1.4.203:6801 socket closed (con state NEGOTIATING) > 2019-09-26T16:16:18.068934+02:00 hostnamefoo kernel: [709066.813064] ceph: mds0 rejected session > 2019-09-26T16:16:18.068955+02:00 hostnamefoo kernel: [709066.814843] ceph: get_quota_realm: ino (10000000008.fffffffffffffffe) null i_snap_realm

3 years, 2 months

3
6
0 0

diskprediction_local fails with python3-sklearn 0.22.2

by Eric Dold

Hello the mgr module diskprediction_local fails under ubuntu 20.04 focal with python3-sklearn version 0.22.2 Ceph version is 15.2.3 when the module is enabled i get the following error: File "/usr/share/ceph/mgr/diskprediction_local/module.py", line 112, in serve self.predict_all_devices() File "/usr/share/ceph/mgr/diskprediction_local/module.py", line 279, in predict_all_devices result = self._predict_life_expentancy(devInfo['devid']) File "/usr/share/ceph/mgr/diskprediction_local/module.py", line 222, in _predict_life_expentancy predicted_result = obj_predictor.predict(predict_datas) File "/usr/share/ceph/mgr/diskprediction_local/predictor.py", line 457, in predict pred = clf.predict(ordered_data) File "/usr/lib/python3/dist-packages/sklearn/svm/_base.py", line 585, in predict if self.break_ties and self.decision_function_shape == 'ovo': AttributeError: 'SVC' object has no attribute 'break_ties' Best Regards Eric

3 years, 4 months

2
1
0 0

v15.2.4 Octopus released

by David Galloway

We're happy to announce the fourth bugfix release in the Octopus series. In addition to a security fix in RGW, this release brings a range of fixes across all components. We recommend that all Octopus users upgrade to this release. For a detailed release notes with links & changelog please refer to the official blog entry at https://ceph.io/releases/v15-2-4-octopus-released Notable Changes --------------- * CVE-2020-10753: rgw: sanitize newlines in s3 CORSConfiguration's ExposeHeader (William Bowling, Adam Mohammed, Casey Bodley) * Cephadm: There were a lot of small usability improvements and bug fixes: * Grafana when deployed by Cephadm now binds to all network interfaces. * `cephadm check-host` now prints all detected problems at once. * Cephadm now calls `ceph dashboard set-grafana-api-ssl-verify false` when generating an SSL certificate for Grafana. * The Alertmanager is now correctly pointed to the Ceph Dashboard * `cephadm adopt` now supports adopting an Alertmanager * `ceph orch ps` now supports filtering by service name * `ceph orch host ls` now marks hosts as offline, if they are not accessible. * Cephadm can now deploy NFS Ganesha services. For example, to deploy NFS with a service id of mynfs, that will use the RADOS pool nfs-ganesha and namespace nfs-ns:: ceph orch apply nfs mynfs nfs-ganesha nfs-ns * Cephadm: `ceph orch ls --export` now returns all service specifications in yaml representation that is consumable by `ceph orch apply`. In addition, the commands `orch ps` and `orch ls` now support `--format yaml` and `--format json-pretty`. * Cephadm: `ceph orch apply osd` supports a `--preview` flag that prints a preview of the OSD specification before deploying OSDs. This makes it possible to verify that the specification is correct, before applying it. * RGW: The `radosgw-admin` sub-commands dealing with orphans -- `radosgw-admin orphans find`, `radosgw-admin orphans finish`, and `radosgw-admin orphans list-jobs` -- have been deprecated. They have not been actively maintained and they store intermediate results on the cluster, which could fill a nearly-full cluster. They have been replaced by a tool, currently considered experimental, `rgw-orphan-list`. * RBD: The name of the rbd pool object that is used to store rbd trash purge schedule is changed from "rbd_trash_trash_purge_schedule" to "rbd_trash_purge_schedule". Users that have already started using `rbd trash purge schedule` functionality and have per pool or namespace schedules configured should copy "rbd_trash_trash_purge_schedule" object to "rbd_trash_purge_schedule" before the upgrade and remove "rbd_trash_purge_schedule" using the following commands in every RBD pool and namespace where a trash purge schedule was previously configured:: rados -p <pool-name> [-N namespace] cp rbd_trash_trash_purge_schedule rbd_trash_purge_schedule rados -p <pool-name> [-N namespace] rm rbd_trash_trash_purge_schedule or use any other convenient way to restore the schedule after the upgrade. Getting Ceph ------------ * Git at git://github.com/ceph/ceph.git * Tarball at http://download.ceph.com/tarballs/ceph-14.2.10.tar.gz * For packages, see http://docs.ceph.com/docs/master/install/get-packages/ * Release git sha1: 7447c15c6ff58d7fce91843b705a268a1917325c -- David Galloway Systems Administrator, RDU Ceph Engineering IRC: dgalloway

3 years, 4 months

4
5
1 0

Provide more documentation for MDS performance tuning on large file systems

by Janek Bevendorff

Hello, Over the last week I have tried optimising the performance of our MDS nodes for the large amount of files and concurrent clients we have. It turns out that despite various stability fixes in recent releases, the default configuration still doesn't appear to be optimal for keeping the cache size under control and avoid intermittent I/O blocks. Unfortunately, it is very hard to tweak the configuration to something that works, because the tuning parameters needed are largely undocumented or only described in very technical terms in the source code making them quite unapproachable for administrators not familiar with all the CephFS internals. I would therefore like to ask if it were possible to document the "advanced" MDS settings more clearly as to what they do and in what direction they have to be tuned for more or less aggressive cap recall, for instance (sometimes it is not clear if a threshold is a min or a max threshold). I am am in the very (un)fortunate situation to have folders with a several 100K direct sub folders or files (and one extreme case with almost 7 million dentries), which is a pretty good benchmark for measuring cap growth while performing operations on them. For the time being, I came up with this configuration, which seems to work for me, but is still far from optimal: mds basic mds_cache_memory_limit 10737418240 mds advanced mds_cache_trim_threshold 131072 mds advanced mds_max_caps_per_client 500000 mds advanced mds_recall_max_caps 17408 mds advanced mds_recall_max_decay_rate 2.000000 The parameters I am least sure about---because I understand the least how they actually work---are mds_cache_trim_threshold and mds_recall_max_decay_rate. Despite reading the description in src/common/options.cc, I understand only half of what they're doing and I am also not quite sure in which direction to tune them for optimal results. Another point where I am struggling is the correct configuration of mds_recall_max_caps. The default of 5K doesn't work too well for me, but values above 20K also don't seem to be a good choice. While high values result in fewer blocked ops and better performance without destabilising the MDS, they also lead to slow but unbounded cache growth, which seems counter-intuitive. 17K was the maximum I could go. Higher values work for most use cases, but when listing very large folders with millions of dentries, the MDS cache size slowly starts to exceed the limit after a few hours, since the MDSs are failing to keep clients below mds_max_caps_per_client despite not showing any "failing to respond to cache pressure" warnings. With the configuration above, I do not have cache size issues any more, but it comes at the cost of performance and slow/blocked ops. A few hints as to how I could optimise my settings for better client performance would be much appreciated and so would be additional documentation for all the "advanced" MDS settings. Thanks a lot Janek

3 years, 4 months

3
13
0 0

atime with cephfs

by Oliver Freyermuth

Dear Cephers, we are currently mounting CephFS with relatime, using the FUSE client (version 13.2.6): ceph-fuse on /cephfs type fuse.ceph-fuse (rw,relatime,user_id=0,group_id=0,allow_other) For the first time, I wanted to use atime to identify old unused data. My expectation with "relatime" was that the access time stamp would be updated less often, for example, only if the last file access was >24 hours ago. However, that does not seem to be the case: ---------------------------------------------- $ stat /cephfs/grid/atlas/atlaslocalgroupdisk/rucio/group/phys-higgs/ed/cb/group.phys-higgs.17620861._000004.HSM_common.root ... Access: 2019-04-10 15:50:04.975959159 +0200 Modify: 2019-04-10 15:50:05.651613843 +0200 Change: 2019-04-10 15:50:06.141006962 +0200 ... $ cat /cephfs/grid/atlas/atlaslocalgroupdisk/rucio/group/phys-higgs/ed/cb/group.phys-higgs.17620861._000004.HSM_common.root > /dev/null $ sync $ stat /cephfs/grid/atlas/atlaslocalgroupdisk/rucio/group/phys-higgs/ed/cb/group.phys-higgs.17620861._000004.HSM_common.root ... Access: 2019-04-10 15:50:04.975959159 +0200 Modify: 2019-04-10 15:50:05.651613843 +0200 Change: 2019-04-10 15:50:06.141006962 +0200 ... ---------------------------------------------- I also tried this via an nfs-ganesha mount, and via a ceph-fuse mount with admin caps, but atime never changes. Is atime really never updated with CephFS, or is this configurable? Something as coarse as "update at maximum once per day only" would be perfectly fine for the use case. Cheers, Oliver

3 years, 4 months

4
6
0 0

Octopus OSDs dropping out of cluster: _check_auth_rotating possible clock skew, rotating keys expired way too early

by Wido den Hollander

Hi, On a recently deployed Octopus (15.2.2) cluster (240 OSDs) we are seeing OSDs randomly drop out of the cluster. Usually it's 2 to 4 OSDs spread out over different nodes. Each node has 16 OSDs and not all the failing OSDs are on the same node. The OSDs are marked as down and all they keep print in their logs: monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2020-06-04T07:57:17.706529-0400) Looking at their status through the admin socket: { "cluster_fsid": "68653193-9b84-478d-bc39-1a811dd50836", "osd_fsid": "87231b5d-ae5f-4901-93c5-18034381e5ec", "whoami": 206, "state": "active", "oldest_map": 73697, "newest_map": 75795, "num_pgs": 19 } The message brought me to my own ticket I created 2 years ago: https://tracker.ceph.com/issues/23460 The first thing I've checked is NTP/time. Double, triple check this. All the times are in sync on the cluster. Nothing wrong there. Again, it's not all the OSDs on a node failing. Just 1 or 2 dropping out. Restarting them brings them back right away and then within 24h some other OSDs will drop out. Has anybody seen this behavior with Octopus as well? Wido

3 years, 7 months

2
1
0 0

2024

2023

2022

2021

2020

2019

ceph-users June 2020