Hi,
We are running a ceph cluster on Ubuntu 18.04 machines with ceph 14.2.4.
Our cephfs clients are using the kernel module and we have noticed that
some of them are sometimes (at least once) hanging after an MDS restart.
The only way to resolve this is to unmount and remount the mountpoint,
or reboot the machine if unmounting is not possible.
After some investigation, the problem seems to be that the MDS denies
reconnect attempts from some clients during restart even though the
reconnect interval is not yet reached. In particular, I see the following
log entries. Note that there are supposedly 9 sessions. 9 clients
reconnect (one client has two mountpoints) and then two more clients
reconnect after the MDS already logged "reconnect_done". These two
clients were hanging after the event. The kernel log of one of them is
shown below too.
Running `ceph tell mds.0 client ls` after the clients have been
rebooted/remounted also shows 11 clients instead of 9.
Do you have any ideas what is wrong here and how it could be fixed? I'm
guessing that the issue is that the MDS apparently has an incorrect
session count and stops the reconnect process to soon. Is this indeed a
bug and if so, do you know what is broken?
Regardless, I also think that the kernel should be able to deal with a
denied reconnect and that it should try again later. Yet, even after
10 minutes, the kernel does not attempt to reconnect. Is this a known
issue or maybe fixed in newer kernels? If not, is there a chance to get
this fixed?
Thanks,
Florian
MDS log:
> 2019-09-26 16:08:27.479 7f9fdde99700 1 mds.0.server reconnect_clients -- 9 sessions
> 2019-09-26 16:08:27.479 7f9fdde99700 0 log_channel(cluster) log [DBG] : reconnect by client.24197043 v1:10.1.4.203:0/990008521 after 0
> 2019-09-26 16:08:27.479 7f9fdde99700 0 log_channel(cluster) log [DBG] : reconnect by client.30487144 v1:10.1.4.146:0/483747473 after 0
> 2019-09-26 16:08:27.479 7f9fdde99700 0 log_channel(cluster) log [DBG] : reconnect by client.21019865 v1:10.1.7.22:0/3752632657 after 0
> 2019-09-26 16:08:27.479 7f9fdde99700 0 log_channel(cluster) log [DBG] : reconnect by client.21020717 v1:10.1.7.115:0/2841046616 after 0
> 2019-09-26 16:08:27.479 7f9fdde99700 0 log_channel(cluster) log [DBG] : reconnect by client.24171153 v1:10.1.7.243:0/1127767158 after 0
> 2019-09-26 16:08:27.479 7f9fdde99700 0 log_channel(cluster) log [DBG] : reconnect by client.23978093 v1:10.1.4.71:0/824226283 after 0
> 2019-09-26 16:08:27.479 7f9fdde99700 0 log_channel(cluster) log [DBG] : reconnect by client.24209569 v1:10.1.4.157:0/1271865906 after 0
> 2019-09-26 16:08:27.479 7f9fdde99700 0 log_channel(cluster) log [DBG] : reconnect by client.20190930 v1:10.1.4.240:0/3195698606 after 0
> 2019-09-26 16:08:27.479 7f9fdde99700 0 log_channel(cluster) log [DBG] : reconnect by client.20190912 v1:10.1.4.146:0/852604154 after 0
> 2019-09-26 16:08:27.479 7f9fdde99700 1 mds.0.59 reconnect_done
> 2019-09-26 16:08:27.483 7f9fdde99700 1 mds.0.server no longer in reconnect state, ignoring reconnect, sending close
> 2019-09-26 16:08:27.483 7f9fdde99700 0 log_channel(cluster) log [INF] : denied reconnect attempt (mds is up:reconnect) from client.24167394 v1:10.1.67.49:0/1483641729 after 0.00400002 (allowed interval 45)
> 2019-09-26 16:08:27.483 7f9fe1087700 0 --1- [v2:10.1.4.203:6800/806949107,v1:10.1.4.203:6801/806949107] >> v1:10.1.67.49:0/1483641729 conn(0x55af50053f80 0x55af50140800 :6801 s=OPENED pgs=21 cs=1 l=0).fault server, going to standby
> 2019-09-26 16:08:27.483 7f9fdde99700 1 mds.0.server no longer in reconnect state, ignoring reconnect, sending close
> 2019-09-26 16:08:27.483 7f9fdde99700 0 log_channel(cluster) log [INF] : denied reconnect attempt (mds is up:reconnect) from client.30586072 v1:10.1.67.140:0/3664284158 after 0.00400002 (allowed interval 45)
> 2019-09-26 16:08:27.483 7f9fe1888700 0 --1- [v2:10.1.4.203:6800/806949107,v1:10.1.4.203:6801/806949107] >> v1:10.1.67.140:0/3664284158 conn(0x55af50055600 0x55af50143000 :6801 s=OPENED pgs=8 cs=1 l=0).fault server, going to standby
Hanging client (10.1.67.49) kernel log:
> 2019-09-26T16:08:27.481676+02:00 hostnamefoo kernel: [708596.227148] ceph: mds0 reconnect start
> 2019-09-26T16:08:27.488943+02:00 hostnamefoo kernel: [708596.233145] ceph: mds0 reconnect denied
> 2019-09-26T16:16:17.541041+02:00 hostnamefoo kernel: [709066.287601] libceph: mds0 10.1.4.203:6801 socket closed (con state NEGOTIATING)
> 2019-09-26T16:16:18.068934+02:00 hostnamefoo kernel: [709066.813064] ceph: mds0 rejected session
> 2019-09-26T16:16:18.068955+02:00 hostnamefoo kernel: [709066.814843] ceph: get_quota_realm: ino (10000000008.fffffffffffffffe) null i_snap_realm
Hi all,
i try to run ceph client tools on an odroid xu4 (armhf) with Ubuntu 20.04
on python 3.8.5.
Unfortunately there is the following error on each "ceph" command (even in
ceph --help)
Traceback (most recent call last):
File "/usr/bin/ceph", line 1275, in <module>
retval = main()
File "/usr/bin/ceph", line 981, in main
cluster_handle = run_in_thread(rados.Rados,
File "/usr/lib/python3/dist-packages/ceph_argparse.py", line 1342, in
run_in_thread
raise Exception("timed out")
Exception: timed out
With this Server I access an existing Ceph-Cluster with the same hardware.
I checked the code part, there is just a thread start and a join (waiting
for finish a RadosThread).
Maybe this is a python error in combination with armhf architecture? Maybe
someone can help.
Thanks and greetings
Dominik
Hi,
The docs have scant detail on doing a migration to bluestore using a
per-osd device copy:
https://docs.ceph.com/en/latest/rados/operations/bluestore-migration/#per-o…
This mentions "using the copy function of ceph-objectstore-tool", but
ceph-objectstore-tool doesn't have a copy function (all the way from v9 to
current).
Has anyone actually tried doing this?
Is there any further detail available on what is involved, e.g. a broad
outline of the steps?
Of course, detailed instructions would be even better, even if accompanied
by "here be dragons!" warnings.
Cheers,
Chris
Hello
the mgr module diskprediction_local fails under ubuntu 20.04 focal with
python3-sklearn version 0.22.2
Ceph version is 15.2.3
when the module is enabled i get the following error:
File "/usr/share/ceph/mgr/diskprediction_local/module.py", line 112, in
serve
self.predict_all_devices()
File "/usr/share/ceph/mgr/diskprediction_local/module.py", line 279, in
predict_all_devices
result = self._predict_life_expentancy(devInfo['devid'])
File "/usr/share/ceph/mgr/diskprediction_local/module.py", line 222, in
_predict_life_expentancy
predicted_result = obj_predictor.predict(predict_datas)
File "/usr/share/ceph/mgr/diskprediction_local/predictor.py", line 457,
in predict
pred = clf.predict(ordered_data)
File "/usr/lib/python3/dist-packages/sklearn/svm/_base.py", line 585, in
predict
if self.break_ties and self.decision_function_shape == 'ovo':
AttributeError: 'SVC' object has no attribute 'break_ties'
Best Regards
Eric
Hello all,
wrt: https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/7IMIWCKIHXN…
Yesterday we hit a problem with osd_pglog memory, similar to the thread above.
We have a 56 node object storage (S3+SWIFT) cluster with 25 OSD disk per node. We run 8+3 EC for the data pool (metadata is on replicated nvme pool).
The cluster has been running fine, and (as relevant to the post) the memory usage has been stable at 100 GB / node. We've had the default pg_log of 3000. The user traffic doesn't seem to have been exceptional lately.
Last Thursday we updated the OSDs from 14.2.8 -> 14.2.13. On Friday the memory usage on OSD nodes started to grow. On each node it grew steadily about 30 GB/day, until the servers started OOM killing OSD processes.
After a lot of debugging we found that the pg_logs were huge. Each OSD process pg_log had grown to ~22GB, which we naturally didn't have memory for, and then the cluster was in an unstable situation. This is significantly more than the 1,5 GB in the post above. We do have ~20k pgs, which may directly affect the size.
We've reduced the pg_log to 500, and started offline trimming it where we can, and also just waited. The pg_log size dropped to ~1,2 GB on at least some nodes, but we're still recovering, and have a lot of ODSs down and out still.
We're unsure if version 14.2.13 triggered this, or if the osd restarts triggered this (or something unrelated we don't see).
This mail is mostly to figure out if there are good guesses why the pg_log size per OSD process exploded? Any technical (and moral) support is appreciated. Also, currently we're not sure if 14.2.13 triggered this, so this is also to put a data point out there for other debuggers.
Cheers,
Kalle Happonen
This is the 15th backport release in the Nautilus series. This release
fixes a ceph-volume regression introduced in v14.2.13 and includes few
other fixes. We recommend users to update to this release.
For a detailed release notes with links & changelog please refer to the
official blog entry at https://ceph.io/releases/v14-2-15-nautilus-released
Notable Changes
---------------
* ceph-volume: Fixes lvm batch --auto, which breaks backward
compatibility when using non rotational devices only (SSD and/or NVMe).
* BlueStore: Fixes a bug in collection_list_legacy which makes pgs
inconsistent during scrub when running mixed versions of osds, prior to
14.2.12 with newer.
* MGR: progress module can now be turned on/off, using the commands:
`ceph progress on` and `ceph progress off`.
Getting Ceph
------------
* Git at git://github.com/ceph/ceph.git
* Tarball at http://download.ceph.com/tarballs/ceph-14.2.15.tar.gz
* For packages, see http://docs.ceph.com/docs/master/install/get-packages/
* Release git sha1: afdd217ae5fb1ed3f60e16bd62357ca58cc650e5
We're happy to announce the fourth bugfix release in the Octopus series.
In addition to a security fix in RGW, this release brings a range of fixes
across all components. We recommend that all Octopus users upgrade to this
release. For a detailed release notes with links & changelog please
refer to the official blog entry at https://ceph.io/releases/v15-2-4-octopus-released
Notable Changes
---------------
* CVE-2020-10753: rgw: sanitize newlines in s3 CORSConfiguration's ExposeHeader
(William Bowling, Adam Mohammed, Casey Bodley)
* Cephadm: There were a lot of small usability improvements and bug fixes:
* Grafana when deployed by Cephadm now binds to all network interfaces.
* `cephadm check-host` now prints all detected problems at once.
* Cephadm now calls `ceph dashboard set-grafana-api-ssl-verify false`
when generating an SSL certificate for Grafana.
* The Alertmanager is now correctly pointed to the Ceph Dashboard
* `cephadm adopt` now supports adopting an Alertmanager
* `ceph orch ps` now supports filtering by service name
* `ceph orch host ls` now marks hosts as offline, if they are not
accessible.
* Cephadm can now deploy NFS Ganesha services. For example, to deploy NFS with
a service id of mynfs, that will use the RADOS pool nfs-ganesha and namespace
nfs-ns::
ceph orch apply nfs mynfs nfs-ganesha nfs-ns
* Cephadm: `ceph orch ls --export` now returns all service specifications in
yaml representation that is consumable by `ceph orch apply`. In addition,
the commands `orch ps` and `orch ls` now support `--format yaml` and
`--format json-pretty`.
* Cephadm: `ceph orch apply osd` supports a `--preview` flag that prints a preview of
the OSD specification before deploying OSDs. This makes it possible to
verify that the specification is correct, before applying it.
* RGW: The `radosgw-admin` sub-commands dealing with orphans --
`radosgw-admin orphans find`, `radosgw-admin orphans finish`, and
`radosgw-admin orphans list-jobs` -- have been deprecated. They have
not been actively maintained and they store intermediate results on
the cluster, which could fill a nearly-full cluster. They have been
replaced by a tool, currently considered experimental,
`rgw-orphan-list`.
* RBD: The name of the rbd pool object that is used to store
rbd trash purge schedule is changed from "rbd_trash_trash_purge_schedule"
to "rbd_trash_purge_schedule". Users that have already started using
`rbd trash purge schedule` functionality and have per pool or namespace
schedules configured should copy "rbd_trash_trash_purge_schedule"
object to "rbd_trash_purge_schedule" before the upgrade and remove
"rbd_trash_purge_schedule" using the following commands in every RBD
pool and namespace where a trash purge schedule was previously
configured::
rados -p <pool-name> [-N namespace] cp rbd_trash_trash_purge_schedule rbd_trash_purge_schedule
rados -p <pool-name> [-N namespace] rm rbd_trash_trash_purge_schedule
or use any other convenient way to restore the schedule after the
upgrade.
Getting Ceph
------------
* Git at git://github.com/ceph/ceph.git
* Tarball at http://download.ceph.com/tarballs/ceph-14.2.10.tar.gz
* For packages, see http://docs.ceph.com/docs/master/install/get-packages/
* Release git sha1: 7447c15c6ff58d7fce91843b705a268a1917325c
--
David Galloway
Systems Administrator, RDU
Ceph Engineering
IRC: dgalloway
Over the weekend I had multiple OSD servers in my Octopus cluster
(15.2.4) crash and reboot at nearly the same time. The OSDs are part of
an erasure coded pool. At the time the cluster had been busy with a
long-running (~week) remapping of a large number of PGs after I
incrementally added more OSDs to the cluster. After bringing all of the
OSDs back up, I have 25 unfound objects and 75 degraded objects. There
are other problems reported, but I'm primarily concerned with these
unfound/degraded objects.
The pool with the missing objects is a cephfs pool. The files stored in
the pool are backed up on tape, so I can easily restore individual files
as needed (though I would not want to restore the entire filesystem).
I tried following the guide at
https://docs.ceph.com/docs/octopus/rados/troubleshooting/troubleshooting-pg….
I found a number of OSDs that are still 'not queried'. Restarting a
sampling of these OSDs changed the state from 'not queried' to 'already
probed', but that did not recover any of the unfound or degraded objects.
I have also tried 'ceph pg deep-scrub' on the affected PGs, but never
saw them get scrubbed. I also tried doing a 'ceph pg force-recovery' on
the affected PGs, but only one seems to have been tagged accordingly
(see ceph -s output below).
The guide also says "Sometimes it simply takes some time for the cluster
to query possible locations." I'm not sure how long "some time" might
take, but it hasn't changed after several hours.
My questions are:
* Is there a way to force the cluster to query the possible locations
sooner?
* Is it possible to identify the files in cephfs that are affected, so
that I could delete only the affected files and restore them from backup
tapes?
--Mike
ceph -s:
cluster:
id: 066f558c-6789-4a93-aaf1-5af1ba01a3ad
health: HEALTH_ERR
1 clients failing to respond to capability release
1 MDSs report slow requests
25/78520351 objects unfound (0.000%)
2 nearfull osd(s)
Reduced data availability: 1 pg inactive
Possible data damage: 9 pgs recovery_unfound
Degraded data redundancy: 75/626645098 objects degraded
(0.000%), 9 pgs degraded
1013 pgs not deep-scrubbed in time
1013 pgs not scrubbed in time
2 pool(s) nearfull
1 daemons have recently crashed
4 slow ops, oldest one blocked for 77939 sec, daemons
[osd.0,osd.41] have slow ops.
services:
mon: 4 daemons, quorum ceph1,ceph2,ceph3,ceph4 (age 9d)
mgr: ceph3(active, since 11d), standbys: ceph2, ceph4, ceph1
mds: archive:1 {0=ceph4=up:active} 3 up:standby
osd: 121 osds: 121 up (since 6m), 121 in (since 101m); 4 remapped pgs
task status:
scrub status:
mds.ceph4: idle
data:
pools: 9 pools, 2433 pgs
objects: 78.52M objects, 298 TiB
usage: 412 TiB used, 545 TiB / 956 TiB avail
pgs: 0.041% pgs unknown
75/626645098 objects degraded (0.000%)
135224/626645098 objects misplaced (0.022%)
25/78520351 objects unfound (0.000%)
2421 active+clean
5 active+recovery_unfound+degraded
3 active+recovery_unfound+degraded+remapped
2 active+clean+scrubbing+deep
1 unknown
1 active+forced_recovery+recovery_unfound+degraded
progress:
PG autoscaler decreasing pool 7 PGs from 1024 to 512 (5d)
[............................]
Hello,
Over the last week I have tried optimising the performance of our MDS
nodes for the large amount of files and concurrent clients we have. It
turns out that despite various stability fixes in recent releases, the
default configuration still doesn't appear to be optimal for keeping the
cache size under control and avoid intermittent I/O blocks.
Unfortunately, it is very hard to tweak the configuration to something
that works, because the tuning parameters needed are largely
undocumented or only described in very technical terms in the source
code making them quite unapproachable for administrators not familiar
with all the CephFS internals. I would therefore like to ask if it were
possible to document the "advanced" MDS settings more clearly as to what
they do and in what direction they have to be tuned for more or less
aggressive cap recall, for instance (sometimes it is not clear if a
threshold is a min or a max threshold).
I am am in the very (un)fortunate situation to have folders with a
several 100K direct sub folders or files (and one extreme case with
almost 7 million dentries), which is a pretty good benchmark for
measuring cap growth while performing operations on them. For the time
being, I came up with this configuration, which seems to work for me,
but is still far from optimal:
mds basic mds_cache_memory_limit 10737418240
mds advanced mds_cache_trim_threshold 131072
mds advanced mds_max_caps_per_client 500000
mds advanced mds_recall_max_caps 17408
mds advanced mds_recall_max_decay_rate 2.000000
The parameters I am least sure about---because I understand the least
how they actually work---are mds_cache_trim_threshold and
mds_recall_max_decay_rate. Despite reading the description in
src/common/options.cc, I understand only half of what they're doing and
I am also not quite sure in which direction to tune them for optimal
results.
Another point where I am struggling is the correct configuration of
mds_recall_max_caps. The default of 5K doesn't work too well for me, but
values above 20K also don't seem to be a good choice. While high values
result in fewer blocked ops and better performance without destabilising
the MDS, they also lead to slow but unbounded cache growth, which seems
counter-intuitive. 17K was the maximum I could go. Higher values work
for most use cases, but when listing very large folders with millions of
dentries, the MDS cache size slowly starts to exceed the limit after a
few hours, since the MDSs are failing to keep clients below
mds_max_caps_per_client despite not showing any "failing to respond to
cache pressure" warnings.
With the configuration above, I do not have cache size issues any more,
but it comes at the cost of performance and slow/blocked ops. A few
hints as to how I could optimise my settings for better client
performance would be much appreciated and so would be additional
documentation for all the "advanced" MDS settings.
Thanks a lot
Janek