I noticed a similar issue tonight. Still looking into the details, but here are the client logs I
Oct 9 19:27:59 mon5-cx kernel: libceph: mds0 ***:6800 socket closed (con state OPEN)
Oct 9 19:28:01 mon5-cx kernel: libceph: mds0 ***:6800 connection reset
Oct 9 19:28:01 mon5-cx kernel: libceph: reset on mds0
Oct 9 19:28:01 mon5-cx kernel: ceph: mds0 closed our session
Oct 9 19:28:01 mon5-cx kernel: ceph: mds0 reconnect start
Oct 9 19:28:01 mon5-cx kernel: ceph: mds0 reconnect denied
Oct 9 19:28:01 mon5-cx kernel: ceph: dropping dirty+flushing Fw state for ffff9109011c9980 1099517142146
Oct 9 19:28:01 mon5-cx kernel: ceph: dropping dirty+flushing Fw state for ffff91096cc788d0 1099517142307
Oct 9 19:28:01 mon5-cx kernel: ceph: dropping dirty+flushing Fw state for ffff9107da741f10 1099517142312
Oct 9 19:28:01 mon5-cx kernel: ceph: dropping dirty+flushing Fw state for ffff9109d5c40e60 1099517141612
Oct 9 19:28:01 mon5-cx kernel: ceph: dropping dirty+flushing Fw state for ffff9108c9337da0 1099517142313
Oct 9 19:28:01 mon5-cx kernel: ceph: dropping dirty+flushing Fw state for ffff9109d5c70340 1099517141565
Oct 9 19:28:01 mon5-cx kernel: ceph: dropping dirty+flushing Fw state for ffff910955acf810 1099517141792
Oct 9 19:28:01 mon5-cx kernel: ceph: dropping dirty+flushing Fw state for ffff91095ff56cf0 1099517142006
Oct 9 19:28:01 mon5-cx kernel: ceph: dropping dirty+flushing Fw state for ffff91096cc7f280 1099517142309
Oct 9 19:28:01 mon5-cx kernel: libceph: mds0 ***:6800 socket closed (con state NEGOTIATING)
Oct 9 19:28:02 mon5-cx kernel: ceph: mds0 rejected session
Oct 9 19:28:02 mon5-cx monit: Lookup for '/srv/repos' filesystem failed -- not found in /proc/self/mounts
Oct 9 19:28:02 mon5-cx monit: Filesystem '/srv/repos' not mounted
Oct 9 19:28:02 mon5-cx monit: 'repos' unable to read filesystem '/srv/repos' state
...
Oct 9 19:28:09 mon5-cx kernel: ceph: get_quota_realm: ino (1.fffffffffffffffe) null i_snap_realm
Oct 9 19:28:24 mon5-cx kernel: ceph: get_quota_realm: ino (1.fffffffffffffffe) null i_snap_realm
Oct 9 19:28:39 mon5-cx kernel: ceph: get_quota_realm: ino (1.fffffffffffffffe) null i_snap_realm
...
Oct 9 21:27:09 mon5-cx kernel: ceph: get_quota_realm: ino (1.fffffffffffffffe) null i_snap_realm
Oct 9 21:27:24 mon5-cx kernel: ceph: get_quota_realm: ino (1.fffffffffffffffe) null i_snap_realm
Oct 9 21:27:27 mon5-cx monit: Lookup for '/srv/repos' filesystem failed -- not found in /proc/self/mounts
Oct 9 21:27:27 mon5-cx monit: Filesystem '/srv/repos' not mounted
Oct 9 21:27:27 mon5-cx monit: 'repos' unable to read filesystem '/srv/repos' state
Oct 9 21:27:27 mon5-cx monit: 'repos' trying to restart
>>> Do you have statistics on the size of the OSDMaps or count of them
>>> which were being maintained by the OSDs?
>> No, I don't think so. How can I find this information?
>
> Hmm I don't know if we directly expose the size of maps. There are
> perfcounters which expose the range of maps being kept around but I
> don't know their names off-hand.
FWIW I’ve been told that size of an OSDmap is roughly equivalent to `ceph pg dump |wc`, which if true would seem to mean that they’re trivially small for most purposes. Reality of course may be much different and/or nuanced.
From my initial testing it looks like 14.2.4 fully supports the
deduplication mentioned here:
https://docs.ceph.com/docs/master/dev/deduplication/
However, I'm not sure where the struct object_manifest script goes in
relation to foo and foo-chunk, and I'm not aware of what the
offsets/caspool should be.
If this still isn't fully implemented how does the dedup tool work? If I
remove a file but it exists elsewhere on the volume, will it be purged or
would the tool need to run again to clear the data?
When trying to modify a zone in one of my clusters to promote it to the
master zone, I get this error:
~ $ radosgw-admin zone modify --rgw-zone atl --master
failed to update zonegroup: 2019-10-09 15:41:53.409 7f9ecae26840 0 ERROR:
found existing zone name atl (94d26f94-d64c-40d1-9a33-56afa948d86a) in
zonegroup seast
(17) File exists
~ $
Anyone have any ideas what's going on here?
Thanks all,
Mac
Hi all
I have a smallish test cluster (14 servers, 84 OSDs) running 14.2.4. Monthly OS patching and reboots that go along with it have resulted in the cluster getting very unwell.
Many of the servers in the cluster are OOM-killing the ceph-osd processes when they try to start. (6 OSDs per server running on filestore.). Strace shows the ceph-osd processes are spending hours reading through the 220k osdmap files after being started.
This behavior started after we recently made it about 72% full to see how things behaved. We also upgraded it to Nautilus 14.2.2 at about the same time.
I’ve tried starting just one OSD per server at a time in hopes of avoiding the OOM killer. Also tried setting noin, rebooting the whole cluster, waiting a day, then marking each of the OSDs in manually. The end result is the same either way. About 60% of PGs are still down, 30% are peering, and the rest are in worse shape.
Anyone out there have suggestions about how I should go about getting this cluster healthy again? Any ideas appreciated.
Thanks!
- Aaron
Good morning
Q: Is it possible to have a 2nd cephfs_data volume and exposing it to the
same openstack environment?
Reason being:
Our current profile is configured with erasure code value of k=3,m=1 (rack
level) but we looking to buy another +- 6PB of storage w/ controllers and
was thinking of moving to an erasure profile of k=2,m=1 since we're not so
big on data redundancy but more on disk space + performance.
For what I understand you can't change erasure profiles, therefor we need
to essentially build a new ceph cluster but we're trying to understand if
we can attach it to the existing openstack platform, then gradually move
all the data over from the old cluster into the new cluster, destroy the
old cluster and integrated it with the new one.
If anyone has any recommendations to get more space out + performance at
the cost of data redundancy with at least 1 rack please let me know as
well.
Regards
--
*Jeremi-Ernst Avenant, Mr.*Cloud Infrastructure Specialist
Inter-University Institute for Data Intensive Astronomy
5th Floor, Department of Physics and Astronomy,
University of Cape Town
Tel: 021 959 4137 <0219592327>
Web: www.idia.ac.za <http://www.uwc.ac.za/>
E-mail (IDIA): jeremi(a)idia.ac.za <mfundo(a)idia.ac.za>
Rondebosch, Cape Town, 7600
Hi!
Is it possible and if yes how to remove any permission to a subdir for a user.
I'd tried to make this:
ceph auth caps client.XYZ mon 'allow r' mds 'allow r, allow rws path=/XYZ, allow path=/ABC' osd 'allow rw pool=cephfs_data'
but got:
Error EINVAL: mds capability parse failed, stopped at ', allow path=/ABC' of 'allow r, allow rws path=/XYZ, allow path=/ABC'
Thanks
Lars
Thx for the hint.
I fiddled around with the configuration and found this:
> root@vm-2:~# ceph zabbix send
> Failed to send data to Zabbix
while
> root@vm-2:~# zabbix_sender -vv -z 192.168.15.253 -p 10051 -s vm-2 -k ceph.num_osd -o 32
> zabbix_sender [1724513]: DEBUG: answer [{"response":"success","info":"processed: 1; failed: 0; total: 1; seconds spent: 0.000041"}]
> info from server: "processed: 1; failed: 0; total: 1; seconds spent: 0.000041"
> sent: 1; skipped: 0; total: 1
works just fine. I figured out that it could be a hostname mismatch betweend what "ceph zabbix send" transmits, and the hostname that is configured on the zabbix server. And well... it's almost embarassing that I missed this for about 3 months now but:
The hostname the ceph zabbix module was submitting was in capital letters, while the hostname configured in zabbix was lowercase, even though, the hostname for that machine is in fact lowercase.
I don't know why the ceph zabbix module makes it uppercase.
I configured the host on zabbix with capital letters and now it works...
kind regards
Ingo Schmidt
----------------------------------------
IT-Department
Island municipality Langeoog
with in-house operations
Tourismus Service and Schiffahrt
Hi,
I am currently dealing with a cluster that's been in use for 5 years and
during that time, has never had its radosgw usage log trimmed. Now that
the cluster has been upgraded to Nautilus (and has completed a full
deep-scrub), it is in a permanent state of HEALTH_WARN because of one
large omap object:
$ ceph health detail
HEALTH_WARN 1 large omap objects
LARGE_OMAP_OBJECTS 1 large omap objects
1 large objects found in pool '.usage'
As far as I can tell, there are two thresholds that can trigger that
warning:
* The default omap object size warning threshold,
osd_deep_scrub_large_omap_object_value_sum_threshold, is 1G.
* The default omap object key count warning threshold,
osd_deep_scrub_large_omap_object_key_threshold, is 200000.
In this case, this was the original situation:
osd.6 [WRN] : Large omap object found. Object:
15:169282cd:::usage.20:head Key count: 5834118 Size (bytes): 917351868
So that's 5.8M keys (way above threshold) and 875 MiB total object size
(below threshold, but not by much).
The usage log in this case was no longer needed that far back, so I
trimmed it to keep only the entries from this year (radosgw-admin usage
trim --end-date 2018-12-31), a process that took upward of an hour.
After the trim (and a deep-scrub of the PG in question¹), my situation
looks like this:
osd.6 [WRN] Large omap object found. Object: 15:169282cd:::usage.20:head
Key count: 1185694 Size (bytes): 187061564
So both the key count and the total object size have diminished by about
80%, which is about what you expect when you trim 5 years of usage log
down to 1 year of usage log. However, my key count is still almost 6
times the threshold.
I am aware that I can silence the warning by increasing
osd_deep_scrub_large_omap_object_key_threshold by a factor of 10, but
that's not my question. My question is what I can do to prevent the
usage log from creating such large omap objects in the first place.
Now, there's something else that you should know about this radosgw,
which is that it is configured with the defaults for usage log sharding:
rgw_usage_max_shards = 32
rgw_usage_max_user_shards = 1
... and this cluster's radosgw is pretty much being used by a single
application user. So the fact that it's happy to shard the usage log 32
ways is irrelevant as long as it puts the usage log for one user all
into one shard.
So, I am assuming that if I bump rgw_usage_max_user_shards up to, say,
16 or 32, all *new* usage log entries will be sharded. But I am not
aware of any way to reshard the *existing* usage log. Is there such a
thing?
Otherwise, it seems like the only option in this situation would be to
clear the usage log altogether, and tweak the sharding knobs, which
should at least make the problem not reappear. Or, else, bump
osd_deep_scrub_large_omap_object_key_threshold and just live with the
large object.
Also, is anyone aware of any adverse side effects of increasing these
thresholds, and/or changing the usage log sharding settings, that I
should keep in mind here?
Thanks in advance for your thoughts.
Cheers,
Florian
¹For anyone reading this in the archives because they've run into the
same problem, and wondering how you find out which PGs in a pool have
too-large objects, here's a jq one-liner:
ceph --format=json pg ls-by-pool <poolname> \
| jq '.pg_stats[]|select(.stat_sum.num_large_omap_objects>0)'