Hello
I am running ceph 14.2.7 with balancer in crush-compat mode (needed
because of old clients), but it's doesn't seem to be doing anything. It
used to work in the past. I am not sure what changed. I created a big
pool, ~285TB stored, and it doesn't look like it ever got balanced:
pool 43 'fs-data-k5m2-hdd' erasure size 7 min_size 6 crush_rule 7
object_hash rjenkins pg_num 2048 pgp_num 2048 autoscale_mode warn
last_change 48647 lfor 0/42080/42102 flags
hashpspool,ec_overwrites,nearfull stripe_width 20480 application cephfs
OSD utilization varies between ~50% and about ~80%, with about 60% raw
used. I am using a mixture of 9TB and 14TB drives. Number of PGs/drive
varies 103 and 207.
# ceph osd df | grep hdd | sort -k 17 | (head -n 2; tail -n 2)
160 hdd 12.53519 1.00000 13 TiB 6.0 TiB 5.9 TiB 74 KiB 12 GiB 6.6
TiB 47.74 0.79 120 up
146 hdd 12.53519 1.00000 13 TiB 6.0 TiB 6.0 TiB 51 MiB 13 GiB 6.5
TiB 48.17 0.80 119 up
79 hdd 8.99799 1.00000 9.0 TiB 7.3 TiB 7.2 TiB 42 KiB 16 GiB 1.7
TiB 80.91 1.34 186 up
62 hdd 8.99799 1.00000 9.0 TiB 7.3 TiB 7.2 TiB 112 KiB 16 GiB 1.7
TiB 81.44 1.35 189 up
# ceph balancer status
{
"last_optimize_duration": "0:00:00.339635",
"plans": [],
"mode": "crush-compat",
"active": true,
"optimize_result": "Some osds belong to multiple subtrees: {0:
['default', 'default~hdd'], ...
"last_optimize_started": "Thu Apr 9 11:17:40 2020"
}
Does anybody know how to debug this?
Thanks,
Vlad
Hello,
since some time I've been investigating problems causing buckets' index
corruption. In my case it's been because of numerous bugs related to index
resharding and bucket lifecycle policies.
One of those bugs present in versions prior to 14.2.8 made the index
omapkeys' names contain unicode NULL which manifests like this:
root@mach0122:~/mkw # radosgw-admin bi list --bucket=sysa-user-logs|grep
idx |tail -5
"idx":
"_multipart_user-ec24b85efa1c/user-ec24b85efa1c-AA-U79.log-2020031521.gz.u6_qPM0X2mXQRXD0hEfRK2dCc4dD4El\u0000.4",
"idx":
"_multipart_user-ec24b85efa1c/user-ec24b85efa1c-AA-U79.log-2020031600.gz.bvZdIu5KnsAQyXoa_ZWpix-BYD-yvcz\u0000.15",
"idx":
"_multipart_user-ec24b85efa1c/user-ec24b85efa1c-AA-U79.log-2020031600.gz.wX9jHyKnJ21w5DYL_rdfTsb7zH3tUEa\u0000.10",
"idx":
"_multipart_user-ec24b85efa1c/user-ec24b85efa1c-AA-U79.log-2020031601.gz.K6GhZfNzcqDCcjkrn5GskRWS8ufXSbO\u0000.8",
"idx":
"_multipart_user-ec24b85efa1c/user-ec24b85efa1c-AA-U79.log-2020031601.gz.vEr4VCu0QU07te_RHNJ9Wi1cb8tsiYq\u0000.7",
I am unable to remove them by rmomapkey. Also I tried to use the method
https://github.com/ceph/ceph/blob/8c63b26fe88bb02d894705cb1beec289668fb43d/…
but it fails with:
File "./idx_removal.py", line 96, in remove_key
iocontext.remove_omap_keys(write_op, key)
File "rados.pyx", line 516, in rados.requires.wrapper.validate_func
(/build/ceph-14.2.8/obj-x86_64-linux-gnu/src/pybind/rados/pyrex/rados.c:4992)
File "rados.pyx", line 3607, in rados.Ioctx.remove_omap_keys
(/build/ceph-14.2.8/obj-x86_64-linux-gnu/src/pybind/rados/pyrex/rados.c:45059)
File "rados.pyx", line 543, in rados.cstr_list
(/build/ceph-14.2.8/obj-x86_64-linux-gnu/src/pybind/rados/pyrex/rados.c:5665)
File "rados.pyx", line 539, in rados.cstr
(/build/ceph-14.2.8/obj-x86_64-linux-gnu/src/pybind/rados/pyrex/rados.c:5463)
TypeError: keys must be a string
Does anyone know what else can be done to remove such omapkey? For the
obvious reasons I can't use the clear_omap method...
Kind regards,
Maks Kowalik
Hey guys,
I am evaluating using M2 SSDs as osds for an all flash pool. Is anyone using that in production and can elaborate on his experience? I am a little bit concerned about the lifetime of the M2 disks.
Best regards
Felix
IT-Services
Telefon 02461 61-9243
E-Mail: f.stolte(a)fz-juelich.de
-------------------------------------------------------------------------------------
-------------------------------------------------------------------------------------
Forschungszentrum Juelich GmbH
52425 Juelich
Sitz der Gesellschaft: Juelich
Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
Vorsitzender des Aufsichtsrats: MinDir Dr. Karl Eugen Huthmacher
Geschaeftsfuehrung: Prof. Dr.-Ing. Wolfgang Marquardt (Vorsitzender),
Karsten Beneke (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt
-------------------------------------------------------------------------------------
-------------------------------------------------------------------------------------
Hello,
I've a issue, since my Nautilus -> Octopus upgrade
My cluster has many rbd images (~3k or something)
Each of them has ~30 snapshots
Each day, I create and remove a least a snapshot per image
Since Octopus, when I remove the "nosnaptrim" flags, each OSDs uses 100%
of its CPU time
The whole cluster collapses: OSDs no longer see each others, most of
them are seens as down ..
I do not see any progress being made : it does not appear the problem
will solve by itself
What can I do ?
Best regards,
So, this is following on from a discussion in the #ceph IRC channel, where we seem to have reached the limit of what we can do.
I have a ~15 node, 311 OSD cluster. (20 OSDs per node).
The cluster is Nautilus - the 3 MONs + the first 8 OSD hosts were installed as Mimic and upgraded to Nautilus with ceph-ansible ; the remaining OSD hosts were added directly with Nautilus as they were only added in a few weeks ago.
Yesterday, suddenly, about half of the OSDs (~140) were marked Down, and a number of slow operations were detected.
Initially, examining the logs (and with a bit of help from IRC), I noticed that the ansible roles used to build the newer OSDs had configured chrony incorrectly, and their clocks were drifting.
(There were BADAUTHORIZER errors in OSD logs, too.)
I fixed the chrony configuration... and we (including people in IRC) expected everything to just... stabilise.
Things have not stabilised, which leads me to suspect that there are other issues at play.
After noticing a number of issues with mgrs deadlocking in Nautilus - eg https://tracker.ceph.com/issues/17170https://tracker.ceph.com/issues/43048 - I tried stopping all mgrs and mons, and then slowly bringing them up.
This has not helped.
Interestingly, the OSDs with slow ops (some of which are marked down) report ops_in_flight which are "wait for new map", whilst the lead mon believes those same ops are timed out.
(I can of course, telnet to every OSD, even the down ones, from other OSDs, including ones which report issues talking to them on the same port; and from the lead mon.)
I am wondering if this is an example of: https://tracker.ceph.com/issues/44184 as we did create a new pool shortly after adding the new OSD host nodes... but it isn't clear from that ticket [or the discussion on this list] how to fix this, other than removing the pool - which I can't do, as we need this pool to exist, and the pool is replaces needs to be decomissioned.
Can anyone advise what I should do next? At present, obviously, the cluster is unusable.
All;
We set up a CephFS on a Nautilus (14.2.8) cluster in February, to hold backups. We finally have all the backups running, and are just waiting for the system reach steady-state.
I'm concerned about usage numbers, in the Dashboard Capacity it shows the cluster as 37% used, while under Filesystems --> <FSName> --> Pools -_> <data> --> Usage, it shows 71% used.
Does CephFS place a limit on the size of a CephFS? Is there a limit to how large a pool can be in Ceph? Where is the sizing discrepancy coming from, and do I need to address it?
Thank you,
Dominic L. Hilsbos, MBA
Director - Information Technology
Perform Air International Inc.
DHilsbos(a)PerformAir.com
www.PerformAir.com
On Tue, Apr 7, 2020 at 3:36 AM alean Huang <woalean(a)gmail.com> wrote:
>
> hi,
> i am appreciate you work very much, there is a problem about ceph mds i meet.
> ceph version: luminous
> the mds cache grow to 20G and limit is 4G after 'stat *' from client which mount by kernel(centos7 , 4.14).
> there are 1000w files in this dir. mds log show recall_client_state does not work after recall some caps.
>
> log is like this:
> 2020-04-07 17:08:36.841304 7f23ddd15700 10 mds.2.server recall_client_state: session client.2366013 172.16.200.20:0/4162573294 caps 6801282, leases 0
> 2020-04-07 17:08:36.841323 7f23ddd15700 15 mds.2.server session recall threshold (16384) hit at 0; skipping!
> 2020-04-07 17:08:36.841326 7f23ddd15700 7 mds.2.server recalled (throttled) 0 client caps.
>
> mds_recall_max_decay_rate = 2.5
It looks like your setting for mds_recall_max_caps is larger than
mds_recall_max_decay_threshold. Are you changing these configurations?
If so, why?
--
Patrick Donnelly, Ph.D.
He / Him / His
Senior Software Engineer
Red Hat Sunnyvale, CA
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D
Hi Folks
We are using Ceph as our storage backend on our 6 Node Proxmox VM Cluster. To Monitor our systems we use Zabbix and i would like to get some Ceph Data into our Zabbix to get some alarms when something goes wrong.
Ceph mgr has a module, "zabbix" that uses "zabbix-sender" to actively send data, but i cannot get the module working. It always responds with "failed to send data"
The network side seems to be fine:
root@vm-2:~# traceroute 192.168.15.253
traceroute to 192.168.15.253 (192.168.15.253), 30 hops max, 60 byte packets
1 192.168.15.253 (192.168.15.253) 0.411 ms 0.402 ms 0.393 ms
root@vm-2:~# nmap -p 10051 192.168.15.253
Starting Nmap 7.70 ( https://nmap.org ) at 2019-09-18 08:40 CEST
Nmap scan report for 192.168.15.253
Host is up (0.00026s latency).
PORT STATE SERVICE
10051/tcp open zabbix-trapper
MAC Address: BA:F5:30:EF:40:EF (Unknown)
Nmap done: 1 IP address (1 host up) scanned in 0.61 seconds
root@vm-2:~# ceph zabbix config-show
{"zabbix_port": 10051, "zabbix_host": "192.168.15.253", "identifier": "VM-2", "zabbix_sender": "/usr/bin/zabbix_sender", "interval": 60}
root@vm-2:~#
But if i try "ceph zabbix send" i get "failed to send data to zabbix" and this show up in the systems journal:
Sep 18 08:41:13 vm-2 ceph-mgr[54445]: 2019-09-18 08:41:13.272 7fe360fe4700 -1 mgr.server reply reply (1) Operation not permitted
The log of ceph-mgr on that machine states:
2019-09-18 08:42:18.188 7fe359fd6700 0 mgr[zabbix] Exception when sending: /usr/bin/zabbix_sender exited non-zero: zabbix_sender [3253392]: DEBUG: answer [{"response":"success","info":"processed: 0; failed: 44; total: 44; seconds spent: 0.000179"}]
2019-09-18 08:43:18.217 7fe359fd6700 0 mgr[zabbix] Exception when sending: /usr/bin/zabbix_sender exited non-zero: zabbix_sender [3253629]: DEBUG: answer [{"response":"success","info":"processed: 0; failed: 44; total: 44; seconds spent: 0.000321"}]
I'm guessing, this could have something to do with user rights. But i have no idea where to start to track this down.
Maybe someone here has a hint?
If more information is needed, i will gladly provide it.
greetings
Ingo
Hi cephers,
I'm looking for some advice on what to do about drives of different
sizes in the same cluster.
We have so far kept the drive sizes consistent on our main ceph cluster
(using 8TB drives). We're getting some new hardware with larger, 12TB
drives next, and I'm pondering on how best to configure them. If I just
simply add them, they will have 1.5x the data (which is less of a
problem), but will also get 1.5x the iops - so I presume it will slow
the whole cluster down as a result (these drives will be busy, and the
rest will not be as much). I'm wondering how people generally handle this.
I'm more concerned about these larger drives being busier than the rest
- so I'd like to be able to put for example 1/3 drive of less accessed
data on them in addition to the usual data - to use the extra capacity
but not increase the load on them. Is there an easy way to accomplish
this? One possibility is to run two OSDs on the drive (in two crush
hierarchies), which isn't ideal. Can I just run one OSD somehow and put
it into two crush roots, or something similar?
Andras