Hi guys,
We recently upgrade the ceph-mgr to 15.2.4, Octopus in our production
clusters. The status of the cluster now is as follow:
*# ceph versions*
*{*
* "mon": {*
* "ceph version 15.2.4 (7447c15c6ff58d7fce91843b705a268a1917325c)
octopus (stable)": 5*
* },*
* "mgr": {*
* "ceph version 15.2.4 (7447c15c6ff58d7fce91843b705a268a1917325c)
octopus (stable)": 3*
* },*
* "osd": {*
* "ceph version 15.2.4 (7447c15c6ff58d7fce91843b705a268a1917325c)
octopus (stable)": 1933*
* },*
* "mds": {*
* "ceph version 15.2.4 (7447c15c6ff58d7fce91843b705a268a1917325c)
octopus (stable)": 14*
* },*
* "overall": {*
* "ceph version 15.2.4 (7447c15c6ff58d7fce91843b705a268a1917325c)
octopus (stable)": 1955*
* }*
*}*
Now we suffered some problems in this cluster:
1. it always took a significant longer time to get the result of `ceph pg
dump`.
2. the ceph-exportor might failed to get cluster metrics.
3. sometimes the cluster showed a few inactive/down pgs but recovered very
soon.
We did a investigation on the ceph-mgr, didn't get the root cause yet. But
there are some dispersed clews (I am not sure if they ca help):
1. the ms_dispatch thread is always busy with one core.
2. the msg size is significant larger than 40K.
*2020-09-24T14:47:50.216+0000 7f8f811f6700 1 --
[v2:{mgr_ip}:6800/111,v1:{mgr_ip}:6801/111] <== osd.3038
v2:{osd_ip}:6800/384927 431 ==== pg_stats(17 pgs tid 0 v 0) v2 ====
42153+0+0 (secure 0 0 0) 0x55dae07c1800 con 0x55daf6dde400*
3. get some errors of "Fail to parse JSON result".
*2020-09-24T15:47:42.739+0000 7f8f8da0f700 0 [devicehealth ERROR root]
Fail to parse JSON result from daemon osd.1292 ()*
4. in the sending channel, we could see lots of faults.
*2020-09-24T14:53:17.725+0000 7f8fa866e700 1 --
[v2:{mgr_ip}:6800/111,v1:{mgr_ip}:6801/111] >> v1:{osd_ip}:0/1442957044
conn(0x55db38757400 legacy=0x55db03d8e800 unknown :6801
s=STATE_CONNECTION_ESTABLISHED l=1).tick idle (909347879) for more than
900000000 us, fault.*
*2020-09-24T14:53:17.725+0000 7f8fa866e700 1 --1-
[v2:{mgr_ip}:6800/111,v1:{mgr_ip}:6801/111] >> v1:{osd_ip}:0/1442957044
conn(0x55db38757400 0x55db03d8e800 :6801 s=OPENED pgs=1572189 cs=1
l=1).fault on lossy channel, failing*
5. or the mgr-fin thread would be busy with one core.
[image: image.png]
and from the perf dump we got:
* "finisher-Mgr": { "queue_len": 1359862,
"complete_latency": { "avgcount": 14, "sum":
40300.307764855, "avgtime": 2878.593411775 } },*
Sorry about these clews are a little messy. Could you have any comments on
this?
Thanks.
Regards,
Hao
Without knowing the source code and just from my observations, I would say
everytime the osd map changes, the crush/pgmap tries to fix that. However a
running backfill is not stopped and only backfill_wait would be
reconsidered.
--
Martin Verges
Managing director
Mobile: +49 174 9335695
E-Mail: martin.verges(a)croit.io
Chat: https://t.me/MartinVerges
croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web: https://croit.io
YouTube: https://goo.gl/PGE1Bx
Am Sa., 26. Sept. 2020 um 13:33 Uhr schrieb Marc Roos <
M.Roos(a)f1-outsourcing.eu>:
>
> When I add an osd rebalancing is taking place, lets say ceph relocates
> 40 pg's.
>
> When I add another osd during rebalancing, when ceph has only relocated
> 10 pgs and has to do still 30 pgs.
>
> What happens then:
>
> 1. Is ceph just finishing the relocation of these 30 pgs and then
> calculates how the new environment with the newly added osd should be
> relocated and starts relocating that.
>
> 2. or is ceph is finishing only the relocation of the pg it is currently
> doing, and then recalculates immediately how pg's should be distributed
> without finishing these 30 pg's it was planning to do.
>
>
>
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io
> To unsubscribe send an email to ceph-users-leave(a)ceph.io
>
When I add an osd rebalancing is taking place, lets say ceph relocates
40 pg's.
When I add another osd during rebalancing, when ceph has only relocated
10 pgs and has to do still 30 pgs.
What happens then:
1. Is ceph just finishing the relocation of these 30 pgs and then
calculates how the new environment with the newly added osd should be
relocated and starts relocating that.
2. or is ceph is finishing only the relocation of the pg it is currently
doing, and then recalculates immediately how pg's should be distributed
without finishing these 30 pg's it was planning to do.
hey folks,
I have managed to fat finger a config apply command and accidentally
deleted the CRD for one of my pools. The operator went ahead and tried to
purge it, but fortunately since it's used by CephFS it was unable to.
Redeploying the exact same CRD does not make the operator stop trying to
delete it though.
Any hints on how to make the operator forget about the deletion request and
leave it be?
--
Cheers,
Peter Sarossy
Technical Program Manager
Data Center Data Security - Google LLC.
Now and then an issue can emerge when you may hear peculiar sounds originating from the printer because of some tech glitch. In the event that that occurs, at that point you can get the assistance by heading off to the tech help locales or you can call the Epson Customer Service to get the issue settled. https://www.epsonprintersupportpro.net/
The companions on Facebook tells you the number of individuals you're associated with, on the web-based media website. Subsequently, on the off chance that you can't see your companions, at that point this may be because of the helpless web association. On the off chance that you accept that there is some other explanation, at that point you can get to Facebook Customer Service Toll Free Number to know the clear arrangements. https://www.customercare-email.com/facebook-customer-service.html
We currently run a SSD cluster and HDD clusters and are looking at possibly
creating a cluster for NVMe storage. For spinners and SSDs, it seemed the
max recommended per osd host server was 16 OSDs ( I know it depends on the
CPUs and RAM, like 1 cpu core and 2GB memory ).
Questions:
1. If we do a jbod setup, the servers can hold 48 NVMes, if the servers
were bought with 48 cores and 100+ GB of RAM, would this make sense?
2. Should we just raid 5 by groups of NVMe drives instead ( and buy less
CPU/RAM )? There is a reluctance to waste even a single drive on raid
because redundancy is basically cephs job.
3. The plan was to build this with octopus ( hopefully there are no issues
we should know about ). Though I just saw one posted today, but this is a
few months off.
4. Any feedback on max OSDs?
5. Right now they run 10Gb everywhere with 80Gb uplinks, I was thinking
this would need at least 40Gb links to every node ( the hope is to use these
to speed up image processing at the application layer locally in the DC ).
I haven't spoken to the Dell engineers yet but my concern with NVMe is that
the raid controller would end up being the bottleneck ( next in line after
network connectivity ).
Regards,
-Brent
Existing Clusters:
Test: Nautilus 14.2.11 with 3 osd servers, 1 mon/man, 1 gateway, 2 iscsi
gateways ( all virtual on nvme )
US Production(HDD): Nautilus 14.2.11 with 12 osd servers, 3 mons, 4
gateways, 2 iscsi gateways
UK Production(HDD): Nautilus 14.2.11 with 12 osd servers, 3 mons, 4 gateways
US Production(SSD): Nautilus 14.2.11 with 6 osd servers, 3 mons, 3 gateways,
2 iscsi gateways
Hi,
I recently restarted a storage node for our Ceph cluster and had an
issue bringing one of the OSDs back online. This storage node has
multiple HDs each as a devoted OSD for a data pool, and a single nVME
drive with an LVM partition assigned as an OSD in a metadata pool.
After rebooting the host, the OSD using an LVM partition did not
restart. When trying to manually start the OSD using systemctl, I can
follow the launch of a podman container and see an error message prior
to the container shutting down again:
Sep 23 14:02:06 X bash[30318]: Running command:
/usr/bin/ceph-bluestore-tool --cluster=ceph prime-osd-dir --dev
/dev/boot/cephfs_meta --path /var/lib/ceph/osd/ceph-165
--no-mon-config
Sep 23 14:02:06 X bash[30318]: stderr: failed to read label for
/dev/boot/cephfs_meta: (2) No such file or directory
Sep 23 14:02:06 X bash[30318]: --> RuntimeError: command returned
non-zero exit status: 1
1. I can see the existence of the /dev/boot/cephfs_meta symlink to a
device ../dm-3
2. `lsblk` shows the lvm partition 'boot-cephfs_meta' under nvme0n1p3
3. `sudo lvscan --all` shows the it as activated:
` ACTIVE '/dev/boot/cephfs_meta' [3.42 TiB] inherit`
This is on a CentOS 8 system, with ceph version 15.2.1
(9fd2f65f91d9246fae2c841a6222d34d121680ee) octopus (stable)
Related issues I have found include:
1. https://github.com/rook/rook/issues/2591
2. https://github.com/rook/rook/issues/3289
There were indicated solutions for these involving installing the
LVM2 package, which I completed with `sudo dnf install lvm2`, then
tried a restart of the system and restart of the container. This was
not able to resolve the problem for LVM-partition based OSD.
This LVM-based OSD was initially created with a `ceph-volume`
command: `ceph-volume lvm create --bluestore --data /dev/sd<x>
--block.db
/dev/nvme0n1<partition-nr>`
Is there a workaround for this problem where the container process is
unable to read the label of the LVM partition and fails to start the
OSD?
Thanks,
Matt
--
Matt Larson, PhD
Madison, WI 53705 U.S.A.