Hi All,
Long story-short, we’re doing disaster recovery on a cephfs cluster, and are at a point where we have 8 pgs stuck incomplete. Just before the disaster, I increased the pg_count on two of the pools, and they had not completed increasing the pgp_num yet. I’ve since forced pgp_num to the current values.
So far, I’ve tried mark_unfound_lost but they don’t report any unfound objects, and I’ve tried force-create-pg but that has no effect, except on one of the pgs, which went to creating+incomplete. During the disaster recovery, I had to re-create several OSDs (due to unreadable superblocks,) and now one of the new osds, as well as one of the existing osds won’t start. The log from the startup of osd.29 is here: https://pastebin.com/PX9AAj8m, which seems to indicate that it won’t start because it’s supposed to have copies of the incomplete placement groups.
ceph pg 5.38 query (one of the incomplete) gives: https://pastebin.com/Jf4GnZTc
I have hunted around in the osds listed for all the placement groups for any sign of a pg that I could mark as complete with ceph-objectstore-tool, but can’t find any. I don’t care about the data in the pgs, but I can’t abandon the filesystem.
Any help would be greatly appreciated.
-TJ Ragan
Dear all,
is it possible to upgrade from 13.2.2 directly to 13.2.8 after setting "ceph osd set pglog_hardlimit" (mimic 13.2.5 release notes), or do I need to follow this path:
13.2. 2 -> 5 -> 6 -> 8 ?
Thanks!
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
Looking to role out a all flash Ceph cluster. Wanted to see if anyone else was using Micron drives along with some basic input on my design so far?
Basic Config
Ceph OSD Nodes
8x Supermicro A+ Server 2113S-WTRT
- AMD EPYC 7601 32 Core 2.2Ghz
- 256G Ram
- AOC-S3008L-L8e HBA
- 10GB SFP+ for client network
- 40GB QSFP+ for ceph cluster network
OSD
10x Micron 5300 PRO 7.68TB in each ceph node
- 80 total drives across the 8 nodes
WAL/DB
5x Micron 7300 MAX NVMe 800GB per Ceph Node
- Plan on dedicating 1 for each 2 OSD's
Still thinking out a external monitor node as I have a lot of options, but this is a pretty good start. Open to suggestions as well!
I think 800 GB NVMe per 2 SSDs is an overkill. 1 OSD usually only
requires 30 GB block.db, so 400 GB per an OSD is a lot. On the other
hand, does 7300 have twice the iops of 5300? In fact, I'm not sure if a
7300 + 5300 OSD will perform better than just a 5300 OSD at all.
It would be interesting if you could benchmark & compare it though :)
> Hmm change 40Gbps to 100Gbps networking.
>
> 40Gbps technology its just a bond of 4x10 Links with some latency due
> link aggregation.
> 100 Gbps and 25Gbps got less latency and Good performance. In ceph a
> 50% of the latency comes from Network commits and the other 50% from
> disk commits.
>
> A fast graph :
> https://blog.mellanox.com/wp-content/uploads/John-Kim-030416-Fig-3a-1024x74…
> Article:
> https://blog.mellanox.com/2016/03/25-is-the-new-10-50-is-the-new-40-100-is-…
>
> Micron got their own Whitepaper for CEPH and looks like performs fine.
> https://www.micron.com/-/media/client/global/documents/products/other-docum…
>
>
> AS your Budget is high, please buy 3 x 1.5K $ nodes for your monitors
> and you Will sleep better. They just need 4 cores / 16GB RAM and
> 2x128GB SSD or NVME M2 .
>
> -----Mensaje original-----
> De: Adam Boyhan <adamb(a)medent.com>
> Enviado el: viernes, 31 de enero de 2020 13:59
> Para: ceph-users <ceph-users(a)ceph.io>
> Asunto: [ceph-users] Micron SSD/Basic Config
>
> Looking to role out a all flash Ceph cluster. Wanted to see if anyone
> else was using Micron drives along with some basic input on my design
> so far?
>
> Basic Config
> Ceph OSD Nodes
> 8x Supermicro A+ Server 2113S-WTRT
> - AMD EPYC 7601 32 Core 2.2Ghz
> - 256G Ram
> - AOC-S3008L-L8e HBA
> - 10GB SFP+ for client network
> - 40GB QSFP+ for ceph cluster network
>
> OSD
> 10x Micron 5300 PRO 7.68TB in each ceph node
> - 80 total drives across the 8 nodes
>
> WAL/DB
> 5x Micron 7300 MAX NVMe 800GB per Ceph Node
> - Plan on dedicating 1 for each 2 OSD's
>
> Still thinking out a external monitor node as I have a lot of options,
> but this is a pretty good start. Open to suggestions as well!
>
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an
> email to ceph-users-leave(a)ceph.io
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io
> To unsubscribe send an email to ceph-users-leave(a)ceph.io
On Fri, Jan 31, 2020 at 2:06 PM EDH - Manuel Rios
<mriosfer(a)easydatahost.com> wrote:
>
> Hmm change 40Gbps to 100Gbps networking.
>
> 40Gbps technology its just a bond of 4x10 Links with some latency due link aggregation.
> 100 Gbps and 25Gbps got less latency and Good performance. In ceph a 50% of the latency comes from Network commits and the other 50% from disk commits.
40G ethernet is not the same as 4x 10G bond. A bond load balances on a
per-packet (or well, per flow usually) basis. A 40G link uses all four
links even for a single packet.
100G is "just" 4x 25G
I also wouldn't agree that network and disk latency is a 50/50 split
in Ceph unless you have some NVRAM disks or something.
Even for the network speed the processing and queuing in the network
stack dominates over the serialization delay from a 40G/100G
difference (4kb at 100G is 320ns, and 800ns at 40G for the
serialization; I don't have any figures for processing times on
40/100G ethernet, but 10G fiber is at 300ns, 10G base-t at 2300
nanoseconds)
Paul
>
> A fast graph : https://blog.mellanox.com/wp-content/uploads/John-Kim-030416-Fig-3a-1024x74…
> Article: https://blog.mellanox.com/2016/03/25-is-the-new-10-50-is-the-new-40-100-is-…
>
> Micron got their own Whitepaper for CEPH and looks like performs fine.
> https://www.micron.com/-/media/client/global/documents/products/other-docum…
>
>
> AS your Budget is high, please buy 3 x 1.5K $ nodes for your monitors and you Will sleep better. They just need 4 cores / 16GB RAM and 2x128GB SSD or NVME M2 .
>
> -----Mensaje original-----
> De: Adam Boyhan <adamb(a)medent.com>
> Enviado el: viernes, 31 de enero de 2020 13:59
> Para: ceph-users <ceph-users(a)ceph.io>
> Asunto: [ceph-users] Micron SSD/Basic Config
>
> Looking to role out a all flash Ceph cluster. Wanted to see if anyone else was using Micron drives along with some basic input on my design so far?
>
> Basic Config
> Ceph OSD Nodes
> 8x Supermicro A+ Server 2113S-WTRT
> - AMD EPYC 7601 32 Core 2.2Ghz
> - 256G Ram
> - AOC-S3008L-L8e HBA
> - 10GB SFP+ for client network
> - 40GB QSFP+ for ceph cluster network
>
> OSD
> 10x Micron 5300 PRO 7.68TB in each ceph node
> - 80 total drives across the 8 nodes
>
> WAL/DB
> 5x Micron 7300 MAX NVMe 800GB per Ceph Node
> - Plan on dedicating 1 for each 2 OSD's
>
> Still thinking out a external monitor node as I have a lot of options, but this is a pretty good start. Open to suggestions as well!
>
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io
> To unsubscribe send an email to ceph-users-leave(a)ceph.io
Appreciate the input.
Looking at those articles they make me feel like the 40G they are talking about is 4x Bonded 10G connections.
Im looking at 40Gbps without bonding for throughput. Is that still the same?
[ https://www.fs.com/products/29126.html | https://www.fs.com/products/29126.html ]
Yep most of this is based on the white paper with a few changes here and there.
From: "EDH - Manuel Rios" <mriosfer(a)easydatahost.com>
To: "adamb" <adamb(a)medent.com>, "ceph-users" <ceph-users(a)ceph.io>
Sent: Friday, January 31, 2020 8:05:52 AM
Subject: RE: Micron SSD/Basic Config
Hmm change 40Gbps to 100Gbps networking.
40Gbps technology its just a bond of 4x10 Links with some latency due link aggregation.
100 Gbps and 25Gbps got less latency and Good performance. In ceph a 50% of the latency comes from Network commits and the other 50% from disk commits.
A fast graph : https://blog.mellanox.com/wp-content/uploads/John-Kim-030416-Fig-3a-1024x74…
Article: https://blog.mellanox.com/2016/03/25-is-the-new-10-50-is-the-new-40-100-is-…
Micron got their own Whitepaper for CEPH and looks like performs fine.
https://www.micron.com/-/media/client/global/documents/products/other-docum…
AS your Budget is high, please buy 3 x 1.5K $ nodes for your monitors and you Will sleep better. They just need 4 cores / 16GB RAM and 2x128GB SSD or NVME M2 .
After having upgraded my ceph cluster from Luminous to Nautilus 14.2.6 ,
from time to time "ceph health detail" claims about some"Long heartbeat
ping times on front/back interface seen".
As far as I can understand (after having read
https://docs.ceph.com/docs/nautilus/rados/operations/monitoring/), this
means that the ping from one OSD to another one exceeded 1 s.
I have some questions on these network performance checks
1) What is meant exactly with front and back interface ?
2) I can see the involved OSDs only in the output of "ceph health detail"
(when there is the problem) but I can't find this information in the log
files. In the mon log file I can only see messages such as:
2020-01-28 11:14:07.641 7f618e644700 0 log_channel(cluster) log [WRN] :
Health check failed: Long heartbeat ping times on back interface seen,
longest is 1416.618 msec (OSD_SLOW_PING_TIME_BACK)
but the involved OSDs are not reported in this log.
Do I just need to increase the verbosity of the mon log ?
3) Is 1 s a reasonable value for this threshold ? How could this value be
changed ? What is the relevant configuration variable ?
4) https://docs.ceph.com/docs/nautilus/rados/operations/monitoring/
suggests to use the dump_osd_network command. I think there is an error in
that page: it says that the command should be issued on ceph-mgr.x.asok,
while I think that instead the ceph-osd-x.asok should be used
I have an other ceph cluster (running nautilus 14.2.6 as well) where there
aren't OSD_SLOW_PING_* error messages in the mon logs, but:
ceph daemon /var/run/ceph/ceph-osd..asok dump_osd_network 1
reports a lot of entries (i.e. pings exceeded 1 s). How can this be
explained ?
Thanks, Massimo
On 1/28/20 6:58 PM, Anthony D'Atri wrote:
>
>
>> I did this ones. This cluster was running IPv6-only (still is) and thus
>> I had the flexibility of new IPs.
>
> Dumb question — how was IPv6 a factor in that flexibility? Was it just that you had unused addresses within an existing block?
>
There are no dumb questions :-)
Usually Ceph is put into RFC1918 IPv4 space (10.x, 172.X) and those are
usually more difficult to route in networks.
IPv6 address space is globally routed in most networks thus making this
easier.
As long as the hosts can talk IP(4/6) with each other you can perform
such a migration.
Wido
Iam testing failure scenarios for my cluster. I have 3 monitors. Lets say if mons 1 and 2 go down and so monitors can't form a quorum, how can I recover?
Are the instructions at followling link valid for deleting mons 1 and 2 from monmap, https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/1.2.3/ht…
One more question - lets say I delete mons 1 and 2 from monmap. And the cluster has only mon 3 remaining so mon 3 has quorum. Now what happens if mon 1 and 2 come up? Do they join mon 3 and so there will again be 3 monitors in the cluster?
Thanks
Is it possible to create an EC backed RBD via ceph-iscsi tools (gwcli,
rbd-target-api)? It appears that a pre-existing RBD created with the rbd
command can be imported, but there is no means to directly create an EC
backed RBD. The API seems to expect a single pool field in the body to work
with.
Perhaps there is a lower level construct where you can set the metadata of
a particular RADOS pool to always use Pool X for data-pool when using Pool
Y for RBD header and metadata. This way the clients, in our case ceph-iscsi
needn't be modified or concerned with the dual-pool situation unless
explicitly specified.
For out particular use case we expose limited functionality of
rbd-target-api to clients and it would be helpful for them to keep track of
a single pool and not be concerned with two pools but if a data-pool and a
"main" pool could be passed via the API that would be okay too.
Thanks a lot.
Respectfully,
*Wes Dillingham*
wes(a)wesdillingham.com
LinkedIn <http://www.linkedin.com/in/wesleydillingham>