Only other thing I can think of is that a firewall is dropping idle
connections, although Ceph should be sending heartbeats more often then the
common 5 minutes for most firewalls. In the logs is it showing the monitor
marking the OSDs out or the OSD peers? That would give you an idea where to
look.
----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1
On Sat, Aug 17, 2019 at 10:26 AM Lorenz Kiefner <root+cephusers(a)deinadmin.de>
wrote:
Hello again,
all links are at least 10/50 mbit upstream/downstream, mostly 40/100 mbit,
with some VMs at hosting companies running at 1/1 gbit. All my 39 OSDs on
17 hosts in 11 locations (5 of them are connected at the moment by consumer
internet links) are nearly in a full mesh network consisting of wireguard
VPN links, routed by bird with OSPF. Speed is not great as you can imagine
but sufficient for me.
Some hosts are x86, some are ARMv7 on ODROID HC-1 (SAMSUNG smartphone
SoC). Could this mix of architectures be a problem?
My goal is to provide a shared filesystem with my friends and to provide
backup space on rbd images. This seams possible, but it is really annoying
when OSDs are randomly marked down.
If there were some network issues I would expect that all OSDs on the
affected host would be marked down, but only one OSD on this host is marked
down. If I log in on that host and restart the OSD the same OSD will
probably be marked down again in some 10-30 minutes. And this only happens
if there is *no* backfill or recovery running. I would expect that network
issues and packet drops on a saturated line are more likely than on an
idling line.
Are there some (more) config keys for OSD ping timeouts in luminous? I
would be very happy for some more ideas!
Thank you all
Lorenz
Am 16.08.19 um 17:01 schrieb Robert LeBlanc:
Personally I would not be trying to create a Ceph cluster across Consumer
Internet links, usually their upload speed is so slow and Ceph is so chatty
that it would make for a horrible experience. If you are looking for a
backup solution, then I would look at some sort of n-way rsync solution, or
btrfs/zfs volumes that send/receive each other. I really don't think Ceph
is a good fit.
----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1
On Thu, Aug 15, 2019 at 12:37 AM Lorenz Kiefner <
root+cephusers(a)deinadmin.de> wrote:
Oh no, it's not that bad. It's
$ ping -s 65000 dest.inati.on
on a VPN connection that has a MTU of 1300 via IPv6. So I suspect that I
only get an answer, when all 51 fragments get fully returned. It's clear
that big packets with lots of fragments are more affected by packet loss
than 64 byte pings.
I just (at 9 o'clock in the morning) repeated this ping test and got
hardly any drops (less than 1%), even with the size of 64k. So it's really
dependent on the time of the day. Seems like some ISPs are dropping some
packets, especially in the evening...
A few minutes ago I restarted all down-marked OSDs, but they are getting
marked down again... Seems like Ceph is tolerable against packet loss (it
surely affects performance, but this irrelevant for me).
Could erasure coded pools pose some problems?
Thank you all for every hint!
Lorenz
Am 15.08.19 um 08:51 schrieb Janne Johansson:
Den ons 14 aug. 2019 kl 17:46 skrev Lorenz Kiefner <
root+cephusers(a)deinadmin.de>gt;:
Is ceph sensitive to packet loss? On some VPN
links I have up to 20%
packet loss on 64k packets but less than 3% on 5k packets in the
evenings.
20% seems crazy high, there must be something really wrong there.
At 20%, you would get tons of packet timeouts to wait for on all those
lost frames,
then resends of (at least!) those 20% extra, which in turn would lead to
20% of those
resends getting lost, all while the main streams of data try to move
forward when some
older packet do get over. This is a really bad situation to design for,
I think you should look for a link solution that doesn't drop that many
packets, instead of changing
the software you try to run over that link, all others will notice this
too and act badly in some way or other.
Heck, 20% is like taking a math schoolbook and remove all instances of
"3" and "8" and see if kids can learn to count from it. 8-/
--
May the most significant bit of your life be positive.
_______________________________________________
ceph-users mailing list -- ceph-users(a)ceph.io
To unsubscribe send an email to ceph-users-leave(a)ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users(a)ceph.io
To unsubscribe send an email to ceph-users-leave(a)ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users(a)ceph.io
To unsubscribe send an email to ceph-users-leave(a)ceph.io