Only other thing I can think of is that a firewall is dropping idle connections, although Ceph should be sending heartbeats more often then the common 5 minutes for most firewalls. In the logs is it showing the monitor marking the OSDs out or the OSD peers? That would give you an idea where to look.
----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Sat, Aug 17, 2019 at 10:26 AM Lorenz Kiefner <root+cephusers@deinadmin.de> wrote:

Hello again,

all links are at least 10/50 mbit upstream/downstream, mostly 40/100 mbit, with some VMs at hosting companies running at 1/1 gbit. All my 39 OSDs on 17 hosts in 11 locations (5 of them are connected at the moment by consumer internet links) are nearly in a full mesh network consisting of wireguard VPN links, routed by bird with OSPF. Speed is not great as you can imagine but sufficient for me.

Some hosts are x86, some are ARMv7 on ODROID HC-1 (SAMSUNG smartphone SoC). Could this mix of architectures be a problem?

My goal is to provide a shared filesystem with my friends and to provide backup space on rbd images. This seams possible, but it is really annoying when OSDs are randomly marked down.

If there were some network issues I would expect that all OSDs on the affected host would be marked down, but only one OSD on this host is marked down. If I log in on that host and restart the OSD the same OSD will probably be marked down again in some 10-30 minutes. And this only happens if there is *no* backfill or recovery running. I would expect that network issues and packet drops on a saturated line are more likely than on an idling line.

Are there some (more) config keys for OSD ping timeouts in luminous? I would be very happy for some more ideas!

Thank you all

Lorenz


Am 16.08.19 um 17:01 schrieb Robert LeBlanc:
Personally I would not be trying to create a Ceph cluster across Consumer Internet links, usually their upload speed is so slow and Ceph is so chatty that it would make for a horrible experience. If you are looking for a backup solution, then I would look at some sort of n-way rsync solution, or btrfs/zfs volumes that send/receive each other. I really don't think Ceph is a good fit.
----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Thu, Aug 15, 2019 at 12:37 AM Lorenz Kiefner <root+cephusers@deinadmin.de> wrote:

Oh no, it's not that bad. It's

$ ping -s 65000 dest.inati.on

on a VPN connection that has a MTU of 1300 via IPv6. So I suspect that I only get an answer, when all 51 fragments get fully returned. It's clear that big packets with lots of fragments are more affected by packet loss than 64 byte pings.

I just (at 9 o'clock in the morning) repeated this ping test and got hardly any drops (less than 1%), even with the size of 64k. So it's really dependent on the time of the day. Seems like some ISPs are dropping some packets, especially in the evening...

A few minutes ago I restarted all down-marked OSDs, but they are getting marked down again... Seems like Ceph is tolerable against packet loss (it surely affects performance, but this irrelevant for me).


Could erasure coded pools pose some problems?


Thank you all for every hint!

Lorenz


Am 15.08.19 um 08:51 schrieb Janne Johansson:
Den ons 14 aug. 2019 kl 17:46 skrev Lorenz Kiefner <root+cephusers@deinadmin.de>:
Is ceph sensitive to packet loss? On some VPN links I have up to 20%
packet loss on 64k packets but less than 3% on 5k packets in the evenings.

20% seems crazy high, there must be something really wrong there.

At 20%, you would get tons of packet timeouts to wait for on all those lost frames,
then resends of (at least!) those 20% extra, which in turn would lead to 20% of those
resends getting lost, all while the main streams of data try to move forward when some
older packet do get over. This is a really bad situation to design for, 

I think you should look for a link solution that doesn't drop that many packets, instead of changing
the software you try to run over that link, all others will notice this too and act badly in some way or other.

Heck, 20% is like taking a math schoolbook and remove all instances of "3" and "8" and see if kids can learn to count from it. 8-/
 
--
May the most significant bit of your life be positive.
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-leave@ceph.io

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-leave@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-leave@ceph.io